Prototyping VI: Image Extender – Image sonification tool for immersive perception of sounds from images and new creation possibilities

New features in the object recognition and test run for images:

Since the initial freesound.org and GeminAI setup, I have added several improvements.
You can now choose between different object recognition models and adjust settings like the number of detected objects and the minimum confidence threshold.

GUI for the settings of the model

I also created a detailed testing matrix, using a wide range of images to evaluate detection accuracy. Due to that there might be the change of the model later on, because it seems the gemini api only has a very basic pool of tags and is also not a good training in every category.

Test of images for the object recognition

It is still reliable for these basic tags like “bird”, “car”, “tree”, etc. And for these tags it also doesn’t really matter if theres a lot of shadow, you only see half of the object or even if its blurry. But because of the lack of specific tags I will look into models or APIs that offer more fine-grained recognition.

Coming up: I’ll be working on whether to auto-play or download the selected audio files including layering sounds, adjusting volumes, experimenting with EQ and filtering — all to make the playback more natural and immersive. Also, I will think about categorization and moving the tags into a layer system. Beside that I am going to check for other object recognition models, but  I might stick to the gemini api for prototyping a bit more and change the model later.

Prototyping V: Image Extender – Image sonification tool for immersive perception of sounds from images and new creation possibilities

Integration of AI-Object Recognition in the automated audio file search process:

After setting up the initial interface for the freesound.org API and confirming everything works with test tags and basic search filters, the next major milestone is now in motion: AI-based object recognition using the GeminAI API.

The idea is to feed in an image (or a batch of them), let the AI detect what’s in it, and then use those recognized tags to trigger an automated search for corresponding sounds on freesound.org. The integration already loads the detected tags into an array, which is then automatically passed on to the sound search. This allows the system to dynamically react to the content of an image and search for matching audio files — no manual tagging needed anymore.

So far, the detection is working pretty reliably for general categories like “bird”, “car”, “tree”, etc. But I’m looking into models or APIs that offer more fine-grained recognition. For instance, instead of just “bird”, I’d like it to say “sparrow”, “eagle”, or even specific songbird species if possible. This would make the whole sound mapping feel much more tailored and immersive.

A list of test images will be prepared, but there’s already a testing matrix for different objects, situations, scenery and technical differences

On the freesound side, I’ve got the basic query parameters set up: tag search, sample rate, file type, license, and duration filters. There’s room to expand this with additional parameters like rating, bit depth, and maybe even a random selection toggle to avoid repetition when the same tag comes up multiple times.

Coming up: I’ll be working on whether to auto-play or download the selected audio files, and starting to test how the AI-generated tags influence the mood and quality of the soundscape. The long-term plan includes layering sounds, adjusting volumes, experimenting with EQ and filtering — all to make the playback more natural and immersive.