image extender

by David Adlberger - 30. December 2025

Product VIII: Image Extender

Iterative Workflow and Feedback Mechanism

The primary objective for this update was to architect a paradigm shift from a linear generative pipeline to a nonlinear, interactive sound design environment

System Architecture & Implementation of Interactive Components

The existing pipeline, comprising image analysis (object detection, semantic tagging), importance-weighted sound search, audio processing (equalization, normalization, panoramic distribution based on visual coordinates), and temporal randomization was extended with a state-preserving session layer and an interactive control interface, implemented within the collab notebook ecosystem.

Data Structure & State Management
A critical prerequisite for interactivity was the preservation of all intermediate audio objects and their associated metadata. The system was refactored to maintain a global, mutable data structure, a list of processed_track objects. Each object encapsulates:

The raw audio waveform (as a NumPy array).
Semantic source tag (e.g., “car,” “rain”).
Track type (ambience base or foreground object).
Temporal onset and duration within the mix.
Panning coefficient (derived from image x-coordinate).
Initial target loudness (LUFS, derived from object importance scaling).

Dynamic Mixing Console Interface
A GUI panel was generated post-sonification, featuring the following interactive widgets for each processed_track:

Per-Track Gain Sliders: Linear potentiometers (range 0.0 to 2.0) controlling amplitude multiplication. Adjustment triggers an immediate recalculation of the output sum via a create_current_mix() function, which performs a weighted summation of all tracks based on the current slider states.
Play/Stop Controls: Buttons invoking a non-blocking, threaded audio playback engine (using IPython.display.Audio and threading), allowing for real-time auditioning without interface latency.

On-Demand Sound Replacement Engine
The most significant functional addition is the per-track “Search & Replace” capability. Each track’s GUI includes a dedicated search button (🔍). Its event handler executes the following algorithm:

Tag Identification: Retrieves the original semantic tag from the target processed_track.
Targeted Audio Retrieval: Calls a modified search_new_sound_for_tag(tag, exclude_id_list) function. This function re-executes the original search logic, including query formulation, Freesound API calls, descriptor validation (e.g., excluding excessively long or short files), and fallback strategies—while maintaining a session-specific exclusion list to avoid re-selecting previously used sounds.
Consistent Processing: The newly retrieved audio file undergoes an identical processing chain as in the initial pipeline: target loudness normalization (to the original track’s LUFS target), application of the same panning coefficient, and insertion at the identical temporal position.
State Update & Mix Regeneration: The new audio data replaces the old waveform in the processed_track object. The create_current_mix() function is invoked, seamlessly integrating the new sonic element while preserving all other user adjustments (e.g., volume levels of other tracks).

Integrated Feedback & Evaluation Module
To formalize user evaluation and gather data for continuous system improvement, a structured feedback panel was integrated adjacent to the mixing controls. This panel captures:

A subjective 5-point Likert scale rating.
Unstructured textual feedback.
Automated attachment of complete session metadata (input image description, derived tags, importance values, processing parameters, and the final processed_track list).
This design explicitly closes the feedback loop, treating each user interaction as a potential training or validation datum for future algorithmic refinements.
Automated sending of the feedback via email

by David Adlberger - 7. April 202516. April 2025

Prototyping IV: Image Extender – Image sonification tool for immersive perception of sounds from images and new creation possibilities

Tests on automated audio file search via freesound.org api:

For further use in the automated audio file search of the recognized objects I tested the freesound.org api and programmed the first interface for testing purposes. The first thing I had to do was request an API-Key by freesound.org. After that I noticed an interesting point to think about using it in my project: it is open for 5000 requests per year, but I will research on possibilities for using it more. For the testing 5000 is more than enough.

The current code already searches with a few testing tags and gives possibilities to filter the searches by samplerate, duration, licence and file type. There might be added more filter possibilities next like rating, bit depth, and maybe the possibility of random file selection so it won’t be always the same for each tag.

Next steps would also include to either download the file or just play it automatically. Then there will be tests on using the tags of the AI image recognition code for this automated search. And later in the process I have to figure out the playback of multiple files, volume staging and filtering or EQing methods for masking effects etc…

Test gui for automated sound searching via freesounds.org API

by David Adlberger - 4. April 2025

Prototyping III: Image Extender – Image sonification tool for immersive perception of sounds from images and new creation possibilities

Research on sonification of images / video material and different approaches – focus on RGB

The paper by Kopecek and Ošlejšek presents a system that enables visually impaired users to perceive color images through sound using a semantic color model. Each primary color (such as red, green, or blue) is assigned a unique sound, and colors in an image are approximated by the two closest primary colors. These are represented through two simultaneous tones, with volume indicating the proportion of each color. Users can explore images by selecting pixels or regions using input devices like a touchscreen or mouse. The system calculates the average color of the selected area and plays the corresponding sounds. Distinct audio cues indicate image boundaries, and sounds can be either synthetic or instrument-based, with timbre and pitch helping to differentiate them. Users can customize colors and sounds for a more personalized experience. This approach allows for dynamic, efficient exploration of images and supports navigation via annotated SVG formats.

image seperation by Kopecek and Ošlejšek

The review by Sarkar, Bakshi, and Sa offers an overview of various image sonification methods designed to help visually impaired users interpret visual scenes through sound. It covers techniques such as raster scanning, query-based, and path-based approaches, where visual data like pixel intensity and position are mapped to auditory cues. Systems like vOICe and NAVI use high and low-frequency tones to represent image regions vertically. The paper emphasizes the importance of transfer functions, which link image properties to sound attributes such as pitch, volume, and frequency. Different rendering methods—like audification, earcons, and parameter mapping—are discussed in relation to human auditory perception. Special attention is given to color sonification, including the semantic color model introduced by Kopecek and Ošlejšek, which improves usability through clearly distinguishable tones. The paper also explores applications in fields such as medical imaging, algorithm visualization, and network analysis, and briefly touches on sound-to-image conversions.

Principles of the image-to-sound mapping

Matta, Rudolph, and Kumar propose the theoretical system “Auditory Eyes,” which converts visual data into auditory and tactile signals to support blind users. The system comprises three main components: an image encoder that uses edge detection and triangulation to estimate object location and distance; a mapper that translates features like motion, brightness, and proximity into corresponding sound and vibration cues; and output generators that produce sound using tools like Csound and tactile feedback via vibrations. Motion is represented using effects like Doppler shift and interaural time difference, while spatial positioning is conveyed through head-related transfer functions. Brightness is mapped to pitch, and edges are conveyed through tone duration. The authors emphasize that combining auditory and tactile information can create a richer and more intuitive understanding of the environment, making the system potentially very useful for real-world navigation and object recognition.

References

Kopecek, Ivan, and Radek Ošlejšek. 2008. “Hybrid Approach to Sonification of Color Images.” In Third 2008 International Conference on Convergence and Hybrid Information Technology, 721–726. IEEE. https://doi.org/10.1109/ICCIT.2008.152.

Sarkar, Rajib, Sambit Bakshi, and Pankaj K Sa. 2012. “Review on Image Sonification: A Non-visual Scene Representation.” In 1st International Conference on Recent Advances in Information Technology (RAIT-2012), 1–5. IEEE. https://doi.org/10.1109/RAIT.2012.6194495.

Matta, Suresh, Heiko Rudolph, and Dinesh K Kumar. 2005. “Auditory Eyes: Representing Visual Information in Sound and Tactile Cues.” In Proceedings of the 13th European Signal Processing Conference (EUSIPCO 2005), 1–5. Antalya, Turkey. https://www.researchgate.net/publication/241256962.