Prototyping X: Image Extender – Image sonification tool for immersive perception of sounds from images and new creation possibilities

Researching Automated Mixing Strategies for Clarity and Real-Time Composition

As the Image Extender project continues to evolve from a tagging-to-sound pipeline into a dynamic, spatially aware audio compositing system, this phase focused on surveying and evaluating recent methods in automated sound mixing. My aim was to understand how existing research handles spectral masking, spatial distribution, and frequency-aware filtering—especially in scenarios where multiple unrelated sounds are combined without a human in the loop.

This blog post synthesizes findings from several key research papers and explores how their techniques may apply to our use case: a generative soundscape engine driven by object detection and Freesound API integration. The next development phase will evaluate which of these methods can be realistically adapted into the Python-based architecture.

Adaptive Filtering Through Time–Frequency Masking Detection

A compelling solution to masking was presented by Zhao and Pérez-Cota (2024), who proposed a method for adaptive equalization driven by masking analysis in both time and frequency. By calculating short-time Fourier transforms (STFT) for each track, their system identifies where overlap occurs and evaluates the masking directionality—determining whether a sound acts as a masker or a maskee over time.

These interactions are quantified into masking matrices that inform the design of parametric filters, tuned to reduce only the problematic frequency bands, while preserving the natural timbre and dynamics of the source sounds. The end result is a frequency-aware mixing approach that adapts to real masking events rather than applying static or arbitrary filtering.

Why this matters for Image Extender:
Generated mixes often feature overlapping midrange content (e.g., engine hums, rustling leaves, footsteps). By applying this masking-aware logic, the system can avoid blunt frequency cuts and instead respond intelligently to real-time spectral conflicts.

Implementation possibilities:

  • STFTs: librosa.stft
  • Masking matrices: pairwise multiplication and normalization (NumPy)
  • EQ curves: second-order IIR filters via scipy.signal.iirfilter

“This information is then systematically used to design and apply filters… improving the clarity of the mix.”
— Zhao and Pérez-Cota (2024)

Iterative Mixing Optimization Using Psychoacoustic Metrics

Another strong candidate emerged from Liu et al. (2024), who proposed an automatic mixing system based on iterative masking minimization. Their framework evaluates masking using a perceptual model derived from PEAQ (ITU-R BS.1387) and adjusts mixing parameters—equalization, dynamic range compression, and gain—through iterative optimization.

The system’s strength lies in its objective function: it not only minimizes total masking but also seeks to balance masking contributions across tracks, ensuring that no source is disproportionately buried. The optimization process runs until a minimum is reached, using a harmony search algorithm that continuously tunes each effect’s parameters for improved spectral separation.

Why this matters for Image Extender:
This kind of global optimization is well-suited for multi-object scenes, where several detected elements contribute sounds. It supports a wide range of source content and adapts mixing decisions to preserve intelligibility across diverse sonic elements.

Implementation path:

  • Masking metrics: critical band energy modeling on the Bark scale
  • Optimization: scipy.optimize.differential_evolution or other derivative-free methods
  • EQ and dynamics: Python wrappers (pydub, sox, or raw filter design via scipy.signal)

“Different audio effects… are applied via an iterative Harmony searching algorithm that aims to minimize the masking.”
— Liu et al. (2024)

Comparative Analysis

MethodCore ApproachIntegration PotentialImplementation Effort
Time–Frequency Masking (Zhao)Analyze masking via STFT; apply targeted EQHigh — per-event conflict resolutionMedium
Iterative Optimization (Liu)Minimize masking metric via parametric searchHigh — global mix clarityHigh

Both methods offer significant value. Zhao’s system is elegant in its directness—its per-pair analysis supports fine-grained filtering on demand, suitable for real-time or batch processes. Liu’s framework, while computationally heavier, offers a holistic solution that balances all tracks simultaneously, and may serve as a backend “refinement pass” after initial sound placement.

Looking Ahead

This research phase provided the theoretical and technical groundwork for the next evolution of Image Extender’s audio engine. The next development milestone will explore hybrid strategies that combine these insights:

  • Implementing a masking matrix engine to detect conflicts dynamically
  • Building filter generation pipelines based on frequency overlap intensity
  • Testing iterative mix refinement using masking as an objective metric
  • Measuring the perceived clarity improvements across varied image-driven scenes

References

Zhao, Wenhan, and Fernando Pérez-Cota. “Adaptive Filtering for Multi-Track Audio Based on Time–Frequency Masking Detection.” Signals 5, no. 4 (2024): 633–641. https://doi.org/10.3390/signals5040035:contentReference[oaicite:2]{index=2}

Liu, Xiaojing, Angeliki Mourgela, Hongwei Ai, and Joshua D. Reiss. “An Automatic Mixing Speech Enhancement System for Multi-Track Audio.” arXiv preprint arXiv:2404.17821 (2024). https://arxiv.org/abs/2404.17821:contentReference[oaicite:3]{index=3}

Prototyping VII: Image Extender – Image sonification tool for immersive perception of sounds from images and new creation possibilities

Mixing of the automatically searched audio files into one combined stereo file:

In this latest update, I’ve implemented several new features to create the first layer of an automated sound mixing system for the object recognition tool. The tool now automatically adjusts pan values and applies attenuation to ensure a balanced stereo mix, while seamlessly handling multiple tracks. This helps avoid overload and guarantees smooth audio mixing.

check of the automatically searched and downloaded files + the automatically generated combined audiofile

A key new feature is the addition of a sound_pannings array, which holds unique panning values for each sound based on the position of the object’s bounding box within an image. This ensures that each sound associated with a recognized object gets an individualized panning, calculated from its horizontal position within the image, for a more dynamic and immersive experience.

display of the sound panning values [-1 left, 1 right]

I’ve also introduced a system to automatically download sound files directly into Google Colab’s file system. This eliminates the need for managing local folders. Users can now easily preview audio within the notebook, which adds interactivity and helps visualize the results instantly.

The sound downloading process has also been revamped. The filters for the search can now be saved via a buttonclick to apply for the search and download for the audiofile. Currently for each tags there are 10 sounds per tag preloaded, with each sound randomly selected to avoid duplication but ensure the use of multiple times of the same tag. A sound is only downloaded if it hasn’t been used before. If all sound options for a tag are exhausted, no sound will be downloaded for that tag.

Additionally, I’ve added the ability to create a ZIP file that includes all the downloaded sounds as well as the final mixed audio output. This makes it easy to download and share the files. To keep things organized, I’ve also introduced a delete button that removes all downloaded files once they are no longer needed. The interface now includes buttons for controlling the download, file cleanup, and audio playback, simplifying the process for users.

Looking ahead, I plan to continue refining the system by working on better mixing techniques, focusing on aspects like spectrum, frequency, and the overall importance of the sounds. Future updates will also look at integrating volume control and more far in the future an LLM Model that can check the correctness of the found file title.