Product IX: Image Extender

Moving Beyond Dry Audio to Spatially Intelligent Soundscapes

My primary objective for this update was to bridge a critical perceptual gap in the system: while the previous iterations successfully mapped visual information to sonic elements with precise panning and temporal placement, the resulting audio mix remained perceptually “dry” and disconnected from the image’s implied acoustic environment. This update introduces adaptive reverberation, not as a cosmetic effect, but as a semantically grounded spatialization layer that transforms discrete sound objects into a coherent, immersive acoustic scene.

System Architecture

The existing interactive DAW interface, with its per-track volume controls, sound replacement engine, and user feedback mechanisms, was extended with a comprehensive spatial audio processing module. This module interprets the reverb parameters derived from image analysis (room detection, size estimation, material damping, and spatial width) and provides interactive control over their application.

Global Parameter State & Data Flow Integration

A crucial architectural challenge was maintaining separation between the raw audio mix (user-adjustable volume levels) and the reverb-processed version. I implemented a dual-state system with:

  • current_mix_raw: The continuously updated sum of all audio tracks with current volume slider adjustments.
  • current_mix_with_reverb: A cached, processed version with reverberation applied, recalculated only when reverb parameters change or volume sliders are adjusted with reverb enabled.

This separation preserves processing efficiency while maintaining real-time responsiveness. The system automatically pulls reverb parameters (room_sizedampingwet_levelwidth) from the image analysis block when available, providing image-informed defaults while allowing full manual override.

Pedalboard-Based Reverb Engine

I integrated the pedalboard audio processing library to implement professional-grade reverberation. The engine operates through a transparent conversion chain:

  1. Format ConversionAudioSegment objects (from pydub) are converted to NumPy arrays normalized to the [-1, 1] range
  2. Pedalboard Processing: A Reverb effect instance applies parameters with real-time adjustable controls
  3. Format Restoration: Processed audio is converted back to AudioSegment while preserving sample rate and channel configuration

The implementation supports both mono and stereo processing chains, maintaining compatibility with the existing panning system.

Interactive Reverb Control Interface

A dedicated control panel was added to the DAW interface, featuring:

  • Parameter Sliders: Four continuous controls for room size, damping, wet/dry mix, and stereo width, pre-populated with image-derived values when available
  • Toggle System: Three distinct interaction modes:
    1. “🔄 Apply Reverb”: Manual application with current settings
    2. “🔇 Remove Reverb”: Return to dry mix
    3. “Reverb ON/OFF Toggle”: Single-click switching between states
  • Contextual Feedback: Display of image-based room detection status (indoor/outdoor)

Seamless Playback Integration

The playback system was redesigned to dynamically switch between dry and wet mixes:

  • Intelligent Routing: The play_mix() function automatically selects current_mix_with_reverb or current_mix_raw based on the reverb_enabled flag
  • State-Aware Processing: When volume sliders are adjusted with reverb enabled, the system automatically reapplies reverberation to the updated mix, maintaining perceptual consistency
  • Export Differentiation: Final mixes are exported with _with_reverb or _raw suffixes, providing clear version control

Design Philosophy: Transparency Over Automation

This phase reinforced a critical design principle: spatial effects should enhance rather than obscure the user’s creative decisions. Several automation approaches were considered and rejected:

  • Automatic Reverb Application: While the system could automatically apply image-derived reverb, I preserved manual activation to maintain user agency
  • Dynamic Parameter Adjustment: Real-time modification of reverb parameters during playback was technically feasible but introduced perceptual confusion
  • Per-Track Reverb: Individual reverberation for each sound object would create acoustic chaos rather than coherent space

The decision was made to implement reverb as a master bus effect, applied consistently to the entire mix after individual track processing. This approach creates a unified acoustic space that respects the visual scene’s implied environment while preserving the clarity of individual sound elements.

Technical Challenges & Solutions

State Synchronization

The most significant challenge was maintaining synchronization between the constantly updating volume-adjusted mix and the computationally expensive reverb processing. The solution was a conditional caching system: reverb is only recalculated when parameters change or when volume adjustments occur with reverb active.

Format Compatibility

Bridging the pydub-based mixing system with pedalboard‘s NumPy-based processing required careful attention to sample format conversion, channel configuration, and normalization. The implementation maintains bit-perfect round-trip conversion.

Product VII: Image Extender

Room-Aware Mixing – From Image Analysis to Coherent Acoustic Spaces

Instead of attempting to recover exact physical properties, the system derives normalized, perceptual room parameters from visual cues such as geometry, materials, furnishing density, and openness. These parameters are intentionally abstracted to work with algorithmic reverbs.

The introduced parameters are:

  • room_detected (bool)
    Indicates whether the image depicts a closed indoor space or an outdoor/open environment.
  • room_size (0.0–1.0)
    Represents the perceived acoustic size of the room (small rooms → short decay, large spaces → long decay).
  • damping (0.0–1.0)
    Estimates high-frequency absorption based on visible materials (soft furnishings, carpets, curtains vs. glass and hard walls).
  • wet_level (0.0–1.0)
    Describes how reverberant the space naturally feels.
  • width (0.0–1.0)
    Estimates perceived stereo width derived from room proportions and openness.

All parameters are stored flat within the same dictionary as objects, panning, and importance values, forming a single coherent scene representation.

Dereverberation: Explored, Then Intentionally Abandoned

As part of this phase, automatic analysis of existing reverberation (RT60, DRR estimation) and dereverberation was evaluated.

The outcome:

  • Computationally expensive, especially in Google Colab
  • Inconsistent and often unsatisfactory audio results
  • High complexity with limited practical benefit

Decision:
Dereverberation is not pursued further in this project. Instead, the system relies on:

  • Consistent room estimation
  • Controlled, unified reverb application
  • Preventive design rather than corrective processing

The next step will be to focus on the analysis of the sounds (especially rt60 and drr values) to make the reverb (if its a closed room) less on the specific sound.