Product IV: Image Extender

Semantic Sound Validation & Ensuring Acoustic Relevance Through AI-Powered Verification

Building upon the intelligent fallback systems developed in Phase III, this week’s development addressed a more subtle yet critical challenge in audio generation: ensuring that retrieved sounds semantically match their visual counterparts. While the fallback system successfully handled missing sounds, I discovered that even when sounds were technically available, they didn’t always represent the intended objects accurately. This phase introduces a sophisticated description verification layer and flexible filtering system that transforms sound retrieval from a mechanical matching process to a semantically intelligent selection.

The newly implemented description verification system addresses this through OpenAI-powered semantic analysis. Each retrieved sound’s description is now evaluated against the original visual tag to determine if it represents the actual object or just references it contextually. This ensures that when Image Extender layers “car” sounds into a mix, they’re authentic engine recordings rather than musical tributes.

Intelligent Filter Architecture: Balancing Precision and Flexibility

Recognizing that overly restrictive filtering could eliminate viable sounds, we redesigned the filtering system with adaptive “any” options across all parameters. The Bit-Depth filter got removed because it resulted in search errors which is also mentioned in the documentation of the freesound.org api.

Scene-Aware Audio Composition: Atmo Sounds as Acoustic Foundation

A significant architectural improvement involves intelligent base track selection. The system now distinguishes between foreground objects and background atmosphere:

  • Scene & Location Analysis: Object detection extracts environmental context (e.g., “forest atmo,” “urban street,” “beach waves”)
  • Atmo-First Composition: Background sounds are prioritized as the foundational layer
  • Stereo Preservation: Atmo/ambience sounds retain their stereo imaging for immersive soundscapes
  • Object Layering: Foreground sounds are positioned spatially based on visual detection coordinates

This creates mixes where environmental sounds form a coherent base while individual objects occupy their proper spatial positions, resulting in professionally layered audio compositions.

Dual-Mode Object Detection with Scene Understanding

OpenAI GPT-4.1 Vision: Provides comprehensive scene analysis including:

  • Object identification with spatial positioning
  • Environmental context extraction
  • Mood and atmosphere assessment
  • Structured semantic output for precise sound matching

MediaPipe EfficientDet: Offers lightweight, real-time object detection:

  • Fast local processing without API dependencies
  • Basic object recognition with positional data
  • Fallback when cloud services are unavailable

Wildcard-Enhanced Semantic Search: Beyond Exact Matching

Multi-Stage Fallback with Verification Limits

The fallback system evolved into a sophisticated multi-stage process:

  1. Atmo Sound Prioritization: Scene_and_location tags are searched first as base layer
  2. Object Search: query with user-configured filters
  3. Description Verification: AI-powered semantic validation of each result
  4. Quality Tiering: Progressive relaxation of rating and download thresholds
  5. Pagination Support: Multiple result pages when initial matches fail verification
  6. Controlled Fallback: Limited OpenAI tag regeneration with automatic timeout

This structured approach prevents infinite loops while maximizing the chances of finding appropriate sounds. The system now intelligently gives up after reasonable attempts, preventing computational waste while maintaining output quality.

Toward Contextually Intelligent Audio Generation

This week’s enhancements represent a significant leap from simple sound retrieval to contextually intelligent audio selection. The combination of semantic verification, adaptive filtering and scene-aware composition creates a system that doesn’t just find sounds, it finds the right sounds and arranges them intelligently.

Product III: Image Extender

Intelligent Sound Fallback Systems – Enhancing Audio Generation with AI-Powered Semantic Recovery

After refining Image Extender’s sound layering and spectral processing engine, this week’s development shifted focus to one of the system’s most practical yet creatively crucial challenges: ensuring that the generation process never fails silently. In previous iterations, when a detected visual object had no directly corresponding sound file in the Freesound database, the result was often an incomplete or muted soundscape. The goal of this phase was to build an intelligent fallback architecture—one capable of preserving meaning and continuity even in the absence of perfect data.

Closing the Gap Between Visual Recognition and Audio Availability

During testing, it became clear that visual recognition is often more detailed and specific than what current sound libraries can support. Object detection models might identify entities like “Golden Retriever,” “Ceramic Cup,” or “Lighthouse,” but audio datasets tend to contain more general or differently labeled entries. This mismatch created a semantic gap between what the system understands and what it can express acoustically.

The newly introduced fallback framework bridges this gap, allowing Image Extender to adapt gracefully. Instead of stopping when a sound is missing, the system now follows a set of intelligent recovery paths that preserve the intent and tone of the visual analysis while maintaining creative consistency. The result is a more resilient, contextually aware sonic generation process—one that doesn’t just survive missing data, but thrives within it.

Dual Strategy: Structured Hierarchies and AI-Powered Adaptation

Two complementary fallback strategies were introduced this week: one grounded in structured logic, and another driven by semantic intelligence.

The CSV-based fallback system builds on the ontology work from the previous phase. Using the tag_hierarchy.csv file, each sound tag is part of a parent–child chain, creating predictable fallback paths. For example, if “tiger” fails, the system ascends to “jungle,” and then “nature.” This rule-based approach guarantees reliability and zero additional computational cost, making it ideal for large-scale batch operations or offline workflows.

In contrast, the AI-powered semantic fallback uses GPT-based reasoning to dynamically generate alternative tags. When the CSV offers no viable route, the model proposes conceptually similar or thematically related categories. A specific bird species might lead to the broader concept of “bird sounds,” or an abstract object like “smartphone” could redirect to “digital notification” or “button click.” This layer of intelligence brings flexibility to unfamiliar or novel recognition results, extending the system’s creative reach beyond its predefined hierarchies.

User-Controlled Adaptation

Recognizing that different projects require different balances between cost, control, and creativity, the fallback mode is now user-configurable. Through a simple dropdown menu, users can switch between CSV Mode and AI Mode.

  • CSV Mode favors consistency, predictability, and cost-efficiency—perfect for common, well-defined categories.
  • AI Mode prioritizes adaptability and creative expansion, ideal for complex visual inputs or unique scenes.

This configurability not only empowers users but also represents a deeper design philosophy: that AI systems should be tools for choice, not fixed solutions.

Toward Adaptive and Resilient Multimodal Systems

This week’s progress marks a pivotal evolution from static, database-bound sound generation to a hybrid model that merges structured logic with adaptive intelligence. The dual fallback system doesn’t just fill gaps, it embodies the philosophy of resilient multimodal AI, where structure and adaptability coexist in balance.

The CSV hierarchy ensures reliability, grounding the system in defined categories, while the AI layer provides flexibility and creativity, ensuring the output remains expressive even when the data isn’t. Together, they form a powerful, future-proof foundation for Image Extender’s ongoing mission: transforming visual perception into sound not as a mechanical translation, but as a living, interpretive process.

Product II: Image Extender

Dual-Model Vision Interface – OpenAI × Gemini Integration for Adaptive Image Understanding

Following the foundational phase of last week, where the OpenAI API Image Analyzer established a structured evaluation framework for multimodal image analysis, the project has now reached a significant new milestone. The second release integrates both OpenAI’s GPT-4.1-based vision models and Google’s Gemini (MediaPipe) inference pipeline into a unified, adaptive system inside the Image Extender environment.

Unified Recognition Interface

In The current version, the recognition logic has been completely refactored to support runtime model switching.
A dropdown-based control in Google Colab enables instant selection between:

  • Gemini (MediaPipe) – for efficient, on-device object detection and panning estimation
  • OpenAI (GPT-4.1 / GPT-4.1-mini) – for high-level semantic and compositional interpretation

Non-relevant parameters such as score threshold or delegate type dynamically hide when OpenAI mode is active, keeping the interface clean and focused. Switching back to Gemini restores all MediaPipe-related controls.
This creates a smooth dual-inference workflow where both engines can operate independently yet share the same image context and visualization logic.

Architecture Overview

The system is divided into two self-contained modules:

  1. Image Upload Block – handles external image input and maintains a global IMAGE_FILE reference for both inference paths.
  2. Recognition Block – manages model selection, executes inference, parses structured outputs, and handles visualization.

This modular split keeps the code reusable, reduces side effects between branches, and simplifies later expansion toward GUI-based or cloud-integrated applications.

OpenAI Integration

The OpenAI branch extends directly from Last week but now operates within the full environment.
It converts uploaded images into Base64 and sends a multimodal request to gpt-4.1 or gpt-4.1-mini.
The model returns a structured Python dictionary, typically using the following schema:

{

    “objects”: […],

    “scene_and_location”: […],

    “mood_and_composition”: […],

    “panning”: […]

}

A multi-stage parser (AST → JSON → fallback) ensures robustness even when GPT responses contain formatting artifacts.

Prompt Refinement

During development, testing revealed that the English prompt version initially returned empty dictionaries.
Investigation showed that overly strict phrasing (“exclusively as a Python dictionary”) caused the model to suppress uncertain outputs.
By softening this instruction to allow “reasonable guesses” and explicitly forbidding empty fields, the API responses became consistent and semantically rich.

Debugging the Visualization

A subtle logic bug was discovered in the visualization layer:
The post-processing code still referenced German dictionary keys (“objekte”, “szenerie_und_ort”, “stimmung_und_komposition”) from Last week.
Since the new English prompt returned English keys (“objects”, “scene_and_location”, etc.), these lookups produced empty lists, which in turn broke the overlay rendering loop.
After harmonizing key references to support both language variants, the visualization resumed normal operation.

Cross-Model Visualization and Validation

A unified visualization layer now overlays results from either model directly onto the source image.
In OpenAI mode, the “panning” values from GPT’s response are projected as vertical lines with object labels.
This provides immediate visual confirmation that the model’s spatial reasoning aligns with the actual object layout, an important diagnostic step for evaluating AI-based perception accuracy.

Outcome and Next Steps

The project now represents a dual-model visual intelligence system, capable of using symbolic AI interpretation (OpenAI) and local pixel-based detection (Gemini).

Next steps

The upcoming development cycle will focus on connecting the openAI API layer directly with the Image Extender’s audio search and fallback system.

Zwischen Bild und Ton – Kritische Bewertung der Masterarbeit “Automatic Sonification of Video Sequences” von Andrea Corcuera Marruffo

Grundlegendes

Autorin: Andrea Corcuera Marruffo
Titel: Automatic Sonification of Video Sequences through Object Detection and Physical Modelling
Hochschule: Aalborg University Copenhagen
Studiengang: MSc Sound and Music Computing
Jahr: 2017

Die Arbeit von Andrea Corcuera Marruffo untersucht die automatische Erzeugung von Foley-Sounds aus Videosequenzen. Ziel ist es, audiovisuelle Inhalte algorithmisch zu sonifizieren, indem visuelle Informationen, z.B. Materialeigenschaften oder Objektkollisionen, mithilfe von Convolutional Neural Networks (nutzung des YOLO models) analysiert und anschließend physikalisch modellierte Klänge synthetisiert werden. Damit positioniert sich die Arbeit an der Schnittstelle von Klangsynthese, teilweise software und coding und Wahrnehmung, ein Feld, das in der Medienproduktion wie auch in der künstlerischen Forschung zunehmende Relevanz besitzt und entsprechend auch überschneidungen zum Grundkonzept meiner vorstehenden Masterarbeit.

Das „Werkstück“ besteht aus einem funktionalen Prototypen, der Videos analysiert, Objekte klassifiziert und deren Interaktionen in synthetisierte Klänge übersetzt. Ergänzt wird dieses Tool durch eine Evaluation, in der audiovisuelle Stimuli hinsichtlich ihrer Plausibilität und wahrgenommenen Qualität getestet werden.

Bewertung

systematisch anhand der Beurteilungskriterien des Studiengangs CMS

(1) Gestaltungshöhe

Die Arbeit zeigt eine sehr gute technische Tiefe und eine klare methodische Struktur. Der Aufbau ist logisch, die Visualisierungen (z. B. Flussdiagramme, Spektrogramme) sind nachvollziehbar und unterstützen das Verständnis des Prozesses.

(2) Innovationsgrad

Der Ansatz, Foley-Sound automatisch (unter dem Einsatz von „physical modelling“) zu generieren, wurde zum Zeitpunkt der Veröffentlichung (2017) nur vereinzelt erforscht. Die Verbindung von Object Detection und Physical Modelling stellt daher einen innovativen Beitrag im Bereich „Computational Sound Design“ dar.

(3) Selbstständigkeit

Die Arbeit zeigt eine deutliche Eigenleistung. Die Autorin erstellt ein eigenes Dataset, modifiziert Trainingsdaten und implementiert das YOLO Model in einer angepassten Form. Auch die Syntheseparameter werden experimentell abgeleitet. Die Eigenständigkeit ist daher sowohl konzeptionell als auch technisch vorhanden.

(4) Gliederung und Struktur

Die Struktur folgt einem klassischen wissenschaftlichen Aufbau. Theorie, Implementierung, Evaluation, Schlussfolgerung. Kapitel sind klar fokussiert, jedoch teils stark technisch geprägt, was die Lesbarkeit für fachfremde Leser einschränken kann. Eine visuellere Darstellung der Evaluationsmethodik hätte das eventuell verbessert.

(5) Kommunikationsgrad

Die Arbeit ist insgesamt verständlich und präzise formuliert. Fachtermini werden sorgfältig eingeführt, Abbildungen sind beschriftet und logisch eingebunden. Der sprachliche Stil ist sachlich, allerdings manchmal zu stark an technischer Dokumentation orientiert. Narrative Reflexionen zu Designentscheidungen oder ästhetischen Überlegungen fehlen weitgehend, was anhand des Studiengangs, welcher sich nicht hauptsächlich an design orientiert verständlich und nachvollziehbar ist.

(6) Umfang der Arbeit

Mit über 30 Seiten Haupttext und zusätzlichem Anhang ist der Umfang angemessen. Die Balance zwischen Theorie, Umsetzung und Evaluation ist gelungen. Die empirische Studie mit 15 Proband bleibt jedoch relativ klein, wodurch die statistische Aussagekraft begrenzt ist.

(7) Orthographie, Sorgfalt und Genauigkeit

Die Arbeit ist durchgängig formal korrekt und methodisch sorgfältig dokumentiert. Kleinere sprachliche Unschärfen („he first talkie film“) mindern den Gesamteindruck kaum. Zitate und Quellenverweise sind konsistent.

(8) Literatur Das Literaturverzeichnis zeigt eine solide theoretische Fundierung. Es werden gängige Quellen zu Sound Synthesis, Modal Modelling und Neural Networks verwendet (Smith, Farnell, Van den Doel). Allerdings wären aktueller Medien- oder Wahrnehmungsforschung (durch z. B. Sonic Interaction Design, Embodied Sound Studies) noch eine spannende Ergänzung hinsichtlich Forschungsliteratur gewesen.

Abschließende Einschätzung

Insgesamt überzeugt die Arbeit durch ihren innovativen Ansatz, die methodische Präzision und die gelungene Umsetzung eines komplexen Systems. Die Evaluation zeigt kritisch die Grenzen des Modells auf (Objektgenauigkeit und Synchronisationsprobleme), was die Autorin reflektiert und nachvollziehbar einordnet.

Stärken: klare Struktur, hohes technisches Niveau, origineller Forschungsansatz, eigenständige Implementierung.
Schwächen: begrenzte ästhetische Reflexion, kleine Stichprobe in der Evaluation, eingeschränkte Materialvielfalt.