programming

by David Adlberger - 30. December 2025

Product VIII: Image Extender

Iterative Workflow and Feedback Mechanism

The primary objective for this update was to architect a paradigm shift from a linear generative pipeline to a nonlinear, interactive sound design environment

System Architecture & Implementation of Interactive Components

The existing pipeline, comprising image analysis (object detection, semantic tagging), importance-weighted sound search, audio processing (equalization, normalization, panoramic distribution based on visual coordinates), and temporal randomization was extended with a state-preserving session layer and an interactive control interface, implemented within the collab notebook ecosystem.

Data Structure & State Management
A critical prerequisite for interactivity was the preservation of all intermediate audio objects and their associated metadata. The system was refactored to maintain a global, mutable data structure, a list of processed_track objects. Each object encapsulates:

The raw audio waveform (as a NumPy array).
Semantic source tag (e.g., “car,” “rain”).
Track type (ambience base or foreground object).
Temporal onset and duration within the mix.
Panning coefficient (derived from image x-coordinate).
Initial target loudness (LUFS, derived from object importance scaling).

Dynamic Mixing Console Interface
A GUI panel was generated post-sonification, featuring the following interactive widgets for each processed_track:

Per-Track Gain Sliders: Linear potentiometers (range 0.0 to 2.0) controlling amplitude multiplication. Adjustment triggers an immediate recalculation of the output sum via a create_current_mix() function, which performs a weighted summation of all tracks based on the current slider states.
Play/Stop Controls: Buttons invoking a non-blocking, threaded audio playback engine (using IPython.display.Audio and threading), allowing for real-time auditioning without interface latency.

On-Demand Sound Replacement Engine
The most significant functional addition is the per-track “Search & Replace” capability. Each track’s GUI includes a dedicated search button (🔍). Its event handler executes the following algorithm:

Tag Identification: Retrieves the original semantic tag from the target processed_track.
Targeted Audio Retrieval: Calls a modified search_new_sound_for_tag(tag, exclude_id_list) function. This function re-executes the original search logic, including query formulation, Freesound API calls, descriptor validation (e.g., excluding excessively long or short files), and fallback strategies—while maintaining a session-specific exclusion list to avoid re-selecting previously used sounds.
Consistent Processing: The newly retrieved audio file undergoes an identical processing chain as in the initial pipeline: target loudness normalization (to the original track’s LUFS target), application of the same panning coefficient, and insertion at the identical temporal position.
State Update & Mix Regeneration: The new audio data replaces the old waveform in the processed_track object. The create_current_mix() function is invoked, seamlessly integrating the new sonic element while preserving all other user adjustments (e.g., volume levels of other tracks).

Integrated Feedback & Evaluation Module
To formalize user evaluation and gather data for continuous system improvement, a structured feedback panel was integrated adjacent to the mixing controls. This panel captures:

A subjective 5-point Likert scale rating.
Unstructured textual feedback.
Automated attachment of complete session metadata (input image description, derived tags, importance values, processing parameters, and the final processed_track list).
This design explicitly closes the feedback loop, treating each user interaction as a potential training or validation datum for future algorithmic refinements.
Automated sending of the feedback via email

by David Adlberger - 30. October 202515. December 2025

Product I: Image Extender

OpenAI API Image Analyzer – Structured Vision Testing and Model Insights

Adaptive Visual Understanding Framework
In this development phase, the focus was placed on building a robust evaluation framework for OpenAI’s multimodal models (GPT-4.1 and GPT-4.1-mini). The primary goal: systematically testing image interpretation, object detection, and contextual scene recognition while maintaining controlled cost efficiency and analytical depth.

upload of image (image source: https://www.trumau.at/)

Combined Request Architecture
Unlike traditional multi-call pipelines, the new setup consolidates image and text interpretation into a single API request. This streamlined design prevents token overhead and ensures synchronized contextual understanding between categories. Each inference returns a structured Python dictionary containing three distinct analytical branches:
• Objects – Recognizable entities such as animals, items, or people
• Scene and Location Estimation – Environment, lighting, and potential geographic cues
• Mood and Composition – Aesthetic interpretation, visual tone, and framing principles

For each uploaded image, the analyzer prints three distinct lists per modelside by side. This offers a straightforward way to assess interpretive differences without complex metrics. In practice, GPT-4.1 tends to deliver slightly more nuanced emotional and compositional insights, while GPT-4.1-mini prioritizes concise, high-confidence object recognition.

results of the image object analysis and model comparison

Through the unified format, post-processing can directly populate separate lists or database tables for subsequent benchmarking, minimizing parsing latency and data inconsistencies.

Robust Output Parsing
Because model responses occasionally include Markdown code blocks (e.g., python {…}), the parsing logic was redesigned with a multi-layered interpreter using regex sanitation and dual parsing strategies (AST > JSON > fallback). This guarantees that even irregularly formatted outputs are safely converted into structured datasets without manual intervention. The system thus sustains analytical integrity under diverse prompt conditions.
Model Benchmarking: GPT-4.1-mini vs. GPT-4.1
The benchmark test compared inference precision, descriptive richness, and token efficiency between the two models. While GPT-4.1 demonstrates deeper contextual inference and subtler mood detection, GPT-4.1-mini achieves near-equivalent recognition accuracy at approximately one-tenth of the cost per request. For large-scale experiments (e.g., datasets exceeding 10,000 images), GPT-4.1-mini provides the optimal balance between granularity and economic viability.
Token Management and Budget Simulation
A real-time token tracker revealed an average consumption of ~1,780 tokens per image request. Given GPT-4.1-mini’s rate of $0.003 / 1k tokens, a one-dollar operational budget supports roughly 187 full image analyses. This insight forms the baseline for scalable experimentation and budget-controlled automation workflows in cloud-based vision analytics.

The next development phase will integrate this OpenAI-driven visual analysis directly into the Image Extender environment. This integration marks the transition from isolated model testing toward a unified generative framework.

by David Adlberger - 7. April 202516. April 2025

Prototyping IV: Image Extender – Image sonification tool for immersive perception of sounds from images and new creation possibilities

Tests on automated audio file search via freesound.org api:

For further use in the automated audio file search of the recognized objects I tested the freesound.org api and programmed the first interface for testing purposes. The first thing I had to do was request an API-Key by freesound.org. After that I noticed an interesting point to think about using it in my project: it is open for 5000 requests per year, but I will research on possibilities for using it more. For the testing 5000 is more than enough.

The current code already searches with a few testing tags and gives possibilities to filter the searches by samplerate, duration, licence and file type. There might be added more filter possibilities next like rating, bit depth, and maybe the possibility of random file selection so it won’t be always the same for each tag.

Next steps would also include to either download the file or just play it automatically. Then there will be tests on using the tags of the AI image recognition code for this automated search. And later in the process I have to figure out the playback of multiple files, volume staging and filtering or EQing methods for masking effects etc…