Product X: Image Extender

Extending the System from Image Interpretation to Image Synthesis

This update marked a conceptual shift in the system’s scope: until now, images functioned purely as inputs, sources of visual information to be analyzed, interpreted, and mapped onto sound. With this iteration, I expanded the system to also support image generation, enabling users not only to upload visual material but to synthesize it directly within the same creative loop.

The goal was not to bolt on image generation as a novelty feature, but to integrate it in a way that respects the system’s broader design philosophy: user intent first, semantic coherence second, and automation as a supportive, not dominant, layer.

Architectural Separation: Reasoning vs. Rendering

A key early decision was to separate prompt reasoning from image rendering. Rather than sending raw user input directly to the image model, I introduced a two-stage pipeline:

  1. Prompt Interpretation & Enrichment (GPT-4.1)
    Responsible for understanding vague or underspecified user prompts and rewriting them into a semantically complete, realistic scene description.
  2. Image Synthesis (gpt-image-1 → DALL-E 2/3)
    Dedicated purely to rendering the final image from the enriched prompt. Through implementation, I discovered that while the original spec referenced gpt-image-1, OpenAI’s actual models are DALL-E 2 (60% cheaper, faster, but less detailed) and DALL-E 3 (higher quality but more expensive).

This separation mirrors the system’s audio architecture, where semantic interpretation and signal processing are deliberately decoupled. GPT-4.1 acts as a semantic mediator, while the image model remains a deterministic renderer.

The Response Format Learning Curve

During implementation, I encountered a subtle but important API nuance that forced a deeper understanding of the system’s data flow: DALL-E models return URLs by default, not base64 data. The initial implementation failed with a confusing “NoneType” error because I was trying to decode a base64 field that didn’t exist.

The fix was elegantly simple, adding response_format=”b64_json” to the API call—but the debugging process revealed something more fundamental about API design: different services have different default behaviors, and understanding those defaults is crucial for robust system integration.

This also led to implementing proper fallback logic: if base64 isn’t available, the system gracefully falls back to downloading from the image URL, ensuring reliability across different OpenAI model versions and configurations.

Interactive Workflow Integration with Toggle Architecture

To maintain consistency with the existing interactive toolset while adding flexibility, I implemented a mode-toggle architecture:

  • Upload Mode: Traditional file upload with drag-and-drop support
  • Generate Mode: Text-to-image synthesis with prompt enrichment
  • State Preservation: The system maintains a single IMAGE_FILE variable that can be overwritten by either mode, ensuring seamless transitions between workflows

The interface exposes this through clean toggle buttons, showing only the relevant UI for each mode. This reduces cognitive load while preserving full functionality, a principle I’ve maintained throughout the system’s evolution.

Cost-Aware Design with Caching and Model Selection

Image synthesis presents unique cost challenges compared to text generation or audio processing. I implemented several cost-mitigation strategies learned through experimentation:

  1. Resolution Control: Defaulting to 1024×1024  or 512×512 (for DALL-E 2)
  2. Quality Parameter Awareness: Only DALL-E 3 supports quality=”standard” vs “hd”—using the wrong parameter with DALL-E 2 causes API errors

The cost considerations weren’t just about saving money—they were about enabling iteration. When artists can generate dozens of variations without financial anxiety, they explore more freely. The system defaults to the cheapest viable path, with quality controls available but not forced.

Prompt Realism as a Soft Constraint

Rather than enforcing hard validation rules (e.g., predefined lists of places or objects), I chose to treat realism as a soft constraint enforced by language, not logic.

User prompts are passed through a prompt-enrichment step where GPT-4.1 is instructed to:

  • Reframe the input as a photographic scene
  • Ensure the presence of spatial context (location, environment)
  • Ground the description in physical objects and lighting
  • Explicitly avoid illustrated, cartoon, or painterly styles

This approach preserves creative freedom while ensuring that the downstream image generation remains visually coherent and photo-realistic. Importantly, the system does not reject user input—it interprets it.

Design Philosophy: Generation as a First-Class Input

What this update ultimately enabled is a shift in how the system can be used:

  • Images are no longer just analyzed artifacts
  • They can now be constructed, refined, and immediately fed into downstream processes (visual analysis, audio mapping, spatial inference)

This closes a loop that previously required external tools. The system now supports a full cycle: imagine → generate → interpret → sonify.

Crucially, the same principle that guided earlier updates still applies: automation should amplify intent, not replace it. Image generation here is not about producing spectacle, but about giving users a controlled, semantically grounded way to define the visual worlds their soundscapes respond to.

The implementation journeyfrom API quirks to cost optimization to user experience design, reinforced that even “simple” features require deep consideration when integrating into a complex creative system. Each new capability should feel like it was always there, waiting to be discovered.

Product IX: Image Extender

Moving Beyond Dry Audio to Spatially Intelligent Soundscapes

My primary objective for this update was to bridge a critical perceptual gap in the system: while the previous iterations successfully mapped visual information to sonic elements with precise panning and temporal placement, the resulting audio mix remained perceptually “dry” and disconnected from the image’s implied acoustic environment. This update introduces adaptive reverberation, not as a cosmetic effect, but as a semantically grounded spatialization layer that transforms discrete sound objects into a coherent, immersive acoustic scene.

System Architecture

The existing interactive DAW interface, with its per-track volume controls, sound replacement engine, and user feedback mechanisms, was extended with a comprehensive spatial audio processing module. This module interprets the reverb parameters derived from image analysis (room detection, size estimation, material damping, and spatial width) and provides interactive control over their application.

Global Parameter State & Data Flow Integration

A crucial architectural challenge was maintaining separation between the raw audio mix (user-adjustable volume levels) and the reverb-processed version. I implemented a dual-state system with:

  • current_mix_raw: The continuously updated sum of all audio tracks with current volume slider adjustments.
  • current_mix_with_reverb: A cached, processed version with reverberation applied, recalculated only when reverb parameters change or volume sliders are adjusted with reverb enabled.

This separation preserves processing efficiency while maintaining real-time responsiveness. The system automatically pulls reverb parameters (room_sizedampingwet_levelwidth) from the image analysis block when available, providing image-informed defaults while allowing full manual override.

Pedalboard-Based Reverb Engine

I integrated the pedalboard audio processing library to implement professional-grade reverberation. The engine operates through a transparent conversion chain:

  1. Format ConversionAudioSegment objects (from pydub) are converted to NumPy arrays normalized to the [-1, 1] range
  2. Pedalboard Processing: A Reverb effect instance applies parameters with real-time adjustable controls
  3. Format Restoration: Processed audio is converted back to AudioSegment while preserving sample rate and channel configuration

The implementation supports both mono and stereo processing chains, maintaining compatibility with the existing panning system.

Interactive Reverb Control Interface

A dedicated control panel was added to the DAW interface, featuring:

  • Parameter Sliders: Four continuous controls for room size, damping, wet/dry mix, and stereo width, pre-populated with image-derived values when available
  • Toggle System: Three distinct interaction modes:
    1. “🔄 Apply Reverb”: Manual application with current settings
    2. “🔇 Remove Reverb”: Return to dry mix
    3. “Reverb ON/OFF Toggle”: Single-click switching between states
  • Contextual Feedback: Display of image-based room detection status (indoor/outdoor)

Seamless Playback Integration

The playback system was redesigned to dynamically switch between dry and wet mixes:

  • Intelligent Routing: The play_mix() function automatically selects current_mix_with_reverb or current_mix_raw based on the reverb_enabled flag
  • State-Aware Processing: When volume sliders are adjusted with reverb enabled, the system automatically reapplies reverberation to the updated mix, maintaining perceptual consistency
  • Export Differentiation: Final mixes are exported with _with_reverb or _raw suffixes, providing clear version control

Design Philosophy: Transparency Over Automation

This phase reinforced a critical design principle: spatial effects should enhance rather than obscure the user’s creative decisions. Several automation approaches were considered and rejected:

  • Automatic Reverb Application: While the system could automatically apply image-derived reverb, I preserved manual activation to maintain user agency
  • Dynamic Parameter Adjustment: Real-time modification of reverb parameters during playback was technically feasible but introduced perceptual confusion
  • Per-Track Reverb: Individual reverberation for each sound object would create acoustic chaos rather than coherent space

The decision was made to implement reverb as a master bus effect, applied consistently to the entire mix after individual track processing. This approach creates a unified acoustic space that respects the visual scene’s implied environment while preserving the clarity of individual sound elements.

Technical Challenges & Solutions

State Synchronization

The most significant challenge was maintaining synchronization between the constantly updating volume-adjusted mix and the computationally expensive reverb processing. The solution was a conditional caching system: reverb is only recalculated when parameters change or when volume adjustments occur with reverb active.

Format Compatibility

Bridging the pydub-based mixing system with pedalboard‘s NumPy-based processing required careful attention to sample format conversion, channel configuration, and normalization. The implementation maintains bit-perfect round-trip conversion.

Product V: Image Extender

Dynamic Audio Balancing Through Visual Importance Mapping

This development phase introduces sophisticated volume control based on visual importance analysis, creating audio mixes that dynamically reflect the compositional hierarchy of the original image. Where previous systems ensured semantic accuracy, we now ensure proportional acoustic representation.

The core advancement lies in importance-based volume scaling. Each detected object’s importance value (0-1 scale from visual analysis) now directly determines its loudness level within a configurable range (-30 dBFS to -20 dBFS). Visually dominant elements receive higher volume placement, while background objects maintain subtle presence.

Key enhancements include:

– Linear importance-to-volume mapping creating natural acoustic hierarchies

– Fixed atmo sound levels (-30 dBFS) ensuring consistent background presence

– Image context integration in sound validation for improved semantic matching

– Transparent decision logging showing importance values and calculated loudness targets

The system now distinguishes between foreground emphasis and background ambiance, producing mixes where a visually central “car” (importance 0.9) sounds appropriately prominent compared to a distant “tree” (importance 0.2), while “urban street atmo” provides unwavering environmental foundation.

This represents a significant evolution from flat audio layering to dynamically balanced soundscapes that respect visual composition through intelligent volume distribution.

Product II: Image Extender

Dual-Model Vision Interface – OpenAI × Gemini Integration for Adaptive Image Understanding

Following the foundational phase of last week, where the OpenAI API Image Analyzer established a structured evaluation framework for multimodal image analysis, the project has now reached a significant new milestone. The second release integrates both OpenAI’s GPT-4.1-based vision models and Google’s Gemini (MediaPipe) inference pipeline into a unified, adaptive system inside the Image Extender environment.

Unified Recognition Interface

In The current version, the recognition logic has been completely refactored to support runtime model switching.
A dropdown-based control in Google Colab enables instant selection between:

  • Gemini (MediaPipe) – for efficient, on-device object detection and panning estimation
  • OpenAI (GPT-4.1 / GPT-4.1-mini) – for high-level semantic and compositional interpretation

Non-relevant parameters such as score threshold or delegate type dynamically hide when OpenAI mode is active, keeping the interface clean and focused. Switching back to Gemini restores all MediaPipe-related controls.
This creates a smooth dual-inference workflow where both engines can operate independently yet share the same image context and visualization logic.

Architecture Overview

The system is divided into two self-contained modules:

  1. Image Upload Block – handles external image input and maintains a global IMAGE_FILE reference for both inference paths.
  2. Recognition Block – manages model selection, executes inference, parses structured outputs, and handles visualization.

This modular split keeps the code reusable, reduces side effects between branches, and simplifies later expansion toward GUI-based or cloud-integrated applications.

OpenAI Integration

The OpenAI branch extends directly from Last week but now operates within the full environment.
It converts uploaded images into Base64 and sends a multimodal request to gpt-4.1 or gpt-4.1-mini.
The model returns a structured Python dictionary, typically using the following schema:

{

    “objects”: […],

    “scene_and_location”: […],

    “mood_and_composition”: […],

    “panning”: […]

}

A multi-stage parser (AST → JSON → fallback) ensures robustness even when GPT responses contain formatting artifacts.

Prompt Refinement

During development, testing revealed that the English prompt version initially returned empty dictionaries.
Investigation showed that overly strict phrasing (“exclusively as a Python dictionary”) caused the model to suppress uncertain outputs.
By softening this instruction to allow “reasonable guesses” and explicitly forbidding empty fields, the API responses became consistent and semantically rich.

Debugging the Visualization

A subtle logic bug was discovered in the visualization layer:
The post-processing code still referenced German dictionary keys (“objekte”, “szenerie_und_ort”, “stimmung_und_komposition”) from Last week.
Since the new English prompt returned English keys (“objects”, “scene_and_location”, etc.), these lookups produced empty lists, which in turn broke the overlay rendering loop.
After harmonizing key references to support both language variants, the visualization resumed normal operation.

Cross-Model Visualization and Validation

A unified visualization layer now overlays results from either model directly onto the source image.
In OpenAI mode, the “panning” values from GPT’s response are projected as vertical lines with object labels.
This provides immediate visual confirmation that the model’s spatial reasoning aligns with the actual object layout, an important diagnostic step for evaluating AI-based perception accuracy.

Outcome and Next Steps

The project now represents a dual-model visual intelligence system, capable of using symbolic AI interpretation (OpenAI) and local pixel-based detection (Gemini).

Next steps

The upcoming development cycle will focus on connecting the openAI API layer directly with the Image Extender’s audio search and fallback system.

Product I: Image Extender

OpenAI API Image Analyzer – Structured Vision Testing and Model Insights

Adaptive Visual Understanding Framework
In this development phase, the focus was placed on building a robust evaluation framework for OpenAI’s multimodal models (GPT-4.1 and GPT-4.1-mini). The primary goal: systematically testing image interpretation, object detection, and contextual scene recognition while maintaining controlled cost efficiency and analytical depth.

upload of image (image source: https://www.trumau.at/)
  1. Combined Request Architecture
    Unlike traditional multi-call pipelines, the new setup consolidates image and text interpretation into a single API request. This streamlined design prevents token overhead and ensures synchronized contextual understanding between categories. Each inference returns a structured Python dictionary containing three distinct analytical branches:
    • Objects – Recognizable entities such as animals, items, or people
    • Scene and Location Estimation – Environment, lighting, and potential geographic cues
    • Mood and Composition – Aesthetic interpretation, visual tone, and framing principles

For each uploaded image, the analyzer prints three distinct lists per modelside by side. This offers a straightforward way to assess interpretive differences without complex metrics. In practice, GPT-4.1 tends to deliver slightly more nuanced emotional and compositional insights, while GPT-4.1-mini prioritizes concise, high-confidence object recognition.

results of the image object analysis and model comparison

Through the unified format, post-processing can directly populate separate lists or database tables for subsequent benchmarking, minimizing parsing latency and data inconsistencies.

  1. Robust Output Parsing
    Because model responses occasionally include Markdown code blocks (e.g., python {…}), the parsing logic was redesigned with a multi-layered interpreter using regex sanitation and dual parsing strategies (AST > JSON > fallback). This guarantees that even irregularly formatted outputs are safely converted into structured datasets without manual intervention. The system thus sustains analytical integrity under diverse prompt conditions.
  2. Model Benchmarking: GPT-4.1-mini vs. GPT-4.1
    The benchmark test compared inference precision, descriptive richness, and token efficiency between the two models. While GPT-4.1 demonstrates deeper contextual inference and subtler mood detection, GPT-4.1-mini achieves near-equivalent recognition accuracy at approximately one-tenth of the cost per request. For large-scale experiments (e.g., datasets exceeding 10,000 images), GPT-4.1-mini provides the optimal balance between granularity and economic viability.
  3. Token Management and Budget Simulation
    A real-time token tracker revealed an average consumption of ~1,780 tokens per image request. Given GPT-4.1-mini’s rate of $0.003 / 1k tokens, a one-dollar operational budget supports roughly 187 full image analyses. This insight forms the baseline for scalable experimentation and budget-controlled automation workflows in cloud-based vision analytics.

The next development phase will integrate this OpenAI-driven visual analysis directly into the Image Extender environment. This integration marks the transition from isolated model testing toward a unified generative framework.

#5 Vizualisation Refinement and Hardware Setup

Over the past few weeks, this project slowly evolved into something that brings together a lot of different inspirations—some intentional, some accidental. Looking back, it really started during the VR project we worked on at the beginning of the design week. We were thinking about implementing NFC tags, and there was something fascinating about the idea that just placing an object somewhere could trigger an action. That kind of physical interaction stuck with me.

NFC Tag

Around the same time, we got a VR headset to develop and test our game. While browsing games, I ended up playing this wizard game—and one small detail in it fascinated me. You could lay magical cards onto a rune-like platform, and depending on the card, different things would happen. It reminded me exactly of those NFC interactions in the real world. It was playful, physical, and smart. That moment clicked for me, I really like the idea that placing something down could unlock or reveal something.

Wizard Game

Closing the Circle

That’s the energy I want to carry forward into the final version of this project. I’m imagining an interactive desk where you can place cards representing different countries and instantly see their CO2 emission data visualized. For this prototype, I’m keeping it simple and focused—Austria only, using the dataset I already processed. But this vision could easily scale: more countries, more visual styles, more ways to explore and compare. Alongside developing the interaction concept, I also took time to refine the visualization itself. In earlier versions, the particle behavior and data mapping were more abstract and experimental—interesting, but sometimes a bit chaotic. For this version, I wanted it to be more clear and readable without losing that expressive quality. I adjusted the look of the CO2 particles to feel more alive and organic, giving them color variation, slight flickering, and softer movement. These small changes helped shift the visual language from a data sketch to something that feels more atmospheric and intentional. It’s still messy in a good way, but now it communicates more directly what’s at stake.

Image Reference

Image 1 (NFC Tag): https://www.als-uk.com/news-and-blog/the-future-of-nfc-tags/

Image 2 (Wizard Game): https://www.roadtovr.com/the-wizards-spellcasting-vr-combat-game-early-access-launch-trailer-release-date/

#4 Alright… Now What?

So far, I’ve soldered things together (mentally, not literally), tested sensors, debugged serial communication, and got Arduino and Processing talking to each other. That in itself feels like a win. But now comes the real work: What do I actually do with this setup?

At this stage, I started combining the two main inputs—the proximity sensor and the potentiometer into a single, working system. The potentiometer became a kind of manual timeline scrubber, letting me move through 13 steps that represent a line, which should be a test for a potential timeline? The proximity sensor added a sense of presence, acting like a trigger that wakes the system up when someone approaches. Together, they formed a simple but functional prototype of a prototype, a rough sketch of the interaction I’m aiming for. It helped me think through how the data might be explored, not just visually, but physically, with gestures and motion. This phase was more about testing interaction metaphors than polishing visuals—trying to understand how something as abstract as historical emissions can be felt through everyday components like a knob and a distance sensor. This task pointed out to me, how important testing and the ideation of your ideas can be, to get a better understanding of your own thoughts and to form a more precise imagination of your plan.

Small Prototype to connect sensors in one file

Things about to get serious

Building on the knowledge I gained during the ideation phase, I connected my working sensor system, a potentiometer and proximity sensor to the Processing sketch I had developed during design week. That earlier version already included interaction through Makey Makey and homemade aluminum foil buttons, which made for a playful and tactile experience. In my opinion, the transfer to Arduino technology made the whole setup easier to handle and much cleaner—fewer cables, more direct control, and better integration with the Processing environment. The potentiometer now controls the timeline of Austria’s CO2 emissions, while the proximity sensor acts as a simple trigger to activate the visualization. This transition from foil to microcontroller reflects how the project evolved from rough experimentation into a more stable, cohesive prototype.

17 – Clickable Prototype v1

After all the sketches, user flows, and planning, I finally pulled everything into a quick clickable prototype (Figma is awesome for this, btw). It’s still an early version, but it gives a solid feel of how the app might look and behave. I wanted to see how the Home, Activity, and Settings tabs work together and how smooth the experience feels when clicking through it all.

Here’s a short walkthrough video showing the prototype in action:

Working on this helped me catch a few small details I hadn’t noticed before, like the pacing between steps and where extra feedback could better guide the user. Overall, seeing it come to life, even in a simple form, was a great way to confirm if the structure works.

Next, I’ll refine the flow, tidy up interactions, and start testing how others respond. It’s exciting to finally transition from an idea to something tangible you can click through.

2.6. “The Hidden Side of Graz”

After extensive experimentation with the Touch Board, I’m excited to share a short video showcasing the final prototype in action. This interactive map invites people to discover Graz through touch and sound. Each spot on the map hides a small surprise: a sound, a memory, a piece of the city waiting to be heard. Everything you see here was designed to feel handmade and screen-free, turning simple tech into something a little more magical.

Watch the video to see how it all comes together.

And here’s the video with all the sound stories.
Hope you enjoy 🙂

Building the panner: Implementing the Object and Trigger System

After conceptualizing the panner interface as a core feature of my spatial sound toolkit, the next phase of the project shifted into technical territory. This stage involved developing both the XY panner behavior and a trigger system built directly on top of object positions. In this post, I’ll walk through how I translated the idea into code; using MAX/MSP, Max for Live andJavaScript, creating a mix of visual and hidden logic.

Starting with a simple XY Pad

My starting point was a simple XY pad. At first glance, this seemed like a straightforward way to navigate sound across a room and interact with virtual objects. But I quickly found that in its raw form, it lacked the nuance I needed; it was too binary, too linear. There was no sense of proximity, weight, or sonic gravity between the user and the objects.

So I introduced some kind of attractors.

Introducing Attractors

The new implementation allows each object in the panner to become an attractor within a customizable radius. Here’s how it works:

  1. Each object is placed at a fixed position on the grid; The user can set the position within the interface.
  2. A radius value (default: 0.5; range: 0.5–4 [coordinates]) defines how close the user’s XY slider needs to be in order to activate the attractor. This gets checked via a classic condition.
  3. If the user’s XY position falls within that radius, it triggers the attraction_value abstraction;
  4. This abstraction calculates the distance between the user position and the object using the classic formula:
    d = √([x₂ - x₁]²+[y₂ - y₁]²)
  5. This distance is then normalized between 0 and 1 based on the radius and used to control mapping parameters; in this case, faders that modulate each object’s sound layer.

This system gives users a gradient-based interaction model, where getting closer to an object increases its influence, allowing for more natural and exploratory listening behaviors. To give the creators further possibilities to influence the responsiveness, there is an additional smoothing fader, that allows control over how long panning movements need to take action (100-4000ms).

Trigger System

To complement the panner, I also implemented the trigger system that sits directly on top of the mapped objects.

To keep the patch clean and user-friendly, I wrote a custom JavaScript file;
includeTriggers.js

Using JavaScript in Max/MSP provided me with several advantages:

First of all it allows controlled patch editing without the user needing to dive into patch internals.

Further I could implement accurate placement of the trigger buttons in both the patcher and the presentation mode (which further on equals the UI of the Max for Live-Device).

I could also establish invisible connections to the send object that routes interaction to my event_trigger abstraction.

This script is activated via a simple toggle switch in the user interface. When toggled on, it triggers the following actions:

  1. Finds the correct trigger button templates;
  2. Positions them on top of the corresponding object locations;
  3. Connects them invisibly to the back-end.

When toggled off, a sister script, excludeTriggers.js, removes them from presentation mode, disabling interaction safely without deleting anything.

Using the Max for Live API

When a user activates one of the visible triggers, the event_trigger abstraction takes action. It uses the Max for Live API to launch a clip from Ableton Live’s Session View; playing a sound event specifically assigned to that object.

Each object can hold multiple events, which are randomly triggered using a round-robin system. As pointed out in the previous blog entry, this ensures variation and prevents repetition.

Learning Through the Implementation

This implementation phase was not only functional but also very much educational. Working with Max for Live’s UI elements and the API gave me a much better understanding of the platform’s architecture.

In particular, experimenting with JavaScript within Max/MSP allowed me to see and manipulate the underlying hierarchy of patch elements; something normally hidden from view. It definitely was a somewhat tedious process, that forced me to rely a lot on trial and error, due to a bad documentation. But this experiments resulted in a handful of reusable scripts like e.g. createTriggers.js and deleteTriggers.js, which I may refine further for future iterations. Same thing counts for working with Max for Live. Even though I might not use every approach I have some patches now, that I can easily adapt for other UIs.

Since I already mention, that it’s quite a new challenge for me to work with the idea of a broader usability in mind, some feedback would be really nice: So if you’re working with spatial sound, Max for Live, or experimental interaction systems and would like to test this prototype or collaborate; feel free to reach out.