Product I – Embodied Resonance – Hardware Selection and System Setup

RECAP:
Embodied Resonance investigates how the body of a person with lived experience of war responds to auditory triggers that recall traumatic events, and how these responses can be expressed through sound. The heart is central to this project, as cardiac activity – particularly heart rate variability (HRV)—provides detailed insight into stress regulation and autonomic nervous system dynamics associated with post-traumatic stress disorder (PTSD).

The conceptual direction of the work is shaped by my personal experience of the war in Ukraine and an interest in trauma physiology. I was drawn to the idea that trauma leaves measurable traces in the body—signals that often remain inaccessible through language but can be explored scientifically and translated into sonic form. This approach was influenced by The Body Keeps the Score by Bessel van der Kolk, which emphasizes embodied memory and non-verbal manifestations of trauma.

In the previous semester, the project focused on exploratory work with existing physiological datasets. A large open-access dataset on stress-induced myocardial ischemia was used to study cardiac behavior under rest, stress, and recovery conditions. Although not designed specifically for PTSD research, the dataset includes participants with PTSD, anxiety-related disorders, and cardiovascular conditions, offering a diverse basis for analysis.

During this phase, Python tools based on the NeuroKit2 library were developed to compute time- and frequency-domain HRV metrics from ECG recordings. Additional scripts transformed these parameters into MIDI patterns and continuous controller (CC) data for sound synthesis and composition. Initial experiments with real-time HRV streaming were also conducted, but they revealed significant limitations: many HRV metrics require long analysis windows and are computationally demanding, making them unsuitable for stable real-time sonification.

In the current semester, corresponding to the Product phase, the project transitions from simulation-based exploration to work with my own body. During earlier presentations, concerns were raised regarding the ethical implications of experiments that could potentially lead to re-traumatization, particularly when involving other participants with war-related trauma. In response, I decided not to extrapolate the experiment to other Ukrainians at this stage and to limit the investigation to my own physiological responses.

Furthermore, instead of exposing myself to recorded sirens at arbitrary times, I chose to record my ECG during the weekly civil defense siren tests that take place every Saturday in Graz. This context offers a meaningful contrast: for most residents of Austria, the siren test is a routine element of everyday life, largely stripped of emotional urgency. For someone with lived experience of war, however, the same sound carries associations of immediate danger. By situating the recordings within this real, socially normalized setting, the project examines how a familiar public signal can produce profoundly different embodied responses depending on personal history.

Before starting the experimental recordings, it was necessary to select and acquire appropriate sensors and a microcontroller. Prior to purchase, a short survey of available biosensing hardware was conducted, with particular attention paid to signal quality, availability of documentation, and the existence of example projects demonstrating practical use. An additional criterion was whether the sensors had been previously employed in projects related to heart rate variability (HRV) analysis.

For ECG acquisition, the DFRobot Gravity Heart Rate Monitor Sensor was selected. This sensor offered a favorable balance between cost and functionality, providing all required cables as well as disposable electrodes. Importantly, it had been used in a well-documented HRV-focused project, which served as a valuable technical reference during development and troubleshooting. In addition to ECG, a galvanic skin response (GSR) sensor from Seeed Studio was included to explore changes in skin conductance as an additional physiological marker of stress. While GSR was not part of the previous semester’s research, it was included experimentally to assess whether this modality could provide complementary information. At this stage, the structure and usefulness of GSR data were not yet fully predictable and were treated as exploratory. As a microcontroller, the Arduino MKR WiFi 1010 was chosen. 

The full list of acquired components is as follows:

  • Arduino MKR WiFi 1010
  • DFRobot Gravity Heart Rate Monitor Sensor (ECG)
  • DFRobot Disposable ECG Electrodes
  • Seeed Studio GSR Sensor
  • Seeed Studio 4-pin Male Jumper to Grove Conversion Cable
  • Breadboard (400 holes)
  • Male-to-male jumper wires
  • Male-to-female jumper wires
  • Potentiometers (100 kΩ)
  • LiPo Battery 1S 3.7 V 500 mAh (not used in final setup)

The total cost of the hardware provides amounted to approximately 80 EUR. The initial motivation for this choice was the possibility of wireless data transmission via WiFi or Bluetooth. In practice, however, wireless communication was not required. Due to the high motion sensitivity of both ECG and GSR sensors, recordings had to be performed in a largely static position, making a wired USB connection to the computer sufficient. For this reason, a battery intended for mobile operation was ultimately not used.

For software configuration, the Arduino IDE was installed. Although I had prior experience working with Arduino hardware several years ago, the interface had changed significantly. To support the Arduino MKR WiFi 1010, the SAMD Boards package was additionally installed via the Boards Manager. After software setup, all components were connected according to a simple wiring scheme that required no additional electronic elements. 

Figure 1. Wiring diagram of the experimental setup with Arduino MKR WiFi 1010, ECG sensor, and GSR sensor.

The Arduino ground (GND) was connected to the ground rail of the breadboard, and the 5 V output was connected to the power rail.

The ECG sensor was connected as follows:

GND (black wire) → ground rail on the breadboard

VCC (red wire) → 5 V power rail

Signal output (blue wire) → analog input A1 on the Arduino

The GSR sensor was connected as follows:

GND (black wire) → ground rail on the breadboard

VCC (red wire) → 3.3 V output on the Arduino

Signal output (yellow wire) → analog input A2 on the Arduino

Figure 2 illustrates the complete wiring configuration of the system, including the Arduino MKR WiFi 1010, ECG sensor, GSR sensor, and breadboard power distribution.

Figure 2. Physical hardware configuration used for ECG and GSR data recording.

Product X: Image Extender

Extending the System from Image Interpretation to Image Synthesis

This update marked a conceptual shift in the system’s scope: until now, images functioned purely as inputs, sources of visual information to be analyzed, interpreted, and mapped onto sound. With this iteration, I expanded the system to also support image generation, enabling users not only to upload visual material but to synthesize it directly within the same creative loop.

The goal was not to bolt on image generation as a novelty feature, but to integrate it in a way that respects the system’s broader design philosophy: user intent first, semantic coherence second, and automation as a supportive, not dominant, layer.

Architectural Separation: Reasoning vs. Rendering

A key early decision was to separate prompt reasoning from image rendering. Rather than sending raw user input directly to the image model, I introduced a two-stage pipeline:

  1. Prompt Interpretation & Enrichment (GPT-4.1)
    Responsible for understanding vague or underspecified user prompts and rewriting them into a semantically complete, realistic scene description.
  2. Image Synthesis (gpt-image-1 → DALL-E 2/3)
    Dedicated purely to rendering the final image from the enriched prompt. Through implementation, I discovered that while the original spec referenced gpt-image-1, OpenAI’s actual models are DALL-E 2 (60% cheaper, faster, but less detailed) and DALL-E 3 (higher quality but more expensive).

This separation mirrors the system’s audio architecture, where semantic interpretation and signal processing are deliberately decoupled. GPT-4.1 acts as a semantic mediator, while the image model remains a deterministic renderer.

The Response Format Learning Curve

During implementation, I encountered a subtle but important API nuance that forced a deeper understanding of the system’s data flow: DALL-E models return URLs by default, not base64 data. The initial implementation failed with a confusing “NoneType” error because I was trying to decode a base64 field that didn’t exist.

The fix was elegantly simple, adding response_format=”b64_json” to the API call—but the debugging process revealed something more fundamental about API design: different services have different default behaviors, and understanding those defaults is crucial for robust system integration.

This also led to implementing proper fallback logic: if base64 isn’t available, the system gracefully falls back to downloading from the image URL, ensuring reliability across different OpenAI model versions and configurations.

Interactive Workflow Integration with Toggle Architecture

To maintain consistency with the existing interactive toolset while adding flexibility, I implemented a mode-toggle architecture:

  • Upload Mode: Traditional file upload with drag-and-drop support
  • Generate Mode: Text-to-image synthesis with prompt enrichment
  • State Preservation: The system maintains a single IMAGE_FILE variable that can be overwritten by either mode, ensuring seamless transitions between workflows

The interface exposes this through clean toggle buttons, showing only the relevant UI for each mode. This reduces cognitive load while preserving full functionality, a principle I’ve maintained throughout the system’s evolution.

Cost-Aware Design with Caching and Model Selection

Image synthesis presents unique cost challenges compared to text generation or audio processing. I implemented several cost-mitigation strategies learned through experimentation:

  1. Resolution Control: Defaulting to 1024×1024  or 512×512 (for DALL-E 2)
  2. Quality Parameter Awareness: Only DALL-E 3 supports quality=”standard” vs “hd”—using the wrong parameter with DALL-E 2 causes API errors

The cost considerations weren’t just about saving money—they were about enabling iteration. When artists can generate dozens of variations without financial anxiety, they explore more freely. The system defaults to the cheapest viable path, with quality controls available but not forced.

Prompt Realism as a Soft Constraint

Rather than enforcing hard validation rules (e.g., predefined lists of places or objects), I chose to treat realism as a soft constraint enforced by language, not logic.

User prompts are passed through a prompt-enrichment step where GPT-4.1 is instructed to:

  • Reframe the input as a photographic scene
  • Ensure the presence of spatial context (location, environment)
  • Ground the description in physical objects and lighting
  • Explicitly avoid illustrated, cartoon, or painterly styles

This approach preserves creative freedom while ensuring that the downstream image generation remains visually coherent and photo-realistic. Importantly, the system does not reject user input—it interprets it.

Design Philosophy: Generation as a First-Class Input

What this update ultimately enabled is a shift in how the system can be used:

  • Images are no longer just analyzed artifacts
  • They can now be constructed, refined, and immediately fed into downstream processes (visual analysis, audio mapping, spatial inference)

This closes a loop that previously required external tools. The system now supports a full cycle: imagine → generate → interpret → sonify.

Crucially, the same principle that guided earlier updates still applies: automation should amplify intent, not replace it. Image generation here is not about producing spectacle, but about giving users a controlled, semantically grounded way to define the visual worlds their soundscapes respond to.

The implementation journeyfrom API quirks to cost optimization to user experience design, reinforced that even “simple” features require deep consideration when integrating into a complex creative system. Each new capability should feel like it was always there, waiting to be discovered.

Product VII: Image Extender

Room-Aware Mixing – From Image Analysis to Coherent Acoustic Spaces

Instead of attempting to recover exact physical properties, the system derives normalized, perceptual room parameters from visual cues such as geometry, materials, furnishing density, and openness. These parameters are intentionally abstracted to work with algorithmic reverbs.

The introduced parameters are:

  • room_detected (bool)
    Indicates whether the image depicts a closed indoor space or an outdoor/open environment.
  • room_size (0.0–1.0)
    Represents the perceived acoustic size of the room (small rooms → short decay, large spaces → long decay).
  • damping (0.0–1.0)
    Estimates high-frequency absorption based on visible materials (soft furnishings, carpets, curtains vs. glass and hard walls).
  • wet_level (0.0–1.0)
    Describes how reverberant the space naturally feels.
  • width (0.0–1.0)
    Estimates perceived stereo width derived from room proportions and openness.

All parameters are stored flat within the same dictionary as objects, panning, and importance values, forming a single coherent scene representation.

Dereverberation: Explored, Then Intentionally Abandoned

As part of this phase, automatic analysis of existing reverberation (RT60, DRR estimation) and dereverberation was evaluated.

The outcome:

  • Computationally expensive, especially in Google Colab
  • Inconsistent and often unsatisfactory audio results
  • High complexity with limited practical benefit

Decision:
Dereverberation is not pursued further in this project. Instead, the system relies on:

  • Consistent room estimation
  • Controlled, unified reverb application
  • Preventive design rather than corrective processing

The next step will be to focus on the analysis of the sounds (especially rt60 and drr values) to make the reverb (if its a closed room) less on the specific sound.

Critical review of “Microbiophonic Emergences” by Adomas Palekas (Master’s Thesis, Institute of Sonology, 2024)

Adomas Palekas’s master’s thesis, entitled Microbiophonic Emergences, could be described as an interdisciplinary mixture of artistic reflection, philosophical speculation, and experimental sound practice. Combining ecological thought and artistic research, the text examines the relationship between sound, life, and perception. Drawing upon the Gaia hypothesis and Goethean science, the author advances a more sensitive and ethical mode of listening wherein the boundaries between the art and scientific observation are dissolved.

Overall, the presentation of the work is considerate and visually well-structured, though it significantly deviates from academic conventions. A clearly defined research question, hypothesis, or structured methodology is lacking. Instead, the text comes across as a long essay on listening, nature, and non-human agency. Its first half is dedicated to theoretical reflections on sound as a living force, while the second half introduces a series of artistic experiments and installations entitled Kwass Fermenter, Microbial Music I–III (On Bread, Compost and Haze, Aerials), Rehydration, Infection, Spectrum of Mutations: Myosin III and Kwassic Motion. According to the author, these works constitute a coherent artistic ecosystem in which microorganisms and sonic feedback interact.

More conceptually, this framing of sonification as bi-directional means that sound should not just be generated from biological data but is also to be sent back into the system and used to affect it. Conceptually, this approach seeks to transform sonification into a dialogue rather than a representation. This claim of originality, however, feels somewhat overstated: bi-directional or feedback-based sonification has been explored conceptually and practically by many artists and researchers before him, mostly within the frames of bio-art and ecological sound practices. Palekas himself mentions only one precedent when, in fact, there exists a wide range of comparable works dealing with translating biological activity into sound and then re-introducing it into the same system. His treatment of the topic is therefore limited and without deep contextual awareness, giving the impression of a rediscovery of ideas that are well conceptualized in the discipline.

The artistic independence of the thesis and a strong, personal vision are explicit. Palekas’s voice is consistent; his writing also reflects genuine curiosity and sensitivity. But it is this very independence that alienates his research from the broader academic and artistic discourse. One misses the dialogue with other practitioners or with theoretical perspectives, except for the few philosophical sources mentioned above. The limited literature review weakens the credibility of his theoretical framework and makes it difficult to situate the work within contemporary sound studies or bio-art research.

The structuring of the thesis is much closer to a philosophical narrative than to a scientific report. The chapters are more intuitively than logically connected. Because explicit methodological framing is absent, the reader has to reconstruct the logic of the experiments from poetic descriptions. For example, the sonification tests with fermentation are told in narrative terms, sometimes mentioning sensors, mappings, and feedback without providing detailed diagrams, lists of parameters, or reproducible data.

From a communicational point of view, the thesis is well-written and easy to read. Palekas’s prose is expressive and reflective; his philosophical passages are a pleasure to read. At the same time, this lyricism too often supplants analytical clarity. The experimental results remain fuzzy; the measurements are given “by ear,” not through numerical analysis, and the reader cannot tell whether the effects observed are significant or only subjective impressions.

In scope and depth, the thesis is ambitious but uneven. It tries to combine philosophy, biology, and sound art, but the practical documentation remains superficial. The experiments are deficient in calibration and control conditions, as well as in quantitative evidence. The author himself recognizes that fermentation is hardly predictable and thus difficult to reproduce. But this admission only underlines the fragility of his conclusions. Without a presentation of clear data or even replicable protocols, the whole project remains conceptual rather than empirical.

Partial accuracy and attention to detail: the author provides some information about equipment and process – for example, relative calibration among the CO₂ sensors, use of Arduino, Pure Data, but no consistent system is provided for reporting values, frequencies, and time spans. References made to appendices and videos are incomplete, and none of the referred sound recordings and codes are available. The result is that the project cannot be scientifically evaluated or reproduced.

The section on literature review reflects selectivity: In situating his thought within broader ecological and philosophical frameworks, Palekas barely engages the rich corpus of research on bio-sonification, microbial sensing, and feedback sound systems. The lack of these sources increases the effect of isolation: the thesis feels self-contained rather than in conversation with a field.

This gap between theory and documentation is where the quality of the artifact is questioned. The installations and performances he describes conceptually are incompletely and poorly documented. It is not clear if the works were created for this thesis or collated from previous projects. Without recordings, schematics, or step-by-step documentation available, one cannot evaluate any artistic or technical outcomes. Put differently, Microbiophonic Emergences is a strong artistic and philosophical statement, but it is only a partially successful academic thesis. Its conceptual strength comes from the ethical rethinking of listening, the poetic vision of sound as life, and the attempt to dissolve the hierarchy between observer and observed.

The work unfortunately lacks in methodological rigor, detailed evidence, and sufficient contextual grounding. While Palekas seeks to establish a dialogue between humans and microbes, the outcome is just speculative and remains unverified. This invention of bi-directional sonification is not really new; moreover, the thesis overlooks the numerous past projects that have already elaborated on a similar feedback relationship between sound and living systems. Overall, the work is successful as a reflective, imaginative exploration of sound and ecology but fails as a systematically researched academic document. While the work evokes curiosity and wonder, it requires far stronger methodological and contextual grounding to meet the standards of a master’s thesis.

Product III: Image Extender

Intelligent Sound Fallback Systems – Enhancing Audio Generation with AI-Powered Semantic Recovery

After refining Image Extender’s sound layering and spectral processing engine, this week’s development shifted focus to one of the system’s most practical yet creatively crucial challenges: ensuring that the generation process never fails silently. In previous iterations, when a detected visual object had no directly corresponding sound file in the Freesound database, the result was often an incomplete or muted soundscape. The goal of this phase was to build an intelligent fallback architecture—one capable of preserving meaning and continuity even in the absence of perfect data.

Closing the Gap Between Visual Recognition and Audio Availability

During testing, it became clear that visual recognition is often more detailed and specific than what current sound libraries can support. Object detection models might identify entities like “Golden Retriever,” “Ceramic Cup,” or “Lighthouse,” but audio datasets tend to contain more general or differently labeled entries. This mismatch created a semantic gap between what the system understands and what it can express acoustically.

The newly introduced fallback framework bridges this gap, allowing Image Extender to adapt gracefully. Instead of stopping when a sound is missing, the system now follows a set of intelligent recovery paths that preserve the intent and tone of the visual analysis while maintaining creative consistency. The result is a more resilient, contextually aware sonic generation process—one that doesn’t just survive missing data, but thrives within it.

Dual Strategy: Structured Hierarchies and AI-Powered Adaptation

Two complementary fallback strategies were introduced this week: one grounded in structured logic, and another driven by semantic intelligence.

The CSV-based fallback system builds on the ontology work from the previous phase. Using the tag_hierarchy.csv file, each sound tag is part of a parent–child chain, creating predictable fallback paths. For example, if “tiger” fails, the system ascends to “jungle,” and then “nature.” This rule-based approach guarantees reliability and zero additional computational cost, making it ideal for large-scale batch operations or offline workflows.

In contrast, the AI-powered semantic fallback uses GPT-based reasoning to dynamically generate alternative tags. When the CSV offers no viable route, the model proposes conceptually similar or thematically related categories. A specific bird species might lead to the broader concept of “bird sounds,” or an abstract object like “smartphone” could redirect to “digital notification” or “button click.” This layer of intelligence brings flexibility to unfamiliar or novel recognition results, extending the system’s creative reach beyond its predefined hierarchies.

User-Controlled Adaptation

Recognizing that different projects require different balances between cost, control, and creativity, the fallback mode is now user-configurable. Through a simple dropdown menu, users can switch between CSV Mode and AI Mode.

  • CSV Mode favors consistency, predictability, and cost-efficiency—perfect for common, well-defined categories.
  • AI Mode prioritizes adaptability and creative expansion, ideal for complex visual inputs or unique scenes.

This configurability not only empowers users but also represents a deeper design philosophy: that AI systems should be tools for choice, not fixed solutions.

Toward Adaptive and Resilient Multimodal Systems

This week’s progress marks a pivotal evolution from static, database-bound sound generation to a hybrid model that merges structured logic with adaptive intelligence. The dual fallback system doesn’t just fill gaps, it embodies the philosophy of resilient multimodal AI, where structure and adaptability coexist in balance.

The CSV hierarchy ensures reliability, grounding the system in defined categories, while the AI layer provides flexibility and creativity, ensuring the output remains expressive even when the data isn’t. Together, they form a powerful, future-proof foundation for Image Extender’s ongoing mission: transforming visual perception into sound not as a mechanical translation, but as a living, interpretive process.

Product II: Image Extender

Dual-Model Vision Interface – OpenAI × Gemini Integration for Adaptive Image Understanding

Following the foundational phase of last week, where the OpenAI API Image Analyzer established a structured evaluation framework for multimodal image analysis, the project has now reached a significant new milestone. The second release integrates both OpenAI’s GPT-4.1-based vision models and Google’s Gemini (MediaPipe) inference pipeline into a unified, adaptive system inside the Image Extender environment.

Unified Recognition Interface

In The current version, the recognition logic has been completely refactored to support runtime model switching.
A dropdown-based control in Google Colab enables instant selection between:

  • Gemini (MediaPipe) – for efficient, on-device object detection and panning estimation
  • OpenAI (GPT-4.1 / GPT-4.1-mini) – for high-level semantic and compositional interpretation

Non-relevant parameters such as score threshold or delegate type dynamically hide when OpenAI mode is active, keeping the interface clean and focused. Switching back to Gemini restores all MediaPipe-related controls.
This creates a smooth dual-inference workflow where both engines can operate independently yet share the same image context and visualization logic.

Architecture Overview

The system is divided into two self-contained modules:

  1. Image Upload Block – handles external image input and maintains a global IMAGE_FILE reference for both inference paths.
  2. Recognition Block – manages model selection, executes inference, parses structured outputs, and handles visualization.

This modular split keeps the code reusable, reduces side effects between branches, and simplifies later expansion toward GUI-based or cloud-integrated applications.

OpenAI Integration

The OpenAI branch extends directly from Last week but now operates within the full environment.
It converts uploaded images into Base64 and sends a multimodal request to gpt-4.1 or gpt-4.1-mini.
The model returns a structured Python dictionary, typically using the following schema:

{

    “objects”: […],

    “scene_and_location”: […],

    “mood_and_composition”: […],

    “panning”: […]

}

A multi-stage parser (AST → JSON → fallback) ensures robustness even when GPT responses contain formatting artifacts.

Prompt Refinement

During development, testing revealed that the English prompt version initially returned empty dictionaries.
Investigation showed that overly strict phrasing (“exclusively as a Python dictionary”) caused the model to suppress uncertain outputs.
By softening this instruction to allow “reasonable guesses” and explicitly forbidding empty fields, the API responses became consistent and semantically rich.

Debugging the Visualization

A subtle logic bug was discovered in the visualization layer:
The post-processing code still referenced German dictionary keys (“objekte”, “szenerie_und_ort”, “stimmung_und_komposition”) from Last week.
Since the new English prompt returned English keys (“objects”, “scene_and_location”, etc.), these lookups produced empty lists, which in turn broke the overlay rendering loop.
After harmonizing key references to support both language variants, the visualization resumed normal operation.

Cross-Model Visualization and Validation

A unified visualization layer now overlays results from either model directly onto the source image.
In OpenAI mode, the “panning” values from GPT’s response are projected as vertical lines with object labels.
This provides immediate visual confirmation that the model’s spatial reasoning aligns with the actual object layout, an important diagnostic step for evaluating AI-based perception accuracy.

Outcome and Next Steps

The project now represents a dual-model visual intelligence system, capable of using symbolic AI interpretation (OpenAI) and local pixel-based detection (Gemini).

Next steps

The upcoming development cycle will focus on connecting the openAI API layer directly with the Image Extender’s audio search and fallback system.

Product I: Image Extender

OpenAI API Image Analyzer – Structured Vision Testing and Model Insights

Adaptive Visual Understanding Framework
In this development phase, the focus was placed on building a robust evaluation framework for OpenAI’s multimodal models (GPT-4.1 and GPT-4.1-mini). The primary goal: systematically testing image interpretation, object detection, and contextual scene recognition while maintaining controlled cost efficiency and analytical depth.

upload of image (image source: https://www.trumau.at/)
  1. Combined Request Architecture
    Unlike traditional multi-call pipelines, the new setup consolidates image and text interpretation into a single API request. This streamlined design prevents token overhead and ensures synchronized contextual understanding between categories. Each inference returns a structured Python dictionary containing three distinct analytical branches:
    • Objects – Recognizable entities such as animals, items, or people
    • Scene and Location Estimation – Environment, lighting, and potential geographic cues
    • Mood and Composition – Aesthetic interpretation, visual tone, and framing principles

For each uploaded image, the analyzer prints three distinct lists per modelside by side. This offers a straightforward way to assess interpretive differences without complex metrics. In practice, GPT-4.1 tends to deliver slightly more nuanced emotional and compositional insights, while GPT-4.1-mini prioritizes concise, high-confidence object recognition.

results of the image object analysis and model comparison

Through the unified format, post-processing can directly populate separate lists or database tables for subsequent benchmarking, minimizing parsing latency and data inconsistencies.

  1. Robust Output Parsing
    Because model responses occasionally include Markdown code blocks (e.g., python {…}), the parsing logic was redesigned with a multi-layered interpreter using regex sanitation and dual parsing strategies (AST > JSON > fallback). This guarantees that even irregularly formatted outputs are safely converted into structured datasets without manual intervention. The system thus sustains analytical integrity under diverse prompt conditions.
  2. Model Benchmarking: GPT-4.1-mini vs. GPT-4.1
    The benchmark test compared inference precision, descriptive richness, and token efficiency between the two models. While GPT-4.1 demonstrates deeper contextual inference and subtler mood detection, GPT-4.1-mini achieves near-equivalent recognition accuracy at approximately one-tenth of the cost per request. For large-scale experiments (e.g., datasets exceeding 10,000 images), GPT-4.1-mini provides the optimal balance between granularity and economic viability.
  3. Token Management and Budget Simulation
    A real-time token tracker revealed an average consumption of ~1,780 tokens per image request. Given GPT-4.1-mini’s rate of $0.003 / 1k tokens, a one-dollar operational budget supports roughly 187 full image analyses. This insight forms the baseline for scalable experimentation and budget-controlled automation workflows in cloud-based vision analytics.

The next development phase will integrate this OpenAI-driven visual analysis directly into the Image Extender environment. This integration marks the transition from isolated model testing toward a unified generative framework.

Zwischen Bild und Ton – Kritische Bewertung der Masterarbeit “Automatic Sonification of Video Sequences” von Andrea Corcuera Marruffo

Grundlegendes

Autorin: Andrea Corcuera Marruffo
Titel: Automatic Sonification of Video Sequences through Object Detection and Physical Modelling
Hochschule: Aalborg University Copenhagen
Studiengang: MSc Sound and Music Computing
Jahr: 2017

Die Arbeit von Andrea Corcuera Marruffo untersucht die automatische Erzeugung von Foley-Sounds aus Videosequenzen. Ziel ist es, audiovisuelle Inhalte algorithmisch zu sonifizieren, indem visuelle Informationen, z.B. Materialeigenschaften oder Objektkollisionen, mithilfe von Convolutional Neural Networks (nutzung des YOLO models) analysiert und anschließend physikalisch modellierte Klänge synthetisiert werden. Damit positioniert sich die Arbeit an der Schnittstelle von Klangsynthese, teilweise software und coding und Wahrnehmung, ein Feld, das in der Medienproduktion wie auch in der künstlerischen Forschung zunehmende Relevanz besitzt und entsprechend auch überschneidungen zum Grundkonzept meiner vorstehenden Masterarbeit.

Das „Werkstück“ besteht aus einem funktionalen Prototypen, der Videos analysiert, Objekte klassifiziert und deren Interaktionen in synthetisierte Klänge übersetzt. Ergänzt wird dieses Tool durch eine Evaluation, in der audiovisuelle Stimuli hinsichtlich ihrer Plausibilität und wahrgenommenen Qualität getestet werden.

Bewertung

systematisch anhand der Beurteilungskriterien des Studiengangs CMS

(1) Gestaltungshöhe

Die Arbeit zeigt eine sehr gute technische Tiefe und eine klare methodische Struktur. Der Aufbau ist logisch, die Visualisierungen (z. B. Flussdiagramme, Spektrogramme) sind nachvollziehbar und unterstützen das Verständnis des Prozesses.

(2) Innovationsgrad

Der Ansatz, Foley-Sound automatisch (unter dem Einsatz von „physical modelling“) zu generieren, wurde zum Zeitpunkt der Veröffentlichung (2017) nur vereinzelt erforscht. Die Verbindung von Object Detection und Physical Modelling stellt daher einen innovativen Beitrag im Bereich „Computational Sound Design“ dar.

(3) Selbstständigkeit

Die Arbeit zeigt eine deutliche Eigenleistung. Die Autorin erstellt ein eigenes Dataset, modifiziert Trainingsdaten und implementiert das YOLO Model in einer angepassten Form. Auch die Syntheseparameter werden experimentell abgeleitet. Die Eigenständigkeit ist daher sowohl konzeptionell als auch technisch vorhanden.

(4) Gliederung und Struktur

Die Struktur folgt einem klassischen wissenschaftlichen Aufbau. Theorie, Implementierung, Evaluation, Schlussfolgerung. Kapitel sind klar fokussiert, jedoch teils stark technisch geprägt, was die Lesbarkeit für fachfremde Leser einschränken kann. Eine visuellere Darstellung der Evaluationsmethodik hätte das eventuell verbessert.

(5) Kommunikationsgrad

Die Arbeit ist insgesamt verständlich und präzise formuliert. Fachtermini werden sorgfältig eingeführt, Abbildungen sind beschriftet und logisch eingebunden. Der sprachliche Stil ist sachlich, allerdings manchmal zu stark an technischer Dokumentation orientiert. Narrative Reflexionen zu Designentscheidungen oder ästhetischen Überlegungen fehlen weitgehend, was anhand des Studiengangs, welcher sich nicht hauptsächlich an design orientiert verständlich und nachvollziehbar ist.

(6) Umfang der Arbeit

Mit über 30 Seiten Haupttext und zusätzlichem Anhang ist der Umfang angemessen. Die Balance zwischen Theorie, Umsetzung und Evaluation ist gelungen. Die empirische Studie mit 15 Proband bleibt jedoch relativ klein, wodurch die statistische Aussagekraft begrenzt ist.

(7) Orthographie, Sorgfalt und Genauigkeit

Die Arbeit ist durchgängig formal korrekt und methodisch sorgfältig dokumentiert. Kleinere sprachliche Unschärfen („he first talkie film“) mindern den Gesamteindruck kaum. Zitate und Quellenverweise sind konsistent.

(8) Literatur Das Literaturverzeichnis zeigt eine solide theoretische Fundierung. Es werden gängige Quellen zu Sound Synthesis, Modal Modelling und Neural Networks verwendet (Smith, Farnell, Van den Doel). Allerdings wären aktueller Medien- oder Wahrnehmungsforschung (durch z. B. Sonic Interaction Design, Embodied Sound Studies) noch eine spannende Ergänzung hinsichtlich Forschungsliteratur gewesen.

Abschließende Einschätzung

Insgesamt überzeugt die Arbeit durch ihren innovativen Ansatz, die methodische Präzision und die gelungene Umsetzung eines komplexen Systems. Die Evaluation zeigt kritisch die Grenzen des Modells auf (Objektgenauigkeit und Synchronisationsprobleme), was die Autorin reflektiert und nachvollziehbar einordnet.

Stärken: klare Struktur, hohes technisches Niveau, origineller Forschungsansatz, eigenständige Implementierung.
Schwächen: begrenzte ästhetische Reflexion, kleine Stichprobe in der Evaluation, eingeschränkte Materialvielfalt.

Critical Review: “Sound response to physicality – Artistic expressions of movement sonification” by Aleksandra Joanna Słyż (Royal College of Music, 2022)

by Verena Schneider, CMS24 Sound Design Master 

The master thesis “Sound Response to Physicality: Artistic Expressions of Movement Sonification” was written by Aleksandra Joanna Słyż in 2022 at the Royal College of Music in Stockholm (Kungliga Musikhögskolan; Stockholm, Sweden).

Introduction

I chose Aleksandra Słyż’s master thesis because her topic immediately resonated with my own research interests. In my master project I am working with the x-IMU3 motion sensor to track surf movements and transform them into sound for a surf documentary.
During my research process, the question of how to sonify movement data became central, and Słyż’s work gave me valuable insights into which parameters can be used and how the translation from sensor to sound can be conceptually designed.

Her thesis, Sound response to physicality, focuses on the artistic and perceptual dimensions of movement sonification. Through her work Hypercycle, she explores how body motion can control and generate sound in real time, using IMU sensors and multichannel sound design. I found many of her references—such as John McCarthy and Peter Wright’s Technology as Experience—highly relevant for my own thesis.

Gestaltungshöhe – Artistic Quality and Level of Presentation

Słyż’s thesis presents a high level of artistic and conceptual quality. The final piece, Hypercycle, is a technically complex and interdisciplinary installation that connects sound, body, and space. The artistic idea of turning the body into a musical instrument is powerful, and she reflects deeply on the relation between motion, perception, and emotion.

Visually, the documentation of her work is clear and professional, though I personally wished for a more detailed sonic description. The sound material she used is mainly synthesized tones—technically functional, but artistically minimal. As a sound designer, I would have enjoyed a stronger exploration of timbre and spatial movement as expressive parameters.

Innovationsgrad – Innovation and Contribution to the Field

Using motion sensors for artistic sonification is not entirely new, yet her combination of IMU data, embodied interaction, and multichannel audio gives the project a strong contemporary relevance. What I found innovative was how she conceptualized direct and indirect interaction—how spectators experience interactivity even when they don’t control the sound themselves.

However, from a technical point of view, the work could have been more transparent. I was missing a detailed explanation of how exactly she mapped sensor data to sound parameters. This part felt underdeveloped, and I see potential for future work to document such artistic systems more precisely.

Selbstständigkeit – Independence and Original Contribution

Her thesis clearly shows independence and artistic maturity. She worked across disciplines—combining psychology, music technology, and perception studies—and reflected on her process critically. I especially appreciated that she didn’t limit herself to the technical side but also integrated a psychological and experiential perspective.

As someone also working with sensor-based sound, I can see how much self-direction and experimentation this project required. The depth of reflection makes the work feel authentic and personal.

Gliederung und Struktur – Structure and Coherence

The structure of the thesis is logical and easy to follow. Each chapter begins with a quote that opens the topic in a poetic way, which I found very effective. She starts by explaining the theoretical background, then moves toward the technical discussion of IMU sensors, and finally connects everything to her artistic practice.

Her explanations are written in clear English, and she carefully defines all important terms such as sonificationproprioception, and biofeedback. Even readers with only basic sound design knowledge can follow her reasoning.

Kommunikationsgrad – Communication and Expression

The communication of her ideas is well-balanced between academic precision and personal reflection. I like that she uses a human-centered language, often describing how the performer or spectator might feel within the interactive system.

Still, the technical documentation of the sonification process could be more concrete. She briefly shows a Max/MSP patch, but I would have loved to understand more precisely how the data flow—from IMU to sound—was built. For future readers and practitioners, such details would be extremely valuable.

Umfang – Scope and Depth

The length of the thesis (around 50 pages) feels appropriate for the topic. She covers a wide range of areas: from sensor technology and perception theory to exhibition practice and performance philosophy.
At the same time, I had the impression that she decided to keep the technical parts lighter, focusing more on conceptual reflection. For me, this makes the thesis stronger as an artistic reflection, but weaker as a sound design manual.

Orthography, Accuracy, and Formal Care

The thesis is very carefully written and proofread. References are consistent, and the terminology is accurate. She integrates both scientific and artistic citations, which gives the text a professional academic tone.
The layout is clear, and the visual elements (diagrams, performance photos) are well placed.

Literature – Quality and Relevance

The literature selection is one of the strongest aspects of this work. She cites both technical and philosophical sources—from G. Kramer’s Sonification Report to McCarthy & Wright’s Technology as Experience and Tanaka & Donnarumma’s The Body as Musical Instrument.
For me personally, her bibliography became a guide for my own research. I found new readings that I will also include in my master thesis.

Final Assessment – Strengths, Weaknesses, and Personal Reflection

Overall, Sound response to physicality is a well-balanced, thoughtful, and inspiring thesis that connects technology, perception, and art.
Her biggest strength lies in how she translates complex sensor-based interactions into human experience and emotional resonance. The way she conceptualizes embodied interaction and indirect interactivity is meaningful and poetic.

The main weakness, in my opinion, is the lack of detailed technical documentation—especially regarding how the IMU data was mapped to sound and multichannel output. As someone building my own sonification system with the x-IMU3 and contact microphones, I would have loved to see the exact data chain from sensor to audio.

Despite that, her work inspired me profoundly. It reminded me that the psychological and experiential dimensions of sound are just as important as the data itself. In my own project, where I sonify the movement of a surfboard and the feeling of the ocean, I will carry this understanding forward: that sonification is not only about data translation but about shaping human experience through sound.

Post 1: Listening to the Ocean

– The Emotional Vision Behind Surfboard Sonification

Surfing is more than just a sport. For many surfers, it is a ritual, a form of meditation, and an experience of deep emotional release. There is a unique silence that exists out on the water. It is not the absence of sound but the presence of something else: a sense of connection, stillness, and immersion. This is where the idea for “Surfboard Sonification” was born. It began not with technology, but with a feeling. A moment on the water when the world quiets, and the only thing left is motion and sensation.

The project started with a simple question: how can one translate the feeling of surfing into sound? What if we could make that feeling audible? What if we could tell the story of a wave, not through pictures or words, but through vibrations, resonance, and sonic movement?

My inspiration came from both my personal experiences as a surfer and from sound art and acoustic ecology. I was particularly drawn to the work of marine biologist Wallace J. Nichols and his theory of the “Blue Mind.” According to Nichols, being in or near water has a scientifically measurable impact on our mental state. It relaxes us, improves focus, and connects us to something larger than ourselves. It made me wonder: can we create soundscapes that replicate or amplify that feeling?

In addition to Nichols’ research, I studied the sound design approaches of artists like Chris Watson and Jana Winderen, who work with natural sound recordings to create immersive environments. I also looked at data-driven artists such as Ryoji Ikeda, who transform abstract numerical inputs into rich, minimalist sonic works.

The goal of Surfboard Sonification was to merge these worlds. I wanted to use real sensor data and field recordings to tell a story. I did not want to rely on synthesizers or artificial sound effects. I wanted to use the board itself as an instrument. Every crackle, vibration, and movement would be captured and turned into music—not just any music, but one that feels like surfing.

The emotional journey of a surf session is dynamic. You begin on the beach, often overstimulated by the environment. There is tension, anticipation, the chaos of wind, people, and crashing waves. Then, as you paddle out, things change. The noise recedes. You become attuned to your body and the water. You wait, breathe, and listen. When the wave comes and you stand up, everything disappears. It’s just you and the ocean. And then it’s over, and a sense of calm returns.

This narrative arc became the structure of the sonic composition I set out to create. Beginning in noise and ending in stillness. Moving from overstimulation to focus. From red mind to blue mind.

To achieve this, I knew I needed to design a system that could collect as much authentic data as possible. This meant embedding sensors into a real surfboard without affecting its function. It meant using microphones that could capture the real vibrations of the board. It meant synchronizing video, sound, and movement into one coherent timeline.

This was not just an artistic experiment. It was also a technical challenge, an engineering project, and a sound design exploration. Each part of the system had to be carefully selected and tested. The hardware had to survive saltwater, sun, and impact. The software had to process large amounts of motion data and translate it into sound in real time or through post-processing.

And at the heart of all this was one simple but powerful principle, spoken to me once by a surf teacher in Sri Lanka:

“You are only a good surfer if you catch a wave with your eyes closed.”

That phrase stayed with me. It encapsulates the essence of surfing. Surfing is not about seeing; it’s about sensing. Feeling. Listening. This project was my way of honoring that philosophy—by creating a system that lets us catch a wave with our ears.

This blog series will walk through every step of that journey. From emotional concept to hardware integration, from dry-land simulation to ocean deployment. You will learn how motion data becomes music. How a surfboard becomes a speaker. And how the ocean becomes an orchestra.

In the next post, I will dive into the technical setup: the sensors, microphones, recorders, and housing that make it all possible. I will describe the engineering process behind building a waterproof, surfable, sound-recording device—and what it took to embed that into a real surfboard without compromising performance.

But for now, I invite you to close your eyes. Imagine paddling out past the break. The sound of your breath, the splash of water, the silence between waves. This is the world of Surfboard Sonification. And this is just the beginning.

References

Nichols, W. J. (2014). Blue Mind. Little, Brown Spark.

Watson, C. (n.d.). Field recording artist.

Winderen, J. (n.d.). Jana Winderen: Artist profile. https://www.janawinderen.com

Ikeda, R. (n.d.). Official site. https://www.ryojiikeda.com

Truax, B. (2001). Acoustic Communication. Ablex Publishing.

Puckette, M. S. (2007). The Theory and Technique of Electronic Music. World Scientific Publishing Company.