image recognition

by David Adlberger - 7. April 202516. April 2025

Prototyping IV: Image Extender – Image sonification tool for immersive perception of sounds from images and new creation possibilities

Tests on automated audio file search via freesound.org api:

For further use in the automated audio file search of the recognized objects I tested the freesound.org api and programmed the first interface for testing purposes. The first thing I had to do was request an API-Key by freesound.org. After that I noticed an interesting point to think about using it in my project: it is open for 5000 requests per year, but I will research on possibilities for using it more. For the testing 5000 is more than enough.

The current code already searches with a few testing tags and gives possibilities to filter the searches by samplerate, duration, licence and file type. There might be added more filter possibilities next like rating, bit depth, and maybe the possibility of random file selection so it won’t be always the same for each tag.

Next steps would also include to either download the file or just play it automatically. Then there will be tests on using the tags of the AI image recognition code for this automated search. And later in the process I have to figure out the playback of multiple files, volume staging and filtering or EQing methods for masking effects etc…

Test gui for automated sound searching via freesounds.org API

by David Adlberger - 17. March 2025

Prototyping II: Image Extender – Image sonification tool for immersive perception of sounds from images and new creation possibilities

Expanded research on sonification of images / video material and different approaches:

Yeo and Berger (2005) write in “A Framework for Designing Image Sonification Methods” about the challenge of mapping static, time-independent data like images into the time-dependent auditory domain. They introduce two main concepts: scanning and probing. Scanning follows a fixed, pre-determined order of sonification, whereas probing allows for arbitrary, user-controlled exploration. The paper also discusses the importance of pointers and paths in defining how data is mapped to sound. Several sonification techniques are analyzed, including inverse spectrogram mapping and the method of raster scanning (which already was explained in the Prototyping I – Blog entry), with examples illustrating their effectiveness. The authors suggest that combining scanning and probing offers a more comprehensive approach to image sonification, allowing for both global context and local feature exploration. Future work includes extending the framework to model human image perception for more intuitive sonification methods.

Time on “perpendicular” axis. (Yeo, Berger, 2005)

Sharma et al. (2017) explore action recognition in still images using Natural Language Processing (NLP) techniques in “Action Recognition in Still Images Using Word Embeddings from Natural Language Descriptions.” Rather than training visual action detectors, they propose detecting prominent objects in an image and inferring actions based on object relationships. The Object-Verb-Object (OVO) triplet model predicts verbs using object co-occurrence, while word2vec captures semantic relationships between objects and actions. Experimental results show that this approach reliably detects actions without computationally intensive visual action detectors. The authors highlight the potential of this method in resource-constrained environments, such as mobile devices, and suggest future work incorporating spatial relationships and global scene context.

Iovino et al. (1997) discuss developments in Modalys, a physical modeling synthesizer based on modal synthesis, in “Recent Work Around Modalys and Modal Synthesis.” Modalys allows users to create virtual instruments by defining physical structures (objects), their interactions (connections), and control parameters (controllers). The authors explore the musical possibilities of Modalys, emphasizing its flexibility and the challenges of controlling complex synthesis parameters. They propose applications such as virtual instrument construction, simulation of instrumental gestures, and convergence of signal and physical modeling synthesis. The paper also introduces single-point objects, which allow for spectral control of sound, bridging the gap between signal synthesis and physical modeling. Real-time control and expressivity are emphasized, with future work focused on integrating Modalys with real-time platforms.

McGee et al. (2012) describe Voice of Sisyphus, a multimedia installation that sonifies a black-and-white image using raster scanning and frequency domain filtering in “Voice of Sisyphus: An Image Sonification Multimedia Installation.” Unlike traditional spectrograph-based sonification methods, this project focuses on probing different image regions to create a dynamic audio-visual composition. Custom software enables real-time manipulation of image regions, polyphonic sound generation, and spatialization. The installation cycles through eight phrases, each with distinct visual and auditory characteristics, creating a continuous, evolving experience. The authors discuss balancing visual and auditory aesthetics, noting that visually coherent images often produce noisy sounds, while abstract images yield clearer tones. The project draws inspiration from early experiments in image sonification and aims to create a synchronized audio-visual experience engaging viewers on multiple levels.

Software Interface for Voice of Sisyphus (McGee et al., 2012)

Roodaki et al. (2017) introduce SonifEye, a system that uses physical modeling sound synthesis to convey visual information in high-precision tasks, in “SonifEye: Sonification of Visual Information Using Physical Modeling Sound Synthesis.” They propose three sonification mechanisms: touch, pressure, and angle of approach, each mapped to sounds generated by physical models (e.g., tapping on a wooden plate or plucking a string). The system aims to reduce cognitive load and avoid alarm fatigue by using intuitive, natural sounds. Two experiments compare the effectiveness of visual, auditory, and combined feedback in high-precision tasks. Results show that auditory feedback alone can improve task performance, particularly in scenarios where visual feedback may be distracting. The authors suggest applications in medical procedures and other fields requiring precise manual tasks.

Dubus and Bresin review mapping strategies for the sonification of physical quantities in “A Systematic Review of Mapping Strategies for the Sonification of Physical Quantities.” Their study analyzes 179 publications to identify trends and best practices in sonification. The authors find that pitch is the most commonly used auditory dimension, while spatial auditory mapping is primarily applied to kinematic data. They also highlight the lack of standardized evaluation methods for sonification efficiency. The paper proposes a mapping-based framework for characterizing sonification and suggests future work in refining mapping strategies to enhance usability.

References

Yeo, Woon Seung, and Jonathan Berger. 2005. “A Framework for Designing Image Sonification Methods.” In Proceedings of ICAD 05-Eleventh Meeting of the International Conference on Auditory Display, Limerick, Ireland, July 6-9, 2005.

Sharma, Karan, Arun CS Kumar, and Suchendra M. Bhandarkar. 2017. “Action Recognition in Still Images Using Word Embeddings from Natural Language Descriptions.” In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 978-1-5090-4941-7/17. DOI: 10.1109/WACVW.2017.17.

Iovino, Francisco, Rene Causse, and Richard Dudas. 1997. “Recent Work Around Modalys and Modal Synthesis.” In Proceedings of the International Computer Music Conference (ICMC).

McGee, Ryan, Joshua Dickinson, and George Legrady. 2012. “Voice of Sisyphus: An Image Sonification Multimedia Installation.” In Proceedings of the 18th International Conference on Auditory Display (ICAD-2012), Atlanta, USA, June 18–22, 2012.

Roodaki, Hessam, Navid Navab, Abouzar Eslami, Christopher Stapleton, and Nassir Navab. 2017. “SonifEye: Sonification of Visual Information Using Physical Modeling Sound Synthesis.” IEEE Transactions on Visualization and Computer Graphics 23, no. 11: 2366–2371. DOI: 10.1109/TVCG.2017.2734320.

Dubus, Gaël, and Roberto Bresin. 2013. “A Systematic Review of Mapping Strategies for the Sonification of Physical Quantities.” PLoS ONE 8(12): e82491. DOI: 10.1371/journal.pone.0082491.

by David Adlberger - 5. March 2025

Explore II: Image Extender – Image sonification tool for immersive perception of sounds from images and new creation possiblities

The Image Extender project bridges accessibility and creativity, offering an innovative way to perceive visual data through sound. With its dual-purpose approach, the tool has the potential to redefine auditory experiences for diverse audiences, pushing the boundaries of technology and human perception.

The project is designed as a dual-purpose tool for immersive perception and creative sound design. By leveraging AI-based image recognition and sonification algorithms, the tool will transform visual data into auditory experiences. This innovative approach is intended for:

1. Visually Impaired Individuals
2. Artists and Designers

The tool will focus on translating colors, textures, shapes, and spatial arrangements into structured soundscapes, ensuring clarity and creativity for diverse users.

Core Functionality: Translating image data into sound using sonification frameworks and AI algorithms.
Target Audiences: Visually impaired users and creative professionals.
Platforms: Initially desktop applications with planned mobile deployment for on-the-go accessibility.
User Experience: A customizable interface to balance complexity, accessibility, and creativity.

Working Hypotheses and Requirements

Hypotheses:
1. Cross-modal sonification enhances understanding and creativity in visual-to-auditory transformations.
2. Intuitive soundscapes improve accessibility for visually impaired users compared to traditional methods.
Requirements:
- Develop an intuitive sonification framework adaptable to various images.
- Integrate customizable settings to prevent sensory overload.
- Ensure compatibility across platforms (desktop and mobile).

Subtasks

1. Project Planning & Structure

Define Scope and Goals: Clarify key deliverables and objectives for both visually impaired users and artists/designers.
Research Methods: Identify research approaches (e.g., user interviews, surveys, literature review).
Project Timeline and Milestones: Establish a phased timeline including prototyping, testing, and final implementation.
Identify Dependencies: List libraries, frameworks, and tools needed (Python, Pure Data, Max/MSP, OSC, etc.).

2. Research & Data Collection

Sonification Techniques: Research existing sonification methods and metaphors for cross-modal (sight-to-sound) mapping and research different other approaches that can also blend in the overall sonification strategy.
Image Recognition Algorithms: Investigate AI image recognition models (e.g., OpenCV, TensorFlow, PyTorch).
Psychoacoustics & Perceptual Mapping: Review how different sound frequencies, intensities, and spatialization affect perception.
Existing Tools & References: Study tools like Melobytes, VOSIS, and BeMyEyes to understand features, limitations, and user feedback.

object detection from python yolo library

3. Concept Development & Prototyping

Develop Sonification Mapping Framework: Define rules for mapping visual elements (color, shape, texture) to sound parameters (pitch, timbre, rhythm).
Simple Prototype: Create a basic prototype that integrates:
- AI content recognition (Python + image processing libraries).
- Sound generation (Pure Data or Max/MSP).
- Communication via OSC (e.g., using Wekinator).
Create or collect Sample Soundscapes: Generate initial soundscapes for different types of images (e.g., landscapes, portraits, abstract visuals).

example of puredata with rem library (image to sound in pure data by Artiom
Constantinov)

4. User Experience Design

UI/UX Design for Desktop:
- Design intuitive interface for uploading images and adjusting sonification parameters.
- Mock up controls for adjusting sound complexity, intensity, and spatialization.
Accessibility Features:
- Ensure screen reader compatibility.
- Develop customizable presets for different levels of user experience (basic vs. advanced).
Mobile Optimization Plan:
- Plan for responsive design and functionality for smartphones.

5. Testing & Feedback Collection

Create Testing Scenarios:
- Develop a set of diverse images (varying in content, color, and complexity).
Usability Testing with Visually Impaired Users:
- Gather feedback on the clarity, intuitiveness, and sensory experience of the sonifications.
- Identify areas of overstimulation or confusion.
Feedback from Artists/Designers:
- Assess the creative flexibility and utility of the tool for sound design.
Iterate Based on Feedback:
- Refine sonification mappings and interface based on user input.

6. Implementation of Standalone Application

Develop Core Application:
- Integrate image recognition with sonification engine.
- Implement adjustable parameters for sound generation.
Error Handling & Performance Optimization:
- Ensure efficient processing for high-resolution images.
- Handle edge cases for unexpected or low-quality inputs.
Cross-Platform Compatibility:
- Ensure compatibility with Windows, macOS, and plan for future mobile deployment.

7. Finalization & Deployment

Finalize Feature Set:
- Balance between accessibility and creative flexibility.
- Ensure the sonification language is both consistent and adaptable.
Documentation & Tutorials:
- Create user guides for visually impaired users and artists.
- Provide tutorials for customizing sonification settings.
Deployment:
- Package as a standalone desktop application.
- Plan for mobile release (potentially a future phase).

Technological Basis Subtasks:

Programming: Develop core image recognition and processing modules in Python.
Sonification Engine: Create audio synthesis patches in Pure Data/Max/MSP.
Integration: Implement OSC communication between Python and the sound engine.
UI Development: Design and code the user interface for accessibility and usability.
Testing Automation: Create scripts for automating image-sonification tests.

Possible academic foundations for further research and work:

Chatterjee, Oindrila, and Shantanu Chakrabartty. “Using Growth Transform Dynamical Systems for Spatio-Temporal Data Sonification.” arXiv preprint, 2021.

Chion, Michel. Audio-Vision. New York: Columbia University Press, 1994.

Görne, Tobias. Sound Design. Munich: Hanser, 2017.

Hermann, Thomas, Andy Hunt, and John G. Neuhoff, eds. The Sonification Handbook. Berlin: Logos Publishing House, 2011.

Schick, Adolf. Schallwirkung aus psychologischer Sicht. Stuttgart: Klett-Cotta, 1979.

Sigal, Erich. “Akustik: Schall und seine Eigenschaften.” Accessed January 21, 2025. mu-sig.de.

Spence, Charles. “Crossmodal Correspondences: A Tutorial Review.” Attention, Perception, Psychophysics, 2011.

Ziemer, Tim. Psychoacoustic Music Sound Field Synthesis. Cham: Springer International Publishing, 2020.

Ziemer, Tim, Nuttawut Nuchprayoon, and Holger Schultheis. “Psychoacoustic Sonification as User Interface for Human-Machine Interaction.” International Journal of Informatics Society, 2020.

Ziemer, Tim, and Holger Schultheis. “Three Orthogonal Dimensions for Psychoacoustic Sonification.” Acta Acustica United with Acustica, 2020.