IRCAM Forum Workshops 2025 – ACIDS

From 26 to 28th of March, we (the sound design master, second semester) had the incredible opportunity to visit IRCAM (Institut de Recherche et Coordination Acoustique/Musique) in Paris as part of a student excursion. For anyone passionate about sound, music technology, and AI, IRCAM is like stepping into new fields of research, discussion and seeing prototypes in action. One of my personal highlights was learning about the ACIDS team (Artificial Creatiive Intelligence and Data Science) and their research projects—RAVE (Real-time Audio Variational autoEncoder) and AFTER (Audio Features Transfer and Exploration in Real-time

ACIDS – Team

The ACIDS team is a multidisciplinary group of researchers working at the intersection of machine learning, sound synthesis, and real-time audio processing. Their name stands for Audio, Communication, Information, Data, and Sound, reflecting their broad focus on computational audio research. During our visit, they gave us an inside look at their latest developments, including demonstrations from the IRCAM Forum Workshop (March 26–28, 2025), where they showcased some of their most exciting advancements. Beside their really good and catchy (also a bit funny) presentation I want to showcase two projects.

RAVE (Real-Time Neural Audio Synthesis)

One of the most impressive projects we explored was RAVE (Real-time Audio Variational autoEncoder), a deep learning model for high-quality audio synthesis and transformation. Unlike traditional digital signal processing, RAVE uses a latent space representation of sound, allowing for intuitive and expressive real-time manipulation.

Overall architecture of the proposed approach. Blocks in blue are the only ones optimized,
while blocks in grey are fixed or frozen operations.

Key Innovations

  1. Two-Stage Training:
    • Stage 1: Learns compact latent representations using a spectral loss.
    • Stage 2: Fine-tunes the decoder with adversarial training for ultra-realistic audio.
  2. Blazing Speed:
    • Runs 20× faster than real-time on a laptop CPU, thanks to a multi-band decomposition technique.
  3. Precision Control:
    • Post-training latent space analysis balances reconstruction quality vs. compactness.
    • Enables timbre transfer and signal compression (2048:1 ratio).

Performance

  • Outperforms NSynth and SING in audio quality (MOS: 3.01 vs. 2.68/1.15) with fewer parameters (17.6M).
  • Handles polyphonic music and speech, unlike many restricted models.

You can explore RAVE’s code and research on their GitHub repository and learn more about its applications on the IRCAM website.

AFTER

While many AI audio tools focus on raw sound generation, what sets AFTER (Audio Foundation Transformer) apart is its sophisticated control mechanisms—a priority highlighted in recent research from the ACIDS team. As their paper states:

“Deep generative models now synthesize high-quality audio signals, shifting the critical challenge from audio quality to control capabilities. While text-to-music generation is popular, explicit control and example-based style transfer better capture the intents of artists.”

How AFTER Achieves Precision

The team’s breakthrough lies in separating local and global audio information:

  • Global (timbre/style): Captured from a reference sound (e.g., a vintage synth’s character).
  • Local (structure): Controlled via MIDI, text prompts, or another audio’s rhythm/melody.

This is enabled by a diffusion autoencoder that builds two disentangled representation spaces, enforced through:

  1. Adversarial training to prevent overlap between timbre and structure.
  2. A two-stage training strategy for stability.
Detailed overview of our method. Input signal(s) are passed to structure and timbre encoders, which provides
semantic encodings that are further disentangled through confusion maximization. These are used to condition a latent
diffusion model to generate the output signal. Input signals are identical during training and but distinct at inference.

Why Musicians Care

In tests, AFTER outperformed existing models in:

  • One-shot timbre transfer (e.g., making a piano piece sound like a harp).
  • MIDI-to-audio generation with precise stylistic control.
  • Full “cover version” generation—transforming a classical piece into jazz while preserving its melody.

Check out AFTER’s progress on GitHub and stay updated via IRCAM’s research page.

References

Caillon, Antoine, and Philippe Esling. “RAVE: A Variational Autoencoder for Fast and High-Quality Neural Audio Synthesis.” arXiv preprint arXiv:2111.05011 (2021). https://arxiv.org/abs/2111.05011.

Demerle, Nils, Philippe Esling, Guillaume Doras, and David Genova. “Combining Audio Control and Style Transfer Using Latent Diffusion.” 

Leave a Reply

Your email address will not be published. Required fields are marked *