FunAudioLLM shipped PrismAudio on March 24, 2026 — an open-source video-to-audio (V2A) generation model accepted to the ICLR 2026 Main Conference. It achieves state-of-the-art results across all four perceptual dimensions on both the VGGSound benchmark and the newly released AudioCanvas evaluation suite.

Four Reasoning Modules, One Model

Previous V2A systems relied on a single reasoning chain. PrismAudio splits that into four specialized Chain-of-Thought (CoT) modules — Semantic, Temporal, Aesthetic, and Spatial — each with its own reward function. This lets the model apply multi-dimensional Reinforcement Learning optimization via Fast-GRPO, a hybrid ODE-SDE sampling method that cuts RL training overhead without hurting generation quality.

The payoff: at 518M parameters, PrismAudio runs inference in 0.63 seconds — faster than MMAudio (1.30s) and ThinkSound (1.07s) — while outscoring both on benchmark metrics.

AudioCanvas Benchmark

Alongside the model, the team releases AudioCanvas, a new V2A benchmark covering 300 single-event sound classes and 501 multi-event samples. It's designed to test out-of-domain generalization; PrismAudio scores CLAP 0.52 and MOS-Q 4.12, leading the field.

Get It Now

Model weights are live on Hugging Face and ModelScope. Code is in the prismaudio branch of the ThinkSound GitHub repo. An interactive demo is available on Hugging Face Spaces and ModelScope Studios.