FoleySpace

Vision-Aligned Binaural Spatial Audio Generation

TL;DR

FoleySpace first estimates the sound source 2D coordinates and depth in each video frame, and then employs a coordinate mapping mechanism to convert the 2D source positions into a 3D trajectory. This 3D trajectory, together with the monaural audio generated by a pre-trained V2A model, serves as a conditioning input for a diffusion model to generate spatially consistent binaural audio.



Demo presentation

Tip: Please wear headphones 🎧 to enjoy the best audiovisual experience.




SEE-2-SOUND

AudioX

ThinkSound

FoleySpace