FoleySpace

FoleySpace first estimates the sound source 2D coordinates and depth in each video frame, and then employs a coordinate mapping mechanism to convert the 2D source positions into a 3D trajectory. This 3D trajectory, together with the monaural audio generated by a pre-trained V2A model, serves as a conditioning input for a diffusion model to generate spatially consistent binaural audio.

FoleySpace

Vision-Aligned Binaural Spatial Audio Generation

TL;DR

Demo Presentation