FoleySpace first estimates the sound source 2D coordinates and depth in each video frame, and then employs a coordinate mapping mechanism to convert the 2D source positions into a 3D trajectory. This 3D trajectory, together with the monaural audio generated by a pre-trained V2A model, serves as a conditioning input for a diffusion model to generate spatially consistent binaural audio.
SEE-2-SOUND
AudioX
ThinkSound
FoleySpace