Geometry-aware multi-task learning for binaural audio generation from video
Access full-text files
Human audio perception is inherently spatial and videos with binaural audio simulate the spatial experience by delivering different sounds to both ears. However, videos are typically recorded with mono audio and hence generally do not offer the rich audio experience of binaural audio. We propose an audio spatialization method that uses the visual information in videos to convert mono audio to binaural. We leverage the spatial and geometric information about the audio present in the visuals of the video to guide the learning process. We learn these geometry aware features in visuals in a multi-task manner to generate rich binaural audio. We also generate a large video dataset with binaural audio in photorealistic environments to better understand and evaluate the task. We demonstrate the efficacy of our method to generate better binaural audio by learning more spatially coherent visual features by extensive evaluation on two datasets.