Diffusion Models for Audio-Visual Speech Enhancement

119
Not scheduled
20m
Von-Melle-Park 4

Von-Melle-Park 4

Poster

Description

This poster showcases a selection of our work on diffusion models for speech enhancement. While diffusion models have proven successful in natural image generation, we adopt them for speech enhancement by introducing a task-adopted diffusion process in the complex short-time Fourier domain. Our results show competitive performance compared to strong predictive methods, while generalization is better when evaluated in a mismatched training scenario. However, for very challenging input, the model tends to produce speech-like sounds without semantic meaning. To address this problem, we condition the diffusion model on visual input with the speaker’s lips, resulting in improved speech quality and intelligibility. This improvement is reflected in a reduced word error rate of a downstream automatic speech recognition model.

Keywords

Diffusion models
Speech Enhancement
Audio-Visual
Generative Models

Author

Co-author

Presentation materials

There are no materials yet.