RadioSES: mmWave-Based Audioradio Speech Enhancement and Separation System

Muhammed Zahid Ozturk, Chenshu Wu, Beibei Wang, Min Wu, K.J. Ray Liu

In Submission to IEEE/ACM Transactions on Audio, Speech and Language Processing



Abstract


Speech enhancement and separation have been a long-standing problem, despite recent advances using a single microphone. Although microphones perform well in constrained settings, their performance for speech separation decreases in noisy conditions. In this work, we propose RadioSES, an audioradio speech enhancement and separation system that overcomes inherent problems in audio-only systems. By fusing a complementary radio modality, RadioSES can estimate the number of speakers, solve source association problem, separate and enhance noisy mixture speeches, and improve both intelligibility and perceptual quality. We perform millimeter-wave sensing to detect and localize speakers, and introduce an audioradio deep learning framework to fuse the separate radio features with the mixed audio features. Extensive experiments using commercial off-the-shelf devices show that RadioSES outperforms a variety of state-of-the-art baselines, with consistent performance gains in different environmental settings. Compared with the audiovisual methods, RadioSES provides similar improvements (e.g., ~3 dB gains in SiSDR), along with the benefits of lower computational complexity and being less privacy concerning.



Audio Samples


Below are samples audio files for 3-person/2-person/single-person separation and enhancement, together with the spectrograms.

3-person Separation

#1 Clean Mixture
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
Speaker 2 Mic Speaker 2 RadioSES
#2 Clean Mixture
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
Speaker 2 Mic Speaker 2 RadioSES
#3 Clean Mixture
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
Speaker 2 Mic Speaker 2 RadioSES
#4 Noisy Mixture
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
Speaker 2 Mic Speaker 2 RadioSES
#5 Noisy Mixture
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
Speaker 2 Mic Speaker 2 RadioSES
#6 Noisy Mixture
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
Speaker 2 Mic Speaker 2 RadioSES

2-Person Separation

#1 Clean Mixture
mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
#2 Clean Mixture with Silence
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
#3 Clean Mixture
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
#4 Clean Mixture with Same Speaker
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
#5 Noisy Mixture
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
#6 Noisy Mixture
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
#7 Noisy Mixture
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES
#8 Noisy Mixture
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Speaker 1 Mic Speaker 1 RadioSES

Single-person Enhancement

Example #1
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Example #2
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Example #3
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Example #4
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Example #5
Mixture
Speaker 0 Mic Speaker 0 RadioSES
Example #6
Mixture
Speaker 0 Mic Speaker 0 RadioSES