We test on real data of AudioSet eval dataset. 41 categories from the FSD Kaggle 2018 dataset are used here. Each category has 2 test samples.
SSR-Speech is a novel diffusion-based generative model for target sound extraction. Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events.
We test on real data of AudioSet eval dataset. 41 categories from the FSD Kaggle 2018 dataset are used here. Each category has 2 test samples.
Text | Original Recording | SoloAudio (ours) | DPM-TSE | WaveFormer | AudioSep |
---|---|---|---|---|---|
Acoustic guitar | |||||
Acoustic guitar | |||||
Applause | |||||
Applause | |||||
Bark | |||||
Bark | |||||
Bass drum | |||||
Bass drum | |||||
Burping, eructation | |||||
Burping, eructation | |||||
Bus | |||||
Bus | |||||
Cello | |||||
Cello | |||||
Chime | |||||
Chime | |||||
Clarinet | |||||
Clarinet | |||||
Computer keyboard | |||||
Computer keyboard | |||||
Cough | |||||
Cough | |||||
Cowbell | |||||
Cowbell | |||||
Double bass | |||||
Double bass | |||||
Drawer open, close | |||||
Drawer open, close | |||||
Electric piano | |||||
Electric piano | |||||
Fart | |||||
Fart | |||||
Finger snapping | |||||
Finger snapping | |||||
Fireworks | |||||
Fireworks | |||||
Flute | |||||
Flute | |||||
Glockenspiel | |||||
Glockenspiel | |||||
Gong | |||||
Gong | |||||
Gunshot, gunfile | |||||
Gunshot, gunfile | |||||
Harmonica | |||||
Harmonica | |||||
Hi-hat | |||||
Hi-hat | |||||
Keys jangling | |||||
Keys jangling | |||||
Knock | |||||
Knock | |||||
Laughter | |||||
Laughter | |||||
Meow | |||||
Meow | |||||
Microwave oven | |||||
Microwave oven | |||||
Oboe | |||||
Oboe | |||||
Saxophone | |||||
Saxophone | |||||
Scissors | |||||
Scissors | |||||
Shatter | |||||
Shatter | |||||
Snare drum | |||||
Snare drum | |||||
Squeak | |||||
Squeak | |||||
Tambourine | |||||
Tambourine | |||||
Tearing | |||||
Tearing | |||||
Telephone | |||||
Telephone | |||||
Trumpet | |||||
Trumpet | |||||
Violin, fiddle | |||||
Violin, fiddle | |||||
Writing | |||||
Writing |
@article{helin2024soloaudio,
author = {Wang, Helin and Hai, Jiarui and Lu, Yen-Ju and Thakkar, Karan and Elhilali, Mounya and Dehak, Najim},
title = {SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer},
journal = {arXiv},
year = {2024},
}