We test on real data of AudioSet eval dataset. 41 categories from the FSD Kaggle 2018 dataset are used here. Each category has 2 test samples.
SSR-Speech is a novel diffusion-based generative model for target sound extraction. Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events.
We test on real data of AudioSet eval dataset. 41 categories from the FSD Kaggle 2018 dataset are used here. Each category has 2 test samples.
| Text | Original Recording | SoloAudio (ours) | DPM-TSE | WaveFormer | AudioSep | 
|---|---|---|---|---|---|
| Acoustic guitar | |||||
| Acoustic guitar | |||||
| Applause | |||||
| Applause | |||||
| Bark | |||||
| Bark | |||||
| Bass drum | |||||
| Bass drum | |||||
| Burping, eructation | |||||
| Burping, eructation | |||||
| Bus | |||||
| Bus | |||||
| Cello | |||||
| Cello | |||||
| Chime | |||||
| Chime | |||||
| Clarinet | |||||
| Clarinet | |||||
| Computer keyboard | |||||
| Computer keyboard | |||||
| Cough | |||||
| Cough | |||||
| Cowbell | |||||
| Cowbell | |||||
| Double bass | |||||
| Double bass | |||||
| Drawer open, close | |||||
| Drawer open, close | |||||
| Electric piano | |||||
| Electric piano | |||||
| Fart | |||||
| Fart | |||||
| Finger snapping | |||||
| Finger snapping | |||||
| Fireworks | |||||
| Fireworks | |||||
| Flute | |||||
| Flute | |||||
| Glockenspiel | |||||
| Glockenspiel | |||||
| Gong | |||||
| Gong | |||||
| Gunshot, gunfile | |||||
| Gunshot, gunfile | |||||
| Harmonica | |||||
| Harmonica | |||||
| Hi-hat | |||||
| Hi-hat | |||||
| Keys jangling | |||||
| Keys jangling | |||||
| Knock | |||||
| Knock | |||||
| Laughter | |||||
| Laughter | |||||
| Meow | |||||
| Meow | |||||
| Microwave oven | |||||
| Microwave oven | |||||
| Oboe | |||||
| Oboe | |||||
| Saxophone | |||||
| Saxophone | |||||
| Scissors | |||||
| Scissors | |||||
| Shatter | |||||
| Shatter | |||||
| Snare drum | |||||
| Snare drum | |||||
| Squeak | |||||
| Squeak | |||||
| Tambourine | |||||
| Tambourine | |||||
| Tearing | |||||
| Tearing | |||||
| Telephone | |||||
| Telephone | |||||
| Trumpet | |||||
| Trumpet | |||||
| Violin, fiddle | |||||
| Violin, fiddle | |||||
| Writing | |||||
| Writing | 
@article{helin2024soloaudio,
  author    = {Wang, Helin and Hai, Jiarui and Lu, Yen-Ju and Thakkar, Karan and Elhilali, Mounya and Dehak, Najim},
  title     = {SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer},
  journal   = {arXiv},
  year      = {2024},
}