SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

1Johns Hopkins University
*Indicates equal contribution

SSR-Speech is a novel diffusion-based generative model for target sound extraction. Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events.

Diagram of SoloAudio

Interpolate start reference image.

TSE: SoloAudio v.s. Prior SotA

We test on real data of AudioSet eval dataset. 41 categories from the FSD Kaggle 2018 dataset are used here. Each category has 2 test samples.

Text Original Recording SoloAudio (ours) DPM-TSE WaveFormer AudioSep
Acoustic guitar
Acoustic guitar
Applause
Applause
Bark
Bark
Bass drum
Bass drum
Burping, eructation
Burping, eructation
Bus
Bus
Cello
Cello
Chime
Chime
Clarinet
Clarinet
Computer keyboard
Computer keyboard
Cough
Cough
Cowbell
Cowbell
Double bass
Double bass
Drawer open, close
Drawer open, close
Electric piano
Electric piano
Fart
Fart
Finger snapping
Finger snapping
Fireworks
Fireworks
Flute
Flute
Glockenspiel
Glockenspiel
Gong
Gong
Gunshot, gunfile
Gunshot, gunfile
Harmonica
Harmonica
Hi-hat
Hi-hat
Keys jangling
Keys jangling
Knock
Knock
Laughter
Laughter
Meow
Meow
Microwave oven
Microwave oven
Oboe
Oboe
Saxophone
Saxophone
Scissors
Scissors
Shatter
Shatter
Snare drum
Snare drum
Squeak
Squeak
Tambourine
Tambourine
Tearing
Tearing
Telephone
Telephone
Trumpet
Trumpet
Violin, fiddle
Violin, fiddle
Writing
Writing

BibTeX

@article{helin2024soloaudio,
  author    = {Wang, Helin and Hai, Jiarui and Lu, Yen-Ju and Thakkar, Karan and Elhilali, Mounya and Dehak, Najim},
  title     = {SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer},
  journal   = {arXiv},
  year      = {2024},
}