SoloAudio

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

¹Johns Hopkins University

^*Indicates equal contribution

SSR-Speech is a novel diffusion-based generative model for target sound extraction. Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events.

Text

Original Recording

SoloAudio (ours)

DPM-TSE

WaveFormer

AudioSep

Acoustic guitar

Applause

Bark

Bass drum

Burping, eructation

Bus

Cello

Chime

Clarinet

Computer keyboard

Cough

Cowbell

Double bass

Drawer open, close

Electric piano

Fart

Finger snapping

Fireworks

Flute

Glockenspiel

Gong

Gunshot, gunfile

Harmonica

Hi-hat

Keys jangling

Knock

Laughter

Meow

Microwave oven

Oboe

Saxophone

Scissors

Shatter

Snare drum

Squeak

Tambourine

Tearing

Telephone

Trumpet

Violin, fiddle

Writing

@article{helin2024soloaudio, author = {Wang, Helin and Hai, Jiarui and Lu, Yen-Ju and Thakkar, Karan and Elhilali, Mounya and Dehak, Najim}, title = {SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer}, journal = {arXiv}, year = {2024}, }

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

Diagram of SoloAudio

TSE: SoloAudio v.s. Prior SotA

BibTeX