SoloSpeech

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

¹Johns Hopkins University ²The Chinese University of Hong Kong ³Nanyang Technological University ⁴Tsinghua University ⁵Brno University of Technology

Target Speech Extraction (TSE) aims to isolate a target speaker’s voice from a mixture of multiple speakers by leveraging speaker-specific cues. Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. In contrast, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio’s latent space, aligning it with the mixture audio’s latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks while demonstrating exceptional generalization on out-of-domain data, providing at least a 16.1% SISNR improvement over the previous best method, USEF-TSE. These findings underscore SoloSpeech’s robustness and versatility in diverse acoustic environments.

Mixture

Cue

SoloSpeech

USEF-TSE

SoloAudio

Mixture

Cue

SoloSpeech

USEF-TSE

SoloAudio

Mixture

Cue

Target

SoloSpeech

USEF-TSE

SoloAudio

Mixture

Cue

Target

T-F Audio VAE

Stable Audio VAE

Mixture

Cue

Target

w/ Corrector

w/o Corrector

BibTeX

@misc{wang2025solospeechenhancingintelligibilityquality, title={SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline}, author={Helin Wang and Jiarui Hai and Dongchao Yang and Chen Chen and Kai Li and Junyi Peng and Thomas Thebaud and Laureano Moro Velazquez and Jesus Villalba and Najim Dehak}, year={2025}, eprint={2505.19314}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2505.19314}, }

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Overall Pipeline

Table of Contents

Comparison on Real-recorded Datasets

CHiME-5

RealSEP

Comparison with SOTAs on the Libri2Mix

Abalation Studies for the Compressor on the Libri2Mix

Abalation Studies for the Corrector on the Libri2Mix

BibTeX