SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

1Johns Hopkins University 2Tencent AI Lab 3Nanyang Technological University
*Work done during an internship at Tencent AI Lab

SSR-Speech a neural codec autoregressive model designed for stable, safe, and robust zero-shot text-based speech editing and text-to-speech synthesis. Our approach achieves the state-of-the-art performance in the RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing previous methods. Furthermore, SSR-Speech excels in multi-span speech editing and also demonstrates remarkable robustness to background sounds.

Animated demonstration of SSR-Speech.

Speech Editing: SSR-Speech v.s. Prior SotA

All speakers are unseen during training. Utterances are from our RealEdit evaluation set, comprises audiobooks, YouTube videos, and Spotify podcasts. In the original paper of VoiceCraft, they run inference 10 times with different margin parameters. We compare with this setting and also one inference time for SSR-Speech, VoiceCraft and FluentSpeech.

Original Transcript Original Recording Target Transcript: SSR-Speech (once) VoiceCraft (once) VoiceCraft (10 times) FluentSpeech (once)
Fast cars, that had the nice clothes, that had the money, they was criminals. Fast cars, that had the nice clothes, that had expensive gold watches, that had the money, they was criminals.
We would just be open and willing to adopt whatever child God brought to her life. We would just be excited to welcome whatever child God brought into her life.
that's a bomb and that's a good sign from him. he got fully extended on it. knew it as soon as that ball left. that's a bomb and that's a good sign from him. he clearly signalled and made the play happen as soon as that ball left.
economic development remains one of the most effective ways to increase the capacity to adapt to climate change. economic development remains one of the most promising options that we have left on the table to increase the capacity to adapt to climate change.
So if you've been following my story, you will remember that I said earlier in this podcast that the Grammy nominations came out. So if you've been following my story, you will remember that I said earlier that this week we had super exciting stuff to talk about because Grammy nominations came out.
because we can include so many other characters if we just expand the definitions to any sword wielder, who's a little spicy. because we can include so many other participants if we are brave enough to expand the definitions to any blade wielder, who's a little spicy.
It was a glance of inquiry, ending in a look of chagrin, with some muttered phrases that rendered it more emphatic. It was a look of disgust followed by a curled lip, with some muttered phrases that rendered it more emphatic.
some times i really feel that the world around us continues to be more hectic and more complicated and so many of us are truly craving to find simplicity. some times i really feel that the world around us continues to be more hectic, more impersonal, and more uncaring and so many of us are truly craving to find simplicity.
For making the title though because I need to get my numbers way up before I get there, but I'm gonna get there title of Iceland is definitely going to sign me and um, yeah. For making the title though because I need to get my numbers way up before I get there, but I'm gonna get there title of Iceland is going to sign me and um, yeah.
they knew that governments don't control things. a government can't control the economy without controlling people. they knew that governments don't control money directly. a government can't control the economy without controlling people.
No words just lightning breaking darkness and crashing into the Earth with brilliant presence. No words just lightning breaking darkness and crashing into the surface of the Earth with brilliant presence.
that schedule is one per week and it will probably be like a Wednesday night thing because I plan on doing one to two videos per week. that schedule is one per week and you will start to see a lot more content arriving because I plan on doing one to two videos per week.
in a case like this, i probably wouldn't spend any more time looking at the deal if i was only interested in the cash flow. in a case like this, i probably wouldn't spend any more time looking at all the details and the fine print if i was only interested in the cash flow.
but the renaissance broke their monopoly on knowledge, one of the most important bastions of the church. but the renaissance broke their monopoly on knowledge, with it's free movement of research and endless scientific inquiry, one of the most important bastions of the church.

Zero-Shot TTS: SSR-Speech v.s. Prior SotA

All speakers are unseen during training. Only the first 3 seconds of the Voice Prompt are given to the models. In the original paper of VoiceCraft, they run inference 5 times with different random seeds. We compare with this setting and also one inference time for SSR-Speech, VoiceCraft, VALL-E, XTTS v2 and FluentSpeech.

Traget Transcript Voice Prompt (only the first 3 seconds is used) SSR-Speech (Once) VoiceCraft (Once) VoiceCraft (5 times) VALL-E (Once) XTTS v2 (Once) FluentSpeech (Once)
we voted it out that was a standard that people used till eleven or little bit past eleven um and it was an international standard.
hey you all, my name is corey ash and, and i know that you have been working really hard to try to figure out
i'm pleading the fifth about my shoeless state at the moment. but what if you learned about yourselves in this time and and what has changed about the way that you're working?
Do you know I was so foolish that I thought every flash of lightning must descend on your head.
Quox did not have much to say until the conversation was ended, but then he turned to Kaliko and asked:
you know it's one thing to kind of be home alone but it's another to know that you have, you know zoom, yoga classes with friend of yours.
at its base, raider one is a four wheel drive polaris, MVRS 700, that controllers can operate nearly one thousand yards away.
We clambered up to the front seat and jolted off past the little pond and along the road that climbed to the big cornfield.
We were the only ones who did go afoot, however, although the corrals were not more than two hundred yards' distant.
i don't wanna make it as a thing where i'm absolving myself of any responsibility.
and the only way to do great work is to love what you do. if you haven't found it yet, keep looking.
blank space because i'd written a lot of lines down already in the year preceding the session.

BibTeX

@article{wang2024ssrspeech,
      author    = {Wang, Helin and Yu, Meng and Hai, Jiarui and Chen, Chen and Hu, Yuchen and Chen, Rilin and Dehak, Najim and Yu, Dong},
      title     = {SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis},
      journal   = {arXiv},
      year      = {2024},
    }