Guided-TTS: Text-to-Speech with Untranscribed Speech

Neural textual content-to-speech (TTS) types are successfully used to produce high-high-quality human-like speech. On the other hand, most TTS types can be skilled if only the transcribed information of the preferred speaker is offered. That signifies that long-kind untranscribed information, these types of as podcasts, are unable to be used to train current types.

Picture credit score:, CC0 Community Domain

A latest paper on arXiv proposes an unconditional diffusion-based generative product. It is skilled on untranscribed information that leverages a phoneme classifier for textual content-to-speech synthesis. A probabilistic product learns to produce mel-spectrograms of the speaker devoid of any context.

The effects demonstrate that the proposed strategy matches the general performance of the current types on LJSpeech. By schooling the classifier on a multi-speaker paired dataset, equivalent general performance is revealed devoid of observing any transcript of LJSpeech. Thus, it is probable to create a high-high-quality TTS product devoid of a transcript for the preferred speaker.

Study paper: Kim, H., Kim, S., and Yoon, S., “Guided-TTS:Text-to-Speech with Untranscribed Speech”, 2021. Connection: muscles/2111.11755