Neural textual content-to-speech (TTS) types are successfully used to produce high-high-quality human-like speech. On the other hand, most TTS types can be skilled if only the transcribed information of the preferred speaker is offered. That signifies that long-kind untranscribed information, these types of as podcasts, are unable to be used to train current types.
A latest paper on arXiv proposes an unconditional diffusion-based generative product. It is skilled on untranscribed information that leverages a phoneme classifier for textual content-to-speech synthesis. A probabilistic product learns to produce mel-spectrograms of the speaker devoid of any context.
The effects demonstrate that the proposed strategy matches the general performance of the current types on LJSpeech. By schooling the classifier on a multi-speaker paired dataset, equivalent general performance is revealed devoid of observing any transcript of LJSpeech. Thus, it is probable to create a high-high-quality TTS product devoid of a transcript for the preferred speaker.
Most neural textual content-to-speech (TTS) types require
paired information from the preferred speaker for high-high-quality speech synthesis, which limitations the utilization of large amounts of untranscribed information for schooling. In this work, we existing Guided-TTS, a high-high-quality TTS product that learns to produce speech from untranscribed speech information. Guided-TTS brings together an unconditional diffusion probabilistic product with a separately skilled phoneme classifier for textual content-to-speech. By modeling the unconditional distribution for speech, our product can make use of the untranscribed information for schooling. For textual content-to-speech synthesis, we guideline the generative method of the unconditional DDPM by way of phoneme classification to make mel-spectrograms from the conditional distribution offered transcript. We demonstrate that Guided-TTS achieves equivalent general performance with the current procedures devoid of any transcript for LJSpeech. Our effects additional demonstrate that a single speaker-dependent phoneme classifier skilled on multispeaker large-scale information can guideline unconditional DDPMs for a variety of speakers to conduct TTS.
Study paper: Kim, H., Kim, S., and Yoon, S., “Guided-TTS:Text-to-Speech with Untranscribed Speech”, 2021. Connection: https://arxiv.org/abdominal muscles/2111.11755