Voice cloning with Artificial Intelligence

The underlying technologies of AI such as machine learning and Deep learning have significantly demonstrated the potential for text-to-speech(TTS) interaction, also called speech synthesis. This speech synthesis when collaborated with speech recognition the birth of virtual assistants such as SIRI, and ALEXA has taken place.

Voice cloning helps in eliminating the robotic tonality with better voice assistance in chatbot assistance services. Today AI software can generate synthetic speech that closely resembles a targeted human voice. Deep neural networks are moving a step closer to highly intuitive human chatbots’ interaction with quality, interactive and personalized experiences.

Recent research by Jia Zhang on Text -To- Speech syntheses has proposed a new technique that generates near-similar speech audio using a few seconds of a sample voice called Speech.

Vector to TTS(SV2TTS). This technology is highly efficient as compared to traditional expense training methods that required several hours of professionally recorded speech. The advantages of SV2TTS are-

  • It can clone voices without excessive training or retraining.
  • A very high-quality audio result can be produced.
  • Natural speech gets synthesized during the training in the absence of a speaker.

The Topic Covers

  1. The technique

  2. Steps to generate these embedding are

  3. Synthesizer

  4. Neural Vocoder

  5. Conclusion

The technique

The SV2TTS has three parts each trained individually on independent data thus reducing the dependency on high-quality muti-speakers.


The speaker encoder

This is the first part of SV2TTS technology. The speaker encoder takes the input audio, encodes it in
“Mel” spectrogram frames and embed it as an output of how the speaker sounds without taking the meaning or the background noise of the speaker. Its main duty is to access the voice of the speaker like high or low tone, the pitch of the voice, accent, etc. These features are then combined into a low dimensional vector known as a d-vector or the speaker embedding. As a result, the speech utterance of the speaker is exactly matched within the speaker embedding technology while the rest of the speaker utterances are left aside. Below is the visual representation of embedding where each color represents a different speaker.


Steps to generate these embedding are:

  • The examples of speech audios of different speakers are segmented into 16- sec clips with no transcript and then transformed into “mel” spectrograms.
  • The speaker encoder is then trained to take two audio samples and match them with the same speaker produced them. As a result, the speaker encoder creates the embedding for the speech exactly matching the original speaker.



Synthesizer is the second step of the SV2TTS. It involves creating “mel” spectrograms and later converted into sound through a vocoder. It combines the sequence of AI-generated text in the form of
human sounds called phonemes. These phonemes are later converted into “mel” spectrogram frames through Tacotron 2 architecture.


Neural Vocoder

After creating the “mel” spectrogram by the synthesizer. The ”mel” spectrograms are then converted into raw audio waves using a vocoderlike DeepMind’s WaveNet Model.



SV2TTS can power hundreds of applications involving traditional TTS, chatbots, and virtual assistants like Siri and Cortana. It can also help people who have lost their voice due to ALS or for any reason by converting the text into cloned voice output. With so many advantages there can be possibilities that such cloning voice technologies can be misused in many ways. With AI-Deepfake voice, or media that take a person’s existing image, audio, or video and replace them with something else with the help of AI are multiplying quickly as identified by the US Federal Trade Commission. Preventing the malicious use of voice synthesizers on persons’ behavior requires technology safeguards. This includes an audio fingerprint layered with the individual sample voice with that of a system-generated voice to serve as a form of authentication. Proper guidance with regulation and legislature on Deepfake voice cloning is the need of the hour.

Leave a Reply

Your email address will not be published. Required fields are marked *