Voice cloning

5 min readJan 7, 2020

Voice cloning is a technology that changes the voice of a person using software and hardware, both in real-time and in batch mode.

The technology allows you to model the personal characteristics of human speech with a complete match to the original called “target copy”.

Overall Technology Assessment

Currently, speech recognition systems are relatively well developed. They are used for voice control of various household appliances (in phones, car audio systems, and even washing machines and refrigerators). Some days ago, we launched a service that adds speech recognition easily to your website also. Reverse Process — there are several difficulties in extracting words from an audio signal and synthesizing speech.

Software products

The following programs are considered to belong to the categories “Voice Changing Software” or “Voice changer”:

Morphvox;
voice changer;
VMic — Voice Anonymizer;

Also SDK packages:

Voice Cloning Toolkit for Festival and HTS (Mac) — Research Package by the Speech Technology Research Center and Junichi Yamagishi from the University of Edinburgh.

Website and telephone service

Earlier, some commercial companies used a real-time subscriber voice change service. This was done as follows:

In advance, the subscriber (customer) of the website orders a call back to his / her phone and a call to a “copy target” phone, and the system provides samples of the caller’s voice behind the “copy target”;

2. Then the subscriber orders a call back to his or her phone and the phone of the subscriber of interest. The system connects the subscriber. The signal from it passes through the server of the company, where the frequency characteristics and the tone of the voice change to the parameters of the human voice — “copy target”. The subscriber hears the customer’s words, but for him, these words (supposedly) are from a human voice — “copy target”.

Description of technology

Real-time telephone call cloning technology is based on DFT methods for analyzing the frequency in the discrete signal (Fourier special conversion method) obtained by digitizing the analog telephone signal using the G.729 narrowband voice codec. Synthesis of altered speech on the basis of signal-carrier, that is, the resulting “cloned voice” realizes the possibility of maximum storage of the personal acoustic characteristics of the copied output voice: phonetic peculiarities of pronunciation, accent, and even stuttering artifacts. In this way, it is impossible to identify the speaker’s artificially, even with special processing and mathematical analysis of the output telephone signal. Illegal use of speech cloning technology is strictly excluded in accordance with the particular program for protecting the online service. The described technology for cloning voice in telephone networks, according to the creators, is the latest product that was previously unparalleled.

Background

The existing systems for creating machine speech have proven themselves in particular technical niches: in car navigation systems, watches, electronic readers, translators, etc. In such systems, the task is not to imitate the voice of a particular person, because the resulting machine speech is also not personalized and easily recognizable due to its pronounced artificial origin.

Previously, attempts to synthesize a person’s speech were based on the principle of creating a “core” of the speech branch, which contains a complete set of acoustic, phonetic and prosodic features — individual features of speech. This required a very detailed, personalized database of ‘copied’ votes. The person whose voice had to be copied needed to read a long prepared text, specially created and containing a large number of phonemes, in order to maximally identify the features of the speaker’s speech.

This posed some difficulties, since it is known that an ordinary person gets tired even after 15 minutes of continuous reading, and after 20 minutes of reading his voice can even crash. Even for a professional speaker, 45 minutes of continuous reading while maintaining the full range of individual speech characteristics is quite a difficult task. The quality of the recorded voice was also very high — it was necessary to exclude various types of noise that could interfere with modeling. The personal record of the original voice thus obtained was subjected to frequency analysis and mathematical processing, and the calculation process often took more than a day. Only then could the individual database with the voice of the particular person be used by a speech synthesizer. Naturally, the length of the coding process, and most importantly, the need to record the reference speech in a studio, significantly reduced the range of application of speech copying systems under normal conditions.

Application

At the moment, the most prominent example of the commercial application of innovative speech cloning technologies in the entertainment industry. By calling a subscriber and communicating with him or her through the voice of another person (for example, your mutual acquaintance), you can play him and find out what his opinion is about you. Children will have the opportunity to listen to stories that were initially sounded by professionals, but with the voice of their parents. When films and other productions are localized, the voice of duplicate actors may be “tailored” to the voice of the original source. It should be noted that such technology opens up opportunities for a wide range of abuses falling under different articles of the Penal Code.

Interesting facts

The technology for cloning speech and even the mobile device for it (in the form of a “mini-voice recorder”) was shown as a small gadget in the first movie of the series “BUGS.” Electronic bugs. ‘
Al Pacino’s character talks to the cloned voice of a virtual actress in a movie directed by Victor Taransky’s “Simona”.
The ability to expertly imitate the voices of other people by presenting them in a telephone conversation for personal gain, Igor Lutsenko performed by Igor Sklyar — a character from the Russian movie “The Imitator” with director Oleg Borisovich Fialko.
A device capable of mimicking other people’s voices by presenting them in a phone conversation about fraud was used by an evil character in one of the latest episodes of the “Secrets of Investigation” series.
Cyborgs, or so-called humanoid terminators (unlike other models), in all parts of the Terminator movie, had the ability to imitate every human voice. For example, the T-800 in Terminator 2 is capable of sound reproduction and imitation of a person’s voice (changes the timbre over a vast range — can mimic child and female voices).