As if you weren’t scared enough of getting your identity stolen, AI may soon be able to translate your speech into another language while keeping the characteristics of your voice. Researchers at Google have trained a neural network to map spectrograms or “voiceprints” from one language to another. So basically if you speak English, the AI can translate it to Spanish while still keeping the unique qualities and tone of your voice.
Normal translation systems have a multi-step process. They first convert the audio into text, then translate that text and resynthesize the audio. This process loses the characteristics of the original voice and produces an output that seems completely different and unnatural.
Translatotron Can Retain Voice Characteristics in Translation
The new system, called Translatotron, can translate the audio of speech from one language into audio of speech of another without any other steps. No transcripts or other intermediate text representations are used during this process.
First, a trained neural network is used to map the voiceprint of the input language to the voiceprint of the output language. It then takes the new voiceprint and converts it into an audio wave file that can be played. Finally, it layers back in the vocal characteristics of the original audio file.
“It makes use of two other separately trained components: a neural vocoder that converts output spectrograms to time-domain waveforms, and, optionally, a speaker encoder that can be used to maintain the character of the source speaker’s voice in the synthesized translated speech,” said a Google blog post.
Translatotron is able to retain the original speaker’s vocal characteristics in the translated speech, which makes the new speech sound more natural and less jarring. It also produces more accurate translations by retaining important nonverbal cues. The new process should also minimize translation error because it reduces the task to fewer steps.
“This system avoids dividing the task into separate stages, providing a few advantages over cascaded systems, including faster inference speed, naturally avoiding compounding errors between recognition and translation, making it straightforward to retain the voice of the original speaker after translation, and better handling of words that do not need to be translated (e.g., names and proper nouns),” the blog also stated.
Proof of Concept: Spanish to English
Translatotron is only a “proof of concept” and as of now is only doing Spanish to English translations. As you can hear in the samples below, it’s not perfect but you can notice some of the tone and characteristics of the original speaker in the Translatotron output.