Microsoft has built a new AI-based speech generator that can mimic how you talk after hearing it for just 3 seconds
One of Microsoft’s artificial intelligence (AI) research teams published a paper about his text-to-speech (TTS) synthesis model last week.
The model is called VALL-E – no doubt a nod to OpenAI’s image-generating AI DALL-E – and it demonstrates a remarkable ability to copy not only a speaker’s voice, but also their emotional intonations (such as anger) and the acoustic properties of the speaker. recording (such as reverb).
It was trained to 60,000 hours of audio with 7,000 unique speakers using a set of 16 expensive 32 GB NVIDIA graphics cards.
The result is a text-to-speech model that can mimic speakers not included in the training data, using a small three-second sample of their speech.
Known in the field as a “zero-shot” problem, machine learning gives the model far more capabilities than an AI that needs to be trained on hours of speech from a single person to accurately mimic them.
You can listen to samples of VALL-E on a site set up by the research team. Some of the results sound computer-generated, and others aren’t much different from the base sample — which are limitations the paper acknowledges.
But overall, VALL-E’s speech mimicry and ability to recreate audio environments is impressive to hear, so much so that the Microsoft team is preparing a method to detect whether audio was generated by VALL -E.
“Since VALL-E could synthesize speech that preserves the speaker’s identity, it could carry potential risks in misusing the model, such as forging voice identification or mimicking a specific speaker,” the paper said.
“To mitigate such risks, it is possible to build a detection model to distinguish whether an audio clip has been synthesized by VALL-E.
“We will also put Microsoft AI Principles into practice in the further development of the models.”
A possible practical application of this technology – beyond impersonating celebrities – is the creation of audio data for speech recognition systems such as Siri and Alexa.
“Speech recognition always takes advantage of several inputs with different speakers and acoustic environments, which the previous one cannot meet [text-to-speech system] systems’, the authors of the article note.
“Given the diversity characteristic of VALL-E, it is an ideal candidate to generate pseudo-data for speech recognition.”
Microsoft also proposed a fully AI-based content creation system combining text generation models such as GPT-3, which has been made famous by ChatGPT. The podcasts and audiobooks of the future could be written and spoken entirely by generative AI.
The creative industry is already struggling the rise of AI art threaten to take over work that used to be done by artists.
Digital artist Ben Moran recently made headlines after being banned from the popular Art subreddit for Submit AI-generated artwhich is against community guidelines.
Moran is adamant that they made the statue real, not an AI.