Quantumrun

IMAGE CREDIT:

iStock

Speech synthesis: Robots that can finally express emotions

The speech synthesis technology is opening new opportunities for more interactive bots.

Author:
Author name
Quantumrun Foresight
December 29, 2022

Insight summary

While machine-generated speech has been around for a while, it’s only through developments in speech recognition and generation that it is beginning to sound less robotic. Some companies are using voice synthesis and cloning advancements to infuse emotions (i.e., tone) into machine-generated speech. The long-term implications of speech synthesis could include recreating celebrity voices and even more convincing deepfake content.

Speech synthesis context

Synthetic speech is generated by a non-human source (e.g., a computer) while re-creating the sound of a human voice. This technology had existed since the 1930s when American acoustic engineer Homer Dudley constructed the first vocoder (voice synthesizer). Gradually, systems began to emerge that used Gaussian Mixture Models (GMM) to improve the quality of speech synthesis, though not the speed. However, advancements in deep learning (DL, a machine learning method) and artificial intelligence (AI) have refined the technology to produce more believable and natural-sounding conversations. Speech synthesis is primarily supported by two deep neural networks (DNN) technologies: text-to-speech (TTS) and voice conversion (VC).

Text-to-speech converts text to voice, while VC can transform a person’s voice to mimic another’s. These two DDNs are often used in virtual assistants, and can create more nuanced voices and conversations. Speech synthesis can create more emphatic robot caregivers and smarter digital home assistants.

However, synthetic voice technology can also be used for cyber attacks. These fraudulent activities copy people’s voiceprints (voice samples that are digitally stored to serve as their biometric identification) to infiltrate systems and devices. Voice cloning can also fool colleagues into giving their passwords and other sensitive company information. Stolen or generated voices can also be used in phishing attacks where people are tricked into sending money or transferring it to specific bank accounts.

Disruptive impact

In 2021, researchers from the telecoms company Hitachi and Japan’s University of Tsukuba developed an AI model that can mimic human-like speech, including different audio-based emotional markers. The speech is meant to sound like a professional caregiver. Models like this are intended to be used in robots or devices that may offer companionship, support, and direction for individuals who require it. The team taught its AI model by first feeding it with examples of emotional speech.

After that, an emotion recognizer is trained to identify the feeling, and a speech synthesis model is developed to create emotional speech. The emotion recognizer helps guide the speech synthesizer depending on what feeling or “target emotion” the user expects or needs to hear. The researchers tested their model on elderly patients, and participants became more energetic during the daytime as a result. Additionally, the model could calm the patients and soothe them to sleep at night.

Meanwhile, voice synthesis is also being increasingly used in films. For example, to create the synthetic voice narrative for the 2022 Netflix docu-series, The Andy Warhol Diaries, voice generator firm Resemble AI employed 3 minutes and 12 seconds of Warhol’s original voice recordings from the 1970s and 80s. The firm’s technology allowed Warhol’s voice to be recreated to recite his own words from the diaries, creating a six-part immersive documentary on his life.

The team took the generated output of Warhol’s voice from the AI and made adjustments for emotion and pitch. They also added human-like imperfections by referencing audio clips of another speaker. Resemble AI reiterates that before any voice cloning or synthesis project, the company always asks for consent from the voice owners or their legal representatives. For the docu-series, the company obtained the permission of the Andy Warhol Foundation.

Implications of speech synthesis

Wider implications of speech synthesis may include:

Media companies using speech synthesis to re-create the voices of deceased celebrities for films and documentaries. However, some audiences might find this unethical and offputting.

Increased incidents of voice cloning cybercrimes, particularly in the financial services industry.

Live portrait firms using synthetic speech to bring famous paintings and historical figures to life. This service is especially attractive for museums and the education sector.

Speech synthesis being used in deepfake videos to spread propaganda and falsely accuse people, particularly journalists and activists.

More startup firms focusing on voice cloning and synthetic speech services, including celebrities and influencers who want to rent out their voices to brands.

Enhanced realism in virtual assistants and interactive games through advanced speech synthesis, improving user experience but raising concerns over emotional attachment to AI.

Adoption of speech synthesis in automated customer service, streamlining operations but potentially leading to job displacement in the call center industry.

Government agencies leveraging speech synthesis for public service announcements, enabling multilingual and accent-specific communication but requiring careful oversight to prevent misuse or misinformation.