Marius Cotescu and Georgi Tinchev, data scientists at Amazon, recently demonstrated their efforts to help Alexa, the company’s digital assistant, overcome pronunciation difficulties and master Irish-accented English. Their goal was to utilize artificial intelligence and recordings from native speakers to improve Alexa’s ability to handle regional accents.
During the demonstration, Alexa spoke about a fun night out, using the Irish word “craic” to describe the enjoyment. However, Mr. Tinchev noticed that Alexa had dropped the “r” in the word “party,” making it sound more British than Irish. This highlighted the challenge of voice disentanglement, a complex area of data science that involves understanding the nuances of accents, prosody (the pitch, timbre, and accentuation of speech), and emotional expression in order to create more natural-sounding artificial speech.
Recent advancements in AI, computer chips, and hardware have enabled researchers to make progress in solving the voice disentanglement problem. The goal is to develop AI-powered devices and speech synthesizers that can accurately reproduce a wide range of regional accents, making interactions with digital assistants more conversational and natural. This research aligns with the emergence of generative AI, which enables chatbots to generate their own responses and respond verbally to voice commands, potentially revitalizing consumer interest in voice assistants like Alexa and Siri.
Historically, teaching voice assistants multiple languages and accents has been a costly and time-consuming process, involving hiring voice actors to record extensive speech data. However, recent developments in AI, particularly text-to-speech models, have made it possible to create synthetic voices based on text inputs in different languages, accents, and dialects.
To catch up with competitors like Microsoft and Google in the AI race, Amazon has been under pressure to enhance Alexa’s capabilities. Their CEO, Andy Jassy, has expressed the intention to make Alexa more proactive and conversational using sophisticated generative AI. Achieving this requires AI systems to disentangle accents from other speech characteristics like tone and frequency, allowing for the synthesis of new accents.
Building an Irish-accented Alexa presented several linguistic challenges, such as accounting for the Irish tendency to drop the “h” in “th” and over pronounce the letter “r.” Marius Cotescu’s team tackled these challenges by training Irish Alexa on a relatively small corpus of recordings by voice actors who spoke Irish-accented English. Initially, there were some glitches, with dropped letters and syllables, as well as voice pitch inconsistencies. However, the team fine-tuned the model, addressing each issue meticulously to create a more accurate Irish accent.
Experts in Irish linguistics praised the accuracy of Irish Alexa’s accent, noting that it successfully emphasized “r’s” and softened “t’s.” The positive feedback encouraged the researchers, giving them hope that their methodology could be extended to replicate accents in languages other than English.
In conclusion, Marius Cotescu, Georgi Tinchev, and their team at Amazon have made significant progress in voice disentanglement, using AI and recordings to improve the accuracy of regional accents in digital assistants like Alexa. This advancement paves the way for more natural and conversational interactions with AI-powered devices.