James Vincent in a report from last year:
The newscaster speaking voice was created by recording audio clips from real life news channels then using machine learning to spot patterns in how newscasters read the text. Speaking to The Verge, Amazon’s Trevor Wood, who oversees the application of AI in text-to-speech at Amazon, said this approach more easily captures the detail in human speaking styles. “It’s difficult to describe these nuances precisely in words, and a data-driven approach can discover and generalize these more efficiently than a human,” said Wood.
Notably, Amazon says it only took a few hours of data to teach Alexa the newscaster speaking voice, suggesting that a whole range of styles could be easily incorporated in the future. So far, Amazon has already added a whisper mode for Alexa, and after the upgrade to NTTS in the coming weeks we can probably expect a panoply of voices in 2019.
It's interesting to think about this in the context of each of these digital assistants having one set voice (though different in different regions/languages). This both feels like it has to change, but is also a risk in these nascent days of vocal computing. One unquestionably nice thing about Alexa/Siri/etc is that it's consistent. I imagine there's a very fine line between such consistency and monotony.Read more...