SuttaCentral Voice Assistant

karl_lew · November 24, 2018, 8:22pm

All AWS Polly voices rely on machine learning to infer cadence, inflection and pronunciation from context. These AI voices do an amazing job for their target languages, but still struggle with homographs that they have not been trained on. One classic example in the suttas is “lives” as in “my past lives” and “he lives now”. The AI voices understand the patterns of grammar usually but not always. Like us, they fail in a quirky, individual way. Unlike us, they cannot be fixed without retraining. Humans can correct their pronunciation adaptively. The AI voices would need to be entirely retrained to handle new subtleties. The AI voices would literally have to relearn everything for which they have already been trained along with whatever additional corrections are needed. Amazon is in charge of training the voices and would no doubt hesitate at the rather large expense and risk of retraining any existing voice. The risk is of negative impact to all existing voice applications globally. Not just our risk, but risk to everybody in the world using AWS voices.

Because the training of ML voices involves phrases and/or sentences (i.e., not just individual words), the pronunciation of an individual word is affected by surrounding words. This leads to inconsistencies in pronunciation of individual words such as desito. These inconsistencies have been quite vexing. I don’t know how to fix them. At best, the only control we have is over letter combinations. I have managed to nudge some words back into a believable universe by changing them slightly–e.g., “kkh” becomes “k.k\u02b0”.

The same inconsistency occors with other words that have a short “a” at the end.

Fortunately, consistent mispronunciations are tractable. We’ll need to know the rules for knowing short “a” at the end. Is it every ending “a” or only some? I need the letter combinations for ending “a” that need to be corrected. Once we determine those, Aminah can add them to the Release Plan as bugs.

Aminah, this can perhaps go in the Release Plan as a bug, since it seems to be an “always wrong everywhere” thing. I am really glad we have Sabbamita’s trained ears–I have no idea what Pali should sound like and have been clumsily relying on the internet, not real people speaking around me.

Thank you very much for listening for these. I personally have found the awareness of inconsistencies arises subtly with repeated listening. I listen to DN33 and MN44 regularly, but with more people listening, our coverage will improve.

I’m afraid we may need to grant Russell mercy because of his status as a tourist (like Amy and Raveena) in Pali land. Raveena with her Hindi phonemes tends to be closer if Russell and Amy prove too annoying.

Raveena was trained on and is speaking English with the odd Pali hiccough thrown in. This is why Raveena is generally smoother. Aditi has been trained to speak Hindi or English and is speaking a foreign language (Pali) entirely. This is why Aditi Pali is rough and feels like words stitched together because that is what they are.

All the voices have trouble speaking Pali-only. Raveena sounds even worse than Aditi. And Amy speaking Pali was truly, well, unspeakable–I wouldn’t want to inflict that on anyone. Yes, I share your sad dismay at Aditi’s choppiness–it is not correctable. What we need is an AWS Polly Pali voice tranined with human voice recordings of many Pali phrases. For such to happen, we would need to approach Amazon and request a Pali voice, which would no doubt require a lot of money (millions?), time and effort. Such a request would be feasible with a much larger user base and might happen in Aminah’s lifetime, but perhaps not mine. Do not give up hope for such, but be prepared to wait for the right time.

Consistent mispronunciations due to letter combinations and/or occurring at end/beginning of words are fixable and includable in the Release Plan as bugs. Each mispronunciation takes typically a day to work out. Yes. They are maddening, but that estimate can guide your judgement about value of each correction. Homographs (e.g., “lives”) should be documented in Mispronunciations and will not be fixed.