SutaCentral, read me a sutta (Emma)

I’ve been experimenting with the “SuttaCentral, read me a sutta” concept, especially in the area of text-to-speech (TTS). I’ve evaluated several ways of converting printed material to speech:

As the primary criterion for choosing among these, I would propose that any automatic speech for suttas should ideally be conducive to aware and tranquil listening. In other words, not only is the informational content important, but the delivery itself should be clear, dispassionate, engaging and smooth.

The services/software listed above are shown in decreasing order of acceptability. Amazon Polly is the best and offers many voices, many of which are conversationally engaging and smooth. Here is an excerpt of The Root of Existence MN1 spoken by AWS Polly using the Emma and Amy voices at varying pitches and rates. For example, p30 means -30% of natural voice pitch, and r10 means -10% of normal speaking speed.

The Root of Existence MN1/Bodhi AWSPolly/Emma/p30r10
The Root of Existence MN1/Bodhi AWSPolly/Emma/p20r20
The Root of Existence MN1/Bodhi AWSPolly/Emma/p15-r20
The Root of Existence MN1/Bodhi AWSPolly/Amy/p0-r20

Please listen to these samples and share your thoughts on:

  • clarity: are the words spoken clearly? What words are difficult?
  • engagement: how long could you listen to such a voice? one sutta? several suttas? an afternoon of suttas?
  • affect: how does the speech affect you? Is it irritating? Is it neutral? Is it calming?
  • content: is there any sutta excerpt you’d like to hear? I’ll be happy to generate it.
  • anything else that comes to mind

NOTE: Please ignore the advertisements. They are from SoundCloud, which I’m using just for sharing prototypes. SuttaCentral itself would not have such ads.

8 Likes

clarity: are the words spoken clearly? What words are difficult?

It took me the whole of the first section to ‘align’ my hearing to the speech. the pitch and the ‘whispery’ quality made it difficult to distinguish between words. I didn’t recognise the word 'Earth" until the last repetition.

engagement: how long could you listen to such a voice? one sutta? several suttas? an afternoon of suttas?

It definitely gets easier the longer one listens. I would classify it as tolerable rather than pleasant - definitely only use it if there is a strong need ie no other option.

affect: how does the speech affect you? Is it irritating? Is it neutral? Is it calming?

While not immediately associated with calming feelings, one could habituate to it - would take some effort - but that is part of mind discipline

content: is there any sutta excerpt you’d like to hear? I’ll be happy to generate it.
anything else that comes to mind

Thank-you, this is most kind of you :smiley:

Sadhu Sadhu Sadhu!

1 Like

Thanks Karl, that’s amazing.

For my taste, it’s a little too fast. Is this something that could be controlled by the user?

Also, taking what I gather is a female voice and reducing the pitch sounds, to me, a little odd, and gives a slightly irritating edge. I was reminded of Gollum in the Lord of the Rings movies… Was there a reason for varying the pitch so much?

2 Likes

User control would be a V2 feature, since it would require an active AWS connection and that would not be free. With V1, the proposal would be to record and store a default voice that could be used offline. In other words, we’re evaluating the default voice.

Given your feedback, I’ve uploaded a pitch-20% and rate-20% to achieve a more measured and deliberate cadence and less of a gruff old lady Gollum feeling. I’ve found that reducing the pitch tends to maintain intelligibility when lowering the rate.

slower-and-higher-Emma

Thanks for the detailed feedback. :pray:

The alignment issue might be due to the rate reduction–it’s 10% slower than normal conversational speech. Normal Dhamma talks are much slower (e.g., ~30% slower) and I wanted to capture some of that without losing intelligibility (slower AI speech tends to get muddy). The whispery might be due to an excessive pitch reduction, which mikenz66 also remarked on. I’ve uploaded a new sample linked from the edited original post.

Glad listening got easier. I noticed that as well in my explorations and it was a bit hard listening to thoroughly mangled AI speech produced by other vendors (for example, Watson could not say “supreme” in UK english).

1 Like

That’s a lot better. I would prefer a higher pitch (if you use the female) or a male voice, but it’s quite listenable now.

1 Like

I uploaded a p-15% r-20% variation for Emma. I also realized that the Emma voice has more emotional color than the Amy voice, which I have uploaded at its natural pitch.

Emma, p30r10: at first I thought there were significant echoes but after listening a few times I realised it was the result of repetitions in the text being repeated with no change in Emma’s prosody. I couldn’t cope with listening to this one.

Emma, p20r20: clearer, but the echo effect remains a problem.

Emma, p15r20: very similar to p20r20 but I prefer Emma at p15 to p20. Pity about the echo effect.

Amy, p0r20: I preferAmy’s voice. Her fricatives are clearer. By now I’m getting used to the echo; however I fear it would become irritating after a while.

‘Earth’ was a stumbling block. I couldn’t identify it as a word in any of the ‘Emma’ readings; with ‘Amy’ I wondered if it was ‘earth’ or ‘death’ and had to check the sutta to find out. Why can’t they do fricatives?

You may get different responses from listeners depending on the variety of English they are used to. I’m speaking from a British and Australian point of view.

Despite my criticisms the overall is amazingly good. And many sadhus to you for this project. More and more I prefer listening to reading, especially with material that I repeat a lot. I really look forward to your project being implemented.

I understand that achieving meaningful variation in prosodic features (pitch, stress, intonation) is the biggest challenge in this sort of technology, and it is a pity it is so hugely important for English. I don’t suppose that there’s a voice out there that could read in a monotonal chant? It might do better.

Best of luck with this.

2 Likes

Well, it’s interesting, and certainly listenable. None of them compare with Google’s infamous demo from a few months back, which really was virtually indistinguishable from human, at least for a short span of time. That one uses the WaveNet ai, which now underlies Google Assistant. But I’m not sure if that can be used as a general purpose reader yet.

Don’t forget, we have suttas read by humans already, and we want to get them out there.

The automated tts would be suited for the UI and for suttas that we don’t have in human-read form.

And in that context, the quality of the voice is probably less important than the overall UI design; knowing what questions people will ask, how to answer them, and how to get people where they want.

And BTW, I don’t know if it was deliberate, but the heading is perfect: suta in Pali means “heard” (as opposed to sutta, “discourse”).

I’ve begun developing an overall approach to a11y here, but it is only in its beginning stages. please feel free to add to it!

4 Likes

I also prefer slow Amy and it well may be the least objectionable. We can actually have a custom IPA lexicon to deal with “Earth” and other oddities that are garbled. For example, “bhikkhu” didn’t come across well, and Sutta Central has three (3) different spellings of that word (discounting case). Indeed, the samples you heard already had such substitutions for bhikkhu, and Tathagata. Adding earth should be easy and would apply automatically for all of SuttaCentral. Thank you for your careful listening and very helpful feedback!
:pray:

1 Like

Bhante, I also drooled over WaveNet, but their API is in beta with no guarantee of backwards compatibility. To deal with the various existing and forthcoming TTS AI services, I’ve created a small GitHub Javascript repository sc-voice which will read and understand Pootl files and also allow us to plug in any TTS service. It already works with Watson TextToSpeech and I’m currently writing the AWS Polly adapter. Writing a WaveNet adapter shouldn’t be that hard once WaveNet matures.

The sc-voice library will take advantage of the text segment approach pioneered and used extensively in your own translations. Specifically, we’ll be able to deliver TTS search results at the text segment, paragraph or sutta level. And after experimenting with IPA, I have a hope that we may also offer Pali TTS in later releases with the same granularity levels of segment, paragraph or sutta. I’ve learned so much following all the discussions on SC discussing the various interpretations of specific Pali phrases. I’m slowly realizing the critical importance of learning Pali while studying the Dhamma.

SuttaCentral search does seem to be the a good candidate for initial assisted improvements. I think the Discuss and Discover forum style would be much trickier given that it requires search as well as quoting and editing. Search is “just one text box”. I haven’t looked at Alexa just yet, but AWS does have an Alexa voice service. This could be used for initial search query and potentially for search result traversal.

In my deep dive into the assisted technology this past week I’ve learned a lot about aria, ChromeVox, and other stuff. I even have Braille stickons for my laptop keyboard. I am truly amazed at the proficiency displayed by long-time users of screen readers. They are magicians! Sadly, I am also becoming sharply aware that such proficiency remains out of reach to many of us coming into disability late in life. After years of trying, I still do not know Braille. What does resonate for me is this line from the SC Github issue:

Create dedicated UI for using Alexa, Siri, Google, etc.

:heart::pray:

p.s.,. Thanks for catching my misspelling of SuttaCentral. I guess I’ll leave it as is. :smiley:

2 Likes

That is all great, we should talk soon.

Just one detail:

We are hoping to move away from .po files and possibly Pootle. In any case, assume that the data will JSON that is something like this:

Our database already works this way.

1 Like

That’s breathtaking!

I was actually wondering how I might go about suggesting JSON as a document format and was a bit hesitant to put that idea forward given the current investment in .po files. Yes. Please JSON by all means!

:pray:

p.s., Should you need to contact me directly, I’m karl.m.lew on a certain *oogle electronic mail service.

1 Like

Yes, .po is just a way of using the translation engine. In fact our database, ArangoDB, talks to the outside world in JSON. Its data is handled pretty much like in that link; which I, incidentally, was unaware of when developing that spec, until Blake showed me that we’re already using it! DBs are like a black box to me, I stay away from them as much as possible!

But the idea is to keep everything in JSON on Github, and one of the reasons is precisely to enable various uses of the data, such as yours.

Meanwhile, for consuming the data over the web, you may want to use our REST API.

1 Like

Thank you so much, Karl, for this work!

I have to say I also prefer Amy’s voice.

What I had a bit difficulty to understand was the passage “earth as earth” (especially with Emma), and “…and that with being as condition” (with both voices). I am not an English native speaker; my native language is German.

3 Likes

Hi Karl

I was really impressed and surprised at the quality of the reading and preferred “slow” Amy.

We are working on producing audio versions of the sutta’s and some of the areas of difficulty are emphasis, timing, pronunciation of Pali words, dealing with elisions and how to differentiate between say questioner and responder.

A point I have noticed when listening to Sutta’s is how errors seen to be very distracting and there seems to be a tendency to focus in on areas where something isn’t quite right e.g. spacing of “earth as earth”.

In reviewing our readings I have had to accept it is inevitably unsatisfactory and not every one will resonate with the style we have adopted.

To offer more choice would be a good thing and as technology improves it does seem to offer a very worthwhile alternative to spoken sutta’s.

Best wishes and many thanks

Dave

I noted when listening that I have a tendancy to pick up and be distracted by any issue

2 Likes

I would love to use the REST API. Sadly, it appears to be in a state of flux. I will need to install SC locally to investigate.

1 Like

Enabling voice assistance via TTS will require work beyond software. Specifically we will need careful listeners (you all are wonderful!) as well as people familiar with the International Phonetic Alphabet (IPA). To illustrate the work required, I’ve put together an example of how we might deal with a problematic phrase such as “earth as earth”, which was called out in your comments. I’ll let Slow Amy speak for herself:

Slow Amy: Earth as earth

The solution to “earth as earth” requires use of SSML to call out specific articulations of “earth” in IPA. SSML also lets us place pauses for emphasizing words.

<prosody rate="-20%"> Slow Amy's normal speech:

earth as earth.

Slow Amy speaking with the aid of SSML input

<phoneme alphabet=“ipa” ph=“ɜ:rθ”>earth</phoneme><break time=".1s"/> as
<phoneme alphabet=“ipa” ph=“ɜ:rθ”>earth</phoneme><break time=".1s"/>.

Thank you
</prosody>
</speak>

Fortunately, we can create a SuttaCentral lexicon that maps “earth” to:

<phoneme alphabet=“ipa” ph=“ɜ:rθ”>earth</phoneme><break time=".1s"/>

Slow Amy will never have the marvelous resonant insight of a human speaker. Slow Amy is still a robot. And thanks to Paliaudio.com, we already do have a beautiful rendering of major suttas, including MN1. Perhaps Slow Amy can help provide voice assistance in finding a sutta or an excerpt of a sutta to help spread the Dhamma.

Enabling voice assistance will take time and patience. Thank you all for your help and advice.

:pray:

3 Likes

Out of all the options, I much prefer slow Amy as well.

Thank you for your wonderful work :slight_smile:

2 Likes

How are you chosing the dialect for the robotic voices? Or is dialect sort of baked into the software? I noticed you are using the IPA. Is it a tool for creating pacing only?
This project is fascinating and I commend you for taking on the work.

1 Like