SutaCentral, read me a sutta (Emma)

karl_lew · August 10, 2018, 8:11pm

I’ve been experimenting with the “SuttaCentral, read me a sutta” concept, especially in the area of text-to-speech (TTS). I’ve evaluated several ways of converting printed material to speech:

Amazon Polly
IBM Watson TextToSpeech
ChromeVox
ispeech.org
several others too bad to list

As the primary criterion for choosing among these, I would propose that any automatic speech for suttas should ideally be conducive to aware and tranquil listening. In other words, not only is the informational content important, but the delivery itself should be clear, dispassionate, engaging and smooth.

The services/software listed above are shown in decreasing order of acceptability. Amazon Polly is the best and offers many voices, many of which are conversationally engaging and smooth. Here is an excerpt of The Root of Existence MN1 spoken by AWS Polly using the Emma and Amy voices at varying pitches and rates. For example, p30 means -30% of natural voice pitch, and r10 means -10% of normal speaking speed.

The Root of Existence MN1/Bodhi AWSPolly/Emma/p30r10
The Root of Existence MN1/Bodhi AWSPolly/Emma/p20r20
The Root of Existence MN1/Bodhi AWSPolly/Emma/p15-r20
The Root of Existence MN1/Bodhi AWSPolly/Amy/p0-r20

Please listen to these samples and share your thoughts on:

clarity: are the words spoken clearly? What words are difficult?
engagement: how long could you listen to such a voice? one sutta? several suttas? an afternoon of suttas?
affect: how does the speech affect you? Is it irritating? Is it neutral? Is it calming?
content: is there any sutta excerpt you’d like to hear? I’ll be happy to generate it.
anything else that comes to mind

NOTE: Please ignore the advertisements. They are from SoundCloud, which I’m using just for sharing prototypes. SuttaCentral itself would not have such ads.

Viveka · August 10, 2018, 9:17pm

clarity: are the words spoken clearly? What words are difficult?

It took me the whole of the first section to ‘align’ my hearing to the speech. the pitch and the ‘whispery’ quality made it difficult to distinguish between words. I didn’t recognise the word 'Earth" until the last repetition.

engagement: how long could you listen to such a voice? one sutta? several suttas? an afternoon of suttas?

It definitely gets easier the longer one listens. I would classify it as tolerable rather than pleasant - definitely only use it if there is a strong need ie no other option.

affect: how does the speech affect you? Is it irritating? Is it neutral? Is it calming?

While not immediately associated with calming feelings, one could habituate to it - would take some effort - but that is part of mind discipline

content: is there any sutta excerpt you’d like to hear? I’ll be happy to generate it.
anything else that comes to mind

Thank-you, this is most kind of you

Sadhu Sadhu Sadhu!

mikenz66 · August 10, 2018, 9:23pm

Thanks Karl, that’s amazing.

For my taste, it’s a little too fast. Is this something that could be controlled by the user?

Also, taking what I gather is a female voice and reducing the pitch sounds, to me, a little odd, and gives a slightly irritating edge. I was reminded of Gollum in the Lord of the Rings movies… Was there a reason for varying the pitch so much?

karl_lew · August 10, 2018, 9:39pm

User control would be a V2 feature, since it would require an active AWS connection and that would not be free. With V1, the proposal would be to record and store a default voice that could be used offline. In other words, we’re evaluating the default voice.

Given your feedback, I’ve uploaded a pitch-20% and rate-20% to achieve a more measured and deliberate cadence and less of a gruff old lady Gollum feeling. I’ve found that reducing the pitch tends to maintain intelligibility when lowering the rate.

slower-and-higher-Emma

karl_lew · August 10, 2018, 9:47pm

Thanks for the detailed feedback.

The alignment issue might be due to the rate reduction–it’s 10% slower than normal conversational speech. Normal Dhamma talks are much slower (e.g., ~30% slower) and I wanted to capture some of that without losing intelligibility (slower AI speech tends to get muddy). The whispery might be due to an excessive pitch reduction, which mikenz66 also remarked on. I’ve uploaded a new sample linked from the edited original post.

Glad listening got easier. I noticed that as well in my explorations and it was a bit hard listening to thoroughly mangled AI speech produced by other vendors (for example, Watson could not say “supreme” in UK english).

mikenz66 · August 10, 2018, 9:52pm

That’s a lot better. I would prefer a higher pitch (if you use the female) or a male voice, but it’s quite listenable now.

karl_lew · August 10, 2018, 10:23pm

I uploaded a p-15% r-20% variation for Emma. I also realized that the Emma voice has more emotional color than the Amy voice, which I have uploaded at its natural pitch.

Gillian · August 11, 2018, 12:11am

Emma, p30r10: at first I thought there were significant echoes but after listening a few times I realised it was the result of repetitions in the text being repeated with no change in Emma’s prosody. I couldn’t cope with listening to this one.

Emma, p20r20: clearer, but the echo effect remains a problem.

Emma, p15r20: very similar to p20r20 but I prefer Emma at p15 to p20. Pity about the echo effect.

Amy, p0r20: I preferAmy’s voice. Her fricatives are clearer. By now I’m getting used to the echo; however I fear it would become irritating after a while.

‘Earth’ was a stumbling block. I couldn’t identify it as a word in any of the ‘Emma’ readings; with ‘Amy’ I wondered if it was ‘earth’ or ‘death’ and had to check the sutta to find out. Why can’t they do fricatives?

You may get different responses from listeners depending on the variety of English they are used to. I’m speaking from a British and Australian point of view.

Despite my criticisms the overall is amazingly good. And many sadhus to you for this project. More and more I prefer listening to reading, especially with material that I repeat a lot. I really look forward to your project being implemented.

I understand that achieving meaningful variation in prosodic features (pitch, stress, intonation) is the biggest challenge in this sort of technology, and it is a pity it is so hugely important for English. I don’t suppose that there’s a voice out there that could read in a monotonal chant? It might do better.

Best of luck with this.

sujato · August 11, 2018, 12:51am

Well, it’s interesting, and certainly listenable. None of them compare with Google’s infamous demo from a few months back, which really was virtually indistinguishable from human, at least for a short span of time. That one uses the WaveNet ai, which now underlies Google Assistant. But I’m not sure if that can be used as a general purpose reader yet.

Don’t forget, we have suttas read by humans already, and we want to get them out there.

The automated tts would be suited for the UI and for suttas that we don’t have in human-read form.

And in that context, the quality of the voice is probably less important than the overall UI design; knowing what questions people will ask, how to answer them, and how to get people where they want.

And BTW, I don’t know if it was deliberate, but the heading is perfect: suta in Pali means “heard” (as opposed to sutta, “discourse”).

I’ve begun developing an overall approach to a11y here, but it is only in its beginning stages. please feel free to add to it!

karl_lew · August 11, 2018, 3:04am

I also prefer slow Amy and it well may be the least objectionable. We can actually have a custom IPA lexicon to deal with “Earth” and other oddities that are garbled. For example, “bhikkhu” didn’t come across well, and Sutta Central has three (3) different spellings of that word (discounting case). Indeed, the samples you heard already had such substitutions for bhikkhu, and Tathagata. Adding earth should be easy and would apply automatically for all of SuttaCentral. Thank you for your careful listening and very helpful feedback!

karl_lew · August 11, 2018, 3:52am

Bhante, I also drooled over WaveNet, but their API is in beta with no guarantee of backwards compatibility. To deal with the various existing and forthcoming TTS AI services, I’ve created a small GitHub Javascript repository sc-voice which will read and understand Pootl files and also allow us to plug in any TTS service. It already works with Watson TextToSpeech and I’m currently writing the AWS Polly adapter. Writing a WaveNet adapter shouldn’t be that hard once WaveNet matures.

The sc-voice library will take advantage of the text segment approach pioneered and used extensively in your own translations. Specifically, we’ll be able to deliver TTS search results at the text segment, paragraph or sutta level. And after experimenting with IPA, I have a hope that we may also offer Pali TTS in later releases with the same granularity levels of segment, paragraph or sutta. I’ve learned so much following all the discussions on SC discussing the various interpretations of specific Pali phrases. I’m slowly realizing the critical importance of learning Pali while studying the Dhamma.

SuttaCentral search does seem to be the a good candidate for initial assisted improvements. I think the Discuss and Discover forum style would be much trickier given that it requires search as well as quoting and editing. Search is “just one text box”. I haven’t looked at Alexa just yet, but AWS does have an Alexa voice service. This could be used for initial search query and potentially for search result traversal.

In my deep dive into the assisted technology this past week I’ve learned a lot about aria, ChromeVox, and other stuff. I even have Braille stickons for my laptop keyboard. I am truly amazed at the proficiency displayed by long-time users of screen readers. They are magicians! Sadly, I am also becoming sharply aware that such proficiency remains out of reach to many of us coming into disability late in life. After years of trying, I still do not know Braille. What does resonate for me is this line from the SC Github issue:

Create dedicated UI for using Alexa, Siri, Google, etc.

p.s.,. Thanks for catching my misspelling of SuttaCentral. I guess I’ll leave it as is.

sujato · August 11, 2018, 3:56am

That is all great, we should talk soon.

Just one detail:

We are hoping to move away from .po files and possibly Pootle. In any case, assume that the data will JSON that is something like this:

github.com/suttacentral/suttacentral

Use standoff data in segmented texts

opened 10:29AM - 20 Jun 18 UTC

closed 07:35AM - 11 Sep 19 UTC

sujato

Type: improvement P3

> It's the job that's never started as takes longest to finish. We have discu…ssed many times the desirability of using standoff markup for text data on SC. With our segmented texts we finally have a chance to do this. Just to clarify the terminology, as per Schmidt, section 1. [standoff-properties.pdf](https://github.com/suttacentral/suttacentral/files/2259114/standoff-properties.pdf) Standoff **markup** is mostly what we are talking about here. Standoff **properties**—which allow a lossless incorporation of metadata at any granularity—are more sophisticated, and proposing such a system is Schmidt's main thesis. These come into play in the final section of this page. ## The Problem Currently we keep our texts in either a translation format (PO) or a presentation format (HTML). Both of these involve serious compromises. Without going into too much detail, the basic problem is that such formats, being pressed into serving a purpose for which they were not designed, end up mixing various kinds of data. Instead, we should separate each kind of data cleanly and consistently using a dedicated data format, probably JSON. The advantage of standoff markup is that it allows an unlimited set of data to be associated with the source, without cluttering up the source files. This can include such things as: 1. Multiple references from different editions, with more that can be added over time. 2. HTML or other markup. 3. Notes. 4. Variant readings 5. Anything else! ## The Idea The data needs to preserve three things for each item: 1. The segment ID—a number that stands for the absract idea of the segment and is unique across the entire SC corpus. 3. The content—a string of utf-8 glyphs that represents a line of text, a reference number, and so on. 2. The attribute—the kind of thing that the data is; for example an English translation of the Pali by Sujato. These things should be preserved cleanly, simply, unambiguously, and locally (i.e. you should not have to go to another file to find out who the author is.) We can do this by making everything a JSON object. 1. The segment ID is the key of the JSON object. 3. The content is the value of the JSON object. 2. The attribute is defined in the file. Currently in DN 1 we have a PO file that records the following information about the first segment of DN 1. #. </h2><p> #. <a class="pts-cs" id="pts-cs1.1"></a> #. <a class="pts-vp-pli" id="pts-vp-pli1.1"></a> #. <a class="sc" id="sc1"></a> #. <span class="evam"> msgctxt "dn1:1.1.1" msgid "Evaṃ me sutaṃ—" msgstr "So I have heard." Notice that this breaks the rules on a number of fronts: 1. Different kinds of stuff is mixed: references, HTML, source text, translation text. In other segments, variant readings or comments might also be found. 2. It is verbose, including all the stuff necessary for HTML markup of references. 2. There is nothing to identify the edition or the author. Since each segment has a `msgctxt` number, we can split this into data sets something like the following. { "ms-pli":[ { "dn1:1.1.1":"Evaṃ me sutaṃ—" }, { "dn1:1.1.2":"ekaṃ samayaṃ bhagavā" } ] } The root key tells us the attribute, which is *universal* to this data set. Here it is `ms-pli`, i.e. the Mahasangiti edition of the Pali Tipitaka. The segment IDs and text strings form key/value objects. They are ordered, so form an array. A translation can be handled similarly. { "sujato-en":[ { "dn1:1.1.1":"So I have heard." }, { "dn1:1.1.2":"At one time the Buddha" } ] } This tells us that it is an English translation by Sujato; that the translation is of DN 1; the identity of each segment; and the content of the translation itself. For references we can include multiple editions within one array. It would also be possible to have each edition in a separate file. Here the key is, as always, the segment ID. This matches an object which is a set of key/value pairs, each representing a specific reference system + number. { "refs":[ { "dn1:1.1.1":{ "pts-cs":"1.1", "pts-vp-pli":"1.1", "msdiv":"1" } }, { "dn1:1.1.2":{ } } ] } In another file can have markup. Note that this can be done better than in PO, for in PO we can only add markup at the beginning of a segment, and hence must *close* the previous html tag where applicable—see above `</h2><p>`. This creates problems; for example, at the end of a sutta, there is nowhere for the final tags to be closed. Much better to do something like this: { "html":[ { "dn1:1.1.1":"<p><span class="evam">{{1stString}}</span>" }, { "dn1:1.1.2":"{{2ndString}}</p>" } ] } ## Usage Once the data is cleanly separated like this, we can mix and match it as we wish. Some use cases: - For search, just take the text. Or index the refs separately. - For online, import just the text and HTML, and bring in other data dynamically as requested. - For Pootle, leave out refs and HTML, but include notes and variants. - Presentation markup can be adapted for various cases. For example, different languages might handle paragraph conventions differently. - We have already seen cases where people have used SC's data for themselves. By cleanly separating the sources, such third-party applications can flexibly use whatever data is convenient for them. - Things we have not thought of! We set up elegant and powerful data, and the applications will suggest themselves. ## Note!!! There are some bugs in the current pootle data, when converting to JSON we should squash them: https://github.com/suttacentral/suttacentral/issues/1115 ## Standoff properties: do stuff inside segments All the above assumes that the stuff we need to remove can be located outside a segment. And in most cases, this is true. The structural markup works outside segments. Refs such as page break lose a little granularity by going outside the segment, but nothing serious—in most cases. Variants and notes can work fine that way, too. However certain things must live inside a segment. These include: - Certain kinds of textual emphasis or styling: *all* consciousness is not-self! - Some references would be awkward to handle outside segments. I am thinking primarily of line-breaks in the Taisho texts. We would commonly have two or more line breaks in one segment. - Reconstructed and other text marked with text-critical markup, especially in Sanskrit texts and other manuscripts. To handle such cases, a system called "standoff properties" has been developed using a JSON-based markup called STIL. https://discourse.suttacentral.net/uploads/default/original/2X/6/6056afc3c25fcf0e9b3e677c04ea4bc34b8151ab.pdf Essentially, each inner span would be marked with three pieces of information, which represent: 3. The kind of property (`name`) 1. The number of glyphs since the start of the segment (`reloff`) 2. The length of the inner span (`len`) Here is an example. ![stil](https://user-images.githubusercontent.com/6112010/43619593-4ab768f4-9701-11e8-803b-d4a60457a333.png) Unfortunately, work on the basic tools for such a system is still ongoing, and I don't think it is mature yet. I would suspect that for the time being we will have to fake it a bit. Perhaps we could use a set of defined plain text glyphs to represent text-critical info, and markdown for styling. BTW, this looks cool: https://github.com/dhlab-basel/Knora/pull/319

Our database already works this way.

karl_lew · August 11, 2018, 4:06am

That’s breathtaking!

I was actually wondering how I might go about suggesting JSON as a document format and was a bit hesitant to put that idea forward given the current investment in .po files. Yes. Please JSON by all means!

p.s., Should you need to contact me directly, I’m karl.m.lew on a certain *oogle electronic mail service.

sujato · August 11, 2018, 6:44am

Yes, .po is just a way of using the translation engine. In fact our database, ArangoDB, talks to the outside world in JSON. Its data is handled pretty much like in that link; which I, incidentally, was unaware of when developing that spec, until Blake showed me that we’re already using it! DBs are like a black box to me, I stay away from them as much as possible!

But the idea is to keep everything in JSON on Github, and one of the reasons is precisely to enable various uses of the data, such as yours.

Meanwhile, for consuming the data over the web, you may want to use our REST API.

sabbamitta · August 11, 2018, 8:30am

Thank you so much, Karl, for this work!

I have to say I also prefer Amy’s voice.

What I had a bit difficulty to understand was the passage “earth as earth” (especially with Emma), and “…and that with being as condition” (with both voices). I am not an English native speaker; my native language is German.

Paliaudio.com · August 11, 2018, 12:31pm

Hi Karl

I was really impressed and surprised at the quality of the reading and preferred “slow” Amy.

We are working on producing audio versions of the sutta’s and some of the areas of difficulty are emphasis, timing, pronunciation of Pali words, dealing with elisions and how to differentiate between say questioner and responder.

A point I have noticed when listening to Sutta’s is how errors seen to be very distracting and there seems to be a tendency to focus in on areas where something isn’t quite right e.g. spacing of “earth as earth”.

In reviewing our readings I have had to accept it is inevitably unsatisfactory and not every one will resonate with the style we have adopted.

To offer more choice would be a good thing and as technology improves it does seem to offer a very worthwhile alternative to spoken sutta’s.

Best wishes and many thanks

Dave

I noted when listening that I have a tendancy to pick up and be distracted by any issue

karl_lew · August 11, 2018, 2:27pm

I would love to use the REST API. Sadly, it appears to be in a state of flux. I will need to install SC locally to investigate.

karl_lew · August 11, 2018, 3:15pm

Enabling voice assistance via TTS will require work beyond software. Specifically we will need careful listeners (you all are wonderful!) as well as people familiar with the International Phonetic Alphabet (IPA). To illustrate the work required, I’ve put together an example of how we might deal with a problematic phrase such as “earth as earth”, which was called out in your comments. I’ll let Slow Amy speak for herself:

Slow Amy: Earth as earth

The solution to “earth as earth” requires use of SSML to call out specific articulations of “earth” in IPA. SSML also lets us place pauses for emphasizing words.

<prosody rate="-20%"> Slow Amy's normal speech:
earth as earth.

Slow Amy speaking with the aid of SSML input

<phoneme alphabet=“ipa” ph=“ɜ:rθ”>earth</phoneme><break time=“.1s”/> as
<phoneme alphabet=“ipa” ph=“ɜ:rθ”>earth</phoneme><break time=“.1s”/>.

Thank you
</prosody>
</speak>

Fortunately, we can create a SuttaCentral lexicon that maps “earth” to:

<phoneme alphabet=“ipa” ph=“ɜ:rθ”>earth</phoneme><break time=“.1s”/>

Slow Amy will never have the marvelous resonant insight of a human speaker. Slow Amy is still a robot. And thanks to Paliaudio.com, we already do have a beautiful rendering of major suttas, including MN1. Perhaps Slow Amy can help provide voice assistance in finding a sutta or an excerpt of a sutta to help spread the Dhamma.

Enabling voice assistance will take time and patience. Thank you all for your help and advice.

Viveka · August 11, 2018, 9:31pm

Out of all the options, I much prefer slow Amy as well.

Thank you for your wonderful work

Nadine · August 12, 2018, 4:33am

How are you chosing the dialect for the robotic voices? Or is dialect sort of baked into the software? I noticed you are using the IPA. Is it a tool for creating pacing only?
This project is fascinating and I commend you for taking on the work.