AI TTS for Pali

karl_lew · May 19, 2022, 10:32am

Summary: Can we generate Pali audio directly from the text?

This thread is split off from Bhante’s RedBean thread:

Voice generates Opus/Ogg as well as MP3, but yes, compatibility is always a headache.

The following are from AN4.170.

Essentially, click a segment to hear it and you’ll see a GUID based request to the Voice server for that segment. The GUID is generated from the SSML used to generate the Pali.

AN4.170:1.1
AN4.170:1.2
AN4.170:1.3

Here I just used the Chrome browser developer pane’s Network tab to get the individual audio URLs.

I think you’ll have greater success than I did. I could not figure out how to reconcile the text time scale with the audio time scale. Letters take different amounts of time to articulate and I had no idea how to match the audio to the letter for training. By using time-based input instead of text, I was able to circumvent that conundrum. But the compression potential was lost in my shortcut.

This project was my own introduction to Tensorflow, and it’s been an incredible deep dive into some really cool stuff all new to me. I’ve even had to take math courses to understand how MP3 works. For example, the Moebius transformations and DFT.

I’m really looking forward to your experiments!

NgXinZhao · May 19, 2022, 12:23pm

Pali Speak – Apps on Google Play

This app is good, maybe can ask the developer for the backend codes or voices?

karl_lew · May 20, 2022, 11:16am

Thank you. Installing…

Ah. Here we go. The approach used is this:

It doesn’t actually use text to speech, it’s just a bunch of sounds files for each syllable. The pronunciation may not be perfect, but it’s certainly not completely wrong and at the least gives a sense of proper Pali cadence and stress.

olastor · May 21, 2022, 6:35pm

Thank you.

My hope is that you can take the segments and the matching audio without further labeling or cutting efforts and just let the GPU do the work using some of the already built systems. The probably most common english dataset for speech does also not appear to have sophisticated annotations per letter for example. Not sure if that would work, but I’ll try to find some time to try that out.

Wow, that sounds like it’s already been quite a journey for you.

karl_lew · May 22, 2022, 11:49am

Yes. I think that will work. I was unable to use GPU acceleration, and training took soooo long!

I will be following your research avidly!

NgXinZhao · June 5, 2022, 11:26pm

So any progress? My monastery is also looking into getting the Pali Nikayas read out loud.

karl_lew · June 6, 2022, 3:45pm

Well, the Raspberry Pi 400 is having difficulty with Voice software, so there are currently two possibilities:

Download MP3’s (or Opus) from Voice for offline listening. This is what I do. You can choose monolingual or bilingual. I currently listen to DN34 Pali/English, which is 3.5 hours of offline listening.
Download Voice to a Chromebook with a Linux subsystem. I develop Voice on Google’s Chromebook, but that’s a developer build that requires AWS payments, so it will be a lot more complicated than just using MP3’s downloaded from Voice.

olastor · June 9, 2022, 11:42am

Sorry for the delay, I have not yet tried the original idea of the thread of training a model.

But I tried out the approach of concatenating audio of individual words and created a small prototype to compare the quality with the original sc-voice audio. It might be quite buggy I think because it’s only for testing purposes and you might have to reload the page if does not work, but after the text shows up you can click on “Speak” and it should start reading. Something like this could be built as an offline application for desktop pretty easily (because it just needs to host static assets), for example using redbean and then it could be downloaded as a single file of ca. 2GB size.

Perhaps this is not read fluently enough, but I wanted to try that out before getting into more complicated methods. I cannot personally judge very well if that would be useful already because I don’t understand the language.

olastor · June 9, 2022, 11:54am

@karl_lew Mimic 3 Preview - Mycroft looks interesting because it claims to be able to “run completely offline on low-cost devices”, but it’s not yet released. Perhaps this would be a candidate for offline usage, only pali would need to be trained additionally as common languages are already included.

sabbamitta · June 9, 2022, 12:25pm

Compared with what Voice does, it’s indeed a bit slower. But apart from that it sounds great! One small thing: In the word susaṁvutā a long “u” (in vutā) is spoken, but it should be a short one. But in Voice, Aditi also doesn’t get that one right. We (or actually, Karl) has worked quite a lot on Aditi’s pronunciation, and it’s not easy to get every edge case correct.

When clicking on your link the page loaded immediately and I could start listening. Of course this is just a tiny portion of the canon, and bugs usually like to hide in some obscure corners …

karl_lew · June 10, 2022, 10:53am

Remarkable!

That certainly sounds a lot better than “syllable chopping”. At 2GB, that’s well in reach of a Raspberry Pi. The cadence is a bit halting, which induces some listening fatigue, Perhaps that’s just the inter-word gap, which might be adjustable to a smaller value for a smoother flow?

Mycroft looks like it’s aiming at open-source home assistant, which is fantastic. We have Google Assistant in the kitchen and just setting timers while cooking is incredibly useful. We also use GA for spoken web queries–combined with cloud search, it also is remarkably useful, although not up to par with typed search. I’ve been keeping an eye on this technology as well, since it should be possible to create a Voice Assistant (e.g., “SuttaCentral, please read Snp2.12”). I even have the Google RPi kit for custom training, but haven’t assembled it yet. One of our friends actually works on Google Home, so I’ve had some interesting chats with him about what is feasible or not.

There is however, a huge gap in capability between online/offline versions of these assistants. Offline versions currently have very limited vocabulary that is at the “simple magic spell” level without any deep understanding of grammar and syntax. So a voice activated kitchen timer would be feasible on Mycroft, but a sutta Voice assistant feels a bit of a stretch at the moment, much like Musks decades old promise of “we’ll have self-driving in a year”. Overall, although we are indeed on the cusp of interactive offline voice interaction, we’re just not quite there yet.

Exciting times!

karl_lew · June 10, 2022, 8:51pm

BTW…
I just now got scv-bilara to work on my RPi4, and that is just a first step towards getting Voice on RPi. It is a little slower, but it did find the root of suffering. I find that oddly…delightful.

Scv-bilara handles all searches for Voice.

sabbamitta · June 10, 2022, 10:18pm

Sounds like good news?

olastor · June 12, 2022, 10:38am

I updated the prototype and added a number input (“0.1”) to trim away that amount of seconds from the beggining and end of each word. Perhaps that helps, but it’s a bit hacky (it’d be better to trim any silence in beginning and end of the audio files first and then add padding).

olastor · June 12, 2022, 10:45am

Yeah, Mimic 3 however is only a TTS engine, I think. It should be possible to use it separately without any voice assistance use case.

karl_lew · June 12, 2022, 6:05pm

Yes. I’ve registered for more information. Let’s see what happens.

Voice is already SSML-based with a Hindi base, soooooo…maybe something there. At least we have “infinite” training data thanks to Raveena and Aditi.

karl_lew · June 13, 2022, 4:05pm

Mimic3 is quite interesting!

Here is a segment read from MN44 without SSML.

Although Mimic3 has SSML input, it was unable to produce anything useful with the following SSML for Aditi:

<prosody rate="-30%" pitch="-10%">
  <break time="0.001s"/>
  <phoneme alphabet="ipa" ph="ẽv\\əŋ">evaṁ</phoneme>
  <break time="0.001s"/> <phoneme alphabet="ipa" ph="ɾək kʰɪ     t̪əŋ">rakkhitaṁ</phoneme>
  <break time="0.001s"/> 
  <phoneme alphabet="ipa" ph="mə hə t̪o">mahato</phoneme>
  <break time="0.001s"/> <phoneme alphabet="ipa" ph="ət̪ t̪ʰɑː jə" >atthāya</phoneme>
  <break time="0.001s"/>
</prosody>

For comparison, here is Aditi speaking that SSML:

I’ve sent this feedback to Mimic3 as well, since Mimic3 mangled Rājagaha horribly:

We have heavily customized SSML, so Mimic3 is not usable in its current state online or offline.

olastor · June 19, 2022, 9:44pm

I see, I haven’t considered that.