Here I just used the Chrome browser developer pane’s Network tab to get the individual audio URLs.
I think you’ll have greater success than I did. I could not figure out how to reconcile the text time scale with the audio time scale. Letters take different amounts of time to articulate and I had no idea how to match the audio to the letter for training. By using time-based input instead of text, I was able to circumvent that conundrum. But the compression potential was lost in my shortcut.
This project was my own introduction to Tensorflow, and it’s been an incredible deep dive into some really cool stuff all new to me. I’ve even had to take math courses to understand how MP3 works. For example, the Moebius transformations and DFT.
It doesn’t actually use text to speech, it’s just a bunch of sounds files for each syllable. The pronunciation may not be perfect, but it’s certainly not completely wrong and at the least gives a sense of proper Pali cadence and stress.
My hope is that you can take the segments and the matching audio without further labeling or cutting efforts and just let the GPU do the work using some of the already built systems. The probably most common english dataset for speech does also not appear to have sophisticated annotations per letter for example. Not sure if that would work, but I’ll try to find some time to try that out.
Wow, that sounds like it’s already been quite a journey for you.
Well, the Raspberry Pi 400 is having difficulty with Voice software, so there are currently two possibilities:
Download MP3’s (or Opus) from Voice for offline listening. This is what I do. You can choose monolingual or bilingual. I currently listen to DN34 Pali/English, which is 3.5 hours of offline listening.
Download Voice to a Chromebook with a Linux subsystem. I develop Voice on Google’s Chromebook, but that’s a developer build that requires AWS payments, so it will be a lot more complicated than just using MP3’s downloaded from Voice.
Sorry for the delay, I have not yet tried the original idea of the thread of training a model.
But I tried out the approach of concatenating audio of individual words and created a small prototype to compare the quality with the original sc-voice audio. It might be quite buggy I think because it’s only for testing purposes and you might have to reload the page if does not work, but after the text shows up you can click on “Speak” and it should start reading. Something like this could be built as an offline application for desktop pretty easily (because it just needs to host static assets), for example using redbean and then it could be downloaded as a single file of ca. 2GB size.
Perhaps this is not read fluently enough, but I wanted to try that out before getting into more complicated methods. I cannot personally judge very well if that would be useful already because I don’t understand the language.
@karl_lewMimic 3 Preview - Mycroft looks interesting because it claims to be able to “run completely offline on low-cost devices”, but it’s not yet released. Perhaps this would be a candidate for offline usage, only pali would need to be trained additionally as common languages are already included.
Compared with what Voice does, it’s indeed a bit slower. But apart from that it sounds great! One small thing: In the word susaṁvutā a long “u” (in vutā) is spoken, but it should be a short one. But in Voice, Aditi also doesn’t get that one right. We (or actually, Karl) has worked quite a lot on Aditi’s pronunciation, and it’s not easy to get every edge case correct.
When clicking on your link the page loaded immediately and I could start listening. Of course this is just a tiny portion of the canon, and bugs usually like to hide in some obscure corners …
That certainly sounds a lot better than “syllable chopping”. At 2GB, that’s well in reach of a Raspberry Pi. The cadence is a bit halting, which induces some listening fatigue, Perhaps that’s just the inter-word gap, which might be adjustable to a smaller value for a smoother flow?
Mycroft looks like it’s aiming at open-source home assistant, which is fantastic. We have Google Assistant in the kitchen and just setting timers while cooking is incredibly useful. We also use GA for spoken web queries–combined with cloud search, it also is remarkably useful, although not up to par with typed search. I’ve been keeping an eye on this technology as well, since it should be possible to create a Voice Assistant (e.g., “SuttaCentral, please read Snp2.12”). I even have the Google RPi kit for custom training, but haven’t assembled it yet. One of our friends actually works on Google Home, so I’ve had some interesting chats with him about what is feasible or not.
There is however, a huge gap in capability between online/offline versions of these assistants. Offline versions currently have very limited vocabulary that is at the “simple magic spell” level without any deep understanding of grammar and syntax. So a voice activated kitchen timer would be feasible on Mycroft, but a sutta Voice assistant feels a bit of a stretch at the moment, much like Musks decades old promise of “we’ll have self-driving in a year”. Overall, although we are indeed on the cusp of interactive offline voice interaction, we’re just not quite there yet.
I just now got scv-bilara to work on my RPi4, and that is just a first step towards getting Voice on RPi. It is a little slower, but it did find the root of suffering. I find that oddly…delightful.
I updated the prototype and added a number input (“0.1”) to trim away that amount of seconds from the beggining and end of each word. Perhaps that helps, but it’s a bit hacky (it’d be better to trim any silence in beginning and end of the audio files first and then add padding).