My new project: recording the suttas 🎙


Aditi does have a higher pitch, but my hypothesis is that Aeneas relies heavily on the inter-segment pauses for breathing. This provides a coarse segmentation.

If you listen carefully to Bhante’s segment #4 audio, you will hear two (2) breaths.

  • The first breath is an intermediate breath (which is only needed because the preceding inter-segment breath was too short).
  • The second breath is an intersegment breath

Both breaths are the same length, which presents Aeneas with a problem in that there are more audio fragments than the text would imply. In these cases, my hypothesis is that Aeneas uses a time-weighted average of the TTS audio to determine the inter-segment boundary. This would explain why Aeneas arbitrarily chose to chop “ayasma” in half. In other words, I think Aeneas got confused here and simply sliced at a computed time value totally disregarding any audio pauses.

When Bhante takes a deep breath before each segment, that deep breath itself provides a distinct timing block of silence. I’m guessing that Bhante is speaking from continuity born of immersion as one would in lengthy chants. In such a chanting mode, the breaths support the continuity of chanting to keep a certain cadence. Breaths would tend to be shallow or as needed.

By taking a deep breath before each segment, Bhante would be mindfully segmenting the audio himself. This is why I think Bhante @Sujato will want to decide how he wants the user to experience his audio. The difference is one of presentation. One may speak the sutta as a single entire full chant or one may simply speak each sutta segment mindfully. This is Bhante’s decision to make, not ours.

Note that the above is still a hypothesis. We only have the one recording of SN1.20.


I’ve just removed all breaths just on sn1.20-pli :
That didn’t fix the chop of ayasma, but perhaps there is something to the cadence strategy, and perhaps just additions of pauses into text where Bhante pauses?. Maybe other TTS versions settings will help, will hop on that next, and tune the gate a lot more so more is cut out without removing soft starts of words, work on getting sn1.20 ok. Night!


I likely misunderstood, perhaps breaths are considered a segment boundary but will need to look at this later with fresher eyes. There is a way for it to segment based on a broad, like, teleprompter rate heh but I think perhaps it is just too different from the tts and mixing up where it’s up to. I’m very happy we have something to drill down into!


My hypothesis is that the 1.5 second pause between bhagava and rajagahe in SN1.20:1.2 of your link is confusing Aeneas.


Just to comment on the breathing/pause issue, the problem is that as time goes on we will have multiple different situations when handling segment breaks, including abbreviations, cases where the sentence really does have to flow over the segment break, and so on.

Currently i am working on the Sagathavagga, which is verse. So we have a nice, regular segment structure, with a reasonable pause between segments. This is not normal! Once we get into the prose, things will get far more messed up.

This is why we can’t design a system based on the assumption that I can control pauses or breaths; the linguistic context is just too diverse to rely on such assumptions. The only reliable indicator is the actual content of the words. So somehow we have to get Aeneas to work with this (or if it proves impossible, find another approach.)

Michael, this is all fantastic, but please use opus, not MP3!

I uploaded some more suttas today.


Aeneas does a credible job with most segments, as Michael’s web page shows. I propose that we simply use Aeneas as is with eSpeak and Michael’s optimizations for now. Auto-captioning is a technology that will get better over time. As consumers of that technology, we would simply accept the current state of the technology and re-segment existing recordings as the segmentation technology improves. Personally, I currently lack the skills to improve on what Aeneas does today.

Unless there are objections, I think Voice can start incorporating Aeneas segmented audio in a forthcoming release. As technology improves, we would update the segmentations of existing audio. Regardless of what we decide on Aeneas segmentation, Voice will also link to the full Opus audio files.


Yes sounds good just modify the segments. I’ve been playing with Aeneas settings to no avail yet! However, perhaps this is an option for anyone wanting to get all the Pali phonemes down (I don’t think there are that many?) There is a piece of software on Ubuntu for espeak that lets you construct a map:

If all the rest of the recording process is going well, I can try my hand at getting it talking to Aditi too.

Sadhu on more recordings Bhante @Sujato! Will try to update it all next few days, also have some ideas on reading style that might help with how it sounds, e.g. perhaps if words that start with S or F were sounded a bit less subtlely, even if it’s a section to be pronounced softer because the expander has issues with that at present.

More importantly if new recordings could be 48khz instead of 44.1khz I think that is better, is also easy to convert all older ones to 48khz - won’t be discernable, I thought it safer but wasn’t the right decision.

Mp3s are just generated for the tuning pages and for testing with SC voice.


This is a great project, a big sadhu to all involved :heart_eyes:

Is this really discernible though? I was under the impression that human speech (not singing—and melodious recitation should really qualify as speech) rarely exceeds 8kHz?

I’ve been encoding speech audio to mp3 using 16kbps 11kHz mono and there is almost no percieved loss (when the original recording is in studio quality—poor quality originals tend to amplify noise a bit more at these settings), while the file size is approximately 10 times smaller than with the standard encoding (128kbps, 44kHz, stereo).


Young folks can hear up to 20kHz, perhaps higher. Nyquist sampling is a minimum 2x highest frequency, so that would yield a lower bound of 40kHz for sampling. I also am curious about the benefit of the extra 48kHz since different sampling rates generally affect file size.

Let’s see if SC Voice can handle Opus files generated on your pipeline. This will be a good test for a future transition to Opus.

it’s taken us months to construct the Aditi map for Pali. Doing that again for eSpeak would be somewhat exhausting. :laughing:

I think that AI will naturally evolve over time to match human understanding of spoken speech. At such a time, audio segmentation will become a simple task for that future software. With that in mind, Aeneas is doing quite a good job as is.


What I was referring to was the capability of a human vocal apparatus to produce sounds in the above frequency range. Consonants should peak out at 5–6 kHz, and while the upper range for the higher resonances for the vowels seems unlimited, their produced power is too weak for them to be heard in practice.

Therfore I’m not convinced that any significant gain can be achieved by a higher frequency sampling. But I’m not an expert in this field by any means, so I can easily be completely wrong.

4kHz is a practical limit for ordinary speech (not singing) and it is usually enough to encode it reliably with double that range at 8kHz.


Michael, this week I’ll be adding code to the next release of Voice to link to the full audio recordings as they become available. Shall I refer to the links as you have provided or shall I refer to the Opus files that are not there yet?

BTW, a small request…if it’s not too much trouble, may we avoid the leading zero for the minor number? It’s actually harder to code for sn1.03 than it is to code for sn1.3. However if that is what you are given, then I can deal with it on my end.


On the SC main page the numbers are without zeroes.


I have set the encoding for 48khz for future recordings.


You are certainly welcome to record however you like, but I will weigh in on this point and confirm that the higher sampling rate will make no difference to your quality and will simply make your files larger.

[citation: I studied digital audio in college and was a professional sound engineer for a time]

[edit: 48k is useful for, e.g. recording sound effects for movies, where a) the higher frequencies matter and b) the director might later say something like “that’s perfect, but can it be a little lower?” 48k allows you to change timing, pitch, etc after the fact with less quality loss.]


While I’m here: instead of trying to line up the audio to the segments after the fact, it would be a lot easier if Bhante could just hit the spacebar on every segment boundary while recording. There are plenty of ways to hook up that kind of form-factor and then there’s no guesswork. Just an idea :smile:


We cannot rely on their being a gap between segments. Often two segments will simply run on to each other.

I’ll check with Michael re why he wants the higher sampling rate.


Going online on SC Voice, sadhu!! I’ll make sure all of Bhante’s work thus far is up on opus and mp3 this weekend, they will be different files with different timings. Are you going off the json, vtt or srt pulled live or storing on the SC voice server?

Will change the URLs if it makes it easier! Just matches the naming of the FLACs.

I can tell opus files are reducing sampling rate in any case. 20Hz is what it’s going down to I’ve just checked (Opus (audio format) - Wikipedia). Will set the encoder to 8Khz and see if there’s any discernible difference for chanting, good to check thanks! If reducing sampling rate can lead to less CPU needed when playing out is good too.

The only reason I say 48KHz is because most things use 48KHz, 44100Hz is really a standard from CD days. As I said, doesn’t matter so much, but since it’s basically considered legacy and there will be other things to plug these files into in the future, 48Khz is more common. I chose 44100Hz originally because some of the processing tools wanted those.

Amazing to hear! Anything that can be tweaked will be great to hear about.


I’m not sure how tapping a button while you talk requires “a gap”?

Really? Is that so these days? Well, I guess I shouldn’t be too surprised: storage is so cheap!

CDs and DATs and even a reel-to-reel machine :joy: My studio experience was long ago and is now, apparently, obsolete, so I guess I’ll just go back to lurking and learning from y’all. :pray:


And then they charge us for bandwidth. :laughing:

I had a reel-to-reel in college for listening to hi-fi music. Now I listen to MP3s. Maybe I should choose a middle way. Maybe…Opus.

Thank you. Deleting those zeroes will be a big help.

The zeroes are complicated because Voice users do not type zeroes and one never knows exactly how many zeroes to use. It’s not always one. Sometimes it’s two. Sometimes it’s none. Etc.

For now, I’ll adopt the crudest implementation, which is to simply provide an HTML link to your digitalocean .opus files. The full audio files can be heard directly in the browser, so I think all I’d need to provide is a Pali link and an English link for each uploaded sutta. Aminah is thinking about changing the Voice UI a bit to make the Other Resources more prominently available. It will therefore take us a few release iterations before we settle on how to fetch and present Bhante’s full audio recordings.

In a later release, I’ll be downloading your segmented audio files into dedicated audio caches on the Voice server. This will provide a smooth user listening experience without jarring pauses for retrieval. This latter task is much more work and has dependencies on VSM technology that I haven’t implemented yet. The segmented human audio will be quite interesting to work on and experience, since we’ll be able to combine, for example, Bhante’s Pali audio segments with Vicki’s German voice for the German translation of a sutta while showing the text for both at the same time, segment by segment!

I guess explosions and bullets and jets would need the higher frequencies!

But it never occurred to me that one would lower the pitch of a recorded human audio! Wow! That certainly would require higher frequency recording. I had the oddest mental flash on the suttas chanted by Bhante slowed down to Mongolian throat singer pitches. :scream_cat:

Although we do lower the pitch of Aditi by 10% to make it more accessible to users with high frequency hearing loss, we would probably not need to do that with human voices since we already have adequate low frequency voice coverage of all supported suttas.


Because words in a sentence normally run on one to the other. We don’t speak like this, wespeaklikethis. Cutting, in real time, a runon of words, reliably and accurately, is practically impossible. It’s even more difficult if it is to be done without disturbing concentration or altering the natural phrasing. Anyway, there is no need: that’s the whole reason for using software to do it in post-processing. The software works fine, it is just being tweaked.