My new project: recording the suttas 🎙


I don’t think it’s as fractured as all that: I use webm on Linux just fine; and ogg has been supported on Edge since version 17.

It’s only Apple that is the outlier.

Thanks, I’m sure others are wondering what this is all about, too. For all the messed up things about the world, I find it really incredible that an advanced project like this is made, with the deliberate intent to provide the best possible quality freely available. Open source rocks!


Ah! Thanks for the correction. Politely exits


Hi all, I’ve put some tests up on the original sn1.1 - sn1.10 for Karl to target and people to test load times. Will work on all the rest too and can run it as Bhante records, with his permission. No processing yet just normalisation, so this is just for possible testing on SC app. The cutting currently sometimes cuts on parts of words which I need to work on. It is at:
[ for whole sutta]( for whole sutta) for timings on new files with silence removal.

I might also be able to export as .vtt for easy use in node js libraries for subtitles such as vtt-cue-object - npm and easy integration with Youtube too, if that helps Karl? Otherwise for the time being I will be focusing on getting the cuts a lot smoother and processing/equalising/breath removal etc. working?

The pipeline is almost alive I reckon!


For a given fidelity, we care about cache disk space and latency. Fast disk space costs money in the cloud. Latency is annoying to end users who hate lag. Caching solves latency at the cost of disk space. To support non-MP3 audio formats, we have to incur higher ongoing disk space costs for caching. For example to store both caf and ogg would double the cache size. And on-the-fly conversion might affect latency.

Currently, MP3 fits the sweet spot in that it can be cached as is and delivered as is to all platforms. Switching to another format for delivery involves tradeoffs between SSD cache space, conversion latency and fidelity.

Opus is not 100% supported. Consider the following from the ffmpeg documentation:

[8.4 opus](

Opus encoder.

This is a native FFmpeg encoder for the Opus format. Currently its in development and only implements the CELT part of the codec. Its quality is usually worse and at best is equal to the libopus encoder.

What this means is that to support Opus on Voice, we will need to write code that others will eventually write. That code we would write is idiosyncratic throwaway code. However, I do think we’re “close enough” to start the transition to Opus. For example, we could phase in Opus for human voices using the following incremental progression that provides initial access quickly and eventually matures to a fast, high fidelity segmented solution:

  1. Archive in FLAC/Opus/CAF; link to sutta audio archive from Voice sutta (least work)
  2. Archive in FLAC; segment to Opus; cache in MP3; deliver in MP3 (least fidelity)
  3. Archive in FLAC; segment to Opus; cache in Opus; deliver in MP3, CAF, OGG (least disk)
  4. Archive in FLAC; segment to Opus; cache/deliver in MP3, CAF, OGG (least latency)

In other words, it will take a bit of work to deliver the fidelity we all want.


How much would you estimate the costs to be for caching non-MP3s without negatively affecting latency?


Currently Voice is using up 12 months of free-tier AWS EC2 T2 8GB. I estimate the monthly costs will eventually be about USD17/month, but Voice usage is increasing and I am unfamiliar with AWS pricing small print.

Caching non-MP3’s always reduces latency and increases disk cost. If we went from 8GB to 100GB, I think it might be USD30/month.

Converting non-MP3’s on the fly reduces disk usage but increases CPU usage. That would also cost us. It’s difficult to estimate CPU cost of conversion since it is load related and load is not constant. Currently our CPU cost is neglible and we are not using up our CPU credits.

It’s actually fascinating that Voice is perfectly happy running in the AWS Free Tier (for 12 months). I’ve had to pay literally only a few cents here and there. The free will end this year, so we’ll get a better idea of ongoing costs.

Amusingly, it is actually feasible to keep Voice free by simply hosting it on a different AWS account every year. :laughing: Ethically, I think that is problematic. :japanese_goblin:


Here are the first three segments from aeneas-test/sn1.20-pli-mahasangiti-plain_map.json:

  • SN1.20:0.3

  • SN1.20:1.1

  • SN1.20:1.2

Here are the first three segments from the Pali WebM:

  • sn1.7:0.3

  • sn1.7:1.2

  • sn1.7:2.2


Bhante @Sujato, @MichaelH, I fear that none of us has a solution for the Pali Aeneas segmentation mis-alignment. The discrepancies are nominal for the first and last segments, but they get unacceptably large for the other segments. I would propose that at this time we simply put Pali Opus files for the full suttas into Wasabi so that Voice can refer to them at the sutta level. If this is agreeable, then let me know what help might be needed to convert and upload the flac’s as opus files in Wasabi.


Hi Bhante and Karl!

I’m not sure where your opus links for 1.7 was pointing, it should have been pointing to:

which isn’t both segments in one.

The intro was cut with the incorrect silence levels, earlier files were just to see if it could be plugged in to app. I’m not sure what sn1.7.1.2 was referred to in that list of files!

The issues were the cuts in between segments were mid-sound, and at the start of the chapter title was the only stuff I saw. These two issues are fixed now.

Pali segmentation has also just been set to German, just because the sounds of the Romanised letters are more correct - the “e” being “ay”, and Pali is Proto-Indo-European which is closer to German - check it out! Help:IPA/Standard German - Wikipedia

All sn1.1 - 1.20 with initial processing for breath removal, a first equaliser try, vtt subtitles with the pali segment as the text, and processed FLACs will be up on digitalocean by the end of the day I’m just battling with mundane file naming issues lol.

If you want to know details, these issues with mis-aligment are that the Aeneas installed is without the C extensions and all dependencies, and the setting ‘task_adjust_boundary_percent_value=50’ not there. Without C extensions (missing on Bhante’s machine I now realise) the timings are way wrong. You need the C extensions for the timing to be close to correct:

sudo pip install -r requirements.txt
python build_ext --inplace

I’m inferring that you’re using s3 api to pull into cache Karl. The whole sutta is accessible using the s3 api on digitalocean the same way as wasabi and you have the IAM credentials that can be used if you want to use it to pull the segment into the cache. I can also sync to wasabi now if you like!

It might be smoother if the playouts happened from the whole sutta in any case.

My apologies for delays, I have a job with extremely long hours and can only spend around 20 hours a week on this.



This sounds like very good news! :clap:

:joy: :+1:

Sadhu! :anjal:


SN1-20 is up with files to match usual naming convention. Will add MP3s tomorrow!:

Still some mid-word segment cuts happening still, so my last message turned out wrong speech, I didn’t check it enough.

I’ll think there might be a way to manually adjust segmentation for each sutta based on a waveform using subtitle editing software too, would just need to code it.

Whole Sutta Example Paths:

Segments example URLs:

FLACs Whole Sutta Examples

At least now we can make adjustments in one place, and it just re-cuts and reprocesses them all.

Night Anagarika Sabamitta and all, thanks for encouragement! :slight_smile:


Thank you, Michael. This is exactly what I need to start experimenting.
Bhante @sujato, I will look into this and see if we can provide a Voice interface for Editors/Translators to access Aeneas segmentation functionality. This would allow you to preview Aeneas segmentation yourself as you wish. Would an Aeneas-segmented preview of your individual records be of some vaue to you? You would simply enter one of your github URLs and Voice would show you a page of HTML5 audio segmets such as this:

Sehr gut! Vielen Dank!


Thx so much Michael and Karl!

Not really! I’m hoping that the process will be automated and I won’t have to check things.

With the timing, it’s not such a big deal if one or two segments are divided in the wrong place, but if the whole thing gets out of whack that’s not acceptable. But it sounds like it’s getting back on track.


Cool! Then we’ll invest instead in automating the pipeline. I’ll be looking first at Pali pipeline simply because it will be the most ornery and difficult (e.g, German? :laughing: ). The English pipeline should be much easier. Painful before pleasant. :see_no_evil:

The other thing of interest is that we’ll be working on Voice Sound Modules (VSM). A VSM is single file that comprises a collection of sounds by from one source (e.g., mahasangiti or yourself or Ajahn Brahmali) for a “logical unit” (vagga or nikaya) such that it fits within 1MB-1GB and a single voice (e.g., human speaker or AI speaker). A VSM will become the basic unit of S3-style storage.

The rationale for a VSM approach has arisen from my investigation into billing. We would be penalized for individual segmented audio files in S3-style storage. It’s not just the size, but also the mode and frequency of access which affects billing. Therefore we have to “containerize” sounds as VSMs for archival and retrieval. Containerization will also allow us to provision Voice caches from VSMs for servers around the world according to local need. An EU Voice server would have German VSMs. A Japanese Voice server would have Japanese VSMs. Etc.

VSMs will therefore by the primary unit of cost control and content management for Voice. It’s all a bit hand-wavy now, but over the next month or so I’ll be working on a more concrete specification.


A quick update on the Aeneas pipeline:

After a small delay to wipe my machine and re-install Ubuntu (because I had forgotten my password), I’ve been experimenting with installing Aeneas. I have yet to complete the rather laborious installation and am writing a script to assist with installation of Aeneas on a standard Ubuntu 16.04 environment.

During this lengthy process, one of the things I have noticed is that Aeneas uses eSpeak, apparently to generate waveforms for comparison with the human audio during the segmentation process. This seems to be an eminently sensible approach, i.e., comparing two audio waveforms. Yet that very approach raises some interesting questions regarding which TTS package we should use for Aeneas. There are several possibilities:

  • eSpeak is ancient and it is difficult to download and install, especially for writing an installation script.
  • eSpeak-ng is the next generation of eSpeak, which should be backwards compatible at the CLI level.
  • Aeneas supports AWS Polly, which means we could theoretically segment Bhante’s audio by having Aeneas match AWS Polly Aditi’s speech of the very same text.
  • Voice doesn’t use Aditi directly, but relies on a very special phoneme translator that we have all worked on together, led by Anagarika Sabbamitta as the chief listener. It might therefore be possible to create an Aeneas Aditi/Voice adapter for segmentation.

Personally, I think using Aeneas with the Voice version of Aditi would be the long-term way to go with the highest potential for segmentation accuracy. That is also the most work, but I think the benefit would be greater. This is all hypothetical currently, since I’m still slogging through the Aeneas installation.

Indeed, since the surest path to the Aeneas Aditi/Voice adapter is actually to do all the preceding steps successively, it will be a bit of a trudge. :footprints:


As I am puttering around with Aeneas, I have confirmed that Aeneas uses eSpeak to generate audio for comparison with Bhante’s audio. The following eSpeak is an example of how badly it pronounces “evam me suttam”:

Now imagine how Aeneas might perform using Voice Aditi as comparison:

With this in mind, we will need to develop a Voice command line utility that mimics eSpeak functionality. This new voice utility will take arbitrary text from stdin and parameters that specify speaker (e.g., “aditi”) as well as language (e.g., “pli”). The output would be to an audio file for Aeneas to process.

This is actually quite encouraging since such a voice utlity is quite feasible and should work fine with Aeneas. However, we will need to develop this voice utility to proceed with the Pali auto-segmentation effort of Bhante’s audio recordig. I’ll be adding this issue for future implementation. Until the voice utility is ready, progress on auto-segmentation will be blocked.


Thanks, that sounds like excellent progress. Indeed, having gone to the trouble of creating decent Pali TTS, it seems like the obvious solution to actually use it.

The Aeneas documentation seems to be quite thorough, I wonder if they give any guidance to using other voices.


Sadhu to using Aditi in Aeneas!! In theory we could then subtitle any Youtube with tracts of Pali too! :slight_smile: Just need a Pali-detector!

The latest from me is:

More boring:

  • Changed the way it calculates segments to cut 300ms before start of next segment where possible, which looks better for segmentation to a degree, and tuned down speech area detection to mfcc_mask_log_energy_threshold=0.112
  • Shortened all silences to 2.2 seconds at most
  • Non-unicode has its speech characters like '!,; back to make English TTL comparison happier.


Sadly, the Aeneas CustomTTSWrappers are not proving useful. Indeed, what seems to be happening is that aeneas seems to be heavily influenced by breathing.

Consider SN1.20, segment 4 from Bhante’s sc-audio/aeneas-test:

ekam samayam bhagava rajagahe viharati tapodarame
Atha kho ayasma samiddhi rattiya paccusasamayam paccutthaya yena tapoda tenupasankami gattani parisincitum

Here is Aditi’s audio segment. Aditi, being a robot, does not breath:

Here is Aeneas’s segmentation of Bhante’s audio with a breath taken between bhagava and rajagahe. In this segmentation, the choice of TTS engine is actually irrelevant (they both segment this way):

The intermediate breath creates a big difference in timings and may be throwing off the Aeneas segmenter. Notice that Aeneas has added in parts of segment 5 (“Atha kho ay…”) to the end of segment 4. Also notice that this breath isn’t really necessary. There are other segments where Bhante simply takes a deep breath and rolls through much longer phrases than “ekam smayam bhagava”. I even tried your mfcc_mask_log_energy_threshold=0.112 parameter and it does not improve the segmentation at all of segment 4/5 of SN1.20.

Therefore, I think we have the following options:

  • Accept Aeneas segmentation as is. Some segments will end or start at the wrong place as shown above
  • Ask Bhante @sujato to re-record SN1.20, taking a full deep breath before reciting each segment. We could then compare the new recording segmentation with the old recording segmentation to see if that single deep breath would benefit segmentation.


Wow thanks for drilling down! I’ll introduce breath removal into processing and send now. It seemed happier without any processing in tests last week, but will try to tune it for Aeneas instead I guess. 1.20 has had lots of issues in both pli and eng, there are heaps of skips for eng.

Does pitch come into play in the comparisons perhaps I wonder?