My new project: recording the suttas 🎙

sujato · March 15, 2019, 6:28am

Readers of the forum will be aware of the ongoing development of an audio interface for suttas on SC via @karl_lew’s SC Voice, which uses computer-generated voices. We have also discussed the possibility of recording the suttas with humans.

Over the last few weeks, I have been gradually setting up a studio at the Lokanta Vihara, specifically for recording suttas. (We hope to do some podcasts and the like as well, but let’s focus on suttas for now.)

Thanks to the generous support of @michaelh I now have an excellent studio mic and preamp. We’ve also put quite a bit of work into sound-proofing my room. I mean, it’s nice anyway to have a quiet space, but it’s essential for good recordings.

We have been discussing making the recordings, and a number of people have showed an interest in doing the reading. Which is great! But when I read a few suttas for testing, I immediately realized that by reading them myself, I could also get the chance to review the translations and improve the wording and phrasing. I gradually came round to the realization that I should do all the suttas myself.

I’ll read them through and make any adjustments to the translation, gradually working through all the nikayas. If anyone else is still interested to read them as well, that’s fine of course, but I’d recommend working in my wake, so to speak, waiting until the adjusted texts are ready. This is, of course, one disadvantage of this approach as compared to the computer-generated audio, which can be updated continually with the latest version.

How the process will go will be something like this.

I’ll make the recording here, as a raw flac file.
Any changes to the text are added to SC.
I send audio and text to Michael, who processes the audio, generates a timing map of the text, and outputs the final product as opus files.
The raw files are archived.
The processed files are served to SC Voice. These are optimized segmented audio files, so can be served directly on SCV.
Processed files are also made into EPUB audio ebooks, released when each volume is ready.

I guess the whole project will take a year or so, reading suttas for an hour or two each day. We’re just about ready to get started, and I’ll keep you all up to date as to progress.

sabbamitta · March 15, 2019, 8:31am

Hi Bhante, this is fantastic news indeed! Looking forward to hearing this on SC-Voice…

Which is the first Nikaya you want to start with?

Robbie · March 15, 2019, 1:35pm

This project sounds like delightful silence to my ears!

I think the SuttaCentral Podcast (SCP) would benefit immensely from a reliable database of human sutta recordings. I think it’d be best to postpone SCP to 2020, after you finish this project.

I’d be happy to volunteer for proofreading the revisions!

verajip · March 15, 2019, 2:00pm

This is exciting! I would love to listen to suttas in human voices especially from the sutta experts like our dear (familiar) bhantes @sujato and @Brahmali … thank you and sadhu!

Mat · March 15, 2019, 2:33pm

That great, Bhante @sujato, I hope this projects keeps going from strength to strength!!

karl_lew · March 15, 2019, 2:41pm

It is very difficult to practice restraint at the hearing of this wonderful news!

Thank you, Bhante, for this forthcoming gift to the world. With this gift, we will all be able to speak the Dhamma together with you in unison according to the best of the oral tradition, without needing books or computer screens or robots.

Sadhu! Sadhu! Sadhu!

We are currently looking into cloud storage possibilities given the forthcoming massive influx of audio files. These are early days for us in the cloud storage area and we seek advice from others. Our current working hypothesis is that we would use any cloud storage (e.g., Wasabi) having an API compatible with the S3 Storage API.

The storage units would correspond to Voice caches. As Anagarika Sabbamitta has mentioned, we would store your recordings in caches by Nikaya, language, translator, and voice:

mn_en_sujato_sujato
dn_en_sujato_sujato
an_en_sujato_sujato
sn_en_sujato_sujato
kn_en_sujato_sujato

The caches are indexed by GUID computed as hash signatures of the JSON used as the TTS input. This means that we can follow your translation changes as you make them. Edited segments will automatically have new guids as the content changes. Unaffected segments will retain the same guids. This will allow us to present seamless “audio-pubs” that follow your updated translations.

For example, here is the cache index for a spoken text segment. Note that it is without sutta_uid or segment reference since all identical text segments are spoken identically:

  1 {                                                                                                                                                                 
  2   "api": "aws-polly",                                                                                                                                             
  3   "apiVersion": "v4",                                                                                                                                             
  4   "audioFormat": "mp3",                                                                                                                                           
  5   "voice": "Amy",                                                                                                                                                 
  6   "prosody": {                                                                                                                                                    
  7     "rate": "-30%",                                                                                                                                               
  8     "pitch": "-0%"                                                                                                                                                
  9   },                                                                                                                                                              
 10   "language": "en-GB",                                                                                                                                            
 11   "text": "<prosody rate=\"-30%\" pitch=\"-0%\">consciousness as self, self as having consciousness, consciousness in self, or self in consciousness.</prosody>", 
 12   "guid": "15df18b9d78303983d659070079de8b0",                                                                                                                     
 13   "volume": "mn_en_sujato_amy"                                                                                                                                    
 14 }

Your recordings would have an index entry with a similar structure. We generate the cache GUID from the hash signature of the following JSON. Since the GUID changes as the text or apiVersion changes, we can accommodate both translation changes as well as voice changes, although the need for a human “apiVersion” is admittedly only relevant in a scenario of humans altering their own voicing.

  {                                                                                                                                                                 
    "api": "human",   
     "apiVersion": "v1",                                                                                                                                          
     "audioFormat": "opus",                                                                                                                                           
     "voice": "Sujato",                                                                                                                                                                                                                                                                                                          
    "language": "en-AU",                                                                                                                                            
     "text": "consciousness as self, self as having consciousness, consciousness in self, or self in consciousness.", 
    "guid": "some-signature-guid",                                                                                                                     
    "volume": "mn_en_sujato_sujato"                                                                                                                                    
  }

The implications of the above structure are several:

We only need one sound file for any given combination of spoken text. (i.e., you need not say “Why is that?” a million times)
We need a way to index all your recordings. For now the SC ID with segment numbers is fine, but we would also need the exact text spoken so that we can update the cache if your translation changes.
If it would help, we can add a Voice interface to help maintain an index of spoken entries for you so that you can simply skip segments you have already spoken and move on to the unspoken segments.
If you do wish to speak each segment in place, we would simply add the SCID to the JSON index for each audio file.

Let us know what you need. How can we help?

Robbie · March 15, 2019, 2:45pm

Cloud storage sounds good. What about Google? Do they offer cheap cloud storage?

karl_lew · March 15, 2019, 3:00pm

Yes. Google allows up to 5GB in their free tier. What this means is that Bhante Sujato could probably store all his recording as Voice caches in Google cloud storage for free under a single account. Voice could then access those caches and serve up sutta recordings assembled on demand.Google cloud storage does have egress restrictions, so both Voice webserver and audio cloud storage would probably need to be nationally co-located (i.e., US/US or AU/AU, etc.)

I.e., we don’t need a single vendor, and having multiple free accounts for individual human speakers should be very cost effective. However, the cost of storing ALL the recordings on a single account is quite reasonable given that vendors such as Wasabi offer 1TB at USD78/year.

SarathW1 · March 16, 2019, 6:26am

This is a great idea. I think @Frankk already done a wonderful recording perhaps he may be a great help.
Bhante @sujato do you have any of your sample recordings which I I can listen to?
Thanks

SarathW1 · March 16, 2019, 7:05am

Bhante @sujato is this an aditonala voice in SCV?
For instance, say can I listen to your voice in SCV while I can read the Sutta same time?

SarathW1 · March 16, 2019, 7:13am

Bhante @sujato another suggestion is record live which may be publish in YouTube.

Robbie · March 16, 2019, 10:31am

I like the idea. Especially if the recordings would be videotaped.

EDIT: I’m less sure now. See posts #14 and #16 below.

Timothy · March 16, 2019, 2:50pm

This is very exciting! One question, though: will these recordings be English only, or Pali/English segments, or some variation of those? I would very much like to hear the Pali!

karl_lew · March 16, 2019, 2:59pm

That is what I would as well. We are currently discussing how to make this happen…

Here we have a problem, because it would be somewhat difficult to do one recording that serves both YouTube as well as Voice.

If we record each text segment with the highest fidelity, that recording becomes a diamond that we can string together with other segment recordings into a sutta presented in Voice. I don’t think anybody wants to see Bhante speak one text segment on YouTube.
If we record each sutta with video, we get a wonderful video with low fidelity audio where we can all watch Bhante speak the sutta.

Which should we do first? Think carefully, because if we do both, then Bhante has to do more than TWICE the work.

Another way of thinking about this is… do we want to look at 3999 video recordings of Bhante chanting suttas? Or do we want to listen to the suttas while reading them?

I know I would love to see one (1) video recording of Bhante making segment audio recordings (i.e., “Behind the scenes with Bhante…” Robbie would probably have fun doing that for one of his podcasts.

If we record Bhante speaking individual text segments, then Voice can present these segments with or without Aditi speaking Pali.

If Bhante also records Pali segments, then we could offer the choice of Bhante vs. Aditi to Voice users for all segmented translations to other languages.

Voice currently has links to existing audio recordings by Pali Audio, Frankk and others. However, these recordings are not segmented, so it is difficult to hear/read suttas segment by segment.

Robbie · March 16, 2019, 3:45pm

I don’t know.

I was wondering if the flac files could be easily combined with video footage recorded simultaneously.

I’m not sure.

I like BSWA’s YouTube/podcast double postings. Then again, maybe videos of sutta recordings don’t add enough additional value.

I second this. It would also help Pāli students with getting the pronunciation right. Though that’s perhaps a minor benefit compared to the work it involves.

karl_lew · March 16, 2019, 3:54pm

I think it might depend on the sutta. For example, the Thig/Thig are short so it should be no trouble for Bhante to read both the entire sutta as well as each segment recorded on flac. My assumption here is that an flac of an entire sutta would require a lot of work to break up into segments. There are 137,351 segments that are in Bhante’s 3999 sutta translations.

However, imagine Bhante recoding DN33 (two hours) both meticulously by segment and then again in its unbroken entirety. For the larger suttas, assembling individual high fidelity segment recordings might make for a better listening experience with least effort on Bhante’s part.

My thinking is just that Bhante’s forthcoming gift is already quite large and do we need him to give it twice? Or would we rather him spend the time giving Dhamma talks. Because recording all the suttas is at least one year of Dhamma talks.

(Of course, if Bhante is passionate about speaking suttas, then I would simply support his wish.)

Robbie · March 16, 2019, 3:56pm

You’re right. Now I’m convinced. I think videos are NOT a good idea.

Thanks for changing my mind!

SarathW1 · March 16, 2019, 9:03pm

Good discussion
The objective should be to use the time and resources more effectively and not to lose the sight of end objective.
Brain storming is a good idea before start the project.

karl_lew · March 16, 2019, 9:03pm

Actually you also changed mine.

I did some research and realized that it is possible to automatically generate YouTube videos from:

a picture of Bhante Sujato used as a background
the text of a segment used as a caption
the audio of a segment

We could then concatenate all those video clips into a sutta video clip for YouTube. In this way we can create a Bhante Sujato Sutta YouTube channel and publish suttas regularly to the channel. If enough people subscribe to the channel and watch Bhante’s videos, YouTube would be a source of some income.

What do you think?

sujato · March 16, 2019, 9:24pm

Hey all!

Due to popular demand, it looks like I’ll be doing both

The process is documented in more detail here:

I will record whole suttas, not segments. Michael uses the incredible aeneas to create a JSON timing map of the sutta. This tells us to the millisecond where each segment starts and stops on the audio. He then uses this to create a set of segmented files. So we automatically have both whole sutta files (for ebooks, youtube, podcasts, etc.) and segmented files (for SCV). He will automate the pipeline so that the appropriate files can be processed and uploaded with a single script.

This means that, unlike the computer voices, the segments will not be reused. I will record the whole sutta in one flow, which will sound more natural.

Aeneas works with Pali as well as English.