Bhante! Definitely, it is very good work carrying by you.
I just managed to get Aeneas to work with raw pootle files from SC GitHub, yay!
Any reason to target
webm rather than
ogg or matroska? Any should be fine, but I can’t find anything that explains the differences.
I made a repo to start keeping the files on Github, together with a bunch of new recordings:
Actually this is in under my own name, we should probably switch to an independent account.
One change I’ve made here, I am now recording the audio while listening to a drone background in A:
I did this originally because I found myself starting chanting a different pitches and having to correct it. So this makes sure the start point is always correct.
It also means all the recordings will be pitched consistently and can be mixed together better. And if someone wants to make an app, or just listen, with ambient background music, go right ahead!
I’m in Canberra for a couple of days, will get back to Sydney tomorrow, so will keep working from there.
.webm is for streaming / seeking and fast start generally, ogg is less so and matroska is geared to video I believe.
One of the major goals is to allow content creators to have advanced playback capabilities, such as fast seeking and fast start using only an HTTP server.
GitHub looks perfect for storage, since the FLACs shouldn’t be that large - the Mahaparinibbana Sutta might refuse to hash though haha. Also the repo might get huge and difficult to clone, will look into!
Thanks, now I know!
I played around with the
sn1.01 zip file and split the first four segments of the Pali:
Here is the program:
ffmpeg -y -ss 00:00.000 -t 4.120 -i sn1.01-pli.ogg -c:a libvorbis sn1.01-pli-s1.ogg ffmpeg -y -ss 00:04.120 -t 5.56 -i sn1.01-pli.ogg -c:a libvorbis sn1.01-pli-s2.ogg ffmpeg -y -ss 00:09.680 -t 5.28 -i sn1.01-pli.ogg -c:a libvorbis sn1.01-pli-s3.ogg ffmpeg -y -ss 00:14.960 -t 6.24 -i sn1.01-pli.ogg -c:a libvorbis sn1.01-pli-s4.ogg
What you will notice is that the segmented sound files don’t quite line up correctly. I am unfamiliar with editing sound files, so I am not quite sure what is going on. @MichaelH would you have any suggestions for addressing this issue?
Just so you know, better use the files on Github, as my method and approach is evolving and they will be the latest versions.
I also noticed that the timing on the Pali segments crept out of sync a little. I had hoped it was just a passing glitch, but it seems not.
I am seeing lots of timing issues with Aeneas output, not sure if it’s an error with what I’ve done to Aeneas or not.
I’m not sure how aeneas works, but could it help to put in a longer inter-segment pause during the sutta Pali recording? This would make segments sound like paragraphs and aeneas might be able to pick up on that granularity. Also, will the drone be a problem for aeneas or will that be on a separate audio track that only Bhante can hear?
Sorry I cannot help more. This is all new to me.
I dunno, but it isn’t possible to rely on this. pauses ≠ segment.
Only I can hear it, it is not on the recording.
This is new for all of us!
I wonder whether the problem might be the parsing of special Unicode characters? Perhaps, as there are so many in the Pali, Aeneas trips on the timing? We might try first converting the Pali text in the same way that you do for audio synthesis, see if that makes a difference.
I share your worry about Unicode. Perhaps Aeneas has an English bias. The SSML conversions I use will be even less helpful since they are XML. Here is a Pali segment from SN. As you can see, it is a bit ghastly.
<prosody rate=\"-30%\" pitch=\"-10%\"><break time=\"0.001s\"/><phoneme alphabet=\"ipa\" ph=\"ẽkəŋ\">ekaṃ</phoneme><break time=\"0.001s\"/> <phoneme alp habet=\"ipa\" ph=\"sə mə jəŋ\">samayaṃ</phoneme><break time=\"0.001s\"/> <phoneme alphabet=\"ipa\" ph=\"bʰə gə ʋɑː\">bhagavā</phoneme><break time=\"0.001s\"/> <ph oneme alphabet=\"ipa\" ph=\"sɑː ʋət̪ t̪ʰɪ jəŋ\">sāvatthiyaṃ</phoneme><break time=\"0.001s\"/> <phoneme alphabet=\"ipa\" ph=\"ʋɪ hə ɾə t̪ɪ\">viharati</phoneme><break time=\"0.001s\"/> <phoneme alphabet=\"ipa\" ph=\"ʝe t̪ə ʋə ne\">jetavane</phoneme><break time=\"0.001s\"/> <phoneme alphabet=\"ipa\" ph=\"ə nɑːt̪ hə pɪɳ ɖɪ kəs sə\" >anāthapiṇḍikassa</phoneme><break time=\"0.001s\"/> <phoneme alphabet=\"ipa\" ph=\"ɑːɾɑːme\">ārāme</phoneme><break time=\"0.001s\"/>.</prosody>
I will try an alternate approach and generate romanized Pali for Aeneas. Give me a few moments…
sn1.1-romanized.txt.gz (451 Bytes)
samyutta nikaya 1 1. nalavagga 1. oghataranasutta evam me sutam— ekam samayam bhagava savatthiyam viharati jetavane anathapindikassa arame. atha kho annatara devata abhikkantaya rattiya abhikkantavanna kevalakappam jetavanam obhasetva yena bhagava tenupasankami; upasankamitva bhagavantam abhivadetva ekamantam atthasi. ekamantam thita kho sa devata bhagavantam etadavoca: katham nu tvam, marisa, oghamatari ti? appatittham khvaham, avuso, anayuham oghamatarin ti. yathakatham pana tvam, marisa, appatittham anayuham oghamatari ti? yadasvaham, avuso, santitthami tadassu samsidami; yadasvaham, avuso, ayuhami tadassu nibbuyhami. evam khvaham, avuso, appatittham anayuham oghamatarin ti. cirassam vata passami, brahmanam parinibbutam; appatittham anayuham, tinnam loke visattikan ti. idamavoca sa devata. samanunno sattha ahosi. atha kho sa devata: samanunno me sattha ti bhagavantam abhivadetva padakkhinam katva tatthevantaradhayiti.
I have made a test for this, and can confirm: replacing the diacriticals with plain text changes the mapping by aeneas. Files are here.
Use the “notext” files to compare timings. The end of the first segment is at 2.66 in plain, and 2.16 in unicode. The second segment is 15.16/10.08, the third is 28.12/20.28, and so on. So it is definitely a significant difference.
Next step is to figure out if this is what is causing the discrepancies you noticed.
If this is an issue, we should convert both the Pali and the English before running Aeneas. The English still has a few Pali terms, especially the names. I haven’t noticed this causing any problems, but it would seem prudent to ensure all files are processed similarly.
Thank you, Bhante. I shall use the new JSON map files to split up your audio recording tomorrow. If this works, then perhaps I should replicate your Aeneas setup on my own computer so that I can create an audio pipeline that generates the MP3 files. As you suggest, the pipeline would romanize both Pali and English for consistent treatment by Aeneas. The pipeline could probably just be some code in my fork of sc-audio so that we could both share.
Best to co-ordinate this with Michael. Ultimately I think all the processing should be done at his end, and you just get the files. That way you, or whoever inherits SCV, can focus on application, and Michael on content.
I really, really, really want to use opus, not MP3.
By the way, did I mention that I really want to use opus? I think I might have mentioned it!
Yes, Bhante. We know and were just discussing that in our most recent Voice planning meeting. Opus has better high frequency response than MP3. It’s just the better technology and clearly the long term direction. For the AI voices this was not an issue, since their quality is much lower than your FLAC files. Indeed, the maximum sampling for the AI voices is only 22kHz, which is woefully low given that the Nyquist sampling rate for 20-20kHz speech should be at least 40kHz. MP3 would cripple the fidelity of the FLAC files you are creating.
The use of MP3 files for caching and download is simply a short-term solution that supports the most players with least engineering effort. Currently, the pesky use case is Apple iPhone for browsing and offline listening. We’ll need to phase in Opus files carefully over time, since we do not want to cache both formats (the Voice caches are 70% full and growing daily even now). I think that a two-tier storage solution using cloud for archival and VPS cache for performance will provide a good balance. This constraint therefore makes Voice cloud storage a requirement for Opus support. Fortunately, Michael has already provided a Wasabi account for use, so we will probably be tackling that after we sort out the segmentation of the audio files.
@MichaelH, let me know how you’d like us to work together on automating the generation of audio segments. Tomorrow I’ll take a pass at splitting up the Pali audio file per Bhante’s new Aeneas JSON files. Hopefully that will solve the audio registration issue we saw earlier.
I assume, Bhante, your preference isn’t merely for quality but is also ideological? (Since mp3 is a proprietary format, in addition to being technologically inferior)
Yes, both. It just seems crazy to me to use an ancient, technologically inferior, patent-encumbered format when there is a modern, technologically superior, and open source format available. The only advantage MP3 has is it is more widely supported. But Opus now covers 86% of browsers, which is pretty much what we support anyway (it excludes ancient IE and some mobile browsers).
It is supported on iOS safari 11 and later if put in the CAF wrapper.
Does SCV care about what audio file we use? Can we not simply use the auto-generated voices in MP3 and the real ones in opus?
I assume this was the (kind of) engineering effort refered to. Windows requires the .webm container, Apple requires .caf, Linux prefers ogg… the encoding is standardized but sadly the containers are not