SuttaCentral

Adding Vietnamese audios to SC-Voice “Other Resources"

sc-voice
Tags: #<Tag:0x00007fc7afc928a8>

#1

just let you know: there is the Vietnamese audio here: https://discourse.suttacentral.net/t/sutta-pi-aka-in-vietnamese-audio/5105, although I think it would need some edits before publishing


How long does it take to listen to the discourses?
#2

Would you be able to do the edits? We are then very happy to link it to SuttaCentral Voice!


#3

Oh I’d love to :grinning: ! Are there some guidelines to follow?

Also if anyone is interested, feel free to contribute, I’m not going to “monopolize” it


#4

Fantastic, thanks for that!

Personally I can’t tell, since firstly I am not a Vietnamese speaker, and secondly don’t have any particular knowledge in sound engineering. Maybe @michaelh or @Robbie or @musiko can give some advice?

And maybe, Robbie, this is now drifting a bit off topic from the original intention of the Wiki you created. Can one of the mods help to split it into an extra thread please? @Viveka? Thanks!

You can call it “Adding Vietnamese audios to SC-Voice “Other Resources””.


#5

I don’t think I can give any advice. I’m clueless about sound engineering. :grin: It’s a nice idea though and I hope it will succeed! :grinning:


#6

Well I can always find some online tutorials for sound editing, that’s not really a problem :wink:


#7

Hi Phineas,

This is a great find! :smiling_face_with_three_hearts:

I had trouble finding complete canon recordings in Pali myself, and there it is in Vietnamese!

I am not the one doing the part that plugs Resources into the interface, but I’m quite sure the next steps would be:

  • first step would be to download and then rename per sutta with the naming to match SuttaCentral numbering and which can plug into SC Voice with edition_translator in this naming and folder format, where I’m guessing GHPGVN is the Vietnam Buddhist Sangha Committee who recorded it:
    /vie/sn/sn1/sn1.60-vie-chau-ghpgvn.flac
  • If any editing is needed to remove or cut suttas recorded together, use Audacity and just save as .flac file format
  • Then I would run some processing on it to match volume, remove noise etc.
  • Then the work would need to be done to get it into the interface.

It might be possible to add it such that it reads it at same time as the Pali. How we are doing this with English currently (work in progress) is using software to speak the English, then the software compares this to recording and splits according to the lines already in Sutta Central (which appears not to exist currently for Vietnamese at the moment , there are numbers that aren’t split further into segments).

If the version being read out is exactly the Thich Minh Chau translation then this could work. The rest of the editing is somewhat automatic in theory, with minor adjustments made for each recording!

To do the splitting, we would need a Vietnamese TTS, this one here looks great for the task: Vnspeak TTS – Vnspeak TTS – Vietnamese Multi-platform Text-to-Speech Engine we could write to them and ask for pricing, perhaps they will just let us use it which would be great!


#8

Thanks, this sounds like very good advice as a whole.

Just one point:

I don’t think this is possible unless the Vietnamese translation on the SC main site is segmented according to the Pali. As I can tell from the work on the Vinaya I have been involved in, this is really a loooott of work! For the Vinaya it took us a year (we were three volunteers), and the Suttas are still a larger body of texts. So I would consider this a project of its own, rather than just one more step in preparing the existing audio recordings.

If someone is interested in doing it I would recommend coordinating with Bhante @sujato. I guess this sort of work should be done either with Bilara (which I think is not yet quite at this point) or with Pootle (which I think is already outdated)…

Otherwise the recordings can just be added under “Other Resources” in Voice.

Bildschirmfoto%20vom%202019-07-06%2021-13-27

Here you find for example other English translations that are not segmented (like the one by I.B. Horner), as well as audio recordings other than what Voice can support directly (i.e. recordings of a segmented root text or translation that have been splitted into segments according to the text on the SC main site). In our case there would then be a note that the recording is in Vietnamese.


#9

Thank for the help! :smile: Just one big question:

Where can I find the corresponding numbering between the Pali and VNmese translation?

(much smaller question: there is kind of a “front matter” in the VNmese audio, should I skip it?)


#10

Just taking any random Sutta, I hope I am right and this is in fact the Vietnamese translation (the shortcut for the language, by the way, is “vi” as you can see in the URL, and this should be kept throughout).

  • Make sure you have “View textual information” in the text settings turned on.
  • You can see the SuttaCentral numbering as “SC1”, “SC2”, etc. like this:

This, as I already mentioned above, has to be done either in Pootle which we used for the Vinaya segmentation, or in Bilara. In both cases it would have to be set up as a project by Bhante Sujato, and I don’t think this is likely going to happen right now.

Just so you get an idea what I am talking about, I am adding a screenshot from a Vinaya passage. You see the Pali segment in the left, and the respective part of the translation had to be entered into the fields at the right; in this case together with all the html code which makes it look a bit confusing perhaps. :grin:

Hmm… hard to say; it may depend on what kind of front matter it is. In any case the audios will be linked to Voice according to Suttas. If the matter belongs to one particular Sutta, it may be kept, if it belongs to a whole chapter or even the entire collection it would perhaps be better to skip.


#11

No! Now looking in the context of Michael’s post, this is not the sort of numbering he was talking about. He was talking about the numbers of the Suttas themselves, like “MN1”, “MN2”, etc. Especially in parts of the Anguttara Nikaya there can be different numbering systems for the Suttas, with the result that in one edition the total number of Suttas in that collection differs from the numbers in other editions. Just compare with what is on the website and go with this. The numbering inside a Sutta isn’t relevant for the current purposes.


#12

Woa I didn’t know we can find numbering in the website. That’s awesome

Sorry, I thought that SN1.60 was inside a Sutta. I was afraid to have to split each sentence :relieved:.

So for the time being, I just have to split by Sutta. The more detailed segmentation would be a separate project with enormous amount of work.

The front matter is for the entire collection, it is 5-6 min long. It tells a bit about the author (Thich Minh Chau), the translation, the affiliation, the interest and production of this audio collection, etc. I assume it’s safe to skip it then.


#13

:joy: SN1.60 means the 60th Sutta in the first Samyutta of the Samyutta Nikaya. If you look at the left sidebar on SuttaCentral, the traditional structure of the collections is nicely mirrored in the way it is arranged on the website (I can only tell for the Pali canon, though).

I would say so, unless @karl_lew thinks otherwise.

And to give you an idea what it can be like if this work has been done, just look at the Voice representation of the Sutta you mentioned, SN1.60. Click the “play” button, and you will find the Sutta read out segment by segment with two languages “side by side”, so to speak—and be surprised who is reading the Pali! :wink: If the Vietnamese would be segmented, it could be matched in the same way either with the Pali or the English.

Underneath the “play” button you see a line with the number and title of the Sutta. Click on the little arrow at the start of that line, and this will expand you the “Other Resources” which is where the product of your work will be found as a link in the future.

Thank you so much, Phineas, for doing this work! :pray:


#14

Hey, I’m just skimming, and am not 100% sure the point is relevant here, but I just want to note that currently, some of the VI sutta numbers are incorrect on SuttaCentral. This mostly concerns the SN and also the AN a bit (I’m trusting the MN is fine, but I haven’t checked). I have now corrected the numbering (:crossed_fingers:), but this won’t be visible on SC until a larger bit of work it was part of is updated on the site.

Yes, and also to note that this is due to be made more visible (subject to amendment, of course).


#15

And also note that this will happen thanks to Aminah! :wink:


#16

We have yet to add in Bhante Sujato’s own introductory matter for each collection. At this point we are only able to publish sutta-specific material. SuttaCentral will always have more than Voice. Accessibility is difficult.

Voice does provide additional audio links for each sutta under “Other Resources”. Those audio links need not be segmented. In fact, we link to Bhante Sujato’s own full, unsegmented audio recordings of suttas as they become available. Segmentation is difficult. Linking is not.


#17

Oh you helped to split the Vinaya Ang. @sabbamitta sadhu amazing :slight_smile:

This is all inspiring me to test something that guesses where the SC segment splits are for all translations, based on a google translation to English for all translations, then comparison of each section with segment. Using this software from Google word2vec that compares two sentences for similarity:

You feed in a corpus (Something like all English translations and all computer-translations of non-English) and it supposedly can compare how similar two phrases, sentences, paragraphs are.

Never used it myself but apparently it can infer “female royal” as “queen”, these kinds of semantic things. So you would compare each computer-translated-to-english sentence (line if no sentence) to each of that section’s segments. So guess where 1.2.5 start and 1.2.6 starts inside 1.2 for example.

But who knows, given how similar/synonym-y one segment can be to another this might not work. Would be fun to visualise though and might help with sutta-search too. In theory could also end up with a comparison of all English translations segment by segment to compare translations which might aide in making new translations with bilara. Also you might be able to compare similarities inside a translation to see which texts have more, like, internal validity with the canon to to other suttas but I think I am getting a little bit ahead of myself :slight_smile:


#18

:joy: :laughing:

Wow, so much enthusiasm! Usually that’s a good thing to have in order to start something! So who knows what can come out of this!! :heart:

Not that I want to weaken your elation, I’d just like to give a few points to think about:

  • Firstly, translators take different approaches. I think most translations on SC are done from the root texts, not from English translations. And if you only compare some of the existing English translations, they can be quite different, at least in some passages. If you now take translations in other languages, I can’t imagine they are closer to, say, Bhante Sujato’s translation than Bhikkhu Bodhi’s or Bhikkhu Thanissaro’s are. Many of the existing German translations for example have been made 100–200 years ago, and they differ considerably from modern English translations. So if you computer-translate such translations to English, there might be much less similarities than you would hope.
  • Secondly, I suppose that even if using this tool, segmentation would still have to be done via Bilara in order for the texts to be integrated into SC.
  • But—and that might be the most important point—such a tool could increase the scope of possible volunteers for segmentation. If it is not necessarily required to be fluent in the language-to-be-segmented, more people would be able to do such jobs. And I can definitely confirm that by doing this work for the Vinaya my grasp of Pali, of which I am still a beginner student, has grown considerably! And I have learned new aspects of the English language too.

#19

I’ve started putting the files here: MEGA

Luckily all the audio files in the DN are well numbered, I didn’t have problem with filename. I’m listening to each file to make sure.

As @karl_lew mentioned, introductory matter could be added in the future, so I created a separated folder for that. The order is confusing though, as there’re 2 sets of introductory for the whole audio collection (in the DN and the MN).


#20

Great, looks good!

Not sure if this is relevant in this context, but on SC the language shortcut for Vietnamese is “vi”, and here it is “vie”. Maybe it doesn’t matter, but I can’t tell.

I think if you rely on the Sutta numbers in the Pali or the English on SC you should be safe.

Maybe a good idea just to store it separately as you did, and then see what comes up over time. Even if it can’t be used, it doesn’t get lost at least.