An English Translation of Niddesa

Bhante, I need to apologize, I’m going to have to postpone working on this. I’ve had to start working nights and weekends to make ends meet, and simply don’t have time anymore. If anyone else wants to pick it up, they’re more than welcome; the bilara-ificiation scripts and line-correlation data are all here, and my version of the translation is CC0/Public Domain. Otherwise, I will definitely come back to it when the economy improves (or collapses completely, one way or the other).

6 Likes

No worries, sorry to hear about your circumstances.

I wonder whether Ayya @vimala would like to take this up?

Sure, I’d be happy to do that. We can use these data for Linguae Dharmae training as well.

So just to recap that I understand it right:
These are the files from https://zacanger.com/niddesa/assets/niddesa.html that are being converted to Bilara format and have been partly done in the repo: niddesa/bilara-scripts at feat/pub · zacanger/niddesa · GitHub

I might have some questions about what I’m seeing in the repo in the next week. I’m not sure how long this will take me but will try to get it done asap.

1 Like

Yes, I believe that is correct. It’s kind of a pilot program for what Linguae Dharmae is doing.

1 Like

I had a look at the input files and discussed it with @SebastianN. It is very lovely clean markdown so that should be fairly easy. We need to train an AI model on pali <–> english sentence alignment. That should be easy to do but will take some time because we are both quite busy at the moment. But we will get there.

If I’m not mistaken, this translation was made from an AI algorithm. Won’t that cause problems if you use it for training other AI?

Good point @Snowbird. It won’t give technical issues for us but you might question if it is useful data; but the same counts for Bilara/SuttaCentral. I did not look into the translation quality as I assumed that if Bhante @Sujato wants it for Bilara/SuttaCentral, the quality must be quite reasonable and the translation proofread to the extend that it is worthwhile to have.

1 Like

It’s a modified AI output, which has been hand-corrected.

When it’s ready, if we put it on Bilara, I can go through it myself. It can be a kind of pilot test, see how useful the process is.

3 Likes

The initial version was a combination of four machine translations — using two services with two different source languages (Sinhala and Chinese), collated and cross-referenced in any tricky spots against the Pali. The data in the bilara-scripts subdirectory is primarily for matching lines, because best translation I was able to automate was from a Chinese source that translated in prose format, not lines and verses. Some of the files in that reference directory are actually from Linguae Dharmae; the segmented JSON based on the BuddhaNexus data is useful for matching up line-by-line.

And FWIW I fed a few verses through GPT-3 on a whim and they came out beautifully but not entirely accurate — I think training on a parallel corpus that’s as large, and also as thoroughly vetted, as possible, would really help.

Please let me know if anyone has any questions about my initial version, or the conversion scripts or any of it, I’m happy to help, just don’t have time to actually finish it anymore.

2 Likes