Help me find EBT's in Khmer

Charlotteannun · January 3, 2022, 2:24am

My friend Ayya Saddha Bhikkhuni of Cambodia would greatly appreciate any suttas translated into her language.

Anything on site? Or any leads? Thanks in advance!

Khemarato.bhikkhu · January 3, 2022, 3:11am

There doesn’t appear to be any on SC…

Ven @Dhammanando, any leads?

Khemarato.bhikkhu · January 3, 2022, 4:13am

Ahha! I maybe found it?

5000-years.org

ទាញយក (Download) ព្រះត្រៃបិដកខ្មែរ

គម្ពីរព្រះត្រៃបិដកគឺជាឃ្លាំងផ្ទុកនូវពុទ្ធវចនៈរបស់ព្រះពុទ្ធដែលទ្រង់ទេសនាទូន្មានប្រៀនប្រដៅអប់រំរំឭកដាស់តឿនរវាង៤៥ព្រះវស្សាចាប់តាំងពីទ្រង់បានត្រាស់ដឹងជាព្រះសម្ពោធិញ្ញាណរហូតដល់ព្រះអង្គបានចូលបរិនិព្វាន។វចនៈដែលទ្រង់បានសម្ដែងនេះហៅថាធម្មវិន័យ។ធម្មវិន័យនេះចែកជ...

A Khmer speaker will have to help, but I was able to confirm that there is a 1969 translation of the Pāḷi Canon into Khmer … and this might be it?

mikenz66 · January 3, 2022, 4:20am

That looks like everything (via Google Translate):

There are 110 books in total in the Khmer language.

Discipline consists of 13 chapters, 5 books (Part 1 - 13)

The Sutta Padak has 64 chapters, 5 books (Part 14-77)

Abhidhamma Bidak has 33 chapters, 7 books (part 78 - 110)

The Buddhist Council established the Trinity Committee in 1929, chaired by two monks and consisting of 40 members and lay people with advanced degrees in Buddhism and Pali. The first volume of the Trinity was born as a bilingual book, one Pali and one Khmer, published in 1931, and the last 110 volumes, first published in 1968, with a volume of more than 400 pages. This transformation took almost 40 years to complete. Our Khmer Trinity was inaugurated and celebrated on April 1-2, 2513-1969.

Khemarato.bhikkhu · January 3, 2022, 4:41am

Yes, I did see the Google translate of the page. But I’m on mobile and can’t download the large PDFs to confirm that they indeed are what it says on the box.

Quite lucky that they finished by 1969. Just a few years later and the project never would have finished at all

paul1 · January 3, 2022, 5:42am

This teaching resource is located in the US, but the monks are Cambodian:

Churchwell · February 5, 2022, 10:57am

Venerable, are the texts going to be added to SuttaCentral? I’m a Khmer speaker, if needed, I can assist with inputting the texts.

Khemarato.bhikkhu · February 5, 2022, 11:25am

I don’t think there’s a plan to add them at this time, but if there are volunteers and copyright allows (two big 'if’s!) I don’t see why SC wouldn’t be happy to welcome them.

In the likely event that this is copyrighted but with no actual owner (the owning body being subsequently disbanded), Bhante @sujato: does SuttaCentral have a policy re orphaned works? Considering they were something of a public works project from the beginning, I think a strong argument could be made for these being considered public domain now, but it’s of course your call

sujato · February 5, 2022, 11:39pm

Thanks for the offer! That would be really helpful. Let’s see what we can do.

Not officially, but I agree that it’s probably okay in such a case. Generally, of course, there is a safe harbor provision for copyright, so if a work is added in good faith, then should the copyright owner complain, we merely have to take it down.

FYI, I did discuss adding Khmer suttas with some of the Cambodian community some years ago, but it never succeeded.

Alright, let’s have a look and see what we’ve got. Looking at the 5000-years site, it looks like the texts are present as PDF files. They’re not text files, they are scanned images. These are not suitable for us, as there’s no way to reliably transform images to text, especially text in exotic scripts like Khmer. The only use they would have would be as reference points for proof-reading.

So here’s the first question:

Can we find these texts in HTML? Or even Word, or literally anything other than scanned images?

Here’s a start: this site seems to have the Tipitaka in Khmer in HTML. here;s the Brahmajala Sutta:

http://ti-kh.org/books?book=14&page=7

So this might be a useful source. There’s a long way to go, however. Assuming we can scrape the site and get the content, there’s no semantic markup, hence no transforming it into a form usable for SC will be tricky. Still, so long as the origin is in reasonably well-formed HTML it should be possible.

Can some of our Cambodian readers check out this site for us? It’d be good to know:

is there a copyright?
is the text accurate and well-proofed?
what is the source for the translation?

If we can’t find the whole Tipitaka, perhaps we can find some individual suttas or collections?

Churchwell · February 6, 2022, 6:10am

Ok so I looked at the links that you and Ven Khemarato shared, as well as a pdf Khmer Tripitaka that I have in my own possession, and here’s what I found:

I found nothing mentioning copyright at all. I’m not familiar with Cambodian copyright law (I’m of Khmer descent, but I’m an American), but in my opinion, there should be no legal issues? The Khmer translation was originally published under a government that no longer exists, and in addition, the files hosted on 5000-years.org are merely scanned reprints, not an original translation.

The text is the entire Tripitaka, and it looks like the same translation from 1969. However, it doesn’t look like the standard 6th council edition. Some of the Suttas have slightly different names, and some have additional verses. Sometimes the Pali words are spelled differently. Also, a few books from the Khuddaka Nikaya are missing, I assume they were not considered canonical.

Would manual input be feasible?

sujato · February 6, 2022, 6:27am

Great, thanks!

Let’s proceed on the understanding that these are orphaned works under copyright, and that we are using them in the spirit intended (i.e. we’re not selling them, etc.)

Now, the files from ti-kh.org are HTML, so these are better than the 5000-years ones for us.

They appear to be in a form where each page of HTML corresponds to a page in the book. here’s a sample:

 <div style="width: 560px; padding-left: 60px; padding-top: 50px;">
                
                ព្រះដ៏មានព្រះភាគព្រះនាមវេស្សភូ១ ជាព្រហ្មចរិយធម៌មិនឋិតនៅយូរ ម្នាលសារីបុត្ត ឯព្រហ្មចរិយធម៌របស់ព្រះដ៏មានព្រះភាគព្រះនាមកកុសន្ធ១ ព្រះដ៏មានព្រះភាគព្រះនាមកោនាគមន១ ព្រះដ៏មានព្រះភាគព្រះនាមកស្សប១ ជាព្រហ្មចរិយធម៌ឋិតនៅបានយូរ។ ព្រះសារីបុត្តត្ថេរក្រាបទូលសួរថា សូមទ្រង់ព្រះមេត្តាប្រោស ព្រហ្មចរិយធម៌របស់ព្រះដ៏មានព្រះភាគព្រះនាមវិបស្សី១ ព្រះដ៏មានព្រះភាគព្រះនាមសិខី១ ព្រះដ៏មានព្រះភាគព្រះនាមវេស្សភូ១ បានជាមិនឋិតនៅយូរ នោះតើព្រោះហេតុ និងបច្ច័យដូចម្តេច។ ព្រះដ៏មានព្រះភាគទ្រង់ត្រាស់ថា ម្នាលសារីបុត្ត ព្រះដ៏មានព្រះភាគព្រះនាមវិបស្សី១ ព្រះដ៏មានព្រះភាគព្រះនាមសិខី១ ព្រះដ៏មានព្រះភាគព្រះនាមវេស្សភូ១ ព្រះអង្គមិនសូវប្រឹងប្រែង<sup>(១)</sup> សំដែងធម៌ដល់សាវកទាំងឡាយដោយពិស្តារប៉ុន្មានទេ ព្រហ្មចរិយធម៌មានអង្គ៩ប្រការ របស់ព្រះពុទ្ធទាំងឡាយនោះគឺ សុត្តៈ គេយ្យៈ វេយ្យាករណៈ គាថា ឧទានៈ ឥតិវុត្តកៈ ជាតកៈ អព្ភូតធម្មៈ វេទល្លៈ ជាធម៌មានចំនួនតិចៗណាស់ សិក្ខាបទក៏ព្រះពុទ្ធទាំងនោះមិនបានបញ្ញត្តដល់សាវកទាំងឡាយទេ អាណាបាតិមោក្ខ ក៏ព្រះពុទ្ធទាំងនោះមិនបានសំដែងឡើយ ដល់អំណឹះ ឥតពីព្រះពុទ្ធដ៏មានជោគ<hr /><div style='font-size: 14px'>(១) ប្រែចេញ ពីបាលីថា កិលាសុនោ សព្ទនេះប្រែថាខ្ជិល ប៉ុន្តែក្នុងទីនេះ បើប្រែថាខ្ជិល នឹងទៅជាឆ្គងសេចក្តី មិនសមតាមន័យក្នុងអដ្ឋកថា ៗប្រាប់សេចក្តីថា ព្រះពុទ្ធគ្មានសេចក្តីខ្ជិលទេ។</div>
            </div>

As you can see, there are no paragraphs, headings, or anything else that might help in structuring the text. In this page, there is a tag <hr /><div style='font-size: 14px'>, maybe that corresponds to a paragraph break?

Above that, the file has two kinds of title:

  <h1>
                ព្រះត្រៃបិដក ភាគ ០១
            </h1>

and

                <a href="" title='បិដក ភាគ ០១ - ទំព័រទី ២៣'>
                        បិដក ភាគ ០១ - ទំព័រទី ២៣</a>

Something about “sutta 9”? Can you make out what these are?

The file name, and also the page metadata, contain the information “book=1&page=23”.

That’s all I got! As far as i can tell, the entire collection is structured this way. There’s no indication of nikayas, sutta numbers (?) and other details. It’s all based on the book and page. Which makes it hard. We need to find something in the text, or the navigation, somewhere, that can tell us what these things are, not just what they correspond to in a Khmer edition.

So this is possible, but it’s not looking easy. Can we keep looking around for a bit, see if we can find a better source?

Let’s keep it for a last resort.

Churchwell · February 6, 2022, 9:33am

It’s hard since its structured according to books, and only loosely corresponds to actual sections. But sometimes you can search the string to find the title. For example:

ព្រហ្មជាលសូត្រ ទី១

indicates Brahmajala Sutra #1. So sometimes if you search the string for “សូត្រ” meaning Sutra in Khmer, followed by “ទី” which indicates the number, you can find the title.

But this becomes even more complicated because sometimes, especially in the Anguttara and Khuddaka Nikayas, the sutta isnt even labeled. It is only indicated by a number. For example, in the Anguttara Nikaya:

បឋមបណ្ណាសក ពាលវគ្គ

indicates Pathama pannasaka Bala Vagga. In the text that follows, the sutta is only indicated by “[១]”, corresponding to AN 3.1. But the actual name of the sutta isn’t shown.

So for me, its pretty easy to see where a sutta begins and ends, but for someone who can’t read Khmer, it might not be obvious.

Regarding other sources, almost every other site I’ve found also has the scanned pdfs. But I did find them in Microsoft word doc format here:

http://5000-years.org/kh/read/2976

The text is formatted a bit weird, but other than that its OK. Someone would still have to search through to find the sutta boundaries though.

sujato · February 6, 2022, 10:30pm

Okay thanks, that’s all useful info.

Now, I’ve scraped the ti-kh site, and ended up with some 26,000 files containing the Tipitaka. It seems some of these are in Pali and some in Khmer. Leaving aside the Pali we have maybe 10-15,000 files in Khmer. That’s a lot! Not easy to wrangle these into a usable form.

The word documents on 5000 seem like a more usable starting point. But first, can I ask you a couple more questions?

Can you closely read a couple of equivalent paragraphs in the Word version and the ti-kh HTML version? Maybe start with Book 14, which I think is DN 1.
Are they the same translation?
In terms of accuracy, are they similar, or do we find typos in one and not the other?

Assuming that they are the same translation, and that there are no major differences in proofing, we might consider moving forward with the Word files.

I have prepared one of these in plain HTML.

Open Book 14 in LibreOffice
“preview in web browser”
save as HTML.
Run HTML tidy with aggressive cleaning settings
Delete footnotes (not supported by SC)

So that’s pretty straightforward, and the result is fairly clean.

Pitaka-014.zip (87.3 KB)

A few questions:

The text has paragraphs. Are these actual paragraphs, or do they merely correspond to the page divisions?
The text has zero-width spaces. These are used to indicate a word break in languages that do not space words. I am wondering whether these have been applied properly, or if they are just random noise. Here’s an example:

កាលខ្ញុំព្រះអង្គទាំងឡាយ

If you can’t see them:

កាលខ្ញុំ***ព្រះអង្គ***ទាំងឡាយ

I’m suspicious, because they don’t occur often enough to be true word breaks. The HTML files from ti-kh also have these, but a lot more of them:

ព្រះ***ដ៏***មាន***ព្រះ***ភាគ***អង្***គនោះ

Churchwell · February 7, 2022, 7:28am

Well, some bad news. After reading only a few paragraphs of DN1, I can already see typos here and there. Sometimes its significant. For example, in the HTML file you sent, there are several sentences missing from the beginning of DN1. However, it looks like the files on ti-kh.org have less typos. Even so, it might require proofreading. But the good news is that aside from the typos, it seems like the texts are all the same translation.

By paragraphs, do you you mean the passages marked by a number in brackets, like this: [១] ? I think that in most cases, they do correspond to actual passages, like a monologue by a certain person.

To be honest, I don’t know much about the zero-width space. I don’t even know how to type it om my computer, and I didn’t know whether you were even supposed to use it for Khmer. If I had to guess, the majority of Khmer people don’t use it either. Besides, in the last sample you showed from the HTML files, the zero-width space was done incorrectly. The last 2 words should be អង្គនោះ, not អង្គនោះ. So I’m not an expert, but maybe the zero-width space can be ignored?

sujato · February 9, 2022, 8:29am

Oops, I think that is my mistake. I converted the files quickly and it looks like I gobbled some of the content. Here’s the original Word document, you can check.

Pitaka-014.zip (182.2 KB)

Proofreading is a whole next level of commitment. Generally speaking, we don’t proofread our legacy texts, we just take them as-is. One of the reasons for this is that we want to focus our efforts on new translations via Bilara.

If we can get a reasonable version online from existing files, that would be a great start.

Ok, that’s good.

Yes. It seems that those bracketed numbers appear at the start of a <p> in the HTML. So they appear to be paragraph numbers. That’s good!

Ok, good to know.

The main use of a zero-width-space is to give hints as to where to break a line correctly. Broswers are usually not very cleaver when it comes to exotic languages. But people usually get used to it.

If they’re not reliable, best delete them all. If need be, we can find an algorithm to insert them properly at the end of the project.

Churchwell · February 10, 2022, 3:30pm

Ah, that makes sense. I re read the 1st few paragraphs and with that big error gone, the rest are very minor (mostly just random spelling errors every once and a while). So it should be at a low enough level to be acceptable.

Based on the sample I read, both the word doc and the text on ti-kh.org have a similar level of errors, but they are both pretty low so both should be acceptable. But you guys said you would rather use the word docs right?

sujato · February 12, 2022, 9:44pm

Yes, if possible, the word docs would be much better.

What do you think, shall we go ahead?

One issue is, we don’t actually have a roster of volunteers for this work at the moment. we used to, but recently our focus has shifted to making new translations on Bilara. So we need someone to do the work of converting the files.

Is this something you’d be interested in doing? I can help with the skills if need be.

If it’s too much for you, we should look for a volunteer.

Churchwell · February 13, 2022, 11:29am

Well that depends, what would the time commitment be like? I’m still in school currently, so I don’t have an extreme amount of free time, but if it’s not too much, then I would gladly contribute whatever time I have, especially if there is no one else to do it. If it matters, I’m a computer science student, so the file conversion might be somewhat familiar to me.

Khemarato.bhikkhu · February 13, 2022, 12:13pm

Wow! Wish I could quadruple heart this thread! Makes me so happy to think SuttaCentral may be getting another Southeast Asian language soon!

If you’re at an American undergraduate program, I wonder if there’s a way you could spin this as a for-credit project? Maybe not worth the red-tape in the end, but probably worth at least broaching with your advisor. Many universities love to support student passion-projects

trusolo · February 13, 2022, 2:41pm

Hi @Churchwell, adding to the suggestion by Venerable @Khemarato.bhikkhu, you could inquire if there is something called an Independent Study (it is usually a two credit course). It is to convert a research project for credits, typically done when you do summer research projects instead of internships. Since you are in computer science, faculty whose interests are in natural language processing (NLP) might be interested in mentoring you.
with metta