My friend Ayya Saddha Bhikkhuni of Cambodia would greatly appreciate any suttas translated into her language.
Anything on site? Or any leads? Thanks in advance!
My friend Ayya Saddha Bhikkhuni of Cambodia would greatly appreciate any suttas translated into her language.
Anything on site? Or any leads? Thanks in advance!
Ahha! I maybe found it?
A Khmer speaker will have to help, but I was able to confirm that there is a 1969 translation of the PΔαΈ·i Canon into Khmer β¦ and this might be it?
That looks like everything (via Google Translate):
There are 110 books in total in the Khmer language.
- Discipline consists of 13 chapters, 5 books (Part 1 - 13)
- The Sutta Padak has 64 chapters, 5 books (Part 14-77)
- Abhidhamma Bidak has 33 chapters, 7 books (part 78 - 110)
The Buddhist Council established the Trinity Committee in 1929, chaired by two monks and consisting of 40 members and lay people with advanced degrees in Buddhism and Pali. The first volume of the Trinity was born as a bilingual book, one Pali and one Khmer, published in 1931, and the last 110 volumes, first published in 1968, with a volume of more than 400 pages. This transformation took almost 40 years to complete. Our Khmer Trinity was inaugurated and celebrated on April 1-2, 2513-1969.
Yes, I did see the Google translate of the page. But Iβm on mobile and canβt download the large PDFs to confirm that they indeed are what it says on the box.
Quite lucky that they finished by 1969. Just a few years later and the project never would have finished at all
This teaching resource is located in the US, but the monks are Cambodian:
Venerable, are the texts going to be added to SuttaCentral? Iβm a Khmer speaker, if needed, I can assist with inputting the texts.
I donβt think thereβs a plan to add them at this time, but if there are volunteers and copyright allows (two big 'ifβs!) I donβt see why SC wouldnβt be happy to welcome them.
@Churchwell - Could you take a look at the books and see what their Copyright says? I assume they say something like βCopyright the Trinity Committee; All Rights Reservedββ¦
In the likely event that this is copyrighted but with no actual owner (the owning body being subsequently disbanded), Bhante @sujato: does SuttaCentral have a policy re orphaned works? Considering they were something of a public works project from the beginning, I think a strong argument could be made for these being considered public domain now, but itβs of course your call
Thanks for the offer! That would be really helpful. Letβs see what we can do.
Not officially, but I agree that itβs probably okay in such a case. Generally, of course, there is a safe harbor provision for copyright, so if a work is added in good faith, then should the copyright owner complain, we merely have to take it down.
FYI, I did discuss adding Khmer suttas with some of the Cambodian community some years ago, but it never succeeded.
Alright, letβs have a look and see what weβve got. Looking at the 5000-years site, it looks like the texts are present as PDF files. Theyβre not text files, they are scanned images. These are not suitable for us, as thereβs no way to reliably transform images to text, especially text in exotic scripts like Khmer. The only use they would have would be as reference points for proof-reading.
So hereβs the first question:
Hereβs a start: this site seems to have the Tipitaka in Khmer in HTML. here;s the Brahmajala Sutta:
http://ti-kh.org/books?book=14&page=7
So this might be a useful source. Thereβs a long way to go, however. Assuming we can scrape the site and get the content, thereβs no semantic markup, hence no transforming it into a form usable for SC will be tricky. Still, so long as the origin is in reasonably well-formed HTML it should be possible.
Can some of our Cambodian readers check out this site for us? Itβd be good to know:
If we canβt find the whole Tipitaka, perhaps we can find some individual suttas or collections?
Ok so I looked at the links that you and Ven Khemarato shared, as well as a pdf Khmer Tripitaka that I have in my own possession, and hereβs what I found:
I found nothing mentioning copyright at all. Iβm not familiar with Cambodian copyright law (Iβm of Khmer descent, but Iβm an American), but in my opinion, there should be no legal issues? The Khmer translation was originally published under a government that no longer exists, and in addition, the files hosted on 5000-years.org are merely scanned reprints, not an original translation.
The text is the entire Tripitaka, and it looks like the same translation from 1969. However, it doesnβt look like the standard 6th council edition. Some of the Suttas have slightly different names, and some have additional verses. Sometimes the Pali words are spelled differently. Also, a few books from the Khuddaka Nikaya are missing, I assume they were not considered canonical.
Would manual input be feasible?
Great, thanks!
Letβs proceed on the understanding that these are orphaned works under copyright, and that we are using them in the spirit intended (i.e. weβre not selling them, etc.)
Now, the files from ti-kh.org are HTML, so these are better than the 5000-years ones for us.
They appear to be in a form where each page of HTML corresponds to a page in the book. hereβs a sample:
<div style="width: 560px; padding-left: 60px; padding-top: 50px;">
ααααβααβααΆαβααααβααΆαβααααααΆαβαααααααΌα‘β βααΆβαααα ααα
αα·αβαααβαα·αβαα·ααα
βααΌαβ βααααΆαβααΆααΈαα»αααβ βα―βαααα ααα
αα·αβαααβααααβααααβααβααΆαβααααβααΆαβααααααΆαβαβαα»αβαααα‘β βααααβααβααΆαβααααβααΆαβααααααΆαβααβααΆαβααα‘β βααααβααβααΆαβααααβααΆαβααααααΆαβαααααα‘β βααΆβαααα ααα
αα·αβαααβαα·ααα
βααΆαβααΌαβαβ βααααβααΆααΈαα»ααααααααβααααΆαααΌαβαα½αβααΆβ βααΌαβαααααααααβααααααΆαααααβ βαααα ααα
αα·αβαααβααααβααααβααβααΆαβααααβααΆαβααααααΆαβαα·αααααΈα‘β βααααβααβααΆαβααααβααΆαβααααααΆαβαα·ααΈα‘β βααααβααβααΆαβααααβααΆαβααααααΆαβαααααααΌα‘β βααΆαβααΆβαα·αβαα·ααα
βααΌαβ βαααβααΎβαααααβα ααα»β βαα·αβαα
αα
ααβααΌα
ααααα
βαβ βααααβααβααΆαβααααβααΆαβαααααβααααΆααβααΆβ βααααΆαβααΆααΈαα»αααβ βααααβααβααΆαβααααβααΆαβααααααΆαβαα·αααααΈα‘β βααααβααβααΆαβααααβααΆαβααααααΆαβαα·ααΈα‘β βααααβααβααΆαβααααβααΆαβααααααΆαβαααααααΌα‘β βααααα’αααβαα·αααΌαβααααΉαααααα<sup>β(βα‘β)β</sup> βααααααααβαααβααΆααβααΆααα‘αΆαβαααβαα·ααααΆαβααα»ααααΆαβααβ βαααα ααα
αα·αβαααβααΆαβα’αααα©αααααΆαβ βααααβαααααα»αααβααΆααα‘αΆαβαααβααΊβ βαα»βααααβ βααααααβ βααααααΆααααβ βααΆααΆβ βα§ααΆααβ βα₯αα·βαα»αααβααβ βααΆαβααβ βα’ααααΌαβαααααβ βααβαααααβ βααΆβαααβααΆαβα
αβαα½αβαα·α
βαβααΆααβ βαα·ααααΆααβααβαααααα»αααβααΆαααααβαα·αααΆαβααααααααααβααΆααβααΆααα‘αΆαβααβ βα’αΆααΆβααΆαα·αααααβ βααβαααααα»αααβααΆαααααβαα·αααΆαβαααααβα‘αΎαβ βαααβα’αααΉαβ βα₯αααΈβαααααα»αααβααβααΆααααβ<hr /><div style='font-size: 14px'>β(βα‘β)β βααααβα
ααβ βααΈβααΆααΈβααΆβ βαα·βααΆβαα»ααβ βααααβαααβααααααΆβαααα·αβ βααα»ααααβαααα»αβααΈαααβ βααΎβααααααΆβαααα·αβ βααΉαβαα
ααΆβααααβααα
ααααΈβ βαα·αβααβααΆααααβαααα»αβα’ααααααΆβ βαβααααΆααβααα
ααααΈβααΆβ βαααααα»αααβααααΆαβααα
ααααΈβαααα·αβααβαβ</div>
</div>
As you can see, there are no paragraphs, headings, or anything else that might help in structuring the text. In this page, there is a tag β<hr /><div style='font-size: 14px'>
, maybe that corresponds to a paragraph break?
Above that, the file has two kinds of title:
<h1>
αααααααααα·αα ααΆα α α‘
</h1>
and
<a href="" title='αα·αα ααΆα α α‘ - αααααααΈ α’α£'>
αα·αα ααΆα α α‘ - αααααααΈ α’α£</a>
Something about βsutta 9β? Can you make out what these are?
The file name, and also the page metadata, contain the information βbook=1&page=23β.
Thatβs all I got! As far as i can tell, the entire collection is structured this way. Thereβs no indication of nikayas, sutta numbers (?) and other details. Itβs all based on the book and page. Which makes it hard. We need to find something in the text, or the navigation, somewhere, that can tell us what these things are, not just what they correspond to in a Khmer edition.
So this is possible, but itβs not looking easy. Can we keep looking around for a bit, see if we can find a better source?
Letβs keep it for a last resort.
Itβs hard since its structured according to books, and only loosely corresponds to actual sections. But sometimes you can search the string to find the title. For example:
βαααα ααβααΆαβααΌαααβ βααΈα‘β
indicates Brahmajala Sutra #1. So sometimes if you search the string for βααΌαααβ meaning Sutra in Khmer, followed by βααΈβ which indicates the number, you can find the title.
But this becomes even more complicated because sometimes, especially in the Anguttara and Khuddaka Nikayas, the sutta isnt even labeled. It is only indicated by a number. For example, in the Anguttara Nikaya:
βαααβαααααΆβααβ βααΆαβααααβ
indicates Pathama pannasaka Bala Vagga. In the text that follows, the sutta is only indicated by β[α‘]β, corresponding to AN 3.1. But the actual name of the sutta isnβt shown.
So for me, its pretty easy to see where a sutta begins and ends, but for someone who canβt read Khmer, it might not be obvious.
Regarding other sources, almost every other site Iβve found also has the scanned pdfs. But I did find them in Microsoft word doc format here:
http://5000-years.org/kh/read/2976
The text is formatted a bit weird, but other than that its OK. Someone would still have to search through to find the sutta boundaries though.
Okay thanks, thatβs all useful info.
Now, Iβve scraped the ti-kh site, and ended up with some 26,000 files containing the Tipitaka. It seems some of these are in Pali and some in Khmer. Leaving aside the Pali we have maybe 10-15,000 files in Khmer. Thatβs a lot! Not easy to wrangle these into a usable form.
The word documents on 5000 seem like a more usable starting point. But first, can I ask you a couple more questions?
Assuming that they are the same translation, and that there are no major differences in proofing, we might consider moving forward with the Word files.
I have prepared one of these in plain HTML.
So thatβs pretty straightforward, and the result is fairly clean.
Pitaka-014.zip (87.3 KB)
A few questions:
ααΆααααα»αβααααα’αααβααΆααα‘αΆα
If you canβt see them:
ααΆααααα»αβ***ααααα’αααβ***ααΆααα‘αΆα
Iβm suspicious, because they donβt occur often enough to be true word breaks. The HTML files from ti-kh also have these, but a lot more of them:
αααα***βααβ***ααΆαβ***αααα***βααΆα***βα’αα***αβααα
Well, some bad news. After reading only a few paragraphs of DN1, I can already see typos here and there. Sometimes its significant. For example, in the HTML file you sent, there are several sentences missing from the beginning of DN1. However, it looks like the files on ti-kh.org have less typos. Even so, it might require proofreading. But the good news is that aside from the typos, it seems like the texts are all the same translation.
By paragraphs, do you you mean the passages marked by a number in brackets, like this: [α‘] ? I think that in most cases, they do correspond to actual passages, like a monologue by a certain person.
To be honest, I donβt know much about the zero-width space. I donβt even know how to type it om my computer, and I didnβt know whether you were even supposed to use it for Khmer. If I had to guess, the majority of Khmer people donβt use it either. Besides, in the last sample you showed from the HTML files, the zero-width space was done incorrectly. The last 2 words should be α’αααααα, not α’αααααα. So Iβm not an expert, but maybe the zero-width space can be ignored?
Oops, I think that is my mistake. I converted the files quickly and it looks like I gobbled some of the content. Hereβs the original Word document, you can check.
Pitaka-014.zip (182.2 KB)
Proofreading is a whole next level of commitment. Generally speaking, we donβt proofread our legacy texts, we just take them as-is. One of the reasons for this is that we want to focus our efforts on new translations via Bilara.
If we can get a reasonable version online from existing files, that would be a great start.
Ok, thatβs good.
Yes. It seems that those bracketed numbers appear at the start of a <p>
in the HTML. So they appear to be paragraph numbers. Thatβs good!
Ok, good to know.
The main use of a zero-width-space is to give hints as to where to break a line correctly. Broswers are usually not very cleaver when it comes to exotic languages. But people usually get used to it.
If theyβre not reliable, best delete them all. If need be, we can find an algorithm to insert them properly at the end of the project.
Ah, that makes sense. I re read the 1st few paragraphs and with that big error gone, the rest are very minor (mostly just random spelling errors every once and a while). So it should be at a low enough level to be acceptable.
Based on the sample I read, both the word doc and the text on ti-kh.org have a similar level of errors, but they are both pretty low so both should be acceptable. But you guys said you would rather use the word docs right?
Yes, if possible, the word docs would be much better.
What do you think, shall we go ahead?
One issue is, we donβt actually have a roster of volunteers for this work at the moment. we used to, but recently our focus has shifted to making new translations on Bilara. So we need someone to do the work of converting the files.
Is this something youβd be interested in doing? I can help with the skills if need be.
If itβs too much for you, we should look for a volunteer.
Well that depends, what would the time commitment be like? Iβm still in school currently, so I donβt have an extreme amount of free time, but if itβs not too much, then I would gladly contribute whatever time I have, especially if there is no one else to do it. If it matters, Iβm a computer science student, so the file conversion might be somewhat familiar to me.
Wow! Wish I could quadruple heart this thread! Makes me so happy to think SuttaCentral may be getting another Southeast Asian language soon!
If youβre at an American undergraduate program, I wonder if thereβs a way you could spin this as a for-credit project? Maybe not worth the red-tape in the end, but probably worth at least broaching with your advisor. Many universities love to support student passion-projects
Hi @Churchwell, adding to the suggestion by Venerable @Khemarato.bhikkhu, you could inquire if there is something called an Independent Study (it is usually a two credit course). It is to convert a research project for credits, typically done when you do summer research projects instead of internships. Since you are in computer science, faculty whose interests are in natural language processing (NLP) might be interested in mentoring you.
with metta