Help me find EBT's in Khmer

It’s hard to say exactly. If I am converting a text the size of a nikaya, and I have a reasonable source, I’d allow perhaps a morning’s work.

Obviously it would take time to work up the skills, and it will vary a lot depending on the sources and the texts, but it needn’t be too long. And more importantly, there’s no deadline, so you can work gradually.

How about we start with DN, I’ll mentor you through the process, then we can see?

In that case you should be fine.

Basically what we need to do is:

  • get the files :white_check_mark:
  • convert to HTML
    • if starting from a Word document, I find the best way is usually to open the file in the browser and then save the source. For some reason it gives a cleaner result than simply “save as HTML”.
  • I find it easier to work with an entire nikaya in one file, so if it’s separate files, combine them.
    • (use cat or whatever)
  • clean the HTML
    • use HTML Tidy and maybe some regex. So far as is possible, we want clean, semantic HTML as a basis.
  • remove any extraneous content
    • notes, etc. This requires some manual checking.
  • put the HTML in the required SC standard.
  • add a metadata section for each sutta
    • publication details, licensing, etc.
  • check the HTML with Tidy, also visually
    • visual inspection lets you check that paragraphs, headings, etc. are correct. One method I use is to apply a very coarse and obvious stylesheet that makes borders or backgrounds for every different element.
  • check the file for typographic niceties, such as:
    • single spaces only
    • no triple dots (…), only ellipses (…)
    • use correct dashes
    • use correct quotes
  • split the files into separate files per sutta.
  • upload to SC!

We’ll also need to add Khmer support to SC, but this is simple.

What do you think? Shall we start with DN?

2 Likes

@Khemarato.bhikkhu and @trusolo thanks for the suggestions, those are good ideas! I’m actually taking an NLP course right now, so might be able to ask my professor.

@sujato sure, we can start with DN. For me, as long as there’s no deadline, I should be fine. I won’t be super fast, but I’m willing to work at it until the entire Tripitaka is finished. Only thing is I don’t really have HTML experience, but I should be able to figure it out.

3 Likes