Vimuttimagga, OCR, pdf images conversion to plain text or html, in pali script?

frankk · October 30, 2016, 5:05pm

there’s a pdf file of the 400+ page vimuttimagga on internet archive.
i found a free way to OCR it to plain html.

using pdfsam for pc, splits the 400 page pdf file into 40 pdf files each 10 pages long.
since google drive and google doc will only OCR 10 pages at most for you for a single pdf.

now it’s choking on the diacriticals for the pali. is there a way to train google doc to understand romanized pali script?

i figure sutta central must have lots of experience with the many languages and media sources used to assemble the awesome collection of suttas we have.

if not, i’ll go ahead and finish up the vimuttimagga conversion job later this week and the english part of it should very very readable.

Gabriel_L · October 30, 2016, 10:43pm

Please do share with us the result. This is a text worth checking!

sujato · October 30, 2016, 11:17pm

Hi @frankk,

To answer your question, no, I don’t think it’s possible to train Google Docs to recognize Pali diacriticals. Nor do I think you’ll get reliable results out of the box from other OCR tools. I use Tesseract (https://github.com/tesseract-ocr), which is good, but I wouldn’t expect it to recognize diacriticals. Surely there are ways of training such specialized tools to get diacriticals right, but they wouldn’t be for the faint of heart!

I’m wondering, why don’t you simply use the plain text version from the internet archive? At a quick glance it looks reasonably good, at least the English portions. As for the Pali text, the stuff in footnotes is mostly not really necessary; personally I’d just remove the footnotes and concentrate on getting a nice readable text.

There is no program that can accurately prepare a clean and reliable HTML version of a complex document like this from a scanned copy. The problem is that the information is so complex that you need to contextually understand how it fits together to make sense of it. Perhaps next-gen AI tools could do this, but not today.

For SC, we start with the best sources we can find, and work with them by hand. Each case is different.

Incidentally, my friend Ven Nyanatusita has been working for a long time on a revised translation; it is well known that the extant translation, as a pioneering work, is not very reliable.

frankk · October 31, 2016, 4:10pm

thanks bhante.

i just did a search for vimuttimagga in archive.org,
and since the last time i checked, maybe a year or two ago,
i see there’s an audio version! awesome. someone with a european accent speaking english, but clear enough i can understand.

i took a look at the text version on archive.org, looks ok, but
i’ll go ahead and finish up the google docs html version i started then, it won’t take long. google did a pretty good job matching fonts on the html output, so perhaps they’ll be more accurate on the text itself.

when does ven. nyanatusita plan on publicly sharing his work on vimt.?
i love his work on the patimokha, the word by word pali breakdown in all it’s glorious gory detail.

since the source material is in ancient classical chinese, i’m a little skeptical about how good anyone can really translate it. erudite chinese buddhists i know who have translated portions of agama don’t even agree with each other on lots of things, simply because ancient classical chinese is so terse and knotty.

frankk · October 31, 2016, 5:54pm

ok, here is the raw html conversion in a zip file, and a superquick epub job adding some table of contents headers:

https://archive.org/download/VimuttimaggaHtmlAndEpub

homage to the arahant upatissa!

i don’t plan on doing any more formatting or editing on this, other than a few sections i’m interested such as anapanasati and jhana, 4 elements.

sujato · October 31, 2016, 10:55pm

Thanks so much for sharing with us.

Yes, translating the Vimuttimagga properly is a monumental task. There are many parallels that must be considered, including those in Tibetan. And abhidhamma texts in Chinese are even less studied than the Agamas. Ven Nyanatusita is a slow and thorough scholar, I have no idea of his timeline on this.

I feel very fortunate doing Pali translations; the texts are on the whole very clear and well understood, and the translation issues have been worked over for more than a century. I feel like I’m upgrading a well-worn track, while the translators from Chinese are clearing untrodden ground.