Pali and Optical Character Recognition

Antonio-Costanzo · November 16, 2020, 6:01pm

As I understand the matter, you found what you were searching for. One question, do you need a digitalized version of part of the pdf which are just images? In other words, do you need some OCR work?

khagga · November 16, 2020, 6:16pm

Hi @Antonio-Costanzo, thanks for asking. The PDF is all text, so it’s not really an OCR kind of situation. I suppose one could try running OCR on the PDF… out of curiosity I just tried with https://tesseract.projectnaptha.com/. Here’s how that came out:

Unsurprisingly the default language model didn’t support diacritics, so it would be a huge post-editing process to fix that. On the other hand, if one were to use a language model which supported diacritics it might work. Perhaps such a model exists for Tesseract, but I’ve not looked into it. (Probably worth having a discussion about that around here, as I imagine many people would be interested in OCR’ing old books.)

But yes, in this case it’s not necessary since we’re in touch with the original author and a born-digital version exists. Do you work with OCR? It’s a topic I’m interested in — especially the topic of developing custom language models. But Tesseract (which seems to be the only open-source candidate out there) is pretty complicated.

Incidentally, I just searched this site and there are plenty of hits for “OCR” that might be of interest to you!

Just ran across this thread, for instance, which has some interesting stuff on OCR.

Antonio-Costanzo · November 16, 2020, 6:36pm

Hi @khagga,

I tried Tesseract once but I recognized it was too difficult for me to use. As I was searching for a way to set Pāli as OCR language, after some time spent in research, I had to buy a very expensive licence of ABBYY Fine Reader. I managed to make a txt file with all Pāli Tipiṭaka words, so I have normally good results while doing OCR of Pāli text or other texts with Pāli quotation. Anyway, you have still to do a lot of work after having run the OCR over all the pages you need. Just to show you, this is a sample of how ABBYY works:

All the highlighted glyphs are glyphs of which the program is uncertain.
E.g., if I want to correct spelling in the page I showed you, I have to run the verify tool:

When you are satisfied with your editing work, then you can save it in many formats, including searchable pdf, docx, html and much more.

khagga · November 17, 2020, 3:09am

This looks like a reasonable workflow in ABBY. I have some familiarity with the output of ABBY because that’s what the Internet Archive uses and I’ve done work on other languages there.

Specifically, I work on a project on colonial documents in 17th century Spanish and Aymara. There are several diacritics that ABBY wasn’t trained on so that for instance Long S was consistently mapped to £ and so forth.

Other than that ABBY’s output was great in my experience. On the other hand I’m pathologically stingy so I pretty much never buy software

It might be interesting to see what Pali content is on archive.org and see how the OCR came out. Probably all of the PTS content is there?

sujato · November 17, 2020, 5:40am

Here’s the start of AN 5.1:

Evam me sutaip. Ekaqi samayaQi Bhagava Sayatthi-
yajp viharati Jetavane Anathapi9dika8sa arame. Tatra
kho Bhagava bhikkhu amantesi: — Bhikkhavo ti. Bhadante’
ti te bhikkhu Bhagavato paccassosum. Bhagava etad
avoca: —

Pane’ imani bhikkhave sekhabals^ni. Katamani panca?
Saddhabalaiii , hiribalam ^ , ottappabalaip , viriyabalaip,

pannEbalam.

khagga · November 17, 2020, 11:20am

Oh dear, seems like editing that would be more trouble than typing it again!

Antonio-Costanzo · November 17, 2020, 11:41am

If you have a Pāli dictionary in ABBYY, the results are not as bad as Bhante Sujato showed. I’m quite sure that files you find in Archive have been made without a Pāli dictionary so that it is to be expected that they contain many errors. I made a dictionary based on CST 4.0 and trust me, the editing work is not so terrifying even though you have obviously to spend some time doing it. As better the dictionary you have, as better results you obtain. In my case the dictionary has not thematic forms as dhamma or buddha so that I’m adding these stem forms to the dictionary little by little. Another good thing is that you can train ABBYY in recognizing special forms of glyphs, as the long s, choosing a suitable font. The real problem with ABBYY, I think, is not the quality of the software and the editing you need to do, but the price. It is unfortunately not for all.

sujato · November 17, 2020, 11:00pm

Indeed. I find it strange that with all our advances in neural nets and machine learning, such a fundamental task is still so clunky. Imagine what it’s like with even more exotic scripts.

Hmm, I wonder whether it would be possible to get GPT-3 to do this?