SuttaCentral

Pali and Optical Character Recognition

Dear @khagga,

As I understand the matter, you found what you were searching for. One question, do you need a digitalized version of part of the pdf which are just images? In other words, do you need some OCR work?

1 Like

Hi @Antonio-Costanzo, thanks for asking. The PDF is all text, so it’s not really an OCR kind of situation. I suppose one could try running OCR on the PDF… out of curiosity I just tried with https://tesseract.projectnaptha.com/. Here’s how that came out:

Unsurprisingly the default language model didn’t support diacritics, so it would be a huge post-editing process to fix that. On the other hand, if one were to use a language model which supported diacritics it might work. Perhaps such a model exists for Tesseract, but I’ve not looked into it. (Probably worth having a discussion about that around here, as I imagine many people would be interested in OCR’ing old books.)

But yes, in this case it’s not necessary since we’re in touch with the original author and a born-digital version exists. Do you work with OCR? It’s a topic I’m interested in — especially the topic of developing custom language models. But Tesseract (which seems to be the only open-source candidate out there) is pretty complicated.

Incidentally, I just searched this site and there are plenty of hits for “OCR” that might be of interest to you!

:pray:

Just ran across this thread, for instance, which has some interesting stuff on OCR.

1 Like

Hi @khagga,

I tried Tesseract once but I recognized it was too difficult for me to use. As I was searching for a way to set Pāli as OCR language, after some time spent in research, I had to buy a very expensive licence of ABBYY Fine Reader. I managed to make a txt file with all Pāli Tipiṭaka words, so I have normally good results while doing OCR of Pāli text or other texts with Pāli quotation. Anyway, you have still to do a lot of work after having run the OCR over all the pages you need. Just to show you, this is a sample of how ABBYY works:


All the highlighted glyphs are glyphs of which the program is uncertain.
E.g., if I want to correct spelling in the page I showed you, I have to run the verify tool:

image

When you are satisfied with your editing work, then you can save it in many formats, including searchable pdf, docx, html and much more.

1 Like

This looks like a reasonable workflow in ABBY. I have some familiarity with the output of ABBY because that’s what the Internet Archive uses and I’ve done work on other languages there.

Specifically, I work on a project on colonial documents in 17th century Spanish and Aymara. There are several diacritics that ABBY wasn’t trained on so that for instance Long S was consistently mapped to £ and so forth.

Other than that ABBY’s output was great in my experience. On the other hand I’m pathologically stingy so I pretty much never buy software :rofl:

It might be interesting to see what Pali content is on archive.org and see how the OCR came out. Probably all of the PTS content is there?

Here’s the start of AN 5.1:

  1. Evam me sutaip. Ekaqi samayaQi Bhagava Sayatthi-
    yajp viharati Jetavane Anathapi9dika8sa arame. Tatra
    kho Bhagava bhikkhu amantesi: — Bhikkhavo ti. Bhadante’
    ti te bhikkhu Bhagavato paccassosum. Bhagava etad
    avoca: —

  2. Pane’ imani bhikkhave sekhabals^ni. Katamani panca?
    Saddhabalaiii , hiribalam ^ , ottappabalaip , viriyabalaip,

pannEbalam.

:grimacing:

3 Likes

Oh dear, seems like editing that would be more trouble than typing it again! :scream:

If you have a Pāli dictionary in ABBYY, the results are not as bad as Bhante Sujato showed. I’m quite sure that files you find in Archive have been made without a Pāli dictionary so that it is to be expected that they contain many errors. I made a dictionary based on CST 4.0 and trust me, the editing work is not so terrifying even though you have obviously to spend some time doing it. As better the dictionary you have, as better results you obtain. In my case the dictionary has not thematic forms as dhamma or buddha so that I’m adding these stem forms to the dictionary little by little. Another good thing is that you can train ABBYY in recognizing special forms of glyphs, as the long s, choosing a suitable font. The real problem with ABBYY, I think, is not the quality of the software and the editing you need to do, but the price. It is unfortunately not for all.

2 Likes

Indeed. I find it strange that with all our advances in neural nets and machine learning, such a fundamental task is still so clunky. Imagine what it’s like with even more exotic scripts.

Hmm, I wonder whether it would be possible to get GPT-3 to do this?