Dear @khagga,
As I understand the matter, you found what you were searching for. One question, do you need a digitalized version of part of the pdf which are just images? In other words, do you need some OCR work?
Dear @khagga,
As I understand the matter, you found what you were searching for. One question, do you need a digitalized version of part of the pdf which are just images? In other words, do you need some OCR work?
Hi @Antonio-Costanzo, thanks for asking. The PDF is all text, so itās not really an OCR kind of situation. I suppose one could try running OCR on the PDF⦠out of curiosity I just tried with https://tesseract.projectnaptha.com/. Hereās how that came out:
Unsurprisingly the default language model didnāt support diacritics, so it would be a huge post-editing process to fix that. On the other hand, if one were to use a language model which supported diacritics it might work. Perhaps such a model exists for Tesseract, but Iāve not looked into it. (Probably worth having a discussion about that around here, as I imagine many people would be interested in OCRāing old books.)
But yes, in this case itās not necessary since weāre in touch with the original author and a born-digital version exists. Do you work with OCR? Itās a topic Iām interested in ā especially the topic of developing custom language models. But Tesseract (which seems to be the only open-source candidate out there) is pretty complicated.
Incidentally, I just searched this site and there are plenty of hits for āOCRā that might be of interest to you!
Just ran across this thread, for instance, which has some interesting stuff on OCR.
Hi @khagga,
I tried Tesseract once but I recognized it was too difficult for me to use. As I was searching for a way to set PÄli as OCR language, after some time spent in research, I had to buy a very expensive licence of ABBYY Fine Reader. I managed to make a txt file with all PÄli Tipiį¹aka words, so I have normally good results while doing OCR of PÄli text or other texts with PÄli quotation. Anyway, you have still to do a lot of work after having run the OCR over all the pages you need. Just to show you, this is a sample of how ABBYY works:
When you are satisfied with your editing work, then you can save it in many formats, including searchable pdf, docx, html and much more.
This looks like a reasonable workflow in ABBY. I have some familiarity with the output of ABBY because thatās what the Internet Archive uses and Iāve done work on other languages there.
Specifically, I work on a project on colonial documents in 17th century Spanish and Aymara. There are several diacritics that ABBY wasnāt trained on so that for instance Long S was consistently mapped to Ā£ and so forth.
Other than that ABBYās output was great in my experience. On the other hand Iām pathologically stingy so I pretty much never buy software
It might be interesting to see what Pali content is on archive.org and see how the OCR came out. Probably all of the PTS content is there?
Evam me sutaip. Ekaqi samayaQi Bhagava Sayatthi-
yajp viharati Jetavane Anathapi9dika8sa arame. Tatra
kho Bhagava bhikkhu amantesi: ā Bhikkhavo ti. Bhadanteā
ti te bhikkhu Bhagavato paccassosum. Bhagava etad
avoca: āPaneā imani bhikkhave sekhabals^ni. Katamani panca?
Saddhabalaiii , hiribalam ^ , ottappabalaip , viriyabalaip,pannEbalam.
Oh dear, seems like editing that would be more trouble than typing it again!
If you have a PÄli dictionary in ABBYY, the results are not as bad as Bhante Sujato showed. Iām quite sure that files you find in Archive have been made without a PÄli dictionary so that it is to be expected that they contain many errors. I made a dictionary based on CST 4.0 and trust me, the editing work is not so terrifying even though you have obviously to spend some time doing it. As better the dictionary you have, as better results you obtain. In my case the dictionary has not thematic forms as dhamma or buddha so that Iām adding these stem forms to the dictionary little by little. Another good thing is that you can train ABBYY in recognizing special forms of glyphs, as the long s, choosing a suitable font. The real problem with ABBYY, I think, is not the quality of the software and the editing you need to do, but the price. It is unfortunately not for all.
Indeed. I find it strange that with all our advances in neural nets and machine learning, such a fundamental task is still so clunky. Imagine what itās like with even more exotic scripts.
Hmm, I wonder whether it would be possible to get GPT-3 to do this?