The Guide Nettippakarana seemed scanned to text. (How?) 🤔

Upasaka_Dhammasara · August 4, 2020, 1:32pm

Bhante @sujato was telling me books like this is difficult to make scan to text. I don’t understand how this was done. It’s still incredible. It seem they didn’t edit.

http://lirs.ru/lib/The_Guide,Nettippakarana,Nanamoli,1977.pdf

arkaprava · August 5, 2020, 1:34pm

OCR’d , Adobe gives good results.

Upasaka_Dhammasara · August 5, 2020, 1:40pm

Can you do that? For the Petakopadesa.

arkaprava · August 5, 2020, 3:02pm

I have OCR’d the Petakopadesa too.

Upasaka_Dhammasara · August 5, 2020, 8:17pm

Can you please try with this one?

http://www.kbrl.gov.mm/book/details/003430?categoryId=58

Thanks in advance

arkaprava · August 5, 2020, 8:20pm

https://drive.google.com/drive/folders/1RakYQuDA0wTTygoopPNTizEXI-7z46Tt

Upasaka_Dhammasara · August 5, 2020, 9:22pm

I am not sure I entered. At first it said I needed permission. But I see a Petakopadesa. But it’s seemed one I had already. I just want the version that I gave you that the words are copyble. Meaning that you can copy and then paste.

arkaprava · August 5, 2020, 9:29pm

You can copy,paste as well as search on that document.

sujato · August 5, 2020, 10:06pm

To be clear what’s happening here: when you scan a text, all you have is images. You can then run OCR on that, and make a searchable and copyable version of the same text. How this works is that the PDF puts an invisible layer of text in the file. This is great for making a usable copy of a scanned document.

None of this, however, produces a proper digital text. Under the hood, an OCR’d text looks something like this. This can, indeed, be used as the basis for a digital text, and in simple cases that can be quite good. But for complex texts with footnotes, special characters, and the like, we are still very far away from anything that is usable or publishable.

We have found that the only way to make a properly publishable digital text is through skilled human workers, time and patience. Typists are good, fast, and cheap, and it is easier to simply pay a typing firm to do the work than to mess around with OCR.

Perhaps a next generation of neural nets will be applicable to this work and produce better outcomes, but so far I haven’t seen this kind of magic.

Upasaka_Dhammasara · August 5, 2020, 11:11pm

Now I got it. Thank you.

Upasaka_Dhammasara · August 5, 2020, 11:12pm

Thanks for the explaination. I understand.

Khemarato.bhikkhu · February 21, 2021, 4:53am

And unfortunately you won’t for a very long time.

OCR is considered a “solved problem” in machine vision circles, so serious academics can’t get funding anymore to continue the research it would take to perfect the technology.

Usually, this is where for-profits step in and productionize the research. While some commercial interest (e.g. at Google) does exist in scanning whole books, the messy gook you pointed out above is good enough for their purposes (full text search) and is unlikely to get much more investment there either (beyond e.g. extending the existing technology to more languages, which Google is doing a fantastic job with).

Non-profits like Archive.org can only do so much with their limited staff and budget. Cutting edge, AI research is likely beyond their scope, and non-profits that do have AI research teams will not be remotely interested in this kind of problem.

Tl;dr: until a lot more people have machine learning skills, it’ll likely remain an unfinished technology.