The Guide Nettippakarana seemed scanned to text. (How?) 🤔

Bhante @sujato was telling me books like this is difficult to make scan to text. I don’t understand how this was done. It’s still incredible. It seem they didn’t edit.,Nettippakarana,Nanamoli,1977.pdf

OCR’d , Adobe gives good results.

Can you do that? For the Petakopadesa.

I have OCR’d the Petakopadesa too.

Can you please try with this one?


Thanks in advance

1 Like

I am not sure I entered. At first it said I needed permission. But I see a Petakopadesa. But it’s seemed one I had already. I just want the version that I gave you that the words are copyble. Meaning that you can copy and then paste.

You can copy,paste as well as search on that document.

1 Like

To be clear what’s happening here: when you scan a text, all you have is images. You can then run OCR on that, and make a searchable and copyable version of the same text. How this works is that the PDF puts an invisible layer of text in the file. This is great for making a usable copy of a scanned document.

None of this, however, produces a proper digital text. Under the hood, an OCR’d text looks something like this. This can, indeed, be used as the basis for a digital text, and in simple cases that can be quite good. But for complex texts with footnotes, special characters, and the like, we are still very far away from anything that is usable or publishable.

We have found that the only way to make a properly publishable digital text is through skilled human workers, time and patience. Typists are good, fast, and cheap, and it is easier to simply pay a typing firm to do the work than to mess around with OCR.

Perhaps a next generation of neural nets will be applicable to this work and produce better outcomes, but so far I haven’t seen this kind of magic.


:scream: Now I got it. Thank you.

Thanks for the explaination. I understand.