Correcting the CPED

sujato · February 17, 2016, 12:25am

I doubt very much if you’ll find an OCR that does a much better job, but good luck. I can do it with the open source Tesseract, which does include Sanskrit. I just used it to OCR the Monier Williams Sanskrit Dictionary: it took a full 12 hours running at 100% of one of my CPUs, so I’m not eager to do it unless it’s going to be really useful! I didn’t test out the Sanskrit side of things; it probably assumes you are using Devanagari characters so will not be of any use.

Perhaps you can explain to me exactly what you want to achieve and maybe there’s an easier way.

Is the problem that you can’t input the diacritical marks? If so, we have ways of doing this in all major operating systems here:

But I’m not sure how useful this will be, as you don’t need to use diacritical marks to search on Google Docs, it does fuzzy searching by default.

In the majority of cases you just have to visually scan the document to find the entries.