Back in another thread I opined that we likely won’t see proper document-level OCR technology for a while… Well, it seems the wait may be shorter than I feared!
A team of researchers at Harvard, Brown, etc have just announced Layout Parser: a deep learning OCR preprocessor in Python that determines the structure of a given document automatically. This preprocessor allows the OCR to take the document one section at a time and output more reasonable data.
Technical details are in the arxiv.org paper and on GitHub
This is really interesting, thank you for sharing. ABBYY FineReader Seems to do some of this kind of thing but it’s kind of patchy. Recognizing columns can make all the difference between usable OCR and unusable.
I wonder how well this works with unusual orthographies and diacritics.
From the paper, it does quite well across languages.
But remember that this is just a preprocessor. It’s not doing the actual OCR. It’s just figuring out how to crop the images it sends to the OCR step. The problem that i.e. Tesseract has trouble with mixed language input is not solved by this particular project.