February 1, 2018
Optical character recognition (also optical character reader, OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example from a television broadcast).
This library supports over 60 languages, automatic text orientation and script detection, a simple interface for reading paragraph, word, and character bounding boxes. Tesseract.js can run either in a browser and on a server with NodeJS.
Tesseract.js can be integrated with odoo frontend and use for conversion of images to text.
Example : (Image consisting of english text converted to text)
Example : (Image consisting of chinese text converted to chinese text)
We had a pdf to parse in incoming mail, During this process we parsed that pdf, by first converting it to image, the converted image was then passed to tesseract.js which parsed contents of image to text.
However the accurateness of extracted text depends on the quality of image passed to OCR.
Below is the ocr content(Dutch Language) after processing it with custom logic.