Computer-aided text recognition (OCR) is a method of extracting text from pictures and drawings and converting it into a machine-readable format. OCR is often used for faxes or scanned documents. i-net PDFC uses this technology as the basis for various pre-installed filters. The exact description of the individual scenarios is documented with the respective filters.
i-net PDFC's basic OCR plug-in used the open source software Tesseract as its standard for text recognition. The current version 4 of Tesseract uses trained neural networks for recognition and thus offers a very high recognition rate for printed texts. Handwritten texts are not supported.
In order for Tesseract to deliver the best possible results, a number of characteristics are required.
x
has a height of about 10 pixels.
If other languages besides English are to be recognized, the corresponding language files must be installed manually. Please visit https://github.com/tesseract-ocr/tessdata and download the corresponding *.traineddata files. Afterwards these files have to be copied or moved to the <installation folder>/lang/tessdata
folder. Finally the i-net PDFC service must be restarted.
The use of text recognition software consumes a large amount of resources. If frequent bottlenecks occur, this plugin should be deactivated.
For Tesseract to be used, it must be installed on the system and be functional. It must be installed as a 4 version and must not be an alpha/beta version.
Linux/Mac users must install the Tesseract 4 program in addition to the plugin. See Install Tesseract for details.
The University of Mannheim offers appropriate installers for Windows. The Tesseract 4 version should be installed. See for this Install Tesseract for Windows