Text recognition

Computer-aided text recognition (OCR) is a method of extracting text from pictures and drawings and converting it into a machine-readable format. OCR is often used for faxes or scanned documents. i-net PDFC uses this technology as the basis for various pre-installed filters. The exact description of the individual scenarios is documented with the respective filters.

Tesseract OCR

i-net PDFC's basic OCR plug-in used the open source software Tesseract as its standard for text recognition. The current version 4 of Tesseract uses trained neural networks for recognition and thus offers a very high recognition rate for printed texts. Handwritten texts are not supported.

Prerequisites

In order for Tesseract to deliver the best possible results, a number of characteristics are required.

  • Tesseract must be installed and operational. The functionality can be tested via configuration or recovery.
  • A language must be specified. (This is done automatically by the plugin LanguageDetection, prerequisite a document contains text as such) If the language is detected incorrectly, it can be set manually.
  • Only the language English is delivered as standard, further languages must be added by yourself. (If the language is missing, English is used)
  • The quality of the images must be at least 300 DPI. A resolution of 300 DPI is reached when the small letter x has a height of about 10 pixels.
  • The background colour should be monochrome. Noise in the image should be avoided.
  • Texts should be aligned horizontally.
  • The font should not be exotic. Well working fonts are included in this List.
  • The text should not be written by hand.

Add more languages

If other languages besides English are to be recognized, the corresponding language files must be installed manually. Please visit https://github.com/tesseract-ocr/tessdata and download the corresponding *.traineddata files. Afterwards these files have to be copied or moved to the <installation folder>/lang/tessdata folder. Finally the i-net PDFC service must be restarted.

Final Notes

The use of text recognition software consumes a large amount of resources. If frequent bottlenecks occur, this plugin should be deactivated.

Tesseract ( Installed Plugin )

For Tesseract to be used, it must be installed on the system and be functional. It must be installed as a 4 version and must not be an alpha/beta version.

Linux/Mac

Linux/Mac users must install the Tesseract 4 program in addition to the plugin. See Install Tesseract for details.

Windows

The University of Mannheim offers appropriate installers for Windows. The Tesseract 4 version should be installed. See for this Install Tesseract for Windows