Text recognition

Computer-aided text recognition (OCR) is a method of extracting text from pictures and drawings and converting it into a machine-readable format. OCR is often used for faxes or scanned documents. i-net PDFC uses this technology as the basis for various pre-installed filters. The exact description of the individual scenarios is documented with the respective filters.

Tesseract OCR

i-net PDFC's basic OCR plug-in used the open source software Tesseract as its standard for text recognition. The current version 4 of Tesseract uses trained neural networks for recognition and thus offers a very high recognition rate for printed texts. Handwritten texts are not supported.

Prerequisites

In order for Tesseract to deliver the best possible results, a number of characteristics are required.

  • Tesseract must be installed and operational. The functionality can be tested via configuration or recovery.
  • A language must be specified. (This is done automatically by the plugin LanguageDetection, prerequisite a document contains text as such) If the language is detected incorrectly, it can be set manually.
  • Only the language English is delivered as standard, further languages must be added by yourself. (If the language is missing, English is used)
  • The quality of the images must be at least 300 DPI. A resolution of 300 DPI is reached when the small letter x has a height of about 10 pixels.
  • The background colour should be monochrome. Noise in the image should be avoided.
  • Texts should be aligned horizontally.
  • The font should not be exotic. Well working fonts are included in this List.
  • The text should not be written by hand.

Tesseract Windows

As a prerequisite, the Visual C++ Redistributable 2015 package need to be installed on the Windows system.

It can be installed using:

Tesseract ( Installed Plugin )

In order to use Tesseract, it must be installed on the system and functional. It must be a 4 version installed and not an alpha/beta version. (If in the configuration under "Used Tesseract Plugin": Windows, an installation is not required).

Linux/Mac

Linux/Mac users must install the Tesseract 4 program in addition to the plugin. See Install Tesseract for details.

Windows

The University of Mannheim offers appropriate installers for Windows. The Tesseract 4 version should be installed. See for this Install Tesseract for Windows

Configuration

Status Values Information/error handling
Used Tesseract Plugin Installed Tesseract must be installed on the system
Windows If the internal Tesseract works, no further settings need to be made. May only work on Windows systems, but cannot be guaranteed. Tesseract does not need to be installed.
Status ok Tesseract was configured correctly and can be used.
Tesseract could not be found [...] Check the path to the installation of Tesseract or add make it available through the system environment.
Tesseract could not find the language file narrowly. [...] Check the path to the language files. There must always be an English language file.
The Tesseract plugin does not work correctly, [...] A Tesseract version 4.x should be installed.
Version tesseract v4.1.0* Detected version of the installed Tesseract distribution
Language files found [deu, eng] Tesseract has detected the given languages.
No language files are found. Check the path to the .traineddata files. There usually is a folder tessdata with the appropriate files in the Tesseract installation directory. For Tesseract Windows Plugin the folder lang/tessdata from the i-net PDFC installation directory is used. English is shipped by default with the plugin.

Add more languages

If other languages besides English are to be recognized, the corresponding language files must be installed manually. Please visit https://github.com/tesseract-ocr/tessdata and download the corresponding *.traineddata files. Afterwards these files have to be copied or moved to the <installation folder>/lang/tessdata folder. Finally the i-net PDFC service must be restarted.

Settings

For Tesseract to work properly with i-net PDFC, it may need to be parameterized. After all settings have been made, click on Apply to update the status.

Note: No settings need to be made for the Tesseract Windows plugin.

Tesseract Program

The installation path of the installed Tesseract version should be given. If this option is set correctly, Tesseract can be used. If Tesseract has been added to the environment variables or is in a standard path (/usr/bin/tesseract or /usr/local/bin/tesseract), the default setting "tesseract" is sufficient at this point.

Folder of the language files.

By default (empty field) it is in the PDFC installation directory lang/tessdata. If you want to use a different path instead, you must specify the appropriate folder. A folder with the .traineddata is expected.

Note: The English language file is always required.