Computer-aided text recognition (OCR) is a method of extracting text from pictures and drawings and converting it into a machine-readable format. OCR is often used for scanned documents. i-net PDFC uses this technology as the basis for various pre-installed filters. The exact description of the individual scenarios is documented with the respective filters.
The OCR in INETAPP is based up on Tesseract and requires at least version 4. The configuration of OCR depends on the operating system the INETAPP server is installed on. This configuration page mostly displays information about the state of the Tesseract availability:
Note: Tesseract 4 and 5 are supported. These must not be alpha or beta versions.
The Current State section reflects information from the backend system and indicates whether tesseract is functional.
The Visual C++ Redistributable 2015 package has to be installed on the Windows system, which can be done in one of the following ways:
choco install vcredist2015
For custom installations, please check Install Tesseract for installation details on Linux and Windows systems. macOS users can usually use one of the following commands to install Tesseract 5 via the package manager MacPorts or Homebrew:
sudo port install tesseract # or brew install tesseract
If additional languages besides English should be supported, the corresponding language files must be installed manually by downloading the corresponding *.traineddata
files. Afterwards these files have to be moved into the <installation>/lang/tessdata
folder or the customized path. Finally the i-net PDFC server has to be restarted.
tesseract
binary name should suffice. This entry is only shown for the Custom Installation Tesseract variant.lang/tessdata
if left empty.