PDF-parser optimized to use stroke instead of fill if possible, resulting in smaller exported PDF files and more precise rendering
Jpeg2000 encoded images supported
The font Dejvu-Sans is used as default font for font embedding in PDF export
Table comparison overhauled to compare the structure as well. Will now show added/removed cells, columns, rows and tables as well
Text location check in strict mode will now only check the position along the baseline of the text and no longer all bounds of each word. This is more robust if the font family or size is changed as well
Header/footer detection in documents with large line heights improved
Recognition of the text styles subscript
, superscript
and strike through
for both PDF and DocX
Improved table recognition for tables with header box, "cross tabs" and tables with only inner borders
Link-Verification now checks relative URLs for potential http-Protocol as well
Handling for incorrect PDF files improved, shapes will be ignored in that case
Multi column detection for plain two column layouts and heading + columns scenarios improved
Visibility calculation improved for shapes with mixed fill+stroke colors and redundant elements
'Compute actual visiblity' improved - it will now exclude text on filled background of equal color as well
Word detection and separation improved in case of chunking inside a numerical value
Tesseract OCR:
DOCX parser: