The latest public release of Solid Framework SDK is now available for download from the developer portal at www.solidframework.net. This is version 9.2.8150.
Faster Detection of Language within OCR’d files
Solid Framework can detect the language of a scanned document which helps to improve OCR accuracy. For files that contained ambiguous language this could be slow.
We have modified the mechanism to significantly faster with these files.
Modified mechanism for specifying the location of Tesseract “traineddata” files
Solid Framework performs OCR for Chinese, Japanese, Korean and Greek language documents using Tesseract. For this to work it is necessary to specify the location of the folder than contains the “traineddata” files for the specific language.
The mechanism for doing this has been modified with the creation of a read/write property TesseractDataDirectoryLocation. This replaces the method SetTesseractDataDirectory.
The latest public release of Solid Framework SDK is now available for download from the developer portal at . This is version 9.2.8136.
Improved OCR Accuracy
We’ve made a whole bunch of improvements in SolidOCR. In addition to general improvement we’ve become even better in three main areas:
Better Reconstruction of Lists that Contain Ambiguous Letters
When SolidFramework finds characters that might represent the items in a list, it tries to make sense of them, and if it can, it will recreate the list.
Some characters though are ambiguous. For example, is “i” the first item in a Roman numeral list, or the one that comes after “h” in an alphabetic list? It could be either, and the answer depends on the context of the rest of the document.
In the latest release we have improved the logic for working this out.
Better reconstruction of Non-Continuous Lists
SolidFramework now handles lists better, even if there are missing items.
Previously the presence of out-of-sequence items would cause OCR to assume that the item was an image rather than a list item. We now recognize the item as starting a new list.
Improved OCR with Sparse Text
OCR relies on the context of characters on a page as part of identifying what the characters represent. This is difficult when sentences are short.
We’ve made significant improvements in identifying text in this scenario allowing us to correctly identify all of the text in a number of previously difficult samples.
Improved Ability to Detect Tables that have Closely Neighbouring Non-Table Text
As SolidFramework reconstructs a document from a PDF file, it has to decide whether text represents text columns, or the contents of a table.
We’ve made some changes in the way this works, so that we can do this better, even with tables that have no borders which are surrounded by text that is not part of the table.
Layout Information Available for many Objects
As part of ongoing work, SolidFramework is being developed to provide layout information for all of the items in the document.
One aspect of this work involves calculating not just where objects were in the PDF, but also where they will be located on the page within a reconstructed document. This may not be the same as within the PDF if , for example, it is necessary to substitute a font used within the original PDF with a different one in the recreated document because the font does not exist or is not licensed on the machine.
SolidFramework 8132 includes information about the location of bullets and numbers used to identify the start of a list item.
https://solidframework.net/wp-content/uploads/2021/08/solid_framework_pdftron_340.png00Solid Documentshttps://solidframework.net/wp-content/uploads/2021/08/solid_framework_pdftron_340.pngSolid Documents2017-11-18 03:33:422021-01-17 22:06:25Solid Framework 9.2.8136.1 Released