Solid Framework 9.2.8150.1 Released

The latest public release of Solid Framework SDK is now available for download from the developer portal at www.solidframework.net. This is version 9.2.8150.

Main Features

Faster Detection of Language within OCR’d files

Solid Framework can detect the language of a scanned document which helps to improve OCR accuracy. For files that contained ambiguous language this could be slow.

We have modified the mechanism to significantly faster with these files.

 

Modified mechanism for specifying the location of Tesseract “traineddata” files

Solid Framework performs OCR for Chinese, Japanese, Korean and Greek language documents using Tesseract. For this to work it is necessary to specify the location of the folder than contains the “traineddata” files for the specific language.

The mechanism for doing this has been modified with the creation of a read/write property   TesseractDataDirectoryLocation. This replaces the method SetTesseractDataDirectory.

For further information see Performing OCR using Tesseract.

 

Further OCR improvements

SolidOCR continues to improve, and this release contains further refinements.

Solid Framework 9.2.8136.1 Released

The latest public release of Solid Framework SDK is now available for download from the developer portal at . This is version 9.2.8136.

 

Main Features

Improved OCR Accuracy

We’ve made a whole bunch of improvements in SolidOCR.  In addition to general improvement we’ve become even better in three main areas:

 

Better Reconstruction of Lists that Contain Ambiguous Letters

When SolidFramework finds characters that might represent the items in a list, it tries to make sense of them, and if it can, it will recreate the list.

Some characters though are ambiguous. For example, is “i” the first item in a Roman numeral list, or the one that comes after “h” in an alphabetic list? It could be either, and the answer depends on the context of the rest of the document.

In the latest release we have improved the logic for working this out.

 

Better reconstruction of Non-Continuous Lists

SolidFramework now handles lists better, even if there are missing items.

Previously the presence of out-of-sequence items would cause OCR to assume that the item was an image rather than a list item. We now recognize the item as starting a new list.

Improved OCR with Sparse Text

OCR relies on the context of  characters on a page as part of identifying what the characters represent. This is difficult when sentences are short.

We’ve made significant improvements in identifying text in this scenario allowing us to correctly identify all of the text in a number of previously difficult samples.

Improved Ability to Detect Tables that have Closely Neighbouring Non-Table Text

As SolidFramework reconstructs a document from a PDF file, it has to decide whether text represents text columns, or the contents of a table.

We’ve made some changes in the way this works, so that we can do this better, even with tables that have no borders which are surrounded by text that is not part of the table.

 

Layout Information Available for many Objects

As part of ongoing work, SolidFramework is being developed to provide layout information for all of the items in the document.

One aspect of this work involves calculating not just where objects were in the PDF, but also where they will be located on the page within a reconstructed document. This may not be the same as within the PDF if , for example, it is necessary to substitute a font used within  the original PDF with a different one in the recreated document because the font does not exist or is not licensed on the machine.

SolidFramework 8132 includes information about the location of bullets and numbers used to identify the start of a list item.