Developing software using Solid Framework

What operating systems does Solid Framework support?

Solid Framework comes in two flavors.

Native Solid Framework

The native version is pure C++, and runs on Windows and OSX.

We are just finalizing a version for Linux which we will release in the next few weeks.

Solid Framework for .NET

The .NET wrapped version is a breeze to work with. It requires the .NET Framework 4.0 or later.  It is not supported on Windows 9x.

The framework has been tested on:

  • Windows 10
  • Windows 8
  • Windows 7
  • Windows Server 2016
  • Windows Server 2012 R2 x64
  • Windows Server 2008 R2 x64

The Solid Framework .dll is built as an “AnyCPU” framework and automatically runs x86 or x64 native, depending on the process that loads it.

Is SolidFramework just for .NET?

We offer two versions of the library – one as a .NET library, but we also supply a Native C++ DLL which can be used without .NET

What .NET programming languages can be used when writing software that works with Solid Framework?

Solid Framework is a CLS Compliant class library. This means that the class library only exposes features that are common across all .NET languages. For example, unsigned types and overloaded methods are not used since these features are not available in all supported languages.

For simplicity, all the samples and documentation are in C#. The other commonly used CLS Compliant languages are:

  • C#
  • Visual Basic .NET
  • J#

Is it possible to use C++ to write software that includes Solid Framework?

Absolutely! Several of our customers use C++ for the web based or app based products.

If C++ is used then .NET does not need to be available on the machine.

There are several samples to demonstrate the use of C++ available in the Downloads part of the Solid Framework portal.

Which versions of Visual Studio can Solid Framework be used with?

The Solid Documents team develop using VS 2013 and VS 2015 and we currently target the “v12” MSVC runtime (the version that shipped with VS 2013).

We also know that some of our customers are using VS 2017.

Is it possible to use Visual C++ 6.0?

The quick answer is that we have never tried so we don’t know.

The longer answer is that Solid Framework makes use of some language features, for example “Shared_Ptr” and “wstring” which were added to C++ in 2011. As such we think that there may be problems using Solid Framework with VC++6.

Having said, we deliberately do not use static linking to customer apps which makes us depend less on specific 3rd party library versions (such as the version of the MSVC runtime for example). If the customer can actually compile against our SolidFramework.h and SolidFramework.cpp API interface include files, then our library will work (these interface files then do version agnostic dynamic loading of the rest of our system).

How do I download the Solid Framework SDK?

The Solid Framework SDK can be downloaded using the self-service Solid Framework Developer Portal.

You will need to create an account in order to access the portal.

Click here to see video tutorial on how to download the SDK

After completing your download place SolidFramework.dll in the source folder of your project.

Add a reference to this assembly from your project, Click here to see video tutorial.

Parallel Processing under Solid Framework

Solid Framework is not intrinsically thread safe.

If parallel processing is required then we recommend using the “JobProcessor” class. This class will spin up a number of independent JobHandler processes, with each process performing a single conversion.

The JobProcessor will queue conversion requests and allocate them to a JobHandler process when one becomes idle.

By default, the JobProcessor will launch as many concurrent JobHandler processes as you have cores on your machine. You can restrict the number of parallel JobHandlers by setting JobProcessor::WorkerCount. 

Typically you will wish to set the number of workers to less than the number of cores on the machine, since this will allow other tasks to continue while Solid Framework is converting files.

Currently, JobProcessor is only available for .NET. We hope to release a native C++ version in the near future.

Licensing

I have just updated my license using the latest Machine ID generator, and now I can’t run use Solid Framework. What has gone wrong?

The machine ID generator was recently updated to solve problems related to virtual or multiple network cards installed on the same machine.  The new generator generates longer IDs that are not compatible with versions of Solid Framework prior to 9.2.7514.

There are two solutions:

  1. Update the version of Solid Framework to at least 9.2.7514
  2. Download and use the version of the Machine ID generator compatible with older versions. This can be found at http://downloads.soliddocuments.com/solidframework/SolidMachineID2015.exe

How do I get and use a license for Solid Framework?

To use the features of Solid Framework, you need a license from Solid Documents. Licenses, including trial licenses, can be created using the self-service Solid Framework Developer Portal. These licenses depend on a machine-specific ID and there is a utility available at the Developer Portal to generate these ids.

To use the Solid Framework features you must embed the location of your license in your code Click here to see the video tutorial.

// Solid Framework (Professional) license
License.Import(new StreamReader(@"C:\Users\Joe\Documents\Visual Studio 2010\Projects\FrameworkProject\license.xml"));




Image Processing and OCR

What is the focus for OCR accuracy?

SolidFramework is primarily aimed at the reconstruction of business documents.

As such OCR is unashamedly biased towards accurately recognising the content of such documents.

What is CGM?

CGM is an abbreviation for Color Gray Mono. Originally our image processing was targeted at archiving functionality: creating small scanned pages for PDF/A archive files while preserving the page image quality as much as possible. To this end, we recognize zones of the page image based on their nature and break the page up into appropriate components. For example:

  • a color photograph is extracted from the rest of the page, downsampled (typically to 150dpi) and compressed as JPEG
  • for a color graphic or text heading (palette image) we resample the colors to a smaller palette (like 16 or 256 colors) and use lossless compression (think of it as a GIF or PNG)
  • for monochrome text we extract as either 1 or 2 bits per pixel (anti-aliased text) and store losslessly using CCITT FAX compression

This segmentation, selective use of lossy or lossless compression and selective downsampling allows us to build a composite image page in PDF which is far smaller than a single scanned image page would be.

What pre-processing does SolidOCR support?

Solid CGM also includes all the obvious pre-processing functionality required to process scanned page images.

  • deskew
  • auto rotate (determining dominant page text orientation)
  • despeckle (“salt” and “pepper” noise removal)
  • dynamic thresholding (OCR is typically a monochrome process but for that to work, we need to establish “paper” and “ink” shades and limits)
  • scanner noise removal (typically black bars at the edges of pages)
  • staple, punch hole and folded corner noise removal
  • 90 and 270 degree text component detection (minor text components not in the same orientation as the rest of the page)
  • vector table detection
  • vector underline removal and repair (fixing the text character descenders that the underline may have “sliced”)
  • inverse text component detection (either for the whole page or for smaller text components: typically white on black text but can be any colors)

What needs to be done to perform OCR on CJK or Greek Text?

SolidFramework uses Tesseract to preform OCR on Chinese, Japanese, Korean and Greek language documents.

Information about how to do this can be found in the document  Performing OCR using Tesseract.

Miscellaneous

How do I get hold of columns within a document?

Columns are a property of the “Section” object.

How can I find the location of a piece of text on a page?

Provided that the CoreModel has been created with the PdfOptions.ExposeTargetDocumentPagination set to true, then it is possible to get the LayoutDocument from the CoreModel once it has been created.

Each object within the CoreModel.Topic (except runs) has an associated Layout object. This layout object contains information about the location of the object within the document.

To get the layout object search the LayoutDocument.FindLayoutObject (ID), where ID is the identifier of the SolidObject which can be found using SolidObject.GetID().

For each paragraph in the CoreModel.Topic there will be a matching LayoutParagraph which provides access to the location.

When I try to convert a PDF to PDF/A, I get a conversion status of PdfAError, and yet conversion appears to have happened. What does this mean?

ConversionStatus.PdfAError means “There was a problem in the source document that meant that it was not PDF/A compliant”.

However, SolidFramework may have been able to resolve these problems to create a compliant document, in which case an output file would have been generated.

It is thereforenecessary to check whether the ConversionResults contains a path to a file, which would indicate that conversion was able to occur.

Typical code is as follows:

if (res == ConversionStatus.Success || res == ConversionStatus.PdfAError)
{
    // Get the location of the generated file
    if (conv.Results[0] != null)
    {
        if (conv.Results[0].Paths.Count == 1)
        {
            string path = conv.Results[0].Paths[0];
            //Do something with the file
        }
    }
}

How do I remove tagging from a PDF file?

Tagging uses a set of standard structure types to allow page content (text, graphics and images) to be extracted and reused for other purposes.

For example, Solid Framework uses tags, if present, to identify tables within a PDF. This can allow more accurate extraction of table data from a PDF. The problem with this is that tags are optional.

The same textual data may be identified as a table if tags are present, but identified as ordinary text if they are absent. This causes problems if you are trying to compare apparently similar files where one is tagged and the other is not.

Solid Framework allows tags to be removed from a PDF using the following code:

string taggedFile; //Path to PDF that contains tags
string untaggedFile; //Path to PDF that has had tags removed.

PdfDocument doc = new PdfDocument(taggedFile);
doc.Open();
doc.RemoveStructTreeRoot();
doc.SaveAs(untaggedFile, SolidFramework.Plumbing.OverwriteMode.ForceOverwrite, true);

Why do I have more pages in my reconstructed Word document than were in the PDF?

Solid Framework is very good at reconstructing Word documents from PDF. In most situations the reconstructed document will have the same number of pages as the PDF, but there are two main situations where this is not the case because of limitations within Word.

Page jumps due to consecutive Odd (or Even) Page Footers.

Word supports headers and footers being different on odd and even pages.  However, Word does not support an odd footer on one page being followed by a different odd footer on the following page (or indeed two consecutive even-page footers). Attempting to do so will result in Word quietly inserting an additional page, which may not be visible within Word, but will result in an increased page count, and cause the page numbers to jump as you move from one page to the next.

Tables as the very last item in the Document

Word does not allow a table to be the last item on a page. It must be followed by a new line.

If the table ends at the very bottom of the page then the new line may result in an extra page being created.

This problem can be resolved by editing the document and setting a very small font size for the newline character. While we could have automatically done that  when Solid Framework reconstructed the document, we chose not too, since we realised that doing so would make editing the Word document difficult.

What does PdfDocument.SaveOptimizedAs() do?

The PDF file structure includes a cross-reference (or XRef) table which contains links to all of the objects or elements in a file and helps in navigating the file.

The image below shows the start of the XRef table for the PDF file that contains the documentation for PDF version 1.7.

If the user removes all references to an item, for example a font, an image or a page, from the PDF, then potentially the XRef table continues to hold references to, and the file continues to contain, that item even though it is no longer used. This results in a PDF file that is larger than it needs to be .

PdfDocument.SaveOptimizedAs() removes the links to, and the content of, these obsolete objects. This potentially allows the size of the file to be reduced.

If obsolete objects are present within the file, then significant reductions in file size can be achieved. If there are no obsolete objects present, then this method will have a minor effect on file size and could even cause a small increase in file size.

Problem solving

How do I get debugging information from Solid Framework?

Solid Framework can emit a detailed text file log during processing. This can be very useful in allowing Solid Documents to identify where a problem is occurring.

By default no log file is created. If one is specified then it will be written to as processing occurs.

Additional log files will be created by individual JobHandler processes if they are used. These files will have the letters “jh” and an ID included in their filename.

Note: in versions up to and including 9.2.8284, if the log file name does not end in “.txt” then all JobHandler processes will use the same log file which may cause file access contention and occasional conversion failures.

The pattern for using a log in C# is typically something like this:


string logPath = @"c:\test\solidframework.txt";
if (System.IO.File.Exists(logPath)) {
System.IO.File.Delete(logPath);
}

SolidFramework.Plumbing.Logging.Instance.Path = logPath;

Similarly in C++ the following code can be used.

LPCWSTR p = L"c:/test/solidframework.txt";
SolidFramework::Platform::Plumbing::Logging::getInstance().setLogPath(p);