Developing software using Solid Framework

What operating systems does Solid Framework support?

Solid Framework comes in two flavors.

Native Solid Framework

The native version is pure C++, and runs on Windows, macOS and Linux (Ubuntu 18.04 and up CentOS 8 and up).

Solid Framework for .NET

The .NET wrapped version is a breeze to work with. It requires the .NET Framework 4.0 or later.  It is not supported on Windows 9x.

The framework has been tested on:

  • Windows 10
  • Windows 8
  • Windows 7
  • Windows Server 2016
  • Windows Server 2012 R2 x64
  • Windows Server 2008 R2 x64

The Solid Framework .dll is built as an “AnyCPU” framework and automatically runs x86 or x64 native, depending on the process that loads it.

Is SolidFramework just for .NET?

We offer two versions of the library – one as a .NET library, but we also supply a Native C++ DLL which can be used without .NET

What version of the .NET Framework is required when developing using the .NET version of Solid Framework?

The minimum version of .NET Framework required is 4.0.

Later versions of .Net Framework can also be used.

What .NET programming languages can be used when writing software that works with Solid Framework?

Solid Framework is a CLS Compliant class library. This means that the class library only exposes features that are common across all .NET languages. For example, unsigned types and overloaded methods are not used since these features are not available in all supported languages.

For simplicity, all the samples and documentation are in C#. The other commonly used CLS Compliant languages are:

  • C#
  • Visual Basic .NET
  • J#

Is it possible to use C++ to write software that includes Solid Framework?

Absolutely! Several of our customers use C++ for the web based or app based products.

If C++ is used then .NET does not need to be available on the machine.

There are several samples to demonstrate the use of C++ available in the Downloads part of the Solid Framework portal.

What are the differences between the .NET and the native C++ versions of Solid Framework

Solid Framework is available in two distinct versions – a .Net version and a C++ version.

This document describes the main differences between these versions.

Which versions of Visual Studio can Solid Framework be used with?

The Solid Documents team develop using VS 2019 and we currently target the “v14” MSVC runtime (the version that shipped with VS 2015).

Is it possible to use Visual C++ 6.0?

The quick answer is that we have never tried so we don’t know.

The longer answer is that Solid Framework makes use of some language features, for example “Shared_Ptr” and “wstring” which were added to C++ in 2011. As such we think that there may be problems using Solid Framework with VC++6.

Having said, we deliberately do not use static linking to customer apps which makes us depend less on specific 3rd party library versions (such as the version of the MSVC runtime for example). If the customer can actually compile against our SolidFramework.h and SolidFramework.cpp API interface include files, then our library will work (these interface files then do version agnostic dynamic loading of the rest of our system).

How do I download the Solid Framework SDK?

The Solid Framework SDK can be downloaded using the self-service Solid Framework Developer Portal.

You will need to create an account in order to access the portal.

Click here to see video tutorial on how to download the SDK

After completing your download place SolidFramework.dll in the source folder of your project.

Add a reference to this assembly from your project, Click here to see video tutorial.

Does Solid Framework work on Macs powered by Apple Silicon processors?

Solid Documents Ltd is a member of the Apple Developer program and we have a long history of developing for the Mac platform.

Solid Framework works on the ARM based Apple M1 chip. Rosetta is not required.

What hardware factors are related to SDK processing performance and concurrency?

The two biggest factors are number of cores and amount of memory available. Assuming a 64-bit executable, the real scale comes when converting large scanned documents: the conversion process dynamically scales both the number of threads and the number of pages being concurrently processed to maximize use of the available resources.

Parallel Processing in Solid Framework

Does SDK support multi-thread calling?

No. While Solid Framework does take advantage of multi-threaded execution for increased conversion performance, it only supports one concurrent conversion per process. The primary reasons for this architectural choice are stability and job isolation.

A JobProcessor can be used to allow concurrent conversions on the same machine, but within different processes.

What is a Job Processor?

In a service environment where the number of concurrent connections could be very large it makes sense to manage and limit the number of concurrent conversions taking place and use a queuing system to manage requests. This is what our JobProcessor architecture does on Windows (C# implementation). This architecture is strongly modelled on the original Apache web server MPM architecture for:
– scaling to take advantage of available resources (adjustable but constrained number of worker processes)
– managing a queue of requests (to never allow periods of high traffic to bring the service down, only the average time per conversion increases)
– manage worker process health independent of main process (crashes or hangs/timeouts in worker processes only affect a single job – the one that caused the crash or hang, workers are “health recycled” – by default this happens every 100 jobs so no long-term resource leakage issues)

This architecture is stable and well tested. It is what is behind our www.simplypdf.com conversion service. It runs for months at a time doing 10,000s of conversions with no crashes, leakage or restarts.

Is it possible to limit the number of JobHandlers?

Solid Framework is not intrinsically thread safe.

If parallel processing is required then we recommend using the “JobProcessor” class. This class will spin up a number of independent JobHandler processes, with each process performing a single conversion.

The JobProcessor will queue conversion requests and allocate them to a JobHandler process when one becomes idle.

By default, the JobProcessor will launch as many concurrent JobHandler processes as you have cores on your machine. You can restrict the number of parallel JobHandlers by setting JobProcessor::WorkerCount. 

Typically you will wish to set the number of workers to less than the number of cores on the machine, since this will allow other tasks to continue while Solid Framework is converting files.

Is the JobProcessor available in the C++ version of the Framework?

The JobProcessor class is available only within the .Net version of SolidFramework.

However, we do have a Python based implementation that will work with the C++ version of SolidFramework, which can be used on both Windows and Linux machines.

Licensing

How do I get a Machine ID?

The license for Solid Framework is linked to the ID of Machine on which Solid Framework is installed.

To get the Machine ID, download the tool from the “My Account” page on solidframework.net.

Click on “How do I get a Machine ID?” to see the download options.

How do I get and use a license for Solid Framework?

To use the features of Solid Framework, you need a license from Solid Documents. Licenses, including trial licenses, can be created using the self-service Solid Framework Developer Portal. These licenses depend on a machine-specific ID and there is a utility available at the Developer Portal to generate these ids.

To use the Solid Framework features you must embed the location of your license in your code Click here to see the video tutorial.

// Solid Framework (Professional) license
License.Import(new StreamReader(@"C:\Users\Joe\Documents\Visual Studio 2010\Projects\FrameworkProject\license.xml"));




How do I use a license for Solid Framework if the Machine ID is volatile, such as on the Cloud?

Licenses are generally associated with the ID of the machine on which Solid Framework is being used. This can cause problems if Solid Framework is deployed on the cloud, since the machine ID may change from time to time.

Solid Framework has a mechanism for dealing with this issue, so please email us for details.

Note that deployment to the cloud is only available for hybrid or public licences.

Image Processing and OCR

What languages does Solid OCR support?

Solid OCR directly supports 17 languages

  1. English
  2. Catalan
  3. Danish
  4. Dutch
  5. Finnish
  6. French
  7. German
  8. Italian
  9. Norwegian
  10. Polish
  11. Portuguese
  12. Romanian
  13. Russian
  14. Spanish
  15. Swedish
  16. Slovenian
  17. Turkish

It uses Tesseract to provide support for a further 6 languages

  1. Chinese (traditional)
  2. Chinese (simplified)
  3. Japanese
  4. Korean
  5. Greek
  6. Hebrew

What needs to be done to perform OCR on CJK, Greek or Hebrew Text?

SolidFramework uses Tesseract to preform OCR on Chinese, Japanese, Korean, Greek and Hebrew language documents.

Information about how to do this can be found in the document  Performing OCR using Tesseract.

What is the focus for OCR accuracy?

SolidFramework is primarily aimed at the reconstruction of business documents.

As such OCR is unashamedly biased towards accurately recognising the content of such documents.

What is CGM?

CGM is an abbreviation for Color Gray Mono. Originally our image processing was targeted at archiving functionality: creating small scanned pages for PDF/A archive files while preserving the page image quality as much as possible. To this end, we recognize zones of the page image based on their nature and break the page up into appropriate components. For example:

  • a color photograph is extracted from the rest of the page, downsampled (typically to 150dpi) and compressed as JPEG
  • for a color graphic or text heading (palette image) we resample the colors to a smaller palette (like 16 or 256 colors) and use lossless compression (think of it as a GIF or PNG)
  • for monochrome text we extract as either 1 or 2 bits per pixel (anti-aliased text) and store losslessly using CCITT FAX compression

This segmentation, selective use of lossy or lossless compression and selective downsampling allows us to build a composite image page in PDF which is far smaller than a single scanned image page would be.

What pre-processing does SolidOCR support?

Solid CGM also includes all the obvious pre-processing functionality required to process scanned page images.

  • deskew
  • auto rotate (determining dominant page text orientation)
  • despeckle (“salt” and “pepper” noise removal)
  • dynamic thresholding (OCR is typically a monochrome process but for that to work, we need to establish “paper” and “ink” shades and limits)
  • scanner noise removal (typically black bars at the edges of pages)
  • staple, punch hole and folded corner noise removal
  • 90 and 270 degree text component detection (minor text components not in the same orientation as the rest of the page)
  • vector table detection
  • vector underline removal and repair (fixing the text character descenders that the underline may have “sliced”)
  • inverse text component detection (either for the whole page or for smaller text components: typically white on black text but can be any colors)

Miscellaneous

How do I get hold of columns within a document?

Columns are a property of the “Section” object.

How can I find the location of a piece of text on a page?

Provided that the CoreModel has been created with the PdfOptions.ExposeTargetDocumentPagination set to true, then it is possible to get the LayoutDocument from the CoreModel once it has been created.

Each object within the CoreModel.Topic (except runs) has an associated Layout object. This layout object contains information about the location of the object within the document.

To get the layout object search the LayoutDocument.FindLayoutObject (ID), where ID is the identifier of the SolidObject which can be found using SolidObject.GetID().

For each paragraph in the CoreModel.Topic there will be a matching LayoutParagraph which provides access to the location.

When I try to convert a PDF to PDF/A, I get a conversion status of PdfAError, and yet conversion appears to have happened. What does this mean?

ConversionStatus.PdfAError means “There was a problem in the source document that meant that it was not PDF/A compliant”.

However, SolidFramework may have been able to resolve these problems to create a compliant document, in which case an output file would have been generated.

It is thereforenecessary to check whether the ConversionResults contains a path to a file, which would indicate that conversion was able to occur.

Typical code is as follows:

if (res == ConversionStatus.Success || res == ConversionStatus.PdfAError)
{
    // Get the location of the generated file
    if (conv.Results[0] != null)
    {
        if (conv.Results[0].Paths.Count == 1)
        {
            string path = conv.Results[0].Paths[0];
            //Do something with the file
        }
    }
}

How do I remove tagging from a PDF file?

Tagging uses a set of standard structure types to allow page content (text, graphics and images) to be extracted and reused for other purposes.

For example, Solid Framework uses tags, if present, to identify tables within a PDF. This can allow more accurate extraction of table data from a PDF. The problem with this is that tags are optional.

The same textual data may be identified as a table if tags are present, but identified as ordinary text if they are absent. This causes problems if you are trying to compare apparently similar files where one is tagged and the other is not.

Solid Framework allows tags to be removed from a PDF using the following code:

string taggedFile; //Path to PDF that contains tags
string untaggedFile; //Path to PDF that has had tags removed.

PdfDocument doc = new PdfDocument(taggedFile);
doc.Open();
doc.RemoveStructTreeRoot();
doc.SaveAs(untaggedFile, SolidFramework.Plumbing.OverwriteMode.ForceOverwrite, true);

Why do I have more pages in my reconstructed Word document than were in the PDF?

Solid Framework is very good at reconstructing Word documents from PDF. In most situations the reconstructed document will have the same number of pages as the PDF, but there are two main situations where this is not the case because of limitations within Word.

Page jumps due to consecutive Odd (or Even) Page Footers.

Word supports headers and footers being different on odd and even pages.  However, Word does not support an odd footer on one page being followed by a different odd footer on the following page (or indeed two consecutive even-page footers). Attempting to do so will result in Word quietly inserting an additional page, which may not be visible within Word, but will result in an increased page count, and cause the page numbers to jump as you move from one page to the next.

Tables as the very last item in the Document

Word does not allow a table to be the last item on a page. It must be followed by a new line.

If the table ends at the very bottom of the page then the new line may result in an extra page being created.

This problem can be resolved by editing the document and setting a very small font size for the newline character. While we could have automatically done that  when Solid Framework reconstructed the document, we chose not too, since we realised that doing so would make editing the Word document difficult.

What does PdfDocument.SaveOptimizedAs() do?

The PDF file structure includes a cross-reference (or XRef) table which contains links to all of the objects or elements in a file and helps in navigating the file.

The image below shows the start of the XRef table for the PDF file that contains the documentation for PDF version 1.7.

If the user removes all references to an item, for example a font, an image or a page, from the PDF, then potentially the XRef table continues to hold references to, and the file continues to contain, that item even though it is no longer used. This results in a PDF file that is larger than it needs to be .

PdfDocument.SaveOptimizedAs() removes the links to, and the content of, these obsolete objects. This potentially allows the size of the file to be reduced.

If obsolete objects are present within the file, then significant reductions in file size can be achieved. If there are no obsolete objects present, then this method will have a minor effect on file size and could even cause a small increase in file size.

Problem solving

How do I get debugging information from Solid Framework?

Solid Framework can emit a detailed text file log during processing. This can be very useful in allowing Solid Documents to identify where a problem is occurring.

By default no log file is created. If one is specified then it will be written to as processing occurs.

Additional log files will be created by individual JobHandler processes if they are used. These files will have the letters “jh” and an ID included in their filename.

Note: in versions up to and including 9.2.8284, if the log file name does not end in “.txt” then all JobHandler processes will use the same log file which may cause file access contention and occasional conversion failures.

The pattern for using a log in C# is typically something like this:

string logPath = @"c:\test\solidframework.txt";

if (System.IO.File.Exists(logPath))
{
     System.IO.File.Delete(logPath);
}

SolidFramework.Plumbing.Logging.Instance.Path = logPath;

(Alternatively you can use SolidFramework.Plumbing.Logging.Path = logPath;)

Similarly in C++ the following code can be used.

LPCWSTR p = L”c:/test/solidframework.txt”; SolidFramework::Platform::Plumbing::Logging::GetInstance().SetLogPath(p);

Note: It is important to initialize Solid Framework before attempting to specify the log file.
The easiest way to do this is to place this code after Importing the Solid Framework license.

Why am I getting a "BadImageFormatException"?

Three versions of SolidFramework.dll are available for .NET

One is specifically for 32-bit, one for 64-bit, and the third is “AnyCPU”.

If a Visual Studio project has a different bitness to the referenced SolidFramework.dll, then the following message is shown:

The problem occurs most often if the 64-bit version SolidFramework.dll is downloaded, since the default bitness for an “AnyCPU” Visual Studio project is, unintuitively, 32-bit.

Solution

  1. (Preferred). Download and use the AnyCPU version of SolidFramework.dll
  2. Change the Visual Studio project options to uncheck the default “Prefer 32-bit”

The Prefer 32-bit checkbox

I am getting an error that says that "api-ms-win-crt-runtime-l1-1-0.dll" is missing

Solid Framework 10.0.10054 is compiled using Visual Studio 2019 which uses of the Windows 10 SDK.

This has resulted in  a change to the C++ Redistributable libraries that are required to use Solid Framework. This is not a problem on Windows 8 or Windows 10 machines (as the required files are automatically present), but it can cause an error on Windows versions up to and including Windows 7.

When trying to run Solid Framework an error will be shown if the required files are not available.

Error shown if api-ms-win-crt-runtime-l1-1-0.dll is missing

The required files can be downloaded from https://support.microsoft.com/en-nz/help/2977003/the-latest-supported-visual-c-downloads

For more information please see https://solidframework.net/wp-content/uploads/general/running_solid_framework_on_windows_7.pdf