Handling non-table Content when Exporting to Excel – New Options Added

Solid Framework  is great for extracting tables from a PDF and putting them into an Excel Spreadsheet. Options exist to allow you to specify whether to create a separate sheet for each table, or whether to have all of the tables put onto the same sheet.

If a table is split over several pages in the PDF then very often Solid Framework will stitch the various parts of the table back together, rather than creating separate tables.

But what should be done with text and images in a PDF that are not part of a table?

In the past, Solid Framework has offered two choices: either put the content into the first columns of the Excel file (TablesFromContent = true) , or else remove it entirely (TablesFromContent = false).

In version 8472 we changed ExcelTablesFromContent from being a Boolean to being an enum to support a third option. This was reverted in 8492 to avoid compile time errors.

It is recommended that users download the latest version of Solid Framework is they are using 8472.

To support this functionality we have added a new Boolean option PreserveColumnsInNonTableContent

The choices now are:

  •  false is the equivalent of ExcelTablesFromContent = true in previous versions of SolidFramework.
  •  true allows text that is in columns in the PDF to be placed in columns in the reconstructed Excel file. This is now the default.

How this compares in practice

These examples are based on the following text:

PreserveColumnsInNonTableContent = false

Each paragraph is placed into a single cell within the reconstructed spreadsheet. The text for each paragraph is always placed in first column.

PreserveColumnsInNonTableContent = true

Text is retained in columns.  A number of sentences may be included in the same cell and row.

This mode may be particularly useful if table data is not correctly detected, since the spreadsheet will still look similar to the original file.

Solid Framework 9.2.8472.1 Released

The latest public release of Solid Framework SDK is now available for download from the developer portal at www.solidframework.net. This is version 9.2.8472.

MAIN FEATURES

Avoidance of Unnecessary Columns when converting to Excel

Additional columns were created when converting some PDFs to Excel. This made editing difficult. Solid Framework 9.2.8472. no longer creates such columns.

Large Images are now Exportable to Excel

If a PDF contains an image larger than than the maximum row height in Excel (409 pt) then previously it has been discarded. In this release additional rows will be added to allow the image to be shown.

FURTHER OCR IMPROVEMENTS

SolidOCR continues to improve, and this release contains further refinements. In particular, in this release we have made improvements with regard to very small text.

OPTION TO IGNORE TAGS WHEN RECONSTRUCTING A PDF DOCUMENT

Solid Framework has supported tags within PDF documents for many years. Tags are used to guide the reconstruction process, particularly with regard to identifying tables. This can result in visually similar PDFs being reconstructed differently depending on whether or not they are tagged.

We have now added the option DetectTaggedTables (default = true). If the option is set to false then the tags will be ignored when reconstructing the document.

NEW OPTION WHEN EXTRACTING NON-TABLE CONTENT TO EXCEL

When creating an Excel spreadsheet from a PDF, we have always offered a choice as to how text that is not part of a table should be handled. In the past the only options have been to either remove it, or to place it into the first column of the spreadsheet, with one sentence per row.

We have now added the option of “KeepColumns” which will respect the horizontal location of text. This option allows the spreadsheet to look more like the original PDF.

For more information see the blog note.

This improvement has required ExcelTablesFromContent to be changed from  Boolean to an enum. This may cause compile-time errors in existing code. Please contact us if you require support.

Export to .Doc now creates an RTF file

We recommend choosing the “.docx” file format when converting to Word, as this has been the default format for more than ten years.

We have, however, also supported conversion to “.rtf “and to “.doc”.

While we will be continuing support for “.rtf”, from this release, conversion to “.doc” will actually result in creation of a “.rtf” file with only the file extension being “.doc”. Such files will still open  seamlessly in Word.

Solid Framework 9.2.8284.1 Released

The latest public release of Solid Framework SDK is now available for download from the developer portal at www.solidframework.net. This is version 9.2.8284.

MAIN FEATURES

IMPROVED HANDLING OF SHADING

The rendering and conversion of Type 3 and Type 7 shading has been significantly improved.

RED NUMBERS ARE NO LONGER CONSIDERED TO ALWAYS BE NEGATIVE WHEN CONVERTING TO EXCEL

Previously red numeric text was always considered to represent negative numbers. It is now considered to be negative only if it is preceded by a “-” sign.

FURTHER OCR IMPROVEMENTS

SolidOCR continues to improve, and this release contains further refinements. In this release we have made improvements in the recognition of bold text within documents as well as specific improvements within the 32 bit version of the Framework.

SUPPORT FOR REMOVING TAGS FROM A PDF DOCUMENT

Solid Framework has supported tags within PDF documents for many years. Tags are used to guide the reconstruction process, which can result in visually similar PDFs being reconstructed differently depending on whether or not they are tagged.

We have now added functionality to allow tags to be removed if required.