Introduction

Solid Framework has the ability to reconstruct both PDFs that contain embedded text or which are scanned. This article describes the focus of our work on reconstructing scanned documents.

Types of Document that Solid Framework Converts

As a business our focus is on document reconstruction. Specifically, we strive to do an excellent job of  business documents that originated on personal computers some time in the last twenty years.

In particular we aim to be able to convert documents that started as Word, PowerPoint or Excel,  before being converted to PDF,  back into an editable form (in Word, PowerPoint or Excel).

There are many other types of PDFs that are not out focus, for example:

  •  restaurant menus, birthday cards, colorful brochures, etc. created in layout products that are not word processors
  •  contrived or constructed test case documents that don’t look like typical customer business documents

Of course we try to convert all documents. We just choose to focus our energy on business documents when it comes to making improvements to Solid Framework.

 

Types of Document that Solid OCR Converts

When it comes to reconstructing scanned documents, our focus remains on business documents.

Examples of scanned content that would not be our focus are:

  • low quality faxes
  • low quality scans (either low resolution or lossy compression e.g. jpeg for text)
  • non-document scans (for example cash register receipts, invoices, etc. )

Again, we obviously attempt to do a good job of converting just about anything. We just choose to focus our improvement energy on business documents.

 

What is so great about Solid OCR?

We have developed our own OCR technology for Latin and Cyrillic scripts. It is fast and accurate for the types of documents (mentioned above) that we have the most interest in reconstructing.

We support:

  • the popular Latin and Cyrillic  script languages.
  • font detection for commonly used fonts (Windows and macOS) (e.g. Arial Black is detected as Arial Black, not some generic bold SanSerif font)
  • automatic language detection (excellent for batch processing or other automated conversion processes)
  • multilingual document support (language detection is at the paragraph-level, allowing the document to be converted if it contains a mix of languages)
  • massively parallel conversion on powerful machines
  • vector text conversion (we detect non-font text in output from products like Corel Draw or Illustrator and pass the text vectors through OCR while preserving the non-text vectors as-is)
  • large format pages at high resolution (consider A0 size CAD drawings at 300dpi – most competitors downsample before doing OCR)

Handling of other languages

For support of Greek, Hebrew and CJK scripts we fall back on the open source Tesseract OCR engine. While it is slower and nowhere near as accurate as our own technology for Latin and Cyrillic scripts, it does provide an option for these languages. Solid OCR image processing is still used for these languages which means all our related functionality still works (page auto-rotation, language detection, parallel processing, vector text, etc).

 

Examples of Customers Use-cases for Solid OCR

  • legal document products (where recognising the isolated numerical document-wide list numbering characters is very important)
  • high-end business document comparison
  • Acrobat-clone targeted at the CAD market (large format pages, vector text support)
  • automated text translation