The Solid Framework Core Model is our in-memory model of the reconstructed content of your PDF file. We reconstruct clean Unicode text, in reading order, and provide the bounds of the major text blocks as per their position in the original PDF layout.
Typically, we then export from this model to another format like Word .docx but this sample shows you how to iterate this model programmatically in-memory.
The sample outputs two files:
- a text file that includes the bounding box coordinates and text for each of the text blocks.
- an HTML file that visually illustrates the different bounding rectangles for all the text blocks/paragraphs in the PDF.
The source code for this sample is provided in the form of a zip file. Solid Framework for .NET is available from the developer portal (self-service).
Steps to Extracting Text from Model – C# .NET Sample Application
- Create a directory like C:\SampleCode and download and extract the sample project into your new directory. Download the TextBounding.NET.zip (6KB)
- Create a free portal account, download Solid Framework .NET and generate a Developer SDK license. Download Solid Framework .NET dll into your new directory.
- Open TextBounding.NET.sln in Visual Studio.
- Navigate to your downloaded Solid Framework .NET for Windows .dll file and add it as a Reference.
- Right click on the project and select Rebuild.
- Open a cmd window and navigate to your project’s Debug folder where the .exe file of the project is. i.e cd C:\SampleCode\Debug.
- Type in the name of your .exe file followed by the paths to the following files:
- Your license.xml file
- The pdf file to parse
- Path to where you want your .html file to be saved to
- Path where you want your .txt file to be saved to
i.e. TextBounding.NET.exe C:\SampleCode\license.xml C:\SampleCode\YourPDF.pdf C:\SampleCode\layout.html C:\SampleCode\layout.txt
- Press Enter. You can then view the .html file which visually illustrates the reconstructed text blocks/paragraphs and their bounds.