Skip to main content

Converting a PDF to a CSV Table

Sheetloom provides a powerful feature to extract data from a PDF document and convert it to a queryable CSV table.

Multiple PDFs can be simultaneously uploaded and converted into one CSV table.

To do this requires a field map document defining header names and fields from the PDF document.

Conversion Stages

The following sets out the PDF to CSV conversion process workflow.

1. Upload a PDF

From the PDFs page click Upload PDF and select a file.

upload pdf


To change the name of the folder the PDF file will be stored in, click Show Advanced Options. The folder name can be changed from the PDF name that displays by default. In the example we change from b000001 (the company registration number) to Registration.

upload pdf advanced


Select Upload. The Folder with the chosen name now displays in the PDFs page.

pdf page


tip

Changing the folder name to describe the content of the PDF being uploaded is recommended. For example, if the PDF is named as the company name, and contains registration details, changing name to Registration makes sense as other company PDFs will be appended later.


2. Append PDFs

Additional PDF files may be available to append immediately, or may not become available until later. The append is done the same way in either case.

From the PDF page click on the PDF folder then click Append PDF. Click Choose PDFs and select the file or files to append, and click Open.

append files


The files display in the dialogue box. Remove any if required then select Upload

pdf confirm append upload


The uploaded files are added to the folder.

pdf-append-in-folder


3. Extract Text Fields from PDFs

Uploaded PDFs are in binary format and text needs to be extracted from them before conversion.

Begin by generating a text file from the uploaded PDF file(s). To do this click the Play button in the TXT column. The process runs in a few seconds and a text document for each PDF is generated.

The TXT icon becomes live. Click on it to view a list of text documents that have been generated from the uploaded PDFs.

pdf text docs


A text document can be viewed by downloading it. Click the text icon and then download from the Actions column in the pop-up that displays. If the field map has already been loaded it can also be viewed by clicking the eye icon.

4. Add a Field Map

This document is in YAML format and contains details of the CSV headers and field values to extract from the text document.

It can be externally produced, or produced within Sheetloom using the text files generated in step 3. See the field map page for configuration details.

From the PDF page click the Upload icon in the Field Map column. Select the field map file and upload it.

upload-mapping-doc

Mapping from the document to the target headers and values in the text file is automatically done when the map document is uploaded. To preview these click the text icon and the eye icon for the required text field from the pop-up that displays.

pdf extract preview

Mapped fields display in yellow. If the field was created in Sheetloom using a Header entry the field heading block also displays in blue.


info

The Field Map is typically prepared in advance but it can be created during the conversion workflow by following the instructions at Create a Field Map in Sheetloom


5. Generate a CSV from Text Files

A CSV file with the target headers and values from the text file can now be generated.

On the PDF page click on the Play button in the CSV column. pdf generate csv


In a few seconds the CSV file is created. The CSV icon in the CSV column becomes live. Click it then the eye icon to generate a preview of the CSV showing the extracted columns.

csv-preview


To preview the database table, select the table icon from the Table column.

The data is stored in a database table in Sheetloom called pdf_mytablename e.g., pdf_registration which can be referenced in a Stitch query like any other table. Click the list icon adjacent to the table icon in the Table column to display a sample query.