Extract Financial Data Tables from a PDF with Python

PDF stands for “Portable Document Format.” It is the third most popular file format on the web (after HTML and XHTML). There are trillions of PDF files worldwide. Businesses and government agencies widely use PDFs to distribute information and collect data electronically. While it has become an essential part of business communication in the digital age, it is not necessarily a good format for working with data. Extracting tables from PDF files is a common need for businesses and researchers. It allows them to analyze and report on the data more effectively.

VIEW PROJECT

Project Type

Data Scraping, Data Wrangling

Software/Tools/Libraries

Python, Camelot, PyPDF2, Pandas, Matplotlib, NumPy

Authors

Tivon Johnson

Overview

Extracting text from a PDF can be as simple as copying and pasting the content. However, extracting tabular data may not be as straightforward as table structure is rarely maintained and columns and rows get distorted. Fortunately, there are other methods we can use. Python, for example, has a variety of libraries available for extracting tables from PDF files. This project demonstrates how to extract tables from a PDF using the Camelot Python library.

Data Source

Annual City Budget data for Augusta, Georgia published as downloadable PDF files at www.augustaga.gov.

Process Outline

Install Python Libraries
Extract Tables
Pandas Dataframe
Data Validation
Save as CSV File

Python Libraries

Camelot – Used for extracting tables
- Works with text-based files and tables only
- Required dependencies: Tkinter, Ghostscript
PyPDF2 – Different types of PDF operations
- Auto-installed with Camelot
- Defaults to v3.0, which has a deprecation error
- Solution: pip install PyPDF<3.0
Pandas – Data manipulation and analysis
Matplotlib – Visualization and plots
NumPy – Scientific computing