Projects Archive - tivon.io

Extract Financial Data Tables from a PDF with Python

Tivon — Sun, 15 Oct 2023 20:20:05 +0000

Project Type

Data Scraping, Data Wrangling

Software/Tools/Libraries

Python, Camelot, PyPDF2, Pandas, Matplotlib, NumPy

Authors

Tivon Johnson

Overview

PDF stands for “Portable Document Format.” It is the third most popular file format on the web (after HTML and XHTML). There are trillions of PDF files worldwide. Businesses and government agencies widely use PDFs to distribute information and collect data electronically. While it has become an essential part of business communication in the digital age, it is not necessarily a good format for working with data. Extracting tables from PDF files is a common need for businesses and researchers. It allows them to analyze and report on the data more effectively.

Extracting text from a PDF can be as simple as copying and pasting the content. However, extracting tabular data may not be as straightforward as table structure is rarely maintained and columns and rows get distorted. Fortunately, there are other methods we can use. Python, for example, has a variety of libraries available for extracting tables from PDF files. This project demonstrates how to extract tables from a PDF using the Camelot Python library.

Data Source

Annual City Budget data for Augusta, Georgia published as downloadable PDF files at www.augustaga.gov.

Process Outline

Install Python Libraries
Extract Tables
Pandas Dataframe
Data Validation
Save as CSV File

Python Libraries

Camelot – Used for extracting tables
- Works with text-based files and tables only
- Required dependencies: Tkinter, Ghostscript
PyPDF2 – Different types of PDF operations
- Auto-installed with Camelot
- Defaults to v3.0, which has a deprecation error
- Solution: pip install PyPDF<3.0
Pandas – Data manipulation and analysis
Matplotlib – Visualization and plots
NumPy – Scientific computing

Tables

Summary by Fund PDF

Detail Revenue PDF

Pandas Dataframes

Summary by Fund Dataframe

Detail Revenue Dataframe

CSV Files

Summary by Fund CSV

Detail Revenue CSV

The post Extract Financial Data Tables from a PDF with Python appeared first on tivon.io.

Web Scraping & Image Classification of Maritime Vessels in Open Seas

Tivon — Sun, 15 May 2022 20:19:00 +0000

Model Type

Neural Network

Software/Tools/Libraries

Python, Pandas, NumPy Tweepy, PyTorch, AWS SageMaker, SQL, Twitter API, Instagram API

Authors

Tivon Johnson, Claudio Escudero, Himanvesh Maddina, Zeeshan Raza, James Worrall, Yiiang Xu

Overview

The goal of this project is to build a data mining pipeline and uses computer vision to automate the process of searching for open-source photos of commercial maritime vessels in an area of interest. The model applies web scraping and machine learning to obtain images from social media and detect and identify ships within them by geolocation. The application is able to pull posts from Twitter and other open data sources, apply computer vision models to search for Tweets that contain images of ships, and present the data in a ranked view based on confidence score.

Project Goals

Build a data mining pipeline that allows an analyst to search for open-source photos in an area
of interest
Apply computer vision to detect ships within the photos
Apply advanced analytics to determine which of the photos is most likely to be the ship the
analyst is looking for

Requirements

Web scraping required to scrape data from Twitter/Instagram
Building a machine learning model based on object detection and hashtags
Utilization of Optimal Character Recognition software
Pre-trained models to categorize and label images

The post Web Scraping & Image Classification of Maritime Vessels in Open Seas appeared first on tivon.io.

Use a Decision Tree to Determine Eligibility for Credit Line Increase

Tivon — Tue, 10 Aug 2021 20:20:01 +0000

Model Type

Decision Tree

Software/Tools/Libraries

Python, Scikit Learn, Numpy, Pandas, Matplotlib, Google Colab

Authors

Tivon Johnson, Harsharan Gorli, Andrew Levy, Yingying Liu

Model Details

Basic Information

Model date: August, 2021
Model version: 1.0
Model Type: Decision Tree
Columns used as inputs in the final model: Limit_BAL,PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6,BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6,PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6
Column(s) used as target(s) in the final model: DELINQ_NEXT
Software used to implement the model: Python 3.6+, Google Collab
Version of the modeling software: v0.2.5.
Hyperparameters or other settings of your model: max_depth = 12
License: Apache 2.0 License
Model implementation code: Credit_Line_Increase.ipynb

Intended Use

Primary intended uses: This model is an example probability of default classifier, with an example use case for determining eligibility for a credit line increase.
Primary intended users: Professors, Students in GWU DNSC 6301 bootcamp.
Out-of-scope use cases: This model is for educational purposes and not intended to evaluate real-world credit worthiness.

Training Data

Data dictionary:

Name	Modeling Role	Measurement Level	Description
ID	ID	int	unique row indentifier
LIMIT_BAL	input	float	amount of previously awarded credit
SEX	demographic information	int	1 = male; 2 = female
RACE	demographic information	int	1 = hispanic; 2 = black; 3 = white; 4 = asian
EDUCATION	demographic information	int	1 = graduate school; 2 = university; 3 = high school; 4 = others
MARRIAGE	demographic information	int	1 = married; 2 = single; 3 = others
AGE	demographic information	int	age in years
PAY_0, PAY_2 – PAY_6	inputs	int	history of past payment; PAY_0 = the repayment status in September, 2005; PAY_2 = the repayment status in August, 2005; …; PAY_6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; …; 8 = payment delay for eight months; 9 = payment delay for nine months and above
BILL_AMT1 – BILL_AMT6	inputs	float	amount of bill statement; BILL_AMNT1 = amount of bill statement in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; …; BILL_AMT6 = amount of bill statement in April, 2005
PAY_AMT1 – PAY_AMT6	inputs	float	amount of previous payment; PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; …; PAY_AMT6 = amount paid in April, 2005
DELINQ_NEXT	target	int	whether a customer’s next payment is delinquent (late), 1 = late; 0 = on-time

Source of training data: GWU Blackboard, email jphall@gwu.edu for more information
How training data was divided into training and validation data: 50% training, 25% validation, 25% test
Number of rows in training and validation data:
- Training rows: 15,000
- Validation rows: 7,500

Test Data

Source of test data: GWU Blackboard, email jphall@gwu.edu for more information
Number of rows in test data: 7,500
State any differences in columns between training and test data: None

Quantitative Analysis

Metrics used to evaluate the model and final figures:
- Training AUC: 0.78
- Validation AUC: 0.75
- Test AUC: 0.74
- Asian-to-White AIR: 1.00
- Black-to-White AIR: 0.85
- Female-to-Male AIR: 1.02
- Hispanic-to-White AIR: 0.83
Iteration Plot of the final model (inclusive of Training AUC, Validation AUC and Hispanic-to-White AIR:

Ethical Considerations

Potential negative impacts of the model:
- The model can lead to descrimination. While the model may be accurate, accuracy does not always imply the model is unbiased. Numerous factors that can lead to delinquency unfortunately can potentially be linked to race or gender. Bias testing was implemented in order to mitigate any potential descrimination.
- According to variable importance chart, the most recent payment is the primary factor the decision tree splits on. As we have seen in the pandemic, the people that are affected most by economic disruptions are lower income individuals, who would be most in need of potential increased credit lines.
Potential uncertainties relating to the impact of using the model:
- One uncertainty could be the off-label use of the model. While it is the intention of the group to use the model specifically for extending a credit line, ouse ther groups of people can potentially use the model in instance where the model has not been tested.
- Another uncertainty can be the accuracy of the data itself. Over time, the data itself can become dated, leading to inaccurate results.
Other unexpected results:
- The model still has bias, and it is hard to fix it based on the dataset.

The post Use a Decision Tree to Determine Eligibility for Credit Line Increase appeared first on tivon.io.

Predicting Mortgage Rates with Interpretable Machine Learning

Tivon — Tue, 15 Jun 2021 20:20:00 +0000

Model Type

Explainable Boosting Machine (EBM)

Software/Tools/Libraries

Python, InterpretML, XGBoost, Numpy, Pandas, Matplotlib, Seaborn, h2o

Authors

Tivon Johnson, Minhye Kim, Zach Vila, Qunzhe Ding

Model Details

Our group has developed interpretable machine learning models as a part of our semester project in the Responsible Machine Learning class taught by Professor Hall during the Summer Semester in 2021. We have used Home Mortgage Disclosure Act historic mortgage reporting data to predict the probability of applicants being charged a higher rate for their mortgages. In order to address and mitigate growing concerns on risks of black box machine learning models deployed to highly impactful social areas without much needed contemplation and precaution on adverse effects, we demonstrate available techniques to interpret and explain predictive models to prevent unjust discrimination, improve security and encourage ethical decisions.

Basic Information

Model Date: June 2021
Model Version: 1.0
Model Type: Explainable Boosting Machine (EBM)
Software: Python 3.6+, InterpretML v0.2.5
Hyperparameters: {‘max_bins’: 512, ‘max_interaction_bins’: 32, ‘interactions’: 15, ‘outer_bags’: 10, ‘inner_bags’: 4, ‘learning_rate’: 0.01, ‘validation_size’: 0.4, ‘min_samples_leaf’: 1, ‘max_leaves’: 3, ‘early_stopping_rounds’: 100.0, ‘n_jobs’: 4, ‘random_state’: 12345}
Columns used as inputs: [‘intro_rate_period_std’, ‘no_intro_rate_period_std’, ‘debt_to_income_ratio_missing’, ‘property_value_std’, ‘income_std’, ‘debt_to_income_ratio_std’]
Column used as target: ‘high-priced’
Paper or other resource for more information
License: Apache 2.0

Intended Use

Primary intended uses

- The primary intended use is to provide an interpretable machine learning model that helps explain predictions as opposed to black box models which provide little, if any, explanation. Such transparency may help prevent bias and discrimination that can occur with black-box models as it relates to applicants with higher mortgage rates.
- Our project goal is to determine if the Annual Percentage Rate (APR) charged for a mortgage is high-priced or not, which we consider as one of many issues that perpetuates a massive disparity in overall wealth between different demographic groups in the US. As a result, a demographic factor such as race or sex was considered. We discovered black applicants are more likely to get high-priced mortgages.

Primary intended users

- The primary intended users of this model are professors, students, and researchers of interpretable machine learning models.

Out-of-scope use cases

- This model is for educational purposes and not intended to evaluate real-world credit worthiness.

Metrics

Model performance measures

- In our project, we choose Area Under the Curve (AUC) as the evaluation metric of the model: in machine learning, AUC is one of the most important evaluation metrics for checking the model’s performance in classification.

Decision thresholds

- Typically, an excellent model has AUC near to the 1 and a poor model has an AUC near 0, if a model’s AUC is 0.5, it means the model has no class separation capacity. In our project, we didn’t set decision thresholds of the AUC, but we select our best model (EBM), with the highest AUC which is 0.8247 (pre-remediation), or 0.8097 (post-remediation).

Variation approaches

- Our EBM model random grid search runs through 500 iterations.
- We split our data by 70%/30% training/validation.

Training Data

Datasets

- Home Mortgage Disclosure Act (HMDA) aggregate lending data

Preprocessing

- This data contains no major quality issues, so no preprocessing was required.
- The data was divided into training and validation data with random values in a shape of 70 (training):30 (validation).

Data Shape

- Training data rows = 112,253, columns = 23
- Validation data rows = 48,085, columns = 23

Attributes selected to remediate our model for fairness regarding demographic information.

- black: Binary numeric input, whether the borrower is Black (1) or not (0).
- asian: Binary numeric input, whether the borrower is Asian (1) or not (0).
- white: Binary numeric input, whether the borrower is White (1) or not (0).
- male: Binary numeric input, whether the borrower is male (1) or not (0).
- female: Binary numeric input, whether the borrower is female (1) or not (0).

Attributes selected to fit our model which best explain the relationship of an independent variable with the target variable.

- term 360: Binary numeric input, whether the mortgage is a standard 360 month mortgage (1) or a different type of mortgage (0).
- conforming: Binary numeric input, whether the mortgage conforms to normal standards (1), or whether the loan is different (0), e.g., jumbo, HELOC, reverse mortgage, etc.
- debt to income ratio missing: Binary numeric input, missing marker (1) for debt to income ratio std.
- loan amount std: Numeric input, standardized amount of the mortgage for applicants.
- loan to value ratio std: Numeric input, ratio of the mortgage size to the value of the property for mortgage applicants.
- no intro rate period std: Binary numeric input, whether or not a mortgage does not include an introductory rate period.
- intro rate period std: Numeric input, standardized introductory rate period for mortgage applicants.
- property value std: Numeric input, value of the mortgaged property.
- income std: Numeric input, standardized income for mortgage applicants.
- debt to income ratio std: Numeric input, standardized debt-to-income ratio for mortgage applicants.

Attributes engineered for residual analysis.

- phat: Numeric input, prediction probabilities of high-priced mortgage for mortgage applicants.
- r: Numeric input, log loss residuals for the predicted probabilities

Attribute for the target variable.

- high priced: Binary target, whether (1) or not (0) the annual percentage rate (APR) charged for a mortgage is 150 basis points (1.5%) or more above a survey-based estimate of similar mortgages. (High-priced mortgages are legal, but somewhat punitive to borrowers. High-priced mortgages often fall on the shoulders of minority home owners, and are one of many issues that perpetuates a massive disparity in overall wealth between different demographic groups in the US.)

Attributes that were not used in our approaches.

- row id: Numeric input, value that uniquely identifies a row in a table.
- amind: Binary numeric input, whether the borrower is American Indian (1) or not (0).
- hipac: Binary numeric input, whether the borrower is Native Hawaiian or Other Pacific Islander (1) or not (0).
- hispanic: Binary numeric input, whether the borrower is Hispanic (1) or not (0).
- non hispanic: Binary numeric input, whether the borrower is Non-Hispanic (1) or not (0).
- agegte62: Binary numeric input, whether the borrower’s age is over 62 (1) or not (0).
- agelt62: Binary numeric input, whether the borrower’s co-borrower’s age is over 62 (1) or not (0).

Link: hmda_train_preprocessed.zip

Evaluation Data

Datasets

- Home Mortgage Disclosure Act (HMDA) aggregate lending data

Data Shape

- Test data rows = 19,832, columns = 22

All Test Data Columns

- All the columns are the same as the training & validation data, except for that the target variable high-priced column does not exist in this test data.

Link: hmda_test_preprocessed.zip

Quantitative Analysis

Unitary results:

- Our best remediated EBM model produced an AUC of 0.8097 after employing several post-processing techniques such as removing outliers and sensitivity analysis to economic recession conditions. This AUC was also achieved while ensuring a minimum Adverse Impact Ratio (AIR) of 0.8.
- Best training/validation AUC (pre-remediation): 0.8247

Intersectional results:

- Among the models explored (EBM, Ensemble, GBM, MGBM, and GLM), we found that the EBM model produced the greatest fidelity to the true outcomes, while maintaining the highest standards of fairness. We compared not only the AUC results to evaluate the models independently but also cross-validated over a number of evaluation metrics such as ACC, AUC, Log Loss, F1, and MSE. Once we determined the superiority of the EBM class model, we selected it as the best model and continued on to remediation techniques.

AUC (pre-remediation) of other alternative models:

- Ensemble: 0.8195
- Gradient Boosting Machine (GBM): 0.8183
- Monotonic Gradient Boosting Machine (MGBM): 0.8021
- Penalized Generalized Linear Model (GLM): 0.7628

Partial Dependence Plots:

The partial dependence plot (short PDP or PD plot) shows the “marginal effect one or two features have on the predicted outcome of a machine learning model” (J. H. Friedman 2001).

Global Model Variable Importance:

Global variable importance values give an indication of the magnitude of a variable’s contribution to model predictions for all of the data.

Ethical Considerations

Although we use the 4/5ths rule, one should aim for full parity where possible in a machine learning model (i.e. 1 to 1 parity in classification)
Pre-processing remediation techniques should be scrutinized for potential legal issues (e.g. manipulating data with racial class could constitute affirmative action)
Failure to perform bias testing and remediation of machine learning models can lead to discrimination, which can become self-reinforcing over time
Our best model underperformed markedly when exposed to economic conditions mimicking a recession, which demonstrates that even the most carefully scrutinized training data can be undermined by shifting real-world conditions
This model card does not constitute legal or compliance advice
Further exploration is warranted for our models, but we provide a baseline here
Additional Reading
- Interpretable Models
- “Black Boxes”
- Discrimination in algorithms

All models are wrong, but some are useful – George E. P. Box

The post Predicting Mortgage Rates with Interpretable Machine Learning appeared first on tivon.io.