Projects Archive - tivon.io https://tivon.io/projects/ Data Science Tue, 15 Apr 2025 18:50:19 +0000 en-US hourly 1 https://wordpress.org/?v=6.8 https://i0.wp.com/tivon.io/wp-content/uploads/2023/11/TIVON-6-copy4.png?fit=32%2C32&ssl=1 Projects Archive - tivon.io https://tivon.io/projects/ 32 32 230914717 Extract Financial Data Tables from a PDF with Python https://tivon.io/projects/extract-tables-from-pdf/ Sun, 15 Oct 2023 20:20:05 +0000 https://tivon.io/projects/mobile-app-landing-design-service-copy-4/ PDF stands for “Portable Document Format.” It is the third most popular file format on the web (after HTML and XHTML). There are trillions of PDF files worldwide. Businesses and government agencies widely use PDFs to distribute information and collect data electronically. While it has become an essential part of business communication in the digital age, it is not necessarily a good format for working with data. Extracting tables from PDF files is a common need for businesses and researchers. It allows them to analyze and report on the data more effectively.

The post Extract Financial Data Tables from a PDF with Python appeared first on tivon.io.

]]>
Project Type

Data Scraping, Data Wrangling

Software/Tools/Libraries

Python, Camelot, PyPDF2, Pandas, Matplotlib, NumPy

Authors

Tivon Johnson

Overview

PDF stands for “Portable Document Format.” It is the third most popular file format on the web (after HTML and XHTML). There are trillions of PDF files worldwide. Businesses and government agencies widely use PDFs to distribute information and collect data electronically. While it has become an essential part of business communication in the digital age, it is not necessarily a good format for working with data. Extracting tables from PDF files is a common need for businesses and researchers. It allows them to analyze and report on the data more effectively.

Extracting text from a PDF can be as simple as copying and pasting the content. However, extracting tabular data may not be as straightforward as table structure is rarely maintained and columns and rows get distorted. Fortunately, there are other methods we can use. Python, for example, has a variety of libraries available for extracting tables from PDF files. This project demonstrates how to extract tables from a PDF using the Camelot Python library.

Data Source

Annual City Budget data for Augusta, Georgia published as downloadable PDF files at www.augustaga.gov.

Process Outline
  1. Install Python Libraries
  2. Extract Tables
  3. Pandas Dataframe
  4. Data Validation
  5. Save as CSV File
Python Libraries
  • Camelot – Used for extracting tables
    • Works with text-based files and tables only
    • Required dependencies: Tkinter, Ghostscript
  • PyPDF2 – Different types of PDF operations
    • Auto-installed with Camelot
    • Defaults to v3.0, which has a deprecation error
    • Solution: pip install PyPDF<3.0
  • Pandas – Data manipulation and analysis
  • Matplotlib – Visualization and plots
  • NumPy – Scientific computing
Tables
Summary by Fund PDF

 

Detail Revenue PDF

 

Pandas Dataframes
Summary by Fund Dataframe

 

Detail Revenue Dataframe

 

CSV Files
Summary by Fund CSV

 

Detail Revenue CSV

The post Extract Financial Data Tables from a PDF with Python appeared first on tivon.io.

]]>
68
Web Scraping & Image Classification of Maritime Vessels in Open Seas https://tivon.io/projects/travel-app-design-creativity-application/ Sun, 15 May 2022 20:19:00 +0000 https://tivon.io/projects/mobile-app-landing-design-service-copy-2/ Utilizing open-source information to derive insights into commercial maritime activities. The objective of this project is to create a web scraping and machine learning application to automate the process of obtaining and identifying images of ships from social media platforms by geolocation.

The post Web Scraping & Image Classification of Maritime Vessels in Open Seas appeared first on tivon.io.

]]>
Model Type

Neural Network

Software/Tools/Libraries
Python, Pandas, NumPy Tweepy, PyTorch, AWS SageMaker, SQL, Twitter API, Instagram API
Authors
Tivon Johnson, Claudio Escudero, Himanvesh Maddina, Zeeshan Raza, James Worrall, Yiiang Xu

Overview

The goal of this project is to build a data mining pipeline and uses computer vision to automate the process of searching for open-source photos of commercial maritime vessels in an area of interest. The model applies web scraping and machine learning to obtain images from social media and detect and identify ships within them by geolocation. The application is able to pull posts from Twitter and other open data sources, apply computer vision models to search for Tweets that contain images of ships, and present the data in a ranked view based on confidence score.

Project Goals
  • Build a data mining pipeline that allows an analyst to search for open-source photos in an area
    of interest
  • Apply computer vision to detect ships within the photos
  • Apply advanced analytics to determine which of the photos is most likely to be the ship the
    analyst is looking for
Requirements
  • Web scraping required to scrape data from Twitter/Instagram
  • Building a machine learning model based on object detection and hashtags
  • Utilization of Optimal Character Recognition software
  • Pre-trained models to categorize and label images

The post Web Scraping & Image Classification of Maritime Vessels in Open Seas appeared first on tivon.io.

]]>
66
Use a Decision Tree to Determine Eligibility for Credit Line Increase https://tivon.io/projects/workout-website-design-and-development/ Tue, 10 Aug 2021 20:20:01 +0000 https://tivon.io/projects/mobile-app-landing-design-service-copy-3/ A decision tree is a classification technique that models decisions in a hierarchical structure resembling the branches of a tree. Decision tree algorithms are widely used in machine learning for predictive analytics. This project provides an example of a default decision tree classifier to determine eligibility for a credit limit increase.

The post Use a Decision Tree to Determine Eligibility for Credit Line Increase appeared first on tivon.io.

]]>
Model Type
Decision Tree
Software/Tools/Libraries

Python, Scikit Learn, Numpy, Pandas, Matplotlib, Google Colab

Authors
Tivon Johnson, Harsharan Gorli, Andrew Levy, Yingying Liu


Model Details

Basic Information
  • Model date: August, 2021
  • Model version: 1.0
  • Model Type: Decision Tree
  • Columns used as inputs in the final model: Limit_BAL,PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6,BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6,PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6
  • Column(s) used as target(s) in the final model: DELINQ_NEXT
  • Software used to implement the model: Python 3.6+, Google Collab
  • Version of the modeling software: v0.2.5.
  • Hyperparameters or other settings of your model: max_depth = 12
  • License: Apache 2.0 License
  • Model implementation codeCredit_Line_Increase.ipynb

Intended Use
  • Primary intended uses: This model is an example probability of default classifier, with an example use case for determining eligibility for a credit line increase.
  • Primary intended users: Professors, Students in GWU DNSC 6301 bootcamp.
  • Out-of-scope use cases: This model is for educational purposes and not intended to evaluate real-world credit worthiness.

Training Data
  • Data dictionary:

Name

Modeling Role

Measurement Level

Description

ID

ID

int

unique row indentifier

LIMIT_BAL

input

float

amount of previously awarded credit

SEX

demographic information

int

1 = male; 2 = female

RACE

demographic information

int

1 = hispanic; 2 = black; 3 = white; 4 = asian

EDUCATION

demographic information

int

1 = graduate school; 2 = university; 3 = high school; 4 = others

MARRIAGE

demographic information

int

1 = married; 2 = single; 3 = others

AGE

demographic information

int

age in years

PAY_0, PAY_2 – PAY_6

inputs

int

history of past payment; PAY_0 = the repayment status in September, 2005; PAY_2 = the repayment status in August, 2005; …; PAY_6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; …; 8 = payment delay for eight months; 9 = payment delay for nine months and above

BILL_AMT1 – BILL_AMT6

inputs

float

amount of bill statement; BILL_AMNT1 = amount of bill statement in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; …; BILL_AMT6 = amount of bill statement in April, 2005

PAY_AMT1 – PAY_AMT6

inputs

float

amount of previous payment; PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; …; PAY_AMT6 = amount paid in April, 2005

DELINQ_NEXT

target

int

whether a customer’s next payment is delinquent (late), 1 = late; 0 = on-time

  • Source of training data: GWU Blackboard, email jphall@gwu.edu for more information
  • How training data was divided into training and validation data: 50% training, 25% validation, 25% test
  • Number of rows in training and validation data:
    • Training rows: 15,000
    • Validation rows: 7,500

Test Data
  • Source of test data: GWU Blackboard, email jphall@gwu.edu for more information
  • Number of rows in test data: 7,500
  • State any differences in columns between training and test data: None

Quantitative Analysis
  • Metrics used to evaluate the model and final figures:
    • Training AUC: 0.78
    • Validation AUC: 0.75
    • Test AUC: 0.74
    • Asian-to-White AIR: 1.00
    • Black-to-White AIR: 0.85
    • Female-to-Male AIR: 1.02
    • Hispanic-to-White AIR: 0.83
  • Iteration Plot of the final model (inclusive of Training AUC, Validation AUC and Hispanic-to-White AIR:

Ethical Considerations
  • Potential negative impacts of the model:
    • The model can lead to descrimination. While the model may be accurate, accuracy does not always imply the model is unbiased. Numerous factors that can lead to delinquency unfortunately can potentially be linked to race or gender. Bias testing was implemented in order to mitigate any potential descrimination.
    • According to variable importance chart, the most recent payment is the primary factor the decision tree splits on. As we have seen in the pandemic, the people that are affected most by economic disruptions are lower income individuals, who would be most in need of potential increased credit lines.
  • Potential uncertainties relating to the impact of using the model:
    • One uncertainty could be the off-label use of the model. While it is the intention of the group to use the model specifically for extending a credit line, ouse ther groups of people can potentially use the model in instance where the model has not been tested.
    • Another uncertainty can be the accuracy of the data itself. Over time, the data itself can become dated, leading to inaccurate results.
  • Other unexpected results:
    • The model still has bias, and it is hard to fix it based on the dataset.

The post Use a Decision Tree to Determine Eligibility for Credit Line Increase appeared first on tivon.io.

]]>
67
Predicting Mortgage Rates with Interpretable Machine Learning https://tivon.io/projects/interpretable-machine-learning/ Tue, 15 Jun 2021 20:20:00 +0000 https://tivon.io/projects/mobile-app-landing-design-service-copy-5/ In machine learning, a "black box" model refers to a model in which its internal mechanisms are difficult to understand, making it hard to explain how it arrives at its predictions. This lack of transparency can raise ethical concerns regarding accountability and fairness in automated decision-making.

"Interpretable Machine Learning" (IML) refers to methods used to make machine learning models more transparent and comprehensible.

The post Predicting Mortgage Rates with Interpretable Machine Learning appeared first on tivon.io.

]]>
Model Type

Explainable Boosting Machine (EBM)

Software/Tools/Libraries

Python, InterpretML, XGBoost, Numpy, Pandas, Matplotlib, Seaborn, h2o

Authors

Tivon Johnson, Minhye Kim, Zach Vila, Qunzhe Ding

Model Details

Our group has developed interpretable machine learning models as a part of our semester project in the Responsible Machine Learning class taught by Professor Hall during the Summer Semester in 2021. We have used Home Mortgage Disclosure Act historic mortgage reporting data to predict the probability of applicants being charged a higher rate for their mortgages. In order to address and mitigate growing concerns on risks of black box machine learning models deployed to highly impactful social areas without much needed contemplation and precaution on adverse effects, we demonstrate available techniques to interpret and explain predictive models to prevent unjust discrimination, improve security and encourage ethical decisions.

Basic Information
  • Model Date: June 2021
  • Model Version: 1.0
  • Model Type: Explainable Boosting Machine (EBM)
  • Software: Python 3.6+, InterpretML v0.2.5
  • Hyperparameters: {‘max_bins’: 512, ‘max_interaction_bins’: 32, ‘interactions’: 15, ‘outer_bags’: 10, ‘inner_bags’: 4, ‘learning_rate’: 0.01, ‘validation_size’: 0.4, ‘min_samples_leaf’: 1, ‘max_leaves’: 3, ‘early_stopping_rounds’: 100.0, ‘n_jobs’: 4, ‘random_state’: 12345}
  • Columns used as inputs: [‘intro_rate_period_std’, ‘no_intro_rate_period_std’, ‘debt_to_income_ratio_missing’, ‘property_value_std’, ‘income_std’, ‘debt_to_income_ratio_std’]
  • Column used as target: ‘high-priced’
  • Paper or other resource for more information
  • License: Apache 2.0

Intended Use

Primary intended uses

    • The primary intended use is to provide an interpretable machine learning model that helps explain predictions as opposed to black box models which provide little, if any, explanation. Such transparency may help prevent bias and discrimination that can occur with black-box models as it relates to applicants with higher mortgage rates.
    • Our project goal is to determine if the Annual Percentage Rate (APR) charged for a mortgage is high-priced or not, which we consider as one of many issues that perpetuates a massive disparity in overall wealth between different demographic groups in the US. As a result, a demographic factor such as race or sex was considered. We discovered black applicants are more likely to get high-priced mortgages.

Primary intended users

    • The primary intended users of this model are professors, students, and researchers of interpretable machine learning models.

Out-of-scope use cases

    • This model is for educational purposes and not intended to evaluate real-world credit worthiness.

Metrics

Model performance measures

    • In our project, we choose Area Under the Curve (AUC) as the evaluation metric of the model: in machine learning, AUC is one of the most important evaluation metrics for checking the model’s performance in classification.

Decision thresholds

    • Typically, an excellent model has AUC near to the 1 and a poor model has an AUC near 0, if a model’s AUC is 0.5, it means the model has no class separation capacity. In our project, we didn’t set decision thresholds of the AUC, but we select our best model (EBM), with the highest AUC which is 0.8247 (pre-remediation), or 0.8097 (post-remediation).

Variation approaches

    • Our EBM model random grid search runs through 500 iterations.
    • We split our data by 70%/30% training/validation.

Training Data

Datasets

    • Home Mortgage Disclosure Act (HMDA) aggregate lending data

Preprocessing

    • This data contains no major quality issues, so no preprocessing was required.
    • The data was divided into training and validation data with random values in a shape of 70 (training):30 (validation).

Data Shape

    • Training data rows = 112,253, columns = 23
    • Validation data rows = 48,085, columns = 23

Attributes selected to remediate our model for fairness regarding demographic information.

    • black: Binary numeric input, whether the borrower is Black (1) or not (0).
    • asian: Binary numeric input, whether the borrower is Asian (1) or not (0).
    • white: Binary numeric input, whether the borrower is White (1) or not (0).
    • male: Binary numeric input, whether the borrower is male (1) or not (0).
    • female: Binary numeric input, whether the borrower is female (1) or not (0).

Attributes selected to fit our model which best explain the relationship of an independent variable with the target variable.

    • term 360: Binary numeric input, whether the mortgage is a standard 360 month mortgage (1) or a different type of mortgage (0).
    • conforming: Binary numeric input, whether the mortgage conforms to normal standards (1), or whether the loan is different (0), e.g., jumbo, HELOC, reverse mortgage, etc.
    • debt to income ratio missing: Binary numeric input, missing marker (1) for debt to income ratio std.
    • loan amount std: Numeric input, standardized amount of the mortgage for applicants.
    • loan to value ratio std: Numeric input, ratio of the mortgage size to the value of the property for mortgage applicants.
    • no intro rate period std: Binary numeric input, whether or not a mortgage does not include an introductory rate period.
    • intro rate period std: Numeric input, standardized introductory rate period for mortgage applicants.
    • property value std: Numeric input, value of the mortgaged property.
    • income std: Numeric input, standardized income for mortgage applicants.
    • debt to income ratio std: Numeric input, standardized debt-to-income ratio for mortgage applicants.

Attributes engineered for residual analysis.

    • phat: Numeric input, prediction probabilities of high-priced mortgage for mortgage applicants.
    • r: Numeric input, log loss residuals for the predicted probabilities

Attribute for the target variable.

    • high priced: Binary target, whether (1) or not (0) the annual percentage rate (APR) charged for a mortgage is 150 basis points (1.5%) or more above a survey-based estimate of similar mortgages. (High-priced mortgages are legal, but somewhat punitive to borrowers. High-priced mortgages often fall on the shoulders of minority home owners, and are one of many issues that perpetuates a massive disparity in overall wealth between different demographic groups in the US.)

Attributes that were not used in our approaches.

    • row id: Numeric input, value that uniquely identifies a row in a table.
    • amind: Binary numeric input, whether the borrower is American Indian (1) or not (0).
    • hipac: Binary numeric input, whether the borrower is Native Hawaiian or Other Pacific Islander (1) or not (0).
    • hispanic: Binary numeric input, whether the borrower is Hispanic (1) or not (0).
    • non hispanic: Binary numeric input, whether the borrower is Non-Hispanic (1) or not (0).
    • agegte62: Binary numeric input, whether the borrower’s age is over 62 (1) or not (0).
    • agelt62: Binary numeric input, whether the borrower’s co-borrower’s age is over 62 (1) or not (0).

Linkhmda_train_preprocessed.zip

Evaluation Data

Datasets

    • Home Mortgage Disclosure Act (HMDA) aggregate lending data

Data Shape

    • Test data rows = 19,832, columns = 22

All Test Data Columns

    • All the columns are the same as the training & validation data, except for that the target variable high-priced column does not exist in this test data.

Linkhmda_test_preprocessed.zip

Quantitative Analysis

Unitary results:

    • Our best remediated EBM model produced an AUC of 0.8097 after employing several post-processing techniques such as removing outliers and sensitivity analysis to economic recession conditions. This AUC was also achieved while ensuring a minimum Adverse Impact Ratio (AIR) of 0.8.
    • Best training/validation AUC (pre-remediation): 0.8247

Intersectional results:

    • Among the models explored (EBM, Ensemble, GBM, MGBM, and GLM), we found that the EBM model produced the greatest fidelity to the true outcomes, while maintaining the highest standards of fairness. We compared not only the AUC results to evaluate the models independently but also cross-validated over a number of evaluation metrics such as ACC, AUC, Log Loss, F1, and MSE. Once we determined the superiority of the EBM class model, we selected it as the best model and continued on to remediation techniques.

AUC (pre-remediation) of other alternative models:

    • Ensemble: 0.8195
    • Gradient Boosting Machine (GBM): 0.8183
    • Monotonic Gradient Boosting Machine (MGBM): 0.8021
    • Penalized Generalized Linear Model (GLM): 0.7628
Partial Dependence Plots:

The partial dependence plot (short PDP or PD plot) shows the “marginal effect one or two features have on the predicted outcome of a machine learning model” (J. H. Friedman 2001).

Global Model Variable Importance:

Global variable importance values give an indication of the magnitude of a variable’s contribution to model predictions for all of the data.

Ethical Considerations

  • Although we use the 4/5ths rule, one should aim for full parity where possible in a machine learning model (i.e. 1 to 1 parity in classification)
  • Pre-processing remediation techniques should be scrutinized for potential legal issues (e.g. manipulating data with racial class could constitute affirmative action)
  • Failure to perform bias testing and remediation of machine learning models can lead to discrimination, which can become self-reinforcing over time
  • Our best model underperformed markedly when exposed to economic conditions mimicking a recession, which demonstrates that even the most carefully scrutinized training data can be undermined by shifting real-world conditions
  • This model card does not constitute legal or compliance advice
  • Further exploration is warranted for our models, but we provide a baseline here
  • Additional Reading

All models are wrong, but some are useful – George E. P. Box

The post Predicting Mortgage Rates with Interpretable Machine Learning appeared first on tivon.io.

]]>
69