Predicting the Outcome of Home Loan Applications using Logistic Regression in R

Martin Calvino
12 min readMay 11, 2023

--

Will Bank of America accept your application for a home loan?

Image created by Martin Calvino using DCGAN algorithm

Imagine you are a first time home buyer and you have submitted a home loan application to the bank in which you are already a client. The bank denies you the loan, precluding you from having the money to buy the house you like. You decide to apply for a home loan at a different bank but you don’t know which one? You may decide on a series of banks based on previous experiences from relatives, friends and/or colleagues, assuming if they got a loan you may also get one. Prospecting bank after bank until you get a home loan is usually a frustrating and time consuming process. Wouldn’t be a better approach to pre-select those banks with higher chances to grant you a home loan based on some sort of educated guess that can point you towards the right direction? Wouldn’t be nice if you could predict the outcome of your home loan application to a particular bank before you even apply? Welcome Logistic Regression, a supervised machine learning methodology that can be used to predict the outcome of your home loan application for any particular bank of interest.

In this work I focused my attention to fitting a logistic regression model on nationwide home mortgage data to predicting the outcome of home loan applications to Bank of America. I was particularly interested in exploring the relationship between the ethnicity of the applicant and the outcome of the loan (accepted or denied).

LOGISTIC REGRESSION

The difference between linear and logistic regression is that the response variable is continuous for linear regression and categorical for logistic regression. Often times the response variable in logistic regression is binary, meaning observations belong to one of two classes (for instance 0 or 1, TRUE or FALSE, Yes or NO, and Accepted or Denied to name a few examples). For this reason logistic regression is used as a classification algorithm; and it is based on the logistic function that calculates the probability that an observation belongs to one of the two classes, assigning the observation to the class with the highest probability. The logistic function is an S-shaped curve that maps a continuous variable onto values between 0 and 1.

Binomial logistic regression is commonly used to two-class classification problems, whereas multinomial logistic regression is used in classification problems where there are three or more classes.

A nice characteristic of logistic regression models is their interpretability, because each predictor variable in the model gets an estimate that measures how the value of the variable affects the probability that a case belongs to one class over the other. Probabilities are converted to log odds (the output of logistic regression models) because the relationship between odds and the explanatory variable is not linear; by taking the natural logarithm of the odds a linear relationship is found between log odds and the explanatory variable.

The odds of a home loan application being accepted by Bank of America would be:

odds = probability of being accepted / probability of being denied

odds = p / (1-p)

Taking the natural logarithm of the odds give us the log odds:

log odds = ln(p / (1-p)) and this equation is known as the logit function

And log odds can be interpreted in the following manner: [A] a positive value means the outcome of a home loan application is more likely to be accepted than to be denied; [B] a negative value means the outcome of a home loan application is less likely to be accepted than to be denied; and [C] log odds of 0 means the outcome of a home loan application is as likely to be accepted as to be denied.

The linear relationship between log odds and an explanatory variable is:

log odds = y_intercept + slope * explanatory_variable

meaning that when having several explanatory variables, their individual contributions to the log odds can be added to obtain the overall log odds of a home loan being accepted by Bank of America.

IMPLEMENTATION

I took advantage of the Home Mortgage Disclosure Act (HMDA) dataset available at the Consumer Financial Protection Bureau (CFPB) and downloaded nationwide home mortgage data for Bank of America for the years 2018, 2019, 2020 and 2021 respectively. All together my starting dataset comprised of 1,671,302 observations and 99 variables, each observation being a home loan application; with the variables shown on Figure 1.

# load libraries
library(tidyverse)
library(psych)
library(gplots)
library(vcd)
library(graphics)
library(yardstick)

# load datasets
# Bank of America as 'boa' for years 2018, 2019, 2020 and 2021 respectively
boa18 <- read.csv(file.choose())
boa19 <- read.csv(file.choose())
boa20 <- read.csv(file.choose())
boa21 <- read.csv(file.choose()) # 368,728 observations x 99 variables

# create one big data frame with data from the four years
train.boa <- rbind(boa18, boa19, boa20, boa21) # 1,671,302 observations x 99 variables

colnames(train.boa)
Figure 1. The 99 variables comprising the home mortgage dataset from Bank of America. Documentation for each variable is found at the following weblink: https://ffiec.cfpb.gov/documentation/2018/lar-data-fields/

I was interested in predicting the outcome of home loan applications based on income, loan amount, down payment (defined as 100 — loan_to_value_ratio), state in which the application was submitted/received, and the ethnicity of the loan applicant.

Because the loan to value (LTV) ratio of a home is obtained by dividing the amount of the mortgage by the appraised value of the property. The resulting ratio will represent the percentage of the home’s value that is being financed through a mortgage.

LTV ratio = (Amount of mortgage / Appraised value of property) x 100%

For example, if you are financing a home with a mortgage of $200,000, and the appraised value of the property is $250,000, the LTV ratio would be:

LTV ratio = ($200,000 / $250,000) x 100% = 80%

This means that you are financing 80% of the value of the property through your mortgage, and you will need to provide a down payment of 20% or more to purchase the property. Thus, down payment can be indirectly calculated in my dataset as 100 minus LTV, and it was added as a new variable in my dataset.

# choose home loan applications with loan_to_value ratios greater than 0% and less than 100%
train.boa <- filter(train.boa, loan_to_value_ratio <= 100)

# feature engineering: estimate down payment as 100 - loan_to_value_ratio
# down payment is in percentual points
train.boa <- mutate(train.boa, downpayment = round(100 - loan_to_value_ratio))
head(train.boa[, 100], n = 25) # dow npayment in now variable number 100

# feature selection: consider action_taken, income, down payment, loan amount, ethnicity and state variables only
# action_taken is a variable denoting the action of the bank on the home loan application (example: originated or denied)
actionTaken <- train.boa$action_taken
income <- train.boa$income
downpayment <- train.boa$downpayment
loanAmount <- train.boa$loan_amount
ethnicity <- train.boa$derived_ethnicity
state <- train.boa$state_code

# income and downpayment and ethnicity as 'idoet'
train.boa.idoet <- data.frame(income, downpayment, loanAmount, ethnicity, actionTaken, state)
head(train.boa.idoet)

# recode ethnicity variable
train.boa.idoet$ethnicity[train.boa.idoet$ethnicity == "Ethnicity Not Available"] <- NA
head(train.boa.idoet, n = 25)

# how many missing values?
sum(is.na(train.boa.idoet)) # 220,515 NAs
# remove NAs
train.boa.idoet <- na.omit(train.boa.idoet)

# inspect summary statistics
summary(train.boa.idoet)

# multiply income*1000 and choose loan applications with income > 0
train.boa.idoet$income <- train.boa.idoet$income*1000
train.boa.idoet <- filter(train.boa.idoet, income > 0)

After removing missing values from my dataset, I proceeded to identify and remove outlier observations for the income and loan amount variables respectively. Inspecting summary statistics for the dataset revealed that the median income was $93,000/year, the median down payment was 33% and the median loan amount was $125,000 (Figure 2).

# identify and remove outliers
# income
outliers.inc.train.boa.idoet <- boxplot(train.boa.idoet[, 1])$out
train.boa.idoet <- train.boa.idoet[-which(train.boa.idoet[, 1] %in% outliers.inc.train.boa.idoet), ]
boxplot(train.boa.idoet[, c(1:3)])$out
# loan amount
outliers.la.train.boa.idoet <- boxplot(train.boa.idoet[, 3])$out
train.boa.idoet <- train.boa.idoet[-which(train.boa.idoet[, 3] %in% outliers.la.train.boa.idoet), ]
boxplot(train.boa.idoet[, c(1:3)])$out

# inspect summary statistics now that outliers were removed
summary(train.boa.idoet[, 1:3])
Figure 2. Summary statistics for home mortgage dataset from Bank of America after variables of interest were selected, and missing values and outliers were removed from the original dataset; the statistics shown are based on a dataset that comprised 972,223 observations and 6 variables.

I then proceeded to keep home loan applications that were derived from latinos or non-latinos applicants, and loan applications that were accepted or denied. This approach left me with a dataset comprised of 914,315 observations and 6 variables.

# look at ethnicity variable
levels(factor(train.boa.idoet$ethnicity))

# select Hispanic and Not Hispanic home loan applicants
selected_ethnicities <- c("Hispanic or Latino", "Not Hispanic or Latino")
train.boa.idoet <- filter(train.boa.idoet, ethnicity %in% selected_ethnicities)
levels(factor(train.boa.idoet$ethnicity))

# look at the actionTaken variable
levels(factor(train.boa.idoet$actionTaken))
# select loans that were accepted (actionTaken == 1) or denied (actionTaken == 3)
train.boa.idoet <- filter(train.boa.idoet, actionTaken == 1 | actionTaken == 3)
# recode actionTaken variable
train.boa.idoet$actionTaken[train.boa.idoet$actionTaken == 1] <- "accepted"
train.boa.idoet$actionTaken[train.boa.idoet$actionTaken == 3] <- "denied"
head(train.boa.idoet, n = 25)

str(train.boa.idoet) # 914,315 observations x 6 variables

Subsequently, I explored how income, loan amount, down payment, state and ethnicity variables related to the outcome of the home loan applications (actionTaken variable) to have a better understanding on the possible predictive value of each variable (Figure 3) prior to fitting a logistic regression model to the dataset. It was clear that the distribution of values for income and loan amount differed for loan applications that were accepted in relation to those that were denied (Figure 3A). The proportion of home loans that were accepted also differed depending on the state, with the lowest values observed for Florida, Georgia, Hawai, Rhode Island and Texas (Figure 3B). Because none of the loans applications in Puerto Rico were accepted, I decided to remove those observations because they harbor no predictive value. Lastly, the proportion of accepted home loans was lower for latino applicants compared to non-latinos (Figure 3C). This is consistent with my previous studies exploring home loan acceptance rates for the latino population (see references).

# explore the relationship among variables
exrelva <- gather(train.boa.idoet, key = "Variable", value = "Value", -actionTaken)
View(exrelva)

# violin plots
exrelva %>%
filter(Variable != "ethnicity" & Variable != "state") %>%
ggplot(aes(actionTaken, as.numeric(Value))) +
facet_wrap(~ Variable, scales = "free_y") +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
theme_bw()

# bar plot > in Puerto Rico all home loan applications were denied
exrelva %>%
filter(Variable == "state") %>%
ggplot(aes(Value, fill = actionTaken)) +
geom_bar(position = "fill") +
theme_bw()

exrelva %>%
filter(Variable == "ethnicity") %>%
ggplot(aes(Value, fill = actionTaken)) +
geom_bar(position = "fill") +
theme_bw()
Figure 3A. Violin plots (with quantiles drawn) displaying the distribution of values for down payment, income, and loan amount variables in relation to the actionTake variable. On the y-axis: percentages are shown for down payment; US Dollars are shown for income and loan amount.
Figure 3B. Proportion of home loan applications that were accepted (in red) and denied (in green) by Bank of America for each state.
Figure 3C. Proportion of home loan applications that were accepted (in red) and denied (in green) by Bank of America in relation to the ethnicity of the applicant.
# remove home loan applications from Puerto Rico because they don't contribute with
# predictive value in regards to the ethnicity of the applicant and the loan being accepted or denied
train.boa.idoet$state[train.boa.idoet$state == "PR"] <- NA
train.boa.idoet <- na.omit(train.boa.idoet)

# code response variable (actionTaken) as 0 and 1 to implement logistic regression
train.boa.idoet$actionTaken[train.boa.idoet$actionTaken == "accepted"] <- 1
train.boa.idoet$actionTaken[train.boa.idoet$actionTaken == "denied"] <- 0
train.boa.idoet$actionTaken <- factor(
train.boa.idoet$actionTaken,
levels = c(0, 1),
labels = c(0, 1)
)

table(train.boa.idoet$actionTaken) # 408,547 denied loans and 505,767 accepted loans

After removing observations from Puerto Rico, my dataset contained 505,767 loans that were accepted and 408,547 loans that were denied. I split the dataset into training and test set, with the training set comprising 70% and the test set comprising the remaining 30%. This allowed me to train the logistic regression model on 640,020 observations while evaluating its performance on 274,294 observations respectively.

# use a train/test split
rows <- sample(nrow(train.boa.idoet))
train.boa.idoet <- train.boa.idoet[rows, ]
# train:test split is 70%:30%
split <- round(nrow(train.boa.idoet) * 0.70)
train <- train.boa.idoet[1:split, ]
test <- train.boa.idoet[(split + 1):nrow(train.boa.idoet), ]
nrow(train) / nrow(train.boa.idoet) # training dataset is 70% of the entire dataset


# fit a logistic regression model to training data using the glm() function
fit.boa <- glm(actionTaken ~ income + downpayment + loanAmount + ethnicity + state,
data = train, family = binomial())

# inspect model's coefficients
summary(fit.boa)

# make predictions on the test dataset to evaluate model performance
pred.test <- predict(fit.boa, newdata = test, type = "response")
summary(pred.test)

The coefficients of the fitted model for the training set are shown on Figure 4, and suggested that positive values for income, loan amount, down payment and not being latino increased the likelihood of a home loan being accepted by Bank of America. The contribution of the state variable to acceptance of home loans varied significantly.

Figure 4. Screen capture showing the coefficients of the fitted logistic regression model on the training set of home mortgage data for Bank of America.

In order to evaluate the performance of the fitted model, I created a Confusion Matrix and used the yardstick package to access evaluation metrics such as accuracy (the proportion of correct predictions), sensitivity (the proportion of true positives), and specificity (the proportion of true negatives) (Figure 5). The model correctly predicted 68.5% of observations (home loan applications) in the test set (accuracy metric); and correctly predicted 70.2% of accepted home loans (sensitivity metric), and 66.5% of denied home loans (specificity metric) respectively.

# evaluate model performance
# Confusion Matrix: counts of outcomes
actual_response <- test$actionTaken
predicted_response <- ifelse(pred.test > 0.50, "1", "0")
outcomes <- table(predicted_response, actual_response)
outcomes

# evaluate model using functions from the yardstick package
confusion <- conf_mat(outcomes)
autoplot(confusion)
# model performance metrics
summary(confusion,event_level = "second")

# accuracy is the proportion of correct predictions > 0.685
# sensitivity is the proportion of true positives > 0.702
# specificity is the proportion of true negatives > 0.665
Figure 5A. Table displaying the actual responses of Bank of America to home loan applications (0 is denied, and 1 is accepted) relative to the predicted responses outputted by the model. True positives: 106,384 observations; True negatives: 81,618 observations; False negatives: 45,262 observations; and False positives: 41,030 observations.
Figure 5B. Evaluation metrics on the fitted model that were calculated using the yardstick package.
Figure 5C. Graphic output from the yardstick’s autoplot’s function displaying the results from the Confusion Matrix. Compare this graph to Figure 5A.

Finally, I used the fitted model to make predictions on new data. I selected from Zillow’s website a house (Figure 6) located in Phoenix, Arizona (notice from Figure 4 that the log odds for Arizona are positive and significant) and predicted if someone like me, who is latino and works as postdoctoral researcher with a fixed fellowship from the National Institutes of Health (NIH) set at $55,000 per year, could get a home loan from Bank of America. Assuming my down payment was 10% and the loan amount requested on my application to Bank of America was $517,500 ($575,000 minus $57,500), the model predicted my application as accepted, with a probability of 0.90. Notice that the probability is higher (0.93) if my ethnicity would have been non-latino.

Figure 6. Screen capture from Zillow’s website highlighting a home priced at $575,000.
# use the model to make predictions on new data
# what is the probability that Bank of America will accept a home loan applications from
# a Latino earning $55,000/year with a down payment of 10%, borrowing $517,500 and living in Arizona?
predict(fit.boa, newdata = data.frame(income = 55000, downpayment = 10, loanAmount = 517500,
ethnicity = "Hispanic or Latino", state = "AR"),
type = "response")
# answer is: 0.90 >> home loan application accepted

CONCLUSION

Here I described the implementation of a logistic regression model trained with home mortgage data from Bank of America (consisting of 640,020 observations) in order to predict the outcome (accepted vs. denied) of loan applications based on the applicants’ income, down payment, loan amount, state and ethnicity. The model correctly predicted 68.5% of applications in the test dataset (consisting of 274,294 observations), and was used to make predictions on new data. The relevancy and implications of this work are the following: First, logistic regression models trained with home mortgage data from the biggest banks in United States can assist prospective home buyers in identifying banks with positive predictions, saving them time and effort throughout the screening of a loan application process. Second, from a social justice perspective, models trained with ethnicity and race data from applicants can assist minority groups in deciding which banks may extend them a home loan, facilitating home ownership for latinos and persons of color.

REFERENCES

ABOUT THE AUTHOR

Martin Calvino is a Visiting Professor at Torcuato Di Tella University; a Computational Biologist at The Human Genetics Institute of New Jersey — Rutgers University; and a Multimedia Artist. You can follow him on Instagram: @from.data.to.art

THE CODE FROM THIS WORK as well as the datasets can be accessed from my GitHub page at:

--

--

Responses (1)