Predicting Home Loan Amounts from Property Value and Income using Linear Regression in R

Martin Calvino
11 min readApr 21, 2023

--

Image created by Martin Calvino using DCGAN

Linear Regression is a group of methodologies used to predict a response variable (also called dependent or outcome variable) from one (simple linear regression) or more (multiple linear regression) predictor variables (also called independent or explanatory variables).

In linear regression, the outcome variable needs to be continuous (quantitative) whereas the explanatory variable can be continuous or categorical; and you can identify the explanatory variables that relate to an outcome variable, describe the relationship involved, and also provide the equation for predicting the outcome variable from the explanatory variables.

Throughout this work I will focus on a class of regression models known under the term of Ordinary Least Squares (OLS) that include simple linear and multiple linear regression models:

in simple linear regression > you predict a quantitative outcome variable from a quantitative or categorical explanatory variable

in multiple linear regression > you predict a quantitative outcome variable from 2 or more explanatory variables (quantitative and/or categorical)

Linear regression uses the equation of a straight line to make predictions:

y = intercept + slope*x

where y is the outcome variable, x is the explanatory variable, the intercept is the value of y when x=0, and the slope is the amount of change in y when x increases by one unit. The intercept and slope of a linear regression model are known as its parameters (also called coefficients).

In this work I will implement simple and multiple linear regression models, evaluate their fit, test their assumptions and select among alternative models. My objective is to select model parameters (coefficients) that minimize the difference between actual values from the outcome variable and those predicted by the model. The parameters are selected in order to minimize the sum of squared residuals. A residual is the deviation between the real value of the outcome variable and its estimated/predicted value.

Here I will use linear regression to predict loan_amount (outcome variable) reported during 2021 for Bank of America in New Jersey from property_value and income (explanatory variables) values, by training a model with home loan applications for the year 2020 taken from the Home Mortgage Disclosure Act (HMDA) dataset publicly available at the website from the Consumer Financial Protection Bureau (CFPB). I have extensively worked with this dataset in the past to address differences in home loan statistics between latino and non-latino applicants of New Jersey and other states (see references).

CODE IMPLEMENTATION IN R

To predict loan_amount from property_value and income, I selected loans that were accepted by Bank of America in New Jersey during the years 2021 and 2020 respectively; identified and removed missing values (linear regression cannot handle missing values); identified and removed outliers for numeric variables of my interest; and made sure the ranges for loan_amount, property_value and income values were comparable between the dataset used for training (2020) and the dataset used to make predictions on (2021), respectively.

# OBJECTIVE
# Predict loan_amount associated to Bank of America home loan applications in New Jersey during 2021 from
# a linear regression model trained with data associated with Bank of America home loan applications in New Jersey during 2020

library(tidyverse)
library(patchwork)

# datasets were downloaded from:
# https://ffiec.cfpb.gov/data-browser/data/2020?category=states&items=NJ&leis=B4TYDEB6GKMZO031MB27

# dataset: home loans from Bank of America (boa) in New Jersey for 2020
boa20 <- read.csv(file.choose()) # 18,855 observations x 99 variables
# dataset: home loans from Bank of America (boa) in New Jersey for 2021
boa21 <- read.csv(file.choose()) # 17,316 observations x 99 variables

# description/documentation of variables can be found at:
# https://ffiec.cfpb.gov/documentation/2018/lar-data-fields/

# take accepted loans only (action_taken == 1)
boa20 <- filter(boa20, action_taken == 1) # 9,348 observations x 99 variables
boa21 <- filter(boa21, action_taken == 1) # 8,297 observations x 99 variables

# select numeric columns
colnames(boa20)
View(boa20)
# numeric.boa.accepted.2020 as nuba20
nuba20 <- select(boa20, loan_amount, income, property_value, loan_to_value_ratio, interest_rate)
# numeric.boa.accepted.2021 as nuba21
nuba21 <- select(boa21, loan_amount, income, property_value, loan_to_value_ratio, interest_rate)

# identify and remove NAs
sum(is.na(nuba20)) # 29 missing values from boa20
nuba20 <- na.omit(nuba20)
sum(is.na(nuba21)) # 16 missing values from boa2021
nuba21 <- na.omit(nuba21)

# multiply income*1000
nuba20$income <- nuba20$income*1000
nuba21$income <- nuba21$income*1000

# let's explore some statistics for our variables
summary(nuba20)
summary(nuba21)

# select observations displaying income values bigger than 0
nuba20 <- filter(nuba20, income > 0)
nuba21 <- filter(nuba21, income > 0)

################################################################################
# remove outliers one variable at a time for nuba20
# loan amount
outliers.la.nuba20 <- boxplot(nuba20[, 1])$out
nuba20 <- nuba20[-which(nuba20[, 1] %in% outliers.la.nuba20), ]
boxplot(nuba20[, c(1:3)])$out
# income
outliers.inc.nuba20 <- boxplot(nuba20[, 2])$out
nuba20 <- nuba20[-which(nuba20[, 2] %in% outliers.inc.nuba20), ]
boxplot(nuba20[, c(1:3)])$out
# property value
outliers.pv.nuba20 <- boxplot(nuba20[, 3])$out
nuba20 <- nuba20[-which(nuba20[, 3] %in% outliers.pv.nuba20), ]
boxplot(nuba20[, c(1:3)])$out
# loan to value ratio
outliers.ltvr.nuba20 <- boxplot(nuba20[, 4])$out
nuba20 <- nuba20[-which(nuba20[, 4] %in% outliers.ltvr.nuba20), ]
boxplot(nuba20[, 4])$out
# interest rate
outliers.ir.nuba20 <- boxplot(nuba20[, 5])$out
nuba20 <- nuba20[-which(nuba20[, 5] %in% outliers.ir.nuba20), ]
boxplot(nuba20[, 5])$out

# let's only consider values from 0 to 100 for loan_to_value_ratio
nuba20 <- filter(nuba20, loan_to_value_ratio <= 100)

################################################################################
# remove outliers one variable at a time for nuba21
# loan amount
outliers.la.nuba21 <- boxplot(nuba21[, 1])$out
nuba21 <- nuba21[-which(nuba21[, 1] %in% outliers.la.nuba21), ]
boxplot(nuba21[, c(1:3)])$out
# income
outliers.inc.nuba21 <- boxplot(nuba21[, 2])$out
nuba21 <- nuba21[-which(nuba21[, 2] %in% outliers.inc.nuba21), ]
boxplot(nuba21[, c(1:3)])$out
# property value
outliers.pv.nuba21 <- boxplot(nuba21[, 3])$out
nuba21 <- nuba21[-which(nuba21[, 3] %in% outliers.pv.nuba21), ]
boxplot(nuba21[, c(1:3)])$out
# loan to value ratio
outliers.ltvr.nuba21 <- boxplot(nuba21[, 4])$out
nuba21 <- nuba21[-which(nuba21[, 4] %in% outliers.ltvr.nuba21), ]
boxplot(nuba21[, 4])$out
# interest rate
outliers.ir.nuba21 <- boxplot(nuba21[, 5])$out
nuba21 <- nuba21[-which(nuba21[, 5] %in% outliers.ir.nuba21), ]
boxplot(nuba21[, 5])$out

# let's only consider values from 0 to 100 for loan_to_value_ratio
nuba21 <- filter(nuba21, loan_to_value_ratio <= 100)

################################################################################
# let's explore again the summary statistics for our variables after outliers were removed
summary(nuba20)
summary(nuba21)

# let's filter nuba21 to have same ranges as nuba20 for loan_amount, property_value, and income
# as to potentially improve model performance later on (model can predict based on what it learned)
nuba21 <- filter(nuba21, loan_amount >= 25000 & loan_amount <=955000)
nuba21 <- filter(nuba21, income >= 13000 & income <= 451000)
nuba21 <- filter(nuba21, property_value >= 45000 & property_value <= 1205000)

summary(nuba20)
summary(nuba21)

I subsequently ran a correlation analysis and created scatter plots to explore the relationship among the variables in my datasets. It seemed property_value, and to a lesser extent income, were the variables that displayed the highest correlation coefficients in relation to loan_amount. Furthermore, the relationship appeared to be linear (Figure 1 and Figure 2).

# let's check the correlation among variables
cor(nuba20)
cor(nuba21)

# use the gather() function on nuba20 to plot each variable against loan_amount to get an idea of the relationship in the data
# all.variables.against.loan.amount as avala20
avala20 <- gather(nuba20, key = "Variable", value = "Value", -loan_amount)
View(avala20)

avala21 <- gather(nuba21, key = "Variable", value = "Value", -loan_amount)
View(avala21)

# plot predictor variables against the outcome variable (loan_amount)
# it seems income and property_value have a linear relationship with loan_amount

# avala20
ggplot(data20 = avala, mapping = aes(x = Value, y = loan_amount)) +
facet_wrap(~ Variable, scale = "free_x") +
geom_point(alpha = 0.15) +
geom_smooth() +
geom_smooth(method = "lm", col = "red") +
theme_bw() +
ggtitle("Relationship among variables to loan_amount", "6583 accepted home loan applications: Bank of America - New Jersey - 2020")

# avala21
ggplot(data = avala21, mapping = aes(x = Value, y = loan_amount)) +
facet_wrap(~ Variable, scale = "free_x") +
geom_point(alpha = 0.15) +
geom_smooth() +
geom_smooth(method = "lm", col = "red") +
theme_bw() +
ggtitle("Relationship among variables to loan_amount", "6617 accepted home loan applications: Bank of America - New Jersey - 2021")
Figure 1. Screenshot of console output displaying Pearson correlation coefficients
Figure 2A. Scatter plots displaying GAM (blue line) and linear fit (red line) models. Property value and income appears to display a linear relationship with loan amount.
Figure 2B. Scatter plots displaying GAM (blue line) and linear fit (red line) models. Property value and income appears to display a linear relationship with loan amount whereas interest rate and loan_value_ratio not.

Based on the above, I explored the possibility of predicting loan_amount from property_value and income. For this, I trained a linear regression model with home loan data from 2020 in order to make predictions on home loans against data from 2021. I explored the following 3 models:

model 1 >> lm(loan_amount ~ property_value, data = nuba20)

model 2 >> lm(loan_amount ~ property_value + income, data = nuba20)

model 3 >> lm(loan_amount ~ property_value + income + property_value:income, data = nuba20)


# SIMPLE LINEAR REGRESSION >> lm(response_variable ~ predictor_variable, data = dataFrame)
# response/outcome variable: loan_amount
# predictor/explanatory variable: property_value

la.pv.fit <- lm(loan_amount ~ property_value, data = nuba20)
summary(la.pv.fit) # adjusted R-squared: 0.53
# the prediction equation is loan_amount = 17600 + 0,5609 * property_value

# plot residuals model 1
plot.residuals.m1 <- ggplot(data = la.pv.fit) +
geom_histogram(mapping = aes(x = la.pv.fit$residuals)) +
ggtitle("model 1")


# let's predict loan_amounts from 2021
predicted_loan.model.1 <- predict(la.pv.fit, data.frame(property_value = nuba21$property_value))
is.vector(predicted_loan.model.1)
predicted_loan.model.1[1:10]

# let's add predicted loan_amounts as a new column on nuba21 dataframe
nuba21$pred.la.m1 <- predicted_loan.model.1
View(nuba21)
# place the newly created column (named pre.la.m1) next to loan_amount column
nuba21 <- select(nuba21, loan_amount, pred.la.m1, everything())
View(nuba21)

################################################################################

# MULTIPLE LINEAR REGRESSION (additive) >>> lm(response_variable ~ predictor_variable1 + predictor_variable2, data = dataFrame)
# add 'income' as a second predictor variable and see if linear model performs better
la.pv.plus.inc.fit <- lm(loan_amount ~ property_value + income, data = nuba20)
summary(la.pv.plus.inc.fit) # adjusted R-squared 0.54

# predict a single loan_amount from property_value and income
predict(la.pv.plus.inc.fit, data.frame(property_value = 369000, income = 80000))

# plot residuals model 2
plot.residuals.m2 <- ggplot(data = la.pv.plus.inc.fit) +
geom_histogram(mapping = aes(x = la.pv.plus.inc.fit$residuals)) +
ggtitle("model 2")

# let's make predictions on loan amounts for 2021
predicted_loan.model.2 <- predict(la.pv.plus.inc.fit, data.frame(property_value = nuba21$property_value, income = nuba21$income))
is.vector(predicted_loan.model.2)
predicted_loan.model.2[1:10]

# let's add a second new column with predicted loan amounts from model2 to nuba21
nuba21$pred.la.m2 <- predicted_loan.model.2
nuba21 <- select(nuba21, loan_amount, pred.la.m1, pred.la.m2, everything())
View(nuba21)

# MULTIPLE LINEAR REGRESSION (with interaction among predictor variables)
# lm(response_variable ~ predictor_variable1 + predictor_variable2 + predictor_variable1:predictor_variable2, data = dataFrame)
interaction.pv.inc.fit <- lm(loan_amount ~ property_value + income + property_value:income, data = nuba20)
summary(interaction.pv.inc.fit) # adjusted R-squared is 0.55

# predict a single loan_amount from property_value and income
predict(interaction.pv.inc.fit, data.frame(property_value = 369000, income = 80000))

# let's predict loan_amounts in 2021 with model3
predicted_loan.model.3 <- predict(interaction.pv.inc.fit, data.frame(property_value = nuba21$property_value, income = nuba21$income))

# add a 3rd new column with predicted loan_amounts from model3 to nuba21
nuba21$pred.la.m3 <- predicted_loan.model.3
nuba21 <- select(nuba21, loan_amount, pred.la.m1, pred.la.m2, pred.la.m3, everything())


# plot residuals model 3
plot.residuals.m3 <- ggplot(data = interaction.pv.inc.fit) +
geom_histogram(mapping = aes(x = interaction.pv.inc.fit$residuals)) +
ggtitle("model 3")


# EVALUATE MODELS

# plot histograms with residuals from the 3 models
plot.residuals.m1 + plot.residuals.m2 + plot.residuals.m3

# confidence interval for model 3
confint(interaction.pv.inc.fit)

# assess regression diagnostics
par(mfrow = c(2, 2))
plot(interaction.pv.inc.fit)
par(mfrow = c(1, 1))

I then evaluated the models by plotting histograms of the residuals (Figure 3), knowing they should display a symmetrical distribution around zero when the model fit the data. Although the histograms look pretty similar, model 3 appeared to display a narrower bell shape around zero, indicating it a better fit to the data compared to model 1 and model 2. This was supported by a higher value for the Adjusted R-squared in model 3.

Figure 3. Histograms displaying the distribution of residuals for each linear model.

I then evaluated the confidence intervals for model 3 (shown on Figure 4) and found that I could be 95% confident that the intervals contained the true change in loan_amount for a 1 unit change in property_value, income, and property_value:income respectively. I also looked at regression diagnostics to evaluate if my data satisfied the statistical assumptions underlying a linear regression model; mainly normality, independence, linearity, and homoscedasticity (Figure 5).

Figure 4. Outputs from the confint( ) function (top) and summary( ) function (bottom) applied to model 3.
Figure 5. Regression diagnostics plots

Subsequently, I wanted to use model 3 to predict loan_amounts based on property_value and income values from 2021. Because I have real values of loan_amounts for the dataset in 2021 (Figure 6), I can evaluate how the model performed in its predictions by plotting real loan_amount values from 2021 against predicted values learned from data in 2020 (Figure 7). Furthermore, I can compare the predictive performance of model 3 compared to model 1 and 2. This approach indicated that the three models evaluated had similar predictive patterns, with predicted loan amounts differing the most in comparison to real loan amounts when values were under $250,000.

# inspect nuba21 with columns containing predicted values for loan amounts
View(nuba21)

# let's plot real loan_amounts from home loan applications in 2021 versus loan_amount predictions from the 3 models we built
plot1 <- ggplot(data = nuba21) +
geom_point(mapping = aes(x = loan_amount, y = pred.la.m1), alpha = 0.25) +
geom_abline(intercept = 0, slope = 1, color = "red") +
scale_y_continuous(breaks = seq(0, 1000000, 100000)) +
ggtitle("Loan amounts \npredicted by model 1")


plot2 <- ggplot(data = nuba21) +
geom_point(mapping = aes(x = loan_amount, y = pred.la.m2), alpha = 0.25) +
geom_abline(intercept = 0, slope = 1, color = "red") +
scale_y_continuous(breaks = seq(0, 1000000, 100000)) +
ggtitle("Loan amounts \npredicted by model 2")

plot3 <- ggplot(data = nuba21) +
geom_point(mapping = aes(x = loan_amount, y = pred.la.m3), alpha = 0.25) +
geom_abline(intercept = 0, slope = 1, color = "red") +
scale_y_continuous(breaks = seq(0, 1000000, 100000)) +
ggtitle("Loan amounts \npredicted by model 3")

# visualize the 3 plots together for easy comparison
plot1 + plot2 + plot3
Figure 6. Screenshot depicting predicted loan amounts (columns 2, 3 and 4) from model 1, 2 and 3 in relation to real loan amounts on column 1; loan_amount, property_value and income values are from the 2021 dataset whereas the models were trained with values from 2020 dataset.
Figure 7. Scatter plots depicting real values for loan_amounts in 2021 (x-axis) plotted against predicted loan amounts (y-axis). If the model would had been perfect, there should had been a 1:1 relationship wit all data points aligned along the red line.

CONCLUSION

By using two datasets with the same variables from two consecutive years, I could evaluate three linear regression models in their power to predict loan amounts from property and income values. Although not perfect, it is important to note that these models were trained based on accepted home loans by Bank of America in New Jersey and thus; they could be useful for prospective home buyers in estimating the amount of money to be requested on their loan applications solely based on their income and the value of the property they want to buy.

REFERENCES

ABOUT THE AUTHOR

Martin Calvino is a Visiting Professor at Torcuato Di Tella University; a Computational Biologist at The Human Genetics Institute of New Jersey — Rutgers University; and a Multimedia Artist. You can follow him on Instagram: @from.data.to.art

THE CODE FROM THIS WORK as well as the datasets can be accessed from my GitHub page at:

--

--

No responses yet