Dimensionality Reduction with PCA & UMAP: uncovering grouping structure in home loan applicants
I combined PCA and UMAP with LOGISTIC REGRESSION to classify Asians and African Americans home loan applicants from Bank of America
Let’s place ourselves in the following imaginary situation: Bank of America is preparing the submission of its home loan data for the year 2022 to the Consumer Financial Protection Bureau (CFPB) — as it is mandated by the government under the Home Mortgage Disclosure Act (HMDA) to do so each year —and realized it accidentally labeled loan applications from Asian and African American customers as “Race Not Available”. They hired you to solve this issue, requesting you analyze data from the previous four years (2021–2020–2019–2018) to infer the race of these two groups of loan applicants in 2022. How would you do it?
A possible approach is to use dimensionality reduction methods such as Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) to uncover the patterns present in home loan data from previous years (2020–2019–2018) to identify (if possible) clusters of loan applications that belong to Asian and African American customers respectively. The patterns learned by these two models can then be used as input into a logistic regression algorithm to classify the race of each loan application as Asian or African American in 2021, the most recent year with available data. The logistic regression model could then be used to make predictions on the race of home loan applicants during 2022.
DATA PREPARATION
The amount of loan applications processed by Bank of America for the three year period of 2018–2019–2020 totaled 1,302,574 while the number of applications for 2021 was 368,728. From 99 variables originally present in the dataset I selected 10 numeric variables (median_family_income, income, loan_amount, property_value, loan_to_value_ratio, house_age, interest_rate, loan_costs, origination_charges, and loan_term) and one categorical variable (race). After identifying and removing outliers and missing values; and after filtering out all loan applications from customers that weren’t Asian or African American, the dataset I used to train PCA and UMAP models comprised of 37,032 loans from Asian applicants and 17,501 loans from African American applicants respectively.
# load libraries
library(tidyverse)
library(psych)
library(GGally)
# load datasets 2018-2019-2020-2021
# home loan data is nationwide
# Bank of America 2018 as boa18
boa18 <- read.csv(file.choose()) # 462,401 observations x 99 variables
boa19 <- read.csv(file.choose()) # 466,552 observations x 99 variables
boa20 <- read.csv(file.choose()) # 373,621 observations x 99 variables
boa21 <- read.csv(file.choose()) # 368,728 observations x 99 variables
# combine datasets from 2018-2019-2020 into a single data frame named boa
boa <- rbind(boa18, boa19, boa20) # 1,302,574 observations x 99 variables
# inspect the names of variables
colnames(boa)
# select variables of interest (feature selection)
median_family_income <- boa$ffiec_msa_md_median_family_income
income <- boa$income*1000
loanAmount <- boa$loan_amount
property_value <- boa$property_value
ltv <- boa$loan_to_value_ratio
house.age <- boa$tract_median_age_of_housing_units
interest_rate <- boa$interest_rate
loan_coast <- boa$total_loan_costs
origination_charges <- boa$origination_charges
loan_term <- boa$loan_term
race <- boa$derived_race
# because PCA & UMAP only works with numeric variables (continuous data),
# create a new data frame with numeric variables and leaving behind all categorical variables
# in the original dataset except race
boa.numeric <- data.frame(income,
median_family_income,
loanAmount,
property_value,
ltv,
house.age,
loan_coast,
origination_charges,
loan_term,
interest_rate,
race)
# for our imaginary example, let's focus on Asian & African American loan applicants only
boa.numeric <- filter(boa.numeric, race == "Asian" | race == "Black or African American")
# inspect the first 100 rows of boa.numeric
View(boa.numeric[1:100, ])
# inspect summary statistics
summary(boa.numeric[, c(1:10)])
# identify & remove outliers for income, loan amount and property value variables
boxplot(boa.numeric[, c(1:10)])
# remove outliers for the income variable
outliers.inc <- boxplot(boa.numeric[, 1])$out
boa.numeric <- boa.numeric[-which(boa.numeric[, 1] %in% outliers.inc), ]
boxplot(boa.numeric[, c(1:10)])
# remove outliers for the loan amount variable
outliers.la <- boxplot(boa.numeric[, 3])$out
boa.numeric <- boa.numeric[-which(boa.numeric[, 3] %in% outliers.la), ]
boxplot(boa.numeric[, c(1:10)])
# remove outliers for the property value variable
outliers.pv <- boxplot(boa.numeric[, 4])$out
boa.numeric <- boa.numeric[-which(boa.numeric[, 4] %in% outliers.pv), ]
boxplot(boa.numeric[, c(1:10)])
# remove outliers for the loan_to_value_ratio variable
outliers.ltv <- boxplot(boa.numeric[, 5])$out
boa.numeric <- boa.numeric[-which(boa.numeric[, 5] %in% outliers.ltv), ]
boxplot(boa.numeric[, c(1:10)])
# remove applications with negative values for income
boa.numeric <- filter(boa.numeric, income > 0)
# inspect summary statistics
summary(boa.numeric[, c(1:10)])
# count missing values
sum(is.na(boa.numeric)) # 436,145 NAs
# remove missing values
# notce that becuase we remove observations containing missing values for the interest_rate variable
# we are indirectly keeping loan applications that were accepted (denied applications don't have interest rate values)
boa.numeric <- na.omit(boa.numeric)
# inspect the structure of boa.numeric
str(boa.numeric) # we now have 54,533 observations x 11 variables that we can work with
# inspect summary statistics once more
summary(boa.numeric[, c(1:10)])
# visualize all data in boa.numeric using the ggpair() function from the GGally package
plot.boa.numeric <- ggpairs(boa.numeric, mapping = aes(col = race)) +
theme_bw()
# see the plot
plot.boa.numeric
# save plot with specified dimensions for easy viewing
ggsave("Bank_of_America_Asian_vs_African_American_2018_to_2020_nationWide.jpeg",
width = 8000, height = 4000, units = "px")
# count number of applications for each racaial group
table(boa.numeric$race) # 37,032 Asian & 17,501 African American applicants
To visualize the distribution of values for each numeric variable and their pairwise combination in relation to each racial group, I plotted the data using the ggpairs( ) function from the GGally package (Figure 1) and found that the variable loan_to_value_ratio (shown as ltv in the plot) seem to differentiate between Asian and African American applicants. To a lesser extent, the same is true for the variables income, median_family_income, loan_amount and property_value. On the other hand, the variable loan_term contains little information that distinguishes Asian from African American applicants.
IMPLEMENTATION OF PCA ALGORITHM
PCA is a data-reduction method that transforms a great number of correlated variables into a reduced set of uncorrelated variables known as principal components. These components are linear combinations of the variables in a dataset, with the weights used to create the linear composites being selected to maximize the variance each principal component accounts for, while maintaining the components uncorrelated. Thus, PCA creates new variables that describe most of the information in the data.
Because PCA is an unsupervised machine learning algorithm, it does not require labeled data; learning inherent patterns in the data itself.
Reducing the dimensions of a dataset using PCA also facilitates the visualization of data (multivariate information is displayed in a 2D graph); attenuates the curse of dimensionality (data becomes sparse as the number of variables is increased for a given number of observations; with many machine learning algorithms having trouble to learn patterns from sparse data); and reduces the effects of collinearity (variables in a dataset displaying diverse degrees of correlation with each other, negatively affecting the estimation of parameters in linear regression models).
In order to select the number of components to extract from the home loan data, I used the fa.parallel( ) function from the psych package that produced a scree plot indicating eigenvalues (the variances of the data along a given principal component) against their component numbers, and parallel analysis based on 1,000 simulations (the function run simulations and extract eigenvalues from random data matrices of the same size as the dataset I used in this study. If an eigenvalue based on the real home loan data is bigger compared to the average corresponding eigenvalues from a set of random data matrices, that component is retained) (Figure 2). The results shown on Figure 2 suggested I extract 4 components based on the criteria of selecting components with eigenvalues both greater than 1 and larger than the simulated values (dotted line on the figure).
# IMPLEMENT PCA
# select the number of components to extract
fa.parallel(boa.numeric[, -11], fa = "pc", n.iter = 1000, show.legend = TRUE,
main = "Scree plot with parallel analysis") # the plot suggest I extract 4 components
I then proceeded to extract the principal components from the Bank of America home loan dataset using the principal( ) function from the psych package. The output is shown in Figure 3, with columns labeled PC1–4 are the component loadings, described as correlations of the observed variables with the principal components. For instance the variables income, median_family_income, loan_amount, and property_value correlated highly with PC1, whereas the variables loan_costs and origination_charges correlated highly with PC2. The column labeled h2 contains the components communalities, which are the amount of variance in each variable explained by the components. For example, 91% of the variance in loan_amount and property_value could be explained by the four components. Conversely, the column labeled u2 contains the (1-h2). The row labeled SS Loadings describe the eigenvalues associated with the components; and the row labeled Proportion Var describes the amount of variance accounted for by each component, with PC1 accounting for 31% of the variance in the ten variables. All together, the four principal components that I extracted accounted for 77% of the variance in the ten numeric variables present in home loan dataset (row labeled Cumulative Var).
# scale & center data because variables have different units of measurements (US Dollars, percentages, years) and ranges
# boa.numeric.centerd.and.scaled as "boncas"
boncas <- scale(boa.numeric[, -11], center = TRUE, scale = TRUE)
# extract principal components
pc.boa.num <- principal(boncas, nfactors = 4, rotate = "none", scores = TRUE)
pc.boa.num
I subsequently plotted the first two components against each other to evaluate how well they separated Asian from African American loan applicants (Figure 4). The component scores associated with home loan applications from Asians varied primarily along PC1, whereas the component scores associated with African Americans varied primarily along PC2. Although both groups of home loan applicants weren’t completely well separated on the plot, component scores still clustered into two distinct groups.
# plot home loan applications across the first 2 PCs
# Bank of America as boa
# add PCA scores associated to each loan application for the first 2 components
boa_pca <- boa.numeric %>%
mutate(PCA1 = pc.boa.num$scores[, 1], PCA2 = pc.boa.num$scores[, 2])
# inspect boa_pca
head(boa_pca)
ggplot(boa_pca, aes(PCA1, PCA2, col = race)) +
geom_point() +
theme_bw()
In summary, I started with ten continuous variables and condensed most of the information in the dataset into just two principal components that contained enough information to differentiate the two clusters of home loan applications.
IMPLEMENTATION OF UMAP ALGORITHM
Uniform Manifold Approximation And Projection (UMAP) is a non-linear dimension reduction algorithm that finds distances between observations in a multivariate space and then tries to recreate these distances in low dimensional space. The inherent assumption in UMAP is that the data is distributed along a manifold, a n-dimensional smooth geometric shape in where every point exists in a flat two-dimensional plane. The distances between observations along the manifold are then calculated, and iteratively recreated into a lower-dimensional space.
The UMAP algorithm has four hyperparameters that affects the resulting embedding:
n_neighbors: larger values will incorporate more neighboring observations, forcing the algorithm to prioritize global structure; whereas smaller values will incorporate fewer neighbors, forcing the algorithm to prioritize local structure.
min_dist: determines the minimum distance apart that cases are allowed to be in the lower-dimensional representation; larger values results in observations being further apart, whereas smaller values results in observations being more aggregated.
metric: determines the distance metric to be used with Euclidean distance set as the default; other distance metrics can also be used.
n_epochs: determines the number of iterations.
I proceeded to create the embedding using UMAP and plotted the results to examine how well the two group of home loan applicants separated from each other (Figure 5). Although the loan applications weren’t clearly separated for each racial group, sill a clear pattern can be observed, with observations from African Americans clustered on the outer contour and lower tip of the inverted C shape.
# IMPLEMENT UMAP ALGORITHM
# install and load the umap package
install.packages("umap", dependencies = TRUE)
library(umap)
# create embedding
boa_umap <- select(boa.numeric, -race) %>%
as.matrix() %>%
umap(n_neighbors = 20, min_dist = 0.5, metric = "manhattan", n_epochs = 600, verbose = TRUE)
# visualize results
boa_umap_2 <- boa.numeric %>%
mutate(UMAP1 = boa_umap$layout[, 1], UMAP2 = boa_umap$layout[, 2])
# inspect boa_umap_2
head(boa_umap_2)
ggplot(boa_umap_2, aes(UMAP1, UMAP2, col = race)) +
geom_point(size = 0.1) +
theme_bw()
In summary, similar to using PCA, the implementation of the UMAP algorithm allowed me to condense the information in my dataset from 10 continuous variables into two UMAP dimensions in which loan applications from Asians and African Americans could be differentiated.
COMPUTING COMPONENTS SCORES FROM PCA AND UMAP EMBEDDING ON NEW DATA (HOME LOAN APPLICATIONS FROM 2021)
The idea now is to take the information learned by PCA and UMAP models from data in 2018–2019–2020 and apply their principal components / dimensions as explanatory variables in a logistic regression model to predict the race (Asian vs. African American) of home loan applicants during 2021. For this I processed 2021 data in the same manner I processed data from the previous three years.
# work on boa21 dataset now
# select variables of interest (feature selection)
median_family_income21 <- boa21$ffiec_msa_md_median_family_income
income21 <- boa21$income*1000
loanAmount21 <- boa21$loan_amount
property_value21 <- boa21$property_value
ltv21 <- boa21$loan_to_value_ratio
house.age21 <- boa21$tract_median_age_of_housing_units
interest_rate21 <- boa21$interest_rate
loan_coast21 <- boa21$total_loan_costs
origination_charges21 <- boa21$origination_charges
loan_term21 <- boa21$loan_term
race21 <- boa21$derived_race
# because PCA & UMAP only works with numeric variables (continuous data),
# create a new data frame with numeric variables and leaving behind all categorical variables
# in the original dataset except race
boa21.numeric <- data.frame(income21,
median_family_income21,
loanAmount21,
property_value21,
ltv21,
house.age21,
loan_coast21,
origination_charges21,
loan_term21,
interest_rate21,
race21)
# for our imaginary example, let's focus on Asian & African American loan applicants only
boa21.numeric <- filter(boa21.numeric, race21 == "Asian" | race21 == "Black or African American")
# identify outliers and remove them from boa21.numeric
boxplot(boa21.numeric[, c(1:10)])
# remove outliers for the income variable
outliers.inc <- boxplot(boa21.numeric[, 1])$out
boa21.numeric <- boa21.numeric[-which(boa21.numeric[, 1] %in% outliers.inc), ]
boxplot(boa21.numeric[, c(1:10)])
# remove outliers for the loan amount variable
outliers.la <- boxplot(boa21.numeric[, 3])$out
boa21.numeric <- boa21.numeric[-which(boa21.numeric[, 3] %in% outliers.la), ]
boxplot(boa21.numeric[, c(1:10)])
# remove outliers for the property value variable
outliers.pv <- boxplot(boa21.numeric[, 4])$out
boa21.numeric <- boa21.numeric[-which(boa21.numeric[, 4] %in% outliers.pv), ]
boxplot(boa21.numeric[, c(1:10)])
# remove outliers for the loan_to_value_ratio variable
outliers.ltv <- boxplot(boa21.numeric[, 5])$out
boa21.numeric <- boa21.numeric[-which(boa21.numeric[, 5] %in% outliers.ltv), ]
boxplot(boa21.numeric[, c(1:10)])
# inspect boa21.numeric
summary(boa21.numeric)
# remove applications with negative values for income
boa21.numeric <- filter(boa21.numeric, income21 > 0)
# identify missing values and remove them
sum(is.na(boa21.numeric))
boa21.numeric <- na.omit(boa21.numeric)
table(boa21.numeric$race21) # 13,551 observations from Asians and 8,172 observations from African American
Subsequently, I made race a dichotomous factor (coded as Asian == 1 and African American == 0) and I split the 2021 data into train and test datasets respectively.
# code response variable (race) as 0 and 1 to implement logistic regression
boa21.numeric$race21[boa21.numeric$race21 == "Asian"] <- 1
boa21.numeric$race21[boa21.numeric$race21 == "Black or African American"] <- 0
boa21.numeric$race21 <- factor(
boa21.numeric$race21,
levels = c(0, 1),
labels = c(0, 1)
)
str(boa21.numeric)
# create a train/test split of dataset
rows <- sample(nrow(boa21.numeric))
train.boa21.numeric <- boa21.numeric[rows, ]
# train:test split is 70%:30%
split <- round(nrow(train.boa21.numeric) * 0.70)
train <- train.boa21.numeric[1:split, ]
test <- train.boa21.numeric[(split + 1):nrow(train.boa21.numeric), ]
nrow(train) / nrow(train.boa21.numeric) # training dataset is 70% of the entire dataset
PC COMPONENTS AS EXPLANATORY VARIABLES IN LOGISTIC REGRESSION
I used the PCA model trained with data from the past three years to predict components scores for the train/test data on home loans from 2021; and used the resulting two PCs as explanatory variables in my logistic regression model. I then evaluated the model performance using the Confusion Matrix and three other metrics (accuracy, sensitivity and specificity) provided by the yardstick package (Figure 6).
# PC COMPONENTS AS EXPLANATORY VARIABLES
# use PCA model created with data from 2018-2019-2020 to predict component scores on train dataset
pca_data_modeled_21 <- predict(pc.boa.num, data = train[, -11])[, 1:4]
head(pca_data_modeled_21)
# use PCA model to predict component scores on test dataset
pca_test_modeled_21 <- predict(pc.boa.num, data = test[, -11])[, 1:4]
head(pca_test_modeled_21)
# convert from matrix to data frame to use in logistic regression
class(pca_data_modeled_21)
pca_data_modeled_21 <- as.data.frame(pca_data_modeled_21)
class(pca_test_modeled_21)
pca_test_modeled_21 <- as.data.frame(pca_test_modeled_21)
# add PC component scores to train dataset
train <- train %>%
mutate(PCA1 = pca_data_modeled_21[, 1], PCA2 = pca_data_modeled_21[, 2])
View(train)
# add PC component scores to test dataset
test <- test %>%
mutate(PCA1 = pca_test_modeled_21[, 1], PCA2 = pca_test_modeled_21[, 2])
View(test)
# fit logistic regression model with PC components as explanatory variables
fit.glm.pca.21 <- glm(race21 ~ PCA1 + PCA2, data = train, family = binomial())
# inspect model's coefficients
summary(fit.glm.pca.21)
# make predictions on the test dataset
pred.test <- predict(fit.glm.pca.21, newdata = test, type = "response")
# evaluate model perdormance
# Confusion Matrix: counts of outcomes
actual_response <- test$race21
predicted_response <- ifelse(pred.test > 0.50, "1", "0")
outcomes <- table(predicted_response, actual_response)
outcomes
# evaluate model performance using functions from the yardstick package
confusion <- conf_mat(outcomes)
autoplot(confusion)
# obtain model's performance metrics
summary(confusion, event_level = "second")
# accuracy is the proportion of correct predictions: 0.77 or 77.0%
# sensitivity is the proportion of true positives: 0.818 or 81.8%
# specificity is the proportion of true negatives: 0.693 or 69.3%
Based on the results shown on Figure 6, I was positively surprised to find that the accuracy of the model (the proportion of correct predictions) was 77% because I used only the first two PCs as explanatory variables (from the four PCs available) when fitting the logistic regression model. These PCs accounted for 54% of the variance in the dataset containing home loan applications from 2018–2019–2020 (see Figure 3). The sensitivity (the proportion of true positives) of the logistic regression model was 81.8% whereas the specificity (the proportion of true negatives) was 69.3%. The lower value for the specificity metric compared to that for sensitivity was somehow expected because the 2021 dataset had fewer loan applications from African Americans (coded as 0) compared to Asians (coded as 1) and thus the model was exposed to more applications from the second racial group.
In summary, I found that PCA was a valid dimension reduction technique to uncover racial grouping structure in home loan applications from 2018–2019–2020; and that using the first two PCs from the PCA model as explanatory variables to a logistic regression model was a successful approach in predicting the race of home loan applicants in 2021.
UMAP DIMENSIONS AS EXPLANATORY VARIABLES IN LOGISTIC REGRESSION
Similar to what I described above with PCA, I subsequently used the UMAP model trained with data from the past three years to predict the UMAP embedding for the train/test data on home loans from 2021; and used the resulting two UMAP dimensions as explanatory variables in my logistic regression model. I then evaluated the model performance using the Confusion Matrix and three other metrics (accuracy, sensitivity and specificity) provided by the yardstick package (Figure 7).
# UMAP DIMENSIONS AS EXPLANATORY VARIABLES
# create a train/test split of dataset
rowsU <- sample(nrow(boa21.numeric))
train.boa21.numericU <- boa21.numeric[rowsU, ]
# train:test split is 70%:30%
splitU <- round(nrow(train.boa21.numericU) * 0.70)
trainU <- train.boa21.numericU[1:splitU, ]
testU <- train.boa21.numericU[(splitU + 1):nrow(train.boa21.numericU), ]
nrow(trainU) / nrow(train.boa21.numericU) # training dataset is 70% of the entire dataset
# compute UMAP embedding on train dataset
UMAP_data_modeled_21 <- predict(boa_umap, data = trainU[, -11])
# compute UMAP embedding on test dataset
UMAP_test <- predict(boa_umap, data = testU[, -11])
# convert matrices to data frames
class(UMAP_data_modeled_21)
UMAP_train <- as.data.frame(UMAP_data_modeled_21)
class(UMAP_test)
UMAP_test <- as.data.frame(UMAP_test)
# add UMAP dimensions to trainU dataset
trainU <- trainU %>%
mutate(UMAP1 = UMAP_train[, 1], UMAP2 = UMAP_train[, 2])
# add UMAP dimensions to testU dataset
testU <- testU %>%
mutate(UMAP1 = UMAP_test[, 1], UMAP2 = UMAP_test[, 2])
# fit logistic regression model with UMAP dimensions as explanatory variables
fit.glm.UMAP.21 <- glm(race21 ~ UMAP1 + UMAP2, data = trainU, family = binomial())
# inspect model's coefficients
summary(fit.glm.UMAP.21)
# make predictions on the test dataset
pred.test.U <- predict(fit.glm.UMAP.21, newdata = testU, type = "response")
# evaluate model perdormance
# Confusion Matrix: counts of outcomes
actual_responseU <- testU$race21
predicted_responseU <- ifelse(pred.test.U > 0.50, "1", "0")
outcomesU <- table(predicted_responseU, actual_responseU)
outcomesU
# evaluate model performance using functions from the yardstick package
confusionU <- conf_mat(outcomesU)
autoplot(confusionU)
# obtain model's performance metrics
summary(confusionU, event_level = "second")
# accuracy is the proportion of correct predictions: 0.731 or 73.1%
# sensitivity is the proportion of true positives: 0.776 or 77.6%
# specificity is the proportion of true negatives: 0.657 or 65.7%
In summary, the accuracy of the logistic regression model when using UMAP dimensions as explanatory variables was 73.1%, a value five percentage points lower than the accuracy of the model when using PCs as explanatory variables. The same is true for the sensitivity and specificity metrics. A possible approach to increase the performance of the logistic regression model would be to change the hyperparameters of the UMAP algorithm when trained with home loan data from 2018–2019–2020 to improve the separation of Asians from African Americans, and see if this would impact the accuracy of logistic regression with data from 2021.
CONCLUSION
In this work I combined dimension reduction algorithms such as PCA and UMAP to uncover the grouping structure of home loan applicants (Asians and African Americans) to Bank of America during the three year period 2018–2019–2020, and used the information learned by these models to implement logistic regression in order to classify the race of loan applicants in 2021. Overall, using PCs as explanatory variables in logistic regression produced better classification accuracy than using UMAP dimensions as explanatory variables.
The logistic regression model developed here could then be use to make predictions on the race of home loan applicants during 2022, helping Bank of America to solve the imaginary problem of mislabeling Asians and African American customers with the designation “Race Not Available”.
REFERENCES
ABOUT THE AUTHOR
Martin Calvino is a Visiting Professor at Torcuato Di Tella University; a Computational Biologist at The Human Genetics Institute of New Jersey — Rutgers University; and a Multimedia Artist. You can follow him on Instagram: @from.data.to.art
THE CODE ASSOCIATED TO THIS WORK as well as the datasets can be accessed from my GitHub page at: