Where to Apply for Home Loans if You Are Latinx? Cluster Analysis of Financial Institutions

13 min readJun 10, 2023

Image created by Martin Calvino using DCGAN algorithm.

Imagine a non-profit organization focused on latinx issues commissions you to analyze all financial institutions in United States in order to identify those with ‘friendly’ home loan offerings to latinxs. Because access to equitable opportunities in home ownership is seen as a fundamental aspect of social justice; the identification of financial institutions characterized by high percentages of accepted home loan applications together with low interest rates towards latinxs would constitute an important resource of knowledge for this societal group.

In this work I analyzed 3,731 financial institutions across the country that together processed 2,243,013 home loan applications received from latinxs during 2021. The implementation of cluster analysis allowed me to identify financial institutions that accepted home loan applications from latinxs at high percentages and at the same time provided low interest rates on their granted loans.

CLUSTER ANALYSIS

Cluster analysis consists of numerous data-reduction methodologies designed to uncover subgroups of observations in a dataset. It enables you to reduce a great number of observations into a smaller number of clusters or types; with each cluster containing observations that are more similar to each other than they are to observations in other clusters.

In my work, financial institutions were clustered based on the similarity of percentage of home loans that were accepted, mean interest rates and mean loan amounts per bank, respectively. Based on this information, latinxs applicants are encouraged to apply to those subgroups of financial institutions displaying higher percentage of acceptance and lower interest rates for their intended loan amounts.

There are two main approaches to clustering, one is known as hierarchical agglomerative clustering, and the other is known as partitioning clustering. Whereas in hierarchical agglomerative clustering each observation starts at its own cluster, with clusters subsequently combined two at a time until all clusters are unified into a single big cluster; in partitioning clustering, we must specify k, the number of clusters sought, with observations randomly assigned into k groups and reshuffled until cohesive clusters are formed. We visualize hierarchical clustering in the form of dendrograms and partitioning clustering in the form of bivariate cluster plots.

The clustering algorithms used are different for each approach. For hierarchical clustering these are: single linkage, complete linkage, average linkage, centroid, and Ward’s method. For partitioning clustering these are: k-means and partitioning around medoids (PAM) (see references for an in-depth explanation). All these algorithms require a measure of the distance among the observations to be clustered that can be any of the following: Euclidean, Manhattan, Canberra, asymmetric binary, maximum, and Minkowski distance; with larger distances meaning larger dissimilarities between observations (the distance between an observation and itself is 0).

In this work I explored average linkage as the clustering algorithm for the implementation of hierarchical clustering and partitioning around medoids as the clustering algorithm for the implementation of partitioning clustering. In both approaches I used Euclidean distance.

DATA PREPARATION

I downloaded all home loan applications across the country for the year 2021 from the Consumer Financial Protection Bureau (CFPB) website in which the Home Mortgage Disclosure Act (HMDA) dataset was publicly available. This constituted 19,182,225 loan applications that were processed by 4,292 financial institutions nationwide. When I considered latinxs home loan applicants only, my dataset consisted of 3,731 financial institutions that processed 2,243,013 applications.

### load libraries
library(tidyverse)
library(NbClust)
library(cluster)
library(fMultivar)
library(patchwork)
library(MetBrewer)
library(plot3D)
library(ggrepel)

### load datasets

# load dataset: home loans nationwide for 2021
banks <- read.csv(file.choose()) # 19,182,225 observations x 99 variables
length(levels(factor(banks$lei))) # 4,292 financial institutions

# load csv file with common bank names mapped to leis by year
bank.names <- read.csv(file.choose())
View(bank.names)
# select leis for 2020 (the most recent year)
bank.names <- bank.names[, c(1,5)] 
View(bank.names)
# rename variable lei_2020 to lei
names(bank.names)[2] <- "lei" # this will help in left_join operation
View(bank.names) # 5,399 entries

# only consider banks that have common names assigned to leis
my.banks <- filter(bank.names, lei %in% banks$lei)
length(levels(factor(my.banks$respondent_name))) # 3,731 financial institutions
View(my.banks)
# add common names to banks data frame
banks <- inner_join(my.banks, banks, by = c("lei"))
length(levels(factor(banks$respondent_name))) # 3,731 financial institutions

### latino home loan applicants

# consider latino home loan applicants only
banks <- filter(banks, derived_ethnicity == "Hispanic or Latino") # 2,243,013 observations

Because the objective of my work was to cluster financial institutions based on the percentage of accepted home loan applications, the mean interest rate and the mean loan amount per bank, I proceeded to create these variables from the existing variables in the dataset and inspected their value distributions and summary statistics (Figure 1 and Figure 2). I found that the median for percentage of accepted loans per bank was about 66%, whereas the median value for mean interest rate per bank was 3.1%. The median value for mean loan amount per bank was $201,500.

# inspect data frame
str(banks)

# convert interest_rate variable to numeric
typeof(banks$interest_rate)
class(banks$interest_rate)
banks$interest_rate <- as.numeric(banks$interest_rate)
class(banks$interest_rate)

### create new variables per bank

# variable > percentage of accepted home loans per bank

# group all home loan applications per bank and arrange banks in descending order
banks.Arranged <- banks %>%
  group_by(respondent_name) %>%
  summarize(c0 = n()) %>%
  arrange(desc(c0))

View(banks.Arranged) # 3,456 financial institutions processed home loan applications from latinos in 2021

# group accepted home loan applications per bank and arrange banks in descending order
applications.accepted.per.bank <- banks  %>%
  filter(action_taken == 1) %>%
  group_by(respondent_name) %>%
  summarize(c1 = n()) %>%
  arrange(desc(c1))

View(applications.accepted.per.bank) # 3,338 financial institutions accepted home loan applications

# consider banks with accepted home loan applications
all.applications.per.bank <- banks.Arranged %>%
  filter(respondent_name %in% applications.accepted.per.bank$respondent_name)

View(all.applications.per.bank) # 3,338 financial institutions

# join data frames
pct.accepted.per.bank <- inner_join(all.applications.per.bank, applications.accepted.per.bank, by = c("respondent_name"))
View(pct.accepted.per.bank)
# create new variable containing the percentage of accepted home loans per bank
pct.accepted.per.bank <- mutate(pct.accepted.per.bank, pct.accepted = (c1*100)/c0)
View(pct.accepted.per.bank)

# variable > mean interest_rate per bank

# calculate the mean interest rate per bank
# values for interest rates are only present in home loan applications that were accepted by the bank
mean.ir.per.bank <- banks %>%
  filter(respondent_name %in% pct.accepted.per.bank$respondent_name) %>%
  filter(action_taken == 1) %>%
  group_by(respondent_name) %>%
  summarize(mean.ir = as.numeric(mean(interest_rate)))

View(mean.ir.per.bank)

# join data frames
banks.to.be.clustered <- inner_join(pct.accepted.per.bank, mean.ir.per.bank, by = c("respondent_name"))
View(banks.to.be.clustered)

# variable >  mean loan_amount per bank
mean.la.per.bank <- banks %>%
  filter(respondent_name %in% pct.accepted.per.bank$respondent_name) %>%
  filter(action_taken == 1) %>%
  group_by(respondent_name) %>%
  summarize(mean.la = mean(loan_amount))

View(mean.la.per.bank)

# join data frames
banks.to.be.clustered <- inner_join(banks.to.be.clustered, mean.la.per.bank, by = c("respondent_name"))
View(banks.to.be.clustered)

### identify and remove missing values
sum(is.na(banks.to.be.clustered))
banks.to.be.clustered <- na.omit(banks.to.be.clustered)
View(banks.to.be.clustered) # 2,377 financial institutions

### inspect summary statistics
summary(banks.to.be.clustered[, c(4:6)])

# inspect the distribution of percentage of accepted loans per bank
plot1 <- ggplot(data = banks.to.be.clustered, mapping = aes(x = pct.accepted)) +
  geom_histogram()
# inspect the distribution of mean interest rate per bank
plot2 <- ggplot(data = banks.to.be.clustered, mapping = aes(x = mean.ir)) +
  geom_histogram()
# inspect the distribution of mean loan amount per bank
plot3 <- ggplot(data = banks.to.be.clustered, mapping = aes(x = mean.la)) +
  geom_histogram()

plot1 + plot2 + plot3

**Figure 1.** Screen capture of RStudio’s console output displaying the summary statistics for the variables percentage of accepted home loans per bank (pct.accepted), mean interest rate per bank (mean.ir) and mean loan amount per bank (mean.la). Results are shown for 2,377 financial institutions that processed home loans from latinxs applicants.

**Figure 2.** Histograms displaying the distribution of values for the variables percentage of accepted home loans per bank (pct.accepted > left plot), mean interest rate per bank (mean.ir > middle plot) and mean loan amount per bank (mean.la > right plot). Results are based on 2,377 financial institutions that processed home loans from latinxs applicants.

IMPLEMENTATION OF AGGLOMERATIVE HIERARCHICAL CLUSTERING

Because agglomerative hierarchical clustering works best when the number of observations is under 150, I selected the top100 financial institutions in terms of the number of processed home loan applications submitted by latinxs (Figure 3).

### HIERARCHICAL CLUSTERING

# consider the top100 financial institutions in terms of processed home loans (c0 variable)
top100 <- banks.to.be.clustered[1:100, ]
View(top100)

# visualize these
ggplot(data = top100) +
  geom_col(mapping = aes(x = reorder(respondent_name, c0), y = c0)) +
  coord_flip() +
  theme_bw()

# inspect summary statistics
summary(top100[, c(4:6)])
# the median for percentage of accepted loan applications per bank is: 55.9%
# the median interest rate per bank is: 3.0%
# the median loan amount per bank is: $267,447

# remove columns c0 and c1
top100 <- top100[, -c(2,3)]
View(top100)

**Figure 3.** The top100 financial institutions plotted according to the number of home loan applications from latinxs that were processed during 2021. United Wholesale Mortgage was the first in the list with 112,248 home loans, whereas America First was the last on the list with 3,054 home loans.

I subsequently scaled the data and performed the clustering, obtaining the results shown in Figure 4. The dendrogram displays two big clusters, the leftmost cluster is composed of 5 financial institutions (Discover Bank, 21st Mortgage Corporation, Triad Financial Services, Vanderbilt Mortgage and Finance, and Credit Human Federal Credit Union) whereas the second cluster is composed of the other 95 financial institutions. It was super interesting to find that the median value for the percentage of accepted loans was 13.95% for these 5 financial institutions whereas the median value was 55.9% when all top100 financial institutions were considered (see above code segment, and see Figure 5). The median value for mean interest rate was 6.8% for the 5 institutions composing the leftmost cluster in Figure 5, whereas the median value for the top100 financial institutions was 3.0%. These results indicated that the institutions composing the leftmost cluster were similar in terms of less than favorable home loan offerings to latinx people.

# scale numeric variables prior to clustering
# the scale() function standarize the variables to a mean of 0 and standard deviation of 1
# (x-mean(x)) / sd(x)
top100.scaled <- scale(top100[, c(2,3,4)])
View(top100.scaled)
# add bank common names as row names to scaled data frame
row.names(top100.scaled) <- top100$respondent_name
View(top100.scaled)

# calculate distance
?dist()
distance <- dist(top100.scaled, method = "euclidean")
# implement clustering
hc.top100.scaled <- hclust(distance, method = "average")
# visualize dendrogram
plot(hc.top100.scaled, hang = -1, cex = 0.6, main = "Hierarchical Clustering of top100 financial institutions")

**Figure 4.** Dendrogram depicting the results from hierarchical clustering. Notice the leftmost group or cluster consisting of 5 financial institutions (Discover Bank, 21st Mortgage Corporation, Triad Financial Services, Vanderbilt Mortgage and Finance, Credit Human Federal Credit Union).

# inspect metrics for leftmost cluster
financial_institution <- c("Discover_Bank", "21st_Mortgage_Corp.", "Triad_Financial", "Vanderbilt_Mortgage", "Credit_Human_Federal")
pct_accepted_loans <- c(8.709925, 15.14854, 15.84744, 13.95384, 13.89522)
mean_interest_rate <- c(6.877177, 7.903087, 7.565589, 6.404317, 5.461475)
mean_loan_amount <- c(86579.79, 89930.14, 79428.49, 79428.49, 98302.11)
# create data frame with financial institutions from leftmost cluster
leftmost_cluster <- data.frame(financial_institution, pct_accepted_loans, mean_interest_rate, mean_loan_amount)
leftmost_cluster
# inspect summary statistics
summary(leftmost_cluster[, c(2,3,4)])

**Figure 5.** Screen capture of RStudio’s console output depicting actual values (top) and summary statistics (bottom) for the three variables of interest obtained from the 5 financial institutions composing the leftmost cluster shown in **Figure 4**.

In summary, by selecting the top100 financial institutions in terms of processed loans from latinxs during 2021, and by the implementation of agglomerative hierarchical clustering I was able to obtain two big clusters of financial institutions, with the leftmost cluster composed of 5 financial institutions that displayed higher median values for percentage of accepted home loans and mean interest rates, respectively.

IMPLEMENTATION OF PARTITIONING AROUND MEDOIDS (PAM) CLUSTERING

For the implementation of partitioning around medoids (PAM) I focused on those financial institutions that processed at least 1,000 home loans during 2021, which were 245 institutions in total. Using the NbClust( ) function from the NbClust package I determined that the best number for k was 2 and the resulting bivariate plot is shown in Figure 6. Two distinct clusters were observed, one cluster contained 139 financial institutions whereas the other contained 106 institutions.

### PARTITIONING CLUSTERING

# consider financial institutions that processed at least 1,000 home loan applications
top245 <- filter(banks.to.be.clustered, c0 >= 1000)
View(top245)

# scale data
top245.scaled <- scale(top245[, c(4,5,6)])
# add row names
row.names(top245.scaled) <- top245$respondent_name
View(top245.scaled)

# determine the best number of clusters
?NbClust()
nc <- NbClust(top245.scaled, distance = "euclidean", min.nc = 2, max.nc = 15, method = "kmeans") # 2 clusters were suggested

# PAM: Partitioning Around Medoids
set.seed(1234)
?pam()
# fit PAM
fit.pam <- pam(top245.scaled, k = 2, metric = "euclidean", stand = FALSE)
# identify Medoids
fit.pam$medoids
# visualize partition in bivariate plot
clusplot(fit.pam)
# obtain info about clusters
fit.pam$clusinfo # one cluster contains 139 financial institutions whereas the other contains 106 institutions
# inspect cluster assignment for each financial institution
fit.pam$clustering

**Figure 6.** Bivariate plot depicting results from partitioning around medoids clustering. The 245 financial institutions that processed at least 1,000 home loan applications from latinxs were partitioned into two big clusters (shown as triangles and circles within the ellipses).

I plotted the financial institutions along the three variables of interest in a 3D plot and colored them according to their cluster assignment (Figure 7). I could observe that financial institutions belonging to cluster 1 (in blue color) displayed higher percentage of accepted loans and higher mean loan amounts compared to institutions in cluster 2 (in red color). These results were confirmed by visualizing the distribution of these variables per cluster using box plots (Figure 8). The distribution of values for mean interest rate per bank was similar for both clusters.

# create a new variable in top245 data frame that contains cluster assignments
top245 <- mutate(top245, cluster_assignment = fit.pam$clustering)
View(top245)

# visualize banks in 3D colored by cluster assignments
x <- top245$pct.accepted
y <- top245$mean.ir  
z <- top245$mean.la
# define dimensions of plot in pixels
width <- 10000
height <- 10000
# open a new graphics device with desired width and height
dev.new(width = width, height = height)
# create scatter plot in 3D
scatter3D(x, y, z, colvar = top245$cluster_assignment,
          bty = "g", xlab = "percentage of accepted home loans", ylab = "mean interest rate", zlab = "mean loan amount",
          colkey = list(side = 1, length = 0.5),
          phi = 0,
          pch = 20,
          cex = 3,
          ticktype = "detailed", theta = 40)
text3D(x, y, z, labels = top245$respondent_name, add = TRUE, cex = 0.6)
# save plot to file
filename <- "3Dplot_top245_financial_institutions.clustered.png"
dev.copy(png, filename)
dev.off()

# visualize distribution of variables per cluster
# percentage of accepted home loans
plot4 <- ggplot(data = top245, aes(x = factor(cluster_assignment), y = pct.accepted)) +
  geom_boxplot()
# mean interest rate
plot5 <- ggplot(data = top245, aes(x = factor(cluster_assignment), y = mean.ir)) +
  geom_boxplot()
# mean loan amount
plot6 <- ggplot(data = top245, aes(x = factor(cluster_assignment), y = mean.la)) +
  geom_boxplot()
# view all three plots
plot4 + plot5 + plot6

**Figure 7.** 3D plot depicting financial institutions colored according to their cluster assignment (blue = cluster 1 and red = cluster 2).

**Figure 8.** Distribution of values for the percentage of accepted home loans per bank for clusters 1 and 2 (left plot), mean interest rate per bank for clusters 1 and 2 (middle plot) and mean loan amount per bank for cluster 1 and 2 (right plot).

WHERE TO APPLY FOR A HOME LOAN IF YOU ARE LATINX?

If you are latinx, at which financial institutions should you apply for a home loan in United States? Based on the work I presented here, I suggest you to apply for home loans at institutions forming cluster 1. Those in particular with higher percentage of accepted loans and lower interest rates. To visualize this, I plotted the financial institutions assigned to cluster 1 according to the percentage of accepted home loans and mean interest rate per bank (Figure 9A) and focused on those with low values for mean interest rate (Figure 9B). All together they constitute a valuable resource of financial knowledge for the latinx community.

# show financial institutions from cluster 1
cluster1.banks <- top245[top245$cluster_assignment == 1, ]
View(cluster1.banks)

# create new variable distinguishing banks with mean interest rates higher than 3.5%
cluster1.banks <- cluster1.banks %>%
  mutate(banks_high_ir = ifelse(mean.ir > 3.5, ">3.5%", "<= 3.5%"))

# visualized them based on their percentage of accepted home loans and interest rates
ggplot(data = cluster1.banks, aes(x = pct.accepted, y = mean.ir, color = banks_high_ir)) +
  scale_color_manual(values = met.brewer("Demuth", 2)) +
  geom_point() +
  geom_rug() +
  geom_text_repel(aes(label = respondent_name), size = 2.5) +
  theme(legend.position = "none")
  
# select financial institutions from cluster 1 with interest rates below 3.5
cluster1.low.ir <- filter(cluster1.banks, mean.ir < 3.5)
cluster1.low.ir$respondent_name
# visualize banks from cluster 1 according to percentage of accepted loans and mean interest rate per bank
ggplot(data = cluster1.low.ir, aes(x = pct.accepted, y = mean.ir, color = banks_high_ir)) +
  scale_color_manual(values = met.brewer("Demuth", 2)) +
  geom_point() +
  geom_rug() +
  geom_label_repel(aes(label = respondent_name), size = 2.5) +
  theme(legend.position = "none")

**Figure 9A.** Scatter and rug plot depicting financial institutions from cluster 1 according to their percentage of accepted loans and their mean interest rates. Financial institutions with mean interest rate above 3.5% were colored grey, otherwise brown.

**Figure 9B.** Scatter and rug plot showing financial institutions from Figure 9A with mean interest rates below 3.5%.

I reasoned in a similar manner for the financial institutions comprising cluster 2, visualizing them according to the percentage of accepted loans in relation to their mean interest rates (Figure 10). It was interesting to find that the same group of 5 financial institutions identified earlier with hierarchical clustering not only formed part of cluster 2, but were quite distinct as well in relation to the other banks (Figure 10A).

### show financial institutions from cluster 2
cluster2.banks <- top245[top245$cluster_assignment == 2, ]
View(cluster2.banks)

# create new variable distinguishing banks with mean interest rates higher than 3.5%
cluster2.banks <- cluster2.banks %>%
  mutate(banks_high_ir = ifelse(mean.ir > 3.5, ">3.5%", "<= 3.5%"))

# visualized them based on their percentage of accepted home loans and interest rates
ggplot(data = cluster2.banks, aes(x = pct.accepted, y = mean.ir, color = banks_high_ir)) +
  scale_color_manual(values = met.brewer("Demuth", 2)) +
  geom_point() +
  geom_rug() +
  geom_text_repel(aes(label = respondent_name), size = 2.5) +
  theme(legend.position = "none")

# select financial institutions from cluster 1 with interest rates below 3.5
cluster2.low.ir <- filter(cluster2.banks, mean.ir < 3.5)
cluster2.low.ir$respondent_name
# visualize banks from cluster 2 according to percentage of accepted loans and mean interest rate per bank
ggplot(data = cluster2.low.ir, aes(x = pct.accepted, y = mean.ir, color = banks_high_ir)) +
  scale_color_manual(values = met.brewer("Demuth", 2)) +
  geom_point() +
  geom_rug() +
  geom_label_repel(aes(label = respondent_name), size = 2.5) +
  theme(legend.position = "none")

**Figure 10A.** Scatter and rug plot depicting financial institutions from cluster 2 according to their percentage of accepted loans and their mean interest rates. Financial institutions with mean interest rate above 3.5% were colored grey, otherwise brown.

**Figure 10B.** Scatter and rug plot showing financial institutions from Figure 10A with mean interest rates below 3.5%.

CONCLUSION

My work showed the usefulness of cluster analysis in the identification of financial institutions with similar unfavorable / favorable loan terms towards the latinx population; and thus constitutes a valuable source of financial knowledge for this societal group. This work can be replicated on a state level for a more granular analysis. Furthermore, home loan data on African Americans or any other minority group can also be analyzed in order to provide useful information on banks. I hope this work contributes to inform latinxs on issues pertaining home ownership, providing an overall view of social justice from a financial perspective.

REFERENCES

R in Action

R in Action is the first book to present both the R system and the use cases that make it such a compelling package for…

www.manning.com

Impressive package for 3D and 4D graph - R software and data visualization

In my previous articles, I already described how to make 3D graphs in R using the package below: To close the…

www.sthda.com

ABOUT THE AUTHOR

Martin Calvino is a Visiting Professor at Torcuato Di Tella University; a Computational Biologist at The Human Genetics Institute of New Jersey — Rutgers University; and a Multimedia Artist. You can follow him on Instagram: @from.data.to.art

THE CODE ASSOCIATED TO THIS WORK as well as the datasets can be accessed from my GitHub page at:

GitHub …

In this work I analyzed 3,731 financial institutions across the country that together processed 2,243,013 home loan…

github.com