Comparative Text Analysis in R: the LatinX Term According to the Printed Press (part 2)

Martin Calvino
17 min readSep 26, 2023

--

In this work I continued my analysis of news articles from six newspapers (The Whashington Post, The New York Times, Los Angeles Times, Houston Chronicle, Miami Herald and Chicago Tribune) that mentioned the latinx term.

Image created by Martin Calvino using DCGAN algorithm

COMPARISON OF WORD FREQUENCY ACROSS NEWSPAPERS

On my previous work (see references) I found that The Washington Post and Miami Herald newspapers shared views on the latinx term when it came to countries mentioned, topics extracted and most frequent words-in-context associated with the latinx term; these newspapers mentioned latinx within political and identity topics. But how about their whole set of words contained in their news articles published? Do these newspapers use the same set of words? I proceeded to compare their word frequencies and quantified (using correlation tests) how similar and different these set of word frequencies were not only among them but also in relation to the other newspapers in my study. My results are shown in Figure 1. I found that Houston Chronicle was the newspaper displaying the highest correlation (higher Pearson’s correlation coefficient) in word frequency to Miami Herald. Surprisingly, The Washington Post was the newspaper displaying the lowest correlation in word frequency to Miami Herald, this means that although these newspapers shared similarities in their view of latinx within political and identity topics, they used words at very different frequencies (number of times each word is mentioned adjusted by text length).

# load libraries
library(tidyverse)
library(tidytext)
library(MetBrewer)
library(scales)
library(igraph)
library(ggraph)


# write function to show directory path for each newspaper .txt document
show_files <- function(directory_path, pattern = "\\.txt$") {
file_name_v <- dir(directory_path, pattern, full.names = TRUE)
for (i in seq_along(file_name_v)) {
cat(i, file_name_v[i], "\n", sep = " ")
}
}

# directory path to folder (latinx2022) containing .txt documents (data)
my_folder <- "/home/calviot/Desktop/latinx2022/data"

# show directory paths
show_files(my_folder)

# load .txt file containing news articles from Chicago Tribune
ct_text <- scan("/home/calviot/Desktop/latinx2022/data/ChicagoTribune.txt", what = "character", sep = "\n")
length(ct_text)
ct_text
# convert to lower case
ct_text <- tolower(ct_text)
ct_text
# create tibble
ct_text_df <- tibble(line = 1:length(ct_text), text = ct_text)
View(ct_text_df)

# access dataset with stop words from the tidytext package
data("stop_words", package = "tidytext")

# Chicago Tribune as tidytext as "ct_tt"
ct_tt <- ct_text_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)

View(ct_tt) # is in tidytext format (one token per row)


# create a function to do all the above, and apply it to the other 5 newspapers' .txt documents
tidy_newspaper_text <- function(directory_path) {
newspaper_text <- scan(directory_path, what = "character", sep = "\n")
newspaper_text <- tolower(newspaper_text)
newspaper_text <- tibble(line = 1:length(newspaper_text), text = newspaper_text)
newspaper_tidy_text <- newspaper_text %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
return(newspaper_tidy_text)
}

# call the tidy_newspaper_text() function on other newspapers' .txt files

# Houston Chronicle as tidytext as "hc_tt"
hc_tt <- tidy_newspaper_text("/home/calviot/Desktop/latinx2022/data/HoustonChronicle.txt")
View(hc_tt)
# Los Angeles Times as tidytext as "lat_tt"
lat_tt <- tidy_newspaper_text("/home/calviot/Desktop/latinx2022/data/LATimes.txt")
View(lat_tt)
# Miami Herald as tidytext as "mh_tt"
mi_tt <- tidy_newspaper_text("/home/calviot/Desktop/latinx2022/data/MiamiHerald.txt")
View(mi_tt)
# The New York Times as tidytext as "nyt_tt"
nyt_tt <- tidy_newspaper_text("/home/calviot/Desktop/latinx2022/data/NYTimes.txt")
View(nyt_tt)
# The Washington Post as tidytext as "wp_tt"
wp_tt <- tidy_newspaper_text("/home/calviot/Desktop/latinx2022/data/WashingtonPost.txt")
View(wp_tt)


################################################################################
### Text Comparison based on word frequency usage

# calculate frequency for each word for each newspaper and place it into a single data frame
word_frequency.1 <- bind_rows(mutate(mi_tt, author = "Miami Herald"),
mutate(lat_tt, author = "Los Angeles Times"),
mutate(ct_tt, author = "Chicago Tribune"),
mutate(hc_tt, author = "Houston Chronicle"),
mutate(wp_tt, author = "The Washington Post"),
mutate(nyt_tt, author = "The New York Times"))

View(word_frequency.1) # 170,657 rows
sum(is.na(word_frequency.1))

word_frequency.1 <- mutate(word_frequency.1, word = str_extract(word, "[^.,\\d]+")) # remove periods, commas and digits from the "word" column
View(word_frequency.1)
sum(is.na(word_frequency.1)) # 4,591 NAs

word_frequency.2 <- word_frequency.1 %>%
count(author, word)
View(word_frequency.2)

word_frequency.3 <- word_frequency.2 %>%
group_by(author) %>%
mutate(proportion = n/sum(n)) %>%
select(-n)
View(word_frequency.3)

word_frequency.4 <- word_frequency.3 %>%
spread(author, proportion) %>%
select(word, `Miami Herald`, everything())
View(word_frequency.4)

word_frequency.5 <- word_frequency.4 %>%
gather(`Chicago Tribune`:`The Washington Post`, key = author, value = proportion)
View(word_frequency.5)

# plot results
ggplot(word_frequency.5, aes(x = proportion, y = `Miami Herald`, color = abs(`Miami Herald` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~author, ncol = 5) +
theme(legend.position = "none") +
labs(y = "Miami Herald", x = NULL) +
theme_bw()

# save plot as .svg and modify it in Adobe Illustrator for better visualization
ggsave("comparative_word_frequency.svg", width = 5000, height = 3000, units = "px")

# how correlated are the word frequencies between Miami Herald's and the rest of the newspapers
cor.test(data = word_frequency.5[word_frequency.5$author == "Chicago Tribune",], ~ proportion + `Miami Herald`) # 0.867
cor.test(data = word_frequency.5[word_frequency.5$author == "Houston Chronicle",], ~ proportion + `Miami Herald`) # 0.881
cor.test(data = word_frequency.5[word_frequency.5$author == "Los Angeles Times",], ~ proportion + `Miami Herald`) # 0.845
cor.test(data = word_frequency.5[word_frequency.5$author == "The New York Times",], ~ proportion + `Miami Herald`) # 0.814
cor.test(data = word_frequency.5[word_frequency.5$author == "The Washington Post",], ~ proportion + `Miami Herald`) # 0.733

Figure 1. Comparative word frequency from five newspapers (x-axis) relative to Miami Herald’s (y-axis). The Pearson’s correlation coefficient for each newspaper pair combination is shown. Words that are close to the diagonal dotted line have similar frequencies in both sets of texts (composed of news articles). Words that are far from the diagonal line are words that are found more frequent in one set of texts than another.

QUANTIFYING WHAT NEWS ARTICLES MENTIONING THE LATINX TERM WERE ABOUT: term frequency — inverse document frequency (tf-idf)

Although my correlation analysis of word frequencies (Figure 1) highlighted the similarities of news articles between the Miami Herald and Houston Chronicle newspapers, it wasn’t a useful approach to assess what these newspapers wrote about. The statistic tf-idf (from term frequency — inverse document frequency) measures how important a word is to a text in a collection of texts; for example the importance of a word to one document containing news articles from Miami Herald to a collection of documents containing news articles from the other five newspapers in this study. The inverse document frequency (idf) decreases the weight for commonly used words and increases the weight for rare words in a collection of documents; the idf is then multiplied by the term frequency (tf) to calculate a word/term’s tf-idf, that is the frequency of a word adjusted for how rarely it is used.

I first analyzed the term frequency distribution for each newspaper (Figure 2) and found that many words were mentioned at low frequencies whereas few words occurred at higher frequencies (long tails to the right in each plot). Whereas the word frequency distribution for the Chicago Tribune, Houston Chronicle, Los Angeles Times and The Washington Post newspapers displayed long tails to the right, the frequency distribution of the Miami Herald and The New York Times newspapers were more compact.

# How to quantify what a newspaper text is about?
# Term frequency - inverse document frequency tf-idf

# commonly used words
newspaper_words <- word_frequency.1 %>%
count(author, word, sort = TRUE) %>%
ungroup()

View(newspaper_words)

# total words per newspaper
newspaper_total_words <- newspaper_words %>%
group_by(author) %>%
summarize(total = sum(n))

View(newspaper_total_words)

# inspect the distribution of n/total newspaper: the number of times a word appears in a newspaper divided by the total
# number of words in that newspaper
news_words <- left_join(newspaper_words, newspaper_total_words)
View(news_words)

ggplot(news_words, aes(n/total, fill = author)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.004) +
facet_wrap(~author, ncol = 3, scales = "free_y") +
theme_bw()
Figure 2. Histograms displaying the term frequency distribution for each newspaper. Each bar height represents the number of words at any given frequency.

I proceeded then to calculate the tf-idf for each newspaper and visualized the words with higher values; these words were important for the content of each newspaper’s document (collection of news articles put together). As shown in Figure 3, the characteristic words for each newspaper included names (the persons they wrote about), places and nouns. In the case of Chicago Tribune for instance, there were words related to religion and renting issues of landlords/tenants, whereas for The New York Times the words had to do with Taller Boricua, Museo del Barrio, and the Puerto Rican experience in New York City in relation to artistic matters, respectively.

# calculate tf-idf
news_words_tf_idf <- news_words %>%
bind_tf_idf(word, author, n)

View(news_words_tf_idf) # words that appear in all newspapers' documents have idf = 0

# visualize words with high tf-idf values for each newspaper
news_words_tf_idf %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(author) %>%
top_n(25) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = author)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf_idf") +
facet_wrap(~author, ncol = 3, scales = "free") +
coord_flip() +
theme_bw()
Figure 3. Top 25 characteristic words based on tf-idf values for each newspaper.

IDENTIFYING MOST POSITIVE & NEGATIVE WORDS: SENTIMENT ANALYSIS

I was interested in analyzing the emotional intent of words (sentiment) in news articles across newspapers. I found intriguing to compare positive and negative words’ usage among newspapers. Here, the sentiment content for sections of texts comprised of 5 lines each was assessed as the sum of the sentiment content of the individual words within each section. I used the Bing sentiment lexicon (see references) that categorized words into positive and negative groups. I first assessed the sentiment across narrative time (news articles ordered chronologically) for each newspaper (Figure 4) and found that Miami Herald displayed sections of text in which the sentiment score was the lowest among the newspapers; whereas The Washington Post displayed more sections of text with negative sentiment than positive.

### Sentiment Analysis

# access the sentiments dataset from the tidytext package
sentiments # this is the Bing lexicon
# access the AFINN and NRC lexicon
get_sentiments("afinn")
get_sentiments("nrc")

# inspect negative words from the Bing lexicon
bing_negative <- get_sentiments("bing") %>%
filter(sentiment == "negative")

# concrete example: inspect negative words on Miami Herald
mi_negative_words <- mi_tt %>%
inner_join(bing_negative) %>%
count(word, sort = TRUE)
View(mi_negative_words)

# calculate sentiment score per 5 lines of text for each newspaper
bing_sentiment <- word_frequency.1 %>%
inner_join(get_sentiments("bing")) %>%
count(author, index = line %/% 5, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
View(bing_sentiment)

# plot sentiment scores across narrative time for each newspaper
ggplot(bing_sentiment, aes(index, sentiment, fill = author)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = met.brewer("Troy", 6)) +
facet_wrap(~author, ncol = 2, scales = "free_x") +
theme_bw()

# save plot as .svg and modify it in Adobe Illustrator for better visualization
ggsave("comparative_sentiment.svg", width = 4000, height = 3000, units = "px")
Figure 4. Sentiment scores for sections of text comprised of 5 lines each; news articles were ordered chronologically for each newspaper’s document. I did not take into account qualifiers before a word, thus “no good” or “not true” were scored individually as “no” + “good” or “not” + “true” respectively.

I then analyzed the top words that contributed to each sentiment (Figure 5) and found that words with negative sentiments such as racism/racist and poor/poverty/debt were shared among several newspapers; whereas positive words such as support and love were also present across newspapers. It was interesting to find that the word queer was assigned as negative in the Bing lexicon, exemplifying how problematic it is to apply general lexicons to domain-specific studies, and also how inherent bias in computational analysis can be carried on and affect real people if natural language processing algorithms for sentiment analysis using the Bing lexicon were to be deployed in the market.

It was also interesting to note that the in Bing sentiment lexicon, the word progressive was categorized as positive whereas the word conservative as negative; with both words mostly written about (in the news articles considered in my study) in relation to political issues. I would guess that Republican voters consider the word conservative as being positive.

# identify most common positive and negative words for Miami Herald (A)
mi_word_counts <- mi_tt %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
View(mi_word_counts)

# plot the top 25 words contributing to each sentiment (negative vs. positive) in Miami Heralds' news articles (B)
mi_word_counts %>%
group_by(sentiment) %>%
top_n(25) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = met.brewer("Renoir", 2)) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "contribution to sentiment", x = NULL) +
coord_flip() +
theme_bw()

# save plot as .svg and modify it in Adobe Illustrator for better visualization (C)
ggsave("MiamiHerald_negative_positive_words.svg", width = 2000, height = 1500, units = "px")

# create a function that combines code from sections A, B, C
# top words contributing to sentiment as "top_words_to_sentiment"
top_words_to_sentiment <- function(newspaper_tidyText, my_plot_file_name) {
# section A
newspaper_tidyText_word_counts <- newspaper_tidyText %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
# section B
my_plot <- newspaper_tidyText_word_counts %>%
group_by(sentiment) %>%
top_n(25) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = met.brewer("Renoir", 2)) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "contribution to sentiment", x = NULL) +
coord_flip() +
theme_bw()
return(my_plot)
# section C
return(ggsave(my_plot_file_name, width = 2000, height = 1500, units = "px"))
}

# top words contributing to sentiment in Chicago Tribune
top_words_to_sentiment(ct_tt, "ChicagoTribune_negative_positive_words.svg")
# top words contributing to sentiment in Houston Chronicle
top_words_to_sentiment(hc_tt, "HoustonChronicle_negative_positive_words.svg")
# top words contributing to sentiment in Los Angeles Times
top_words_to_sentiment(lat_tt, "LosAngelesTimes_negative_positive_words.svg")
# top words contributing to sentiment in The New York Times
top_words_to_sentiment(nyt_tt, "TheNewYorkTimes_negative_positive_words.svg")
# top words contributing to sentiment in The Washington Post
top_words_to_sentiment(wp_tt, "TheWashingtonPost_negative_positive_words.svg")

## import the above six .svg files into Adobe Illustrator to create a composite plot for visualizing results
Figure 5. Top 25 words (according to their number of mentions) that contributed the most to the sentiment score based on the Bing sentiment lexicon. Because “trump” is an English word and also the lower case version of the last name of president Donald Trump, it was scored as a positive word and greatly affected the sentiment score shown in Figure 4.

COMPARING BI-GRAMS ACROSS NEWSPAPERS

Bi-grams are pairs of adjacent words that occur repeatedly in a text. I was interested in exploring which words followed latinx (with stop_words removed) and how they differed across newspapers (Figure 6). When taking into account the top-10 bi-grams in which the first word of the bi-gram was the latinx term, I could observe that “latinx immigrants” and “latinx identity” were distinctively present in news articles from The Washington Post, whereas “latinx entrepreneurs” in the Miami Herald.

### extract bi-grams

# tokenize text files by n-grams (bigrams)
bigrams <- function(directory_path) {
newspaper_text <- scan(directory_path, what = "character", sep = "\n")
newspaper_text <- tolower(newspaper_text)
newspaper_text <- tibble(line = 1:length(newspaper_text), text = newspaper_text)
newspaper_tidy_text <- newspaper_text %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
return(newspaper_tidy_text)
}

#3 extract bigrams from The New York Times
bigrams_nyt <- bigrams("/home/calviot/Desktop/latinx2022/data/NYTimes.txt")
bigrams_nyt

# count bigrams present in The New York Times
bigrams_nyt %>%
count(bigram, sort = TRUE) # "of the" is the most common bigram

# separate bigrams into two columns: word1 and word2 in order to remove stop words from them
bigrams_nyt_separated <- bigrams_nyt %>%
separate(bigram, c("word1", "word2"), sep = " ")

bigrams_nyt_separated

# now filter out stop words
bigrams_nyt_filtered <- bigrams_nyt_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)

# now inspect most common bigrams (without stop words) in The New York Times
bigram_nyt_counts <- bigrams_nyt_filtered %>%
count(word1, word2, sort = TRUE)

bigram_nyt_counts # "latin american" is the most common bigram occurring 48 times

# inspect bigrams in which word1 == latinx
latinx_bigrams <- bigrams_nyt_filtered %>%
filter(word1 == "latinx") %>%
count(word1, word2, sort = TRUE)

View(latinx_bigrams)

# inspect bigrams in which word1 == ms
# this is useful to identify female persons mentioned (a proxy for name entity recognition)
female_latinx_bigrams <- bigrams_nyt_filtered %>%
filter(word1 == "ms") %>%
count(word1, word2, sort = TRUE)

View(female_latinx_bigrams)
Figure 6A. Bi-grams containing the term latinx as the first word occurring in news articles from the Chicago Tribune. The top 10 occurrences are shown.
Figure 6B. Bi-grams containing the term latinx as the first word occurring in news articles from the Houston Chronicle. The top 10 occurrences are shown.
Figure 6C. Bi-grams containing the term latinx as the first word occurring in news articles from Los Angeles Times. The top 10 occurrences are shown.
Figure 6D. Bi-grams containing the term latinx as the first word occurring in news articles from the Miami Herald. The top 10 occurrences are shown.
Figure 6E. Bi-grams containing the term latinx as the first word occurring in news articles from The New York Times. The top 10 occurrences are shown.
Figure 6F. Bi-grams containing the term latinx as the first word occurring in news articles from The Washington Post. The top 10 occurrences are shown.

I then proceeded to visualize the network of all bi-grams that occurred at least three times for each newspaper’s document (Figure 7).

# visualize network of bigrams (occurring 3 or more times) for The New York Times
# create an igraph object with graph_from_data_frame() function
bigram_nyt_graph <- bigram_nyt_counts %>%
filter(n >= 3) %>%
graph_from_data_frame()

bigram_nyt_graph

# visualize network with ggraph()
set.seed(2023)
ggraph(bigram_nyt_graph, layout = "fr") +
geom_edge_link(edge_width = 1) +
geom_node_point(color = "orange", size = 4.5) +
geom_node_text(aes(label = name)) +
theme_void()

ggsave("nyt_bigram_plot.svg", width = 1000, height = 1000, units = "px")


## create a function that extract bigrams in which word1 == latinx
word1_latinx_bigrams <- function(dataset) {
bigrams_newspaper <- bigrams(dataset) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word) %>%
filter(word1 == "latinx") %>%
count(word1, word2, sort = TRUE)
return(bigrams_newspaper)
}

# The New York Times latinx bigrams
word1_latinx_bigrams("/home/calviot/Desktop/latinx2022/data/NYTimes.txt")
# Chicago Tribune latinx bigrams
word1_latinx_bigrams("/home/calviot/Desktop/latinx2022/data/ChicagoTribune.txt")
# Houston Chronicle latinx bigrams
word1_latinx_bigrams("/home/calviot/Desktop/latinx2022/data/HoustonChronicle.txt")
# Los Angeles Times latinx bigrams
word1_latinx_bigrams("/home/calviot/Desktop/latinx2022/data/LATimes.txt")
# Miami Herald latinx bigrams
word1_latinx_bigrams("/home/calviot/Desktop/latinx2022/data/MiamiHerald.txt")
# The Washington Post latinx bigrams
word1_latinx_bigrams("/home/calviot/Desktop/latinx2022/data/WashingtonPost.txt")


## create a function that visualizes bigrams
visualize_bigrams <- function(dataset) {
bigrams_newspaper <- bigrams(dataset) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word) %>%
count(word1, word2, sort = TRUE) %>%
filter(n >= 3) %>%
graph_from_data_frame()

my_bigram_network_plot <- ggraph(bigrams_newspaper, layout = "fr") +
geom_edge_link() +
geom_node_point(color = "orange", size = 4.5) +
geom_node_text(aes(label = name)) +
theme_void()

return(my_bigram_network_plot)
}

# The New York Times bigrams' visualization
visualize_bigrams("/home/calviot/Desktop/latinx2022/data/NYTimes.txt")
# Chicago Tribune bigrams' visualization
visualize_bigrams("/home/calviot/Desktop/latinx2022/data/ChicagoTribune.txt")
# Houston Chronicle bigrams' visualization
visualize_bigrams("/home/calviot/Desktop/latinx2022/data/HoustonChronicle.txt")
# Los Angeles Times bigram's visualization
visualize_bigrams("/home/calviot/Desktop/latinx2022/data/LATimes.txt")
# Miami Herald bigrams' visualization
visualize_bigrams("/home/calviot/Desktop/latinx2022/data/MiamiHerald.txt")
# The Washington Post bigrams' visualization
visualize_bigrams("/home/calviot/Desktop/latinx2022/data/WashingtonPost.txt")

Figure 7A. Network of bi-grams visualization for the Chicago Tribune. We can observe bi-grams such as low-income, rental-assistance, and hit-pandedmic that are related to one of the topics previously found whose relative importance was more prominent in Chicago Tribune (see references).
Figure 7B. Network of bi-grams visualization for the Houston Chronicle. We can observe bi-grams such immigrant-families, mexico-border, tejano-music and environmental-racism for instance.
Figure 7C. Network of bi-grams visualization for Los Angeles Times. We can observe bi-grams such as social-justice, border-protection, border-crossing, border-crisis, border patrol and immigrant-parents.
Figure 7D. Network of bi-grams visualization for Miami Herald. We can observe bi-grams such as systemic-racism, gender-neutral, safety-net, vaccination-rates and immigration-status.
Figure 7E. Network of bi-grams visualization for The New York Times. We can observe bi-grams such as poor-people, low-income, george-floyd, and anti-immigrant.
Figure 7F. Network of bi-grams visualization for The Washington Post. We can observe bi-grams such as cancel-culture, racial-identity, heritage-month and schools-reopening.

CORRELATING PAIRS OF WORDS ACROSS SECTIONS OF TEXT

I was interested in analyzing not only bi-grams but also pairs of words that co-occurred across sections of text, even though they weren’t located next to each other. I thus calculated the correlation between pairs of co-occurring words for every 5-lines’ section of text across news articles for each newspaper (Figure 8). This analysis revealed interesting results, as the correlated words to the terms “support” and “issues” varied across newspapers. The correlated words to “racism” and “queer” were also super interesting.

## Correlating Pairs of Words across sections of text

install.packages("widyr", dependencies = TRUE)
library(widyr)

# what words tend to appear within 5-lines sections of text?
# let's focus on Chicago Tribune first as concrete example
section_words <- word_frequency.1 %>%
mutate(section = row_number() %/% 5) %>%
filter(author == "Chicago Tribune") %>%
filter(section > 0) %>%
filter(!word %in% stop_words$word)

View(section_words)

# count words co-occurring within sections
word_pairs <- section_words %>%
pairwise_count(word, section, sort = TRUE)

View(word_pairs)

# which words co-occur with "latinx" for every 5-lines section of text?
# "undocumented" was the word that contributed the most to negative sentiment in Chicago Tribune
word_pairs %>%
filter(item1 == "undocumented") %>%
print(n = 25)

# "affordable" was the workd that contributed the most to positive sentiment in Chicago Tribune
word_pairs %>%
filter(item1 == "affordable") %>%
print(n = 25)


# find correlated words
word_cors <- section_words %>%
group_by(word) %>%
filter(n() >= 3) %>%
pairwise_cor(word, section, sort = TRUE)

View(word_cors)


# what are the most correlated words with "undocumented" and "affordable" in Chicago Tribune?
word_cors %>%
filter(item1 %in% c("undocumented", "affordable")) %>%
group_by(item1) %>%
top_n(15) %>%
ungroup() %>%
mutate(item2 = reorder(item2, correlation)) %>%
ggplot(aes(item2, correlation)) +
geom_bar(stat = "identity") +
facet_wrap(~item1, scales = "free") +
coord_flip()

# visualize network of correlated words within 5-lines sections of text for Chicago Tribune
word_cors %>%
na.omit() %>%
filter(item1 %in% c("undocumented", "issues", "hard", "affordable", "support", "helped")) %>%
filter(correlation > 0.01) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link() +
geom_node_point(color = "orange", size = 4.5) +
geom_node_text(aes(label = name)) +
theme_void()


# create function to identify most correlated words within 5-lines of text, and apply it to newspapers
correlated_word_pairs <- function(newspaper_name, negative_term, positive_term) {
section_words <- word_frequency.1 %>%
mutate(section = row_number() %/% 5) %>%
filter(author == newspaper_name) %>%
filter(section > 0) %>%
filter(!word %in% stop_words$word)

word_cors <- section_words %>%
group_by(word) %>%
filter(n() >= 3) %>%
pairwise_cor(word, section, sort = TRUE)

word_cors %>%
filter(item1 %in% c(negative_term, positive_term)) %>%
group_by(item1) %>%
top_n(15) %>%
ungroup() %>%
mutate(item2 = reorder(item2, correlation)) %>%
ggplot(aes(item2, correlation)) +
geom_bar(stat = "identity") +
facet_wrap(~item1, scales = "free") +
coord_flip()
}

# calculate correlated pairs of words and visualize them
correlated_word_pairs("Chicago Tribune", "undocumented", "affordable")
correlated_word_pairs("Houston Chronicle", "issues", "support")
correlated_word_pairs("Los Angeles Times", "queer", "support")
correlated_word_pairs("Miami Herald", "racism", "support")
correlated_word_pairs("The New York Times", "issues", "support")
correlated_word_pairs("The Washington Post", "hard", "support")

Figure 8A. The top-2 words (“ undocumented” and “affordable”) contributing to the sentiment score in Chicago Tribune (see Figure 5) were selected, and the corresponding top-10 most correlated words were identified when sections of text comprising 5-lines each were taken into account.
Figure 8B. The top-2 words (“ issues” and “support”) contributing to the sentiment score in Houston Chronicle (see Figure 5) were selected, and the corresponding top-10 most correlated words were identified when sections of text comprising 5-lines each were taken into account.
Figure 8C. The top-2 words (“ queer” and “support”) contributing to the sentiment score in Los Angeles Times (see Figure 5) were selected, and the corresponding top-10 most correlated words were identified when sections of text comprising 5-lines each were taken into account.
Figure 8D. The top-2 words (“ racism” and “support”) contributing to the sentiment score in Miami Herald (see Figure 5) were selected, and the corresponding top-10 most correlated words were identified when sections of text comprising 5-lines each were taken into account.
Figure 8E. The top-2 words (“ issues” and “support”) contributing to the sentiment score in The New York Times (see Figure 5) were selected, and the corresponding top-10 most correlated words were identified when sections of text comprising 5-lines each were taken into account.
Figure 8F. The top-2 words (“ hard” and “support”) contributing to the sentiment score in The Washington Post (see Figure 5) were selected, and the corresponding top-10 most correlated words were identified when sections of text comprising 5-lines each were taken into account.

I wanted to extend the above analysis and look at words that showed at least a 0.07 correlation with the top-3 words contributing to the sentiment score in each newspaper. This time around, I created network visualizations that provided a more comprehensive overview of pair of correlated words across sections of text (Figure 9). Interesting pairs of correlated words included undocumented-poverty and affordable-housing in Chicago Tribune (Figure 9A); hate-twitter and environmental-racism in Houston Chronicle (Figure 9B); lack-diversity and queer-identity in Los Angeles Times (Figure 9C); hate-speech and structural-racism in Miami Herald (Figure 9D); collectors-support and queer-relationships in The New York Times (Figure 9E); trust-vaccines and infection-risk in The Washington Post (Figure 9F).

# create function to visualize network of correlated words associated with my words of choice
correlated_words_to_my_selected_words <- function(newspaper_name, vector_with_words_of_choice) {
section_words <- word_frequency.1 %>%
mutate(section = row_number() %/% 5) %>%
filter(author == newspaper_name) %>%
filter(section > 0) %>%
filter(!word %in% stop_words$word)

word_cors <- section_words %>%
group_by(word) %>%
filter(n() >= 3) %>%
pairwise_cor(word, section, sort = TRUE)

word_cors %>%
na.omit() %>%
filter(item1 %in% vector_with_words_of_choice) %>%
filter(correlation >= 0.07) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link() +
geom_node_point(color = "orange", size = 4.5) +
geom_node_text(aes(label = name)) +
theme_void()
}

# Chicago Tribune
ct_words <- c("undocumented", "issues", "hard", "affordable", "support", "helped")
correlated_words_to_my_selected_words("Chicago Tribune", ct_words)
# Houston Chronicle
hc_words <- c("issues", "racism", "hate", "support", "love", "free")
correlated_words_to_my_selected_words("Houston Chronicle", hc_words)
# Los Angeles Times
lat_words <- c("queer", "hard", "lack", "support", "love", "free")
correlated_words_to_my_selected_words("Los Angeles Times", lat_words)
# Miami Herald
mihe_words <- c("racism", "issues", "hate", "support", "love", "pride")
correlated_words_to_my_selected_words("Miami Herald", mihe_words)
# The New York Times
nyt_words <- c("issues", "dirt", "queer", "support", "love", "won")
correlated_words_to_my_selected_words("The New York Times", nyt_words)
# The Washington Post
wp_words <- c("hard", "risk", "issue", "support", "educated", "trust")
correlated_words_to_my_selected_words("The Washington Post", wp_words)
Figure 9A. Network visualization of pairs of correlated words (showing at least a 0.07 correlation) to the top-3 words contributing to negative and positive sentiment each in Chicago Tribune (see Figure 5).
Figure 9B. Network visualization of pairs of correlated words (showing at least a 0.07 correlation) to the top-3 words contributing to negative and positive sentiment each in Houston Chronicle (see Figure 5).
Figure 9C. Network visualization of pairs of correlated words (showing at least a 0.07 correlation) to the top-3 words contributing to negative and positive sentiment each in Los Angeles Times (see Figure 5).
Figure 9D. Network visualization of pairs of correlated words (showing at least a 0.07 correlation) to the top-3 words contributing to negative and positive sentiment each in Miami Herald (see Figure 5).
Figure 9E. Network visualization of pairs of correlated words (showing at least a 0.07 correlation) to the top-3 words contributing to negative and positive sentiment each in The New York Times (see Figure 5).
Figure 9F. Network visualization of pairs of correlated words (showing at least a 0.07 correlation) to the top-3 words contributing to negative and positive sentiment each in The Washington Post (see Figure 5).

CONCLUSION

In this work I used word frequencies to compare whole texts comprised of news articles and found that Miami Herald and Houston Chronicle documents were highly correlated. I then calculated the tf-idf statistic and found words that were characteristic of each newspaper, for instance lease/eviction/landlords (Chicago Tribune); baylor/arte/publico (Houston Chronicle); coachella/lacma/marve (Los Angeles Times); prisons/cannabis/voto (Miami Herald); museo/nuyorican/taller (The New York Times); and politico/reopenings/rosalia (The Washington Post). Consequently, I looked into sentiment analysis and how it changed along the narrative time, identifying the most important words that contributed to negative and positive scores for each newspaper. I also extracted bi-grams for each newspaper and looked into the most common bi-grams in which one word in the pair was the latinx term. Furthermore, I analyzed co-occuring words across sections of texts and identified the most correlated terms to words contributing to sentiment. My work extends my previous efforts in the implementation of computational tools for text analysis in order to provide insights into the perception of the latinx term by the printed press in United States.

REFERENCES

ABOUT THE AUTHOR

Martin Calvino is currently a Visiting Professor at Torcuato Di Tella University; he previously worked as Computational Biologist at The Human Genetics Institute of New Jersey — Rutgers University; he is also a Multimedia Artist. You can follow him on Instagram: @from.data.to.art

THE CODE FOR THIS WORK CAN BE ACCESSED AT MY GITHUB PAGE:

--

--

No responses yet