Understanding the semantic meaning of the LatinX term with Word Embeddings

Martin Calvino
8 min readDec 20, 2023

--

Image created by Martin Calvino using DCGAN

The term “latinx” is employed to describe people of Latin American descend or cultural identity in the United States. It is also intended to accommodate individuals identifying as non-binary and gender non-conforming within the Latin American and Hispanic community. The usage of the term is experiencing controversy and opposition not only in segments of the latino community but also at the political and governmental level. Having lived in the United States for more than 16 years (I now live in Buenos Aires — Argentina) I felt confused at the time; should I identify with it and accept it or should I keep identifying myself just as before (latino, hispanic or Uruguayan-born). The term certainly helped in identifying latinos living in the United States over latinos living in countries of Latin America I thought. The evolving usage of the term sparked my interest and I decided to analyze how the term was being referred by the printed press, collecting news articles mentioning the term from six different newspapers. With this in mind, I embarked myself in the computational analysis of text in order to extract the contextual meaning of the term according to the newspaper; and published two extensive analysis on the topic. In my first work I implemented a keyword-in-context analysis to identify the most frequent words surrounding mentions of the “latinx” term from a 10 word context extracted from news articles published by The New York Times, The Washington Post, Los Angeles Times, Houston Chronicle, Miami Herald and Chicago Tribune respectively (see here); whereas in my second work I analyzed bi-grams containing the “latinx” term and how they differed among newspapers (see here). I now present the implementation of vector models for language, known as word embeddings, to extract the meaning of the “latinx” term according to how it was used (written about) in news articles from the above mentioned newspapers.

IMPLEMENTATION OF WORD EMBEDDINGS

Because we take from semantics (a subfield of linguistics) that the meaning of a word is constructed from its contextual relations, computational approaches for the analysis of text known as word embeddings are learned representations for text in which words that have the same meaning have similar representation. In word embeddings each word is represented as real-valued vector in a pre-defined vector space. Because individual words are associated with points in a vector space, words that are close to each other in this space are used in similar ways, naturally capturing word meaning. Thus, word embeddings take into account the usage of the word in a collection of given texts in order to define its meaning.

I implemented my own word embeddings from news articles mentioning the term “latinx” that took advantage of the Normalized Skipgram Probability approach. The skipgram probability is how often do I find word_1 and word_2 together. The term “skipgram” refers to the fact that we can “skip” several words between word_1 and word_2. The normalized skipgram probability calculates if the frequency of the skipgram is higher or lower than what we expect from the unigram frequency. Because some words are very common, and some are very rare, we divide the skipgram frequency by the two unigram frequencies. When the result is more than 1 it means the skipgram occurred more frequently than the unigram probabilities of the two input words and we say word_1 and word_2 are “associated”. The greater the ratio, the stronger the association. For ratio less than 1, word_1 and word_2 are said to be “anti-associated”. If we take the log of this number it’s called the Pointwise Mutual Information (PMI) of word_1 and word_2; it’s represents how frequently word_1 and word_2 coincide “mutually” (or jointly) rather than independently.

PMI = log(p(word_1, word_2) / p(word_1) / p(word_2))

The computational approach then creates a PMI matrix: with each row representing word_1 and each column representing word_2, and each value is the PMI, thus the size of the matrix is (n_vocabulary, n_vocabulary). The dimensionality of the matrix is then compressed (using Singular Value Decomposition (SVD)) into two smaller matrices, each one forming a set of word vectors with size (n_vocabulary, n_dim); with each row of one matrix representing a single word vector. The SVD outputs an orthogonal space that allow us to add and subtract word vectors in a meaningful manner; they are word eigenvectors.

This is a very valuable approach because it allowed me to find word vectors for my own collection of news articles without the need to rely on pre-trained vectors (Word2vec and GloVe for instance).


# load libraries
library(tidyverse)
library(tidytext)
library(SnowballC)
library(slider)
library(widyr)
library(furrr)
library(textdata)

# load .csv file containing news articles from The New York Times
nyt_news <- read_csv(file.choose())
View(nyt_news)

# filter out words used 3 times or less in the dataset
# create a tidy text dataset (one token per row)
tidy_nyt_news <- nyt_news %>%
select(News_ID, Text) %>%
unnest_tokens(word, Text) %>%
add_count(word) %>%
filter(n >= 4) %>%
select(-n)

View(tidy_nyt_news)

# create nested dataframe with one row per news article
nyt_nested_words <- tidy_nyt_news %>%
nest(words = c(word))

nyt_nested_words
View(nyt_nested_words)

# create a slide_windows() function to implement fast sliding window computations
# calculate skipgram probabilities
# define a fixed-size moving window that centers around each word
slide_windows <- function(tbl, window_size) {
skipgrams <- slider::slide(
tbl,
~.x,
.after = window_size - 1,
.step = 1,
.complete = TRUE
)
safe_mutate <- safely(mutate)

out <- map2(skipgrams, 1:length(skipgrams), ~ safe_mutate(.x, window_id = .y))

out %>%
transpose() %>%
pluck("result") %>%
compact() %>%
bind_rows()
}

plan(multisession) # for parallel computing

# calculate PMI (Pointwise Mutual Information) values for each sliding window with size of 4 words
nyt_tidy_pmi <- nyt_nested_words %>%
mutate(words = future_map(words, slide_windows, 4L)) %>%
unnest(words) %>%
unite(window_id, News_ID, window_id) %>%
pairwise_pmi(word, window_id)
# when PMI is high, the two words are associated with each other (likely to occur together)
nyt_tidy_pmi
View(nyt_tidy_pmi)
summary(nyt_tidy_pmi[,3]) # the median PMI value is 1.42

# determine word vectors from PMI values using SVD (Singular Value Decomposition: a method for dimensionality reduction via matrix factorization)
# project the sparse, high-dimensional set of word features into a more dense, 100-dimensional set of features
# each word is represented as a numeric vector in this new feature space
nyt_tidy_word_vectors <- nyt_tidy_pmi %>%
widely_svd(item1, item2, pmi, nv = 100, maxit = 1000)

View(nyt_tidy_word_vectors)

# explore nyt word embedding
# which words are close to each other in this 100-dimensional feature space?
# create a function to find nearest words based on cosine similarity
# return a dataframe sorted by similarity to my word/token of interest
nearest_neighbors <- function(df, token) {
df %>%
widely(
~ {
y <- .[rep(token, nrow(.)), ]
res <- rowSums(. * y) / (sqrt(rowSums(. ^ 2)) * sqrt(sum(.[token, ] ^ 2)))
matrix(res, ncol = 1, dimnames = list(x = names(res)))
},
sort = TRUE
)(item1, dimension, value) %>%
select(-item2)
}

# what words are closest to "latinx"
nyt_tidy_word_vectors %>%
nearest_neighbors("latinx")

################################################################################

# create a function to contain all the above code
# and apply it to find the closest words to "latinx" in news articles
# from the other 5 newspapers

closest_words <- function(newspaper_news, token) {
tidy_news <- newspaper_news %>%
select(News_ID, Text) %>%
unnest_tokens(word, Text) %>%
add_count(word) %>%
filter(n >= 4) %>%
select(-n)

nested_words <- tidy_news %>%
nest(words = c(word))

slide_windows <- function(tbl, window_size) {
skipgrams <- slider::slide(
tbl,
~.x,
.after = window_size - 1,
.step = 1,
.complete = TRUE
)
safe_mutate <- safely(mutate)

out <- map2(skipgrams, 1:length(skipgrams), ~ safe_mutate(.x, window_id = .y))

out %>%
transpose() %>%
pluck("result") %>%
compact() %>%
bind_rows()
}

plan(multisession)

tidy_pmi <- nested_words %>%
mutate(words = future_map(words, slide_windows, 4L)) %>%
unnest(words) %>%
unite(window_id, News_ID, window_id) %>%
pairwise_pmi(word, window_id)

tidy_word_vectors <- tidy_pmi %>%
widely_svd(item1, item2, pmi, nv = 100, maxit = 1000)

nearest_neighbors <- function(df, token) {
df %>%
widely(
~ {
y <- .[rep(token, nrow(.)), ]
res <- rowSums(. * y) / (sqrt(rowSums(. ^ 2)) * sqrt(sum(.[token, ] ^ 2)))
matrix(res, ncol = 1, dimnames = list(x = names(res)))
},
sort = TRUE
)(item1, dimension, value) %>%
select(-item2)
}

tidy_word_vectors %>%
nearest_neighbors("latinx")
}

################################################################################

# Los Angeles Times
# load .csv file containing news articles from Los Angeles Times
lat_news <- read_csv(file.choose())
View(lat_news)
# find closest words to "latinx" in the word embedding from news articles
closest_words(lat_news, "latinx")


# Houston Chronicle
# load .csv file containing news articles from Houston Chronicle
huc_news <- read_csv(file.choose())
View(huc_news)
# find closest words to "latinx" in the word embedding from news articles
closest_words(huc_news, "latinx")


# Miami Herald
# load .csv file containing news articles from Miami Herald
mihe_news <- read_csv(file.choose())
View(mihe_news)
# find closest words to "latinx" in the word embedding from news articles
closest_words(mihe_news, "latinx")


# The Washington Post
# load .csv file containing news articles from Washington Post
wp_news <- read_csv(file.choose())
View(wp_news)
# find closest words to "latinx" in the word embedding from news articles
closest_words(wp_news, "latinx")


# Chicago Tribune
# load .csv file containing news articles from Chicago Tribune
ct_news <- read_csv(file.choose())
View(ct_news)
# find closest words to "latinx" in the word embedding from news articles
closest_words(ct_news, "latinx")

EXPLORING MY OWN WORD EMBEDDINGS

Now that I have determined word embeddings derived from news articles mentioning the term “latinx” for six newspapers, I explored which words were closer to “latinx” in feature space. The results are shown in Figure 1.

Figure 1. Top words closer to “latinx” in feature space of word embeddings derived from news articles from each newspaper

I found that in the word embedding extracted from news articles published by The New York Times, inclusive and neutral were among the top words closer in feature space to latinx. Interestingly, neutral was found among the top words associated with latinx using keyword-in-context analysis (see Figure 4 here) but not inclusive, demonstrating the advantage of word embedding as complementary analysis. It was interesting to find that lgbtq and queer were among the closest words in feature space to latinx for the embeddings extracted from Los Angeles Times and Houston Chronicle. Non-religious was found close to latinx in feature space for the embedding extracted from Chicago Tribune and this was consistent with my previous results (see Figure 4 here).

My results shown here demonstrate that the usage of the term latinx differs among newspapers and word embeddings are a great way to asses these differences.

REFERENCES

ABOUT THE AUTHOR

Martin Calvino is currently a Visiting Professor in Data Science and Multimedia Art at Torcuato Di Tella University; he previously worked as Computational Biologist at The Human Genetics Institute of New Jersey — Rutgers University; he is also a Multimedia Artist. You can follow him on Instagram: @from.data.to.art

THE CODE FOR THIS WORK CAN BE ACCESSED AT MY GITHUB PAGE:

--

--

No responses yet