LSA

Overview

Teaching: 20 min
Exercises: 20 min

Questions

TODO

Objectives

TODO

Topic Modeling

So far, we’ve:

Warmed up with a hands on card sorting activity and a discussion about the kinds of tasks we have as qualitative or mixed methods researchers
Situated ourselves with the kinds of tasks machine learning is well suited for, alongside some of the common ethical risk areas for data-driven research and technology
Narrowed our focus to modern Natural Language Processing tasks
Introduced vector spaces for thinking about the differences and similarities between several documents at once
Prepared our data for processing in Python
And introduced word embeddings and TF-IDF matrices as one approach for representing our documents as large vector spaces

We now begin to close the loop with Topic Modeling.

Topic Modeling is a frequent goal of text analysis. Topics are the things that a document is about, by some sense of “about.” We could think of topics as:

discrete categories that your documents belong to, such as fiction vs. non-fiction
or spectra of subject matter that your documents contain in differing amounts, such as being about politics, cooking, racing, dragons, …

In the first case, we could use machine learning to predict discrete categories, such as trying to determine the author of the Federalist Papers.

In the second case, we could try to determine the least number of topics that provides the most information about how our documents differ from one another, then use those concepts to gain insight about the “stuff” or “story” of our documents.

In this lesson we’ll focus on this second case, where topics are treated as spectra of subject matter. There are a variety of ways of doing this, and not all of them use the vector space model we have learned. For example:

Vector-space models:
- Principle Component Analysis (PCA)
- Epistemic Network Analysis (ENA)
- Linear Discriminant Analysis (LDA)
- Latent Semantic Analysis (LSA)
Probability models:
- Latent Dirichlet Allocation (LDA)

Specifically, we will be discussing Latent Semantic Analysis (LSA). We’re narrowing our focus to LSA because it introduces us to concepts and workflows that we will use in the future.

The assumption behind LSA is that underlying the thousands of words in our vocabulary are a smaller number of hidden topics, and that those topics help explain the distribution of the words we see across our documents. In all our models so far, each dimension has corresponded to a single word. But in LSA, each dimension now corresponds to a hidden topic, and each of those in turn corresponds to the words that are most strongly associated with it.

For example, a hidden topic might be the lasting influence of the Battle of Hastings on the English language, with some documents using more words with Anglo-Saxon roots and other documents using more words with Latin roots. This dimension is “hidden” because authors don’t usually stamp a label on their books with a summary of the linguistic histories of their words. Still, we can imagine a spectrum between words that are strongly indicative of authors with more Anglo-Saxon diction vs. words strongly indicative of authors with more Latin diction. Once we have that spectrum, we can place our documents along it, then move on to the next hidden topic, then the next, and so on, until we’ve discussed the fewest, strongest hidden topics that capture the most “story” about our documents.

Setting Up TF-IDF

Mathematically, these “latent semantic” dimensions are derived from our TF-IDF matrix, so let’s begin there.

First we need to download the data we’ll be using and do a little python setup.

!pip install parse

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: parse in /usr/local/lib/python3.9/dist-packages (1.19.0)
/bin/bash: svn: command not found

# Run this cell to mount your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

# add folder to colab's path so we can import the helper functions
import sys
wksp_dir = '/content/drive/My Drive/Colab Notebooks/text-analysis'
sys.path.insert(0, wksp_dir)
listdir(wksp_dir)

# and import our functions from helpers.py
from helpers import create_file_list, parse_into_dataframe, lemmatize_files

# get list of files to analyze
data_dir = wksp_dir + '/data/'
corpus_file_list = create_file_list(data_dir, "*.txt")

# parse filelist into a dataframe
data = parse_into_dataframe(data_dir + "{Author}-{Title}.txt", corpus_file_list)
data

	Author	Title	Item
0	austen	sense	/content/drive/My Drive/Colab Notebooks/text-a...
21	austen	persuasion	/content/drive/My Drive/Colab Notebooks/text-a...
12	austen	pride	/content/drive/My Drive/Colab Notebooks/text-a...
22	austen	northanger	/content/drive/My Drive/Colab Notebooks/text-a...
9	austen	emma	/content/drive/My Drive/Colab Notebooks/text-a...
7	austen	ladysusan	/content/drive/My Drive/Colab Notebooks/text-a...
10	chesterton	thursday	/content/drive/My Drive/Colab Notebooks/text-a...
8	chesterton	ball	/content/drive/My Drive/Colab Notebooks/text-a...
5	chesterton	brown	/content/drive/My Drive/Colab Notebooks/text-a...
30	chesterton	knewtoomuch	/content/drive/My Drive/Colab Notebooks/text-a...
39	chesterton	whitehorse	/content/drive/My Drive/Colab Notebooks/text-a...
27	chesterton	napoleon	/content/drive/My Drive/Colab Notebooks/text-a...
18	dickens	hardtimes	/content/drive/My Drive/Colab Notebooks/text-a...
28	dickens	bleakhouse	/content/drive/My Drive/Colab Notebooks/text-a...
38	dickens	davidcopperfield	/content/drive/My Drive/Colab Notebooks/text-a...
40	dickens	taleoftwocities	/content/drive/My Drive/Colab Notebooks/text-a...
17	dickens	christmascarol	/content/drive/My Drive/Colab Notebooks/text-a...
20	dickens	greatexpectations	/content/drive/My Drive/Colab Notebooks/text-a...
41	dickens	pickwickpapers	/content/drive/My Drive/Colab Notebooks/text-a...
2	dickens	ourmutualfriend	/content/drive/My Drive/Colab Notebooks/text-a...
13	dickens	olivertwist	/content/drive/My Drive/Colab Notebooks/text-a...
37	dumas	threemusketeers	/content/drive/My Drive/Colab Notebooks/text-a...
33	dumas	montecristo	/content/drive/My Drive/Colab Notebooks/text-a...
32	dumas	twentyyearsafter	/content/drive/My Drive/Colab Notebooks/text-a...
3	dumas	tenyearslater	/content/drive/My Drive/Colab Notebooks/text-a...
29	dumas	maninironmask	/content/drive/My Drive/Colab Notebooks/text-a...
14	dumas	blacktulip	/content/drive/My Drive/Colab Notebooks/text-a...
34	litbank	conll	/content/drive/My Drive/Colab Notebooks/text-a...
4	melville	moby_dick	/content/drive/My Drive/Colab Notebooks/text-a...
24	melville	typee	/content/drive/My Drive/Colab Notebooks/text-a...
23	melville	pierre	/content/drive/My Drive/Colab Notebooks/text-a...
11	melville	piazzatales	/content/drive/My Drive/Colab Notebooks/text-a...
19	melville	conman	/content/drive/My Drive/Colab Notebooks/text-a...
1	melville	omoo	/content/drive/My Drive/Colab Notebooks/text-a...
15	melville	bartleby	/content/drive/My Drive/Colab Notebooks/text-a...
26	shakespeare	othello	/content/drive/My Drive/Colab Notebooks/text-a...
6	shakespeare	midsummer	/content/drive/My Drive/Colab Notebooks/text-a...
16	shakespeare	muchado	/content/drive/My Drive/Colab Notebooks/text-a...
31	shakespeare	caesar	/content/drive/My Drive/Colab Notebooks/text-a...
35	shakespeare	lear	/content/drive/My Drive/Colab Notebooks/text-a...
36	shakespeare	romeo	/content/drive/My Drive/Colab Notebooks/text-a...
25	shakespeare	twelfthnight	/content/drive/My Drive/Colab Notebooks/text-a...

Next we’ll load our tokenizer for processing our documents.

Then we’ll preprocess each file, creating a “lemmatized” copy of each.

Note: This can take several minutes to run. It took me around 7 minutes when I was writing these instructions.

import spacy, logging
tokenizer = spacy.load('en_core_web_sm', disable=["parser", "ner", "textcat", "senter"])
tokenizer.max_length = 4500000
def lemmatize_files(tokenizer, corpus_file_list, pos_set={"ADJ", "ADV", "INTJ", "NOUN", "VERB"}, stop_set=set()):
    """
    Example:
        data["Lemma_File"] = lemmatize_files(tokenizer, corpus_file_list)
    """
    logging.warning("This function is computationally intensive. It may take several minutes to finish running.")
    N = len(corpus_file_list)
    lemma_filename_list = []
    for i, filename in enumerate(corpus_file_list):
        logging.info(f"{i+1} out of {N}: Lemmatizing {filename}")
        lemma_filename = filename + ".lemmas"
        lemma_filename_list.append(lemma_filename)
        open(lemma_filename, "w", encoding="utf-8").writelines(
            token.lemma_.lower() + "\n"
            for token in tokenizer(open(filename, "r", encoding="utf-8").read())
            if token.pos_ in pos_set
            and token.lemma_.lower() not in stop_set
            # and token.text_.lower() not in stop_set
        )

    return lemma_filename_list

data["LemmaFile"] = lemmatize_files(tokenizer, corpus_file_list)

WARNING:root:This function is computationally intensive. It may take several minutes to finish running.



---------------------------------------------------------------------------

KeyboardInterrupt                         Traceback (most recent call last)

<ipython-input-14-96b4175b727c> in <cell line: 26>()
     24     return lemma_filename_list
     25 
---> 26 data["LemmaFile"] = lemmatize_files(tokenizer, corpus_file_list)


<ipython-input-14-96b4175b727c> in lemmatize_files(tokenizer, corpus_file_list, pos_set, stop_set)
     16         open(lemma_filename, "w", encoding="utf-8").writelines(
     17             token.lemma_.lower() + "\n"
---> 18             for token in tokenizer(open(filename, "r", encoding="utf-8").read())
     19             if token.pos_ in pos_set
     20             and token.lemma_.lower() not in stop_set


/usr/local/lib/python3.9/dist-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
   1009                 error_handler = proc.get_error_handler()
   1010             try:
-> 1011                 doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
   1012             except KeyError as e:
   1013                 # This typically happens if a component is not initialized


/usr/local/lib/python3.9/dist-packages/spacy/pipeline/trainable_pipe.pyx in spacy.pipeline.trainable_pipe.TrainablePipe.__call__()


/usr/local/lib/python3.9/dist-packages/spacy/pipeline/tok2vec.py in predict(self, docs)
    123             width = self.model.get_dim("nO")
    124             return [self.model.ops.alloc((0, width)) for doc in docs]
--> 125         tokvecs = self.model.predict(docs)
    126         return tokvecs
    127 


/usr/local/lib/python3.9/dist-packages/thinc/model.py in predict(self, X)
    313         only the output, instead of the `(output, callback)` tuple.
    314         """
--> 315         return self._func(self, X, is_train=False)[0]
    316 
    317     def finish_update(self, optimizer: Optimizer) -> None:


/usr/local/lib/python3.9/dist-packages/thinc/layers/chain.py in forward(model, X, is_train)
     53     callbacks = []
     54     for layer in model.layers:
---> 55         Y, inc_layer_grad = layer(X, is_train=is_train)
     56         callbacks.append(inc_layer_grad)
     57         X = Y


/usr/local/lib/python3.9/dist-packages/thinc/model.py in __call__(self, X, is_train)
    289         """Call the model's `forward` function, returning the output and a
    290         callback to compute the gradients via backpropagation."""
--> 291         return self._func(self, X, is_train=is_train)
    292 
    293     def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":


/usr/local/lib/python3.9/dist-packages/thinc/layers/chain.py in forward(model, X, is_train)
     53     callbacks = []
     54     for layer in model.layers:
---> 55         Y, inc_layer_grad = layer(X, is_train=is_train)
     56         callbacks.append(inc_layer_grad)
     57         X = Y


/usr/local/lib/python3.9/dist-packages/thinc/model.py in __call__(self, X, is_train)
    289         """Call the model's `forward` function, returning the output and a
    290         callback to compute the gradients via backpropagation."""
--> 291         return self._func(self, X, is_train=is_train)
    292 
    293     def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":


/usr/local/lib/python3.9/dist-packages/thinc/layers/with_array.py in forward(model, Xseq, is_train)
     30 ) -> Tuple[SeqT, Callable]:
     31     if isinstance(Xseq, Ragged):
---> 32         return cast(Tuple[SeqT, Callable], _ragged_forward(model, Xseq, is_train))
     33     elif isinstance(Xseq, Padded):
     34         return cast(Tuple[SeqT, Callable], _padded_forward(model, Xseq, is_train))


/usr/local/lib/python3.9/dist-packages/thinc/layers/with_array.py in _ragged_forward(model, Xr, is_train)
     85 ) -> Tuple[Ragged, Callable]:
     86     layer: Model[ArrayXd, ArrayXd] = model.layers[0]
---> 87     Y, get_dX = layer(Xr.dataXd, is_train)
     88 
     89     def backprop(dYr: Ragged) -> Ragged:


/usr/local/lib/python3.9/dist-packages/thinc/model.py in __call__(self, X, is_train)
    289         """Call the model's `forward` function, returning the output and a
    290         callback to compute the gradients via backpropagation."""
--> 291         return self._func(self, X, is_train=is_train)
    292 
    293     def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":


/usr/local/lib/python3.9/dist-packages/thinc/layers/chain.py in forward(model, X, is_train)
     53     callbacks = []
     54     for layer in model.layers:
---> 55         Y, inc_layer_grad = layer(X, is_train=is_train)
     56         callbacks.append(inc_layer_grad)
     57         X = Y


/usr/local/lib/python3.9/dist-packages/thinc/model.py in __call__(self, X, is_train)
    289         """Call the model's `forward` function, returning the output and a
    290         callback to compute the gradients via backpropagation."""
--> 291         return self._func(self, X, is_train=is_train)
    292 
    293     def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":


/usr/local/lib/python3.9/dist-packages/thinc/layers/chain.py in forward(model, X, is_train)
     53     callbacks = []
     54     for layer in model.layers:
---> 55         Y, inc_layer_grad = layer(X, is_train=is_train)
     56         callbacks.append(inc_layer_grad)
     57         X = Y


/usr/local/lib/python3.9/dist-packages/thinc/model.py in __call__(self, X, is_train)
    289         """Call the model's `forward` function, returning the output and a
    290         callback to compute the gradients via backpropagation."""
--> 291         return self._func(self, X, is_train=is_train)
    292 
    293     def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":


/usr/local/lib/python3.9/dist-packages/thinc/layers/maxout.py in forward(model, X, is_train)
     51     W = model.get_param("W")
     52     W = model.ops.reshape2f(W, nO * nP, nI)
---> 53     Y = model.ops.gemm(X, W, trans2=True)
     54     Y += model.ops.reshape1f(b, nO * nP)
     55     Z = model.ops.reshape3f(Y, Y.shape[0], nO, nP)


KeyboardInterrupt: 

lemmas_file_list = []
pos_set = {"ADJ", "ADV", "INTJ", "NOUN", "VERB"}
for i, filename in enumerate(corpus_file_list):
  print(i+1, "out of", len(corpus_file_list), "Lemmatizing", filename)
  lemma_filename = filename + ".lemmas"
  lemmas_file_list.append(lemma_filename)
  open(lemma_filename, "w", encoding="utf-8").writelines(
    token.lemma_.lower() + "\n"
    for token in tokenizer(open(filename, "r", encoding="utf-8").read())
    if token.pos_ in pos_set
  )

print(lemmas_file_list)

out of 41 Lemmatizing python-text-analysis/data/dickens-olivertwist.txt
out of 41 Lemmatizing python-text-analysis/data/dumas-montecristo.txt
out of 41 Lemmatizing python-text-analysis/data/melville-typee.txt
out of 41 Lemmatizing python-text-analysis/data/dumas-twentyyearsafter.txt
out of 41 Lemmatizing python-text-analysis/data/dumas-blacktulip.txt
out of 41 Lemmatizing python-text-analysis/data/dumas-tenyearslater.txt
out of 41 Lemmatizing python-text-analysis/data/dickens-davidcopperfield.txt
out of 41 Lemmatizing python-text-analysis/data/chesterton-thursday.txt
out of 41 Lemmatizing python-text-analysis/data/shakespeare-lear.txt
out of 41 Lemmatizing python-text-analysis/data/shakespeare-midsummer.txt
out of 41 Lemmatizing python-text-analysis/data/chesterton-whitehorse.txt
out of 41 Lemmatizing python-text-analysis/data/chesterton-ball.txt
out of 41 Lemmatizing python-text-analysis/data/dickens-taleoftwocities.txt
out of 41 Lemmatizing python-text-analysis/data/chesterton-napoleon.txt
out of 41 Lemmatizing python-text-analysis/data/shakespeare-romeo.txt
out of 41 Lemmatizing python-text-analysis/data/chesterton-brown.txt
out of 41 Lemmatizing python-text-analysis/data/dickens-pickwickpapers.txt
out of 41 Lemmatizing python-text-analysis/data/austen-pride.txt
out of 41 Lemmatizing python-text-analysis/data/dickens-hardtimes.txt
out of 41 Lemmatizing python-text-analysis/data/melville-moby_dick.txt
out of 41 Lemmatizing python-text-analysis/data/austen-emma.txt
out of 41 Lemmatizing python-text-analysis/data/shakespeare-othello.txt
out of 41 Lemmatizing python-text-analysis/data/melville-conman.txt
out of 41 Lemmatizing python-text-analysis/data/dickens-ourmutualfriend.txt
out of 41 Lemmatizing python-text-analysis/data/dickens-greatexpectations.txt
out of 41 Lemmatizing python-text-analysis/data/shakespeare-muchado.txt
out of 41 Lemmatizing python-text-analysis/data/chesterton-knewtoomuch.txt
out of 41 Lemmatizing python-text-analysis/data/austen-northanger.txt
out of 41 Lemmatizing python-text-analysis/data/dickens-christmascarol.txt
out of 41 Lemmatizing python-text-analysis/data/austen-persuasion.txt
out of 41 Lemmatizing python-text-analysis/data/melville-bartleby.txt
out of 41 Lemmatizing python-text-analysis/data/austen-sense.txt
out of 41 Lemmatizing python-text-analysis/data/dumas-threemusketeers.txt
out of 41 Lemmatizing python-text-analysis/data/melville-piazzatales.txt
out of 41 Lemmatizing python-text-analysis/data/shakespeare-caesar.txt
out of 41 Lemmatizing python-text-analysis/data/melville-pierre.txt
out of 41 Lemmatizing python-text-analysis/data/dumas-maninironmask.txt
out of 41 Lemmatizing python-text-analysis/data/austen-ladysusan.txt
out of 41 Lemmatizing python-text-analysis/data/melville-omoo.txt
out of 41 Lemmatizing python-text-analysis/data/dickens-bleakhouse.txt
out of 41 Lemmatizing python-text-analysis/data/shakespeare-twelfthnight.txt
['python-text-analysis/data/dickens-olivertwist.txt.lemmas', 'python-text-analysis/data/dumas-montecristo.txt.lemmas', 'python-text-analysis/data/melville-typee.txt.lemmas', 'python-text-analysis/data/dumas-twentyyearsafter.txt.lemmas', 'python-text-analysis/data/dumas-blacktulip.txt.lemmas', 'python-text-analysis/data/dumas-tenyearslater.txt.lemmas', 'python-text-analysis/data/dickens-davidcopperfield.txt.lemmas', 'python-text-analysis/data/chesterton-thursday.txt.lemmas', 'python-text-analysis/data/shakespeare-lear.txt.lemmas', 'python-text-analysis/data/shakespeare-midsummer.txt.lemmas', 'python-text-analysis/data/chesterton-whitehorse.txt.lemmas', 'python-text-analysis/data/chesterton-ball.txt.lemmas', 'python-text-analysis/data/dickens-taleoftwocities.txt.lemmas', 'python-text-analysis/data/chesterton-napoleon.txt.lemmas', 'python-text-analysis/data/shakespeare-romeo.txt.lemmas', 'python-text-analysis/data/chesterton-brown.txt.lemmas', 'python-text-analysis/data/dickens-pickwickpapers.txt.lemmas', 'python-text-analysis/data/austen-pride.txt.lemmas', 'python-text-analysis/data/dickens-hardtimes.txt.lemmas', 'python-text-analysis/data/melville-moby_dick.txt.lemmas', 'python-text-analysis/data/austen-emma.txt.lemmas', 'python-text-analysis/data/shakespeare-othello.txt.lemmas', 'python-text-analysis/data/melville-conman.txt.lemmas', 'python-text-analysis/data/dickens-ourmutualfriend.txt.lemmas', 'python-text-analysis/data/dickens-greatexpectations.txt.lemmas', 'python-text-analysis/data/shakespeare-muchado.txt.lemmas', 'python-text-analysis/data/chesterton-knewtoomuch.txt.lemmas', 'python-text-analysis/data/austen-northanger.txt.lemmas', 'python-text-analysis/data/dickens-christmascarol.txt.lemmas', 'python-text-analysis/data/austen-persuasion.txt.lemmas', 'python-text-analysis/data/melville-bartleby.txt.lemmas', 'python-text-analysis/data/austen-sense.txt.lemmas', 'python-text-analysis/data/dumas-threemusketeers.txt.lemmas', 'python-text-analysis/data/melville-piazzatales.txt.lemmas', 'python-text-analysis/data/shakespeare-caesar.txt.lemmas', 'python-text-analysis/data/melville-pierre.txt.lemmas', 'python-text-analysis/data/dumas-maninironmask.txt.lemmas', 'python-text-analysis/data/austen-ladysusan.txt.lemmas', 'python-text-analysis/data/melville-omoo.txt.lemmas', 'python-text-analysis/data/dickens-bleakhouse.txt.lemmas', 'python-text-analysis/data/shakespeare-twelfthnight.txt.lemmas']

# %%shell

# cd python-text-analysis/data
# zip lemmas *.lemmas

# %%shell

# cd python-text-analysis/data
# unzip lemmas.zip

And after that, we’ll convert our lemmatized files to TF-IDF matrix format, just as we did in the previous lesson.

Recall, max_df=.6 removes terms that appear in more than 60% of our documents (overly common words like the, a, an) and min_df=.1 removes terms that appear in less than 10% of our documents (overly rare words like specific character names, typos, or punctuation the tokenizer doesn’t understand). We’re looking for that sweet spot where terms are frequent enough for us to build theoretical understanding of what they mean for our corpus, but not so frequent that they can’t help us tell our documents apart.

from sklearn.feature_extraction.text import TfidfVectorizer
lemmas_file_list = create_file_list("python-text-analysis/data", "*.txt.lemmas")
vectorizer = TfidfVectorizer(input='filename', max_df=.6, min_df=.1)
tfidf = vectorizer.fit_transform(lemmas_file_list)
print(tfidf.shape)

(41, 9879)

How LSA Works: SVD of TF-IDF Matrix

What do these dimensions mean? We have 41 documents, which we can think of as rows. And we have tens of thousands of tokens, which is like a dictionary of all the types of words we have in our documents, and which we represent as columns.

Now we want to reduce the number of dimensions used to represent our documents. Sure, we could talk through each of the tens of thousands words in our dictionary. And we could talk through each of our individual 41 documents. Qualitative researchers are capable of great things. But those approaches won’t take advantage of our model, they require a HUGE page burden to walk a reader through, and they put all the pressure on you to notice cross-cutting themes on your own. Instead, we can use a technique from linear algebra–Singlar Value Decomposition (SVD)–to reduce those tens of thousands of words or 41 documents to a smaller set of more cross-cutting dimensions.

We won’t deeply dive into all the mathematics of SVD, but will discuss what happens in abstract. The mathematical technique we are using is called “SVD” because we are “decomposing” our original matrix and creating a special matrix with “singular values.” Any matrix M of arbitrary size can always be split or decomposed into three matrices that multiply together to make M. There are often many non-unique ways to do this decomposition.

The three resulting matrices are called U, Σ, and Vt.

Image of SVD. Visualisation of the priciple of SVD by CMG Lee.

The U matrix is a matrix where there are documents as rows, and different topics as columns. The scores in each cell show how much each document is “about” each topic.

The Vt matrix is a matrix where there are a set of terms as columns, and different topics as rows. Again, the values in each cell correspond to how much a given word indicates a given topic.

The Σ matrix is special, and the one from which SVD gets its name. Nearly every cell in the matrix is zero. Only the diagonal cells are filled in: there are singular values in each row and column. Each singular value represent the amount of variation in our data explained by each topic–how much of the “stuff” of the story that topic covers.

A good deal of variation can often be explained by a relatively small number of topics, and often the variation each topic describes shrinks with each new topic. Because of this, we can truncate or remove individual rows with the lowest singular values, since they provide the least amount of information.

Once this truncation happens, we can multiply together our three matrices and end up with a smaller matrix with topics instead of words as dimensions.

Image of singular value decomposition with columns and rows truncated.

This allows us to focus our account of our documents on a narrower set of cross-cutting topics. This does come at a price though. When we reduce dimensionality, our model loses information about our dataset. Our hope is that the information lost was unimportant. But “importance” depends on your moral theoretical stances. Because of this, it is important to carefully inspect the results of your model, carefully interpret the “topics” it identifies, and check all that against your qualitative and theoretical understanding of your documents.

This will likely be an iterative process where you refine your model several times. Keep in mind the adage: all models are wrong, some are useful, and a less accurate model may be preferred if it is easier to explain to your stakeholders.

Question: What’s the most possible topics we could get from this model? Think about what the most singular values are that you could possibly fit in the Σ matrix.

Remember, these singular values exist only on the diagonal, so the most topics we could have will be whichever we have fewer of- unique words or documents in our corpus.

Because there are usually more unique words than there are documents, it will almost always be equal to the number of documents we have, in this case 41.

To see this, let’s begin to reduce the dimensionality of our TF-IDF matrix using SVD, starting with the greatest number of dimensions.

from sklearn.decomposition import TruncatedSVD
maxDimensions = min(tfidf.shape)-1
svdmodel = TruncatedSVD(n_components=maxDimensions, algorithm="arpack")
lsa = svdmodel.fit_transform(tfidf)
print(lsa)

[[ 3.91364432e-01 -3.38256707e-01 -1.10255485e-01 ... -3.30703329e-04
   2.26445596e-03 -1.29373990e-02]
 [ 2.83139301e-01 -2.03163967e-01  1.72761316e-01 ...  1.98594965e-04
  -4.41931701e-03 -1.84732254e-02]
 [ 3.32869588e-01 -2.67008449e-01 -2.43271177e-01 ...  4.50149502e-03
   1.99200352e-03  2.32871393e-03]
 ...
 [ 1.91400319e-01 -1.25861226e-01  4.36682522e-02 ... -8.51158743e-04
   4.48451964e-03  1.67944132e-03]
 [ 2.33925324e-01 -8.46322843e-03  1.35493523e-01 ...  5.46406784e-03
  -1.11972177e-03  3.86332162e-03]
 [ 4.09480701e-01 -1.78620470e-01 -1.61670733e-01 ... -6.72035999e-02
   9.27745251e-03 -7.60191949e-05]]

How should we pick a number of topics to keep? Fortunately, we have the Singular Values to help us understand how much data each topic explains. Let’s take a look and see how much data each topic explains. We will visualize it on a graph.

import matplotlib.pyplot as plt

#this shows us the amount of dropoff in explanation we have in our sigma matrix. 
print(svdmodel.explained_variance_ratio_)

plt.plot(range(maxDimensions), svdmodel.explained_variance_ratio_ * 100)
plt.xlabel("Topic Number")
plt.ylabel("% explained")
plt.title("SVD dropoff")
plt.show()  # show first chart

[0.02053967 0.12553786 0.08088013 0.06750632 0.05095583 0.04413301
03236406 0.02954683 0.02837433 0.02664072 0.02596086 0.02538922
02499496 0.0240097  0.02356043 0.02203859 0.02162737 0.0210681
02004    0.01955728 0.01944726 0.01830292 0.01822243 0.01737443
01664451 0.0160519  0.01494616 0.01461527 0.01455848 0.01374971
01308112 0.01255502 0.01201655 0.0112603  0.01089138 0.0096127
00830014 0.00771224 0.00622448 0.00499762]

png

Often a heuristic used by researchers to determine a topic count is to look at the dropoff in percentage of data explained by each topic.

Typically the rate of data explained will be high at first, dropoff quickly, then start to level out. We can pick a point on the “elbow” where it goes from a high level of explanation to where it starts levelling out and not explaining as much per topic. Past this point, we begin to see diminishing returns on the amount of the “stuff” of our documents we can cover quickly. This is also often a good sweet spot between overfitting our model and not having enough topics.

Alternatively, we could set some target sum for how much of our data we want our topics to explain, something like 90% or 95%. However, with a small dataset like this, that would result in a large number of topics, so we’ll pick an elbow instead.

Looking at our results so far, a good number in the middle of the “elbow” appears to be around 5-7 topics. So, let’s fit a model using only 6 topics and then take a look at what each topic looks like.

(Why is the first topic, “Topic 0,” so low? It has to do with how our SVD was setup. Truncated SVD does not mean center the data beforehand, which takes advantage of sparse matrix algorithms by leaving most of the data at zero. Otherwise, our matrix will me mostly filled with the negative of the mean for each column or row, which takes much more memory to store. The math is outside the scope for this lesson, but it’s expected in this scenario that topic 0 will be less informative than the ones that come after it, so we’ll skip it.)

numDimensions = 7
svdmodel = TruncatedSVD(n_components=numDimensions, algorithm="arpack")
lsa = svdmodel.fit_transform(tfidf)
print(lsa)

[[ 3.91364432e-01 -3.38256707e-01 -1.10255485e-01 -1.57263147e-01
   4.46988327e-01  4.19701195e-02 -1.60554169e-01]
 [ 2.83139301e-01 -2.03163967e-01  1.72761316e-01 -2.09939164e-01
  -3.26746690e-01  5.57239735e-01 -2.77917582e-01]
 [ 3.32869588e-01 -2.67008449e-01 -2.43271177e-01  2.10563091e-01
  -1.76563657e-01 -2.99275913e-02  1.16776821e-02]
 [ 3.08138678e-01 -2.10715886e-01  1.90232173e-01 -3.35332382e-01
  -2.39294420e-01 -2.10772234e-01 -5.00250358e-02]
 [ 3.05001339e-01 -2.28993064e-01  2.27384118e-01 -3.12862475e-01
  -2.30273991e-01 -3.01470572e-01  2.94344505e-02]
 [ 4.61714301e-01 -3.71103910e-01 -6.23885346e-02 -2.07781625e-01
   3.75805961e-01  4.62796547e-02 -2.40105061e-02]
 [ 3.99078406e-01 -3.72675621e-01 -4.29488320e-01  3.21312840e-01
  -2.06780567e-01 -4.79678166e-02  1.81897768e-02]
 [ 2.60635143e-01 -1.90036072e-01 -1.31092747e-02 -1.38136420e-01
   1.37846031e-01  2.59831829e-02  1.28138615e-01]
 [ 2.75254100e-01 -1.66002010e-01  1.51344979e-01 -2.03879356e-01
  -1.97434785e-01  4.34660579e-01  3.51604210e-01]
 [ 2.63962657e-01 -1.51795541e-01  1.03662446e-01 -1.32354362e-01
  -8.01919283e-02  1.34144571e-01  4.40821829e-01]
 [ 5.39085586e-01  5.51168135e-01 -7.25812593e-02  1.11795245e-02
  -2.79031624e-04 -1.68092332e-02  5.49535679e-03]
 [ 2.69952815e-01 -1.76699531e-01  5.70356228e-01  4.48630131e-01
   4.28713759e-02 -2.18545514e-02  1.29750415e-02]
 [ 6.20096940e-01  6.50488110e-01 -3.76389598e-02  2.84363611e-02
   1.59378698e-02 -1.18479143e-02 -1.67609142e-02]
 [ 2.39439789e-01 -1.46548125e-01  5.73647210e-01  4.48872088e-01
   6.91429226e-02 -6.62720018e-02 -5.65690665e-02]
 [ 3.46673808e-01 -2.28179603e-01  4.18572442e-01  1.99567055e-01
  -9.26169891e-03  1.28870542e-02  6.90447513e-02]
 [ 6.16613469e-01  6.59524199e-01 -6.30672750e-02  4.21736740e-03
   1.66141337e-02 -1.39649741e-02 -9.24035248e-04]
 [ 4.19959535e-01 -3.55330895e-01 -5.39327447e-02 -2.01473687e-01
   3.73339308e-01  6.42749710e-02  3.85309124e-02]
 [ 3.69324851e-01 -3.45008143e-01 -3.46180574e-01  2.57048111e-01
  -2.03332217e-01  8.43097532e-03 -3.03449265e-02]
 [ 6.27339749e-01  1.62509554e-01  2.45818244e-02 -7.59347178e-02
  -6.91425518e-02  5.45427510e-02  2.01009502e-01]
 [ 3.10638955e-01 -1.27428647e-01  6.35926253e-01  4.72744826e-01
   8.18397293e-02 -5.48693117e-02 -7.44129304e-02]
 [ 5.81561697e-01  6.09748220e-01 -4.20854426e-02  1.91045296e-03
   4.76425507e-03 -2.04751525e-02 -1.90787467e-02]
 [ 3.25549596e-01 -2.35619355e-01  1.94586350e-01 -3.99287993e-01
  -2.46239345e-01 -3.59189648e-01 -5.52938926e-02]
 [ 3.88812327e-01 -3.62768914e-01 -4.48329052e-01  3.68459209e-01
  -2.60646554e-01 -7.30511536e-02  3.70734308e-02]
 [ 4.01431564e-01 -3.29316324e-01 -1.07594721e-01 -9.11451209e-02
   2.29891158e-01  5.14621207e-03  4.04610197e-02]
 [ 1.72871962e-01 -5.46831788e-02  8.30995631e-02 -1.54834480e-01
  -1.59427703e-01  3.85080042e-01 -9.72202770e-02]
 [ 5.98566537e-01  5.98108991e-01 -6.66814202e-02  3.05305099e-02
   5.34360487e-03 -2.87781213e-02 -2.44070894e-02]
 [ 2.59082136e-01 -1.76483028e-01  1.18735256e-01 -1.85860632e-01
  -3.24030617e-01  4.76593510e-01 -3.77322924e-01]
 [ 2.85857247e-01 -2.16452087e-01  1.56285206e-01 -3.83067065e-01
  -2.24662519e-01 -4.59375982e-01 -1.60404615e-02]
 [ 3.96454518e-01 -3.51785523e-01 -4.06191581e-01  3.09628775e-01
  -1.65348903e-01 -3.42214059e-02 -8.79935957e-02]
 [ 5.68307565e-01  5.79236354e-01 -2.49977438e-02 -1.65820193e-03
  -1.48330776e-03  4.97525494e-04 -7.56653060e-03]
 [ 3.95181458e-01 -3.43909965e-01 -1.12527848e-01 -1.54143147e-01
   4.24627540e-01  3.46146552e-02 -9.53357379e-02]
 [ 7.03778529e-02 -4.53018748e-02  4.47075047e-02 -1.29319689e-02
  -1.25637206e-04 -3.73101178e-03  2.26633086e-02]
 [ 5.87259340e-01  5.91592344e-01 -3.06093001e-02  3.14797614e-02
   9.20390599e-03 -8.28941483e-03 -2.50957867e-02]
 [ 2.90241679e-01 -1.59290104e-01  5.44614348e-01  3.72292370e-01
   2.60700775e-02  7.08606085e-03 -4.24466458e-02]
 [ 3.73064985e-01 -2.83432129e-01  2.07212226e-01 -1.86820663e-02
   2.03303288e-01  1.46948739e-02  1.10489338e-01]
 [ 3.80760325e-01 -3.20618500e-01 -2.67027067e-01  4.74970999e-02
   1.41382144e-01 -1.72863694e-02  8.04289208e-03]
 [ 2.76029781e-01 -2.66104786e-01 -3.70078860e-01  3.35161862e-01
  -2.59387443e-01 -7.34908946e-02  4.83959546e-02]
 [ 2.87419636e-01 -2.05299959e-01  1.46794264e-01 -3.22859868e-01
  -2.05122322e-01 -3.24165310e-01 -4.45227118e-02]
 [ 1.91400319e-01 -1.25861226e-01  4.36682522e-02 -1.02268922e-01
  -2.32049150e-02  1.95768614e-01  5.96553168e-01]
 [ 2.33925324e-01 -8.46322843e-03  1.35493523e-01 -1.92794298e-01
  -1.74616417e-01  4.49616713e-02 -1.85204985e-01]
 [ 4.09480701e-01 -1.78620470e-01 -1.61670733e-01 -8.17899037e-02
   3.68899535e-01  1.60467077e-02 -2.28751397e-01]]

And put all our results together in one DataFrame so we can save it to a spreadsheet to save all the work we’ve done so far. This will also make plotting easier in a moment.

Since we don’t know what these topics correspond to yet, for now I’ll call the first topic X, the second Y, the third Z, and so on.

import parse
import pandas
# "python-text-analysis/data/{Author}-{Title}.txt"
def parse_into_dataframe(pattern, items, col_name="Item"):
    results = []
    p = parse.compile(pattern)
    for item in items:
        result = p.search(item)
        if result is not None:
            result.named[col_name] = item
            results.append(result.named)
    
    return pandas.DataFrame.from_dict(results)

import pandas
from numpy import mean
data = parse_into_dataframe("python-text-analysis/data/{Author}-{Title}.txt", lemmas_file_list)
# skipping w[0] because it is less informative in a truncated svd
# xs = [w[1] for w in lsa]
# ys = [w[2] for w in lsa]
# zs = [w[3] for w in lsa]
# ws = [w[4] for w in lsa]
# ps = [w[5] for w in lsa]
# qs = [w[6] for w in lsa]
# data = pandas.DataFrame({
#     "Author": authors,
#     "Title": titles,
#     # "X": xs - mean(xs),
#     # "Y": ys - mean(ys),
#     # "Z": zs - mean(zs),
#     # "W": ws - mean(ws),
#     # "P": ps - mean(ps),
#     # "Q": qs - mean(qs)
# })

data[["X", "Y", "Z", "W", "P", "Q"]] = lsa[:, [1, 2, 3, 4, 5, 6]]
data[["X", "Y", "Z", "W", "P", "Q"]] -= data[["X", "Y", "Z", "W", "P", "Q"]].mean()

data.to_csv("results.csv")
print(data)

         Author              Title  \
     dickens        olivertwist   
    melville               omoo   
      austen         northanger   
  chesterton              brown   
  chesterton        knewtoomuch   
     dickens    ourmutualfriend   
      austen               emma   
     dickens     christmascarol   
    melville        piazzatales   
    melville             conman   
shakespeare            muchado   
      dumas      tenyearslater   
shakespeare               lear   
      dumas    threemusketeers   
      dumas        montecristo   
shakespeare              romeo   
    dickens  greatexpectations   
     austen         persuasion   
   melville             pierre   
      dumas   twentyyearsafter   
shakespeare             caesar   
 chesterton               ball   
     austen              pride   
    dickens         bleakhouse   
   melville          moby_dick   
shakespeare       twelfthnight   
   melville              typee   
 chesterton           thursday   
     austen              sense   
shakespeare          midsummer   
    dickens     pickwickpapers   
      dumas         blacktulip   
shakespeare            othello   
      dumas      maninironmask   
    dickens    taleoftwocities   
    dickens   davidcopperfield   
     austen          ladysusan   
 chesterton           napoleon   
   melville           bartleby   
 chesterton         whitehorse   
    dickens          hardtimes   

                                                 Item         X         Y  \
 python-text-analysis/data/dickens-olivertwist.... -0.261657 -0.141328   
 python-text-analysis/data/melville-omoo.txt.le... -0.126564  0.141689   
 python-text-analysis/data/austen-northanger.tx... -0.190409 -0.274343   
 python-text-analysis/data/chesterton-brown.txt... -0.134116  0.159160   
 python-text-analysis/data/chesterton-knewtoomu... -0.152394  0.196312   
 python-text-analysis/data/dickens-ourmutualfri... -0.294504 -0.093461   
  python-text-analysis/data/austen-emma.txt.lemmas -0.296076 -0.460560   
 python-text-analysis/data/dickens-christmascar... -0.113437 -0.044181   
 python-text-analysis/data/melville-piazzatales... -0.089402  0.120273   
 python-text-analysis/data/melville-conman.txt.... -0.075196  0.072590   
python-text-analysis/data/shakespeare-muchado....  0.627768 -0.103653   
python-text-analysis/data/dumas-tenyearslater.... -0.100100  0.539284   
python-text-analysis/data/shakespeare-lear.txt...  0.727088 -0.068711   
python-text-analysis/data/dumas-threemusketeer... -0.069949  0.542575   
python-text-analysis/data/dumas-montecristo.tx... -0.151580  0.387500   
python-text-analysis/data/shakespeare-romeo.tx...  0.736124 -0.094139   
python-text-analysis/data/dickens-greatexpecta... -0.278731 -0.085005   
python-text-analysis/data/austen-persuasion.tx... -0.268409 -0.377253   
python-text-analysis/data/melville-pierre.txt....  0.239109 -0.006490   
python-text-analysis/data/dumas-twentyyearsaft... -0.050829  0.604854   
python-text-analysis/data/shakespeare-caesar.t...  0.686348 -0.073158   
python-text-analysis/data/chesterton-ball.txt.... -0.159020  0.163514   
python-text-analysis/data/austen-pride.txt.lemmas -0.286169 -0.479401   
python-text-analysis/data/dickens-bleakhouse.t... -0.252717 -0.138667   
python-text-analysis/data/melville-moby_dick.t...  0.021916  0.052027   
python-text-analysis/data/shakespeare-twelfthn...  0.674709 -0.097754   
python-text-analysis/data/melville-typee.txt.l... -0.099883  0.087663   
python-text-analysis/data/chesterton-thursday.... -0.139853  0.125213   
python-text-analysis/data/austen-sense.txt.lemmas -0.275186 -0.437264   
python-text-analysis/data/shakespeare-midsumme...  0.655836 -0.056070   
python-text-analysis/data/dickens-pickwickpape... -0.267310 -0.143600   
python-text-analysis/data/dumas-blacktulip.txt...  0.031298  0.013635   
python-text-analysis/data/shakespeare-othello....  0.668192 -0.061681   
python-text-analysis/data/dumas-maninironmask.... -0.082691  0.513542   
python-text-analysis/data/dickens-taleoftwocit... -0.206833  0.176140   
python-text-analysis/data/dickens-davidcopperf... -0.244019 -0.298099   
python-text-analysis/data/austen-ladysusan.txt... -0.189505 -0.401151   
python-text-analysis/data/chesterton-napoleon.... -0.128700  0.115722   
python-text-analysis/data/melville-bartleby.tx... -0.049262  0.012596   
python-text-analysis/data/chesterton-whitehors...  0.068136  0.104421   
python-text-analysis/data/dickens-hardtimes.tx... -0.102021 -0.192743   

           Z         W         P         Q  
-0.152952  0.466738  0.032626 -0.164769  
-0.205628 -0.306997  0.547896 -0.282132  
 0.214874 -0.156814 -0.039271  0.007463  
-0.331021 -0.219545 -0.220116 -0.054240  
-0.308552 -0.210525 -0.310814  0.025220  
-0.203471  0.395555  0.036936 -0.028225  
 0.325624 -0.187031 -0.057312  0.013975  
-0.133825  0.157595  0.016639  0.123924  
-0.199568 -0.177685  0.425317  0.347390  
-0.128043 -0.060443  0.124801  0.436607  
0.015490  0.019470 -0.026153  0.001281  
0.452941  0.062621 -0.031198  0.008760  
0.032747  0.035687 -0.021192 -0.020976  
0.453183  0.088892 -0.075616 -0.060784  
0.203878  0.010488  0.003543  0.064830  
0.008528  0.036364 -0.023309 -0.005139  
-0.197163  0.393089  0.054931  0.034316  
0.261359 -0.183583 -0.000913 -0.034560  
-0.071624 -0.049393  0.045199  0.196795  
0.477056  0.101589 -0.064213 -0.078628  
0.006221  0.024514 -0.029819 -0.023293  
-0.394977 -0.226490 -0.368533 -0.059509  
0.372770 -0.240897 -0.082395  0.032859  
-0.086834  0.249641 -0.004198  0.036246  
-0.150524 -0.139678  0.375736 -0.101435  
0.034841  0.025093 -0.038122 -0.028622  
-0.181550 -0.304281  0.467250 -0.381538  
-0.378756 -0.204913 -0.468720 -0.020255  
0.313940 -0.145599 -0.043565 -0.092208  
0.002653  0.018266 -0.008846 -0.011781  
-0.149832  0.444377  0.025271 -0.099550  
-0.008621  0.019624 -0.013075  0.018449  
0.035791  0.028953 -0.017633 -0.029310  
0.376603  0.045819 -0.002258 -0.046661  
-0.014371  0.223053  0.005351  0.106275  
0.051808  0.161132 -0.026630  0.003828  
0.339473 -0.239638 -0.082835  0.044181  
-0.318549 -0.185373 -0.333509 -0.048737  
-0.097958 -0.003455  0.186425  0.592339  
-0.188483 -0.154867  0.035618 -0.189420  
-0.077479  0.388649  0.006703 -0.232966  

Inspecting Results

Let’s plot the results, color-coding by author to see if any patterns are immediately apparent. We’ll focus on the X and Y topics for now to illustrate the workflow. We’ll return to the other topics in our model as a further exercise.

colormap = {
    "austen": "red",
    "chesterton": "blue",
    "dickens": "green",
    "dumas": "orange",
    "melville": "cyan",
    "shakespeare": "magenta"
}

data["Color"] = data["Author"].map(colormap)

xR2 = round(svdmodel.explained_variance_ratio_[1] * 100, 2)
yR2 = round(svdmodel.explained_variance_ratio_[2] * 100, 2)
for author, books in data.groupby(by="Author"):
  books.plot(
      "X", "Y",
      kind="scatter",
      ax=plt.gca(),
      figsize=[5, 5],
      label=author,
      c="Color",
      xlim=[-1, 1],
      ylim=[-1, 1],
      title="My LSA Plot",
      xlabel=f"Topic X ({xR2}%)",
      ylabel=f"Topic Y ({yR2}%)"
  )

png

It seems that some of the books by the same author are clumping up and getting arrange in our plot, much like how we arranged cards in the card sorting activity much earlier.

We don’t know why they are getting arranged this way, since we don’t know what more meaningful concepts X and Y correspond to. But we can work do some work to figure that out.

Let’s write a helper to get the strongest words for each topic. This will show the terms with the highest and lowest association with a topic. In LSA, each topic is a spectra of subject matter, from the kinds of terms on the low end to the kinds of terms on the high end. So, inspecting the contrast between these high and low terms (and checking that against our domain knowledge) can help us interpret what our model is identifying.

def showTopics(topic, n):
  terms = vectorizer.get_feature_names_out()
  weights = svdmodel.components_[topic]
  df = pandas.DataFrame({"Term": terms, "Weight": weights})
  tops = df.sort_values(by=["Weight"], ascending=False)[0:n]
  bottoms = df.sort_values(by=["Weight"], ascending=False)[-n:]
  return pandas.concat([tops, bottoms])

Let’s use it to get the terms for the X topic.

What does this topic seem to represent to you? What’s the contrast between the top and bottom terms?

print(showTopics(1, 5))

            Term    Weight
      thou  0.369606
      hath  0.368384
      exit  0.219252
      thee  0.194711
       tis  0.184968
        ve -0.083406
 attachment -0.090431
         am -0.103122
        ma -0.117927
       aunt -0.139385

And the Y topic.

What does this topic seem to represent to you? What’s the contrast between the top and bottom terms?

print(showTopics(2, 5))

            Term    Weight
  cardinal  0.269191
    madame  0.258087
     queen  0.229547
     honor  0.211801
 musketeer  0.203572
         am -0.112988
        ma -0.124932
 attachment -0.150380
  behaviour -0.158139
       aunt -0.216180

Now that we have names for our first two topics, let’s redo the plot with better axis labels.

for author, books in data.groupby(by="Author"):
  books.plot(
      "X", "Y",
      kind="scatter",
      ax=plt.gca(),
      figsize=[5, 5],
      label=author,
      c="Color",
      xlim=[-1, 1],
      ylim=[-1, 1],
      title="My LSA Plot",
      xlabel=f"Victorian vs. Elizabethan ({xR2}%)",
      ylabel=f"English vs. French ({yR2}%)"
    )

png

Finally, let’s repeat this process with the other 4 topics, tentative called Z, W, P, and Q.

In the first two topics (X and Y), some authors were clearly separated, but others overlapped. If we hadn’t color coded them, we wouldn’t be easily able to tell them apart.

But in the next few topics, this flips, with different combinations of authors getting pulled apart and pulled together. This is because these topics (Z, W, P, and Q) highlight different features of the data, independent of the features we’ve already captured above.

zR2 = round(svdmodel.explained_variance_ratio_[3] * 100, 2)
wR2 = round(svdmodel.explained_variance_ratio_[4] * 100, 2)
for author, books in data.groupby(by="Author"):
  books.plot(
      "Z", "W",
      kind="scatter",
      ax=plt.gca(),
      figsize=[5, 5],
      label=author,
      c="Color",
      xlim=[-1, 1],
      ylim=[-1, 1],
      title="My LSA Plot",
      xlabel=f"Topic Z ({zR2}%)",
      ylabel=f"Topic W ({wR2}%)"
    )

I wonder what these topics correspond to?

print(showTopics(3, 5))

print(showTopics(4, 5))

for author, books in data.groupby(by="Author"):
  books.plot(
      "Z", "W",
      kind="scatter",
      ax=plt.gca(),
      figsize=[5, 5],
      label=author,
      c="Color",
      xlim=[-1, 1],
      ylim=[-1, 1],
      title="My LSA Plot",
      xlabel=f"Common vs. Royal ({zR2}%)",
      ylabel=f"Sea vs. Contractions ({wR2}%)"
    )

So far, Austen and Dickens have been lumped together, as have Chesterton and Mellville. But in these last two topics, P and Q, we can see their differences.

pR2 = round(svdmodel.explained_variance_ratio_[5] * 100, 2)
qR2 = round(svdmodel.explained_variance_ratio_[6] * 100, 2)
for author, books in data.groupby(by="Author"):
  books.plot(
      "P", "Q",
      kind="scatter",
      ax=plt.gca(),
      figsize=[5, 5],
      label=author,
      c="Color",
      xlim=[-1, 1],
      ylim=[-1, 1],
      title="My LSA Plot",
      xlabel=f"Topic P ({pR2}%)",
      ylabel=f"Topic Q ({qR2}%)"
    )

I wonder what these topics correspond to?

print(showTopics(5, 5))

print(showTopics(6, 5))

for author, books in data.groupby(by="Author"):
  books.plot(
      "P", "Q",
      kind="scatter",
      ax=plt.gca(),
      figsize=[5, 5],
      label=author,
      c="Color",
      xlim=[-1, 1],
      ylim=[-1, 1],
      title="My LSA Plot",
      xlabel=f"Land vs. Sea ({pR2}%)",
      ylabel=f"Exploration vs. Labor ({qR2}%)"
    )

Key Points

TODO

previous episode

Text Analysis in Python

next episode

LSA

Overview