LSA
Overview
Teaching: 20 min
Exercises: 20 minQuestions
TODO
Objectives
TODO
Topic Modeling
So far, we’ve:
- Warmed up with a hands on card sorting activity and a discussion about the kinds of tasks we have as qualitative or mixed methods researchers
- Situated ourselves with the kinds of tasks machine learning is well suited for, alongside some of the common ethical risk areas for data-driven research and technology
- Narrowed our focus to modern Natural Language Processing tasks
- Introduced vector spaces for thinking about the differences and similarities between several documents at once
- Prepared our data for processing in Python
- And introduced word embeddings and TF-IDF matrices as one approach for representing our documents as large vector spaces
We now begin to close the loop with Topic Modeling.
Topic Modeling is a frequent goal of text analysis. Topics are the things that a document is about, by some sense of “about.” We could think of topics as:
- discrete categories that your documents belong to, such as fiction vs. non-fiction
- or spectra of subject matter that your documents contain in differing amounts, such as being about politics, cooking, racing, dragons, …
In the first case, we could use machine learning to predict discrete categories, such as trying to determine the author of the Federalist Papers.
In the second case, we could try to determine the least number of topics that provides the most information about how our documents differ from one another, then use those concepts to gain insight about the “stuff” or “story” of our documents.
In this lesson we’ll focus on this second case, where topics are treated as spectra of subject matter. There are a variety of ways of doing this, and not all of them use the vector space model we have learned. For example:
- Vector-space models:
- Principle Component Analysis (PCA)
- Epistemic Network Analysis (ENA)
- Linear Discriminant Analysis (LDA)
- Latent Semantic Analysis (LSA)
- Probability models:
- Latent Dirichlet Allocation (LDA)
Specifically, we will be discussing Latent Semantic Analysis (LSA). We’re narrowing our focus to LSA because it introduces us to concepts and workflows that we will use in the future.
The assumption behind LSA is that underlying the thousands of words in our vocabulary are a smaller number of hidden topics, and that those topics help explain the distribution of the words we see across our documents. In all our models so far, each dimension has corresponded to a single word. But in LSA, each dimension now corresponds to a hidden topic, and each of those in turn corresponds to the words that are most strongly associated with it.
For example, a hidden topic might be the lasting influence of the Battle of Hastings on the English language, with some documents using more words with Anglo-Saxon roots and other documents using more words with Latin roots. This dimension is “hidden” because authors don’t usually stamp a label on their books with a summary of the linguistic histories of their words. Still, we can imagine a spectrum between words that are strongly indicative of authors with more Anglo-Saxon diction vs. words strongly indicative of authors with more Latin diction. Once we have that spectrum, we can place our documents along it, then move on to the next hidden topic, then the next, and so on, until we’ve discussed the fewest, strongest hidden topics that capture the most “story” about our documents.
Setting Up TF-IDF
Mathematically, these “latent semantic” dimensions are derived from our TF-IDF matrix, so let’s begin there.
First we need to download the data we’ll be using and do a little python setup.
!pip install parse
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: parse in /usr/local/lib/python3.9/dist-packages (1.19.0)
/bin/bash: svn: command not found
# Run this cell to mount your Google Drive.
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# add folder to colab's path so we can import the helper functions
import sys
wksp_dir = '/content/drive/My Drive/Colab Notebooks/text-analysis'
sys.path.insert(0, wksp_dir)
listdir(wksp_dir)
# and import our functions from helpers.py
from helpers import create_file_list, parse_into_dataframe, lemmatize_files
# get list of files to analyze
data_dir = wksp_dir + '/data/'
corpus_file_list = create_file_list(data_dir, "*.txt")
# parse filelist into a dataframe
data = parse_into_dataframe(data_dir + "{Author}-{Title}.txt", corpus_file_list)
data
Author | Title | Item | |
---|---|---|---|
0 | austen | sense | /content/drive/My Drive/Colab Notebooks/text-a... |
21 | austen | persuasion | /content/drive/My Drive/Colab Notebooks/text-a... |
12 | austen | pride | /content/drive/My Drive/Colab Notebooks/text-a... |
22 | austen | northanger | /content/drive/My Drive/Colab Notebooks/text-a... |
9 | austen | emma | /content/drive/My Drive/Colab Notebooks/text-a... |
7 | austen | ladysusan | /content/drive/My Drive/Colab Notebooks/text-a... |
10 | chesterton | thursday | /content/drive/My Drive/Colab Notebooks/text-a... |
8 | chesterton | ball | /content/drive/My Drive/Colab Notebooks/text-a... |
5 | chesterton | brown | /content/drive/My Drive/Colab Notebooks/text-a... |
30 | chesterton | knewtoomuch | /content/drive/My Drive/Colab Notebooks/text-a... |
39 | chesterton | whitehorse | /content/drive/My Drive/Colab Notebooks/text-a... |
27 | chesterton | napoleon | /content/drive/My Drive/Colab Notebooks/text-a... |
18 | dickens | hardtimes | /content/drive/My Drive/Colab Notebooks/text-a... |
28 | dickens | bleakhouse | /content/drive/My Drive/Colab Notebooks/text-a... |
38 | dickens | davidcopperfield | /content/drive/My Drive/Colab Notebooks/text-a... |
40 | dickens | taleoftwocities | /content/drive/My Drive/Colab Notebooks/text-a... |
17 | dickens | christmascarol | /content/drive/My Drive/Colab Notebooks/text-a... |
20 | dickens | greatexpectations | /content/drive/My Drive/Colab Notebooks/text-a... |
41 | dickens | pickwickpapers | /content/drive/My Drive/Colab Notebooks/text-a... |
2 | dickens | ourmutualfriend | /content/drive/My Drive/Colab Notebooks/text-a... |
13 | dickens | olivertwist | /content/drive/My Drive/Colab Notebooks/text-a... |
37 | dumas | threemusketeers | /content/drive/My Drive/Colab Notebooks/text-a... |
33 | dumas | montecristo | /content/drive/My Drive/Colab Notebooks/text-a... |
32 | dumas | twentyyearsafter | /content/drive/My Drive/Colab Notebooks/text-a... |
3 | dumas | tenyearslater | /content/drive/My Drive/Colab Notebooks/text-a... |
29 | dumas | maninironmask | /content/drive/My Drive/Colab Notebooks/text-a... |
14 | dumas | blacktulip | /content/drive/My Drive/Colab Notebooks/text-a... |
34 | litbank | conll | /content/drive/My Drive/Colab Notebooks/text-a... |
4 | melville | moby_dick | /content/drive/My Drive/Colab Notebooks/text-a... |
24 | melville | typee | /content/drive/My Drive/Colab Notebooks/text-a... |
23 | melville | pierre | /content/drive/My Drive/Colab Notebooks/text-a... |
11 | melville | piazzatales | /content/drive/My Drive/Colab Notebooks/text-a... |
19 | melville | conman | /content/drive/My Drive/Colab Notebooks/text-a... |
1 | melville | omoo | /content/drive/My Drive/Colab Notebooks/text-a... |
15 | melville | bartleby | /content/drive/My Drive/Colab Notebooks/text-a... |
26 | shakespeare | othello | /content/drive/My Drive/Colab Notebooks/text-a... |
6 | shakespeare | midsummer | /content/drive/My Drive/Colab Notebooks/text-a... |
16 | shakespeare | muchado | /content/drive/My Drive/Colab Notebooks/text-a... |
31 | shakespeare | caesar | /content/drive/My Drive/Colab Notebooks/text-a... |
35 | shakespeare | lear | /content/drive/My Drive/Colab Notebooks/text-a... |
36 | shakespeare | romeo | /content/drive/My Drive/Colab Notebooks/text-a... |
25 | shakespeare | twelfthnight | /content/drive/My Drive/Colab Notebooks/text-a... |
Next we’ll load our tokenizer for processing our documents.
Then we’ll preprocess each file, creating a “lemmatized” copy of each.
Note: This can take several minutes to run. It took me around 7 minutes when I was writing these instructions.
import spacy, logging
tokenizer = spacy.load('en_core_web_sm', disable=["parser", "ner", "textcat", "senter"])
tokenizer.max_length = 4500000
def lemmatize_files(tokenizer, corpus_file_list, pos_set={"ADJ", "ADV", "INTJ", "NOUN", "VERB"}, stop_set=set()):
"""
Example:
data["Lemma_File"] = lemmatize_files(tokenizer, corpus_file_list)
"""
logging.warning("This function is computationally intensive. It may take several minutes to finish running.")
N = len(corpus_file_list)
lemma_filename_list = []
for i, filename in enumerate(corpus_file_list):
logging.info(f"{i+1} out of {N}: Lemmatizing {filename}")
lemma_filename = filename + ".lemmas"
lemma_filename_list.append(lemma_filename)
open(lemma_filename, "w", encoding="utf-8").writelines(
token.lemma_.lower() + "\n"
for token in tokenizer(open(filename, "r", encoding="utf-8").read())
if token.pos_ in pos_set
and token.lemma_.lower() not in stop_set
# and token.text_.lower() not in stop_set
)
return lemma_filename_list
data["LemmaFile"] = lemmatize_files(tokenizer, corpus_file_list)
WARNING:root:This function is computationally intensive. It may take several minutes to finish running.
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-14-96b4175b727c> in <cell line: 26>()
24 return lemma_filename_list
25
---> 26 data["LemmaFile"] = lemmatize_files(tokenizer, corpus_file_list)
<ipython-input-14-96b4175b727c> in lemmatize_files(tokenizer, corpus_file_list, pos_set, stop_set)
16 open(lemma_filename, "w", encoding="utf-8").writelines(
17 token.lemma_.lower() + "\n"
---> 18 for token in tokenizer(open(filename, "r", encoding="utf-8").read())
19 if token.pos_ in pos_set
20 and token.lemma_.lower() not in stop_set
/usr/local/lib/python3.9/dist-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
1009 error_handler = proc.get_error_handler()
1010 try:
-> 1011 doc = proc(doc, **component_cfg.get(name, {})) # type: ignore[call-arg]
1012 except KeyError as e:
1013 # This typically happens if a component is not initialized
/usr/local/lib/python3.9/dist-packages/spacy/pipeline/trainable_pipe.pyx in spacy.pipeline.trainable_pipe.TrainablePipe.__call__()
/usr/local/lib/python3.9/dist-packages/spacy/pipeline/tok2vec.py in predict(self, docs)
123 width = self.model.get_dim("nO")
124 return [self.model.ops.alloc((0, width)) for doc in docs]
--> 125 tokvecs = self.model.predict(docs)
126 return tokvecs
127
/usr/local/lib/python3.9/dist-packages/thinc/model.py in predict(self, X)
313 only the output, instead of the `(output, callback)` tuple.
314 """
--> 315 return self._func(self, X, is_train=False)[0]
316
317 def finish_update(self, optimizer: Optimizer) -> None:
/usr/local/lib/python3.9/dist-packages/thinc/layers/chain.py in forward(model, X, is_train)
53 callbacks = []
54 for layer in model.layers:
---> 55 Y, inc_layer_grad = layer(X, is_train=is_train)
56 callbacks.append(inc_layer_grad)
57 X = Y
/usr/local/lib/python3.9/dist-packages/thinc/model.py in __call__(self, X, is_train)
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
292
293 def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":
/usr/local/lib/python3.9/dist-packages/thinc/layers/chain.py in forward(model, X, is_train)
53 callbacks = []
54 for layer in model.layers:
---> 55 Y, inc_layer_grad = layer(X, is_train=is_train)
56 callbacks.append(inc_layer_grad)
57 X = Y
/usr/local/lib/python3.9/dist-packages/thinc/model.py in __call__(self, X, is_train)
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
292
293 def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":
/usr/local/lib/python3.9/dist-packages/thinc/layers/with_array.py in forward(model, Xseq, is_train)
30 ) -> Tuple[SeqT, Callable]:
31 if isinstance(Xseq, Ragged):
---> 32 return cast(Tuple[SeqT, Callable], _ragged_forward(model, Xseq, is_train))
33 elif isinstance(Xseq, Padded):
34 return cast(Tuple[SeqT, Callable], _padded_forward(model, Xseq, is_train))
/usr/local/lib/python3.9/dist-packages/thinc/layers/with_array.py in _ragged_forward(model, Xr, is_train)
85 ) -> Tuple[Ragged, Callable]:
86 layer: Model[ArrayXd, ArrayXd] = model.layers[0]
---> 87 Y, get_dX = layer(Xr.dataXd, is_train)
88
89 def backprop(dYr: Ragged) -> Ragged:
/usr/local/lib/python3.9/dist-packages/thinc/model.py in __call__(self, X, is_train)
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
292
293 def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":
/usr/local/lib/python3.9/dist-packages/thinc/layers/chain.py in forward(model, X, is_train)
53 callbacks = []
54 for layer in model.layers:
---> 55 Y, inc_layer_grad = layer(X, is_train=is_train)
56 callbacks.append(inc_layer_grad)
57 X = Y
/usr/local/lib/python3.9/dist-packages/thinc/model.py in __call__(self, X, is_train)
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
292
293 def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":
/usr/local/lib/python3.9/dist-packages/thinc/layers/chain.py in forward(model, X, is_train)
53 callbacks = []
54 for layer in model.layers:
---> 55 Y, inc_layer_grad = layer(X, is_train=is_train)
56 callbacks.append(inc_layer_grad)
57 X = Y
/usr/local/lib/python3.9/dist-packages/thinc/model.py in __call__(self, X, is_train)
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
292
293 def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":
/usr/local/lib/python3.9/dist-packages/thinc/layers/maxout.py in forward(model, X, is_train)
51 W = model.get_param("W")
52 W = model.ops.reshape2f(W, nO * nP, nI)
---> 53 Y = model.ops.gemm(X, W, trans2=True)
54 Y += model.ops.reshape1f(b, nO * nP)
55 Z = model.ops.reshape3f(Y, Y.shape[0], nO, nP)
KeyboardInterrupt:
lemmas_file_list = []
pos_set = {"ADJ", "ADV", "INTJ", "NOUN", "VERB"}
for i, filename in enumerate(corpus_file_list):
print(i+1, "out of", len(corpus_file_list), "Lemmatizing", filename)
lemma_filename = filename + ".lemmas"
lemmas_file_list.append(lemma_filename)
open(lemma_filename, "w", encoding="utf-8").writelines(
token.lemma_.lower() + "\n"
for token in tokenizer(open(filename, "r", encoding="utf-8").read())
if token.pos_ in pos_set
)
print(lemmas_file_list)
1 out of 41 Lemmatizing python-text-analysis/data/dickens-olivertwist.txt
2 out of 41 Lemmatizing python-text-analysis/data/dumas-montecristo.txt
3 out of 41 Lemmatizing python-text-analysis/data/melville-typee.txt
4 out of 41 Lemmatizing python-text-analysis/data/dumas-twentyyearsafter.txt
5 out of 41 Lemmatizing python-text-analysis/data/dumas-blacktulip.txt
6 out of 41 Lemmatizing python-text-analysis/data/dumas-tenyearslater.txt
7 out of 41 Lemmatizing python-text-analysis/data/dickens-davidcopperfield.txt
8 out of 41 Lemmatizing python-text-analysis/data/chesterton-thursday.txt
9 out of 41 Lemmatizing python-text-analysis/data/shakespeare-lear.txt
10 out of 41 Lemmatizing python-text-analysis/data/shakespeare-midsummer.txt
11 out of 41 Lemmatizing python-text-analysis/data/chesterton-whitehorse.txt
12 out of 41 Lemmatizing python-text-analysis/data/chesterton-ball.txt
13 out of 41 Lemmatizing python-text-analysis/data/dickens-taleoftwocities.txt
14 out of 41 Lemmatizing python-text-analysis/data/chesterton-napoleon.txt
15 out of 41 Lemmatizing python-text-analysis/data/shakespeare-romeo.txt
16 out of 41 Lemmatizing python-text-analysis/data/chesterton-brown.txt
17 out of 41 Lemmatizing python-text-analysis/data/dickens-pickwickpapers.txt
18 out of 41 Lemmatizing python-text-analysis/data/austen-pride.txt
19 out of 41 Lemmatizing python-text-analysis/data/dickens-hardtimes.txt
20 out of 41 Lemmatizing python-text-analysis/data/melville-moby_dick.txt
21 out of 41 Lemmatizing python-text-analysis/data/austen-emma.txt
22 out of 41 Lemmatizing python-text-analysis/data/shakespeare-othello.txt
23 out of 41 Lemmatizing python-text-analysis/data/melville-conman.txt
24 out of 41 Lemmatizing python-text-analysis/data/dickens-ourmutualfriend.txt
25 out of 41 Lemmatizing python-text-analysis/data/dickens-greatexpectations.txt
26 out of 41 Lemmatizing python-text-analysis/data/shakespeare-muchado.txt
27 out of 41 Lemmatizing python-text-analysis/data/chesterton-knewtoomuch.txt
28 out of 41 Lemmatizing python-text-analysis/data/austen-northanger.txt
29 out of 41 Lemmatizing python-text-analysis/data/dickens-christmascarol.txt
30 out of 41 Lemmatizing python-text-analysis/data/austen-persuasion.txt
31 out of 41 Lemmatizing python-text-analysis/data/melville-bartleby.txt
32 out of 41 Lemmatizing python-text-analysis/data/austen-sense.txt
33 out of 41 Lemmatizing python-text-analysis/data/dumas-threemusketeers.txt
34 out of 41 Lemmatizing python-text-analysis/data/melville-piazzatales.txt
35 out of 41 Lemmatizing python-text-analysis/data/shakespeare-caesar.txt
36 out of 41 Lemmatizing python-text-analysis/data/melville-pierre.txt
37 out of 41 Lemmatizing python-text-analysis/data/dumas-maninironmask.txt
38 out of 41 Lemmatizing python-text-analysis/data/austen-ladysusan.txt
39 out of 41 Lemmatizing python-text-analysis/data/melville-omoo.txt
40 out of 41 Lemmatizing python-text-analysis/data/dickens-bleakhouse.txt
41 out of 41 Lemmatizing python-text-analysis/data/shakespeare-twelfthnight.txt
['python-text-analysis/data/dickens-olivertwist.txt.lemmas', 'python-text-analysis/data/dumas-montecristo.txt.lemmas', 'python-text-analysis/data/melville-typee.txt.lemmas', 'python-text-analysis/data/dumas-twentyyearsafter.txt.lemmas', 'python-text-analysis/data/dumas-blacktulip.txt.lemmas', 'python-text-analysis/data/dumas-tenyearslater.txt.lemmas', 'python-text-analysis/data/dickens-davidcopperfield.txt.lemmas', 'python-text-analysis/data/chesterton-thursday.txt.lemmas', 'python-text-analysis/data/shakespeare-lear.txt.lemmas', 'python-text-analysis/data/shakespeare-midsummer.txt.lemmas', 'python-text-analysis/data/chesterton-whitehorse.txt.lemmas', 'python-text-analysis/data/chesterton-ball.txt.lemmas', 'python-text-analysis/data/dickens-taleoftwocities.txt.lemmas', 'python-text-analysis/data/chesterton-napoleon.txt.lemmas', 'python-text-analysis/data/shakespeare-romeo.txt.lemmas', 'python-text-analysis/data/chesterton-brown.txt.lemmas', 'python-text-analysis/data/dickens-pickwickpapers.txt.lemmas', 'python-text-analysis/data/austen-pride.txt.lemmas', 'python-text-analysis/data/dickens-hardtimes.txt.lemmas', 'python-text-analysis/data/melville-moby_dick.txt.lemmas', 'python-text-analysis/data/austen-emma.txt.lemmas', 'python-text-analysis/data/shakespeare-othello.txt.lemmas', 'python-text-analysis/data/melville-conman.txt.lemmas', 'python-text-analysis/data/dickens-ourmutualfriend.txt.lemmas', 'python-text-analysis/data/dickens-greatexpectations.txt.lemmas', 'python-text-analysis/data/shakespeare-muchado.txt.lemmas', 'python-text-analysis/data/chesterton-knewtoomuch.txt.lemmas', 'python-text-analysis/data/austen-northanger.txt.lemmas', 'python-text-analysis/data/dickens-christmascarol.txt.lemmas', 'python-text-analysis/data/austen-persuasion.txt.lemmas', 'python-text-analysis/data/melville-bartleby.txt.lemmas', 'python-text-analysis/data/austen-sense.txt.lemmas', 'python-text-analysis/data/dumas-threemusketeers.txt.lemmas', 'python-text-analysis/data/melville-piazzatales.txt.lemmas', 'python-text-analysis/data/shakespeare-caesar.txt.lemmas', 'python-text-analysis/data/melville-pierre.txt.lemmas', 'python-text-analysis/data/dumas-maninironmask.txt.lemmas', 'python-text-analysis/data/austen-ladysusan.txt.lemmas', 'python-text-analysis/data/melville-omoo.txt.lemmas', 'python-text-analysis/data/dickens-bleakhouse.txt.lemmas', 'python-text-analysis/data/shakespeare-twelfthnight.txt.lemmas']
# %%shell
# cd python-text-analysis/data
# zip lemmas *.lemmas
# %%shell
# cd python-text-analysis/data
# unzip lemmas.zip
And after that, we’ll convert our lemmatized files to TF-IDF matrix format, just as we did in the previous lesson.
Recall, max_df=.6
removes terms that appear in more than 60% of our documents (overly common words like the, a, an) and min_df=.1
removes terms that appear in less than 10% of our documents (overly rare words like specific character names, typos, or punctuation the tokenizer doesn’t understand). We’re looking for that sweet spot where terms are frequent enough for us to build theoretical understanding of what they mean for our corpus, but not so frequent that they can’t help us tell our documents apart.
from sklearn.feature_extraction.text import TfidfVectorizer
lemmas_file_list = create_file_list("python-text-analysis/data", "*.txt.lemmas")
vectorizer = TfidfVectorizer(input='filename', max_df=.6, min_df=.1)
tfidf = vectorizer.fit_transform(lemmas_file_list)
print(tfidf.shape)
(41, 9879)
How LSA Works: SVD of TF-IDF Matrix
What do these dimensions mean? We have 41 documents, which we can think of as rows. And we have tens of thousands of tokens, which is like a dictionary of all the types of words we have in our documents, and which we represent as columns.
Now we want to reduce the number of dimensions used to represent our documents. Sure, we could talk through each of the tens of thousands words in our dictionary. And we could talk through each of our individual 41 documents. Qualitative researchers are capable of great things. But those approaches won’t take advantage of our model, they require a HUGE page burden to walk a reader through, and they put all the pressure on you to notice cross-cutting themes on your own. Instead, we can use a technique from linear algebra–Singlar Value Decomposition (SVD)–to reduce those tens of thousands of words or 41 documents to a smaller set of more cross-cutting dimensions.
We won’t deeply dive into all the mathematics of SVD, but will discuss what happens in abstract. The mathematical technique we are using is called “SVD” because we are “decomposing” our original matrix and creating a special matrix with “singular values.” Any matrix M of arbitrary size can always be split or decomposed into three matrices that multiply together to make M. There are often many non-unique ways to do this decomposition.
The three resulting matrices are called U, Σ, and Vt.
The U matrix is a matrix where there are documents as rows, and different topics as columns. The scores in each cell show how much each document is “about” each topic.
The Vt matrix is a matrix where there are a set of terms as columns, and different topics as rows. Again, the values in each cell correspond to how much a given word indicates a given topic.
The Σ matrix is special, and the one from which SVD gets its name. Nearly every cell in the matrix is zero. Only the diagonal cells are filled in: there are singular values in each row and column. Each singular value represent the amount of variation in our data explained by each topic–how much of the “stuff” of the story that topic covers.
A good deal of variation can often be explained by a relatively small number of topics, and often the variation each topic describes shrinks with each new topic. Because of this, we can truncate or remove individual rows with the lowest singular values, since they provide the least amount of information.
Once this truncation happens, we can multiply together our three matrices and end up with a smaller matrix with topics instead of words as dimensions.
This allows us to focus our account of our documents on a narrower set of cross-cutting topics. This does come at a price though. When we reduce dimensionality, our model loses information about our dataset. Our hope is that the information lost was unimportant. But “importance” depends on your moral theoretical stances. Because of this, it is important to carefully inspect the results of your model, carefully interpret the “topics” it identifies, and check all that against your qualitative and theoretical understanding of your documents.
This will likely be an iterative process where you refine your model several times. Keep in mind the adage: all models are wrong, some are useful, and a less accurate model may be preferred if it is easier to explain to your stakeholders.
Question: What’s the most possible topics we could get from this model? Think about what the most singular values are that you could possibly fit in the Σ matrix.
Remember, these singular values exist only on the diagonal, so the most topics we could have will be whichever we have fewer of- unique words or documents in our corpus.
Because there are usually more unique words than there are documents, it will almost always be equal to the number of documents we have, in this case 41.
To see this, let’s begin to reduce the dimensionality of our TF-IDF matrix using SVD, starting with the greatest number of dimensions.
from sklearn.decomposition import TruncatedSVD
maxDimensions = min(tfidf.shape)-1
svdmodel = TruncatedSVD(n_components=maxDimensions, algorithm="arpack")
lsa = svdmodel.fit_transform(tfidf)
print(lsa)
[[ 3.91364432e-01 -3.38256707e-01 -1.10255485e-01 ... -3.30703329e-04
2.26445596e-03 -1.29373990e-02]
[ 2.83139301e-01 -2.03163967e-01 1.72761316e-01 ... 1.98594965e-04
-4.41931701e-03 -1.84732254e-02]
[ 3.32869588e-01 -2.67008449e-01 -2.43271177e-01 ... 4.50149502e-03
1.99200352e-03 2.32871393e-03]
...
[ 1.91400319e-01 -1.25861226e-01 4.36682522e-02 ... -8.51158743e-04
4.48451964e-03 1.67944132e-03]
[ 2.33925324e-01 -8.46322843e-03 1.35493523e-01 ... 5.46406784e-03
-1.11972177e-03 3.86332162e-03]
[ 4.09480701e-01 -1.78620470e-01 -1.61670733e-01 ... -6.72035999e-02
9.27745251e-03 -7.60191949e-05]]
How should we pick a number of topics to keep? Fortunately, we have the Singular Values to help us understand how much data each topic explains. Let’s take a look and see how much data each topic explains. We will visualize it on a graph.
import matplotlib.pyplot as plt
#this shows us the amount of dropoff in explanation we have in our sigma matrix.
print(svdmodel.explained_variance_ratio_)
plt.plot(range(maxDimensions), svdmodel.explained_variance_ratio_ * 100)
plt.xlabel("Topic Number")
plt.ylabel("% explained")
plt.title("SVD dropoff")
plt.show() # show first chart
[0.02053967 0.12553786 0.08088013 0.06750632 0.05095583 0.04413301
0.03236406 0.02954683 0.02837433 0.02664072 0.02596086 0.02538922
0.02499496 0.0240097 0.02356043 0.02203859 0.02162737 0.0210681
0.02004 0.01955728 0.01944726 0.01830292 0.01822243 0.01737443
0.01664451 0.0160519 0.01494616 0.01461527 0.01455848 0.01374971
0.01308112 0.01255502 0.01201655 0.0112603 0.01089138 0.0096127
0.00830014 0.00771224 0.00622448 0.00499762]
Often a heuristic used by researchers to determine a topic count is to look at the dropoff in percentage of data explained by each topic.
Typically the rate of data explained will be high at first, dropoff quickly, then start to level out. We can pick a point on the “elbow” where it goes from a high level of explanation to where it starts levelling out and not explaining as much per topic. Past this point, we begin to see diminishing returns on the amount of the “stuff” of our documents we can cover quickly. This is also often a good sweet spot between overfitting our model and not having enough topics.
Alternatively, we could set some target sum for how much of our data we want our topics to explain, something like 90% or 95%. However, with a small dataset like this, that would result in a large number of topics, so we’ll pick an elbow instead.
Looking at our results so far, a good number in the middle of the “elbow” appears to be around 5-7 topics. So, let’s fit a model using only 6 topics and then take a look at what each topic looks like.
(Why is the first topic, “Topic 0,” so low? It has to do with how our SVD was setup. Truncated SVD does not mean center the data beforehand, which takes advantage of sparse matrix algorithms by leaving most of the data at zero. Otherwise, our matrix will me mostly filled with the negative of the mean for each column or row, which takes much more memory to store. The math is outside the scope for this lesson, but it’s expected in this scenario that topic 0 will be less informative than the ones that come after it, so we’ll skip it.)
numDimensions = 7
svdmodel = TruncatedSVD(n_components=numDimensions, algorithm="arpack")
lsa = svdmodel.fit_transform(tfidf)
print(lsa)
[[ 3.91364432e-01 -3.38256707e-01 -1.10255485e-01 -1.57263147e-01
4.46988327e-01 4.19701195e-02 -1.60554169e-01]
[ 2.83139301e-01 -2.03163967e-01 1.72761316e-01 -2.09939164e-01
-3.26746690e-01 5.57239735e-01 -2.77917582e-01]
[ 3.32869588e-01 -2.67008449e-01 -2.43271177e-01 2.10563091e-01
-1.76563657e-01 -2.99275913e-02 1.16776821e-02]
[ 3.08138678e-01 -2.10715886e-01 1.90232173e-01 -3.35332382e-01
-2.39294420e-01 -2.10772234e-01 -5.00250358e-02]
[ 3.05001339e-01 -2.28993064e-01 2.27384118e-01 -3.12862475e-01
-2.30273991e-01 -3.01470572e-01 2.94344505e-02]
[ 4.61714301e-01 -3.71103910e-01 -6.23885346e-02 -2.07781625e-01
3.75805961e-01 4.62796547e-02 -2.40105061e-02]
[ 3.99078406e-01 -3.72675621e-01 -4.29488320e-01 3.21312840e-01
-2.06780567e-01 -4.79678166e-02 1.81897768e-02]
[ 2.60635143e-01 -1.90036072e-01 -1.31092747e-02 -1.38136420e-01
1.37846031e-01 2.59831829e-02 1.28138615e-01]
[ 2.75254100e-01 -1.66002010e-01 1.51344979e-01 -2.03879356e-01
-1.97434785e-01 4.34660579e-01 3.51604210e-01]
[ 2.63962657e-01 -1.51795541e-01 1.03662446e-01 -1.32354362e-01
-8.01919283e-02 1.34144571e-01 4.40821829e-01]
[ 5.39085586e-01 5.51168135e-01 -7.25812593e-02 1.11795245e-02
-2.79031624e-04 -1.68092332e-02 5.49535679e-03]
[ 2.69952815e-01 -1.76699531e-01 5.70356228e-01 4.48630131e-01
4.28713759e-02 -2.18545514e-02 1.29750415e-02]
[ 6.20096940e-01 6.50488110e-01 -3.76389598e-02 2.84363611e-02
1.59378698e-02 -1.18479143e-02 -1.67609142e-02]
[ 2.39439789e-01 -1.46548125e-01 5.73647210e-01 4.48872088e-01
6.91429226e-02 -6.62720018e-02 -5.65690665e-02]
[ 3.46673808e-01 -2.28179603e-01 4.18572442e-01 1.99567055e-01
-9.26169891e-03 1.28870542e-02 6.90447513e-02]
[ 6.16613469e-01 6.59524199e-01 -6.30672750e-02 4.21736740e-03
1.66141337e-02 -1.39649741e-02 -9.24035248e-04]
[ 4.19959535e-01 -3.55330895e-01 -5.39327447e-02 -2.01473687e-01
3.73339308e-01 6.42749710e-02 3.85309124e-02]
[ 3.69324851e-01 -3.45008143e-01 -3.46180574e-01 2.57048111e-01
-2.03332217e-01 8.43097532e-03 -3.03449265e-02]
[ 6.27339749e-01 1.62509554e-01 2.45818244e-02 -7.59347178e-02
-6.91425518e-02 5.45427510e-02 2.01009502e-01]
[ 3.10638955e-01 -1.27428647e-01 6.35926253e-01 4.72744826e-01
8.18397293e-02 -5.48693117e-02 -7.44129304e-02]
[ 5.81561697e-01 6.09748220e-01 -4.20854426e-02 1.91045296e-03
4.76425507e-03 -2.04751525e-02 -1.90787467e-02]
[ 3.25549596e-01 -2.35619355e-01 1.94586350e-01 -3.99287993e-01
-2.46239345e-01 -3.59189648e-01 -5.52938926e-02]
[ 3.88812327e-01 -3.62768914e-01 -4.48329052e-01 3.68459209e-01
-2.60646554e-01 -7.30511536e-02 3.70734308e-02]
[ 4.01431564e-01 -3.29316324e-01 -1.07594721e-01 -9.11451209e-02
2.29891158e-01 5.14621207e-03 4.04610197e-02]
[ 1.72871962e-01 -5.46831788e-02 8.30995631e-02 -1.54834480e-01
-1.59427703e-01 3.85080042e-01 -9.72202770e-02]
[ 5.98566537e-01 5.98108991e-01 -6.66814202e-02 3.05305099e-02
5.34360487e-03 -2.87781213e-02 -2.44070894e-02]
[ 2.59082136e-01 -1.76483028e-01 1.18735256e-01 -1.85860632e-01
-3.24030617e-01 4.76593510e-01 -3.77322924e-01]
[ 2.85857247e-01 -2.16452087e-01 1.56285206e-01 -3.83067065e-01
-2.24662519e-01 -4.59375982e-01 -1.60404615e-02]
[ 3.96454518e-01 -3.51785523e-01 -4.06191581e-01 3.09628775e-01
-1.65348903e-01 -3.42214059e-02 -8.79935957e-02]
[ 5.68307565e-01 5.79236354e-01 -2.49977438e-02 -1.65820193e-03
-1.48330776e-03 4.97525494e-04 -7.56653060e-03]
[ 3.95181458e-01 -3.43909965e-01 -1.12527848e-01 -1.54143147e-01
4.24627540e-01 3.46146552e-02 -9.53357379e-02]
[ 7.03778529e-02 -4.53018748e-02 4.47075047e-02 -1.29319689e-02
-1.25637206e-04 -3.73101178e-03 2.26633086e-02]
[ 5.87259340e-01 5.91592344e-01 -3.06093001e-02 3.14797614e-02
9.20390599e-03 -8.28941483e-03 -2.50957867e-02]
[ 2.90241679e-01 -1.59290104e-01 5.44614348e-01 3.72292370e-01
2.60700775e-02 7.08606085e-03 -4.24466458e-02]
[ 3.73064985e-01 -2.83432129e-01 2.07212226e-01 -1.86820663e-02
2.03303288e-01 1.46948739e-02 1.10489338e-01]
[ 3.80760325e-01 -3.20618500e-01 -2.67027067e-01 4.74970999e-02
1.41382144e-01 -1.72863694e-02 8.04289208e-03]
[ 2.76029781e-01 -2.66104786e-01 -3.70078860e-01 3.35161862e-01
-2.59387443e-01 -7.34908946e-02 4.83959546e-02]
[ 2.87419636e-01 -2.05299959e-01 1.46794264e-01 -3.22859868e-01
-2.05122322e-01 -3.24165310e-01 -4.45227118e-02]
[ 1.91400319e-01 -1.25861226e-01 4.36682522e-02 -1.02268922e-01
-2.32049150e-02 1.95768614e-01 5.96553168e-01]
[ 2.33925324e-01 -8.46322843e-03 1.35493523e-01 -1.92794298e-01
-1.74616417e-01 4.49616713e-02 -1.85204985e-01]
[ 4.09480701e-01 -1.78620470e-01 -1.61670733e-01 -8.17899037e-02
3.68899535e-01 1.60467077e-02 -2.28751397e-01]]
And put all our results together in one DataFrame so we can save it to a spreadsheet to save all the work we’ve done so far. This will also make plotting easier in a moment.
Since we don’t know what these topics correspond to yet, for now I’ll call the first topic X, the second Y, the third Z, and so on.
import parse
import pandas
# "python-text-analysis/data/{Author}-{Title}.txt"
def parse_into_dataframe(pattern, items, col_name="Item"):
results = []
p = parse.compile(pattern)
for item in items:
result = p.search(item)
if result is not None:
result.named[col_name] = item
results.append(result.named)
return pandas.DataFrame.from_dict(results)
import pandas
from numpy import mean
data = parse_into_dataframe("python-text-analysis/data/{Author}-{Title}.txt", lemmas_file_list)
# skipping w[0] because it is less informative in a truncated svd
# xs = [w[1] for w in lsa]
# ys = [w[2] for w in lsa]
# zs = [w[3] for w in lsa]
# ws = [w[4] for w in lsa]
# ps = [w[5] for w in lsa]
# qs = [w[6] for w in lsa]
# data = pandas.DataFrame({
# "Author": authors,
# "Title": titles,
# # "X": xs - mean(xs),
# # "Y": ys - mean(ys),
# # "Z": zs - mean(zs),
# # "W": ws - mean(ws),
# # "P": ps - mean(ps),
# # "Q": qs - mean(qs)
# })
data[["X", "Y", "Z", "W", "P", "Q"]] = lsa[:, [1, 2, 3, 4, 5, 6]]
data[["X", "Y", "Z", "W", "P", "Q"]] -= data[["X", "Y", "Z", "W", "P", "Q"]].mean()
data.to_csv("results.csv")
print(data)
Author Title \
0 dickens olivertwist
1 melville omoo
2 austen northanger
3 chesterton brown
4 chesterton knewtoomuch
5 dickens ourmutualfriend
6 austen emma
7 dickens christmascarol
8 melville piazzatales
9 melville conman
10 shakespeare muchado
11 dumas tenyearslater
12 shakespeare lear
13 dumas threemusketeers
14 dumas montecristo
15 shakespeare romeo
16 dickens greatexpectations
17 austen persuasion
18 melville pierre
19 dumas twentyyearsafter
20 shakespeare caesar
21 chesterton ball
22 austen pride
23 dickens bleakhouse
24 melville moby_dick
25 shakespeare twelfthnight
26 melville typee
27 chesterton thursday
28 austen sense
29 shakespeare midsummer
30 dickens pickwickpapers
31 dumas blacktulip
32 shakespeare othello
33 dumas maninironmask
34 dickens taleoftwocities
35 dickens davidcopperfield
36 austen ladysusan
37 chesterton napoleon
38 melville bartleby
39 chesterton whitehorse
40 dickens hardtimes
Item X Y \
0 python-text-analysis/data/dickens-olivertwist.... -0.261657 -0.141328
1 python-text-analysis/data/melville-omoo.txt.le... -0.126564 0.141689
2 python-text-analysis/data/austen-northanger.tx... -0.190409 -0.274343
3 python-text-analysis/data/chesterton-brown.txt... -0.134116 0.159160
4 python-text-analysis/data/chesterton-knewtoomu... -0.152394 0.196312
5 python-text-analysis/data/dickens-ourmutualfri... -0.294504 -0.093461
6 python-text-analysis/data/austen-emma.txt.lemmas -0.296076 -0.460560
7 python-text-analysis/data/dickens-christmascar... -0.113437 -0.044181
8 python-text-analysis/data/melville-piazzatales... -0.089402 0.120273
9 python-text-analysis/data/melville-conman.txt.... -0.075196 0.072590
10 python-text-analysis/data/shakespeare-muchado.... 0.627768 -0.103653
11 python-text-analysis/data/dumas-tenyearslater.... -0.100100 0.539284
12 python-text-analysis/data/shakespeare-lear.txt... 0.727088 -0.068711
13 python-text-analysis/data/dumas-threemusketeer... -0.069949 0.542575
14 python-text-analysis/data/dumas-montecristo.tx... -0.151580 0.387500
15 python-text-analysis/data/shakespeare-romeo.tx... 0.736124 -0.094139
16 python-text-analysis/data/dickens-greatexpecta... -0.278731 -0.085005
17 python-text-analysis/data/austen-persuasion.tx... -0.268409 -0.377253
18 python-text-analysis/data/melville-pierre.txt.... 0.239109 -0.006490
19 python-text-analysis/data/dumas-twentyyearsaft... -0.050829 0.604854
20 python-text-analysis/data/shakespeare-caesar.t... 0.686348 -0.073158
21 python-text-analysis/data/chesterton-ball.txt.... -0.159020 0.163514
22 python-text-analysis/data/austen-pride.txt.lemmas -0.286169 -0.479401
23 python-text-analysis/data/dickens-bleakhouse.t... -0.252717 -0.138667
24 python-text-analysis/data/melville-moby_dick.t... 0.021916 0.052027
25 python-text-analysis/data/shakespeare-twelfthn... 0.674709 -0.097754
26 python-text-analysis/data/melville-typee.txt.l... -0.099883 0.087663
27 python-text-analysis/data/chesterton-thursday.... -0.139853 0.125213
28 python-text-analysis/data/austen-sense.txt.lemmas -0.275186 -0.437264
29 python-text-analysis/data/shakespeare-midsumme... 0.655836 -0.056070
30 python-text-analysis/data/dickens-pickwickpape... -0.267310 -0.143600
31 python-text-analysis/data/dumas-blacktulip.txt... 0.031298 0.013635
32 python-text-analysis/data/shakespeare-othello.... 0.668192 -0.061681
33 python-text-analysis/data/dumas-maninironmask.... -0.082691 0.513542
34 python-text-analysis/data/dickens-taleoftwocit... -0.206833 0.176140
35 python-text-analysis/data/dickens-davidcopperf... -0.244019 -0.298099
36 python-text-analysis/data/austen-ladysusan.txt... -0.189505 -0.401151
37 python-text-analysis/data/chesterton-napoleon.... -0.128700 0.115722
38 python-text-analysis/data/melville-bartleby.tx... -0.049262 0.012596
39 python-text-analysis/data/chesterton-whitehors... 0.068136 0.104421
40 python-text-analysis/data/dickens-hardtimes.tx... -0.102021 -0.192743
Z W P Q
0 -0.152952 0.466738 0.032626 -0.164769
1 -0.205628 -0.306997 0.547896 -0.282132
2 0.214874 -0.156814 -0.039271 0.007463
3 -0.331021 -0.219545 -0.220116 -0.054240
4 -0.308552 -0.210525 -0.310814 0.025220
5 -0.203471 0.395555 0.036936 -0.028225
6 0.325624 -0.187031 -0.057312 0.013975
7 -0.133825 0.157595 0.016639 0.123924
8 -0.199568 -0.177685 0.425317 0.347390
9 -0.128043 -0.060443 0.124801 0.436607
10 0.015490 0.019470 -0.026153 0.001281
11 0.452941 0.062621 -0.031198 0.008760
12 0.032747 0.035687 -0.021192 -0.020976
13 0.453183 0.088892 -0.075616 -0.060784
14 0.203878 0.010488 0.003543 0.064830
15 0.008528 0.036364 -0.023309 -0.005139
16 -0.197163 0.393089 0.054931 0.034316
17 0.261359 -0.183583 -0.000913 -0.034560
18 -0.071624 -0.049393 0.045199 0.196795
19 0.477056 0.101589 -0.064213 -0.078628
20 0.006221 0.024514 -0.029819 -0.023293
21 -0.394977 -0.226490 -0.368533 -0.059509
22 0.372770 -0.240897 -0.082395 0.032859
23 -0.086834 0.249641 -0.004198 0.036246
24 -0.150524 -0.139678 0.375736 -0.101435
25 0.034841 0.025093 -0.038122 -0.028622
26 -0.181550 -0.304281 0.467250 -0.381538
27 -0.378756 -0.204913 -0.468720 -0.020255
28 0.313940 -0.145599 -0.043565 -0.092208
29 0.002653 0.018266 -0.008846 -0.011781
30 -0.149832 0.444377 0.025271 -0.099550
31 -0.008621 0.019624 -0.013075 0.018449
32 0.035791 0.028953 -0.017633 -0.029310
33 0.376603 0.045819 -0.002258 -0.046661
34 -0.014371 0.223053 0.005351 0.106275
35 0.051808 0.161132 -0.026630 0.003828
36 0.339473 -0.239638 -0.082835 0.044181
37 -0.318549 -0.185373 -0.333509 -0.048737
38 -0.097958 -0.003455 0.186425 0.592339
39 -0.188483 -0.154867 0.035618 -0.189420
40 -0.077479 0.388649 0.006703 -0.232966
Inspecting Results
Let’s plot the results, color-coding by author to see if any patterns are immediately apparent. We’ll focus on the X and Y topics for now to illustrate the workflow. We’ll return to the other topics in our model as a further exercise.
colormap = {
"austen": "red",
"chesterton": "blue",
"dickens": "green",
"dumas": "orange",
"melville": "cyan",
"shakespeare": "magenta"
}
data["Color"] = data["Author"].map(colormap)
xR2 = round(svdmodel.explained_variance_ratio_[1] * 100, 2)
yR2 = round(svdmodel.explained_variance_ratio_[2] * 100, 2)
for author, books in data.groupby(by="Author"):
books.plot(
"X", "Y",
kind="scatter",
ax=plt.gca(),
figsize=[5, 5],
label=author,
c="Color",
xlim=[-1, 1],
ylim=[-1, 1],
title="My LSA Plot",
xlabel=f"Topic X ({xR2}%)",
ylabel=f"Topic Y ({yR2}%)"
)
It seems that some of the books by the same author are clumping up and getting arrange in our plot, much like how we arranged cards in the card sorting activity much earlier.
We don’t know why they are getting arranged this way, since we don’t know what more meaningful concepts X and Y correspond to. But we can work do some work to figure that out.
Let’s write a helper to get the strongest words for each topic. This will show the terms with the highest and lowest association with a topic. In LSA, each topic is a spectra of subject matter, from the kinds of terms on the low end to the kinds of terms on the high end. So, inspecting the contrast between these high and low terms (and checking that against our domain knowledge) can help us interpret what our model is identifying.
def showTopics(topic, n):
terms = vectorizer.get_feature_names_out()
weights = svdmodel.components_[topic]
df = pandas.DataFrame({"Term": terms, "Weight": weights})
tops = df.sort_values(by=["Weight"], ascending=False)[0:n]
bottoms = df.sort_values(by=["Weight"], ascending=False)[-n:]
return pandas.concat([tops, bottoms])
Let’s use it to get the terms for the X topic.
What does this topic seem to represent to you? What’s the contrast between the top and bottom terms?
print(showTopics(1, 5))
Term Weight
8718 thou 0.369606
4026 hath 0.368384
3104 exit 0.219252
8673 thee 0.194711
8783 tis 0.184968
9435 ve -0.083406
555 attachment -0.090431
294 am -0.103122
5312 ma -0.117927
581 aunt -0.139385
And the Y topic.
What does this topic seem to represent to you? What’s the contrast between the top and bottom terms?
print(showTopics(2, 5))
Term Weight
1221 cardinal 0.269191
5318 madame 0.258087
6946 queen 0.229547
4189 honor 0.211801
5746 musketeer 0.203572
294 am -0.112988
5312 ma -0.124932
555 attachment -0.150380
783 behaviour -0.158139
581 aunt -0.216180
Now that we have names for our first two topics, let’s redo the plot with better axis labels.
for author, books in data.groupby(by="Author"):
books.plot(
"X", "Y",
kind="scatter",
ax=plt.gca(),
figsize=[5, 5],
label=author,
c="Color",
xlim=[-1, 1],
ylim=[-1, 1],
title="My LSA Plot",
xlabel=f"Victorian vs. Elizabethan ({xR2}%)",
ylabel=f"English vs. French ({yR2}%)"
)
Finally, let’s repeat this process with the other 4 topics, tentative called Z, W, P, and Q.
In the first two topics (X and Y), some authors were clearly separated, but others overlapped. If we hadn’t color coded them, we wouldn’t be easily able to tell them apart.
But in the next few topics, this flips, with different combinations of authors getting pulled apart and pulled together. This is because these topics (Z, W, P, and Q) highlight different features of the data, independent of the features we’ve already captured above.
zR2 = round(svdmodel.explained_variance_ratio_[3] * 100, 2)
wR2 = round(svdmodel.explained_variance_ratio_[4] * 100, 2)
for author, books in data.groupby(by="Author"):
books.plot(
"Z", "W",
kind="scatter",
ax=plt.gca(),
figsize=[5, 5],
label=author,
c="Color",
xlim=[-1, 1],
ylim=[-1, 1],
title="My LSA Plot",
xlabel=f"Topic Z ({zR2}%)",
ylabel=f"Topic W ({wR2}%)"
)
I wonder what these topics correspond to?
print(showTopics(3, 5))
print(showTopics(4, 5))
for author, books in data.groupby(by="Author"):
books.plot(
"Z", "W",
kind="scatter",
ax=plt.gca(),
figsize=[5, 5],
label=author,
c="Color",
xlim=[-1, 1],
ylim=[-1, 1],
title="My LSA Plot",
xlabel=f"Common vs. Royal ({zR2}%)",
ylabel=f"Sea vs. Contractions ({wR2}%)"
)
So far, Austen and Dickens have been lumped together, as have Chesterton and Mellville. But in these last two topics, P and Q, we can see their differences.
pR2 = round(svdmodel.explained_variance_ratio_[5] * 100, 2)
qR2 = round(svdmodel.explained_variance_ratio_[6] * 100, 2)
for author, books in data.groupby(by="Author"):
books.plot(
"P", "Q",
kind="scatter",
ax=plt.gca(),
figsize=[5, 5],
label=author,
c="Color",
xlim=[-1, 1],
ylim=[-1, 1],
title="My LSA Plot",
xlabel=f"Topic P ({pR2}%)",
ylabel=f"Topic Q ({qR2}%)"
)
I wonder what these topics correspond to?
print(showTopics(5, 5))
print(showTopics(6, 5))
for author, books in data.groupby(by="Author"):
books.plot(
"P", "Q",
kind="scatter",
ax=plt.gca(),
figsize=[5, 5],
label=author,
c="Color",
xlim=[-1, 1],
ylim=[-1, 1],
title="My LSA Plot",
xlabel=f"Land vs. Sea ({pR2}%)",
ylabel=f"Exploration vs. Labor ({qR2}%)"
)
Key Points
TODO