This lesson is still being designed and assembled (Pre-Alpha version)

Training Word2Vec

Overview

Teaching: 20 min
Exercises: 20 min
Questions
  • TODO

Objectives
  • TODO

Colab Setup

Run this code to enable helper functions.

# Run this cell to mount your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

# show existing colab notebooks and helpers.py file
from os import listdir
wksp_dir = '/content/drive/My Drive/Colab Notebooks/text-analysis'

# add folder to colab's path so we can import the helper functions
import sys
sys.path.insert(0, wksp_dir)
Mounted at /content/drive
# check that wksp_dir is correct
listdir(wksp_dir)
['.ipynb_checkpoints',
 '__pycache__',
 'data',
 'word2vec_PCA_plot.jpg',
 'helpers.py',
 'Word_embeddings_intro.ipynb',
 'word2vec_scratch.ipynb',
 'Word_embeddings_train.ipynb']

Load in the data

Create list of files we’ll use for our analysis. We’ll start by fitting a word2vec model to just one of the books in our list – Moby Dick.

# pip install necessary to access parse module (called from helpers.py)
!pip install parse

# test that functions can now be imported from helpers.py
from helpers import create_file_list 

# get list of files to analyze
data_dir = wksp_dir + '/data/'
corpus_file_list = create_file_list(data_dir, "*.txt")

# parse filelist into a dataframe
from helpers import parse_into_dataframe 
data = parse_into_dataframe(data_dir + "{Author}-{Title}.txt", corpus_file_list)
data
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting parse
  Downloading parse-1.19.0.tar.gz (30 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: parse
  Building wheel for parse (setup.py) ... [?25l[?25hdone
  Created wheel for parse: filename=parse-1.19.0-py3-none-any.whl size=24589 sha256=400d8a9b0d6f99cf6a340ed83dce69610fcf3fc01a6fee0ac4508e748eef6977
  Stored in directory: /root/.cache/pip/wheels/d6/9c/58/ee3ba36897e890f3ad81e9b730791a153fce20caa4a8a474df
Successfully built parse
Installing collected packages: parse
Successfully installed parse-1.19.0


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Author Title Item
16 austen emma /content/drive/My Drive/Colab Notebooks/text-a...
11 austen sense /content/drive/My Drive/Colab Notebooks/text-a...
5 austen pride /content/drive/My Drive/Colab Notebooks/text-a...
7 austen northanger /content/drive/My Drive/Colab Notebooks/text-a...
15 austen persuasion /content/drive/My Drive/Colab Notebooks/text-a...
19 austen ladysusan /content/drive/My Drive/Colab Notebooks/text-a...
14 chesterton brown /content/drive/My Drive/Colab Notebooks/text-a...
13 chesterton thursday /content/drive/My Drive/Colab Notebooks/text-a...
17 chesterton whitehorse /content/drive/My Drive/Colab Notebooks/text-a...
18 chesterton ball /content/drive/My Drive/Colab Notebooks/text-a...
20 chesterton napoleon /content/drive/My Drive/Colab Notebooks/text-a...
3 chesterton knewtoomuch /content/drive/My Drive/Colab Notebooks/text-a...
23 dickens olivertwist /content/drive/My Drive/Colab Notebooks/text-a...
25 dickens ourmutualfriend /content/drive/My Drive/Colab Notebooks/text-a...
26 dickens pickwickpapers /content/drive/My Drive/Colab Notebooks/text-a...
22 dickens greatexpectations /content/drive/My Drive/Colab Notebooks/text-a...
21 dickens bleakhouse /content/drive/My Drive/Colab Notebooks/text-a...
0 dickens taleoftwocities /content/drive/My Drive/Colab Notebooks/text-a...
10 dickens davidcopperfield /content/drive/My Drive/Colab Notebooks/text-a...
8 dickens christmascarol /content/drive/My Drive/Colab Notebooks/text-a...
4 dickens hardtimes /content/drive/My Drive/Colab Notebooks/text-a...
9 dumas montecristo /content/drive/My Drive/Colab Notebooks/text-a...
12 dumas maninironmask /content/drive/My Drive/Colab Notebooks/text-a...
6 dumas twentyyearsafter /content/drive/My Drive/Colab Notebooks/text-a...
24 dumas blacktulip /content/drive/My Drive/Colab Notebooks/text-a...
2 dumas tenyearslater /content/drive/My Drive/Colab Notebooks/text-a...
1 dumas threemusketeers /content/drive/My Drive/Colab Notebooks/text-a...
29 litbank conll /content/drive/My Drive/Colab Notebooks/text-a...
28 melville moby_dick /content/drive/My Drive/Colab Notebooks/text-a...
30 melville piazzatales /content/drive/My Drive/Colab Notebooks/text-a...
33 melville pierre /content/drive/My Drive/Colab Notebooks/text-a...
34 melville conman /content/drive/My Drive/Colab Notebooks/text-a...
36 melville omoo /content/drive/My Drive/Colab Notebooks/text-a...
38 melville bartleby /content/drive/My Drive/Colab Notebooks/text-a...
27 melville typee /content/drive/My Drive/Colab Notebooks/text-a...
40 shakespeare lear /content/drive/My Drive/Colab Notebooks/text-a...
31 shakespeare romeo /content/drive/My Drive/Colab Notebooks/text-a...
32 shakespeare twelfthnight /content/drive/My Drive/Colab Notebooks/text-a...
35 shakespeare othello /content/drive/My Drive/Colab Notebooks/text-a...
37 shakespeare muchado /content/drive/My Drive/Colab Notebooks/text-a...
39 shakespeare caesar /content/drive/My Drive/Colab Notebooks/text-a...
41 shakespeare midsummer /content/drive/My Drive/Colab Notebooks/text-a...
single_file = data.loc[data['Title'] == 'moby_dick','Item'].item()
single_file
'/content/drive/My Drive/Colab Notebooks/text-analysis/data/melville-moby_dick.txt'

Let’s preview the file contents to make sure our code so far is working correctly.

# open and read file
f = open(single_file,'r')
file_contents = f.read()
f.close()

# preview file contents
preview_len = 500
print(file_contents[0:preview_len])
[Moby Dick by Herman Melville 1851]


ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.  He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.

"While you take in hand to school others, and to teach them by wha
file_contents[0:preview_len] # Note that \n are still present in actual string (print() processes these as new lines)
'[Moby Dick by Herman Melville 1851]\n\n\nETYMOLOGY.\n\n(Supplied by a Late Consumptive Usher to a Grammar School)\n\nThe pale Usher--threadbare in coat, heart, body, and brain; I see him\nnow.  He was ever dusting his old lexicons and grammars, with a queer\nhandkerchief, mockingly embellished with all the gay flags of all the\nknown nations of the world.  He loved to dust his old grammars; it\nsomehow mildly reminded him of his mortality.\n\n"While you take in hand to school others, and to teach them by wha'

Preprocessing steps

  1. Split text into sentences
  2. Tokenize the text
  3. Lemmatize and lowercase all tokens
  4. Remove stop words

1. Convert text to list of sentences

Remember that we are using the sequence of words in a sentence to learn meaningful word embeddings. The last word of one sentence does not always relate to the first word of the next sentence. For this reason, we will split the text into individual sentences before going further.

Punkt Sentence Tokenizer

NLTK’s sentence tokenizer (‘punkt’) works well in most cases, but it may not correctly detect sentences when there is a complex paragraph that contains many punctuation marks, exclamation marks, abbreviations, or repetitive symbols. It is not possible to define a standard way to overcome these issues. If you want to ensure every “sentence” you use to train the Word2Vec is truly a sentence, you would need to write some additional (and highly data-dependent) code that uses regex and string manipulation to overcome rare errors.

For our purposes, we’re willing to overlook a few sentence tokenization errors. If this work were being published, it would be worthwhile to double-check the work of punkt.

import nltk
nltk.download('punkt') # dependency of sent_tokenize function
sentences = nltk.sent_tokenize(file_contents)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
sentences[300:305]
['How then is this?',
 'Are the green fields gone?',
 'What do they\nhere?',
 'But look!',
 'here come more crowds, pacing straight for the water, and\nseemingly bound for a dive.']

2-4: Tokenize, lemmatize, and remove stop words

Pull up preprocess text helper function and unpack the code…

from helpers import preprocess_text
# test function
string = 'It is not down on any map; true places never are.'
tokens = preprocess_text(string, 
                         remove_stopwords=True,
                         verbose=True)
tokens
Tokens ['It', 'is', 'not', 'down', 'on', 'any', 'map', 'true', 'places', 'never', 'are']
Lowercase ['it', 'is', 'not', 'down', 'on', 'any', 'map', 'true', 'places', 'never', 'are']
Lemmas ['it', 'is', 'not', 'down', 'on', 'any', 'map', 'true', 'place', 'never', 'are']
StopRemoved ['map', 'true', 'place', 'never']





['map', 'true', 'place', 'never']
# convert list of sentences to pandas series so we can use the apply functionality
import pandas as pd
sentences_series = pd.Series(sentences)
tokens_cleaned = sentences_series.apply(preprocess_text, 
                                        remove_stopwords=True, 
                                        verbose=False)
# view sentences before clearning
sentences[300:305]
['How then is this?',
 'Are the green fields gone?',
 'What do they\nhere?',
 'But look!',
 'here come more crowds, pacing straight for the water, and\nseemingly bound for a dive.']
# view sentences after cleaning
tokens_cleaned[300:305]
300                                                   []
301                                 [green, field, gone]
302                                                   []
303                                               [look]
304    [come, crowd, pacing, straight, water, seeming...
dtype: object
tokens_cleaned.shape # 9852 sentences
(9852,)
# remove empty sentences and 1-word sentences (all stop words)
tokens_cleaned = tokens_cleaned[tokens_cleaned.apply(len) > 1]
tokens_cleaned.shape
(9007,)

Train Word2Vec model using tokenized text

We can now use this data to train a word2vec model. We’ll start by importing the Word2Vec module from gensim. We’ll then hand the Word2Vec function our list of tokenized sentences and set sg=0 to use the continuous bag of words (CBOW) training method.

Set seed and workers for a fully deterministic run: Next we’ll set some parameters for reproducibility. We’ll set the seed so that our vectors get randomly initialized the same way each time this code is run. For a fully deterministically-reproducible run, we’ll also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling — noted in gensim’s documentation

# import gensim's Word2Vec module
from gensim.models import Word2Vec

# train the word2vec model with our cleaned data
model = Word2Vec(sentences=tokens_cleaned, seed=0, workers=1, sg=0)

Gensim’s implementation is based on the original Tomas Mikolov’s original model of word2vec, which downsamples all frequent words automatically based on frequency. The downsampling saves time when training the model.

Next steps: word embedding use-cases

We now have a vector representation for all the (lemmatized and non-stop words) words referenced throughout Moby Dick. Let’s see how we can use these vectors to gain insights from our text data.

Most similar words

Just like with the pretrained word2vec models, we can use the most_similar function to find words that meaningfully relate to a queried word.

# default
model.wv.most_similar(positive=['whale'], topn=10)
[('great', 0.9986481070518494),
 ('white', 0.9984517097473145),
 ('fishery', 0.9984385371208191),
 ('sperm', 0.9984176158905029),
 ('among', 0.9983417987823486),
 ('right', 0.9983320832252502),
 ('three', 0.9983301758766174),
 ('day', 0.9983181357383728),
 ('length', 0.9983041882514954),
 ('seen', 0.998255729675293)]

Vocabulary limits

Note that Word2Vec can only produce vector representations for words encountered in the data used to train the model.

fastText solves OOV issue

If you need to obtain word vectors for out of vocabulary (OOV) words, you can use the fastText word embedding algorithm, instead (also provided from Gensim). The fastText algorithm can obtain vectors even for out-of-vocabulary (OOV) words, by summing up vectors for its component char-ngrams, provided at least one of the char-ngrams was present in the training data.

model.wv.most_similar(positive=['orca'],topn=30) 
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-25-9dc7ea336470> in <cell line: 1>()
----> 1 model.wv.most_similar(positive=['orca'],topn=30)


/usr/local/lib/python3.9/dist-packages/gensim/models/keyedvectors.py in most_similar(self, positive, negative, topn, clip_start, clip_end, restrict_vocab, indexer)
    839 
    840         # compute the weighted average of all keys
--> 841         mean = self.get_mean_vector(keys, weight, pre_normalize=True, post_normalize=True, ignore_missing=False)
    842         all_keys = [
    843             self.get_index(key) for key in keys if isinstance(key, _KEY_TYPES) and self.has_index_for(key)


/usr/local/lib/python3.9/dist-packages/gensim/models/keyedvectors.py in get_mean_vector(self, keys, weights, pre_normalize, post_normalize, ignore_missing)
    516                 total_weight += abs(weights[idx])
    517             elif not ignore_missing:
--> 518                 raise KeyError(f"Key '{key}' not present in vocabulary")
    519 
    520         if total_weight > 0:


KeyError: "Key 'orca' not present in vocabulary"

What can we do with this most similar functionality? One way we can use it is to construct a list of similar words to represent some sort of category. For example, maybe we want to know what other sea creatures are referenced throughout Moby Dick. We can use gensim’s most_smilar function to begin constructing a list of words that, on average, represent a “sea creature” category.

We’ll use the following procedure:

  1. Initialize a small list of words that represent the category, sea creatures.
  2. Calculate the average vector representation of this list of words
  3. Use this average vector to find the top N most similar vectors (words)
  4. Review similar words and update the sea creatures list
  5. Repeat steps 1-4 until no additional sea creatures can be found
# start with a small list of words that represent sea creatures 
sea_creatures = ['whale','fish','creature','animal']

# The below code will calculate an average vector of the words in our list, 
# and find the vectors/words that are most similar to this average vector
model.wv.most_similar(positive=sea_creatures, topn=30)
[('great', 0.9997826814651489),
 ('part', 0.9997532963752747),
 ('though', 0.9997507333755493),
 ('full', 0.999735951423645),
 ('small', 0.9997267127037048),
 ('among', 0.9997209906578064),
 ('case', 0.9997204542160034),
 ('like', 0.9997190833091736),
 ('many', 0.9997131824493408),
 ('fishery', 0.9997081756591797),
 ('present', 0.9997068643569946),
 ('body', 0.9997056722640991),
 ('almost', 0.9997050166130066),
 ('found', 0.9997038245201111),
 ('whole', 0.9997023940086365),
 ('water', 0.9996949434280396),
 ('even', 0.9996913075447083),
 ('time', 0.9996898174285889),
 ('two', 0.9996897578239441),
 ('air', 0.9996871948242188),
 ('length', 0.9996850490570068),
 ('vast', 0.9996834397315979),
 ('line', 0.9996828436851501),
 ('made', 0.9996813535690308),
 ('upon', 0.9996812343597412),
 ('large', 0.9996775984764099),
 ('known', 0.9996767640113831),
 ('harpooneer', 0.9996761679649353),
 ('sea', 0.9996750354766846),
 ('shark', 0.9996744990348816)]
# we can add shark to our list
model.wv.most_similar(positive=['whale','fish','creature','animal','shark'],topn=30) 
[('great', 0.9997999668121338),
 ('though', 0.9997922778129578),
 ('part', 0.999788761138916),
 ('full', 0.999781608581543),
 ('small', 0.9997766017913818),
 ('like', 0.9997683763504028),
 ('among', 0.9997652769088745),
 ('many', 0.9997631311416626),
 ('case', 0.9997614622116089),
 ('even', 0.9997515678405762),
 ('body', 0.9997514486312866),
 ('almost', 0.9997509717941284),
 ('present', 0.9997479319572449),
 ('found', 0.999747633934021),
 ('water', 0.9997465014457703),
 ('made', 0.9997431635856628),
 ('air', 0.9997406601905823),
 ('whole', 0.9997400641441345),
 ('fishery', 0.9997299909591675),
 ('harpooneer', 0.9997295141220093),
 ('time', 0.9997290372848511),
 ('two', 0.9997289776802063),
 ('sea', 0.9997265934944153),
 ('strange', 0.9997244477272034),
 ('large', 0.999722421169281),
 ('place', 0.9997209906578064),
 ('dead', 0.9997198581695557),
 ('leviathan', 0.9997192025184631),
 ('sometimes', 0.9997178316116333),
 ('high', 0.9997177720069885)]
# add leviathan (sea serpent) to our list
model.wv.most_similar(positive=['whale','fish','creature','animal','shark','leviathan'],topn=30) 
[('though', 0.9998274445533752),
 ('part', 0.9998168349266052),
 ('full', 0.9998133182525635),
 ('small', 0.9998107552528381),
 ('great', 0.9998067021369934),
 ('like', 0.9998064041137695),
 ('even', 0.9997999668121338),
 ('many', 0.9997966885566711),
 ('body', 0.9997950196266174),
 ('among', 0.999794602394104),
 ('found', 0.9997929334640503),
 ('case', 0.9997885823249817),
 ('almost', 0.9997871518135071),
 ('made', 0.9997868537902832),
 ('air', 0.999786376953125),
 ('water', 0.9997802972793579),
 ('whole', 0.9997780919075012),
 ('present', 0.9997757077217102),
 ('harpooneer', 0.999768853187561),
 ('place', 0.9997684955596924),
 ('much', 0.9997658729553223),
 ('time', 0.999765157699585),
 ('sea', 0.999765157699585),
 ('dead', 0.999764621257782),
 ('strange', 0.9997624158859253),
 ('high', 0.9997615218162537),
 ('two', 0.999760091304779),
 ('sometimes', 0.9997592568397522),
 ('half', 0.9997562170028687),
 ('vast', 0.9997541904449463)]

No additional sea creatures. It appears we have our list of sea creatures recovered using Word2Vec

Limitations

There is at least one sea creature missing from our list — a giant squid. The giant squid is only mentioned a handful of times throughout Moby Dick, and therefore it could be that our word2vec model was not able to train a good representation of the word “squid”. Neural networks only work well when you have lots of data

Exercise: Exploring the skip-gram algorithm

The skip-gram algoritmm sometimes performs better in terms of its ability to capture meaning of rarer words encountered in the training data. Train a new Word2Vec model using the skip-gram algorithm, and see if you can repeat the above categorical search task to find the word, “squid”.

# import gensim's Word2Vec module
from gensim.models import Word2Vec

# train the word2vec model with our cleaned data
model = Word2Vec(sentences=tokens_cleaned, seed=0, workers=1, sg=1)
model.wv.most_similar(positive=['whale','fish','creature','animal','shark','leviathan'],topn=100) # still no sight of squid 
[('whalemen', 0.9931729435920715),
 ('specie', 0.9919217824935913),
 ('bulk', 0.9917919635772705),
 ('ground', 0.9913252592086792),
 ('skeleton', 0.9905602931976318),
 ('among', 0.9898401498794556),
 ('small', 0.9887762665748596),
 ('full', 0.9885162115097046),
 ('captured', 0.9883950352668762),
 ('found', 0.9883666634559631),
 ('sometimes', 0.9882548451423645),
 ('snow', 0.9880553483963013),
 ('magnitude', 0.9880378842353821),
 ('various', 0.9878063201904297),
 ('hump', 0.9876748919487),
 ('cuvier', 0.9875931739807129),
 ('fisherman', 0.9874721765518188),
 ('general', 0.9873012900352478),
 ('living', 0.9872495532035828),
 ('wholly', 0.9872384667396545),
 ('bone', 0.987160861492157),
 ('mouth', 0.9867696762084961),
 ('natural', 0.9867129921913147),
 ('monster', 0.9865870475769043),
 ('blubber', 0.9865683317184448),
 ('indeed', 0.9864518046379089),
 ('teeth', 0.9862186908721924),
 ('entire', 0.9861844182014465),
 ('latter', 0.9859246015548706),
 ('book', 0.9858523607254028)]

Discuss Exercise Result

When using Word2Vec to reveal items from a category, you risk missing items that are rarely mentioned. This is true even when we use the Skip-gram training method, which has been found to have better performance on rarer words. For this reason, it’s sometimes better to save this task for larger text corpuses. In a later lesson, we will explore a task called Named Entity Recognition, which will handle this type of task in a more robust and systematic way.

Exercise: Categorical Search Applications

How else might you exploit this kind of analysis? Share your ideas with the group.

Other word embedding models

While Word2Vec is a famous model that is still used throughout many NLP applications today, there are a few other word embedding models that you might also want to consider exploring. GloVe and fastText are among the two most popular choices to date.

GloVe

GloVe” to generate word embeddings. GloVe, coined from Global Vectors, is a model based on the distribtuional hypothesis. The GloVe model is trained on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus (the GloVe model in spaCy comes pretrained on the Gigaword dataset). The main intuition underlying the model is the observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning.

Populating this matrix requires a single pass through the entire corpus to collect the statistics. For large corpora, this pass can be computationally expensive, but it is a one-time up-front cost.”

fastText

# Preview other word embedding models available
print(list(api.info()['models'].keys()))
['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']

Key Points

  • TODO