This lesson is still being designed and assembled (Pre-Alpha version)

Text Analysis in Python: Glossary

Key Points

Introduction to Natural Language Processing
  • NLP is comprised of models that perform different tasks.

  • Our workflow for an NLP project consists of designing, preprocessing, representation, running, creating output, and interpreting that output.

  • NLP tasks can be adapted to suit different research interests.

Vector Space and Distance
  • We model documents by plotting them in high dimensional space.

  • Distance is highly dependent on document length.

  • Documents are modeled as vectors so cosine similarity can be used as a similarity metric.

Preparing and Preprocessing Your Data
  • Tokenization breaks strings into smaller parts for analysis.

  • Casing removes capital letters.

  • Stopwords are common words that do not contain much useful information.

  • Lemmatization reduces words to their root form.

Document Embeddings and TF-IDF
  • todo

Latent Semantic Analysis
  • todo

Intro to Word Embeddings
  • Word emebddings can help us derive additional meaning stored in text at the level of individual words

  • Word embeddings have many use-cases in text-analysis and NLP related tasks

The Word2Vec Algorithm
  • Artificial neural networks (ANNs) are powerful models that can approximate any function given sufficient training data.

  • The best method to decide between training methods (CBOW and Skip-gram) is to try both methods and see which one works best for your specific application.

Training Word2Vec
  • TODO

Ethics and Text Analysis
  • todo

LLMs and BERT Overview
  • LLMs are based on transformers. They train millions to billions of parameters on vast datasets.

  • Attention allows for context to be encoded into an embedding.

  • BERT is an example of a LLM.

APIs
  • You will need to evaluate the suitability of data for inclusion in your corpus and will need to take into consideration issues such as legal/ethical restrictions and data quality among others.

  • Making an API request is a common way to access data.

  • You can build a query using a source’s URL and combine it with get() in Python to make a request for data.

APIs
  • You will need to evaluate the suitability of data for inclusion in your corpus and will need to take into consideration issues such as legal/ethical restrictions and data quality among others.

  • Making an API request is a common way to access data.

  • You can build a query using a source’s URL and combine it with get() in Python to make a request for data.

BERTClassifier
  • TODO

BERTIntro
  • TODO

IntroToTasks
  • TODO

LSA
  • TODO

Preprocessing
  • TODO

VectorSpace
  • TODO

Glossary

FIXME