This lesson is still being designed and assembled (Pre-Alpha version)

This lesson is part of The Carpentries Incubator, a place to share and use each other's Carpentries-style lessons. This lesson has not been reviewed by and is not endorsed by The Carpentries.

Text Analysis in Python: Glossary

Key Points

Introduction to Natural Language Processing	NLP is comprised of models that perform different tasks. Our workflow for an NLP project consists of designing, preprocessing, representation, running, creating output, and interpreting that output. NLP tasks can be adapted to suit different research interests.
Vector Space and Distance	We model documents by plotting them in high dimensional space. Distance is highly dependent on document length. Documents are modeled as vectors so cosine similarity can be used as a similarity metric.
Preparing and Preprocessing Your Data	Tokenization breaks strings into smaller parts for analysis. Casing removes capital letters. Stopwords are common words that do not contain much useful information. Lemmatization reduces words to their root form.
Document Embeddings and TF-IDF	todo
Latent Semantic Analysis	todo
Intro to Word Embeddings	Word emebddings can help us derive additional meaning stored in text at the level of individual words Word embeddings have many use-cases in text-analysis and NLP related tasks
The Word2Vec Algorithm	Artificial neural networks (ANNs) are powerful models that can approximate any function given sufficient training data. The best method to decide between training methods (CBOW and Skip-gram) is to try both methods and see which one works best for your specific application.
Training Word2Vec	TODO
Ethics and Text Analysis	todo
LLMs and BERT Overview	LLMs are based on transformers. They train millions to billions of parameters on vast datasets. Attention allows for context to be encoded into an embedding. BERT is an example of a LLM.
APIs	You will need to evaluate the suitability of data for inclusion in your corpus and will need to take into consideration issues such as legal/ethical restrictions and data quality among others. Making an API request is a common way to access data. You can build a query using a source’s URL and combine it with get() in Python to make a request for data.
APIs	You will need to evaluate the suitability of data for inclusion in your corpus and will need to take into consideration issues such as legal/ethical restrictions and data quality among others. Making an API request is a common way to access data. You can build a query using a source’s URL and combine it with get() in Python to make a request for data.
BERTClassifier	TODO
BERTIntro	TODO
IntroToTasks	TODO
LSA	TODO
Preprocessing	TODO
VectorSpace	TODO

Glossary

FIXME