Textbook:
- You'll be able to follow along without installing Python on your computer by viewing the Jupyter notebooks in the repo. You should also be able to run all code except for lesson one using the Google Colab buttons.
- The repository has been built for Windows computers. Some packages may be incompatible with Macs.
- Environment setup instructions: https://www.youtube.com/watch?v=sUUWLBmj7Xc&feature=youtu.be
- For lesson one, you will want to use environment_nlp-course-small.yml for a fast installation.
- Config.ini: You will need to change two lines in the config.ini file under [USER]. Change the text following "USERNAME:" to your username and "RAW_DATA:" to the file path to the raw data folder within the repository.
- Topics: course overview, git bash, python config.ini files, conda virtual environments
- Technology: git bash, configparser, conda
- Homework: use the command line to search data among 1000's of server configuration files
- Topics: Extract text from docx, pdf, and image files
- Technology: docx, PyPDF2, pdfminer.six, subprocess, pytesseract
- Homework: structure the annual reports into sections
- Supplementary Material: watch lesson_databases videos
- Topics: POS tagging, dependency parsing, rule-based matching, phrase dectection
- Technology: SpaCy, gensim
- Prework: Read section 2.1-2.4 SLP and/or 2.1-2.5 SLP videos , section 8.1-8.3 SLP, and chapter 5 Collocations
- Supplementary Material: watch lesson_automation videos
- Topics: vector space model, TFIDF, BM25, Co-occurance matrix
- Technology: scikit-learn
- Prework: Read section 6.1-6.6 SLP
- Supplementary Material: watch lesson_object_oriented_python
- Topics: PCA, latent semantic indexing (LSI), latent dirichlet allocation(LDA), topic coherence metrics
- Technology: scikit-learn, gensim
- Prework: Read TamingTextwiththeSVD
- Topics: Word2Vec, GloVe, FastText
- Technology: scikit-learn, gensim
- Prework: Read section 6.8-6.13 SLP, Efficient Estimation of Word Representations in Vector Space & Distributed Representations of Words and Phrases and their Compositionality
- Topics: Neural networks with word embeddings
- Technology: keras, gensim, FLAIR (pytorch)
- Prework: Read NLP (Almost) from Scratch and A Primer on Neural Network Models for Natural Language Processing
- Topics: Neural networks with word embeddings, Contextual Word Embeddings
- Technology: keras, gensim, FLAIR (pytorch)
- Prework: Read NLP (Almost) from Scratch and A Primer on Neural Network Models for Natural Language Processing
- Topics: cosine similarity, distance metrics, l1 and l2 norm, recommendation engines
- Technology: scikit-learn, SpaCy, gensim
- Prework: Read section 2.5 SLP and/or 2.1-2.5 SLP videos
- Topics: automate the process to collect data from https://www.annualreports.com
- Technology: requests, Jupyter Notebooks, BeautifulSoup, Scrapy
- Homework: automate the process to identify and download company 10-K annual reports
- Topics: use sqlalchemy to create and populate a database, locally and on AWS
- Technology: sqlalchemy, sqllite, AWS RDS (MySQL)
- Homework: create and populate a database with sqlalchemy
- Topics: reconstruct scikit-learn's CountVectorizer codebase
- Technology: scikit-learn, object oriented Python