-
Natural Language Toolkit
python -c "import nltk;nltk.download('all')"
-
Wordcloud Library
conda install -c conda-forge wordcloud
-
News API Python Client Library
pip install newsapi-python
-
IBM Watson Python Library
pip install --upgrade "ibm-watson>=3.0.3"
-
SpaCy Library
conda install -c conda-forge spacy python -m spacy download en_core_web_sm
-
alpaca-trade-api SDK
pip install alpaca-trade-api
-
python-dotenv Library
pip install python-dotenv
-
Natural Language Processing (NLP) is a methods for building computer software that understands, generates, and manipulates human language. —Jacob Eisenstein
-
Tokenization is the process of segmenting running text into words, sentences, or phrases.
-
Stopwords are Words that, for analysis purposes, do not have informational content. Words like “the,” “there,” and “in.” they are useful for grammar and syntax, but they don’t contain any important content.
-
Lemmatization is standardizing the "morphology" of words. For example, walking, walked, and walks will all become walk.
-
N-Grams are tokens that include multi-word phrases. The n is the number of words—for example, bigrams are two-word combinations.
-
Sentiment Analysis is a field within NLP, it’s defined as “the computational study of people's opinions, sentiments, emotions, and attitudes.”
- Polarity (positive, neutral, negative)
- Emtions(happy, angery, sad)
- Intentions (detects what people want to do)
-
Term Relevance is quite important on sentiment analysis since leads to a better understanding of human speech.
-
Corpus is a large, structured and organized collection of text documents that normally verses on a specific matter.
-
TF Term Frequency
-
IDF Inverse Document Frequency
-
TF_IDF A weighting factor intended to measure how importanat a word is to a document in a collection of documents or corpus.
-
TF drives the score up, but IDF will bring it down.
-
VADER Sentiment is a tool used to score the sentiment polarity of human speech as positive, neutral or negative based on a set of rules and a lexicon & it generates 4 scores:
- Positive : compound score >= 0.05
- Neutral : compound score <= 0.05 & >=-0.05
- Negative : compound score <=-0.05
- Compound
Tone Analyzer is a cloud service from IBM Watson that is able to measure the tone of written text. This service is able to analyze tone in English and French conversations and you can used in Python via its API.
- Preprocessing: preparing the text, including ingestion
- Extraction: get interesting features of the text
- Analysis: summarize these features
- Representation: visualize your analysis
- Core functions depend on language models learned from programmed rules
- Accurate
- Intended for educational and prototyping purposes
- Core functions depend on language models learned from tagged text
- Fast and flexible
- Designed specifically for production use
- also provide tools for tokenization & lemmatization
- In comparison to NLTK, spaCy's language models trades off accuracy for speed
In examples, we will be using spaCy for:
- Part of speech tagging: where spaCy will categorizing each word in a sentence by its grammatical role in the sentence.
- Named Entity Recognition : Extracting named entities, which include proper nouns and other specific types of nouns such as currencies, from a text.
- Text As Feature : In order to use this data for classification or prediction, we need to make them features—numerical representations of unstructured text.