natural-language-processing spacy sentiment-analysis

Natural Language Processing

Installation:

Natural Language Toolkit

 python -c "import nltk;nltk.download('all')"

Wordcloud Library

 conda install -c conda-forge wordcloud

News API Python Client Library
```
 pip install newsapi-python
```

IBM Watson Python Library

 pip install --upgrade "ibm-watson>=3.0.3"

SpaCy Library

 conda install -c conda-forge spacy
 python -m spacy download en_core_web_sm

alpaca-trade-api SDK
```
 pip install alpaca-trade-api
```
python-dotenv Library
```
 pip install python-dotenv
```

Important Terminologies & Concepts

Natural Language Processing (NLP) is a methods for building computer software that understands, generates, and manipulates human language. —Jacob Eisenstein
Tokenization is the process of segmenting running text into words, sentences, or phrases.
Stopwords are Words that, for analysis purposes, do not have informational content. Words like “the,” “there,” and “in.” they are useful for grammar and syntax, but they don’t contain any important content.
Lemmatization is standardizing the "morphology" of words. For example, walking, walked, and walks will all become walk.
N-Grams are tokens that include multi-word phrases. The n is the number of words—for example, bigrams are two-word combinations.
Sentiment Analysis is a field within NLP, it’s defined as “the computational study of people's opinions, sentiments, emotions, and attitudes.”
- Polarity (positive, neutral, negative)
- Emtions(happy, angery, sad)
- Intentions (detects what people want to do)
Term Relevance is quite important on sentiment analysis since leads to a better understanding of human speech.
Corpus is a large, structured and organized collection of text documents that normally verses on a specific matter.
TF Term Frequency
IDF Inverse Document Frequency
TF_IDF A weighting factor intended to measure how importanat a word is to a document in a collection of documents or corpus.
TF drives the score up, but IDF will bring it down.
VADER Sentiment is a tool used to score the sentiment polarity of human speech as positive, neutral or negative based on a set of rules and a lexicon & it generates 4 scores:
- Positive : compound score >= 0.05
- Neutral : compound score <= 0.05 & >=-0.05
- Negative : compound score <=-0.05
- Compound

IBM Watson Tone Analyzer

Tone Analyzer is a cloud service from IBM Watson that is able to measure the tone of written text. This service is able to analyze tone in English and French conversations and you can used in Python via its API.

Natural Language Processing Workflow

Preprocessing: preparing the text, including ingestion
Extraction: get interesting features of the text
Analysis: summarize these features
Representation: visualize your analysis

NLTK

Core functions depend on language models learned from programmed rules
Accurate
Intended for educational and prototyping purposes

spaCy

Core functions depend on language models learned from tagged text
Fast and flexible
Designed specifically for production use
also provide tools for tokenization & lemmatization
In comparison to NLTK, spaCy's language models trades off accuracy for speed

In examples, we will be using spaCy for:

Part of speech tagging: where spaCy will categorizing each word in a sentence by its grammatical role in the sentence.
Named Entity Recognition : Extracting named entities, which include proper nouns and other specific types of nouns such as currencies, from a text.
Text As Feature : In order to use this data for classification or prediction, we need to make them features—numerical representations of unstructured text.

SashaFlores / Natural_Language_Processing