practical-nlp / practical-nlp-code

Official Repository for Code associated with 'Practical Natural Language Processing' book by O'Reilly Media

Home Page:http://www.practicalnlp.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dat not available for Doc2Vec Chapter -4

jkapila opened this issue · comments

Hello Folks,

I was trying to replicate the code in google collab for Chapter 4. While running the 02_Doc2Vec_Example.ipynb, I was not able to get the data from kaggle as the data was removed. Then I reffered to previous issue and found data was in Data folder.

While using that view of data I found the data is dfferent as indicated in the notebook.

Can you please indicate which data shoul we use, as the accuracy form the data is pretty low as showcased in the notebook.

Following are the ouptut:


#Load the dataset and explore.
filepath = "data/train_data.csv"
df = pd.read_csv(filepath)
print(df.shape)
df.head()
sentiment content
empty @tiffanylue i know i was listenin to bad habi...
sadness Layin n bed with a headache ughhhh...waitin o...
sadness Funeral ceremony...gloomy friday...
enthusiasm wants to hang out with friends SOON!
neutral @dannycastillo We want to trade with someone w...
df['sentiment'].value_counts()
category value
worry 7433
neutral 6340
sadness 4828
happiness 2986
love 2068
surprise 1613
hate 1187
fun 1088
relief 1021
empty 659
enthusiasm 522
boredom 157
anger 98
Name: sentiment, dtype: int64
#Let us take the top 3 categories and leave out the rest.
shortlist = ['neutral', "happiness", "worry"]
df_subset = df[df['sentiment'].isin(shortlist)]
df_subset.shape

(16759, 2)

preds = myclass.predict(test_vectors)
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(test_cats, preds))
              precision    recall  f1-score   support

   happiness       0.34      0.54      0.42       713
     neutral       0.48      0.56      0.52      1595
       worry       0.62      0.40      0.48      1882

    accuracy                           0.48      4190
   macro avg       0.48      0.50      0.47      4190
weighted avg       0.52      0.48      0.49      4190