Dat not available for Doc2Vec Chapter -4

Question

Dat not available for Doc2Vec Chapter -4

jkapila opened this issue 4 years ago · comments

Hello Folks,

I was trying to replicate the code in google collab for Chapter 4. While running the 02_Doc2Vec_Example.ipynb, I was not able to get the data from kaggle as the data was removed. Then I reffered to previous issue and found data was in Data folder.

While using that view of data I found the data is dfferent as indicated in the notebook.

Can you please indicate which data shoul we use, as the accuracy form the data is pretty low as showcased in the notebook.

Following are the ouptut:

#Load the dataset and explore.
filepath = "data/train_data.csv"
df = pd.read_csv(filepath)
print(df.shape)
df.head()

sentiment	content
empty	@tiffanylue i know i was listenin to bad habi...
sadness	Layin n bed with a headache ughhhh...waitin o...
sadness	Funeral ceremony...gloomy friday...
enthusiasm	wants to hang out with friends SOON!
neutral	@dannycastillo We want to trade with someone w...

df['sentiment'].value_counts()

category	value
worry	7433
neutral	6340
sadness	4828
happiness	2986
love	2068
surprise	1613
hate	1187
fun	1088
relief	1021
empty	659
enthusiasm	522
boredom	157
anger	98
Name: sentiment, dtype: int64

#Let us take the top 3 categories and leave out the rest.
shortlist = ['neutral', "happiness", "worry"]
df_subset = df[df['sentiment'].isin(shortlist)]
df_subset.shape

(16759, 2)

preds = myclass.predict(test_vectors)
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(test_cats, preds))

              precision    recall  f1-score   support

   happiness       0.34      0.54      0.42       713
     neutral       0.48      0.56      0.52      1595
       worry       0.62      0.40      0.48      1882

    accuracy                           0.48      4190
   macro avg       0.48      0.50      0.47      4190
weighted avg       0.52      0.48      0.49      4190