Dat not available for Doc2Vec Chapter -4
jkapila opened this issue · comments
Hello Folks,
I was trying to replicate the code in google collab for Chapter 4. While running the 02_Doc2Vec_Example.ipynb, I was not able to get the data from kaggle as the data was removed. Then I reffered to previous issue and found data was in Data folder.
While using that view of data I found the data is dfferent as indicated in the notebook.
Can you please indicate which data shoul we use, as the accuracy form the data is pretty low as showcased in the notebook.
Following are the ouptut:
#Load the dataset and explore.
filepath = "data/train_data.csv"
df = pd.read_csv(filepath)
print(df.shape)
df.head()
sentiment | content |
---|---|
empty | @tiffanylue i know i was listenin to bad habi... |
sadness | Layin n bed with a headache ughhhh...waitin o... |
sadness | Funeral ceremony...gloomy friday... |
enthusiasm | wants to hang out with friends SOON! |
neutral | @dannycastillo We want to trade with someone w... |
df['sentiment'].value_counts()
category | value |
---|---|
worry | 7433 |
neutral | 6340 |
sadness | 4828 |
happiness | 2986 |
love | 2068 |
surprise | 1613 |
hate | 1187 |
fun | 1088 |
relief | 1021 |
empty | 659 |
enthusiasm | 522 |
boredom | 157 |
anger | 98 |
Name: sentiment, dtype: int64 |
#Let us take the top 3 categories and leave out the rest.
shortlist = ['neutral', "happiness", "worry"]
df_subset = df[df['sentiment'].isin(shortlist)]
df_subset.shape
(16759, 2)
preds = myclass.predict(test_vectors)
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(test_cats, preds))
precision recall f1-score support
happiness 0.34 0.54 0.42 713
neutral 0.48 0.56 0.52 1595
worry 0.62 0.40 0.48 1882
accuracy 0.48 4190
macro avg 0.48 0.50 0.47 4190
weighted avg 0.52 0.48 0.49 4190