import pandas as pd
import numpy as np
import spacy
from spacy .lang .en .stop_words import STOP_WORDS as stopwords
df = pd .read_csv ('https://raw.githubusercontent.com/laxmimerit/twitter-data/master/twitter4000.csv' , encoding = 'latin1' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
0
is bored and wants to watch a movie any sugge...
0
1
back in miami. waiting to unboard ship
0
2
@misskpey awwww dnt dis brng bak memoriessss, ...
0
3
ughhh i am so tired blahhhhhhhhh
0
4
@mandagoforth me bad! It's funny though. Zacha...
0
...
...
...
3995
i just graduated
1
3996
Templating works; it all has to be done
1
3997
mommy just brought me starbucks
1
3998
@omarepps watching you on a House re-run...lov...
1
3999
Thanks for trying to make me smile I'll make y...
1
4000 rows × 2 columns
df ['sentiment' ].value_counts ()
1 2000
0 2000
Name: sentiment, dtype: int64
len ('this is text' .split ())
df ['word_counts' ] = df ['twitts' ].apply (lambda x : len (str (x ).split ()))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
2296
bulat dan bahagia and desperately needing a k...
1
15
3600
@johncmayer Like there was ever any doubt you ...
1
12
2468
@kirstiealley LETS DO IT!
1
4
66
@anthothemantho hahaha i agree! i cried like a...
0
16
1602
@KINOFLYHIGH fuck i shouldnt have left!
0
6
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
385
homework
0
1
691
@ekrelly
0
1
1124
disappointed
0
1
1286
@officialmgnfox
0
1
1325
headache
0
1
1897
@MCRmuffin
0
1
2542
Graduated!
1
1
2947
reading
1
1
3176
@omeirdeleon
1
1
3470
www.myspace.com/myfinalthought
1
1
3966
@gethyp3
1
1
def char_counts (x ):
s = x .split ()
x = '' .join (s )
return len (x )
df ['char_counts' ] = df ['twitts' ].apply (lambda x : char_counts (str (x )))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
2503
Woke up. Such a nice weather out there. Shower...
1
13
57
1408
I think I killed outlook
0
5
20
16
@BrianQuest I made 1 fo u 2: http://bit.ly/eId ...
0
19
81
601
working all day on mothers dayy but i left my...
0
20
72
1345
This java assignment has really got me down. ...
0
24
99
x = 'this is' # 6/2 = 3
y = 'thankyou guys' # 12/2 = 6
df ['avg_word_len' ] = df ['char_counts' ]/ df ['word_counts' ]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
489
thiking of goin to the library but not realy c...
0
11
52
4.727273
1291
I dropped one of my iPod earphones in a glass ...
0
12
43
3.583333
1834
carley & kim are coming over! but no mallo...
0
17
71
4.176471
1494
I'm still alive, but I need some miracle. Don'...
0
23
91
3.956522
{'go', 'alone', 'besides', 'against', 'anyway', 'being', 'former', 'becoming', 'namely', 'this', 'over', 'whole', "'s", 'name', 'were', 'nevertheless', 'herein', 'nowhere', 'more', 'whether', 'amount', 'per', 'everything', 'our', 'than', 'show', 'top', 'them', '’s', 'how', 'on', 'my', 'mostly', 'done', 'seems', 'serious', 'both', 'very', 'amongst', 'who', 'n‘t', 'often', 'twenty', 'thus', '’ve', 'should', 'few', 'again', 'hundred', 'any', 'under', 'become', 'three', 'must', 'twelve', 're', 'meanwhile', 'also', 'around', 'out', 'something', 'other', 'whither', 'after', 'these', 'using', 'else', 'further', 'see', 'down', 'side', 'each', 'one', 'cannot', 'within', 'us', 'whereas', "'m", 'somehow', 'elsewhere', 'its', 'but', 'seemed', 'made', 'hers', '‘s', 'the', '’m', 'at', 'his', "'ve", 'another', 'perhaps', 'became', 'those', 'least', 'nine', 'she', '‘ll', '‘m', 'it', 'are', 'either', 'not', 'ten', '’re', 'you', 'has', 'still', 'off', 'sometimes', 'is', 'had', 'whom', 'why', 'with', 'used', 'say', 'could', 'was', 'yours', 'therein', 'when', 'enough', 'rather', 'yourselves', 'throughout', 'her', 'because', 'seem', 'fifteen', 'in', 'keep', 'just', 'fifty', 'quite', '’d', 'five', 'across', 'then', 'their', 'therefore', 'already', 'moreover', 'up', '‘d', 'have', 'put', 'that', 'there', 'onto', 'herself', 'most', 'no', 'whatever', 'since', 'though', 'may', 'ca', 'from', 'someone', 'latter', 'eight', 'they', 'and', 'various', 'well', 'latterly', 'whereafter', 'now', 'anything', 'ourselves', "'re", 'into', "n't", 'somewhere', 'an', 'take', 'been', 'without', 'indeed', 'me', 'third', 'thru', 'him', 'whereupon', 'whoever', 'above', 'next', 'which', 'themselves', 'several', 'last', 'four', 'many', 'thence', 'whereby', 'beyond', 'between', 'much', 'however', 'seeming', 'hereby', 'unless', 'hence', 'n’t', 'yet', 'nor', '‘ve', 'along', 'although', 'among', 'via', 'never', 'give', 'regarding', 'wherever', 'to', 'he', 'would', 'of', 'mine', 'always', 'back', 'anyone', 'others', 'do', 'two', 'until', 'your', 'as', 'bottom', 'thereafter', 'formerly', 'neither', 'toward', 'we', 'thereupon', 'all', 'together', 'becomes', '‘re', 'so', 'might', 'thereby', 'empty', 'where', 'please', 'ours', 'will', 'move', "'ll", 'even', 'or', 'myself', 'afterwards', 'does', 'front', 'get', 'anywhere', 'nothing', 'own', 'am', 'beforehand', 'behind', 'by', 'too', 'doing', 'beside', 'wherein', 'i', 'be', 'whose', 'if', 'such', 'did', 'less', 'otherwise', 'part', 'make', 'noone', 'every', 'due', 'almost', 'except', 'before', 'what', 'some', 'same', 'ever', 'everyone', 'here', 'while', 'a', 'hereupon', 'about', 'none', 'call', '’ll', 'whence', 'eleven', 'anyhow', 'hereafter', 'for', 'itself', 'once', 'six', 'nobody', 'sixty', 'only', 'first', 'really', 'towards', 'whenever', 'yourself', 'himself', 'below', 'everywhere', 'forty', 'upon', 'through', 'full', "'d", 'sometime', 'can', 'during'}
x = 'this is the text data'
['this', 'is', 'the', 'text', 'data']
[t for t in x .split () if t in stopwords ]
len ([t for t in x .split () if t in stopwords ])
df ['stop_words_len' ] = df ['twitts' ].apply (lambda x : len ([t for t in x .split () if t in stopwords ]))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
1692
@Person678 Keep trying, I grew one last year a...
0
23
90
3.913043
10
3021
@taylormcfly I know!! Should of guessed they'd...
1
10
51
5.100000
1
1544
Although I want to hit up mcdonalds breakfast ...
0
9
46
5.111111
2
1329
i'm gonna be a good girl and stay at my dorm d...
0
23
88
3.826087
13
876
@LipstickNYC hmmm i owed you a story yesterday...
0
23
114
4.956522
8
Count #HashTags and @Mentions
x = 'this is #hashtag and this is @mention'
['this', 'is', '#hashtag', 'and', 'this', 'is', '@mention']
[t for t in x .split () if t .startswith ('@' )]
len ([t for t in x .split () if t .startswith ('@' )])
df ['hashtags_count' ] = df ['twitts' ].apply (lambda x : len ([t for t in x .split () if t .startswith ('#' )]))
df ['mentions_count' ] = df ['twitts' ].apply (lambda x : len ([t for t in x .split () if t .startswith ('@' )]))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
843
@Ms_Kaydine all im sayin is MJ's feet better g...
0
26
109
4.192308
12
0
1
2597
angels and demonds...i saw that movie yesterda...
1
10
67
6.700000
3
0
0
657
I need a bf! LOL anyone wanna sign up haha. Th...
0
32
105
3.281250
13
0
0
1070
@ABBSound ??????? ????? ??? ???? ??? ??? ?? ??...
0
9
46
5.111111
0
0
1
2335
@lukeb3000 i might be interested. how shall i ...
1
10
52
5.200000
6
0
1
If numeric digits are present in twitts
['this', 'is', '1', 'and', '2']
[t for t in x .split () if t .isdigit ()]
df ['numerics_count' ] = df ['twitts' ].apply (lambda x : len ([t for t in x .split () if t .isdigit ()]))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
numerics_count
1063
@Destini41 where do you think the otalia story...
0
19
95
5.000000
8
0
1
0
546
fml :/ today is too nice of a day to feel this...
0
13
38
2.923077
6
0
0
0
3325
@Kimmy6313 I totally feel better, you were rig...
1
14
62
4.428571
4
0
1
0
686
wants tomorrow to be over already.
0
6
29
4.833333
3
0
0
0
1814
@xMarshmellows Awww
0
2
18
9.000000
0
0
1
0
x = 'I AM HAPPY'
y = 'i am happy'
[t for t in x .split () if t .isupper ()]
df ['upper_counts' ] = df ['twitts' ].apply (lambda x : len ([t for t in x .split () if t .isupper ()]))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
numerics_count
upper_counts
1617
thinks working 57 hours this week might just k...
0
22
84
3.818182
11
0
0
1
0
565
@derrickkendall that is if i'm not busy murder...
0
17
84
4.941176
8
0
1
0
0
946
Muwahahaha .... >.> hides behind pink fo...
0
21
100
4.761905
9
0
1
0
0
1517
Making pesto pasta for memy 2nd bday dinner! H...
0
28
111
3.964286
6
0
0
0
2
3864
The first day of "real" rehersals of...
1
21
91
4.333333
12
0
0
0
1
'@DavidArchie Our local shows love tributes too much. True story! Will be watching SIS videos in Youtube later, haha '
Preprocessing and Cleaning
df ['twitts' ] = df ['twitts' ].apply (lambda x : str (x ).lower ())
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
numerics_count
upper_counts
1048
afternoon everyone just playing some tunes whi...
0
24
106
4.416667
12
0
0
0
1
31
shit you mister gembul! oh no.. you stole my h...
0
10
45
4.500000
3
0
0
0
0
3709
@silverlines hey you opened it! congrats!
1
6
36
6.000000
1
0
1
0
0
1777
@i140 myliferecord ... a health/medical histo...
0
16
90
5.625000
2
0
1
0
0
1596
@talentdmrripley maybe a good night's sleep f...
0
8
50
6.250000
2
0
1
0
0
contractions = {
"ain't" : "am not" ,
"aren't" : "are not" ,
"can't" : "cannot" ,
"can't've" : "cannot have" ,
"'cause" : "because" ,
"could've" : "could have" ,
"couldn't" : "could not" ,
"couldn't've" : "could not have" ,
"didn't" : "did not" ,
"doesn't" : "does not" ,
"don't" : "do not" ,
"hadn't" : "had not" ,
"hadn't've" : "had not have" ,
"hasn't" : "has not" ,
"haven't" : "have not" ,
"he'd" : "he would" ,
"he'd've" : "he would have" ,
"he'll" : "he will" ,
"he'll've" : "he will have" ,
"he's" : "he is" ,
"how'd" : "how did" ,
"how'd'y" : "how do you" ,
"how'll" : "how will" ,
"how's" : "how does" ,
"i'd" : "i would" ,
"i'd've" : "i would have" ,
"i'll" : "i will" ,
"i'll've" : "i will have" ,
"i'm" : "i am" ,
"i've" : "i have" ,
"isn't" : "is not" ,
"it'd" : "it would" ,
"it'd've" : "it would have" ,
"it'll" : "it will" ,
"it'll've" : "it will have" ,
"it's" : "it is" ,
"let's" : "let us" ,
"ma'am" : "madam" ,
"mayn't" : "may not" ,
"might've" : "might have" ,
"mightn't" : "might not" ,
"mightn't've" : "might not have" ,
"must've" : "must have" ,
"mustn't" : "must not" ,
"mustn't've" : "must not have" ,
"needn't" : "need not" ,
"needn't've" : "need not have" ,
"o'clock" : "of the clock" ,
"oughtn't" : "ought not" ,
"oughtn't've" : "ought not have" ,
"shan't" : "shall not" ,
"sha'n't" : "shall not" ,
"shan't've" : "shall not have" ,
"she'd" : "she would" ,
"she'd've" : "she would have" ,
"she'll" : "she will" ,
"she'll've" : "she will have" ,
"she's" : "she is" ,
"should've" : "should have" ,
"shouldn't" : "should not" ,
"shouldn't've" : "should not have" ,
"so've" : "so have" ,
"so's" : "so is" ,
"that'd" : "that would" ,
"that'd've" : "that would have" ,
"that's" : "that is" ,
"there'd" : "there would" ,
"there'd've" : "there would have" ,
"there's" : "there is" ,
"they'd" : "they would" ,
"they'd've" : "they would have" ,
"they'll" : "they will" ,
"they'll've" : "they will have" ,
"they're" : "they are" ,
"they've" : "they have" ,
"to've" : "to have" ,
"wasn't" : "was not" ,
" u " : " you " ,
" ur " : " your " ,
" n " : " and " ,
"won't" : "would not" ,
'dis' : 'this' ,
'bak' : 'back' ,
'brng' : 'bring' }
x = "i'm don't he'll" # "i am do not he will"
def cont_to_exp (x ):
if type (x ) is str :
for key in contractions :
value = contractions [key ]
x = x .replace (key , value )
return x
else :
return x
% % timeit
df ['twitts' ] = df ['twitts' ].apply (lambda x : cont_to_exp (x ))
97.6 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
numerics_count
upper_counts
3348
@timtech awww, how cute. i love when men go al...
1
11
47
4.272727
5
0
1
0
0
470
wii says i gained back .4 pounds
0
7
26
3.714286
1
0
0
0
1
826
@littleliverbird maybe. i go on a bit less too...
0
27
109
4.037037
13
0
1
0
1
570
cannot get into mariah's new song.
0
6
28
4.666667
2
0
0
0
0
2966
@sassyback dude i am gen y myself
1
6
27
4.500000
1
0
1
0
0
df [df ['twitts' ].str .contains ('hotmail.com' )]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
numerics_count
upper_counts
3713
@securerecs arghh me please markbradbury_16@h...
1
5
51
10.2
0
0
1
0
0
'@securerecs arghh me please markbradbury_16@hotmail.com'
x = '@securerecs arghh me please markbradbury_16@hotmail.com'
re .findall (r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)' , x )
['markbradbury_16@hotmail.com']
df ['emails' ] = df ['twitts' ].apply (lambda x : re .findall (r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+\b)' , x ))
df ['emails_count' ] = df ['emails' ].apply (lambda x : len (x ))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
numerics_count
upper_counts
emails
emails_count
3713
@securerecs arghh me please markbradbury_16@h...
1
5
51
10.2
0
0
1
0
0
[markbradbury_16@hotmail.com]
1
re .sub (r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)' ,"" , x )
'@securerecs arghh me please '
df ['twitts' ] = df ['twitts' ].apply (lambda x : re .sub (r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)' ,"" , x ))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
numerics_count
upper_counts
emails
emails_count
3713
@securerecs arghh me please
1
5
51
10.2
0
0
1
0
0
[markbradbury_16@hotmail.com]
1
x = 'hi, thanks to watching it. for more visit https://youtube.com/kgptalkie'
#shh://git@git.com:username/repo.git=riif?%
re .findall (r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?' , x )
[('https', 'youtube.com', '/kgptalkie')]
df ['url_flags' ] = df ['twitts' ].apply (lambda x : len (re .findall (r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?' , x )))
df [df ['url_flags' ]> 0 ].sample (5 )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
numerics_count
upper_counts
emails
emails_count
url_flags
3203
@thewebguy http://twitpic.com/6jb33 - dude, th...
1
14
85
6.071429
3
0
1
0
0
[]
0
1
3362
shabtai it is great prizes today! (go almost ...
1
15
80
5.333333
5
1
0
0
1
[]
0
1
2537
@seuj sardinia for a few days of pre-graduatio...
1
10
67
6.700000
4
0
1
0
0
[]
0
1
2458
and again http://twitpic.com/4wp8l
1
3
32
10.666667
2
0
0
0
0
[]
0
1
548
@cyphersushi no, i am afraid not.but! go here...
0
16
117
7.312500
7
0
1
0
0
[]
0
1
'hi, thanks to watching it. for more visit https://youtube.com/kgptalkie'
re .sub (r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?' , '' , x )
'hi, thanks to watching it. for more visit '
df ['twitts' ] = df ['twitts' ].apply (lambda x : re .sub (r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?' , '' , x ))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
numerics_count
upper_counts
emails
emails_count
url_flags
2784
@realadulttalk come on and smile for me? that ...
1
12
62
5.166667
4
0
1
0
1
[]
0
0
888
@richmiller oh man, i am really sorry i hope ...
0
17
67
3.941176
6
0
1
0
1
[]
0
0
190
im veryy bad
0
3
10
3.333333
0
0
0
0
0
[]
0
0
1090
@simplymallory you be naht online d: sighs i...
0
15
63
4.200000
6
0
1
0
2
[]
0
0
1553
just got sad, although sadly expected, news fr...
0
10
48
4.800000
4
0
0
0
0
[]
0
0
df [df ['twitts' ].str .contains ('rt' )]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
numerics_count
upper_counts
emails
emails_count
url_flags
4
@mandagoforth me bad! it is funny though. zach...
0
26
116
4.461538
13
0
2
0
0
[]
0
0
23
ut oh, i wonder if the ram on the desktop is s...
0
14
46
3.285714
7
0
0
0
2
[]
0
0
59
@paulmccourt dunno what sky you're looking at!...
0
15
80
5.333333
3
0
1
0
0
[]
0
0
75
im back home in belfast im realli tired thoug...
0
22
84
3.818182
9
0
0
0
1
[]
0
0
81
@lilmonkee987 i know what you mean... i feel s...
0
11
48
4.363636
5
0
1
0
0
[]
0
0
...
...
...
...
...
...
...
...
...
...
...
...
...
...
3913
for the press so after she recovered she kille...
1
24
100
4.166667
1
0
0
0
0
[]
0
0
3919
earned her cpr & first aid certifications!
1
7
40
5.714286
1
0
0
0
1
[]
0
0
3945
@teciav "i look high, i look low, i look ...
1
23
106
4.608696
10
0
1
0
0
[]
0
0
3951
i am soo very parched. and hungry. oh and i am...
1
21
87
4.142857
7
0
0
2
1
[]
0
0
3986
@countroshculla yeah..needed to get up early.....
1
10
69
6.900000
4
0
1
0
0
[]
0
0
381 rows × 13 columns
x = 'rt @username: hello hirt'
re .sub (r'\brt\b' , '' , x ).strip ()
df ['twitts' ] = df ['twitts' ].apply (lambda x : re .sub (r'\brt\b' , '' , x ).strip ())
Special Chars removal or punctuation removal
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
numerics_count
upper_counts
emails
emails_count
url_flags
2205
eating food leaving school to go to hospital ...
1
12
45
3.750000
5
0
0
0
0
[]
0
0
812
@earthlifeshop i know! it makes it hard for th...
0
17
74
4.352941
5
0
1
0
1
[]
0
0
1443
cannot sleep! only 3 hours til i have to wake up
0
11
38
3.454545
6
0
0
1
0
[]
0
0
x = '@duyku apparently i was not ready enough... i...'
re .sub (r'[^\w ]+' , "" , x )
'duyku apparently i was not ready enough i'
df ['twitts' ] = df ['twitts' ].apply (lambda x : re .sub (r'[^\w ]+' , "" , x ))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
twitts
sentiment
word_counts
char_counts
avg_word_len
stop_words_len
hashtags_count
mentions_count
numerics_count
upper_counts
emails
emails_count
url_flags
2294
joshishollywood aw joshi would describe you ex...
1
9
55
6.111111
3
0
1
0
0
[]
0
0
3495
repressd i hate it when that happens errrr i m...
1
14
63
4.500000
4
0
1
0
2
[]
0
0
1678
but when you do have a camera less funny thing...
0
11
45
4.090909
6
0
0
0
0
[]
0
0
3702
uh do not wanna work but mondays are easy days...
1
13
49
3.769231
5
0
0
0
0
[]
0
0
3201
heromancer i will take shin
1
6
50
8.333333
1
0
1
0
0
[]
0
1
Remove multiple spaces "hi hello "
x = 'hi hello how are you'
df ['twitts' ] = df ['twitts' ].apply (lambda x : ' ' .join (x .split ()))
!pip install beautifulsoup4
Requirement already satisfied: beautifulsoup4 in c:\users\chitr\appdata\local\programs\python\python36\lib\site-packages (4.9.3)
Requirement already satisfied: soupsieve>1.2 in c:\users\chitr\appdata\local\programs\python\python36\lib\site-packages (from beautifulsoup4) (2.2.1)
WARNING: You are using pip version 21.0.1; however, version 21.1.1 is available.
You should consider upgrading via the 'c:\users\chitr\appdata\local\programs\python\python36\python.exe -m pip install --upgrade pip' command.
from bs4 import BeautifulSoup
x = '<html><h1> thanks for watching it </h1></html>'
x .replace ('<html><h1>' , '' ).replace ('</h1></html>' , '' ) #not rec
' thanks for watching it '
BeautifulSoup (x , 'lxml' ).get_text ().strip ()
---------------------------------------------------------------------------
FeatureNotFound Traceback (most recent call last)
<ipython-input-187-2e9db3c14738> in <module>
----> 1 BeautifulSoup(x, 'lxml').get_text().strip()
c:\users\chitr\appdata\local\programs\python\python36\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, element_classes, **kwargs)
244 "Couldn't find a tree builder with the features you "
245 "requested: %s. Do you need to install a parser library?"
--> 246 % ",".join(features))
247
248 # At this point either we have a TreeBuilder instance in
FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
% % time
df ['twitts' ] = df ['twitts' ].apply (lambda x : BeautifulSoup (x , 'lxml' ).get_text ().strip ())
def remove_accented_chars (x ):
x = unicodedata .normalize ('NFKD' , x ).encode ('ascii' , 'ignore' ).decode ('utf-8' , 'ignore' )
return x
df ['twitts' ] = df ['twitts' ].apply (lambda x : remove_accented_chars (x ))
x = 'this is a stop words'
' ' .join ([t for t in x .split () if t not in stopwords ])
df ['twitts_no_stop' ] = df ['twitts' ].apply (lambda x : ' ' .join ([t for t in x .split () if t not in stopwords ]))
Convert into base or root form of word
nlp = spacy .load ('en_core_web_sm' )
x = 'this is chocolates. what is times? this balls'
def make_to_base (x ):
x = str (x )
x_list = []
doc = nlp (x )
for token in doc :
lemma = token .lemma_
if lemma == '-PRON-' or lemma == 'be' :
lemma = token .text
x_list .append (lemma )
return ' ' .join (x_list )
df ['twitts' ] = df ['twitts' ].apply (lambda x : make_to_base (x ))
x = 'this is this okay bye'
text = ' ' .join (df ['twitts' ])
freq_comm = pd .Series (text ).value_counts ()
df ['twitts' ] = df ['twitts' ].apply (lambda x : ' ' .join ([t for t in x .split () if t not in f20 ]))
rare20 = freq_comm .tail (20 )
df ['twitts' ] = df ['twitts' ].apply (lambda x : ' ' .join ([t for t in x .split () if t not in rare20 ]))
from wordcloud import WordCloud
import matplotlib .pyplot as plt
% matplotlib inline
text = ' ' .join (df ['twitts' ])
wc = WordCloud (width = 800 , height = 400 ).generate (text )
plt .imshow (wc )
plt .axis ('off' )
plt .show ()
!python - m textblob .download_corpora
from textblob import TextBlob
x = 'thankks forr waching it'
x = TextBlob (x ).correct ()
Tokenization using TextBlob
x = 'thanks#watching this video. please like it'
doc = nlp (x )
for token in doc :
print (token )
x = 'Breaking News: Donal Trump, the president of the USA is looking to sign a deal to mine the moon'
for noun in doc .noun_chunks :
print (noun )
Language Translation and Detection
Language Code: https://www.loc.gov/standards/iso639-2/php/code_list.php
Use TextBlob's Inbuilt Sentiment Classifier
from textblob .sentiments import NaiveBayesAnalyzer
x = 'we all stands together. we are gonna win this fight'
tb = TextBlob (x , analyzer = NaiveBayesAnalyzer ())