chitreshkr / Natural-Language-Processing-Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Complete Text Processing

import pandas as pd
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDS as stopwords
df = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/twitter-data/master/twitter4000.csv', encoding = 'latin1')
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment
0 is bored and wants to watch a movie any sugge... 0
1 back in miami. waiting to unboard ship 0
2 @misskpey awwww dnt dis brng bak memoriessss, ... 0
3 ughhh i am so tired blahhhhhhhhh 0
4 @mandagoforth me bad! It's funny though. Zacha... 0
... ... ...
3995 i just graduated 1
3996 Templating works; it all has to be done 1
3997 mommy just brought me starbucks 1
3998 @omarepps watching you on a House re-run...lov... 1
3999 Thanks for trying to make me smile I'll make y... 1

4000 rows × 2 columns

df['sentiment'].value_counts()
1    2000
0    2000
Name: sentiment, dtype: int64

Word Counts

len('this is text'.split())
3
df['word_counts'] = df['twitts'].apply(lambda x: len(str(x).split()))
df.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts
2296 bulat dan bahagia and desperately needing a k... 1 15
3600 @johncmayer Like there was ever any doubt you ... 1 12
2468 @kirstiealley LETS DO IT! 1 4
66 @anthothemantho hahaha i agree! i cried like a... 0 16
1602 @KINOFLYHIGH fuck i shouldnt have left! 0 6
df['word_counts'].max()
32
df['word_counts'].min()
1
df[df['word_counts']==1]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts
385 homework 0 1
691 @ekrelly 0 1
1124 disappointed 0 1
1286 @officialmgnfox 0 1
1325 headache 0 1
1897 @MCRmuffin 0 1
2542 Graduated! 1 1
2947 reading 1 1
3176 @omeirdeleon 1 1
3470 www.myspace.com/myfinalthought 1 1
3966 @gethyp3 1 1

Characters Count

len('this is')
7
def char_counts(x):
    s = x.split()
    x = ''.join(s)
    return len(x)
char_counts('this is')
6
df['char_counts'] = df['twitts'].apply(lambda x: char_counts(str(x)))
df.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts
2503 Woke up. Such a nice weather out there. Shower... 1 13 57
1408 I think I killed outlook 0 5 20
16 @BrianQuest I made 1 fo u 2: http://bit.ly/eId... 0 19 81
601 working all day on mothers dayy but i left my... 0 20 72
1345 This java assignment has really got me down. ... 0 24 99

Average Word Length

x = 'this is' # 6/2 = 3
y = 'thankyou guys' # 12/2 = 6
df['avg_word_len'] = df['char_counts']/df['word_counts']
df.sample(4)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len
489 thiking of goin to the library but not realy c... 0 11 52 4.727273
1291 I dropped one of my iPod earphones in a glass ... 0 12 43 3.583333
1834 carley &amp; kim are coming over! but no mallo... 0 17 71 4.176471
1494 I'm still alive, but I need some miracle. Don'... 0 23 91 3.956522

Stop Words Count

print(stopwords)
{'go', 'alone', 'besides', 'against', 'anyway', 'being', 'former', 'becoming', 'namely', 'this', 'over', 'whole', "'s", 'name', 'were', 'nevertheless', 'herein', 'nowhere', 'more', 'whether', 'amount', 'per', 'everything', 'our', 'than', 'show', 'top', 'them', '’s', 'how', 'on', 'my', 'mostly', 'done', 'seems', 'serious', 'both', 'very', 'amongst', 'who', 'n‘t', 'often', 'twenty', 'thus', '’ve', 'should', 'few', 'again', 'hundred', 'any', 'under', 'become', 'three', 'must', 'twelve', 're', 'meanwhile', 'also', 'around', 'out', 'something', 'other', 'whither', 'after', 'these', 'using', 'else', 'further', 'see', 'down', 'side', 'each', 'one', 'cannot', 'within', 'us', 'whereas', "'m", 'somehow', 'elsewhere', 'its', 'but', 'seemed', 'made', 'hers', '‘s', 'the', '’m', 'at', 'his', "'ve", 'another', 'perhaps', 'became', 'those', 'least', 'nine', 'she', '‘ll', '‘m', 'it', 'are', 'either', 'not', 'ten', '’re', 'you', 'has', 'still', 'off', 'sometimes', 'is', 'had', 'whom', 'why', 'with', 'used', 'say', 'could', 'was', 'yours', 'therein', 'when', 'enough', 'rather', 'yourselves', 'throughout', 'her', 'because', 'seem', 'fifteen', 'in', 'keep', 'just', 'fifty', 'quite', '’d', 'five', 'across', 'then', 'their', 'therefore', 'already', 'moreover', 'up', '‘d', 'have', 'put', 'that', 'there', 'onto', 'herself', 'most', 'no', 'whatever', 'since', 'though', 'may', 'ca', 'from', 'someone', 'latter', 'eight', 'they', 'and', 'various', 'well', 'latterly', 'whereafter', 'now', 'anything', 'ourselves', "'re", 'into', "n't", 'somewhere', 'an', 'take', 'been', 'without', 'indeed', 'me', 'third', 'thru', 'him', 'whereupon', 'whoever', 'above', 'next', 'which', 'themselves', 'several', 'last', 'four', 'many', 'thence', 'whereby', 'beyond', 'between', 'much', 'however', 'seeming', 'hereby', 'unless', 'hence', 'n’t', 'yet', 'nor', '‘ve', 'along', 'although', 'among', 'via', 'never', 'give', 'regarding', 'wherever', 'to', 'he', 'would', 'of', 'mine', 'always', 'back', 'anyone', 'others', 'do', 'two', 'until', 'your', 'as', 'bottom', 'thereafter', 'formerly', 'neither', 'toward', 'we', 'thereupon', 'all', 'together', 'becomes', '‘re', 'so', 'might', 'thereby', 'empty', 'where', 'please', 'ours', 'will', 'move', "'ll", 'even', 'or', 'myself', 'afterwards', 'does', 'front', 'get', 'anywhere', 'nothing', 'own', 'am', 'beforehand', 'behind', 'by', 'too', 'doing', 'beside', 'wherein', 'i', 'be', 'whose', 'if', 'such', 'did', 'less', 'otherwise', 'part', 'make', 'noone', 'every', 'due', 'almost', 'except', 'before', 'what', 'some', 'same', 'ever', 'everyone', 'here', 'while', 'a', 'hereupon', 'about', 'none', 'call', '’ll', 'whence', 'eleven', 'anyhow', 'hereafter', 'for', 'itself', 'once', 'six', 'nobody', 'sixty', 'only', 'first', 'really', 'towards', 'whenever', 'yourself', 'himself', 'below', 'everywhere', 'forty', 'upon', 'through', 'full', "'d", 'sometime', 'can', 'during'}
len(stopwords)
326
x = 'this is the text data'
x.split()
['this', 'is', 'the', 'text', 'data']
[t for t in x.split() if t in stopwords]
['this', 'is', 'the']
len([t for t in x.split() if t in stopwords])
3
df['stop_words_len'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t in stopwords]))
df.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len
1692 @Person678 Keep trying, I grew one last year a... 0 23 90 3.913043 10
3021 @taylormcfly I know!! Should of guessed they'd... 1 10 51 5.100000 1
1544 Although I want to hit up mcdonalds breakfast ... 0 9 46 5.111111 2
1329 i'm gonna be a good girl and stay at my dorm d... 0 23 88 3.826087 13
876 @LipstickNYC hmmm i owed you a story yesterday... 0 23 114 4.956522 8

Count #HashTags and @Mentions

x = 'this is #hashtag and this is @mention'
x.split()
['this', 'is', '#hashtag', 'and', 'this', 'is', '@mention']
[t for t in x.split() if t.startswith('@')]
['@mention']
len([t for t in x.split() if t.startswith('@')])
1
df['hashtags_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('#')]))
df['mentions_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('@')]))
df.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count
843 @Ms_Kaydine all im sayin is MJ's feet better g... 0 26 109 4.192308 12 0 1
2597 angels and demonds...i saw that movie yesterda... 1 10 67 6.700000 3 0 0
657 I need a bf! LOL anyone wanna sign up haha. Th... 0 32 105 3.281250 13 0 0
1070 @ABBSound ??????? ????? ??? ???? ??? ??? ?? ??... 0 9 46 5.111111 0 0 1
2335 @lukeb3000 i might be interested. how shall i ... 1 10 52 5.200000 6 0 1

If numeric digits are present in twitts

x = 'this is 1 and 2'
x.split()
['this', 'is', '1', 'and', '2']
x.split()[3].isdigit()
False
[t for t in x.split() if t.isdigit()]
['1', '2']
df['numerics_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isdigit()]))
df.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count numerics_count
1063 @Destini41 where do you think the otalia story... 0 19 95 5.000000 8 0 1 0
546 fml :/ today is too nice of a day to feel this... 0 13 38 2.923077 6 0 0 0
3325 @Kimmy6313 I totally feel better, you were rig... 1 14 62 4.428571 4 0 1 0
686 wants tomorrow to be over already. 0 6 29 4.833333 3 0 0 0
1814 @xMarshmellows Awww 0 2 18 9.000000 0 0 1 0

UPPER case words count

x = 'I AM HAPPY'
y = 'i am happy'
[t for t in x.split() if t.isupper()]
['I', 'AM', 'HAPPY']
df['upper_counts'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isupper()]))
df.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count numerics_count upper_counts
1617 thinks working 57 hours this week might just k... 0 22 84 3.818182 11 0 0 1 0
565 @derrickkendall that is if i'm not busy murder... 0 17 84 4.941176 8 0 1 0 0
946 Muwahahaha .... &gt;.&gt; hides behind pink fo... 0 21 100 4.761905 9 0 1 0 0
1517 Making pesto pasta for memy 2nd bday dinner! H... 0 28 111 3.964286 6 0 0 0 2
3864 The first day of &quot;real&quot; rehersals of... 1 21 91 4.333333 12 0 0 0 1
df.iloc[3962]['twitts']
'@DavidArchie Our local shows love tributes too much. True story! Will be watching SIS videos in Youtube later, haha '

Preprocessing and Cleaning

Lower Case Conversion

x = 'this is Text'
x.lower()
'this is text'
x = 45.0
str(x).lower()
'45.0'
df['twitts'] = df['twitts'].apply(lambda x: str(x).lower())
df.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count numerics_count upper_counts
1048 afternoon everyone just playing some tunes whi... 0 24 106 4.416667 12 0 0 0 1
31 shit you mister gembul! oh no.. you stole my h... 0 10 45 4.500000 3 0 0 0 0
3709 @silverlines hey you opened it! congrats! 1 6 36 6.000000 1 0 1 0 0
1777 @i140 myliferecord ... a health/medical histo... 0 16 90 5.625000 2 0 1 0 0
1596 @talentdmrripley maybe a good night's sleep f... 0 8 50 6.250000 2 0 1 0 0

Contraction to Expansion

contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and ",
"won't": "would not",
'dis': 'this',
'bak': 'back',
'brng': 'bring'}
x = "i'm don't he'll" # "i am do not he will"
def cont_to_exp(x):
    if type(x) is str:
        for key in contractions:
            value = contractions[key]
            x = x.replace(key, value)
        return x
    else:
        return x
    
cont_to_exp(x)
'i am do not he will'
%%timeit
df['twitts'] = df['twitts'].apply(lambda x: cont_to_exp(x))
97.6 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count numerics_count upper_counts
3348 @timtech awww, how cute. i love when men go al... 1 11 47 4.272727 5 0 1 0 0
470 wii says i gained back .4 pounds 0 7 26 3.714286 1 0 0 0 1
826 @littleliverbird maybe. i go on a bit less too... 0 27 109 4.037037 13 0 1 0 1
570 cannot get into mariah's new song. 0 6 28 4.666667 2 0 0 0 0
2966 @sassyback dude i am gen y myself 1 6 27 4.500000 1 0 1 0 0

Count and Remove Emails

import re
df[df['twitts'].str.contains('hotmail.com')]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count numerics_count upper_counts
3713 @securerecs arghh me please markbradbury_16@h... 1 5 51 10.2 0 0 1 0 0
df.iloc[3713]['twitts']
'@securerecs arghh me please  markbradbury_16@hotmail.com'
x = '@securerecs arghh me please  markbradbury_16@hotmail.com'
re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)', x)
['markbradbury_16@hotmail.com']
df['emails'] = df['twitts'].apply(lambda x: re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+\b)', x))
df['emails_count'] = df['emails'].apply(lambda x: len(x))
df[df['emails_count']>0]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count numerics_count upper_counts emails emails_count
3713 @securerecs arghh me please markbradbury_16@h... 1 5 51 10.2 0 0 1 0 0 [markbradbury_16@hotmail.com] 1
re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x)
'@securerecs arghh me please  '
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x))
df[df['emails_count']>0]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count numerics_count upper_counts emails emails_count
3713 @securerecs arghh me please 1 5 51 10.2 0 0 1 0 0 [markbradbury_16@hotmail.com] 1

Count URLs and Remove it

x = 'hi, thanks to watching it. for more visit https://youtube.com/kgptalkie'
#shh://git@git.com:username/repo.git=riif?%
re.findall(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)
[('https', 'youtube.com', '/kgptalkie')]
df['url_flags'] = df['twitts'].apply(lambda x: len(re.findall(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))
df[df['url_flags']>0].sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count numerics_count upper_counts emails emails_count url_flags
3203 @thewebguy http://twitpic.com/6jb33 - dude, th... 1 14 85 6.071429 3 0 1 0 0 [] 0 1
3362 shabtai it is great prizes today! (go almost ... 1 15 80 5.333333 5 1 0 0 1 [] 0 1
2537 @seuj sardinia for a few days of pre-graduatio... 1 10 67 6.700000 4 0 1 0 0 [] 0 1
2458 and again http://twitpic.com/4wp8l 1 3 32 10.666667 2 0 0 0 0 [] 0 1
548 @cyphersushi no, i am afraid not.but! go here... 0 16 117 7.312500 7 0 1 0 0 [] 0 1
x
'hi, thanks to watching it. for more visit https://youtube.com/kgptalkie'
re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x)
'hi, thanks to watching it. for more visit '
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x))
df.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count numerics_count upper_counts emails emails_count url_flags
2784 @realadulttalk come on and smile for me? that ... 1 12 62 5.166667 4 0 1 0 1 [] 0 0
888 @richmiller oh man, i am really sorry i hope ... 0 17 67 3.941176 6 0 1 0 1 [] 0 0
190 im veryy bad 0 3 10 3.333333 0 0 0 0 0 [] 0 0
1090 @simplymallory you be naht online d: sighs i... 0 15 63 4.200000 6 0 1 0 2 [] 0 0
1553 just got sad, although sadly expected, news fr... 0 10 48 4.800000 4 0 0 0 0 [] 0 0

Remove RT

df[df['twitts'].str.contains('rt')]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count numerics_count upper_counts emails emails_count url_flags
4 @mandagoforth me bad! it is funny though. zach... 0 26 116 4.461538 13 0 2 0 0 [] 0 0
23 ut oh, i wonder if the ram on the desktop is s... 0 14 46 3.285714 7 0 0 0 2 [] 0 0
59 @paulmccourt dunno what sky you're looking at!... 0 15 80 5.333333 3 0 1 0 0 [] 0 0
75 im back home in belfast im realli tired thoug... 0 22 84 3.818182 9 0 0 0 1 [] 0 0
81 @lilmonkee987 i know what you mean... i feel s... 0 11 48 4.363636 5 0 1 0 0 [] 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
3913 for the press so after she recovered she kille... 1 24 100 4.166667 1 0 0 0 0 [] 0 0
3919 earned her cpr &amp; first aid certifications! 1 7 40 5.714286 1 0 0 0 1 [] 0 0
3945 @teciav &quot;i look high, i look low, i look ... 1 23 106 4.608696 10 0 1 0 0 [] 0 0
3951 i am soo very parched. and hungry. oh and i am... 1 21 87 4.142857 7 0 0 2 1 [] 0 0
3986 @countroshculla yeah..needed to get up early..... 1 10 69 6.900000 4 0 1 0 0 [] 0 0

381 rows × 13 columns

x = 'rt @username: hello hirt'
re.sub(r'\brt\b', '', x).strip()
'@username: hello hirt'
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'\brt\b', '', x).strip())

Special Chars removal or punctuation removal

df.sample(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count numerics_count upper_counts emails emails_count url_flags
2205 eating food leaving school to go to hospital ... 1 12 45 3.750000 5 0 0 0 0 [] 0 0
812 @earthlifeshop i know! it makes it hard for th... 0 17 74 4.352941 5 0 1 0 1 [] 0 0
1443 cannot sleep! only 3 hours til i have to wake up 0 11 38 3.454545 6 0 0 1 0 [] 0 0
x = '@duyku apparently i was not ready enough... i...'
re.sub(r'[^\w ]+', "", x)
'duyku apparently i was not ready enough i'
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'[^\w ]+', "", x))
df.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
twitts sentiment word_counts char_counts avg_word_len stop_words_len hashtags_count mentions_count numerics_count upper_counts emails emails_count url_flags
2294 joshishollywood aw joshi would describe you ex... 1 9 55 6.111111 3 0 1 0 0 [] 0 0
3495 repressd i hate it when that happens errrr i m... 1 14 63 4.500000 4 0 1 0 2 [] 0 0
1678 but when you do have a camera less funny thing... 0 11 45 4.090909 6 0 0 0 0 [] 0 0
3702 uh do not wanna work but mondays are easy days... 1 13 49 3.769231 5 0 0 0 0 [] 0 0
3201 heromancer i will take shin 1 6 50 8.333333 1 0 1 0 0 [] 0 1

Remove multiple spaces "hi hello "

x =  'hi    hello     how are you'
' '.join(x.split())
'hi hello how are you'
df['twitts'] = df['twitts'].apply(lambda x: ' '.join(x.split()))

Remove HTML tags

!pip install beautifulsoup4
Requirement already satisfied: beautifulsoup4 in c:\users\chitr\appdata\local\programs\python\python36\lib\site-packages (4.9.3)
Requirement already satisfied: soupsieve>1.2 in c:\users\chitr\appdata\local\programs\python\python36\lib\site-packages (from beautifulsoup4) (2.2.1)


WARNING: You are using pip version 21.0.1; however, version 21.1.1 is available.
You should consider upgrading via the 'c:\users\chitr\appdata\local\programs\python\python36\python.exe -m pip install --upgrade pip' command.
from bs4 import BeautifulSoup
x = '<html><h1> thanks for watching it </h1></html>'
x.replace('<html><h1>', '').replace('</h1></html>', '') #not rec
' thanks for watching it '
BeautifulSoup(x, 'lxml').get_text().strip()
---------------------------------------------------------------------------

FeatureNotFound                           Traceback (most recent call last)

<ipython-input-187-2e9db3c14738> in <module>
----> 1 BeautifulSoup(x, 'lxml').get_text().strip()


c:\users\chitr\appdata\local\programs\python\python36\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, element_classes, **kwargs)
    244                     "Couldn't find a tree builder with the features you "
    245                     "requested: %s. Do you need to install a parser library?"
--> 246                     % ",".join(features))
    247 
    248         # At this point either we have a TreeBuilder instance in


FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
%%time
df['twitts'] = df['twitts'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text().strip())

Remove Accented Chars

x = 'Áccěntěd těxt'
import unicodedata
def remove_accented_chars(x):
    x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return x
remove_accented_chars(x)
df['twitts'] = df['twitts'].apply(lambda x: remove_accented_chars(x))

Remove Stop Words

x = 'this is a stop words'
' '.join([t for t in x.split() if t not in stopwords])
df['twitts_no_stop'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in stopwords]))
df.sample(5)

Convert into base or root form of word

nlp = spacy.load('en_core_web_sm')
x = 'this is chocolates. what is times? this balls'
def make_to_base(x):
    x = str(x)
    x_list = []
    doc = nlp(x)
    
    for token in doc:
        lemma = token.lemma_
        if lemma == '-PRON-' or lemma == 'be':
            lemma = token.text

        x_list.append(lemma)
    return ' '.join(x_list)
make_to_base(x)
df['twitts'] = df['twitts'].apply(lambda x: make_to_base(x))
df.sample(5)

Common words removal

x = 'this is this okay bye'
text = ' '.join(df['twitts'])
len(text)
text = text.split()
len(text)
freq_comm = pd.Series(text).value_counts()
f20 = freq_comm[:20]
f20
df['twitts'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in f20]))
df.sample(5)

Rare words removal

rare20 = freq_comm.tail(20)
df['twitts'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in rare20]))
df.sample(5)

Word Cloud Visualization

# !pip install wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
text = ' '.join(df['twitts'])
len(text)
wc = WordCloud(width=800, height=400).generate(text)
plt.imshow(wc)
plt.axis('off')
plt.show()

Spelling Correction

!pip install -U textblob
!python -m textblob.download_corpora
from textblob import TextBlob
x = 'thankks forr waching it'
x = TextBlob(x).correct()
x

Tokenization using TextBlob

x = 'thanks#watching this video. please like it'
TextBlob(x).words
doc = nlp(x)
for token in doc:
    print(token)

Detecting Nouns

x = 'Breaking News: Donal Trump, the president of the USA is looking to sign a deal to mine the moon'
doc = nlp(x)
for noun in doc.noun_chunks:
    print(noun)

Language Translation and Detection

Language Code: https://www.loc.gov/standards/iso639-2/php/code_list.php

x
tb = TextBlob(x)
tb.detect_language()
tb.translate(to = 'zh')

Use TextBlob's Inbuilt Sentiment Classifier

from textblob.sentiments import NaiveBayesAnalyzer
x = 'we all stands together. we are gonna win this fight'
tb = TextBlob(x, analyzer=NaiveBayesAnalyzer())
tb.sentiment

About


Languages

Language:Jupyter Notebook 100.0%