Complete Text Processing

import pandas as pd
import numpy as np
import spacy

from spacy.lang.en.stop_words import STOP_WORDS as stopwords

df = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/twitter-data/master/twitter4000.csv', encoding = 'latin1')

df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment
0	is bored and wants to watch a movie any sugge...	0
1	back in miami. waiting to unboard ship	0
2	@misskpey awwww dnt dis brng bak memoriessss, ...	0
3	ughhh i am so tired blahhhhhhhhh	0
4	@mandagoforth me bad! It's funny though. Zacha...	0
...	...	...
3995	i just graduated	1
3996	Templating works; it all has to be done	1
3997	mommy just brought me starbucks	1
3998	@omarepps watching you on a House re-run...lov...	1
3999	Thanks for trying to make me smile I'll make y...	1

4000 rows × 2 columns

df['sentiment'].value_counts()

1    2000
0    2000
Name: sentiment, dtype: int64

Word Counts

len('this is text'.split())

df['word_counts'] = df['twitts'].apply(lambda x: len(str(x).split()))

df.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts
2296	bulat dan bahagia and desperately needing a k...	1	15
3600	@johncmayer Like there was ever any doubt you ...	1	12
2468	@kirstiealley LETS DO IT!	1	4
66	@anthothemantho hahaha i agree! i cried like a...	0	16
1602	@KINOFLYHIGH fuck i shouldnt have left!	0	6

df['word_counts'].max()

df['word_counts'].min()

df[df['word_counts']==1]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts
385	homework	0	1
691	@ekrelly	0	1
1124	disappointed	0	1
1286	@officialmgnfox	0	1
1325	headache	0	1
1897	@MCRmuffin	0	1
2542	Graduated!	1	1
2947	reading	1	1
3176	@omeirdeleon	1	1
3470	www.myspace.com/myfinalthought	1	1
3966	@gethyp3	1	1

Characters Count

len('this is')

def char_counts(x):
    s = x.split()
    x = ''.join(s)
    return len(x)

char_counts('this is')

df['char_counts'] = df['twitts'].apply(lambda x: char_counts(str(x)))

df.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts
2503	Woke up. Such a nice weather out there. Shower...	1	13	57
1408	I think I killed outlook	0	5	20
16	@BrianQuest I made 1 fo u 2: http://bit.ly/eId...	0	19	81
601	working all day on mothers dayy but i left my...	0	20	72
1345	This java assignment has really got me down. ...	0	24	99

Average Word Length

x = 'this is' # 6/2 = 3
y = 'thankyou guys' # 12/2 = 6

df['avg_word_len'] = df['char_counts']/df['word_counts']

df.sample(4)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	word_counts	char_counts	avg_word_len
489	thiking of goin to the library but not realy c...	11	52	4.727273
1291	I dropped one of my iPod earphones in a glass ...	12	43	3.583333
1834	carley & kim are coming over! but no mallo...	17	71	4.176471
1494	I'm still alive, but I need some miracle. Don'...	23	91	3.956522

Stop Words Count

print(stopwords)

{'go', 'alone', 'besides', 'against', 'anyway', 'being', 'former', 'becoming', 'namely', 'this', 'over', 'whole', "'s", 'name', 'were', 'nevertheless', 'herein', 'nowhere', 'more', 'whether', 'amount', 'per', 'everything', 'our', 'than', 'show', 'top', 'them', '’s', 'how', 'on', 'my', 'mostly', 'done', 'seems', 'serious', 'both', 'very', 'amongst', 'who', 'n‘t', 'often', 'twenty', 'thus', '’ve', 'should', 'few', 'again', 'hundred', 'any', 'under', 'become', 'three', 'must', 'twelve', 're', 'meanwhile', 'also', 'around', 'out', 'something', 'other', 'whither', 'after', 'these', 'using', 'else', 'further', 'see', 'down', 'side', 'each', 'one', 'cannot', 'within', 'us', 'whereas', "'m", 'somehow', 'elsewhere', 'its', 'but', 'seemed', 'made', 'hers', '‘s', 'the', '’m', 'at', 'his', "'ve", 'another', 'perhaps', 'became', 'those', 'least', 'nine', 'she', '‘ll', '‘m', 'it', 'are', 'either', 'not', 'ten', '’re', 'you', 'has', 'still', 'off', 'sometimes', 'is', 'had', 'whom', 'why', 'with', 'used', 'say', 'could', 'was', 'yours', 'therein', 'when', 'enough', 'rather', 'yourselves', 'throughout', 'her', 'because', 'seem', 'fifteen', 'in', 'keep', 'just', 'fifty', 'quite', '’d', 'five', 'across', 'then', 'their', 'therefore', 'already', 'moreover', 'up', '‘d', 'have', 'put', 'that', 'there', 'onto', 'herself', 'most', 'no', 'whatever', 'since', 'though', 'may', 'ca', 'from', 'someone', 'latter', 'eight', 'they', 'and', 'various', 'well', 'latterly', 'whereafter', 'now', 'anything', 'ourselves', "'re", 'into', "n't", 'somewhere', 'an', 'take', 'been', 'without', 'indeed', 'me', 'third', 'thru', 'him', 'whereupon', 'whoever', 'above', 'next', 'which', 'themselves', 'several', 'last', 'four', 'many', 'thence', 'whereby', 'beyond', 'between', 'much', 'however', 'seeming', 'hereby', 'unless', 'hence', 'n’t', 'yet', 'nor', '‘ve', 'along', 'although', 'among', 'via', 'never', 'give', 'regarding', 'wherever', 'to', 'he', 'would', 'of', 'mine', 'always', 'back', 'anyone', 'others', 'do', 'two', 'until', 'your', 'as', 'bottom', 'thereafter', 'formerly', 'neither', 'toward', 'we', 'thereupon', 'all', 'together', 'becomes', '‘re', 'so', 'might', 'thereby', 'empty', 'where', 'please', 'ours', 'will', 'move', "'ll", 'even', 'or', 'myself', 'afterwards', 'does', 'front', 'get', 'anywhere', 'nothing', 'own', 'am', 'beforehand', 'behind', 'by', 'too', 'doing', 'beside', 'wherein', 'i', 'be', 'whose', 'if', 'such', 'did', 'less', 'otherwise', 'part', 'make', 'noone', 'every', 'due', 'almost', 'except', 'before', 'what', 'some', 'same', 'ever', 'everyone', 'here', 'while', 'a', 'hereupon', 'about', 'none', 'call', '’ll', 'whence', 'eleven', 'anyhow', 'hereafter', 'for', 'itself', 'once', 'six', 'nobody', 'sixty', 'only', 'first', 'really', 'towards', 'whenever', 'yourself', 'himself', 'below', 'everywhere', 'forty', 'upon', 'through', 'full', "'d", 'sometime', 'can', 'during'}

len(stopwords)

x = 'this is the text data'

x.split()

['this', 'is', 'the', 'text', 'data']

[t for t in x.split() if t in stopwords]

['this', 'is', 'the']

len([t for t in x.split() if t in stopwords])

df['stop_words_len'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t in stopwords]))

df.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len
1692	@Person678 Keep trying, I grew one last year a...	0	23	90	3.913043	10
3021	@taylormcfly I know!! Should of guessed they'd...	1	10	51	5.100000	1
1544	Although I want to hit up mcdonalds breakfast ...	0	9	46	5.111111	2
1329	i'm gonna be a good girl and stay at my dorm d...	0	23	88	3.826087	13
876	@LipstickNYC hmmm i owed you a story yesterday...	0	23	114	4.956522	8

Count #HashTags and @Mentions

x = 'this is #hashtag and this is @mention'

x.split()

['this', 'is', '#hashtag', 'and', 'this', 'is', '@mention']

[t for t in x.split() if t.startswith('@')]

['@mention']

len([t for t in x.split() if t.startswith('@')])

df['hashtags_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('#')]))

df['mentions_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('@')]))

df.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count
843	@Ms_Kaydine all im sayin is MJ's feet better g...	0	26	109	4.192308	12	1
2597	angels and demonds...i saw that movie yesterda...	1	10	67	6.700000	3	0
657	I need a bf! LOL anyone wanna sign up haha. Th...	0	32	105	3.281250	13	0
1070	@ABBSound ??????? ????? ??? ???? ??? ??? ?? ??...	0	9	46	5.111111	0	1
2335	@lukeb3000 i might be interested. how shall i ...	1	10	52	5.200000	6	1

If numeric digits are present in twitts

x = 'this is 1 and 2'

x.split()

['this', 'is', '1', 'and', '2']

x.split()[3].isdigit()

False

[t for t in x.split() if t.isdigit()]

['1', '2']

df['numerics_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isdigit()]))

df.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count
1063	@Destini41 where do you think the otalia story...	0	19	95	5.000000	8	1
546	fml :/ today is too nice of a day to feel this...	0	13	38	2.923077	6	0
3325	@Kimmy6313 I totally feel better, you were rig...	1	14	62	4.428571	4	1
686	wants tomorrow to be over already.	0	6	29	4.833333	3	0
1814	@xMarshmellows Awww	0	2	18	9.000000	0	1

UPPER case words count

x = 'I AM HAPPY'
y = 'i am happy'

[t for t in x.split() if t.isupper()]

['I', 'AM', 'HAPPY']

df['upper_counts'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isupper()]))

df.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	numerics_count	upper_counts
1617	thinks working 57 hours this week might just k...	0	22	84	3.818182	11	0	1	0
565	@derrickkendall that is if i'm not busy murder...	0	17	84	4.941176	8	1	0	0
946	Muwahahaha .... >.> hides behind pink fo...	0	21	100	4.761905	9	1	0	0
1517	Making pesto pasta for memy 2nd bday dinner! H...	0	28	111	3.964286	6	0	0	2
3864	The first day of "real" rehersals of...	1	21	91	4.333333	12	0	0	1

df.iloc[3962]['twitts']

'@DavidArchie Our local shows love tributes too much. True story! Will be watching SIS videos in Youtube later, haha '

Preprocessing and Cleaning

Lower Case Conversion

x = 'this is Text'

x.lower()

'this is text'

x = 45.0
str(x).lower()

'45.0'

df['twitts'] = df['twitts'].apply(lambda x: str(x).lower())

df.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	upper_counts
1048	afternoon everyone just playing some tunes whi...	0	24	106	4.416667	12	0	1
31	shit you mister gembul! oh no.. you stole my h...	0	10	45	4.500000	3	0	0
3709	@silverlines hey you opened it! congrats!	1	6	36	6.000000	1	1	0
1777	@i140 myliferecord ... a health/medical histo...	0	16	90	5.625000	2	1	0
1596	@talentdmrripley maybe a good night's sleep f...	0	8	50	6.250000	2	1	0

Contraction to Expansion

contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and ",
"won't": "would not",
'dis': 'this',
'bak': 'back',
'brng': 'bring'}

x = "i'm don't he'll" # "i am do not he will"

def cont_to_exp(x):
    if type(x) is str:
        for key in contractions:
            value = contractions[key]
            x = x.replace(key, value)
        return x
    else:
        return x

cont_to_exp(x)

'i am do not he will'

%%timeit
df['twitts'] = df['twitts'].apply(lambda x: cont_to_exp(x))

97.6 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

df.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	upper_counts
3348	@timtech awww, how cute. i love when men go al...	1	11	47	4.272727	5	1	0
470	wii says i gained back .4 pounds	0	7	26	3.714286	1	0	1
826	@littleliverbird maybe. i go on a bit less too...	0	27	109	4.037037	13	1	1
570	cannot get into mariah's new song.	0	6	28	4.666667	2	0	0
2966	@sassyback dude i am gen y myself	1	6	27	4.500000	1	1	0

Count and Remove Emails

import re

df[df['twitts'].str.contains('hotmail.com')]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	hashtags_count	mentions_count	numerics_count	upper_counts
3713	@securerecs arghh me please markbradbury_16@h...	1	5	51	10.2	0	0	1	0	0

df.iloc[3713]['twitts']

'@securerecs arghh me please  markbradbury_16@hotmail.com'

x = '@securerecs arghh me please  markbradbury_16@hotmail.com'

re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)', x)

['markbradbury_16@hotmail.com']

df['emails'] = df['twitts'].apply(lambda x: re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+\b)', x))

df['emails_count'] = df['emails'].apply(lambda x: len(x))

df[df['emails_count']>0]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	hashtags_count	mentions_count	numerics_count	upper_counts	emails	emails_count
3713	@securerecs arghh me please markbradbury_16@h...	1	5	51	10.2	0	0	1	0	0	[markbradbury_16@hotmail.com]	1

re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x)

'@securerecs arghh me please  '

df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x))

df[df['emails_count']>0]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	hashtags_count	mentions_count	numerics_count	upper_counts	emails	emails_count
3713	@securerecs arghh me please	1	5	51	10.2	0	0	1	0	0	[markbradbury_16@hotmail.com]	1

Count URLs and Remove it

x = 'hi, thanks to watching it. for more visit https://youtube.com/kgptalkie'

#shh://git@git.com:username/repo.git=riif?%

re.findall(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)

[('https', 'youtube.com', '/kgptalkie')]

df['url_flags'] = df['twitts'].apply(lambda x: len(re.findall(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))

df[df['url_flags']>0].sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	hashtags_count	mentions_count	upper_counts	emails	url_flags
3203	@thewebguy http://twitpic.com/6jb33 - dude, th...	1	14	85	6.071429	3	0	1	0	[]	1
3362	shabtai it is great prizes today! (go almost ...	1	15	80	5.333333	5	1	0	1	[]	1
2537	@seuj sardinia for a few days of pre-graduatio...	1	10	67	6.700000	4	0	1	0	[]	1
2458	and again http://twitpic.com/4wp8l	1	3	32	10.666667	2	0	0	0	[]	1
548	@cyphersushi no, i am afraid not.but! go here...	0	16	117	7.312500	7	0	1	0	[]	1

'hi, thanks to watching it. for more visit https://youtube.com/kgptalkie'

re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x)

'hi, thanks to watching it. for more visit '

df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x))

df.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	upper_counts	emails
2784	@realadulttalk come on and smile for me? that ...	1	12	62	5.166667	4	1	1	[]
888	@richmiller oh man, i am really sorry i hope ...	0	17	67	3.941176	6	1	1	[]
190	im veryy bad	0	3	10	3.333333	0	0	0	[]
1090	@simplymallory you be naht online d: sighs i...	0	15	63	4.200000	6	1	2	[]
1553	just got sad, although sadly expected, news fr...	0	10	48	4.800000	4	0	0	[]

Remove RT

df[df['twitts'].str.contains('rt')]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	hashtags_count	mentions_count	numerics_count	upper_counts	emails	emails_count	url_flags
4	@mandagoforth me bad! it is funny though. zach...	0	26	116	4.461538	13	0	2	0	0	[]	0	0
23	ut oh, i wonder if the ram on the desktop is s...	0	14	46	3.285714	7	0	0	0	2	[]	0	0
59	@paulmccourt dunno what sky you're looking at!...	0	15	80	5.333333	3	0	1	0	0	[]	0	0
75	im back home in belfast im realli tired thoug...	0	22	84	3.818182	9	0	0	0	1	[]	0	0
81	@lilmonkee987 i know what you mean... i feel s...	0	11	48	4.363636	5	0	1	0	0	[]	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
3913	for the press so after she recovered she kille...	1	24	100	4.166667	1	0	0	0	0	[]	0	0
3919	earned her cpr & first aid certifications!	1	7	40	5.714286	1	0	0	0	1	[]	0	0
3945	@teciav "i look high, i look low, i look ...	1	23	106	4.608696	10	0	1	0	0	[]	0	0
3951	i am soo very parched. and hungry. oh and i am...	1	21	87	4.142857	7	0	0	2	1	[]	0	0
3986	@countroshculla yeah..needed to get up early.....	1	10	69	6.900000	4	0	1	0	0	[]	0	0

381 rows × 13 columns

x = 'rt @username: hello hirt'

re.sub(r'\brt\b', '', x).strip()

'@username: hello hirt'

df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'\brt\b', '', x).strip())

Special Chars removal or punctuation removal

df.sample(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	numerics_count	upper_counts	emails
2205	eating food leaving school to go to hospital ...	1	12	45	3.750000	5	0	0	0	[]
812	@earthlifeshop i know! it makes it hard for th...	0	17	74	4.352941	5	1	0	1	[]
1443	cannot sleep! only 3 hours til i have to wake up	0	11	38	3.454545	6	0	1	0	[]

x = '@duyku apparently i was not ready enough... i...'

re.sub(r'[^\w ]+', "", x)

'duyku apparently i was not ready enough i'

df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'[^\w ]+', "", x))

df.sample(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	upper_counts	emails	url_flags
2294	joshishollywood aw joshi would describe you ex...	1	9	55	6.111111	3	1	0	[]	0
3495	repressd i hate it when that happens errrr i m...	1	14	63	4.500000	4	1	2	[]	0
1678	but when you do have a camera less funny thing...	0	11	45	4.090909	6	0	0	[]	0
3702	uh do not wanna work but mondays are easy days...	1	13	49	3.769231	5	0	0	[]	0
3201	heromancer i will take shin	1	6	50	8.333333	1	1	0	[]	1

Remove multiple spaces `"hi hello "`

x =  'hi    hello     how are you'

' '.join(x.split())

'hi hello how are you'

df['twitts'] = df['twitts'].apply(lambda x: ' '.join(x.split()))

Remove HTML tags

!pip install beautifulsoup4

Requirement already satisfied: beautifulsoup4 in c:\users\chitr\appdata\local\programs\python\python36\lib\site-packages (4.9.3)
Requirement already satisfied: soupsieve>1.2 in c:\users\chitr\appdata\local\programs\python\python36\lib\site-packages (from beautifulsoup4) (2.2.1)


WARNING: You are using pip version 21.0.1; however, version 21.1.1 is available.
You should consider upgrading via the 'c:\users\chitr\appdata\local\programs\python\python36\python.exe -m pip install --upgrade pip' command.

from bs4 import BeautifulSoup

x = '<html><h1> thanks for watching it </h1></html>'

x.replace('<html><h1>', '').replace('</h1></html>', '') #not rec

' thanks for watching it '

BeautifulSoup(x, 'lxml').get_text().strip()

---------------------------------------------------------------------------

FeatureNotFound                           Traceback (most recent call last)

<ipython-input-187-2e9db3c14738> in <module>
----> 1 BeautifulSoup(x, 'lxml').get_text().strip()


c:\users\chitr\appdata\local\programs\python\python36\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, element_classes, **kwargs)
    244                     "Couldn't find a tree builder with the features you "
    245                     "requested: %s. Do you need to install a parser library?"
--> 246                     % ",".join(features))
    247 
    248         # At this point either we have a TreeBuilder instance in


FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

%%time
df['twitts'] = df['twitts'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text().strip())

Remove Accented Chars

x = 'Áccěntěd těxt'

import unicodedata

def remove_accented_chars(x):
    x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return x

remove_accented_chars(x)

df['twitts'] = df['twitts'].apply(lambda x: remove_accented_chars(x))

Remove Stop Words

x = 'this is a stop words'

' '.join([t for t in x.split() if t not in stopwords])

df['twitts_no_stop'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in stopwords]))

df.sample(5)

Convert into base or root form of word

nlp = spacy.load('en_core_web_sm')

x = 'this is chocolates. what is times? this balls'

def make_to_base(x):
    x = str(x)
    x_list = []
    doc = nlp(x)
    
    for token in doc:
        lemma = token.lemma_
        if lemma == '-PRON-' or lemma == 'be':
            lemma = token.text

        x_list.append(lemma)
    return ' '.join(x_list)

make_to_base(x)

df['twitts'] = df['twitts'].apply(lambda x: make_to_base(x))

df.sample(5)

Common words removal

x = 'this is this okay bye'

text = ' '.join(df['twitts'])

len(text)

text = text.split()

len(text)

freq_comm = pd.Series(text).value_counts()

f20 = freq_comm[:20]

f20

df['twitts'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in f20]))

df.sample(5)

Rare words removal

rare20 = freq_comm.tail(20)

df['twitts'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in rare20]))

df.sample(5)

Word Cloud Visualization

# !pip install wordcloud

from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

text = ' '.join(df['twitts'])

len(text)

wc = WordCloud(width=800, height=400).generate(text)
plt.imshow(wc)
plt.axis('off')
plt.show()

Spelling Correction

!pip install -U textblob

!python -m textblob.download_corpora

from textblob import TextBlob

x = 'thankks forr waching it'

x = TextBlob(x).correct()

Tokenization using TextBlob

x = 'thanks#watching this video. please like it'

TextBlob(x).words

doc = nlp(x)
for token in doc:
    print(token)

Detecting Nouns

x = 'Breaking News: Donal Trump, the president of the USA is looking to sign a deal to mine the moon'

doc = nlp(x)

for noun in doc.noun_chunks:
    print(noun)

Language Translation and Detection

Language Code: https://www.loc.gov/standards/iso639-2/php/code_list.php

tb = TextBlob(x)

tb.detect_language()

tb.translate(to = 'zh')

Use TextBlob's Inbuilt Sentiment Classifier

from textblob.sentiments import NaiveBayesAnalyzer

x = 'we all stands together. we are gonna win this fight'

tb = TextBlob(x, analyzer=NaiveBayesAnalyzer())

tb.sentiment

chitreshkr / Natural-Language-Processing-Python

Complete Text Processing

Word Counts

Characters Count

Average Word Length

Stop Words Count

Count #HashTags and @Mentions

If numeric digits are present in twitts

UPPER case words count

Preprocessing and Cleaning

Lower Case Conversion

Contraction to Expansion

Count and Remove Emails

Count URLs and Remove it

Remove RT

Special Chars removal or punctuation removal

Remove multiple spaces `"hi hello "`

Remove HTML tags

Remove Accented Chars

Remove Stop Words

Convert into base or root form of word

Common words removal

Rare words removal

Word Cloud Visualization

Spelling Correction

Tokenization using TextBlob

Detecting Nouns

Language Translation and Detection

Use TextBlob's Inbuilt Sentiment Classifier

About

Languages

Complete Text Processing

Word Counts

Characters Count

Average Word Length

Stop Words Count

Count #HashTags and @Mentions

If numeric digits are present in twitts

UPPER case words count

Preprocessing and Cleaning

Lower Case Conversion

Contraction to Expansion

Count and Remove Emails

Count URLs and Remove it

Remove RT

Special Chars removal or punctuation removal

Remove multiple spaces "hi hello "

Remove HTML tags

Remove Accented Chars

Remove Stop Words

Convert into base or root form of word

Common words removal

Rare words removal

Word Cloud Visualization

Spelling Correction

Tokenization using TextBlob

Detecting Nouns

Language Translation and Detection

Use TextBlob's Inbuilt Sentiment Classifier

About

Languages

Remove multiple spaces `"hi hello "`