General terms associated with spacy and NLP
- Text: The original word text.
- Lemma: The base form of the word.
- POS: The simple UPOS part-of-speech tag.
- Tag: The detailed part-of-speech tag.
- Dep: Syntactic dependency, i.e. the relation between tokens.
- Shape: The word shape – capitalization, punctuation, digits.
- is alpha: Is the token an alpha character?
- is stop: Is the token part of a stop list, i.e. the most common words of the language?
- Root text: The original text of the word connecting the noun chunk to the rest of the parse.
- Root dep: Dependency relation connecting the root to its head.
- Root head text: The text of the root token’s head.
- Head text: The original text of the token head.
- Head POS: The part-of-speech tag of the token head.
- Children: The immediate syntactic dependents of the token.
- Grouping the column by Class Name to look at the count by category
- Grouping the column by Clothing ID to look at the count by Clothing ID
**# Count by Class Name**
Dresses 6319
Knits 4843
Blouses 3097
Sweaters 1428
Pants 1388
Name: Class Name, dtype: int64
**# Count by Clothing ID**
data_dress['Clothing ID'].value_counts()[0:5]
1078 1024
1094 756
1081 582
1110 480
1095 327
Name: Clothing ID, dtype: int64
- To analyze the, filtering it with and "Dresses" and Clothing ID - "1078"
data_dress = data[(data['Class Name']=='Dresses') & (data['Clothing ID']==1078)]
-
This step is needed to explore different spacy functions. The NLP object now has a tokenizer, tagger, parser and entity recognizer in its pipeline and we can use it to process a text and get all of those features.
dress_review=data_dress['Review Text'].str.cat(sep='\n') doc = nlp(dress_review) doc[1200:1600] love it! will be easy to wear casually and work appropriate, too. the sale price was a huge bonus. I love this dress because its very playful and bouncy. it puts me in a light hearted mood when i wear it. i originally wanted to buy the grey color but my store only had the navy, so i tried it on. the navy is brighter and more colorful than it looks on line and the stripes are more varied in color than in the picture - so its quite appealing and vibrant. the lines of the dress are also quite flattering. all in all, its a fun dress! This dress is comfortable as well as flattering, which does not happen very often!
-
The Spacy pipeline consist of three parts tagger, parser and ner which are further analyzed below.
[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f18bb185ba8>), ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f18bb05f7c8>), ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f18bb05f828>)]
##SPAN
dress_span = doc[0:20]
print(dress_span)
print(type(dress_span))
I really wanted this to work. alas, it had a strange fit for me. the straps would
<class 'spacy.tokens.span.Span'>
##SENTENCES
i=0
for i,sent in enumerate(doc.sents):
i+=1
print(sent)
if i==10:
break
I really wanted this to work.
alas, it had a strange fit for me.
the straps would not stay up, and it had a weird fit under the breast.
it worked standing up, but the minute i sat down it fell off my shoulders.
the fabric was beautiful!
and i loved that it had pockets.
I love cute summer dresses and this one, especially because it is made out of linen, is unique.
it is very well-made with a design that is quite flattering.
i am 5 foot 6 and a little curvy with a 38 c bust
and i got a size 10.
#To check if token at index 17 is the start of the sentence or not.
doc[17].is_sent_start
True
-
We could also retrieve some linguistic features such as noun chunks, part of speech tags, and dependency relations between tokens in each sentence. In order to understand what various tags such as token.pos_, token.tag_, or token.dep_ mean, we can use spacy.explain() that will access annotation specifications.
-
The entire text(doc) can be sliced with words(token) indices to get single tokens or sequences of tokens (spans) and various token attributes such as text, lemma, index, pos, tag and etc.
for token in doc[0:10]: print(f'{token.text:{10}} {token.lemma_:{10}} {token.pos_:{6}} {token.dep_:{12}} {spacy.explain(token.tag_)}') I -PRON- PRON nsubj pronoun, personal really really ADV advmod adverb wanted want VERB ROOT verb, past tense this this DET nsubj determiner to to PART aux infinitival "to"
-
Converting the tokens into pandas DataFrame with POS and LEMMA for each word
dress_frame = pd.DataFrame() o=0 for token in doc: dress_frame.loc[o, 'lemma']= token.lemma_ dress_frame.loc[o, 'pos']= token.pos_ dress_frame.loc[o, 'text']= token.text dress_frame.loc[o, 'lemma'] = token.dep_ o=o+1 dress_frame[0:10] lemma pos text 0 nsubj PRON I 1 advmod ADV really 2 ROOT VERB wanted 3 nsubj DET this
-
Grouping the tokens with POS (Parts of speech)
group_dress = dress_frame.groupby(['pos']).agg( { 'text':'count' }) group_dress['text'].sort_values(ascending=False)[0:15] pos NOUN 11003 DET 8329 PUNCT 8311 VERB 7634 PRON 6510 ADJ 6399 ADP 5133 AUX 4879 ADV 4787 CCONJ 3339 PART 1692 SCONJ 1361 SPACE 1327 NUM 1101 PROPN 822 ##Getting the TOP 5 Adjectives group_dress_adj = dress_frame[dress_frame['pos']=='ADJ'].groupby(['text']).agg( { 'text':'count' }) group_dress_adj['text'].sort_values(ascending=False)[0:5] text great 257 flattering 185 perfect 176 comfortable 162 small 149 Name: text, dtype: int64
-
Looking at noun chunks in the document
#Similar to Doc.ents, Doc.noun_chunks are another object property. #Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. #You can think of noun chunks as a noun plus the words describing the noun – for example, #in Sheb Wooley's 1958 song, a *"one-eyed, one-horned, flying, purple people-eater" #would be one long noun chunk. #https://spacy.io/usage/visualizers i=0 for chunk in doc.noun_chunks: i+=1 print(chunk.text) if i ==15: break I it a strange fit me the straps it a weird fit the breast it i it my shoulders the fabric i it
-
Exploring different POS tags across the document for the same word for eg: "size" here
#Text: The original token text. #Dep: The syntactic relation connecting the child to head. #Head text: The original text of the token head. #Head POS: The part-of-speech tag of the token head. #Children: The immediate syntactic dependents of the token. https://spacy.io/usage/linguistic-features i=0 for token in doc: if token.text =="size": if i == 15: break else: i+=1 print(f'{token.text:{14}} {token.head.text:{12}} {token.head.pos_:{10}} {[child for child in token.children]}') size got VERB [a, 10] size to ADP [] size down ADP [a] size xs PROPN [] size xl PROPN [] size to ADP [] size was AUX [a, petite] size looked VERB [neither] size runs VERB [so, i, would, down, framed] size purchased VERB [the, 4, ,, fit] size small ADJ [] size needed VERB [to, versital, ,, cute] size ordered VERB [my, regular, ,, medium] size to PART [] size returned VERB [my]
- Exploring NER Label MONEY
- This will extract all the tokens which are tagged as "MONEY" by the Spacy tagger
for ent in doc.ents:
if ent.label_ == 'MONEY':
print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
102# - MONEY - Monetary values, including unit
135 - MONEY - Monetary values, including unit
two cents - MONEY - Monetary values, including unit
128# - MONEY - Monetary values, including unit
140# 34d - MONEY - Monetary values, including unit
120# - MONEY - Monetary values, including unit
140# 5'3 - MONEY - Monetary values, including unit
15 bucks - MONEY - Monetary values, including unit
158 - MONEY - Monetary values, including unit
120# - MONEY - Monetary values, including unit
over $50 - MONEY - Monetary values, including unit
5'3 - MONEY - Monetary values, including unit
110# - MONEY - Monetary values, including unit
168 - MONEY - Monetary values, including unit
49 - MONEY - Monetary values, including unit
79 - MONEY - Monetary values, including unit
39;fuzzy' - MONEY - Monetary values, including unit
over $250 - MONEY - Monetary values, including unit
#32b# - MONEY - Monetary values, including unit
120# max - MONEY - Monetary values, including unit
138 - MONEY - Monetary values, including unit
135# 36c - MONEY - Monetary values, including uni
- Retrieving and visualizing named entities is done very conveniently in spaCy.
displacy.render(doc[0:500], style='ent', jupyter=True, options={'distance': 110})
- I want to tag the word "dress" as a "PRODUCT" in the entire dataset
for i,token in enumerate(doc):
if token.text == 'dress':
print(token.text)
print(i)
break
Output:
dress
140
new_ent = Span(doc,140, 141, label=PRODUCT)
doc.ents = list(doc.ents)+ [new_ent]
- Now the word dress is tagged and can be seen when we filter for "PRODUCTS" in the document
**dress - PRODUCT - Objects, vehicles, foods, etc. (not services)**
the s fit - PRODUCT - Objects, vehicles, foods, etc. (not services)
the s fit great - PRODUCT - Objects, vehicles, foods, etc. (not services)
p6 - PRODUCT - Objects, vehicles, foods, etc. (not services)
s - PRODUCT - Objects, vehicles, foods, etc. (not services)
a34b - PRODUCT - Objects, vehicles, foods, etc. (not services)
small/ - PRODUCT - Objects, vehicles, foods, etc. (not services)
- Stemming on ADJECTIVE's
- The Stemming is the process of reducing the word into it's root form.
dress_adj = dress_frame[(dress_frame['pos']=='ADJ') & (dress_frame['lemma']=='acomp' )]
dress_adj[0:10]
pos text
54 acomp ADJ beautiful
84 acomp ADJ unique
98 acomp ADJ flattering
127 acomp ADJ difficult
181 acomp ADJ nice
200 acomp ADJ true
229 acomp ADJ lovely
253 acomp ADJ perfect
287 acomp ADJ adorable
292 acomp ADJ flattering
The stem of the word beautiful is beauti
The stem of the word unique is uniqu
The stem of the word flattering is flatter
The stem of the word difficult is difficult
The stem of the word nice is nice
The stem of the word true is true
The stem of the word lovely is love
The stem of the word perfect is perfect
The stem of the word adorable is ador
The stem of the word worse is wors
- Stemming on VERB's
pos text
91 acomp VERB made
567 acomp VERB wearing
2526 acomp VERB made
4471 acomp VERB chested
4853 acomp VERB faded
6134 acomp VERB closed
8687 acomp VERB worried
9350 acomp VERB dressed
10675 acomp VERB looking
10781 acomp VERB endowed
The stem of the word made is made
The stem of the word wearing is wear
The stem of the word chested is chest
The stem of the word faded is fade
The stem of the word closed is close
The stem of the word worried is worri
The stem of the word dressed is dress
The stem of the word looking is look
The stem of the word endowed is endow
The stem of the word pictured is pictur
-
In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words.
-
The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'.
-
Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence.
- Lemmatization on ADJECTIVE's
i=0
for token in doc:
if token.pos_ == 'ADJ' and token.dep_ =='acomp':
if i<10:
i+=1
print(f'{token.text:{10}} {token.lemma_:{10}} {token.pos_:{6}} {token.dep_:{12}} {spacy.explain(token.tag_)}')
else:
break
beautiful **beautiful** ADJ acomp adjective
unique **unique** ADJ acomp adjective
flattering **flattering** ADJ acomp adjective
difficult **difficult** ADJ acomp adjective
nice **nice** ADJ acomp adjective
true **true** ADJ acomp adjective
lovely **lovely** ADJ acomp adjective
perfect **perfect** ADJ acomp adjective
adorable **adorable** ADJ acomp adjective
flattering **flattering** ADJ acomp adjective
- Lemmatization on VERB's
i=0
for token in doc:
if token.pos_ == 'VERB' and token.dep_ =='acomp':
if i<10:
i+=1
print(f'{token.text:{10}} {token.lemma_:{10}} {token.pos_:{6}} {token.dep_:{12}} {spacy.explain(token.tag_)}')
else:
break
made **make** VERB acomp verb, past participle
wearing **wear** VERB acomp verb, gerund or present participle
made **make** VERB acomp verb, past participle
chested **cheste** VERB acomp verb, past participle
faded **fade** VERB acomp verb, past participle
closed **close** VERB acomp verb, past participle
worried **worry** VERB acomp verb, past participle
dressed **dress** VERB acomp verb, past participle
looking **look** VERB acomp verb, gerund or present participle
endowed **endow** VERB acomp verb, past participle
- List of Default Stop Words
stop_words = nlp.Defaults.stop_words
stop_word = [i for i in stop_words]
stop_word[0:10]
['during',
'never',
'besides',
'thereafter',
'since',
'or',
'noone',
'rather',
'often',
'though']
- Check if the word is a stop-words or not
nlp.vocab['is'].is_stop
True
nlp.vocab['mystery'].is_stop
False
- Adding a stop word to the list of default list of stop-words
#Adding a stop word
#Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('btw')
# Set the stop_word tag on the lexeme
nlp.vocab['btw'].is_stop = True
- Removing a stop word from the list of default list of stop-words
# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('however')
# Remove the stop_word tag from the lexeme
nlp.vocab['however'].is_stop = False
nlp.vocab['however'].is_stop
False
There are two ways of matching text in spacy, Below are the following
- Matcher
pattern1 = [{'LOWER': 'tight','OP':'+'}]
pattern2 = [{'LOWER': 'petites'}]
matcher.add('sizes', None, pattern1,pattern2)
matcher
i found the fit to be flattering -- fitted enough but not too loose or **tight**.
i do think the cut is on the trim side, but it isn't **tight** or fitted.
it is a bit looser on top (i'm 32c) and more form-fitting around the hips but not **tight** or clingy.
the end of the sleeves (where the buttons are) are very **tight** but
the arm holes were **tight**, but have very cute buttons if you look closely at the picture.
both fit well, but the sleeves in the large were quite **tight**.
when it arrived and i tried it on, it fit great on my arms, wasn't too tight on my neck, but once it went down over my chest (36c), the dress never came back "in" to show my feminine waist/shape.it is a perfect length and drapes well...not too **tight** at all.
ok, the arms were a little **tight**.
i also love how versatile it is--you can wear it as a dress as shown on the models, or as a tunic over **tight** jeans or jeggings/leggings.
- PhraseMatcher
phrase_matcher = PhraseMatcher(nlp.vocab)
phrase = ['tight', 'extremely tight', 'too tight', 'wrong size']
phrase_patterns = [nlp(text) for text in phrase]
print(phrase_patterns)
phrase_matcher.add('rsk', None, *phrase_patterns)
the armholes fit perfectly though, if i had sized down they may have been **too tight**.
the slip was also slightly **tight** over hips.
i may have to try, though, as otherwise it's a great workable dress for summer--not **too tight** or revealing in the b
i may have to try, though, as otherwise it's a great workable dress for summer--not **too tight** or revealing in the b
for myself, the sleeves are **tight** and so is the fit across the back and shoulders.
one comment mentioned that the slip underneath was **tight** in the hips, yet as a size 30 in pants the slip was still fine.
This dress is beautiful but the bottom half was **too tight** for my shape.
This dress is beautiful but the bottom half was **too tight** for my shape.
- Default Segmentation rule.
print(nlp.pipe_names)
['tagger', 'parser', 'ner']
- After Adding custom segmentation rule.
- So by default a new sentence ends with "." , but what if we want to end the sentence with "," as in case of poems.
def set_custome_segmentations(doc):
for token in doc[:-1]:
if token.text == ",":
doc[token.i+1].is_sent_start = True
return doc
nlp.pipe_names
nlp.add_pipe(set_custome_segmentations, before='parser')
print(nlp.pipe_names)
['tagger', 'set_custome_segmentations', 'parser', 'ner']
- Before Segmentation Rule.
for sent in dec_ss.sents:
print(sent)
**I really wanted this to work..
alas, it had a strange fit for me..
the straps would not stay up, and it had a weird fit under the breast.**
- After Adding custom segmentation rule.
- Now the new sentence ends with ","
doc_post_ss = nlp(ss)
for sent in doc_post_ss.sents:
print(sent)
**I really wanted this to work..
alas,
it had a strange fit for me..
the straps would not stay up,
and it had a weird fit under the breast.**