explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Home Page:https://spacy.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tokenization of URLs needs work

rlvoyer opened this issue · comments

In [3]: nlp = English()

In [4]: doc = nlp("Do you agree that this is a URL: http://www.nytimes.com/2016/04/20/us/politics/new-york-primary-preview.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region&region=top-news&WT.nav=top-news&_r=0")

In [5]: [s.lemma_.lower() for s in doc if not s.like_url]
Out[5]:
['do',
 'you',
 'agree',
 'that',
 'this',
 'be',
 'a',
 'url',
 ':',
 '-',
 'york-primary-preview.html?hp&action=click&pgtype=homepage&clicksource=story-heading&module=a-lede-package-region&region=top-news&wt.nav=top-news&_r=0']

In fact, the problem here might be better characterized as a tokenization problem (so I'm going to rename the issue).

commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.