Tokenization of URLs needs work
rlvoyer opened this issue · comments
In [3]: nlp = English()
In [4]: doc = nlp("Do you agree that this is a URL: http://www.nytimes.com/2016/04/20/us/politics/new-york-primary-preview.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region®ion=top-news&WT.nav=top-news&_r=0")
In [5]: [s.lemma_.lower() for s in doc if not s.like_url]
Out[5]:
['do',
'you',
'agree',
'that',
'this',
'be',
'a',
'url',
':',
'-',
'york-primary-preview.html?hp&action=click&pgtype=homepage&clicksource=story-heading&module=a-lede-package-region®ion=top-news&wt.nav=top-news&_r=0']
In fact, the problem here might be better characterized as a tokenization problem (so I'm going to rename the issue).
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.