Tokenization of URLs needs work

Question

Tokenization of URLs needs work

rlvoyer opened this issue 8 years ago · comments

In [3]: nlp = English()

In [4]: doc = nlp("Do you agree that this is a URL: http://www.nytimes.com/2016/04/20/us/politics/new-york-primary-preview.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region&region=top-news&WT.nav=top-news&_r=0")

In [5]: [s.lemma_.lower() for s in doc if not s.like_url]
Out[5]:
['do',
 'you',
 'agree',
 'that',
 'this',
 'be',
 'a',
 'url',
 ':',
 '-',
 'york-primary-preview.html?hp&action=click&pgtype=homepage&clicksource=story-heading&module=a-lede-package-region&region=top-news&wt.nav=top-news&_r=0']

Robert Voyer · Answer 1 · Wed Apr 20 2016 01:32:20 GMT+0800 (China Standard Time)

In fact, the problem here might be better characterized as a tokenization problem (so I'm going to rename the issue).

lock · Answer 2 · Wed May 09 2018 13:38:33 GMT+0800 (China Standard Time)

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.