explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Home Page:https://spacy.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BUG: token.ent_iob_ is str not unicode

ELind77 opened this issue · comments

Spacy version: 1.3.0
System: Ubuntu 14.04

Issue:
The value returned on token.ent_iob_ is a string, not unicode.

Code:
The above issue is reproducible with the following:

import spacy
nlp = spacy.load('en')

txt = u'''Lorem Ipsum is simply dummy text of the printing and typesetting industry.'''

doc = nlp(txt)
for tok in doc[:5]:
    print type(tok.ent_iob_)

Results in:

<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>

Comments:
Pretty sure this is caused by this line in token.pyx.
Possible solutions are to change that line or import unicode_literals in that file. I'm not sure how the project handles strings internally but having all modules use unicode_literals might not be a terrible idea.

Just fixing the single line would be easy though. If I want to submit a PR as small as this do I need to run a bunch of tests or can I just put u in front of each of those letters? That said, adding some kind of automated test builder to ensure that all properties and return values respect the contracts in the documentation might not be a bad idea. Alternatively, from what little I know about cython, maybe the properties could get type declarations that would be enforced by the compiler?

Followup question, is there a page with instructions for contributing?

-- Eric

Thanks for the report!

All modules should definitely have unicode_literals. Good suggestions re the testing, which currently needs to be refactored and improved. I don't know how to add a type declaration to a property in Cython, though. You can only specify return types for cdef and cpdef functions, I believe.

You can find the contribution guidelines here. Thanks again!

Should be fixed now.

commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.