glample / tagger

Named Entity Recognition Tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ValueError: max() arg is an empty sequence

victoriastuart opened this issue · comments

Two issues:

  1. Others (e.g. issues #20 , #41 ) asked what a 'tokenized sentence' is; that puzzled me too.
    Answer: any sentence is 'tokenized'; e.g.

    Victoria was born in 1961 in Halifax, Nova Scotia, Canada.

  2. If your input file contains blank lines, e.g.

    Victoria was born in 1961 in Halifax, Nova Scotia, Canada.
    
    Victoria used to work at NIEHS in North Carolina.
    

then tagger.py | utils.py throws an error:

...
    max_length = max([len(word) for word in words])
ValueError: max() arg is an empty sequence

You can solve that, simply, by changing the following lines in tagger.py

Original:

print 'Tagging...'
with codecs.open(opts.input, 'r', 'utf-8') as f_input:
    count = 0
    for line in f_input:
        words = line.rstrip().split()

Modified:

print 'Tagging...'
with codecs.open(opts.input, 'r', 'utf-8') as f_input:
    count = 0
    for line in f_input:
        if len(line) <= 1:
            line = ''
        words = line.rstrip().split()

Added lines:

        if len(line) <= 1:
            line = ''

@victoriastuart Thanks a lot, you just saved me a lot of time!

Hi @victoriastuart @nkruglikov I am new to python can you please help me out with training the model using GoogleNews word embeddings? I am trying to train using the script

python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method=adam --tag_scheme=iob --pre_emb=GoogleNews-vectors-negative300.bin --all_emb=300

I got this error:
image

I am stuck with this issue for about 2 months and couldn't resolve it. Thanks in advance.