glample / tagger

Named Entity Recognition Tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error In Evaluation

mhbashari opened this issue · comments

Hi!

Running on my data cause this error (After first epoch):

Traceback (most recent call last):
  File "./train.py", line 220, in <module>
    dev_data, id_to_tag, dico_tags)
  File "/home/tagger/utils.py", line 282, in evaluate
    return float(eval_lines[1].strip().split()[-1])
IndexError: list index out of range

the form of data is:

<sent0_unicode_word><space><iob_tag>
<sent0_unicode_word><space><iob_tag>
<sent0_unicode_word><space><iob_tag>

<sent1_unicode_word><space><iob_tag>
<sent1_unicode_word><space><iob_tag>
<sent1_unicode_word><space><iob_tag>

The iob tags are in the set {B-PER, I-PER}, and the data is validated by this script:

for line in conll:
    if line != "\n":
        spl = line.strip().split()
        if spl[-1] not in ["B-PER", "I-PER", "O"]:
            return False

Would you help me to find out where and why my work raised this exception?

Hi,

If the algorithm runs after one epoch, your data should be in the good format. The error is coming from this line:
float(eval_lines[1].strip().split()[-1])

But the eva_lines are the lines extracted from the output of the evaluation script. The python code calls the external perl script to evaluate the sentences, and store the result into a file. Maybe this file has not been created properly. Can you check if you have something in the evaluation folder of your experiment? This kind of issues can happen when the python file is trying to write something into a folder where it doesn't have permissions.

Hi glample,
I'm having quite the same error too but right in the first epoch. And I cannot fix it.
Traceback (most recent call last): File "./train.py", line 222, in <module> test_data, id_to_tag, dico_tags) File "/home/vuong/Documents/thang/tagger/utils.py", line 282, in evaluate return float(eval_lines[1].strip().split()[-1]) IndexError: list index out of range
And I also have
WARNING (theano.tensor.blas): We did not found a dynamic library into the library_dir of the library we use for blas. If you use ATLAS, make sure to compile it with dynamics library.
unexpected number of features: 5 (3)
Can you please tell me how to fix this error?
Thank you!

Hi Glample,
It is okay with your provided data but when I run it with my data, it has the same error as detuvoldo. Can you help us please?
Thank you.

Hi @svensy @detuvoldo were you able to solve that error? Can you please guide me?I am stuck....

@Rabia-Noureen can you post your data format here? some examples may be good for us to help you

@detuvoldo thanks for your response, I am using the dataset that is provided by @glample link is down below:
https://github.com/glample/tagger/tree/master/dataset

I am using Windows 10 64 bit with python 2.7. When i tried to train the model i got an error:

(env_name27) C:\Users\Acer\tagger-master>python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GT 620M (CNMeM is enabled with initial size: 85.0% of memory, cuDNN not available)
Model location: ./models
Found 23624 unique words (203621 in total)
Found 84 unique characters
Found 17 unique named entity tags
14041 / 3250 / 3453 sentences in train / dev / test.
Saving the mappings to disk...
Compiling...
Starting epoch 0...
50, cost average: 15.406189
100, cost average: 11.704297
150, cost average: 10.767459
200, cost average: 13.812738
250, cost average: 11.460194
300, cost average: 13.207466
350, cost average: 12.146099
400, cost average: 12.428576
450, cost average: 10.977689
500, cost average: 12.830771
550, cost average: 10.062991
600, cost average: 9.834551
650, cost average: 11.481623
700, cost average: 9.460655
750, cost average: 9.907359
800, cost average: 10.251657
850, cost average: 10.405848
900, cost average: 14.113665
950, cost average: 10.436158
'.' is not recognized as an internal or external command,
operable program or batch file.
ID NE Total O S-LOC B-PER E-PER S-ORG S-MISC B-ORG E-ORG S-PER I-ORG B-LOC E-LOC B-MISC E-MISC I-MISC I-PER I-LOC Percent
0 O 42759 42759 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100.000
1 S-LOC 1603 1603 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
2 B-PER 1234 1234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
3 E-PER 1234 1234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
4 S-ORG 891 891 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
5 S-MISC 665 665 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
6 B-ORG 450 450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
7 E-ORG 450 450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
8 S-PER 608 608 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
9 I-ORG 301 301 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
10 B-LOC 234 234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
11 E-LOC 234 234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
12 B-MISC 257 257 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
13 E-MISC 257 257 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
14 I-MISC 89 89 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
15 I-PER 73 73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
16 I-LOC 23 23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
42759/51362 (83.25026%)
Traceback (most recent call last):
File "train.py", line 220, in
dev_data, id_to_tag, dico_tags)
File "C:\Users\Acer\tagger-master\utils.py", line 282, in evaluate
return float(eval_lines[1].strip().split()[-1])
IndexError: list index out of range

I am doing something wrong?I am stuck with this issue for about 2 months and couldn't resolve it. Thanks in advance.

according to the text you provided, I think that there are errors in the data set. Your should check the data set carefully. May be there is an unnecessary "." appeared anywhere

additionally, Theano will not be supported anymore, so you should change to another one.

@detuvoldo so should i try to run the script on cpu instead of gpu? Because Theano is the requirement for running NER Tagger as mentioned in the .readme file. Moreover can you please provide the link for any other dataset that is according to the required format? I am new to python so i dont have much idea.
Thanks

https://github.com/detuvoldo/tagger/tree/master/lstm/fold1

you can watch here to find the correct format

Okay thanks alot i will use your provided dataset, hope it will solve the issue. I also want to train the model using GoogleNews word embeddings? Using the script

python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method=adam --tag_scheme=iob --pre_emb=GoogleNews-vectors-negative300.bin --all_emb=300

Its a .bin file is it fine?

i think that you dont need "=", just a space

Oh i got it thanks for your help sir....

@detuvoldo sorry for disturbing you again i tried to run your dataset and script, it solved the error in the dataset but the error is still there. I replaced the 2 lines in utils.py because i am using Windows 10 and there were some path related issues.

image

The error is
run train.py --train lstm/fold1/train --dev lstm/fold1/dev --test lstm/fold1/test
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GT 620M (CNMeM is enabled with initial size: 85.0% of memory, cuDNN not available)
Model location: \?\E:\New-Code\tagger-master\tagger-master\models\tag_scheme=iob,lower=False,zeros=False,char_dim=25,char_lstm_dim=25,char_bidirect=True,word_dim=100,word_lstm_dim=100,word_bidirect=True,pre_emb=,all_emb=False,cap_dim=0,crf=True,dropout=0.3,lr_method=sgd-lr_.005
Found 2573 unique words (48986 in total)
Found 64 unique characters
Found 27 unique named entity tags
858 / 289 / 286 sentences in train / dev / test.
Saving the mappings to disk...
Compiling...
Starting epoch 0...
50, cost average: 101.645935
100, cost average: 83.234520
150, cost average: 82.757523
200, cost average: 69.019493
250, cost average: 64.411346
300, cost average: 62.836563
350, cost average: 60.969635
400, cost average: 58.851826
450, cost average: 49.994457
ID NE Total O I-LOC B-CTT B-OBJ B-LOC B-ACR B-INT B-PRC I-FACE I-PRC I-ACR I-OBJ B-FNUM I-FNUM I-DDIR B-FACEI-BEDNUM I-CTT B-DDIR I-INTB-BEDNUMB-BATHNUMI-BATHNUM I-FPOS B-FPOS I-BDIR B-BDIR Percent
0 O 9314 9175 0 63 14 0 0 62 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 98.508
1 I-LOC 2604 2602 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
2 B-CTT 478 245 0 233 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48.745
3 B-OBJ 464 282 0 0 177 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38.147
4 B-LOC 439 439 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
5 B-ACR 346 334 0 1 1 0 7 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.023
6 B-INT 339 126 0 0 32 0 0 181 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 53.392
7 B-PRC 233 232 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
8 I-FACE 218 218 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
9 I-PRC 232 225 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
10 I-ACR 214 203 0 0 2 0 1 0 0 0 0 7 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.271
11 I-OBJ 201 198 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
12 B-FNUM 170 156 0 0 5 0 0 8 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
13 I-FNUM 166 157 0 0 8 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
14 I-DDIR 170 169 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
15 B-FACE 120 120 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
16I-BEDNUM 103 103 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
17 I-CTT 103 98 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
18 B-DDIR 83 83 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
19 I-INT 57 56 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
20B-BEDNUM 57 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
21B-BATHNUM 44 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
22I-BATHNUM 45 44 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
23 I-FPOS 42 42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
24 B-FPOS 37 36 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
25 I-BDIR 22 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
26 B-BDIR 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000
9780/16307 (59.97424%)
Traceback (most recent call last):

File "E:\New-Code\tagger-master\tagger-master\train.py", line 221, in
dev_data, id_to_tag, dico_tags, epoch)

File "utils.py", line 284, in evaluate
return float(eval_lines[1].strip().split()[-1])

IndexError: list index out of range

Can you please suggest something that can help me solve the error?
Thanks in advance

@detuvoldo sorry to disturb you again but still waiting for your response. Please reply if you can help me out in this regard. Thanks