glample / tagger

Named Entity Recognition Tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

parsing sentences from dataset

sbmaruf opened this issue · comments

https://github.com/glample/tagger/blob/master/loader.py#L8
In this function while you are reading sentences form dataset, you ignore the sentence which is started with DOCSTART. I guess DOCSTART means starting of a new document. Why are you ignoring the first sentence of a new document? or Did I have some problem understanding your code?

I ignored DOCSTART because this token is not really a part of the document, it is not a sentence inside of which you need to tag named entities. But you can remove this condition, it would not make any difference.

I understand you are ignoring 'DOCSTART'. But why did you ignore the sentence after DOCSTART.
Assume a dataset like following, [From dutch dataset]

-DOCSTART- -DOCSTART- O
De Art O
tekst N O
van Prep O
het Art O
arrest N O
is V O
nog Adv O
niet Adv O
schriftelijk Adj O
beschikbaar Adj O
maar Conj O
het Art O
bericht N O
werd V O
alvast Adv O
bekendgemaakt V O
door Prep O
een Art O
communicatiebureau N O
dat Conj O
Floralux N B-ORG
inhuurde V O
. Punc O

In Prep O
'81 Num O
regulariseert V O
de Art O
toenmalige Adj O
Vlaamse Adj B-MISC
regering N O
de Art O
toestand N O
met Prep O
een Art O
BPA N B-MISC
dat Pron O
het Art O
bedrijf N O
op Prep O
eigen Pron O
kosten N O
heeft V O
laten V O
opstellen V O
. Punc O

In this case your function 'load_sentences' would not read the sentence,

"De tekst van het arrest is nog niet schriftelijk beschikbaar maar het bericht werd alvast bekendgemaakt door een communicatiebureau dat Floralux in huurde."

Instead, it will start from the second line.

"In '81 regulariseert ..."

Is there any reason why you did this?

Sorry for the delay. In practice I think there is an empty line after each DOCSTART symbol, so if you add an empty line before the first sentence, it will not be skipped. No?

Thanks glample for your reply.
I did also assume that from English data-set, but there was no empty line after -DOCSTART- in dutch dataset. But I guess this should not change the results too much.