donglixp / coarse2fine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Annotate.py doesn't add ent field

stprior opened this issue · comments

Hi,
The annotate script for wikisql doesn't seem to add an ent field for questions, which seems to be expected by the TableDataset class. I added this by using a pos annotator with CoreNLPClient, however in your paper I see "We appended 10-dimensional part-of-speech tag vectors to em-
beddings of the question words in WIKI SQL. The part-of-speech tags were obtained by the spaCy
toolkit." - is this related to the ent field, and is there a different approach I should use to populate it?

Hi @stprior ,

Please find the code as follows:

import spacy
import codecs
import json
from spacy.tokens import Doc

nlp = spacy.load('en_core_web_lg')

def anno_main(anno_path):
    with codecs.open(anno_path.replace('annotated', 'annotated_ent'), "w", "utf-8") as f_out:
        with codecs.open(anno_path, "r", "utf-8") as f_in:
            for line in f_in:
                js = json.loads(line)
                w_list = js['question']['gloss']
                ws_list = [it.isspace() for it in js['question']['after']]
                doc = Doc(nlp.vocab, words=w_list, spaces=ws_list)
                for name, proc in nlp.pipeline:
                    doc = proc(doc)
                js['question']['ent'] = [tk.tag_ for tk in doc]
                assert(len(js['question']['ent']) == len(js['question']['words']))
                f_out.write(json.dumps(js))
                f_out.write('\n')


for split in ('train','dev','test'):
    anno_main("data_path/WikiSQL/annotated/{}.jsonl".format(split))