Annotate.py doesn't add ent field
stprior opened this issue · comments
Stephen Prior commented
Hi,
The annotate script for wikisql doesn't seem to add an ent field for questions, which seems to be expected by the TableDataset class. I added this by using a pos annotator with CoreNLPClient, however in your paper I see "We appended 10-dimensional part-of-speech tag vectors to em-
beddings of the question words in WIKI SQL. The part-of-speech tags were obtained by the spaCy
toolkit." - is this related to the ent field, and is there a different approach I should use to populate it?
Li Dong commented
Hi @stprior ,
Please find the code as follows:
import spacy
import codecs
import json
from spacy.tokens import Doc
nlp = spacy.load('en_core_web_lg')
def anno_main(anno_path):
with codecs.open(anno_path.replace('annotated', 'annotated_ent'), "w", "utf-8") as f_out:
with codecs.open(anno_path, "r", "utf-8") as f_in:
for line in f_in:
js = json.loads(line)
w_list = js['question']['gloss']
ws_list = [it.isspace() for it in js['question']['after']]
doc = Doc(nlp.vocab, words=w_list, spaces=ws_list)
for name, proc in nlp.pipeline:
doc = proc(doc)
js['question']['ent'] = [tk.tag_ for tk in doc]
assert(len(js['question']['ent']) == len(js['question']['words']))
f_out.write(json.dumps(js))
f_out.write('\n')
for split in ('train','dev','test'):
anno_main("data_path/WikiSQL/annotated/{}.jsonl".format(split))