数据集格式和应用的问题

Question

数据集格式和应用的问题

urcllr opened this issue 6 years ago · comments

看了下载的raw和preprocessed的头几行格式，大概总结出有以下的字段：
RAW
{
question,question_type？,fact_or_option？,question_id,
documents[{title,bs_rank_pos？,is_selected？,paragraphs[]}],
entity_answers[],
answers[]
}

PREPROCESSED
{
question,question_type,fact_or_option,question_id,
documents[{title,bs_rank_pos,is_selected,paragraphs[],+most_related_para？,+segmented_title[对title分词],+segmented_paragraphs[对paragraphs分词]}],
answers[],
+answer_spans[],？
+fake_answers[],？
+segmented_answers[对answers分词],
+answer_docs[],？
+segmented_question[对question分词],
+match_scores[]？,
+yesno_type？
}

有几个问题想了解一下：

打？的那些字段代表什么意义？REQUIRED还是OPTIONAL？取值不同或忽略对结果大概有什么影响？如果是由proprocess.py生成出来的字段就可以忽略不解释。
看过CLOSED ISUUSES，RAW里面的bs_rank_pos是搜索排名数，是越大越推荐还是越小越推荐？
RAW里面的answers、entity_answers似乎都是答案，有什么区别？
准备用MRC做医学问答系统，让机器阅读不同版本的教材文字作为文章，用课后作业及其答案作为训练，或者从医学科普杂志上抽取问答，而不是搜索引擎中获取数据集。这种情况下，阅读文章该放在哪里？documents.paragraphs吗？另外诸如bs_rank_pos、is_selected等与搜索引擎相关的字段该怎么配置？

万望赐教，谢谢！

urcllr commented 6 years ago

谢谢

urcllr · Answer 1 · Sat Apr 28 2018 17:16:33 GMT+0800 (China Standard Time)

不好意思，有个可行性的问题想先行了解一下：能否将整本教材的陈述性文字作为仅有的一篇大文章进行阅读（这篇文章就可能有几百M到几G），然后针对不同章节的内容进行问答？
假设一次性输入整本OFFICE教材，包括WORD/EXCEL/POWERPOINT作为仅有的一篇大文章，然后用户随便问关于WORD或EXCEL或POWERPOINT的问题它都能答上。而不用问WORD时选择WORD教程，问EXCEL时另外再选择EXCEL教程……

lkliukai · Answer 2 · Sat Apr 28 2018 19:19:03 GMT+0800 (China Standard Time)

如果是由proprocess.py生成出来的字段就可以忽略不解释。

Yes, it is.

看过CLOSED ISUUSES，RAW里面的bs_rank_pos是搜索排名数，是越大越推荐还是越小越推荐？

You could ignore the rank and we do not suggest use that information for now.

RAW里面的answers、entity_answers似乎都是答案，有什么区别？

They are answers in a further structured format and provide a different aspect to answer the question.

准备用MRC做医学问答系统……

It will be good, and you may just put the document under the paragraphs to reuse inference logic.

lkliukai · Answer 3 · Sat Apr 28 2018 19:20:51 GMT+0800 (China Standard Time)

不好意思，有个可行性的问题想先行了解一下：能否将整本教材的陈述性文字作为仅有的一篇大文章进行阅读（这篇文章就可能有几百M到几G），然后针对不同章节的内容进行问答？……

Yes it is, one of potential applications is to read the instruction book.