hotpotqa / hotpot

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problematic Questions

yairf11 opened this issue · comments

Hi,

I have been looking through your datasets, and found something odd - in the training set, there are questions that seem broken / missing.
For example, sample id 5a775ea9554299373536024d holds the question 'w', and sample id 5a81265c5542995ce29dcbca holds the question 'DRM'. There are several more.

The easiest way to find these examples is by sorting the questions in the training set by length, and then looking at the shortest ones.
A simple workaround could be to discard all questions with no question mark, but this eliminates 2322 samples, some of them perfectly good questions.

Are you aware of this?
Thanks!

Thank you for pointing this out! We just took a look at the shortest questions and indeed found the same questions you mentioned. As far as we know:

  • All these very short (and meaningless) questions come from the “easy” split. (See the “level” field in the json).
  • There are less than 100 such questions.
  • There are no such questions in the dev and test sets.

At this point, I think you could sort the examples by question length and remove the shortest 100 before training your model. We will release the next version of the training set very soon to exclude these examples.

We just released v1.1 here http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_train_v1.1.json
We removed 117 questions of this kind from the training set v1.

I also find an example in dev set:

{
  "_id": "5ae61bfd5542992663a4f261",
  "answer": "swingman",
  "question": "Which teams did Jimmy Butler play and what role did he play on these teams?",
  "supporting_facts": [
    [
      "Shooting guard",
      4
    ],
    [
      "Shooting guard",
      5
    ],
    [
      "Jimmy Butler (basketball)",
      0
    ],
    [
      "Jimmy Butler (basketball)",
      902
    ]

Note that 902 is a large number, there's no such sentence in the document.