Problematic Questions

Question

Problematic Questions

yairf11 opened this issue 6 years ago · comments

Hi,

I have been looking through your datasets, and found something odd - in the training set, there are questions that seem broken / missing.
For example, sample id 5a775ea9554299373536024d holds the question 'w', and sample id 5a81265c5542995ce29dcbca holds the question 'DRM'. There are several more.

The easiest way to find these examples is by sorting the questions in the training set by length, and then looking at the shortest ones.
A simple workaround could be to discard all questions with no question mark, but this eliminates 2322 samples, some of them perfectly good questions.

Are you aware of this?
Thanks!

Zhilin Yang · Answer 1 · Tue Nov 06 2018 12:18:42 GMT+0800 (China Standard Time)

Thank you for pointing this out! We just took a look at the shortest questions and indeed found the same questions you mentioned. As far as we know:

All these very short (and meaningless) questions come from the “easy” split. (See the “level” field in the json).
There are less than 100 such questions.
There are no such questions in the dev and test sets.

At this point, I think you could sort the examples by question length and remove the shortest 100 before training your model. We will release the next version of the training set very soon to exclude these examples.

Zhilin Yang · Answer 2 · Wed Nov 07 2018 02:04:34 GMT+0800 (China Standard Time)

We just released v1.1 here http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_train_v1.1.json
We removed 117 questions of this kind from the training set v1.

Vimos Tan · Answer 3 · Wed Feb 13 2019 22:08:49 GMT+0800 (China Standard Time)

I also find an example in dev set:

{
  "_id": "5ae61bfd5542992663a4f261",
  "answer": "swingman",
  "question": "Which teams did Jimmy Butler play and what role did he play on these teams?",
  "supporting_facts": [
    [
      "Shooting guard",
      4
    ],
    [
      "Shooting guard",
      5
    ],
    [
      "Jimmy Butler (basketball)",
      0
    ],
    [
      "Jimmy Butler (basketball)",
      902
    ]

Note that 902 is a large number, there's no such sentence in the document.