The main data is split into two files. One for train+val (36,766+4,966 samples) and the other for test (7,540 samples).
- Images
The large img file is compressed and split into 51 chunks of 1GB. You can download all chunks at once by running this script.
To unzip and merge all chunks, run 7z x imgs.7z.001
We also provide google drive download links
You are good when you have WebQA_train_val.json
, WebQA_test.json
, imgs.lineidx
and imgs.tsv
.
{<guid>: {'sources': [<image_id>/<snippet_id>, ..., ],
'answer': "xxxxxxx" },
<guid>: {...},
<guid>: {...},
}