wenhuchen / Table-Fact-Checking

Data and Code for ICLR2020 Paper "TabFact: A Large-scale Dataset for Table-based Fact Verification"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About differences between collected_data and tokenized data.

ichiroex opened this issue · comments

Thank you for sharing with us your interesting dataset.

I'm curious about the differences between collected_data and tokenized data.
What did you process the collected_data to generate the tokenized_data?

Originally, I've tried to split the collected_data into train/val/test splits by using the train_id.json/val_id.json/test_id.json in the data folder.
But, the number of examples in each split differs from the train/val/split in your paper as below.

[In my case]
train: 92,585
val: 12,851
test: 12,839

[In your paper]
train: 92,283
val: 12,792
test: 12,779

However, I found that the number of the train/val/test examples in the tokenized_data folder equals your paper.
Did you apply any filtering process to the collected_data?

Hi, there are a few sentences filtered from collected_data, please refer to this line:

# print("drop sentence: {}".format(orig_sent))