questions about the evaluation script

Question

questions about the evaluation script

bozheng-hit opened this issue 6 years ago · comments

Hi Tao,
I evaluated the first example in gold_example.txt and pred_example.txt.
I want to know why the exact match result comes out to be 1.

The examples are:
gold: SELECT count() FROM singer|concert_singer
pred: select count() from stadium

The command I used is:
python evaluation.py --gold ./evaluation_examples/gold_small.txt --pred ./evaluation_examples/pred_small.txt --etype match --db ./database/ --table tables.json

Would you please give an explanation about this?

Best,
Bo Zheng

Tao Yu · Answer 1 · Wed Oct 17 2018 17:44:10 GMT+0800 (China Standard Time)

Hi Bo,

For this special case, the evaluation script doesn't take the table name into consideration. This happens only for * (here it appears in count(*) ) since we add * as an additional column for the whole database in the tables.json. We should have added * as an additional column for each table of the database in the tables.json. However, it is too time-consuming for us to modify inputs and code for all baselines and our syntaxSQL model.

As you know, the evaluation script can also provide the execution accuracy which could get this example right.

Best,
Tao

Tao Yu · Answer 2 · Wed Oct 17 2018 18:23:48 GMT+0800 (China Standard Time)

Hi Bo,

As we pointed out here, The evaluation script doesn't consider the DISTINCT keyword. The reason is that it is very common for people to add DISTINCT in the SQL query even though the corresponding natural language question doesn't contain any clue of having DISTINCT (we found this problem during our annotation). Thus, the evaluation script would not give 0 if the only difference between two SQL queries is DISTINCT.

Best,
Tao

Bo Zheng · Answer 3 · Wed Oct 17 2018 18:37:20 GMT+0800 (China Standard Time)

Hi Tao,
Since you are running a leaderboard now and the test set is not visible for us, I think it's better to provide a correct evaluation for us. We have no idea how many test data are having the same problem.

Thanks for the quick reply.

Best,
Bo

Tao Yu · Answer 4 · Fri Oct 26 2018 05:06:49 GMT+0800 (China Standard Time)

Hi Bo,

We updated the evaluation script so that the first problem (count(*)) is fixed. For the DISTINCT case, we think that it is still reasonable to not include it in the evaluation.

Best,
Tao