questions about the evaluation script
bozheng-hit opened this issue · comments
Hi Tao,
I evaluated the first example in gold_example.txt and pred_example.txt.
I want to know why the exact match result comes out to be 1.
The examples are:
gold: SELECT count() FROM singer|concert_singer
pred: select count() from stadium
The command I used is:
python evaluation.py --gold ./evaluation_examples/gold_small.txt --pred ./evaluation_examples/pred_small.txt --etype match --db ./database/ --table tables.json
Would you please give an explanation about this?
Best,
Bo Zheng
Hi Bo,
For this special case, the evaluation script doesn't take the table name into consideration. This happens only for *
(here it appears in count(*)
) since we add *
as an additional column for the whole database in the tables.json
. We should have added *
as an additional column for each table of the database in the tables.json
. However, it is too time-consuming for us to modify inputs and code for all baselines and our syntaxSQL model.
As you know, the evaluation script can also provide the execution accuracy which could get this example right.
Best,
Tao
Hi Bo,
As we pointed out here, The evaluation script doesn't consider the DISTINCT
keyword. The reason is that it is very common for people to add DISTINCT
in the SQL query even though the corresponding natural language question doesn't contain any clue of having DISTINCT
(we found this problem during our annotation). Thus, the evaluation script would not give 0 if the only difference between two SQL queries is DISTINCT
.
Best,
Tao
Hi Tao,
Since you are running a leaderboard now and the test set is not visible for us, I think it's better to provide a correct evaluation for us. We have no idea how many test data are having the same problem.
Thanks for the quick reply.
Best,
Bo
Hi Bo,
We updated the evaluation script so that the first problem (count(*)
) is fixed. For the DISTINCT
case, we think that it is still reasonable to not include it in the evaluation.
Best,
Tao