taoyds / spider

scripts and baselines for Spider: Yale complex and cross-domain semantic parsing and text-to-SQL challenge

Home Page:https://yale-lily.github.io/spider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

questions about the evaluation script

bozheng-hit opened this issue · comments

Hi Tao,
I evaluated the first example in gold_example.txt and pred_example.txt.
I want to know why the exact match result comes out to be 1.

The examples are:
gold: SELECT count() FROM singer|concert_singer
pred: select count(
) from stadium

The command I used is:
python evaluation.py --gold ./evaluation_examples/gold_small.txt --pred ./evaluation_examples/pred_small.txt --etype match --db ./database/ --table tables.json

Would you please give an explanation about this?

Best,
Bo Zheng

Hi Bo,

For this special case, the evaluation script doesn't take the table name into consideration. This happens only for * (here it appears in count(*) ) since we add * as an additional column for the whole database in the tables.json. We should have added * as an additional column for each table of the database in the tables.json. However, it is too time-consuming for us to modify inputs and code for all baselines and our syntaxSQL model.

As you know, the evaluation script can also provide the execution accuracy which could get this example right.

Best,
Tao

Hi Bo,

As we pointed out here, The evaluation script doesn't consider the DISTINCT keyword. The reason is that it is very common for people to add DISTINCT in the SQL query even though the corresponding natural language question doesn't contain any clue of having DISTINCT (we found this problem during our annotation). Thus, the evaluation script would not give 0 if the only difference between two SQL queries is DISTINCT.

Best,
Tao

Hi Tao,
Since you are running a leaderboard now and the test set is not visible for us, I think it's better to provide a correct evaluation for us. We have no idea how many test data are having the same problem.

Thanks for the quick reply.

Best,
Bo

Hi Bo,

We updated the evaluation script so that the first problem (count(*)) is fixed. For the DISTINCT case, we think that it is still reasonable to not include it in the evaluation.

Best,
Tao