taoyds / spider

scripts and baselines for Spider: Yale complex and cross-domain semantic parsing and text-to-SQL challenge

Home Page:https://yale-lily.github.io/spider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parsed results of FROM Clause in SQL query

JasperGuo opened this issue · comments

Hi,

Thanks for your great job in building such a dataset for nl2sql.
I am now conducting a characteristics study of the sql on it.
I find that some parsed results of FROM Clause in SQL Query are confusing.

Here is an example.
The FROM Clause in the query contains two join operations and three tables involved.
But in the parsed results of table_units, only two of the tables are involved.

    {
        "db_id": "department_management",
        "query": "SELECT DISTINCT T1.creation FROM department AS T1 JOIN management AS T2 ON T1.department_id  =  T2.department_id JOIN head AS T3 ON T2.head_id  =  T3.head_id WHERE T3.born_state  =  'Alabama'",
        "sql": {
            "except": null,
            "from": {
                "conds": [
                    [
                        false,
                        2,
                        [
                            0,
                            [
                                0,
                                1,
                                false
                            ],
                            null
                        ],
                        [
                            0,
                            11,
                            false
                        ],
                        null
                    ]
                ],
                "table_units": [
                    [
                        "table_unit",
                        0
                    ],
                    [
                        "table_unit",
                        2
                    ]
                ]
            },
           ...
        }
    }

Based on my further inspection, it may be caused by the order of JOIN and ON clauses.
Here is a clear and correct example.

{
        "db_id": "bike_1",
        "query": "SELECT count(*) FROM station AS T1 JOIN trip AS T2 JOIN station AS T3 JOIN trip AS T4 ON T1.id  =  T2.start_station_id AND T2.id  =  T4.id AND T3.id  =  T4.end_station_id WHERE T1.city  =  \"Mountain View\" AND T3.city  =  \"Palo Alto\"",
        "sql": {
            "except": null,
            "from": {
                "conds": [
                    [
                        false,
                        2,
                        [
                            0,
                            [
                                0,
                                1,
                                false
                            ],
                            null
                        ],
                        [
                            0,
                            16,
                            false
                        ],
                        null
                    ],
                    "and",
                    [
                        false,
                        2,
                        [
                            0,
                            [
                                0,
                                12,
                                false
                            ],
                            null
                        ],
                        [
                            0,
                            12,
                            false
                        ],
                        null
                    ],
                    "and",
                    [
                        false,
                        2,
                        [
                            0,
                            [
                                0,
                                1,
                                false
                            ],
                            null
                        ],
                        [
                            0,
                            19,
                            false
                        ],
                        null
                    ]
                ],
                "table_units": [
                    [
                        "table_unit",
                        0
                    ],
                    [
                        "table_unit",
                        2
                    ],
                    [
                        "table_unit",
                        0
                    ],
                    [
                        "table_unit",
                        2
                    ]
                ]
            },
            ...
        }
    },

Please let me know if I misunderstand something of the parsed results.

Great Thanks!

Hi,
We generated train.json and dev.json using the old version of the SQL parse script process_sql.py. The old parsed results are inputs for some baselines so that we didn't update them in released files. If you would like to update the parsed sql results, we provide the following script using the latest process_sql.py (with a few modifications compared to the old version) to get the right parsed results:
https://github.com/taoyds/spider/blob/master/preprocess/parse_raw_json.py

Best,
Tao

Thanks Tao! I will give it a try.