haoxing49 / chase

Project page of Chase

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL

CHASE is a large-scale and pragmatic Chinese dataset for cross-database context-dependent text-to-SQL task (natural language interfaces for relational databases). It is released along with our ACL 2021 paper: CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL. This repo contains our dataset CHASE.

Citation

Data Content and Format

Question, SQL, and Parsed SQL

Each file intrain.json and dev.json contains the following fields:

  • database_id: the database id to which this interaction is addressed.
  • interaction: the query interaction including multiple DB query questions. For each question in the interaction, it includes:
    • utterance: the natural language question
    • utterance_toks: the natural language question tokens
    • query: the SQL query corresponding to the question.
    • sql: parsed results of this SQL query using process_sql.py. Please refer to the Spider Github page for the detailed documentation.
    {
        "database_id": "party_host",
        "interaction": [
            {
                "utterance": "主办方都有谁?",
                "utterance_toks": [
                    "主",
                    "办",
                    "方",
                    ...
                    "?"
                ],
                "query": "select 姓名 from 主办方",
                "sql": {
                    "except": null,
                    "from": {
                        "conds": [],
                        "table_units": [
                            [
                                "table_unit",
                                1
                            ]
                        ]
                    },
                    ...
                    "where": []
                }
            },
            {
                "utterance": "他们来自哪些不同的国家?",
                "utterance_toks": [
                    "他",
                    "们",
                    ...
                    "?"
                ],
                "query": "select distinct 国籍 from 主办方",
                "sql": {
                    "except": null,
                    "from": {
                        "conds": [],
                        "table_units": [
                            [
                                "table_unit",
                                1
                            ]
                        ]
                    },
                    ...
                    "where": []
                }
            },
            {
                "utterance": "每个国家有多少个主办方?",
                "utterance_toks": [
                    "每",
                    "个",
                    "国",
                    "家",
                    ...
                    "?"
                ],
                "query": "select 国籍 , count(*) from 主办方 group by 国籍",
                "sql": {
                    "except": null,
                    "from": {
                        "conds": [],
                        "table_units": [
                            [
                                "table_unit",
                                1
                            ]
                        ]
                    },
                    ...
                    "where": []
                }
            }
        ]
    }

Tables

tables.json contains the following information for each database:

  • db_id: database id
  • table_names_original: original table names stored in the database.
  • table_names: cleaned and normalized table names. We make sure the table names are meaningful. [to be changed]
  • column_names_original: original column names stored in the database. Each column looks like: [0, "派对主题"]. 0 is the index of table names in table_names, which is "派对" in this case. "派对主题" is the column name.
  • column_names: cleaned and normalized column names. We make sure the column names are meaningful. [to be changed]
  • column_types: data type of each column
  • foreign_keys: foreign keys in the database. [11, 7] means column indices in the column_names. These two columns are foreign keys of two different tables.
  • primary_keys: primary keys in the database. Each number is the index of column_names.
    {
        "db_id": "party_host",
        "table_names_original": [
            "派对",
            "主办方",
            "派对主办方"
        ],
        "table_names": [
            "派对",
            "主办方",
            "派对主办方"
        ],
        "column_names_original": [
            [
                -1,
                "*"
            ],
            [
                0,
                "派对"
            ],
            [
                0,
                "派对主题"
            ],
            [
                0,
                "地点"
            ],
            ...
        ],
        "column_names": [
            [
                -1,
                "*"
            ],
            [
                0,
                "派对"
            ],
            [
                0,
                "派对主题"
            ],
            [
                0,
                "地点"
            ],
            ...
        ],
        "column_types": [
            "text",
            "number",
            "text",
            "text",
            ...
        ],
        "foreign_keys": [
            [
                11,
                1
            ],
            [
                12,
                7
            ]
        ],
        "primary_keys": [
            1,
            7,
            11
        ]
    }

About

Project page of Chase

License:MIT License


Languages

Language:Python 88.3%Language:Vue 8.9%Language:Jsonnet 1.4%Language:Shell 0.8%Language:JavaScript 0.6%Language:HTML 0.0%