zerolink-io / zsql-sqlite-dpo

DPO pairs for English text into SQL queries

Home Page:https://huggingface.co/datasets/zerolink/zsql-sqlite-dpo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

zsql-sqlite-dpo

This is a dataset for training machine learning models to convert natural English language text into SQLite dialect SQL queries.

This dataset comprises 200,000 DPO pairs curated to support the rapid development of text-to-SQL generation models. The uniqueness of this dataset lies in its optimization process. The "chosen" field within each data pair contains SQL queries that have been canonicalized, optimized, and which are chosen from the candidate set which minimizes syntactic cyclomatic and asymptotic complexity against the given schema.

Direct Preference Optimization (see Rafailov et al, 2023) is a novel approach to refinement learning from positive and negative samples to modify the behavior of large-scale unsupervised language models to align with human preferences This method simplifies the fine-tuning process, making it more stable and computationally efficient without the need for extensive hyperparameter tuning or LM sampling, and has been shown to effectively control model outputs, matching or surpassing existing methods.

The source data is cleaned and filtered based on the following criteria:

  • Remove queries which are not in English.
  • Remove queries which are not valid SQL queries.
  • Remove queries which are not executable against the given schema.
  • Remove queries which are executed against tables with non-Latin characters.
  • Remove queries which use features not supported by the given database.
  • Remove long queries which contain domain-specific knowledge which cause model confusion.
  • Remove queries which do not fit within a 4096 token context window.

Usage

To load the dataset using the HuggingFace datasets library:

from datasets import load_dataset

dataset = load_dataset("zerolink/zsql-sqlite-dpo")

To use in model fine-tuning, apply the following chat tokenizer:

tokenizer = AutoTokenizer.from_pretrained(model)

def tokenize(element):
    schema = element["schema"]
    question = element["question"]
    answer = element["chosen"]

    prompt = f"""
    Using the schema:
    {schema}
    Generate SQL for the following question:
    {question}
    """

    system = "Translate English to SQLite SQL."
    message = [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": answer},
    ]
    output = tokenizer.apply_chat_template(
        message, add_generation_prompt=False, tokenize=True
    )
    return {"text": output}

Fields

The fields in this dataset are as follows:

Field Name Description
schema The schema of the database.
question The natural language question.
chosen The DPO preferred SQL query.
rejected The DPO rejected SQL query.
weight The weight of the query in the reward function.

Sources

This dataset is derived from the following sources:

Source License External Link
wikisql BSD 3-Clause https://github.com/salesforce/WikiSQL
spider CC-BY-SA-4.0 https://huggingface.co/datasets/spider
sql_create_context CC-BY-4.0 https://huggingface.co/datasets/b-mc2/sql-create-context
squall CC-BY-SA-4.0 https://github.com/tzshi/squall
sede Apache-2.0 https://github.com/hirupert/sede
nvbench MIT https://github.com/TsinghuaDatabaseGroup/nvBench
imdb Not Found https://github.com/jkkummerfeld/text2sql-data
advising CC-BY-4.0 https://github.com/jkkummerfeld/text2sql-data
atis Not Found https://github.com/jkkummerfeld/text2sql-data
restaurants Not Found https://github.com/jkkummerfeld/text2sql-data
scholar Not Found https://github.com/jkkummerfeld/text2sql-data
yelp Not Found https://github.com/jkkummerfeld/text2sql-data
academic Not Found https://github.com/jkkummerfeld/text2sql-data
criteria2sql Apache-2.0 https://github.com/xiaojingyu92/Criteria2SQL
eICU CC-BY-4.0 https://github.com/glee4810/EHRSQL
mimic_iii CC-BY-4.0 https://github.com/glee4810/EHRSQL
mimicsql_data MIT https://github.com/wangpinggl/TREQS
worldsoccerdatabase CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
whatcdhiphop CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
studentmathscore CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
pesticide CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
thehistoryofbaseball CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
uswildfires CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
geonucleardata CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
greatermanchestercrime CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA

Composition:

Composition

License

This dataset is provided for academic and research purposes. Please adhere to the specified license terms and conditions for usage and distribution.