Awesome Text2SQL🎉🎉🎉

Curated tutorials and resources for Large Language Models, Text2SQL, and more.

🌱 How to Contribute

We warmly welcome contributions from everyone, whether you've found a typo, a bug, have a suggestion, or want to share a resource related to LLM+Text2SQL. For detailed guidelines on how to contribute, please see our CONTRIBUTING.md file.

👋 Introduction

Text-to-SQL (or Text2SQL), as the name implies, is to convert text into SQL. A more academic definition is to convert natural language problems in the database field into structured query languages that can be executed in relational databases. Therefore, Text-to-SQL can also be abbreviated as NL2SQL.
- Input: natural language questions, such as "Query the relevant information of the table t_user, and the results are sorted in descending order by id, and only the first 10 data are kept."
- Output: SQL, such as "SELECT * FROM t_user ORDER BY id DESC LIMIT 10"

📖 Survey

(2023-International Conference on Very Large Data Bases, VLDB, CCF-A）A survey on deep learning approaches for text-to-SQL [paper]
(2022-IEEE Transactions on Knowledge and Data Engineering, TKDE, CCF-A) A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions [paper]
(2022-International Conference on Computational Linguistics, COLOING, CCF-B) Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect [paper]
(2022-arXiv)Deep Learning Driven Natural Languages Text to SQL Query Conversion: A Survey [paper]

💬 Classic Model

(2023-arXiv, None) Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation [paper] [code]
(2023-arXiv, None) Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs [paper] [code]
(2023-arXiv, None) DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction [paper] [code]
(2023-arXiv, None) A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability [paper] [code]
(2023-ICLR, CCF-A) Binding Language Models in Symbolic Languages [paper] [code]
(2023-ICASSP, CCF-B) T5-SR: A Unified Seq-to-Seq Decoding Strategy for Semantic Parsing [paper]
(2022-ACL, CCF-A) S^2SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers [paper]
(2022-NAACL, CCF-B) SeaD: End-to-end Text-to-SQL Generation with Schema-aware Denoising [paper]
(2022-EMNLP, CCF-B) STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing [paper] [code]
(2022-EMNLP, CCF-B) RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL [paper] [code]
(2022-EMNLP, CCF-B) CQR-SQL: Conversational Question Reformulation Enhanced Context-Dependent Text-to-SQL Parsers [paper]
(2022-ACL, CCF-A) HIE-SQL: History Information Enhanced Network for Context-Dependent Text-to-SQL Semantic Parsing [paper]
(2022-arXiv, None) Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [paper]
(2021-ACL, CCF-A) Decoupled Dialogue Modeling and Semantic Parsing for Multi-Turn Text-to-SQL [paper]
(2021-arXiv, None) Pay More Attention to History: A Context Modelling Strategy for Conversational Text-to-SQL [paper] [code]
(2021-ICLR, CCF-A) SCORE: Pre-training for Context Representation in Conversational Semantic Parsing [paper]
(2021-DASFAA, CCF-B) An Interactive NL2SQL Approach with Reuse Strategy [paper]
(2021-NAACL, CCF-B) Structure-Grounded Pretraining for Text-to-SQL [paper]
(2021-EMNLP, CCF-B) PICARD:Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models [paper] [code]
(2021-ICLR, CCF-A) GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [paper] [code]
(2021-ACL, CCF-A) LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations [paper] [code]
(2020-EMNLP, CCF-B) Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing [paper] [code]
(2020-ACL, CCF-A) TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [paper] [code]
(2020-ACL, CCF-A) RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers [paper] [code]
(2020-EMNLP, CCF-B) Mention Extraction and Linking for SQL Query Generation [paper]
(2020-EMNLP, CCF-B) IGSQL: Database Schema Interaction Graph Based Neural Model for Context-Dependent Text-to-SQL Generation [paper] [code]
(2020-arXiv, None) Hybrid Ranking Network for Text-to-SQL [paper] [code]
(2019-arXiv, None) X-SQL: reinforce schema representation with context [paper]
(2019-EMNLP, CCF-B) Global Reasoning over Database Structures for Text-to-SQL Parsing [paper] [code]
(2019-EMNLP, CCF-B) Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions [paper] [code]
(2019-ACL, CCF-A) Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing [paper] [code]
(2019-ACL, CCF-A) Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation [paper] [code]
(2018-EMNLP, CCF-B) SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task [paper] [code]
(2018-NAACL, CCF-B) TypeSQL: Knowledge-based Type-Aware Neural Text-to-SQL Generation [paper] [code]
(2017-arXiv, None) SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning [paper] [code]

🔥 Base Model

Llama [paper] [code] [model]
- 2023/02, Meta AI proposes the open source LLM Llama, which has four scales: 7b, 13b, 33b, and 65b.
ChatGLM [paper] [code] [model]
- 2023/03, Tsinghua University proposes the open bilingual language model ChatGLM, based on General Language Model framework, with the specification of 7b.
Alpaca [paper] [code] [model]
- 2023/03, Stanford University proposes Alpaca, an open source LLM fine-tuned based on the Llama 7b model. There are 1 specification of 7b, and the training is simpler and cheaper.
Vicuna [paper] [code] [model]
- 2023/03, UC Berkeley University, CMU and Stanford University propose Vicuna, an open souce LLM based on the Llama model, has two specifications: 7b and 13b.
WizardLM [paper] [code] [model]
- 2023/04, Peking University and Microsoft propose WizardLM, a LLM of evolutionary instructions, with three specifications of 7b, 13b, and 30b. 2023/06, They propose WizardMath, a LLM in the field of mathematics. 2023/08, They propose WizardCoder, a LLM in the field of code.
Falcon [paper] [code] [model]
- 2023/06, United Arab Emirates proposes Falcon, an open source LLM trained solely on refinedweb datasets, with four parameter specifications of 1b, 7b, 40b and 180b. It is worth noting that the performance on model 40B exceeds that of 65B LLaMA.
ChatGLM2[paper] [code] [model]
- 2023/06, Tsinghua University proposes the second-generation version of ChatGLM，with the specification of 7b, which has stronger performance, longer context, more efficient inference and more open license.
Llama 2 [paper] [code] [model]
- 2023/07, Meta AI proposes the second-generation Llama series open-source LLM Llama 2. Compared with Llama 1, the training data is 40% more, and the context length is doubled. The model has four specifications: 7b, 13b, 34b, and 70b, but 34b is not open source.
Code LLama [paper] [code] [model]
- 2023/08, Meta AI proposes Code LLama, based on Llama 2. Code Llama reaches state-of-the-art performance among open models on several code benchmarks. There are foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models, with 7B, 13B and 34B parameters each.

💡 Fine-tuning

P-Tuning [paper] [code]
- 2021/03, Tsinghua University and others propose P-Tuning, a fine-tuning method for LLM, which uses trainable continuous prompt word embeddings to reduce the cost of fine-tuning.
LoRA [paper] [code]
- 2021/06, Microsoft proposes the Low-Rank Adaptation method for fine-tuning LLM by freezing the pre-training weights.
P-Tuning V2 [paper] [code]
- 2021/10, Tsinghua University proposes P-Tuning V2, an improved version of P-Tuning with better performance.
RLHF [paper] [code]
- 2022/12, OpenAI uses the RLHF method to train ChatGPT, and uses human feedback signals to directly optimize the language model, with excellent performance.
QLoRA [paper] [code]
- 2023/05, Washington University proposes the qlora method, based on the frozen 4bit quantization model, combined with LoRA method training, which further reduces the cost of fine-tuning.

💪 Dataset

WikiSQL [paper] [code] [dataset]
- 2017/09, Salesforce proposes a large Text-to-SQL dataset WikiSQL, the data comes from Wikipedia, which belongs to a single domain, contains 80,654 natural language questions, and 77,840 SQL statements. The form of SQL statements is relatively simple, and does not include sorting, grouping, and subqueries and other complex operations.
Spider [paper] [code] [dataset]
- 2018/09, Yale University proposes the Text-to-SQL dataset Spider with multiple databases, multiple tables, and single-round query. It is also recognized as the most difficult large-scale cross-domain evaluation list in the industry. It contains 10,181 natural language questions and 5,693 SQL statements. Involving more than 200 databases in 138 different fields, the difficulty level is divided into: easy, medium, difficult, and extremely difficult.
CoSQL [paper] [code] [dataset]
- 2019/09, Yale University and Salesforce Research propose a cross-domain database CoSQL, which consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains.
CHASE [paper] [code] [dataset]
- 2021/08, Xi'an Jiaotong University and Microsoft propose the first cross-domain, multi-round Text-to-SQL Chinese dataset, which contains a list of 5459 multi-round questions and 17940 <query, SQL> binary groups.
BIRD-SQL [paper] [code] [dataset]
- 2023/05, the University of Hong Kong and Alibaba propose a large-scale cross-domain dataset BIRD, which contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.

🌈 Evaluation Index

Execution Accuracy (EX) [paper]
- Definition: Calculate the proportion of the correct number of SQL execution results in the data set, and the result may be overestimated.
Exact Match (EM) [paper]
- Definition: Calculate the matching degree between the SQL generated by the model and the marked SQL, and the result may be underestimated.

🔧 Practice Project

DB-GPT-Hub
- The eosphoros organization proposes an open source project focusing on Text-to-SQL fine-tuning based on LLM, including large-scale model download, dataset preprocessing, fine-tuning technologies such as LoRA and QLoRA, model prediction, model evaluation and other steps.
sqlcoder
- The Defog organization proposes an advanced Text-to-SQL LLM, which has outstanding performance and is better than GPT3.5, wizardcoder and starcoder, etc., second only to GPT4.
modal_finetune_sql
- This project is based on the LLaMa 2 7b model for Text-to-SQL fine-tuning, which includes a complete training, fine-tuning, and evaluation process.
LLaMA-Efficient-Tuning
- Easy-to-use LLM fine-tuning framework (LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Chat

🤝 Friendship Links

eosphoros
- They are a team of technology enthusiasts from internet companies and NLP graduate students who are passionate about open source projects. Their focus is on developing solutions that protect the privacy and security of databases and large language models. Their aim is to ensure that the abilities of these models remain absolutely private, secure, and under control.
Awesome-AIGC-Tutorials
- Awesome AIGC Tutorials houses a curated collection of tutorials and resources spanning across Large Language Models, AI Painting, and related fields. Discover in-depth insights and knowledge catered for both beginners and advanced AI enthusiasts.