leuchine / self_play_picard

Using self-play to augment multi-turn text-to-SQL datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This is the official implementation of the following paper:

Qi Liu, Zihuiwen Ye, Tao Yu, Phil Blunsom and Linfeng Song. [Augmenting Multi-Turn Text-to-SQL Datasets with Self-Play]

About Self-Play for Text-to-SQL

The task of context-dependent text-to-SQL aims to convert multi-turn user utterances to formal SQL queries. This is a challenging task due to both the scarcity of training data from which to learn complex contextual dependencies and to generalize to unseen databases. In this paper we explore augmenting the training datasets using self-play, which leverages contextual information to synthesize new interactions to adapt the model to new databases. We first design a SQL-to-text model conditioned on a sampled goal query, which represents a user’s intent, that then converses with a text-to-SQL semantic parser to generate new interactions. We then filter the synthesized interactions and retrain the models with the augmented data. We find that self-play improves the accuracy of a strong baseline on SParC and CoSQL, two widely used cross-domain text-to-SQL datasets. Our analysis shows that self-play simulates various conversational thematic relations, enhances cross-domain generalization and improves beam-search.

The implementation is based on PICARD, which we use as our baseline Text-to-SQL model.

Installing Docker

Training and evaluating are run using Docker. Following PICARD, please first build the image for training:

$ make pull-train-image

And then build the image for evaluating:

$ make pull-eval-image

Generating Goal Query Templates

We use the method proposed in GAZP to generate goal queries, which are later used for synthetic interaction generation. Please follow their Readme to prepocess the data and generate templates using the code under the section "Generate data".

Training

The scripts for training text-to-SQL models on the datasets CoSQL and SParC are launch_cosql.sh, launch_sparc.sh, respectively. We take launch_sparc.sh as an example. You can run it with:

$ bash launch_sparc.sh


In launch_sparc.sh, there are four commands.

nohup make train_sparc

This trains a text-to-SQL model on the specified dataset (SParC), and saves the checkpoints in the output directory train_sparc under the seq2seq directory.

nohup make train_sql2text_sparc 

This trains a SQL-to-text model on the specified dataset (SParC), saves the checkpoints in train_sql2text_sparc under the seq2seq directory.

nohup  make self_play_sparc

This generates synthetic self-play examples by using the sampled goal guery templates proposed in GAZP, then using the trained text-to-SQL and SQL-to-text models to converse with each other.

nohup  make train_sparc_self_play

This retrains the text-to-SQL and SQL-to-text models with the generated self-play interactions, in addition to the training data (SParC). It is saved in train_sparc_self_play under the seq2seq directory.

Evaluation

To evaluate the models trained on SParC, please run:

$ make eval_sparc

To evaluate the models trained on CoSQL, please run:

$ make eval_cosql

About

Using self-play to augment multi-turn text-to-SQL datasets

License:Apache License 2.0


Languages

Language:Python 69.3%Language:Haskell 26.8%Language:Makefile 1.7%Language:Dockerfile 1.4%Language:Shell 0.5%Language:Thrift 0.2%