This is a portable Gymnasium environment of SQLite database. It is designed for platforms that are not able to use docker. (e.g. users without root privillege)
Simply pip install sqlgym
. If you want to generate ReAct dataset and fine tune a model, please clone the repository and install from source.
# Clone this repository
git clone https://github.com/KYLN24/sqlgym.git
# or via SSH
# git clone git@github.com:KYLN24/sqlgym.git
cd sqlgym
# Install this package
pip install ".[sft]"
# Make a directory to save data
mkdir .data
cd .data
This project currently suppport the BIRD-SQL dataset.
mkdir bird
cd bird
# Download BIRD-SQL Dataset
wget -c https://bird-bench.oss-cn-beijing.aliyuncs.com/train.zip
unzip train.zip
cd train
unzip train_databases.zip
cd ..
wget -c https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip
unzip dev.zip
cd dev
unzip dev_databases.zip
cd ..
from sqlgym import SqlGymEnv
from sqlgym.datasets import BirdDataset
dataset = BirdDataset(
bird_path=".data/bird",
mode="dev",
)
env = SqlGymEnv(dataset)
print(env.reset(0))
print(env.step(dataset[0].gt))
You can use scripts/make_datasets.py
to generate a SFT dataset.
python -u scripts/make_datasets.py --bird_path=./data/bird # Dataset will be created at ./data/bird/train.jsonl and ./data/bird/dev.jsonl
You can use scripts/make_react_dataset.py
to convert it to ReAct format with thought generated by GPT.
# Edit the script to add your OpenAI api_key.
# Change base_url and other generation parameters as you wish.
python -u scripts/make_react_dataset.py \
--data_path=.data/bird/train.jsonl \
--save_path=.data/bird/train_react.jsonl
Then, use scripts/train.py
or scripts/train_react.py
to fine tune a chat model. The tokenizer should support the apply_chat_template
method.
torchrun --nproc_per_node=8 scripts/train.py \
--model=meta-llama/Llama-2-7b-chat-hf \
--train_set=.data/bird/train.jsonl \
--output_dir=.data/output
torchrun --nproc_per_node=8 scripts/train.py \
--model=meta-llama/Llama-2-7b-chat-hf \
--train_set=.data/bird/train_react.jsonl \
--output_dir=.data/output \
--react