This project is designed to benchmark large language models (LLMs) on various Text-to-SQL datasets. It is easy to use and works "out of the box" for any LLM that can be loaded as a transformer module. We utilize the llama-index
framework to facilitate the integration and testing of models across standardized datasets like WikiSQL
and BIRD-SQL
.
- Ease of Use: Simple setup and execution to facilitate quick benchmarking.
- Flexibility: Supports a wide range of LLMs compatible with transformer architectures.
- Extensibility: Easily add new datasets by providing a corresponding configuration file.
We do zero-shot prediction, the current result is as following: Text-to-SQL LeaderBoard
I am doing the evaluation on following models, let's expect the results!
- SQLCoder-7B-2
- GLM4-9B-Chat
- the datasets that you wish to test on
- The LLM model directory
I use python 3.11 to build this version. I guess it can be runnable after python
Clone the repository and install the required Python packages:
git clone https://github.com/Nutingnon/text-to-sql-benchmark.git
cd text-to-sql-benchmark
pip install -r requirements.txt
To start benchmarking, run the following command:
python benchmark.py --config configs/your_config_file.yaml
Replace your_config_file.yaml
with the path to your dataset configuration file. We provide three configure files for WiKi-SQL
, BIRD-SQL
and Kaggle-DBQA
Due to the test-suite-sql-eval is lack of maintainance, it cannot correctly parse many keywords in SQL, such as CAST
and UNION ALL
, etc. To test model's prediction, I modified the script from BIRD-SQL:
python evaluation.py \
--predicted_sql_path <your_sql_path> \
--ground_truth_path <ground_truth_path> \
--db_root_path <db_root_path> \
--pred_key <pred_key> \
--gold_key <gold_key>
For example, for handling the bird-sql
evaluation, we can run
python evaluation.py
--predicted_sql_path ./predictions/bird-minidev/predictions_Codellama-34B-Instruct-hf.jsonl
--ground_truth_path ./datasets/minidev/MINIDEV/mini_dev_sqlite.json
--db_root_path ./datasets/minidev/MINIDEV/dev_databases
--pred_key predict_query
--gold_key SQL
Currently, we support (which means, have tested on) the following datasets:
If you want to test the models on more datasets, please make sure the dataset is organized well, you can refer to the preprocessing code from UNITE.
We welcome contributions to improve the project. If you're interested in enhancing the functionality or adding support for additional datasets, please fork the repository and submit a pull request. I am looking forward the contributions on following aspects:
-
Accelerating the prediction
- larger batch_size
- Multi-gpu usage
-
Difficulty classification for comprehensive evaluation
- Evaluate the difficulty of a SQL query based on various dimensions. The classification rule can refer to two papers:
This project is licensed under the MIT License - see the LICENSE file for details.
- Thanks to the following projects and resources, which provide code base for this project:
- UNITE: A Unified Benchmark for Text-to-SQL Evaluation
- test-suite-sql-eval
- BIRD-DEV
- Contributors and users who have provided valuable feedback and suggestions.