Skip to content

Nutingnon/text-to-sql-benchmark

Repository files navigation

Text-to-SQL Benchmarking for LLMs

Overview

This project is designed to benchmark large language models (LLMs) on various Text-to-SQL datasets. It is easy to use and works "out of the box" for any LLM that can be loaded as a transformer module. We utilize the llama-index framework to facilitate the integration and testing of models across standardized datasets like WikiSQL and BIRD-SQL .

Features

  • Ease of Use: Simple setup and execution to facilitate quick benchmarking.
  • Flexibility: Supports a wide range of LLMs compatible with transformer architectures.
  • Extensibility: Easily add new datasets by providing a corresponding configuration file.

Current Result

We do zero-shot prediction, the current result is as following: Text-to-SQL LeaderBoard leadrboard

I am doing the evaluation on following models, let's expect the results!

  • SQLCoder-7B-2
  • GLM4-9B-Chat

Getting Started

Prerequisites

  • the datasets that you wish to test on
  • The LLM model directory

Installation

I use python 3.11 to build this version. I guess it can be runnable after python $\geq$ 3.8

Clone the repository and install the required Python packages:

git clone https://github.com/Nutingnon/text-to-sql-benchmark.git
cd text-to-sql-benchmark
pip install -r requirements.txt

Usage

I. Generate predictions for a dataset

To start benchmarking, run the following command:

python benchmark.py --config configs/your_config_file.yaml

Replace your_config_file.yaml with the path to your dataset configuration file. We provide three configure files for WiKi-SQL, BIRD-SQL and Kaggle-DBQA

II. Evaluation

Due to the test-suite-sql-eval is lack of maintainance, it cannot correctly parse many keywords in SQL, such as CAST and UNION ALL, etc. To test model's prediction, I modified the script from BIRD-SQL:

python evaluation.py \
    --predicted_sql_path <your_sql_path> \
    --ground_truth_path <ground_truth_path> \
    --db_root_path <db_root_path> \
    --pred_key <pred_key> \
    --gold_key <gold_key>

For example, for handling the bird-sql evaluation, we can run

python evaluation.py 
     --predicted_sql_path ./predictions/bird-minidev/predictions_Codellama-34B-Instruct-hf.jsonl 
     --ground_truth_path ./datasets/minidev/MINIDEV/mini_dev_sqlite.json 
     --db_root_path ./datasets/minidev/MINIDEV/dev_databases 
     --pred_key predict_query
     --gold_key SQL

Supported Datasets

Currently, we support (which means, have tested on) the following datasets:

If you want to test the models on more datasets, please make sure the dataset is organized well, you can refer to the preprocessing code from UNITE.

Call for Contributing

We welcome contributions to improve the project. If you're interested in enhancing the functionality or adding support for additional datasets, please fork the repository and submit a pull request. I am looking forward the contributions on following aspects:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages