This repository contains the implementation of LIMAO on Balsa, a Lifelong Modular Reinforcement Learning framework designed for database query optimization in dynamic environments. The codebase currently supports experiments on IMDB and TPC-H workloads, but you are welcome to extend it to additional workloads. Due to time constraints, the repository might still have some inconsistencies. Feel free to reach out to the authors for assistance!
LIMAO integrates reinforcement learning to optimize database queries dynamically under workload, data volume, and both workload with data volume switching scenarios. This repository enables users to reproduce the experiments conducted in the associated paper.
- Operating System: Linux (CloudLab c240g5 machine recommended, if you use your own mahcines, please make some customizations to the setup)
- Python: 3.8
- PostgreSQL: 12.5
- Anaconda: Anaconda3-2023.09
Note: Additional configurations might be needed for using a different system.
-
Prepare the storage environment:
sudo /usr/local/etc/emulab/mkextrafs.pl -f /mydata sudo chmod 777 /mydata
-
Install Anaconda:
cd /mydata curl -O https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh bash ./Anaconda3-2023.09-0-Linux-x86_64.sh rm ./Anaconda3-2023.09-0-Linux-x86_64.sh
-
Clone the repository:
git clone [email protected]:Tsihan/LIMAOLifeLongRLDB.git -b final_switching_workload
Note: This branch aims to replay the result in workload switching situation using LIMAO, if you want to see other dynamic situations, you may use other branches like final_switching_dbvolume, final_switching_workload_dbvolume, .etc. The prefix of each branch like "final", "original", "treedecompose" reflect different variants of the prototype evolving from Balsa to the final form of LIMAO implemented in Balsa.
-
Set up the Python environment:
conda create -n limao python=3.8 -y conda activate limao pip install -e . pip install -e pg_executor pip install -r requirements.txt
-
Install PostgreSQL 12.5:
wget https://ftp.postgresql.org/pub/source/v12.5/postgresql-12.5.tar.gz tar xzvf postgresql-12.5.tar.gz cd postgresql-12.5 ./configure --prefix=/mydata/postgresql-12.5 --without-readline sudo make -j sudo make install echo 'export PATH=/mydata/postgresql-12.5/bin:$PATH' >> ~/.bashrc source ~/.bashrc
-
Install
pg_hint_plan
extension:git clone https://github.com/ossc-db/pg_hint_plan.git -b REL12_1_3_7 cd pg_hint_plan # Modify Makefile: Set PG_CONFIG to /mydata/postgresql-12.5/bin/pg_config make sudo make install
-
Download IMDB dataset:
mkdir -p /mydata/datasets/job && cd /mydata/datasets/job wget -c http://homepages.cwi.nl/~boncz/job/imdb.tgz tar -xvzf imdb.tgz
-
Prepend headers to CSV files:
- Activate the LIMAO environment:
conda activate limao
- Run the script:
python3 /mydata/LIMAOLifeLongRLDB/scripts/prepend_imdb_headers.py
- Modify the script to set the correct directory:
flags.DEFINE_string('csv_dir', '/mydata/datasets/job', 'Directory to IMDB CSVs.')
- Activate the LIMAO environment:
-
Initialize and start PostgreSQL:
pg_ctl -D /mydata/databases initdb cp /mydata/LIMAOLifeLongRLDB/conf/balsa-postgresql.conf /mydata/databases/postgresql.conf pg_ctl -D /mydata/databases start -l logfile
-
Load IMDB dataset:
bash load-postgres/load_job_postgres.sh /mydata/datasets/job
Note: For setting up the TPC-H workload, please refer to the following Google Drive link.
- IMDB Set 1:
python3 run.py --run Balsa_JOBRandSplit_IMDB_assorted_3 --local
- IMDB Set 2:
python3 run.py --run Balsa_JOBRandSplit_IMDB_assorted_4 --local
- Switching Workload:
python3 run.py --run Balsa_JOBRandSplit_IMDB_assorted_Replay_2 --local
- TPCH Set 1:
python3 run.py --run Balsa_JOBRandSplit_TPCH10_assorted --local
- TPCH Set 2:
python3 run.py --run Balsa_JOBRandSplit_TPCH10_assorted_2 --local
- Switching Workload:
python3 run.py --run Balsa_JOBRandSplit_TPCH10_assorted_Replay --local
-
Modify parameters in the following scripts as needed:
balsa/optimizer.py
(class initialization)balsa/util/postgres.py
(PostgreSQL connection)run.py
:_MakeWorkload(self, is_origin=False)
RunBaseline(self)
sim.py
:Params(cls)
_SimulationDataPath(self)
_FeaturizedDataPath(self)
-
Update timezones in
balsa-postgresql.conf
if required.
This project is licensed under the MIT License. See the LICENSE file for details.
Note: If any steps are unclear or additional details are required, please feel free to raise an issue in this repository.