This repository contains the code for the airflowHPC project. The project is a proof of concept for using Apache Airflow to manage HPC jobs, in particular complex molecular dynamics simulation workflows.
git clone https://github.com/ejjordan/airflowHPC.git
export AIRFLOW_HOME=$PWD/airflow_dir
python3 -m venv airflowHPC_env
source airflowHPC_env/bin/activate
pip install --upgrade pip
pip install -e airflowHPC/
pip install -r airflowHPC/requirements.txt
It may be necessary to first install gmxapi manually as described in the gmxapi installation instructions.
gmxapi_ROOT=/path/to/gromacs pip install --no-cache-dir gmxapi
This demo shows how to run grompp and mdrun in a simple Airflow DAG. It should complete in less than a minute.
airflow users create --username admin --role Admin -f firstname -l lastname -e [email protected]
export AIRFLOW__CORE__EXECUTOR=airflowHPC.executors.resource_executor.ResourceExecutor
export AIRFLOW__CORE__DAGS_FOLDER=airflowHPC/airflowHPC/dags/
export AIRFLOW__CORE__LOAD_EXAMPLES=False
export AIRFLOW__WEBSERVER__DAG_DEFAULT_VIEW="graph"
airflow standalone
This will start a webserver on port 8080 where you can trigger airflow DAGs. The first command will prompt you to enter a password for the admin user. You will then use this username and password to log in to the webserver.
On the webserver, navigate to the DAGs
tab and click run_gmx
DAG.
If you press the play button in the upper right corner, you can specify the
output directory (or use the default value) and trigger the DAG.
The output files are written by default to a directory called outputs
, though this
is a DAG parameter which can be changed in the browser before triggering the DAG.
Note that if you run the run_gmx
DAG multiple times with the same
output directory, the files will be overwritten.
The output directory will be located in the directory where you ran
the airflow standalone
command.
On the webserver of the simple demonstration, you may have noticed warnings that you should not use the SequentialExecutor or SQLite database in production. There are detailed instructions for setting up a database connection in the Airflow documentation here.
MySQL is available from package managers such as apt or on Ubuntu.
sudo apt install mysql-server
You can then set up the database with the following commands.
sudo systemctl start mysql
sudo mysql --user=root --password=root -e "CREATE DATABASE airflow_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;"
sudo mysql --user=root --password=root -e "CREATE USER 'airflow_user' IDENTIFIED BY 'airflow_pass';"
sudo mysql --user=root --password=root -e "GRANT ALL PRIVILEGES ON airflow_db.* TO 'airflow_user'"
sudo mysql --user=root --password=root -e "FLUSH PRIVILEGES;"
If you don't have root access, you can use the postgresql instructions below, which do not require root access.
It is possible to install PostgreSQL on HPC resources with spack.
spack install postgresql
Now to set up the database, you can use the following commands.
spack load postgresql
mkdir postgresql_db
initdb -D postgresql_db/data
pg_ctl -D postgresql_db/data -l logfile start
Note that you will need to run the pg_ctl
command every time you want to use the database,
for example after logging out and back in to the HPC resource.
The airflow database can then be set up with the following commands.
createdb -T template1 airflow_db
psql airflow_db
In the psql shell, you can then run the following commands.
CREATE USER airflow_user WITH PASSWORD 'airflow_pass';
GRANT ALL PRIVILEGES ON DATABASE airflow_db TO airflow_user;
GRANT ALL ON SCHEMA public TO airflow_user;
ALTER USER airflow_user SET search_path = public;
quit
You may also need to put the postgresql library in the LD_LIBRARY_PATH
environment variable.
export LD_LIBRARY_PATH=/path/to/your/postgresql/lib:$LD_LIBRARY_PATH
Once you have a database set up and configured according to the instructions, you can install the airflowHPC postgresql requirements.
pip install airflowHPC[postgresql]
You can then launch the standalone webserver with the prepare_standalone.sh
script.
This script will use the ResourceExecutor, which is vended by the function
get_provider_info()
in airflowHPC/__init__.py
.
bash airflowHPC/scripts/prepare_standalone.sh
Note that the script can be run from any directory, but it should not be moved from its original location in the repository. The script should work without needing to configure any environment variables, as they are set in the script itself and the help should provide enough information to run the script.
After configuring a python virtual environment and database as described above,
you can run a DAG using the
Airflow CLI.
Note that you must configure the airflow environment variables as described above so
that the CLI can find the DAGs and the database.
You also need to have set up the airflow configuration by, for example, running the
airflow standalone
or airflow db init
command at least once.
airflow dags backfill -s YYYY-MM-DD --reset-dagruns -y run_gmx
Below is an example slurm script for running a DAG on an HPC resource.
#!/bin/bash -l
#SBATCH --account=your_account
#SBATCH --job-name=airflow
#SBATCH --partition=gpu
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=4
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
export OMP_PLACES=cores
source /path/to/spack/share/spack/setup-env.sh
spack load [email protected]
source /path/to/pyenvs/spack_py/bin/activate
export AIRFLOW__DATABASE__SQL_ALCHEMY_CONN="postgresql+psycopg2://airflow_user:airflow_pass@localhost/airflow_db"
export AIRFLOW__CORE__EXECUTOR=airflowHPC.executors.resource_executor.ResourceExecutor
export AIRFLOW__CORE__LOAD_EXAMPLES=False
export AIRFLOW__CORE__DAGS_FOLDER="/path/to/airflowHPC/airflowHPC/dags/"
export AIRFLOW__HPC__CORES_PER_NODE=128
export AIRFLOW__HPC__GPUS_PER_NODE=8
export AIRFLOW__HPC__GPU_TYPE="nvidia"
export AIRFLOW__HPC__MEM_PER_NODE=256
export AIRFLOW__HPC__THREADS_PER_CORE=2
export RADICAL_UTILS_NO_ATFORK=1
export LD_LIBRARY_PATH=/path/to/postgresql/lib/:$LD_LIBRARY_PATH
pg_ctl -D /path/to/databases/postgresql/data/ -l /path/to/databases/postgresql/server.log start
module load gromacs
airflow dags backfill -s 2024-06-14 -v --reset-dagruns -y gmx_multi
Instructions for debugging a DAG are available in the airflow documentation.
It is also possible to debug an operator, task, or DAG, by setting the
environment variable AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT
to a large value
in seconds and dropping a set_trace() statement in the code.
Airflow ships with a command line tool for clearing the database.
airflow db clean --clean-before-timestamp 'yyyy-mm-dd'
The dag runs can also be deleted from the webserver.