This serves as a codebase for the distributed version of C3 based on Orca.
- Create a cloudlab job with 17 nodes (16 nodes for actors, 1 node for learner) using the
orca
profile (which will have thelinux-learner
image preinstalled once the nodes start)' - Run
./cloudlab/config.sh
to setup the nodes (run this from YOUR machine). - The next thing you will have to do is
./cloudlab/setup_params.sh
to generaterl-module/params_distributed.json
. (run this on node0 on cloudlab) - Once this is done, you should use
v9_multi_train.sh
to start the training job - do this within atmux
session. (run this on cloudlab node0) ssh
into the actor nodes and look at~/actor_logs/
to see thestdout
(andstderr
) of each actors. Look inside~/ConstrainedOrca/rl-module/training_log
for the train log.- Once training is done, use
scripts/collate_train_files.sh
to collect everything into one place. Backup the folder generated by this for the future.
- Move the checkpoint you desire into
~/ConstrainedOrca/rl-module/train_dir/seed0/
. When you dols
inside thisseed0
, it should show you one directory that looks something likelearner0-v9_actorNum256_multi_lambda0.0_ksymbolic5_k1_raw-sym_threshold25_seed0/
. - Run
./scripts/eval_orca.sh <model_name> <trace_dir> <results_dir> <start_run> <end_run> <constraints_id>
. <trace_dir>
is/proj/verifiedmlsys-PG0/ConstrainedOrca/sage_traces/traces
for SAGE traces.<results_dir>
is/proj/verifiedmlsys-PG0/sigcomm_results/new_result_dir/constraint_id_*
.
- Use
cd scripts && ./process_down_file.sh
to trim stuff. - Use
./scripts/plots/plot_thr.py
for motivation figures - Use
./scripts/plots/plot_thr_delay.py
for thr vs delay plots.
baseline_v4
is the baseline used for NSDI submission.