Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add GPU AMIP scaling runs #673

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
41dff6f
add strong scaling GPU AMIP
juliasloan25 Mar 6, 2024
3d8f4bb
add weak scaling
juliasloan25 Mar 6, 2024
e9b769c
use atmos branch
juliasloan25 Mar 6, 2024
9fec265
don't use vert_diff: true [skip ci]
juliasloan25 Mar 6, 2024
416b053
use correct driver [skip ci]
juliasloan25 Mar 6, 2024
9d69913
wait after each job [skip ci]
juliasloan25 Mar 7, 2024
50c45cc
weak scaling only [skip ci]
juliasloan25 Mar 7, 2024
9e0df96
add barrier
juliasloan25 Mar 7, 2024
9d994e4
strong scaling only [skip ci]
juliasloan25 Mar 7, 2024
686f44c
no waits; update ws resolutions [skip ci]
juliasloan25 Mar 8, 2024
33811a6
decrease ws 1 GPU dt [skip ci]
juliasloan25 Mar 8, 2024
91be5d3
show surface fractions [skip ci]
juliasloan25 Mar 8, 2024
7b521db
show surface fraction sums [skip ci]
juliasloan25 Mar 8, 2024
ed2768d
more barrier [skip ci]
juliasloan25 Mar 8, 2024
0b22c9b
add scaling plot [skip ci]
juliasloan25 Mar 8, 2024
1f675c1
ws 1 GPU h_elem 30 [skip ci]
juliasloan25 Mar 8, 2024
9031045
2 gpu ss sum before max [skip ci]
juliasloan25 Mar 8, 2024
1d9206f
strong scaling only h_elem 30 [skip ci]
juliasloan25 Mar 9, 2024
885737d
strong scaling h_elem 60, dt 50
juliasloan25 Mar 9, 2024
f77dfab
weak scaling h_elem 84 dt 50
juliasloan25 Mar 11, 2024
dbbdfee
ss 1 gpu @ 60, 4 gpu @ 42
juliasloan25 Mar 11, 2024
c64b442
dyamond strong scaling [skip ci]
juliasloan25 Mar 12, 2024
6c22046
DYAMOND ws [skip ci]
juliasloan25 Mar 12, 2024
4a21815
dyamond ws higher res [skip ci]
juliasloan25 Mar 12, 2024
f963c31
fix pipeline [skip ci]
juliasloan25 Mar 12, 2024
55c6dc1
include 1 GPU [skip ci]
juliasloan25 Mar 12, 2024
8e68e7f
dyamond ws helem 30,42,60 [skip ci]
juliasloan25 Mar 12, 2024
10ac98e
run for 1 day [skip ci]
juliasloan25 Mar 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
222 changes: 222 additions & 0 deletions .buildkite/gpu/pipeline.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
agents:
queue: clima
slurm_mem: 8G
modules: common nsight-systems/2023.4.1

env:
JULIA_CUDA_MEMORY_POOL: none
JULIA_MPI_HAS_CUDA: "true"
JULIA_NVTX_CALLBACKS: gc
JULIA_MAX_NUM_PRECOMPILE_FILES: 100
OPENBLAS_NUM_THREADS: 1
OMPI_MCA_opal_warn_on_missing_libcuda: 0
SLURM_KILL_BAD_EXIT: 1
SLURM_GRES_FLAGS: "allow-task-sharing"
GPU_CONFIG_PATH: "config/gpu_configs"
GPU_DYAMOND_CONFIG_PATH: "config/gpu_configs/gpu_dyamond"
GPU_DYAMOND_WS_CONFIG_PATH: "config/gpu_configs/gpu_dyamond_ws"
CLIMAATMOS_GC_NSTEPS: 10

steps:
- label: "init :GPU:"
key: "init_gpu_env"
command:
- echo "--- Instantiate experiments/AMIP"
- julia --project=experiments/AMIP -e 'using Pkg; Pkg.instantiate(;verbose=true)'
- julia --project=experiments/AMIP -e 'using Pkg; Pkg.precompile()'
- julia --project=experiments/AMIP -e 'using Pkg; Pkg.status()'

- echo "--- Download artifacts"
- "julia --project=artifacts -e 'using Pkg; Pkg.instantiate(;verbose=true)'"
- "julia --project=artifacts -e 'using Pkg; Pkg.precompile()'"
- "julia --project=artifacts -e 'using Pkg; Pkg.status()'"
- "julia --project=artifacts artifacts/download_artifacts.jl"

agents:
slurm_gpus: 1
slurm_cpus_per_task: 8
env:
JULIA_NUM_PRECOMPILE_TASKS: 8
JULIA_MAX_NUM_PRECOMPILE_FILES: 50

- wait

# - group: "DYAMOND GPU strong scaling"
# steps:

# - label: "GPU AMIP DYAMOND - strong scaling - 1 GPU"
# key: "gpu_amip_dyamond"
# command:
# - >
# julia --threads=3 --color=yes --project=experiments/AMIP experiments/AMIP/coupler_driver.jl
# --config_file $GPU_DYAMOND_CONFIG_PATH/gpu_amip_dyamond.yml
# artifact_paths: "gpu_amip_dyamond/*"
# agents:
# slurm_gpus_per_task: 1
# slurm_cpus_per_task: 4
# slurm_ntasks: 1
# slurm_mem: 32G

# - label: "GPU AMIP DYAMOND - strong scaling - 2 GPUs"
# key: "gpu_amip_dyamond_2process"
# command:
# - >
# srun --cpu-bind=threads --cpus-per-task=4
# julia --threads=3 --color=yes --project=experiments/AMIP experiments/AMIP/coupler_driver.jl
# --config_file $GPU_DYAMOND_CONFIG_PATH/gpu_amip_dyamond_2process.yml
# artifact_paths: "gpu_amip_dyamond_2process/*"
# agents:
# slurm_gpus_per_task: 1
# slurm_cpus_per_task: 4
# slurm_ntasks: 2
# slurm_mem: 32G

# - label: "GPU AMIP DYAMOND - strong scaling - 4 GPUs"
# key: "gpu_amip_dyamond_4process"
# command:
# - >
# srun --cpu-bind=threads --cpus-per-task=4
# julia --threads=3 --color=yes --project=experiments/AMIP experiments/AMIP/coupler_driver.jl
# --config_file $GPU_DYAMOND_CONFIG_PATH/gpu_amip_dyamond_4process.yml
# artifact_paths: "gpu_amip_dyamond_4process/*"
# agents:
# slurm_gpus_per_task: 1
# slurm_cpus_per_task: 4
# slurm_ntasks: 4
# slurm_mem: 32G

- group: "DYAMOND GPU weak scaling"
steps:

- label: "GPU AMIP DYAMOND - weak scaling - 1 GPU"
key: "gpu_amip_dyamond_ws"
command:
- >
julia --threads=3 --color=yes --project=experiments/AMIP experiments/AMIP/coupler_driver.jl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an nsys profile to the non-MPI jobs?

--config_file $GPU_DYAMOND_WS_CONFIG_PATH/gpu_amip_dyamond_ws.yml
artifact_paths: "gpu_amip_dyamond_ws/*"
agents:
slurm_gpus_per_task: 1
slurm_cpus_per_task: 4
slurm_ntasks: 1
slurm_mem: 32G

- label: "GPU AMIP DYAMOND - weak scaling - 2 GPUs"
key: "gpu_amip_dyamond_ws_2process"
command:
- >
srun --cpu-bind=threads --cpus-per-task=4
julia --threads=3 --color=yes --project=experiments/AMIP experiments/AMIP/coupler_driver.jl
--config_file $GPU_DYAMOND_WS_CONFIG_PATH/gpu_amip_dyamond_ws_2process.yml
artifact_paths: "gpu_amip_dyamond_ws_2process/*"
agents:
slurm_gpus_per_task: 1
slurm_cpus_per_task: 4
slurm_ntasks: 2
slurm_mem: 32G

- label: "GPU AMIP DYAMOND - weak scaling - 4 GPUs"
key: "gpu_amip_dyamond_ws_4process"
command:
- >
srun --cpu-bind=threads --cpus-per-task=4
julia --threads=3 --color=yes --project=experiments/AMIP experiments/AMIP/coupler_driver.jl
--config_file $GPU_DYAMOND_WS_CONFIG_PATH/gpu_amip_dyamond_ws_4process.yml
artifact_paths: "gpu_amip_dyamond_ws_4process/*"
agents:
slurm_gpus_per_task: 1
slurm_cpus_per_task: 4
slurm_ntasks: 4
slurm_mem: 32G

# - group: "CHAP GPU strong scaling"
# steps:

# - label: "GPU AMIP CHAP - strong scaling - 1 GPU"
# key: "gpu_amip_chap"
# command:
# - >
# julia --threads=3 --color=yes --project=experiments/AMIP experiments/AMIP/coupler_driver.jl
# --config_file $GPU_CONFIG_PATH/gpu_amip_chap.yml
# artifact_paths: "gpu_amip_chap/*"
# agents:
# slurm_gpus_per_task: 1
# slurm_cpus_per_task: 4
# slurm_ntasks: 1
# slurm_mem: 32G

# - label: "GPU AMIP CHAP - strong scaling - 2 GPUs"
# key: "gpu_amip_chap_2process"
# command:
# - >
# srun --cpu-bind=threads --cpus-per-task=4
# julia --threads=3 --color=yes --project=experiments/AMIP experiments/AMIP/coupler_driver.jl
# --config_file $GPU_CONFIG_PATH/gpu_amip_chap_2process.yml
# artifact_paths: "gpu_amip_chap_2process/*"
# agents:
# slurm_gpus_per_task: 1
# slurm_cpus_per_task: 4
# slurm_ntasks: 2
# slurm_mem: 32G

# - label: "GPU AMIP CHAP - strong scaling - 4 GPUs"
# key: "gpu_amip_chap_4process"
# command:
# - >
# srun --cpu-bind=threads --cpus-per-task=4
# julia --threads=3 --color=yes --project=experiments/AMIP experiments/AMIP/coupler_driver.jl
# --config_file $GPU_CONFIG_PATH/gpu_amip_chap_4process.yml
# artifact_paths: "gpu_amip_chap_4process/*"
# agents:
# slurm_gpus_per_task: 1
# slurm_cpus_per_task: 4
# slurm_ntasks: 4
# slurm_mem: 32G

# - group: "CHAP GPU weak scaling"
# steps:

# - label: "GPU AMIP CHAP - weak scaling - 1 GPU"
# key: "gpu_amip_chap_ws"
# command:
# - >
# julia --threads=3 --color=yes --project=experiments/AMIP experiments/AMIP/coupler_driver.jl --config_file $GPU_CONFIG_PATH/gpu_amip_chap_ws.yml
# artifact_paths: "gpu_amip_chap_ws/*"
# agents:
# slurm_gpus_per_task: 1
# slurm_cpus_per_task: 4
# slurm_ntasks: 1
# slurm_mem: 32G
# slurm_exclusive:

# - label: "GPU AMIP CHAP - weak scaling - 2 GPUs"
# key: "gpu_amip_chap_ws_2process"
# command:
# - >
# srun --cpu-bind=threads --cpus-per-task=4
# julia --threads=3 --color=yes --project=experiments/AMIP experiments/AMIP/coupler_driver.jl
# --config_file $GPU_CONFIG_PATH/gpu_amip_chap_ws_2process.yml
# artifact_paths: "gpu_amip_chap_ws_2process/*"
# agents:
# slurm_gpus_per_task: 1
# slurm_cpus_per_task: 4
# slurm_ntasks: 2
# slurm_mem: 32G
# slurm_time: 8:00:00
# slurm_exclusive:

# - label: "GPU AMIP CHAP - weak scaling - 4 GPUs"
# key: "gpu_amip_chap_ws_4process"
# command:
# - >
# srun --cpu-bind=threads --cpus-per-task=4
# julia --threads=3 --color=yes --project=experiments/AMIP experiments/AMIP/coupler_driver.jl
# --config_file $GPU_CONFIG_PATH/gpu_amip_chap_ws_4process.yml
# artifact_paths: "gpu_amip_chap_ws_4process/*"
# agents:
# slurm_gpus_per_task: 1
# slurm_cpus_per_task: 4
# slurm_ntasks: 4
# slurm_mem: 32G
# slurm_time: 8:00:00
# slurm_exclusive:
22 changes: 22 additions & 0 deletions config/gpu_configs/gpu_amip_chap.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
anim: false
apply_limiter: false
atmos_config_file: "config/gpu_configs/gpu_aquaplanet_chap.yml"
dt: "50secs"
dt_cloud_fraction: "1hours"
dt_cpl: 50
dt_rad: "1hours"
dt_save_state_to_disk: "Inf"
dt_save_to_sol: "Inf"
energy_check: false
evolving_ocean: false
h_elem: 60
hourly_checkpoint: false
job_id: "gpu_amip_chap"
land_albedo_type: "map_static"
mode_name: "amip"
mono_surface: false
run_name: "gpu_amip_chap"
start_date: "19790301"
surface_setup: "PrescribedSurface"
t_end: "1days"
turb_flux_partition: "CombinedStateFluxes"
22 changes: 22 additions & 0 deletions config/gpu_configs/gpu_amip_chap_2process.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
anim: false
apply_limiter: false
atmos_config_file: "config/gpu_configs/gpu_aquaplanet_chap_2process.yml"
dt: "50secs"
dt_cloud_fraction: "1hours"
dt_cpl: 50
dt_rad: "1hours"
dt_save_state_to_disk: "Inf"
dt_save_to_sol: "Inf"
energy_check: false
evolving_ocean: false
h_elem: 60
hourly_checkpoint: false
job_id: "gpu_amip_chap_2process"
land_albedo_type: "map_static"
mode_name: "amip"
mono_surface: false
run_name: "gpu_amip_chap_2process"
start_date: "19790301"
surface_setup: "PrescribedSurface"
t_end: "1days"
turb_flux_partition: "CombinedStateFluxes"
22 changes: 22 additions & 0 deletions config/gpu_configs/gpu_amip_chap_4process.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
anim: false
apply_limiter: false
atmos_config_file: "config/gpu_configs/gpu_aquaplanet_chap_4process.yml"
dt: "50secs"
dt_cloud_fraction: "1hours"
dt_cpl: 50
dt_rad: "1hours"
dt_save_state_to_disk: "Inf"
dt_save_to_sol: "Inf"
energy_check: false
evolving_ocean: false
h_elem: 42
hourly_checkpoint: false
job_id: "gpu_amip_chap_4process"
land_albedo_type: "map_static"
mode_name: "amip"
mono_surface: false
run_name: "gpu_amip_chap_4process"
start_date: "19790301"
surface_setup: "PrescribedSurface"
t_end: "1days"
turb_flux_partition: "CombinedStateFluxes"
22 changes: 22 additions & 0 deletions config/gpu_configs/gpu_amip_chap_ws.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
anim: false
apply_limiter: false
atmos_config_file: "config/gpu_configs/gpu_aquaplanet_chap_ws_1process.yml"
dt: "100secs"
dt_cloud_fraction: "1hours"
dt_cpl: 100
dt_rad: "1hours"
dt_save_state_to_disk: "Inf"
dt_save_to_sol: "Inf"
energy_check: false
evolving_ocean: false
h_elem: 30
hourly_checkpoint: false
job_id: "gpu_amip_chap_ws"
land_albedo_type: "map_static"
mode_name: "amip"
mono_surface: false
run_name: "gpu_amip_chap_ws"
start_date: "19790301"
surface_setup: "PrescribedSurface"
t_end: "1days"
turb_flux_partition: "CombinedStateFluxes"
22 changes: 22 additions & 0 deletions config/gpu_configs/gpu_amip_chap_ws_2process.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
anim: false
apply_limiter: false
atmos_config_file: "config/gpu_configs/gpu_aquaplanet_chap_ws_2process.yml"
dt: "50secs"
dt_cloud_fraction: "1hours"
dt_cpl: 50
dt_rad: "1hours"
dt_save_state_to_disk: "Inf"
dt_save_to_sol: "Inf"
energy_check: false
evolving_ocean: false
h_elem: 60
hourly_checkpoint: false
job_id: "gpu_amip_chap_ws_2process"
land_albedo_type: "map_static"
mode_name: "amip"
mono_surface: false
run_name: "gpu_amip_chap_ws_2process"
start_date: "19790301"
surface_setup: "PrescribedSurface"
t_end: "1days"
turb_flux_partition: "CombinedStateFluxes"
22 changes: 22 additions & 0 deletions config/gpu_configs/gpu_amip_chap_ws_4process.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
anim: false
apply_limiter: false
atmos_config_file: "config/gpu_configs/gpu_aquaplanet_chap_ws_4process.yml"
dt: "50secs"
dt_cloud_fraction: "1hours"
dt_cpl: 50
dt_rad: "1hours"
dt_save_state_to_disk: "Inf"
dt_save_to_sol: "Inf"
energy_check: false
evolving_ocean: false
h_elem: 84
hourly_checkpoint: false
job_id: "gpu_amip_chap_ws_4process"
land_albedo_type: "map_static"
mode_name: "amip"
mono_surface: false
run_name: "gpu_amip_chap_ws_4process"
start_date: "19790301"
surface_setup: "PrescribedSurface"
t_end: "1days"
turb_flux_partition: "CombinedStateFluxes"
21 changes: 21 additions & 0 deletions config/gpu_configs/gpu_dyamond/gpu_amip_dyamond.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
anim: false
apply_limiter: false
atmos_config_file: "config/gpu_configs/gpu_aquaplanet_dyamond.yml"
dt: "100secs"
dt_cpl: 100
dt_rad: "1hours"
dt_save_state_to_disk: "Inf"
dt_save_to_sol: "Inf"
energy_check: false
evolving_ocean: false
h_elem: 30
hourly_checkpoint: false
job_id: "gpu_amip_dyamond"
land_albedo_type: "map_static"
mode_name: "amip"
mono_surface: false
run_name: "gpu_amip_dyamond"
start_date: "19790301"
surface_setup: "PrescribedSurface"
t_end: "12hours"
turb_flux_partition: "CombinedStateFluxes"
Loading