Script to help test the real configs on Hera #79

zmoon · 2025-02-27T18:24:41Z

The script tests/run_config.py

sets up a temporary case directory for each selected config
- the configs are identified by the names of the config/* subdirectories, e.g. cmaq_gfs_megan_nei2019_globtempo
stores settings as JSON (script args, etc.)
copies in config files, modifying time and modifying grid if requested
links in the input data
writes a Slurm job script (optionally submitting it as well)

The script tests/collect_data.py collects data from those runs, including the stored settings and info from the Slurm job stdout and stderr files, and writes it as a newline-delimited JSON file, which can be loaded in pandas with pd.read_json("data.ndjson", lines=True).

In the future we could use these capabilities to develop some e2e regression tests and maybe connect a self-hosted runner on Hera to GitHub Actions for automated testing.

The below figures summarize data for a set of runs (two time steps of cmaq_gfs_megan_nei2019_globtempo, no change to the grid). Regrid T means NEXUS output via ESMF (regridded to the FV3 grid spec; nexus -r grid_spec.nc), whereas regrid F indicates HEMCO native grid output produced by HEMCO.

Data table:

n: number of tasks (MPI)
c: number of CPUs per task (also used as OMP_NUM_THREADS)
r: regrid
time: run time in minutes (only successful runs included)
mem: max node-wise memory usage during the job in GB

n	c	r	time	mem
40	4	T	1.98	29.66
36	4	T	2.02	26.73
36	4	T	2.02	26.75
40	4	T	2.03	29.66
32	4	T	2.10	23.88
32	4	T	2.12	23.90
24	4	T	2.15	24.18
28	4	T	2.15	29.99
28	4	T	2.17	29.94
24	4	T	2.18	24.16
32	2	T	2.20	47.66
32	2	T	2.22	47.61
36	2	T	2.22	53.30
36	2	T	2.23	53.30
28	2	T	2.25	41.88
40	2	T	2.25	59.11
16	4	T	2.28	24.63
16	4	T	2.28	24.65
40	2	T	2.28	59.07
24	2	T	2.28	36.19
24	2	T	2.32	36.15
28	2	T	2.32	41.92
20	4	T	2.37	30.43
12	4	T	2.37	18.96
12	4	T	2.38	18.97
20	4	T	2.42	30.41
20	1	T	2.62	60.43
16	2	T	2.63	37.49
24	1	T	2.63	71.77
20	1	T	2.63	60.40
24	1	T	2.65	71.93
24	1	T	2.65	72.04
16	2	T	2.67	49.02
24	1	T	2.68	70.59
20	2	T	2.70	60.61
20	2	T	2.70	60.55
28	1	T	2.72	78.17
32	1	T	2.72	90.81
32	1	T	2.75	88.89
28	1	T	2.75	83.52
32	1	T	2.75	94.08
12	2	T	2.77	37.67
16	1	T	2.80	49.04
12	2	T	2.80	37.68
16	1	T	2.82	49.06
16	1	T	2.82	48.96
36	1	T	2.87	100.01
32	1	T	2.88	93.80
16	1	T	2.90	49.04
40	1	T	2.90	112.84
40	1	T	2.92	110.34
8	4	T	2.93	26.31
8	4	T	2.95	26.32
40	1	T	3.00	90.78
8	2	T	3.15	26.28
12	1	T	3.15	37.68
12	1	T	3.15	37.65
12	1	T	3.17	37.67
8	2	T	3.18	26.25
12	1	T	3.22	37.71
4	12	T	3.52	7.89
4	12	T	3.65	8.98
4	16	T	3.67	8.99
4	20	T	3.68	8.98
4	20	T	3.68	9.01
4	16	T	3.72	8.99
8	1	T	3.75	26.31
8	1	T	3.77	26.26
8	1	T	3.77	26.24
4	8	T	3.82	16.23
4	8	T	3.82	16.32
8	1	T	3.82	26.25
4	4	T	4.02	16.26
4	4	T	4.05	16.24
4	28	T	4.07	5.23
3	16	T	4.35	11.25
3	16	T	4.35	11.20
4	32	T	4.37	5.30
4	32	T	4.38	5.29
4	40	T	4.38	5.24
3	20	T	4.47	10.96
3	20	T	4.47	11.09
4	24	T	4.47	5.23
4	36	T	4.50	5.23
4	40	T	4.52	5.24
3	8	T	4.60	15.82
3	8	T	4.60	15.86
4	28	T	4.67	5.21
4	2	T	4.70	16.36
4	36	T	4.72	5.23
4	2	T	4.72	16.39
4	24	T	4.78	5.20
3	28	T	4.82	6.29
3	4	T	4.93	15.81
3	4	T	4.98	15.82
3	24	T	5.03	6.28
3	24	T	5.07	6.30
3	32	T	5.10	6.45
3	36	T	5.10	6.47
3	12	T	5.20	14.58
3	40	T	5.22	6.33
3	40	T	5.25	6.29
3	28	T	5.27	6.26
3	12	T	5.30	14.57
3	32	T	5.52	6.45
3	36	T	5.58	6.30
4	1	T	5.60	16.35
4	1	T	5.63	16.36
2	16	T	5.72	15.47
2	16	T	5.73	15.47
2	12	T	5.80	15.47
3	2	T	5.80	15.87
2	12	T	5.83	15.47
2	8	T	5.83	15.46
2	20	T	5.83	15.33
2	8	T	5.92	15.44
3	2	T	5.92	15.77
2	20	T	6.00	15.48
2	24	T	6.17	8.41
2	28	T	6.28	8.46
2	40	T	6.45	8.49
2	4	T	6.48	15.43
2	32	T	6.50	8.56
2	32	T	6.50	8.61
2	4	T	6.53	15.43
2	36	T	6.60	8.39
2	28	T	6.63	8.41
2	40	T	6.63	8.47
2	36	T	6.85	8.31
2	24	T	6.95	8.42
3	1	T	7.15	15.80
3	1	T	7.18	15.85
2	2	T	7.57	15.44
2	2	T	7.62	15.35
2	1	T	9.53	15.17
2	1	T	9.57	15.41
1	16	F	9.62	12.62
1	12	F	9.63	12.63
2	16	F	9.73	23.31
3	16	F	9.75	23.93
2	12	F	9.87	23.92
1	8	F	9.92	12.62
1	16	F	10.03	12.65
1	8	F	10.08	12.64
1	24	F	10.10	12.64
2	8	F	10.15	23.94
2	16	F	10.15	23.95
1	16	T	10.23	14.65
1	16	T	10.25	13.26
1	12	F	10.28	12.65
1	20	T	10.28	14.52
1	12	T	10.32	13.23
1	20	T	10.35	14.54
1	24	F	10.40	12.65
1	12	T	10.57	14.66
1	32	F	10.60	12.62
1	8	T	10.82	14.64
1	28	T	10.87	14.58
1	32	F	11.05	12.69
1	40	F	11.12	12.69
1	24	T	11.13	14.57
2	32	F	11.15	12.72
1	4	F	11.25	12.64
1	40	F	11.33	12.68
1	8	T	11.33	14.66
1	4	F	11.35	12.62
2	4	F	11.48	23.92
1	28	T	11.52	14.53
2	4	F	11.55	23.07
1	24	T	11.57	14.57
1	40	T	11.67	14.59
1	36	T	11.77	14.59
3	8	F	11.83	33.75
1	32	T	11.83	13.29
1	36	T	11.92	14.57
1	40	T	12.00	14.60
1	4	T	12.35	14.51
1	32	T	12.55	14.71
1	4	T	12.62	14.61
3	4	F	13.42	35.08
2	2	F	13.50	23.88
2	2	F	13.55	23.45
1	2	F	13.60	12.61
3	12	F	13.77	35.12
3	4	F	13.82	35.13
1	2	F	14.10	12.60
3	12	F	14.12	35.16
1	2	T	14.20	14.64
1	2	T	14.60	14.45
3	2	F	15.72	35.08
4	2	F	16.05	45.91
3	2	F	16.48	34.74
1	1	F	17.38	12.61
2	1	F	17.52	22.88
1	1	F	17.87	12.59
1	1	T	18.10	14.26
1	1	T	18.47	14.60
3	1	F	19.32	34.72
4	1	F	20.13	45.91
3	1	F	20.18	35.09

note that /scratch1/RDARCH/rda-arl-gpu/Barry.Baker/emissions/nexus/FENGSHA/ has links to them too

also could be interesting for space-separated (e.g. "gfs megan") to include _all_ the matches

might be a good idea to require at least one delimiter for this matching mode, to ensure that it is intentional

but don't fail, so it is still easy to test the tmp case creation outside of Hera

and nproc isn't useful

was getting OOM-ed running the default grid without this haven't been able to find what the workflow uses for --mem(-per-cpu)

https://docs.rdhpcs.noaa.gov/slurm/overview.html#using-report-mem-utility-in-batch-jobs

in case need to check details

should probably make this an option instead later

produced from testing the Python scripts and such

Taken from noaa-oar-arl#78

tests/collect_data.py

Copilot

PR Overview

This pull request adds two scripts: one to run config cases on Hera (tests/run_config.py) and one to collect their output data (tests/collect_data.py).

tests/run_config.py sets up temporary directories, updates config files based on command‐line arguments, creates a Slurm job script, and optionally submits it.
tests/collect_data.py collects settings, extracts runtime and memory usage info from Slurm outputs, and writes a newline‐delimited JSON file.

Reviewed Changes

File	Description
tests/run_config.py	Implements test run setup, config updates, and job creation/submission for Hera.
tests/collect_data.py	Implements collection and parsing of run outputs into a summary JSON file.

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

tests/run_config.py

tests/collect_data.py

Co-authored-by: Copilot <[email protected]>

drnimbusrain · 2025-02-28T14:03:14Z

@zmoon Awesome! You want this tested on Hera in the UFS-SRW-App workflow? Also, how did you get copilot enabled to act as a review of this PR? Can't see the option in other noaa-oar-arl repos.

bbakernoaa · 2025-02-28T14:14:54Z

@drnimbusrain you can request copilot to review PRs under the normal "request reviewer" in the top right

I don't think these changes are for the workflow. This is a simple script for us to be able to perform tests with the configuration files used for our various supported emission scenarios (which include the operational system). This is basically a the RT piece for NEXUS.

tests/run_config.py

drnimbusrain · 2025-02-28T16:58:54Z

Great, thanks Barry! Unfortunately I cannot see the CoPilot for other repositories under reviewer, just here.

bbakernoaa · 2025-02-28T17:10:02Z

You have to enable it under settings for each repository Barry Baker National Oceanic and Atmospheric Administration Air Resources Laboratory Physical Research Scientist Chemical Modeling and Emissions Group Leader NCWCP, R/ARL, Rm. 4204 5830 University Research Court College Park, Maryland 20740 Phone: ‪(301) 683-1395‬

…

On Fri, Feb 28, 2025 at 11:59 AM Patrick Campbell ***@***.***> wrote: Great, thanks Barry! Unfortunately I cannot see the CoPilot for other repositories under reviewer, just here. — Reply to this email directly, view it on GitHub <#79 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFIUVN7Y7X22K4VQXZVIOTL2SCIWJAVCNFSM6AAAAABYAQHRICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGEZTCNRXGI> . You are receiving this because your review was requested.Message ID: ***@***.***> [image: drnimbusrain]*drnimbusrain* left a comment (noaa-oar-arl/NEXUS#79) <#79 (comment)> Great, thanks Barry! Unfortunately I cannot see the CoPilot for other repositories under reviewer, just here. — Reply to this email directly, view it on GitHub <#79 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFIUVN7Y7X22K4VQXZVIOTL2SCIWJAVCNFSM6AAAAABYAQHRICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGEZTCNRXGI> . You are receiving this because your review was requested.Message ID: ***@***.***>

drnimbusrain · 2025-02-28T17:30:34Z

You have to enable it under settings for each repository Barry Baker National Oceanic and Atmospheric Administration Air Resources Laboratory Physical Research Scientist Chemical Modeling and Emissions Group Leader NCWCP, R/ARL, Rm. 4204 5830 University Research Court College Park, Maryland 20740 Phone: ‪(301) 683-1395‬
…
On Fri, Feb 28, 2025 at 11:59 AM Patrick Campbell @.> wrote: Great, thanks Barry! Unfortunately I cannot see the CoPilot for other repositories under reviewer, just here. — Reply to this email directly, view it on GitHub <#79 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFIUVN7Y7X22K4VQXZVIOTL2SCIWJAVCNFSM6AAAAABYAQHRICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGEZTCNRXGI . You are receiving this because your review was requested.Message ID: @.> [image: drnimbusrain]drnimbusrain left a comment (noaa-oar-arl/NEXUS#79) <#79 (comment)> Great, thanks Barry! Unfortunately I cannot see the CoPilot for other repositories under reviewer, just here. — Reply to this email directly, view it on GitHub <#79 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFIUVN7Y7X22K4VQXZVIOTL2SCIWJAVCNFSM6AAAAABYAQHRICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGEZTCNRXGI . You are receiving this because your review was requested.Message ID: @.***>

Still can't access, no worries...

zmoon added 30 commits February 20, 2025 11:48

Discover config dirs

ffe86d4

Select configs

27a3f09

Start creating run dirs

4ac0592

Store current commit hash

8ead166

Make dirs, link data

3dae157

Initial job script

5e63ba9

Link grid spec from Jianping's location

f421ef5

note that /scratch1/RDARCH/rda-arl-gpu/Barry.Baker/emissions/nexus/FENGSHA/ has links to them too

Specify more than one config with multiple -c usages

5bfc026

also could be interesting for space-separated (e.g. "gfs megan") to include _all_ the matches

Include all matches

46b524b

might be a good idea to require at least one delimiter for this matching mode, to ensure that it is intentional

Only do words match if intentional

19815e5

Specify Slurm -N/-n

5933cd3

Warn if inputs not linked

3643e56

but don't fail, so it is still easy to test the tmp case creation outside of Hera

Use the 2h GFC sfc file I have available

6acea0b

Remove possible previous run log files

194fe9a

Grid scaling

21ad7da

Use -g/--grid-factor instead of -d

0e7323e

Specify arg short form too when referencing in helps

3e0a063

Option to auto submit the job

f8a8eaa

GFS sfc file is expected to be at config level

080d1a4

Don't want literal \

257be8d

-q is --qos not --queue

704cc76

Use --cpus-per-task

b725897

and nproc isn't useful

Comment out --nodes line if not specifying

767a51c

Tweak case prints

2f07b90

Set mem

1fbdaf0

was getting OOM-ed running the default grid without this haven't been able to find what the workflow uses for --mem(-per-cpu)

Store grid factor too

d8ab217

Use the report-mem util from RDHPCS

4c71878

https://docs.rdhpcs.noaa.gov/slurm/overview.html#using-report-mem-utility-in-batch-jobs

Add --qos option

300d094

Initial data collect script

7d69059

Ask before replacing existing data file

67824c8

zmoon and others added 11 commits February 24, 2025 09:58

Add case dir

87eac56

in case need to check details

x

ccbd2bf

Use all node mem (for easier testing of usage)

7abce8d

should probably make this an option instead later

Add option to disable regrid

b35d30b

Don't use 'do' in the flag

fbc428f

Ignore .nc files at repo root

6cab6e3

produced from testing the Python scripts and such

Ignore outputs

8ac74b4

Taken from noaa-oar-arl#78

Check for srun error message

ca622da

Should be srun: error:

28de608

Add input and output args for collect-data

7ba8c9a

Ignore all tmp-ish dirs

177f703

zmoon commented Feb 27, 2025

View reviewed changes

tests/collect_data.py Outdated Show resolved Hide resolved

zmoon requested a review from Copilot February 27, 2025 19:31

Copilot AI reviewed Feb 27, 2025

View reviewed changes

tests/run_config.py Show resolved Hide resolved

tests/collect_data.py Outdated Show resolved Hide resolved

zmoon and others added 2 commits February 27, 2025 14:36

Split on first =

3fc57f4

Co-authored-by: Copilot <[email protected]>

"data records"

1f77ea3

zmoon marked this pull request as ready for review February 27, 2025 20:03

zmoon requested review from bbakernoaa and drnimbusrain February 27, 2025 20:03

zmoon commented Feb 28, 2025

View reviewed changes

tests/run_config.py Outdated Show resolved Hide resolved

Nodes determined by Slurm -n and -c

6424c4b

zmoon added 2 commits February 28, 2025 12:04

Clarify -p vs -c

ba12322

Add account override opt for submission

4921585

bbakernoaa approved these changes Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to help test the real configs on Hera #79

Script to help test the real configs on Hera #79

zmoon commented Feb 27, 2025 •

edited

Loading

Copilot AI left a comment

drnimbusrain commented Feb 28, 2025

bbakernoaa commented Feb 28, 2025

drnimbusrain commented Feb 28, 2025

bbakernoaa commented Feb 28, 2025 via email

drnimbusrain commented Feb 28, 2025

Script to help test the real configs on Hera #79

Are you sure you want to change the base?

Script to help test the real configs on Hera #79

Conversation

zmoon commented Feb 27, 2025 • edited Loading

Copilot AI left a comment

Choose a reason for hiding this comment

PR Overview

Reviewed Changes

drnimbusrain commented Feb 28, 2025

bbakernoaa commented Feb 28, 2025

drnimbusrain commented Feb 28, 2025

bbakernoaa commented Feb 28, 2025 via email

drnimbusrain commented Feb 28, 2025

zmoon commented Feb 27, 2025 •

edited

Loading