Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unify buildkite pipelines #443

Merged
merged 1 commit into from
Oct 5, 2023
Merged

unify buildkite pipelines #443

merged 1 commit into from
Oct 5, 2023

Conversation

juliasloan25
Copy link
Member

Purpose

remove some environment variables from our buildkite pipelines, and unify standard and longrun pipelines

closes #441


  • I have read and checked the items on the review checklist.

.buildkite/pipeline.yml Outdated Show resolved Hide resolved
Copy link
Member

@simonbyrne simonbyrne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but might be a good idea to try running it

@juliasloan25
Copy link
Member Author

@@ -1,22 +1,19 @@
agents:
queue: central
slurm_mem: 8G
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're hitting CliMA/slurm-buildkite#47

This is applied to all jobs, but some other jobs also specify slurm_mem_per_cpu, which causes conflicts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay, thank you

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it may be similarly problematic to specify both slurm_ntasks and slurm_tasks_per_node? e.g. here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that can do weird things.

@juliasloan25
Copy link
Member Author

@LenkaNovak it looks like these changes will fix the longrun issues we had that prevented the coupler reports from being generated :)

@LenkaNovak
Copy link
Collaborator

Awesome! Thanks for pursuing this @juliasloan25 and @simonbyrne for your help! Once this is merged we can finalize #456 and start profiling the DYAMOND run with @sriharshakandala 👏

@LenkaNovak
Copy link
Collaborator

LenkaNovak commented Oct 4, 2023

Maybe just a clarification, @simonbyrne , wouldn't removing slurm_tasks_per_node: X make our weak scaling in these long runs less accurate? We noticed in the past that testing across nodes caused a large variability in walltime.

@simonbyrne
Copy link
Member

Looking at the docs:

--ntasks-per-node=
Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option. This is related to --cpus-per-task=ncpus, but does not require knowledge of the actual number of cpus on each node. In some cases, it is more convenient to be able to request that no more than a specific number of tasks be invoked on each node. Examples of this include submitting a hybrid MPI/OpenMP app where only one MPI "task/rank" should be assigned to each node while allowing the OpenMP portion to utilize all of the parallelism present in the node, or submitting a single setup/cleanup/monitoring job to each node of a pre-existing allocation as one step in a larger job script.

What I would generally suggest is using either just slurm_ntasks, or slurm_ntasks_per_node + slurm_nodes.

@juliasloan25
Copy link
Member Author

juliasloan25 commented Oct 4, 2023

I can it change to slurm_ntasks_per_node + slurm_nodes if we want to be more specific! Let me know which you prefer @LenkaNovak

Also, as of this PR our target AMIP longrun fails due to numerical instability. It's hard to tell which PR introduced this, since the last time the longruns successfully ran was Sept 17, and between then and now the target AMIP run was failing due to HDF5 errors. It could have been the updates to ClimaAtmos v0.16.0 or v0.16.1.

@LenkaNovak
Copy link
Collaborator

slurm_ntasks_per_node + slurm_nodes

Sounds good. Thanks for the pointer! @juliasloan25 Would it be possible to do this, i.e. specify slurm_ntasks_per_node + slurm_nodes? I've just tested it here. The precise setup is a trade off between wait time for resource allocation and getting the specific setup we need, but it would be good to have a more trackable / comparable configurations for the AMIP run set. Otherwise happy to merge this.

@juliasloan25
Copy link
Member Author

bors r+

@bors
Copy link
Contributor

bors bot commented Oct 5, 2023

Build succeeded!

The publicly hosted instance of bors-ng is deprecated and will go away soon.

If you want to self-host your own instance, instructions are here.
For more help, visit the forum.

If you want to switch to GitHub's built-in merge queue, visit their help page.

@bors bors bot merged commit 15314cf into main Oct 5, 2023
8 checks passed
@bors bors bot deleted the js/envvars branch October 5, 2023 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce env variables in buildkite pipelines
3 participants