Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm job template: how a job can probe instance topology and hostname-instanceid mappings… #268

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Slurm job template to probe EC2 informations

Usage: review and customize [job-template.sbatch](job-template.sbatch) to your need.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include bullet points on the features here. If I'm reading this correctly the script does:

  • Outputs instance topology
  • Outputs hostname to instance id mapping
  • Checks all instances are on the same network spine
  • Prints out start, end, and elapsed time in the job
  • Outputs instance id's after the job completes (so you can see if any have been replaced)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the value of this since Slurm only comes with network topology awareness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response in a separate comment.


Below shows the sample output (trimmed) of customizing the template to run Megatron-LM, i.e.,
replace the `srun -l /usr/bin/hostname` with the relevant parts from the [Megatron-LM
example](../../3.test_cases/1.megatron-lm/)).

```text
...
+ validate_ec2_same_spine
++ lstopo_ec2
++ INSTANCE_IDS=($(srun cat /sys/devices/virtual/dmi/id/board_asset_tag))
+++ srun cat /sys/devices/virtual/dmi/id/board_asset_tag
++ local INSTANCE_IDS
++ aws ec2 describe-instance-topology --instance-ids i-1111111111example i-0000000000example
+ local 'TOPO_JSON={
"Instances": [
{
"InstanceId": "i-0000000000example",
"InstanceType": "p4de.24xlarge",
"NetworkNodes": [
"nn-1111111111example",
"nn-2222222222example",
"nn-3333333333example"
],
"AvailabilityZone": "us-west-2b",
"ZoneId": "usw2-az2"
},
{
"InstanceId": "i-1111111111example",
"InstanceType": "p4de.24xlarge",
"NetworkNodes": [
"nn-1111111111example",
"nn-2222222222example",
"nn-3333333333example"
],
"AvailabilityZone": "us-west-2b",
"ZoneId": "usw2-az2"
}
]
}'
+ echo '{
"Instances": [
{
"InstanceId": "i-0000000000example",
"InstanceType": "p4de.24xlarge",
"NetworkNodes": [
"nn-1111111111example",
"nn-2222222222example",
"nn-3333333333example"
],
"AvailabilityZone": "us-west-2b",
"ZoneId": "usw2-az2"
},
{
"InstanceId": "i-1111111111example",
"InstanceType": "p4de.24xlarge",
"NetworkNodes": [
"nn-1111111111example",
"nn-2222222222example",
"nn-3333333333example"
],
"AvailabilityZone": "us-west-2b",
"ZoneId": "usw2-az2"
}
]
}'
{
"Instances": [
{
"InstanceId": "i-0000000000example",
"InstanceType": "p4de.24xlarge",
"NetworkNodes": [
"nn-1111111111example",
"nn-2222222222example",
"nn-3333333333example"
],
"AvailabilityZone": "us-west-2b",
"ZoneId": "usw2-az2"
},
{
"InstanceId": "i-1111111111example",
"InstanceType": "p4de.24xlarge",
"NetworkNodes": [
"nn-1111111111example",
"nn-2222222222example",
"nn-3333333333example"
],
"AvailabilityZone": "us-west-2b",
"ZoneId": "usw2-az2"
}
]
}
++ echo '{
"Instances": [
{
"InstanceId": "i-0000000000example",
"InstanceType": "p4de.24xlarge",
"NetworkNodes": [
"nn-1111111111example",
"nn-2222222222example",
"nn-3333333333example"
],
"AvailabilityZone": "us-west-2b",
"ZoneId": "usw2-az2"
},
{
"InstanceId": "i-1111111111example",
"InstanceType": "p4de.24xlarge",
"NetworkNodes": [
"nn-1111111111example",
"nn-2222222222example",
"nn-3333333333example"
],
"AvailabilityZone": "us-west-2b",
"ZoneId": "usw2-az2"
}
]
}'
++ grep '^ *"nn\-.................\"'
++ sort -n
++ uniq -c
++ wc -l
+ local UNIQ_NN=3
+ echo Expected 3 nn ids, got 3 nn ids
Expected 3 nn ids, got 3 nn ids
+ [[ 3 -eq 3 ]]
...
+ srun -l bash -c 'echo "hostname <=> instance_id mapping: $(hostname) <=> $(cat /sys/devices/virtual/dmi/id/board_asset_tag)"'
1: hostname <=> instance_id mapping: p4de-st-p4de-2 <=> i-0000000000example
0: hostname <=> instance_id mapping: p4de-st-p4de-1 <=> i-1111111111example
+ srun -l bash -c 'echo BEFORE: $(hostname) $(sudo lctl get_param llite.*.stats | grep write_bytes)'
0: BEFORE: p4de-st-p4de-1 write_bytes 10361 samples [bytes] 1 2147479552 401369912220
1: BEFORE: p4de-st-p4de-2
...
++ date
+ BEGIN_TRAINING='Mon Apr 8 08:33:01 UTC 2024'
+ SECONDS=0
+ srun -l --container-image ... --container-mounts ... python -m torch.distributed.run ... /workspace/Megatron-LM/pretrain_gpt.py ...
...
++ date
+ END_TRAINING='Mon Apr 8 09:02:25 UTC 2024'
+ echo 'BEGIN_TRAINING: Mon Apr 8 08:33:01 UTC 2024'
BEGIN_TRAINING: Mon Apr 8 08:33:01 UTC 2024
+ echo 'END_TRAINING : Mon Apr 8 09:02:25 UTC 2024'
END_TRAINING : Mon Apr 8 09:02:25 UTC 2024
+ echo 'Elapsed: 29min 24sec'
Elapsed: 29min 24sec
+ srun -l bash -c 'echo AFTER: $(hostname) $(sudo lctl get_param llite.*.stats | grep write_bytes)'
srun: Step created for StepId=191.4
1: AFTER: p4de-st-p4de-2
0: AFTER: p4de-st-p4de-1 write_bytes 11553 samples [bytes] 1 2147479552 775980517197
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/bin/bash

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

#SBATCH --nodes=2 # number of nodes to use

set -exuo pipefail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to print all output and commands? Seems too verbose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response in a separate comment.



################################################################################
## Preamble
################################################################################
# Helper function to query instance topology
lstopo_ec2() {
local INSTANCE_IDS=( $(srun cat /sys/devices/virtual/dmi/id/board_asset_tag) )
aws ec2 describe-instance-topology --instance-ids "${INSTANCE_IDS[@]}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I like how simple this is.

}

# Are SLURM_NNODES on the same network spine?
validate_ec2_same_spine() {
local TOPO_JSON="$(lstopo_ec2)"
echo "${TOPO_JSON}"

local UNIQ_NN=$(echo "$TOPO_JSON" | grep '^ *"nn\-.................\"' | sort -n | uniq -c | wc -l)
echo Expected 3 nn ids, got $UNIQ_NN nn ids
[[ $UNIQ_NN -eq 3 ]] || echo WARNING: ec2 instances on different network spine...
}
validate_ec2_same_spine

# Track instance ids, later on to view their CloudWatch metrics.
srun -l bash -c "echo \"hostname <=> instance_id mapping: \$(hostname) <=> \$(cat /sys/devices/virtual/dmi/id/board_asset_tag)\""

# Track per-instance cumulative Lustre statistics. In this example, we only show the write_bytes.
srun -l bash -c "echo BEFORE: \$(hostname) \$(sudo lctl get_param llite.*.stats | grep write_bytes)" || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this attempting to show lustre read/write statistics for the job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.


env


################################################################################
## Actual scripts. Below example runs /usr/bin/hostname on all allocated nodes.
################################################################################
BEGIN_TRAINING=$(date)
SECONDS=0
srun -l /usr/bin/hostname
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a block here for the training code

################################################################################
## Insert training code here...
################################################################################

END_TRAINING=$(date)
echo "BEGIN_TRAINING: ${BEGIN_TRAINING}"
echo "END_TRAINING : ${END_TRAINING}"
echo "Elapsed: $(($SECONDS / 60))min $(($SECONDS % 60))sec"


################################################################################
## Postamble
################################################################################
# Track per-instance cumulative Lustre statistics. In this exmaple, we only show the write_bytes.
srun -l bash -c "echo AFTER: \$(hostname) \$(sudo lctl get_param llite.*.stats | grep write_bytes)" || true