CosmoFlow error #5

dobiup · 2024-12-30T06:38:52Z

We were tried to running CosmoFlow with H100 GPU node (4 GPU node with 32GPUs), and also facing some other issues running NVIDIA’s MLPerf HPC Cosmoflow implementation with below environmental.

• Ubuntu 22.04.5 LTS
• NVIDIA Driver 535.126.03
• CUDA Toolkit 12.4
• Docker 27.1.2
• NVIDIA Container Toolkit 1.17.2
• SLURM 24.11
• Enroot 3.5.0
• Pyxis

I'm not sure I can run the CosmoFlow on the H100. where can I start to fix this issue?

In multi-node GPU cluster with IB (32 H100 GPUs)

Slurm Logs
workspace/cosmoflow/trainer.py:282: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with torch.cuda.amp.autocast():
/workspace/cosmoflow/trainer.py:282: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with torch.cuda.amp.autocast():
:::MLLOG {"namespace": "", "time_ms": 1735533105502, "event_type": "INTERVAL_START", "key": "staging_start", "value": null, "metadata": {"file": "/workspace/cosmoflow/utils/utils.py", "lineno": 233, "instance": 0}}
./run_and_time.sh: line 144: 3387730 Killed ${LOGGER:-} ${DISTRIBUTED} ${BIND} python main.py "${PARAMS[@]}" "$@"
./run_and_time.sh: line 144: 3422311 Killed ${LOGGER:-} ${DISTRIBUTED} ${BIND} python main.py "${PARAMS[@]}" "$@"
slurmstepd: error: mpi/pmix_v5: _errhandler: hgx-022 [2]: pmixp_client_v2.c:211: Error handler invoked: status = -61, source = [slurm.pmix.403.6:18]
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 403.6 ON hgx-020 CANCELLED AT 2024-12-30T12:33:49 ***

In single-node GPU (8 H100 GPUs)

Compliance results
INFO - Running compliance on file: /results/slurm_241230122056067138914_1.log
INFO - Compliance checks: hpc_3.0.0/common.yaml
INFO - Compliance checks: hpc_3.0.0/closed_common.yaml
INFO - Compliance checks: hpc_3.0.0/closed_cosmoflow.yaml
WARNING - Failed checks for eval_error : v['value'] <= 0.124 and v['value'] > 0.
ERROR - FAILED
Slurm logs
:::MLLOG {"namespace": "", "time_ms": 1735532584719, "event_type": "POINT_IN_TIME", "key": "eval_error", "value": NaN, "metadata": {"file": "/workspace/cosmoflow/utils/utils.py", "lineno": 233, "epoch_num": 10, "instance": 0}}
:::MLLOG {"namespace": "", "time_ms": 1735532584720, "event_type": "INTERVAL_END", "key": "run_stop", "value": null, "metadata": {"file": "/workspace/cosmoflow/main.py", "lineno": 204, "status": "aborted", "time": 28.41117286682129, "epoch_num": 10}}
NCCL version 2.23.4+cuda12.6
[rank0]:[W1230 04:23:05.204499354 ProcessGroupNCCL.cpp:1294] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

srun --mpi=pmix --ntasks=8 --ntasks-per-node=8 --container-name=cosmoflow_401 all_reduce_perf_mpi -b 160M -e 160M -d float -G 1 -f 2

nThread 1 nGpus 1 minBytes 167772160 maxBytes 167772160 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 1

Using devices

Rank 0 Group 0 Pid 3480784 on hgx-020 device 0 [0x1a] NVIDIA H100 80GB HBM3

Rank 1 Group 0 Pid 3480785 on hgx-020 device 1 [0x40] NVIDIA H100 80GB HBM3

Rank 2 Group 0 Pid 3480786 on hgx-020 device 2 [0x53] NVIDIA H100 80GB HBM3

Rank 3 Group 0 Pid 3480787 on hgx-020 device 3 [0x66] NVIDIA H100 80GB HBM3

Rank 4 Group 0 Pid 3480788 on hgx-020 device 4 [0x9c] NVIDIA H100 80GB HBM3

Rank 5 Group 0 Pid 3480789 on hgx-020 device 5 [0xc0] NVIDIA H100 80GB HBM3

Rank 6 Group 0 Pid 3480790 on hgx-020 device 6 [0xd2] NVIDIA H100 80GB HBM3

Rank 7 Group 0 Pid 3480791 on hgx-020 device 7 [0xe4] NVIDIA H100 80GB HBM3

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

167772160 41943040 float sum -1 727.4 230.65 403.64 0 726.0 231.08 404.39 0

Out of bounds values : 0 OK

Avg bus bandwidth : 404.015

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CosmoFlow error #5

CosmoFlow error #5

dobiup commented Dec 30, 2024 •

edited

Loading

CosmoFlow error #5

CosmoFlow error #5

Comments

dobiup commented Dec 30, 2024 • edited Loading

nThread 1 nGpus 1 minBytes 167772160 maxBytes 167772160 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 1

Using devices

Rank 0 Group 0 Pid 3480784 on hgx-020 device 0 [0x1a] NVIDIA H100 80GB HBM3

Rank 1 Group 0 Pid 3480785 on hgx-020 device 1 [0x40] NVIDIA H100 80GB HBM3

Rank 2 Group 0 Pid 3480786 on hgx-020 device 2 [0x53] NVIDIA H100 80GB HBM3

Rank 3 Group 0 Pid 3480787 on hgx-020 device 3 [0x66] NVIDIA H100 80GB HBM3

Rank 4 Group 0 Pid 3480788 on hgx-020 device 4 [0x9c] NVIDIA H100 80GB HBM3

Rank 5 Group 0 Pid 3480789 on hgx-020 device 5 [0xc0] NVIDIA H100 80GB HBM3

Rank 6 Group 0 Pid 3480790 on hgx-020 device 6 [0xd2] NVIDIA H100 80GB HBM3

Rank 7 Group 0 Pid 3480791 on hgx-020 device 7 [0xe4] NVIDIA H100 80GB HBM3

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

Out of bounds values : 0 OK

Avg bus bandwidth : 404.015

dobiup commented Dec 30, 2024 •

edited

Loading