MIG support for HPL benchmark

Hi,

I'm trying to run the NVIDIA HPL benchmark with the NVIDIA container toolkit as described [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks).
The only difference is that I'm using a MIG vGPU (A100 MIG 7g.40gb) on a Virtual Machine.
The HPL benchmark at the moment remains stuck forever in the first step of the benchmark. You can find the log at the end of the issue.
Here are the variables I defined for the container:

```
root@cuda-clic:/home/ubuntu# cat variables.env 
NVIDIA_MIG_MONITOR_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=all
NVIDIA_MIG_CONFIG_DEVICES=all
HPL_USE_NVSHMEM=0
HPL_P2P_AS_BCAST=1
HPL_FCT_COMM_POLICY=1
```

This is how I start the container:

```sudo docker run --rm --runtime=nvidia --gpus all --shm-size=20g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --privileged --security-opt=label=disable --env-file variables.env -i -t nvcr.io/nvidia/hpc-benchmarks:24.09 /bin/bash```

Container log for HPL:

```
root@027d829baa7d:/workspace# ./hpl.sh --no-multinode --dat hpl-linux-x86_64/sample-dat/HPL-1GPU.dat   

================================================================================
HPL-NVIDIA 24.09.0  -- NVIDIA accelerated HPL benchmark -- NVIDIA
================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :    8000 
NB     :    1024 
PMAP   : Column-major process mapping
P      :       1 
Q      :       1 
PFACT  :    Left 
NBMIN  :       2 
NDIV   :       2 
RFACT  :    Left 
BCAST  :  2ringM 
DEPTH  :       1 
SWAP   : Spread-roll (long)
L1     : no-transposed form
U      : transposed form
EQUIL  : no
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0


HPL-NVIDIA ignores the following parameters from input file:
	* Broadcast parameters
	* Panel factorization parameters
	* Look-ahead value
	* L1 layout
	* U layout
	* Equilibration parameter
	* Memory alignment parameter

HPL-NVIDIA settings from environment variables:
	HPL_USE_NVSHMEM from environment 0 
	HPL_P2P_AS_BCAST from environment 1 (0->ncclBcast, 1->ncclSend / Recv, 2->CUDA - aware MPI, 3->host MPI, 4->NVSHMEM)
	HPL_FCT_COMM_POLICY from environment 1 (0 -> nvshmem (default), 1 -> host MPI)
--- DEVICE INFO ---
  Peak clock frequency: 1410 MHz
  SM version          : 80
  Number of SMs       : 42
-------------------
[HPL TRACE] cuda_nvshmem_init: max=0.0000 (0) min=0.0000 (0)
[WARNING] Change Input N 8000 to 7168
[HPL TRACE] ncclCommInitRank: max=0.1928 (0) min=0.1928 (0)
[HPL TRACE] cugetrfs_mp_init: max=0.2329 (0) min=0.2329 (0)
--- MEMORY INFO ---
DEVICE
  System           =      2.41571 GiB (MIN)      2.41571 GiB (MAX)      2.41571 GiB (AVG)
  HPL buffers      =      2.68807 GiB (MIN)      2.68807 GiB (MAX)      2.68807 GiB (AVG)
  Used             =      5.10378 GiB (MIN)      5.10378 GiB (MAX)      5.10378 GiB (AVG)
  Total            =     19.99597 GiB (MIN)     19.99597 GiB (MAX)     19.99597 GiB (AVG)
HOST
  HPL buffers      =      0.00008 GiB (MIN)      0.00008 GiB (MAX)      0.00008 GiB (AVG)
-------------------

 ... Testing HPL components ... 

 **** Factorization, m = 7168, policy = 0 **** 
```

Do you have any suggestions on what could be the issue? I'm interested in running the benchmark on a single vGPU, without communication between nodes.
Thanks in advance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MIG support for HPL benchmark #832

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MIG support for HPL benchmark #832

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions