Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIG support for HPL benchmark #832

Open
marcofaltelli opened this issue Dec 12, 2024 · 0 comments
Open

MIG support for HPL benchmark #832

marcofaltelli opened this issue Dec 12, 2024 · 0 comments

Comments

@marcofaltelli
Copy link

Hi,

I'm trying to run the NVIDIA HPL benchmark with the NVIDIA container toolkit as described here.
The only difference is that I'm using a MIG vGPU (A100 MIG 7g.40gb) on a Virtual Machine.
The HPL benchmark at the moment remains stuck forever in the first step of the benchmark. You can find the log at the end of the issue.
Here are the variables I defined for the container:

root@cuda-clic:/home/ubuntu# cat variables.env 
NVIDIA_MIG_MONITOR_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=all
NVIDIA_MIG_CONFIG_DEVICES=all
HPL_USE_NVSHMEM=0
HPL_P2P_AS_BCAST=1
HPL_FCT_COMM_POLICY=1

This is how I start the container:

sudo docker run --rm --runtime=nvidia --gpus all --shm-size=20g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --privileged --security-opt=label=disable --env-file variables.env -i -t nvcr.io/nvidia/hpc-benchmarks:24.09 /bin/bash

Container log for HPL:

root@027d829baa7d:/workspace# ./hpl.sh --no-multinode --dat hpl-linux-x86_64/sample-dat/HPL-1GPU.dat   

================================================================================
HPL-NVIDIA 24.09.0  -- NVIDIA accelerated HPL benchmark -- NVIDIA
================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :    8000 
NB     :    1024 
PMAP   : Column-major process mapping
P      :       1 
Q      :       1 
PFACT  :    Left 
NBMIN  :       2 
NDIV   :       2 
RFACT  :    Left 
BCAST  :  2ringM 
DEPTH  :       1 
SWAP   : Spread-roll (long)
L1     : no-transposed form
U      : transposed form
EQUIL  : no
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0


HPL-NVIDIA ignores the following parameters from input file:
	* Broadcast parameters
	* Panel factorization parameters
	* Look-ahead value
	* L1 layout
	* U layout
	* Equilibration parameter
	* Memory alignment parameter

HPL-NVIDIA settings from environment variables:
	HPL_USE_NVSHMEM from environment 0 
	HPL_P2P_AS_BCAST from environment 1 (0->ncclBcast, 1->ncclSend / Recv, 2->CUDA - aware MPI, 3->host MPI, 4->NVSHMEM)
	HPL_FCT_COMM_POLICY from environment 1 (0 -> nvshmem (default), 1 -> host MPI)
--- DEVICE INFO ---
  Peak clock frequency: 1410 MHz
  SM version          : 80
  Number of SMs       : 42
-------------------
[HPL TRACE] cuda_nvshmem_init: max=0.0000 (0) min=0.0000 (0)
[WARNING] Change Input N 8000 to 7168
[HPL TRACE] ncclCommInitRank: max=0.1928 (0) min=0.1928 (0)
[HPL TRACE] cugetrfs_mp_init: max=0.2329 (0) min=0.2329 (0)
--- MEMORY INFO ---
DEVICE
  System           =      2.41571 GiB (MIN)      2.41571 GiB (MAX)      2.41571 GiB (AVG)
  HPL buffers      =      2.68807 GiB (MIN)      2.68807 GiB (MAX)      2.68807 GiB (AVG)
  Used             =      5.10378 GiB (MIN)      5.10378 GiB (MAX)      5.10378 GiB (AVG)
  Total            =     19.99597 GiB (MIN)     19.99597 GiB (MAX)     19.99597 GiB (AVG)
HOST
  HPL buffers      =      0.00008 GiB (MIN)      0.00008 GiB (MAX)      0.00008 GiB (AVG)
-------------------

 ... Testing HPL components ... 

 **** Factorization, m = 7168, policy = 0 **** 

Do you have any suggestions on what could be the issue? I'm interested in running the benchmark on a single vGPU, without communication between nodes.
Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant