You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to run the NVIDIA HPL benchmark with the NVIDIA container toolkit as described here.
The only difference is that I'm using a MIG vGPU (A100 MIG 7g.40gb) on a Virtual Machine.
The HPL benchmark at the moment remains stuck forever in the first step of the benchmark. You can find the log at the end of the issue.
Here are the variables I defined for the container:
root@027d829baa7d:/workspace# ./hpl.sh --no-multinode --dat hpl-linux-x86_64/sample-dat/HPL-1GPU.dat
================================================================================
HPL-NVIDIA 24.09.0 -- NVIDIA accelerated HPL benchmark -- NVIDIA
================================================================================
HPLinpack 2.1 -- High-Performance Linpack benchmark -- October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 8000
NB : 1024
PMAP : Column-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 2ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : no
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
HPL-NVIDIA ignores the following parameters from input file:
* Broadcast parameters
* Panel factorization parameters
* Look-ahead value
* L1 layout
* U layout
* Equilibration parameter
* Memory alignment parameter
HPL-NVIDIA settings from environment variables:
HPL_USE_NVSHMEM from environment 0
HPL_P2P_AS_BCAST from environment 1 (0->ncclBcast, 1->ncclSend / Recv, 2->CUDA - aware MPI, 3->host MPI, 4->NVSHMEM)
HPL_FCT_COMM_POLICY from environment 1 (0 -> nvshmem (default), 1 -> host MPI)
--- DEVICE INFO ---
Peak clock frequency: 1410 MHz
SM version : 80
Number of SMs : 42
-------------------
[HPL TRACE] cuda_nvshmem_init: max=0.0000 (0) min=0.0000 (0)
[WARNING] Change Input N 8000 to 7168
[HPL TRACE] ncclCommInitRank: max=0.1928 (0) min=0.1928 (0)
[HPL TRACE] cugetrfs_mp_init: max=0.2329 (0) min=0.2329 (0)
--- MEMORY INFO ---
DEVICE
System = 2.41571 GiB (MIN) 2.41571 GiB (MAX) 2.41571 GiB (AVG)
HPL buffers = 2.68807 GiB (MIN) 2.68807 GiB (MAX) 2.68807 GiB (AVG)
Used = 5.10378 GiB (MIN) 5.10378 GiB (MAX) 5.10378 GiB (AVG)
Total = 19.99597 GiB (MIN) 19.99597 GiB (MAX) 19.99597 GiB (AVG)
HOST
HPL buffers = 0.00008 GiB (MIN) 0.00008 GiB (MAX) 0.00008 GiB (AVG)
-------------------
... Testing HPL components ...
**** Factorization, m = 7168, policy = 0 ****
Do you have any suggestions on what could be the issue? I'm interested in running the benchmark on a single vGPU, without communication between nodes.
Thanks in advance
The text was updated successfully, but these errors were encountered:
Hi,
I'm trying to run the NVIDIA HPL benchmark with the NVIDIA container toolkit as described here.
The only difference is that I'm using a MIG vGPU (A100 MIG 7g.40gb) on a Virtual Machine.
The HPL benchmark at the moment remains stuck forever in the first step of the benchmark. You can find the log at the end of the issue.
Here are the variables I defined for the container:
This is how I start the container:
sudo docker run --rm --runtime=nvidia --gpus all --shm-size=20g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --privileged --security-opt=label=disable --env-file variables.env -i -t nvcr.io/nvidia/hpc-benchmarks:24.09 /bin/bash
Container log for HPL:
Do you have any suggestions on what could be the issue? I'm interested in running the benchmark on a single vGPU, without communication between nodes.
Thanks in advance
The text was updated successfully, but these errors were encountered: