You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the upgrade of CUDA and NVML versions, some functions have emerged with a "_v2" suffix, such as nvmlDeviceGetMemoryInfo and nvmlDeviceGetMemoryInfo_v2. When upper-level applications call these functions, they may preferentially invoke the v2 functions. If libcuda.so or libnvidia-ml.so does not declare the v2 functions, then the v1 version will be called, as in this code snippet https://github.com/XuehaiPan/nvitop/blob/470245dc3da0d9f4e3106b2c981d63d23440a5a5/nvitop/api/libnvml.py#L861-L879 .
However, when we implement a hook library like nvshare, if we provide a declaration for the v2 version of the function to be compatible with higher versions and attempt to call the v2 version in the real library, there could be an issue if the real library is a lower version that does not have the v2 function, potentially leading to an exception.
If we can't solve this problem inside the library, maybe solve it by bypassing.
A simple and inelegant idea is, we can try to recognize the CUDA/NVML version during container or CUDA process start, then set a matched hook library into library path.
With the upgrade of CUDA and NVML versions, some functions have emerged with a "_v2" suffix, such as
nvmlDeviceGetMemoryInfo
andnvmlDeviceGetMemoryInfo_v2
. When upper-level applications call these functions, they may preferentially invoke the v2 functions. If libcuda.so or libnvidia-ml.so does not declare the v2 functions, then the v1 version will be called, as in this code snippet https://github.com/XuehaiPan/nvitop/blob/470245dc3da0d9f4e3106b2c981d63d23440a5a5/nvitop/api/libnvml.py#L861-L879 .However, when we implement a hook library like nvshare, if we provide a declaration for the v2 version of the function to be compatible with higher versions and attempt to call the v2 version in the real library, there could be an issue if the real library is a lower version that does not have the v2 function, potentially leading to an exception.
For instance, in this code at https://github.com/grgalex/nvshare/blob/main/src/hook.c#L598 , it returns CUDA_ERROR_NOT_INITIALIZED when real libcuda.so has no
cuGetProcAddress_v2
function, which might cause the user program to malfunction.The text was updated successfully, but these errors were encountered: