You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
In terms of tracking system metrics from a profiler and mlflow perspective, the current code lacks some feature to better support other hardware different from Nvidia-GPUS. Since the only package we use to do this is pynvml and that is limited to NVIDIA GPUs.
Describe the solution you'd like
We could improve this, and at least provide better support to AMD GPUs since there is an open source package called pyrsmi (developed by ROCM), which does the same as pynvml, but for AMD ROCM hardware.
We could define a custom SystemMetrics Monitor that can handle many hadwares. (see comments for potential implementation). This will work out of the box, with the same config settings as you would run your mlflow. Lastly, this will only monitor one node. I.e 8 amd gpus on the same node, and not across nodes, since we assume that the memory consumption, speed, etc.. would be the same for all gpus except "master gpu".
Describe alternatives you've considered
No response
Additional context
This solution was originally suggested by @einrone so many thanks for this!
Organisation
No response
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
In terms of tracking system metrics from a profiler and mlflow perspective, the current code lacks some feature to better support other hardware different from Nvidia-GPUS. Since the only package we use to do this is pynvml and that is limited to NVIDIA GPUs.
Describe the solution you'd like
We could improve this, and at least provide better support to AMD GPUs since there is an open source package called pyrsmi (developed by ROCM), which does the same as pynvml, but for AMD ROCM hardware.
We could define a custom SystemMetrics Monitor that can handle many hadwares. (see comments for potential implementation). This will work out of the box, with the same config settings as you would run your mlflow. Lastly, this will only monitor one node. I.e 8 amd gpus on the same node, and not across nodes, since we assume that the memory consumption, speed, etc.. would be the same for all gpus except "master gpu".
Describe alternatives you've considered
No response
Additional context
This solution was originally suggested by @einrone so many thanks for this!
Organisation
No response
The text was updated successfully, but these errors were encountered: