[QUESTION]How to Use ndtimeline in a Multi-Machine Multi-GPU Environment #55

zmtttt · 2024-09-18T09:07:53Z

Does it support Muti-Machine and Muti-GPU to use ndtimeline？？
Now，I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool,
but I wandered does it support Muti-machine？？ how to flush the ndtimeline？ and how to deal with the commnication ？ how to define custom time-event ？
Thanks！
the following picture using single-Machine and four GPUs

@MingjiHan99 @pengyanghua @MackZackA @JsBlueCat @Meteorix

zmtttt · 2024-09-23T03:13:16Z

Does it support Muti-Machine and Muti-GPU to use ndtimeline？？ Now，I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine？？ how to flush the ndtimeline？ and how to deal with the commnication ？ how to define custom time-event ？ Thanks！ the following picture using single-Machine and four GPUs

@MingjiHan99 @pengyanghua @MackZackA @JsBlueCat @Meteorix

hello！！！
I wandered how to use muti machines？
“ - In case you need a tracing file related to ranks on different machines, you can implement an MQHandler by yourself and send all metrics to a central storage. This provides you with a method to filter and generate the tracing file for specified ranks.”
MQhandler is messege-queue？ and what‘s central storage？ how to achieve it？ have you evaluate the time comsuption？

MackZackA · 2024-10-09T14:04:09Z

Thank you for your interest in veScale!
For this question, I would like to refer you to talk to @vocaltract who is an expert in MQ handler.

vocaltract · 2024-10-09T14:15:15Z

Does it support Muti-Machine and Muti-GPU to use ndtimeline？？ Now，I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine？？ how to flush the ndtimeline？ and how to deal with the commnication ？ how to define custom time-event ？ Thanks！ the following picture using single-Machine and four GPUs
@MingjiHan99 @pengyanghua @MackZackA @JsBlueCat @Meteorix

hello！！！ I wandered how to use muti machines？ “ - In case you need a tracing file related to ranks on different machines, you can implement an MQHandler by yourself and send all metrics to a central storage. This provides you with a method to filter and generate the tracing file for specified ranks.” MQhandler is messege-queue？ and what‘s central storage？ how to achieve it？ have you evaluate the time comsuption？

“MQHandler” stands for “message queue handler”. We tend to use message queue (MQ) to send metric data.
The overhead is quite low because the message queue producer has its own local buffer in memory and will send data to the broker asynchronously.
“Central Storage” refers to the infrastructure that consumes messages and persists them in a data warehouse such as Hive, ClickHouse, InfluxDB, and so on.

zmtttt · 2024-10-10T03:18:19Z

Does it support Muti-Machine and Muti-GPU to use ndtimeline？？ Now，I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine？？ how to flush the ndtimeline？ and how to deal with the commnication ？ how to define custom time-event ？ Thanks！ the following picture using single-Machine and four GPUs
@MingjiHan99 @pengyanghua @MackZackA @JsBlueCat @Meteorix

hello！！！ I wandered how to use muti machines？ “ - In case you need a tracing file related to ranks on different machines, you can implement an MQHandler by yourself and send all metrics to a central storage. This provides you with a method to filter and generate the tracing file for specified ranks.” MQhandler is messege-queue？ and what‘s central storage？ how to achieve it？ have you evaluate the time comsuption？

“MQHandler” stands for “message queue handler”. We tend to use message queue (MQ) to send metric data. The overhead is quite low because the message queue producer has its own local buffer in memory and will send data to the broker asynchronously. “Central Storage” refers to the infrastructure that consumes messages and persists them in a data warehouse such as Hive, ClickHouse, InfluxDB, and so on.

thanks！ "But I still don't know how to write the MQHandler code. Do I need to create a separate script as a producer to receive messages from consumers? That is, each rank sends its own record information to the consumer, and the corresponding producer receives the rank-record information from different nodes."

MackZackA mentioned this issue Oct 9, 2024

[QUESTION]How to use MQhandler for muti machines？ #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION]How to Use ndtimeline in a Multi-Machine Multi-GPU Environment #55

[QUESTION]How to Use ndtimeline in a Multi-Machine Multi-GPU Environment #55

zmtttt commented Sep 18, 2024 •

edited

Loading

zmtttt commented Sep 23, 2024

MackZackA commented Oct 9, 2024

vocaltract commented Oct 9, 2024 •

edited

Loading

zmtttt commented Oct 10, 2024

[QUESTION]How to Use ndtimeline in a Multi-Machine Multi-GPU Environment #55

[QUESTION]How to Use ndtimeline in a Multi-Machine Multi-GPU Environment #55

Comments

zmtttt commented Sep 18, 2024 • edited Loading

zmtttt commented Sep 23, 2024

MackZackA commented Oct 9, 2024

vocaltract commented Oct 9, 2024 • edited Loading

zmtttt commented Oct 10, 2024

zmtttt commented Sep 18, 2024 •

edited

Loading

vocaltract commented Oct 9, 2024 •

edited

Loading