Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION]How to Use ndtimeline in a Multi-Machine Multi-GPU Environment #55

Open
zmtttt opened this issue Sep 18, 2024 · 4 comments
Open

Comments

@zmtttt
Copy link

zmtttt commented Sep 18, 2024

Does it support Muti-Machine and Muti-GPU to use ndtimeline??
Now,I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool,
but I wandered does it support Muti-machine?? how to flush the ndtimeline? and how to deal with the commnication ? how to define custom time-event ?
Thanks!
the following picture using single-Machine and four GPUs
Megatron

@MingjiHan99 @pengyanghua @MackZackA @JsBlueCat @Meteorix

@zmtttt
Copy link
Author

zmtttt commented Sep 23, 2024

Does it support Muti-Machine and Muti-GPU to use ndtimeline?? Now,I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine?? how to flush the ndtimeline? and how to deal with the commnication ? how to define custom time-event ? Thanks! the following picture using single-Machine and four GPUs Megatron

@MingjiHan99 @pengyanghua @MackZackA @JsBlueCat @Meteorix

hello!!!
I wandered how to use muti machines?
“ - In case you need a tracing file related to ranks on different machines, you can implement an MQHandler by yourself and send all metrics to a central storage. This provides you with a method to filter and generate the tracing file for specified ranks.”
MQhandler is messege-queue? and what‘s central storage? how to achieve it? have you evaluate the time comsuption?

@MackZackA
Copy link
Collaborator

Thank you for your interest in veScale!
For this question, I would like to refer you to talk to @vocaltract who is an expert in MQ handler.

@vocaltract
Copy link
Collaborator

vocaltract commented Oct 9, 2024

Does it support Muti-Machine and Muti-GPU to use ndtimeline?? Now,I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine?? how to flush the ndtimeline? and how to deal with the commnication ? how to define custom time-event ? Thanks! the following picture using single-Machine and four GPUs Megatron
@MingjiHan99 @pengyanghua @MackZackA @JsBlueCat @Meteorix

hello!!! I wandered how to use muti machines? “ - In case you need a tracing file related to ranks on different machines, you can implement an MQHandler by yourself and send all metrics to a central storage. This provides you with a method to filter and generate the tracing file for specified ranks.” MQhandler is messege-queue? and what‘s central storage? how to achieve it? have you evaluate the time comsuption?

“MQHandler” stands for “message queue handler”. We tend to use message queue (MQ) to send metric data.
The overhead is quite low because the message queue producer has its own local buffer in memory and will send data to the broker asynchronously.
“Central Storage” refers to the infrastructure that consumes messages and persists them in a data warehouse such as Hive, ClickHouse, InfluxDB, and so on.

@zmtttt
Copy link
Author

zmtttt commented Oct 10, 2024

Does it support Muti-Machine and Muti-GPU to use ndtimeline?? Now,I can use single-Machine and Muti-GPU to analyze GPT with the ndtimeline tool, but I wandered does it support Muti-machine?? how to flush the ndtimeline? and how to deal with the commnication ? how to define custom time-event ? Thanks! the following picture using single-Machine and four GPUs Megatron
@MingjiHan99 @pengyanghua @MackZackA @JsBlueCat @Meteorix

hello!!! I wandered how to use muti machines? “ - In case you need a tracing file related to ranks on different machines, you can implement an MQHandler by yourself and send all metrics to a central storage. This provides you with a method to filter and generate the tracing file for specified ranks.” MQhandler is messege-queue? and what‘s central storage? how to achieve it? have you evaluate the time comsuption?

“MQHandler” stands for “message queue handler”. We tend to use message queue (MQ) to send metric data. The overhead is quite low because the message queue producer has its own local buffer in memory and will send data to the broker asynchronously. “Central Storage” refers to the infrastructure that consumes messages and persists them in a data warehouse such as Hive, ClickHouse, InfluxDB, and so on.

thanks! "But I still don't know how to write the MQHandler code. Do I need to create a separate script as a producer to receive messages from consumers? That is, each rank sends its own record information to the consumer, and the corresponding producer receives the rank-record information from different nodes."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants