-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] (pytorch) flickering in multiprocessing / distributed #3147
Comments
If you have multiple processes, each of your processes will have a unique Progress object. They are all going to be writing to standard output at different schedules, and showing different data. Even if it didn't flicker, its not going to show you anything useful. The solution is to coordinate a single process which displays progress information from each of the worker processes, via some kind of IPC mechanism. If you were using multiprocessing I'd recommend Pipes. I don't know if there is an equivalent with Torch. |
I think the confusing part here was the visible command creating a very chaotic situation. As a user the expected behavior is that anything except the first rank will have a progress task but just be invisible. As dictated by the Hiding Tasks section in the docs. If only rank 0 is visible then wouldn't we expect fine data and output? I didn't think to look at the progress group at first since the docs have a focus around that group dictating style. Not to mention that the multithreading examples floating around do not seem as "complicated" I also am not sure disabling all processes except for the 0th rank is the best solution as there are definitely reasons to have asynchronous tasks happening. Also given the triviality of tqdm and how the |
The problem is with your code. To solve it you will need an understanding of the difference between processes and threads. This is not a problem that Rich can solve for you on its own. If you think tqdm is better, use that. |
Sure, I'm very willing to admit that my code can be better and I'm more than happy to hear critiques and ways I can improve. We're all also niche experts. I'm a researcher first, dev second so I'm sure by code is bad. But just thought this report would be useful to you, especially as a competitor solves the issue. I'm sorry if I came off antagonistic, I'm not trying to fight but rather inform you of a point of confusion for your awesome tool. I mean I found a solution, but it was non-obvious from reading the docs and that seems like something users should report. That this issue can at minimum help others google the issue I faced. If you want to support distributed users that's awesome, if you don't, it's your project and that's cool too. I get that you have your own priorities. (Fwiw Pytorch Lightning has a rich tracker, so there are large projects ML trying to use Rich) But can you help me understand why the Is there a way to use the simpler # foo.py
for i in rich.progress.track(range(N), disable=dist.get_rank() != 0):
foo_task()
if i % some_frequency:
bar_task()
# bar.py
def bar_task():
for j in rich.progress.track(range(M), disable=dist.get_rank() != 0):
buzz() |
The
You have guessed exactly how it works. Reading the API reference is generally better than guessing. |
This is what I think is non-obvious, from reading the docs.
The two file example I gave with simultaneous tracks does not in fact work (and I knew it wouldn't because I read the docs). It doesn't work for exactly the reasons we've discussed. It complains about an already existing progress bar. And idk if you call reading the docs and trying to interpret "guessing". I did RTFM, and the source, fwiw. It's possible to RTFM and get unexpected results. Whatever, I understand you aren't concerned so let's just close this. |
I hope we solved your problem. If you like using Rich, you might also enjoy Textual |
Steven, if you are going to assume the worst intentions behind every response, you will quickly burn through the good will required by open source maintainers to assist you. |
Describe the bug
Rich progress bars flicker a ton when working on distributed processes. This makes them nearly unreadable. The problem does not exist for tqdm, but I'd love to use rich for the additional features and console. This effect is clearly caused by multiple processes writing to the same task and overwriting the results.
Minimal example, with notes
Run the program with the following command:
torchrun --standalone --nnodes=1 --nproc_per_node=8 rich_example.py
nproc_per_node
is the number of GPUs. Best reproduction will happen with more than one GPU.So here's some things I've tried and their effects:
The
visible
argument actually makes things worse than any other option, which was the most surprising to me. The flickering is obvious even without the dsync across processes. So something appears to be wrong with this functionality. I read the docs as similar to tqdm in that we just don't display the progressbar.Additional note: in my real program I can actually get flickering when using only one GPU. This may be something to do with the pytorch loggers but I'm unsure, this may be a hint or useless information/red herring. Flickering does not occur when using standard non-DDP code. It only occurs when using distributed. I'm also had rich trigger a cuda device side assert but this may be a different error. Only had that happen once and on a less reliable machine (effect happens on multiple machines with different versions of pytorch and GPUs).
What does work: using
with Progress(disable=(dist.get_rank() != 0)) as progress:
Disabling the progress group on anything except the 0th rank solves all flickering problems. But this has obvious limitations as we can't have different ranks doing different tasks which we'd want to track.Platform
Click to expand
OS: Linux (Ubuntu
I may ask you to copy and paste the output of the following commands. It may save some time if you do it now.
If you're using Rich in a terminal:
$ pip freeze | grep rich rich==13.5.2
The text was updated successfully, but these errors were encountered: