[BUG] (pytorch) flickering in multiprocessing / distributed #3147

stevenwalton · 2023-10-09T04:03:15Z

[x ] I've checked docs and closed issues for possible solutions.
[x ] I can't find my issue in the FAQ.

Describe the bug

Rich progress bars flicker a ton when working on distributed processes. This makes them nearly unreadable. The problem does not exist for tqdm, but I'd love to use rich for the additional features and console. This effect is clearly caused by multiple processes writing to the same task and overwriting the results.

Minimal example, with notes

import time
from tqdm import tqdm
import torch
from torch import nn
from torch import optim
from torch import distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torchvision import models
from rich.progress import track, Progress

def train():
    local_rank = int(os.environ["LOCAL_RANK"])
    dist.init_process_group(backend="nccl", init_method="env://")
    torch.cuda.set_device(local_rank)
    device = f"cuda:{local_rank}"
    
    model = models.resnet50().to("cuda") # Bigger model will make effect more obvious if you can't reproduce
    model = DDP(model, device_ids=[local_rank], output_device=local_rank)
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    #for _ in tqdm(range(1000)): # Works just fine
    #for _ in track(range(1000), description="I'm rich"): # Similar problems to Progress
    with Progress() as progress:
        if dist.get_rank() == 0: # Maybe do this?
            train_task = progress.add_task("[red] Training", total=1000,)
                                       #visible=dist.get_rank()==0)
        for i in range(1000):
            optimizer.zero_grad(set_to_none=True)
            x = torch.randn(32,3,224,224) # Dummy data. Use anything here
            out = model(x) 
            loss = out.mean()
            loss.backward()
            optimizer.step()
            # simulate GPUs not finishing at exact same time (you need this line when not using visible)
            # It isn't completely necessary but makes problem more obvious and realistic
            time.sleep(0.1 * torch.randn(1).abs().item())
            if dist.get_rank() == 0: # If done above do here too
                progress.update(train_task, advance=1, description=f"[red] Training: {i} iteration")

Run the program with the following command:
torchrun --standalone --nnodes=1 --nproc_per_node=8 rich_example.py

nproc_per_node is the number of GPUs. Best reproduction will happen with more than one GPU.

So here's some things I've tried and their effects:

Don't wrap in a get_rank anywhere and just run:
- twitchy when displaying information around the iteration number
- Numbers will change depending on what process writes to that tracker
Wrap in get_rank and only have 0 have the tracker and update
- Twitchy with lots of updates and lots of time where the bar is just missing
- It appears like every process sets a refresh
Use the visible command and only set rank 0 to visible:
- complete chaos. Don't need the sleep line if visible is used.
- No difference if visible is set in the update
- Information is at least correct, but most of the time it is invisible

The visible argument actually makes things worse than any other option, which was the most surprising to me. The flickering is obvious even without the dsync across processes. So something appears to be wrong with this functionality. I read the docs as similar to tqdm in that we just don't display the progressbar.

Additional note: in my real program I can actually get flickering when using only one GPU. This may be something to do with the pytorch loggers but I'm unsure, this may be a hint or useless information/red herring. Flickering does not occur when using standard non-DDP code. It only occurs when using distributed. I'm also had rich trigger a cuda device side assert but this may be a different error. Only had that happen once and on a less reliable machine (effect happens on multiple machines with different versions of pytorch and GPUs).

What does work: using with Progress(disable=(dist.get_rank() != 0)) as progress: Disabling the progress group on anything except the 0th rank solves all flickering problems. But this has obvious limitations as we can't have different ranks doing different tasks which we'd want to track.

Platform

Click to expand

OS: Linux (Ubuntu

I may ask you to copy and paste the output of the following commands. It may save some time if you do it now.

If you're using Rich in a terminal:

$ python -m rich.diagnose                          
╭───────────────────────── <class 'rich.console.Console'> ─────────────────────────╮
│ A high level console interface.                                                  │
│                                                                                  │
│ ╭──────────────────────────────────────────────────────────────────────────────╮ │
│ │ <console width=210 ColorSystem.TRUECOLOR>                                    │ │
│ ╰──────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                  │
│     color_system = 'truecolor'                                                   │
│         encoding = 'utf-8'                                                       │
│             file = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'> │
│           height = 54                                                            │
│    is_alt_screen = False                                                         │
│ is_dumb_terminal = False                                                         │
│   is_interactive = True                                                          │
│       is_jupyter = False                                                         │
│      is_terminal = True                                                          │
│   legacy_windows = False                                                         │
│         no_color = False                                                         │
│          options = ConsoleOptions(                                               │
│                        size=ConsoleDimensions(width=210, height=54),             │
│                        legacy_windows=False,                                     │
│                        min_width=1,                                              │
│                        max_width=210,                                            │
│                        is_terminal=True,                                         │
│                        encoding='utf-8',                                         │
│                        max_height=54,                                            │
│                        justify=None,                                             │
│                        overflow=None,                                            │
│                        no_wrap=False,                                            │
│                        highlight=None,                                           │
│                        markup=None,                                              │
│                        height=None                                               │
│                    )                                                             │
│            quiet = False                                                         │
│           record = False                                                         │
│         safe_box = True                                                          │
│             size = ConsoleDimensions(width=210, height=54)                       │
│        soft_wrap = False                                                         │
│           stderr = False                                                         │
│            style = None                                                          │
│         tab_size = 8                                                             │
│            width = 210                                                           │
╰──────────────────────────────────────────────────────────────────────────────────╯
╭─── <class 'rich._windows.WindowsConsoleFeatures'> ────╮
│ Windows features available.                           │
│                                                       │
│ ╭───────────────────────────────────────────────────╮ │
│ │ WindowsConsoleFeatures(vt=False, truecolor=False) │ │
│ ╰───────────────────────────────────────────────────╯ │
│                                                       │
│ truecolor = False                                     │
│        vt = False                                     │
╰───────────────────────────────────────────────────────╯
╭────── Environment Variables ───────╮
│ {                                  │
│     'TERM': 'xterm-kitty',         │
│     'COLORTERM': 'truecolor',      │
│     'CLICOLOR': None,              │
│     'NO_COLOR': None,              │
│     'TERM_PROGRAM': None,          │
│     'COLUMNS': None,               │
│     'LINES': None,                 │
│     'JUPYTER_COLUMNS': None,       │
│     'JUPYTER_LINES': None,         │
│     'JPY_PARENT_PID': None,        │
│     'VSCODE_VERBOSE_LOGGING': None │
│ }                                  │
╰────────────────────────────────────╯
platform="Linux"

$ pip freeze | grep rich
rich==13.5.2

The text was updated successfully, but these errors were encountered:

willmcgugan · 2023-10-09T09:15:07Z

If you have multiple processes, each of your processes will have a unique Progress object. They are all going to be writing to standard output at different schedules, and showing different data. Even if it didn't flicker, its not going to show you anything useful.

The solution is to coordinate a single process which displays progress information from each of the worker processes, via some kind of IPC mechanism. If you were using multiprocessing I'd recommend Pipes. I don't know if there is an equivalent with Torch.

stevenwalton · 2023-10-10T21:02:54Z

I think the confusing part here was the visible command creating a very chaotic situation. As a user the expected behavior is that anything except the first rank will have a progress task but just be invisible. As dictated by the Hiding Tasks section in the docs. If only rank 0 is visible then wouldn't we expect fine data and output? I didn't think to look at the progress group at first since the docs have a focus around that group dictating style. Not to mention that the multithreading examples floating around do not seem as "complicated"

I also am not sure disabling all processes except for the 0th rank is the best solution as there are definitely reasons to have asynchronous tasks happening.

Also given the triviality of tqdm and how the rich.progress.track similarly doesn't work, I think there could be some improvement here. At minimum an option to disable would be helpful to that you can get similarly trivial progress bars to tqdm. I'm not sure what tqdm is doing, but it is clear that the process for making a simple progress bar in this environment is far easier than using rich. I'd much rather use rich as display wise it is far prettier and more powerful. But it is also hard to convince others to use this if track doesn't work easily.

willmcgugan · 2023-10-11T07:28:20Z

The problem is with your code. To solve it you will need an understanding of the difference between processes and threads.

This is not a problem that Rich can solve for you on its own. If you think tqdm is better, use that.

stevenwalton · 2023-10-11T18:44:52Z

Sure, I'm very willing to admit that my code can be better and I'm more than happy to hear critiques and ways I can improve. We're all also niche experts. I'm a researcher first, dev second so I'm sure by code is bad. But just thought this report would be useful to you, especially as a competitor solves the issue. I'm sorry if I came off antagonistic, I'm not trying to fight but rather inform you of a point of confusion for your awesome tool. I mean I found a solution, but it was non-obvious from reading the docs and that seems like something users should report. That this issue can at minimum help others google the issue I faced. If you want to support distributed users that's awesome, if you don't, it's your project and that's cool too. I get that you have your own priorities. (Fwiw Pytorch Lightning has a rich tracker, so there are large projects ML trying to use Rich)

But can you help me understand why the visible argument doesn't work as (I had) expected? It is not obvious to me why setting all ranks, except for one, as invisible results in such flickering. Shouldn't that result in only one writer writing to the cli? In this type of setting it is perfectly fine to watch only one rank since the rest should be approximately at the same progress location as the others (this is what lightning does btw and how people use tqdm).

Is there a way to use the simpler track method that can be used if a progress group is already in use? Such as

# foo.py
for i in rich.progress.track(range(N), disable=dist.get_rank() != 0):
   foo_task()
   if i % some_frequency:
     bar_task()
     
# bar.py
def bar_task():
   for j in rich.progress.track(range(M), disable=dist.get_rank() != 0):
      buzz()

willmcgugan · 2023-10-12T09:20:40Z

The visible property works on tasks. A progress bar can have multiple tasks. Making all the tasks invisible won't disable the progress bar.

Is there a way to use the simpler track method that can be used if a progress group is already in use? Such as

You have guessed exactly how it works. Reading the API reference is generally better than guessing.

stevenwalton · 2023-10-12T17:00:18Z

Making all the tasks invisible won't disable the progress bar.

This is what I think is non-obvious, from reading the docs.

You have guessed exactly how it works.

The two file example I gave with simultaneous tracks does not in fact work (and I knew it wouldn't because I read the docs). It doesn't work for exactly the reasons we've discussed. It complains about an already existing progress bar. And idk if you call reading the docs and trying to interpret "guessing". I did RTFM, and the source, fwiw. It's possible to RTFM and get unexpected results. Whatever, I understand you aren't concerned so let's just close this.

github-actions · 2023-10-12T17:00:34Z

I hope we solved your problem.

If you like using Rich, you might also enjoy Textual

willmcgugan · 2023-10-12T17:26:09Z

Steven, if you are going to assume the worst intentions behind every response, you will quickly burn through the good will required by open source maintainers to assist you.

stevenwalton added the Needs triage label Oct 9, 2023

Textualize deleted a comment from github-actions bot Oct 9, 2023

stevenwalton closed this as completed Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] (pytorch) flickering in multiprocessing / distributed #3147

[BUG] (pytorch) flickering in multiprocessing / distributed #3147

stevenwalton commented Oct 9, 2023

willmcgugan commented Oct 9, 2023

stevenwalton commented Oct 10, 2023

willmcgugan commented Oct 11, 2023

stevenwalton commented Oct 11, 2023

willmcgugan commented Oct 12, 2023

stevenwalton commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

willmcgugan commented Oct 12, 2023

[BUG] (pytorch) flickering in multiprocessing / distributed #3147

[BUG] (pytorch) flickering in multiprocessing / distributed #3147

Comments

stevenwalton commented Oct 9, 2023

willmcgugan commented Oct 9, 2023

stevenwalton commented Oct 10, 2023

willmcgugan commented Oct 11, 2023

stevenwalton commented Oct 11, 2023

willmcgugan commented Oct 12, 2023

stevenwalton commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

willmcgugan commented Oct 12, 2023