PyTorch Profiler produces large trace files (~1GB) causing TensorBoard to crash #720

jxmmy7777 · 2023-10-04T16:39:49Z

When using the PyTorch Profiler with TensorBoard, the generated trace files are too large (e.g., 1 ~2 GB for just 10 steps), causing TensorBoard to crash or hang.

To reproduce

Steps to reproduce the behavior:

Set up a PyTorch Lightning Trainer with the following profiler configuration:

profiler = PyTorchProfiler(
    on_trace_ready=torch.profiler.tensorboard_trace_handler("<path_to_logs>"),
    schedule=torch.profiler.schedule(skip_first=2, wait=1, warmup=0, active=5),
    profile_memory=True
)

Run the training for a few steps.
The produced trace file size becomes excessively large.
Attempt to open with TensorBoard.
TensorBoard crashes or becomes unresponsive when viewing in the trace or memory tab.

Expected behavior
The trace file should be of manageable size, or there should be a method to limit or chunk the file size to prevent such issues. Additionally, TensorBoard should be able to handle large trace files more gracefully.

Environment:
PyTorch Lightning Version: 1.9.0
Python version: 3.9.18

I have tried 1) Disabled profile_memory and 2) Reducing active steps in the profiler schedule. However, it seems like the trace file is always more than 1GB, which I can't view on tensorbaord. Can someone suggest some alternatives for profiling ?

Given the challenges with the current profiler, I am looking for alternative methods or tools to view the profile my PyTorch Lightning training. Suggestions or recommendations would be highly appreciated.

UTokyoChenYe · 2023-12-26T05:42:29Z

@jxmmy7777 Excuse me, I met the same problem when profiling inference. Did you fix it?

idontkonwher · 2024-02-20T01:48:23Z

@jxmmy7777 @UTokyoChenYe I also met the same problem. My json file is about 1.3GB and it's not work when I use export_to_chrome instead.

kvignesh1420 · 2024-03-18T21:04:35Z

Hi, any update on this issue?

jxmmy7777 · 2024-03-19T08:17:11Z

Hi @kvignesh1420 @idontkonwher @UTokyoChenYe ,I haven't found a good solution yet. My current approach involves minimizing the file size as much as possible and reducing the number of active/warm-up steps. Alternatively, I opt for using a simpler profiler for performance profiling.

idontkonwher · 2024-03-26T01:56:34Z

@jxmmy7777 Thanks for your replay, I fixed my problem by reduce the code block size in profiler context.

alexseceks · 2024-08-06T09:14:44Z

I tried reducing the block size in the profiler context, but with no luck. I get 1.9G torch_trace.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch Profiler produces large trace files (~1GB) causing TensorBoard to crash #720

PyTorch Profiler produces large trace files (~1GB) causing TensorBoard to crash #720

jxmmy7777 commented Oct 4, 2023 •

edited

Loading

UTokyoChenYe commented Dec 26, 2023

idontkonwher commented Feb 20, 2024

kvignesh1420 commented Mar 18, 2024

jxmmy7777 commented Mar 19, 2024 •

edited

Loading

idontkonwher commented Mar 26, 2024

alexseceks commented Aug 6, 2024

PyTorch Profiler produces large trace files (~1GB) causing TensorBoard to crash #720

PyTorch Profiler produces large trace files (~1GB) causing TensorBoard to crash #720

Comments

jxmmy7777 commented Oct 4, 2023 • edited Loading

To reproduce

UTokyoChenYe commented Dec 26, 2023

idontkonwher commented Feb 20, 2024

kvignesh1420 commented Mar 18, 2024

jxmmy7777 commented Mar 19, 2024 • edited Loading

idontkonwher commented Mar 26, 2024

alexseceks commented Aug 6, 2024

jxmmy7777 commented Oct 4, 2023 •

edited

Loading

jxmmy7777 commented Mar 19, 2024 •

edited

Loading