Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support chrome tracing in ppe.profiler #742

Merged
merged 17 commits into from
Nov 17, 2023

Conversation

emcastillo
Copy link
Contributor

@emcastillo emcastillo commented Aug 23, 2023

Tracer class by @okdshin

The user should be the one in charge of enabling chrome for the records they want? I mean, passing a Emitter object instead of a boolean flag and avoiding to store in the object in TLS

@emcastillo emcastillo changed the title [WIP] Support chrome tracing in ppe.profiler Support chrome tracing in ppe.profiler Aug 28, 2023
@emcastillo emcastillo marked this pull request as ready for review August 28, 2023 02:54
@emcastillo
Copy link
Contributor Author

emcastillo commented Aug 28, 2023

trace.json.zip
You can check this file with Chrome and about:tracing
I modified mnist example to output the trace

@emcastillo
Copy link
Contributor Author

Discussed offline

  • How about merging the tracer with TimeSummary?
  • The TimeSummary class is complicated so we don't want to modify it more ..., also the TS object lives in the TLS too

@kmaehashi kmaehashi self-assigned this Aug 28, 2023
@kmaehashi
Copy link
Member

Added me to assignee as I'd like to join the design discussion :)

@emcastillo
Copy link
Contributor Author

I changed the code to allow users to create their own tracers for timelines (not to rely only on google tracing format) and to allow users to supply a tracer object instead of using the default one in the TLS if needed

pytorch_pfn_extras/profiler/_tracing.py Outdated Show resolved Hide resolved
pytorch_pfn_extras/profiler/_tracing.py Outdated Show resolved Hide resolved
pytorch_pfn_extras/profiler/_record.py Show resolved Hide resolved
pytorch_pfn_extras/profiler/_tracing.py Outdated Show resolved Hide resolved
@emcastillo
Copy link
Contributor Author

image Trace for the multithreaded test case :)

linshokaku
linshokaku previously approved these changes Sep 13, 2023
Copy link
Member

@linshokaku linshokaku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kuenishi
Copy link
Member

Thank you for the work. I'm testing this patch in advance for pfnet/pfio#326. According to @emcastillo 's advice, I added some line to the MNIST example, but I'm getting empty trace.json for now. Do you have any insight to get the trace?

$ git diff
diff --git a/example/mnist_trainer.py b/example/mnist_trainer.py
index d36fccd..509f929 100644
--- a/example/mnist_trainer.py
+++ b/example/mnist_trainer.py
@@ -266,6 +266,7 @@ def main():
         device=args.device,
         extensions=my_extensions,
         stop_trigger=trigger,
+        enable_trace=True,
         evaluator=ppe.engine.create_evaluator(
             model_with_loss,
             device=args.device,
$ python example/mnist_trainer.py  --epochs 1
epoch       iteration   train/loss  lr          model/fc2.bias/grad/min  val/loss    val/accuracy
1           938         0.343534    0.01        -0.0292827               0.105675    0.9673
$ cat result/trace.json
[]$

@kuenishi
Copy link
Member

kuenishi commented Sep 20, 2023

IMO, to integrate well with PFIO regarding pfnet/pfio#326, it would be nicer to pass the enable_trace knob somehow from trainer to underlying PFIO access IO connections such as pfio.v2.S3 . You might not like TLS but it'd be nice for me...

pytorch_pfn_extras/profiler/_tracing.py Show resolved Hide resolved
pytorch_pfn_extras/profiler/_util.py Outdated Show resolved Hide resolved
pytorch_pfn_extras/training/extensions/timeline_trace.py Outdated Show resolved Hide resolved
Comment on lines +13 to +28
class Tracer:
@contextlib.contextmanager
def add_event(self, name: str) -> Generator[None, None, None]:
raise NotImplementedError("Tracers must implement add_event")

def add_remote_event(self, name: str, value: Any) -> None:
raise NotImplementedError("Tracers must implement add_remote_event")

def clear(self) -> None:
raise NotImplementedError("Tracers must implement clear")

def flush(self, filename: str, writer: Writer) -> None:
raise NotImplementedError("Tracers must implement flush")

def enable(self, enable_flag: bool) -> None:
raise NotImplementedError("Tracers must implement enable")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Tracer class should have a finalize() or synchronize() interface. In particular, if the QueueWorker is not synchronized, the possibility arises that the program will terminate before all profiles are output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a finalize method, also enforced synchronization in flush method. Good catch!!!

Comment on lines +126 to +127
# TODO(ecastill): try to work on some append mode manipulating the
# file pointer and with json.dumps?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment:
I couldn't find any document, but it's worth checking to see if it supports the jsonl format.
The jsonl format is append friendly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chrome does not support jsonl format for displaying the trace :(

@emcastillo
Copy link
Contributor Author

Thanks for the great review, PTAL!

@linshokaku
Copy link
Member

/test

Copy link
Member

@linshokaku linshokaku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@linshokaku linshokaku merged commit 8500fe2 into pfnet:master Nov 17, 2023
17 checks passed
@emcastillo emcastillo added this to the v0.7.5 milestone Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use PPE instead of pytorch_pfn_extras for profiler key
4 participants