Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"No trace event is collected" when using tensorboard / capture_tpu_profile #380

Open
lackhole opened this issue Jan 24, 2022 · 6 comments
Open

Comments

@lackhole
Copy link

At first I was trying to profile BERT in Google Cloud TPU VM(v3-8 | tpu-vm-tf-2.7.0), so I followed the guide while fine tuning BERT.

But when I press capture, it says No trace event is collected, so I thought the problem maybe specific to TPU and posted a question at StackOverflow.
* Full log vv

Starting to trace for 1000 ms. Remaining attempt(s): 3
No trace event is collected. Automatically retrying.

Starting to trace for 1000 ms. Remaining attempt(s): 2
No trace event is collected. Automatically retrying.

Starting to trace for 1000 ms. Remaining attempt(s): 1
No trace event is collected. Automatically retrying.

Starting to trace for 1000 ms. Remaining attempt(s): 0
No trace event is collected after 4 attempt(s). Perhaps, you want to try again (with more attempts?).
Tip: increase number of attempts with --num_tracing_attempts.

After that, I thought maybe the tensorboard itself might be the problem so I followed Tensorflow Serving Readme for my personal PC(macOS 10.15 / Ubuntu 18.04) using CPU, but both of them also got stuck with same error : No trace event is collected. Automatically retrying..
Original issue filed at Tensorboard Issue 5517

The output from diagnose_tensorboard.py is pasted at the original issue.

cf.
Tensorboard Web toasts "Capture profile successfully. Please refresh." but after 0.5 sec it disappears and nothing happens after refresh.

@dmmolitor
Copy link

Have you tried increasing the number of tracing attempts as suggested in the log? Similarly, you can try increasing the profile duration. The potential issues section of the guide has some suggestions for what could be going wrong here and some steps to try. In particular, making sure the TPU is running before capturing the trace.

@lackhole
Copy link
Author

lackhole commented Mar 9, 2022

@dmmolitor Yes I did. Since the error continues, I changed the TPU architecture to Node and everything worked fine. So I guess there might be some bug with non-Node architecture since tensorboard itself cannot even profile CPU in my laptop as I mentioned above.
Thank you.

@dmmolitor
Copy link

You are welcome. If your issue is resolved, could you please close the issue?

@lackhole
Copy link
Author

@dmmolitor I don't think the issue is resolved, since profiling only works in specific architecture. I'll leave it opened.

@chokkyvista
Copy link

I also find myself unable to replicate https://cloud.google.com/tpu/docs/profile-tpu-vm#profile_tab in order to capture profiles on TPU VMs (TPU nodes work fine as @lackhole noted).

In my case, the Tensorboard web UI says Failed to capture profile: empty trace result.
image
and the tensorboard server records the following errors

I tensorflow/core/profiler/rpc/client/profiler_client.cc:113] Asynchronous gRPC Profile() to localhost:6000
I tensorflow/core/profiler/rpc/client/remote_profiler_session_manager.cc:96] Issued Profile gRPC to 1 clients
I tensorflow/core/profiler/rpc/client/profiler_client.cc:131] Waiting for completion.
E tensorflow/core/profiler/rpc/client/profiler_client.cc:154] Unavailable: failed to connect to all addresses
W tensorflow/core/profiler/rpc/client/capture_profile.cc:133] No trace event is collected from localhost:6000
W tensorflow/core/profiler/rpc/client/capture_profile.cc:145] localhost:6000 returned Unavailable: failed to connect to all addresses

This doesn't look like will get resolved by increasing either the number of retries or the profiling duration 🤔

I also tried the command line tool capture_tpu_profile to no avail (think it only works with TPU nodes).

And here's my TF setup for reference -

$ python3 -m pip list | grep -E 'tensor|cloud-tpu'
cloud-tpu-client              0.10
cloud-tpu-profiler            2.4.0
tensorboard                   2.6.0
tensorboard-data-server       0.6.1
tensorboard-plugin-profile    2.11.1
tensorboard-plugin-wit        1.8.1
tensorflow                    2.6.5
tensorflow-addons             0.16.1
tensorflow-datasets           4.8.2
tensorflow-estimator          2.6.0
tensorflow-hub                0.12.0
tensorflow-io                 0.30.0
tensorflow-io-gcs-filesystem  0.30.0
tensorflow-metadata           1.12.0
tensorflow-model-optimization 0.7.3
tensorflow-text               2.6.0

@chokkyvista
Copy link

As it turns out, the localhost:6000 returned Unavailable: failed to connect to all addresses error above was due to me forgetting to start the TF profiler server, which can be easily fixed by adding tf.profiler.experimental.server.start(6000) to the training script.

I was then able to see the following output from the training session, signalling a successful profile capture ✌

I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
W tensorflow/core/profiler/lib/profiler_session.cc:137] Profiling is late by 25154051 nanoseconds and will start immediately.
I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
I tensorflow/core/profiler/rpc/profiler_service_impl.cc:67] Collecting XSpace to repository: gs://.../plugins/profile/2023_02_03_20_17_09/localhost_6000.xplane.pb
I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.

On the tensorboard server side though, there's a new error

W tensorflow/core/profiler/convert/xplane_to_tools_data.cc:226] Can not find tool: tool_names. Please update to the latest version of Tensorflow.

which prevented the resulting xplane.pb from being correctly parsed and displayed.
Downgrading tensorboard-plugin-profile from 2.11.1 to 2.8.0 to get it more aligned with tensorboard (2.6.0) proved effective 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants