Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky] Exception in GraphListener thread: rcl_wait unexpectedly timed out #527

Open
JEnoch opened this issue Mar 11, 2025 · 17 comments
Open

Comments

@JEnoch
Copy link
Contributor

JEnoch commented Mar 11, 2025

The same error log has been reported by several users in different contexts, using latest release, or commit

  1. in #455 - using v0.2.2 (release for jazzy)
2025-03-07T08:07:28.822902000Z [ERROR] [diagnostics_summarizer-29]: process has died [pid 662, exit code -6, cmd '/install/harvey/lib/harvey/diagnostics_summarizer --ros-args -r __node:=diagnostics_summarizer --params-file /tmp/launch_params_i1lgx4ix'].
2025-03-07T08:07:33.852035000Z [diagnostics_summarizer-29] [ERROR] [1741334848.118082488] [rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out
2025-03-07T08:07:33.882358000Z [diagnostics_summarizer-29] terminate called after throwing an instance of 'std::runtime_error'
2025-03-07T08:07:33.882532000Z [diagnostics_summarizer-29]   what():  rcl_wait unexpectedly timed out
2025-03-07T08:07:33.884842000Z [diagnostics_summarizer-29] Stack trace (most recent call last) in thread 1024:
2025-03-07T08:07:33.891791000Z [diagnostics_summarizer-29] #11   Object "", at 0xffffffffffffffff, in 
2025-03-07T08:07:33.891994000Z [diagnostics_summarizer-29] #10   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc25b62bc3b, in 
2025-03-07T08:07:33.894063000Z [diagnostics_summarizer-29] #9    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc25b59eaa3, in 
2025-03-07T08:07:33.895319000Z [diagnostics_summarizer-29] #8    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7fc25b830db3, in 
2025-03-07T08:07:33.922034000Z [diagnostics_summarizer-29] #7    Object "/opt/ros/jazzy/lib/librclcpp.so", at 0x7fc25bb96ab7, in 
2025-03-07T08:07:33.928509000Z [diagnostics_summarizer-29] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7fc25b7ff0c0, in std::rethrow_exception(std::__exception_ptr::exception_ptr)
2025-03-07T08:07:33.934079000Z [diagnostics_summarizer-29] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7fc25b7e9a54, in std::terminate()
2025-03-07T08:07:33.945943000Z [diagnostics_summarizer-29] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7fc25b7ff0d9, in 
2025-03-07T08:07:33.965063000Z [diagnostics_summarizer-29] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7fc25b7e9ff4, in 
2025-03-07T08:07:33.970482000Z [diagnostics_summarizer-29] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc25b52a8fe, in abort
2025-03-07T08:07:33.986060000Z [diagnostics_summarizer-29] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc25b54727d, in raise
2025-03-07T08:07:34.030215000Z [diagnostics_summarizer-29] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc25b5a0b2c, in pthread_kill
2025-03-07T08:07:34.064043000Z [diagnostics_summarizer-29] Aborted (Signal sent by tkill() 662 0)
  1. in #267 - using v0.2.2 (release for jazzy)
[foxglove_bridge-2] [ERROR] [1741672280.071278525] [rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out
[foxglove_bridge-2] terminate called after throwing an instance of 'std::runtime_error'
[foxglove_bridge-2]   what():  rcl_wait unexpectedly timed out
[ERROR] [foxglove_bridge-2]: process has died [pid 77, exit code -6, cmd '/opt/ros/jazzy/lib/foxglove_bridge/foxglove_bridge --ros-args --log-level error --ros-args -r __node:=data_bridge -r __ns:=/groundstation --params-file /tmp/launch_params_veefrsbb'].
  1. via email - using jazzy branch, commit fe1a014
[lifecycle_manager-2] [ERROR] [1741548982.344915581] [rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out
[lifecycle_manager-2] terminate called after throwing an instance of 'std::runtime_error'
[lifecycle_manager-2]   what():  rcl_wait unexpectedly timed out
[lifecycle_manager-4] [ERROR] [1741548982.587766326] [rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out
[lifecycle_manager-4] terminate called after throwing an instance of 'std::runtime_error'
[lifecycle_manager-4]   what():  rcl_wait unexpectedly timed out
[lifecycle_manager-18] [ERROR] [1741548983.299309956] [rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out
[lifecycle_manager-18] terminate called after throwing an instance of 'std::runtime_error'
[lifecycle_manager-18]   what():  rcl_wait unexpectedly timed out
[ERROR] [lifecycle_manager-2]: process has died [pid 175258, exit code -6, cmd '/home/gd/xxxxx/g_ws/install/nav2_lifecycle_manager/lib/nav2_lifecycle_manager/lifecycle_manager --ros-args -r __node:=lifecycle_manager_map_server_amcl --params-file /tmp/launch_params_d_t5atuk --params-file /tmp/launch_params_51hi__lq --params-file /tmp/launch_params_n239xeyi --params-file /tmp/launch_params_yb0p02ir --params-file /tmp/launch_params_h316r13z'].
@evshary
Copy link
Contributor

evshary commented Mar 11, 2025

This can also be reproduced while running navigation2 test: colcon test --packages-select nav2_system_tests --ctest-args -R test_waypoint_follower --event-handlers console_direct+. Still investigating

stdout_stderr.log

@JEnoch
Copy link
Contributor Author

JEnoch commented Mar 11, 2025

Looking at the code:

  • the GraphListener::run_loop() calls rcl_wait(&wait_set_, -1) - i.e. with an infinite timeout.
  • rcl_wait() converts -1 into a INT64_MAX nanoseconds timeout, and calls rmw_wait() with it.
  • In rmw_zenoh_cpp the rmw_wait() implementation calls std::condition_variable::wait_for() with this "INT64_MAX" timeout and a predicate on wait_set_data->triggered.
  • Thus this returns only when wait_set_data->triggered is set to true, which is supposed to be done only after the insertion of an event in one of the queues for subscriptions, guard_conditions, services, clients or events.
  • Then the code checks those queues and returns RMW_RET_TIMEOUT if all the queues are empty.

So the issue seems to be some race condition between the insertion of an event, the setting of wait_set_data->triggered to true and the wakeup of this condition_variable and the check of for a new event.

@Mallets
Copy link
Contributor

Mallets commented Mar 11, 2025

I believe #528 provides a fix to the problem.
@evshary could you validate the validate on nav2_system_tests?

@evshary
Copy link
Contributor

evshary commented Mar 11, 2025

I believe #528 provides a fix to the problem. @evshary could you validate the validate on nav2_system_tests?

The rcl_wait error disappeared on my side. The failure in the log is caused by another issue.

stdout_stderr.log

@MrBlenny
Copy link

I did some testing of this using the jazzy backport and the foxglove bridge. Was getting the same issue. I'll do some more testing tomorrow to confirm in a more isolated environment.

@Mallets
Copy link
Contributor

Mallets commented Mar 13, 2025

@MrBlenny did you get the same issue with or without #528 (now merged upstream)?

@mgraoikerlan
Copy link

I am also experiencing the same issue even with #528.

@Mallets
Copy link
Contributor

Mallets commented Mar 17, 2025

@mgraoikerlan we are still investigating the issue but we weren't able to reproduce it from our side. Do you have a minimal example that would trigger it on your side?

@mgraoikerlan
Copy link

@mgraoikerlan we are still investigating the issue but we weren't able to reproduce it from our side. Do you have a minimal example that would trigger it on your side?

You can use the example @evshary provided:

colcon test --packages-select nav2_system_tests --ctest-args -R test_waypoint_follower --event-handlers console_direct+

I'm getting the error with that too.

@Mallets
Copy link
Contributor

Mallets commented Mar 17, 2025

If I run

colcon test --packages-select nav2_system_tests --ctest-args -R test_waypoint_follower --event-handlers console_direct+

on my side I get the same error as @evshary (see #527 (comment)).

I.e. I do NOT see any timed out error

[rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out

That error was solved by #528 on my side and on the CI.

Did you manage to trigger the rcl_wait unexpectedly timed out error in any other way?

@mgraoikerlan
Copy link

mgraoikerlan commented Mar 17, 2025

If I run

colcon test --packages-select nav2_system_tests --ctest-args -R test_waypoint_follower --event-handlers console_direct+

on my side I get the same error as @evshary (see #527 (comment)).

I.e. I do NOT see any timed out error

[rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out

That error was solved by #528 on my side and on the CI.

Did you manage to trigger the rcl_wait unexpectedly timed out error in any other way?

I confirm that this test does not give this error after #528. In the next few days I will do more tests with the code of my project.

@Yadunund
Copy link
Member

Closing as the original rcl_wait error has been resolved.

@JEnoch
Copy link
Contributor Author

JEnoch commented Mar 20, 2025

Despite #528 reduces the frequency of this error, it's still happening.
I managed to reproduce it once on +20 runs of nav2_system_tests :

[1.079s] 23: [lifecycle_manager-17] [ERROR] [1742395388.500310160] [rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out
[1.079s] 23: [lifecycle_manager-17] terminate called after throwing an instance of 'std::runtime_error'
[1.079s] 23: [lifecycle_manager-17]   what():  rcl_wait unexpectedly timed out

And another user reported it in the launch phase of its robot:

[lifecycle_manager-11] [ERROR] [1742335587.578381804] [rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out
[lifecycle_manager-11]   what():  rcl_wait unexpectedly timed out
[lifecycle_manager-26] [ERROR] [1742335587.620730087] [rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out
[lifecycle_manager-26]   what():  rcl_wait unexpectedly timed out
[lifecycle_manager-22] [ERROR] [1742335587.728636503] [rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out
[lifecycle_manager-22]   what():  rcl_wait unexpectedly timed out
[lifecycle_manager-52] [ERROR] [1742335588.111462446] [rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out
[lifecycle_manager-52]   what():  rcl_wait unexpectedly timed out
[limit_accel_with_height_node-15] [ERROR] [1742335097.434303364] [rclcpp]: caught std::exception exception in GraphListener thread: rcl_wait unexpectedly timed out
[limit_accel_with_height_node-15]   what():  rcl_wait unexpectedly timed out

@Yadunund Yadunund reopened this Mar 20, 2025
@Yadunund Yadunund changed the title Exception in GraphListener thread: rcl_wait unexpectedly timed out [Flaky] Exception in GraphListener thread: rcl_wait unexpectedly timed out Mar 20, 2025
@JEnoch
Copy link
Contributor Author

JEnoch commented Mar 21, 2025

I did some "printf debugging" and here is what I found:
The GraphListener only uses guard_conditions in its wait_set_. It calls rcl_wait(&wait_set_, -1) which leads to call rmw_wait() with wait_timeout == nullptr and thus to call wait_set_data->condition_variable.wait().

When the bug occurs (i.e. rcl_wait(&wait_set_, -1) returning RMW_RET_TIMEOUT), just before the call to wait_set_data->condition_variable.wait() I see that wait_set_data->triggered == true and all the rmw_zenoh_cpp::GuardCondition have has_triggered_ == false.
Hence, the wait() directly exits, and no triggered guard_condition is found, leading to RMW_RET_TIMEOUT.

I think the issue is that during the previous loop iteration:

  • Some guard condition was triggered
  • wait_set_data->triggered is set to false before the wait_set_data->condition_mutex is released.
  • Then the triggered guard_conditions are consumed
  • At the same time, some new guard_condition is probably be triggered, leading wait_set_data->triggered to be set to true
  • On next loop iteration, the GraphListener clears the wait_set_, which probably cancels the triggered guard_conditions. But I don't see the wait_set_data->triggered being set to false !
  • Hence the next call to condition_variable.wait() is with wait_set_data->triggered == true but no guard_condition.

@Yadunund , @clalancette , what do you think ?
When the wait_set_ is cleared, shouldn't wait_set_data->triggered be reset to false also ?

@Mallets
Copy link
Contributor

Mallets commented Mar 26, 2025

Based on @JEnoch analysis I've prepared #575.
The PR seems to solve the flaky test on the same VM used by @JEnoch to run the tests.

@JEnoch
Copy link
Contributor Author

JEnoch commented Mar 26, 2025

In #575 we make sure that when entering in rmw_wait the wait_set_data->triggered flag is consistent with all its conditions has_triggered_ flags. I.e. if all are false then wait_set_data->triggered is set to false.

Still, I'm not clear on 2 points in rcl and rclcpp layers:

  1. Why isn't rcl_wait_set_clear() setting rmw_wait_set_t::triggered to false at the same time it clears all conditions ?
  2. In rclcpp's GraphListener::run_loop(), it seems to me that some conditions could be triggered concurrently to the loop, and be silently cleared and ignored when calling rcl_wait_set_clear(&wait_set_) before rcl_wait(&wait_set_, -1). Isn't it an issue ?

@mjcarroll
Copy link
Member

Why isn't rcl_wait_set_clear() setting rmw_wait_set_t::triggered to false at the same time it clears all conditions ?

Is triggered an implementation-specific detail? I don't think that it's part of the general rmw interface, which would explain why rcl or rclcpp can't do anything about it.

In rclcpp's GraphListener::run_loop(), it seems to me that some conditions could be triggered concurrently to the loop, and be silently cleared and ignored when calling rcl_wait_set_clear(&wait_set_) before rcl_wait(&wait_set_, -1). Isn't it an issue ?

I don't think there is anything that specifies that the user of the waitset handles every ready entity in the waitset. You could (in theory) wait for a ready condition for a specific entity and ignore the rest. This would cause a lot more churn/spin, but is a "legal" use of the pattern.

Specifically though in the case of the GraphListener, this may indicate a bug if the notifications aren't propagating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants