Fix MoveGroup::setStartStateToCurrentState() (issue 442) #671

andreaskoepf · 2016-05-10T13:26:15Z

We ran into #442 in our lab too and observed some very threatening and dangerous high-speed motions of our UR5 robotic arm. On a real industrial robot this issue is extremely safety-critical since it can lead to plans that start far away from the current robot joint position even though MoveGroup::setStartStateToCurrentState() has been called. I am personally quite shocked that this was not fixed by somebody else before since #442 is more than two years old.

Some background details about the problem as understood by us:

Calling MoveGroup::setStartStateToCurrentState() internally nulls the considered_start_state_ and sends a move-group planning request action goal with an empty start state (joint-names & positions arrays are emtpy). This makes MoveIt! directly use the RobotState which is stored inside the planning scene, e.g. when using OMPL ModelBasedPlanningContext::setCompleteInitialState() will be called from ompl_interface::PlanningContextManager::getPlanningContext() with a state generated by robot_state::RobotStatePtr start_state = planning_scene->getCurrentStateUpdated(req.start_state); .. when req.start_state is 'empty' this means effectively that the current state of the planning scene is is used.

Unfortunately the action server callback MoveGroupMoveAction::executeMoveCallback does not ensure that a robot state newer than the planning request is available. This leads to plans that use an outdated start state. Actually I am not quite sure how far back in time the invalid start state can be since from the motions that we saw it must have been significantly > 0.1s... maybe somebody from the MoveIt! team with more expertise in this specific area can explain why we saw such severe movements....

How we address the problem:

This PR makes sure that a new JointState message is received by CurrentStateMonitor and is indeed written to the RobotState of the PlanningScene before the planning request is being processed. In our tests this fixed the problem (which otherwise is quickly reproductible). In this PR the change is done only exemplary for the MoveGroupMoveAction .. other actions might be affected by an outdated current state too.

Since the MoveIt! source base is new to me and the project is quite large this PR is intended as a discussion basis on how to best address the problem. E.g. the sleep-based wait loop should better be replaced by a boost::condition_variable or some other more direct signalling mechanism, etc.

Special thanks to @alemme for debugging this issue together with me.

simonschmeisser · 2016-05-10T14:18:08Z

It has surprised me as well a couple of times how badly synchronized stuff can be in MoveIt!. I somehow manage to forget after some time about all the sleep() calls I had to put into my code.

I like your idea about saving the last update time. However I think this should be stored in the RobotState itself. Ideally the timestamp of when the position information entered the ros system (ie when the joint position message arrived from the robot controller).

JimmyDaSilva · 2016-05-10T15:06:54Z

Works like a charm for me. Glad someone finally digged into this :)
Thanks !

andreaskoepf · 2016-05-10T15:27:47Z

@simonschmeisser I would appreciate to have time stamps inside the RobotState itself - something similar to the values current_state_time_ and joint_time_ currently stored externally in CurrentStateMonitor, e.g. current_state_monitor.cpp#L312.

130s · 2016-05-26T21:38:03Z

@andreaskoepf Could you rebase to the latest source of origin/indigo-devel? Your PR is based on outdated commits and due the the old .travis.yml Travis would probably never pass.

andreaskoepf · 2016-05-26T22:55:37Z

@130s Rebasing would be no problem. My hope was that some MoveIt! expert would review the changes and give me some hints how to fix this in a more general way. E.g. I have not yet checked how planning scene changes (e.g. addition/removal of collision objects) is incorporated in the planning process and if that faces the same problem e.g. it might not be guaranteed that the most recent changes are considered during planning.

I can set this on my todo list and try to find 1-2 days of time to build up a deeper understanding about the code-structure and then come up with a cleaner solution. But I will not be able to do this before June 6th...

davetcoleman · 2016-06-14T22:31:22Z

move_group/src/default_capabilities/move_action_capability.cpp

+    if (age < ros::Duration(0))
+      break;
+
+    if (i >= 200)


instead of a less intuitive iteration count, can you have a timeout variable in seconds? i know its basically the same thing

@davetcoleman yes, sure. thx for having a look. Still had no time to do more complete analysis if integration of changes in the planning-scene suffers from a similar issue (I think I observed a couple of times that collision objects were not correcty added/removed before the the planning but that is not verified).

@andreaskoepf: see #686 for integrating changes to the planning scene synchronously.

davetcoleman · 2016-06-14T22:50:46Z

I'm not super familiar with the architecture of this part of MoveIt! but overall it seems like a reasonable fix.

alainsanguinetti · 2016-07-04T08:57:19Z

Maybe we should deprecate this method?
Could it be a problem with how people process the callbacks ? (not fast enough or maybe async spinner thread is preempted etc .. )
As @v4hn noted, I think #686 can solve the problem.
I think it is better to rely on an explicit synchronisation mechanism rather than adding some delay to hide that something was not fast enough or is not happening.

v4hn · 2016-07-04T09:49:29Z

move_group/src/default_capabilities/move_action_capability.cpp

+    }
+
+    ros::Duration(0.001).sleep();
+  }


I believe this whole block should only be executed if planning_scene::isEmpty(goal->request.start_state).
If you just want to plan with a complete state, there is no need to wait for a state update.
On the other hand, if the start_state is empty that means we should plan with the current state, so we should wait for the next update as you proposed @andreaskoepf .

v4hn · 2016-07-04T10:19:56Z

The current state is that the MoveGroupAction fails to fulfill its specification for the case where `request.start_state` is empty. In this case the action might accidentally use an outdated state instead of the current state and I don't see a different solution than the one proposed here. I think a cleaned version of patch should definitely be applied. @andreaskoepf could you please address the comments by dave and me? I think the sleep is fine for now to fix this hazardous behavior. If later on someone steps up to replace it by a mechanism using condition_variable et. al. that would be great, but not mandatory.

alainsanguinetti · 2016-07-06T07:38:23Z

Hello,
I think the title of this PR is misleading because it pretends to fix the setStartToCurrentState method while it actually addresses issue #442 which is a much bigger problem which resides mainly in the move_group package.
I think we can do two things:

actually improve setStartToCurrentState by making it use current_state_monitor_.getCurrentState() which is quite easy and would make it possible for the requesting code to manage what is the start state.
fix the problem on the move_group node.

I think, as I said before, that it is a problem with how callbacks are processed. To show that, in move_group/move_group.cpp line 152 you can change:
ros::AsyncSpinner spinner( 1 );
to
ros::AsyncSpinner spinner( 0 );
which actually tells the spinner to use as many threads as there are cores on the computer. In my case, this solved the problem or at least significantly reduced it.
My explanation would be that when there is only one thread, if the callbacks corresponding to the action are processed by the spinner before the state update callbacks then it is impossible to have the state be updated before the action is processed.

In any case, changing from 1 to 0 improved the smoothness of the display so I think I'll open a PR just for that.

rhaschke · 2016-07-15T02:12:36Z

@alainsanguinetti raised an important point: Having only a single spinner thread, all traffic is processed strictly in order. There are some blocking services, e.g. MoveGroupExecuteService and maybe more, which take a while for processing. During that time, no other ROS messages are processed.

In case of a waiting MoveGroupExecuteService that means, that the service waits for the trajectory controller to finish, however, incoming joint state updates are not processed and pile up in the callback queue. Afterwards, it takes a while to process all of them.
For this reason the proposed fix doesn't solve the issue: Because updateSceneWithCurrentState stores the current time as last_robot_state_update_time_ but not the (eventually outdated) timestamp of the joint_state message, the waiting loop might stop to early, being satisfied with an outdated state.

To fix the problem in a more fundamental fashion, I suggest to modify planning_scene_monitor::LockedPlanningSceneRO such, that it waits until all scene updates with a timestamp < ros::Time::now() are processed and only then returns the current planning scene.
This indeed would require a condition variable to be efficient.

Another issue might be bad locking behaviour of the PlanningScene in PlanningSceneMonitor.
I observed a write-lock time of 0.277s for a state update. I don't yet fully understand the reason, but eventually the multiple number of read-only accessors block a write-accessor. In this case, a pending write-accessor should block all future read-accessors?

Increasing the number of spinner threads is a quick fix. However, we should carefully look for those blocking services and put them actively in threads (instead of running them in the spinner thread only).

v4hn · 2016-07-15T09:02:51Z

On Thu, Jul 14, 2016 at 07:12:37PM -0700, Robert Haschke wrote:

@alainsanguinetti raised an important point: Having only a single spinner thread, all traffic is processed strictly in order.
There are some blocking services, e.g. MoveGroupExecuteService and maybe more, which take a while for processing. During that time, no other ROS messages are processed.

Yes, we should work on that front.

This does not solve #422 in theory though. The problem there is basically the same one as previously noticed with the ObjectRecognitionAction:
Traditionally the action server does not get to know exactly when it is supposed to start using data, i.e. input images.
https://groups.google.com/forum/#!msg/ros-sig-perception/j1JtB5dkkIU/tEF_kOZAU-oJ

We have the same problem with the planning scene/robot state to use for the action request, and I believe we should change that somehow.
Cleaning up threading in MoveIt does not resolve this design flaw, but will reduce the number of errors quite a lot to make it harder
to trigger the bug (which does not mean that it will not appear the next time you have to show a demo).

rhaschke · 2016-07-15T09:32:21Z

This does not solve #422 in theory though. The problem there is basically the same one as previously noticed with the ObjectRecognitionAction:
Traditionally the action server does not get to know exactly when it is supposed to start using data, i.e. input images.
https://groups.google.com/forum/#!msg/ros-sig-perception/j1JtB5dkkIU/tEF_kOZAU-oJ

As far as I can see, Martin Günther proposes exactly the same concept as me: Make sure that the input data is current (here: all input to PSM previous to a timestamp is processed) w.r.t. the timestamp of the call of an action.
Indeed, currently there is no method to pass this timestamp explicitly. However, in case of PSM, I think
ros::Time::now() will be sufficient.

Please continue this discussion in #716.

andreaskoepf · 2016-07-15T11:28:58Z

Please continue this discussion in #716.

fix currentstate to startstate bug

9ecc1b5

andreaskoepf mentioned this pull request May 10, 2016

Trajectory start doesn't match Current Position, when using plan/execute #442

Closed

davetcoleman reviewed Jun 14, 2016
View reviewed changes

v4hn reviewed Jul 4, 2016
View reviewed changes

rhaschke mentioned this pull request Jul 15, 2016

fix #442: race conditions when updating PlanningScene state #716

Merged

andreaskoepf closed this Jul 15, 2016

Fix MoveGroup::setStartStateToCurrentState() (issue 442) #671

Fix MoveGroup::setStartStateToCurrentState() (issue 442) #671

Uh oh!

Conversation

andreaskoepf commented May 10, 2016

Uh oh!

simonschmeisser commented May 10, 2016

Uh oh!

JimmyDaSilva commented May 10, 2016

Uh oh!

andreaskoepf commented May 10, 2016

Uh oh!

130s commented May 26, 2016

Uh oh!

andreaskoepf commented May 26, 2016

Uh oh!

davetcoleman Jun 14, 2016

Choose a reason for hiding this comment

Uh oh!

andreaskoepf Jun 14, 2016

Choose a reason for hiding this comment

Uh oh!

v4hn Jul 4, 2016

Choose a reason for hiding this comment

Uh oh!

davetcoleman commented Jun 14, 2016

Uh oh!

alainsanguinetti commented Jul 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

v4hn Jul 4, 2016

Choose a reason for hiding this comment

Uh oh!

v4hn commented Jul 4, 2016 via email

Uh oh!

alainsanguinetti commented Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhaschke commented Jul 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

v4hn commented Jul 15, 2016

Uh oh!

rhaschke commented Jul 15, 2016

Uh oh!

andreaskoepf commented Jul 15, 2016

Uh oh!

Uh oh!

alainsanguinetti commented Jul 4, 2016 •

edited

Loading

alainsanguinetti commented Jul 6, 2016 •

edited

Loading

rhaschke commented Jul 15, 2016 •

edited

Loading