Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution structure (spin-off of #427) #428

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

mxgrey
Copy link
Collaborator

@mxgrey mxgrey commented Nov 20, 2024

This PR is spun out of #427 and focuses entirely on changes to the execution structure (Context -> Execution -> Node relationship).

This is the most crucial change for the async execution to be feasible. Up until now, rclrs has had a loose coupling between the executor and nodes. That means nodes could be added to or removed from the executor freely. This comes with the following disadvantages:

  • It's possible to add the same node to multiple executors and spin them both at the same time. We have protections in place so that this won't violate any contract with rcl, but it would be a confusing arrangement.
  • We need many layers of mutexes, arcs, and weaks to manage the loose ownership.
  • Users don't have a clear order in which to invoke these three top-level objects.
  • For async execution, nodes will need a way of issuing jobs to their executor at any time, which isn't possible if they can be orphaned or bounce around between executors.

Due to the last bullet point in particular, I'm proposing a rigid procedure in which:

  • A user creates a Context
  • The Context can create one or more Executors
  • Each Executor can create one or more Nodes

The minimum publisher example shows a good demonstration of this new procedure.

In this structure, Nodes are no longer allowed to be removed from an Executor or move between Executors. That's a very unlikely use case to begin with, so I don't think there's a downside to this restriction. After #421 we'll be able to introduce custom flavors of Executors, so if anyone really cares about moving their node around between executors, it would be possible to introduce a custom hierarchical executor that can move nodes between other executors.

mxgrey and others added 4 commits November 16, 2024 16:05
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
rclrs/src/context.rs Show resolved Hide resolved
rclrs/src/executor.rs Show resolved Hide resolved
rclrs/src/parameter/service.rs Outdated Show resolved Hide resolved
rclrs/src/parameter/service.rs Show resolved Hide resolved
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
Comment on lines -42 to +54
pub fn spin_once(&self, timeout: Option<Duration>) -> Result<(), RclrsError> {
fn spin_once(&self, timeout: Option<Duration>) -> Result<(), RclrsError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was spin_once made private and is there a way to have a call to spin that doesn't run continuously? I also noticed in the implementation that this function executes all the available work so it's more of a spin_some than spin_once but that's out of the scope here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was spin_once made private and is there a way to have a call to spin that doesn't run continuously?

The idea is to make this fn spin_once function an implementation detail. The public API would now be

executor.spin(SpinOptions::spin_once());

It's certainly more verbose, but I'd like to start the API transition by narrowing down the public API for Executor to be based entirely on SpinOptions. After we've tick-tocked this version of spin_once I wouldn't mind revisiting how to make the API more ergonomic. But I don't think we'll need to worry about making "spin once" ergonomic since the vast majority of downstream users won't have any use for it. Once async execution is available they will almost certainly want to use until_promise_resolved in places where they would have previously used spin_once in a loop with a condition. But that's a discussion that doesn't make sense to have until we're reviewing the async execution PR.

I also noticed in the implementation that this function executes all the available work so it's more of a spin_some than spin_once but that's out of the scope here

I double-checked and if I'm reading the ROS 1 and rclcpp docs correctly, spin_once from ROS 1 would do all available work without waiting for any new work. In rclcpp this was changed so spin_once will wait until the next time new work becomes available if there isn't any work already available. In rclcpp spin_some behaves the way spin_once used to in ROS 1.

In this case our behavior matches rclcpp's spin_once so I think we should keep the name as-is.

@esteve
Copy link
Collaborator

esteve commented Dec 16, 2024

@mxgrey could you revert the version bump? I prefer doing it before a new release separate from other PRs

@mxgrey
Copy link
Collaborator Author

mxgrey commented Dec 16, 2024

@esteve I can certainly do that but I should caution that without the version bump cargo will not recognize that it needs to update its lockfile nor its build cache. I ran into an issue on my machine where compiling the examples was failing because cargo insisted on using an outdated version of rclrs that it had cached. I had to delete my colcon build and install directories and delete the target directories in the cargo packages before it would compile.

Bumping the version number across all the Cargo.tomls as I did in the last commit should spare others from needing to manually purge their build artifacts. Do you still want me to revert it?

@mxgrey
Copy link
Collaborator Author

mxgrey commented Dec 16, 2024

FWIW another way to avoid the problem that I just described is for us to put a workspace Cargo.toml in the root of the git repo. That's something I've been wanting to do for a long time now, but I've been holding off on it in case it's controversial for any reason. colcon-cargo should be able to support it by now, so that shouldn't be a blocker at least.

If you'd like I could revert the version bump and add in a Cargo.toml workspace. This will mean that all the Cargo.toml in the examples directory will specify relative paths to rclrs and rosidl_runtime_rs, so they might not be as useful for folks who want to just copy the whole example crate directory when starting a project.

use anyhow::{Error, Result};

#[tokio::main]
async fn main() -> Result<(), Error> {
let context = rclrs::Context::new(env::args())?;
let mut executor = rclrs::Context::default_from_env()?.create_basic_executor();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxgrey how would a custom executor work here? It'd take a Context as a argument, right? I'd remove the create_basic_executor function and be slightly more verbose (SingleThreadedExecutor::new(&context)) to show how the API works and basically to not make any distinction between the executors.

Copy link
Collaborator Author

@mxgrey mxgrey Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we look several PRs into the future we can see how the basic executor will be created once we have the ability to support custom executors.

The hook for making a custom executor will be the runtime trait which downstream users can implement for their custom struct. Once they've implemented that trait, anyone can use the custom runtime to create an opaque Executor instance. The Executor hides away what kind of runtime is being used so the rest of the application doesn't have to care about it.

I've gone ahead and made a repo to demonstrate what this would look like. In that repo I've made two different custom executors: DoNothingSimple and DoNothingAdvanced. Both are custom executor runtimes which, as their name implies, do nothing. They're not useful as far as executors go, but they can show us how the custom executor API will work. If you clone that repo into a colcon workspace and checkout the worker branch of my fork you'll be able to compile and run the demo.

Simple

The simplest way to create a custom executor is to use the Context::create_executor method as demonstrated here. This will work perfectly fine as long as it's okay for downstream users to directly initialize your custom runtime.

Advanced

There are two reasons you might want to use a more advanced approach for a custom runtime:

  1. You don't want users to be able to directly instantiate your runtime. Maybe your runtime is a singleton or relies on some dependencies that you want to hide as implementation details.
  2. Your runtime needs access to the Context right away during instantiation, and you want to guarantee that the user can't pass the wrong context to you.

To achieve this you can use the extension trait pattern. The repo demonstrates an implementation of this pattern here. Then the pattern gets used here.

This pattern allows you to add new methods to the Context struct from downstream crates. In this case we're adding Context::create_do_nothing_executor which bears an uncanny resemblance to our built-in method Context::create_basic_executor.

All that is to say, the API for custom downstream executors can be fully consistent with this out-of-the-box upstream API of Context::create_basic_executor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the Working Group meeting, I've change this PR to use an extension trait to provide the Context::create_basic_executor method: a75cd24

At the same time I've simplified all the tests and demos to just do use rclrs::*; since that's probably how most users will want to use rclrs anyway.

@mxgrey mxgrey force-pushed the execution_structure branch from e3a7436 to 0918476 Compare December 16, 2024 14:51
@mxgrey
Copy link
Collaborator Author

mxgrey commented Dec 16, 2024

@esteve After giving it more thought I've reverted the version bump. It would fix the problem for people pulling this PR, but then we'd need to keep bumping the version number with each of the subsequent PRs that are lined up right now or else the problem will keep happening.

I think gathering all the crates up into one cargo workspace is the right solution here. How would you feel if I open up a separate PR for that?

@mxgrey
Copy link
Collaborator Author

mxgrey commented Dec 16, 2024

I should bring attention to a last minute change that I just introduced: b465f6f

While developing the workers PR #446 I realized that a multi-threaded executor could accumulate multiple RclrsErrors across its different threads within one spin. I don't want to discard any of those errors just for the sake of returning a single Result<(), RclrsError> from Executor::spin, so in #446 I changed Executor::spin to return a Vec<RclrsError> which will be empty if no errors occurred.

To allow this to funnel into a single Result<(), RclrsError> instance when users want it to, I introduced the trait method RclrsErrorFilter::first_error which can be applied to Vec<RclrsError> to get a Result<(), RclrsError> which will reflect whether the vector contained any errors. You can see an example usage here.

Since the point of this PR is to introduce all the user-facing API changes that will be needed in order to support async executors (which IMO includes multi-threaded executors), I figured I should push that API change up to this PR.

Signed-off-by: Michael X. Grey <[email protected]>
Signed-off-by: Michael X. Grey <[email protected]>
@maspe36 maspe36 mentioned this pull request Dec 17, 2024
maspe36
maspe36 previously approved these changes Dec 17, 2024
/// A helper trait to disregard timeouts as not an error.
pub trait RclrsErrorFilter {
/// Get the first error available, or Ok(()) if there are no errors.
fn first_error(self) -> Result<(), RclrsError>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@mxgrey mxgrey Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's both interesting and surprising to me. I knew that you could use collect to go from Vec<Result<T, E>> to Result<Vec<T>, E> but I didn't know Vec<Result<(), E>> to Result<(), E> was possible. This seems to be a special case for () because I get a compilation error when I try to use it with Vec<Result<i32, E>>.

I can change the implementation of fn first_error to take advantage of this, although I think the code would be a bit less clear about what it's doing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also remove this function from the public API and let the users figure out what to do with a Vec<Result<(), RclrsError>>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already using it in the unit tests. I'd rather not replace it with .into_iter().collect() because I feel like .into_iter().collect() is much more abstruse than it needs to be for expressing something as simple as "give me the first error in this collection".

@esteve
Copy link
Collaborator

esteve commented Dec 18, 2024

@mxgrey

I think gathering all the crates up into one cargo workspace is the right solution here. How would you feel if I open up a separate PR for that?

Not sure, what would be the implications? Eventually we'd like to move the crates to separate repositories so that they can be released independently into the ROS buildfarm, though we're still far from getting there, would putting all the crates into one cargo workspace prevent that in the future?

@mxgrey
Copy link
Collaborator Author

mxgrey commented Dec 18, 2024

Not sure, what would be the implications?

The only implication is that it you can do $ cargo build, $ cargo run, and $ cargo test from the root workspace, and the crates can all share one cache and one lockfile. There are no downsides as far as I can figure.

would putting all the crates into one cargo workspace prevent that in the future?

No, there's no significant commitment to setting up a workspace. IIRC we were mostly planning on moving out rosidl_generator_rs which is a Python package anyway. But even if we wanted to move out rosidl_runtime_rs into its own repo, when it comes time to do that we would just revert the Cargo.toml of rclrs to specify a version for rosidl_runtime_rs (like it does right now) instead of specifying a path. But I assume the example crates will always stay in the same repo as rclrs, so that should continue to be useful.

@esteve
Copy link
Collaborator

esteve commented Jan 15, 2025

@mxgrey let us know when this is ready for one final review, though my only feedback was about the create_basic_executor function, but that's been addressed. @nwn is that change ok with you?

@mxgrey
Copy link
Collaborator Author

mxgrey commented Jan 15, 2025

@esteve I believe we've closed the loop on all feedback, so I think it's a good time to announce a final call for feedback to the working group.

@nwn
Copy link
Contributor

nwn commented Jan 20, 2025

@mxgrey let us know when this is ready for one final review, though my only feedback was about the create_basic_executor function, but that's been addressed. @nwn is that change ok with you?

Yes, thanks for making the change. It just makes it less likely that we'll accidentally give special treatment to the basic executor.

Comment on lines +81 to +88
impl Default for Context {
fn default() -> Self {
// SAFETY: It should always be valid to instantiate a context with no
// arguments, no parameters, no options, etc.
Self::new([], InitOptions::default()).expect("Failed to instantiate a default context")
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a doc comment for this impl to warn people that this isn't the same thing as the "default context" they're used to in rclcpp, which is a global shared context.

use std::{
sync::{Arc, Mutex, Weak},
time::Duration,
};

/// Single-threaded executor implementation.
pub struct SingleThreadedExecutor {
pub struct Executor {
context: Arc<ContextHandle>,
nodes_mtx: Mutex<Vec<Weak<Node>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the removal of remove_node(), it should now be the responsibility of the Executor to clean up old Nodes. So to avoid leaking the Node allocations, we should filter those out, possibly within spin or spin_once.

Comment on lines -53 to +55
/// Sending messages does not require calling [`spin`][1] on the publisher's node.
/// Sending messages does not require the node's executor to [spin][2].
///
/// [1]: crate::spin
/// [2]: crate::Executor::spin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-pick, but this could just be [1] instead of [2].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants