Skip to content

Affinity: Update resource lifetime #70

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 48 additions & 6 deletions affinity/cpp-20/d0796r2.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,17 +152,59 @@ From a historic perspective, programming models for traditional high-performance

Some of these programming models also address *fault tolerance*. In particular, PVM has native support for this, providing a mechanism [[27]][pvm-callback] which can notify a program when a resource is added or removed from a system. MPI lacks a native *fault tolerance* mechanism, but there have been efforts to implement fault tolerance on top of MPI [[28]][mpi-post-failure-recovery] or by extensions[[29]][mpi-fault-tolerance].

Due to the complexity involved in standardizing *dynamic resource discovery* and *fault tolerance*, these are currently out of the scope of this paper.
Due to the complexity involved in standardizing *dynamic resource discovery* and *fault tolerance*, these are currently out of the scope of this paper. However, we leave open the possibility of accommodating both in the future, by not overconstraining *resources*' lifetimes (see next section).

### Lifetime considerations
### Resource lifetime

As the execution context would provide a partitioning interface which returns objects describing the components of the system topology of an execution resource, it is important to consider the lifetime of these objects.
The initial solution may only target systems with a single addressable memory region. It may thus exclude devices like discrete GPUs. However, in order to maintain a unified interface going forward, the initial solution should consider these devices and be able to scale to support them in the future. In particular, in order to support heterogeneous systems, the abstraction must let the interface query the *resource topology* of the *system* in order to perform device discovery.

The objects returned from the partitioning interface would be opaque, implementation-defined objects that do not perform any scheduling or execution functionality which would be expected from an *execution context* and would not store any state related to an execution. Instead they would act simply as an identifier to a particular partition of the *resource topology*.
The *resource* objects returned from the partitioning interface are opaque, implementation-defined objects. They would not perform any scheduling or execution functionality which would be expected from an *execution context*, and they would not store any state related to an execution. Instead, they would simply act as an identifier to a particular partition of the *resource topology*. This means that the lifetime of a *resource* retrieved from an *execution context* must not be tied to the lifetime of that *execution context*.

For these reasons, *resources* must always outlive any *execution context* which is constructed from them, and any *resource* retrieved from an *execution context* must not be tied to the lifetime of that *execution context*.
The lifetime of a *resource* instance refers to both validity and uniqueness. First, if a *resource* instance exists, does it point to a valid underlying hardware or software resource? That is, could an instance's validity ever change at run time? Second, could a *resource* instance ever point to a different (but still valid) underlying resource? It suffices for now to define "point to a valid underlying resource" informally. We will elaborate this idea later in this proposal.

The initial solution should target systems with a single addressable memory region. It should thus exclude devices like discrete GPUs. In order to maintain a unified interface going forward, the initial solution should consider these devices and be able to scale to support them in the future. In particular, in order to support heterogeneous systems, the abstraction must let the interface query the *resource topology* of the *system* in order to perform device discovery.
Creation of a *context* expresses intent to use the *resource*, not just to view it as part of the *resource topology*. Thus, if a *resource* could ever cease to point to a valid underlying resource, then users must not be allowed to create a *context* from the resource instance, or launch parallel executions with that context. *Context* construction, and use of an *executor* with that *context* to launch a parallel execution, both assert validity of the *context*'s *resource*.

If a *resource* is valid, then it must always point to the same underlying thing. For example, a *resource* cannot first point to one CPU core, and then suddenly point to a different CPU core. *Contexts* can thus rely on properties like binding of operating system threads to CPU cores. However, the "thing" to which a *resource* points may be a dynamic, possibly software-managed pool of hardware. For instance, the operating system may maintain a pool of a varying number of CPU cores, or a device driver may switch between a "discrete" GPU and an "integrated" GPU, depending on utilization and power constraints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand the part about the thing that is pointed to being dynamic. Does this mean that the thing that an execution_resource points to may become available/unavailable dynamically, not that it can become something different?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your feedback! I'll answer a few questions at a time since I'm in meetings all day.

Copy link
Collaborator Author

@mhoemmen mhoemmen Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean by "the thing that is pointed to," is that an execution resource may point either to a hardware resource (CPU core, etc.) or to a software resource. Tom Rodgers from Red Hat pointed out today the use case of running in a virtual machine, in which "hardware" doesn't necessarily map to real hardware.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case of running in a virtual machine, the "CPU core" to which an execution resource points, might run on one physical CPU core at one moment. At the next, it may run on a different CPU core, even possibly on a different node in a different place. Nevertheless, we want the execution resource to point to that same "virtualized CPU core"; the resource shouldn't suddenly point to something else like a GPU or whatever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation, I think I understand now, so effectively we want to say that a C++ library implementation cannot change what an execution resource points to dynamically, however, what that execution resource points to can be dynamically managed by the environment which is executing the process, such as the OS or a virtual machine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right; thanks! :D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note to self: Added virtual machine example locally)


In summary, a *resource* either identifies a thing uniquely, or harmlessly points to nothing. The section that follows will justify and explain this.

#### Permit dynamic resource lifetime

We should not assume that *resource* instances have the same lifetime as the running application. For example, some hardware accelerators like GPUs require calling an initialization function before a running application may use the accelerator, and calling a finalization function after using the accelerator. The software interface for the accelerator may not even be available at application launch time. For instance, the interface may live in a dynamic library that users may load at run time. In the case of a pool of CPU cores managed by the operating system, the application might have to request access to the pool at run time, and the operating system may have to do some work in order to reserve CPU cores and set them up for use in the pool. Applications that do not use the pool should not have to pay this setup cost. The more general cases of dynamic resource discovery and fault tolerance, that we discussed above, also call for dynamic *resource* lifetimes.

#### Resources should not reference count

We considered mandating that *execution resources* use reference counting, just like `shared_ptr`. This would clearly define resources' lifetimes. However, there are several arguments against requiring reference counting.

1. Holding a reference to the *execution resource* would prevent parallel execution from shutting down, thus (potentially) deadlocking the program.
2. Not all kinds of *resources* may have lifetimes that fit reference counting semantics. Some kinds of GPU *resources* only exist during parallel execution, for example; those *resources* cannot be valid if they escape the parallel region. In general, programming models that let a "host" processor launch code on a "different processor" have this issue.
3. Reference counting could have unattractive overhead if accessed concurrently, especially if code wants to traverse a particular subset of the *resource topology* inside a parallel region (e.g., to access GPU scratch memory).
4. Since users can construct arbitrary data structures from *resources* in a *resource hierarchy*, the proposal would need another *resource* type analogous to `weak_ptr`, in order to avoid circular dependencies that could prevent releasing *resources*.
5. There is no type currently in the Standard that has reference-counting semantics, but does not have `shared_` in its name (e.g., `shared_ptr` and `shared_future`). Adding a type like this sets a bad precedent for types with hidden costs and correctness issues (see (4)).

#### What does validity of a resource mean?

Here, we elaborate on what it means for a *resource* to be "valid." This proposal lets users encounter a *resource* either while traversing the *resource topology*, or through a *context* that uses the *resource*. "Viewing" the *resource* in the *resource topology* implies a lower level of "commitment" or "permanence" than using the *resource* in a *context*. In particular,

1. It is implementation defined whether any subset of the *resource topology* reflects the current state of the *system*, or just a "snapshot." Ability to iterate a *resource*'s children in the *resource topology* need not imply ability to create a *context* from that *resource*. This may even vary between subsets of the *resource topology*.
2. *Context* creation asserts *resource* validity.
3. Use of a *context* to launch parallel execution asserts *resource* validity.

Here is a concrete example. Suppose that company "Aleph" makes an accelerator that can be viewed as a *resource*, and that has its own child *resources*. Users must call `Aleph_initialize()` in order to see the accelerator and its children as *resources* in the *resource topology*. Users must call `Aleph_finalize()` when they are done using the accelerator.

Questions:

1. What should happen if users are traversing the *resource topology*, but never use the accelerator's *resource* (other than to iterate past it), and something else concurrently calls `Aleph_finalize()`?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've raised some interesting questions here.

I think the best way to solve this would be to require that this_system::get_resource be thread-safe and initialise and finalize any third party or OS resources when constructing the topology before returning. This would mean that there's no way to concurrently interfere with how the implementation uses those resources without using them directly through their respective API, though that would have to be undefined behaviour. This also means that the topology would be a snapshot of the resources at that point, and don't actually require any of the underlying resources to be alive until they are actually used, i.e. when passed to the construction of a context.

The disadvantage of this approach is that the initial call to this_system::get_resource could be quite costly, depending on the implementation, though I'd expect that you would only do this once on startup or at specific intervals between computations.

Alternatively, if we go down the route that was suggested at Rapperswil of having a visitor approach to topology discovery, where the user provides some kind of visitor that traverses the topology and returns the resources that it wants to use. With this kind of approach, it might be worth having the execution_resource have active resources. Perhaps this pattern could construct the context directly?

It may also be worth having a thread-safe valid member function on the execution_resource to check if the underlying resource is still available. Though I think this would have to be named differently to represent that it is not simply a getter and that it must do some querying or initialising of resources within the topology to identify if the resource is still valid. However, even with this, you could still have the situation where a resource becomes unavailable between calling valid and constructing an execution context or after the context is constructed.

Perhaps instead we should say that you can always construct a context from an execution_resource, even if it's unavailable, but have the context handle the case where the resource is unavailable, by cancelling enqueued work and disallowing new submissions and throwing an exception or calling a callback function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also means that the topology would be a snapshot of the resources at that point, and don't actually require any of the underlying resources to be alive until they are actually used, i.e. when passed to the construction of a context.

I like this. This could even solve the problem of some resources only being available inside a parallel region (like a GPU). Topology is a snapshot of everything reachable from the root.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean that there's no way to concurrently interfere with how the implementation uses those resources without using them directly through their respective API, though that would have to be undefined behaviour.

I'm OK with that being UB in general, possibly with specific APIs giving stronger guarantees.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though I'd expect that you would only do this once on startup or at specific intervals between computations.

That's right -- this is really about constructing a thread pool. Crazy high-frequency dynamic hardware load balancing feels like it wants a different interface. (Users might just see that as a special kind of resource -- it doesn't feel like something that naturally has an easily walkable topology.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, if we go down the route that was suggested at Rapperswil of having a visitor approach to topology discovery, where the user provides some kind of visitor that traverses the topology and returns the resources that it wants to use.

Was that in the notes? I'd like to learn more about that. (Sorry I missed that discussion at Rapperswil.) I'm not sure, without knowing more at least, whether that would solve the problems we've discussed here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps instead we should say that you can always construct a context from an execution_resource, even if it's unavailable, but have the context handle the case where the resource is unavailable, by cancelling enqueued work and disallowing new submissions and throwing an exception or calling a callback function.

Why not just attempt to create the context? If the resource is no longer available, the context creation fails. Otherwise, the context assumes the responsibility for keeping the resource alive or otherwise handling the case where the resource ceases to be alive at some point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AerialMantis I reviewed the Rapperswil notes, and noticed that the affinity paper was marked "DND." It looks like it wasn't covered there. Were you referring to the P0443 discussion in SG1 (and/or joint with LEWG)? Here is the link (requires usual login). For example: "JB: My takeaway is that (P0443) executors are copyable things with reference semantics."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Discussion a week or two ago concluded that the affinity paper was actually discussed in SG1 in Rapperswil, but didn't make it into the minutes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note to self: I think I've addressed this in my local changes.)

2. What should happen if users are traversing the accelerator's child *resources*, and something else concurrently calls `Aleph_finalize()`?
3. What should happen if users try to create an *execution context* from the accelerator's *resource*, after `Aleph_finalize()` has been called?
4. What should happen to outstanding *execution contexts* that use the accelerator's *resource*, if something calls `Aleph_finalize()` after the *context* was created?

Answers:

1. Nothing bad may happen. Users must be able to iterate past an invalidated *resource*. If users are iterating a *resource* R's children and one child becomes invalid, that must not invalidate R or the iterators to its children.
2. Iterating the children after invalidation of the parent must not be undefined behavior, but the child *resources* remain invalid. Attempts to view and iterate the children of the child *resources* may (but need not) fail.
3. *Context* creation asserts *resource* validity. If the *resource* is invalid, *context* creation must fail. (Compare to how MPI functions report an error if they are called after `MPI_Finalize` has been called on that process.)
4. Use of a *context* in an *executor* to launch parallel execution asserts *resource* validity, and must thus fail if the *resource* is not longer valid.

### Querying the relative affinity of partitions

Expand Down