diff --git a/affinity/cpp-20/d0796r2.md b/affinity/cpp-20/d0796r2.md index 8b5c9b2..4e7de29 100644 --- a/affinity/cpp-20/d0796r2.md +++ b/affinity/cpp-20/d0796r2.md @@ -152,17 +152,59 @@ From a historic perspective, programming models for traditional high-performance Some of these programming models also address *fault tolerance*. In particular, PVM has native support for this, providing a mechanism [[27]][pvm-callback] which can notify a program when a resource is added or removed from a system. MPI lacks a native *fault tolerance* mechanism, but there have been efforts to implement fault tolerance on top of MPI [[28]][mpi-post-failure-recovery] or by extensions[[29]][mpi-fault-tolerance]. -Due to the complexity involved in standardizing *dynamic resource discovery* and *fault tolerance*, these are currently out of the scope of this paper. +Due to the complexity involved in standardizing *dynamic resource discovery* and *fault tolerance*, these are currently out of the scope of this paper. However, we leave open the possibility of accommodating both in the future, by not overconstraining *resources*' lifetimes (see next section). -### Lifetime considerations +### Resource lifetime -As the execution context would provide a partitioning interface which returns objects describing the components of the system topology of an execution resource, it is important to consider the lifetime of these objects. +The initial solution may only target systems with a single addressable memory region. It may thus exclude devices like discrete GPUs. However, in order to maintain a unified interface going forward, the initial solution should consider these devices and be able to scale to support them in the future. In particular, in order to support heterogeneous systems, the abstraction must let the interface query the *resource topology* of the *system* in order to perform device discovery. -The objects returned from the partitioning interface would be opaque, implementation-defined objects that do not perform any scheduling or execution functionality which would be expected from an *execution context* and would not store any state related to an execution. Instead they would act simply as an identifier to a particular partition of the *resource topology*. +The *resource* objects returned from the partitioning interface are opaque, implementation-defined objects. They would not perform any scheduling or execution functionality which would be expected from an *execution context*, and they would not store any state related to an execution. Instead, they would simply act as an identifier to a particular partition of the *resource topology*. This means that the lifetime of a *resource* retrieved from an *execution context* must not be tied to the lifetime of that *execution context*. -For these reasons, *resources* must always outlive any *execution context* which is constructed from them, and any *resource* retrieved from an *execution context* must not be tied to the lifetime of that *execution context*. +The lifetime of a *resource* instance refers to both validity and uniqueness. First, if a *resource* instance exists, does it point to a valid underlying hardware or software resource? That is, could an instance's validity ever change at run time? Second, could a *resource* instance ever point to a different (but still valid) underlying resource? It suffices for now to define "point to a valid underlying resource" informally. We will elaborate this idea later in this proposal. -The initial solution should target systems with a single addressable memory region. It should thus exclude devices like discrete GPUs. In order to maintain a unified interface going forward, the initial solution should consider these devices and be able to scale to support them in the future. In particular, in order to support heterogeneous systems, the abstraction must let the interface query the *resource topology* of the *system* in order to perform device discovery. +Creation of a *context* expresses intent to use the *resource*, not just to view it as part of the *resource topology*. Thus, if a *resource* could ever cease to point to a valid underlying resource, then users must not be allowed to create a *context* from the resource instance, or launch parallel executions with that context. *Context* construction, and use of an *executor* with that *context* to launch a parallel execution, both assert validity of the *context*'s *resource*. + +If a *resource* is valid, then it must always point to the same underlying thing. For example, a *resource* cannot first point to one CPU core, and then suddenly point to a different CPU core. *Contexts* can thus rely on properties like binding of operating system threads to CPU cores. However, the "thing" to which a *resource* points may be a dynamic, possibly software-managed pool of hardware. For instance, the operating system may maintain a pool of a varying number of CPU cores, or a device driver may switch between a "discrete" GPU and an "integrated" GPU, depending on utilization and power constraints. + +In summary, a *resource* either identifies a thing uniquely, or harmlessly points to nothing. The section that follows will justify and explain this. + +#### Permit dynamic resource lifetime + +We should not assume that *resource* instances have the same lifetime as the running application. For example, some hardware accelerators like GPUs require calling an initialization function before a running application may use the accelerator, and calling a finalization function after using the accelerator. The software interface for the accelerator may not even be available at application launch time. For instance, the interface may live in a dynamic library that users may load at run time. In the case of a pool of CPU cores managed by the operating system, the application might have to request access to the pool at run time, and the operating system may have to do some work in order to reserve CPU cores and set them up for use in the pool. Applications that do not use the pool should not have to pay this setup cost. The more general cases of dynamic resource discovery and fault tolerance, that we discussed above, also call for dynamic *resource* lifetimes. + +#### Resources should not reference count + +We considered mandating that *execution resources* use reference counting, just like `shared_ptr`. This would clearly define resources' lifetimes. However, there are several arguments against requiring reference counting. + + 1. Holding a reference to the *execution resource* would prevent parallel execution from shutting down, thus (potentially) deadlocking the program. + 2. Not all kinds of *resources* may have lifetimes that fit reference counting semantics. Some kinds of GPU *resources* only exist during parallel execution, for example; those *resources* cannot be valid if they escape the parallel region. In general, programming models that let a "host" processor launch code on a "different processor" have this issue. + 3. Reference counting could have unattractive overhead if accessed concurrently, especially if code wants to traverse a particular subset of the *resource topology* inside a parallel region (e.g., to access GPU scratch memory). + 4. Since users can construct arbitrary data structures from *resources* in a *resource hierarchy*, the proposal would need another *resource* type analogous to `weak_ptr`, in order to avoid circular dependencies that could prevent releasing *resources*. + 5. There is no type currently in the Standard that has reference-counting semantics, but does not have `shared_` in its name (e.g., `shared_ptr` and `shared_future`). Adding a type like this sets a bad precedent for types with hidden costs and correctness issues (see (4)). + +#### What does validity of a resource mean? + +Here, we elaborate on what it means for a *resource* to be "valid." This proposal lets users encounter a *resource* either while traversing the *resource topology*, or through a *context* that uses the *resource*. "Viewing" the *resource* in the *resource topology* implies a lower level of "commitment" or "permanence" than using the *resource* in a *context*. In particular, + + 1. It is implementation defined whether any subset of the *resource topology* reflects the current state of the *system*, or just a "snapshot." Ability to iterate a *resource*'s children in the *resource topology* need not imply ability to create a *context* from that *resource*. This may even vary between subsets of the *resource topology*. + 2. *Context* creation asserts *resource* validity. + 3. Use of a *context* to launch parallel execution asserts *resource* validity. + +Here is a concrete example. Suppose that company "Aleph" makes an accelerator that can be viewed as a *resource*, and that has its own child *resources*. Users must call `Aleph_initialize()` in order to see the accelerator and its children as *resources* in the *resource topology*. Users must call `Aleph_finalize()` when they are done using the accelerator. + +Questions: + + 1. What should happen if users are traversing the *resource topology*, but never use the accelerator's *resource* (other than to iterate past it), and something else concurrently calls `Aleph_finalize()`? + 2. What should happen if users are traversing the accelerator's child *resources*, and something else concurrently calls `Aleph_finalize()`? + 3. What should happen if users try to create an *execution context* from the accelerator's *resource*, after `Aleph_finalize()` has been called? + 4. What should happen to outstanding *execution contexts* that use the accelerator's *resource*, if something calls `Aleph_finalize()` after the *context* was created? + +Answers: + + 1. Nothing bad may happen. Users must be able to iterate past an invalidated *resource*. If users are iterating a *resource* R's children and one child becomes invalid, that must not invalidate R or the iterators to its children. + 2. Iterating the children after invalidation of the parent must not be undefined behavior, but the child *resources* remain invalid. Attempts to view and iterate the children of the child *resources* may (but need not) fail. + 3. *Context* creation asserts *resource* validity. If the *resource* is invalid, *context* creation must fail. (Compare to how MPI functions report an error if they are called after `MPI_Finalize` has been called on that process.) + 4. Use of a *context* in an *executor* to launch parallel execution asserts *resource* validity, and must thus fail if the *resource* is not longer valid. ### Querying the relative affinity of partitions