Replies: 9 comments
-
Hi @chkothe, thanks so much for the very detailed write-up of your use case! This is a good place to share. My overall takeaway from the statistics and details you provide is that we've gotten things mostly right and our minimalist API surface covers a pleasantly large amount of what you need. However you also point out some gaps that would be helpful to address, so let me try and respond to each and see which ones we need to follow up on:
This part surprised me. I'd expect that converting between array types per function call will be quite expensive, introducing both overhead and possible bugs. Do you have some examples for functions where you do or don't allow this?
TensorFlow and PyTorch (and C99) use
There is something here that we still need to follow up on (not enough hours in a day ...): can we popularize/adopt
Both of these are good candidates to add I'd say. Let's open a separate issue for them.
I'm not sure if this is actually guaranteed - for example the numpy docs just say "calculates 1/x".
I very much doubt this will happen. I have so far not seen any suggestions like this, nor do I think it's a valid reason to remove functionality. Each library is going to provide a superset of what's in the standard. Deprecation and removal of existing functionality requires a good reason, like "confuses users", "there are now superior alternatives", "it's broken", etc.
The
We went back and forth on that quite a bit. I think we need to see this work in practice for a bit during adoption of the standard in array-consuming libraries. There were certainly arguments for adopting a context manager. One pain point I remember is that semantics were hard to agree on - for example PyTorch has a rule of never allowing implicit device transfers, while TensorFlow does allow that.
That's super useful, thanks. I think there's work in progress to propose adding single-integer-array indexing to the specification, which I think address the most common gap you're seeing. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the detailed feedback! I agree with the overall assessment.
Yes, and to be more clear, for the typical few-liner subroutine it would be too much overhead -- the places where we do that mainly involve iterative solvers that can take 10s of seconds (or more) to run on large data, where the conversion overhead is small. Some examples of that are robust estimators, large-scale least-squares interpolation, machine learning code, or costly sliding-window ops on time series (so the analogy would be the typical high-level scikit-learn or statsmodels method). Another case is with workflows where only the few most expensive steps run on the GPU, meaning that a conversion happens anyway, and then one may as well pick the fastest backend.
Yes for sure -- those are quick search-and-replace changes. We're also fine with putting light wrappers around namespaces to fix up small gaps like those where needed, and have been doing that all along (that may be necessary if one wishes to have side-by-side support for backends that have and have not already transitioned to the array API). Overall I agree that concise and uniform naming seems preferable.
My two cents are that having those added to the API seems appealing for newly written code, at least in places where it makes numpy easier to use or follow (requires some case-by-case judgment). When porting existing numpy code, I'd ideally be able to search and replace Thanks for addressing the
Re context managers -- yes, those look like a candidate for a later extension (seems like a potential tar pit that could slow down standardization). After all, there's sort of an escape hatch that allows users to retrofit one that controls the
Yeah, and I'm not even sure all our single integer array indexing use cases are all that relevant (oftentimes we could as well use a 1-element slice). What's more important to us is actually that implementations continue to preserve their numpy-style view semantics and write-through capabilities that they already have (and one may hope that more implementations achieve it in case it's low-hanging fruit for them). |
Beta Was this translation helpful? Give feedback.
-
Hi @chkothe, thanks for your very detailed report! It's valuable and as a CuPy contributor I'm happy to hear both CuPy and the array API help your work. Sorry to digress here, though, as I was hoping to discuss this offline but couldn't find your contact. Could you kindly share why you need |
Beta Was this translation helpful? Give feedback.
-
Thanks for following up on that! Emailed you. |
Beta Was this translation helpful? Give feedback.
-
These are indeed tricky to implement, but probably not significantly harder than matrix factorization? Yes, they are in SciPy rather than NumPy, but that's a somewhat artifcial distinction. Many (not all) of these can be found in PyTorch, JAX and TensorFlow. My two cents is that it would be valuable if the standard specified the interface for these functions, even if not every library is going to implement them. It's really not a big deal to need to look up a compatibility table for advanced linear algebra functionality. The other option is to encourage people to write their own shims for specific backends, which is not terrible but goes a little against the spirit of the array standard. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Yes, definitely! |
Beta Was this translation helpful? Give feedback.
-
Hmm, I'm not so sure I agree with that. Unfortunately https://data-apis.org/array-api/latest/extensions/index.html does not explain this yet - we should fix that. I had a look through older discussions, and it wasn't conclusive resolved if this is okay or not. What was resolved is that an extension is optional. My working assumption was that if the extension is implemented, it is complete. If every single function in an extension is optional, then this will be very tricky to use from a user perspective. Checking if
I had a look at
For the four libraries that do have an implementation, the signatures don't match. So it's also a matter of effort whether we'd like to do this. Maybe rather than adding it to the |
Beta Was this translation helpful? Give feedback.
-
Raised in issue ( #482 )
Raised issue ( #483 ) |
Beta Was this translation helpful? Give feedback.
-
We're very happy to see progress in getting the array API standardized -- obviously a monumental undertaking when all is considered.
We've been using backend-agnostic numpy-style code for over a year now in research and now production, and are gradually rolling it out across a multi-megabyte codebase (supporting mainly numpy, cupy, pytorch, tensorflow, jax (also dask, but not using that at the moment), so I thought I'd share our user story in case it's helpful. It's a similar use case than what the array API addresses, but it was built with the current numpy workalike APIs in mind.
Our codebase has an entrypoint equivalent to get
get_namespace
, although we call them backends, and a typical use case looks like:For each of the backends we have a (usually thin) compatibility layer that adds any missing functions or fixes up issues with the function signature. In our case,
backend_for
looks at__array_priority__
to return the namespace for the highest-priority array, although we rarely use it with more than one array (but it results in the above function accepting multiple array types and promoting according to np < dask < {jax, tf, cupy} < torch). Getting a namespace this way is about as fast as doing.T
on a 1x1 numpy array (thanks to some caching) so we use it on even the smallest subroutine.We use a lot of the API surface of these backends, and typically the most compute-intensive subroutines have an option to choose a preferred backend (via a function
backend_get(shorthand)
) out of the subset that supports the necessary ops, or to keep the same backend. We've been extremely impressed with the compatibility of cupy, the performance of torch (notably also its cpu arrays), and we have a few places where we prefer tf or jax because they might have a faster implementation of a critical op or parallelize better (e.g. jax). We find that even over a half-year time span, things evolved rapidly enough that the ideal backend pick for a function may change from one to another (one of the reasons we aim for that much flexibility).We traced our API usage and found that perhaps 90% of our backend-agnostic call sites would be covered by the array API as it exists now. There are a few places that use different aliases for the same functionality (due to some forms being more popular with current backends than others) the most frequent issues being
absolute(x)
andconcatenate(x)
, other examples beingarccos(x)
(all our backends) vsacos(x)
(no backend). We also frequently use convenience shorthands likehstack(x)
,vstack(x)
,ravel(x)
orx.flatten()
, but those could be substituted easily (or provided in a wrapper).We found a few omissions that would require a bit more code rewriting, among others the functions
minimum(a,b)
,maximum(a,b)
,clip(x, lower, upper)
(presumably that would turn intowhere(x>upper,upper,where(x<lower,lower,x))
. Also we frequently usemoveaxis(x,a,b)
andswapaxes(x,a,b)
, e.g., in linear algebra on stacks of matrices (>30 call sites for us). All of these are supported by the above 6 backends already and they're pretty trivial, fortunately. Our code usesreal(x)
in some places since some of the implementations might return spurious complex numbers; that may be reason enough to at least partially specify that function already now. Alsoreciprocal(x)
occurs frequently in our code, I guess that's from a suspicion that writing1/x
may fail to use the reciprocal instruction if it's available.A few things that we use have no substitute at this point, unfortunately, namely
einsum()
,lstsq(x,y)
,eig(x)
, andsqrtm()
(though the latter could be implemented viaeigh
). We hope that these eventually find their way (back) into the API. I realize that lstsq was removed as per a previous discussion (and it's understandable given that the API is a bit crufty), but then our code base has 26 unique call sites of that alone, since we're dealing mostly with engineering and stats. One might reasonably assume that backends that already have optimized implementations of that (all 6 do, and torch/tf support batched matrices) will provide it anyway in their array API namespace. However, we do worry that, given that it has been deliberately removed, we can't be sure that some of the existing backends won't be encouraged to minimize their maintenance surface and drop functions like that from their formerly numpy-compatible (soon array API compatible) namespace, forcing users like us to deal with their raw EagerTensor, xla_extension.DeviceArray, or whatever it may be called, and go find it in whichever ancestral tensorflow namespace the functionality may have been buried in before. We're wondering if a tradeoff could be made where e.g., some of the rarely-used outputs could be marked as as "reserved" placeholder and are allowed to hold unspecified values (e.g., None) until perhaps at some future date the API specifies them. There's also the option to go the same route as withsvd
, where some arguments were removed in the interest of simplicity. On the plus side, it's good to seediag
retired in favor ofdiagonal
(especially so in the age of batched matrices).Other than that, for multi-GPU we use a backend-provided context manager where available (torch, tf, cupy) a custom context manager where it's not (jax), and a no-op context manager for numpy & dask (usage looks like
with be.select_device(id):
). That's because passing device= through all compute functions down into the individual array creation calls (from arange to eye) just isn't all that practical with a large and deeply nested codebase, and it's easy to overlook call sites, causing hidden performance bugs that only turn up on multi-accelerator runs -- however, since the user can write their own context manager (with a stack in thread-local storage and wrappers around the array creation functions), that can be worked around with some effort.Lastly, our indexed array accesses in our main codebase (the parts that we hope to eventually port) look like the following:
We use a high-level array wrapper (similar in spirit to xarray) that supports arbitrarily strided views and allows writes into those views, which results in low-level calls (in the guts of the array class) equivalent to the form:
... that's because we spend much of our time dealing with multi-way tensors (e.g., neural data) that have axes such as space, time, frequency, instance, statistic, or feature (often 3-5 at a time), and most subroutines are agnostic to the presence or order of most axes except for one or two, so they create views that move those few to specific places and then read/write through the transposed view. Our way of dealing with backends that don't support that is not enabling them for those functions (and having feature flags on the backends for reverse indexing, slice assignment, and mesh indexing support to catch cases where we do).
I wasn't sure if this is the right place to report relevant API usage "in the field", hopefully it is.
Beta Was this translation helpful? Give feedback.
All reactions