-
Notifications
You must be signed in to change notification settings - Fork 1.6k
repr(scalable)
#3838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
repr(scalable)
#3838
Conversation
Co-authored-by: Jamie Cunliffe <[email protected]>
text/3838-repr-scalable.md
Outdated
|
||
Similarly, a `scalable` repr is introduced to define a scalable vector type. | ||
`scalable` accepts an integer to determine the minimum number of elements the | ||
vector contains. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be that the amount of elements in the vector needs to be a whole multiple of the scalable value? Otherwise scalable(4)
would allow 5 elements, which would need to be represented as <vscale x 1 x f32>
in LLVM rather than <vscale x 4 x f32>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked the same question in the previous RFC, unfortunately that never got resolved and the new RFC perpetuates the same confusion. Also see this subthread. @davidtwco It would be good to ensure that all the valuable feedback the previous RFC got is not just deleted and forgotten for the 2nd revision.
For the question at hand:
Yeah, this description of the argument as "minimum" is extremely misleading. This field apparently must be set to the "hardware scaling unit" (I don't know the proper term for this, but it's 128bit for ARM) divided by the size of the field. Everything else would at best be a giant waste, if it even works. For instance, IIUC, putting scalable(2)
here means only half the register will ever get used (no matter the register size). I wonder why we even let the code pick that field at all; it seems to me the compiler should just compute it. Is there ever a situation where a 4-byte element type should not use scalable(4)
on ARM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that I follow - the N
in repr(scalable(N))
is the N
in <vscale x N x f32>
. There's a correct value for that N
for any given valid type that the user of repr(scalable)
needs to write. We'd need to get that correct when defining these types in stdarch
but because this is an internal-facing thing, that's fine, we can make sure they're correct.
I can only really speak to SVE, but as I understand it, N
is basically "how many of this type can you fit in the minimum vector length?". For a f32
on SVE, that's four, because the minimum vector length is 128 bits and a f32
is 32 bits. vscale
is the hardware-dependent part that we don't pick, on some processors, it'll be one and you just have the 4x f32
, on some it'll be two and you have 8x f32
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which part of my comment is confusing? I even asked a concrete question:
Is there ever a situation where a 4-byte element type should not use scalable(4) on ARM?
It seems the answer is "no", which then means there is apparently no motivation for having this degree of freedom in the first place?
I think you confirmed my theory from above:
- Calling this the minimum is very confusing. If I set the minimum to 2 I would expect that an array of size 8 still uses the most efficient representation for that size, but that's not what happens here.
- There is only one desired value for this "minimum size", and it can be computed by the compiler.
I don't understand why we should be exposing this very confusing parameter, even if it's just exposed to libcore. That still needs good documentation and testing for all possible values etc. The RFC surely does not do a good job documenting it. (And no, relating it to some obscure LLVM concept doesn't count. This needs a self-contained explanation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which part of my comment is confusing?
GitHub only showed me bjorn's comment when I replied, just seeing yours now, apologies.
I asked the #3268 (comment) in the previous RFC, unfortunately that never got resolved and the new RFC perpetuates the same confusion. Also see #3268 (comment). @davidtwco It would be good to ensure that all the valuable feedback the previous RFC got is not just deleted and forgotten for the 2nd revision.
Apologies, I tried to capture anything that seemed like it was still relevant but may have missed some threads.
Yeah, this description of the argument as "minimum" is extremely misleading. This field apparently must be set to the "hardware scaling unit" (I don't know the proper term for this, but it's 128bit for ARM) divided by the size of the field. Everything else would at best be a giant waste, if it even works. For instance, IIUC, putting scalable(2) here means only half the register will ever get used (no matter the register size). I wonder why we even let the code pick that field at all; it seems to me the compiler should just compute it. Is there ever a situation where a 4-byte element type should not use scalable(4) on ARM?
It seems the answer is "no", which then means there is apparently no motivation for having this degree of freedom in the first place?
I think you confirmed my theory from above:
Calling this the minimum is very confusing. If I set the minimum to 2 I would expect that an array of size 8 still uses the most efficient representation for that size, but that's not what happens here.
There is only one desired value for this "minimum size", and it can be computed by the compiler.
I don't understand why we should be exposing this very confusing parameter, even if it's just exposed to libcore.
I phrased it as "minimum" considering the number of elements that might be present when one includes vscale
in that calculation. e.g. for a svint32_t
, the minimum number of elements is four (when you have a 128-bit register and vscale=1
), but it could be greater than four: it could be eight (when you have a 256-bit register and vscale=2
) or it could be sixteen (when you have a 512-bit register and vscale=4
), etc. This wording is clearly confusing though, so I'll try to make it clearer.
In the "Reference-level explanation", I do mention why the compiler doesn't compute N
(though that should probably be in the "Rationale and alternatives" section):
repr(scalable)
expects the number ofelements
to be provided rather than calculating it. This avoids needing to teach the compiler how to calculate the requiredelement
count, particularly as some of these scalable types can have different element counts. For instance, the predicates used in SVE have different element counts depending on the types they are a predicate for.
I'll expand this in the RFC also, but to elaborate: For SVE, many of the intrinsics take a predicate vector alongside the data vectors, and the predicate vector decides which lanes are on or off for the operation (i.e. which elements in the vector are operated on: all of them, none of them, even-numbered indices, whatever you want). Predicate vectors are in different and smaller registers than the data, these have a bit for every byte in the vector register (for a minimum size of 16 bits). e.g. <vscale x 16 x i8>
vector has a <vscale x 16 x i1>
predicate.
For the non-predicate vectors, you're right that we're basically only going to want to define them with the minimum element count that uses the whole minimum register size. However, for predicate vectors, we want to define types where the N
matches the number of elements in the non-predicate vector, i.e. a <vscale x 4 x i1>
to match a <vscale x 4 x f32>
, <vscale x 8 x i1>
to match <vscale x 8 x u16>
, or <vscale x 16 x i1>
to match <vscale x 16 x u8>
. Some of the sign-extension intrinsics use types with non-128-bit multiples inside their definitions too, though I'd need to refresh my memory on the specifics for those.
That makes makes it trickier than just $min_register_size / $type_size
, which would always be <vscale x 16 x i1>
. We could do something more complicated that gives the compiler all the information it needs to be able to calculate this in every circumstance, but that's a bunch of extra complexity for types that are going to be defined once in the standard library and then well-tested.
That still needs good documentation and testing for all possible values etc. The RFC surely does not do a good job documenting it. (And no, relating it to some obscure LLVM concept doesn't count. This needs a self-contained explanation.)
I don't think it needs testing for all possible values. Just those that we end up defining to use with the intrinsics, and those will all get plenty of testing.
It's a bit of a balancing act how much detail to provide about how the scalable vectors work or what rustc's contract with the codegen backend is, and how much to omit as too much detail. Especially when somewhat familiar with these types/extensions, it's easy to miss where we've gone too far one way or the other, so I appreciate the feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a bunch more wording along the lines of what I've written in my previous comment in 5f5f7d2, let me know if that clears things up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub only showed me bjorn's comment when I replied, just seeing yours now, apologies.
Ah, that's fair. :)
Thanks for clarifying. I guess with predicates the point is that they do indeed not use the entire width of the corresponding register but that's fine -- we only want one bit per element, the hardware has one bit per byte, and deals with the mismatch?
Added a bunch more wording along the lines of what I've written in my previous comment in 5f5f7d2, let me know if that clears things up.
You still introduce this number as the "minimum", which is still confusing. It's not just the minimum, it is the "quantum" or whatever you want to call it -- the vector will always have a length that is an integer multiple of this number. (And moreover, it'll be the same integer multiple for all types.) It is formally correct that this number is also the minimum size, but by leading with that you send the reader down the entirely wrong track.
I don't think it needs testing for all possible values. Just those that we end up defining to use with the intrinsics, and those will all get plenty of testing.
I can almost guarantee that some library contributor will one day pick a wrong value for this and they'll have a really bad time if the result is some odd garbage behavior. Even internal interfaces need to be designed with care and tested in an "open source hive mind" project like Rust. The only fundamental difference to external interfaces is that internal interfaces are perma-unstable so we can more easily adjust them if mistakes have been made. But in terms of documentation, the absolutely need as much care as stable features, and they should get proper testing too.
fn sve_add(in_a: Vec<f32>, in_b: Vec<f32>, out_c: &mut Vec<f32>) { | ||
let len = in_a.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are assuming here that len
is a multiple of step
? The scalar version would not typically make such an assumption, so this is worth calling out.
EDIT: Or does the svwhilelt_b32
part handle that? The comments are not clear enough for me to tell. The mask being "based on" index and len doesn't say how it is based on them...
// `svwhilelt_b32` generates a mask based on comparing the current | ||
// index against the `len` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// `svwhilelt_b32` generates a mask based on comparing the current | |
// index against the `len` | |
// `svwhilelt_b32` deals with the tail of the iteration: it generates a mask that | |
// is enabled for the first `len` elements overall, but disables the last `len % step` | |
// elements in the last iteration. |
Is this correct? I am guessing here.
let pred = svwhilelt_b32(i as _, len as _); | ||
|
||
// `svld1_f32` loads a vector register with the data from address | ||
// `a`, zeroing any elements in the vector that are masked out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And crucially, those "zeroed" elements never even get loaded, so it is okay of they are out-of-bounds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is the correct approach nor implementation. It is invested in reusing an existing framework in repr(simd)
which has many undesirable features.
# Reference-level explanation | ||
[reference-level-explanation]: #reference-level-explanation | ||
|
||
Types annotated with the `#[repr(simd)]` attribute contains either an array | ||
field or multiple fields to indicate the intended size of the SIMD vector that | ||
the type represents. | ||
|
||
Similarly, a `scalable(N)` representation is introduced to define a scalable | ||
vector type. `scalable(N)` accepts an integer to determine the minimum number of | ||
elements the vector contains. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use repr(C)
because otherwise some hypothetical CStruct repr type would have to be parameterized by every single arrangement of types, in an ordered fashion.
But repr(simd)
does not pair with either repr(packed)
or repr(align(N))
coherently, and neither would repr(scalable)
.
I do not think this should be handled by a new repr
... modeling it on a still-unstable repr(simd)
... which, as we have discovered over time with existing reprs, has a million footguns for how they work and get compiled. A lang item seems more to-the-point. I intend to make this true for repr(simd)
as well anyways.
|
||
```rust | ||
#[repr(simd, scalable(4))] | ||
pub struct svfloat32_t { _ty: [f32], } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the aspects of using these random repr
attributes is it is misleading. This looks like it is unsized to begin with, but these are unconst Sized
.
Therefore, there is an additional restriction that these types cannot be used in | ||
the argument or return types of functions unless those functions are annotated | ||
with the relevant target feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this not mean that these types cannot be used in function pointers, as we have no knowledge of the target features of a function pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same reason it would also clash with dyn-compatible traits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is similar to the existing checks we have for __m256
only being allowed in extern "C"
functions when AVX is available. They can be used in function ptrs just fine, because someone somewhere must have made the unsafe promise that it is okay to call a #[target_feature]
function since we actually do have the target feature.
As `repr(scalable(N))` is intended to be a permanently unstable attribute, any | ||
value of `N` is accepted by the attribute and it is the responsibility of | ||
whomever is defining the type to provide a valid value. A correct value for `N` | ||
depends on the purpose of the specific scalable vector type and the | ||
architecture. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of abdication of responsibility in practice, when the rules are simply not that hard to work out, just leaves it harder to maintain the compiler in the future as it amounts to saying "We don't have to write down any reasoning about validity, not even in documentation, and we will allow someone in the future to discover why our decisions were fragile and sometimes incorrect without protecting them against more easily-induced miscompilations". I realize rust-lang has accepted this kind of reasoning in the past, (edited by moderator) future maintainers once you can no longer ask the previous ones due to their inevitable burnout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if you engaged on this RFC without describing decisions you disagree with as "sucks ass" or "pissing in the eyes of future maintainers". I'm open to discussing the details of this proposal and changing them, as with any proposal, but I'm much more open to that when concerns are phrased like this concern was in Ralf's comment rather than how you've chosen to phrase yours.
Therefore, there is an additional restriction that these types cannot be used in | ||
the argument or return types of functions unless those functions are annotated | ||
with the relevant target feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same reason it would also clash with dyn-compatible traits.
When a scalable vector is instantiated into a generic function during | ||
monomorphisation, or a trait method is being implemented for a scalable vector, | ||
then the relevant target feature will be added to the function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding target features to functions based on the types being mentioned in them was also proposed for a different reason in #3525 but that ran into all sorts of complications (I'm not sure off-hand which of these are relevant here).
Unlike that other RFC, what's proposed here isn't even visible in the function signature and not tied to some "if a value of this type exists, the target feature must be enabled, so it's safe" reasoning, which causes new problems. So far the unsafe
enforcement and manual safety reasoning for target features assumes it's clear which target features a function has, long before monomorphization. For example, consider this program:
// crate A
fn foo<T: Sized>() -> usize {
size_of::<T>()
}
// crate B
use crate_a::foo;
fn main() {
println!("vector size: {}", foo::<svfloat32_t>());
}
If the instantiation of foo
gets the target feature, it needs to be unsafe to call. How can the compiler ensure this? And how is it surfaced to the programmer so they can write the correct #[cfg]
or is_aarch64_feature_detected!(...)
condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding target features depending on the concrete instantiation is also fundamentally incompatible with how target features currently work in rustc. Of course the architecture can be changed, but it's a quite non-trivial change, and it has non-trivial ramifications: the MIR inliner crucially relies (for soundness) on knowing the target features of a function without knowing the concrete instance of the generics (inlining a function with more features into a function with fewer features is unsound), so functions that are generic in their target feature set risk serious perf regressions due the MIR inliner becoming less effective. Making every generic function potentially have target-feature thus seems to be like a non-starter both in terms of soundness (that's @hanna-kruppe's issue) and because it completely kills the MIR inliner for generic code (which is the code where we most care about the MIR inliner).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a scalable vector is instantiated into a generic function during
monomorphisation
It's also somewhat unclear what this means -- does instantiating T
with fn() -> svfloat32_t
also add the target feature? It pretty much has to, for soundness.
But then what about something like this
trait Tr { type Assoc; fn make() -> Self::Assoc }
fn foo<T: Tr>() {
size_of_val(&T::make());
}
Now if I call foo
with a type T
such that T::Assoc
is svfloat32_t
, I also need the target feature:
impl Tr for i32 { type Assoc = svfloat32_t; ...}
foo::<i32>(); // oopsie, this must be unsafe
The target feature is needed any time svfloat32_t
is in any way reachable via the given trait instance. This is not something the type system can even express, which also means I don't think it can possibly be sound.
For non-predicate scalable vectors, it will be typical that `N` will be | ||
`$minimum_register_length / $type_size` (e.g. `4` for `f32` or `8` for `f16` | ||
with a minimum 128-bit register length). In this circumstance, `N` could be | ||
trivially calculated by the compiler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this even this part doesn't really work for RVV, which has the added dimension of LMUL: there's not a single $minimum_register_length
but several useful sizes of N for every element type. See the table at https://llvm.org/docs//RISCV/RISCVVectorExtension.html#mapping-to-llvm-ir-types
|
||
Therefore, there is an additional restriction that these types cannot be used in | ||
the argument or return types of functions unless those functions are annotated | ||
with the relevant target feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean for generic functions? I could write a fn foo<T>(x: Box<T>)
that internally passes a T
to some function by-value. Exactly during which stage of the compiler do we stop compilation if T
ends up being a scalable vector type?
It seems to me that the only practical answer is "during monomorphization". Post-mono errors are rather undesirable though (and this would be easily observable by users on stable) so this would definitely need t-lang signoff. Please explicitly not this as a drawback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed the sentence about automatically adding target features to generic functions... for reasons discussed here, that is unfortunately not practical, so I discarded it as an option in my comment above.
|
||
`repr(scalable)` as described later in | ||
[*Reference-level explanation*][reference-level-explanation] is perma-unstable | ||
and exists only enables scalable vector types to be defined in the standard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and exists only enables scalable vector types to be defined in the standard | |
and exists only to enable scalable vector types to be defined in the standard |
|
||
```rust | ||
fn sve_add(in_a: Vec<f32>, in_b: Vec<f32>, out_c: &mut Vec<f32>) { | ||
let len = in_a.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let len = in_a.len(); | |
assert_eq!(a.len(), b.len()); | |
assert_eq!(a.len(), c.len()); | |
let len = in_a.len(); |
[reference-level-explanation]: #reference-level-explanation | ||
|
||
Types annotated with the `#[repr(simd)]` attribute contains either an array | ||
field or multiple fields to indicate the intended size of the SIMD vector that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as of rust-lang/rust#129403, #[repr(simd)]
only supports an array field.
Without support for scalable vectors in the language and compiler, it is not | ||
possible to leverage hardware with scalable vectors from Rust. As extensions | ||
with scalable vectors are available in architectures as either the only or | ||
recommended way to do SIMD, lack of support in Rust would severely limit Rust's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming you're referring to RISC-V V by "only", you can use fixed-length vectors (e.g. std::simd::Simd<f32, 16>
) just fine, LLVM just uses the minimum guaranteed size when deciding how many registers to use, so assuming the V
extension (which implies Zvl128b
giving a minimum length of 128 bits), a std::simd::Simd<f32, 16>
would take 4 registers (though LLVM would probably just set LMUL
to 4 and use 1 register).
compiler always be able to calculate `N` isn't justified given the permanently | ||
unstable nature of the `repr(scalable(N))` attribute and the scalable vector | ||
types defined in `std::arch` are likely to be few in number, automatically | ||
generated and well-tested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want there to be a std::simd
way to use scalable vectors at some point, so we would need a way for the compiler to either generate the correct N
or be able to correctly handle any N
(within some architecture-independent bounds, e.g. N
is always 1, 2, 4, 8, or 16) by translating to some other N
that the current architecture can handle.
types to have fewer distinct restrictions than other SIMD types, and would | ||
enable SIMD vectors to be passed by-register, a performance improvement. | ||
|
||
Such a mechanism would need be introduced gradually to existing SIMD types with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a mechanism would need be introduced gradually to existing SIMD types with | |
Such a mechanism would need to be introduced gradually to existing SIMD types with |
|
||
However, as C also has restriction and scalable vectors are nevertheless used in | ||
production code, it is unlikely there will be much demand for those restrictions | ||
to be relaxed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct
s containing a scalable SIMD type may actually end up being in large demand due to project-portable-simd:
One idea IIRC we've talked about is that we would end up with something kinda like a SIMD version of ArrayVec
where we'd have a struct
that contains both an active length (like Vec::len
, not to be confused with vscale
which is more like Vec::capacity
) and a scalable SIMD type, so we could do operations where the tail of the SIMD type is not valid but user code wouldn't need unsafe
since those extra elements are ignored by using auto-generated masks or setvl
.
this would need struct
s like so:
pub struct SimdArrayVec<T> {
len: usize,
value: ScalableSimd<T>,
}
Supercedes #3268.
Extends Rust's existing SIMD infrastructure,
#[repr(simd)]
, with a complementary scalable representation,#[repr(scalable)]
, to support scalable vector types, such as Arm's Scalable Vector Extension (SVE), or RISC-V's Vector Extension (RVV).Like the existing
repr(simd)
representation,repr(scalable)
is internal compiler infrastructure that will be used only in the standard library to introduce scalable vector types which can then be stablised. Only the infrastructure to define these types are introduced in this RFC, not the types or intrinsics that use it.Rendered