Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range extension thunks #425

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Range extension thunks #425

wants to merge 1 commit into from

Conversation

sorear
Copy link
Collaborator

@sorear sorear commented Feb 20, 2024

We are considering three use cases here.

  1. A true large code model needs to support more than 2 GiB of text; data accesses are out of scope for this change but jumps and calls across a range of more than 2 GiB are needed. Most users of a large model will have more than 2 GiB of data but small text, or text with a highly local call pattern, so we want most calls to be able to use the auipc+jalr sequence. This would normally call for relaxation, but relaxation requires object files to contain the longest possible sequence, of which several are possible. Instead, keep the sequences the same and allow thunk insertion.

  2. For executables and shared objects in a Unix environment, most of the code size benefits of relaxation come from call->jal relaxation, not data or TLS relaxation. If the compiler is modified to generate jal instructions instead of call instructions, the code size benefits can be achieved without relaxation at all, but this requires JAL_THUNK to avoid relocation errors at a 1 MiB limit.

  3. If a function has many static call sites in a large binary but is known to be dynamically cold, due to a function attribute or PGO, the call sites can be replaced with jal instructions, sharing a single thunk between all call sites within a 2 MiB text region. This saves code size at small runtime cost.

Restricting the register usage of the thunks is an intentional feature copied from the Go 1.15 toolchain, where every non-leaf function requires a conditional call to runtime.morestack in the prologue; since ra cannot be saved before the stack frame is allocated, the call is performed using t0 as the return register.


There's an argument to be made that we can use thunks for JAL, CALL, and CALL_PLT because linking would fail otherwise. I'd rather not accept the risk of breakage with that. There's also an argument that CALL_PLT is always allowed to use a PLT, and PLT stubs clobber t1 and t2 even with STO_RISCV_VARIANT_CC, so substituting CALL_THUNK for CALL_PLT is always fine. I'm more sympathetic to that, but we need JAL_PLT for 2/3 use cases so why not add both.

I don't have a working PoC or a schedule for preparing a working PoC. As this is my first time on this side of the current process, what level of functional completeness are we looking for? Do I need to cover all three use cases or would it be enough to have a working clang -mshort-calls and lld combination and results showing a binary size decrease?

Recommend using CALL_THUNK for #388, FDPIC, and any other new code models or sub-ABIs.

@aswaterman
Copy link
Contributor

aswaterman commented Feb 20, 2024

For executables and shared objects in a Unix environment, most of the code size benefits of relaxation come from call->jal relaxation, not data or TLS relaxation. If the compiler is modified to generate jal instructions instead of call instructions, the code size benefits can be achieved without relaxation at all, but this requires JAL_THUNK to avoid relocation errors at a 1 MiB limit.

This seems out of scope to me, as it is a solution in search of a problem. Also, we need to support functions that are >1 MiB in size--something GCC does poorly currently but will need to get better at. That would require JAL thunks to be interspersed within a function, which sounds undesirably messy.

So, I'd focus on the large code model motivation. (The dynamically cold call site optimization is also interesting, and though it also has problems with >1 MiB functions, that's OK, because it's purely opportunistic.)

@jrtc27
Copy link
Collaborator

jrtc27 commented Feb 20, 2024

PLT stubs clobber t1 and t2 even with STO_RISCV_VARIANT_CC

PLT stubs clobber t1 and t3 unconditionally (I don't know why that was picked in the first place, but hey, that's what we've got). Without STO_RISCV_VARIANT_CC and -z now / LD_BIND_NOW the header also clobbers t0 and t2, but of course the resolver then clobbers some arbitrary subset of the caller-saved register set.

but we need JAL_PLT for 2/3 use cases so why not add both.

Because that eats precious relocation encoding space for something that does nothing new. We've already got the wastage from CALL+CALL_PLT, we don't need to add yet another relocation that implementations will treat identically to the other two. RV64 has enough space that one should never run out, but RV32 is very constrained.

@rui314
Copy link
Collaborator

rui314 commented Feb 21, 2024

This change would significantly diverge the RISC-V psABI from the psABIs of other RISC processors. Other psABIs do not distinguish range-extendable CALL relocation and the usual one. The linker automatically takes care of everything and creates range extension thunks as needed. (2) and (3) can be implemented without making a modification to the psABI, too. So I don't see a reason to add new relocations.

@sorear
Copy link
Collaborator Author

sorear commented Feb 21, 2024

@aswaterman

This seems out of scope to me, as it is a solution in search of a problem.

Are you saying that code size / speed tradeoffs in general are out of scope (what about Zcmt?), or are you making a narrower point that I'm failing to see?

My current thought for the prototype is

-mshort-calls=never   don't use short calls
-mshort-calls=cold    if [[gnu::cold]] is present
-mshort-calls=always  if function/section size is less than 2 MiB

That would require JAL thunks to be interspersed within a function, which sounds undesirably messy.

The compiler knows how large the function/section is and can switch to AUIPC+JALR if external references would struggle to reach outside it, it needs to do this anyway for internal references and I've done it before (there was function larger than 1 MiB in the Go test suite).

I don't intend for the linker to ever insert bytes in the middle of an input section.

So, I'd focus on the large code model motivation. (The dynamically cold call site optimization is also interesting, and though it also has problems with >1 MiB functions, that's OK, because it's purely opportunistic.)

Unfortunately, the large code model isn't close to done, so I can't do much with it without taking over a much larger project. Likewise I am not confident in my ability to get PGO working on a nontrivial project. So short calls on the basis of a command line option / attribute are the most likely thing I'll be able to have an end to end demonstration of in a reasonable length of time.

@jrtc27

PLT stubs clobber t1 and t3 unconditionally

Good catch. Unfortunately, we need t2 as one of our registers for future Zicfilp support (range extension thunks are software-guarded branches), whether or not Zicfilp modifies the PLT stub to use t2.

we don't need to add yet another relocation that implementations will treat identically to the other two

I'll drop CALL_THUNK then. It potentially turns link failures into runtime undefined behavior if anyone is using CALL_PLT for DSO-local functions with a custom calling convention that uses t2, which won't affect -msave-restore but is anyone doing interprocedural register allocation?

@rui314

So I don't see a reason to add new relocations.

To be clear, you want to drop both CALL_THUNK and JAL_THUNK, and add range extension thunk semantics for CALL, CALL_PLT, and JAL?

I think at a minimum we'd need a clear statement that R_RISCV_JAL within a single section over less than a 1 MiB range will never generate a thunk.

@aswaterman
Copy link
Contributor

Are you saying that code size / speed tradeoffs in general are out of scope (what about Zcmt?), or are you making a narrower point that I'm failing to see?

Not at all! My point was that motivation (2) is unconvincing unless you assume a fictitious universe in which linker relaxation does not exist. But, as I wrote, (1) is a sufficient motivation for this idea, and (3) is an interesting direction. I'm not pooh-poohing the idea; I'm just trying to help hone it.

@rui314
Copy link
Collaborator

rui314 commented Feb 21, 2024

To be clear, you want to drop both CALL_THUNK and JAL_THUNK, and add range extension thunk semantics for CALL, CALL_PLT, and JAL?

Only CALL_PLT (and CALL as it's a synonym for CALL_PLT) would require a range extension thunk, as that's the only relocation in this ABI to call a function that may be at an arbitrary memory address.

Essentially, I don't think you need to invent something new for RISC-V for range extension thunks. You can instead just do what other RISC psABIs do. As a reference, this is what AArch64 psABI do for range extension thunks: https://github.com/ARM-software/abi-aa/blob/2a70c42d62e9c3eb5887fa50b71257f20daca6f9/aaelf64/aaelf64.rst#L1281

We are considering three use cases here.

1. A true large code model needs to support more than 2 GiB of text;
   data accesses are out of scope for this change but jumps and calls
   across a range of more than 2 GiB are needed. Most users of a large
   model will have more than 2 GiB of data but small text, or text with
   a highly local call pattern, so we want most calls to be able to use
   the auipc+jalr sequence. This would normally call for relaxation, but
   relaxation requires object files to contain the longest possible
   sequence, of which several are possible. Instead, keep the sequences
   the same and allow thunk insertion.

2. There have been requests from developers of linkers other than ld.bfd
   to avoid the use of length-changing relaxation, since it is a unique
   linking step for RISC-V among common large systems architectures, can
   be computationally expensive, and requires a substantial number of
   additional relocations in object files.

   In an environment which makes limited use of global data, most of the
   code size benefits of relaxation come from call->jal relaxation. If
   the compiler is modified to generate jal instructions instead of call
   instructions, the code size benefits can be achieved without
   relaxation at all, but this requires JAL_THUNK to avoid relocation
   errors at a 1 MiB limit.

3. If a function has many static call sites in a large binary but is
   known to be dynamically cold, due to a function attribute or PGO, the
   call sites can be replaced with jal instructions, sharing a single
   thunk between all call sites within a 2 MiB text region. This saves
   code size at small runtime cost.

Restricting the register usage of the thunks is an intentional feature
copied from the Go 1.15 toolchain, where every non-leaf function
requires a conditional call to runtime.morestack in the prologue; since
ra cannot be saved before the stack frame is allocated, the call is
performed using t0 as the return register.

Range extension thunks use t1 and t2 as temporary registers, while
traditional practice in PLT entries is to use t1 and t3. This is a
necessary change to support the use of software-guarded branches when
Zicfilp support is added.

Thunk insertion is forbidden inside a single section, since relaxation
assumes that sections cannot grow and to ensure that intra-function
branches do not impact register allocation. Thunks may be inserted on
any cross-section jump; while linkers should endeavor to minimize the
number of thunks used, this is a complex optimization problem with a
quality/time tradeoff and mandating any algorithm would be
inappropriate.

Since this change redefines the existing relocation types, it is a
**backwards incompatible ABI change** if object files expect t1 and t2
to retain their values across jumps that cross section boundaries. In
practice, cross-section JAL relocations are not generated by current
compilers and rarely appear in handwritten assembly, and linkers are not
expected to generate thunks for CALL and CALL_PLT until the output
approaches the 2 GiB limit of auipc+jalr sequences.

Signed-off-by: Stefan O'Rear <[email protected]>
@sorear sorear force-pushed the range-extension-thunks branch from 96bdae0 to 9ee8805 Compare February 26, 2024 19:18
@sorear
Copy link
Collaborator Author

sorear commented Feb 26, 2024

Take 2, now with a better description of use case 2 and with the requested feature of "surprise ABI breaks for currently working code".

There are at least three ways to get current gcc or clang to expect t2 to be valid across a call:

  • clang -mllvm=-enable-ipra calls to a non-interposable function which does not use t2
  • clang uses t2 for the ninth actual argument of a dso_local function which is converted to fastcc
  • GCC nested functions called by name (not a pointer) use t2 as the static chain
  • it is unclear if MachineFunctionSplitter or BasicBlockSections can be enabled in any effective way

What is the path forward for these? Do we change them to not use t2 and retroactively declare old compilers buggy? Do we add a new mechanism for compilers to communicate which registers they expect to live across a call? Do we do nothing, and call it user error to create a binary which exceeds 2 GiB of text when using the medlow/medany code models?

@rui314
Copy link
Collaborator

rui314 commented Feb 27, 2024

It looks much better, but I think we need to first answer that question: do we need range extension thunks for RISC-V?

Other RISC psABIs required range extension thunks because, without them, they couldn't support a medium code model of 2 GiB binary size. They use only a single instruction for function calls, and therefore the "reach" of function call instructions is limited (typically equal to or less than ±128 MiB). On the other hand, they use longer code sequences for data access, so range extenders are not needed for data loads and stores.

In other words, other RISC-V psABI required range extension thunks to bridge the discrepancy between code and data references's reaches. To build binaries larger than 2 GiB, we usually need to build them with the large code model in the first place so that data references can refer to a location beyond ±2 GiB.

The RISC-V's medium code model doesn't have the above-mentioned issue because both code and data references can address ±2 GiB. That means the situation in which range extension thunks are useful is very limited; they're useful only when we have code scattered across more than a 2 GiB address range while all data references are within 2 GiB from the code location. There might be a program that fits into that use case, but I honestly think it would be very rare.

@sorear
Copy link
Collaborator Author

sorear commented Feb 27, 2024

do we need range extension thunks for RISC-V?

No.

It's an optimization, albeit an important one that will be painful to retrofit if it isn't considered early.

That means the situation in which range extension thunks are useful is very limited; they're useful only when we have code scattered across more than a 2 GiB address range while all data references are within 2 GiB from the code location.

That's the situation in which range extension thunks are present. They are useful precisely when they are absent: less than 2 GiB of relatively contiguous code, and data scattered across the address space. I expect this to represent the majority of uses of large models.

If we want to close the call performance gap between the medium and large models, we need to either define relaxations to turn a 64-bit-range call into a 32-bit-range call and then use them on every call in every input object, or define range extension thunks and not use them.

I think that the latter option is better for several reasons:

  • Between non-PIC, PIC, FDPIC, different options for intermediate range, etc there are a substantial number of plausible long call sequences. We would need to add relaxation semantics for all of them. By contrast, the semantics of thunks are agnostic of the details of the call sequence.
  • Relaxation requires linker work in the common case when the short call is generated. Thunks require linker work in the rare case when the long call must be generated.
  • Thunks use the short sequence in object files, so they are smaller.
  • Fully relaxing a long call which uses a literal island requires deleting bytes in two independent places, an unprecedented situation for relaxation.
  • Thunks have secondary uses for binary size optimization which are difficult to

@rui314
Copy link
Collaborator

rui314 commented Feb 27, 2024

All of the bullet points appear to be hypothetical and not validated by any actual experience or implementation. Utilizing a long code sequence with a full 64-bit offset for a function call and allowing the linker to relax it would be a logical expansion for the RISC-V large code model, as that's what we are doing for the medium code model.

Moreover, the linker relaxation and range extension thunks are independent. The linker is permitted to insert range extension thunks into a program, even with the current psABI, as long as it doesn't violate the ABI's assumptions (i.e., as long as range extension thunks preserve registers just like PLT entries do). Therefore, I believe defining range extension thunks at this moment is not absolutely necessary. To me, it appears that this proposal is too early to be ratified.

@aswaterman
Copy link
Contributor

aswaterman commented Feb 27, 2024

I would also favor relaxation from the hypothetical large code model to the current regime as the default approach, given that we already have the relaxation arrow in our quiver.

It’s clear to me that consideration (3) from Stef’s original proposal could bear some fruit, though it’s an optimization beyond the broader topic of large code model support.

@sorear
Copy link
Collaborator Author

sorear commented Feb 28, 2024

All of the bullet points appear to be hypothetical and not validated by any actual experience or implementation.

https://review.gerrithub.io/c/riscv/riscv-go/+/352852/5

The linker is permitted to insert range extension thunks into a program, even with the current psABI, as long as it doesn't violate the ABI's assumptions (i.e., as long as range extension thunks preserve registers just like PLT entries do).

Do you consider it a bug that LLVM treats t3/X28 as preserved by call instructions for the purposes of ipra and fastcc, despite the fact that PLT entries clobber t3?

When Zicfilp happens, "PLT entries" for address not taken functions without a landing pad will need to be changed to use t2/X7 for software-guarded branch reasons. Does this mean that t2 will be removed from the list of registers preserved across a CALL_PLT? Does this mean that the register used for LLVM trampoline intrinsics and GCC nested functions will have to change to something else?

(What external constraint is responsible for Zicfilp using X7 instead of X6 in spite of the ABI problems this causes?)

@rui314
Copy link
Collaborator

rui314 commented Feb 28, 2024

Do you consider it a bug that LLVM treats t3/X28 as preserved by call instructions for the purposes of ipra and fastcc, despite the fact that PLT entries clobber t3?

I guess it depends. ipra and fastcc doesn't follow the calling convention defined by this psABI, so I'd think they are implementation-defined optimizations.

When Zicfilp happens, "PLT entries" for address not taken functions without a landing pad will need to be changed to use t2/X7 for software-guarded branch reasons. Does this mean that t2 will be removed from the list of registers preserved across a CALL_PLT? Does this mean that the register used for LLVM trampoline intrinsics and GCC nested functions will have to change to something else?

That needs to be address in the Zicflip spec, no?

@sorear
Copy link
Collaborator Author

sorear commented Mar 11, 2024

I guess it depends. ipra and fastcc doesn't follow the calling convention defined by this psABI, so I'd think they are implementation-defined optimizations.

ipra and fastcc opt out of the calling convention, but they don't opt out of the relocation scheme, and LLVM generates normal relocatable files with nothing to prevent them from being consumed by a third-party linker that implements the psABI exactly. So I don't think that "it's implementation-defined" is a blanket license to ignore relocation rules.

That needs to be address in the Zicflip spec, no?

Are you advocating changing the draft Zicfilp spec from using t2 to using t1 for compatibility with the ratified ABI, or advocating for Zicfilp to define a new ABI that reserves t2 in PLT stubs?

@kito-cheng
Copy link
Collaborator

FYI, I think...we should reconsider Zicfilp should use t1 rather than t2 for the landing pad label register riscv/riscv-cfi#208

Copy link
Collaborator

@kito-cheng kito-cheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support range extension thunks, without introduce new relocation and new tags.

  • Adding new tag isn't really helpful, especially for asm code, it's not easy to maintain that new tag in asm code, and it's also not too much useful if we add that implicitly via some configuration option or flag.

other input sections may be resolved by the linker to point to a range
extension thunk instead of the target symbol. Range extension thunks will
eventually transfer control to the target symbol, and preserve the contents of
memory and all registers except for `t1` and `t2`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use t1 and t3, avoid use t2 due to landing pad (zicfilp).

@@ -735,6 +735,56 @@ that can represent an even signed 21-bit offset (-1MiB to +1MiB-2).
Branch (SB-Type) instructions have a `R_RISCV_BRANCH` relocation that
can represent an even signed 13-bit offset (-4096 to +4094).

==== Range Extension Thunks

`R_RISCV_JAL`, `R_RISCV_CALL`, and `R_RISCV_CALL_PLT` relocations to targets in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to exclude R_RISCV_JAL, that relocation are used with j and jal; j is used for jump within function and jal typically is only used when the target is known very close, otherwise should use call or tail (then with R_RISCV_CALL_PLT).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants