Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for RISC-V Embedded PIC (ePIC) specification #343

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

luismarques
Copy link
Collaborator

@luismarques luismarques commented Aug 30, 2022

This PR adds a new file, detailing a new RISC-V Embedded PIC (ePIC) ABI specification. It is thus a proposal for that ABI, to be discussed here, improved, and, hopefully, eventually accepted.

What is ePIC

Basically, the aim of the ePIC ABI is to allow the creation of position-independent executables that are easy to load and relocate, with minimal overhead. This is meant to be evaluated in the context of (relatively constrained) embedded systems, where you typically want to use execute-in-place and would prefer to avoid some of the overheads of the System V ABIs. ePIC basically works by using PC-relative addressing for code (and read-only data) and GP-relative addressing for data. The loader only has to apply simple relocations in the data segment (essentially, update statically initialized pointers). See the new file for all the details.

Review notes

  • For the convenience of reviewers, I have included the entire specification document in this PR. It includes sections that might be out of the scope of this repo, such as ones detailing command-line options or providing non-essential information. I felt that would help review the overall design of the spec. I suggest we keep those sections for now and remove them closer to the commit stage -- assuming we reach that stage. Review feedback in those sections is still useful.

  • There is a toolchain implementation for ePIC but it is a work-in-progress. It needs to be updated to reflect some changes in the specification and other issues. There are ePIC demos using that toolchain but they don't yet fully reflect the current specification text. For instance, they currently use the linker option --emit-relocs and additional filtering of relocations (in the loader or the program format converter) instead of relying on the linker to emit only the appropriate relocations for ePIC executables. This is in part an artifact of how the ePIC project started, which was to generate relocatable Tock OS applications, as Tock used a similar mechanism. I will work on updating the toolchain, guided by the feedback of this review, so those discrepancies will eventually go away.

  • Assuming the overall design of ePIC is roughly acceptable, having early feedback about command-line options would be helpful. That would reduce the amount of churn for the toolchain implementation patches, the ePIC demos, and any other systems hoping to adopt ePIC before the spec is completely finalized. That is the most user-visible interface, but the same goes for ABI details like choosing the final relocation numbers (see spec text).

  • My thanks to everybody that directly or indirectly contributed to this effort. That includes Alex Bradbury (coordinated this project while at lowRISC, and later provided editing suggestions); Sam Elliot (part of the initial effort to get this project started, including requirements gathering, etc.); Kito Cheng and Nelson Chu (early review feedback, see below for details); Evandro Menezes (provided the draft compact/large code model specs and toolchain implementation, which influenced / were useful for this project); Maciej W. Rozycki (RISC-V FDPIC proposal video); Greg McGary (provided early review feedback); Jorge Prendes (ported epic-tock-c demo to rust, etc.); Alistair Francis (added OpenTitan support to the epic-tock-c demo); Michael Munday (added Docker support to the epic-c-example, various suggestions); The Tock OS contributors. Apologies to anyone I might have forgotten. Any deficiencies in this spec are my fault and not of the people listed.

Initial review concerns

This section lists some initial review concerns that I had written down or that were brought up in personal communications.

  • Kito/Nelson suggested adding the R_RISCV_64_PCREL relocation. They believe that's required for proper debug support in RV64 systems, as R_RISCV_32_PCREL might not cover all situations, based on their experience with the compact code model. In general, all the assumptions for RV64 systems should be double-checked and evaluated, as the majority of the effort for this spec/project has been on RV32 systems. Given the focus of this spec on embedded systems with limited resources and simple implementations, it might make sense to solve issues like this by making trade-offs that limit the implementation generality (e.g. limiting address space distances).

  • Kito/Nelson suggested converting R_RISCV_32 and R_RISCV_64 to R_RISCV_RELATIVE, like what PIE did. The stated advantage would be that "we don't need to look up symbol table other than __global_pointer$ for the loader, that could make the binary size could be smaller since the dynamic symbol table only need __global_pointer$". I'm not sure I fully understand that suggestion. In the demos we didn't need to check the symbol table, but we were making assumptions about how the program was linked (addresses in the ePIC code segment had the MSBit set, while ones in the ePIC data segment did not). The loader used only the data in the .rela sections, it did not check any other data per relocation entry.

  • The demos currently use the command-line options -static -Wl,--emit-relocs,-T,app.lds. What should that look like in the future, for the final toolchain? For a regular program, not using a linker script, would that be -fpie -fepic? That would mean "I want a position-independent executable and it should use the ePIC ABI", and those two options would be orthogonal. With that approach, in principle it would even be possible to later extend this ABI for shared library support.

  • The spec currently uses the __global_pointer$ symbol. That means, for instance, that PC-rel to GP-rel relaxations must be disabled when ePIC is used, which is currently specified to happen when the -fepic option is provided to the linker. I wonder if it wouldn't be simpler to just introduce a __epic_base$ symbol instead. Using the ePIC ABI during LTO codegen could still be configured, possibly through another option (e.g. the current toolchain implementation currently uses --lto-epic, in a departure from the spec, but consistent with other LTO options in LLD). I would like to avoid something clunky like -plugin-opt=--code-model=..., though.

  • Should we introduce any ELF attributes for ePIC?

@aswaterman
Copy link
Contributor

This is cool.

@ptomsich ptomsich added this to the Post 1.0 milestone Aug 30, 2022
@Nelson1225
Copy link
Collaborator

Nelson1225 commented Aug 31, 2022

ePIC basically works by using PC-relative addressing for code (and read-only data) and GP-relative addressing for data.

So does that means all symbol references should be in the range of +-2^31 bits of gp and pc? Since according to the compact spec, if rodata or anything that are placed far away from pc and gp, then at least we can use got_gprel to access them. However, seems like the above case cannot be worked when using ePIC model. I think it probably worth to mentioned the valid access range (or limitations) for ePIC in the spec.

Address generation for an unknown segment by using %epic

This design seems better than compact. Compact usually generate %got_gprel for these cases, but since we didn't choose to chain the related instructions like pcrel_lo (we just refer the target symbol directly), the last load/store instruction of %got_gprel cannot be relaxed, so we will have the bad code as follows,

lui t0, %got_gprel_hi(src) # R_RISCV_GOT_GPREL_HI20
add t1, s0, t0, %got_gprel(src) # R_RISCV_GOT_GPREL_ADD
ld t2, %got_gprel_lo(src)(t1) # R_RISCV_GOT_GPREL_LO12_I
lw t3, 0(t2), %got_gprel(src) # R_RISCV_GOT_GPREL_LOAD
*** relaxed to ***
---
---
ld t2, %got_gprel_lo(src)(gp) # R_RISCV_GOT_GPREL_LO12_I
lw t3, 0(t2) -> redundant load/store, since we don't know the last lw/sw are using
                t3 and t2 register when relaxing the previous instructions.

Compared to the ePIC, since ePIC chains the low instructions to the high, so the relaxation seems perfect and won't have the redundant load/store. But just make sure one thing - does that means we should always do the EPIC relaxations just like the alignment, even if the -mno-relax is given?

Should the %gprel_hi be shared sometimes?

I just discussed this with @palmer-dabbelt, and had been discussed the similar thing with my SiFive partners when implemented the compact. Since gcc (not sure what clang will do) will consider the section alignment, and then make sure that if the LUI can be shared by the later low instrcutions or not. We probably may meet the problem that gp may not be aligned in the GNU default linker script, or in the custom linker script. Not sure if we should update the related shared checks (or not allow to share %gprel_hi) in gcc/clang, or just make sure the gp will be aligned in linker. The later probably need to be mentioned in the spec, but the former doesn't.

In general, I would support ePIC since it has the better relaxation design than compact, but just some minor issues that need to be confirmed. Thanks.

@luismarques
Copy link
Collaborator Author

So does that means all symbol references should be in the range of +-2^31 bits of gp and pc? Since according to the compact spec, if rodata or anything that are placed far away from pc and gp, then at least we can use got_gprel to access them. However, seems like the above case cannot be worked when using ePIC model. I think it probably worth to mentioned the valid access range (or limitations) for ePIC in the spec.

Yes, that's the range currently specified:

Like when using the medium code model, code can address the range between -2 GiB and +2 GiB from its position. Data in the data segment can be addressed between -2 GiB and +2 GiB from the global pointer.

That phrasing could be improved to make it clearer, if needed. We could also imagine extending this for RV64 (either unconditionally or through a code model option), but given the intended use case of this ABI it's not clear there's a need for that.

Compared to the ePIC, since ePIC chains the low instructions to the high, so the relaxation seems perfect and won't have the redundant load/store. But just make sure one thing - does that means we should always do the EPIC relaxations just like the alignment, even if the -mno-relax is given?

The R_RISCV_EPIC_* relocations always need to be processed, as those are transformed into other relocation types that cannot be ignored (pcrel and gprel relocations). I wouldn't call that transformation a relaxation, if that's what you mean, but it's true that they must always be processed, even with -mno-relax. Notice that the code size doesn't have to change, though, as the add instruction referenced by the R_RISCV_EPIC_BASE_ADD relocation can be transformed into a nop in the case where the add is not needed. That's not the optimal implementation but it is straightforward to implement in O(1). The linker relaxations defined in the "Linker relaxations" section are optional.

In general, I would support ePIC since it has the better relaxation design than compact, but just some minor issues that need to be confirmed. Thanks.

Thanks for the detailed feedback!

@palmer-dabbelt
Copy link
Contributor

This is coupling a lot of things together: some new user interface, new relocations, and a new ABI. The core goal around having two addressable ranges (one PC-based, one GP-based) seems broadly useful, some users even take advantage of the defacto second range that relying on auipc->GP relaxation gives you (though it's not super useful because it's only 12 bits). IMO splitting this out into a handful of distinct features will make this easier to define/implement and more useful.

My worry here is really the R_RISCV_EPIC_* relocations, which is the core of where all this coupling is coming from. Essentially they're encoding this specific ABI (ie, .data can move relative to .text, but .rodata can't) via the relocations targeting specific symbols. In practice the relocations are processed pretty much independently from the code model and ABI, and while the psABI does describe some coupling between these but we generally just ignore that in the toolchain as it's not really feasible to implement.

More concretely, I'd structure this as follows:

  • Add what's necessary for the 3-instruction PCREL addressing sequences (ie, aupic; addi/andi; lw). We'll need these anyway just to allow sharing aupic in medany, but having three instructions will allow us to rewrite these as 32-bit-offset GP-relative sequences as well.
  • Add the 3-instruction GPREL addressing sequences (keeping in mind that we'll probably end up with auigp if this gets popular). This will enable a static/no-PIE ABI that has two address ranges, which itself is useful for a bunch of cases. We'll likely want the same tuning hooks to control where read-only data goes (the Arm ports have some, my guess is at attribute would be useful as well), we could either require those match for correctness or re-write between the PCREL and GPREL sequences in the linker.
  • Add the new ABI that defines these position independence rules that match what you're looking for. Something like static/PIE might be sufficient, but there's so many bugs around that right now it's hard to tell for sure. IIUC this isn't quite FDPIC, but I can't find an actual RISC-V FDPIC spec and last I heard it wasn't quite sorted out yet so it might be possible to fit it in.

I think that would be sufficient to let users make binaries that behave the same way, but without all that extra coupling we'll have an easier time implementing it and we'd get more than one use case out of it.

@luismarques
Copy link
Collaborator Author

luismarques commented Sep 5, 2022

Thank you for the detailed feedback, Palmer!

This is coupling a lot of things together: some new user interface, new relocations, and a new ABI.

Sorry to nitpick but, to reiterate the PR summary, the user interface is (almost surely) out-of-scope of the psABI repo, and it's only included here for convenience, to give an overview of how this new ABI would be used in practice. I also don't see the distinction between "relocations" and "ABI", my understanding is that the former is part of the latter. In any case, I appreciate the overall sentiment, and it's good to discuss if it's possible to decouple some of these concerns without introducing other issues.

The core goal around having two addressable ranges (one PC-based, one GP-based) seems broadly useful, some users even take advantage of the defacto second range that relying on auipc->GP relaxation gives you

My understanding is that PC-rel to GP-rel relaxation currently assumes that the two ranges don't move with respect to each other, so in a sense we are talking about "two ranges" in significantly different ways.

Add what's necessary for the 3-instruction PCREL addressing sequences (ie, aupic; addi/andi; lw). We'll need these anyway just to allow sharing aupic in medany, but having three instructions will allow us to rewrite these as 32-bit-offset GP-relative sequences as well.

Could you please elaborate on this a little bit? While I think I understand what your overall point is, some details aren't quite clear to me. E.g.:

  1. In what circumstances do you want to share auipc?
  2. What's andi (or what did you actually mean to write)?
  3. To be sure, you're proposing that we always use 3 instructions even when 2 would be enough (e.g. for address generation, such as when returning a pointer), right? Is the idea that we would then relax that instruction sequence to 2 instructions in the cases where we do not want to transform the sequence from PC-rel to GP-rel? E.g. remove the nop in auipc; addi; nop?

We'll likely want the same tuning hooks to control where read-only data goes (the Arm ports have some, my guess is at attribute would be useful as well), we could either require those match for correctness or re-write between the PCREL and GPREL sequences in the linker.

Is it really possible to ensure that those always match, without introducing significant pessimizations, compatibility concerns or very intrusive toolchain changes? I was assuming that a linker rewrite would be inevitable but if you think otherwise or have tips for how this might be solvable please do share.

Add the new ABI that defines these position independence rules that match what you're looking for.

If I understand your point correctly, the following solution would fall broadly within your guidance, right?

  1. The compiler always emits PC-relative 3-instruction instruction sequences when using -fepic (or some other option to signal the new ABI). Those 3 instructions are "tagged" with preexisting R_RISCV_PCREL_HI20, R_RISCV_PCREL_LO12_I, etc.
  2. The linker rewrites those PC-relative instruction sequences into GP-relative sequences when using -fepic (when applicable to the symbol).

I'm not necessarily arguing against such an approach, but some counterpoints to keep in mind:

  1. You said "we'll have an easier time implementing it", but it's not obvious to me that's the case. For instance, to implement the R_RISCV_EPIC_* relocations I just checked the rewrite condition, rewrote the instructions (if the condition applied) and then changed the relocation type to become one of the previously specified/proposed PC-relative or GP-relative relocations. So, my experience was that it actually provided an easier separation of concerns, as the ePIC relocations delegated to the normal regular relocation resolution process. In fact, you can do all of that ePIC processing upfront (neater) or interleave with the regular relocation processing loop (probably better performance, due to cache locality, etc., but slightly messier).
  2. Your counterproposal arguably makes things more implicit and tangled. Now when you see auipc; addi; nop with the regular PC-relative relocations you don't really know if that's meant to be a completely different instruction sequence (even if you look at the associated relocations) unless you also have some context about possible command line options or ELF attributes. That seems more error-prone and confusing.

Hopefully, I didn't strawman your proposal. If so, I apologise.

I kindly ask all of you to chime in about this:

  1. Should we try to move this proposal roughly in the direction proposed by Palmer? Or do you disagree with the concerns raised, or suggest other ways of addressing those concerns, etc?
  2. If we do move in that direction, can you please support me in that transition? We would be going from something more specific (closer to a possible final form) to something with more details to fill in (not to mention redoing the toolchain implementation). While I would no doubt take a very active part in driving that effort forward, I kindly ask you to come along with me and take an active part in that effort as well.

@kito-cheng
Copy link
Collaborator

@sorear
Copy link
Collaborator

sorear commented Nov 4, 2022

Haven't reviewed this in great detail but I'm very pleased that it exists and am happier with it superficially than I was with compact.

AUIGP won't save bytes until the gp-relative data segment exceeds 256KiB, since lui and add are both compressible.

Relocation processing needs to handle the fact that "3-instruction sequences" are dynamic; various compiler optimizations can cause the early instructions to be duplicated, the late instructions to be duplicated, the instructions to be out of order in the file, interleaved with other relocations, etc, etc; trying to match relocations with one another when not explicitly threaded is a losing proposition.

The final code size extension will likely have some form of gp-relative loads and stores; at least Andes already has them as a vendor extension. Relaxing 3-instruction sequences to lwgp/swgp is challenging because of the previous point; we don't know, from the LUI/ADD instructions and their relocations alone, whether all of the corresponding load and store instructions are eligible for relocation, so we might need to add new relocations to convey the "all possible users of this are {sw,sd,lw,ld}" information. Which might involve a degree of coordination between the development of this extension and that of the code size reduction extension. I am fairly certain the relaxation code in plctlab/llvm-project@riscv-zce-llvm14 gets this wrong and can delete non-dead instructions if you do anything nontrivial with code ordering.

FDPIC is, first and foremost, a calling convention, which this does not change at all. However, the process of loading an ePIC image (each program header is an independent non-FIXED mmap) is identical to what is needed for FDPIC, so they can probably share a single ELF flag for the purposes of binfmt_elf_fdpic, and be treated as two incompatible ABIs beyond that.

@luismarques
Copy link
Collaborator Author

Since so far we have not made significant progress with alternative proposals, shall we move forward with the general approach proposed in this PR? If so, we can start focusing the discussion on the various details of this proposal.

If you believe an alternative proposal would be preferable (either one that has already been mentioned in this PR discussion or a different one) please say so now. For this discussion not to stall when evaluating alternatives we'll need either reasonably detailed counter-proposals or a quick feedback cycle to address any raised questions or issues.

Thank you all for your input.

@jrtc27
Copy link
Collaborator

jrtc27 commented Dec 5, 2022

I don't think the silence should be regarded as "people think this is the right thing and thus haven't proposed alternatives", rather it's "people have been busy with other things and not thought about this further". As a group we've been focused on getting 1.0 done up until recently, not on big new things like ePIC. My two concerns are:

  1. having too many ABIs in the ecosystem and fragmenting
  2. having too many relocations such that we run out of encoding space on RV32

1 is somewhat inevitable but in as much as we can align the ABIs (e.g. FDPIC and ePIC using the same sequences for certain things) we should just to keep things less crazy for implementers and users. 2 is a real issue, especially given 25% of the space is reserved for custom extensions and almost another 25% is already allocated, so again where possible relocations should be general rather than overly tailored to a specific ABI variant (e.g. again, relocations should be general between FDPIC and ePIC where they have overlap in code sequences).

The need to rewrite instruction sequences beyond just filling in immediate bits at link time is also a bit ugly; @MaskRay may have opinions on that from an LLD perspective.

However, I can see how you've ended up where you have, don't immediately have a better proposal and don't want to block solving a real problem by commenting on things I don't overly like on idealistic grounds.

@kito-cheng
Copy link
Collaborator

Sorry for not response this on public for long time, I've talk with @luismarques on psABI meeting and off-list mail several time before, but I think I really take a public response here.

In short:
I like the idea of ePIC code model.

In long version:
Two separate memory regions is not rare in embedded system, and it's a challenge for existing code models: medlow and medany, and their are some other code model has proposed before which can resolve that issue: large code model and compact code model; fortunately I've drive the large code-model when I work at Andes and drive compact code-model implementation at SiFive, so I've experiences on those two code model:

Large code model (literal pool style) is just did everything conservatively, it work on such scenario but need to pay much more cost on code size and performance, and it can't provide the position independent code capability in general.

Compact code model proposal are trying to fix similar issue but go with different path, it has to predefine the symbol or section is located in which region and must let compiler know (for right load address code sequence), however it's hard to deal with some customized section which specified by attribute, and that's common in embedded program, although the original design isn't intend to doing that, oppositely it try to optimize that, but we found that is really a challenge when we implement that, and we didn't found good solution.

So back to the ePIC code model, I think it's a better solution for resolving the problem in embedded world, it defer the address classification to the linker time by the new EPIC relocation instead of trying to let compiler to resolve that.

One concern of ePIC is it introduced 8 new relocation, but IMO the first 4 relocations (GP related stuffs) could be used in other code model in future, like FDPIC, so their is only really 4 relocations really dedicated for ePIC, and just one comment here, we could rename the R_RISCV_EPIC_* to something like R_RISCV_VAR_PIC_* (variable pic relocation) for relax the coupling of the code model and relocation name a little bit.

And I think there is no perfect code model can fit any scenario well, but the ePIC code model is the good one to address the scenario in the embedded world area which we can't handle well with existing code models, so I support this.

@Nelson1225
Copy link
Collaborator

Compact code model proposal are trying to fix similar issue but go with different path, it has to predefine the symbol or section is located in which region and must let compiler know (for right load address code sequence), however it's hard to deal with some customized section which specified by attribute, and that's common in embedded program, although the original design isn't intend to doing that, oppositely it try to optimize that, but we found that is really a challenge when we implement that, and we didn't found good solution.

I was one of the main person to implement the compact code model when I was in SiFive. So for this description in details, compact will always need to generate the got_gprel patterns, which is similar to the EPIC patterns, but cannot to relaxed to the gprel patterns since the low instruction isn't referred to the corresponding high one. That's fault of the compact, but it also meant we didn't want to make a second pcrel_lo case then. From my side, except this issue, compact and ePIC should be the similar stuff, so the ePIC is the enhancement of the compact.

SiFive should have supported the compact code model almost for two years, and should have customers, so it should be a real request in the embedded world. I support the compact as usual, so I will also support its enhancement, which is the ePIC code model, and will also hope to see the ePIC upstream.

@MaskRay
Copy link
Collaborator

MaskRay commented Feb 5, 2024

With that approach, in principle it would even be possible to later extend this ABI for shared library support.

This proposal is more like uClinux BFLT without shared library support.
(The Linux kernel removed shared library support for BFLT in April 2022.)
(I cannot find much information about BFLT, perhaps these configurations use GOT more often than absolutely necessary.)
If we want to support shared library, I believe we should just proceed with FDPIC.

My two concerns are:

having too many ABIs in the ecosystem and fragmenting
having too many relocations such that we run out of encoding space on RV32

@jrtc27 raised a good question. IIUC this proposal introduces the two sets of relocations:

  • GP-relative relocations, the same as the FDPIC with the addition of R_RISCV_GPREL_ADD for linker relaxation
  • ePIC relocations (new), %epic_hi/%epic_lo/%epic_base_add

I believe the GP-relative relocations are non-controversial. The ePIC relocations could use more debates.

The unknwon segment is because we cannot determine rodata vs data/bss for non-definition declarations in the general case
(extern const S s;).

The need to rewrite instruction sequences beyond just filling in immediate bits at link time is also a bit ugly

I agree that the linker implementation can be dirty. The linker has to figure out whether the referenced section and GP share the same PT_LOAD segment.

FDPIC ABIs avoid the "unknown segment" problem by using a GOT code sequence.
The GOT is accessed via GP, which resides in the data segment.

If we introduce FDPIC in the future, I feel that we can re-implement ePIC relocations using R_RISCV_GPREL_GOT_HI20 optimization/relaxation. Though we probably should not do it, as the return-on-investment may not be sufficiently large:

label:
lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
c.add t0, gp               # we probably need a marker relocation to remove this instruction for relaxation
l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label

-fepic

The proposed code snippets look like clang -fpie -fdirect-access-external-data.
-fpie is a variant of -fpic that assumes defined symbols non-preemptible (binds locally).
-fdirect-access-external-data instructs Clang to assume non-TLS non-definition declarations non-preemptible.

The code also resembles GCC -fpie -mno-pic-data-is-text-relative (arm/s390x) in that PC-relative instructions for data is disabled.
arm -mno-pic-data-is-text-relative uses GOT.-mno-pic-data-is-text-relative seems like a restricted -mfdpic.

@kito-cheng
Copy link
Collaborator

I think one advantage for ePIC is that trying to prevent introduce GOT, that could reduce few runtime overhead and code size; the extra load (in GOT scheme) could be removed by linker relaxation, but still longer instruction sequence.

Load variable from symbol

GOT scheme

label:
  lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
  c.add t0, gp               # R_RISCV_GPREL_ADD var
  l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
  l[w|d] t0, 0(t0) # load the actual value form symbol

Current ePIC scheme

label:
  lui t0, %epic_high(var) # EPIC_HI20 (var)
  c.add t0, gp               # EPIC_BASE_ADD var
  l[w|d] t0, %epic_low(label)(t0) # EPIC_LO12_I label

Load address

GOT scheme

label:
  lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
  c.add t0, gp               # we probably need a marker relocation to remove this instruction for relaxation
  l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label

Current ePIC scheme

label:
  lui t0, %epic_high(var) # EPIC_HI20 (var)
  c.add t0, gp               # EPIC_BASE_ADD var
  addi t0, %epic_low(label)(t0) # EPIC_LO12_I label

@richfelker
Copy link

BFLT-like "single global data base register" ABIs do have a slight code size advantage, but only work for static linking and only really make sense for the most tiny bare metal settings. Of course you can try to implement a custom idea of dynamic linking on top of that, which several archs' BFLT variants did, but this quicky turns into reinventing everything in ELF and inevitably doing it really badly.

As such, musl libc (I'm commenting from the standpoint of author/maintainer) has never supported this kind of thing and has only supported either entirely-nonshareable normal ELF PIE or ELF FDPIC binaries for MMU-less systems. I would really like to see a real FDPIC ABI for riscv that we could support, even if this ePIC thing is sufficient for certain bare-metal uses you have in mind.

@richfelker
Copy link

Also note that, if a data object is known to be local to the same DSO (if it's static or hidden or if you're building with compiler flags indicating that everything will be static linked), FDPIC data references do not have to go through the GOT. They can be relative to the FDPIC GOT register (which need not actually point to a GOT, just some anchor in the data segment). This also works for shared libraries - because the callee library's FDPIC GOT register is loaded when calling into it, the library function is able to access its own static/hidden data GOT-relative. This makes everything almost as cheap as the single-global-pointer system you want in ePIC in the case where it's static linked, but much more general, where efficiency only degrades gracefully as you use more features (dynamic linking).

@sorear
Copy link
Collaborator

sorear commented Feb 18, 2024

If the intention is to merge this in pieces, could this be turned into a draft to reduce confusion?

Please don't provide a YouTube link as the sole definition of FDPIC; there are plenty of written explanations.

I came up with a relocation scheme that supports almost all of the useful features for a FDPIC ABI:

  • following the conventions expected by musl and uclibc for function descriptor allocation, lazy binding, etc
  • support for all TLS models, including TLSDESC
  • generates optimal access sequences for all objects and functions in the code segment, data segment, and absolute segment
  • optimal access sequences generated by relaxation from a single compiler-generated sequence
  • optimal access sequences can be generated by the compiler if relaxation is disabled
  • multiple code models to support code and data sizes up to the full address space
  • -fno-plt calls
  • data flow respects original register specifiers, allowing for code motion
  • linking all relocations together allows for future addressing extensions
  • rematerializability in very limited cases

The relocation scheme can also be used when statically linking and using the normal calling convention; call this the ePIC subset. The most economical scheme has 9 new relocation types on top of v1.0 + TLSDESC, although a scheme with 11 new relocations is simpler in several ways. The 11-type scheme has a 2-type subset which supports ePIC operations; the two types correspond to the #405 relocation types, although they are described very differently to account for the larger context. Neither scheme uses the #394 relocations at all.

FDPIC relocation needs
#Relocs Used ePIC Shared this PR Description
4 2 0 Absolute minimum for a working FDPIC+ePIC with TLS
9 3 1 Supporting all TLS models and generating optimal instruction sequences via relaxation or length-preserving rewrites
11 2 2 Avoid double indirection in INTERMEDIATE_LOAD and save one relocation entry for ePIC, reaching parity
20+? 7? 7? Low estimate for doing the same thing with single-purpose relocations
9-type relocation scheme

Non-TLSDESC global dynamic TLS is not supported. No special provision is made to distinguish rematerializable from non-rematerializable addressing sequences, although compilers may treat addressing sequences as rematerializable if they are known to not be in the code segment. Omitting R_RISCV_RELAX allows length-preserving rewrites. This sketch optimizes the number of relocation types at the expense in some cases of the number of relocation entries.

I've gone back and forth several times over exactly which transformations should be permitted without RELAX. The current rules allow us to express the "use PCREL or GPREL but never use a GOT" property of ePIC and also allow the use of code models in non-relaxed FDPIC, but in the default FDPIC model do not allow rematerialization or omission of R_RISCV_PIC_ADD relocations.

Requiring R_RISCV_INTERMEDIATE_LOAD to be explicitly marked even when it is optimized out at compilation or assembly time is a wart on the design and the only place we're quantitatively worse than ePIC. To fix it, use the 11-type scheme.

A full FDPIC proposal would include, in addition to the relocations and relaxations described here, a precise definition of the calling convention, ELF flags and attributes, the list of code models, and sibling PRs to asm-manual and c-api.

  • R_RISCV_FUNCDESC (Static/Dynamic, FDPIC ABI only)

    Populates a 32/64 bit location with a pointer to a canonical function descriptor created by the dynamic linker for globally visible symbols and the static linker otherwise.

    NOTE: PPC64 ELFv1 points symbol values directly at function descriptors but consistency with FR-V/Blackfin/SuperH/Arm favors this approach.

  • R_RISCV_FUNCDESC_VALUE (Static/Dynamic, FDPIC ABI only)

    Populates a 64/128 bit location with a copy of the canonical function descriptor.

    This is the relocation type used to support lazy binding if present in the relocation table pointed to by DT_JMPREL.

    NOTE: This could be used as a static relocation to populate an ia64-style vtable containing inline descriptors, however all function descriptor ABIs for architectures supported in LLVM use pointers to canonical descriptors in the vtable. This relocation type may also be used for lazy binding when referenced from DT_JMPREL.

  • R_RISCV_GOTGPREL_HI (Static, all GP ABIs)

    Nondeterministically pick an access method, which is one of GOT entry, GP-relative, PC-relative, or absolute. Absolute, GP-relative, and PC-relative can only be used for symbols which are resolved to a definition at static link time. Absolute requires that the symbol be absolute and within signed ±2GiB of zero. GP-relative requires that the symbol be within ±2GiB of __global_pointer$ in the data segment. PC-relative requires that the symbol be within ±2GiB of the relocation's offset in the code segment. PC-relative and absolute access methods can only be used if the relocation offset is even and points at a lui.

    If the access method is GOT entry, find or add an entry to the GOT which will, at runtime, contain the address of the relocation target. When generating a dynamically linked executable or shared library this will typically involve creating a R_RISCV_32 or R_RISCV_64 dynamic relocation.

    The offset of the relocation must be odd, even and point at a c.lui instruction, or even and point at a lui instruction. Other cases are reserved for future standard use.

    For the GOT entry access method and the GP relative address method, the byte displacement from __global_pointer$ to the GOT entry or the target is divided by 4096, rounding to nearest ties up. The divided displacement is inserted in the immediate field of the lui or c.lui instruction. If the divided displacement cannot be represented in the immediate field or if the relocation offset is odd and the divided displacement is not zero, relocation fails.

    For the absolute access method, the absolute address of the target is divided and inserted into the instruction immediate as described in the previous paragraph.

    For the PC-relative access method, the displacement from the relocation offset to the target is divided and inserted into the immediate of the instruction, which also has its opcode rewritten from lui to auipc.

    The relocation may be paired with R_RISCV_RELAX. In this case, if the lui instruction is not replaced with an auipc it may be replaced with a c.lui (if RVC is available for relaxation), and a lui or c.lui may be deleted outright if it receives an immediate of 0.

  • R_RISCV_FUNCDESC_GOTGPREL_HI (Static, FDPIC ABI only)

    Find or create a GOT entry which will receive a canonical function descriptor for the target, which must be a function symbol with zero addend. Perform relocation and relaxation as for R_RISCV_GOTGPREL_HI with a forced access method of the chosen GOT entry.

  • R_RISCV_FUNCDESC_VALUE_GPREL_HI (Static, FDPIC ABI only)

    Find or create an aligned pair of GOT entries which will receive a function descriptor for the target, which must be a function symbol with zero addend. If the target lacks global visibility, the aligned pair will be the canonical function descriptor for the symbol. Perform relocation and relaxation as for R_RISCV_GOTGPREL_HI with a forced GP relative access method and target of the first chosen GOT entry.

  • R_RISCV_TLSDESC_GPREL_HI (Static, all GP ABIs but only useful when dynamic linking)

    Find or create a pair of GOT entries which will receive a TLS descriptor for the target, which must be a symbol in a SHF_TLS section, typically through creation of a R_RISCV_TLSDESC dynamic relocation. Perform relocation and relaxation as for R_RISCV_GOTGPREL_HI with an access method of GP relative and a target of the first GOT entry. May also be relaxed into an initial-exec or local-exec form as described elsewhere for TLSDESC (except for the presence of an add instruction).

    The calling convention of the TLS descriptor does not change; even if an object uses the FDPIC calling convention, the descriptor must ignore but preserve the gp value it is called with.

  • R_RISCV_TLS_GOTGPREL_HI (Static, all GP ABIs but only useful when dynamic linking)

    Find or create a GOT entry containing the TP offset for the target, which must be a symbol in a SHF_TLS section, typically through creation of a R_RISCV_TLS_TPREL32 or R_RISCV_TLS_TPREL64 dynamic relocation. Perform relocation and relaxation as for R_RISCV_GOTGPREL_HI with the GOT entry. May also be relaxed into local-exec as described elsewhere (except for the presence of an add instruction).

  • R_RISCV_PIC_ADD (Static, all GP ABIs; replaces R_RISCV_EPIC_BASE_ADD)

    The target of the relocation is used to locate another ("parent") relocation which must have the basic behavior of R_RISCV_GOTGPREL_HI. The offset of the relocation must be even and point to an add or c.add instruction with gp as one argument; all other cases are reserved for standard use.

    If the parent relocation deleted its lui instruction (only possible if the parent relocation is paired with R_RISCV_RELAX), delete the add or c.add instruction.

    If the parent relocation did not delete its lui instruction and its access method is GOT entry or GP-relative, no action is taken.

    If the parent relocation did not delete its lui instruction and its access method is absolute or PC-relative, rewrite the instruction into a c.mv or canonical mv instruction which copies the non-gp argument of the add to its result. If the resulting instruction would move a register to itself and the parent relocation is paired with R_RISCV_RELAX, the instruction may optionally be deleted instead.

    NOTE: R_RISCV_PIC_ADD relocations have no effect and can be omitted when the parent relocation is not paired with R_RISCV_RELAX and either points to a c.lui or has odd offset.

  • R_RISCV_INTERMEDIATE_LOAD (Static, all GP ABIs)

    The target of the relocation is used to locate another ("parent") relocation which must have the basic behavior of R_RISCV_GOTGPREL_HI. The offset of the relocation must be odd, or even and point to an I-type instruction; all other cases are reserved for standard use.

    If the offset is odd, the instruction is treated as deleted for the purposes of child relocations. No instruction modification is performed as an effect of the relocation, but relocation fails if the parent relocation access method is GOT entry.

    Otherwise, for all access methods, write the low 12 bits of the parent relocation computed displacement into the I-type immediate of the instruction. If the parent relocation deleted its lui instruction (not possible for the PC-relative access method), replace the rs1 register specifier with gp (for GOT entry or GP-relative) or zero (for absolute).

    If the parent relocation access method is not GOT entry, replace the opcode and funct3 of the instruction to convert it into an addi instruction. Relocation fails if the instruction was not previously a lw or ld determined by EI_CLASS. If the parent relocation is paired with R_RISCV_RELAX and rs1 was originally equal to rd, optionally delete the instruction instead.

    If the parent relocation is paired with R_RISCV_RELAX and RVC is available for relaxation, optionally replace the instruction with an equivalent compressed instruction.

  • R_RISCV_PCREL_LO12_I (Existing relocation; add R_RISCV_PIC_LO12_I as an alias)

    The target of the relocation is used to locate another ("parent") relocation. If the parent relocation has an existing type (only R_RISCV_PCREL_HI20 remains valid in GP ABIs), perform relocation as described currently.

    If the parent relocation has the basic behavior of R_RISCV_GOTGPREL_HI, the offset must be even and point to a lw or ld instruction. Perform relocation as for R_RISCV_INTERMEDIATE_LOAD except that the instruction may not be deleted, regardless of relaxation.

    What follows only applies if the parent relocation has type R_RISCV_INTERMEDIATE_LOAD. Find the ("grandparent") parent relocation of the parent relocation.

    If the parent relocation will not have a deleted instruction, no change is made to the instruction. Otherwise, continue by adding the low 12 bits of the grandparent relocaiton computed displacement to the 12-bit I-type immediate of the instruction at the relocation offset. Relocation fails if addition overflows and may fail if the addends have any bits in common. If the grandparent relocation will have a deleted instruction, replace the rs1 register specifier with gp (for GOT entry or GP relative access methods) or zero (for the absolute access method).

    If the grandparent relocation is paired with R_RISCV_RELAX and RVC is available for relaxation, optionally replace the instruction with an equivalent compressed instruction.

  • R_RISCV_PCREL_LO12_S (Existing relocation; add R_RISCV_PIC_LO12_S as an alias)

    The target of the relocation is used to locate another ("parent") relocation. If the parent relocation has an existing type (no defined cases as of writing remain valid in GP ABIs), perform relocation as described currently.

    If the parent relocation has type R_RISCV_INTERMEDIATE_LOAD, relocation and relaxation proceed as described for R_RISCV_PIC_LO12_I except for the use of the S-type immediate format.

  • R_RISCV_TLSDESC_LOAD_LO12 (Existing relocation)

    Extended to allow using the low 12 bits of the computed displacement of a parent relocation of type R_RISCV_TLSDESC_GOTGPREL_HI. Replace rs1 with gp if the parent relocation deleted its instruction.

  • R_RISCV_TLSDESC_ADD_LO12 (Existing relocation)

    Extended to allow using the low 12 bits of the computed displacement of a parent relocation of type R_RISCV_TLSDESC_GOTGPREL_HI. Replace rs1 with gp if the parent relocation deleted its instruction.

  • R_RISCV_CALL (Existing relocation)

    Becomes reserved for standard use in the GP-relative ABIs.

  • R_RISCV_CALL_PLT (Existing relocation)

    In addition to the auipc jalr sequence supported for PLT calls, we also recognize lui add/c.add lx lx jalr/c.jr sequences for no-PLT calls.

    All R_RISCV_CALL_PLT relocations may pass control through a linker-generated stub which clobbers registers equivalent to an eagerly bound PLT stub (t1 - t6).

9-type usage examples All examples are given for RV64; RV32 may be obtained by replacing `ld` with `lw`.
lb, sb, la, lla,  Pseudoinstructions documented in riscv-asm-manual
la.tls.ie
la.fd, lla.fd     Materializes a pointer to a function descriptor (i.e. a C
                  function pointer) for a global or local symbol
llb, lsb          Like lb/sb but for local symbols
tlsdesc_call      Materialize tp-relative offset to a global dynamic TLS symbol
call_noplt        Like call but inlines PLT entry

### lb a0, symbol ###                  ### llb a0, symbol ###
0  lui a0, 0                           0  lui a0, 0
0     R_RISCV_GOTGPREL_HI symbol       0     R_RISCV_GOTGPREL_HI symbol
0     R_RISCV_RELAX                    0     R_RISCV_RELAX
4  c.add a0, gp                        4  c.add a0, gp
4     R_RISCV_PIC_ADD 0                4     R_RISCV_PIC_ADD 0
6  ld a0, 0(a0)                        5     R_RISCV_INTERMEDIATE_LOAD 0
6     R_RISCV_INTERMEDIATE_LOAD 0      6  lb a0, 0(a0)
a  lb a0, 0(a0)                        6     R_RISCV_PIC_LO12_I 5
a     R_RISCV_PIC_LO12_I 6

### sb a1, symbol, a0 ###              ### lsb a1, symbol, a0 ###
0  lui a0, 0                           0  lui a0, 0
0     R_RISCV_GOTGPREL_HI symbol       0     R_RISCV_GOTGPREL_HI symbol
0     R_RISCV_RELAX                    0     R_RISCV_RELAX
4  c.add a0, gp                        4  c.add a0, gp
4     R_RISCV_PIC_ADD 0                4     R_RISCV_PIC_ADD 0
6  ld a0, 0(a0)                        5     R_RISCV_INTERMEDIATE_LOAD 0
6     R_RISCV_INTERMEDIATE_LOAD 0      6  sb a1, 0(a0)
a  sb a1, 0(a0)                        6     R_RISCV_PIC_LO12_S 5
a     R_RISCV_PIC_LO12_S 6

### la/lla a0, symbol ###              ### la.fd a0, symbol ###
0  lui a0, 0                           0  lui a0, 0
0     R_RISCV_GOTGPREL_HI symbol       0     R_RISCV_FUNCDESC_GOTGPREL_HI symbol
0     R_RISCV_RELAX                    0     R_RISCV_RELAX
4  c.add a0, gp                        4  c.add a0, gp
4     R_RISCV_PIC_ADD 0                4     R_RISCV_PIC_ADD 0
6  ld a0, 0(a0)                        6  ld a0, 0(a0)
6     R_RISCV_PIC_LO12_I 0             6     R_RISCV_PIC_LO12_I 0

### la.tls.ie a0, symbol ###           ### lla.fd a0, symbol ###
0  lui a0, 0                           0  lui a0, 0
0     R_RISCV_TLS_GOTGPREL_HI symbol   0     R_RISCV_FUNCDESC_VALUE_GPREL_HI symbol   
0     R_RISCV_RELAX                    0     R_RISCV_RELAX
4  c.add a0, gp                        4  c.add a0, gp
4     R_RISCV_PIC_ADD 0                4     R_RISCV_PIC_ADD 0
6  ld a0, 0(a0)                        6  ld a0, 0(a0)
6     R_RISCV_PIC_LO12_I 0             6     R_RISCV_PIC_LO12_I 0

### call_noplt symbol, t2 ###          ### tlsdesc_call symbol, t2 ###
0  lui t2, 0                           0  lui a0, 0
0     R_RISCV_CALL symbol              0     R_RISCV_TLSDESC_GPREL_HI symbol
0     R_RISCV_RELAX                    0     R_RISCV_RELAX
4  c.add t2, gp                        4  c.add a0, gp
6  ld gp, 8(t2)                        4     R_RISCV_PIC_ADD 0
a  ld t2, 0(t2)                        6  ld t2, 0(a0)
e  c.jr t2                             6     R_RISCV_TLSDESC_LOAD_LO12 0
                                       a  addi a0, a0, 0
                                       a     R_RISCV_TLSDESC_ADD_LO12 0
                                       e  jalr t0, t2
                                       e     R_RISCV_TLSDESC_CALL 0
4-type usage examples (inefficient, illustrative only, for small code model and global dynamic TLS)
### lb/llb a0, symbol ###              ### sb/lsb a1, symbol, a0 ###         
0  c.lui a0, 0                         0  c.lui a0, 0                      
0     R_RISCV_GOTGPREL_HI symbol       0     R_RISCV_GOTGPREL_HI symbol  
2  c.add a0, gp                        2  c.add a0, gp                   
4  ld a0, 0(a0)                        4  ld a0, 0(a0)                   
4     R_RISCV_PIC_LO12_I 0             4     R_RISCV_PIC_LO12_I 0
8  lb a0, 0(a0)                        8  sb a1, 0(a0)                   

### la/lla a0, symbol ###              ### la.fd/lla.fd a0, symbol ###
0  c.lui a0, 0                         0  c.lui a0, 0
0     R_RISCV_GOTGPREL_HI symbol       0     R_RISCV_FUNCDESC_GOTGPREL_HI symbol
2  c.add a0, gp                        2  c.add a0, gp
4  ld a0, 0(a0)                        4  ld a0, 0(a0)
4     R_RISCV_PIC_LO12_I 0             4     R_RISCV_PIC_LO12_I 0

### call_noplt symbol, t2 ###          ### tlsdesc_call symbol, t2 ###
0  lui t2, 0                           0  c.lui a0, 0
0     R_RISCV_CALL symbol              0     R_RISCV_TLSDESC_GPREL_HI symbol
4  c.add t2, gp                        2  c.add a0, gp
6  ld gp, 0(t2)                        4  ld t2, 0(a0)
a  ld t2, 0(t2)                        4     R_RISCV_TLSDESC_LOAD_LO12 0
e  c.jr t2                             8  addi a0, a0, 0
                                       8     R_RISCV_TLSDESC_ADD_LO12 0
                                       c  jalr t0, t2
                                       c     R_RISCV_TLSDESC_CALL 0
                                       
11-type relocation scheme (delta) This saves one relocation entry for loads and stores when the GOT is known at assemble time to be unused. We also flatten `R_RISCV_INTERMEDIATE_LOAD` relocations, potentially saving a local symbol.
  • R_RISCV_GPREL_HI (New; Static, all GP ABIs)

    Acts exactly as R_RISCV_GOTGPREL_HI except that the GOT entry access method will not be used. Relocation shall fail if no other access method is possible.

  • R_RISCV_INTERMEDIATE_LOAD (Redefined; Static, all GP ABIs)

    The target of the relocation is used to locate another ("parent") relocation which must have the basic behavior of R_RISCV_GOTGPREL_HI. The offset of the relocation must be even and point to a lw (for ELFCLASS32) or ld (for ELFCLASS64) instruction; all other cases are reserved for standard use.

    If the parent relocation access method is not GOT entry, replace the instruction with an instruction that moves rs1 to rd.

    If the parent relocation access method is GOT entry, write the low 12 bits of the parent relocation computed displacement into the I-type immediate of the instruction. If the parent relocation has odd offset or deleted its lui instruction, replace the rs1 register specifier with gp.

    If the parent relocation is paired with R_RISCV_RELAX and RVC is available for relaxation, optionally replace the instruction with an equivalent compressed instruction or delete it if it has no effect.

  • R_RISCV_PIC_ADDR_LO12_I (New; Static, all GP ABIs)

    The target of the relocation is used to locate another ("parent") relocation which must have the basic behavior of R_RISCV_GOTGPREL_HI. The offset of the relocation must be even and point to a lw (for ELFCLASS32) or ld (for ELFCLASS64) instruction; all other cases are reserved for standard use.

    For all access methods, write the low 12 bits of the parent relocation computed displacement into the I-type immediate of the instruction. If the parent relocation has odd offset or deleted its lui instruction, replace the rs1 register specifier with gp.

    If the access method is not GOT entry, replace opcode and funct3 to convert the instruction into an addi.

    If the parent relocation is paired with R_RISCV_RELAX and RVC is available for relaxation, optionally replace the instruction with an equivalent compressed instruction or delete it if it has no effect.

  • R_RISCV_PIC_LO12_I (Modified)

    The target of the relocation is used to locate another ("parent") relocation. If the parent relocation has an existing type (only R_RISCV_PCREL_HI20 remains valid in GP ABIs), perform relocation as described currently. The following applies if the parent relocation has the basic behavior of R_RISCV_GOTGPREL_HI; all other new cases are reserved.

    If the parent relocation access method is not GOT entry, add the low 12 bits of the parent relocation computed displacement to the 12-bit I-type immediate of the instruction at the relocation offset. Relocation fails if addition overflows and may fail if the addends have any bits in common. If the parent relocation has odd offset or deleted its instruction, replace the rs1 register specifier with gp (for the GP relative access method) or zero (for the absolute access method).

    For all access methods, if the parent relocation is paired with R_RISCV_RELAX and RVC is available for relaxation, optionally replace the instruction with an equivalent compressed instruction or delete it if it has no effect.

  • R_RISCV_PIC_LO12_S (Modified)

    The target of the relocation is used to locate another ("parent") relocation. If the parent relocation has an existing type (no defined cases as of writing remain valid in GP ABIs), perform relocation as described currently. The following applies if the parent relocation has the basic behavior of R_RISCV_GOTGPREL_HI; all other new cases are reserved.

    If the parent relocation access method is not GOT entry, add the low 12 bits of the parent relocation computed displacement to the 12-bit S-type immediate of the instruction at the relocation offset. Relocation fails if addition overflows and may fail if the addends have any bits in common. If the parent relocation has odd offset or deleted its instruction, replace the rs1 register specifier with gp (for the GP relative access method) or zero (for the absolute access method).

    For all access methods, if the parent relocation is paired with R_RISCV_RELAX and RVC is available for relaxation, optionally replace the instruction with an equivalent compressed instruction or delete it if it has no effect.

11-type usage examples
### lb a0, symbol ###                  ### llb a0, symbol ###
0  lui a0, 0                           0  lui a0, 0
0     R_RISCV_GOTGPREL_HI symbol       0     R_RISCV_GPREL_HI symbol
0     R_RISCV_RELAX                    0     R_RISCV_RELAX
4  c.add a0, gp                        4  c.add a0, gp
4     R_RISCV_PIC_ADD 0                4     R_RISCV_PIC_ADD 0
6  ld a0, 0(a0)                        6  lb a0, 0(a0)
6     R_RISCV_INTERMEDIATE_LOAD 0      6     R_RISCV_PIC_LO12_I 0
a  lb a0, 0(a0)
a     R_RISCV_PIC_LO12_I 0

### sb a1, symbol, a0 ###              ### lsb a1, symbol, a0 ###
0  lui a0, 0                           0  lui a0, 0
0     R_RISCV_GOTGPREL_HI symbol       0     R_RISCV_GPREL_HI symbol
0     R_RISCV_RELAX                    0     R_RISCV_RELAX
4  c.add a0, gp                        4  c.add a0, gp
4     R_RISCV_PIC_ADD 0                4     R_RISCV_PIC_ADD 0
6  ld a0, 0(a0)                        6  sb a1, 0(a0)
6     R_RISCV_INTERMEDIATE_LOAD 0      6     R_RISCV_PIC_LO12_S 0
a  sb a1, 0(a0)                     
a     R_RISCV_PIC_LO12_S 0

### lla a0, symbol ###
0  lui a0, 0
0     R_RISCV_GPREL_HI symbol
0     R_RISCV_RELAX
4  c.add a0, gp
4     R_RISCV_PIC_ADD 0
6  ld a0, 0(a0)
6     R_RISCV_PIC_ADDR_LO12_I 0

### la a0, symbol ###                  ### la.fd a0, symbol ###
0  lui a0, 0                           0  lui a0, 0
0     R_RISCV_GOTGPREL_HI symbol       0     R_RISCV_FUNCDESC_GOTGPREL_HI symbol
0     R_RISCV_RELAX                    0     R_RISCV_RELAX
4  c.add a0, gp                        4  c.add a0, gp
4     R_RISCV_PIC_ADD 0                4     R_RISCV_PIC_ADD 0
6  ld a0, 0(a0)                        6  ld a0, 0(a0)
6     R_RISCV_PIC_ADDR_LO12_I 0        6     R_RISCV_PIC_ADDR_LO12_I 0

### la.tls.ie a0, symbol ###           ### lla.fd a0, symbol ###
0  lui a0, 0                           0  lui a0, 0
0     R_RISCV_TLS_GOTGPREL_HI symbol   0     R_RISCV_FUNCDESC_VALUE_GPREL_HI symbol   
0     R_RISCV_RELAX                    0     R_RISCV_RELAX
4  c.add a0, gp                        4  c.add a0, gp
4     R_RISCV_PIC_ADD 0                4     R_RISCV_PIC_ADD 0
6  ld a0, 0(a0)                        6  ld a0, 0(a0)
6     R_RISCV_PIC_ADDR_LO12_I 0        6     R_RISCV_PIC_ADDR_LO12_I 0
unedited and obsolete development notes

The minimum implementation of FDPIC without TLS needs 3 relocations:

Name Description
FUNCDESC Dynamic relocation. Pointer to a canonical function descriptor in initialized data.
GOTGPREL_HI20 First step of referencing a data object. Following are an add, a load with PCREL_I12, and then the final access.
GOTGPREL_FUNCDESC_HI20 First step of referencing a function. Following are an add and a load with PCREL_I12.

We want cross-module calls to proceed directly from the GOT and C++ method calls to proceed directly from the vtable, without an additional indirection through a canonical function descriptor. That brings us up to 4 for PLT/vtables and 5 to also support -fno-plt:

Name Description
FUNCDESC Dynamic relocation. Pointer to a canonical function descriptor in initialized data.
FUNCDESC_VALUE Dynamic relocation. Copy of the canonical function descriptor.
GOTGPREL_HI20 First step of referencing a data object. Following are an add, a load with PCREL_I12, and then the final access.
GOTGPREL_FUNCDESC_HI20 First step of referencing a function. Following are an add and a load with PCREL_I12.
GOTGPREL_FUNCDESC_VALUE_HI20 First step of referencing a function without a canonical descriptor. Following are an add, two PCREL_I12 loads (one gets an addend), and a jump.

(The PLT sequence requires a fence r,r to support multithreaded lazy binding, but that is not an issue with -fno-plt or -z now.)

If an object is known to be local and part of the data segment at link time, the GOTGPREL_HI20/PCREL_I12 can be rewritten to directly refer to the object and replace the load with an addi.

The only remaining case of an unnecessary load is local objects in the code segment. It is not possible to rewrite the GOTGPREL_HI20 into an auipc because the add is untagged and because the rewrite is impossible if the HI20 was duplicated by compiler transformations (due to the auipc/aluipc issue). Let's fix that, now we have 7 relocations and minimal loads:

Name Description
FUNCDESC Dynamic relocation. Pointer to a canonical function descriptor in initialized data.
FUNCDESC_VALUE Dynamic relocation. Copy of the canonical function descriptor.
GOTGPREL_NONUNIQUE_HI20 First step of referencing a data object. May have been duplicated. Following are a PIC_ADD, a load with PCREL_I12, and then the final access.
GOTGPREL_HI20 First step of referencing a data object. Following are a PIC_ADD, a load with PCREL_I12, and then the final access.
PIC_ADD Polymorphic relocation used to mark adds that need to be deleted or converted into a nop when transforming GOT-relative to PC-relative accesses.
GOTGPREL_FUNCDESC_HI20 First step of referencing a function. Following are an add and a load with PCREL_I12.
GOTGPREL_FUNCDESC_VALUE_HI20 First step of referencing a function without a canonical descriptor. Following are an add, two PCREL_I12 loads (the second gets an addend), and a jump.

If the non-gp source and destination of the PIC_ADD are not the same, it must be replaced with a move instead of deleted. It might be possible to, instead, replace the add with the auipc, and delete the lui entirely. This would avoid the move but prevent the add from being compressed in the object file.

We have minimum loads but there are still two cases of overlong sequences after relaxation.

If the GOT entry or symbol is ±2KiB of gp, the lui immediate will be zero; in this case we can rewrite the PCREL_I12 to use gp as a source and delete both the GOTGPREL_HI20 and the PIC_ADD without additional relocations. Supporting this for functions requires using PIC_ADD with GOTGPREL_FUNCDESC_HI20 and GOTGPREL_FUNCDESC_VALUE_HI20.

If a local address is materialized only to perform access using integer and floating point loads/stores that take an imm12, we should be able to delete the addi (which was originally the PCREL_I12 load) and fold the offset into the dependent accesses. This requires a new relocation for tagging and, again, requires that the source and destination register are the same.

Name Description
FUNCDESC Dynamic relocation. Pointer to a canonical function descriptor in initialized data.
FUNCDESC_VALUE Dynamic relocation. Copy of the canonical function descriptor.
GOTGPREL_NONUNIQUE_HI20 First step of referencing a data object. May have been duplicated.
GOTGPREL_HI20 First step of referencing a data object.
PIC_ADD Polymorphic relocation used to mark adds that need to be deleted or converted into a nop when transforming GOT-relative to PC-relative accesses.
PIC_LO12_I, PIC_LO12_S Aliases for PCREL_LO12_I, PCREL_LO12_S
INTERMEDIATE_LOAD Load from a GOT entry, used only for subsequent accesses with an imm12.
GOTGPREL_FUNCDESC_HI20 First step of referencing a function.
GOTGPREL_FUNCDESC_VALUE_HI20 First step of referencing a function without a canonical descriptor.
# void (*fun)() = abort;
   .long 0        | FUNCDESC(abort)
# PLT or vtable entry
   .long 0, 0     | FUNCDEC_VALUE(foo::bar)
# abort()
   auipc ra, 0    | CALL_PLT(abort)
   jalr ra
# abort(), -fno-plt
1: lui t1, 0      | GOTGPREL_FUNCDESC_VALUE_HI20(abort)
   add t1, t1, gp | PIC_ADD(1)
   lw t2, 0(gp)   | PIC_12I(1 + 0)
   lw gp, 0(gp)   | PIC_12I(1 + 4)
   jalr t2
# extern int whatever; whatever++;
1: lui t1, 0      | GOTGPREL_HI20(whatever) or GOTGPREL_NONUNIQUE_HI20(whatever)
   add t1, t1, gp | PIC_ADD(1)
2: lw t1, 0(t1)   | INTERMEDIATE_LOAD(1)
   lw t2, 0(t1)   | PIC_12I(2)
   addi t2, t2, 1 |
   sw t2, 0(t1)
# return &whatever;
1: lui t1, 0      | GOTGPREL_HI20(whatever)
   add t1, t1, gp | PIC_ADD(1)
   lw a0, 0(t1)   | PIC_12I(1)
   ret

Wait, oops, we need TLS. Fortunately, the subsequent relocations are threaded through the first, so they can be made polymorphic.

Name Description
FUNCDESC Dynamic relocation. Pointer to a canonical function descriptor in initialized data.
FUNCDESC_VALUE Dynamic relocation. Copy of the canonical function descriptor.
NONUNIQUE_GOTGPREL_HI20 First step of referencing a data object. May have been duplicated.
GOTGPREL_HI20 First step of referencing a data object.
PIC_ADD Polymorphic relocation used to mark adds that need to be deleted or converted into a nop when transforming GOT-relative to PC-relative accesses.
PIC_LO12_I, PIC_LO12_S Aliases for PCREL_LO12_I, PCREL_LO12_S
INTERMEDIATE_LOAD Load from a GOT entry, used only for subsequent accesses with an imm12.
FUNCDESC_GOTGPREL_HI20 First step of referencing a function.
FUNCDESC_VALUE_GOTGPREL_HI20 First step of referencing a function without a canonical descriptor.
TLS_GOTGPREL_HI20 For initial-exec
TLS_GD_GOTGPREL_HI20 For dynamic modes without TLSDESC
TLSDESC_GOTGPREL_HI20 For initial-exec

PLT32 and GOT32_PCREL are used to support relative vtables, which is unfortunately impossible with the FDPIC calling convention since the callee GOT must be reachable from the vtable. 32_PCREL appears to be used for DWARF and is unaffected by FDPIC.

Our natural code model is a large one, supporting any amount of code and data as long as the GOT is smaller than 4 GiB. If that requirement is ever exceeded, a larger model can be defined by allowing XXX_GOTGPREL_HI20 relocations to point at an auipc/ld pair instead of a lui, resulting in a linker-generated constant island if not relaxed.

That's 11 new relocations for reasonably optimal FDPIC code generation, 1 of which overlaps with the ePIC proposal, assuming that relaxation is in use. What if we want to get close to optimal in the -mno-relax case (which excludes length changing relaxations but not semantically needed rewrites)? Relaxations depend on the size of the gp-relative segment (code model), whether symbols are external to the module or local (ELF visibility attribute), and whether symbols are in the code or data segment (not always knowable at compile time).

A "smallish FDPIC" supporting GOT up to 256KiB or a "small FDPIC" with total writable data up to 256 KiB can be handled by generating a c.lui instead of a lui for XXX_GOTPCREL_HI20. This is a misnomer. A "tiny FDPIC" with a 4 KiB limit is possible, but duplicating all of the GOTPCREL_HI20 relocs is most undesirable; instead, let's define a GOTPCREL_HI20 with an odd offset as zero length. In the zero length case no add and no PIC_ADD relocation are generated.

TODO: no-relax codegen bypassing GOT, applicability to non-FDPIC calling convention

@sorear
Copy link
Collaborator

sorear commented Feb 27, 2024

Started a draft specification of my own at #429. For now, the relocation section is a cut and paste of the above comment.

@luismarques
Copy link
Collaborator Author

I came up with a relocation scheme that supports almost all of the useful features for a FDPIC ABI:

Thanks for sharing this. I'm a bit under the weather right now, I'll provide some commentary very soon.

@luismarques
Copy link
Collaborator Author

luismarques commented Mar 5, 2024

Started a draft specification of my own at #429. For now, the relocation section is a cut and paste of the above comment.

Thanks for writing down your proposal. I am OK with going with this alternative approach, assuming other people also support it. I think it's important that we find a path forward, and being able to generalize to a full FDPIC is quite nice.

A couple of questions:

  • If we assume linker relaxations are enabled and fully implemented, how different are the local (e.g. llb) and non-local (e.g. lb) instruction sequences after linking, really? If people are concerned about the scarce number of relocation values available, is it worth it to define e.g. R_RISCV_GPREL_HI if the non-local sequence can always be relaxed to the local one?

  • Are you planning to implement your proposal in LLVM? I ask to avoid duplicate efforts.

@luismarques luismarques marked this pull request as draft March 5, 2024 11:23
@sorear
Copy link
Collaborator

sorear commented Mar 11, 2024

If we assume linker relaxations are enabled and fully implemented, how different are the local (e.g. llb) and non-local (e.g. lb) instruction sequences after linking, really?

They aren't. I thought that it was necessary to avoid worse performance for -mno-relax ePIC, but I'll defer to you on that.

Are you planning to implement your proposal in LLVM?

Yes.

@luismarques
Copy link
Collaborator Author

They aren't. I thought that it was necessary to avoid worse performance for -mno-relax ePIC, but I'll defer to you on that.

For my part, it's OK to rely on relaxations to achieve good code size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants