-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for RISC-V Embedded PIC (ePIC) specification #343
base: master
Are you sure you want to change the base?
Conversation
This is cool. |
So does that means all symbol references should be in the range of +-2^31 bits of gp and pc? Since according to the compact spec, if rodata or anything that are placed far away from pc and gp, then at least we can use got_gprel to access them. However, seems like the above case cannot be worked when using ePIC model. I think it probably worth to mentioned the valid access range (or limitations) for ePIC in the spec.
This design seems better than compact. Compact usually generate %got_gprel for these cases, but since we didn't choose to chain the related instructions like pcrel_lo (we just refer the target symbol directly), the last load/store instruction of %got_gprel cannot be relaxed, so we will have the bad code as follows,
Compared to the ePIC, since ePIC chains the low instructions to the high, so the relaxation seems perfect and won't have the redundant load/store. But just make sure one thing - does that means we should always do the EPIC relaxations just like the alignment, even if the -mno-relax is given?
I just discussed this with @palmer-dabbelt, and had been discussed the similar thing with my SiFive partners when implemented the compact. Since gcc (not sure what clang will do) will consider the section alignment, and then make sure that if the LUI can be shared by the later low instrcutions or not. We probably may meet the problem that gp may not be aligned in the GNU default linker script, or in the custom linker script. Not sure if we should update the related shared checks (or not allow to share %gprel_hi) in gcc/clang, or just make sure the gp will be aligned in linker. The later probably need to be mentioned in the spec, but the former doesn't. In general, I would support ePIC since it has the better relaxation design than compact, but just some minor issues that need to be confirmed. Thanks. |
Yes, that's the range currently specified:
That phrasing could be improved to make it clearer, if needed. We could also imagine extending this for RV64 (either unconditionally or through a code model option), but given the intended use case of this ABI it's not clear there's a need for that.
The
Thanks for the detailed feedback! |
This is coupling a lot of things together: some new user interface, new relocations, and a new ABI. The core goal around having two addressable ranges (one PC-based, one GP-based) seems broadly useful, some users even take advantage of the defacto second range that relying on auipc->GP relaxation gives you (though it's not super useful because it's only 12 bits). IMO splitting this out into a handful of distinct features will make this easier to define/implement and more useful. My worry here is really the More concretely, I'd structure this as follows:
I think that would be sufficient to let users make binaries that behave the same way, but without all that extra coupling we'll have an easier time implementing it and we'd get more than one use case out of it. |
Thank you for the detailed feedback, Palmer!
Sorry to nitpick but, to reiterate the PR summary, the user interface is (almost surely) out-of-scope of the psABI repo, and it's only included here for convenience, to give an overview of how this new ABI would be used in practice. I also don't see the distinction between "relocations" and "ABI", my understanding is that the former is part of the latter. In any case, I appreciate the overall sentiment, and it's good to discuss if it's possible to decouple some of these concerns without introducing other issues.
My understanding is that PC-rel to GP-rel relaxation currently assumes that the two ranges don't move with respect to each other, so in a sense we are talking about "two ranges" in significantly different ways.
Could you please elaborate on this a little bit? While I think I understand what your overall point is, some details aren't quite clear to me. E.g.:
Is it really possible to ensure that those always match, without introducing significant pessimizations, compatibility concerns or very intrusive toolchain changes? I was assuming that a linker rewrite would be inevitable but if you think otherwise or have tips for how this might be solvable please do share.
If I understand your point correctly, the following solution would fall broadly within your guidance, right?
I'm not necessarily arguing against such an approach, but some counterpoints to keep in mind:
Hopefully, I didn't strawman your proposal. If so, I apologise. I kindly ask all of you to chime in about this:
|
@luismarques has present ePIC in today's psABI call, and here is the slide: https://docs.google.com/presentation/d/1gjZaDW8F_lgZiXnyUXd3IrxY9AXYDbXsCR1NrPYkHaM/edit#slide=id.g168f245c672_0_6 |
Haven't reviewed this in great detail but I'm very pleased that it exists and am happier with it superficially than I was with compact. AUIGP won't save bytes until the gp-relative data segment exceeds 256KiB, since lui and add are both compressible. Relocation processing needs to handle the fact that "3-instruction sequences" are dynamic; various compiler optimizations can cause the early instructions to be duplicated, the late instructions to be duplicated, the instructions to be out of order in the file, interleaved with other relocations, etc, etc; trying to match relocations with one another when not explicitly threaded is a losing proposition. The final code size extension will likely have some form of gp-relative loads and stores; at least Andes already has them as a vendor extension. Relaxing 3-instruction sequences to lwgp/swgp is challenging because of the previous point; we don't know, from the LUI/ADD instructions and their relocations alone, whether all of the corresponding load and store instructions are eligible for relocation, so we might need to add new relocations to convey the "all possible users of this are {sw,sd,lw,ld}" information. Which might involve a degree of coordination between the development of this extension and that of the code size reduction extension. I am fairly certain the relaxation code in plctlab/llvm-project@riscv-zce-llvm14 gets this wrong and can delete non-dead instructions if you do anything nontrivial with code ordering. FDPIC is, first and foremost, a calling convention, which this does not change at all. However, the process of loading an ePIC image (each program header is an independent non-FIXED mmap) is identical to what is needed for FDPIC, so they can probably share a single ELF flag for the purposes of binfmt_elf_fdpic, and be treated as two incompatible ABIs beyond that. |
Since so far we have not made significant progress with alternative proposals, shall we move forward with the general approach proposed in this PR? If so, we can start focusing the discussion on the various details of this proposal. If you believe an alternative proposal would be preferable (either one that has already been mentioned in this PR discussion or a different one) please say so now. For this discussion not to stall when evaluating alternatives we'll need either reasonably detailed counter-proposals or a quick feedback cycle to address any raised questions or issues. Thank you all for your input. |
I don't think the silence should be regarded as "people think this is the right thing and thus haven't proposed alternatives", rather it's "people have been busy with other things and not thought about this further". As a group we've been focused on getting 1.0 done up until recently, not on big new things like ePIC. My two concerns are:
1 is somewhat inevitable but in as much as we can align the ABIs (e.g. FDPIC and ePIC using the same sequences for certain things) we should just to keep things less crazy for implementers and users. 2 is a real issue, especially given 25% of the space is reserved for custom extensions and almost another 25% is already allocated, so again where possible relocations should be general rather than overly tailored to a specific ABI variant (e.g. again, relocations should be general between FDPIC and ePIC where they have overlap in code sequences). The need to rewrite instruction sequences beyond just filling in immediate bits at link time is also a bit ugly; @MaskRay may have opinions on that from an LLD perspective. However, I can see how you've ended up where you have, don't immediately have a better proposal and don't want to block solving a real problem by commenting on things I don't overly like on idealistic grounds. |
Sorry for not response this on public for long time, I've talk with @luismarques on psABI meeting and off-list mail several time before, but I think I really take a public response here. In short: In long version: Large code model (literal pool style) is just did everything conservatively, it work on such scenario but need to pay much more cost on code size and performance, and it can't provide the position independent code capability in general. Compact code model proposal are trying to fix similar issue but go with different path, it has to predefine the symbol or section is located in which region and must let compiler know (for right load address code sequence), however it's hard to deal with some customized section which specified by attribute, and that's common in embedded program, although the original design isn't intend to doing that, oppositely it try to optimize that, but we found that is really a challenge when we implement that, and we didn't found good solution. So back to the ePIC code model, I think it's a better solution for resolving the problem in embedded world, it defer the address classification to the linker time by the new EPIC relocation instead of trying to let compiler to resolve that. One concern of ePIC is it introduced 8 new relocation, but IMO the first 4 relocations (GP related stuffs) could be used in other code model in future, like FDPIC, so their is only really 4 relocations really dedicated for ePIC, and just one comment here, we could rename the And I think there is no perfect code model can fit any scenario well, but the ePIC code model is the good one to address the scenario in the embedded world area which we can't handle well with existing code models, so I support this. |
I was one of the main person to implement the compact code model when I was in SiFive. So for this description in details, compact will always need to generate the got_gprel patterns, which is similar to the EPIC patterns, but cannot to relaxed to the gprel patterns since the low instruction isn't referred to the corresponding high one. That's fault of the compact, but it also meant we didn't want to make a second pcrel_lo case then. From my side, except this issue, compact and ePIC should be the similar stuff, so the ePIC is the enhancement of the compact. SiFive should have supported the compact code model almost for two years, and should have customers, so it should be a real request in the embedded world. I support the compact as usual, so I will also support its enhancement, which is the ePIC code model, and will also hope to see the ePIC upstream. |
This proposal is more like uClinux BFLT without shared library support.
@jrtc27 raised a good question. IIUC this proposal introduces the two sets of relocations:
I believe the GP-relative relocations are non-controversial. The ePIC relocations could use more debates. The unknwon segment is because we cannot determine rodata vs data/bss for non-definition declarations in the general case
I agree that the linker implementation can be dirty. The linker has to figure out whether the referenced section and GP share the same PT_LOAD segment. FDPIC ABIs avoid the "unknown segment" problem by using a GOT code sequence. If we introduce FDPIC in the future, I feel that we can re-implement ePIC relocations using R_RISCV_GPREL_GOT_HI20 optimization/relaxation. Though we probably should not do it, as the return-on-investment may not be sufficiently large: label:
lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
c.add t0, gp # we probably need a marker relocation to remove this instruction for relaxation
l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
|
I think one advantage for ePIC is that trying to prevent introduce GOT, that could reduce few runtime overhead and code size; the extra load (in GOT scheme) could be removed by linker relaxation, but still longer instruction sequence. Load variable from symbolGOT schemelabel:
lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
c.add t0, gp # R_RISCV_GPREL_ADD var
l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
l[w|d] t0, 0(t0) # load the actual value form symbol Current ePIC schemelabel:
lui t0, %epic_high(var) # EPIC_HI20 (var)
c.add t0, gp # EPIC_BASE_ADD var
l[w|d] t0, %epic_low(label)(t0) # EPIC_LO12_I label Load addressGOT schemelabel:
lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
c.add t0, gp # we probably need a marker relocation to remove this instruction for relaxation
l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label Current ePIC schemelabel:
lui t0, %epic_high(var) # EPIC_HI20 (var)
c.add t0, gp # EPIC_BASE_ADD var
addi t0, %epic_low(label)(t0) # EPIC_LO12_I label |
BFLT-like "single global data base register" ABIs do have a slight code size advantage, but only work for static linking and only really make sense for the most tiny bare metal settings. Of course you can try to implement a custom idea of dynamic linking on top of that, which several archs' BFLT variants did, but this quicky turns into reinventing everything in ELF and inevitably doing it really badly. As such, musl libc (I'm commenting from the standpoint of author/maintainer) has never supported this kind of thing and has only supported either entirely-nonshareable normal ELF PIE or ELF FDPIC binaries for MMU-less systems. I would really like to see a real FDPIC ABI for riscv that we could support, even if this ePIC thing is sufficient for certain bare-metal uses you have in mind. |
Also note that, if a data object is known to be local to the same DSO (if it's static or hidden or if you're building with compiler flags indicating that everything will be static linked), FDPIC data references do not have to go through the GOT. They can be relative to the FDPIC GOT register (which need not actually point to a GOT, just some anchor in the data segment). This also works for shared libraries - because the callee library's FDPIC GOT register is loaded when calling into it, the library function is able to access its own static/hidden data GOT-relative. This makes everything almost as cheap as the single-global-pointer system you want in ePIC in the case where it's static linked, but much more general, where efficiency only degrades gracefully as you use more features (dynamic linking). |
If the intention is to merge this in pieces, could this be turned into a draft to reduce confusion? Please don't provide a YouTube link as the sole definition of FDPIC; there are plenty of written explanations. I came up with a relocation scheme that supports almost all of the useful features for a FDPIC ABI:
The relocation scheme can also be used when statically linking and using the normal calling convention; call this the ePIC subset. The most economical scheme has 9 new relocation types on top of v1.0 + TLSDESC, although a scheme with 11 new relocations is simpler in several ways. The 11-type scheme has a 2-type subset which supports ePIC operations; the two types correspond to the #405 relocation types, although they are described very differently to account for the larger context. Neither scheme uses the #394 relocations at all. FDPIC relocation needs
9-type relocation schemeNon-TLSDESC global dynamic TLS is not supported. No special provision is made to distinguish rematerializable from non-rematerializable addressing sequences, although compilers may treat addressing sequences as rematerializable if they are known to not be in the code segment. Omitting I've gone back and forth several times over exactly which transformations should be permitted without RELAX. The current rules allow us to express the "use PCREL or GPREL but never use a GOT" property of ePIC and also allow the use of code models in non-relaxed FDPIC, but in the default FDPIC model do not allow rematerialization or omission of Requiring A full FDPIC proposal would include, in addition to the relocations and relaxations described here, a precise definition of the calling convention, ELF flags and attributes, the list of code models, and sibling PRs to asm-manual and c-api.
9-type usage examplesAll examples are given for RV64; RV32 may be obtained by replacing `ld` with `lw`.
4-type usage examples (inefficient, illustrative only, for small code model and global dynamic TLS)
11-type relocation scheme (delta)This saves one relocation entry for loads and stores when the GOT is known at assemble time to be unused. We also flatten `R_RISCV_INTERMEDIATE_LOAD` relocations, potentially saving a local symbol.
11-type usage examples
unedited and obsolete development notesThe minimum implementation of FDPIC without TLS needs 3 relocations:
We want cross-module calls to proceed directly from the GOT and C++ method calls to proceed directly from the vtable, without an additional indirection through a canonical function descriptor. That brings us up to 4 for PLT/vtables and 5 to also support -fno-plt:
(The PLT sequence requires a If an object is known to be local and part of the data segment at link time, the GOTGPREL_HI20/PCREL_I12 can be rewritten to directly refer to the object and replace the load with an addi. The only remaining case of an unnecessary load is local objects in the code segment. It is not possible to rewrite the GOTGPREL_HI20 into an auipc because the add is untagged and because the rewrite is impossible if the HI20 was duplicated by compiler transformations (due to the auipc/aluipc issue). Let's fix that, now we have 7 relocations and minimal loads:
If the non-gp source and destination of the PIC_ADD are not the same, it must be replaced with a move instead of deleted. It might be possible to, instead, replace the add with the auipc, and delete the lui entirely. This would avoid the move but prevent the add from being compressed in the object file. We have minimum loads but there are still two cases of overlong sequences after relaxation. If the GOT entry or symbol is ±2KiB of gp, the lui immediate will be zero; in this case we can rewrite the PCREL_I12 to use gp as a source and delete both the GOTGPREL_HI20 and the PIC_ADD without additional relocations. Supporting this for functions requires using PIC_ADD with GOTGPREL_FUNCDESC_HI20 and GOTGPREL_FUNCDESC_VALUE_HI20. If a local address is materialized only to perform access using integer and floating point loads/stores that take an imm12, we should be able to delete the addi (which was originally the PCREL_I12 load) and fold the offset into the dependent accesses. This requires a new relocation for tagging and, again, requires that the source and destination register are the same.
Wait, oops, we need TLS. Fortunately, the subsequent relocations are threaded through the first, so they can be made polymorphic.
PLT32 and GOT32_PCREL are used to support relative vtables, which is unfortunately impossible with the FDPIC calling convention since the callee GOT must be reachable from the vtable. 32_PCREL appears to be used for DWARF and is unaffected by FDPIC. Our natural code model is a large one, supporting any amount of code and data as long as the GOT is smaller than 4 GiB. If that requirement is ever exceeded, a larger model can be defined by allowing XXX_GOTGPREL_HI20 relocations to point at an auipc/ld pair instead of a lui, resulting in a linker-generated constant island if not relaxed. That's 11 new relocations for reasonably optimal FDPIC code generation, 1 of which overlaps with the ePIC proposal, assuming that relaxation is in use. What if we want to get close to optimal in the A "smallish FDPIC" supporting GOT up to 256KiB or a "small FDPIC" with total writable data up to 256 KiB can be handled by generating a c.lui instead of a lui for XXX_GOTPCREL_HI20. This is a misnomer. A "tiny FDPIC" with a 4 KiB limit is possible, but duplicating all of the GOTPCREL_HI20 relocs is most undesirable; instead, let's define a GOTPCREL_HI20 with an odd offset as zero length. In the zero length case no add and no PIC_ADD relocation are generated. TODO: no-relax codegen bypassing GOT, applicability to non-FDPIC calling convention |
Started a draft specification of my own at #429. For now, the relocation section is a cut and paste of the above comment. |
Thanks for sharing this. I'm a bit under the weather right now, I'll provide some commentary very soon. |
Thanks for writing down your proposal. I am OK with going with this alternative approach, assuming other people also support it. I think it's important that we find a path forward, and being able to generalize to a full FDPIC is quite nice. A couple of questions:
|
They aren't. I thought that it was necessary to avoid worse performance for
Yes. |
For my part, it's OK to rely on relaxations to achieve good code size. |
This PR adds a new file, detailing a new RISC-V Embedded PIC (ePIC) ABI specification. It is thus a proposal for that ABI, to be discussed here, improved, and, hopefully, eventually accepted.
What is ePIC
Basically, the aim of the ePIC ABI is to allow the creation of position-independent executables that are easy to load and relocate, with minimal overhead. This is meant to be evaluated in the context of (relatively constrained) embedded systems, where you typically want to use execute-in-place and would prefer to avoid some of the overheads of the System V ABIs. ePIC basically works by using PC-relative addressing for code (and read-only data) and GP-relative addressing for data. The loader only has to apply simple relocations in the data segment (essentially, update statically initialized pointers). See the new file for all the details.
Review notes
For the convenience of reviewers, I have included the entire specification document in this PR. It includes sections that might be out of the scope of this repo, such as ones detailing command-line options or providing non-essential information. I felt that would help review the overall design of the spec. I suggest we keep those sections for now and remove them closer to the commit stage -- assuming we reach that stage. Review feedback in those sections is still useful.
There is a toolchain implementation for ePIC but it is a work-in-progress. It needs to be updated to reflect some changes in the specification and other issues. There are ePIC demos using that toolchain but they don't yet fully reflect the current specification text. For instance, they currently use the linker option
--emit-relocs
and additional filtering of relocations (in the loader or the program format converter) instead of relying on the linker to emit only the appropriate relocations for ePIC executables. This is in part an artifact of how the ePIC project started, which was to generate relocatable Tock OS applications, as Tock used a similar mechanism. I will work on updating the toolchain, guided by the feedback of this review, so those discrepancies will eventually go away.Assuming the overall design of ePIC is roughly acceptable, having early feedback about command-line options would be helpful. That would reduce the amount of churn for the toolchain implementation patches, the ePIC demos, and any other systems hoping to adopt ePIC before the spec is completely finalized. That is the most user-visible interface, but the same goes for ABI details like choosing the final relocation numbers (see spec text).
My thanks to everybody that directly or indirectly contributed to this effort. That includes Alex Bradbury (coordinated this project while at lowRISC, and later provided editing suggestions); Sam Elliot (part of the initial effort to get this project started, including requirements gathering, etc.); Kito Cheng and Nelson Chu (early review feedback, see below for details); Evandro Menezes (provided the draft compact/large code model specs and toolchain implementation, which influenced / were useful for this project); Maciej W. Rozycki (RISC-V FDPIC proposal video); Greg McGary (provided early review feedback); Jorge Prendes (ported epic-tock-c demo to rust, etc.); Alistair Francis (added OpenTitan support to the epic-tock-c demo); Michael Munday (added Docker support to the epic-c-example, various suggestions); The Tock OS contributors. Apologies to anyone I might have forgotten. Any deficiencies in this spec are my fault and not of the people listed.
Initial review concerns
This section lists some initial review concerns that I had written down or that were brought up in personal communications.
Kito/Nelson suggested adding the
R_RISCV_64_PCREL
relocation. They believe that's required for proper debug support in RV64 systems, asR_RISCV_32_PCREL
might not cover all situations, based on their experience with the compact code model. In general, all the assumptions for RV64 systems should be double-checked and evaluated, as the majority of the effort for this spec/project has been on RV32 systems. Given the focus of this spec on embedded systems with limited resources and simple implementations, it might make sense to solve issues like this by making trade-offs that limit the implementation generality (e.g. limiting address space distances).Kito/Nelson suggested converting
R_RISCV_32
andR_RISCV_64
toR_RISCV_RELATIVE
, like what PIE did. The stated advantage would be that "we don't need to look up symbol table other than__global_pointer$
for the loader, that could make the binary size could be smaller since the dynamic symbol table only need__global_pointer$
". I'm not sure I fully understand that suggestion. In the demos we didn't need to check the symbol table, but we were making assumptions about how the program was linked (addresses in the ePIC code segment had the MSBit set, while ones in the ePIC data segment did not). The loader used only the data in the .rela sections, it did not check any other data per relocation entry.The demos currently use the command-line options
-static -Wl,--emit-relocs,-T,app.lds
. What should that look like in the future, for the final toolchain? For a regular program, not using a linker script, would that be-fpie -fepic
? That would mean "I want a position-independent executable and it should use the ePIC ABI", and those two options would be orthogonal. With that approach, in principle it would even be possible to later extend this ABI for shared library support.The spec currently uses the
__global_pointer$
symbol. That means, for instance, that PC-rel to GP-rel relaxations must be disabled when ePIC is used, which is currently specified to happen when the-fepic
option is provided to the linker. I wonder if it wouldn't be simpler to just introduce a__epic_base$
symbol instead. Using the ePIC ABI during LTO codegen could still be configured, possibly through another option (e.g. the current toolchain implementation currently uses--lto-epic
, in a departure from the spec, but consistent with other LTO options in LLD). I would like to avoid something clunky like-plugin-opt=--code-model=...
, though.Should we introduce any ELF attributes for ePIC?