-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD-0178: SBPF Static Syscalls #178
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! I think we just need to land on which SBPF version this goes into?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Removing relocations is a big win. My only comment is the opcode change - this feels a bit unnecessary. But it is not a deal-breaker.
The opcode `0x9D` must represent the return instruction, which supersedes the | ||
`exit` instruction. The opcode (opcode `0x95`), previously assigned to the | ||
`exit` instruction, must now be interpreted as the new syscall instruction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation behind changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, changing the name from exit
to return
when it is the same instruction could be confusing. I have already seen this confused in other SIMDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note - we should bundle large sets of proposed ISA changes together into the same SBPF version upgrade, so that clients don't have to support a mis-mash of ISAs based on feature flags. I believe this is the intent of #161, but just re-iterating 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation is that exit
was occupying the slot in the instruction class for controlflow with immediate values and it does not take an immediate value. The new syscall
opcode however does, so it took its place.
proposals/0178-static-syscalls.md
Outdated
## Detailed Design | ||
|
||
The following must go into effect if and only if a program indicates the SBPF | ||
version XX or higher in its ELF header e_flags field, according to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we specify which version XX is?
The resolution of syscalls during ELF loading requires relocating addresses, | ||
which is a performance burden for the validator. Relocations require an entire | ||
copy of the ELF file in memory to either relocate addresses we fetch from the | ||
symbol table or offset addresses to after the start of the virtual machine’s | ||
memory. Moreover, relocations pose security concerns, as they allow the | ||
arbitrary modification of program headers and programs sections. A new | ||
separate opcode for syscalls modifies the behavior of the ELF loader, allowing | ||
us to resolve syscalls without relocations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
phase. `call imm` (opcode `0x85`) instructions must only refer to internal | ||
calls and its immediate field must only be interpreted as a relative address | ||
to jump from the program counter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean there is no longer a need to hash the immediates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does and is the intention.
proposals/0178-static-syscalls.md
Outdated
| sol_get_clock_sysvar | 36 | | ||
| sol_get_epoch_schedule_sysvar | 37 | | ||
| sol_get_last_restart_slot | 38 | | ||
| sol_get_epoch_rewards_slot | 39 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this one should be sol_get_epoch_rewards_sysvar
, see:
https://github.com/anza-xyz/agave/blob/3890ce5bc594d1b81e4e3fa57174e919fb9257ae/programs/bpf_loader/src/syscalls/mod.rs#L394
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that. Fixed!
program reaches the execution stage containing the `0x9D` opcode, an | ||
`EbpfError::UnsupportedInstruction` must be raised. | ||
|
||
### Syscall numbering convention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer is not a great idea. SVM will continue to diverge more and more as non-mainnet SVM chains and L2s develop. If they wish to push their own syscalls, not only will they need to write their own tooling to handle it, if Solana mainnet adds another syscall that clashes and that same binary is shipped to another chain, or vice-versa, people could lose money. It would make way more sense to use the Murmur3 hash. If they decide to launch a hash collision, at least we tried to stop them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Murmur3 is what the original implementation did, and I was against switching to indexes, but then I got busy on other stuff (I just noticed that master was switched to indexes).
Is the best argument for indexes that lookup is faster? And if that's the argument, isn't it moot considering that we JIT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @deanmlittle,
Thanks for your input. I didn't follow why teams would have to develop their own tooling. On the Solana SDK, changing from consecutive integers to a murmur hash is simply assigning another constant to the function pointers. The compiler toolchain can deal with both cases without changes.
On the other hand, I partially agree that using consecutive numbers may hinder the development of SVM chains. In Agave, we considered that using a contiguous array would add to much complexity to the code just to handle inactive syscalls or deprecated ones, so we are still using a BTree, as there are only 40 syscalls to lookup for.
Agave's implementation does not prevent SVM external users from calculating the murmur32 hash for their own syscalls, as any 32-bit integer can be used for indexing. The numbering convention they use does not need to match ours, provided that the numbers don't coincide.
Using consecutive numbers was a request from Firedancer. I believe either @topointon-jump, @ripatel-fd or @0x0ece can elaborate more on the reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LucasSte The main argument is to optimize for bytecode decode efficiency. Interpreting a syscall instruction with a hash requires at least two memory accesses, having an index requires only one. That's a significant cost saving for an instruction that may be executed up to 1 billion times per second in the future.
@deanmlittle There's no need to lose ABI security protections or break cross-SVM program compatibility. This SIMD makes no mention of removal of the symbol table. Currently, the dynamic symbol table of each program maps syscall names to the syscall hash. It would make sense to redefined st_value
to carry the new syscall ID in this SIMD.
Then, the ELF loader can trivially reject programs that have an unknown syscall name or mismatching ID. And the bytecode verifier should reject syscall invocations that weren't verified via the symbol table. ABIv2 proposed previously by Anza similarly moves checks to bytecode verification from later stages, so this wouldn't be out of line.
The other concern you brought up is compatibility. A public GH repo listing IDs and their users is a common way to solve the enum problem. (Examples of other projects doing this: https://github.com/multiformats/multicodec/blob/master/table.csv https://www.iana.org/assignments/tls-extensiontype-values/tls-extensiontype-values.xhtml)
@LucasSte Could you clarify in the SIMD whether static syscalls are verified during ELF loading or bytecode verification? If we won't have these measures in place to support enums, I agree with Dean that we should keep the hash instead.
Is the best argument for indexes that lookup is faster? And if that's the argument, isn't it moot considering that we JIT?
@alessandrod There is a case for allowing zero-copy execution out of a bytecode buffer (i.e. interpreter). With direct mapping, we've seen that the average per-instruction overhead for mainnet executions is so high that a JIT barely outperforms Firedancer's interpreter even when the compiled program is in program cache. Bytecode translation is more susceptible to DoS due to the high cost of allocating memory and JIT compiling when spam invoking cold programs. FWIW, we are beginning mainnet testing of the full Firedancer client too, which is interpreter-only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alessandrod There is a case for allowing zero-copy execution out of a bytecode buffer (i.e. interpreter).
you don't have to fixup in place right?
I understand your perspective. From the anza POV tho, there would be no overhead in having hashes since we always JIT. From the devs POV, I think we can agree that having SVM compatibility is desirable if it doesn't come at a huge cost?
So is it worth making a protocol compromise to accommodate for fd's current implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the devs POV, I think we can agree that having SVM compatibility is desirable if it doesn't come at a huge cost?
Which huge cost do you mean? Or is it only for sake of example?
So is it worth making a protocol compromise to accommodate for fd's current implementation?
@alessandrod I don't view this as a matter of optimizing for the implementation. I think we make the protocol as robust and efficient as possible while balancing it against scope of changes. For ABIv1 direct mapping, in-place interpretation is the most robust against DoS. For ABIv2 + instruction coding improvements + better economics for cold program invocations, it's going to be the JIT.
That said, after writing above comment I realized we can just brute force inverse hash tables to make in-place Murmur32 syscall hash reversal cheaper. We already do cryptographic attacks against Murmur32 to make direct calls cheaper. So the interpreter performance loss of doing Murmur32 over dense indexing would be so small that it doesn't matter.
So, if you feel strongly about this, feel free to change.
I think we should do the central syscall registry repo regardless. And also clarify whether SIMD-0178 requires the set of unique immediates of syscall instructions to be listed in the symbol table. (Otherwise implies you can omit the symbol table and have "raw" murmur32 hashes that aren't checked for string collisions)
No description provided.