Skip to content

Create a bytecode interpreter to compactly represent text/JSON name maps. #1789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

allevato
Copy link
Collaborator

@allevato allevato commented May 29, 2025

This is something I've been cooking for a bit, as a way to reduce codegen.

The problem with the ExpressibleByDictionaryLiteral approach we're using today is that its codegen is unpredictable. For what is really just static data, the Swift compiler always generates sequences of instructions in debug mode, and in release mode it may or may not be outlined into global data depending on various things outside of the author's control.

StaticStrings, on the other hand, are super predictable since they always become static read-only data (ignoring the oddball edge case involving a single code point). So, why not do what some other languages do and create blobs of binary data that we can process at runtime instead?

The catch is that since StaticStrings must be valid UTF-8, we have to invent our own little bytecode format a bit, to avoid encodings that might be invalid. The format is a stream of bytes as follows:

  • Integers are represented as varints with 6-bit shifts and an always cleared MSB. This is similar to protobuf's own 7-bit varints except we lose a bit to ensure that continuation bytes are in the range 0x40–0x7F instead of 0x80–0xFF.
  • Strings are represented as null-terminated UTF-8.
  • String arrays are represented by a count integer followed by that many null-terminated UTF-8 strings.
  • The stream always starts with an integer "version" number, which future-proofs us if we want to change the encoding. Right now, it's always zero.

Layered on top of the bytecode stream is an interpreter, which abstracts "instructions" that are RawRepresentable as UInt64, and the interpreter loop just reads the next instruction and invokes a callback that is supposed to read and process the operands.

The bytecode is an "ABI" that we can add to, but never remove from (for a particular major version of the runtime). For example, we could add new instructions in the future to represent things differently, as long as the runtime can still correctly interpret the bytecode from older generated code.

There are definitely more optimizations that we could do to the representation here to make it more compact. The most obvious thing would be to add specialized opcodes for the most common cases. Other things we could do would be to sort the field numbers in numerical order and represent them as adjacent differences instead of their absolute value—that would shrink things a bit for dense messages/enums that have numbers greater than 63.

But for now, I just wanted to put up this draft as a vibe check. I took one of our particularly large transitive proto closures internally and measured the optimized and dead-stripped output with and without this change, using a tiny binary that just imports a huge proto module, sets a single field in a message, and then black-holes the message:

Before:  3995696 bytes
After:   3823880 bytes
Diff:     171816 bytes (–4.3%)

But I think this is just the beginning of what we can do—namely, we could refactor some of the generated message storage internals and make visitation completely generalized over a similar bytecode that we would generate about a message's layout. That, I imagine, would yield far greater codegen savings (and likely significant compile time savings as well).


Side note: Many of the APIs presented by the bytecode interpreter and the generated code just scream for Span and UTF8Span. I suppose one day we'll have a high enough deployment target that we could use those 😭

allevato added 3 commits May 29, 2025 10:34
This will be used to generate data compactly as static strings, which
optimize much more consistently than some of the code we generate today
like the dictionary literal used for `NameMap`s.

Later on, I'd also like to look at how we could use this to tabularize
parsing and serialization. With some other refactoring, I think we could
get to a point where the visitation APIs are completely driven by
bytecode strings and we wouldn't have to generate the large
message-specific methods that do this.
/// UTF-8. Specifically,
///
/// - Integers are encoded in a varint-like format similar to protobuf, except that only the low
/// 7 bits are used. The most-significant bit is always clear, and the second-most-significant
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is since the underlying storage is utf-8 and so the high bit would bring on other meanings?

When SE-0483InlineArray Literal Syntax get approved, I'm guessing it would make sense to support/move to that as it means a full 8bit encoding so things could get smaller? Any idea how easy this code could be leverage to support that eventually?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested it with newer toolchains, but I suspect there would still be heavy type-checking costs to using InlineArray since it would have to be initialized with an array literal. Given this:

static let foo: InlineArray<someHugeNumber, UInt8> = [0, 1, 47, 8, ...]

At a minimum, even if it's able to store that as readonly data with minimal codegen under all circumstances, the type checker still has to verify that every element of that array literal is a UInt8 and that it has exactly someHugeNumber elements. That's a lot of work that StaticString doesn't have, since that string literal is just an atomic blob.

What we really need is a way to encode raw bytes as a special kind of string-ish literal, which could supports \xFF-style encoding. Maybe it makes sense to propose something like that once semi-related features like @section land.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I wasn't thinking compile time costs and was just thinking about (slightly) smaller binaries as some numbers would shrink in encoding.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I suspected, creating an InlineArray of, say, 100 elements generates ~100 type variables and ~100 constraints in the constraint system. https://godbolt.org/z/9ahWK6oj1

None of those constraints are hard to solve—they're unambiguous since there aren't any overloads or casts/bridging involved—but the compiler still has to run through them all to prove that it's correct. I wouldn't feel comfortable putting that much load on the type checker when StaticString doesn't have any of that baggage.

But that's not all—in debug builds, the compiler still generates in non-optimized builds a scales-linearly-with-the-size-of-the-array "one-time initialization function" for the InlineArray for the property's unsafeMutableAddressor, even though it also generates it as static RO data. I don't really know what's going on there, but it's more unpredictability that we should avoid.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I wonder if something should get opened on the Swift project in general about this since it sorta seems like this was the space InlineArray was trying to solve, and it compiler performance for large blobs is going to be bad, what was the point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra codegen in debug builds seems silly and is probably an area with room for improvement, but I don't think there's a way around the type checker part since something has to validate that the code is good.

// The only way this could happen is if the program were a single byte, meaning that it
// only has a 6-bits-or-fewer format specifier and nothing else. In other words, there
// are no instructions, and we can simply return as there is nothing to execute.
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still wonder if this should fatalError since it shouldn't happen and if something were generated with one character something went wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty message produces a single code-point name map, since it's just the version prefix and then nothing else: "\u{0}"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should just be _NameMap() like it is now, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but do we care enough about special casing that in the generator? The overhead of interpreting the empty name map bytecode would be low, especially for such an uncommon situation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is somewhat common for things that use extensions over fields as a design pattern (don't ask), so it does happen. The main "savings" from my POV is just less "startup code" that has to run, yes, it's is less, but zero is always less then "a little" 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps - it's also even less for the compiler to look at when type checking the compile. 😉

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, and a little extra complexity in the generator for more savings at runtime is probably the right choice in general.

///
/// - Parameter programFormat: The program format of the bytecode stream to write. Currently,
/// the only valid value is zero.
public init(programFormat: Int = 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a constant over in the main runtime library for the zero value that is used by these places also?

/// - Parameter programFormat: The program format of the bytecode stream to write. Currently,
/// the only valid value is zero.
public init(programFormat: Int = 0) {
if programFormat != 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guard?

if byte & 0x40 == 0 {
return value
}
shift += 6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some guard that shift doesn't get too large?

/// - Returns: An array of `UnsafeBufferPointer`s containing the strings that were read from the
/// bytecode stream. See the documentation of `nextString()` for details on the lifetimes of
/// these pointers.
package mutating func nextNullTerminatedStringArray() -> [UnsafeBufferPointer<UInt8>] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like this got a test (or a helper to write one of these)

/// ## Operands
/// * An integer representing the lower bound (inclusive) of the reserved field number range.
/// * An integer representing the upper bound (exclusive) of the reserved field number range.
case reservedFields = 6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"reservedNumber" nameMap isn't specific to fields, and enums also have reserved numbers, we just haven't needed to capture them.

@@ -110,6 +113,50 @@ class FieldGeneratorBase {
}
}

func writeProtoNameInstruction(to writer: inout ProtoNameInstructionWriter) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't var fieldMapNames come out now?

}
let jsonName = fieldDescriptor.jsonName

// TODO: Create a separate instruction for the group being able to match
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should do this extra opcode now vs. keeping it as a TODO

if isprint(Int32(truncatingIfNeeded: value)) != 0 {
self.append(escapingIfNecessary: UnicodeScalar(UInt32(truncatingIfNeeded: value))!)
} else {
code.append(String(format: "\\u{%x}", value))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth picking off \0 as the one special case? (tab, newline, etc don't seem worth it, but if \0 is common, it would slightly shorten strings and there's no code on the decoding side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants