-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write binary data in yaml canonical format instead of hex encoded strings #5
Comments
We configured our parser to handle this quite easily, so it's not a problem for us. I'm not opposed to switching if it's a common problem, but my preference is that we keep the YAML and JSON representations of binary data the same, as to reduce total system complexity. EDIT: When I refer to JSON, I'm referring to the minimal validator API spec. Here signatures, roots and other binary data are represented as |
Still investigating this proposal. We have been using hex encoded strings in eth1 for some time now and have existing functionality for that with JSON encoding in go-ethereum. Example block returned from json-rpc in eth1. {
"number": 3,
"hash": "0xef95f2f1ed3ca60b048b4bf67cde2195961e0bba6f70bcbea9a2c4e133e34b46",
"parentHash": "0x2302e1c0b972d00932deb5dab9eb2982f570597d9d42504c05d9c2147eaf9c88",
"nonce": "0xfb6e1a62d119228b",
"sha3Uncles": "0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347",
"logsBloom": "0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
"transactionsRoot": "0x3a1b03875115b79539e5bd33fb00d8f7b7cd61929d5a3c574f507b8acf415bee",
"stateRoot": "0xf1133199d44695dfa8fd1bcfe424d82854b5cebef75bddd7e40ea94cda515bcb",
"receiptsRoot: '0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421',
"miner": "0x8888f1f195afa192cfee860698584c030f4c9db1",
"difficulty": '21345678965432',
"totalDifficulty": '324567845321',
"size": 616,
"extraData": "0x",
"gasLimit": 3141592,
"gasUsed": 21662,
"timestamp": 1429287689,
"transactions": [
"0x9fc76417374aa880d4449a1f7f31ec597f00b1f6f3dd2d66f4c9c6c445836d8b"
],
"uncles": []
} Suggesting a shift from hex encoded strings to base64 encoding would be annoying to say the least. Will follow up with a bit more before closing this out, but it's looking like it should be hex encoded. |
It looks like go-ethereum uses type overrides (ex) which uses gencodec to create custom json/yaml parsers to map to go structs (ex). The follow up, if its worth following up, is why are we using hex encoded strings to represent binary data in the first place instead of the seemingly more intuitive Base64 encoding? |
one reason to use one encoding over another is that it should ideally match whatever is used in canonical value printers so it's easy to compare logs with yaml files.. |
Good thinking. Clients are more likely to print hex encoded strings for binary data. |
Reopening this. We've found it too infeasible to writing special decoding/encoding support for these byte arrays as hex encoded strings. Instead, we overwrite the hex strings to base64 in the yaml, then ingest that data to the tests. With regards to "do what the JSON API does", I recant my earlier statement and I don't support hex encoded strings in JSON going forward. I see little benefit to hex encoded over base64 and it means we have to write a special work arounds to support it. The latter comment is probably for a different discussion topic. |
Another note in support of base64, we've found that these yaml files are saving about 30% space on disk as base64 rather than hex encoded. |
For size, we could also compress, encode suites in SSZ (or other), or split mainnet suites (huge) from minimal suites (reasonable) in another repo. If there's an issue with size, clients should put that forward, and describe their resources / caching. But I don't think there's much of a problem (although yes, yaml encoded mainnet tests are very large). Generally I think it's perfectly fine to load tests in a convenient way, i.e. load into a map, translate python names and hex values, and load into the appropriate types. Better than describing all new types just for testing. In Go specifically, I just load them through https://github.com/mitchellh/mapstructure and do some translation, then put them directly in the types I use for state-transitioning. No extra code, or per-type conversion. And while I agree it's far from perfect, it's affordable (tests loading doesn't take that long) and easy. You could generate structs with decoder definitions or use a custom yaml decoder, but I really just think running (and then passing) the tests at this stage is worth much more than (over-)engineering the test runner. Eventually, yes, it would be very nice. But first things first. |
It's not necessarily an issue with size, but difficulty of using existing tools with hex encoded strings for seemingly no benefit. |
If other clients are in for a change too, we can change it. But don't want to break the working setup of anyone right before spec freeze. |
The point of yaml seems to be to optimize utility for humans, where hex is more commonly used - ie Else we might as well go with a binary format that can be memory-mapped instead of parsed - that would at least net a simplification and speedup of test-running. From nimbus, we're fine with keeping things interoperable and simple with hex too, and not too much yaml magic. |
Thanks for the context. I now understand some of the reasoning that went into choosing hex for yaml, but I disagree with this decision. When we need to read the data as humans, we would likely do that in some form of test runner. I.e. We (and others) have had to write custom yaml processing logic or workarounds to support hex encoded strings in over 1 million of lines of yaml in this repository. Maybe it is too late to undo this decision before spec freeze, but we think it is a technical debt that we'll be paying for a long time. |
@prestonvanloon Can you form a concrete counter proposal, to be executed after spec freeze, if it works for everyone? |
This issue is my proposal. Base64 is the canonical format for binary data representation in yaml. Sources:
|
Was thinking of something more radical. If we give up on readability, there should be much better options. |
We have no preference on readability or format as long as we don't have to waste time implementing things that already exist. YAML is fine if we use YAML as intended. JSON is fine because the JSON spec suggests Base64 as well. This isn't giving up on readability. YAML is readable. |
We don't intend to use any of the fancy I think we give something up if fields in tests such as pubkeys become difficult to discuss due to the base64 notation, especially considering pubkeys will almost certainly be represented in hex to users. |
@djrtwo For some additional context, we have wasted at least one eng week to write tooling to support non-canonical formatting with our standard libraries. I think it is a bad idea to use hex for everything. It makes sense when you need representation to be human readable for end users, like a wallet address in UI, but binary data for test ingestion is not one of those use cases. This data is not end user facing and is causing a lot of pain for little/no benefit. |
So pretty much all data, unless it's not a SSZ object, will also be available as ssz-encoded raw bytes in the next testing release. See ethereum/consensus-specs#1320 So effectively no yaml / hex encoded data, unless for a small test metadata detail (which shouldn't touch ssz/production code anyway). Does this work for Prysm @prestonvanloon ? |
What does this mean? There will still be some required yaml? Switching to ssz encoded data will most likely work for us. Thanks for the recent improvements. |
@prestonvanloon there are a few small files that just say things like "there are 10 deposits to read from ssz files", "there are 6 blocks to read", "BLS required here". Etc. etc. Nothing hard to parse really. |
We're finding that unmarshalling yaml is very cumbersome with hex encoded strings as binary data.
Example:
ssz: '0x00000000'
cannot be inherently unmarshaled to a byte slice in go. go playground link.Looking at the spec for yaml, this data should be represented as a base64 binary string with the prefix
!!binary
.Edit: I no longer support adding
!!binary
as it is entirely unnecessary with modern libraries understanding that Base64 encoded strings are binary data. However, the tag resolution in the yaml spec for this scenario is unclear. It's probably safer to add!!binary
explicitly.The text was updated successfully, but these errors were encountered: