Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write binary data in yaml canonical format instead of hex encoded strings #5

Open
prestonvanloon opened this issue Jun 6, 2019 · 21 comments

Comments

@prestonvanloon
Copy link

prestonvanloon commented Jun 6, 2019

We're finding that unmarshalling yaml is very cumbersome with hex encoded strings as binary data.

Example: ssz: '0x00000000' cannot be inherently unmarshaled to a byte slice in go. go playground link.

Looking at the spec for yaml, this data should be represented as a base64 binary string with the prefix !!binary.

Edit: I no longer support adding !!binary as it is entirely unnecessary with modern libraries understanding that Base64 encoded strings are binary data. However, the tag resolution in the yaml spec for this scenario is unclear. It's probably safer to add !!binary explicitly.

@paulhauner
Copy link

paulhauner commented Jun 6, 2019

We configured our parser to handle this quite easily, so it's not a problem for us.

I'm not opposed to switching if it's a common problem, but my preference is that we keep the YAML and JSON representations of binary data the same, as to reduce total system complexity.

EDIT: When I refer to JSON, I'm referring to the minimal validator API spec. Here signatures, roots and other binary data are represented as 0x00...

@prestonvanloon
Copy link
Author

prestonvanloon commented Jun 6, 2019

Still investigating this proposal. We have been using hex encoded strings in eth1 for some time now and have existing functionality for that with JSON encoding in go-ethereum.

Example block returned from json-rpc in eth1.

{
    "number": 3,
    "hash": "0xef95f2f1ed3ca60b048b4bf67cde2195961e0bba6f70bcbea9a2c4e133e34b46",
    "parentHash": "0x2302e1c0b972d00932deb5dab9eb2982f570597d9d42504c05d9c2147eaf9c88",
    "nonce": "0xfb6e1a62d119228b",
    "sha3Uncles": "0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347",
    "logsBloom": "0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
    "transactionsRoot": "0x3a1b03875115b79539e5bd33fb00d8f7b7cd61929d5a3c574f507b8acf415bee",
    "stateRoot": "0xf1133199d44695dfa8fd1bcfe424d82854b5cebef75bddd7e40ea94cda515bcb",
    "receiptsRoot: '0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421',
    "miner": "0x8888f1f195afa192cfee860698584c030f4c9db1",
    "difficulty": '21345678965432',
    "totalDifficulty": '324567845321',
    "size": 616,
    "extraData": "0x",
    "gasLimit": 3141592,
    "gasUsed": 21662,
    "timestamp": 1429287689,
    "transactions": [
        "0x9fc76417374aa880d4449a1f7f31ec597f00b1f6f3dd2d66f4c9c6c445836d8b"
    ],
    "uncles": []
}

Suggesting a shift from hex encoded strings to base64 encoding would be annoying to say the least.

Will follow up with a bit more before closing this out, but it's looking like it should be hex encoded.

@prestonvanloon
Copy link
Author

It looks like go-ethereum uses type overrides (ex) which uses gencodec to create custom json/yaml parsers to map to go structs (ex).
So, it looks like we can generate some extra code to support hex encoded strings to represent binary data and close out this issue.

The follow up, if its worth following up, is why are we using hex encoded strings to represent binary data in the first place instead of the seemingly more intuitive Base64 encoding?

@arnetheduck
Copy link

one reason to use one encoding over another is that it should ideally match whatever is used in canonical value printers so it's easy to compare logs with yaml files..

@prestonvanloon
Copy link
Author

Good thinking. Clients are more likely to print hex encoded strings for binary data.

@prestonvanloon
Copy link
Author

Reopening this. We've found it too infeasible to writing special decoding/encoding support for these byte arrays as hex encoded strings. Instead, we overwrite the hex strings to base64 in the yaml, then ingest that data to the tests.

With regards to "do what the JSON API does", I recant my earlier statement and I don't support hex encoded strings in JSON going forward. I see little benefit to hex encoded over base64 and it means we have to write a special work arounds to support it.

The latter comment is probably for a different discussion topic.

@prestonvanloon
Copy link
Author

Another note in support of base64, we've found that these yaml files are saving about 30% space on disk as base64 rather than hex encoded.

@protolambda
Copy link
Contributor

For size, we could also compress, encode suites in SSZ (or other), or split mainnet suites (huge) from minimal suites (reasonable) in another repo. If there's an issue with size, clients should put that forward, and describe their resources / caching. But I don't think there's much of a problem (although yes, yaml encoded mainnet tests are very large).

Generally I think it's perfectly fine to load tests in a convenient way, i.e. load into a map, translate python names and hex values, and load into the appropriate types. Better than describing all new types just for testing. In Go specifically, I just load them through https://github.com/mitchellh/mapstructure and do some translation, then put them directly in the types I use for state-transitioning. No extra code, or per-type conversion. And while I agree it's far from perfect, it's affordable (tests loading doesn't take that long) and easy.

You could generate structs with decoder definitions or use a custom yaml decoder, but I really just think running (and then passing) the tests at this stage is worth much more than (over-)engineering the test runner. Eventually, yes, it would be very nice. But first things first.

@prestonvanloon
Copy link
Author

It's not necessarily an issue with size, but difficulty of using existing tools with hex encoded strings for seemingly no benefit.

@protolambda
Copy link
Contributor

If other clients are in for a change too, we can change it. But don't want to break the working setup of anyone right before spec freeze.

@arnetheduck
Copy link

The point of yaml seems to be to optimize utility for humans, where hex is more commonly used - ie echo hello | sha256sum prints in hex.

Else we might as well go with a binary format that can be memory-mapped instead of parsed - that would at least net a simplification and speedup of test-running.

From nimbus, we're fine with keeping things interoperable and simple with hex too, and not too much yaml magic.

@prestonvanloon
Copy link
Author

prestonvanloon commented Jun 19, 2019

Thanks for the context. I now understand some of the reasoning that went into choosing hex for yaml, but I disagree with this decision. When we need to read the data as humans, we would likely do that in some form of test runner. I.e. Method failed. Wanted 0xf2d, got 0xabb. This formatting could be done with whatever stdout printer is being used without subjecting others to the cost having yaml data as hex.

We (and others) have had to write custom yaml processing logic or workarounds to support hex encoded strings in over 1 million of lines of yaml in this repository.

Maybe it is too late to undo this decision before spec freeze, but we think it is a technical debt that we'll be paying for a long time.

@protolambda
Copy link
Contributor

@prestonvanloon Can you form a concrete counter proposal, to be executed after spec freeze, if it works for everyone?

@prestonvanloon
Copy link
Author

This issue is my proposal. Base64 is the canonical format for binary data representation in yaml.
Nearly every library supports ingesting base64 as binary.

Sources:
https://yaml.org/spec/1.2/spec.html
https://yaml.org/type/binary.html

Canonical: Base64 without white space and line breaks.

@protolambda
Copy link
Contributor

Was thinking of something more radical. If we give up on readability, there should be much better options.

@prestonvanloon
Copy link
Author

We have no preference on readability or format as long as we don't have to waste time implementing things that already exist. YAML is fine if we use YAML as intended. JSON is fine because the JSON spec suggests Base64 as well.

This isn't giving up on readability. YAML is readable.

@djrtwo
Copy link
Contributor

djrtwo commented Jun 25, 2019

We don't intend to use any of the fancy !! YAML notation due to the more simple parsers not supporting this usage. The 0x is super common in ethereum and human readable and speakable.

I think we give something up if fields in tests such as pubkeys become difficult to discuss due to the base64 notation, especially considering pubkeys will almost certainly be represented in hex to users.

@prestonvanloon
Copy link
Author

prestonvanloon commented Jun 25, 2019

@djrtwo For some additional context, we have wasted at least one eng week to write tooling to support non-canonical formatting with our standard libraries. I think it is a bad idea to use hex for everything.

It makes sense when you need representation to be human readable for end users, like a wallet address in UI, but binary data for test ingestion is not one of those use cases. This data is not end user facing and is causing a lot of pain for little/no benefit.

@protolambda
Copy link
Contributor

So pretty much all data, unless it's not a SSZ object, will also be available as ssz-encoded raw bytes in the next testing release. See ethereum/consensus-specs#1320
An example can be found here: https://github.com/ethereum/eth2.0-spec-tests/tree/tests-v082-reorg/tests/minimal/phase0/operations/deposit/pyspec_tests/success_top_up

So effectively no yaml / hex encoded data, unless for a small test metadata detail (which shouldn't touch ssz/production code anyway). Does this work for Prysm @prestonvanloon ?

@prestonvanloon
Copy link
Author

unless for a small test metadata detail

What does this mean? There will still be some required yaml?

Switching to ssz encoded data will most likely work for us. Thanks for the recent improvements.

@protolambda
Copy link
Contributor

@prestonvanloon there are a few small files that just say things like "there are 10 deposits to read from ssz files", "there are 6 blocks to read", "BLS required here". Etc. etc. Nothing hard to parse really.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@arnetheduck @djrtwo @paulhauner @prestonvanloon @protolambda and others