Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider dropping type checks in Keccak256 methods #16

Open
gsalgado opened this issue Apr 30, 2018 · 17 comments
Open

Consider dropping type checks in Keccak256 methods #16

gsalgado opened this issue Apr 30, 2018 · 17 comments

Comments

@gsalgado
Copy link

The __call__() and new() methods there will check the type of their arguments using isinstance, and although that should be a cheap check, these are code paths where performance is paramount given the number of times they are called when we perform trie operations on a fully synced DB. For example, when doing a full sync around 10% of the time is being spent into isinstance calls. Even though they're probably not all originating from these checks, I think we should consider removing them as these two methods have type annotations that should make it clear enough what is wrong in case they crash when passed an argument of the wrong type

@gsalgado
Copy link
Author

So, it turns out that during a full sync we're spending so much time on isinstance() calls because we're making an insane amount of them: 21926745 calls in a 3 minute run, but only 421334 (2%) of those are coming from these codepaths, so we might not gain anything by dropping these checks after all

@pipermerriam
Copy link
Member

@gsalgado I'm betting most of those isinstance checks are happening in the validation code within the EVM thats checking things like this.:

@gsalgado
Copy link
Author

The biggest offenders are actually rlp and trie: rlp/codec.py:72(encode_raw), trie/validation.py:11(validate_is_bytes), rlp/sedes/serializable.py:300(make_immutable). Just those 3 seem to account to almost half of all isinstance() calls

@pipermerriam
Copy link
Member

Hrm, I could get onboard with dropping the check from trie. It's not clear to me how to drop the check from rlp but we do call make_immutable on every argument during object instantiation. It'd be nice to drop this but I like that when you pass a list into an rlp object it gets cast to a tuple.

@davesque
Copy link
Contributor

davesque commented Apr 30, 2018

I haven't looked at the specific code, but I wonder if the behavior of some of those checks could be duplicated by appropriately coded try/except blocks. According to the docs, try/except is very efficient when an exception is not raised: https://docs.python.org/3.6/faq/design.html#how-fast-are-exceptions.

@pipermerriam
Copy link
Member

not the right solution but I did have a look at functools.singledispatch as a potential remedy for this but it's about 4x slower than a manual isinstance check (but way more elegant)

@carver
Copy link
Collaborator

carver commented Apr 30, 2018

rlp/codec.py:72(encode_raw)
trie/validation.py:11(validate_is_bytes)
rlp/sedes/serializable.py:300(make_immutable)

I think we can improve performance on all of these with assert statements. That way, default usages (and our tests) will work slower & safer, but we can disable asserts when running trinity client.

trie/validation.py:11(validate_is_bytes)

Trivial to move to asserts

rlp/sedes/serializable.py:300(make_immutable)

Not sure about this one. Can we simply disallow list as a valid input type, and use assert statements to reject lists?

rlp/codec.py:72(encode_raw)

This one is tougher, but here is one way to reduce the calls of isinstance by 2/3rds:

old code

    if isinstance(item, Atomic):
        if len(item) == 1 and item[0] < 128:
            return item
        payload = item
        prefix_offset = 128  # string
    elif not isinstance(item, str) and isinstance(item, collections.Sequence):
        payload = b''.join(encode_raw(x) for x in item)
        prefix_offset = 192  # list
    else:
        msg = 'Cannot encode object of type {0}'.format(type(item).__name__)
        raise EncodingError(msg, item)

new code

    if isinstance(item, Atomic):
        if len(item) == 1 and item[0] < 128:
            return item
        payload = item
        prefix_offset = 128  # string
    else:
        try:
            assert not isinstance(item, str)
        except AssertionError as exc:
            raise encoding_error(item) from exc

        try:
            payload = b''.join(encode_raw(x) for x in item)
        except TypeError as exc:
            raise encoding_error(item) from exc

        prefix_offset = 192  # list

def encoding_error(item)
        msg = 'Cannot encode object of type {0}'.format(type(item).__name__)
        return EncodingError(msg, item)

@carver
Copy link
Collaborator

carver commented Apr 30, 2018

I guess this should really be moved to py-evm/pyrlp/py-trie -- seems like we shouldn't remove the type checks in eth-hash until they are clearly a bottleneck.

@pipermerriam
Copy link
Member

We can vendor the encode_raw implementation into py-trie. That's an easy optimization, allowing us to drop checks without any compromise to the actual rlp library. Bonus is that if we go ahead and vendor the decode as well we can drop the rlp dependency.

@davesque
Copy link
Contributor

davesque commented May 1, 2018

I've always felt weird about disabling asserts since some libraries might incorrectly use them as a general exception throwing mechanism. Thoughts on that?

@carver
Copy link
Collaborator

carver commented May 1, 2018

I hope that having libraries break when you turn off assertions would be very rare. I would certainly consider that broken. Is that something you've run into before? @davesque

@davesque
Copy link
Contributor

davesque commented May 1, 2018

@carver Actually, no thankfully. And I guess it's probably not a very big risk and we'd notice it pretty quickly if a library broke in optimized mode. But it does seems a bit undesirable that one should have to use -O to get decent performance. I've never run across a library or tool where that was the case. Maybe I've just never had occasion to use one that worked that way. Or is there a way to disable asserts on a per module basis? If so, that would probably be a decent way to go.

But maybe it's possible that some of that manual type checking could just be removed altogether since a wrong value type would lead pretty quickly to a fairly obvious error (that's why I suggested the try/except approach)? It just feels a bit like we might be trying to shoehorn the wrong concerns into a dynamically typed language.

@pipermerriam
Copy link
Member

eurika! (maybe)

We should be able to get roughly the same level of confidence with type checking via mypy.

@davesque
Copy link
Contributor

davesque commented May 1, 2018

@pipermerriam Yes, I agree with that. Even if third party users of our libraries didn't use mypy, it would still give some guarantees about their internal correctness and the interactions between them.

@pipermerriam
Copy link
Member

Well, currently we have type definitions for eth-keys defined like this: https://github.com/ethereum/py-evm/tree/master/stubs

We can move this stuff to a submodule and add type definitions for anything else which is stable, share the submodule across repositories and get some stronger type safety checks across library boundaries, so even consumers of py-trie/eth-hash/whatever should be able to still get solid type safety checks.

@carver
Copy link
Collaborator

carver commented May 1, 2018

We should be able to get roughly the same level of confidence with type checking via mypy.

Which comes at the cost of requiring that any module that uses the method to use mypy to get a reasonable developer experience. For example, if a module uses encode_raw() and sends a str in as the item (without running mypy checks), we end up in an infinite recursion.

Infinite recursion also happens if running with python -O in the above example, but at least that was "opt-out" of nicer error messages, rather than mypy, which requires "opt-in" to get nicer error messages.

@pipermerriam
Copy link
Member

It still doesn't sit well with me to disable assertions for performance gains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants