Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected perf results and questions #15

Open
sarnold opened this issue Dec 4, 2020 · 3 comments
Open

unexpected perf results and questions #15

sarnold opened this issue Dec 4, 2020 · 3 comments

Comments

@sarnold
Copy link

sarnold commented Dec 4, 2020

Since I need a good re2 python interface (and there appears to be many, largely unmaintained) I ended up testing this one and the google one using your performance.py script and the results are somewhat unexpected compared to the performance table in the README https://github.com/andreasvc/pyre2#performance

I made a small change to make it run with newer python return _wikidata.decode('utf8') which is maybe why the results look odd; can you verify whether this is correct or not?

re2-perf-data.txt

@andreasvc
Copy link
Owner

Working with unicode adds overhead. If you have a use case where you can work with bytes, this is faster; and apparently, this is what is benchmarked in the performance script (which I didn't write). To make the script work across Python 2 and 3 while also having the best performance you should probably use bytes. I don't know if this explains the unexpected results, let me know if you discover more. I don't have time to look into this myself, but if I did I would investigate by profiling.

I don't really understand what you are benchmarking, what is "google-re2" and "py-re2" exactly? Why would the performance of Python's re from the standard module differ across these two? Don't know if that's a meaningful difference, that's supposed to be the baseline.

@sarnold
Copy link
Author

sarnold commented Dec 5, 2020

Sorry if that wasn't clear; google-re2 is my cmake respin of the google python interface which you can find here: https://github.com/freepn/google-re2 and py-re2 is my fork of your repo (I added the dash to avoid name clashes). I was hoping the google-y one would be interface -compatible with pyre2/adblockpareser but it is not, so right now my only fallback is your pyre2.

@andreasvc
Copy link
Owner

I see. I didn't know about the google-re2 Python bindings. Perhaps the README could explain the differences.

If performance is critical, then you should work with utf-8 encoded bytes strings. This is what RE2 uses internally. If you work with Python unicode strings, there will be encoding and decoding on every pyre2 call. RE2 actually fully supports unicode, even when you pass utf-8 encoded bytes strings.

If your fork contains any useful improvements, you're welcome to submit a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants