-
Notifications
You must be signed in to change notification settings - Fork 61
feat: microoptimize _LRUCacheWrapper.__call__
#683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I'm not convinced this is going to make a noticeable difference. A good first step would be to add codspeed benchmarks as we've done on the aiohttp repo. |
Benchmark or no benchmark, in any case it reduces the number of PyObject_GetAttr calls that the interpreter will make. For each cache hit, there will be one less PyObject_GetAttr C call made. Whether or not those changes will be "noticeable" depends on how you define noticeable. But regardless of the definition the interpreter will make less C-API calls with this code, and the user's CPU will not be processing instructions that it has no need to process. Before I spend any more time here, how do you define "noticeable"? We already know it will make an unknown difference greater than zero. Probably small, of course, but every little bit helps, particularly for this type of middleware-ish library. We really need a benchmark when the reduction in the instructions themselves is measurable and the PR doesn't particularly harm readability? |
It's generally frowned upon to modify Python code and make it less readable for nanosecond optimisation. If there's no real-world benefit to such optimisation, which can generally be seen in realistic benchmarks (e.g. like some of the small optimisations in aiohttp), then there's likely no point in spending time on this. There are much better approaches that could improve performance by several orders of magnitude more. These optimisations may also depend on the platform they run on. e.g. If we compiled this code with mypyc, there may be exactly 0 difference in performance. |
In aiohttp we've had several micro-optimisations that have gone through as they have a significant (e.g. 5% in a tight benchmark) difference to performance measured in codspeed. Likewise, we also rejected several micro-optimisations where the difference was neglible. |
I'd agree, but I don't think any of that is happening here. More thoughts below.
Respectfully, we're spending more time doing this chat than it took me to refactor out all 5 uses of the dot operator!
Sure, but this is both low-hanging fruit and already implemented. Nothing in this PR or review prevents any of the other approaches from also being implemented.
This is where I'll have to disagree, I've merged 10 PRs into the mypyc codebase in just this past 2 weeks. If you compile both versions of this code with mypyc, which is actually exactly what I'm doing (You can check out the C code on my fork!) and the reason I ended up here in the first place, the difference is even more obvious in the C code. There is zero chance that "calling a C function 5 extra times" and "calling a C function 0 extra times" has "0 difference in performance" when compiled with mypyc. An impossibility, really. A difference is essentially guaranteed, the CPU receives less instructions and goes thru less wasted cycles. The only real question, as you've noted above, is the magnitude. But anywho, I suppose none of this matters if we don't have an actual performance target in mind. Are you saying the improvement must be > 5% in order to merge a microoptimization PR? |
This is still dependent on the current state of mypyc though. I suspect the difference you see is because OrderedDict is not native in mypyc today, that could change in a future release. The documentation explicitly says not to do these kind of micro-optimisations: https://mypyc.readthedocs.io/en/latest/performance_tips_and_tricks.html#using-fast-native-features It's entirely possible that a future release sees this change become a performance degradation. Without meaningful benchmarks to catch any such changes, it doesn't seem worthwhile to me. |
Mypyc is not able to safely optimize away any of the PyObject_GetAttr calls either now or in the future, since that would violate python semantics in that the get method of descriptors would not be called as expected. You see a difference because one version of the code calls PyObject_GetAttr more times than the other version, for no benefit as we arent accessing descriptors here. In the event of future optimizations, all instances where we currently have a GetAttr call would become equally faster, but again, it would never become the case that plural calls are faster than a singular call. That's simply not possible. Again, what number are you looking for in a benchmark? Is the minimum performance impact really 5%? |
I know, I'm the one who wrote the PR, its in review already lol OrderedDict won't necessarily become a fully-fledged native class, but we will be able to use the more specific PyDict_GetItem function from python's C-API, just like we'd currently do for a true instance of But that isn't hugely relevant here because when compiled with mypyc these C API calls become C-struct member lookups (but still in that case, singular takes less time than plural) |
I'm familiar with those docs, but that documentation specifically says not to do early binding, which only applies to methods. If you notice in that example, the value they are caching is not an object, its a method which is bound to the object in question. That's what they refer to with "binding" We aren't doing any of that here. We are looking up attributes one time, and then using them repeatedly after that without any extra lookups. |
If you follow the link, it says "References to functions, types, most attributes, and methods in the same compilation unit use early binding". So, my understanding is that if OrderdDict becomes native and the async-lru class is being compiled as native, then what it says still applies and this change would actually decrease performance.
I was under the understanding that this is exactly what it does for native classes. As it can see the entire class upfront, it can know that no descriptors exist. But, regardless, some functionality is simply not supported with native classes (I've had to opt-out of native classes on occasion for some projects). |
Yeah this is mostly the case, but it just replaces the C API call with a struct member lookup in each case where a C API call previously existed as noted above
Though this is actually a really good idea for optimization, I'll try to encode this logic into a mypyc PR. We'd essentially just need to detect whether a particular native class defines its own Do we have a numeric performance improvement target in mind for the benchmark or should I just consider this PR dead? At this point I've asked 3 times what number would convince you and you haven't given me anything to work with. I could always just fork and compile |
I'd atleast want a regression test to know if some future change breaks the performance improvement here. So, let's say if the improvement is enough to stand out from the noise of a benchmark, then we'll use it. codspeed does some things to eliminate syscalls etc. from benchmarks, but they'll still have a degree of random noise. |
Okay great, thanks. I'll try to put that together this week. |
Benchmarks added, let me know if this looks like what you had in mind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably also want this in a separate PR in order to see the change this PR produces..
looks like I was using an invalid I've replaced it with |
I'd just look at how it's been done on aiohttp, multidict etc. I've added @bdraco to review, so hopefully he can also provide some guidance. |
I created a new PR #688 with just the codspeed stuff. Once we merge that I can rebase this PR on top to cleanup the diff. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #683 +/- ##
==========================================
+ Coverage 97.22% 97.23% +0.01%
==========================================
Files 12 12
Lines 793 796 +3
Branches 41 41
==========================================
+ Hits 771 774 +3
Misses 20 20
Partials 2 2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CodSpeed Performance ReportMerging #683 will not alter performanceComparing Summary
|
This PR just microoptimizes attribute lookups in
_LRUCacheWrapper.__call__
to streamline call speedChecklist