Skip to content

Conversation

BobTheBuidler
Copy link
Contributor

@BobTheBuidler BobTheBuidler commented Aug 15, 2025

This PR just microoptimizes attribute lookups in _LRUCacheWrapper.__call__ to streamline call speed

Checklist

  • I think the code is well written
  • Unit tests for the changes exist

@Dreamsorcerer
Copy link
Member

I'm not convinced this is going to make a noticeable difference. A good first step would be to add codspeed benchmarks as we've done on the aiohttp repo.

@BobTheBuidler
Copy link
Contributor Author

BobTheBuidler commented Aug 16, 2025

Benchmark or no benchmark, in any case it reduces the number of PyObject_GetAttr calls that the interpreter will make.

For each cache hit, there will be one less PyObject_GetAttr C call made.
For each cache miss, there will be two less.
For each cache miss with a maxsize, there will be 5 less.

Whether or not those changes will be "noticeable" depends on how you define noticeable. But regardless of the definition the interpreter will make less C-API calls with this code, and the user's CPU will not be processing instructions that it has no need to process.

Before I spend any more time here, how do you define "noticeable"? We already know it will make an unknown difference greater than zero. Probably small, of course, but every little bit helps, particularly for this type of middleware-ish library. We really need a benchmark when the reduction in the instructions themselves is measurable and the PR doesn't particularly harm readability?

@Dreamsorcerer
Copy link
Member

It's generally frowned upon to modify Python code and make it less readable for nanosecond optimisation. If there's no real-world benefit to such optimisation, which can generally be seen in realistic benchmarks (e.g. like some of the small optimisations in aiohttp), then there's likely no point in spending time on this. There are much better approaches that could improve performance by several orders of magnitude more.

These optimisations may also depend on the platform they run on. e.g. If we compiled this code with mypyc, there may be exactly 0 difference in performance.

@Dreamsorcerer
Copy link
Member

In aiohttp we've had several micro-optimisations that have gone through as they have a significant (e.g. 5% in a tight benchmark) difference to performance measured in codspeed. Likewise, we also rejected several micro-optimisations where the difference was neglible.

@BobTheBuidler
Copy link
Contributor Author

BobTheBuidler commented Aug 17, 2025

It's generally frowned upon to modify Python code and make it less readable for nanosecond optimisation.

I'd agree, but I don't think any of that is happening here. More thoughts below.

If there's no real-world benefit to such optimisation, which can generally be seen in realistic benchmarks (e.g. like some of the small optimisations in aiohttp), then there's likely no point in spending time on this.

Respectfully, we're spending more time doing this chat than it took me to refactor out all 5 uses of the dot operator!

There are much better approaches that could improve performance by several orders of magnitude more.

Sure, but this is both low-hanging fruit and already implemented. Nothing in this PR or review prevents any of the other approaches from also being implemented.

If we compiled this code with mypyc, there may be exactly 0 difference in performance.

This is where I'll have to disagree, I've merged 10 PRs into the mypyc codebase in just this past 2 weeks.

If you compile both versions of this code with mypyc, which is actually exactly what I'm doing (You can check out the C code on my fork!) and the reason I ended up here in the first place, the difference is even more obvious in the C code. There is zero chance that "calling a C function 5 extra times" and "calling a C function 0 extra times" has "0 difference in performance" when compiled with mypyc. An impossibility, really. A difference is essentially guaranteed, the CPU receives less instructions and goes thru less wasted cycles. The only real question, as you've noted above, is the magnitude.

But anywho, I suppose none of this matters if we don't have an actual performance target in mind. Are you saying the improvement must be > 5% in order to merge a microoptimization PR?

@Dreamsorcerer
Copy link
Member

This is where I'll have to disagree, if you compile both versions of this code with mypyc, which is actually exactly what I'm doing (You can check out the C code on my fork!) and the reason I ended up here in the first place, the difference is even more obvious in the C code. There is zero chance that "calling a C function 5 extra times" and "calling a C function 0 extra times" has "0 difference in performance" when compiled with mypyc. An impossibility, really. A difference is essentially guaranteed, the CPU receives less instructions and goes thru less wasted cycles. The only real question, as you've noted above, is the magnitude.

This is still dependent on the current state of mypyc though. I suspect the difference you see is because OrderedDict is not native in mypyc today, that could change in a future release. The documentation explicitly says not to do these kind of micro-optimisations: https://mypyc.readthedocs.io/en/latest/performance_tips_and_tricks.html#using-fast-native-features

It's entirely possible that a future release sees this change become a performance degradation. Without meaningful benchmarks to catch any such changes, it doesn't seem worthwhile to me.

@BobTheBuidler
Copy link
Contributor Author

BobTheBuidler commented Aug 17, 2025

Mypyc is not able to safely optimize away any of the PyObject_GetAttr calls either now or in the future, since that would violate python semantics in that the get method of descriptors would not be called as expected.

You see a difference because one version of the code calls PyObject_GetAttr more times than the other version, for no benefit as we arent accessing descriptors here.

In the event of future optimizations, all instances where we currently have a GetAttr call would become equally faster, but again, it would never become the case that plural calls are faster than a singular call. That's simply not possible.

Again, what number are you looking for in a benchmark? Is the minimum performance impact really 5%?

@BobTheBuidler
Copy link
Contributor Author

BobTheBuidler commented Aug 17, 2025

that could change in a future release

I know, I'm the one who wrote the PR, its in review already lol

OrderedDict won't necessarily become a fully-fledged native class, but we will be able to use the more specific PyDict_GetItem function from python's C-API, just like we'd currently do for a true instance of builtins.dict.

But that isn't hugely relevant here because when compiled with mypyc these C API calls become C-struct member lookups (but still in that case, singular takes less time than plural)

@BobTheBuidler
Copy link
Contributor Author

BobTheBuidler commented Aug 17, 2025

The documentation explicitly says not to do these kind of micro-optimisations: https://mypyc.readthedocs.io/en/latest/performance_tips_and_tricks.html#using-fast-native-features

I'm familiar with those docs, but that documentation specifically says not to do early binding, which only applies to methods.

If you notice in that example, the value they are caching is not an object, its a method which is bound to the object in question. That's what they refer to with "binding"

We aren't doing any of that here. We are looking up attributes one time, and then using them repeatedly after that without any extra lookups.

@Dreamsorcerer
Copy link
Member

early binding, which only applies to methods.

If you follow the link, it says "References to functions, types, most attributes, and methods in the same compilation unit use early binding".

So, my understanding is that if OrderdDict becomes native and the async-lru class is being compiled as native, then what it says still applies and this change would actually decrease performance.

Mypyc is not able to safely optimize away any of the PyObject_GetAttr calls

I was under the understanding that this is exactly what it does for native classes. As it can see the entire class upfront, it can know that no descriptors exist. But, regardless, some functionality is simply not supported with native classes (I've had to opt-out of native classes on occasion for some projects).

@BobTheBuidler
Copy link
Contributor Author

BobTheBuidler commented Aug 18, 2025

I was under the understanding that this is exactly what it does for native classes.

Yeah this is mostly the case, but it just replaces the C API call with a struct member lookup in each case where a C API call previously existed as noted above

As it can see the entire class upfront, it can know that no descriptors exist.

Though this is actually a really good idea for optimization, I'll try to encode this logic into a mypyc PR. We'd essentially just need to detect whether a particular native class defines its own __get__. But even if/when I implement that, the C code will match the compiled C code of this PR. (as in, the final compiled code in both cases would have one lookup for each attr that is looked-up.)

Do we have a numeric performance improvement target in mind for the benchmark or should I just consider this PR dead? At this point I've asked 3 times what number would convince you and you haven't given me anything to work with. I could always just fork and compile faster-async-lru and call it a day, but the goal was to improve the actual lib that people widely use, and do it in a way that doesn't harm readability.

@Dreamsorcerer
Copy link
Member

Do we have a numeric performance improvement target in mind for the benchmark or should I just consider this PR dead?

I'd atleast want a regression test to know if some future change breaks the performance improvement here. So, let's say if the improvement is enough to stand out from the noise of a benchmark, then we'll use it. codspeed does some things to eliminate syscalls etc. from benchmarks, but they'll still have a degree of random noise.

@BobTheBuidler
Copy link
Contributor Author

Okay great, thanks. I'll try to put that together this week.

@BobTheBuidler
Copy link
Contributor Author

Benchmarks added, let me know if this looks like what you had in mind

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also want this in a separate PR in order to see the change this PR produces..

@Dreamsorcerer Dreamsorcerer requested a review from bdraco August 25, 2025 19:05
@BobTheBuidler
Copy link
Contributor Author

looks like I was using an invalid pytest.mark.codspeed

I've replaced it with pytest.mark.benchmark but should I just remove the pytestmark line entirely?

@Dreamsorcerer
Copy link
Member

I've replaced it with pytest.mark.benchmark but should I just remove the pytestmark line entirely?

I'd just look at how it's been done on aiohttp, multidict etc. I've added @bdraco to review, so hopefully he can also provide some guidance.

@BobTheBuidler
Copy link
Contributor Author

BobTheBuidler commented Aug 27, 2025

I created a new PR #688 with just the codspeed stuff. Once we merge that I can rebase this PR on top to cleanup the diff.

@Dreamsorcerer Dreamsorcerer removed the request for review from bdraco August 28, 2025 10:51
Copy link

codecov bot commented Sep 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.23%. Comparing base (2913fe2) to head (303f9d5).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #683      +/-   ##
==========================================
+ Coverage   97.22%   97.23%   +0.01%     
==========================================
  Files          12       12              
  Lines         793      796       +3     
  Branches       41       41              
==========================================
+ Hits          771      774       +3     
  Misses         20       20              
  Partials        2        2              
Flag Coverage Δ
unit 95.85% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

codspeed-hq bot commented Sep 27, 2025

CodSpeed Performance Report

Merging #683 will not alter performance

Comparing BobTheBuidler:call (303f9d5) with master (2913fe2)

Summary

✅ 39 untouched

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants