-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize values() calls #5479
base: develop
Are you sure you want to change the base?
Optimize values() calls #5479
Conversation
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
self._field = self._manual_field or field | ||
self._big_field = big_field | ||
self._num_list_fields = len(list_fields) | ||
# Optimization: call str() in memory rather than $toString in database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really related to this PR, but would it be possible to not convert it to string? or would that break a lot of things/not be easy to change?
Just thinking about in-memory conversion to string (if done not while iterating over the cursor but done in a list comprehension) doesn't scale well for millions of samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values("id")
returns strings, values("_id")
returns ObjectIDs. So it is up to the caller on whether they want to potentially optimize things by directly working with ObjectIDs rather than strings.
You're right that it should be slightly faster to use ObjectIDs. It removes the $toString
conversion in the database, and ObjectId
is fewer bytes than str(ObjectId)
so it would decrease network I/O somewhat.
# Optimization: use field inclusion syntax when possible | ||
if self._expr is None and path == "_id": | ||
big_field = path | ||
_pipeline = [{"$project": {"_id": 1}}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is big_field? I think I looked into it once, and it looked like it was just used for frames... I may be wrong, but I don't think frame _ids are used much, just sample_id, frame_number, and in that case, we should project those fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values()
is a "big" aggregation because it returns IDs one at a time (in separate documents). All other aggregations are true aggregations: they do the combination in the database and return the results in a single document (ie min()
, max()
, etc).
Returning values()
results as "big" style aggregations is necessary to avoid errors when the dataset contains >16MB of IDs.
Change log
This PR experiments with optimizing
values("id")
calls.Currently,
values("id")
executes the following pipeline:I wanted to see how much faster
values("id")
would be if the following pipeline were instead used:with the
str()
conversion instead happening in Python rather than in the database.Additionally,
values("id")
does not leverage the_id
field index in either of the above pipelines by default. It uses aCOLLSCAN
. So, I also wanted to see how much faster/slower both pipelines are when they use anIXSCAN
, which can be forced viahint={"_id": 1}
.Findings
The benchmarking section below shows the following findings on a local database with a 500k sample dataset:
{"$project": {"_id": 1}}
pipeline runs in 2.1 seconds compared to the original pipeline, which runs in 3.4 secondshint={"_id": 1}
makes both pipelines 0.4-0.6 seconds slowerCreate large dataset
Test real usage
Benchmarking