Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repoID bitmap for speeding up findShard in compound shards #899

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

keegancsmith
Copy link
Member

We add a new section to shards which contains a roaring bitmap for quickly checking if a shard contains a repo ID. We then can load just this (small amount) of data to rule out a compound shard. We use roaring bitmaps since we already have that dependency in our codebase.

The reason we speed up this operation is we found on a large instance which contained thousands of tiny repos we spent so much time in findShard that our indexing queue would always fall behind.

It is possible this new section won't speed this up enough and we need some sort of global oracle (or in-memory cache in indexserver?). This is noted in the code for future travellers.

Test Plan: the existing unit tests already cover if this is forwards and backwards compatible. Additionally I added some logging to zoekt to test if older version of shards still work correctly in findShard, as well as if older versions of zoekt can read the new shards.

I haven't quantified the performance improvements yet. Before landing I will generate some synthetic large compound shards and test on my machine the changes in perf for findShard (will update description).

Closes https://linear.app/sourcegraph/issue/SPLF-824/zoekt-fast-detection-of-repo-id-in-shard

We add a new section to shards which contains a roaring bitmap for
quickly checking if a shard contains a repo ID. We then can load just
this (small amount) of data to rule out a compound shard. We use roaring
bitmaps since we already have that dependency in our codebase.

The reason we speed up this operation is we found on a large instance
which contained thousands of tiny repos we spent so much time in
findShard that our indexing queue would always fall behind.

It is possible this new section won't speed this up enough and we need
some sort of global oracle (or in-memory cache in indexserver?). This is
noted in the code for future travellers.

Test Plan: the existing unit tests already cover if this is forwards and
backwards compatible. Additionally I added some logging to zoekt to test
if older version of shards still work correctly in findShard, as well as
if older versions of zoekt can read the new shards.

I haven't quantified the performance improvements yet. Before landing I
will generate some synthetic large compound shards and test on my
machine the changes in perf for findShard (will update description).
@keegancsmith keegancsmith requested a review from a team January 22, 2025 15:23
@keegancsmith
Copy link
Member Author

Note: I still need to add unit tests

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems really useful to have this bitmap!

Overall question: when you saw a bunch of time spent in findShard, do you know what exactly was slow? Was it reading the metadata, or going through all the repos to check if one existed?

The reason I ask is that I already optimized the metadata reading part a bit: #826. Before, we were reading all sections in the shard (!!) now we just load the metadata. I am curious if the large instance you saw had this fix, but still saw slowness, or could have been missing it.

@@ -187,6 +188,8 @@ func (t *indexTOC) sectionsTaggedList() []taggedSection {
{"nameBloom", &unusedSimple},
{"contentBloom", &unusedSimple},
{"ranks", &unusedSimple},

{"reposIDsBitmap", &t.reposIDsBitmap},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny comment, it'd be nice to group this with the in-use sections above so it's next to "repos"

// If we are still seeing performance issues, we should consider adding
// some sort of global oracle here to avoid filepath.Glob and checking
// each compound shard.
if !zoekt.MaybeContainRepo(fn, o.RepositoryDescription.ID) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud: what if this were not two separate steps (so callers must remember to check MaybeContainRepo first), but a single new method like zoekt.ReadMetadataForRepo(fn, repoID)? That might also let us share work between the two, and avoid opening and mmapping the index file twice (although it's likely not a big deal :))

@@ -96,7 +96,8 @@ type indexTOC struct {
contentChecksums simpleSection
runeDocSections simpleSection

repos simpleSection
repos simpleSection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we bump the index FeatureVersion or FormatVersion here? (I'm not 100% on which one 😊 ) That way, far in the future once we've dropped support for older versions, we can know the bitmap is always there, and simplify the logic?

@keegancsmith
Copy link
Member Author

I'll get back to finishing this PR this week, its just not high priority so wanna prioritize some other work first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants