-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repoID bitmap for speeding up findShard in compound shards #899
base: main
Are you sure you want to change the base?
Conversation
We add a new section to shards which contains a roaring bitmap for quickly checking if a shard contains a repo ID. We then can load just this (small amount) of data to rule out a compound shard. We use roaring bitmaps since we already have that dependency in our codebase. The reason we speed up this operation is we found on a large instance which contained thousands of tiny repos we spent so much time in findShard that our indexing queue would always fall behind. It is possible this new section won't speed this up enough and we need some sort of global oracle (or in-memory cache in indexserver?). This is noted in the code for future travellers. Test Plan: the existing unit tests already cover if this is forwards and backwards compatible. Additionally I added some logging to zoekt to test if older version of shards still work correctly in findShard, as well as if older versions of zoekt can read the new shards. I haven't quantified the performance improvements yet. Before landing I will generate some synthetic large compound shards and test on my machine the changes in perf for findShard (will update description).
Note: I still need to add unit tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems really useful to have this bitmap!
Overall question: when you saw a bunch of time spent in findShard
, do you know what exactly was slow? Was it reading the metadata, or going through all the repos to check if one existed?
The reason I ask is that I already optimized the metadata reading part a bit: #826. Before, we were reading all sections in the shard (!!) now we just load the metadata. I am curious if the large instance you saw had this fix, but still saw slowness, or could have been missing it.
@@ -187,6 +188,8 @@ func (t *indexTOC) sectionsTaggedList() []taggedSection { | |||
{"nameBloom", &unusedSimple}, | |||
{"contentBloom", &unusedSimple}, | |||
{"ranks", &unusedSimple}, | |||
|
|||
{"reposIDsBitmap", &t.reposIDsBitmap}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny comment, it'd be nice to group this with the in-use sections above so it's next to "repos"
// If we are still seeing performance issues, we should consider adding | ||
// some sort of global oracle here to avoid filepath.Glob and checking | ||
// each compound shard. | ||
if !zoekt.MaybeContainRepo(fn, o.RepositoryDescription.ID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud: what if this were not two separate steps (so callers must remember to check MaybeContainRepo
first), but a single new method like zoekt.ReadMetadataForRepo(fn, repoID)
? That might also let us share work between the two, and avoid opening and mmapping the index file twice (although it's likely not a big deal :))
@@ -96,7 +96,8 @@ type indexTOC struct { | |||
contentChecksums simpleSection | |||
runeDocSections simpleSection | |||
|
|||
repos simpleSection | |||
repos simpleSection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we bump the index FeatureVersion
or FormatVersion
here? (I'm not 100% on which one 😊 ) That way, far in the future once we've dropped support for older versions, we can know the bitmap is always there, and simplify the logic?
I'll get back to finishing this PR this week, its just not high priority so wanna prioritize some other work first. |
We add a new section to shards which contains a roaring bitmap for quickly checking if a shard contains a repo ID. We then can load just this (small amount) of data to rule out a compound shard. We use roaring bitmaps since we already have that dependency in our codebase.
The reason we speed up this operation is we found on a large instance which contained thousands of tiny repos we spent so much time in findShard that our indexing queue would always fall behind.
It is possible this new section won't speed this up enough and we need some sort of global oracle (or in-memory cache in indexserver?). This is noted in the code for future travellers.
Test Plan: the existing unit tests already cover if this is forwards and backwards compatible. Additionally I added some logging to zoekt to test if older version of shards still work correctly in findShard, as well as if older versions of zoekt can read the new shards.
I haven't quantified the performance improvements yet. Before landing I will generate some synthetic large compound shards and test on my machine the changes in perf for findShard (will update description).
Closes https://linear.app/sourcegraph/issue/SPLF-824/zoekt-fast-detection-of-repo-id-in-shard