repoID bitmap for speeding up findShard in compound shards #899

keegancsmith · 2025-01-22T15:23:08Z

We add a new section to shards which contains a roaring bitmap for quickly checking if a shard contains a repo ID. We then can load just this (small amount) of data to rule out a compound shard. We use roaring bitmaps since we already have that dependency in our codebase.

The reason we speed up this operation is we found on a large instance which contained thousands of tiny repos we spent so much time in findShard that our indexing queue would always fall behind.

It is possible this new section won't speed this up enough and we need some sort of global oracle (or in-memory cache in indexserver?). This is noted in the code for future travellers.

Test Plan: the existing unit tests already cover if this is forwards and backwards compatible. Additionally I added some logging to zoekt to test if older version of shards still work correctly in findShard, as well as if older versions of zoekt can read the new shards.

I haven't quantified the performance improvements yet. Before landing I will generate some synthetic large compound shards and test on my machine the changes in perf for findShard (will update description).

Added a benchmark to check the impact. See comments in the code.

Closes https://linear.app/sourcegraph/issue/SPLF-824/zoekt-fast-detection-of-repo-id-in-shard

keegancsmith · 2025-01-22T16:02:53Z

Note: I still need to add unit tests

jtibshirani

It seems really useful to have this bitmap!

Overall question: when you saw a bunch of time spent in findShard, do you know what exactly was slow? Was it reading the metadata, or going through all the repos to check if one existed?

The reason I ask is that I already optimized the metadata reading part a bit: #826. Before, we were reading all sections in the shard (!!) now we just load the metadata. I am curious if the large instance you saw had this fix, but still saw slowness, or could have been missing it.

jtibshirani · 2025-01-22T16:31:17Z

toc.go

@@ -187,6 +188,8 @@ func (t *indexTOC) sectionsTaggedList() []taggedSection {
 		{"nameBloom", &unusedSimple},
 		{"contentBloom", &unusedSimple},
 		{"ranks", &unusedSimple},
+
+		{"reposIDsBitmap", &t.reposIDsBitmap},


Tiny comment, it'd be nice to group this with the in-use sections above so it's next to "repos"

jtibshirani · 2025-01-22T16:59:56Z

build/builder.go

+		// If we are still seeing performance issues, we should consider adding
+		// some sort of global oracle here to avoid filepath.Glob and checking
+		// each compound shard.
+		if !zoekt.MaybeContainRepo(fn, o.RepositoryDescription.ID) {


Thinking out loud: what if this were not two separate steps (so callers must remember to check MaybeContainRepo first), but a single new method like zoekt.ReadMetadataForRepo(fn, repoID)? That might also let us share work between the two, and avoid opening and mmapping the index file twice (although it's likely not a big deal :))

I think that makes sense. I have refactored the code.

jtibshirani · 2025-01-22T17:04:40Z

toc.go

@@ -96,7 +96,8 @@ type indexTOC struct {
 	contentChecksums simpleSection
 	runeDocSections  simpleSection

-	repos simpleSection
+	repos          simpleSection


Should we bump the index FeatureVersion or FormatVersion here? (I'm not 100% on which one 😊 ) That way, far in the future once we've dropped support for older versions, we can know the bitmap is always there, and simplify the logic?

FeatureVersion seems wrong, because we don't require a reindex.

Not sure about FormatVersion either. The code is currently in an in-between state. We write and read v16 by default, but compound shards are v17. Introducing v18 now seems confusing.

I am also not sure the distinction between FeatureVersion and FormatVersion is really useful. I am always confused about the different versions. Maybe it's worth rethinking the model? 🤔.

I suggest to complete the move to v17 in a separate PR and used that to simplify the logic: We can re-index v16 shards if we encounter them but support reading them for a couple of releases.

Sounds good to do it in a separate PR. Also let's catch up about FeatureVersion/ FormatVersion more generally, I'd like to understand how to simplify the model.

keegancsmith · 2025-01-27T07:27:46Z

I'll get back to finishing this PR this week, its just not high priority so wanna prioritize some other work first.

jtibshirani · 2025-02-07T22:53:18Z

@keegancsmith let's revive this PR!! (Also sorry about the merge conflicts ...)

We add a new section to shards which contains a roaring bitmap for quickly checking if a shard contains a repo ID. We then can load just this (small amount) of data to rule out a compound shard. We use roaring bitmaps since we already have that dependency in our codebase. The reason we speed up this operation is we found on a large instance which contained thousands of tiny repos we spent so much time in findShard that our indexing queue would always fall behind. It is possible this new section won't speed this up enough and we need some sort of global oracle (or in-memory cache in indexserver?). This is noted in the code for future travellers. Test Plan: the existing unit tests already cover if this is forwards and backwards compatible. Additionally I added some logging to zoekt to test if older version of shards still work correctly in findShard, as well as if older versions of zoekt can read the new shards. I haven't quantified the performance improvements yet. Before landing I will generate some synthetic large compound shards and test on my machine the changes in perf for findShard (will update description).

stefanhengl · 2025-02-18T10:44:36Z

index/builder_test.go

-					{Name: "sameName", ID: 2},
-					{Name: "sameName", ID: 3},
+					{Name: "repoB", ID: 2},
+					{Name: "repoC", ID: 3},
 				},
 				{
-					{Name: "repoB", ID: 4},
-					{Name: "sameName", ID: 5},
-					{Name: "sameName", ID: 6},
+					{Name: "repoD", ID: 4},
+					{Name: "repoE", ID: 5},
+					{Name: "repoF", ID: 6},
 				},
 			},
 			expectedShardCount: 1,
-			expectedRepository: zoekt.Repository{Name: "sameName", ID: 5},
+			expectedRepository: zoekt.Repository{Name: "something-else", ID: 5},


fly-by change to make the intent of the test more obvious. Before, the name and id both matched so it was not obvious that we match by id only.

jtibshirani · 2025-02-18T16:56:24Z

index/builder_test.go

+// BenchmarkFindCompoundShard-16    	   33505	     36016 ns/op
+//
+// Without optimization
+// BenchmarkFindCompoundShard-16    	      76	  15568589 ns/op


Nice! I guess this answers my question here:

when you saw a bunch of time spent in findShard, do you know what exactly was slow? Was it reading the metadata, or going through all the repos to check if one existed?

jtibshirani · 2025-02-18T16:58:18Z

index/read.go

@@ -649,6 +653,103 @@ func IndexFilePaths(p string) ([]string, error) {
 	return exist, nil
 }

+// MaybeContainRepo returns true if the shard at path p could contain repoID.


Small comment: this should be maybeContainsRepo. Also, it seems like the method we're encouraging callers to use is actually containsRepo. So we could move the doc comment there instead (and rework it?)

jtibshirani · 2025-02-18T17:08:05Z

index/builder_test.go

+
+		shard := o.findCompoundShard()
+		if shard != "" {
+			b.Fatal("expected emtpy result")


Suggested change

b.Fatal("expected emtpy result")

b.Fatal("expected empty result")

keegancsmith requested a review from a team January 22, 2025 15:23

jtibshirani reviewed Jan 22, 2025

View reviewed changes

keegancsmith assigned stefanhengl Feb 17, 2025

keegancsmith and others added 2 commits February 17, 2025 15:06

mild refactor, add tests and benchmark

04e94b3

stefanhengl force-pushed the k/find-shard-ds branch from 8d65282 to 04e94b3 Compare February 18, 2025 10:32

stefanhengl reviewed Feb 18, 2025

View reviewed changes

stefanhengl requested a review from jtibshirani February 18, 2025 11:34

jtibshirani approved these changes Feb 18, 2025

View reviewed changes

jtibshirani reviewed Feb 18, 2025

View reviewed changes

stefanhengl added 3 commits February 19, 2025 16:50

improve docstrings

87d0d9e

typo

4c9e0aa

typo

d218d5a

stefanhengl merged commit 456196a into main Feb 19, 2025
9 checks passed

stefanhengl deleted the k/find-shard-ds branch February 19, 2025 16:01

jtibshirani mentioned this pull request Mar 6, 2025

ranking: incorporate file signals into BM25F #922

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

repoID bitmap for speeding up findShard in compound shards #899

repoID bitmap for speeding up findShard in compound shards #899

Uh oh!

keegancsmith commented Jan 22, 2025 •

edited by stefanhengl

Loading

Uh oh!

keegancsmith commented Jan 22, 2025

Uh oh!

jtibshirani left a comment

Uh oh!

jtibshirani Jan 22, 2025

Uh oh!

jtibshirani Jan 22, 2025

Uh oh!

stefanhengl Feb 18, 2025

Uh oh!

jtibshirani Jan 22, 2025

Uh oh!

stefanhengl Feb 18, 2025

Uh oh!

jtibshirani Feb 18, 2025

Uh oh!

keegancsmith commented Jan 27, 2025

Uh oh!

jtibshirani commented Feb 7, 2025

Uh oh!

stefanhengl Feb 18, 2025 •

edited

Loading

Uh oh!

jtibshirani Feb 18, 2025

Uh oh!

jtibshirani Feb 18, 2025

Uh oh!

jtibshirani Feb 18, 2025

Uh oh!

Uh oh!

Uh oh!

	b.Fatal("expected emtpy result")
	b.Fatal("expected empty result")

repoID bitmap for speeding up findShard in compound shards #899

repoID bitmap for speeding up findShard in compound shards #899

Uh oh!

Conversation

keegancsmith commented Jan 22, 2025 • edited by stefanhengl Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keegancsmith commented Jan 22, 2025

Uh oh!

jtibshirani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keegancsmith commented Jan 27, 2025

Uh oh!

jtibshirani commented Feb 7, 2025

Uh oh!

stefanhengl Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

keegancsmith commented Jan 22, 2025 •

edited by stefanhengl

Loading

stefanhengl Feb 18, 2025 •

edited

Loading