MQE: Fix panic in loading too many samples #10261

jhesketh · 2024-12-17T09:28:25Z

What this PR does

This PR solves two bugs:

Bug 1:

BucketedPool would create pools in powers of two up to a set maxSize. That is, if maxSize isn't a power of two itself,
then the maximum bucket would be less than the maxSize.

However, when requesting a slice from the pool we were only checking against the maxSize, and not whether a bucket existed for that size. Now, instead, calculate the bucketIndex and check if that exists before using it.

This would cause panic's in range vectors of more than 65,536 samples.

This problem was introduced in #10199

Bug 2:

If a requested slice size exceeded the size of the pools it would just be returned as the requested size. This caused a bug in the buffers which require the slices to be powers of two.

This PR ensures that the returned slice is always of power two. This means we will return a slice greater than the MaxExpectedPointsPerSeries, but it won't be from a pool (and similarly won't be returned to a pool). We are protected by the MemoryConsumptionTracker for any unreasonable requests.

This was introduced in #9664

Originally this would cause panic's in sub-queries selecting more than 65,536 samples, but since #10199 it returns just errors.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

BucketedPool would create pools in powers of two up to a set `maxSize`. That is, if `maxSize` isn't a power of two itself, then the maximum bucket would be less than the maxSize. However, when requesting a slice from the pool we were only checking against the maxSize, and not whether a bucket existed for that size. Instead calculate the bucketIndex and check if that exists before using it.

This is to guarantee they work with the ring buffers which expect slices to always be power of two. The limiting pool still protects us from requesting an unreasonable amount of points with the MemoryConsumptionTracker.

pkg/streamingpromql/types/limiting_pool.go

jhesketh · 2024-12-17T12:44:27Z

(this will need backporting to r321)

pkg/util/pool/bucketed_pool.go

* MQE: Fix panic in loading too many samples BucketedPool would create pools in powers of two up to a set `maxSize`. That is, if `maxSize` isn't a power of two itself, then the maximum bucket would be less than the maxSize. However, when requesting a slice from the pool we were only checking against the maxSize, and not whether a bucket existed for that size. Instead calculate the bucketIndex and check if that exists before using it. * Use a power of two size max bucket * MQE: Ensure BucketedPool always returns slices of power two This is to guarantee they work with the ring buffers which expect slices to always be power of two. The limiting pool still protects us from requesting an unreasonable amount of points with the MemoryConsumptionTracker. * Fix tests * Correctly return slices to their respective buckets or discard them * Extra tests * Address review feedback * Remove unncessary slice length check (cherry picked from commit 34a24b1) Co-authored-by: Joshua Hesketh <[email protected]>

charleskorn

Thanks for finding and fixing these issues!

charleskorn · 2025-01-05T21:50:41Z

pkg/streamingpromql/types/limiting_pool.go

@@ -13,7 +13,7 @@ import (
 )

 const (
-	MaxExpectedPointsPerSeries = 100_000 // There's not too much science behind this number: 100000 points allows for a point per minute for just under 70 days.
+	MaxExpectedPointsPerSeries = 131_072 // There's not too much science behind this number: 100,000 points allows for a point per minute for just under 70 days. Then we use the next power of two.


[nit] Might be good to clarify why we want a power of two here, for example:

Suggested change

MaxExpectedPointsPerSeries = 131_072 // There's not too much science behind this number: 100,000 points allows for a point per minute for just under 70 days. Then we use the next power of two.

MaxExpectedPointsPerSeries = 131_072 // There's not too much science behind this number: 100,000 points allows for a point per minute for just under 70 days. Then we use the next power of two, given the pools always return slices with capacity equal to a power of two.

charleskorn · 2025-01-05T21:53:01Z

pkg/streamingpromql/types/limiting_pool_test.go

+func TestLimitingBucketedPool_PowerOfTwoCapacities(t *testing.T) {
+	memoryConsumptionTracker := limiting.NewMemoryConsumptionTracker(0, nil)
+
+	pool := NewLimitingBucketedPool(
+		pool.NewBucketedPool(100_000, func(size int) []int { return make([]int, 0, size) }),
+		1,
+		false,
+		nil,
+	)
+
+	cases := []struct {
+		requestedSize int
+		expectedCap   int
+	}{
+		{3, 4},
+		{5, 8},
+		{10, 16},
+		{65_000, 65_536},
+		{100_001, 131_072}, // Exceeds max, expect next power of two
+	}
+
+	for _, c := range cases {
+		slice, err := pool.Get(c.requestedSize, memoryConsumptionTracker)
+		require.NoError(t, err, "Unexpected error when requesting size %d", c.requestedSize)
+		require.Equal(t, c.expectedCap, cap(slice),
+			"LimitingBucketedPool.Get() returned slice with capacity %d; expected %d", cap(slice), c.expectedCap)
+		pool.Put(slice, memoryConsumptionTracker)
+	}
+}


This is testing functionality entirely in BucketedPool - should this test move to util/pool/bucketed_pool_test.go?

charleskorn · 2025-01-05T21:54:22Z

pkg/streamingpromql/types/limiting_pool_test.go

+func TestLimitingBucketedPool_UnreasonableSizeRequest(t *testing.T) {
+	const maxMemoryLimit = 1_000_000 * FPointSize
+
+	reg, metric := createRejectedMetric()
+	memoryConsumptionTracker := limiting.NewMemoryConsumptionTracker(uint64(maxMemoryLimit), metric)
+
+	pool := NewLimitingBucketedPool(
+		pool.NewBucketedPool(100_000, func(size int) []int { return make([]int, 0, size) }),
+		1,
+		false,
+		nil,
+	)
+
+	// Request a reasonable size
+	slice, err := pool.Get(500_000, memoryConsumptionTracker)
+	require.NoError(t, err, "Expected to succeed for reasonable size request")
+	require.Equal(t, 524_288, cap(slice), "Capacity should be next power of two")
+	assertRejectedQueryCount(t, reg, 0)
+
+	pool.Put(slice, memoryConsumptionTracker)
+
+	// Request an unreasonable size
+	_, err = pool.Get(10_000_000, memoryConsumptionTracker)
+	require.Error(t, err, "Expected an error for unreasonably large size request")
+	require.Contains(t, err.Error(), "exceeded", "Error message should indicate memory consumption limit exceeded")
+	assertRejectedQueryCount(t, reg, 1)
+
+	require.Equal(t, uint64(0), memoryConsumptionTracker.CurrentEstimatedMemoryConsumptionBytes,
+		"Current memory consumption should remain at 0 after rejected request")
+}


Is this test testing something beyond what TestLimitingPool_Limited already covers? If so, a comment explaining the difference would be helpful.

charleskorn · 2025-01-05T21:55:38Z

pkg/streamingpromql/types/ring_buffer_test.go

+func TestFPointRingBuffer_UseReturnsErrorOnNonPowerOfTwoSlice(t *testing.T) {
+	memoryConsumptionTracker := limiting.NewMemoryConsumptionTracker(0, nil)
+	buf := NewFPointRingBuffer(memoryConsumptionTracker)
+
+	nonPowerOfTwoSlice := make([]promql.FPoint, 0, 15)
+
+	err := buf.Use(nonPowerOfTwoSlice)
+	require.Error(t, err, "Use() should return an error for a non-power-of-two slice")
+	require.EqualError(t, err, "slice capacity must be a power of two, but is 15",
+		"Error message should indicate the invalid capacity")
+}
+
+func TestHPointRingBuffer_UseReturnsErrorOnNonPowerOfTwoSlice(t *testing.T) {
+	memoryConsumptionTracker := limiting.NewMemoryConsumptionTracker(0, nil)
+	buf := NewHPointRingBuffer(memoryConsumptionTracker)
+
+	nonPowerOfTwoSlice := make([]promql.HPoint, 0, 15)
+
+	err := buf.Use(nonPowerOfTwoSlice)
+	require.Error(t, err, "Use() should return an error for a non-power-of-two slice")
+	require.EqualError(t, err, "slice capacity must be a power of two, but is 15",
+		"Error message should indicate the invalid capacity")
+}


Couldn't this be incorporated into testRingBuffer?

charleskorn · 2025-01-05T22:01:08Z

pkg/util/pool/bucketed_pool.go

+	if bucketIndex >= len(p.buckets) {
+		return // Ignore slices larger than the largest bucket
+	}


Instead of this, should we require that maxSize is a power of two in NewBucketedPool? I think that would be easier to reason about and less confusing for someone using this type.

For example, imagine someone created a pool with maxSize set to 1000. However, only slices with capacity up to 512 will be pooled, which is likely not what they intended.

charleskorn · 2025-01-05T22:04:07Z

pkg/util/pool/bucketed_pool.go

+// If no bucket large enough exists, a slice larger than the requested size
+// of the next power of two is returned.
+// Get guarantees the resulting slice always has a capacity in power of twos.


I think this could be a little clearer:

Suggested change

// If no bucket large enough exists, a slice larger than the requested size

// of the next power of two is returned.

// Get guarantees the resulting slice always has a capacity in power of twos.

// The resulting slice always has a capacity that is a power of two.

// If size is greater than maxSize, then a slice is still returned, however it may not be drawn from a pool.

charleskorn · 2025-01-05T22:04:41Z

pkg/util/pool/bucketed_pool_test.go

+	maxSize := 100000
+	pool := NewBucketedPool(uint(maxSize), makeFunc)
+
+	// Request a size that triggers the last bucket boundary.


[nit]

Suggested change

// Request a size that triggers the last bucket boundary.

// Request a slice with size that will be drawn from the last bucket in the pool.

charleskorn · 2025-01-05T22:06:47Z

pkg/util/pool/bucketed_pool_test.go

+	require.Len(t, s, 0)
+}
+
+func TestBucketedPool_AlwaysReturnsPowerOfTwoCapacities(t *testing.T) {


Isn't this covered by TestBucketedPool_HappyPath?

charleskorn · 2025-01-05T22:08:29Z

pkg/util/pool/bucketed_pool_test.go

+	// Create a slice with capacity that triggers the upper edge case
+	s := make([]int, 0, 65_000) // 86401 is close to maxSize but not aligned to power of 2


These two comments aren't clear to me: it's not clear why 86401 is mentioned in the second comment, and "upper edge case" isn't clear either. If this is intended to test the case where we're putting a slice into the last bucket in the pool and maxSize is not a power of two, then perhaps something along those lines would be clearer?

charleskorn · 2025-01-05T22:13:54Z

pkg/util/pool/bucketed_pool.go

-		return p.make(size)
+	bucketIndex := bits.Len(uint(size - 1))
+
+	// If bucketIndex exceeds the number of available buckets, return a slice of the next power of two.


[nit]

Suggested change

// If bucketIndex exceeds the number of available buckets, return a slice of the next power of two.

// If the requested size is larger than the size of the largest bucket, return a slice of the next power of two greater than or equal to size.

jhesketh added 3 commits December 17, 2024 20:17

Use a power of two size max bucket

8484c31

MQE: Ensure BucketedPool always returns slices of power two

4c1aa8e

This is to guarantee they work with the ring buffers which expect slices to always be power of two. The limiting pool still protects us from requesting an unreasonable amount of points with the MemoryConsumptionTracker.

jhesketh requested a review from a team as a code owner December 17, 2024 09:28

jhesketh mentioned this pull request Dec 17, 2024

MQE: eliminate loops in BucketedPool #10199

Merged

1 task

pracucci reviewed Dec 17, 2024

View reviewed changes

pkg/streamingpromql/types/limiting_pool.go Show resolved Hide resolved

jhesketh added 3 commits December 17, 2024 21:23

Fix tests

b8256be

Correctly return slices to their respective buckets or discard them

1e76f29

Extra tests

8510e91

jhesketh mentioned this pull request Dec 17, 2024

Mimir Query Engine #10067

Open

pracucci approved these changes Dec 17, 2024

View reviewed changes

pkg/util/pool/bucketed_pool.go Show resolved Hide resolved

pkg/util/pool/bucketed_pool.go Show resolved Hide resolved

Address review feedback

473f95e

pracucci reviewed Dec 18, 2024

View reviewed changes

pkg/util/pool/bucketed_pool.go Outdated Show resolved Hide resolved

Remove unncessary slice length check

0cfe82b

jhesketh merged commit 34a24b1 into grafana:main Dec 18, 2024
29 checks passed

jhesketh added the backport r321 label Dec 18, 2024

grafanabot mentioned this pull request Dec 18, 2024

[r321] MQE: Fix panic in loading too many samples #10271

Merged

jhesketh deleted the jhesketh/mqe-pooling branch December 18, 2024 09:52

charleskorn reviewed Jan 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MQE: Fix panic in loading too many samples #10261

MQE: Fix panic in loading too many samples #10261

jhesketh commented Dec 17, 2024 •

edited

Loading

jhesketh commented Dec 17, 2024

charleskorn left a comment

charleskorn Jan 5, 2025

charleskorn Jan 5, 2025

charleskorn Jan 5, 2025

charleskorn Jan 5, 2025

charleskorn Jan 5, 2025

charleskorn Jan 5, 2025

charleskorn Jan 5, 2025

charleskorn Jan 5, 2025

charleskorn Jan 5, 2025

charleskorn Jan 5, 2025

	MaxExpectedPointsPerSeries = 131_072 // There's not too much science behind this number: 100,000 points allows for a point per minute for just under 70 days. Then we use the next power of two.
	MaxExpectedPointsPerSeries = 131_072 // There's not too much science behind this number: 100,000 points allows for a point per minute for just under 70 days. Then we use the next power of two, given the pools always return slices with capacity equal to a power of two.

	// Request a size that triggers the last bucket boundary.
	// Request a slice with size that will be drawn from the last bucket in the pool.

		// Create a slice with capacity that triggers the upper edge case
		s := make([]int, 0, 65_000) // 86401 is close to maxSize but not aligned to power of 2

	// If bucketIndex exceeds the number of available buckets, return a slice of the next power of two.
	// If the requested size is larger than the size of the largest bucket, return a slice of the next power of two greater than or equal to size.

MQE: Fix panic in loading too many samples #10261

MQE: Fix panic in loading too many samples #10261

Conversation

jhesketh commented Dec 17, 2024 • edited Loading

What this PR does

Bug 1:

Bug 2:

Which issue(s) this PR fixes or relates to

Checklist

jhesketh commented Dec 17, 2024

charleskorn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhesketh commented Dec 17, 2024 •

edited

Loading