Skip to content

feat: SIMD sum vector aggregators (Long/Double/Float)#19561

Open
clintropolis wants to merge 1 commit into
apache:masterfrom
clintropolis:simd-vector-aggs
Open

feat: SIMD sum vector aggregators (Long/Double/Float)#19561
clintropolis wants to merge 1 commit into
apache:masterfrom
clintropolis:simd-vector-aggs

Conversation

@clintropolis

Copy link
Copy Markdown
Member

Description

Follow-up to #19512, this PR adds opt-in jdk.incubator.vector specializations of LongSumVectorAggregator, DoubleSumVectorAggregator, and FloatSumVectorAggregator's ungrouped aggregate(buf, position, startRow, endRow) path. This functionality is controlled by the same druid.expressions.useVectorApi flag introduced for expression vector processors in the prior PR (off by default).

Shows a pretty nice improvement:

// 0: non-expression timeseries reference, 1 columns
"SELECT SUM(long1) FROM expressions",
// 4: non-expression timeseries reference, 5 columns
"SELECT SUM(long1), SUM(long4), SUM(double1), SUM(float3), SUM(long5) FROM expressions",
// 7: math op - 2 longs
"SELECT SUM(long1 * long2) FROM expressions",
// 11: all same math op - 3 longs, 1 double, 1 float
"SELECT SUM(long5 * float3 * long1 * long4 * double1) FROM expressions",
Benchmark                        (complexCompression)  (deferExpressionDimensions)  (jsonObjectStorageEncoding)  (query)  (rowsPerSegment)  (schemaType)  (storageType)  (stringEncoding)  (useVectorApi)  (vectorize)  Mode  Cnt   Score   Error  Units
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE        0           1500000      explicit           MMAP              UTF8           false        force  avgt    5   5.793 ± 0.554  ms/op
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE        0           1500000      explicit           MMAP              UTF8            true        force  avgt    5   5.270 ± 0.264  ms/op
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE        4           1500000      explicit           MMAP              UTF8           false        force  avgt    5  32.833 ± 0.879  ms/op
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE        4           1500000      explicit           MMAP              UTF8            true        force  avgt    5  25.135 ± 0.905  ms/op
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE        7           1500000      explicit           MMAP              UTF8           false        force  avgt    5  14.073 ± 0.602  ms/op
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE        7           1500000      explicit           MMAP              UTF8            true        force  avgt    5   9.602 ± 0.362  ms/op
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE       11           1500000      explicit           MMAP              UTF8           false        force  avgt    5  61.997 ± 1.508  ms/op
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE       11           1500000      explicit           MMAP              UTF8            true        force  avgt    5  25.191 ± 0.953  ms/op

I changed query 0 to be sum on long5 which has nulls which shows a nicer improvement than the original 0 on long1 which has no nulls.

Benchmark                        (complexCompression)  (deferExpressionDimensions)  (jsonObjectStorageEncoding)  (query)  (rowsPerSegment)  (schemaType)  (storageType)  (stringEncoding)  (useVectorApi)  (vectorize)  Mode  Cnt   Score   Error  Units
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE        0           1500000      explicit           MMAP              UTF8           false        force  avgt    5  16.936 ± 1.073  ms/op
SqlExpressionBenchmark.querySql                  NONE                 singleString                        SMILE        0           1500000      explicit           MMAP              UTF8            true        force  avgt    5  10.676 ± 0.752  ms/op

changes:

  • add NullAwareVectorAggregator marker interface declaring aggregate(buf, position, startRow, endRow, nullVector) for delegates that handle nulls themselves; return value reports whether any non-null row contributed.
  • update NullableNumericVectorAggregator.aggregate(buf, position, startRow, endRow) to a three-way dispatch: null-free fast path; instanceof NullAwareVectorAggregator -> new null-aware overload (set null marker iff it returned true); else existing scatter-gather fallback.
  • add SimdLongSumVectorAggregator, SimdDoubleSumVectorAggregator, and SimdFloatSumVectorAggregator under query/aggregation/simd/. Each extends its scalar parent, overrides the ungrouped no-null aggregate with a va.add(vb) + reduceLanes(VectorOperators.ADD) SIMD reduction, and implements the null-aware overload using VectorMask-based masked accumulation with notNull.trueCount() for the non-null check.
  • wire Long/Double/FloatSumAggregatorFactory.factorizeVector to dispatch on ExpressionProcessing.useVectorApi().
  • add SimdSumVectorAggregatorTest, for each of the three types, tries various vector sizes and null patterns and asserts SIMD output matches the scalar reference (exact for long, within relative tolerance for double/float to accommodate SIMD's tree-reduce vs scalar's left-to-right reduce).

changes:
* add `NullAwareVectorAggregator` marker interface declaring `aggregate(buf, position, startRow, endRow, nullVector)` for delegates that handle nulls themselves; return value reports whether any non-null row contributed.
* update `NullableNumericVectorAggregator.aggregate(buf, position, startRow, endRow)` to a three-way dispatch: null-free fast path; `instanceof NullAwareVectorAggregator` -> new null-aware overload (set null marker iff it returned true); else existing scatter-gather fallback.
* add `SimdLongSumVectorAggregator`, `SimdDoubleSumVectorAggregator`, and `SimdFloatSumVectorAggregator` under `query/aggregation/simd/`. Each extends its scalar parent, overrides the ungrouped no-null aggregate with a `va.add(vb)` + `reduceLanes(VectorOperators.ADD)` SIMD reduction, and implements the null-aware overload using `VectorMask`-based masked accumulation with `notNull.trueCount()` for the non-null check.
* wire `Long/Double/FloatSumAggregatorFactory.factorizeVector` to dispatch on `ExpressionProcessing.useVectorApi()`.
* add `SimdSumVectorAggregatorTest`, for each of the three types, tries various vector sizes and null patterns and asserts SIMD output matches the scalar reference (exact for long, within relative tolerance for double/float to accommodate SIMD's tree-reduce vs scalar's left-to-right reduce).

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 11 of 11 changed files.


This is an automated review by Codex GPT-5.5

@gianm gianm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the line comments, a documentation note: I noticed that druid.expressions.useVectorApi appears once in the docs, just in docs/operations/java.md. It should also be on the main docs/configuration/index.md. Actually I think all of the ExpressionProcessingConfig stuff is missing. They should all be added.

final ByteBuffer simdBuf = ByteBuffer.allocate(Float.BYTES);
scalar.init(scalarBuf, 0);
simd.init(simdBuf, 0);
scalar.aggregate(scalarBuf, 0, 0, size);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the tests seem to use buf, 0, 0, size as parameters. Please improve coverage by varying the position, startRow, and endRow. It should be ok to do one call with buf, 0, 0, size and another with buf, 1, 1, size + 1 (and put some junk in the first row of the selector that the startRow of 1 is meant to ignore).

* {@link org.apache.druid.query.aggregation.NullAwareVectorAggregator}), compared against a manually-computed
* reference sum.
*/
public class SimdSumVectorAggregatorTest extends InitializedNullHandlingTest

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For test coverage: add useVectorApi as a parameter in some of the parameterized tests? Maybe the timeseries tests and groupBy tests.

// ===========================
// 0: non-expression timeseries reference, 1 columns
"SELECT SUM(long1) FROM expressions",
"SELECT SUM(long5) FROM expressions",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i can revert it, its mentioned in the PR description that I changed it because long1 has no nulls and long5 has a bunch of them, so they exercise different paths

@jtuglu1

jtuglu1 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

within relative tolerance for double/float to accommodate SIMD's tree-reduce vs scalar's left-to-right reduce).

Does this error accumulate any different than normal, non-vectorized instructions? In other words, do we risk returning results with more/less error than non-vectorized aggregators?

@clintropolis

Copy link
Copy Markdown
Member Author

Does this error accumulate any different than normal, non-vectorized instructions? In other words, do we risk returning results with more/less error than non-vectorized aggregators?

I wasn't that sure tbh, the internet tells me that there should be significantly less error than our standard aggregators (relevant arxiv link from img https://arxiv.org/pdf/2107.01604):

Screenshot 2026-06-05 at 2 37 44 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants