feat: SIMD sum vector aggregators (Long/Double/Float)#19561
feat: SIMD sum vector aggregators (Long/Double/Float)#19561clintropolis wants to merge 1 commit into
Conversation
changes: * add `NullAwareVectorAggregator` marker interface declaring `aggregate(buf, position, startRow, endRow, nullVector)` for delegates that handle nulls themselves; return value reports whether any non-null row contributed. * update `NullableNumericVectorAggregator.aggregate(buf, position, startRow, endRow)` to a three-way dispatch: null-free fast path; `instanceof NullAwareVectorAggregator` -> new null-aware overload (set null marker iff it returned true); else existing scatter-gather fallback. * add `SimdLongSumVectorAggregator`, `SimdDoubleSumVectorAggregator`, and `SimdFloatSumVectorAggregator` under `query/aggregation/simd/`. Each extends its scalar parent, overrides the ungrouped no-null aggregate with a `va.add(vb)` + `reduceLanes(VectorOperators.ADD)` SIMD reduction, and implements the null-aware overload using `VectorMask`-based masked accumulation with `notNull.trueCount()` for the non-null check. * wire `Long/Double/FloatSumAggregatorFactory.factorizeVector` to dispatch on `ExpressionProcessing.useVectorApi()`. * add `SimdSumVectorAggregatorTest`, for each of the three types, tries various vector sizes and null patterns and asserts SIMD output matches the scalar reference (exact for long, within relative tolerance for double/float to accommodate SIMD's tree-reduce vs scalar's left-to-right reduce).
FrankChen021
left a comment
There was a problem hiding this comment.
I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.
Reviewed 11 of 11 changed files.
This is an automated review by Codex GPT-5.5
gianm
left a comment
There was a problem hiding this comment.
In addition to the line comments, a documentation note: I noticed that druid.expressions.useVectorApi appears once in the docs, just in docs/operations/java.md. It should also be on the main docs/configuration/index.md. Actually I think all of the ExpressionProcessingConfig stuff is missing. They should all be added.
| final ByteBuffer simdBuf = ByteBuffer.allocate(Float.BYTES); | ||
| scalar.init(scalarBuf, 0); | ||
| simd.init(simdBuf, 0); | ||
| scalar.aggregate(scalarBuf, 0, 0, size); |
There was a problem hiding this comment.
all the tests seem to use buf, 0, 0, size as parameters. Please improve coverage by varying the position, startRow, and endRow. It should be ok to do one call with buf, 0, 0, size and another with buf, 1, 1, size + 1 (and put some junk in the first row of the selector that the startRow of 1 is meant to ignore).
| * {@link org.apache.druid.query.aggregation.NullAwareVectorAggregator}), compared against a manually-computed | ||
| * reference sum. | ||
| */ | ||
| public class SimdSumVectorAggregatorTest extends InitializedNullHandlingTest |
There was a problem hiding this comment.
For test coverage: add useVectorApi as a parameter in some of the parameterized tests? Maybe the timeseries tests and groupBy tests.
| // =========================== | ||
| // 0: non-expression timeseries reference, 1 columns | ||
| "SELECT SUM(long1) FROM expressions", | ||
| "SELECT SUM(long5) FROM expressions", |
There was a problem hiding this comment.
oh i can revert it, its mentioned in the PR description that I changed it because long1 has no nulls and long5 has a bunch of them, so they exercise different paths
Does this error accumulate any different than normal, non-vectorized instructions? In other words, do we risk returning results with more/less error than non-vectorized aggregators? |
I wasn't that sure tbh, the internet tells me that there should be significantly less error than our standard aggregators (relevant arxiv link from img https://arxiv.org/pdf/2107.01604):
|

Description
Follow-up to #19512, this PR adds opt-in jdk.incubator.vector specializations of
LongSumVectorAggregator,DoubleSumVectorAggregator, andFloatSumVectorAggregator's ungrouped aggregate(buf, position, startRow, endRow) path. This functionality is controlled by the samedruid.expressions.useVectorApiflag introduced for expression vector processors in the prior PR (off by default).Shows a pretty nice improvement:
I changed query 0 to be sum on long5 which has nulls which shows a nicer improvement than the original 0 on long1 which has no nulls.
changes:
NullAwareVectorAggregatormarker interface declaringaggregate(buf, position, startRow, endRow, nullVector)for delegates that handle nulls themselves; return value reports whether any non-null row contributed.NullableNumericVectorAggregator.aggregate(buf, position, startRow, endRow)to a three-way dispatch: null-free fast path;instanceof NullAwareVectorAggregator-> new null-aware overload (set null marker iff it returned true); else existing scatter-gather fallback.SimdLongSumVectorAggregator,SimdDoubleSumVectorAggregator, andSimdFloatSumVectorAggregatorunderquery/aggregation/simd/. Each extends its scalar parent, overrides the ungrouped no-null aggregate with ava.add(vb)+reduceLanes(VectorOperators.ADD)SIMD reduction, and implements the null-aware overload usingVectorMask-based masked accumulation withnotNull.trueCount()for the non-null check.Long/Double/FloatSumAggregatorFactory.factorizeVectorto dispatch onExpressionProcessing.useVectorApi().SimdSumVectorAggregatorTest, for each of the three types, tries various vector sizes and null patterns and asserts SIMD output matches the scalar reference (exact for long, within relative tolerance for double/float to accommodate SIMD's tree-reduce vs scalar's left-to-right reduce).