Describe the bug
The datafusion-spark implementation of array_repeat incorrectly returns NULL when the first argument (element) is NULL. In Apache Spark, only a NULL count (second argument) produces a NULL result — a NULL element should be repeated into the array.
To Reproduce
PySpark (correct behavior):
SELECT array_repeat(NULL, 2); -- [NULL, NULL]
SELECT array_repeat(NULL, 1); -- [NULL]
SELECT array_repeat(NULL, 0); -- []
SELECT array_repeat('x', NULL); -- NULL
DataFusion-spark (incorrect behavior):
SELECT array_repeat(NULL, 2); -- NULL (should be [NULL, NULL])
SELECT array_repeat(NULL, 1); -- NULL (should be [NULL])
SELECT array_repeat(NULL, 0); -- NULL (should be [])
SELECT array_repeat('x', NULL); -- NULL (correct)
The .slt test at datafusion/sqllogictest/test_files/spark/array/array_repeat.slt line 59 has the wrong expected value (NULL instead of [NULL, NULL]). Line 79 also has a wrong expected value for the (NULL, 1) row (NULL instead of [NULL]).
Expected behavior
| Expression |
Spark result |
datafusion-spark result |
array_repeat('x', 3) |
[x, x, x] |
[x, x, x] ✓ |
array_repeat(NULL, 2) |
[NULL, NULL] |
NULL ✗ |
array_repeat(NULL, 1) |
[NULL] |
NULL ✗ |
array_repeat(NULL, 0) |
[] |
NULL ✗ |
array_repeat('x', NULL) |
NULL |
NULL ✓ |
Additional context
Root cause: SparkArrayRepeat::spark_array_repeat in datafusion/spark/src/function/array/repeat.rs uses compute_null_mask on all arguments, which returns NULL if any argument is NULL. But array_repeat should only return NULL when the count (second argument) is NULL — a NULL element should be passed through to DataFusion's underlying array_repeat, which correctly repeats it.
Fix: Only check the second argument (count) for NULL, not the first argument (element).
The .slt expected values at lines 59 and 79 will also need to be corrected.
Describe the bug
The
datafusion-sparkimplementation ofarray_repeatincorrectly returns NULL when the first argument (element) is NULL. In Apache Spark, only a NULL count (second argument) produces a NULL result — a NULL element should be repeated into the array.To Reproduce
PySpark (correct behavior):
DataFusion-spark (incorrect behavior):
The
.slttest atdatafusion/sqllogictest/test_files/spark/array/array_repeat.sltline 59 has the wrong expected value (NULLinstead of[NULL, NULL]). Line 79 also has a wrong expected value for the(NULL, 1)row (NULLinstead of[NULL]).Expected behavior
array_repeat('x', 3)[x, x, x][x, x, x]✓array_repeat(NULL, 2)[NULL, NULL]NULL✗array_repeat(NULL, 1)[NULL]NULL✗array_repeat(NULL, 0)[]NULL✗array_repeat('x', NULL)NULLNULL✓Additional context
Root cause:
SparkArrayRepeat::spark_array_repeatindatafusion/spark/src/function/array/repeat.rsusescompute_null_maskon all arguments, which returns NULL if any argument is NULL. Butarray_repeatshould only return NULL when the count (second argument) is NULL — a NULL element should be passed through to DataFusion's underlyingarray_repeat, which correctly repeats it.Fix: Only check the second argument (count) for NULL, not the first argument (element).
The
.sltexpected values at lines 59 and 79 will also need to be corrected.