Skip to content

bug: datafusion-spark array_repeat incorrectly returns NULL when element is NULL #21512

@andygrove

Description

@andygrove

Describe the bug

The datafusion-spark implementation of array_repeat incorrectly returns NULL when the first argument (element) is NULL. In Apache Spark, only a NULL count (second argument) produces a NULL result — a NULL element should be repeated into the array.

To Reproduce

PySpark (correct behavior):

SELECT array_repeat(NULL, 2);    -- [NULL, NULL]
SELECT array_repeat(NULL, 1);    -- [NULL]
SELECT array_repeat(NULL, 0);    -- []
SELECT array_repeat('x', NULL);  -- NULL

DataFusion-spark (incorrect behavior):

SELECT array_repeat(NULL, 2);    -- NULL (should be [NULL, NULL])
SELECT array_repeat(NULL, 1);    -- NULL (should be [NULL])
SELECT array_repeat(NULL, 0);    -- NULL (should be [])
SELECT array_repeat('x', NULL);  -- NULL (correct)

The .slt test at datafusion/sqllogictest/test_files/spark/array/array_repeat.slt line 59 has the wrong expected value (NULL instead of [NULL, NULL]). Line 79 also has a wrong expected value for the (NULL, 1) row (NULL instead of [NULL]).

Expected behavior

Expression Spark result datafusion-spark result
array_repeat('x', 3) [x, x, x] [x, x, x]
array_repeat(NULL, 2) [NULL, NULL] NULL
array_repeat(NULL, 1) [NULL] NULL
array_repeat(NULL, 0) [] NULL
array_repeat('x', NULL) NULL NULL

Additional context

Root cause: SparkArrayRepeat::spark_array_repeat in datafusion/spark/src/function/array/repeat.rs uses compute_null_mask on all arguments, which returns NULL if any argument is NULL. But array_repeat should only return NULL when the count (second argument) is NULL — a NULL element should be passed through to DataFusion's underlying array_repeat, which correctly repeats it.

Fix: Only check the second argument (count) for NULL, not the first argument (element).

The .slt expected values at lines 59 and 79 will also need to be corrected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions