Skip to content

KLL_INVALID_INPUT_SKETCH_BUFFER error when empty KLL sketch is read #55235

@code-mc

Description

@code-mc

The newly added KLL sketch functions behave in an unexpected way when you encounter a sketch that is empty (which is often the case when dealing with nullable input fields or using FILTER WHERE expressions in your sketch aggregate).

Minimal example to reproduce on the latest 4.2.0-preview3:

WITH sketch_data AS (
    SELECT kll_sketch_agg_double(value) AS sketch
    FROM (
    select * from VALUES (1.23D), (2.34D), (3.45D) AS T(value)
    limit 0
    )
)
SELECT kll_sketch_get_quantile_double(sketch, array(0.25, 0.75)) FROM sketch_data
;

Due to the limit 0 an empty double KLL sketch will be created, which when queried using the kll_sketch_get_quantile_double will raise the specified KLL_INVALID_INPUT_SKETCH_BUFFER error.

While the pyspark docs do not mention anything about behavior when reading an empty sketch, the databricks docs (I know not pyspark but still) do say the following:

Notes
Returns NULL if the sketch is empty.

See: https://docs.databricks.com/gcp/en/sql/language-manual/functions/kll_sketch_get_quantile_double#notes

(Just to clarify, the databricks docs are incorrect, as on databricks the same error described in this issue is also raised)

Which makes a lot more sense to me, as it does not crash any query that operates over a row with an empty sketch, which is currently difficult to ensure you will not encounter. The only way to work around this error is to first filter out any empty sketch:

WITH sketch_data AS (
      SELECT kll_sketch_agg_double(value) AS sketch
      FROM (
      select * from VALUES (1.23D), (2.34D), (3.45D) AS T(value)
      limit 0
      )
  )
  SELECT 
    CASE WHEN kll_sketch_get_n_double(sketch) != 0 THEN kll_sketch_get_quantile_double(sketch, array(0.25, 0.75)) ELSE null END
  FROM sketch_data

This adds a lot of boilerplate that ruins the readability of the query.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions