Skip to content

describe() returns null for min/max on binary columns — render as hex instead #21496

@kumarUjjawal

Description

@kumarUjjawal

Is your feature request related to a problem or challenge?

PR #21455, DataFrame::describe() no longer crashes on binary-like columns (Binary, LargeBinary, BinaryView, FixedSizeBinary), but it now returns null for min and max on thosecolumns.

That fix avoids an unsafe Cast(Binary → Utf8), but it leaves users with no way to see the value range of a binary column in describe().

For columns that store hashes, UUIDs, content-addressed identifiers, or fingerprints common uses of FixedSizeBinary(16) / FixedSizeBinary(32) knowing the min/max value is genuinely useful for sanity-checking data.

The Utf8 cast is the wrong tool for this. Arrow correctly refuses to cast arbitrary bytes to Utf8 because there is no general lossless mapping. But that doesn't mean we have to give up on min/max for binary; we can render the bytes as hex instead.

Describe the solution you'd like

In DataFrame::describe():

  1. Stop filtering Binary, LargeBinary, BinaryView out of the min/max aggregations. These types are already supported by MinMaxBytesAccumulator, so min(col) and max(col) produce a real binary scalar.
  2. At the display step, special-case binary columns: instead of cast(column, &DataType::Utf8), use Arrow's ArrayFormatter (or DisplayIndex) which already renders these arrays as lowercase hex, which writes each byte via {byte:02x}).
  3. Update the describe schema so binary columns map to Utf8 output containing the hex string (they already do — only the projection back into the output column needs to change).

Describe alternatives you've considered

  • Keep returning null. Safe, but unhelpful — users have to write their own SQL to get this information.
  • Render binary as base64 instead of hex. Slightly more compact for long values but less common for debugging/inspection. Hex matches Arrow's existing display formatters, so it's
    the lower-friction choice.
  • Skip binary columns from describe entirely (don't even show count/null_count). Worse than today — you lose information that already works correctly.
  • Add a format option to describe() to let the caller choose. Possible, but premature — pick a sensible default first.

Additional context

Generated by Codex

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions