[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473

heyihong · 2025-07-14T13:08:51Z

What changes were proposed in this pull request?

This PR makes data types optional for Expression.Literal.Array and Expression.Literal.Map in Spark Connect protocol buffers. The key changes include:

Type Inference Logic: Added logic to infer array element types and map key/value types from the actual data when types are not explicitly provided
Converter Updates: Updated conversion logics in both Scala and Python to handle optional data types

The implementation allows Spark Connect to infer types from the first element in arrays and first key-value pair in maps when data types are not explicitly specified, reducing the overhead of type specification while maintaining backward compatibility.

Why are the changes needed?

Currently, Spark Connect requires explicit data type specification for array and map literals, even when the types can be easily inferred from the contained elements. This creates unnecessary overhead in:

Performance: Redundant type information increases message size and processing time
Usability: Developers must explicitly specify types that could be automatically inferred

By making data types optional with type inference, we can improve both performance and developer experience while maintaining backward compatibility.

Does this PR introduce any user-facing change?

No - This PR does not introduce any user-facing changes.
The change is backward compatible and existing connect clients will continue to work unchanged.

How was this patch tested?

build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.2.4

heyihong · 2025-07-16T15:29:25Z

@hvanhovell @HyukjinKwon @beliefer

...t/common/src/main/scala/org/apache/spark/sql/connect/common/LiteralValueProtoConverter.scala

HyukjinKwon · 2025-07-16T22:57:27Z

cc @zhengruifeng too

zhengruifeng · 2025-07-17T10:30:56Z

sql/connect/common/src/main/protobuf/spark/connect/expressions.proto

-      DataType element_type = 1;
+      // (Optional) The element type of the array. Only need to set this when the elements are
+      // empty, since spark 4.1+ supports inferring the element type from the elements.
+      optional DataType element_type = 1;
      repeated Literal elements = 2;


I am not sure whether it is worthwhile to just optimize out the element_type.
For large arrays of primitive types, e.g. large dense matrix for ML, we introduced SpecializedArray.

For a Array[Array[Int]] case, how to infer the nullability ?

You mean the nullable field is missing in the array literal? I was thinking of deprecating element_type and introducing a new DataType.Array field so that each array literal includes the nullable field within DataType.Array, for example:

message Array { DataType element_type = 1; [deprecated=true] repeated Literal elements = 2; DataType.Array data_type_array = 3; }

@zhengruifeng This change optimizes out both arrays and maps, and also applies to non-primitive types. Also, the reduction in size of function_lit_array.json seems obvious.

I created a separate ticket to track the Protobuf message change: https://issues.apache.org/jira/browse/SPARK-52930

zhengruifeng · 2025-07-17T10:33:24Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/Serializer.scala

@@ -165,18 +167,19 @@ private[ml] object Serializer {
          case proto.Expression.Literal.LiteralTypeCase.BOOLEAN =>
            (literal.getBoolean.asInstanceOf[Object], classOf[Boolean])
          case proto.Expression.Literal.LiteralTypeCase.ARRAY =>
-            val array = literal.getArray


cc @WeichenXu123 for the ml side

heyihong · 2025-07-21T14:01:39Z

friendly ping @hvanhovell @HyukjinKwon @beliefer @zhengruifeng @WeichenXu123

Update: No need to review at the moment — I need to finish SPARK-52930 first.

…teral.Array optional

github-actions bot added SQL CONNECT labels Jul 14, 2025

heyihong changed the title ~~[SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Expression.Literal.Array optional~~ [WIP][SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Array optional Jul 14, 2025

heyihong force-pushed the SPARK-52449 branch from 9ee011c to 35bb507 Compare July 14, 2025 17:01

github-actions bot added the PYTHON label Jul 14, 2025

heyihong force-pushed the SPARK-52449 branch 4 times, most recently from 62372ff to 2f4c24b Compare July 16, 2025 14:15

heyihong changed the title ~~[WIP][SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Array optional~~ [SPARK-52449][CONNECT][PYTHON][SQL] Make datatypes for Expression.Literal.Map/Array optional Jul 16, 2025

heyihong force-pushed the SPARK-52449 branch 4 times, most recently from 9e7737d to 71814d0 Compare July 16, 2025 15:28

heyihong force-pushed the SPARK-52449 branch from 71814d0 to 235f358 Compare July 16, 2025 15:59

hvanhovell reviewed Jul 16, 2025

View reviewed changes

...t/common/src/main/scala/org/apache/spark/sql/connect/common/LiteralValueProtoConverter.scala Outdated Show resolved Hide resolved

heyihong requested a review from hvanhovell July 16, 2025 19:00

heyihong force-pushed the SPARK-52449 branch 3 times, most recently from 7e5b478 to da00208 Compare July 16, 2025 19:57

heyihong force-pushed the SPARK-52449 branch from da00208 to fc694d4 Compare July 17, 2025 10:02

github-actions bot added the ML label Jul 17, 2025

zhengruifeng reviewed Jul 17, 2025

View reviewed changes

heyihong force-pushed the SPARK-52449 branch from fc694d4 to b6f0561 Compare July 17, 2025 13:33

heyihong changed the title ~~[SPARK-52449][CONNECT][PYTHON][SQL] Make datatypes for Expression.Literal.Map/Array optional~~ [SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional Jul 17, 2025

heyihong requested a review from zhengruifeng July 17, 2025 13:41

heyihong force-pushed the SPARK-52449 branch from b6f0561 to aabcc02 Compare July 17, 2025 14:03

heyihong force-pushed the SPARK-52449 branch 4 times, most recently from c819bcb to c579c1c Compare July 21, 2025 13:56

[SPARK-52449] Make datatypes for Expression.Literal.Map/Expression.Li…

0682838

…teral.Array optional

heyihong force-pushed the SPARK-52449 branch from c579c1c to 0682838 Compare July 23, 2025 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473

[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473

heyihong commented Jul 14, 2025 •

edited

Loading

Uh oh!

heyihong commented Jul 16, 2025

Uh oh!

Uh oh!

HyukjinKwon commented Jul 16, 2025

Uh oh!

zhengruifeng Jul 17, 2025 •

edited

Loading

Uh oh!

zhengruifeng Jul 17, 2025

Uh oh!

heyihong Jul 17, 2025 •

edited

Loading

Uh oh!

heyihong Jul 17, 2025 •

edited

Loading

Uh oh!

heyihong Jul 23, 2025

Uh oh!

zhengruifeng Jul 17, 2025

Uh oh!

heyihong commented Jul 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473

Are you sure you want to change the base?

[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473

Conversation

heyihong commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

heyihong commented Jul 16, 2025

Uh oh!

Uh oh!

HyukjinKwon commented Jul 16, 2025

Uh oh!

zhengruifeng Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

heyihong Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heyihong Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heyihong Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

heyihong commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

heyihong commented Jul 14, 2025 •

edited

Loading

zhengruifeng Jul 17, 2025 •

edited

Loading

heyihong Jul 17, 2025 •

edited

Loading

heyihong Jul 17, 2025 •

edited

Loading

heyihong commented Jul 21, 2025 •

edited

Loading