Skip to content

[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

heyihong
Copy link
Contributor

@heyihong heyihong commented Jul 14, 2025

What changes were proposed in this pull request?

This PR makes data types optional for Expression.Literal.Array and Expression.Literal.Map in Spark Connect protocol buffers. The key changes include:

  1. Type Inference Logic: Added logic to infer array element types and map key/value types from the actual data when types are not explicitly provided
  2. Converter Updates: Updated conversion logics in both Scala and Python to handle optional data types

The implementation allows Spark Connect to infer types from the first element in arrays and first key-value pair in maps when data types are not explicitly specified, reducing the overhead of type specification while maintaining backward compatibility.

Why are the changes needed?

Currently, Spark Connect requires explicit data type specification for array and map literals, even when the types can be easily inferred from the contained elements. This creates unnecessary overhead in:

  1. Performance: Redundant type information increases message size and processing time
  2. Usability: Developers must explicitly specify types that could be automatically inferred

By making data types optional with type inference, we can improve both performance and developer experience while maintaining backward compatibility.

Does this PR introduce any user-facing change?

No - This PR does not introduce any user-facing changes.
The change is backward compatible and existing connect clients will continue to work unchanged.

How was this patch tested?

build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.2.4

@heyihong heyihong changed the title [SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Expression.Literal.Array optional [WIP][SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Array optional Jul 14, 2025
@heyihong heyihong force-pushed the SPARK-52449 branch 4 times, most recently from 62372ff to 2f4c24b Compare July 16, 2025 14:15
@heyihong heyihong changed the title [WIP][SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Array optional [SPARK-52449][CONNECT][PYTHON][SQL] Make datatypes for Expression.Literal.Map/Array optional Jul 16, 2025
@heyihong heyihong force-pushed the SPARK-52449 branch 4 times, most recently from 9e7737d to 71814d0 Compare July 16, 2025 15:28
@heyihong
Copy link
Contributor Author

@heyihong heyihong requested a review from hvanhovell July 16, 2025 19:00
@heyihong heyihong force-pushed the SPARK-52449 branch 3 times, most recently from 7e5b478 to da00208 Compare July 16, 2025 19:57
@HyukjinKwon
Copy link
Member

cc @zhengruifeng too

DataType element_type = 1;
// (Optional) The element type of the array. Only need to set this when the elements are
// empty, since spark 4.1+ supports inferring the element type from the elements.
optional DataType element_type = 1;
repeated Literal elements = 2;
Copy link
Contributor

@zhengruifeng zhengruifeng Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether it is worthwhile to just optimize out the element_type.
For large arrays of primitive types, e.g. large dense matrix for ML, we introduced SpecializedArray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a Array[Array[Int]] case, how to infer the nullability ?

Copy link
Contributor Author

@heyihong heyihong Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the nullable field is missing in the array literal? I was thinking of deprecating element_type and introducing a new DataType.Array field so that each array literal includes the nullable field within DataType.Array, for example:

   message Array {
      DataType element_type = 1; [deprecated=true]
      repeated Literal elements = 2;
      DataType.Array data_type_array = 3;
  }

Copy link
Contributor Author

@heyihong heyihong Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng This change optimizes out both arrays and maps, and also applies to non-primitive types. Also, the reduction in size of function_lit_array.json seems obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a separate ticket to track the Protobuf message change: https://issues.apache.org/jira/browse/SPARK-52930

@@ -165,18 +167,19 @@ private[ml] object Serializer {
case proto.Expression.Literal.LiteralTypeCase.BOOLEAN =>
(literal.getBoolean.asInstanceOf[Object], classOf[Boolean])
case proto.Expression.Literal.LiteralTypeCase.ARRAY =>
val array = literal.getArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @WeichenXu123 for the ml side

@heyihong heyihong changed the title [SPARK-52449][CONNECT][PYTHON][SQL] Make datatypes for Expression.Literal.Map/Array optional [SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional Jul 17, 2025
@heyihong heyihong requested a review from zhengruifeng July 17, 2025 13:41
@heyihong heyihong force-pushed the SPARK-52449 branch 4 times, most recently from c819bcb to c579c1c Compare July 21, 2025 13:56
@heyihong
Copy link
Contributor Author

heyihong commented Jul 21, 2025

friendly ping @hvanhovell @HyukjinKwon @beliefer @zhengruifeng @WeichenXu123

Update: No need to review at the moment — I need to finish SPARK-52930 first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants