-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
62372ff
to
2f4c24b
Compare
9e7737d
to
71814d0
Compare
...t/common/src/main/scala/org/apache/spark/sql/connect/common/LiteralValueProtoConverter.scala
Outdated
Show resolved
Hide resolved
7e5b478
to
da00208
Compare
cc @zhengruifeng too |
DataType element_type = 1; | ||
// (Optional) The element type of the array. Only need to set this when the elements are | ||
// empty, since spark 4.1+ supports inferring the element type from the elements. | ||
optional DataType element_type = 1; | ||
repeated Literal elements = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether it is worthwhile to just optimize out the element_type
.
For large arrays of primitive types, e.g. large dense matrix for ML, we introduced SpecializedArray
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a Array[Array[Int]]
case, how to infer the nullability ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the nullable field is missing in the array literal? I was thinking of deprecating element_type and introducing a new DataType.Array field so that each array literal includes the nullable field within DataType.Array, for example:
message Array {
DataType element_type = 1; [deprecated=true]
repeated Literal elements = 2;
DataType.Array data_type_array = 3;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhengruifeng This change optimizes out both arrays and maps, and also applies to non-primitive types. Also, the reduction in size of function_lit_array.json seems obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a separate ticket to track the Protobuf message change: https://issues.apache.org/jira/browse/SPARK-52930
@@ -165,18 +167,19 @@ private[ml] object Serializer { | |||
case proto.Expression.Literal.LiteralTypeCase.BOOLEAN => | |||
(literal.getBoolean.asInstanceOf[Object], classOf[Boolean]) | |||
case proto.Expression.Literal.LiteralTypeCase.ARRAY => | |||
val array = literal.getArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @WeichenXu123 for the ml side
c819bcb
to
c579c1c
Compare
friendly ping @hvanhovell @HyukjinKwon @beliefer @zhengruifeng @WeichenXu123 Update: No need to review at the moment — I need to finish SPARK-52930 first. |
…teral.Array optional
What changes were proposed in this pull request?
This PR makes data types optional for
Expression.Literal.Array
andExpression.Literal.Map
in Spark Connect protocol buffers. The key changes include:The implementation allows Spark Connect to infer types from the first element in arrays and first key-value pair in maps when data types are not explicitly specified, reducing the overhead of type specification while maintaining backward compatibility.
Why are the changes needed?
Currently, Spark Connect requires explicit data type specification for array and map literals, even when the types can be easily inferred from the contained elements. This creates unnecessary overhead in:
By making data types optional with type inference, we can improve both performance and developer experience while maintaining backward compatibility.
Does this PR introduce any user-facing change?
No - This PR does not introduce any user-facing changes.
The change is backward compatible and existing connect clients will continue to work unchanged.
How was this patch tested?
build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.2.4