Skip to content

Using row_number().over(Window) leads to Arrow exception in 4.0 preview (0.43.1-preview) #1435

@mariussoutier

Description

@mariussoutier

My code is working fine with Spark 3.5.2 and 0.43.1 connector. I'm trying to upgrade to Spark 4.0 and I'm hitting this error:

WARN org.apache.spark.scheduler.TaskSetManager Lost task 0.0 in stage 0.0 (TID 0) (executor driver): java.lang.IllegalArgumentException: should have as many children as in the schema: found 0 expected 15
	at com.google.cloud.spark.bigquery.repackaged.org.apache.arrow.util.Preconditions.checkArgument(Preconditions.java:255)
	at com.google.cloud.spark.bigquery.repackaged.org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:153)
	at com.google.cloud.spark.bigquery.repackaged.org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:88)
	at com.google.cloud.spark.bigquery.repackaged.org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:213)
	at com.google.cloud.spark.bigquery.repackaged.org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:161)
	at com.google.cloud.spark.bigquery.v2.context.ArrowColumnBatchPartitionReaderContext$SimpleAdapter.loadNextBatch(ArrowColumnBatchPartitionReaderContext.java:77)
	at com.google.cloud.spark.bigquery.v2.context.ArrowColumnBatchPartitionReaderContext.next(ArrowColumnBatchPartitionReaderContext.java:252)
	at com.google.cloud.spark.bigquery.v2.BigQueryPartitionReader.next(BigQueryPartitionReader.java:32)

I've narrowed it down to this statement:

.withColumn("row_num", row_number().over(
  Window
    .partitionBy(concat(col("user_pseudo_id"), col("event_timestamp"), col("event_name")))
    .orderBy(lit("window_ordering"))
))

Without it, the code runs fine.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions