Is Spark limited to split the Parquet read granularity by Row Group level only?

According to some articles I found:

- https://cloudsqale.com/2021/03/19/spark-reading-parquet-why-the-number-of-tasks-can-be-much-larger-than-the-number-of-row-groups/
- https://www.gresearch.com/news/parquet-files-know-your-scaling-limits/

Seems Spark can only parallelize the reads across row groups. Is this a known limitation?

Is there any way to split it by row or page level?

If a file has a single row group, it means all the tasks, except 1, would be idle? And that one task would read the entire file?

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is Spark limited to split the Parquet read granularity by Row Group level only? #55747

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Is Spark limited to split the Parquet read granularity by Row Group level only? #55747

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions