Support file row number in Parquet reader

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

Deletion vectors in the Delta Lake and Iceberg table formats are defined in terms of row numbers within individual Parquet files. To be able to filter out rows defined as deleted by deletion vectors we need a way to know the file row number of the rows read by the Arrow Parquet reader.

**Describe the solution you'd like**

The  Arrow Parquet reader should optionally return a column containing the row number of each row. We add a method `ArrowReaderBuilder::with_row_numbers(self, with_row_numbers: bool) -> Self`, which configures the Arrow Parquet reader to add an extra column named `row_number` to its schema (possibly the method could be `ArrowReaderBuilder::with_row_number_column(self, with_row_numbers: Option<String>) -> Self` to make the column name configurable). This column contains the row number within the file.

**Describe alternatives you've considered**

There is a corresponding issue on Datafusion https://github.com/apache/datafusion/issues/13261. It considers an alternative using primary keys and existing SQL primitives, but this comes with a performance penalty. The consensus on the issue is

> I agree with the assessment that the information must be coning from the file reader itself.


The idea is to produce a new column, so if a file had a column `A` with this feature the parquet reader would add a new `row_number` column, like

| A | `row_number` |
|--------|--------|
| x | 0 (row number increases sequentially)|
| y | 1 |
| d | 2 |
| q | 3 |
| a | 4 | 
| .. | .. | 
| q | 100 | 

This would also account for predicates, so for example if we selected only rows with `A = 'q'` above the output would be

| A | `row_number` |
|--------|--------|
| q | 3 |
| q | 100 | 



That is, the Arrow Parquet reader.

**Additional context**

Please see https://github.com/apache/datafusion/issues/13261 for the corresponding issue in Datafusion. There is also a discussion in Datafusion to add system/metadata columns in https://github.com/apache/datafusion/pull/14057 through which this additional file row number column could be exposed. Though, we do not need system/metadata columns to be available to support deletion vectors in delta-rs or iceberg-rs, since the delta-rs and iceberg-rs Datafusion based readers use the Datafusion ParquetSource directly to construct the execution plans for the scans of their TableProviders.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support file row number in Parquet reader #7299

16 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

A	`row_number`
q	3
q	100

Support file row number in Parquet reader #7299

Description

Activity

alamb commented on Mar 18, 2025

jkylling commented on Mar 18, 2025

alamb commented on Sep 25, 2025

alamb commented on Oct 17, 2025

alamb commented on Oct 17, 2025

alamb commented on Oct 21, 2025

vustef commented on Oct 22, 2025

jkylling commented on Oct 23, 2025

16 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions