Skip to content

Support file row number in Parquet reader #7299

@jkylling

Description

@jkylling

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Deletion vectors in the Delta Lake and Iceberg table formats are defined in terms of row numbers within individual Parquet files. To be able to filter out rows defined as deleted by deletion vectors we need a way to know the file row number of the rows read by the Arrow Parquet reader.

Describe the solution you'd like

The Arrow Parquet reader should optionally return a column containing the row number of each row. We add a method ArrowReaderBuilder::with_row_numbers(self, with_row_numbers: bool) -> Self, which configures the Arrow Parquet reader to add an extra column named row_number to its schema (possibly the method could be ArrowReaderBuilder::with_row_number_column(self, with_row_numbers: Option<String>) -> Self to make the column name configurable). This column contains the row number within the file.

Describe alternatives you've considered

There is a corresponding issue on Datafusion apache/datafusion#13261. It considers an alternative using primary keys and existing SQL primitives, but this comes with a performance penalty. The consensus on the issue is

I agree with the assessment that the information must be coning from the file reader itself.

The idea is to produce a new column, so if a file had a column A with this feature the parquet reader would add a new row_number column, like

A row_number
x 0 (row number increases sequentially)
y 1
d 2
q 3
a 4
.. ..
q 100

This would also account for predicates, so for example if we selected only rows with A = 'q' above the output would be

A row_number
q 3
q 100

That is, the Arrow Parquet reader.

Additional context

Please see apache/datafusion#13261 for the corresponding issue in Datafusion. There is also a discussion in Datafusion to add system/metadata columns in apache/datafusion#14057 through which this additional file row number column could be exposed. Though, we do not need system/metadata columns to be available to support deletion vectors in delta-rs or iceberg-rs, since the delta-rs and iceberg-rs Datafusion based readers use the Datafusion ParquetSource directly to construct the execution plans for the scans of their TableProviders.

Activity

added
enhancementAny new improvement worthy of a entry in the changelog
on Mar 16, 2025
alamb

alamb commented on Mar 18, 2025

@alamb
Contributor

I think adding this to the reader seems reasonable to me if there is a way to:

  1. Opt in (don't slow down reading if the row number isn't needed)
  2. the API is reasonable / doesn't make the code "too" complicated (I realize this is a subjective judgement)
jkylling

jkylling commented on Mar 18, 2025

@jkylling
Author

I think adding this to the reader seems reasonable to me if there is a way to:

  1. Opt in (don't slow down reading if the row number isn't needed)
  2. the API is reasonable / doesn't make the code "too" complicated (I realize this is a subjective judgement)

I've started on this in #7307. Please let me know if you think the approach is reasonable.

alamb

alamb commented on Sep 25, 2025

@alamb
Contributor

There is lots of good discussion. Here is one from @scovich: #7307 (comment) about how other readers represent row numbers

What do other parquet readers do to represent row numbers in their output schema?

#7307 (comment), posted Apr 15, might be a starting point?

AFAIK, most parquet readers now support row numbers. We can add DuckDB and Iceberg to the ones already mentioned above.

Duckdb uses a column schema type approach. Interestingly, that's new -- last time I looked (nearly a year go) it required the reader to pass options along with the schema, and one of the options was to request row numbers (which then became an extra unnamed column at the end of the regular schema). I think that approach didn't scale as they started needing more and more special column types. I see geometry, variant, and non-materialiaed expressions, for example.

Iceberg's parquet reader works almost exclusively from field ids, and row index has a baked in field id from the range of metadata row ids.

Spark uses a metadata column approach, identified by a special name (_metadata._rowid); I don't remember how precisely that maps to the underlying parquet reader.

changed the title [-]Return file row number in Parquet readers[/-] [+]Support file row number in Parquet reader[/+] on Oct 17, 2025
alamb

alamb commented on Oct 17, 2025

@alamb
Contributor

i filed another ticket for something very similar (row group index)

alamb

alamb commented on Oct 17, 2025

@alamb
Contributor

I also added an example to this ticket in the description

alamb

alamb commented on Oct 21, 2025

@alamb
Contributor

Copying a comment I made in discord:

I recommend sketching out an "end to end" example that shows how the new API would work

For example, make an example similar to this one that shows how you would specify reading row numbers and how you would access those row numbers in the returned batch
https://docs.rs/parquet/latest/parquet/arrow/index.html#example-reading-parquet-file-into-arrow-recordbatch

vustef

vustef commented on Oct 22, 2025

@vustef

Copying a comment I made in discord:

I recommend sketching out an "end to end" example that shows how the new API would work

For example, make an example similar to this one that shows how you would specify reading row numbers and how you would access those row numbers in the returned batch https://docs.rs/parquet/latest/parquet/arrow/index.html#example-reading-parquet-file-into-arrow-recordbatch

Here's an example:

let file = File::open(path).unwrap();

let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();

let row_number_field = Field::new(
    "my_row_num_col",
    ArrowDataType::Int64,
    false,
)
.with_extension_type(RowNumber::default()) // this is required, `with_row_number_column` won't accept field without this.
.with_metadata(std::collections::HashMap::from([( // optional, just an example here
    PARQUET_FIELD_ID_META_KEY.to_string(),
    "2147483645",
)]));

let builder = builder.with_row_number_column(row_number_field);

// row_number_field will be included in the schema, added to the end of the list
println!("Converted arrow schema is: {}", builder.schema());

let reader = builder.build().unwrap();

let record_batch = reader.next().unwrap().unwrap();

println!("Read {} records.", record_batch.num_rows());

let row_number_col = record_batch.column_by_name("my_row_num_col").unwrap();

Rough ideas behind this:

  • It builds upon the discussion at the PR (here)
  • New column is part of the schema. That makes the usage much easier, as the clients don't need to track this extra column.
  • Because this is a special column, we need to mark it as such. We use a new extension types for this.
  • Users also get the flexibility of fully specifying the field - name, metadata properties, etc.. Type and nullability are going to be asserted though. We can provide a helper function to construct this field, to avoid having to pass false for nullability and ArrowDataType::Int64.
  • To make this field part of the schema, proposal is to use builder.with_row_number_column(field). The alternative is to make users create full schema and insert this field somewhere in it, but that doesn't seem user-friendly always. Rather, with_row_number_column would add this field to the end of the fields list in the schema.
  • with_row_number_column should also modify ArrowReaderBuilder::fields, to add a new field. I'm not sure what field_type it should have there. Probably needs a new one, so that the array reader builders would build a special array reader, that enumerates row positions, and information about extension type would otherwise be lost at this point.

Please let me know what you think.

jkylling

jkylling commented on Oct 23, 2025

@jkylling
Author

Copying a comment I made in discord:
I recommend sketching out an "end to end" example that shows how the new API would work
For example, make an example similar to this one that shows how you would specify reading row numbers and how you would access those row numbers in the returned batch https://docs.rs/parquet/latest/parquet/arrow/index.html#example-reading-parquet-file-into-arrow-recordbatch

Here's an example:

let file = File::open(path).unwrap();

let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();

let row_number_field = Field::new(
"my_row_num_col",
ArrowDataType::Int64,
false,
)
.with_extension_type(RowNumber::default()) // this is required, with_row_number_column won't accept field without this.
.with_metadata(std::collections::HashMap::from([( // optional, just an example here
PARQUET_FIELD_ID_META_KEY.to_string(),
"2147483645",
)]));

let builder = builder.with_row_number_column(row_number_field);

// row_number_field will be included in the schema, added to the end of the list
println!("Converted arrow schema is: {}", builder.schema());

let reader = builder.build().unwrap();

let record_batch = reader.next().unwrap().unwrap();

println!("Read {} records.", record_batch.num_rows());

let row_number_col = record_batch.column_by_name("my_row_num_col").unwrap();
Rough ideas behind this:

  • It builds upon the discussion at the PR (here)
  • New column is part of the schema. That makes the usage much easier, as the clients don't need to track this extra column.
  • Because this is a special column, we need to mark it as such. We use a new extension types for this.
  • Users also get the flexibility of fully specifying the field - name, metadata properties, etc.. Type and nullability are going to be asserted though. We can provide a helper function to construct this field, to avoid having to pass false for nullability and ArrowDataType::Int64.
  • To make this field part of the schema, proposal is to use builder.with_row_number_column(field). The alternative is to make users create full schema and insert this field somewhere in it, but that doesn't seem user-friendly always. Rather, with_row_number_column would add this field to the end of the fields list in the schema.
  • with_row_number_column should also modify ArrowReaderBuilder::fields, to add a new field. I'm not sure what field_type it should have there. Probably needs a new one, so that the array reader builders would build a special array reader, that enumerates row positions, and information about extension type would otherwise be lost at this point.

Please let me know what you think.

This looks really good!

How about:

 // The constructor of the RowNumber extension type is private, so this is the only way to create this field. It ensures that the type and nullability is always correct.
let row_number_field = RowNumberField::new("my_row_num_col");

This can be used with (1) (I believe this pattern is common in engines, even if not user friendly):

let supplied_schema = Arc::new(Schema::new(vec![
    row_number_field,
]));
let options = ArrowReaderOptions::new().with_schema(supplied_schema.clone());
let mut builder = ParquetRecordBatchReaderBuilder::try_new_with_options(
    file,
    options
).expect("Error if the schema is not compatible with the parquet file schema.");

and (2)

let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
builder.with_metadata_columns([row_number_field]);

Alternatively, we modify (2) to be:

(3)

let options = ArrowReaderOptions::new().with_metadata_columns([row_number_field]);
let mut builder = ParquetRecordBatchReaderBuilder::try_new_with_options(
    file,
    options
)

as this might simplify the changes to ParquetRecordBatchReaderBuilder (it might be unchanged?).

This would allow us to do #8641 in the future without having to change the interface of ArrowReaderOptions or ParquetRecordBatchReaderBuilder further.

16 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @alamb@vustef@jkylling@scovich@etseidl

      Issue actions

        Support file row number in Parquet reader · Issue #7299 · apache/arrow-rs