Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Oct 22, 2025

Removes a downcast match in favor of use of the trait. This mirrors the changes to DataSourceExec to use partition_statistics instead of statistics from #15852

@github-actions github-actions bot added the datasource Changes to the datasource crate label Oct 22, 2025
Comment on lines -288 to -302
if let Some(partition) = partition {
let mut statistics = Statistics::new_unknown(&self.schema());
if let Some(file_config) =
self.data_source.as_any().downcast_ref::<FileScanConfig>()
{
if let Some(file_group) = file_config.file_groups.get(partition) {
if let Some(stat) = file_group.file_statistics(None) {
statistics = stat.clone();
}
}
}
Ok(statistics)
} else {
Ok(self.data_source.statistics()?)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was clearly buggy: in the partition = None path it returns projected statistics (see current implementation in FileScanConfig::statistics), if it took the Some(partition) path it calculated the statistics but then never projected them.

Comment on lines +2547 to +2546
#[test]
fn test_partition_statistics_projection() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test demonstrates the bug and fails on main

@adriangb adriangb requested a review from crepererum October 22, 2025 21:05
Comment on lines 196 to 217
fn partition_statistics(&self, partition: Option<usize>) -> Result<Statistics> {
if let Some(partition) = partition {
// Compute statistics for a specific partition
if let Some(batches) = self.partitions.get(partition) {
Ok(common::compute_record_batch_statistics(
from_ref(&batches),
&self.schema,
self.projection.clone(),
))
} else {
// Invalid partition index
Ok(Statistics::new_unknown(&self.projected_schema))
}
} else {
// Compute statistics across all partitions
Ok(common::compute_record_batch_statistics(
&self.partitions,
&self.schema,
self.projection.clone(),
))
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously unimplemented because only FileScanConfig was handled via downcast matching.

@adriangb adriangb requested a review from xudong963 October 22, 2025 21:38
Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb

SchedulingType::NonCooperative
}
fn statistics(&self) -> Result<Statistics>;
fn partition_statistics(&self, partition: Option<usize>) -> Result<Statistics>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an API changing, I think we should make statistics as deprecated, and add a new partition_statistics API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I added fn statistics() back to DataSource trait as a deprecated method and made it default to partition_statistics(None). This should make it fully backwards compatible for all implementations.

@xudong963 xudong963 added the api change Changes the API exposed to users of the crate label Oct 24, 2025
Removes a downcast match in favor of use of the trait. This mirrors the changes to DataSourceExec to use partition_statistics instead of statistics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants