Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a fast way to estimate data size for the lance dataset? #3266

Open
SaintBacchus opened this issue Dec 18, 2024 · 2 comments
Open

Is there a fast way to estimate data size for the lance dataset? #3266

SaintBacchus opened this issue Dec 18, 2024 · 2 comments

Comments

@SaintBacchus
Copy link
Contributor

The data size was very useful for cost based optimizer in compute engine like spark.

But now the lance has to open LanceFileReader to get the data size for one fragment. If we use this sdk, it will have a lot of io requests for a large lance dataset.

Is there an other way to get the data size (approximate data size was also fine) like the count_rows function?

@wjones127
Copy link
Contributor

I think we are trying to expose some stats here: #3221

If we want to use them for planning purposes, we might consider caching at least some of them in the manifest.

I assume you want the size per column, right? Could you provide a link to the Spark API you are talking about?

@SaintBacchus
Copy link
Contributor Author

Spark needs the Statistics from this interface.
The Statistics has three interfaces, sizeInBytes and numRows are useful for simple optimizer rules. The columnStats is also useful for CBO rules but it has lots of column statistics. I'm not sure the storage statistics can provide them all

public interface Statistics {
  OptionalLong sizeInBytes();
  OptionalLong numRows();
  default Map<NamedReference, ColumnStatistics> columnStats() {
    return new HashMap<>();
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants