Is there a fast way to estimate data size for the lance dataset? #3266

SaintBacchus · 2024-12-18T09:23:11Z

The data size was very useful for cost based optimizer in compute engine like spark.

But now the lance has to open LanceFileReader to get the data size for one fragment. If we use this sdk, it will have a lot of io requests for a large lance dataset.

Is there an other way to get the data size (approximate data size was also fine) like the count_rows function?

The text was updated successfully, but these errors were encountered:

wjones127 · 2024-12-23T21:51:29Z

I think we are trying to expose some stats here: #3221

If we want to use them for planning purposes, we might consider caching at least some of them in the manifest.

I assume you want the size per column, right? Could you provide a link to the Spark API you are talking about?

SaintBacchus · 2024-12-24T02:39:15Z

Spark needs the Statistics from this interface.
The Statistics has three interfaces, sizeInBytes and numRows are useful for simple optimizer rules. The columnStats is also useful for CBO rules but it has lots of column statistics. I'm not sure the storage statistics can provide them all

public interface Statistics {
  OptionalLong sizeInBytes();
  OptionalLong numRows();
  default Map<NamedReference, ColumnStatistics> columnStats() {
    return new HashMap<>();
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a fast way to estimate data size for the lance dataset? #3266

Is there a fast way to estimate data size for the lance dataset? #3266

SaintBacchus commented Dec 18, 2024

wjones127 commented Dec 23, 2024

SaintBacchus commented Dec 24, 2024

Is there a fast way to estimate data size for the lance dataset? #3266

Is there a fast way to estimate data size for the lance dataset? #3266

Comments

SaintBacchus commented Dec 18, 2024

wjones127 commented Dec 23, 2024

SaintBacchus commented Dec 24, 2024