Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Statistics] How to distinct "utf8" and "binary" in arrow::ArrayStatistics? #44579

Closed
kou opened this issue Oct 30, 2024 · 3 comments
Closed

Comments

@kou
Copy link
Member

kou commented Oct 30, 2024

Describe the enhancement requested

We can represent "utf8" and "binary" min/max values by arrow::ArrayStatistics:

/// \brief Statistics for an Array
///
/// Apache Arrow format doesn't have statistics but data source such
/// as Apache Parquet may have statistics. Statistics associated with
/// data source can be read unified API via this class.
struct ARROW_EXPORT ArrayStatistics {
using ValueType = std::variant<bool, int64_t, uint64_t, double, std::string>;
/// \brief The number of null values, may not be set
std::optional<int64_t> null_count = std::nullopt;
/// \brief The number of distinct values, may not be set
std::optional<int64_t> distinct_count = std::nullopt;
/// \brief The minimum value, may not be set
std::optional<ValueType> min = std::nullopt;
/// \brief Whether the minimum value is exact or not
bool is_min_exact = false;
/// \brief The maximum value, may not be set
std::optional<ValueType> max = std::nullopt;
/// \brief Whether the maximum value is exact or not
bool is_max_exact = false;
/// \brief Check two statistics for equality
bool Equals(const ArrayStatistics& other) const {
return null_count == other.null_count && distinct_count == other.distinct_count &&
min == other.min && is_min_exact == other.is_min_exact && max == other.max &&
is_max_exact == other.is_max_exact;
}
/// \brief Check two statistics for equality
bool operator==(const ArrayStatistics& other) const { return Equals(other); }
/// \brief Check two statistics for not equality
bool operator!=(const ArrayStatistics& other) const { return !Equals(other); }
};

But we can't distinct them because we use std::string in ValueType for both of them.

How can we distinct them? Should we add arrow::ArrayStatistics::{min,max}_type?

Component(s)

C++

kou added a commit to kou/arrow that referenced this issue Nov 6, 2024
kou added a commit to kou/arrow that referenced this issue Nov 6, 2024
@pitrou
Copy link
Member

pitrou commented Dec 4, 2024

Why would we want to distinguish them?

@kou
Copy link
Member Author

kou commented Dec 5, 2024

I wanted this when I create an arrow::DataType for arrow::ArrayStatistics.
But I may use associated arrow::Array information for it. I'll try it.

kou added a commit to kou/arrow that referenced this issue Dec 21, 2024
kou added a commit that referenced this issue Dec 31, 2024
…pe (#45094)

### Rationale for this change

`arrow::ArrayStatistics` uses raw C++ types such as `int64_t` and `std::string` for min/max types. We need to convert raw C++ types to Arrow types when we use `arrow::ArrayStatistics` for generating statistics array. (GH-45038)

We can't map `std::string` to an Arrow type. Because it may be `arrow::binary()`, `arrow::utf8()` or something.

### What changes are included in this PR?

Use `arrow::DataType` information of associated array when we convert `arrow::ArrayStatistics`'s min/max raw C++ types to Arrow types.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #44579

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou kou added this to the 19.0.0 milestone Dec 31, 2024
@kou
Copy link
Member Author

kou commented Dec 31, 2024

Issue resolved by pull request 45094
#45094

@kou kou closed this as completed Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants