-
Notifications
You must be signed in to change notification settings - Fork 117
Fix: hardcoded DType in dict layout
#5761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
|
We decided that more than 64k unique elements in a DictLayout (where chunks can be ~8k elements), doesn't make much sense! I imagine this is done way with bad assumptions in a few places.. |
|
So this was done before the dict encoder was generic over the codes ptype, and was dynamically setting the codes dtype based on the magnitude of the max_len constraint. So it was always encoding to a wider int, then after the encoding is done we would downcast to a narrower int. In that world it is not possible to tell the codes ptype before encoding is fully finished. also I don't think the codes ptype is selected dynamically, is should still be depending on the max_len argument, and if that is smaller than 256, only then we should get u8 codes: I think what is happening here is that we are trying to dict encode u8 values, in that case this code would select u8, even though max_len is u16::max. So probably the right fix is to expose the codes ptype from the dict builder? |
833c131 to
6aa5e45
Compare
|
@onursatici I think that is what I am doing since I am extracting the dtype of the codes? Or are you saying that we need to forward the original dtype from somewhere else to the writer? Or are you saying we need to upcast the codes always? |
|
I guess what I am saying is that this comment is not exactly right: the way we chose the ptype for codes is upfront, so it doesn't depend on the actual cardinality after encoding. It depends on two things, the width of the input primitive ptype, and the max_len constraint. Normally we have max_len set to u16::max because we don't want a higher cardinality in a chunk, so we always get u16 codes. In fuzzer we end up dict encoding u8 types, and because the input is narrower than the max_len constraint (u8 vs u16), dict encoder returns u8 codes. for the fix, I think this will work and I am happy to merge as is, but I think the right way is to get the codes dtype from the dict encoder, because as soon as you construct one you can get a codes ptype, you don't need to encode a chunk to see what ptype it will end up having. |
|
Shall we do this then @connortsui20 use the max_len instead of the first batch? |
| self.active_values_tx = Some(values_tx); | ||
|
|
||
| // Send first codes | ||
| let codes_dtype = codes.dtype().clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @onursatici and @joseph-isaacs, seems like we should take this from DictLayoutConstraints::max_len and not by the first batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed with Joe, I will make that change and push to this branch
Deploying vortex-bench with
|
| Latest commit: |
0a7cccb
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://96bde770.vortex-93b.pages.dev |
| Branch Preview URL: | https://ct-fix-dict-writer.vortex-93b.pages.dev |
Signed-off-by: Connor Tsui <[email protected]>
Signed-off-by: Onur Satici <[email protected]>
0a7cccb to
11a0cbb
Compare
| /// Constraints for dictionary layout encoding. | ||
| /// | ||
| /// Note that [`max_len`](Self::max_len) is limited to `u16` (65,535 entries) by design. Since | ||
| /// layout chunks are typically ~8k elements, having more than 64k unique values in a dictionary | ||
| /// means dictionary encoding provides little compression benefit. If a column has very high | ||
| /// cardinality, the fallback encoding strategy should be used instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth saying you can still use a DictArray?
joseph-isaacs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we let the tests finish before we merge
Benchmarks: FineWeb NVMeSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on S3Summary
Detailed Results Table
|
Benchmarks: TPC-DS SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb S3Summary
Detailed Results Table
|
Benchmarks: Statistical and Population GeneticsSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on S3Summary
Detailed Results Table
|
Benchmarks: Clickbench on NVMESummary
Detailed Results Table
|
Fixes #5591
Also fixes #5563
Still not super sure about why we limit the dict layout codes to be a max of
u16::MAX...