Skip to content

Conversation

@duongcongtoai
Copy link

@duongcongtoai duongcongtoai commented Oct 19, 2025

Which issue does this PR close?

Rationale for this change

Implement custom interleaving logic for nested data types that have dictionary. Current main branch uses interleave_fallback which will concat the dictionaries even if they are logically equal (but pointing to different address). There are 2 approaches to work around this:

  • make interleave_fallback returns error in case dictionary overflow, and then try merging the dictionary and retry.
  • implement custom interleaving logic for these data types (The approach chosen by this PR)

The first solution may provides wider support for any nested data types which contains dictionary (i.e list<struct<list>> )

What changes are included in this PR?

Special interleave handling for specific data types:

  • list of dictionary
  • list of struct, which containing dictionary fields

Are these changes tested?

Last benchmark

group                                                                          arrow-57.0.0-custom                    arrow-fix-overflow-on-interleave-list-of-dict
-----                                                                          -------------------                    ---------------------------------------------
interleave dict(20, 0.0) 100 [0..100, 100..230, 450..1000]                     1.00      4.0±0.18µs        ? ?/sec    1.02      4.1±0.30µs        ? ?/sec
interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]           1.03     19.3±1.47µs        ? ?/sec    1.00     18.8±0.62µs        ? ?/sec
interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000]                    1.03     19.8±1.73µs        ? ?/sec    1.00     19.2±0.67µs        ? ?/sec
interleave dict(20, 0.0) 400 [0..100, 100..230, 450..1000]                     1.00      9.5±0.39µs        ? ?/sec    1.00      9.5±0.43µs        ? ?/sec
interleave dict_distinct 100                                                   1.00      3.3±0.13µs        ? ?/sec    1.00      3.4±0.10µs        ? ?/sec
interleave dict_distinct 1024                                                  1.06     11.8±0.57µs        ? ?/sec    1.00     11.1±0.33µs        ? ?/sec
interleave dict_distinct 2048                                                  1.00     42.3±2.77µs        ? ?/sec    1.00     42.2±1.43µs        ? ?/sec
interleave dict_sparse(20, 0.0) 100 [0..100, 100..230, 450..1000]              1.00      4.1±0.21µs        ? ?/sec    1.01      4.1±0.15µs        ? ?/sec
interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]    1.00     18.8±0.72µs        ? ?/sec    1.00     18.8±0.94µs        ? ?/sec
interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000]             1.01     19.4±1.09µs        ? ?/sec    1.00     19.3±0.54µs        ? ?/sec
interleave dict_sparse(20, 0.0) 400 [0..100, 100..230, 450..1000]              1.00      9.4±0.42µs        ? ?/sec    1.02      9.6±0.37µs        ? ?/sec
interleave list of logically_equivalent_dict 100                               1.08      8.7±0.44µs        ? ?/sec    1.00      8.0±0.29µs        ? ?/sec
interleave list of logically_equivalent_dict 1024                              1.00     29.3±1.16µs        ? ?/sec    1.01     29.6±0.87µs        ? ?/sec
interleave list of logically_equivalent_dict 2048                              1.00     68.6±2.88µs        ? ?/sec    1.02     69.8±1.84µs        ? ?/sec
interleave list struct with logically_equivalent_dict 100                      1.00     25.0±0.60µs        ? ?/sec    1.04     26.0±1.55µs        ? ?/sec
interleave list struct with logically_equivalent_dict 1024                     1.00     91.4±2.43µs        ? ?/sec    1.04     95.2±7.25µs        ? ?/sec
interleave list struct with logically_equivalent_dict 2048                     1.01    180.2±7.38µs        ? ?/sec    1.00    178.1±5.23µs        ? ?/sec
interleave list struct with same dict ptr 100                                  1.00     25.0±0.63µs        ? ?/sec    1.03     25.8±0.92µs        ? ?/sec
interleave list struct with same dict ptr 1024                                 1.00     92.7±2.87µs        ? ?/sec    1.01     93.6±4.87µs        ? ?/sec
interleave list struct with same dict ptr 2048                                 1.00    179.7±7.63µs        ? ?/sec    1.00   180.6±12.30µs        ? ?/sec

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 19, 2025
@duongcongtoai duongcongtoai changed the title fix: try merging list of dict if possible fix: try merging on interleaving list of dict Oct 19, 2025
@duongcongtoai duongcongtoai force-pushed the fix-overflow-on-interleave-list-of-dict branch from 5c3ee8c to d18f6f0 Compare October 25, 2025 13:46
@duongcongtoai duongcongtoai marked this pull request as ready for review October 25, 2025 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DictionaryKeyOverflowError on interleave with nested type containing dictionary

1 participant