Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LastNonNullIter is too slow if there are too many duplicate rows #5229

Open
evenyag opened this issue Dec 24, 2024 · 0 comments
Open

LastNonNullIter is too slow if there are too many duplicate rows #5229

evenyag opened this issue Dec 24, 2024 · 0 comments
Assignees
Labels
C-enhancement Category Enhancements C-performance Category Performance

Comments

@evenyag
Copy link
Contributor

evenyag commented Dec 24, 2024

What type of enhancement is this?

Performance

What does the enhancement do?

The last non null dedup implementation is too slow, and the memtable may take a long time to flush.

If a key has more than 2M rows, then the dedupliation will be very expensive and stall write requests.

2024-12-24T09:24:09.200271Z  INFO mito2::read::dedup: LastNonNullIter inner iter returns batch, region: 4483945857024(1044, 0), batch len: 2578923, timestamps: Some([1734968880013721400, 1734968880013721400, 1734968880013721400, 1734968880013721400, 1734968880013721400

Some metrics

2024-12-24T09:24:02.606851Z  INFO mito2::read::dedup: LastNonNullIter, region: 4483945857024(1044, 0), num_batches: 1, num_rows: 598196, num_splits: 235112, num_push_batches: 235112, num_return_batches: 3628, num_finish_batches: 0
2024-12-24T09:24:07.615855Z  INFO mito2::read::dedup: LastNonNullIter, region: 4483945857024(1044, 0), num_batches: 11, num_rows: 931664, num_splits: 888700, num_push_batches: 888710, num_return_batches: 16853, num_finish_batches: 0
2024-12-24T09:24:12.617940Z  INFO mito2::read::dedup: LastNonNullIter, region: 4483945857024(1044, 0), num_batches: 12, num_rows: 3510587, num_splits: 949850, num_push_batches: 949861, num_return_batches: 18191, num_finish_batches: 0
2024-12-24T09:24:17.621942Z  INFO mito2::read::dedup: LastNonNullIter, region: 4483945857024(1044, 0), num_batches: 12, num_rows: 3510587, num_splits: 1006868, num_push_batches: 1006879, num_return_batches: 19135, num_finish_batches: 0
2024-12-24T09:24:22.622635Z  INFO mito2::read::dedup: LastNonNullIter, region: 4483945857024(1044, 0), num_batches: 12, num_rows: 3510587, num_splits: 1065109, num_push_batches: 1065120, num_return_batches: 20084, num_finish_batches: 0
2024-12-24T09:24:27.625763Z  INFO mito2::read::dedup: LastNonNullIter, region: 4483945857024(1044, 0), num_batches: 12, num_rows: 3510587, num_splits: 1124782, num_push_batches: 1124793, num_return_batches: 21048, num_finish_batches: 0
2024-12-24T09:24:32.631291Z  INFO mito2::read::dedup: LastNonNullIter, region: 4483945857024(1044, 0), num_batches: 12, num_rows: 3510587, num_splits: 1185770, num_push_batches: 1185781, num_return_batches: 22046, num_finish_batches: 0
2024-12-24T09:24:37.633937Z  INFO mito2::read::dedup: LastNonNullIter, region: 4483945857024(1044, 0), num_batches: 12, num_rows: 3510587, num_splits: 1248650, num_push_batches: 1248661, num_return_batches: 23087, num_finish_batches: 0
2024-12-24T09:24:42.633966Z  INFO mito2::read::dedup: LastNonNullIter, region: 4483945857024(1044, 0), num_batches: 12, num_rows: 3510587, num_splits: 1313255, num_push_batches: 1313266, num_return_batches: 24107, num_finish_batches: 0
2024-12-24T09:24:47.634862Z  INFO mito2::read::dedup: LastNonNullIter, region: 4483945857024(1044, 0), num_batches: 12, num_rows: 3510587, num_splits: 1380601, num_push_batches: 1380612, num_return_batches: 25152, num_finish_batches: 0

The hotspots
image

Implementation challenges

We may need to refactor the iter or memtable to make this computation faster.

@evenyag evenyag added C-enhancement Category Enhancements C-performance Category Performance labels Dec 24, 2024
@evenyag evenyag self-assigned this Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category Enhancements C-performance Category Performance
Projects
None yet
Development

No branches or pull requests

1 participant