sort:Optimize sort collation for long lines by mattsu2020 · Pull Request #12144 · uutils/coreutils

mattsu2020 · 2026-05-04T12:40:50Z

What changed

Avoid precomputing ICU collation sort keys for lines larger than 1 MiB.
Store optional collation key ranges so very long lines can fall back to lazy locale comparison during sorting.

Why

Fixes #12138. In UTF-8 locales, sort precomputed ICU collation keys for every input line. For inputs with a small number of very large lines, such as 26 lines of 200 MiB each, the cost of generating and storing multi-GiB collation keys dominated runtime.

Impact

Small and normal-sized lines keep the existing precomputed-key fast path. Very long lines skip the expensive key materialization and use locale_cmp when compared.

Validation

cargo check -p uu_sort
cargo test -p uu_sort
cargo test -p coreutils --test tests test_sort::test_default_unsorted_ints -- --exact
Compared output against GNU sort with cmp for 52 MiB and 130 MiB reproducer inputs.
Hyperfine on the issue-sized 5.1 GiB input with LC_ALL=en_US.UTF-8 --parallel 1 --buffer-size 8G:
- uutils release: 5.054 s
- GNU gsort 9.11: 33.685 s

github-actions · 2026-05-04T13:06:18Z

GNU testsuite comparison:

Skip an intermittent issue tests/date/resolution (fails in this run but passes in the 'main' branch)
Note: The gnu test tests/basenc/bounded-memory is now being skipped but was previously passing.
Note: The gnu test tests/tail/tail-n0f is now being skipped but was previously passing.

codspeed-hq · 2026-05-04T13:15:42Z

Merging this PR will degrade performance by 23.24%

❌ 3 regressed benchmarks
✅ 308 untouched benchmarks
⏩ 46 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Memory	`sort_german_de_locale`	3.3 MB	4.3 MB	-23.24%
❌	Simulation	`sort_key_field[500000]`	767.8 ms	804.6 ms	-4.57%
❌	Simulation	`sort_ascii_utf8_locale`	15.4 ms	16.2 ms	-4.83%

_{Comparing mattsu2020:fix_sort_performance (23e4bb3) with main (c23dc67)}

46 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

xtqqczze · 2026-05-04T14:46:38Z

Out of interest, why choose 1 MiB as the limit, rather than something lower like u16::MAX?

mattsu2020 · 2026-05-04T23:23:49Z

Out of interest, why choose 1 MiB as the limit, rather than something lower like u16::MAX?

Since measurements using 64 KiB showed performance that was at least equivalent for the issue workload, we will change the threshold to u16::MAX.

xtqqczze · 2026-05-04T23:29:23Z

@mattsu2020 Could you also add a benchmark (in separate PR)?

mattsu2020 · 2026-05-04T23:32:02Z

@mattsu2020 Could you also add a benchmark (in separate PR)?

Sure, I’ll keep this PR focused on the fix and open a separate PR adding a benchmark for long-line locale collation.

Optimize sort collation for long lines

004d638

mattsu2020 changed the title ~~[codex] Optimize sort collation for long lines~~ sort:Optimize sort collation for long lines May 4, 2026

mattsu2020 marked this pull request as ready for review May 4, 2026 13:00

Lower sort collation key threshold

23e4bb3

mattsu2020 mentioned this pull request May 4, 2026

sort:Add sort long-line locale benchmark #12150

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sort:Optimize sort collation for long lines#12144

sort:Optimize sort collation for long lines#12144
mattsu2020 wants to merge 2 commits intouutils:mainfrom
mattsu2020:fix_sort_performance

mattsu2020 commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented May 4, 2026 •

edited

Loading

Uh oh!

xtqqczze commented May 4, 2026 •

edited

Loading

Uh oh!

mattsu2020 commented May 4, 2026

Uh oh!

xtqqczze commented May 4, 2026

Uh oh!

mattsu2020 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mattsu2020 commented May 4, 2026

What changed

Why

Impact

Validation

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 23.24%

Performance Changes

Footnotes

Uh oh!

xtqqczze commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattsu2020 commented May 4, 2026

Uh oh!

xtqqczze commented May 4, 2026

Uh oh!

mattsu2020 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 4, 2026 •

edited

Loading

codspeed-hq Bot commented May 4, 2026 •

edited

Loading

xtqqczze commented May 4, 2026 •

edited

Loading