Skip to content

Conversation

bremoran
Copy link
Contributor

@bremoran bremoran commented Sep 10, 2025

On 32-bit architectures, each call to mld_keccakf1600_xor_bytes incurs an overhead. For example, on Arm v7-M and Arm v8-M and using the optimised bit interleave from xkcp xoring a lane into the state incurs an overhead of 37 instructions. Any time an incomplete lane is xored into the state, this penalty is paid twice. This PR ensures that only full lanes are xored into the state.

Fixes #445

@bremoran bremoran requested a review from a team as a code owner September 10, 2025 13:36
@rod-chapman
Copy link
Contributor

Please provide a description for this PR. What is the point of this refactoring? What benefit does it bring? Please provide CBMC proof harness and Makefile for any new functions.

@mkannwischer
Copy link
Contributor

@bremoran, sorry for the long wait for the review on this. Could you please rebase this on top of the changes in main, so we can benchmark and review it?

@bremoran bremoran force-pushed the f/refactor-fips202 branch from aa57a15 to 2cd2d61 Compare October 3, 2025 10:34
bremoran and others added 3 commits October 4, 2025 08:26
@mkannwischer
Copy link
Contributor

@bremoran, that was not quite what I meant by rebasing.
I applied the changes required to make this work myself in a8d2d6a.

This gets inlined into the proof of mld_H - no need for a separate
contract if the proofs go through.

Signed-off-by: Matthias J. Kannwischer <[email protected]>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 47837 cycles 47836 cycles 1.00
ML-DSA-44 sign 156325 cycles 156334 cycles 1.00
ML-DSA-44 verify 52453 cycles 52450 cycles 1.00
ML-DSA-65 keypair 83684 cycles 83701 cycles 1.00
ML-DSA-65 sign 255488 cycles 255371 cycles 1.00
ML-DSA-65 verify 85590 cycles 85601 cycles 1.00
ML-DSA-87 keypair 136128 cycles 136113 cycles 1.00
ML-DSA-87 sign 320962 cycles 321312 cycles 1.00
ML-DSA-87 verify 137899 cycles 138009 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 115074 cycles 115039 cycles 1.00
ML-DSA-44 sign 430931 cycles 430787 cycles 1.00
ML-DSA-44 verify 122238 cycles 122176 cycles 1.00
ML-DSA-65 keypair 197047 cycles 196905 cycles 1.00
ML-DSA-65 sign 701023 cycles 701285 cycles 1.00
ML-DSA-65 verify 197670 cycles 197656 cycles 1.00
ML-DSA-87 keypair 334759 cycles 335149 cycles 1.00
ML-DSA-87 sign 884276 cycles 884767 cycles 1.00
ML-DSA-87 verify 328610 cycles 329046 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 281441 cycles 288008 cycles 0.98
ML-DSA-44 sign 971200 cycles 972295 cycles 1.00
ML-DSA-44 verify 301117 cycles 306786 cycles 0.98
ML-DSA-65 keypair 482405 cycles 492097 cycles 0.98
ML-DSA-65 sign 1584980 cycles 1609911 cycles 0.98
ML-DSA-65 verify 487166 cycles 493789 cycles 0.99
ML-DSA-87 keypair 817778 cycles 830114 cycles 0.99
ML-DSA-87 sign 2103778 cycles 2168352 cycles 0.97
ML-DSA-87 verify 823572 cycles 838050 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 35501 cycles 35660 cycles 1.00
ML-DSA-44 sign 132302 cycles 132372 cycles 1.00
ML-DSA-44 verify 41006 cycles 40941 cycles 1.00
ML-DSA-65 keypair 63922 cycles 63906 cycles 1.00
ML-DSA-65 sign 220917 cycles 220391 cycles 1.00
ML-DSA-65 verify 66232 cycles 66307 cycles 1.00
ML-DSA-87 keypair 95630 cycles 96815 cycles 0.99
ML-DSA-87 sign 259768 cycles 265102 cycles 0.98
ML-DSA-87 verify 99879 cycles 100242 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 95541 cycles 95838 cycles 1.00
ML-DSA-44 sign 343758 cycles 345726 cycles 0.99
ML-DSA-44 verify 101480 cycles 101478 cycles 1.00
ML-DSA-65 keypair 164662 cycles 164854 cycles 1.00
ML-DSA-65 sign 571713 cycles 568786 cycles 1.01
ML-DSA-65 verify 166031 cycles 165621 cycles 1.00
ML-DSA-87 keypair 271224 cycles 270260 cycles 1.00
ML-DSA-87 sign 725476 cycles 724985 cycles 1.00
ML-DSA-87 verify 273047 cycles 273226 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 41585 cycles 45299 cycles 0.92
ML-DSA-44 sign 143200 cycles 154336 cycles 0.93
ML-DSA-44 verify 46943 cycles 49529 cycles 0.95
ML-DSA-65 keypair 73940 cycles 74392 cycles 0.99
ML-DSA-65 sign 236322 cycles 237019 cycles 1.00
ML-DSA-65 verify 77313 cycles 78423 cycles 0.99
ML-DSA-87 keypair 111858 cycles 112104 cycles 1.00
ML-DSA-87 sign 279992 cycles 279301 cycles 1.00
ML-DSA-87 verify 117273 cycles 116800 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 57678 cycles 57941 cycles 1.00
ML-DSA-44 sign 201328 cycles 201248 cycles 1.00
ML-DSA-44 verify 66243 cycles 65669 cycles 1.01
ML-DSA-65 keypair 102316 cycles 101945 cycles 1.00
ML-DSA-65 sign 332994 cycles 333057 cycles 1.00
ML-DSA-65 verify 107021 cycles 107115 cycles 1.00
ML-DSA-87 keypair 157063 cycles 157562 cycles 1.00
ML-DSA-87 sign 399257 cycles 399500 cycles 1.00
ML-DSA-87 verify 162886 cycles 162176 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 71344 cycles 71956 cycles 0.99
ML-DSA-44 sign 212929 cycles 214221 cycles 0.99
ML-DSA-44 verify 74779 cycles 75325 cycles 0.99
ML-DSA-65 keypair 123608 cycles 123638 cycles 1.00
ML-DSA-65 sign 345402 cycles 346781 cycles 1.00
ML-DSA-65 verify 124084 cycles 123918 cycles 1.00
ML-DSA-87 keypair 206533 cycles 208833 cycles 0.99
ML-DSA-87 sign 447608 cycles 447509 cycles 1.00
ML-DSA-87 verify 205360 cycles 204748 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 69348 cycles 69498 cycles 1.00
ML-DSA-44 sign 222588 cycles 222917 cycles 1.00
ML-DSA-44 verify 74645 cycles 74589 cycles 1.00
ML-DSA-65 keypair 123409 cycles 123347 cycles 1.00
ML-DSA-65 sign 365960 cycles 366381 cycles 1.00
ML-DSA-65 verify 123609 cycles 123483 cycles 1.00
ML-DSA-87 keypair 201689 cycles 200598 cycles 1.01
ML-DSA-87 sign 467807 cycles 466978 cycles 1.00
ML-DSA-87 verify 201993 cycles 201918 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 120674 cycles 120817 cycles 1.00
ML-DSA-44 sign 452232 cycles 453984 cycles 1.00
ML-DSA-44 verify 131541 cycles 131897 cycles 1.00
ML-DSA-65 keypair 204081 cycles 205210 cycles 0.99
ML-DSA-65 sign 739495 cycles 738619 cycles 1.00
ML-DSA-65 verify 209598 cycles 210495 cycles 1.00
ML-DSA-87 keypair 339929 cycles 343513 cycles 0.99
ML-DSA-87 sign 942376 cycles 952408 cycles 0.99
ML-DSA-87 verify 350063 cycles 353724 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 135463 cycles 136469 cycles 0.99
ML-DSA-44 sign 542488 cycles 545358 cycles 0.99
ML-DSA-44 verify 148719 cycles 149472 cycles 0.99
ML-DSA-65 keypair 227337 cycles 229684 cycles 0.99
ML-DSA-65 sign 880524 cycles 888847 cycles 0.99
ML-DSA-65 verify 236252 cycles 237595 cycles 0.99
ML-DSA-87 keypair 375243 cycles 375230 cycles 1.00
ML-DSA-87 sign 1102759 cycles 1101253 cycles 1.00
ML-DSA-87 verify 387967 cycles 389206 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 157565 cycles 158047 cycles 1.00
ML-DSA-44 sign 563855 cycles 566208 cycles 1.00
ML-DSA-44 verify 169337 cycles 169650 cycles 1.00
ML-DSA-65 keypair 270050 cycles 269850 cycles 1.00
ML-DSA-65 sign 928714 cycles 928430 cycles 1.00
ML-DSA-65 verify 275259 cycles 275016 cycles 1.00
ML-DSA-87 keypair 450252 cycles 450841 cycles 1.00
ML-DSA-87 sign 1180577 cycles 1179105 cycles 1.00
ML-DSA-87 verify 460070 cycles 459184 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 73954 cycles 73980 cycles 1.00
ML-DSA-44 sign 236004 cycles 236034 cycles 1.00
ML-DSA-44 verify 80304 cycles 79930 cycles 1.00
ML-DSA-65 keypair 129494 cycles 129578 cycles 1.00
ML-DSA-65 sign 388474 cycles 388294 cycles 1.00
ML-DSA-65 verify 131006 cycles 130908 cycles 1.00
ML-DSA-87 keypair 210035 cycles 210041 cycles 1.00
ML-DSA-87 sign 491914 cycles 492267 cycles 1.00
ML-DSA-87 verify 212663 cycles 212589 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 462159 cycles 466373 cycles 0.99
ML-DSA-44 sign 2216904 cycles 2214442 cycles 1.00
ML-DSA-44 verify 547750 cycles 550635 cycles 0.99
ML-DSA-65 keypair 778716 cycles 777523 cycles 1.00
ML-DSA-65 sign 3628400 cycles 3643249 cycles 1.00
ML-DSA-65 verify 853665 cycles 849541 cycles 1.00
ML-DSA-87 keypair 1250941 cycles 1269297 cycles 0.99
ML-DSA-87 sign 4442690 cycles 4513601 cycles 0.98
ML-DSA-87 verify 1364598 cycles 1373707 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 115550 cycles 115640 cycles 1.00
ML-DSA-44 sign 392162 cycles 392538 cycles 1.00
ML-DSA-44 verify 123972 cycles 123749 cycles 1.00
ML-DSA-65 keypair 200210 cycles 200190 cycles 1.00
ML-DSA-65 sign 648965 cycles 648572 cycles 1.00
ML-DSA-65 verify 203087 cycles 202921 cycles 1.00
ML-DSA-87 keypair 328316 cycles 327699 cycles 1.00
ML-DSA-87 sign 822365 cycles 820887 cycles 1.00
ML-DSA-87 verify 332366 cycles 331384 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 132701 cycles 132744 cycles 1.00
ML-DSA-44 sign 498674 cycles 498324 cycles 1.00
ML-DSA-44 verify 145009 cycles 144951 cycles 1.00
ML-DSA-65 keypair 226922 cycles 227315 cycles 1.00
ML-DSA-65 sign 814244 cycles 813246 cycles 1.00
ML-DSA-65 verify 231594 cycles 231619 cycles 1.00
ML-DSA-87 keypair 374429 cycles 374603 cycles 1.00
ML-DSA-87 sign 1021798 cycles 1021441 cycles 1.00
ML-DSA-87 verify 384208 cycles 383659 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 138585 cycles 138628 cycles 1.00
ML-DSA-44 sign 495158 cycles 495579 cycles 1.00
ML-DSA-44 verify 148937 cycles 148792 cycles 1.00
ML-DSA-65 keypair 241460 cycles 241330 cycles 1.00
ML-DSA-65 sign 810228 cycles 809886 cycles 1.00
ML-DSA-65 verify 241222 cycles 240937 cycles 1.00
ML-DSA-87 keypair 396305 cycles 396441 cycles 1.00
ML-DSA-87 sign 1031970 cycles 1031506 cycles 1.00
ML-DSA-87 verify 402475 cycles 402272 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 213442 cycles 213493 cycles 1.00
ML-DSA-44 sign 781132 cycles 794089 cycles 0.98
ML-DSA-44 verify 230277 cycles 230005 cycles 1.00
ML-DSA-65 keypair 380712 cycles 381674 cycles 1.00
ML-DSA-65 sign 1287339 cycles 1285921 cycles 1.00
ML-DSA-65 verify 373222 cycles 373670 cycles 1.00
ML-DSA-87 keypair 609594 cycles 609555 cycles 1.00
ML-DSA-87 sign 1644483 cycles 1645486 cycles 1.00
ML-DSA-87 verify 621636 cycles 621588 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 115380 cycles 115390 cycles 1.00
ML-DSA-44 sign 392034 cycles 392115 cycles 1.00
ML-DSA-44 verify 123904 cycles 123546 cycles 1.00
ML-DSA-65 keypair 200071 cycles 199986 cycles 1.00
ML-DSA-65 sign 648490 cycles 647905 cycles 1.00
ML-DSA-65 verify 203071 cycles 202802 cycles 1.00
ML-DSA-87 keypair 327348 cycles 327077 cycles 1.00
ML-DSA-87 sign 819919 cycles 819688 cycles 1.00
ML-DSA-87 verify 331865 cycles 331074 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 822021 cycles 823286 cycles 1.00
ML-DSA-44 sign 3332036 cycles 3327209 cycles 1.00
ML-DSA-44 verify 920516 cycles 918657 cycles 1.00
ML-DSA-65 keypair 1395987 cycles 1400241 cycles 1.00
ML-DSA-65 sign 5415850 cycles 5443356 cycles 0.99
ML-DSA-65 verify 1464876 cycles 1464467 cycles 1.00
ML-DSA-87 keypair 2296738 cycles 2298732 cycles 1.00
ML-DSA-87 sign 6800722 cycles 6822286 cycles 1.00
ML-DSA-87 verify 2402751 cycles 2403402 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 213144 cycles 213012 cycles 1.00
ML-DSA-44 sign 780665 cycles 781249 cycles 1.00
ML-DSA-44 verify 230117 cycles 230192 cycles 1.00
ML-DSA-65 keypair 380413 cycles 380850 cycles 1.00
ML-DSA-65 sign 1304248 cycles 1291535 cycles 1.01
ML-DSA-65 verify 372936 cycles 372768 cycles 1.00
ML-DSA-87 keypair 609458 cycles 609112 cycles 1.00
ML-DSA-87 sign 1641897 cycles 1642387 cycles 1.00
ML-DSA-87 verify 621885 cycles 621381 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 239303 cycles 231907 cycles 1.03
ML-DSA-44 sign 701852 cycles 692048 cycles 1.01
ML-DSA-44 verify 238239 cycles 234215 cycles 1.02
ML-DSA-65 keypair 395898 cycles 397168 cycles 1.00
ML-DSA-65 sign 1112619 cycles 1103780 cycles 1.01
ML-DSA-65 verify 392007 cycles 380128 cycles 1.03
ML-DSA-87 keypair 662188 cycles 660299 cycles 1.00
ML-DSA-87 sign 1484409 cycles 1454152 cycles 1.02
ML-DSA-87 verify 645366 cycles 650049 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 239303 cycles 231907 cycles 1.03
ML-DSA-65 verify 392007 cycles 380128 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 keypair 315776 cycles 311426 cycles 1.01
ML-DSA-44 sign 1230641 cycles 1214729 cycles 1.01
ML-DSA-44 verify 353493 cycles 338228 cycles 1.05
ML-DSA-65 keypair 562601 cycles 572363 cycles 0.98
ML-DSA-65 sign 2009516 cycles 1992144 cycles 1.01
ML-DSA-65 verify 541825 cycles 547811 cycles 0.99
ML-DSA-87 keypair 884415 cycles 884798 cycles 1.00
ML-DSA-87 sign 2488138 cycles 2501693 cycles 0.99
ML-DSA-87 verify 912836 cycles 901676 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: f82f729 Previous: efe03f9 Ratio
ML-DSA-44 verify 353493 cycles 338228 cycles 1.05

This comment was automatically generated by workflow using github-action-benchmark.

@mkannwischer
Copy link
Contributor

mkannwischer commented Oct 4, 2025

Performance-wise, there is no reason to not merge this. There is even a small improvement on Cortex-A55 of 1-3% and (for reasons that are beyond me) on 4th gen AMD EPYC (c7a).

CBMC proofs are failing, but we can fix that at a later point.

Fundamentally, I believe such caching does not belong in sign.c, but should be done in fips202.c. One could make the incomplete lane part of the Keccak state which would make it a little bit cleaner, but it would still clutter the code somewhat.

WDYT @hanno-becker?

@rod-chapman
Copy link
Contributor

I see one proof failure in mld_H. Let me take a look...

Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bremoran! I can definitely see this being useful for 32-bit platforms.

A few requests:

  • I don't think this needs an API extension: Instead, the buffering of state prior to XOR'ing should be an implementation detail (add a buffer for the incomplete lane) of the existing absorb API.
  • We should have documentation and CBMC proofs for new functionality.
  • The new logic belongs to FIPS-202.

Could you adjust the PR accordingly?

@mkannwischer
Copy link
Contributor

mkannwischer commented Oct 4, 2025

Thanks @bremoran! I can definitely see this being useful for 32-bit platforms.

A few requests:

  • I don't think this needs an API extension: Instead, the buffering of state prior to XOR'ing should be an implementation detail (add a buffer for the incomplete lane) of the existing absorb API.
  • We should have documentation and CBMC proofs for new functionality.
  • The new logic belongs to FIPS-202.

Could you adjust the PR accordingly?

I agree. Marking this as draft for now. @bremoran, please mark it as ready when you have updated the PR.
Let us know if you need help with adjusting the CBMC proofs.

@mkannwischer mkannwischer marked this pull request as draft October 4, 2025 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Low performance in mld_H

5 participants