Skip to content

Conversation

neuschaefer
Copy link

@neuschaefer neuschaefer commented Oct 4, 2025

Increase the consistency between _mm_loadu_si128 and _mm_stream_si128 by using vector loads/stores of 64-bit elements in both. This should have no impact on existing users. On aarch64 (release build, GCC 15.2), crc_non_temporal_memcpy.cc.o stays effectively the same, the only change being as follows:

--- crc_non_temporal_memcpy.cc.o (original)
+++ crc_non_temporal_memcpy.cc.o (patched)
├── objdump --line-numbers --disassemble --demangle --reloc --no-show-raw-insn --section=.text {} │ @@ -255,15 +255,15 @@
│       add     x2, x21, x2
│       mov     x0, x21
│       ldp     q31, q30, [x0, #32]
│       add     x1, x1, #0x40
│       ldp     q29, q28, [x0], #64
│       stp     q31, q30, [x1, #-32]
│       stp     q29, q28, [x1, #-64]
│ -     cmp     x0, x2
│ +     cmp     x2, x0
│       b.ne    3b0 <absl::crc_internal::CrcNonTemporalMemcpyEngine::Compute(void*, void const*, unsigned long, absl::crc32c_t) const+0x270>  // b.any
│       and     x0, x3, #0xffffffffffffffc0
│       and     x23, x23, #0x3f
│       dmb     ish
│       add     x22, x22, x0
│       add     x21, x21, x0
│       b       380 <absl::crc_internal::CrcNonTemporalMemcpyEngine::Compute(void*, void const*, unsigned long, absl::crc32c_t) const+0x240>

On big-endian Arm (aarch64_be), this fixes a bug in non_temporal_store_memcpy, in which each 32-bit half out of a 64-bit parcel of memory was swapped with the other. For example, the byte sequence 218edf0b 13c68753 would be copied as 13c68753 218edf0b.

Increase the consistency between _mm_loadu_si128 and _mm_stream_si128 by
using vector loads/stores of 64-bit elements in both. This should have no
impact on existing users. On aarch64 (release build, GCC 15.2),
crc_non_temporal_memcpy.cc.o stays effectively the same, the only change
being as follows:

--- crc_non_temporal_memcpy.cc.o (original)
+++ crc_non_temporal_memcpy.cc.o (patched)
├── objdump --line-numbers --disassemble --demangle --reloc --no-show-raw-insn --section=.text {}
│ @@ -255,15 +255,15 @@
│       add     x2, x21, x2
│       mov     x0, x21
│       ldp     q31, q30, [x0, abseil#32]
│       add     x1, x1, #0x40
│       ldp     q29, q28, [x0], abseil#64
│       stp     q31, q30, [x1, #-32]
│       stp     q29, q28, [x1, #-64]
│ -     cmp     x0, x2
│ +     cmp     x2, x0
│       b.ne    3b0 <absl::crc_internal::CrcNonTemporalMemcpyEngine::Compute(void*, void const*, unsigned long, absl::crc32c_t) const+0x270>  // b.any
│       and     x0, x3, #0xffffffffffffffc0
│       and     x23, x23, #0x3f
│       dmb     ish
│       add     x22, x22, x0
│       add     x21, x21, x0
│       b       380 <absl::crc_internal::CrcNonTemporalMemcpyEngine::Compute(void*, void const*, unsigned long, absl::crc32c_t) const+0x240>

On big-endian Arm (aarch64_be), this fixes a bug in non_temporal_store_memcpy,
in which each 32-bit half out of a 64-bit parcel of memory was swapped
with the other. For example, the byte sequence 218edf0b 13c68753 would be
copied as 13c68753 218edf0b.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant