-
Notifications
You must be signed in to change notification settings - Fork 15.1k
Open
Labels
Description
Instructions for reproduction:
Code:
#include <arm_neon.h>
#include <arm_fp16.h>
float16x4_t run_vcmla_lane_f16(float16_t *r_vals, float16_t *a_vals, float16_t *b_vals) {
const int32_t lane_val = 0;
float16x4_t r_val = vld1_f16(r_vals);
float16x4_t a_val = vld1_f16(a_vals);
float16x4_t b_val = vld1_f16(b_vals);
return vcmla_lane_f16(r_val, a_val, b_val, lane_val);
}Then
Compile with clang for aarch64_be-unknown-linux-gnu with O2 and then again in debug and compare.
ASM/Output
With optimisations (incorrect)
[...]
ld1 { v0.4h }, [x2]
ld1 { v1.4h }, [x0]
ld1 { v2.4h }, [x1]
rev32 v0.8h, v0.8h
fcmla v1.4h, v2.4h, v0.h[0], #0
[...]Without optimisations (correct)
[...]
ld1 { v0.4h }, [x10]
ld1 { v1.4h }, [x9]
ld1 { v2.4h }, [x8]
fcmla v0.4h, v1.4h, v2.4h, #0
[...]Difference
As you can see, an extra rev32 instruction is added under the optimisations that makes the output faulty.
Suspected other faulty intrinsics when optimisations are enabled on big-endian
vcmla_lane_f16
vcmla_laneq_f16
vcmla_rot180_lane_f16
vcmla_rot180_laneq_f16
vcmla_rot270_lane_f16
vcmla_rot270_laneq_f16
vcmla_rot90_lane_f16
vcmla_rot90_laneq_f16
vcmlaq_lane_f16
vcmlaq_laneq_f16
vcmlaq_laneq_f32
vcmlaq_rot180_lane_f16
vcmlaq_rot180_laneq_f16
vcmlaq_rot180_laneq_f32
vcmlaq_rot270_lane_f16
vcmlaq_rot270_laneq_f16
vcmlaq_rot270_laneq_f32
vcmlaq_rot90_lane_f16
vcmlaq_rot90_laneq_f16
vcmlaq_rot90_laneq_f32
vdot_lane_s32
vdot_lane_u32
vdot_laneq_s32
vdot_laneq_u32
vdotq_lane_s32
vdotq_lane_u32
vdotq_laneq_s32
vdotq_laneq_u32
vsudot_lane_s32
vsudot_laneq_s32
vsudotq_lane_s32
vsudotq_laneq_s32
vusdot_lane_s32
vusdot_laneq_s32
vusdotq_lane_s32
vusdotq_laneq_s32
Edit: More minimal reproducer