Skip to content

[AARCH64] Incorrect codegen for some _lane_ intrinsics with O1,O2,O3 on big-endian aarch64 #166190

@CrooseGit

Description

@CrooseGit

Instructions for reproduction:

Code:

#include <arm_neon.h>
#include <arm_fp16.h>

float16x4_t run_vcmla_lane_f16(float16_t *r_vals, float16_t *a_vals, float16_t *b_vals) {
    const int32_t lane_val = 0;
    float16x4_t r_val = vld1_f16(r_vals);
    float16x4_t a_val = vld1_f16(a_vals);
    float16x4_t b_val = vld1_f16(b_vals);
    return vcmla_lane_f16(r_val, a_val, b_val, lane_val);
}

Then

Compile with clang for aarch64_be-unknown-linux-gnu with O2 and then again in debug and compare.

ASM/Output

With optimisations (incorrect)

[...]
ld1	{ v0.4h }, [x2]
ld1	{ v1.4h }, [x0]
ld1	{ v2.4h }, [x1]
rev32	v0.8h, v0.8h
fcmla	v1.4h, v2.4h, v0.h[0], #0
[...]

Without optimisations (correct)

[...]
ld1	{ v0.4h }, [x10]
ld1	{ v1.4h }, [x9]
ld1	{ v2.4h }, [x8]
fcmla	v0.4h, v1.4h, v2.4h, #0
[...]

Difference

As you can see, an extra rev32 instruction is added under the optimisations that makes the output faulty.

Suspected other faulty intrinsics when optimisations are enabled on big-endian

vcmla_lane_f16
vcmla_laneq_f16
vcmla_rot180_lane_f16
vcmla_rot180_laneq_f16
vcmla_rot270_lane_f16
vcmla_rot270_laneq_f16
vcmla_rot90_lane_f16
vcmla_rot90_laneq_f16
vcmlaq_lane_f16
vcmlaq_laneq_f16
vcmlaq_laneq_f32
vcmlaq_rot180_lane_f16
vcmlaq_rot180_laneq_f16
vcmlaq_rot180_laneq_f32
vcmlaq_rot270_lane_f16
vcmlaq_rot270_laneq_f16
vcmlaq_rot270_laneq_f32
vcmlaq_rot90_lane_f16
vcmlaq_rot90_laneq_f16
vcmlaq_rot90_laneq_f32
vdot_lane_s32
vdot_lane_u32
vdot_laneq_s32
vdot_laneq_u32
vdotq_lane_s32
vdotq_lane_u32
vdotq_laneq_s32
vdotq_laneq_u32
vsudot_lane_s32
vsudot_laneq_s32
vsudotq_lane_s32
vsudotq_laneq_s32
vusdot_lane_s32
vusdot_laneq_s32
vusdotq_lane_s32
vusdotq_laneq_s32

Edit: More minimal reproducer

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions