Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full SME(1) instruction support and STREAMING Groups #415

Open
wants to merge 44 commits into
base: dev
Choose a base branch
from

Conversation

FinnWilkinson
Copy link
Contributor

@FinnWilkinson FinnWilkinson commented Jun 12, 2024

This PR implements all available SME (version 1) instructions that are contained within LLVM 14.0.5. Specifically, this is Version 2021-06 of the Armv9-A A64 ISA.

No FP16 or BF16 instructions have been supported due to lacking C++17 types. All Quad-Word instruction variants have been emulated using 64-bit data-types.

In addition to this, new STREAMING_SVE and STREAMING_PREDICATE groups have been introduced (along with corresponding decode logic) to allow for a different pipeline / latency configuration for these instructions when SVE Streaming Mode (the context mode which SME instructions are executed in) is enabled. This can allow for a co-processor style implementation of SME to be implemented within SimEng; with additional latency / reduced throughput being configured to mimic an offload penalty, and different execution or LD/STR hardware being modelled for said co-processor compared to the main core.

  • Add STREAMING Group support
  • Add execution logic and regression tests for all missing SME instructions

@FinnWilkinson FinnWilkinson added enhancement New feature or request 0.9.7 Part of SimEng Release 0.9.7 labels Jun 12, 2024
@FinnWilkinson FinnWilkinson self-assigned this Jun 12, 2024
@FinnWilkinson
Copy link
Contributor Author

#rerun tests

bool isStreamingModeEnabled() const;

/** Returns if the SME ZA Register is enabled. */
bool isZA_RegisterEnabled() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the underscore in ZA_Register needed/canonical? Looks a bit weird jumping between camel and snake cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

bool Architecture::isStreamingModeEnabled() const { return SVCRval_ & 1; }

// 1st bit of SVCR register determines if ZA register is enabled.
bool Architecture::isZA_RegisterEnabled() const { return SVCRval_ & 2; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about underscore

const uint16_t STORE_SME = 85;
const uint16_t ALL = 86;
const uint16_t NONE = 87;
const uint16_t STREAMING_SVE = 66;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I looked at this file in isolation, I assumed that the ordering of these was arbitrary. However, it seems to be important for how instruction.cc works. So I'd be in favour of adding a comment here to explain that the ordering of these instructions is important and can't be changed without looking at instruction.cc.

@@ -91,6 +91,19 @@ span<const memory::MemoryAccessTarget> Instruction::generateAddresses() {
setMemoryAddresses({{sourceValues_[2].get<uint64_t>(), 8}});
break;
}
case Opcode::AArch64_LD1_MXIPXX_V_B: // ld1b {zatv.b[ws, #imm]}, pg/z,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using "[[fallthrough]];" here as well? Apply same to other non-attributed fallthroughs below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally I think it's fine but it provides a better structure and possibly improved readability so I'd say we should use fallthroughs

uint64_t out[32] = {0};
std::memcpy(out, zaRow, rowCount * sizeof(uint64_t));
// Slice element is active IFF:
// - Element in 1st source pred corresponding to horiz. slice is TRUE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an AND between these two statements

CHECK_MAT_COL(ARM64_REG_ZAS3, i, uint32_t,
fillNeon<uint32_t>(inter32, (SVL / 8)));
} else {
// Even cols, all elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Odd cols?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several occurrences of the possible same issue throughout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, throughout the test file some SME tests use the same predicate patterns. Two predicates are always used: p0 and p1.

  • p0 is always set to all true using ptrue p0.d (for example)
  • p1 is always set to the pattern {ON, OFF, ON, OFF...} using zip1 p1.s, p0.s, p1.s (for example)

Hence, when using p1 as the predicate the rows, columns, or individual elements per row/col updated will always be even ones. This should also be reflected in the test initialisation with inter32[i] = (i % 2 == 0) ? i : 65; where only the even vector elements are set to increasing values, and odd values to the test-default of 66.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what does i represent here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i is the row or column element index (i.e. from something like int i = 0; i < sliceElements; i++ with sliceElements sometimes fixed to a constant representing the number of elements for the max SVL of 2048-bits) depending on if the Horizontal or Vertical instruction variant is being used.

So in the above, if i is even (index 0, 2, 4 etc) then the row/column element is set to i to reflect the use of index z1.s, #0, #1 in the test. If i is odd then the value remains unchanged; for this test a value of 65.

@FinnWilkinson FinnWilkinson changed the title [WIP] Full SME(1) instruction support and STREAMING Groups Full SME(1) instruction support and STREAMING Groups Sep 2, 2024
Copy link
Contributor

@jj16791 jj16791 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments and I agree with several of Alex's comments. I think it would be good to get the ARM SME/SVE loops as part of our functional verification checks to help test these new instructions. I assume it would have to be done somewhere private though (not sure if we already have that guarantee in the upcoming CI/CD pipelines)?

CMakeLists.txt Outdated Show resolved Hide resolved
configs/a64fx_SME.yaml Outdated Show resolved Hide resolved
docs/sphinx/assets/instruction_groups_AArch64.png Outdated Show resolved Hide resolved
@@ -91,6 +91,19 @@ span<const memory::MemoryAccessTarget> Instruction::generateAddresses() {
setMemoryAddresses({{sourceValues_[2].get<uint64_t>(), 8}});
break;
}
case Opcode::AArch64_LD1_MXIPXX_V_B: // ld1b {zatv.b[ws, #imm]}, pg/z,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally I think it's fine but it provides a better structure and possibly improved readability so I'd say we should use fallthroughs

src/lib/arch/aarch64/Instruction.cc Show resolved Hide resolved
@@ -188,6 +188,20 @@ uint8_t Architecture::predecode(const uint8_t* ptr, uint16_t bytesAvailable,
newInsn.setExecutionInfo(getExecutionInfo(newInsn));
// Cache the instruction
iter = decodeCache_.insert({insn, newInsn}).first;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no data on this but doing this process for every AArch64 instruction may have a detrimental effect on performance. Will need to run through the new CI/CD when it's ready to determine this. Would argue that such a change shouldn't be merged until we know there's no significant performance regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non Predicate / SVE instructions the overhead is 3 if statements and a function call which should be minor. But yes, a performance regression test for this would be good.

Not sure on what an alternative solution could be though if the performance impact is significant...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this comment unresolved so we remember to look into this

src/lib/arch/aarch64/Instruction_decode.cc Outdated Show resolved Hide resolved
@@ -568,9 +568,14 @@ RegisterValue vecUMaxP(srcValContainer& sourceValues) {
const T* n = sourceValues[0].getAsVector<T>();
const T* m = sourceValues[1].getAsVector<T>();

// Concatenate the vectors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you double-checked the ordering of the concatenation? Ran it on ookami and I think these may be the wrong way round but worth double checking in case I've made a mistake

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the spec:

This instruction creates a vector by concatenating the vector elements of the first source SIMD&FP register after the vector elements of the second source SIMD&FP register...

i.e. N is concatonated onto the end of M (M:N)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with "what the spec says" vs "observed values", the latter should probably be taken as the truth. So it's worth someone else double-checking that the values I've observed do go against what the spec says

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very odd and confusing... I've also checked on Ookami and Isambard-AI with the following asm programme:

        movi v0.16b, #0
	movi v1.16b, #1
	movi v2.16b, #2

	umaxp v0.16b, v1.16b, v2.16b

	mov w12, v0.s[0]
	mov w13, v0.s[3]

Which after executing yields the following:

  • v0.b = {1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2}
  • v0.s = {16843009, 16843009, 33686018, 33686018}

Which means the concatonation is v1:v2, NOT v2:v1.
I double checked that gdb doesn't display vector registers "in reverse" (i.e. left-hand most element is in fact v0[0] and not v0[15]) using the final two instructions. Their results were:

  • w12 = 16843009
  • w13 = 33686018

So yes, on hardware the concatonation is seemingly vn:vm.

However, the spec and its pseudo code for UMAXP doesn't align with this... From this page, the pseudo code is as follows:

CheckFPAdvSIMDEnabled64();
constant bits(datasize) operand1 = V[n, datasize];
constant bits(datasize) operand2 = V[m, datasize];
bits(datasize) result;
constant bits(2*datasize) concat = operand2:operand1;
integer element1;
integer element2;
integer max;

for e = 0 to elements-1
    element1 = UInt(Elem[concat, 2*e, esize]);
    element2 = UInt(Elem[concat, (2*e)+1, esize]);
    max = Max(element1, element2);
    Elem[result, e, esize] = max<esize-1:0>;

V[d, datasize] = result;

Where it is clear that the concatonation according to this is vm:vn....

In this instance, we should probably go with hardware. But it is quite annoying that the spec doesn't align with hardware on this, and that updating our code in-line with the spec still fixed the issue that was occuring!

src/lib/arch/aarch64/Instruction_execute.cc Outdated Show resolved Hide resolved
CHECK_MAT_COL(ARM64_REG_ZAS3, i, uint32_t,
fillNeon<uint32_t>(inter32, (SVL / 8)));
} else {
// Even cols, all elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several occurrences of the possible same issue throughout

Copy link
Contributor

@dANW34V3R dANW34V3R left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't finished the review but posting comments to prevent overlaps

CMakeLists.txt Outdated
Comment on lines 75 to 76
set(SIMENG_COMPILE_OPTIONS -Wall -pedantic) #-Wextra

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was causing the issue such that this needed to be removed. Should be fixed and Werror reinstated before we merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool isStreamingModeEnabled() const;

/** Returns if the SME ZA Register is enabled. */
bool isZA_RegisterEnabled() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

src/lib/arch/aarch64/Instruction_decode.cc Show resolved Hide resolved
@@ -1,4 +1,5 @@
#pragma once
#include <cstdint>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this include needed? Either way I don't think iostream is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see issue #426

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iostream removed though

Comment on lines 65 to 70
- STORE_DATA_INT
- STORE_DATA_SCALAR
- STORE_DATA_VECTOR
- STORE_DATA_SVE
- STORE_DATA_STREAMING_SVE
- STORE_DATA_SME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't properly looked through all the groups but can this block be changed simply to STORE_DATA. Same below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this port not do STORE_FP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was an ACM summer school change to make it clearer for new users. Will revert / make simpler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also STORE_FP doesn't exist

Copy link
Contributor

@dANW34V3R dANW34V3R left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOOTS of new instructions, well done for grinding through them. Bring on SAIL

* this instruction was first decoded, and updates the instruction group
* accordingly if required.
* Returns TRUE if the group was updated, FALSE otherwise. */
bool checkStreamingGroup();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the word "check" implies that there won't be state change but there is. Something along the lines of checkStreamingGroupAndUpdate may be more descriptive

Comment on lines +318 to +338
// Identify subgroup type
if (isInstruction(InsnType::isBranch))
group = InstructionGroups::BRANCH;
else if (isInstruction(InsnType::isLoad))
group += 8;
else if (isInstruction(InsnType::isStore))
group += 9;
else if (isInstruction(InsnType::isDivide))
group += 7;
else if (isInstruction(InsnType::isMultiply))
group += 6;
else if (isInstruction(InsnType::isShift) ||
isInstruction(InsnType::isConvert))
group += 5;
else if (isInstruction(InsnType::isLogical))
group += 4;
else if (isInstruction(InsnType::isCompare))
group += 3;
else
group += 2; // Default return is {Data type}_SIMPLE_ARTH

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a switch statement on instructionIdentifier not be viable? Would be slightly nicer to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't this a switch would work here, or would decrease readability if it did. instructionIdentifier is just an uint32_t and so the associated value used in the switch has no meaning. With the current implementation the if clause contains what is being checked against (i.e. isMultiply)

Comment on lines +7417 to +7422
}
results_[row] = {outRow, 256};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misremembering but I think we said we would stop using this implicit registerValue initialisation in favour of an explicit one. If that is the case this should be updated throughout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. results_[row] = RegisterValue(outRow, 256)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 158 instances of using {...} to initialise a RegisterValue. Possibly more if they contain a line break due to a long-names helper function being used.

If this is something we want to move away from then I think it should be a seperate PR, and the {...} constructor should be prohibited for this class (if possible)

memoryData_[index] =
RegisterValue((char*)mdata.data(), md_size * 4);
md_size = 0;
RegisterValue((char*)memData.data(), memData.size() * 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof(uint32_t) might be better than 4 but not that important. Applied throughout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, if this is something we want to enforce then it should be its own PR as this occurs extremely often in this file. An instruction's opcode defines the data type used so we know what the multiplicand should be

const uint64_t* tileSlice =
sourceValues_[sliceNum].getAsVector<uint64_t>();
memoryData_ = sve_merge_store_data<uint64_t>(tileSlice, pg, VL_bits);
std::vector<uint64_t> memData;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you profiled the use of std::vector throughout. It might not make too much difference but "the internet" is saying they are always initialised (which might not be desirable for performance). Do you NEED 0 initialised memory? Do you need dynamically sized arrays? Or could this be done with a std::array or normal array? CPP reference also says reallocations (when resizing) could be costly, reserve() could be useful it if is a problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the vector length is unknown before simulation, and this instruction is predicated, to me the sensible and reasonable choice is a dynamic data structure such as a vector.

Comment on lines 3566 to 3572
uint16_t* row =
const_cast<uint16_t*>(sourceValues_[i].getAsVector<uint16_t>());
uint64_t shifted_active = 1ull << ((i % 32) * 2);
if (pg[i / 32] & shifted_active) {
row[sliceNum] = data[i];
} else {
out[i] = 0;
row[sliceNum] = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment through the pointer to the underlying memory of the registerValue technically breaks our rule (as per the comment) that a registerValue is immutable. This will probably need to be refactored when the reinterpret_cast issue is being resolved with the solution being similar to the implementations below e.g. populate a std::vector and creating in new registerValue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing, will update this to keep RegVals immutable

…uctions and aliases and regression tests (B, H, S, D)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.9.7 Part of SimEng Release 0.9.7 enhancement New feature or request
Projects
Status: Changes Requested
Development

Successfully merging this pull request may close these issues.

4 participants