Skip to content

[VPlan] Compute interleave count for VPlan. #149702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,9 @@ class LoopVectorizationPlanner {
/// all profitable VFs in ProfitableVFs.
VectorizationFactor computeBestVF();

unsigned selectInterleaveCount(VPlan &Plan, ElementCount VF,
InstructionCost LoopCost);

/// Generate the IR code for the vectorized loop captured in VPlan \p BestPlan
/// according to the best selected \p VF and \p UF.
///
Expand Down
109 changes: 72 additions & 37 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -955,13 +955,6 @@ class LoopVectorizationCostModel {
/// 64 bit loop indices.
std::pair<unsigned, unsigned> getSmallestAndWidestTypes();

/// \return The desired interleave count.
/// If interleave count has been specified by metadata it will be returned.
/// Otherwise, the interleave count is computed and returned. VF and LoopCost
/// are the selected vectorization factor and the cost of the selected VF.
unsigned selectInterleaveCount(VPlan &Plan, ElementCount VF,
InstructionCost LoopCost);

/// Memory access instruction may be vectorized in more than one way.
/// Form of instruction after vectorization depends on cost.
/// This function takes cost-based decisions for Load/Store instructions
Expand Down Expand Up @@ -4611,8 +4604,8 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() {
}

unsigned
LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
InstructionCost LoopCost) {
LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
InstructionCost LoopCost) {
// -- The interleave heuristics --
// We interleave the loop in order to expose ILP and reduce the loop overhead.
// There are many micro-architectural considerations that we can't predict
Expand All @@ -4627,11 +4620,11 @@ LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
// 3. We don't interleave if we think that we will spill registers to memory
// due to the increased register pressure.

if (!isScalarEpilogueAllowed())
if (!CM.isScalarEpilogueAllowed())
return 1;

// Do not interleave if EVL is preferred and no User IC is specified.
if (foldTailWithEVL()) {
if (any_of(Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis(),
IsaPred<VPEVLBasedIVPHIRecipe>)) {
Comment on lines +4626 to +4627
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not CM.foldTailWithEVL()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the information when possible is more accruate, as it encodes the materialized decisions and it removes uses of the legacy cost model later in the transform stages.

LLVM_DEBUG(dbgs() << "LV: Preference for VP intrinsics indicated. "
"Unroll factor forced to be 1.\n");
return 1;
Expand All @@ -4644,15 +4637,20 @@ LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
// We don't attempt to perform interleaving for loops with uncountable early
// exits because the VPInstruction::AnyOf code cannot currently handle
// multiple parts.
if (Legal->hasUncountableEarlyExit())
if (Plan.hasEarlyExit())
return 1;

const bool HasReductions = !Legal->getReductionVars().empty();
const bool HasReductions =
any_of(Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis(),
IsaPred<VPReductionPHIRecipe>);

// If we did not calculate the cost for VF (because the user selected the VF)
// then we calculate the cost of VF here.
if (LoopCost == 0) {
LoopCost = expectedCost(VF);
if (VF.isScalar())
LoopCost = CM.expectedCost(VF);
else
LoopCost = cost(Plan, VF);
assert(LoopCost.isValid() && "Expected to have chosen a VF with valid cost");

// Loop body is free and there is no need for interleaving.
Expand All @@ -4661,7 +4659,7 @@ LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
}

VPRegisterUsage R =
calculateRegisterUsageForPlan(Plan, {VF}, TTI, ValuesToIgnore)[0];
calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
// We divide by these constants so assume that we have at least one
// instruction that uses at least one register.
for (auto &Pair : R.MaxLocalUsers) {
Expand Down Expand Up @@ -4722,21 +4720,21 @@ LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
MaxInterleaveCount = ForceTargetMaxVectorInterleaveFactor;
}

unsigned EstimatedVF = getEstimatedRuntimeVF(VF, VScaleForTuning);
unsigned EstimatedVF = getEstimatedRuntimeVF(VF, CM.getVScaleForTuning());

// Try to get the exact trip count, or an estimate based on profiling data or
// ConstantMax from PSE, failing that.
if (auto BestKnownTC = getSmallBestKnownTC(PSE, TheLoop)) {
if (auto BestKnownTC = getSmallBestKnownTC(PSE, OrigLoop)) {
// At least one iteration must be scalar when this constraint holds. So the
// maximum available iterations for interleaving is one less.
unsigned AvailableTC = requiresScalarEpilogue(VF.isVector())
unsigned AvailableTC = CM.requiresScalarEpilogue(VF.isVector())
? BestKnownTC->getFixedValue() - 1
: BestKnownTC->getFixedValue();

unsigned InterleaveCountLB = bit_floor(std::max(
1u, std::min(AvailableTC / (EstimatedVF * 2), MaxInterleaveCount)));

if (getSmallConstantTripCount(PSE.getSE(), TheLoop).isNonZero()) {
if (getSmallConstantTripCount(PSE.getSE(), OrigLoop).isNonZero()) {
// If the best known trip count is exact, we select between two
// prospective ICs, where
//
Expand Down Expand Up @@ -4797,7 +4795,7 @@ LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
// vectorized the loop we will have done the runtime check and so interleaving
// won't require further checks.
bool ScalarInterleavingRequiresPredication =
(VF.isScalar() && any_of(TheLoop->blocks(), [this](BasicBlock *BB) {
(VF.isScalar() && any_of(OrigLoop->blocks(), [this](BasicBlock *BB) {
return Legal->blockNeedsPredication(BB);
}));
bool ScalarInterleavingRequiresRuntimePointerCheck =
Expand All @@ -4820,8 +4818,39 @@ LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,

// Interleave until store/load ports (estimated by max interleave count) are
// saturated.
unsigned NumStores = Legal->getNumStores();
unsigned NumLoads = Legal->getNumLoads();
unsigned NumStores = 0;
unsigned NumLoads = 0;
for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
vp_depth_first_deep(Plan.getVectorLoopRegion()->getEntry()))) {
for (VPRecipeBase &R : *VPBB) {
if (isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(&R)) {
NumLoads++;
continue;
}
if (isa<VPWidenStoreRecipe, VPWidenStoreEVLRecipe>(&R)) {
NumStores++;
continue;
}

if (auto *InterleaveR = dyn_cast<VPInterleaveRecipe>(&R)) {
if (unsigned StoreOps = InterleaveR->getNumStoreOperands())
NumStores += StoreOps;
else
NumLoads += InterleaveR->getNumDefinedValues();
continue;
}
if (auto *RepR = dyn_cast<VPReplicateRecipe>(&R)) {
NumLoads += isa<LoadInst>(RepR->getUnderlyingInstr());
NumStores += isa<StoreInst>(RepR->getUnderlyingInstr());
continue;
}
if (isa<VPHistogramRecipe>(&R)) {
NumLoads++;
NumStores++;
continue;
}
}
}
unsigned StoresIC = IC / (NumStores ? NumStores : 1);
unsigned LoadsIC = IC / (NumLoads ? NumLoads : 1);

Expand All @@ -4831,12 +4860,15 @@ LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
// do the final reduction after the loop.
bool HasSelectCmpReductions =
HasReductions &&
any_of(Legal->getReductionVars(), [&](auto &Reduction) -> bool {
const RecurrenceDescriptor &RdxDesc = Reduction.second;
RecurKind RK = RdxDesc.getRecurrenceKind();
return RecurrenceDescriptor::isAnyOfRecurrenceKind(RK) ||
RecurrenceDescriptor::isFindIVRecurrenceKind(RK);
});
any_of(Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis(),
[](VPRecipeBase &R) {
auto *RedR = dyn_cast<VPReductionPHIRecipe>(&R);

return RedR && (RecurrenceDescriptor::isAnyOfRecurrenceKind(
RedR->getRecurrenceKind()) ||
RecurrenceDescriptor::isFindIVRecurrenceKind(
RedR->getRecurrenceKind()));
});
if (HasSelectCmpReductions) {
LLVM_DEBUG(dbgs() << "LV: Not interleaving select-cmp reductions.\n");
return 1;
Expand All @@ -4847,12 +4879,14 @@ LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
// we're interleaving is inside another loop. For tree-wise reductions
// set the limit to 2, and for ordered reductions it's best to disable
// interleaving entirely.
if (HasReductions && TheLoop->getLoopDepth() > 1) {
if (HasReductions && OrigLoop->getLoopDepth() > 1) {
bool HasOrderedReductions =
any_of(Legal->getReductionVars(), [&](auto &Reduction) -> bool {
const RecurrenceDescriptor &RdxDesc = Reduction.second;
return RdxDesc.isOrdered();
});
any_of(Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis(),
[](VPRecipeBase &R) {
auto *RedR = dyn_cast<VPReductionPHIRecipe>(&R);

return RedR && RedR->isOrdered();
});
if (HasOrderedReductions) {
LLVM_DEBUG(
dbgs() << "LV: Not interleaving scalar ordered reductions.\n");
Expand Down Expand Up @@ -10071,8 +10105,11 @@ bool LoopVectorizePass::processLoop(Loop *L) {

GeneratedRTChecks Checks(PSE, DT, LI, TTI, F->getDataLayout(), CM.CostKind);
if (LVP.hasPlanWithVF(VF.Width)) {
VPCostContext CostCtx(CM.TTI, *CM.TLI, CM.Legal->getWidestInductionType(),
CM, CM.CostKind);

// Select the interleave count.
IC = CM.selectInterleaveCount(LVP.getPlanFor(VF.Width), VF.Width, VF.Cost);
IC = LVP.selectInterleaveCount(LVP.getPlanFor(VF.Width), VF.Width, VF.Cost);

unsigned SelectedIC = std::max(IC, UserIC);
// Optimistically generate runtime checks if they are needed. Drop them if
Expand All @@ -10083,8 +10120,6 @@ bool LoopVectorizePass::processLoop(Loop *L) {
// Check if it is profitable to vectorize with runtime checks.
bool ForceVectorization =
Hints.getForce() == LoopVectorizeHints::FK_Enabled;
VPCostContext CostCtx(CM.TTI, *CM.TLI, CM.Legal->getWidestInductionType(),
CM, CM.CostKind);
if (!ForceVectorization &&
!isOutsideLoopWorkProfitable(Checks, VF, L, PSE, CostCtx,
LVP.getPlanFor(VF.Width), SEL,
Expand Down
5 changes: 4 additions & 1 deletion llvm/lib/Transforms/Vectorize/VPlan.h
Original file line number Diff line number Diff line change
Expand Up @@ -4213,7 +4213,10 @@ class VPlan {
/// block with multiple predecessors (one for the exit via the latch and one
/// via the other early exit).
bool hasEarlyExit() const {
return ExitBlocks.size() > 1 ||
return count_if(ExitBlocks,
[](VPIRBasicBlock *EB) {
return EB->getNumPredecessors() != 0;
}) > 1 ||
(ExitBlocks.size() == 1 && ExitBlocks[0]->getNumPredecessors() > 1);
}

Expand Down
27 changes: 14 additions & 13 deletions llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ target triple = "aarch64--linux-gnu"
; (udiv(2) + extractelement(8) + insertelement(4)) / 2 = 7
;
; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
; CHECK: Found an estimated cost of 7 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
; CHECK: Cost of 7 for VF 2: profitable to scalarize %tmp4 = udiv i32 %tmp2, %tmp3
;
define i32 @predicated_udiv(ptr %a, ptr %b, i1 %c, i64 %n) {
entry:
Expand Down Expand Up @@ -60,7 +60,7 @@ for.end:
; (store(4) + extractelement(4)) / 2 = 4
;
; CHECK: Scalarizing and predicating: store i32 %tmp2, ptr %tmp0, align 4
; CHECK: Found an estimated cost of 4 for VF 2 For instruction: store i32 %tmp2, ptr %tmp0, align 4
; CHECK: Cost of 4 for VF 2: profitable to scalarize store i32 %tmp2, ptr %tmp0, align 4
;
define void @predicated_store(ptr %a, i1 %c, i32 %x, i64 %n) {
entry:
Expand Down Expand Up @@ -93,8 +93,8 @@ for.end:
; CHECK: Found scalar instruction: %addr = phi ptr [ %a, %entry ], [ %addr.next, %for.inc ]
; CHECK: Found scalar instruction: %addr.next = getelementptr inbounds i32, ptr %addr, i64 1
; CHECK: Scalarizing and predicating: store i32 %tmp2, ptr %addr, align 4
; CHECK: Found an estimated cost of 0 for VF 2 For instruction: %addr = phi ptr [ %a, %entry ], [ %addr.next, %for.inc ]
; CHECK: Found an estimated cost of 4 for VF 2 For instruction: store i32 %tmp2, ptr %addr, align 4
; CHECK: Cost of 0 for VF 2: induction instruction %addr = phi ptr [ %a, %entry ], [ %addr.next, %for.inc ]
; CHECK: Cost of 4 for VF 2: profitable to scalarize store i32 %tmp2, ptr %addr, align 4
;
define void @predicated_store_phi(ptr %a, i1 %c, i32 %x, i64 %n) {
entry:
Expand Down Expand Up @@ -135,9 +135,10 @@ for.end:
;
; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x
; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
; CHECK: Found an estimated cost of 3 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x
; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
; CHECK: Cost of 3 for VF 2: profitable to scalarize %tmp3 = add nsw i32 %tmp2, %x
; CHECK: Cost of 5 for VF 2: profitable to scalarize %tmp4 = udiv i32 %tmp2, %tmp3
;

define i32 @predicated_udiv_scalarized_operand(ptr %a, i1 %c, i32 %x, i64 %n) {
entry:
br label %for.body
Expand Down Expand Up @@ -180,8 +181,8 @@ for.end:
;
; CHECK: Scalarizing: %tmp2 = add nsw i32 %tmp1, %x
; CHECK: Scalarizing and predicating: store i32 %tmp2, ptr %tmp0, align 4
; CHECK: Found an estimated cost of 3 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x
; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, ptr %tmp0, align 4
; CHECK: Cost of 3 for VF 2: profitable to scalarize %tmp2 = add nsw i32 %tmp1, %x
; CHECK: Cost of 2 for VF 2: profitable to scalarize store i32 %tmp2, ptr %tmp0, align 4
;
define void @predicated_store_scalarized_operand(ptr %a, i1 %c, i32 %x, i64 %n) {
entry:
Expand Down Expand Up @@ -232,11 +233,11 @@ for.end:
; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2
; CHECK: Scalarizing: %tmp5 = sub i32 %tmp4, %x
; CHECK: Scalarizing and predicating: store i32 %tmp5, ptr %tmp0, align 4
; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x
; CHECK: Found an estimated cost of 7 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2
; CHECK: Found an estimated cost of 7 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2
; CHECK: Found an estimated cost of 3 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x
; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, ptr %tmp0, align 4
; CHECK: Cost of 7 for VF 2: profitable to scalarize %tmp4 = udiv i32 %tmp3, %tmp2
; CHECK: Cost of 7 for VF 2: profitable to scalarize %tmp3 = sdiv i32 %tmp1, %tmp2
; CHECK: Cost of 2 for VF 2: profitable to scalarize store i32 %tmp5, ptr %tmp0, align 4
; CHECK: Cost of 3 for VF 2: profitable to scalarize %tmp5 = sub i32 %tmp4, %x
; CHECK: Cost of 1 for VF 2: WIDEN ir<%tmp2> = add ir<%tmp1>, ir<%x>
;
define void @predication_multi_context(ptr %a, i1 %c, i32 %x, i64 %n) {
entry:
Expand Down
Loading
Loading