Skip to content

Conversation

@skachkov-sc
Copy link
Contributor

@skachkov-sc skachkov-sc commented May 20, 2025

RFC link: https://discourse.llvm.org/t/rfc-loop-vectorization-of-compress-store-expand-load-patterns/86442

Patch series:

  1. [IVDescriptors] Implement MonotonicDescriptor #140720
  2. [LoopVectorize][NFC] Refactor widening decision logic #140722
  3. [VPlan] Implement compressed widening of memory instructions #166956
  4. [LoopVectorize] Support vectorization of compressing patterns in VPlan #140723

Additionally there is a patch with LAA support - #140721. It depends on patch (1), but not strictly required for LoopVectorize support - so can be reviewed independently.

Notes
This patch extends VPlan with the following recipes:

  1. VPMonotonicPHIRecipe represents monotonic header phi and contains its descriptor. We allow only uniform uses of it in the loop, e.g. uniform pointer operands of expand loads/compress stores (LoopVectorizationCostModel::collectLoopUniforms is extended with uniformity analysis of monotonic phis)
  2. VPInstruction is extended with ComputeMonotonicResult operation that is needed to update monotonic phi value:
ComputeMonotonicResult(Val, Mask, Step) = Val + ctpop(Mask) * Step

There are multiple approaches to generate ctpop(Mask), e.g. cast Mask to integer type and then use llvm.ctpop intrinsic. However, we want to support scalable VFs, so the selected generation scheme is llvm.vector.reduce.add.i32(zext(<vscale x N x i1> to <vscale x N x i32>)). For fixed-width vectors there is an InstCombine simplification to llvm.ctpop for that (https://godbolt.org/z/461z5Pf61); for scalable vectors, targets should recognize this pattern (for RISC-V this was done in #127497).

It's also worth to note that we don't support loop interleaving of monotonic patterns. ComputeMonotonicResult for interleaved loops will look like this:

Res[0] = Val + ctpop(Mask[0]) * Step
Res[1] = Res[0] + ctpop(Mask[1]) * Step
...
Res[N] = Res[N-1] + ctpop(Mask[N]) * Step

so it will form a dependency chain between different interleaving parts (so it contradicts the purpose of interleaving).

As mentioned before, any uniform uses of monotonic phis are allowed, but we are mostly interested in the case when it's used as a pointer operand of load/store. Additional checks are required to determine that this memory access is expand load/compress store (they are verified in LoopVectorizationLegality::isCompressedPtr method):

  1. monotonic value of the pointer has stride equals the access size
  2. this memory access is predicated with the same condition as monotonic value update, in other words, they use the same mask after vectorization.

We've implemented the simplified version of the last check: load/store should be placed in the basic block that is the start of MonotonicDescriptor::getPredicateEdge() CFG edge, and this basic block has only one successors. This guarantees that this load/store is executed if and only if the increment of monotonic phi is happened.

@llvmbot llvmbot added vectorizers llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels May 20, 2025
@llvmbot
Copy link
Member

llvmbot commented May 20, 2025

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-vectorizers

Author: Sergey Kachkov (skachkov-sc)

Changes

Patch is 80.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/140723.diff

12 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+1)
  • (modified) llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h (+18)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+35)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+117-13)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+5-3)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+71-12)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+13-8)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+78-6)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+7-5)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+3-2)
  • (added) llvm/test/Transforms/LoopVectorize/compress-idioms.ll (+698)
  • (modified) llvm/unittests/Transforms/Vectorize/VPlanTest.cpp (+4-4)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 1aed98e8f50db..bd3082579f808 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1403,6 +1403,7 @@ class TargetTransformInfo {
     Normal,        ///< The cast is used with a normal load/store.
     Masked,        ///< The cast is used with a masked load/store.
     GatherScatter, ///< The cast is used with a gather/scatter.
+    Compressed,    ///< The cast is used with an expand load/compress store.
     Interleave,    ///< The cast is used with an interleaved load/store.
     Reversed,      ///< The cast is used with a reversed load/store.
   };
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index d654ac3ec9273..757bff2d50eba 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -269,6 +269,10 @@ class LoopVectorizationLegality {
   /// induction descriptor.
   using InductionList = MapVector<PHINode *, InductionDescriptor>;
 
+  /// MonotonicPHIList saves monotonic phi variables and maps them to the
+  /// monotonic phi descriptor.
+  using MonotonicPHIList = MapVector<PHINode *, MonotonicDescriptor>;
+
   /// RecurrenceSet contains the phi nodes that are recurrences other than
   /// inductions and reductions.
   using RecurrenceSet = SmallPtrSet<const PHINode *, 8>;
@@ -304,6 +308,11 @@ class LoopVectorizationLegality {
   /// Returns the induction variables found in the loop.
   const InductionList &getInductionVars() const { return Inductions; }
 
+  /// Returns the monotonic phi variables found in the loop.
+  const MonotonicPHIList &getMonotonicPHIs() const { return MonotonicPHIs; }
+
+  bool hasMonotonicPHIs() const { return !MonotonicPHIs.empty(); }
+
   /// Return the fixed-order recurrences found in the loop.
   RecurrenceSet &getFixedOrderRecurrences() { return FixedOrderRecurrences; }
 
@@ -361,6 +370,12 @@ class LoopVectorizationLegality {
   /// loop. Do not use after invoking 'createVectorizedLoopSkeleton' (PR34965).
   int isConsecutivePtr(Type *AccessTy, Value *Ptr) const;
 
+  /// Returns true if Phi is monotonic variable.
+  bool isMonotonicPHI(PHINode *Phi) const;
+
+  /// Check if memory access is compressed when vectorizing.
+  bool isCompressedPtr(Type *AccessTy, Value *Ptr, BasicBlock *BB) const;
+
   /// Returns true if \p V is invariant across all loop iterations according to
   /// SCEV.
   bool isInvariant(Value *V) const;
@@ -597,6 +612,9 @@ class LoopVectorizationLegality {
   /// variables can be pointers.
   InductionList Inductions;
 
+  /// Holds all of the monotonic phi variables that we found in the loop.
+  MonotonicPHIList MonotonicPHIs;
+
   /// Holds all the casts that participate in the update chain of the induction
   /// variables, and that have been proven to be redundant (possibly under a
   /// runtime guard). These casts can be ignored when creating the vectorized
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index 8e09e6f8d4935..cdfd556e5fc89 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -43,6 +43,10 @@ AllowStridedPointerIVs("lv-strided-pointer-ivs", cl::init(false), cl::Hidden,
                        cl::desc("Enable recognition of non-constant strided "
                                 "pointer induction variables."));
 
+static cl::opt<bool> EnableMonotonicPatterns(
+    "lv-monotonic-patterns", cl::init(true), cl::Hidden,
+    cl::desc("Enable recognition of monotonic patterns."));
+
 static cl::opt<bool>
     HintsAllowReordering("hints-allow-reordering", cl::init(true), cl::Hidden,
                          cl::desc("Allow enabling loop hints to reorder "
@@ -468,6 +472,30 @@ int LoopVectorizationLegality::isConsecutivePtr(Type *AccessTy,
   return 0;
 }
 
+bool LoopVectorizationLegality::isMonotonicPHI(PHINode *Phi) const {
+  return MonotonicPHIs.count(Phi);
+}
+
+bool LoopVectorizationLegality::isCompressedPtr(Type *AccessTy, Value *Ptr,
+                                                BasicBlock *BB) const {
+  MonotonicDescriptor Desc;
+  if (!MonotonicDescriptor::isMonotonicVal(Ptr, TheLoop, Desc, *PSE.getSE()))
+    return false;
+
+  // Check if memory operation will use the same mask as monotonic phi.
+  // TODO: relax restrictions of current implementation.
+  if (Desc.getPredicateEdge() !=
+      MonotonicDescriptor::Edge(BB, BB->getUniqueSuccessor()))
+    return false;
+
+  // Check if pointer step equals access size.
+  auto *Step =
+      dyn_cast<SCEVConstant>(Desc.getExpr()->getStepRecurrence(*PSE.getSE()));
+  if (!Step)
+    return false;
+  return Step->getAPInt() == BB->getDataLayout().getTypeAllocSize(AccessTy);
+}
+
 bool LoopVectorizationLegality::isInvariant(Value *V) const {
   return LAI->isInvariant(V);
 }
@@ -874,6 +902,13 @@ bool LoopVectorizationLegality::canVectorizeInstrs() {
           continue;
         }
 
+        MonotonicDescriptor MD;
+        if (EnableMonotonicPatterns && MonotonicDescriptor::isMonotonicPHI(
+                                           Phi, TheLoop, MD, *PSE.getSE())) {
+          MonotonicPHIs[Phi] = MD;
+          continue;
+        }
+
         if (RecurrenceDescriptor::isFixedOrderRecurrence(Phi, TheLoop, DT)) {
           AllowedExit.insert(Phi);
           FixedOrderRecurrences.insert(Phi);
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 490d0afa99690..32f0d8bd3f85d 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1095,6 +1095,7 @@ class LoopVectorizationCostModel {
     CM_Widen_Reverse, // For consecutive accesses with stride -1.
     CM_Interleave,
     CM_GatherScatter,
+    CM_Compressed,
     CM_Scalarize,
     CM_VectorCall,
     CM_IntrinsicCall
@@ -1308,9 +1309,9 @@ class LoopVectorizationCostModel {
   getDivRemSpeculationCost(Instruction *I,
                            ElementCount VF) const;
 
-  /// Returns widening decision (CM_Widen or CM_Widen_Reverse) if \p I is a
-  /// memory instruction with consecutive access that can be widened, or
-  /// CM_Unknown otherwise.
+  /// Returns widening decision (CM_Widen, CM_Widen_Reverse or CM_Compressed) if
+  /// \p I is a memory instruction with consecutive access that can be widened,
+  /// or CM_Unknown otherwise.
   InstWidening memoryInstructionCanBeWidened(Instruction *I, ElementCount VF);
 
   /// Returns true if \p I is a memory instruction in an interleaved-group
@@ -3263,6 +3264,9 @@ LoopVectorizationCostModel::memoryInstructionCanBeWidened(Instruction *I,
   auto *Ptr = getLoadStorePointerOperand(I);
   auto *ScalarTy = getLoadStoreType(I);
 
+  if (Legal->isCompressedPtr(ScalarTy, Ptr, I->getParent()))
+    return CM_Compressed;
+
   // In order to be widened, the pointer should be consecutive, first of all.
   auto Stride = Legal->isConsecutivePtr(ScalarTy, Ptr);
   if (!Stride)
@@ -3372,9 +3376,9 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
     if (IsUniformMemOpUse(I))
       return true;
 
-    return (WideningDecision == CM_Widen ||
-            WideningDecision == CM_Widen_Reverse ||
-            WideningDecision == CM_Interleave);
+    return (
+        WideningDecision == CM_Widen || WideningDecision == CM_Widen_Reverse ||
+        WideningDecision == CM_Interleave || WideningDecision == CM_Compressed);
   };
 
   // Returns true if Ptr is the pointer operand of a memory access instruction
@@ -3514,6 +3518,39 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
     AddToWorklistIfAllowed(IndUpdate);
   }
 
+  // Handle monotonic phis (similarly to induction vars).
+  for (const auto &MonotonicPHI : Legal->getMonotonicPHIs()) {
+    auto *Phi = MonotonicPHI.first;
+    auto *PhiUpdate = cast<Instruction>(Phi->getIncomingValueForBlock(Latch));
+    const auto &Desc = MonotonicPHI.second;
+
+    auto UniformPhi = llvm::all_of(Phi->users(), [&](User *U) -> bool {
+      auto *I = cast<Instruction>(U);
+      if (I == Desc.getStepInst())
+        return true;
+      if (auto *PN = dyn_cast<PHINode>(I); PN && Desc.getChain().contains(PN))
+        return true;
+      return !TheLoop->contains(I) || Worklist.count(I) ||
+             IsVectorizedMemAccessUse(I, Phi);
+    });
+    if (!UniformPhi)
+      continue;
+
+    auto UniformPhiUpdate =
+        llvm::all_of(PhiUpdate->users(), [&](User *U) -> bool {
+          auto *I = cast<Instruction>(U);
+          if (I == Phi)
+            return true;
+          return !TheLoop->contains(I) || Worklist.count(I) ||
+                 IsVectorizedMemAccessUse(I, Phi);
+        });
+    if (!UniformPhiUpdate)
+      continue;
+
+    AddToWorklistIfAllowed(Phi);
+    AddToWorklistIfAllowed(PhiUpdate);
+  }
+
   Uniforms[VF].insert_range(Worklist);
 }
 
@@ -4272,6 +4309,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
       case VPDef::VPEVLBasedIVPHISC:
       case VPDef::VPPredInstPHISC:
       case VPDef::VPBranchOnMaskSC:
+      case VPDef::VPMonotonicPHISC:
         continue;
       case VPDef::VPReductionSC:
       case VPDef::VPActiveLaneMaskPHISC:
@@ -4992,6 +5030,10 @@ LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
   if (Legal->hasUncountableEarlyExit())
     return 1;
 
+  // Monotonic vars don't support interleaving.
+  if (Legal->hasMonotonicPHIs())
+    return 1;
+
   const bool HasReductions = !Legal->getReductionVars().empty();
 
   // If we did not calculate the cost for VF (because the user selected the VF)
@@ -5577,12 +5619,17 @@ InstructionCost LoopVectorizationCostModel::getConsecutiveMemOpCost(
     Instruction *I, ElementCount VF, InstWidening Decision) {
   Type *ValTy = getLoadStoreType(I);
   auto *VectorTy = cast<VectorType>(toVectorTy(ValTy, VF));
+  const Align Alignment = getLoadStoreAlignment(I);
   unsigned AS = getLoadStoreAddressSpace(I);
   enum TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
 
+  if (Decision == CM_Compressed)
+    return TTI.getExpandCompressMemoryOpCost(I->getOpcode(), VectorTy,
+                                             /*VariableMask*/ true, Alignment,
+                                             CostKind, I);
+
   assert((Decision == CM_Widen || Decision == CM_Widen_Reverse) &&
          "Expected widen decision.");
-  const Align Alignment = getLoadStoreAlignment(I);
   InstructionCost Cost = 0;
   if (Legal->isMaskRequired(I)) {
     Cost += TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS,
@@ -6292,6 +6339,11 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I,
   // the scalar version.
   if (isUniformAfterVectorization(I, VF))
     VF = ElementCount::getFixed(1);
+  else if (auto *Phi = dyn_cast<PHINode>(I)) {
+    // Prohibit scalarization of monotonic phis.
+    if (Legal->isMonotonicPHI(Phi))
+      return InstructionCost::getInvalid();
+  }
 
   if (VF.isVector() && isProfitableToScalarize(I, VF))
     return InstsToScalarize[VF][I];
@@ -6647,6 +6699,8 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I,
       switch (getWideningDecision(I, VF)) {
       case LoopVectorizationCostModel::CM_GatherScatter:
         return TTI::CastContextHint::GatherScatter;
+      case LoopVectorizationCostModel::CM_Compressed:
+        return TTI::CastContextHint::Compressed;
       case LoopVectorizationCostModel::CM_Interleave:
         return TTI::CastContextHint::Interleave;
       case LoopVectorizationCostModel::CM_Scalarize:
@@ -7238,6 +7292,16 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
     }
   }
 
+  for (const auto &[MonotonicPhi, MonotonicDesc] : Legal->getMonotonicPHIs()) {
+    // TODO: currently, we restrict vectorization of non-uniform monotonic phis
+    // by reporting Invalid cost for it. This can be relaxed in future.
+    if (VF.isVector() && !CM.isUniformAfterVectorization(MonotonicPhi, VF))
+      Cost = InstructionCost::getInvalid();
+    else
+      Cost += TTI.getCFInstrCost(Instruction::PHI, CostCtx.CostKind);
+    CostCtx.SkipCostComputation.insert(MonotonicPhi);
+  }
+
   // Pre-compute the costs for branches except for the backedge, as the number
   // of replicate regions in a VPlan may not directly match the number of
   // branches, which would lead to different decisions.
@@ -8229,8 +8293,9 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands,
   LoopVectorizationCostModel::InstWidening Decision =
       CM.getWideningDecision(I, Range.Start);
   bool Reverse = Decision == LoopVectorizationCostModel::CM_Widen_Reverse;
+  bool Compressed = Decision == LoopVectorizationCostModel::CM_Compressed;
   bool Consecutive =
-      Reverse || Decision == LoopVectorizationCostModel::CM_Widen;
+      Reverse || Compressed || Decision == LoopVectorizationCostModel::CM_Widen;
 
   VPValue *Ptr = isa<LoadInst>(I) ? Operands[0] : Operands[1];
   if (Consecutive) {
@@ -8258,11 +8323,12 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands,
   }
   if (LoadInst *Load = dyn_cast<LoadInst>(I))
     return new VPWidenLoadRecipe(*Load, Ptr, Mask, Consecutive, Reverse,
-                                 VPIRMetadata(*Load, LVer), I->getDebugLoc());
+                                 Compressed, VPIRMetadata(*Load, LVer),
+                                 I->getDebugLoc());
 
   StoreInst *Store = cast<StoreInst>(I);
   return new VPWidenStoreRecipe(*Store, Ptr, Operands[0], Mask, Consecutive,
-                                Reverse, VPIRMetadata(*Store, LVer),
+                                Reverse, Compressed, VPIRMetadata(*Store, LVer),
                                 I->getDebugLoc());
 }
 
@@ -8771,11 +8837,19 @@ VPRecipeBase *VPRecipeBuilder::tryToCreateWidenRecipe(VPSingleDefRecipe *R,
       return Recipe;
 
     VPHeaderPHIRecipe *PhiRecipe = nullptr;
-    assert((Legal->isReductionVariable(Phi) ||
+    assert((Legal->isMonotonicPHI(Phi) || Legal->isReductionVariable(Phi) ||
             Legal->isFixedOrderRecurrence(Phi)) &&
-           "can only widen reductions and fixed-order recurrences here");
+           "can only widen monotonic phis, reductions and fixed-order "
+           "recurrences here");
     VPValue *StartV = Operands[0];
-    if (Legal->isReductionVariable(Phi)) {
+    Value *IncomingVal =
+        Phi->getIncomingValueForBlock(OrigLoop->getLoopPreheader());
+    if (Legal->isMonotonicPHI(Phi)) {
+      const MonotonicDescriptor &Desc =
+          Legal->getMonotonicPHIs().find(Phi)->second;
+      assert(Desc.getExpr()->getStart() == PSE.getSCEV(IncomingVal));
+      PhiRecipe = new VPMonotonicPHIRecipe(Phi, Desc, StartV);
+    } else if (Legal->isReductionVariable(Phi)) {
       const RecurrenceDescriptor &RdxDesc =
           Legal->getReductionVars().find(Phi)->second;
       assert(RdxDesc.getRecurrenceStartValue() ==
@@ -9397,6 +9471,27 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range,
   // bring the VPlan to its final state.
   // ---------------------------------------------------------------------------
 
+  // Adjust the recipes for any monotonic phis.
+  for (VPRecipeBase &R : HeaderVPBB->phis()) {
+    auto *MonotonicPhi = dyn_cast<VPMonotonicPHIRecipe>(&R);
+    if (!MonotonicPhi)
+      continue;
+
+    auto &Desc = MonotonicPhi->getDescriptor();
+    auto [EdgeSrc, EdgeDst] = Desc.getPredicateEdge();
+    auto &SE = *PSE.getSE();
+    auto *Step = vputils::getOrCreateVPValueForSCEVExpr(
+        *Plan, Desc.getExpr()->getStepRecurrence(SE), SE);
+
+    auto *MonotonicI = new VPInstruction(
+        VPInstruction::ComputeMonotonicResult,
+        {MonotonicPhi, RecipeBuilder.getEdgeMask(EdgeSrc, EdgeDst), Step},
+        *Desc.getStepInst());
+    auto *InsertBlock = MonotonicPhi->getBackedgeRecipe().getParent();
+    InsertBlock->insert(MonotonicI, InsertBlock->getFirstNonPhi());
+    MonotonicPhi->getBackedgeValue()->replaceAllUsesWith(MonotonicI);
+  }
+
   // Adjust the recipes for any inloop reductions.
   adjustRecipesForReductions(Plan, RecipeBuilder, Range.Start);
 
@@ -10587,6 +10682,15 @@ bool LoopVectorizePass::processLoop(Loop *L) {
     IC = CM.selectInterleaveCount(LVP.getPlanFor(VF.Width), VF.Width, VF.Cost);
 
     unsigned SelectedIC = std::max(IC, UserIC);
+
+    if (LVL.hasMonotonicPHIs() && SelectedIC > 1) {
+      reportVectorizationFailure(
+          "Interleaving of loop with monotonic vars",
+          "Interleaving of loops with monotonic vars is not supported",
+          "CantInterleaveWithMonotonicVars", ORE, L);
+      return false;
+    }
+
     //  Optimistically generate runtime checks if they are needed. Drop them if
     //  they turn out to not be profitable.
     if (VF.Width.isVector() || SelectedIC > 1)
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 06b738afbd221..f5b2667c2b253 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -308,10 +308,11 @@ Value *VPTransformState::get(const VPValue *Def, bool NeedsScalar) {
   VPLane LastLane(IsSingleScalar ? 0 : VF.getKnownMinValue() - 1);
   // Check if there is a scalar value for the selected lane.
   if (!hasScalarValue(Def, LastLane)) {
-    // At the moment, VPWidenIntOrFpInductionRecipes, VPScalarIVStepsRecipes and
-    // VPExpandSCEVRecipes can also be a single scalar.
+    // At the moment, VPWidenIntOrFpInductionRecipes, VPScalarIVStepsRecipes,
+    // VPMonotonicPHIRecipe and VPExpandSCEVRecipes can also be a single scalar.
     assert((isa<VPWidenIntOrFpInductionRecipe, VPScalarIVStepsRecipe,
-                VPExpandSCEVRecipe>(Def->getDefiningRecipe())) &&
+                VPMonotonicPHIRecipe, VPExpandSCEVRecipe>(
+               Def->getDefiningRecipe())) &&
            "unexpected recipe found to be invariant");
     IsSingleScalar = true;
     LastLane = 0;
@@ -1005,6 +1006,7 @@ void VPlan::execute(VPTransformState *State) {
     auto *PhiR = cast<VPSingleDefRecipe>(&R);
     // VPInstructions currently model scalar Phis only.
     bool NeedsScalar = isa<VPInstruction>(PhiR) ||
+                       isa<VPMonotonicPHIRecipe>(PhiR) ||
                        (isa<VPReductionPHIRecipe>(PhiR) &&
                         cast<VPReductionPHIRecipe>(PhiR)->isInLoop());
     Value *Phi = State->get(PhiR, NeedsScalar);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index e634de1e17c69..9ce743da635b1 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -539,6 +539,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue {
     case VPRecipeBase::VPWidenIntOrFpInductionSC:
     case VPRecipeBase::VPWidenPointerInductionSC:
     case VPRecipeBase::VPReductionPHISC:
+    case VPRecipeBase::VPMonotonicPHISC:
     case VPRecipeBase::VPPartialReductionSC:
       return true;
     case VPRecipeBase::VPBranchOnMaskSC:
@@ -900,6 +901,7 @@ class VPInstruction : public VPRecipeWithIRFlags,
     Broadcast,
     ComputeFindLastIVResult,
     ComputeReductionResult,
+    ComputeMonotonicResult,
     // Extracts the last lane from its operand if it is a vector, or the last
     // part if scalar. In the latter case, the recipe will be removed during
     // unrolling.
@@ -965,6 +967,11 @@ class VPInstruction : public VPRecipeWithIRFlags,
 #endif
 
 public:
+  VPInstruction(unsigned Opcode, ArrayRef<VPValue *> Operands, Instruction &I,
+                const Twine &Name = "")
+      : VPRecipeWithIRFlags(VPDef::VPInstructionSC, Operands, I),
+        Opcode(Opcode), Name(Name.str()) {}
+
   VPInstruction(unsigned Opcode, ArrayRef<VPValue *> Operands, DebugLoc DL,
                 const Twine &Name = "")
       : VPRecipeWithIRFlags(VPDef::VPInstructionSC, Operands, DL),
@@ -2249,6 +2256,50 @@ class VPReductionPHIRecipe : public VPHeaderPHIRecipe,
   }
 };
 
+/// A recipe for handling monotonic phis. The start value is the first operand
+/// of the recipe and the incoming value from the backedge is the...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented May 20, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Sergey Kachkov (skachkov-sc)

Changes

Patch is 80.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/140723.diff

12 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+1)
  • (modified) llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h (+18)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+35)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+117-13)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+5-3)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+71-12)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+13-8)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+78-6)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+7-5)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+3-2)
  • (added) llvm/test/Transforms/LoopVectorize/compress-idioms.ll (+698)
  • (modified) llvm/unittests/Transforms/Vectorize/VPlanTest.cpp (+4-4)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 1aed98e8f50db..bd3082579f808 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1403,6 +1403,7 @@ class TargetTransformInfo {
     Normal,        ///< The cast is used with a normal load/store.
     Masked,        ///< The cast is used with a masked load/store.
     GatherScatter, ///< The cast is used with a gather/scatter.
+    Compressed,    ///< The cast is used with an expand load/compress store.
     Interleave,    ///< The cast is used with an interleaved load/store.
     Reversed,      ///< The cast is used with a reversed load/store.
   };
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index d654ac3ec9273..757bff2d50eba 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -269,6 +269,10 @@ class LoopVectorizationLegality {
   /// induction descriptor.
   using InductionList = MapVector<PHINode *, InductionDescriptor>;
 
+  /// MonotonicPHIList saves monotonic phi variables and maps them to the
+  /// monotonic phi descriptor.
+  using MonotonicPHIList = MapVector<PHINode *, MonotonicDescriptor>;
+
   /// RecurrenceSet contains the phi nodes that are recurrences other than
   /// inductions and reductions.
   using RecurrenceSet = SmallPtrSet<const PHINode *, 8>;
@@ -304,6 +308,11 @@ class LoopVectorizationLegality {
   /// Returns the induction variables found in the loop.
   const InductionList &getInductionVars() const { return Inductions; }
 
+  /// Returns the monotonic phi variables found in the loop.
+  const MonotonicPHIList &getMonotonicPHIs() const { return MonotonicPHIs; }
+
+  bool hasMonotonicPHIs() const { return !MonotonicPHIs.empty(); }
+
   /// Return the fixed-order recurrences found in the loop.
   RecurrenceSet &getFixedOrderRecurrences() { return FixedOrderRecurrences; }
 
@@ -361,6 +370,12 @@ class LoopVectorizationLegality {
   /// loop. Do not use after invoking 'createVectorizedLoopSkeleton' (PR34965).
   int isConsecutivePtr(Type *AccessTy, Value *Ptr) const;
 
+  /// Returns true if Phi is monotonic variable.
+  bool isMonotonicPHI(PHINode *Phi) const;
+
+  /// Check if memory access is compressed when vectorizing.
+  bool isCompressedPtr(Type *AccessTy, Value *Ptr, BasicBlock *BB) const;
+
   /// Returns true if \p V is invariant across all loop iterations according to
   /// SCEV.
   bool isInvariant(Value *V) const;
@@ -597,6 +612,9 @@ class LoopVectorizationLegality {
   /// variables can be pointers.
   InductionList Inductions;
 
+  /// Holds all of the monotonic phi variables that we found in the loop.
+  MonotonicPHIList MonotonicPHIs;
+
   /// Holds all the casts that participate in the update chain of the induction
   /// variables, and that have been proven to be redundant (possibly under a
   /// runtime guard). These casts can be ignored when creating the vectorized
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index 8e09e6f8d4935..cdfd556e5fc89 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -43,6 +43,10 @@ AllowStridedPointerIVs("lv-strided-pointer-ivs", cl::init(false), cl::Hidden,
                        cl::desc("Enable recognition of non-constant strided "
                                 "pointer induction variables."));
 
+static cl::opt<bool> EnableMonotonicPatterns(
+    "lv-monotonic-patterns", cl::init(true), cl::Hidden,
+    cl::desc("Enable recognition of monotonic patterns."));
+
 static cl::opt<bool>
     HintsAllowReordering("hints-allow-reordering", cl::init(true), cl::Hidden,
                          cl::desc("Allow enabling loop hints to reorder "
@@ -468,6 +472,30 @@ int LoopVectorizationLegality::isConsecutivePtr(Type *AccessTy,
   return 0;
 }
 
+bool LoopVectorizationLegality::isMonotonicPHI(PHINode *Phi) const {
+  return MonotonicPHIs.count(Phi);
+}
+
+bool LoopVectorizationLegality::isCompressedPtr(Type *AccessTy, Value *Ptr,
+                                                BasicBlock *BB) const {
+  MonotonicDescriptor Desc;
+  if (!MonotonicDescriptor::isMonotonicVal(Ptr, TheLoop, Desc, *PSE.getSE()))
+    return false;
+
+  // Check if memory operation will use the same mask as monotonic phi.
+  // TODO: relax restrictions of current implementation.
+  if (Desc.getPredicateEdge() !=
+      MonotonicDescriptor::Edge(BB, BB->getUniqueSuccessor()))
+    return false;
+
+  // Check if pointer step equals access size.
+  auto *Step =
+      dyn_cast<SCEVConstant>(Desc.getExpr()->getStepRecurrence(*PSE.getSE()));
+  if (!Step)
+    return false;
+  return Step->getAPInt() == BB->getDataLayout().getTypeAllocSize(AccessTy);
+}
+
 bool LoopVectorizationLegality::isInvariant(Value *V) const {
   return LAI->isInvariant(V);
 }
@@ -874,6 +902,13 @@ bool LoopVectorizationLegality::canVectorizeInstrs() {
           continue;
         }
 
+        MonotonicDescriptor MD;
+        if (EnableMonotonicPatterns && MonotonicDescriptor::isMonotonicPHI(
+                                           Phi, TheLoop, MD, *PSE.getSE())) {
+          MonotonicPHIs[Phi] = MD;
+          continue;
+        }
+
         if (RecurrenceDescriptor::isFixedOrderRecurrence(Phi, TheLoop, DT)) {
           AllowedExit.insert(Phi);
           FixedOrderRecurrences.insert(Phi);
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 490d0afa99690..32f0d8bd3f85d 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1095,6 +1095,7 @@ class LoopVectorizationCostModel {
     CM_Widen_Reverse, // For consecutive accesses with stride -1.
     CM_Interleave,
     CM_GatherScatter,
+    CM_Compressed,
     CM_Scalarize,
     CM_VectorCall,
     CM_IntrinsicCall
@@ -1308,9 +1309,9 @@ class LoopVectorizationCostModel {
   getDivRemSpeculationCost(Instruction *I,
                            ElementCount VF) const;
 
-  /// Returns widening decision (CM_Widen or CM_Widen_Reverse) if \p I is a
-  /// memory instruction with consecutive access that can be widened, or
-  /// CM_Unknown otherwise.
+  /// Returns widening decision (CM_Widen, CM_Widen_Reverse or CM_Compressed) if
+  /// \p I is a memory instruction with consecutive access that can be widened,
+  /// or CM_Unknown otherwise.
   InstWidening memoryInstructionCanBeWidened(Instruction *I, ElementCount VF);
 
   /// Returns true if \p I is a memory instruction in an interleaved-group
@@ -3263,6 +3264,9 @@ LoopVectorizationCostModel::memoryInstructionCanBeWidened(Instruction *I,
   auto *Ptr = getLoadStorePointerOperand(I);
   auto *ScalarTy = getLoadStoreType(I);
 
+  if (Legal->isCompressedPtr(ScalarTy, Ptr, I->getParent()))
+    return CM_Compressed;
+
   // In order to be widened, the pointer should be consecutive, first of all.
   auto Stride = Legal->isConsecutivePtr(ScalarTy, Ptr);
   if (!Stride)
@@ -3372,9 +3376,9 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
     if (IsUniformMemOpUse(I))
       return true;
 
-    return (WideningDecision == CM_Widen ||
-            WideningDecision == CM_Widen_Reverse ||
-            WideningDecision == CM_Interleave);
+    return (
+        WideningDecision == CM_Widen || WideningDecision == CM_Widen_Reverse ||
+        WideningDecision == CM_Interleave || WideningDecision == CM_Compressed);
   };
 
   // Returns true if Ptr is the pointer operand of a memory access instruction
@@ -3514,6 +3518,39 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
     AddToWorklistIfAllowed(IndUpdate);
   }
 
+  // Handle monotonic phis (similarly to induction vars).
+  for (const auto &MonotonicPHI : Legal->getMonotonicPHIs()) {
+    auto *Phi = MonotonicPHI.first;
+    auto *PhiUpdate = cast<Instruction>(Phi->getIncomingValueForBlock(Latch));
+    const auto &Desc = MonotonicPHI.second;
+
+    auto UniformPhi = llvm::all_of(Phi->users(), [&](User *U) -> bool {
+      auto *I = cast<Instruction>(U);
+      if (I == Desc.getStepInst())
+        return true;
+      if (auto *PN = dyn_cast<PHINode>(I); PN && Desc.getChain().contains(PN))
+        return true;
+      return !TheLoop->contains(I) || Worklist.count(I) ||
+             IsVectorizedMemAccessUse(I, Phi);
+    });
+    if (!UniformPhi)
+      continue;
+
+    auto UniformPhiUpdate =
+        llvm::all_of(PhiUpdate->users(), [&](User *U) -> bool {
+          auto *I = cast<Instruction>(U);
+          if (I == Phi)
+            return true;
+          return !TheLoop->contains(I) || Worklist.count(I) ||
+                 IsVectorizedMemAccessUse(I, Phi);
+        });
+    if (!UniformPhiUpdate)
+      continue;
+
+    AddToWorklistIfAllowed(Phi);
+    AddToWorklistIfAllowed(PhiUpdate);
+  }
+
   Uniforms[VF].insert_range(Worklist);
 }
 
@@ -4272,6 +4309,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
       case VPDef::VPEVLBasedIVPHISC:
       case VPDef::VPPredInstPHISC:
       case VPDef::VPBranchOnMaskSC:
+      case VPDef::VPMonotonicPHISC:
         continue;
       case VPDef::VPReductionSC:
       case VPDef::VPActiveLaneMaskPHISC:
@@ -4992,6 +5030,10 @@ LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
   if (Legal->hasUncountableEarlyExit())
     return 1;
 
+  // Monotonic vars don't support interleaving.
+  if (Legal->hasMonotonicPHIs())
+    return 1;
+
   const bool HasReductions = !Legal->getReductionVars().empty();
 
   // If we did not calculate the cost for VF (because the user selected the VF)
@@ -5577,12 +5619,17 @@ InstructionCost LoopVectorizationCostModel::getConsecutiveMemOpCost(
     Instruction *I, ElementCount VF, InstWidening Decision) {
   Type *ValTy = getLoadStoreType(I);
   auto *VectorTy = cast<VectorType>(toVectorTy(ValTy, VF));
+  const Align Alignment = getLoadStoreAlignment(I);
   unsigned AS = getLoadStoreAddressSpace(I);
   enum TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
 
+  if (Decision == CM_Compressed)
+    return TTI.getExpandCompressMemoryOpCost(I->getOpcode(), VectorTy,
+                                             /*VariableMask*/ true, Alignment,
+                                             CostKind, I);
+
   assert((Decision == CM_Widen || Decision == CM_Widen_Reverse) &&
          "Expected widen decision.");
-  const Align Alignment = getLoadStoreAlignment(I);
   InstructionCost Cost = 0;
   if (Legal->isMaskRequired(I)) {
     Cost += TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS,
@@ -6292,6 +6339,11 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I,
   // the scalar version.
   if (isUniformAfterVectorization(I, VF))
     VF = ElementCount::getFixed(1);
+  else if (auto *Phi = dyn_cast<PHINode>(I)) {
+    // Prohibit scalarization of monotonic phis.
+    if (Legal->isMonotonicPHI(Phi))
+      return InstructionCost::getInvalid();
+  }
 
   if (VF.isVector() && isProfitableToScalarize(I, VF))
     return InstsToScalarize[VF][I];
@@ -6647,6 +6699,8 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I,
       switch (getWideningDecision(I, VF)) {
       case LoopVectorizationCostModel::CM_GatherScatter:
         return TTI::CastContextHint::GatherScatter;
+      case LoopVectorizationCostModel::CM_Compressed:
+        return TTI::CastContextHint::Compressed;
       case LoopVectorizationCostModel::CM_Interleave:
         return TTI::CastContextHint::Interleave;
       case LoopVectorizationCostModel::CM_Scalarize:
@@ -7238,6 +7292,16 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
     }
   }
 
+  for (const auto &[MonotonicPhi, MonotonicDesc] : Legal->getMonotonicPHIs()) {
+    // TODO: currently, we restrict vectorization of non-uniform monotonic phis
+    // by reporting Invalid cost for it. This can be relaxed in future.
+    if (VF.isVector() && !CM.isUniformAfterVectorization(MonotonicPhi, VF))
+      Cost = InstructionCost::getInvalid();
+    else
+      Cost += TTI.getCFInstrCost(Instruction::PHI, CostCtx.CostKind);
+    CostCtx.SkipCostComputation.insert(MonotonicPhi);
+  }
+
   // Pre-compute the costs for branches except for the backedge, as the number
   // of replicate regions in a VPlan may not directly match the number of
   // branches, which would lead to different decisions.
@@ -8229,8 +8293,9 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands,
   LoopVectorizationCostModel::InstWidening Decision =
       CM.getWideningDecision(I, Range.Start);
   bool Reverse = Decision == LoopVectorizationCostModel::CM_Widen_Reverse;
+  bool Compressed = Decision == LoopVectorizationCostModel::CM_Compressed;
   bool Consecutive =
-      Reverse || Decision == LoopVectorizationCostModel::CM_Widen;
+      Reverse || Compressed || Decision == LoopVectorizationCostModel::CM_Widen;
 
   VPValue *Ptr = isa<LoadInst>(I) ? Operands[0] : Operands[1];
   if (Consecutive) {
@@ -8258,11 +8323,12 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands,
   }
   if (LoadInst *Load = dyn_cast<LoadInst>(I))
     return new VPWidenLoadRecipe(*Load, Ptr, Mask, Consecutive, Reverse,
-                                 VPIRMetadata(*Load, LVer), I->getDebugLoc());
+                                 Compressed, VPIRMetadata(*Load, LVer),
+                                 I->getDebugLoc());
 
   StoreInst *Store = cast<StoreInst>(I);
   return new VPWidenStoreRecipe(*Store, Ptr, Operands[0], Mask, Consecutive,
-                                Reverse, VPIRMetadata(*Store, LVer),
+                                Reverse, Compressed, VPIRMetadata(*Store, LVer),
                                 I->getDebugLoc());
 }
 
@@ -8771,11 +8837,19 @@ VPRecipeBase *VPRecipeBuilder::tryToCreateWidenRecipe(VPSingleDefRecipe *R,
       return Recipe;
 
     VPHeaderPHIRecipe *PhiRecipe = nullptr;
-    assert((Legal->isReductionVariable(Phi) ||
+    assert((Legal->isMonotonicPHI(Phi) || Legal->isReductionVariable(Phi) ||
             Legal->isFixedOrderRecurrence(Phi)) &&
-           "can only widen reductions and fixed-order recurrences here");
+           "can only widen monotonic phis, reductions and fixed-order "
+           "recurrences here");
     VPValue *StartV = Operands[0];
-    if (Legal->isReductionVariable(Phi)) {
+    Value *IncomingVal =
+        Phi->getIncomingValueForBlock(OrigLoop->getLoopPreheader());
+    if (Legal->isMonotonicPHI(Phi)) {
+      const MonotonicDescriptor &Desc =
+          Legal->getMonotonicPHIs().find(Phi)->second;
+      assert(Desc.getExpr()->getStart() == PSE.getSCEV(IncomingVal));
+      PhiRecipe = new VPMonotonicPHIRecipe(Phi, Desc, StartV);
+    } else if (Legal->isReductionVariable(Phi)) {
       const RecurrenceDescriptor &RdxDesc =
           Legal->getReductionVars().find(Phi)->second;
       assert(RdxDesc.getRecurrenceStartValue() ==
@@ -9397,6 +9471,27 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range,
   // bring the VPlan to its final state.
   // ---------------------------------------------------------------------------
 
+  // Adjust the recipes for any monotonic phis.
+  for (VPRecipeBase &R : HeaderVPBB->phis()) {
+    auto *MonotonicPhi = dyn_cast<VPMonotonicPHIRecipe>(&R);
+    if (!MonotonicPhi)
+      continue;
+
+    auto &Desc = MonotonicPhi->getDescriptor();
+    auto [EdgeSrc, EdgeDst] = Desc.getPredicateEdge();
+    auto &SE = *PSE.getSE();
+    auto *Step = vputils::getOrCreateVPValueForSCEVExpr(
+        *Plan, Desc.getExpr()->getStepRecurrence(SE), SE);
+
+    auto *MonotonicI = new VPInstruction(
+        VPInstruction::ComputeMonotonicResult,
+        {MonotonicPhi, RecipeBuilder.getEdgeMask(EdgeSrc, EdgeDst), Step},
+        *Desc.getStepInst());
+    auto *InsertBlock = MonotonicPhi->getBackedgeRecipe().getParent();
+    InsertBlock->insert(MonotonicI, InsertBlock->getFirstNonPhi());
+    MonotonicPhi->getBackedgeValue()->replaceAllUsesWith(MonotonicI);
+  }
+
   // Adjust the recipes for any inloop reductions.
   adjustRecipesForReductions(Plan, RecipeBuilder, Range.Start);
 
@@ -10587,6 +10682,15 @@ bool LoopVectorizePass::processLoop(Loop *L) {
     IC = CM.selectInterleaveCount(LVP.getPlanFor(VF.Width), VF.Width, VF.Cost);
 
     unsigned SelectedIC = std::max(IC, UserIC);
+
+    if (LVL.hasMonotonicPHIs() && SelectedIC > 1) {
+      reportVectorizationFailure(
+          "Interleaving of loop with monotonic vars",
+          "Interleaving of loops with monotonic vars is not supported",
+          "CantInterleaveWithMonotonicVars", ORE, L);
+      return false;
+    }
+
     //  Optimistically generate runtime checks if they are needed. Drop them if
     //  they turn out to not be profitable.
     if (VF.Width.isVector() || SelectedIC > 1)
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 06b738afbd221..f5b2667c2b253 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -308,10 +308,11 @@ Value *VPTransformState::get(const VPValue *Def, bool NeedsScalar) {
   VPLane LastLane(IsSingleScalar ? 0 : VF.getKnownMinValue() - 1);
   // Check if there is a scalar value for the selected lane.
   if (!hasScalarValue(Def, LastLane)) {
-    // At the moment, VPWidenIntOrFpInductionRecipes, VPScalarIVStepsRecipes and
-    // VPExpandSCEVRecipes can also be a single scalar.
+    // At the moment, VPWidenIntOrFpInductionRecipes, VPScalarIVStepsRecipes,
+    // VPMonotonicPHIRecipe and VPExpandSCEVRecipes can also be a single scalar.
     assert((isa<VPWidenIntOrFpInductionRecipe, VPScalarIVStepsRecipe,
-                VPExpandSCEVRecipe>(Def->getDefiningRecipe())) &&
+                VPMonotonicPHIRecipe, VPExpandSCEVRecipe>(
+               Def->getDefiningRecipe())) &&
            "unexpected recipe found to be invariant");
     IsSingleScalar = true;
     LastLane = 0;
@@ -1005,6 +1006,7 @@ void VPlan::execute(VPTransformState *State) {
     auto *PhiR = cast<VPSingleDefRecipe>(&R);
     // VPInstructions currently model scalar Phis only.
     bool NeedsScalar = isa<VPInstruction>(PhiR) ||
+                       isa<VPMonotonicPHIRecipe>(PhiR) ||
                        (isa<VPReductionPHIRecipe>(PhiR) &&
                         cast<VPReductionPHIRecipe>(PhiR)->isInLoop());
     Value *Phi = State->get(PhiR, NeedsScalar);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index e634de1e17c69..9ce743da635b1 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -539,6 +539,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue {
     case VPRecipeBase::VPWidenIntOrFpInductionSC:
     case VPRecipeBase::VPWidenPointerInductionSC:
     case VPRecipeBase::VPReductionPHISC:
+    case VPRecipeBase::VPMonotonicPHISC:
     case VPRecipeBase::VPPartialReductionSC:
       return true;
     case VPRecipeBase::VPBranchOnMaskSC:
@@ -900,6 +901,7 @@ class VPInstruction : public VPRecipeWithIRFlags,
     Broadcast,
     ComputeFindLastIVResult,
     ComputeReductionResult,
+    ComputeMonotonicResult,
     // Extracts the last lane from its operand if it is a vector, or the last
     // part if scalar. In the latter case, the recipe will be removed during
     // unrolling.
@@ -965,6 +967,11 @@ class VPInstruction : public VPRecipeWithIRFlags,
 #endif
 
 public:
+  VPInstruction(unsigned Opcode, ArrayRef<VPValue *> Operands, Instruction &I,
+                const Twine &Name = "")
+      : VPRecipeWithIRFlags(VPDef::VPInstructionSC, Operands, I),
+        Opcode(Opcode), Name(Name.str()) {}
+
   VPInstruction(unsigned Opcode, ArrayRef<VPValue *> Operands, DebugLoc DL,
                 const Twine &Name = "")
       : VPRecipeWithIRFlags(VPDef::VPInstructionSC, Operands, DL),
@@ -2249,6 +2256,50 @@ class VPReductionPHIRecipe : public VPHeaderPHIRecipe,
   }
 };
 
+/// A recipe for handling monotonic phis. The start value is the first operand
+/// of the recipe and the incoming value from the backedge is the...
[truncated]

@skachkov-sc skachkov-sc requested review from ayalz, fhahn and lukel97 May 20, 2025 12:41
@skachkov-sc
Copy link
Contributor Author

llvm-test-suite statistics (number of vectorized loops, target - riscv64 with RVV enabled):

Metric: loop-vectorize.LoopsVectorized

Program                                                                                             loop-vectorize.LoopsVectorized              
                                                                                                    before                         after   diff 
                                   test-suite :: SingleSource/Benchmarks/CoyoteBench/huffbench.test    0.00                           1.00  inf%
                                         test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test   46.00                          86.00 87.0%
                                 test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  231.00                         293.00 26.8%
                          test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/XSBench/XSBench.test    4.00                           5.00 25.0%
                                 test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test   44.00                          48.00  9.1%
                                  test-suite :: MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test   35.00                          37.00  5.7%
                                      test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test   29.00                          30.00  3.4%
                                test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  100.00                         103.00  3.0%
                               test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  100.00                         103.00  3.0%
                            test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test   75.00                          77.00  2.7%
                      test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   79.00                          81.00  2.5%
                              test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test  120.00                         123.00  2.5%
                                      test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test   93.00                          95.00  2.2%
                                     test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test  344.00                         350.00  1.7%
                                      test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test  119.00                         121.00  1.7%
                                   test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test   99.00                         100.00  1.0%
                            test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test  133.00                         134.00  0.8%
                                          test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test  183.00                         184.00  0.5%
                                 test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  434.00                         436.00  0.5%
                                  test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  434.00                         436.00  0.5%
                           test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 1584.00                        1591.00  0.4%
                                      test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  364.00                         365.00  0.3%
                     test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  952.00                         953.00  0.1%
                      test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  952.00                         953.00  0.1%

@skachkov-sc skachkov-sc force-pushed the users/skachkov-sc/widen-decision-refactor branch from 3a5b3c8 to 9ae9f1d Compare November 7, 2025 11:59
@skachkov-sc skachkov-sc force-pushed the users/skachkov-sc/monotonic-vectorization branch from 888e1ea to e38379d Compare November 7, 2025 11:59
@github-actions
Copy link

github-actions bot commented Nov 7, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can probably make this independent of #140721 by first just supporting cases where to compressed store does not alias any of the other memory accesses?

Curious if you already have any runtime performance numbers you could share?

Comment on lines 4498 to 4508
void VPMonotonicPHIRecipe::execute(VPTransformState &State) {
assert(getParent()->getPlan()->getUF() == 1 && "Expected unroll factor 1.");
Value *Start = getStartValue()->getLiveInIRValue();
BasicBlock *VectorPH =
State.CFG.VPBB2IRBB.at(getParent()->getCFGPredecessor(0));
PHINode *MonotonicPHI =
State.Builder.CreatePHI(Start->getType(), 2, "monotonic.iv");
MonotonicPHI->addIncoming(Start, VectorPH);
MonotonicPHI->setDebugLoc(getDebugLoc());
State.set(this, MonotonicPHI, /*IsScalar=*/true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks just like a plain VPWidenPHI, can you use that here or do we need a new recipe? If so, please document the rational, probably good to do in the description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only rational for new recipe was to simplify "adjusting" of VPlan: we need to insert VPInstruction::ComputeMonotonicRecipe at the backedge of such phi (that will incement phi value on ctpop(mask)). This looks similar to handling of reductions: VPMonotonicPHIRecipe/VPInstruction::ComputeMonotonicResult is symmetric to VPReductionPHIRecipe/VPInstruction::ComputeReductionResult. Probably VPWidenPHI recipe can be used there, but there are some places in code when we want to distinguish "monotonic" header phis ftom the others.

Comment on lines 3242 to 3199
/// Whether the consecutive accessed addresses are compressed with mask value.
bool Compressed;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This corresponds 1-1 to an intrinsic, right? Can we just use VPWidenIntrinsicRecipe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no intrinsic before loop vectorization here; we have plain load/store instruction that is placed under some predicate in the original loop, so it become masked in LoopVectorizer. The difference with ordinary masked loads/stores is that "compressed" loads/stores read or write the memory consecutively (the number of elements == the number of set mask bits), and then broadcast the elements in the masked positions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes that are related to the extension of VPWidenMemoryRecipe are splitted into separate PR: #166956 (hopefully it will be easier to review)

@skachkov-sc
Copy link
Contributor Author

I think you can probably make this independent of #140721 by first just supporting cases where to compressed store does not alias any of the other memory accesses?

Yes, the changes in LAA are fully independent, we can skip them for now.

Curious if you already have any runtime performance numbers you could share?

We've benchmarked the following loop pattern:

// benchmark() is run 32 times

template<typename T>
void benchmark(T *dst, const T *src) {
  size_t idx = 0;
  for(size_t i = 0; i < 1024; ++i) {
    T cur = src[i];
    if (cur != static_cast<T>(0))
      dst[idx++] = cur;
  }
  dst[idx] = static_cast<T>(0);
}

On SpacemiT-X60 core (RISC-V CPU with VLEN=256) the results are following:

Type cycles (scalar) cycles (vector) speedup
int16_t 189151 56795 3.33x
int32_t 205712 87196 2.36x
int64_t 205757 150115 1.37x

There were no branch mispredicts for if (cur != static_cast<T>(0)) branch in scalar case here (due to the specifics of data in src array), so I think the speedup can be even bigger for more random inputs. We haven't observed any significant changes on SPECs though.

@MacDue MacDue requested a review from huntergr-arm November 7, 2025 15:01
@skachkov-sc skachkov-sc force-pushed the users/skachkov-sc/widen-decision-refactor branch from 9ae9f1d to 3bb26ad Compare November 7, 2025 15:21
@skachkov-sc skachkov-sc force-pushed the users/skachkov-sc/monotonic-vectorization branch from e38379d to ca191be Compare November 7, 2025 15:23
@skachkov-sc skachkov-sc force-pushed the users/skachkov-sc/widen-decision-refactor branch from 3bb26ad to f39e2a0 Compare November 10, 2025 14:11
@skachkov-sc skachkov-sc force-pushed the users/skachkov-sc/monotonic-vectorization branch from ca191be to 6f23b10 Compare November 11, 2025 12:04
@skachkov-sc skachkov-sc changed the base branch from users/skachkov-sc/widen-decision-refactor to users/skachkov-sc/widen-memory-compress November 11, 2025 12:05

// Adjust the recipes for any monotonic phis.
for (VPRecipeBase &R : HeaderVPBB->phis()) {
auto *MonotonicPhi = dyn_cast<VPMonotonicPHIRecipe>(&R);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use a regular VPInstruction::PHI instead of VPMonotonicPHIRecipe and set the underlying value to the monotonic phi, could you avoid the need for an extra recipe?

Would you be able to detect the monotonic phis by just calling Legal->getMonotonicPHIs().find(Phi->getUnderlyingValue())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean VPIRPhi recipe? I think it's possible, but I'm slightly concerned that semantically VPMonotonicPHIRecipe models loop header phi so it should be derived from VPHeaderPHIRecipe. But yes, we can distinguish monotonic phis using LoopVectorizationLegality

Copy link
Contributor

@lukel97 lukel97 Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking VPPhi which is any VPInstruction with an opcode of Instruction::Phi, which you can create from VPBuilder::createScalarPhi. It should let you model everything in VPlan without needing to create a dedicate recipe.

I wouldn't worry about trying to derive from VPHeaderPHIRecipe, we use VPPhi for the EVL tail folding stuff even though those phis are placed in the header. Other transforms know how to handle VPPhi recipes if that's what your'e wondering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will try to use VPPhi then

@skachkov-sc skachkov-sc force-pushed the users/skachkov-sc/widen-memory-compress branch from 92342e0 to ed75372 Compare November 12, 2025 12:25
@skachkov-sc skachkov-sc force-pushed the users/skachkov-sc/monotonic-vectorization branch from 6f23b10 to ecb2b7e Compare November 12, 2025 15:33
@skachkov-sc
Copy link
Contributor Author

@lukel97 @fhahn I've tried to replace custom VPMonotonicPHIRecipe with VPPhi (in this commit: ecb2b7e). In general, this approach is feasible, but I've encountered some minor issues:

  1. VPlanTransforms::addScalarResumePhis() expects VPHeaderPHIs to work with (and we want to process monotonic phis with this transform), so it needs some small tweaks to work via VPPhiAccessors (to support both VPPhi and VPHeaderPHIRecipes)
  2. We need to explicitly set underlying value for VPPhi there (so it can be used to find monotonic descriptor), and currently VPInstruction::computeCost asserts that underyling value should be null - I've bypassed this check for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms vectorizers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants