Add new batch modes (#133)

batch-serial is back-compat with serial, and batch-canary with canary, but added as new commands for now to ensure not to disrupt workflows relying on these commands to be bulletproof
palantir · Nov 9, 2021 · 465254a · 465254a
1 parent f397879
commit 465254a
Show file tree

Hide file tree

Showing 8 changed files with 880 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -78,6 +78,60 @@ Only accepts 1 ASG. In this example, the ASG must, at start time of bouncer, hav
 * Once nodes have settled and we have again 4 healthy nodes, we call terminate on another old node.
 * Once nodes have settled and we have again 4 healthy nodes (3 being new), we call terminate WITH `should-decrement-desired-capacity` to let us go back to our steady-state of 3 nodes.
 
+## New experimental batch modes
+
+Eventually, `batch-serial` and `batch-canary` could potentially replace `serial` and `canary` with their default values. However, given that the logic in the new batch modes is significantly different than the older modes, they're implemented in parallel for now to prevent potential disruptions relying on the existing behaviour.
+
+### Batch-canary
+
+Use-case is, you'd prefer to use canary, but you don't have capacity to double your ASG in the middle phase. Instead, you want to add new nodes in batches as you delete old nodes.
+
+This method takes in a `batch` parameter. Your final desired capacity + `batch` determines the maximum bouncer will ever scale your ASG to. Bouncer will never scale your ASG below your desired capacity.
+
+I.e. the core tenants of batch-canary are:
+
+* Your total `InService` node count will not go below your given desired capacity
+* Your total node count regardless of status will never go above (desired capacity + batch size)
+
+NOTE: You should probably suspend the "AZ Rebalance" process on your ASG so that AWS doesn't violate these contraints either.
+
+EX: You have an ASG of size 4. You don't have enough instance capacity to run 8 instances, but you do have enough to run 6. Invoke bouncer in `batch-canary` with a `batchsize` of `2` to accomplish this. This will
+
+* Bump desired capacity to 5 to create our first "canary" node.
+* Wait for this node to become healthy (`InService`).
+* Kill an old node. We will wait for this node to be completely gone before continuing.
+* Given we just killed a node, desired capacity is back to `4`. Canary phase is done.
+* Set desired capacity to our max size, `6`, which starts two new machines spawning.
+* Wait for these two nodes to become `InService`.
+* Kill 2 old nodes to get us back down to `4` desired capacity. We've now issued kills to 3 old nodes in total.
+* Again wait for the ASG to settle, which means waiting for the nodes we just killed to fully terminate.
+* Given we only have one old node now, we increase our desired capacity only up to 5, to give us one new node.
+* Once this node enters `InService`, we issue the terminate to the final old node.
+* Wait for the old node to totally finish terminating. Done!
+
+### Batch-serial
+
+Use-case is, you'd prefer to use serial, but you have way too many instances so this takes too long. You can't use canary because your desired capacity is also your max capacity for external reasons (perhaps you tie a static EBS volume to every instance in this ASG).
+
+This method takes in a `batch` parameter. `batch` determines the maximum number of instances bouncer will delete at any time from your desired capacity. Bouncer will never scale your ASG above your desired capacity.
+
+I.e. the core tenants of batch-serial are:
+
+* Your total `InService` node count will not go below (desired capacity - batch size)
+* Your total node count regardless of status will never go above your given desired capacity
+
+NOTE: You should probably suspend the "AZ Rebalance" process on your ASG so that AWS doesn't violate these contraints either.
+
+EX: You have an ASG of size 4. You don't want to delete one instance at a time, but two at a time is ok. Set `batch` to `2`. This mode still does canary a single node so you don't potentially batch a huge number of instances that might all fail to boot. This will
+
+* Terminate a single node, waiting for it to be fully destroyed.
+* Increase desired capacity back to original value.
+* Wait for this new node to become healthy. Canary phase is done.
+* Kill up to batchsize nodes, so in this case, `2`. Wait for them to fully die.
+* Increase desired capacity back to original value, and wait for all nodes to come up healthy.
+* Kill last old node, wait for it to fully die.
+* Increase capacity back to original value, and wait for all nodes to become healthy.
+
 ## Force bouncing all nodes
 
 By default, the bouncer will ignore any nodes which are running the same launch template version (or same launch configuration) that's set on their ASG.  If you've made a change external to the launch configuration / template and want the bouncer to start over bouncing all nodes regardless of launch config / template "oldness", you can add the `-f` flag to any of the run types.  This flag marks any node whose launch time is older than the start time of the current bouncer invocation as "out of date", thus bouncing all nodes.

diff --git a/batchcanary/runner.go b/batchcanary/runner.go
@@ -0,0 +1,273 @@
+// Copyright 2017 Palantir Technologies, Inc.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package batchcanary
+
+import (
+	"context"
+	"strings"
+
+	at "github.com/aws/aws-sdk-go-v2/service/autoscaling/types"
+	"github.com/palantir/bouncer/bouncer"
+	"github.com/pkg/errors"
+	log "github.com/sirupsen/logrus"
+)
+
+// Runner holds data for a particular batch-canary run
+// Note that in the batch-canary case, asgs will always be of length 1
+type Runner struct {
+	bouncer.BaseRunner
+	batchSize int32 // This field is set in ValidatePrereqs
+}
+
+// NewRunner instantiates a new batch-canary runner
+func NewRunner(ctx context.Context, opts *bouncer.RunnerOpts) (*Runner, error) {
+	br, err := bouncer.NewBaseRunner(ctx, opts)
+	if err != nil {
+		return nil, errors.Wrap(err, "error getting base runner")
+	}
+
+	batchSize := *opts.BatchSize
+
+	if batchSize == 0 {
+		if len(strings.Split(opts.AsgString, ",")) > 1 {
+			return nil, errors.New("Batch canary mode supports only 1 ASG at a time")
+		}
+
+		da, err := bouncer.ExtractDesiredASG(opts.AsgString, nil, nil)
+		if err != nil {
+			return nil, err
+		}
+
+		batchSize = da.DesiredCapacity
+	}
+
+	r := Runner{
+		BaseRunner: *br,
+		batchSize:  batchSize,
+	}
+	return &r, nil
+}
+
+// ValidatePrereqs checks that the batch runner is safe to proceed
+func (r *Runner) ValidatePrereqs(ctx context.Context) error {
+	asgSet, err := r.NewASGSet(ctx)
+	if err != nil {
+		return errors.Wrap(err, "error building actualASG")
+	}
+
+	if len(asgSet.ASGs) > 1 {
+		log.WithFields(log.Fields{
+			"count given": len(asgSet.ASGs),
+		}).Error("Batch Canary mode supports only 1 ASG at a time")
+		return errors.New("error validating ASG input")
+	}
+
+	for _, actualAsg := range asgSet.ASGs {
+		if actualAsg.DesiredASG.DesiredCapacity != *actualAsg.ASG.DesiredCapacity {
+			log.WithFields(log.Fields{
+				"desired capacity given":  actualAsg.DesiredASG.DesiredCapacity,
+				"desired capacity actual": *actualAsg.ASG.DesiredCapacity,
+			}).Error("Desired capacity given must be equal to starting desired_capacity of ASG")
+			return errors.New("error validating ASG state")
+		}
+
+		if actualAsg.DesiredASG.DesiredCapacity < *actualAsg.ASG.MinSize {
+			log.WithFields(log.Fields{
+				"min size":         *actualAsg.ASG.MinSize,
+				"max size":         *actualAsg.ASG.MaxSize,
+				"desired capacity": actualAsg.DesiredASG.DesiredCapacity,
+			}).Error("Desired capacity given must be greater than or equal to min ASG size")
+			return errors.New("error validating ASG state")
+		}
+
+		if *actualAsg.ASG.MaxSize < (actualAsg.DesiredASG.DesiredCapacity + r.batchSize) {
+			log.WithFields(log.Fields{
+				"min size":         *actualAsg.ASG.MinSize,
+				"max size":         *actualAsg.ASG.MaxSize,
+				"desired capacity": actualAsg.DesiredASG.DesiredCapacity,
+				"batch size":       r.batchSize,
+			}).Error("Max capacity of ASG must be >= desired capacity + batch size")
+			return errors.New("error validating ASG state")
+		}
+	}
+
+	return nil
+}
+
+func min(a, b int32) int32 {
+	if a < b {
+		return a
+	}
+	return b
+}
+
+// Run has the meat of the batch job
+func (r *Runner) Run() error {
+	var newDesiredCapacity int32
+	decrement := true
+
+	ctx, cancel := r.NewContext()
+	defer cancel()
+
+	for {
+		// Rebuild the state of the world every iteration of the loop because instance and ASG statuses are changing
+		log.Debug("Beginning new batch canary run check")
+		asgSet, err := r.NewASGSet(ctx)
+		if err != nil {
+			return errors.Wrap(err, "error building ASGSet")
+		}
+
+		// Since we only support one ASG in batch-canary mode
+		asg := asgSet.ASGs[0]
+		curDesiredCapacity := *asg.ASG.DesiredCapacity
+		finDesiredCapacity := asg.DesiredASG.DesiredCapacity
+
+		oldUnhealthy := asgSet.GetUnHealthyOldInstances()
+		newHealthy := asgSet.GetHealthyNewInstances()
+		oldHealthy := asgSet.GetHealthyOldInstances()
+
+		newCount := int32(len(asgSet.GetNewInstances()))
+		oldCount := int32(len(asgSet.GetOldInstances()))
+		healthyCount := int32(len(newHealthy) + len(oldHealthy))
+
+		totalCount := newCount + oldCount
+
+		// Never terminate nodes so that we go below finDesiredCapacity number of healthy (InService) machines
+		extraNodes := healthyCount - finDesiredCapacity
+
+		maxDesiredCapacity := finDesiredCapacity + r.batchSize
+		newDesiredCapacity = min(maxDesiredCapacity, finDesiredCapacity+oldCount)
+
+		// Clean-out old unhealthy instances in P:W now, as they're just wasting time
+		oldKilled := false
+		for _, oi := range oldUnhealthy {
+			if oi.ASGInstance.LifecycleState == at.LifecycleStatePendingWait {
+				err := r.KillInstance(ctx, oi, &decrement)
+				if err != nil {
+					return errors.Wrap(err, "error killing instance")
+				}
+				oldKilled = true
+			}
+		}
+
+		if oldKilled {
+			ctx, cancel = r.NewContext()
+			defer cancel()
+			r.Sleep(ctx)
+
+			continue
+		}
+
+		// This check already prints statuses of individual nodes
+		if asgSet.IsTransient() {
+			log.Info("Waiting for nodes to settle")
+			r.Sleep(ctx)
+			continue
+		}
+
+		// Our exit case - we have exactly the number of nodes we want, they're all new, and they're all InService
+		if oldCount == 0 && totalCount == finDesiredCapacity {
+			if curDesiredCapacity == finDesiredCapacity {
+				log.Info("Didn't find any old instances or ASGs - we're done here!")
+				return nil
+			}
+
+			// Not sure how this would happen off-hand?
+			log.WithFields(log.Fields{
+				"Current desired capacity": curDesiredCapacity,
+				"Final desired capacity":   finDesiredCapacity,
+			}).Error("Capacity mismatch")
+			return errors.New("old instance mismatch")
+		}
+
+		// If we haven't canaried a new instance yet, let's do that
+		if newCount == 0 {
+			log.Info("Adding canary node")
+			newDesiredCapacity = curDesiredCapacity + 1
+
+			err = r.SetDesiredCapacity(ctx, asg, &newDesiredCapacity)
+			if err != nil {
+				return errors.Wrap(err, "error setting desired capacity")
+			}
+
+			ctx, cancel = r.NewContext()
+			defer cancel()
+			r.Sleep(ctx)
+
+			continue
+		}
+
+		// Scale-out a batch
+		if newDesiredCapacity > curDesiredCapacity {
+			log.WithFields(log.Fields{
+				"Batch size given":       r.batchSize,
+				"Old machines remaining": oldCount,
+				"Max descap":             maxDesiredCapacity,
+				"Current batch size":     newDesiredCapacity - curDesiredCapacity,
+			}).Info("Adding a batch of new nodes")
+
+			err = r.SetDesiredCapacity(ctx, asg, &newDesiredCapacity)
+			if err != nil {
+				return errors.Wrap(err, "error setting desired capacity")
+			}
+
+			ctx, cancel = r.NewContext()
+			defer cancel()
+			r.Sleep(ctx)
+
+			continue
+		}
+
+		// Scale-in a batch
+		if extraNodes > 0 {
+			killed := int32(0)
+
+			log.WithFields(log.Fields{
+				"Old nodes":     oldCount,
+				"Healthy nodes": healthyCount,
+				"Extra nodes":   extraNodes,
+			}).Info("Killing a batch of nodes")
+
+			for _, oi := range oldHealthy {
+				err := r.KillInstance(ctx, oi, &decrement)
+				if err != nil {
+					return errors.Wrap(err, "error killing instance")
+				}
+				killed++
+				if killed == extraNodes {
+					log.WithFields(log.Fields{
+						"Killed Nodes": killed,
+					}).Info("Already killed number of extra nodes to get back to desired capacity, pausing here")
+					break
+				}
+			}
+			ctx, cancel = r.NewContext()
+			defer cancel()
+			r.Sleep(ctx)
+
+			continue
+		}
+
+		// Not sure how this would happen off-hand?
+		log.WithFields(log.Fields{
+			"Current desired capacity": curDesiredCapacity,
+			"Final desired capacity":   finDesiredCapacity,
+			"Old nodes":                oldCount,
+			"Healthy nodes":            healthyCount,
+			"Extra nodes":              extraNodes,
+		}).Error("Unknown condition hit")
+		return errors.New("undefined error")
+	}
+}