Spread reconfigure in one rack #116

majst01 · 2024-01-16T16:52:22Z

We sporadically see network spikes which correlates with configuration reloads which happen at the same time in one rack.
With this, the start time of the reconfiguration is shifted by the half duration of the reconfiguration interval based on the hostname suffix of the switch. Switches which do not have a number as suffix are not spreading.

with this we have a 10 second spread with a 20 sec reconciliation interval

$ m sw ls
ID                    PARTITION   RACK               OS   STATUS   LAST SYNC 
fra-equ01-r01leaf01   fra-equ01   fra-equ01-rack01   🐢   ●        8s ago      
fra-equ01-r01leaf02   fra-equ01   fra-equ01-rack01   🐢   ●        18s ago

Keep this PR open, but as draft because we do not have the actual need for this.

chbmuc · 2024-01-17T08:12:55Z

I would suggest to calculate the wait time instead of iterating. E.g. something like this:

func waitForTicker(hostname string, interval time.Duration) {
	var (
		index int
		err   error
	)

	index, err = strconv.Atoi(hostname[len(hostname)-1:])
	if err != nil {
		index = 1
		log.Warn("unable to parse leaf number from hostname, not spreading switch reloads", "hostname", hostname, "error", err)
		return
	}

	// Get the next time the ticker should tick
	now := time.Now()
	next := now.Truncate(interval).Add(interval)

	// If the index is odd, add half the interval to the next time
	if index%2 != 0 {
		next = next.Add(interval / 2)
	}

	// If the wait is longer than the interval, we can start earlier
	if wait > interval {
		wait = wait - interval
	}

	wait := next.Sub(now)
	log.Info("Waiting", wait, "until", next, "to start trigger")

	time.Sleep(wait)
}

chbmuc · 2024-01-17T08:59:57Z

As an addition I would also skip the reconfiguration if it had just finished, to give BGP more time to propagate it's changes:

diff --git a/cmd/internal/core/reconfigure-switch.go b/cmd/internal/core/reconfigure-switch.go
index 27fd446..998a8f8 100644
--- a/cmd/internal/core/reconfigure-switch.go
+++ b/cmd/internal/core/reconfigure-switch.go
@@ -18,11 +18,18 @@ import (

 // ReconfigureSwitch reconfigures the switch.
 func (c *Core) ReconfigureSwitch() {
+       var last time.Time
+
        t := time.NewTicker(c.reconfigureSwitchInterval)
        host, _ := os.Hostname()
        for range t.C {
                c.log.Info("trigger reconfiguration")
                start := time.Now()
+               if start.Sub(last) < c.reconfigureSwitchInterval {
+                       c.log.Info("skiping reconfiguration because of last reconfiguration was too recent")
+                       continue
+               }
+
                err := c.reconfigureSwitch(host)
                elapsed := time.Since(start)
                c.log.Info("reconfiguration took", "elapsed", elapsed)
@@ -48,6 +55,8 @@ func (c *Core) ReconfigureSwitch() {
                        c.log.Error("notification about switch reconfiguration failed", "error", err)
                        c.metrics.CountError("reconfiguration-notification")
                }
+
+               last = time.Now()
        }
 }

majst01 · 2024-01-17T09:41:45Z

As an addition I would also skip the reconfiguration if it had just finished, to give BGP more time to propagate it's changes:

diff --git a/cmd/internal/core/reconfigure-switch.go b/cmd/internal/core/reconfigure-switch.go
index 27fd446..998a8f8 100644
--- a/cmd/internal/core/reconfigure-switch.go
+++ b/cmd/internal/core/reconfigure-switch.go
@@ -18,11 +18,18 @@ import (

 // ReconfigureSwitch reconfigures the switch.
 func (c *Core) ReconfigureSwitch() {
+       var last time.Time
+
        t := time.NewTicker(c.reconfigureSwitchInterval)
        host, _ := os.Hostname()
        for range t.C {
                c.log.Info("trigger reconfiguration")
                start := time.Now()
+               if start.Sub(last) < c.reconfigureSwitchInterval {
+                       c.log.Info("skiping reconfiguration because of last reconfiguration was too recent")
+                       continue
+               }
+
                err := c.reconfigureSwitch(host)
                elapsed := time.Since(start)
                c.log.Info("reconfiguration took", "elapsed", elapsed)
@@ -48,6 +55,8 @@ func (c *Core) ReconfigureSwitch() {
                        c.log.Error("notification about switch reconfiguration failed", "error", err)
                        c.metrics.CountError("reconfiguration-notification")
                }
+
+               last = time.Now()
        }
 }

Now switched to cron based scheduling, will implement this logic accordingly

Gerrit91 · 2024-01-18T13:14:48Z

References metal-stack/metal-roles#254

Gerrit91 · 2024-08-22T12:19:59Z

I think we decided not to do it because of too many negative implications. Vote for closing.

majst01 added 4 commits January 16, 2024 17:49

Spread reconfigure in one rack

afa0079

refactor to be testable

67787b2

Fix

e815d00

Better

8a10bd8

majst01 marked this pull request as ready for review January 17, 2024 06:45

majst01 requested a review from a team as a code owner January 17, 2024 06:45

Early return

d673caf

cron based sync

a3e3487

majst01 added 2 commits January 17, 2024 10:59

More

ae2bc37

Fix

9ab43a0

majst01 added the do-not-merge label Mar 22, 2024

majst01 marked this pull request as draft March 22, 2024 07:52

majst01 added 4 commits April 12, 2024 09:03

satisfy linter

15e49d6

use go 1.22

b16f517

Fix dockerfile

8e822f5

Fix action

438a85e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spread reconfigure in one rack #116

Spread reconfigure in one rack #116

majst01 commented Jan 16, 2024 •

edited

Loading

chbmuc commented Jan 17, 2024 •

edited

Loading

chbmuc commented Jan 17, 2024

majst01 commented Jan 17, 2024

Gerrit91 commented Jan 18, 2024

Gerrit91 commented Aug 22, 2024

Spread reconfigure in one rack #116

Are you sure you want to change the base?

Spread reconfigure in one rack #116

Conversation

majst01 commented Jan 16, 2024 • edited Loading

chbmuc commented Jan 17, 2024 • edited Loading

chbmuc commented Jan 17, 2024

majst01 commented Jan 17, 2024

Gerrit91 commented Jan 18, 2024

Gerrit91 commented Aug 22, 2024

majst01 commented Jan 16, 2024 •

edited

Loading

chbmuc commented Jan 17, 2024 •

edited

Loading