Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spread reconfigure in one rack #116

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from
Draft

Conversation

majst01
Copy link
Contributor

@majst01 majst01 commented Jan 16, 2024

We sporadically see network spikes which correlates with configuration reloads which happen at the same time in one rack.
With this, the start time of the reconfiguration is shifted by the half duration of the reconfiguration interval based on the hostname suffix of the switch. Switches which do not have a number as suffix are not spreading.

with this we have a 10 second spread with a 20 sec reconciliation interval

$ m sw ls
ID                    PARTITION   RACK               OS   STATUS   LAST SYNC 
fra-equ01-r01leaf01   fra-equ01   fra-equ01-rack01   🐢   ●        8s ago      
fra-equ01-r01leaf02   fra-equ01   fra-equ01-rack01   🐢   ●        18s ago     

Keep this PR open, but as draft because we do not have the actual need for this.

@majst01 majst01 marked this pull request as ready for review January 17, 2024 06:45
@majst01 majst01 requested a review from a team as a code owner January 17, 2024 06:45
@chbmuc
Copy link

chbmuc commented Jan 17, 2024

I would suggest to calculate the wait time instead of iterating. E.g. something like this:

func waitForTicker(hostname string, interval time.Duration) {
	var (
		index int
		err   error
	)

	index, err = strconv.Atoi(hostname[len(hostname)-1:])
	if err != nil {
		index = 1
		log.Warn("unable to parse leaf number from hostname, not spreading switch reloads", "hostname", hostname, "error", err)
		return
	}

	// Get the next time the ticker should tick
	now := time.Now()
	next := now.Truncate(interval).Add(interval)

	// If the index is odd, add half the interval to the next time
	if index%2 != 0 {
		next = next.Add(interval / 2)
	}

	// If the wait is longer than the interval, we can start earlier
	if wait > interval {
		wait = wait - interval
	}

	wait := next.Sub(now)
	log.Info("Waiting", wait, "until", next, "to start trigger")

	time.Sleep(wait)
}

@chbmuc
Copy link

chbmuc commented Jan 17, 2024

As an addition I would also skip the reconfiguration if it had just finished, to give BGP more time to propagate it's changes:

diff --git a/cmd/internal/core/reconfigure-switch.go b/cmd/internal/core/reconfigure-switch.go
index 27fd446..998a8f8 100644
--- a/cmd/internal/core/reconfigure-switch.go
+++ b/cmd/internal/core/reconfigure-switch.go
@@ -18,11 +18,18 @@ import (

 // ReconfigureSwitch reconfigures the switch.
 func (c *Core) ReconfigureSwitch() {
+       var last time.Time
+
        t := time.NewTicker(c.reconfigureSwitchInterval)
        host, _ := os.Hostname()
        for range t.C {
                c.log.Info("trigger reconfiguration")
                start := time.Now()
+               if start.Sub(last) < c.reconfigureSwitchInterval {
+                       c.log.Info("skiping reconfiguration because of last reconfiguration was too recent")
+                       continue
+               }
+
                err := c.reconfigureSwitch(host)
                elapsed := time.Since(start)
                c.log.Info("reconfiguration took", "elapsed", elapsed)
@@ -48,6 +55,8 @@ func (c *Core) ReconfigureSwitch() {
                        c.log.Error("notification about switch reconfiguration failed", "error", err)
                        c.metrics.CountError("reconfiguration-notification")
                }
+
+               last = time.Now()
        }
 }

@majst01
Copy link
Contributor Author

majst01 commented Jan 17, 2024

As an addition I would also skip the reconfiguration if it had just finished, to give BGP more time to propagate it's changes:

diff --git a/cmd/internal/core/reconfigure-switch.go b/cmd/internal/core/reconfigure-switch.go
index 27fd446..998a8f8 100644
--- a/cmd/internal/core/reconfigure-switch.go
+++ b/cmd/internal/core/reconfigure-switch.go
@@ -18,11 +18,18 @@ import (

 // ReconfigureSwitch reconfigures the switch.
 func (c *Core) ReconfigureSwitch() {
+       var last time.Time
+
        t := time.NewTicker(c.reconfigureSwitchInterval)
        host, _ := os.Hostname()
        for range t.C {
                c.log.Info("trigger reconfiguration")
                start := time.Now()
+               if start.Sub(last) < c.reconfigureSwitchInterval {
+                       c.log.Info("skiping reconfiguration because of last reconfiguration was too recent")
+                       continue
+               }
+
                err := c.reconfigureSwitch(host)
                elapsed := time.Since(start)
                c.log.Info("reconfiguration took", "elapsed", elapsed)
@@ -48,6 +55,8 @@ func (c *Core) ReconfigureSwitch() {
                        c.log.Error("notification about switch reconfiguration failed", "error", err)
                        c.metrics.CountError("reconfiguration-notification")
                }
+
+               last = time.Now()
        }
 }

Now switched to cron based scheduling, will implement this logic accordingly

@Gerrit91
Copy link
Contributor

References metal-stack/metal-roles#254

@majst01 majst01 marked this pull request as draft March 22, 2024 07:52
@Gerrit91
Copy link
Contributor

I think we decided not to do it because of too many negative implications. Vote for closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants