Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rechecking pending Pods #195

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xagent003
Copy link
Contributor

@xagent003 xagent003 commented Feb 16, 2022

This fixes #182

while hopefully not re-breaking #162

We shouldn't treat all pending Pods as "alive" and skip the check. The list of all Pods fetch'd earlier may be stale, and as observed in some scenarios, several seconds before the ip-reconciler does the isPodAlive check.

Instead, can we retry a Get on an individual Pod, with the hopes that it has final IP/network annoations? So refetch the Pod, up to 5 times, if it is Pending state and initial IP check fails. After that, just do the IP matching check like before

@xagent003 xagent003 force-pushed the private/arjun/recheckPendingPod branch 3 times, most recently from d8e231e to 1edaca7 Compare February 17, 2022 04:45
@xagent003
Copy link
Contributor Author

@dougbtv and @maiqueb Can you take a look. I did verify that #162 did not re-break:

xagent@pop-os:~/luigi/samples/macvlan$ kubectl logs ip-reconciler-27417868-ldhrq
2022-02-17T04:28:03Z [debug] NewReconcileLooper - Kubernetes config file located at: /host/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig
2022-02-17T04:28:04Z [debug] successfully read the kubernetes configuration file located at: /host/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig
2022-02-17T04:28:04Z [debug] listing IP pools
2022-02-17T04:28:04Z [debug] Added IP 10.128.165.32 for pod default/web-0
2022-02-17T04:28:04Z [error] could not parse network annotation  for pod: default/web-1; error: unexpected end of JSON input
2022-02-17T04:28:04Z [debug] the IP reservation: IP: 10.128.165.32 is reserved for pod: default/web-0
2022-02-17T04:28:04Z [debug] pod reference default/web-0 matches allocation; Allocation IP: 10.128.165.32; PodIPs: map[10.128.165.32:{}]
2022-02-17T04:28:04Z [debug] the IP reservation: IP: 10.128.165.33 is reserved for pod: default/web-1
2022-02-17T04:28:04Z [debug] pod reference default/web-1 matches allocation; Allocation IP: 10.128.165.33; PodIPs: map[]
2022-02-17T04:28:04Z [debug] Re-fetching Pending Pod: default/web-1 IP-to-match: 10.128.165.33
2022-02-17T04:28:05Z [debug] Added IP 10.128.165.33 for pod default/web-1
2022-02-17T04:28:05Z [debug] Got refreshed pod: &{map[10.128.165.33:{}] Pending}
2022-02-17T04:28:05Z [debug] pod reference default/web-1 matches allocation; Allocation IP: 10.128.165.33; PodIPs: map[10.128.165.33:{}]
2022-02-17T04:28:05Z [debug] Pod now has IP annotation while in Pending
2022-02-17T04:28:05Z [debug] pod reference default/web-0 matches allocation; Allocation IP: 10.128.165.32; PodIPs: map[10.128.165.32:{}]
2022-02-17T04:28:05Z [debug] pod reference default/web-1 matches allocation; Allocation IP: 10.128.165.33; PodIPs: map[]
2022-02-17T04:28:05Z [debug] Re-fetching Pending Pod: default/web-1 IP-to-match: 10.128.165.33
2022-02-17T04:28:05Z [debug] Added IP 10.128.165.33 for pod default/web-1
2022-02-17T04:28:05Z [debug] Got refreshed pod: &{map[10.128.165.33:{}] Pending}
2022-02-17T04:28:05Z [debug] pod reference default/web-1 matches allocation; Allocation IP: 10.128.165.33; PodIPs: map[10.128.165.33:{}]
2022-02-17T04:28:05Z [debug] Pod now has IP annotation while in Pending
2022-02-17T04:28:05Z [debug] no IP addresses to cleanup
xagent@pop-os:~/luigi/samples/macvlan$ kubectl get ippools.whereabouts.cni.cncf.io 10.128.165.0-24 -o yaml
apiVersion: whereabouts.cni.cncf.io/v1alpha1
kind: IPPool
metadata:
  creationTimestamp: "2022-02-16T07:41:55Z"
  generation: 90
  managedFields:
  - apiVersion: whereabouts.cni.cncf.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        .: {}
        f:allocations:
          .: {}
          f:32:
            .: {}
            f:id: {}
            f:podref: {}
          f:33:
            .: {}
            f:id: {}
            f:podref: {}
          f:34:
            .: {}
            f:id: {}
            f:podref: {}
        f:range: {}
    manager: whereabouts
    operation: Update
    time: "2022-02-17T04:28:08Z"
  name: 10.128.165.0-24
  namespace: default
  resourceVersion: "183433"
  uid: 41f5ab5f-2b34-4a5a-94ba-71c604c0a1d2
spec:
  allocations:
    "32":
      id: f7da88514327587cdd23fb0c5d7596341d2d422dea75fd7650c82fd229eb8051
      podref: default/web-0
    "33":
      id: 286db456fecf6f516721347ba004e82a823e9e5768984340ff3b44fdfa0f3c67
      podref: default/web-1
    "34":
      id: 919134bbeb8a0b1128724bf280184b48d3f3bd35aaa7bae8727324b259cb11fb
      podref: default/web-2
  range: 10.128.165.0/24

As you can see, Pod web-1 was missing IP annotation and in Pending phase. I refreshed the Pod, this time it had the IP annotation the first time around, and found a match. So no bad cleanup happened.

@xagent003
Copy link
Contributor Author

xagent003 commented Feb 17, 2022

Ok, I verified that it also fixes #182

Shut down a node with a STS and network with full IPs. Force deleted the Pod, so it gets rescheduled. It remains in ContainerCreating state with "Could not allocate IP in range: ip: 10.128.165.32 / - 10.128.165.33" error

Then the reconciler runs. It cleans up the stale IP (belonging to the Pod I force deleted from the dead node), despite the new replica with same podRef being in ContainerCreating state aka Pending phase. Now the new replica can create and moves to Running

seen in ip-reconciler logs:

2022-02-17T07:01:05Z [debug] Going to update the reserve list to: [IP: 10.128.165.32 is reserved for pod: default/web-0]
2022-02-17T07:01:05Z [debug] PF9: patch = [{remove /spec/allocations/33 <nil>}]
2022-02-17T07:01:05Z [debug] PF9: wrote patchdata: [{"op":"test","path":"/metadata/resourceVersion","value":"204380"},{"op":"remove","path":"/spec/allocations/33"}]
2022-02-17T07:01:05Z [debug] successfully cleanup IPs: [10.128.165.33]

verified the IP Pools:
old reservation:

spec:
  allocations:
    "32":
      id: 1e62cb66ff77fbf409633ca4cef7b2b641427c347eb2269cd23efaaba0bb1e00
      podref: default/web-0
    "33":
      id: 40d15102db0b75cad2faff2ccee483bd4e6c25fdd477af9caa2e97b71423b5f7
      podref: default/web-1
  range: 10.128.165.0/24

vs new one:

spec:
  allocations:
    "32":
      id: 1e62cb66ff77fbf409633ca4cef7b2b641427c347eb2269cd23efaaba0bb1e00
      podref: default/web-0
    "33":
      id: 3d6225ea764ed6e3d5650110725c62aa41a22c935c9949ef1572b03c1f7e6bfd
      podref: default/web-1
  range: 10.128.165.0/24

So this fixes both new issue #182, without re-breaking the fix for #162 (skip pods in Pending phase)

Copy link
Collaborator

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this patch; while it seems I'm making plenty of comments, I'm very happy with the quality of the code.

I wonder about the impacts of this retry on the code - especially since the retries are nearly immediate (no backoff is used).

I honestly don't expect this scenario to happen (famous last words ...), but if the cronjob is triggered and there are plenty of pending pods (and staying in pending state for long), we will overload the API.

livePodIPs)
_, isFound := livePodIPs[ip]
return isFound || livePod.phase == v1.PodPending
isFound := checkIpOnPod(&livePod, podRef, ip)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, I'd rename the method - not 100% sure what to.

It might be personal, but I'm very used to prefixing methods that return booleans with is . What do you thing about isIPonPod(...) ?

return isFound || livePod.phase == v1.PodPending
isFound := checkIpOnPod(&livePod, podRef, ip)

if !isFound && (livePod.phase == v1.PodPending) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could make sense to export this into an helper func.

* so re-fetch the Pod 5x
*/
var podToMatch *podWrapper = &livePod
var retries int = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 0 the default ?

nit: I personally would prefer if the variables were initialized using := . Just more used to it.

Comment on lines 116 to 134
for retries < storage.PodRefreshRetries {
retries += 1
podToMatch := rl.refreshPod(livePodRef)
if podToMatch == nil {
logging.Debugf("Cleaning up...")
return false
} else if podToMatch.phase != v1.PodPending {
logging.Debugf("Pending Pod is now in phase: %s", podToMatch.phase)
break
} else {
isFound = checkIpOnPod(podToMatch, podRef, ip)
// Short-circuit - Pending Pod may have IP now
if isFound {
logging.Debugf("Pod now has IP annotation while in Pending")
return true
}
}
}
isFound = checkIpOnPod(podToMatch, podRef, ip)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use exponential backoff here ?

Also, I think this could be re-written with k8s primitives:

import "k8s.io/apimachinery/pkg/util/wait"

timeout      := time.Second
pollInterval := 200*time.Millisecond  // 5 * 0.2 = 1

// PollImmediate tries a condition func until it returns true, an error, or the timeout
// is reached.
// ... 
waitErr := wait.PollImmediate(pollInterval, timeout, func() (bool, error) {
    pod := rl.refreshPod(livePodRef)

    if podToMatch == nil {
        logging.Debugf("Cleaning up...")
        return false, fmt.Errorf("pod is gone")
    } else if isPodPending(pod) {
        logging.Debugf("Pending Pod is now in phase: %s", podToMatch.phase)
        return true, nil
    } else {
        return isIPonPod(pod, podRef, ip), nil // I'm not entirely sure I did not oversimplify this ... 
    }
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hesitant to use a backoff because remember that for #182 when the IP Pool is full, we WANT it to reconcile. In fact, we dont need any retries in that case... because the container will never come up otherwise. It is in Pending state, does not have an IP annotation either, and will match the podRef.

So adding more retries or backoff would just add load or delay the reconciler (and thus blocked Pod) longer.

Any suggestions on how to distinguish between these two scenarios? Or should we just retry once

Comment on lines 162 to 163
var namespace string
var podName string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't see a reason for declaring the variables here, you can just := below.

Or even - once you know len(namespacedName) == 2 - return directly: return namespacedName[0], namespacedName[1].

not sure what I wrote is easy to understand, I mean something like:

func splitPodRef(podRef string) (string, string) {
    namespacedName := strings.Split(podRef, "/")
    if len(namespacedName) != 2 {
        logging.Errorf("Failed to split podRef %s", podRef)
        return "", ""
    }

    return namespacedName[0], namespacedName[1]
}

@@ -13,7 +13,8 @@ var (
RequestTimeout = 10 * time.Second

// DatastoreRetries defines how many retries are attempted when updating the Pool
DatastoreRetries = 100
DatastoreRetries = 100
PodRefreshRetries = 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might not be needed if we end up using wait.PollImmediate or wait.ExponentialBackoff .

Comment on lines +114 to +126
pod, err := i.clientSet.CoreV1().Pods(namespace).Get(context.TODO(), name, metav1.GetOptions{})
if err != nil {
return nil, err
}

return pod, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, maybe I'm missing something, but why don't you just return i.clientSet.CoreV1().Pods(namespace).Get(context.TODO(), name, metav1.GetOptions{}) ?

@xagent003
Copy link
Contributor Author

xagent003 commented Feb 18, 2022

@maiqueb I'm just going to remove all retries altogether. No backoff.

That means more uncessary API calls and longer time to reconcile for #182 . In that case the Pod is Pending, has no IP annotation, and matches the podRef... but it will never come up until the old IP is reconciled since IP pool is full.

The retry is only for #162 which is a rarer bug to hit (ip-reconciler has to run at the same time as Pods coming up, and Pending Pod has no IP annotation YET - as it is still in process of coming up). So just refetching the individual Pod once should be good enough, as it will make an uncommon scenario even more rare.

I tried a few times and when it is hit, the Pod always has the IP on the first GET call.

@xagent003 xagent003 force-pushed the private/arjun/recheckPendingPod branch from 1edaca7 to 72eb8ea Compare March 29, 2022 23:11
@xagent003 xagent003 force-pushed the private/arjun/recheckPendingPod branch from 72eb8ea to 69b4616 Compare March 30, 2022 03:39
@xagent003
Copy link
Contributor Author

@dougbtv @maiqueb Can you please take a look, revised the PR. The test failure looks unrelated, and same as my PR #180 is hitting.

Can you re-run/override?

@maiqueb
Copy link
Collaborator

maiqueb commented Mar 31, 2022

@dougbtv @maiqueb Can you please take a look, revised the PR. The test failure looks unrelated, and same as my PR #180 is hitting.

Can you re-run/override?

Would you rebase both #211 (which fixes the unit test issue you're facing) ? We cannot override, since this is not executing any unit tests.


for retries < storage.PodRefreshRetries {
retries += 1
podToMatch = rl.refreshPod(livePodRef)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to instead of retrying now just put the pods in pending state to a list, and after the main loop (that handles all pods that are not pending) iterate all the ones that were in pending state ?

I am unsure if I am overthinking this.

@dougbtv
Copy link
Member

dougbtv commented Jun 23, 2022

Hey @xagent003 any chance we could see some tests that show this in action? With just one removed test, I'm not quite sure on it. Also, maybe some improvements to the reconciliation controller in the meanwhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ip-reconciler issue with StatefullSet cleanup
3 participants