Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] container restarted, Could not allocate IP in range despite having reservation for existing Pod #291

Open
xagent003 opened this issue Jan 5, 2023 · 2 comments · May be fixed by #383
Assignees

Comments

@xagent003
Copy link
Contributor

xagent003 commented Jan 5, 2023

@dougbtv @miguel Duarte de Mora Barroso is there a reason whereabouts should not return an existing IP reservation if the podref matches? Seeing more issues surrounding full IP reservations.

Did some tests on node reboots and restarting of our stack k8s services and kubelet. What i noticed is that kubelet recreates the container when it restarts, but we just see an ADD operation coming into the CNI/whereabouts. This fails as Pods already had IPs and IPoool is full. As a result Pod gets stuck in ContainerCreating state

E0104 22:37:06.684288    8702 remote_runtime.go:198] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox "a8d579efef9622a4f30486ff435fae4006022cb2c54941ba0a0bccda2385c6a9": plugin type="multus" name="multus-cni-network" failed (add): [default/asdasd-1:whereaboutsexample]: error adding container to network "whereaboutsexample": Error at storage engine: Could not allocate IP in range: ip: 10.128.165.32 / - 10.128.165.34 / range: net.IPNet{IP:net.IP{0xa, 0x80, 0xa5, 0x0}, Mask:net.IPMask{0xff, 0xff, 0xff, 0x0}}"

But in whereabouts.log:

2023-01-04T22:37:04.504Z        DEBUG   ADD - IPAM configuration successfully read: {Name:whereaboutsexample Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[] DNS:{Nameservers:[] Domain: Search:[] Options:[]} Range:10.128.165.0/24 RangeStart:10.128.165.32 RangeEnd:10.128.165.34 GatewayStr: EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LeaderLeaseDuration:1500 LeaderRenewDeadline:1000 LeaderRetryPeriod:500 LogFile:/tmp/whereabouts-macvlan165.log LogLevel:debug OverlappingRanges:true Gateway:<nil> Kubernetes:{KubeConfigPath:/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath: PodName:asdasd-1 PodNamespace:default}
2023-01-04T22:37:04.504Z        DEBUG   Beginning IPAM for ContainerID: a8d579efef9622a4f30486ff435fae4006022cb2c54941ba0a0bccda2385c6a9
...
2023-01-04T22:37:06.466Z        DEBUG   PF9: GetIpPool: &{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:10.128.165.0-24 GenerateName: Namespace:default SelfLink: UID:3854e207-0e34-4e73-80e5-8883ff039b90 ResourceVersion:140858 Generation:10 CreationTimestamp:2023-01-04 06:21:39 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Finalizers:[] ClusterName: ManagedFields:[{Manager:whereabouts Operation:Update APIVersion:whereabouts.cni.cncf.io/v1alpha1 Time:2023-01-04 22:19:18 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:spec":{".":{},"f:allocations":{".":{},"f:32":{".":{},"f:id":{},"f:podref":{}},"f:33":{".":{},"f:id":{},"f:podref":{}},"f:34":{".":{},"f:id":{},"f:podref":{}}},"f:range":{}}}}]} Spec:{Range:10.128.165.0/24 Allocations:map[32:{ContainerID:529a9ba352e94a553544ddbb838e17ae752c193c7306c71abc108076b2eeb773 PodRef:default/asdasd-0} 33:{ContainerID:0d6b3f6cfc602597bbea82ef00ec8804aa27bee17ca2b69f518191f70cb4af67 PodRef:default/asdasd-1} 34:{ContainerID:2a4b0e4ddd40e59d7693bb7aba317407246bce98e8875a9a0467f624484ed48d PodRef:default/asdasd-2}]}}
2023-01-04T22:37:06.466Z        DEBUG   PF9: Current Allocations: [IP: 10.128.165.32 is reserved for pod: default/asdasd-0 IP: 10.128.165.33 is reserved for pod: default/asdasd-1 IP: 10.128.165.34 is reserved for pod: default/asdasd-2]
2023-01-04T22:37:06.466Z        DEBUG   IterateForAssignment input >> ip: 10.128.165.32 | ipnet: {10.128.165.0 ffffff00} | first IP: 10.128.165.32 | last IP: 10.128.165.34
2023-01-04T22:37:06.466Z        ERROR   Error assigning IP: Could not allocate IP in range: ip: 10.128.165.32 / - 10.128.165.34 / range: net.IPNet{IP:net.IP{0xa, 0x80, 0xa5, 0x0}, Mask:net.IPMask{0xff, 0xff, 0xff, 0x0}}

As you can see, Pod default/asdasd-1 already has an IP reservation (we added some custom logs to print IP pool details). Stranger, we don't see ADD coming into whereabouts for pods asdasd-0 and asdasd-2. despite seeing logs in kubelet for all 3 Pod replicas:

I0104 22:37:04.270321    8702 kuberuntime_manager.go:487] "No sandbox for pod can be found. Need to start a new one" pod="default/asdasd-1"

Also we don't see a DEL coming in, despite original container dying/kubelet recreating it.

Whatever case, the underlying containers could be restarted or die for a variety of reasons. Rather than try to fix why kubelet/container runtime did not send in the DEL or why container died, I think whereabouts should just return an existing IP if there is a matching podRef reservation.

In this case the ADD shouldn't fail because the PodRef already has a reservation. It's effectively the same Pod. Pod names should be unqiue for deployments and DSs. In case of STS like above, we'd prefer it to just keep the same IP.

And The IP would get cleaned up either when we delete Pod gracefully, or via the ip-reconciler if node is brought down ungracefully.

@xagent003
Copy link
Contributor Author

I think right here, if already reserved, this function should check if podRef matches rather than just continuing: https://github.com/k8snetworkplumbingwg/whereabouts/blob/master/pkg/allocate/allocate.go#L215

@caribbeantiger
Copy link
Contributor

caribbeantiger commented Aug 29, 2023

can we merge this fix into master branch ? we have been running into this full IP pool scenario a few times when cloud node crashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants