-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle pod delete with non-existent IPs #44
base: master
Are you sure you want to change the base?
Handle pod delete with non-existent IPs #44
Conversation
@dougbtv @crandles @nicklesimba @aneeshkp Please have a look and give your opinion. |
cmd/whereabouts.go
Outdated
@@ -79,8 +79,12 @@ func cmdDel(args *skel.CmdArgs) error { | |||
|
|||
_, err = storage.IPManagement(types.Deallocate, *ipamConf, args.ContainerID) | |||
if err != nil { | |||
logging.Errorf("Error deallocating IP: %s", err) | |||
return fmt.Errorf("Error deallocating IP: %s", err) | |||
if (strings.Contains(fmt.Sprintf("%s", err), "Did not find reserved IP for container")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would opt for returning a type we can compare against instead of string comparison: https://github.com/dougbtv/whereabouts/blob/d879bd867ede305511260028ae1c4c2f049271f3/pkg/allocate/allocate.go#L61-L63
Something like:
// IPNotFoundError represents an error finding a specific IP during the deallocation process for a container.
type IPNotFoundError struct {
containerID string
}
func (e *IPNotFoundError) Error() string {
return fmt.Sprintf("Did not find reserved IP for container %v", e.containerID)
}
and to compare:
_, err = storage.IPManagement(types.Deallocate, *ipamConf, args.ContainerID)
if err != nil {
if notFound := &(allocate.IPNotFoundError{}); errors.As(err, ¬Found) {
logging.Debugf("Cannot deallocate IP, it might not have been assigned by us. %s", err)
} else {
logging.Errorf("Error deallocating IP: %s", err)
return fmt.Errorf("Error deallocating IP: %s", err)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crandles Thanks!
Sounds good to me.
Also, I realized this PR has 2 unrelated patches clubbed together:
- 5cb45d9: To make whereabouts use hostNetwork by default
- 7083aa4: Handle non-existent IP del request gracefully
I'll split them into two PRs for better workflow and incorporate above comment to submit again.
Thanks!
Added the graceful handling for cases when the IPAM gets del requests for IPs it did not allocate or not managing. This situation can happen when whereabouts is added later as IPAM for additional CNI in a cluster already deployed and processing pod deletion. When this happens, kubelet keeps calling whereabouts with del request for IPs it did not allocate and whereabouts keep returning errors. This becomes endless loop between kubelet, which holds on references to such pods and periodically keep calling all CNI and IPAMs to cleanup. With this fix, whereabouts will log such requests and return gracefully ending possibility of such loops.
7083aa4
to
f49eb4d
Compare
ok, so ds deployment related change for using hostNetwork is now moved out this pull request: |
Added the graceful handling for cases when the IPAM gets del requests for IPs it did not allocate or not managing. This situation can happen when whereabouts is added later as IPAM for additional CNI in a cluster already deployed and processing pod deletion. When this happens, kubelet keeps calling whereabouts with del request for IPs it did not allocate and whereabouts keep returning errors. This becomes endless loop between kubelet, which holds on references to such pods and periodically keep calling all CNI and IPAMs to cleanup. With this fix, whereabouts will log such requests and return gracefully ending possibility of such loops.
@dougbtv @crandles @nicklesimba @aneeshkp Please have a look and give your opinion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM, but curious when this scenario occurs
Thanks @crandles for the review! As seen below, calico CNI and its IPAM was able to handle this situation gracefully with a warning. I felt whereabouts can also take a similar approach without erroring out on non-existent IPs. This will prevent endless attempts on kubelet to clean-up the pod sandbox. Logs:
|
Actually, this is also the expected behavior for IPAM as mentioned in the CNI spec here:
|
@ashish-billore ashish-billore @dougbtv , we are facing same issues when running cluster went down due to power outage . ErrorE1006 05:52:39.220542 5964 remote_runtime.go:128] StopPodSandbox "869c45444de0b540e1babd380ae6e642d6c446b0 Saw that same error log was thr and @crandles suggested some code changes. Let us know if this is checked in or what is the plan to do that. Appreciated |
@mkt-mohit Changes suggested in @crandles comments have been incorporated and I see it has been approved after that. We have been using whereabouts patched with changes as mentioned in this PR, in our setup to handle such cases. |
I have taken latest checked docker image, but still getting same error wherein it stuck in loop to delete. is there anything we can work on this from here? |
The latest docker image does not have this patch included, as this PR is reviewed and approved but not yet merged.
You need to build your own docker image of whereabouts with this patch included to try this fix. |
Sorry this got lagged behind! LGTM. Merging. |
Hrmmm, I also realize this conflicts with a recent change, so I need to rebase it. This solution is more ideal than what's currently there, I'll have to return to it. |
Seems problem with CI/CD system:
|
recheck |
Sorry this got lost. If it's still a concern, can you add a test and we can see if it still exists, then we can rebase, thanks! |
This patch makes whereabouts pods deployed as ds to use
hostNetwork by default. This is helpful as CNI and core
infra components use hostnetwork and hence avoid issues
such as fault in CNI config etc and also making debugging
much easier.