-
Notifications
You must be signed in to change notification settings - Fork 1.4k
CORS-4278: Hardening the GCP destroy code to attempt to delete dependent resources #10035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@barbacbd: This pull request explicitly references no jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/hold |
1b47b89 to
b2c2788
Compare
|
/hold cancel |
b2c2788 to
6d3bcca
Compare
…t resources The destroy process sporadically runs into ResourceInUseByAnotherResource error, but sometimes the dependent resource is not found by the cluster uninstaller. Now these resources will be removed when the error is found.
6d3bcca to
d4801b2
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@barbacbd: This pull request references CORS-4278 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
@barbacbd: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/cc |
|
|
||
| func (o *ClusterUninstaller) deleteTargetPoolByName(ctx context.Context, resourceName string) error { | ||
| items, err := o.listTargetPoolsWithFilter(ctx, "items(name),nextPageToken", func(pool *compute.TargetPool) bool { | ||
| return strings.Contains(pool.Name, resourceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other places (e.g. deleteSubnetworkByName in pkg/destroy/gcp/subnetwork.go) use exact comparison ==.
Should we make it consistent? Or is there a reason to use .Contains here?
| ) | ||
|
|
||
| func (o *ClusterUninstaller) deleteFirewallByName(ctx context.Context, resourceName string) error { | ||
| items, err := o.listFirewallsWithFilter(ctx, "items(name),nextPageToken", func(item string) bool { return strings.Contains(item, resourceName) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above: should we make it consistent as currently we use both == and .Contains in different places?
| ) | ||
|
|
||
| func (o *ClusterUninstaller) deleteTargetTCPProxyByName(ctx context.Context, resourceName string) error { | ||
| items, err := o.listTargetTCPProxiesWithFilter(ctx, globalTargetTCPProxyResource, "items(name),nextPageToken", func(item string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we ever run into a target TCP Proxy that is not type "targettcpproxy" at this point?
| func (o *ClusterUninstaller) listForwardingRules(ctx context.Context, typeName string) ([]cloudResource, error) { | ||
| return o.listForwardingRulesWithFilter(ctx, typeName, "items(name,region,loadBalancingScheme),nextPageToken", o.isClusterResource) | ||
| return o.listForwardingRulesWithFilter(ctx, typeName, "items(name,region,loadBalancingScheme,labels),nextPageToken", func(item *compute.ForwardingRule) bool { | ||
| return o.isClusterResource(item.Name) && o.isOwnedResource(item.Labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other places (i.e. listImages in pkg/destroy/gcp/image.go) that check like:
o.isClusterResource(item.Name) && !o.isSharedResource(item.Labels).
Should we make it consistent?
| // address should have labels but none observed | ||
| return strings.HasPrefix(item.Name, loadBalancerName) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the same check as loadBalancerFilterFunc right? Can we keep the original implementation? We can still keep the comment though.
| splitDetails := strings.Split(errMsg, alreadyInUseStr) | ||
| resource := strings.ReplaceAll( | ||
| strings.ReplaceAll( | ||
| strings.ReplaceAll(splitDetails[len(splitDetails)-1], "\"", ""), | ||
| "'", "", | ||
| ), ", resourceInUseByAnotherResource", "", | ||
| ) | ||
| splitResource := strings.Split(resource, "/") | ||
|
|
||
| if len(splitResource) > 6 || len(splitResource) < 5 { | ||
| return fmt.Errorf("dependent resource information unable to be parsed: %s", resource) | ||
| } | ||
| // global -> ex: projects/xxxxxxxxxxxxxxxx/global/resource-type/resource-name | ||
| // regional -> ex: projects/xxxxxxxxxxxxxx/region/region-name/resource-type/resource-name | ||
| location, resourceType, resourceName := splitResource[len(splitResource)-3], splitResource[len(splitResource)-2], splitResource[len(splitResource)-1] | ||
| o.Logger.Debugf("found dependent resource information: %s, %s, %s", location, resourceType, resourceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part can probably be refactored into a separate func, for example:
func (o *ClusterUninstaller) parseResourceFromError(errMsg string) (location, resourceType, resourceName string, err error) {}But then I have the same question whether it is future-proof to parse the error message?
| case "targetTcpProxies": | ||
| deleteErr = o.deleteTargetTCPProxyByName(ctx, resourceName) | ||
| default: | ||
| deleteErr = fmt.Errorf("failed to find resource type: %s for %s", resourceType, resourceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| deleteErr = fmt.Errorf("failed to find resource type: %s for %s", resourceType, resourceName) | |
| deleteErr = fmt.Errorf("unsupported resource type %q for dependent deletion", | |
| resourceType) |
| if deleteErr != nil { | ||
| return fmt.Errorf("failed to delete dependent resource: %w", deleteErr) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have log when the deletion works:
o.Logger.Infof("Successfully deleted dependent resource %s/%s", resourceType, resourceName)| err = o.handleDependentResourceError(ctx, op, err) | ||
| if err != nil { | ||
| o.Logger.Debugf("failed to handle dependent resource error: %v", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still return the error here so that the destroy loop will try again to delete the resource next time, right?
depErr := o.handleDependentResourceError(ctx, op, err)
if depErr != nil {
// If we successfully identified and attempted to delete a dependent resource,
// return a more informative error
return fmt.Errorf("failed to delete %s %s: %w (dependent resource handling: %w)", resourceType, item.name, err, depErr)
}
// If dependent resource was handled successfully, return original error
// The retry logic will attempt deletion again
return fmt.Errorf("failed to delete %s %s: %w", resourceType, item.name, err)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the gcp jobs are failing due to nil pointer panics (for example, ci/prow/e2e-gcp-ovn, ci/prow/e2e-gcp-default-config) are happening because of this 😓
| err = o.handleDependentResourceError(ctx, op, err) | ||
| if err != nil { | ||
| o.Logger.Debugf("failed to handle dependent resource error: %v", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use a warning here:
depErr := o.handleDependentResourceError(ctx, op, err)
if depErr != nil {
o.Logger.Warnf("Dependent resource cleanup failed for %s %s: %v", resourceType, item.name, depErr)
}
The destroy process sporadically runs into ResourceInUseByAnotherResource error, but sometimes the dependent resource is not found by the cluster uninstaller. Now these resources will be removed when the error is found.