Skip to content

Conversation

@barbacbd
Copy link
Contributor

The destroy process sporadically runs into ResourceInUseByAnotherResource error, but sometimes the dependent resource is not found by the cluster uninstaller. Now these resources will be removed when the error is found.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 28, 2025
@openshift-ci-robot
Copy link
Contributor

@barbacbd: This pull request explicitly references no jira issue.

In response to this:

The destroy process sporadically runs into ResourceInUseByAnotherResource error, but sometimes the dependent resource is not found by the cluster uninstaller. Now these resources will be removed when the error is found.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@barbacbd
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 28, 2025
@barbacbd barbacbd force-pushed the gcp-destroy-hardening branch 4 times, most recently from 1b47b89 to b2c2788 Compare October 29, 2025 19:45
@barbacbd
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 30, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 2, 2025
@barbacbd barbacbd force-pushed the gcp-destroy-hardening branch from b2c2788 to 6d3bcca Compare November 3, 2025 13:08
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 3, 2025
…t resources

The destroy process sporadically runs into ResourceInUseByAnotherResource error, but sometimes
the dependent resource is not found by the cluster uninstaller. Now these resources will be
removed when the error is found.
@barbacbd barbacbd force-pushed the gcp-destroy-hardening branch from 6d3bcca to d4801b2 Compare November 3, 2025 13:09
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 3, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign patrickdillon for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@barbacbd barbacbd changed the title no-jira: Hardening the GCP destroy code to attempt to delete dependent resources CORS-4278: Hardening the GCP destroy code to attempt to delete dependent resources Nov 4, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 4, 2025

@barbacbd: This pull request references CORS-4278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

The destroy process sporadically runs into ResourceInUseByAnotherResource error, but sometimes the dependent resource is not found by the cluster uninstaller. Now these resources will be removed when the error is found.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@barbacbd
Copy link
Contributor Author

barbacbd commented Nov 4, 2025

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2025

@barbacbd: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-default-config d4801b2 link false /test e2e-gcp-default-config
ci/prow/e2e-gcp-ovn-byo-vpc d4801b2 link false /test e2e-gcp-ovn-byo-vpc
ci/prow/e2e-gcp-xpn-dedicated-dns-project d4801b2 link false /test e2e-gcp-xpn-dedicated-dns-project
ci/prow/e2e-gcp-custom-endpoints d4801b2 link false /test e2e-gcp-custom-endpoints
ci/prow/okd-scos-e2e-aws-ovn d4801b2 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-custom-dns d4801b2 link false /test e2e-gcp-custom-dns
ci/prow/e2e-gcp-secureboot d4801b2 link false /test e2e-gcp-secureboot
ci/prow/e2e-gcp-ovn-xpn d4801b2 link false /test e2e-gcp-ovn-xpn
ci/prow/gcp-private d4801b2 link false /test gcp-private
ci/prow/e2e-gcp-ovn d4801b2 link true /test e2e-gcp-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tthvo
Copy link
Member

tthvo commented Nov 5, 2025

/cc

@openshift-ci openshift-ci bot requested a review from tthvo November 5, 2025 02:40

func (o *ClusterUninstaller) deleteTargetPoolByName(ctx context.Context, resourceName string) error {
items, err := o.listTargetPoolsWithFilter(ctx, "items(name),nextPageToken", func(pool *compute.TargetPool) bool {
return strings.Contains(pool.Name, resourceName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other places (e.g. deleteSubnetworkByName in pkg/destroy/gcp/subnetwork.go) use exact comparison ==.

Should we make it consistent? Or is there a reason to use .Contains here?

)

func (o *ClusterUninstaller) deleteFirewallByName(ctx context.Context, resourceName string) error {
items, err := o.listFirewallsWithFilter(ctx, "items(name),nextPageToken", func(item string) bool { return strings.Contains(item, resourceName) })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above: should we make it consistent as currently we use both == and .Contains in different places?

)

func (o *ClusterUninstaller) deleteTargetTCPProxyByName(ctx context.Context, resourceName string) error {
items, err := o.listTargetTCPProxiesWithFilter(ctx, globalTargetTCPProxyResource, "items(name),nextPageToken", func(item string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we ever run into a target TCP Proxy that is not type "targettcpproxy" at this point?

func (o *ClusterUninstaller) listForwardingRules(ctx context.Context, typeName string) ([]cloudResource, error) {
return o.listForwardingRulesWithFilter(ctx, typeName, "items(name,region,loadBalancingScheme),nextPageToken", o.isClusterResource)
return o.listForwardingRulesWithFilter(ctx, typeName, "items(name,region,loadBalancingScheme,labels),nextPageToken", func(item *compute.ForwardingRule) bool {
return o.isClusterResource(item.Name) && o.isOwnedResource(item.Labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other places (i.e. listImages in pkg/destroy/gcp/image.go) that check like:

o.isClusterResource(item.Name) && !o.isSharedResource(item.Labels).

Should we make it consistent?

Comment on lines +122 to +124
// address should have labels but none observed
return strings.HasPrefix(item.Name, loadBalancerName)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the same check as loadBalancerFilterFunc right? Can we keep the original implementation? We can still keep the comment though.

Comment on lines +466 to +481
splitDetails := strings.Split(errMsg, alreadyInUseStr)
resource := strings.ReplaceAll(
strings.ReplaceAll(
strings.ReplaceAll(splitDetails[len(splitDetails)-1], "\"", ""),
"'", "",
), ", resourceInUseByAnotherResource", "",
)
splitResource := strings.Split(resource, "/")

if len(splitResource) > 6 || len(splitResource) < 5 {
return fmt.Errorf("dependent resource information unable to be parsed: %s", resource)
}
// global -> ex: projects/xxxxxxxxxxxxxxxx/global/resource-type/resource-name
// regional -> ex: projects/xxxxxxxxxxxxxx/region/region-name/resource-type/resource-name
location, resourceType, resourceName := splitResource[len(splitResource)-3], splitResource[len(splitResource)-2], splitResource[len(splitResource)-1]
o.Logger.Debugf("found dependent resource information: %s, %s, %s", location, resourceType, resourceName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part can probably be refactored into a separate func, for example:

func (o *ClusterUninstaller) parseResourceFromError(errMsg string) (location, resourceType, resourceName string, err error) {}

But then I have the same question whether it is future-proof to parse the error message?

case "targetTcpProxies":
deleteErr = o.deleteTargetTCPProxyByName(ctx, resourceName)
default:
deleteErr = fmt.Errorf("failed to find resource type: %s for %s", resourceType, resourceName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deleteErr = fmt.Errorf("failed to find resource type: %s for %s", resourceType, resourceName)
deleteErr = fmt.Errorf("unsupported resource type %q for dependent deletion",
resourceType)

if deleteErr != nil {
return fmt.Errorf("failed to delete dependent resource: %w", deleteErr)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have log when the deletion works:

o.Logger.Infof("Successfully deleted dependent resource %s/%s", resourceType, resourceName)

Comment on lines +526 to +529
err = o.handleDependentResourceError(ctx, op, err)
if err != nil {
o.Logger.Debugf("failed to handle dependent resource error: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still return the error here so that the destroy loop will try again to delete the resource next time, right?

depErr := o.handleDependentResourceError(ctx, op, err)
if depErr != nil {
    // If we successfully identified and attempted to delete a dependent resource,
    // return a more informative error
    return fmt.Errorf("failed to delete %s %s: %w (dependent resource handling: %w)", resourceType, item.name, err, depErr)
}
// If dependent resource was handled successfully, return original error
// The retry logic will attempt deletion again
return fmt.Errorf("failed to delete %s %s: %w", resourceType, item.name, err)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the gcp jobs are failing due to nil pointer panics (for example, ci/prow/e2e-gcp-ovn, ci/prow/e2e-gcp-default-config) are happening because of this 😓

Comment on lines +535 to +538
err = o.handleDependentResourceError(ctx, op, err)
if err != nil {
o.Logger.Debugf("failed to handle dependent resource error: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use a warning here:

depErr := o.handleDependentResourceError(ctx, op, err)
if depErr != nil {
    o.Logger.Warnf("Dependent resource cleanup failed for %s %s: %v", resourceType, item.name, depErr)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants