CORS-4278: Hardening the GCP destroy code to attempt to delete dependent resources #10035

barbacbd · 2025-10-28T20:00:09Z

The destroy process sporadically runs into ResourceInUseByAnotherResource error, but sometimes the dependent resource is not found by the cluster uninstaller. Now these resources will be removed when the error is found.

openshift-ci-robot · 2025-10-28T20:00:13Z

@barbacbd: This pull request explicitly references no jira issue.

In response to this:

The destroy process sporadically runs into ResourceInUseByAnotherResource error, but sometimes the dependent resource is not found by the cluster uninstaller. Now these resources will be removed when the error is found.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

barbacbd · 2025-10-28T20:00:16Z

/hold

barbacbd · 2025-10-30T17:30:45Z

/hold cancel

…t resources The destroy process sporadically runs into ResourceInUseByAnotherResource error, but sometimes the dependent resource is not found by the cluster uninstaller. Now these resources will be removed when the error is found.

openshift-ci · 2025-11-03T13:10:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign patrickdillon for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/destroy/gcp/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-11-04T14:54:14Z

@barbacbd: This pull request references CORS-4278 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

The destroy process sporadically runs into ResourceInUseByAnotherResource error, but sometimes the dependent resource is not found by the cluster uninstaller. Now these resources will be removed when the error is found.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

barbacbd · 2025-11-04T15:50:50Z

/retest-required

openshift-ci · 2025-11-04T19:41:06Z

@barbacbd: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-default-config	`d4801b2`	link	false	`/test e2e-gcp-default-config`
ci/prow/e2e-gcp-ovn-byo-vpc	`d4801b2`	link	false	`/test e2e-gcp-ovn-byo-vpc`
ci/prow/e2e-gcp-xpn-dedicated-dns-project	`d4801b2`	link	false	`/test e2e-gcp-xpn-dedicated-dns-project`
ci/prow/e2e-gcp-custom-endpoints	`d4801b2`	link	false	`/test e2e-gcp-custom-endpoints`
ci/prow/okd-scos-e2e-aws-ovn	`d4801b2`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-gcp-custom-dns	`d4801b2`	link	false	`/test e2e-gcp-custom-dns`
ci/prow/e2e-gcp-secureboot	`d4801b2`	link	false	`/test e2e-gcp-secureboot`
ci/prow/e2e-gcp-ovn-xpn	`d4801b2`	link	false	`/test e2e-gcp-ovn-xpn`
ci/prow/gcp-private	`d4801b2`	link	false	`/test gcp-private`
ci/prow/e2e-gcp-ovn	`d4801b2`	link	true	`/test e2e-gcp-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tthvo · 2025-11-05T02:40:10Z

/cc

tthvo · 2025-11-08T01:11:18Z

pkg/destroy/gcp/targetpool.go


+func (o *ClusterUninstaller) deleteTargetPoolByName(ctx context.Context, resourceName string) error {
+	items, err := o.listTargetPoolsWithFilter(ctx, "items(name),nextPageToken", func(pool *compute.TargetPool) bool {
+		return strings.Contains(pool.Name, resourceName)


Other places (e.g. deleteSubnetworkByName in pkg/destroy/gcp/subnetwork.go) use exact comparison ==.

Should we make it consistent? Or is there a reason to use .Contains here?

tthvo · 2025-11-08T01:21:30Z

pkg/destroy/gcp/firewall.go

 )

+func (o *ClusterUninstaller) deleteFirewallByName(ctx context.Context, resourceName string) error {
+	items, err := o.listFirewallsWithFilter(ctx, "items(name),nextPageToken", func(item string) bool { return strings.Contains(item, resourceName) })


Same question as above: should we make it consistent as currently we use both == and .Contains in different places?

tthvo · 2025-11-08T01:27:55Z

pkg/destroy/gcp/targetTCPProxies.go

 )

+func (o *ClusterUninstaller) deleteTargetTCPProxyByName(ctx context.Context, resourceName string) error {
+	items, err := o.listTargetTCPProxiesWithFilter(ctx, globalTargetTCPProxyResource, "items(name),nextPageToken", func(item string) bool {


Will we ever run into a target TCP Proxy that is not type "targettcpproxy" at this point?

tthvo · 2025-11-08T01:40:30Z

pkg/destroy/gcp/forwardingrule.go

 func (o *ClusterUninstaller) listForwardingRules(ctx context.Context, typeName string) ([]cloudResource, error) {
-	return o.listForwardingRulesWithFilter(ctx, typeName, "items(name,region,loadBalancingScheme),nextPageToken", o.isClusterResource)
+	return o.listForwardingRulesWithFilter(ctx, typeName, "items(name,region,loadBalancingScheme,labels),nextPageToken", func(item *compute.ForwardingRule) bool {
+		return o.isClusterResource(item.Name) && o.isOwnedResource(item.Labels)


There are other places (i.e. listImages in pkg/destroy/gcp/image.go) that check like:

o.isClusterResource(item.Name) && !o.isSharedResource(item.Labels).

Should we make it consistent?

tthvo · 2025-11-08T01:42:44Z

pkg/destroy/gcp/cloudcontroller.go

+		// address should have labels but none observed
+		return strings.HasPrefix(item.Name, loadBalancerName)
+	})


This has the same check as loadBalancerFilterFunc right? Can we keep the original implementation? We can still keep the comment though.

tthvo · 2025-11-08T02:03:16Z

pkg/destroy/gcp/gcp.go

+		splitDetails := strings.Split(errMsg, alreadyInUseStr)
+		resource := strings.ReplaceAll(
+			strings.ReplaceAll(
+				strings.ReplaceAll(splitDetails[len(splitDetails)-1], "\"", ""),
+				"'", "",
+			), ", resourceInUseByAnotherResource", "",
+		)
+		splitResource := strings.Split(resource, "/")
+
+		if len(splitResource) > 6 || len(splitResource) < 5 {
+			return fmt.Errorf("dependent resource information unable to be parsed: %s", resource)
+		}
+		// global -> ex: projects/xxxxxxxxxxxxxxxx/global/resource-type/resource-name
+		// regional -> ex: projects/xxxxxxxxxxxxxx/region/region-name/resource-type/resource-name
+		location, resourceType, resourceName := splitResource[len(splitResource)-3], splitResource[len(splitResource)-2], splitResource[len(splitResource)-1]
+		o.Logger.Debugf("found dependent resource information: %s, %s, %s", location, resourceType, resourceName)


This part can probably be refactored into a separate func, for example:

func (o *ClusterUninstaller) parseResourceFromError(errMsg string) (location, resourceType, resourceName string, err error) {}

But then I have the same question whether it is future-proof to parse the error message?

tthvo · 2025-11-08T02:15:01Z

pkg/destroy/gcp/gcp.go

+		case "targetTcpProxies":
+			deleteErr = o.deleteTargetTCPProxyByName(ctx, resourceName)
+		default:
+			deleteErr = fmt.Errorf("failed to find resource type: %s for %s", resourceType, resourceName)


Suggested change

deleteErr = fmt.Errorf("failed to find resource type: %s for %s", resourceType, resourceName)

deleteErr = fmt.Errorf("unsupported resource type %q for dependent deletion",

resourceType)

tthvo · 2025-11-08T02:17:26Z

pkg/destroy/gcp/gcp.go

+		if deleteErr != nil {
+			return fmt.Errorf("failed to delete dependent resource: %w", deleteErr)
+		}
+	}


We should have log when the deletion works:

o.Logger.Infof("Successfully deleted dependent resource %s/%s", resourceType, resourceName)

tthvo · 2025-11-08T02:23:50Z

pkg/destroy/gcp/gcp.go

+		err = o.handleDependentResourceError(ctx, op, err)
+		if err != nil {
+			o.Logger.Debugf("failed to handle dependent resource error: %v", err)
+		}


We should still return the error here so that the destroy loop will try again to delete the resource next time, right?

depErr := o.handleDependentResourceError(ctx, op, err) if depErr != nil { // If we successfully identified and attempted to delete a dependent resource, // return a more informative error return fmt.Errorf("failed to delete %s %s: %w (dependent resource handling: %w)", resourceType, item.name, err, depErr) } // If dependent resource was handled successfully, return original error // The retry logic will attempt deletion again return fmt.Errorf("failed to delete %s %s: %w", resourceType, item.name, err)

I think the gcp jobs are failing due to nil pointer panics (for example, ci/prow/e2e-gcp-ovn, ci/prow/e2e-gcp-default-config) are happening because of this 😓

tthvo · 2025-11-08T02:25:13Z

pkg/destroy/gcp/gcp.go

+		err = o.handleDependentResourceError(ctx, op, err)
+		if err != nil {
+			o.Logger.Debugf("failed to handle dependent resource error: %v", err)
+		}


We should probably use a warning here:

depErr := o.handleDependentResourceError(ctx, op, err) if depErr != nil { o.Logger.Warnf("Dependent resource cleanup failed for %s %s: %v", resourceType, item.name, depErr) }

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 28, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 28, 2025

openshift-ci bot requested review from patrickdillon and rna-afk October 28, 2025 20:00

barbacbd force-pushed the gcp-destroy-hardening branch 4 times, most recently from 1b47b89 to b2c2788 Compare October 29, 2025 19:45

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 30, 2025

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 2, 2025

barbacbd force-pushed the gcp-destroy-hardening branch from b2c2788 to 6d3bcca Compare November 3, 2025 13:08

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 3, 2025

barbacbd force-pushed the gcp-destroy-hardening branch from 6d3bcca to d4801b2 Compare November 3, 2025 13:09

barbacbd changed the title ~~no-jira: Hardening the GCP destroy code to attempt to delete dependent resources~~ CORS-4278: Hardening the GCP destroy code to attempt to delete dependent resources Nov 4, 2025

barbacbd mentioned this pull request Nov 4, 2025

OCPBUGS-62260: GCP Resources were not discovered on destroy #10000

Closed

openshift-ci bot requested a review from tthvo November 5, 2025 02:40

tthvo reviewed Nov 8, 2025

View reviewed changes

	deleteErr = fmt.Errorf("failed to find resource type: %s for %s", resourceType, resourceName)
	deleteErr = fmt.Errorf("unsupported resource type %q for dependent deletion",
	resourceType)

CORS-4278: Hardening the GCP destroy code to attempt to delete dependent resources #10035

Are you sure you want to change the base?

CORS-4278: Hardening the GCP destroy code to attempt to delete dependent resources #10035

Uh oh!

Conversation

barbacbd commented Oct 28, 2025

Uh oh!

openshift-ci-robot commented Oct 28, 2025

Uh oh!

barbacbd commented Oct 28, 2025

Uh oh!

barbacbd commented Oct 30, 2025

Uh oh!

openshift-ci bot commented Nov 3, 2025

Uh oh!

openshift-ci-robot commented Nov 4, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barbacbd commented Nov 4, 2025

Uh oh!

openshift-ci bot commented Nov 4, 2025

Uh oh!

tthvo commented Nov 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

openshift-ci-robot commented Nov 4, 2025 •

edited by openshift-ci bot

Loading