feat(run): provides RunWithGracefulShutdown #564

liufuyang · 2023-11-16T15:37:41Z

So to allow users easily wire up a clean-up function that can be called BEFORE the root context is done.

closes #563

So with typical cloudrunner code initiated like this:

Before:

        err := cloudrunner.Run(func(ctx context.Context) error {
		grpcServer := cloudrunner.NewGRPCServer(ctx)
		cleanup := func() {
			logger := cloudrunner.Logger(ctx)
			select {
			case <-ctx.Done():
				logger.Warn("Root context is already done, cannot use client anymore.")
			default:
				logger.Info("Root context is still live.")
			}
		}

		defer cleanup()
		cloudrunner.Logger(ctx).Info("hello world")

		return cloudrunner.ListenGRPC(ctx, grpcServer)
	})

Output when cancel the process:

2023-11-16T16:35:50.491+0100	info	cloudrunner-go/run.go:173	up and running	{"config": {"cloudrunner": {"PORT": 8080, "K_SERVICE": "___Test_RunWithGracefulShutdown_helloWorld_ctx_cancel_should_before_clean_up_function_call2_in_go_einride_tech_cloudrunner.test", "K_REVISION": "", "K_CONFIGURATION": "", "CLOUD_RUN_JOB": "", "CLOUD_RUN_EXECUTION": "", "CLOUD_RUN_TASK_INDEX": 0, "CLOUD_RUN_TASK_ATTEMPT": 0, "CLOUD_RUN_TASK_COUNT": 0, "GOOGLE_CLOUD_PROJECT": "", "RUNTIME_SERVICEACCOUNT": "", "SERVICE_VERSION": "", "LOGGER_DEVELOPMENT": true, "LOGGER_LEVEL": "debug", "LOGGER_REPORTERRORS": false, "PROFILER_ENABLED": false, "PROFILER_MUTEXPROFILING": false, "PROFILER_ALLOCFORCEGC": true, "TRACEEXPORTER_ENABLED": false, "TRACEEXPORTER_TIMEOUT": "10s", "TRACEEXPORTER_SAMPLEPROBABILITY": 0.01, "METRICEXPORTER_ENABLED": false, "METRICEXPORTER_INTERVAL": "1m0s", "METRICEXPORTER_RUNTIMEINSTRUMENTATION": false, "METRICEXPORTER_HOSTINSTRUMENTATION": false, "METRICEXPORTER_OPENCENSUSPRODUCER": false, "SERVER_TIMEOUT": "4m50s", "CLIENT_TIMEOUT": "10s", "CLIENT_RETRY_ENABLED": true, "CLIENT_RETRY_INITIALBACKOFF": "200ms", "CLIENT_RETRY_MAXBACKOFF": "1m0s", "CLIENT_RETRY_MAXATTEMPTS": 5, "CLIENT_RETRY_BACKOFFMULTIPLIER": 2, "CLIENT_RETRY_RETRYABLESTATUSCODES": ["Unavailable", "Unknown"], "REQUESTLOGGER_MESSAGESIZELIMIT": 0, "REQUESTLOGGER_STATUSTOLEVEL": null}}, "resource": {"telemetry.sdk.language": "go", "telemetry.sdk.name": "opentelemetry", "telemetry.sdk.version": "1.20.0"}, "buildInfo": {"mainPath": "", "goVersion": "go1.21.3", "buildSettings": {}}}
2023-11-16T16:35:50.491+0100	info	cloudrunner-go/run_test.go:117	hello world
2023-11-16T16:35:50.491+0100	info	cloudrunner-go/run.go:220	watching for termination signals
2023-11-16T16:35:50.491+0100	info	cloudrunner-go/grpcserver.go:74	gRPCServer listening	{"address": ":8080"}
2023-11-16T16:35:54.624+0100	info	cloudrunner-go/run.go:227	got signal:	{"signal": "interrupt"}
2023-11-16T16:35:54.624+0100	info	cloudrunner-go/run.go:229	cloudrunner.Shutdown is not used. Canceling root context directly. Call RunWithGracefulShutdown(...) to enable graceful shutdown if preferred.
2023-11-16T16:35:54.624+0100	info	cloudrunner-go/grpcserver.go:71	gRPCServer shutting down
2023-11-16T16:35:54.624+0100	warn	cloudrunner-go/run_test.go:110	Root context is already done, cannot use client anymore.
2023-11-16T16:35:54.624+0100	info	cloudrunner-go/run.go:202	goodbye

Note

Noticing it says Root context is already done, cannot use client anymore.

After:

        err := cloudrunner.RunWithGracefulShutdown(func(ctx context.Context, shutdown *cloudrunner.Shutdown) error {
		grpcServer := cloudrunner.NewGRPCServer(ctx)
		cleanup := func() {
			logger := cloudrunner.Logger(ctx)
			select {
			case <-ctx.Done():
				logger.Warn("Root context is already done, cannot use client anymore.")
			default:
				logger.Info("Root context is still live.")
			}
		}

		shutdown.RegisterCancelFunc(cleanup)
		cloudrunner.Logger(ctx).Info("hello world")

		return cloudrunner.ListenGRPC(ctx, grpcServer)
	})

Output when cancel the process:

2023-11-16T16:33:04.269+0100	info	cloudrunner-go/run.go:173	up and running	{"config": {"cloudrunner": {"PORT": 8080, "K_SERVICE": "___Test_RunWithGracefulShutdown_helloWorld_ctx_cancel_should_before_clean_up_function_call2_in_go_einride_tech_cloudrunner.test", "K_REVISION": "", "K_CONFIGURATION": "", "CLOUD_RUN_JOB": "", "CLOUD_RUN_EXECUTION": "", "CLOUD_RUN_TASK_INDEX": 0, "CLOUD_RUN_TASK_ATTEMPT": 0, "CLOUD_RUN_TASK_COUNT": 0, "GOOGLE_CLOUD_PROJECT": "", "RUNTIME_SERVICEACCOUNT": "", "SERVICE_VERSION": "", "LOGGER_DEVELOPMENT": true, "LOGGER_LEVEL": "debug", "LOGGER_REPORTERRORS": false, "PROFILER_ENABLED": false, "PROFILER_MUTEXPROFILING": false, "PROFILER_ALLOCFORCEGC": true, "TRACEEXPORTER_ENABLED": false, "TRACEEXPORTER_TIMEOUT": "10s", "TRACEEXPORTER_SAMPLEPROBABILITY": 0.01, "METRICEXPORTER_ENABLED": false, "METRICEXPORTER_INTERVAL": "1m0s", "METRICEXPORTER_RUNTIMEINSTRUMENTATION": false, "METRICEXPORTER_HOSTINSTRUMENTATION": false, "METRICEXPORTER_OPENCENSUSPRODUCER": false, "SERVER_TIMEOUT": "4m50s", "CLIENT_TIMEOUT": "10s", "CLIENT_RETRY_ENABLED": true, "CLIENT_RETRY_INITIALBACKOFF": "200ms", "CLIENT_RETRY_MAXBACKOFF": "1m0s", "CLIENT_RETRY_MAXATTEMPTS": 5, "CLIENT_RETRY_BACKOFFMULTIPLIER": 2, "CLIENT_RETRY_RETRYABLESTATUSCODES": ["Unavailable", "Unknown"], "REQUESTLOGGER_MESSAGESIZELIMIT": 0, "REQUESTLOGGER_STATUSTOLEVEL": null}}, "resource": {"telemetry.sdk.language": "go", "telemetry.sdk.name": "opentelemetry", "telemetry.sdk.version": "1.20.0"}, "buildInfo": {"mainPath": "", "goVersion": "go1.21.3", "buildSettings": {}}}
2023-11-16T16:33:04.269+0100	info	cloudrunner-go/run.go:220	watching for termination signals
2023-11-16T16:33:04.269+0100	info	cloudrunner-go/run_test.go:117	hello world
2023-11-16T16:33:04.269+0100	info	cloudrunner-go/grpcserver.go:74	gRPCServer listening	{"address": ":8080"}
2023-11-16T16:33:43.245+0100	info	cloudrunner-go/run.go:227	got signal:	{"signal": "interrupt"}
2023-11-16T16:33:43.247+0100	info	cloudrunner-go/run.go:237	graceful shutdown has begun
2023-11-16T16:33:43.247+0100	info	cloudrunner-go/run_test.go:112	Root context is still live.
2023-11-16T16:33:43.247+0100	info	cloudrunner-go/run.go:241	Shutdown.shutdownFunc() has finished, meaning we will shutdown cleanly
2023-11-16T16:33:43.247+0100	info	cloudrunner-go/run.go:247	exiting by canceling root context due to shutdown signal
2023-11-16T16:33:43.247+0100	info	cloudrunner-go/grpcserver.go:71	gRPCServer shutting down
2023-11-16T16:33:43.247+0100	info	cloudrunner-go/run.go:202	goodbye

Note

Noticing it now says Root context is still live.

liufuyang · 2023-11-16T16:01:05Z

run.go

 func Run(fn func(context.Context) error, options ...Option) (err error) {
-	ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)


Old implementation makes a hard bond between the root context and SIGTERM, which means when a SIGTERM is received, the ctx will be canceled directly and user has no control of it.

New implementation breaks the bond, see trapShutdownSignal() for details. Taking some ideas from this post. https://medium.com/over-engineering/graceful-shutdown-with-go-http-servers-and-kubernetes-rolling-updates-6697e7db17cf

liufuyang · 2023-11-17T09:02:03Z

run.go

+// (default as 30 seconds) above this value to make sure Kubernetes can wait for enough time for graceful shutdown.
+// More info see here:
+// https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace
+const gracefulShutdownMaxGracePeriod = time.Second * 10


Using 10 seconds here as in our preferred CloudRun environment the terminationGracePeriodSeconds seems not configurable and a fixed 10s is specified in the CloudRun doc.

https://cloud.google.com/run/docs/container-contract#instance-shutdown

Perhaps I should put this in the comment as well.

So to allow user easily wire up a clean up function that can be called BEFORE the root context is done.

quoral · 2024-01-05T11:41:36Z

Hey @liufuyang!

We've recently run into similar issues that this PR attempts to solve, and we've merged a very preliminary solution in #585. It does not solve the use-case for some things that you are aiming to solve in this PR however, like pull-based PubSub cloudrun instances.

There's work planned to implement those fixes down the line as well, so I'll keep you in the loop on how that goes with an upcoming ADR where we can discuss if that solution approaches a solution for your usecase. For now as I think we're gonna take that implementation in a different direction than these proposed changes, I'll close this PR and we can move the future discussions over to #563.

Thanks again!

liufuyang requested a review from a team as a code owner November 16, 2023 15:37

liufuyang force-pushed the ensure-graceful-shutdown branch 4 times, most recently from f8a9f4a to f66f10b Compare November 16, 2023 15:58

liufuyang commented Nov 16, 2023

View reviewed changes

liufuyang commented Nov 17, 2023

View reviewed changes

liufuyang requested a review from odsod November 17, 2023 09:03

liufuyang force-pushed the ensure-graceful-shutdown branch from f66f10b to d916cd3 Compare November 19, 2023 09:00

liufuyang changed the title ~~feat(run): provides RunWithGracefulShutdownHook~~ feat(run): provides RunWithGracefulShutdown Nov 19, 2023

feat(run): provides RunWithGracefulShutdown

a6d33d6

So to allow user easily wire up a clean up function that can be called BEFORE the root context is done.

liufuyang force-pushed the ensure-graceful-shutdown branch from d916cd3 to a6d33d6 Compare November 19, 2023 09:04

quoral closed this Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(run): provides RunWithGracefulShutdown #564

feat(run): provides RunWithGracefulShutdown #564

liufuyang commented Nov 16, 2023 •

edited

Loading

liufuyang Nov 16, 2023 •

edited

Loading

liufuyang Nov 17, 2023

quoral commented Jan 5, 2024

		func Run(fn func(context.Context) error, options ...Option) (err error) {
		ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)

feat(run): provides RunWithGracefulShutdown #564

feat(run): provides RunWithGracefulShutdown #564

Conversation

liufuyang commented Nov 16, 2023 • edited Loading

Before:

After:

liufuyang Nov 16, 2023 • edited Loading

Choose a reason for hiding this comment

liufuyang Nov 17, 2023

Choose a reason for hiding this comment

quoral commented Jan 5, 2024

liufuyang commented Nov 16, 2023 •

edited

Loading

liufuyang Nov 16, 2023 •

edited

Loading