Skip to content

decrypt API lacks context.Context support: slow KMS round-trips cannot be cancelled mid-call #2179

@trendvidia

Description

@trendvidia

Summary

decrypt.Data(data []byte, format string) ([]byte, error) (and the related decrypt.File) doesn't accept a context.Context. Consumers that bind sops into a request path or a boot sequence can't interrupt a stuck KMS round-trip — the call blocks until the underlying provider's own internal timeout fires (often 30s+).

Reproduction

ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)
defer cancel()

// We can check ctx before calling…
if err := ctx.Err(); err != nil {
    return err
}

// …but once we're in here, the ctx is irrelevant.
plaintext, err := decrypt.Data(ciphertext, "yaml")
// If the KMS provider is hung, this blocks for ~30s regardless
// of the 100ms deadline we set above.

Affects every key service that calls out to a remote provider: AWS KMS, GCP KMS, Azure Key Vault, HashiCorp Vault. The underlying provider SDKs (aws-sdk-go-v2, cloud.google.com/go/kms, etc.) accept contexts; sops just doesn't thread one through.

Real-world impact

We hit this in chameleon, a layered-config library that wraps sops/decrypt for encrypted layer files. A hung GCP KMS round during application boot blocks initialization indefinitely instead of failing fast. Our workaround is checking ctx.Err() before invoking sops; once we're in the call, we're stuck. Documented as a known limitation on our side: chameleon README ("Sops API has no ctx.Context hook").

Proposed API

Add ctx-aware variants alongside the existing functions (backward compatible):

// New, in package decrypt:
func DataWithContext(ctx context.Context, data []byte, format string) ([]byte, error)
func FileWithContext(ctx context.Context, path, format string) ([]byte, error)

The existing context-less functions can delegate to the ctx variants with context.Background(). Internally, the ctx threads to each key-service call (aws-sdk's WithContext request options, GCP's existing ctx-first methods, etc.).

For sops v4 / next major: replace the existing signatures.

Alternatives considered

  • Goroutine + select-on-channel — works but leaks goroutines on cancel (no way to interrupt the blocked provider call). Worse than no fix.
  • runtime.Goexit from a watchdog goroutine — actively destructive: leaks resources held by the provider SDK (connections, mutexes).
  • Documented "don't call from latency-sensitive paths" — true but unhelpful for boot-time use.

Scope estimate

Mechanical: every kms.Decrypt(...) / kv.Decrypt(...) call site inside the key-service implementations under keyservice/ accepts a context already; threading from a new DataWithContext entry point through is a one-pass change. Tests need a fake key service that respects a cancel signal.

Happy to put up a PR if there's appetite — would appreciate a maintainer comment first on the API shape (separate WithContext functions vs. breaking change).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions