gcp_cloud_storage sink: add content_encoding option (parity with aws_s3) to prevent GCS decompressive transcoding

## Problem                                                                                                                                      
                                                                                                                                                  
  The `gcp_cloud_storage` sink does not expose a `content_encoding` configuration option. When `compression = "gzip"` is set, Vector **always**   
  sets `Content-Encoding: gzip` on the uploaded GCS object. This triggers [GCS decompressive                                                      
  transcoding](https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding), which causes `gsutil cp` (and any client that doesn't
   send `Accept-Encoding: gzip`) to **automatically decompress** the file on download.                                                            
                                                            
  The result: the downloaded file is plain text with a `.gz` extension. Tools like `gunzip` report `not in gzip format`, and downstream consumers
  that expect gzip-compressed files break.

  The `aws_s3` sink **already has** a `content_encoding` option in `S3Options` that allows overriding the compression-derived `Content-Encoding`
  header. The `gcp_cloud_storage` sink is missing this equivalent functionality.

  ## Use Case

  We are migrating from [Secor](https://github.com/pinterest/secor) (a Kafka-to-cloud-storage offloader) to Vector. Secor uploads gzip-compressed
  files to GCS with:

  - `Content-Type: application/gzip`
  - **No** `Content-Encoding` header

  This means the files are stored as opaque gzip blobs — they remain compressed when downloaded and can be decompressed normally with `gunzip`,
  `zcat`, Python `gzip`, BigQuery, Spark, etc.

  We need Vector to produce the same output: gzip-compressed files uploaded **without** `Content-Encoding: gzip`.

  ## Current Behavior

  ### Config

  ```yaml
  sinks:
    gcs_out:
      type: gcp_cloud_storage
      inputs: ["kafka_in"]
      bucket: my-bucket
      encoding:
        codec: raw_message
        framing:
          method: newline_delimited
      compression: gzip
      content_type: "application/gzip"
      filename_extension: gz
      key_prefix: "data/{{ topic }}/dt=%F/hr=%H/"

  Result — GCS object metadata

  $ gsutil stat gs://my-bucket/data/events.foo/dt=2026-03-04/hr=09/1772617594-xxx.gz

      Content-Encoding:       gzip        ← this triggers decompressive transcoding
      Content-Type:           application/gzip

  Result — broken download

  $ gsutil cp gs://my-bucket/data/.../file.gz ./file.gz
  $ gunzip file.gz
  gunzip: file.gz: not in gzip format
  $ file file.gz
  file.gz: JSON data              ← gsutil already decompressed it!

  Expected Behavior

  A content_encoding option (matching the aws_s3 sink) that allows overriding or suppressing the Content-Encoding header:

  sinks:
    gcs_out:
      type: gcp_cloud_storage
      inputs: ["kafka_in"]
      bucket: my-bucket
      encoding:
        codec: raw_message
        framing:
          method: newline_delimited
      compression: gzip
      content_type: "application/gzip"
      content_encoding: ""                  # ← suppress Content-Encoding header
      filename_extension: gz
      key_prefix: "data/{{ topic }}/dt=%F/hr=%H/"

  Expected GCS object metadata

      Content-Type:           application/gzip
                                            ← no Content-Encoding

  Expected download behavior

  $ gsutil cp gs://my-bucket/data/.../file.gz ./file.gz
  $ gunzip file.gz                          ← works cleanly
  $ cat file
  {"event":"page_view","zid":"abc123",...}
  {"event":"page_view","zid":"def456",...}

  Workaround Attempted — encode_gzip in VRL transform

  To avoid Content-Encoding: gzip, we tried compressing in a VRL remap transform and uploading with compression: none:

  transforms:
    batch_messages:
      type: reduce
      inputs: ["kafka_in"]
      group_by: [".topic"]
      expire_after_ms: 60000
      merge_strategies:
        message: concat_newline
    gzip_batch:
      type: remap
      inputs: ["batch_messages"]
      source: |
        .message = encode_gzip(string!(.message) + "\n")

  sinks:
    gcs_out:
      type: gcp_cloud_storage
      inputs: ["gzip_batch"]
      encoding:
        codec: raw_message
        framing:
          method: bytes
      content_type: "application/gzip"
      filename_extension: gz
      batch:
        max_events: 1
        timeout_secs: 10

  This avoids the Content-Encoding header, but introduces a secondary issue: Vector's encoding pipeline appends a trailing \n byte (0x0a) after
  the gzip stream footer. This causes gunzip to warn trailing garbage ignored:

  $ xxd file.gz | tail -3
  000001a0: b6ac 1bd2 747a 7ac3 dc50 d120 ab1a b83c  ....tzz..P. ...<
  000001b0: fc02 0624 df31 db03 0000 0a              ...$.1.....
                                      ^^
                      trailing 0x0a after gzip CRC32+ISIZE footer

  A correctly produced gzip file (from Secor) ends cleanly at the gzip footer:

  $ xxd correct_file.gz | tail -3
  000001a0: 5454 d5b6 7bc7 ac98 54c0 0b85 987f 0206  TT..{...T.......
  000001b0: 24df 31db 0300 00                        $.1....
                                ^^
                      file ends at gzip footer — no trailing bytes

  The data inside the gzip stream is correct (all events decompress properly), but the trailing byte breaks strict gzip parsers and produces
  warnings in standard tools.

  Proposal

  1. Add content_encoding option to gcp_cloud_storage sink — matching the existing aws_s3 sink behavior. When set to "", the Content-Encoding
  header should be omitted from the GCS upload request, even when compression is set.
  2. (Bonus) Investigate trailing \n in raw_message + bytes framing — When the raw_message codec is used with framing.method = bytes and
  batch.max_events = 1, a trailing \n byte is appended after the serialized message. For binary payloads (like pre-compressed gzip data), this
  corrupts the output. The bytes framing should write exactly the raw message bytes with no trailing characters.

  References

  - GCS decompressive transcoding docs
  - aws_s3 sink content_encoding option — existing precedent
  - #21795 — same request for Azure Blob Storage sink
  - Vector version: 0.53.0

  Environment

  - Vector 0.53.0 (timberio/vector:0.53.0-distroless-libc)
  - Sink: gcp_cloud_storage
  - GCS bucket with standard storage class
  - Kafka source → GCS sink pipeline


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcp_cloud_storage sink: add content_encoding option (parity with aws_s3) to prevent GCS decompressive transcoding #24841

Problem

Use Case

Current Behavior

Config

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gcp_cloud_storage sink: add content_encoding option (parity with aws_s3) to prevent GCS decompressive transcoding #24841

Description

Problem

Use Case

Current Behavior

Config

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions