Skip to content

gcp_cloud_storage sink: add content_encoding option (parity with aws_s3) to prevent GCS decompressive transcoding #24841

@zedge-it

Description

@zedge-it

Problem

The gcp_cloud_storage sink does not expose a content_encoding configuration option. When compression = "gzip" is set, Vector always
sets Content-Encoding: gzip on the uploaded GCS object. This triggers GCS decompressive
transcoding
, which causes gsutil cp (and any client that doesn't
send Accept-Encoding: gzip) to automatically decompress the file on download.

The result: the downloaded file is plain text with a .gz extension. Tools like gunzip report not in gzip format, and downstream consumers
that expect gzip-compressed files break.

The aws_s3 sink already has a content_encoding option in S3Options that allows overriding the compression-derived Content-Encoding
header. The gcp_cloud_storage sink is missing this equivalent functionality.

Use Case

We are migrating from Secor (a Kafka-to-cloud-storage offloader) to Vector. Secor uploads gzip-compressed
files to GCS with:

  • Content-Type: application/gzip
  • No Content-Encoding header

This means the files are stored as opaque gzip blobs — they remain compressed when downloaded and can be decompressed normally with gunzip,
zcat, Python gzip, BigQuery, Spark, etc.

We need Vector to produce the same output: gzip-compressed files uploaded without Content-Encoding: gzip.

Current Behavior

Config

sinks:
  gcs_out:
    type: gcp_cloud_storage
    inputs: ["kafka_in"]
    bucket: my-bucket
    encoding:
      codec: raw_message
      framing:
        method: newline_delimited
    compression: gzip
    content_type: "application/gzip"
    filename_extension: gz
    key_prefix: "data/{{ topic }}/dt=%F/hr=%H/"

Result — GCS object metadata

$ gsutil stat gs://my-bucket/data/events.foo/dt=2026-03-04/hr=09/1772617594-xxx.gz

    Content-Encoding:       gzip        ← this triggers decompressive transcoding
    Content-Type:           application/gzip

Result — broken download

$ gsutil cp gs://my-bucket/data/.../file.gz ./file.gz
$ gunzip file.gz
gunzip: file.gz: not in gzip format
$ file file.gz
file.gz: JSON data              ← gsutil already decompressed it!

Expected Behavior

A content_encoding option (matching the aws_s3 sink) that allows overriding or suppressing the Content-Encoding header:

sinks:
  gcs_out:
    type: gcp_cloud_storage
    inputs: ["kafka_in"]
    bucket: my-bucket
    encoding:
      codec: raw_message
      framing:
        method: newline_delimited
    compression: gzip
    content_type: "application/gzip"
    content_encoding: ""                  # ← suppress Content-Encoding header
    filename_extension: gz
    key_prefix: "data/{{ topic }}/dt=%F/hr=%H/"

Expected GCS object metadata

    Content-Type:           application/gzip
                                          ← no Content-Encoding

Expected download behavior

$ gsutil cp gs://my-bucket/data/.../file.gz ./file.gz
$ gunzip file.gz                          ← works cleanly
$ cat file
{"event":"page_view","zid":"abc123",...}
{"event":"page_view","zid":"def456",...}

Workaround Attempted — encode_gzip in VRL transform

To avoid Content-Encoding: gzip, we tried compressing in a VRL remap transform and uploading with compression: none:

transforms:
  batch_messages:
    type: reduce
    inputs: ["kafka_in"]
    group_by: [".topic"]
    expire_after_ms: 60000
    merge_strategies:
      message: concat_newline
  gzip_batch:
    type: remap
    inputs: ["batch_messages"]
    source: |
      .message = encode_gzip(string!(.message) + "\n")

sinks:
  gcs_out:
    type: gcp_cloud_storage
    inputs: ["gzip_batch"]
    encoding:
      codec: raw_message
      framing:
        method: bytes
    content_type: "application/gzip"
    filename_extension: gz
    batch:
      max_events: 1
      timeout_secs: 10

This avoids the Content-Encoding header, but introduces a secondary issue: Vector's encoding pipeline appends a trailing \n byte (0x0a) after
the gzip stream footer. This causes gunzip to warn trailing garbage ignored:

$ xxd file.gz | tail -3
000001a0: b6ac 1bd2 747a 7ac3 dc50 d120 ab1a b83c  ....tzz..P. ...<
000001b0: fc02 0624 df31 db03 0000 0a              ...$.1.....
                                    ^^
                    trailing 0x0a after gzip CRC32+ISIZE footer

A correctly produced gzip file (from Secor) ends cleanly at the gzip footer:

$ xxd correct_file.gz | tail -3
000001a0: 5454 d5b6 7bc7 ac98 54c0 0b85 987f 0206  TT..{...T.......
000001b0: 24df 31db 0300 00                        $.1....
                              ^^
                    file ends at gzip footer — no trailing bytes

The data inside the gzip stream is correct (all events decompress properly), but the trailing byte breaks strict gzip parsers and produces
warnings in standard tools.

Proposal

1. Add content_encoding option to gcp_cloud_storage sink — matching the existing aws_s3 sink behavior. When set to "", the Content-Encoding
header should be omitted from the GCS upload request, even when compression is set.
2. (Bonus) Investigate trailing \n in raw_message + bytes framing — When the raw_message codec is used with framing.method = bytes and
batch.max_events = 1, a trailing \n byte is appended after the serialized message. For binary payloads (like pre-compressed gzip data), this
corrupts the output. The bytes framing should write exactly the raw message bytes with no trailing characters.

References

- GCS decompressive transcoding docs
- aws_s3 sink content_encoding option — existing precedent
- #21795 — same request for Azure Blob Storage sink
- Vector version: 0.53.0

Environment

- Vector 0.53.0 (timberio/vector:0.53.0-distroless-libc)
- Sink: gcp_cloud_storage
- GCS bucket with standard storage class
- Kafka source → GCS sink pipeline

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions