-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Problem
The gcp_cloud_storage sink does not expose a content_encoding configuration option. When compression = "gzip" is set, Vector always
sets Content-Encoding: gzip on the uploaded GCS object. This triggers GCS decompressive
transcoding, which causes gsutil cp (and any client that doesn't
send Accept-Encoding: gzip) to automatically decompress the file on download.
The result: the downloaded file is plain text with a .gz extension. Tools like gunzip report not in gzip format, and downstream consumers
that expect gzip-compressed files break.
The aws_s3 sink already has a content_encoding option in S3Options that allows overriding the compression-derived Content-Encoding
header. The gcp_cloud_storage sink is missing this equivalent functionality.
Use Case
We are migrating from Secor (a Kafka-to-cloud-storage offloader) to Vector. Secor uploads gzip-compressed
files to GCS with:
Content-Type: application/gzip- No
Content-Encodingheader
This means the files are stored as opaque gzip blobs — they remain compressed when downloaded and can be decompressed normally with gunzip,
zcat, Python gzip, BigQuery, Spark, etc.
We need Vector to produce the same output: gzip-compressed files uploaded without Content-Encoding: gzip.
Current Behavior
Config
sinks:
gcs_out:
type: gcp_cloud_storage
inputs: ["kafka_in"]
bucket: my-bucket
encoding:
codec: raw_message
framing:
method: newline_delimited
compression: gzip
content_type: "application/gzip"
filename_extension: gz
key_prefix: "data/{{ topic }}/dt=%F/hr=%H/"
Result — GCS object metadata
$ gsutil stat gs://my-bucket/data/events.foo/dt=2026-03-04/hr=09/1772617594-xxx.gz
Content-Encoding: gzip ← this triggers decompressive transcoding
Content-Type: application/gzip
Result — broken download
$ gsutil cp gs://my-bucket/data/.../file.gz ./file.gz
$ gunzip file.gz
gunzip: file.gz: not in gzip format
$ file file.gz
file.gz: JSON data ← gsutil already decompressed it!
Expected Behavior
A content_encoding option (matching the aws_s3 sink) that allows overriding or suppressing the Content-Encoding header:
sinks:
gcs_out:
type: gcp_cloud_storage
inputs: ["kafka_in"]
bucket: my-bucket
encoding:
codec: raw_message
framing:
method: newline_delimited
compression: gzip
content_type: "application/gzip"
content_encoding: "" # ← suppress Content-Encoding header
filename_extension: gz
key_prefix: "data/{{ topic }}/dt=%F/hr=%H/"
Expected GCS object metadata
Content-Type: application/gzip
← no Content-Encoding
Expected download behavior
$ gsutil cp gs://my-bucket/data/.../file.gz ./file.gz
$ gunzip file.gz ← works cleanly
$ cat file
{"event":"page_view","zid":"abc123",...}
{"event":"page_view","zid":"def456",...}
Workaround Attempted — encode_gzip in VRL transform
To avoid Content-Encoding: gzip, we tried compressing in a VRL remap transform and uploading with compression: none:
transforms:
batch_messages:
type: reduce
inputs: ["kafka_in"]
group_by: [".topic"]
expire_after_ms: 60000
merge_strategies:
message: concat_newline
gzip_batch:
type: remap
inputs: ["batch_messages"]
source: |
.message = encode_gzip(string!(.message) + "\n")
sinks:
gcs_out:
type: gcp_cloud_storage
inputs: ["gzip_batch"]
encoding:
codec: raw_message
framing:
method: bytes
content_type: "application/gzip"
filename_extension: gz
batch:
max_events: 1
timeout_secs: 10
This avoids the Content-Encoding header, but introduces a secondary issue: Vector's encoding pipeline appends a trailing \n byte (0x0a) after
the gzip stream footer. This causes gunzip to warn trailing garbage ignored:
$ xxd file.gz | tail -3
000001a0: b6ac 1bd2 747a 7ac3 dc50 d120 ab1a b83c ....tzz..P. ...<
000001b0: fc02 0624 df31 db03 0000 0a ...$.1.....
^^
trailing 0x0a after gzip CRC32+ISIZE footer
A correctly produced gzip file (from Secor) ends cleanly at the gzip footer:
$ xxd correct_file.gz | tail -3
000001a0: 5454 d5b6 7bc7 ac98 54c0 0b85 987f 0206 TT..{...T.......
000001b0: 24df 31db 0300 00 $.1....
^^
file ends at gzip footer — no trailing bytes
The data inside the gzip stream is correct (all events decompress properly), but the trailing byte breaks strict gzip parsers and produces
warnings in standard tools.
Proposal
1. Add content_encoding option to gcp_cloud_storage sink — matching the existing aws_s3 sink behavior. When set to "", the Content-Encoding
header should be omitted from the GCS upload request, even when compression is set.
2. (Bonus) Investigate trailing \n in raw_message + bytes framing — When the raw_message codec is used with framing.method = bytes and
batch.max_events = 1, a trailing \n byte is appended after the serialized message. For binary payloads (like pre-compressed gzip data), this
corrupts the output. The bytes framing should write exactly the raw message bytes with no trailing characters.
References
- GCS decompressive transcoding docs
- aws_s3 sink content_encoding option — existing precedent
- #21795 — same request for Azure Blob Storage sink
- Vector version: 0.53.0
Environment
- Vector 0.53.0 (timberio/vector:0.53.0-distroless-libc)
- Sink: gcp_cloud_storage
- GCS bucket with standard storage class
- Kafka source → GCS sink pipeline