Iceberg whitepaper #717

ivanyu · 2025-08-14T11:52:54Z

ivanyu
Aug 14, 2025
Maintainer

This is for discussing the Iceberg whitepaper.

stanislavkozlovski · 2025-08-14T16:45:52Z

stanislavkozlovski
Aug 14, 2025

Really great work! Out of curiosity - why was Avro chosen as the initial format? Is it because it's believed to be most commonly adopted amongst Kafka users?

Also how much additional work is it to add Protobuf support? I assume JSON is more work, but Protobuf perhaps may be similar to Avro? Happy to hear about any intricacies in supporting different format conversions into the Parquet schema

2 replies

ivanyu Aug 15, 2025
Maintainer Author

Out of curiosity - why was Avro chosen as the initial format? Is it because it's believed to be most commonly adopted amongst Kafka users?

Yes, this was the main idea. It's the oldest (in the Kafka world), has the biggest mind share out there, it's rational to start with it. We'll be monitoring the demand for what's next.

ivanyu Aug 22, 2025
Maintainer Author

Also how much additional work is it to add Protobuf support? I assume JSON is more work, but Protobuf perhaps may be similar to Avro? Happy to hear about any intricacies in supporting different format conversions into the Parquet schema

Hard to say without some preliminary research. There are lots of caveats with type conversions and these formats are on slightly different level of expressiveness. Surely, less than we invested so far with Avro, a lot of work will be reusable.

sap1ens · 2025-08-14T17:40:57Z

sap1ens
Aug 14, 2025

I know that the full scale benchmark is mentioned in the upcoming work section, but I'd love to understand the impact on read and write latency, e.g. compared with tiered storage without Iceberg support. Would you mind sharing anything?

0 replies

stanislavkozlovski · 2025-08-14T18:22:49Z

stanislavkozlovski
Aug 14, 2025

The README.md ought to be updated I think
Is this compatible with S3 Tables? I recall now I was thinking about integrating these things together 8 months ago
Are users expected to increase their long-term retention so as to match whatever their Iceberg table needs? AFAICT it's pretty usual to keep a year or so of data in these things. Does the new format mean we get way more efficient storage for long-term data?

0 replies

polyzos · 2025-08-16T17:03:01Z

polyzos
Aug 16, 2025

I went through the whitepaper and also saw the upcoming work, but I have a few concerns and questions that go back to what @sap1ens mentioned about benchmarks with and without iceberg tiered storage.

For one-way streaming, i.e, Kafka -> Iceberg, this approach might make sense, but I'm curious about the following:

The implementation lies within Kafka's brokers compared to other solutions that might use Flink behind the scenes. How do you scale this, considering it requires lots of extra CPU and memory for all the conversions, between avro -> parquet (row oriented to column oriented), with parquet writers requiring lots of memory. In high peak scenarios, this will only get worse, i guess.
Reading from remote storage is already slower, so how bigger is the impact now that parquet and more conversions are put into the mix? Considering that Kafka can only replay data.. I suspect this will also have a big drawback to consumer lag.
Offloading into Iceberg tables and later Delta tables -- table format performance typically comes from techniques like partitioning and bucketing. However seems like the goal is to only offload everything into a "huge table", without having a way for these techniques to be applied.. So what's the actual benefit? The larger the topic/table the harder it gets to query it, so downstream consumers will have to put lots of work to make things "work"

There are a few more thoughts, but I want to focus on these for now.. What I'm looking to understand, I guess, is what's the actual benefit of adding Iceberg or delta lake support natively into the tiered storage is. A "single-copy", yes, but in this scenario, does it indeed make sense and yield more benefits than drawbacks? Eager to get your thoughts

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Iceberg whitepaper #717

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Iceberg whitepaper #717

Uh oh!

ivanyu Aug 14, 2025 Maintainer

Replies: 4 comments · 2 replies

Uh oh!

stanislavkozlovski Aug 14, 2025

Uh oh!

Uh oh!

ivanyu Aug 15, 2025 Maintainer Author

Uh oh!

ivanyu Aug 22, 2025 Maintainer Author

Uh oh!

sap1ens Aug 14, 2025

Uh oh!

Uh oh!

stanislavkozlovski Aug 14, 2025

Uh oh!

polyzos Aug 16, 2025

ivanyu
Aug 14, 2025
Maintainer

Replies: 4 comments 2 replies

stanislavkozlovski
Aug 14, 2025

ivanyu Aug 15, 2025
Maintainer Author

ivanyu Aug 22, 2025
Maintainer Author

sap1ens
Aug 14, 2025

stanislavkozlovski
Aug 14, 2025

polyzos
Aug 16, 2025