docs: add docs for kafka sink auto evolve option#5824
docs: add docs for kafka sink auto evolve option#5824guilleov wants to merge 2 commits intoClickHouse:mainfrom
Conversation
|
@guilleov is attempting to deploy a commit to the ClickHouse Team on Vercel. A member of the Team first needs to authorize it. |
| | `bufferCount` (since v1.3.6) | Number of records to buffer in memory before flushing to ClickHouse. `0` disables internal buffering. Buffering is not supported with `exactlyOnce=true`. | `"0"` | | ||
| | `bufferFlushTime` (since v1.3.6) | Maximum time in milliseconds to buffer records before flush when `exactlyOnce=false`. `0` disables time-based flushing. Default value is `0`. Only required for time-base threshold. Only effective when `bufferCount > 0`. | `"0"` | | ||
| | `reportInsertedOffsets` (since v1.3.6) | Enables returning only successfully inserted offsets from `preCommit` (instead of `currentOffsets`) when `exactlyOnce=false`. This does not apply when `ignorePartitionsWhenBatching=true`, where `currentOffsets` are still returned. | `"false"` | | ||
| | `auto.evolve` (since v1.3.7) | Automatically add columns to the ClickHouse table when incoming records contain new fields not present in the table. See [Schema Evolution](#schema-evolution). | `"false"` | |
There was a problem hiding this comment.
please mention schema to be more clear what is evolving. May be even saying something like
schema.auto.column_creation - because in the future we would have column alteration and need to configure them separately.
|
|
||
| 1. For each batch of records, the connector compares the record schema against the table's column list. | ||
| 2. If new fields are detected, it maps the Kafka Connect types to ClickHouse types and issues DDL. | ||
| 3. If multiple schema versions appear in a single batch, the batch is split at schema boundaries - each sub-batch is flushed and the table is evolved before continuing. |
There was a problem hiding this comment.
this is risky - what if we have a small batches by this split and will get too many parts as the result?
What blocks us from adding more then one column to a table?
| |---|---|---| | ||
| | `org.apache.kafka.connect.data.Decimal` | `Decimal(38, S)` | Scale from schema parameters | | ||
| | `org.apache.kafka.connect.data.Date` | `Date32` | | | ||
| | `org.apache.kafka.connect.data.Time` | `Int64` | | |
There was a problem hiding this comment.
actually depends on version - latest version of ClickHouse support Time
|
|
||
| When creating new columns, the connector maps Connect types to ClickHouse types as follows: | ||
|
|
||
| | Kafka Connect Type | ClickHouse Type | Notes | |
There was a problem hiding this comment.
Adding non-nullable columns will back backward compatibility and only records with none -null fields will be inserted. Please make it clear if Nullable(..) is really used.
| | `STRING` / `BYTES` | `String` | | | ||
| | `ARRAY` | `Array(<element_type>)` | Recursive | | ||
| | `MAP` | `Map(<key_type>, <value_type>)` | Recursive | | ||
| | `STRUCT` | Not supported | Throws an error | |
There was a problem hiding this comment.
This is supported by our connector and should not throw an error
https://github.com/ClickHouse/clickhouse-kafka-connect/blob/main/src/main/java/com/clickhouse/kafka/connect/sink/db/ClickHouseWriter.java#L574
Besides STRUCT used for unions like in case of union(String, bytes) https://github.com/ClickHouse/clickhouse-kafka-connect/blob/main/src/main/java/com/clickhouse/kafka/connect/sink/db/ClickHouseWriter.java#L251
| | `MAP` | `Map(<key_type>, <value_type>)` | Recursive | | ||
| | `STRUCT` | Not supported | Throws an error | | ||
|
|
||
| Optional (nullable) fields are wrapped in `Nullable(...)`, except for `ARRAY` and `MAP` types which [cannot be Nullable in ClickHouse](/sql-reference/data-types/nullable). Elements and values inside composite types can still be Nullable. |
There was a problem hiding this comment.
optional may have some default value, but as stated before we have to create Nullable columns to not break insert of older records.
|
|
||
| The connector rejects schema evolution in the following cases with a clear error message: | ||
|
|
||
| - **Non-nullable field without a default value** - ClickHouse requires new columns to be either `Nullable` or have a `DEFAULT`. |
There was a problem hiding this comment.
Please describe how it should be configured:
- specific avro/protobuf schema options
- sink configuration
| The connector rejects schema evolution in the following cases with a clear error message: | ||
|
|
||
| - **Non-nullable field without a default value** - ClickHouse requires new columns to be either `Nullable` or have a `DEFAULT`. | ||
| - **STRUCT fields** - Mapping Connect STRUCT to ClickHouse is non-trivial (could be Tuple, JSON, or Nested). Not supported for auto-evolution. |
There was a problem hiding this comment.
in most cases JSON is used - and it is easy to add as an option.
|
|
||
| - **Non-nullable field without a default value** - ClickHouse requires new columns to be either `Nullable` or have a `DEFAULT`. | ||
| - **STRUCT fields** - Mapping Connect STRUCT to ClickHouse is non-trivial (could be Tuple, JSON, or Nested). Not supported for auto-evolution. | ||
| - **Schemaless or string records** - No Connect schema is available to derive ClickHouse types. Evolution is skipped with a warning. |
There was a problem hiding this comment.
should throw an error and configuration doc should state it clear that evolution is available with schema only.
|
|
||
| Schema evolution is safe to use with multiple connector tasks. `ADD COLUMN IF NOT EXISTS` is idempotent - if two tasks race to add the same column, both succeed silently. DDL statements are executed with [`alter_sync=1`](/sql-reference/statements/alter#synchronicity-of-alter-queries) to wait for the local replica to apply the change. A retry loop on `DESCRIBE TABLE` (5 retries, 200ms backoff) handles propagation to other replicas. | ||
|
|
||
| #### Limitations {#schema-evolution-limitations} |
There was a problem hiding this comment.
this should be on the top. As you might see I've already asked questions that are covered here.
|
@chernser made some changes here also since some parts are modified on the other PR |
Related to:
Check issue and PR description for a summary