Add Hudi sink connector support #4164

voonhous · 2025-10-24T08:39:00Z

No description provided.

voonhous · 2025-10-24T08:39:34Z

voonhous · 2025-10-27T07:12:31Z

Changes here will require Hudi 1.1.0 to be released first.

cshuo

@voonhous thanks for the pr. Can you also describe the scope of pr for the Hudi CDC sink, e.g., what index types and table service(compaction) modes are supported.

cshuo · 2025-10-29T09:52:26Z

...a/org/apache/flink/cdc/connectors/hudi/sink/function/MultiTableEventStreamWriteFunction.java

+     */
+    private void processFlushForTableFunction(
+            EventBucketStreamWriteFunction tableFunction, Event flushEvent) {
+        try {


no need to use reflection now? call tableFunction.flushRemaining(false); directly

cshuo · 2025-10-29T09:56:58Z

.../java/org/apache/flink/cdc/connectors/hudi/sink/function/EventBucketStreamWriteFunction.java

+        }
+
+        // Extract record key from event data using cached field getters
+        String recordKey = extractRecordKeyFromEvent(dataChangeEvent);


record key can be get from HoodieFlinkInternalRow directly by calling HoodieFlinkInternalRow#getRecordKey(). So extractRecordKeyFromEvent is unnecessary, and primaryKeyFieldGetters can be removed.

cshuo · 2025-10-29T10:08:26Z

...c/main/java/org/apache/flink/cdc/connectors/hudi/sink/function/EventStreamWriteFunction.java

+
+/** Base infrastructures for streaming writer function to handle Events. */
+public abstract class EventStreamWriteFunction extends AbstractStreamWriteFunction<Event>
+        implements EventProcessorFunction {


we should make minimal changes to StreamWriteFunction and BucketStreamWriteFunction, the generic type should kept as HoodieFlinkInternalRow. We can confine operations of Event within MultiTableEventStreamWriteFunction and StreamWriteFunction only need to provide the following operations:

processData(HoodieFlinkInternalRow): DataChangeEvent can be converted to HoodieFlinkInternalRow in MultiTableEventStreamWriteFunction.

flushRemaining(): called when flush event is received.

updateSchema()?: when shema change event is received, and need update inner schema or related fields, like index fields.

Seems no need to implement EventProcessorFunction, actually processSchemaChange and processFlush of EventStreamWriteFunction will never be called.

Okay, i tried this, i remember what the problem was:

Event to HoodieFlinkInternalRow conversion in MultiTableEventStreamWriteFunction.

HoodieFlinkInternalRow constructor requires fileId and instantTime upfront

These values come from defineRecordLocation() which needs bucket number

Cannot create HoodieFlinkInternalRow before calling defineRecordLocation

fileId and instantTime are not required to construct HoodieFlinkInternalRow, these two fields are later set in defineRecordLocation().

Seems no need to implement EventProcessorFunction, actually processSchemaChange and processFlush of EventStreamWriteFunction will never be called.

Caused by: java.lang.RuntimeException: Failed to process schema event for table: hudi_inventory_bptbsn.products at org.apache.flink.cdc.connectors.hudi.sink.function.MultiTableEventStreamWriteFunction.processSchemaChange(MultiTableEventStreamWriteFunction.java:296) at org.apache.flink.cdc.connectors.hudi.sink.function.MultiTableEventStreamWriteFunction.processElement(MultiTableEventStreamWriteFunction.java:167) at org.apache.flink.cdc.connectors.hudi.sink.function.MultiTableEventStreamWriteFunction.processElement(MultiTableEventStreamWriteFunction.java:72) at org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) at org.apache.flink.cdc.connectors.hudi.sink.bucket.FlushEventAlignmentOperator.processElement(FlushEventAlignmentOperator.java:94) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:238) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:157) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:114) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:638) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:973) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:917) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:970) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:949) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:763) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.lang.UnsupportedOperationException: #processSchemaChange should not be called at org.apache.flink.cdc.connectors.hudi.sink.function.EventBucketStreamWriteFunction.processSchemaChange(EventBucketStreamWriteFunction.java:158) at org.apache.flink.cdc.connectors.hudi.sink.function.MultiTableEventStreamWriteFunction.processSchemaChange(MultiTableEventStreamWriteFunction.java:293) ... 24 more

It is being invoked.

cshuo · 2025-10-29T10:15:27Z

...src/main/java/org/apache/flink/cdc/connectors/hudi/sink/event/HudiRecordEventSerializer.java

+ * <p>Assumes that CreateTableEvent will always arrive before DataChangeEvent for each table,
+ * following the standard CDC pipeline startup sequence.
+ */
+public class HudiRecordEventSerializer implements HudiRecordSerializer<Event> {


Seems HudiRecordEventSerializer is designed to deal with serializing for multiple tables. Like comments in EventStreamWriteFunction, HudiRecordEventSerializer can be a field of MultiTableStreamWriteOperatorCoordinator? serializing data change event to HoodieFlinkInternalRow which are then dispatched to corresponding table write functions.

cshuo · 2025-10-29T13:03:04Z

...line-connector-hudi/src/main/java/org/apache/flink/cdc/connectors/hudi/sink/v2/HudiSink.java

+        // - Data events go to their specific bucket's task
+        DataStream<BucketWrapper> partitionedStream =
+                bucketAssignedStream.partitionCustom(
+                        (key, numPartitions) -> key % numPartitions,


Maybe we should also consider data skew problem since there are records from multiple table & partitions. You can refer to BucketIndexUtil#getPartitionIndexFunc.

cshuo · 2025-11-04T01:20:17Z

...onnector-hudi/src/main/java/org/apache/flink/cdc/connectors/hudi/sink/util/RowDataUtils.java

+            DataChangeEvent dataChangeEvent, Schema schema) {
+        List<String> partitionKeys = schema.partitionKeys();
+        if (partitionKeys == null || partitionKeys.isEmpty()) {
+            return "default";


should be "" here?

Yeap, good catch, fixed.

cshuo · 2025-11-04T01:24:24Z

...c/main/java/org/apache/flink/cdc/connectors/hudi/sink/function/EventStreamWriteFunction.java

+
+/** Base infrastructures for streaming writer function to handle Events. */
+public abstract class EventStreamWriteFunction extends AbstractStreamWriteFunction<Event>
+        implements EventProcessorFunction {


fileId and instantTime are not required to construct HoodieFlinkInternalRow, these two fields are later set in defineRecordLocation().

cshuo · 2025-11-04T01:37:22Z

.../java/org/apache/flink/cdc/connectors/hudi/sink/function/EventBucketStreamWriteFunction.java

+    }
+
+    /**
+     * Calculate bucket from HoodieFlinkInternalRow using the record key. The record key is already


Are we going to support bucket byhoodie.bucket.index.hash.field?

Not yet, was planning on standardising eveyrthing to use record keys first. Since there is an orthogonal discussion on config, i wanted to leave this out for a separate exercise.

cshuo · 2025-11-04T02:05:23Z

...onnector-hudi/src/main/java/org/apache/flink/cdc/connectors/hudi/sink/util/RowDataUtils.java

+            String instantTime) {
+
+        // Extract record key from primary key fields
+        String recordKey = extractRecordKeyFromDataChangeEvent(dataChangeEvent, schema);


can we use RowDataKeyGen to get record key and partition path directly?

…per functions

voonhous added 11 commits October 24, 2025 16:25

Flink-CDC checkin 2

bb99a46

Checkpoint 34 - Fix checkstyle and RAT

836b07e

Checkpoint 35 - Remove unused code

4ffaa00

Checkpoint 36 - Add restore with checkpoint test

db3a8c2

Checkpoint 36 - Fix spotless

a48435a

Checkpoint 37 - Fix spotless and import errors

da339ca

Checkpoint 38 - Start-stop-checkpoint fix

39a5dbf

Checkpoint 39 - Enable MDT

d141670

Checkpoint 40 - Remove MDT configs

3d187da

Checkpoint 41 - Fix testSyncWholeDb

5ae8868

Checkpoint 42 - Add compaction scheduling support

bb6d713

github-actions bot added build e2e-tests labels Oct 24, 2025

Checkpoint 43 - Change compaction to be event driven

0617fb1

cshuo reviewed Oct 29, 2025

View reviewed changes

voonhous added 5 commits November 3, 2025 12:11

Checkpoint 44 - Remove reflection call

02b64cf

Checkpoint 45 - Change partitioning logic to avoid skew

d84720e

Checkpoint 45 - Update naming convention to reduce confusion.

6ac41d3

Checkpoint 46 - Add partition path extractor

1ba8bd8

Checkpoint 47 - Use HoodieFlinkInternalRow

49428f2

cshuo reviewed Nov 4, 2025

View reviewed changes

voonhous added 4 commits November 4, 2025 10:11

Checkpoint 48 - Fix partitioning issue for non-partitioned tables

82aba79

Checkpoint 49 - Use RowDataKeyGen

94e6930

Checkpoint 50 - Remove code duplication via overloading

e845dec

Checkpoint 51 - Refactor and remove Event*Functions

6254c7f

voonhous force-pushed the hudi-connector-rework-push-to-origin branch from 9f52239 to 6254c7f Compare November 5, 2025 06:53

voonhous added 3 commits November 5, 2025 15:53

Checkpoint 52 - Use RowDataKeyGen implementations of RowDataUtils hel…

df110cb

…per functions

Checkpoint 53 - Fix checkstyle issues

d1c1a6a

Checkpoint 54 - Remove reflection usage

70b5beb

Add Hudi sink connector support #4164

Are you sure you want to change the base?

Add Hudi sink connector support #4164

Uh oh!

Conversation

voonhous commented Oct 24, 2025

Uh oh!

voonhous commented Oct 24, 2025

Uh oh!

voonhous commented Oct 27, 2025

Uh oh!

cshuo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cshuo Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cshuo Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cshuo Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cshuo Oct 29, 2025 •

edited

Loading

cshuo Oct 29, 2025 •

edited

Loading

cshuo Oct 29, 2025 •

edited

Loading