[FLINK-36763 / 36690][runtime] Support new "distributed" schema evolution topology & fix parallelized hang glitch #3801

yuxiqian · 2024-12-13T10:13:52Z

This closes FLINK-36763 and FLINK-36690.

As explained in #3680, current pipeline design doesn't cooperate well with tables whose data and schema change events are distributed among different partitions, aka. distributed tables.

Sadly, some data sources (like Kafka) are scatterly-distributed naturally, and could not be easily introduced into current pipeline framework.

To resolve this issue while keep backwards compatibility, such changes have been made:

Added another suit of SchemaOperator and SchemaCoordinator for distributed topology. (See details below)

Previous operators are still in schema.regular package while new ones are located in schema.distributed package.
Common codes have been escalated into an abstract base class SchemaRegistry to reduce duplication.

Added a new @Experimental optional method into DataSource to switch between two topologies.

@PublicEvolving
public interface DataSource {
    // ...

    @Experimental
    default boolean canContainDistributedTables() {
        return false;
    }
}

Composer will detect data source's distribution trait to determine which operator topology to generate.

Extracted schema merging utilities into SchemaMergingUtils, and deprecate corresponding functions in SchemaUtils.

Now, schema merging is required in Transform, Routing, and Schema evolution stages. Sources that support schema inferencing might need it, too. Unifying them in one place would be easier to maintain.

Updated migration test cases to cover CDC 3.2.0+ only.

CDC 3.1.1 was released over 6 months ago. Keeping state compatibility with earlier versions is not really worthwhile.

P.S: A detailed type merging tree would be like:

yuxiqian · 2024-12-13T10:38:08Z

Here's a detailed write-up about the new topology for "distributed tables":

Currently, a YAML pipeline job has a typical topology like this:

It relies on a basic assumption: Data from a single table must either:

only presents and evolves in one single partition...
or, presents in multiple partitions, but with a globally static schema.

The underlying reason is we're lacking a coordination mechanism across schema operators. For example, if Schema Operator 1 triggers a schema change event request, other schema operators will not even be aware of that, since operators will only try to communicate with coordinator when it receives a schema change event from upstream.

It would be a problem when handling distributed sources, since each partition could emit a schema change stream on its own, but we must have a globally effective schema to write to downstream.

However, simply request operators to block and align is not viable in current design architecture, because we have a broadcast topology right after schema operator, and might freeze the entire downstream from receiving events, leaving us no chance to flush pending data records (See #3680 for more details about barrier alignment).

To coordinate among schema operators and avoid blocking record stream, schema operator is moved more close to the sink, after being hashed and shuffled. That means data from various partitions will be mixed together, and will no longer satisfy normal schema change semantics.

So, in distributed topology, we trace schemas independently for each source partition, in which they're guaranteed to be well-formed.

Also, since any schema change event will be broadcast and copied $N$ times (where $N$ is the sink-side parallelism), it is guaranteed that schema operators will initiate $kN$ requests in total, and we don't need to notify each schema operator to block and talk with the coordinator.

The first step is a random schema operator initiates the schema change request:

Other operators will do so eventually, since any events from upstream will be broadcast to all schema operators, and we don't need to let the coordinator to notify them:

Notice that when a schema operator receives a schema change event and blocks upstream, it also emits a FlushEvent to sink writers to tell them all pending data change events must be handled and persistently flushed. After that, sink writers will report success to the coordinator directly.

Now, the coordinator knows that:

All upcoming streams are already blocked (since all schema operators have started requests)
All pending data change events in the pipeline have been flushed (since it has collected all data writers' success report)
Currently-known upstream schemas from all partitions are known (told along with the schema change request)

Now, it can simply deduce a widest schema, apply it to external DB, and broadcast the consensus result to all schema operators when releasing them from blocking.

yuxiqian · 2024-12-16T12:00:04Z

Polished, and marked this ready for review.

...ime/src/main/java/org/apache/flink/cdc/runtime/operators/transform/PreTransformOperator.java

flink-cdc-composer/src/main/java/org/apache/flink/cdc/composer/flink/FlinkPipelineComposer.java

...src/main/java/org/apache/flink/cdc/runtime/partitioning/DistributedPrePartitionOperator.java

lvyanquan · 2024-12-19T06:27:50Z

Thanks @yuxiqian for this contribution, left some comments.

.../src/main/java/org/apache/flink/cdc/runtime/operators/schema/distributed/SchemaOperator.java

...ntime/src/main/java/org/apache/flink/cdc/runtime/operators/schema/common/SchemaRegistry.java

...src/main/java/org/apache/flink/cdc/runtime/partitioning/DistributedPrePartitionOperator.java

...apache/flink/cdc/runtime/operators/schema/common/event/common/CoordinationResponseUtils.java

yuxiqian · 2024-12-19T09:40:19Z

Thanks for @lvyanquan and @Shawn-Hx's kindly review, addressed your comments.

...c/main/java/org/apache/flink/cdc/runtime/operators/schema/distributed/SchemaCoordinator.java

Signed-off-by: yuxiqian <[email protected]> # Conflicts: # flink-cdc-runtime/src/test/java/org/apache/flink/cdc/runtime/operators/transform/PostTransformOperatorTest.java # flink-cdc-migration-tests/flink-cdc-release-3.0.0/pom.xml # flink-cdc-migration-tests/flink-cdc-release-3.0.1/pom.xml # tools/mig-test/datastream/datastream-3.0.1/pom.xml # tools/mig-test/datastream/datastream-3.1.0/pom.xml # tools/mig-test/datastream/datastream-3.1.1/pom.xml # Conflicts: # flink-cdc-common/src/main/java/org/apache/flink/cdc/common/source/DataSource.java # flink-cdc-composer/src/main/java/org/apache/flink/cdc/composer/flink/FlinkPipelineComposer.java # flink-cdc-composer/src/main/java/org/apache/flink/cdc/composer/flink/translator/DataSourceTranslator.java # flink-cdc-composer/src/main/java/org/apache/flink/cdc/composer/flink/translator/TransformTranslator.java # flink-cdc-connect/flink-cdc-pipeline-connectors/flink-cdc-pipeline-connector-mysql/src/main/java/org/apache/flink/cdc/connectors/mysql/source/MySqlDataSource.java

… event Signed-off-by: yuxiqian <[email protected]>

Signed-off-by: yuxiqian <[email protected]>

Also removed unused methods, rewrite outdated JavaDocs. Signed-off-by: yuxiqian <[email protected]>

Signed-off-by: yuxiqian <[email protected]>

...c/main/java/org/apache/flink/cdc/runtime/operators/schema/distributed/SchemaCoordinator.java

…s method Signed-off-by: yuxiqian <[email protected]>

Shawn-Hx

Thanks @yuxiqian, LGTM.

github-actions bot added values-pipeline-connector composer common runtime build e2e-tests mysql-pipeline-connector migration-tests paimon-pipeline-connector elasticsearch-pipeline-connector labels Dec 13, 2024

yuxiqian mentioned this pull request Dec 13, 2024

[FLINK-36690][runtime] Fix schema operator hanging under extreme parallelized pressure #3680

Closed

yuxiqian force-pushed the FLINK-36763-V3 branch from 57a89d4 to 1c12c8b Compare December 13, 2024 11:52

yuxiqian marked this pull request as ready for review December 16, 2024 12:00

yuxiqian force-pushed the FLINK-36763-V3 branch 2 times, most recently from d5f38a4 to 7c5f8d8 Compare December 18, 2024 09:51

lvyanquan reviewed Dec 19, 2024

View reviewed changes

yuxiqian force-pushed the FLINK-36763-V3 branch from 7c5f8d8 to f65d61f Compare December 19, 2024 06:41

github-actions bot added the doris-pipeline-connector label Dec 19, 2024

yuxiqian force-pushed the FLINK-36763-V3 branch from b72ee4e to a6e9278 Compare December 19, 2024 07:44

github-actions bot added the starrocks-pipeline-connector label Dec 19, 2024

Shawn-Hx reviewed Dec 19, 2024

View reviewed changes

Shawn-Hx reviewed Dec 20, 2024

View reviewed changes

...c/main/java/org/apache/flink/cdc/runtime/operators/schema/distributed/SchemaCoordinator.java Outdated Show resolved Hide resolved

yuxiqian force-pushed the FLINK-36763-V3 branch from 5a12d12 to 419cd01 Compare December 20, 2024 01:54

yuxiqian added 3 commits December 20, 2024 13:04

logs: ensure we have a full schema view before processing data change…

e0d4e1a

… event Signed-off-by: yuxiqian <[email protected]>

composer: unify 2 schema operator translators & extract common logic

7feec47

Signed-off-by: yuxiqian <[email protected]>

yuxiqian added 7 commits December 20, 2024 13:04

nit: clean up unnecessary check

f695935

Signed-off-by: yuxiqian <[email protected]>

fix: nightly ci

24fd6bb

Signed-off-by: yuxiqian <[email protected]>

fix: doris & sr test affected by modifying schema operator translator

c07276c

Signed-off-by: yuxiqian <[email protected]>

fix: refactor event structures

52d5ce9

Also removed unused methods, rewrite outdated JavaDocs. Signed-off-by: yuxiqian <[email protected]>

nit: display label for GHA matrix

027168c

Signed-off-by: yuxiqian <[email protected]>

fix: clear already handled cache correctly

cd54895

Signed-off-by: yuxiqian <[email protected]>

resolve conflicts

35e24a8

yuxiqian force-pushed the FLINK-36763-V3 branch from 419cd01 to 35e24a8 Compare December 20, 2024 05:19

Shawn-Hx reviewed Dec 20, 2024

View reviewed changes

...c/main/java/org/apache/flink/cdc/runtime/operators/schema/distributed/SchemaCoordinator.java Show resolved Hide resolved

log: add detailed logs in SchemaCoordinator#deduceEvolvedSchemaChange…

7c43ae6

…s method Signed-off-by: yuxiqian <[email protected]>

Shawn-Hx approved these changes Dec 20, 2024

View reviewed changes

github-actions bot added the reviewed label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-36763 / 36690][runtime] Support new "distributed" schema evolution topology & fix parallelized hang glitch #3801

[FLINK-36763 / 36690][runtime] Support new "distributed" schema evolution topology & fix parallelized hang glitch #3801

yuxiqian commented Dec 13, 2024 •

edited

Loading

yuxiqian commented Dec 13, 2024 •

edited

Loading

yuxiqian commented Dec 16, 2024

lvyanquan commented Dec 19, 2024

yuxiqian commented Dec 19, 2024

Shawn-Hx left a comment

[FLINK-36763 / 36690][runtime] Support new "distributed" schema evolution topology & fix parallelized hang glitch #3801

Are you sure you want to change the base?

[FLINK-36763 / 36690][runtime] Support new "distributed" schema evolution topology & fix parallelized hang glitch #3801

Conversation

yuxiqian commented Dec 13, 2024 • edited Loading

yuxiqian commented Dec 13, 2024 • edited Loading

yuxiqian commented Dec 16, 2024

lvyanquan commented Dec 19, 2024

yuxiqian commented Dec 19, 2024

Shawn-Hx left a comment

Choose a reason for hiding this comment

yuxiqian commented Dec 13, 2024 •

edited

Loading

yuxiqian commented Dec 13, 2024 •

edited

Loading