Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always splitting update events if partition key changes #11297

Open
CharlesCheung96 opened this issue Jun 12, 2024 · 0 comments · May be fixed by #11232
Open

Always splitting update events if partition key changes #11297

CharlesCheung96 opened this issue Jun 12, 2024 · 0 comments · May be fixed by #11232
Labels

Comments

@CharlesCheung96
Copy link
Contributor

CharlesCheung96 commented Jun 12, 2024

Background

When using the index-value or columns dispatcher to distribute data across different Kafka partitions based on the key, multiple consumer processes in the downstream consumer group consume Kafka topic partitions independently. Due to different consumption progress, data inconsistency might occur. Take the following SQL as an example:

CREATE TABLE t (a INT PRIMARY KEY, b INT);

INSERT INTO t VALUES (1, 2);
UPDATE t SET a = 2 WHERE a = 1;

INSERT INTO t VALUES (1, 3);
UPDATE t SET a="3" WHERE a="1";

If the UPDATE event is not split (ref #11211), and use the index-value or columns dispatcher:

# index 
dispatchers = [
    {matcher = ['test.t'], partition = "index-value"},

]

# columns
dispatchers = [
    {matcher = ['test2.*'], partition = "columns", columns = ["a"]}
]

THEN, the preceding DML will be distributed to the following partitions (change events contain both new and old values):

Event Seq partition-1 partition-2 partition-3
1 INSERT a = 1, b = 2 UPDATE a = 2 WHERE a = 1; UPDATE a = 3 WHERE a = 1;
2 INSERT a = 1, b = 3    

Since partitions are consumed in parallel, different consumption sequences will correspond to different results, in which only one consumption sequence can guarantee the final consistency (p1-1, p2-1, p1-2, p3-1).

Solution

Note that although the correctness of the index dispatcher could be guaranteed by setting output-raw-change-event to false, the column dispatcher still relies on this solution to ensure the correctness of the distribution behavior.

Design

TiCDC should always split the UPDATE event, which changes the partition key, into DELETE and INSERT events. And the preceding DML events will be distributed to the following partitions:

Event Seq partition-1 partition-2 partition-3
1 INSERT a = 1, b = 2 INSERT a = 2, b = 2 INSERT a = 3, b = 3
2 DELETE a = 1    
3 INSERT a = 1, b = 3    
4 DELETE a = 1    

Compatibility

In order to be compatible with the historical version and to ensure that the user's behavior does not change after upgrading, we should add a parameter split-update-partition-key and set it to false by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant