Skip to content

[fix](delta writer) Fix shared delta writer state lifetime#64349

Open
bobhan1 wants to merge 2 commits into
apache:masterfrom
bobhan1:fix-opensource-380-delta-writer-state
Open

[fix](delta writer) Fix shared delta writer state lifetime#64349
bobhan1 wants to merge 2 commits into
apache:masterfrom
bobhan1:fix-opensource-380-delta-writer-state

Conversation

@bobhan1

@bobhan1 bobhan1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: None

Problem Summary:

Shared DeltaWriterV2 instances can be reused by multiple local sinks from the same load. Before this change, the shared writer stored the RuntimeState* from the sink that first created it. If that creator sink finished and its RuntimeState was destroyed while another local sink continued to reuse the shared writer, DeltaWriterV2::write() could access the destroyed state in the memtable flush-limit cancellation path, causing a BE crash or ASAN use-after-free.

This PR adds a BE unit test that reproduces the lifetime boundary:

  • one VTabletWriterV2 creates the shared DeltaWriterV2;
  • the creator writer and its RuntimeState are destroyed without cancelling the shared writer;
  • a second writer reuses the shared writer and is forced into the DeltaWriterV2::write() flush-limit wait path;
  • the old code reads the destroyed creator state, while the fixed code observes the current writer's cancel state and exits cleanly.

The fix removes the stored RuntimeState* from DeltaWriterV2. The shared writer now keeps only the stable WorkloadGroup shared pointer needed by MemTableWriter initialization, and VTabletWriterV2 passes a per-call cancel checker into DeltaWriterV2::write() so cancellation is evaluated against the current sink.

Release note

Fix a possible BE crash when shared delta writers are reused by multiple local sinks.

Check List (For Author)

  • Test: Unit Test
    • ./run-be-ut.sh --run --filter=TestVTabletWriterV2.shared_delta_writer_should_not_access_destroyed_creator_runtime_state -j 100
    • ./run-be-ut.sh --run --filter=DeltaWriterV2PoolTest.* -j 100
  • Behavior changed: Yes. Shared DeltaWriterV2 cancellation now uses the current sink's state instead of the creator sink's state.
  • Does this need documentation: No

bobhan1 added 2 commits June 10, 2026 11:36
### What problem does this PR solve?

Issue Number: None

Problem Summary: Add a backend unit test that reproduces a shared DeltaWriterV2 lifetime bug. When shared delta writers are enabled, the first local sink can create the shared DeltaWriterV2 and the writer stores that sink's RuntimeState pointer. If that creator sink and its RuntimeState are destroyed while another local sink still reuses the shared writer, DeltaWriterV2::write() can access the destroyed RuntimeState in the flush-limit cancel check path. The test builds two VTabletWriterV2 instances for the same load, destroys the creator state after the shared writer is created, and then forces the second writer into the same wait path. On the unfixed code this deterministically reports an ASAN heap-use-after-free in DeltaWriterV2::write().

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - ./run-be-ut.sh --run --filter=TestVTabletWriterV2.shared_delta_writer_should_not_access_destroyed_creator_runtime_state -j 100 (fails as expected before the fix with AddressSanitizer heap-use-after-free in DeltaWriterV2::write())

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Problem Summary: Shared DeltaWriterV2 instances can be reused by multiple local sinks from the same load. Before this change the shared writer stored the RuntimeState pointer from the sink that first created it. If that creator sink finished and its RuntimeState was destroyed while another local sink continued to reuse the same shared writer, DeltaWriterV2::write() could access the destroyed RuntimeState while checking cancellation in the memtable flush-limit wait path, causing a BE crash or ASAN use-after-free. The fix removes the stored RuntimeState pointer from DeltaWriterV2. The writer now stores only the stable WorkloadGroup shared pointer needed by MemTableWriter initialization, and VTabletWriterV2 passes a per-call cancel checker for the current sink into DeltaWriterV2::write(). This keeps cancellation tied to the active caller and avoids retaining sink-local RuntimeState inside the shared writer.

### Release note

Fix a possible BE crash when shared delta writers are reused by multiple local sinks.

### Check List (For Author)

- Test: Unit Test

    - ./run-be-ut.sh --run --filter=TestVTabletWriterV2.shared_delta_writer_should_not_access_destroyed_creator_runtime_state -j 100

    - ./run-be-ut.sh --run --filter=DeltaWriterV2PoolTest.* -j 100

- Behavior changed: Yes. Shared DeltaWriterV2 cancellation now uses the current sink's RuntimeState instead of the creator sink's RuntimeState.

- Does this need documentation: No
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@bobhan1 bobhan1 marked this pull request as ready for review June 10, 2026 03:55
@bobhan1

bobhan1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

run buildall

@bobhan1 bobhan1 changed the title [fix](be) Fix shared delta writer state lifetime [fix](delta writer) Fix shared delta writer state lifetime Jun 10, 2026
@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29299 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e70d03df42de89d0c9a8702b54233257eaf3cd55, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17617	4008	3989	3989
q2	q3	10861	1367	804	804
q4	4690	483	345	345
q5	7554	879	594	594
q6	187	171	138	138
q7	772	869	633	633
q8	9394	1570	1674	1570
q9	5891	4551	4511	4511
q10	6742	1822	1525	1525
q11	425	285	253	253
q12	631	434	290	290
q13	18176	3401	2758	2758
q14	270	259	235	235
q15	q16	811	785	720	720
q17	941	968	1040	968
q18	7005	5762	5577	5577
q19	1321	1368	1052	1052
q20	529	413	264	264
q21	6287	2822	2736	2736
q22	468	379	337	337
Total cold run time: 100572 ms
Total hot run time: 29299 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5094	4837	4723	4723
q2	q3	4931	5313	4711	4711
q4	2142	2168	1448	1448
q5	4711	4932	4705	4705
q6	230	178	128	128
q7	1869	1735	1593	1593
q8	2420	2122	2068	2068
q9	8012	7788	7346	7346
q10	4792	4671	4226	4226
q11	537	385	352	352
q12	736	744	527	527
q13	3005	3356	2822	2822
q14	275	283	251	251
q15	q16	673	712	621	621
q17	1290	1259	1254	1254
q18	7128	6718	6761	6718
q19	1113	1105	1087	1087
q20	2227	2221	1950	1950
q21	5307	4599	4460	4460
q22	532	466	439	439
Total cold run time: 57024 ms
Total hot run time: 51429 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 169654 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e70d03df42de89d0c9a8702b54233257eaf3cd55, data reload: false

query5	4322	647	484	484
query6	440	192	189	189
query7	4840	564	299	299
query8	367	215	206	206
query9	8796	4018	4013	4013
query10	441	307	270	270
query11	5966	2381	2219	2219
query12	167	105	101	101
query13	1284	630	454	454
query14	6400	5395	5098	5098
query14_1	4417	4391	4421	4391
query15	206	204	177	177
query16	993	459	384	384
query17	970	737	604	604
query18	2455	491	358	358
query19	239	183	140	140
query20	118	108	111	108
query21	218	137	115	115
query22	13671	13611	13380	13380
query23	17414	16604	16171	16171
query23_1	16300	16387	16265	16265
query24	7522	1762	1321	1321
query24_1	1297	1302	1324	1302
query25	550	453	376	376
query26	1304	308	166	166
query27	2742	559	336	336
query28	4425	2020	2020	2020
query29	1063	629	467	467
query30	317	231	201	201
query31	1129	1082	973	973
query32	117	65	63	63
query33	530	317	260	260
query34	1191	1150	676	676
query35	755	793	687	687
query36	1452	1452	1238	1238
query37	159	102	90	90
query38	3235	3166	3048	3048
query39	919	927	907	907
query39_1	886	883	888	883
query40	225	123	103	103
query41	65	64	63	63
query42	96	102	95	95
query43	315	322	277	277
query44	
query45	198	188	183	183
query46	1147	1218	772	772
query47	2424	2400	2269	2269
query48	414	430	300	300
query49	649	484	352	352
query50	992	344	284	284
query51	4356	4340	4293	4293
query52	92	89	76	76
query53	236	266	190	190
query54	270	226	204	204
query55	82	76	71	71
query56	255	218	229	218
query57	1430	1415	1352	1352
query58	251	215	218	215
query59	1584	1696	1423	1423
query60	284	246	233	233
query61	162	159	158	158
query62	718	659	583	583
query63	224	182	186	182
query64	2563	818	631	631
query65	
query66	1844	457	338	338
query67	29900	29730	29556	29556
query68	
query69	432	314	277	277
query70	973	947	927	927
query71	311	221	215	215
query72	2944	2761	2444	2444
query73	889	804	447	447
query74	5161	4944	4770	4770
query75	2662	2597	2241	2241
query76	2335	1164	789	789
query77	363	392	289	289
query78	12365	12355	12092	12092
query79	1263	1053	757	757
query80	515	472	393	393
query81	447	290	253	253
query82	234	156	126	126
query83	271	276	254	254
query84	
query85	842	543	458	458
query86	325	322	278	278
query87	3391	3402	3198	3198
query88	3584	2724	2738	2724
query89	413	385	336	336
query90	2165	181	176	176
query91	175	171	136	136
query92	65	63	56	56
query93	1545	1472	820	820
query94	524	368	323	323
query95	677	376	338	338
query96	1074	794	344	344
query97	2683	2736	2583	2583
query98	213	216	221	216
query99	1142	1186	1048	1048
Total cold run time: 250748 ms
Total hot run time: 169654 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants