Skip to content

[improvement](cgroup) inactive_file should be treated as available memory to avoid query be cancelled#64347

Open
yiguolei wants to merge 2 commits into
apache:masterfrom
yiguolei:fix_mem
Open

[improvement](cgroup) inactive_file should be treated as available memory to avoid query be cancelled#64347
yiguolei wants to merge 2 commits into
apache:masterfrom
yiguolei:fix_mem

Conversation

@yiguolei

@yiguolei yiguolei commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Sometimes we may see this errors in cgroup or k8s environment. Allocator sys memory check failed: Cannot alloc:5343, ...,process memory used 85.41 GB exceed limit 108.00 GB or sys available memory 5.88 GB less than low water mark 6.00 GB.
The mem_limit term is false (85.41 < 108). The 5343-byte allocation is rejected only by sys available memory 5.88 GB < low water mark 6.00 GB. 5.88 GiB available implies cgroup_mem_usage of about 114 GiB, roughly 29 GiB above process memory used (85.41 GiB); that gap is unmapped read page cache. The kernel reclaims clean page cache before OOM, so the memory is available, but Doris cannot reclaim it and the rejection repeats on later allocations. (low water mark 6.00 GB is the default: min(120 - 108, 120 * 5%) = 6.)

Before this PR, cgroup_mem_usage = memory.current - inactive_file - slab_reclaimable. So some active files page cache is not treated as recycleable memory. So cgroup_mem_usage is a bit larger than RSS.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yiguolei

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.39% (1915/2443)
Line Coverage 64.88% (34210/52725)
Region Coverage 65.30% (17598/26948)
Branch Coverage 53.97% (9346/17316)

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29737 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 0245a280ecd348cee7772ef9e9af09472a2b38f8, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17658	4377	4322	4322
q2	q3	10739	1386	878	878
q4	4681	496	360	360
q5	7569	867	615	615
q6	188	184	145	145
q7	793	874	634	634
q8	9552	1583	1657	1583
q9	6461	4497	4479	4479
q10	6832	1830	1545	1545
q11	448	282	258	258
q12	671	431	311	311
q13	18161	3557	2752	2752
q14	291	268	257	257
q15	q16	832	789	722	722
q17	1507	1133	820	820
q18	7049	5788	5609	5609
q19	1353	1383	1113	1113
q20	527	406	267	267
q21	6205	2806	2728	2728
q22	487	395	339	339
Total cold run time: 102004 ms
Total hot run time: 29737 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5251	5096	5094	5094
q2	q3	5135	5374	4778	4778
q4	2402	2464	1581	1581
q5	5100	5192	4888	4888
q6	258	200	136	136
q7	2039	1865	1771	1771
q8	2750	2300	2133	2133
q9	7668	7725	7639	7639
q10	4915	4894	4366	4366
q11	611	430	399	399
q12	815	812	600	600
q13	3046	3565	2826	2826
q14	281	286	261	261
q15	q16	722	738	643	643
q17	1323	1308	1285	1285
q18	7826	7080	7155	7080
q19	1142	1107	1110	1107
q20	2276	2279	2008	2008
q21	5689	4988	4883	4883
q22	553	509	398	398
Total cold run time: 59802 ms
Total hot run time: 53876 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 169589 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 0245a280ecd348cee7772ef9e9af09472a2b38f8, data reload: false

query5	4326	646	480	480
query6	450	197	181	181
query7	4817	575	306	306
query8	367	215	199	199
query9	8774	4085	4053	4053
query10	438	349	258	258
query11	5935	2356	2151	2151
query12	156	102	106	102
query13	1316	596	428	428
query14	6426	5459	5102	5102
query14_1	4439	4458	4423	4423
query15	215	201	175	175
query16	1046	472	432	432
query17	1154	729	597	597
query18	2719	489	355	355
query19	218	188	147	147
query20	115	117	120	117
query21	229	149	121	121
query22	13665	13588	13465	13465
query23	17438	16400	16177	16177
query23_1	16195	16344	16452	16344
query24	7390	1785	1341	1341
query24_1	1319	1320	1329	1320
query25	556	435	412	412
query26	1081	316	164	164
query27	2658	544	336	336
query28	4435	2013	2029	2013
query29	1031	657	485	485
query30	314	241	198	198
query31	1142	1082	950	950
query32	112	61	59	59
query33	521	305	248	248
query34	1180	1147	662	662
query35	764	779	691	691
query36	1366	1430	1212	1212
query37	157	112	96	96
query38	3207	3190	3055	3055
query39	937	934	917	917
query39_1	876	878	898	878
query40	213	140	115	115
query41	71	68	71	68
query42	106	101	95	95
query43	335	340	294	294
query44	
query45	197	187	179	179
query46	1087	1209	769	769
query47	2317	2383	2221	2221
query48	391	396	305	305
query49	627	465	351	351
query50	970	368	263	263
query51	4424	4314	4250	4250
query52	88	91	88	88
query53	256	274	188	188
query54	289	212	204	204
query55	79	82	73	73
query56	264	218	212	212
query57	1449	1405	1332	1332
query58	246	216	215	215
query59	1621	1674	1413	1413
query60	280	243	232	232
query61	161	152	160	152
query62	731	657	582	582
query63	235	193	189	189
query64	2214	805	614	614
query65	
query66	1701	464	346	346
query67	29711	29776	29600	29600
query68	
query69	421	310	266	266
query70	988	927	913	913
query71	302	228	217	217
query72	3039	2694	2392	2392
query73	844	791	432	432
query74	5142	5004	4794	4794
query75	2662	2575	2226	2226
query76	2309	1197	796	796
query77	360	367	297	297
query78	12356	12305	11849	11849
query79	1418	1040	786	786
query80	600	475	410	410
query81	458	289	253	253
query82	613	158	124	124
query83	357	279	248	248
query84	
query85	885	527	436	436
query86	363	316	287	287
query87	3389	3419	3209	3209
query88	3646	2762	2737	2737
query89	413	392	331	331
query90	1956	190	189	189
query91	179	171	173	171
query92	64	63	58	58
query93	1518	1398	913	913
query94	566	364	297	297
query95	680	397	458	397
query96	1020	772	330	330
query97	2707	2701	2553	2553
query98	210	211	214	211
query99	1158	1178	1059	1059
Total cold run time: 250517 ms
Total hot run time: 169589 ms

@yiguolei yiguolei added dev/4.1.x usercase Important user case type label dev/4.0.x labels Jun 10, 2026
@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 16.00% (4/25) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.76% (28261/38315)
Line Coverage 57.78% (307588/532307)
Region Coverage 54.61% (257553/471633)
Branch Coverage 55.97% (111723/199627)

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 10, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x dev/4.1.x reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants