Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADBDEV-6156 Count startup memory of each process when using resource groups #1023

Open
wants to merge 18 commits into
base: adb-6.x-dev
Choose a base branch
from

Conversation

dnskvlnk
Copy link
Collaborator

@dnskvlnk dnskvlnk commented Aug 23, 2024

Count startup memory of each process when using resource groups

The goal of the patch is to make resource groups manager to track startup
memory of each process so that the redzone handler would estimate memory more
accurately, but the context needs to be brought.

Startup memory is considered correctly by the VMEM tracker, but it's not
accounted by the resource groups manager because on a process' startup this
memory is not added to self->memUsage. This happened because the startup memory
is considered on VMEM tracker initialization, but the ways VMEM tracker and
resource groups count memory are not consistent with each other.

This patch adds startup memory consumption to self->memUsage to make resource
groups consider this memory.

However, one should note that once a process is detached from a resource group
(it finished query execution), resource groups memory manager doesn't see this
memory again. This happens due to resource groups' design. We could've moved
inactive processes' memory to freeChunks, but that would drain freeChunks pretty
fast and cause new processes to fail because any process starts without a
resource group and acquires it later, thus it must add it's startup memory to
freeChunks at first. If freeChunks is already exhausted, a new process fails to
start. Also, there are other complexities in making resource groups manager to
fully consider startup memory. Thankfully, we have the VMEM tracker which
calculates it correctly, so if the resource groups manager fails to track it
correctly, we can still rely on it. It's impossible to solve this problem in
this patch because it's a design flaw rather than a bug.

@BenderArenadata
Copy link

Failed job Deploy multiarch Dockerimages: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1807317

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/78383

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1807327

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1807328

@dnskvlnk dnskvlnk marked this pull request as ready for review August 26, 2024 16:22
@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/78465

@BenderArenadata
Copy link

Failed job Regression tests with ORCA on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1817041

@BenderArenadata
Copy link

Failed job Regression tests with Postgres on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1817039

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1817048

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1818658

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1818659

@RekGRpth
Copy link
Member

Can you write some tests to check?

@BenderArenadata
Copy link

DROP

-- start_ignore
! gpstop -rai;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be add

! gpconfig -r gp_resource_manager;

before this line?

Copy link
Collaborator Author

@dnskvlnk dnskvlnk Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in disable_resgroup test

@RekGRpth
Copy link
Member

RekGRpth commented Sep 3, 2024

resgroup/enable_resgroup test hangs after applying patch.

@BenderArenadata
Copy link

@BenderArenadata
Copy link

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/79897

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1891649

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/80245

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1911167

@BenderArenadata
Copy link

Failed job Build ubuntu22 for x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1924163

@BenderArenadata
Copy link

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/80289

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/1926814

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/80301

@bandetto
Copy link
Member

Maybe add sql/resgroup/resgroup_startup_memory.sql and expected/resgroup/resgroup_startup_memory.out to src/test/isolation2/.gitignore? For some reason, not all the generated sql and expected files from input/resgroup/ are ignored, but all tests inside input/ folder are.

@dnskvlnk
Copy link
Collaborator Author

@bandetto

Maybe add sql/resgroup/resgroup_startup_memory.sql and expected/resgroup/resgroup_startup_memory.out to src/test/isolation2/.gitignore? For some reason, not all the generated sql and expected files from input/resgroup/ are ignored, but all tests inside input/ folder are.

Yes, this is pretty strange, so I would agree with you. However, I'm not sure about ignoring .sql since they are actually tracked, but some are not, as far as I remember

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/83966

@bandetto
Copy link
Member

However, I'm not sure about ignoring .sql since they are actually tracked, but some are not, as far as I remember

Tests which have input files in input/ instead of sql/ get their sql/..sql files generated out of files from the input/ directory. It means that any sql/..sql file from a test that uses input/ directory can safely be ignored. Seems that this ignoring thing is inconsistent, so it's is up to you to fix it in this patch :^)

@dnskvlnk
Copy link
Collaborator Author

@bandetto I know that, but there are some tests which are generated from .source file but the respective .sql is still tracked. Though I agree with you that if a test is generated then .sql should be ignored

In order to run tests, we need the old version of resGroupPalloc, however, to
keep old tests intact, we need to add it under a new name, because the old
logic of resGroupPalloc isn't compatible with the new behaviour
@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/84568

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/84658

Copy link
Member

@bandetto bandetto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, one should note that once a process is detached from a resource group (it finished query execution), resource groups memory manager doesn't see this memory again.

Can you please elaborate what this means?

We could've moved inactive processes' memory to freeChunks, but that would drain freeChunks pretty fast

Why?

Please explain these points, and why the added tests are needed in PR description (and maybe you can skip explaining some of parts in the description, if they are not really related to the patch, and shorten it to only describe key changes).

src/test/regress/regress_gp.c Outdated Show resolved Hide resolved
src/test/regress/regress_gp.c Outdated Show resolved Hide resolved
-- Start a session which will be detached from a group when the query is done
-- resource groups can't see startup chunks occupied by a detached session but
-- the vmem tracker can
0: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid != (SELECT pg_backend_pid());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and below

Suggested change
0: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid != (SELECT pg_backend_pid());
0: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid != pg_backend_pid();

Do we really want to kill ALL backends except the current one?!

Copy link
Collaborator Author

@dnskvlnk dnskvlnk Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the point of the test, I don't want other processes to interfere. I can change it if you explain what problems it can cause

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/85888

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/85900

! gpconfig -c runaway_detector_activation_percent -v 20;
! gpstop -rai;

CREATE OR REPLACE FUNCTION resGroupPalloc(float) RETURNS int AS '/home/kds/gpdb/src/test/isolation2/../regress/regress.so', 'resGroupPalloc' LANGUAGE C READS SQL DATA;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path remained the same (without generic variable), but the tests have passed?

Copy link
Collaborator Author

@dnskvlnk dnskvlnk Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did it intentionally. Compilation path is defined during configuration stage, so it differs on different machines. Having such statement outside ignore block will cause this test to fail if your compilation path is not the same compared to the default path on ci. If this statement fails to complete, it's not a problem either, because those statements which use this function will indicate the problem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's about using @abs_builddir@ and @DLSUFFIX@ used in input, in output as well? It's replaced in runtime by pg_regress. The paths don't have to be inside ignore block that way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

src/test/regress/regress_gp.c Outdated Show resolved Hide resolved
Co-authored-by: Vladimir Sarmin <[email protected]>
@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/85984

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/86597

@BenderArenadata
Copy link

Allure report https://allure.adsw.io/launch/86650

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants