Request for pscheduler optimisation due to constant load even for small mesh #1502

szymontrocha · 2025-01-10T09:38:57Z

Here is my production mesh in my metro network:
3 nodes (small node hardware BRIX GB-BASE-3160, Intel(R) Celeron(R) CPU J3160 @ 1.60GHz with 8G RAM). Two are Debian12, one Ubuntu20.
All nodes are perfsonar-testpoint with 5.1.4 bundle. All have the same default config. All sending results to central default installation archive.
The mesh runs a very light set of tests: throughput every 3hs, latency and some dns, http, rtt tests. See JSON attached.
There were no significant changes to the mesh definition in the timeframe

Issues observed (all graphs attached from built-in Prometheus monitoring, for last 14 days ):

Looking at memory consumption on the nodes pscheduler (looking at top these are mostly python and postgress processes) does something every day for almost 9 (!) hours consuming 3G (!) of RAM resources on D12 nodes. This means the node is loaded almost every half of its daylife doing almost nothing (no heavy tests running). I think it seems like an overkill for a lighweigth testpoint which was designed to be a thin installation and for such a small mesh.
There is a clear difference between D12 and U20 load. In Ubuntu20 this process takes the same amount of time but consumes ~2G of RAM instead of 3GB.
I observe big spikes in number of runs between 7500 to 15000. I don't know if that matters but at least it's strange to observe it and maybe this deserves some description in documentation what such big numbers really mean to the user i.e. when one sees such graphs should be worrying about or it's normal. How the graph should be interpretted if we put it into the release.

pozman.json

szymontrocha · 2025-01-21T07:54:09Z

Any thoughts on this?

mfeit-internet2 · 2025-01-22T15:46:09Z

We've seen memory growth in the runner before. I suspect that it comes from a combination of having processes in the runner that take on long-running jobs (e.g., latencybg) and grow from taking on smaller jobs. As pSConfig refreshes things, the older processes drop off and new ones start, which is where the 24-hour cycle happens.

Fixing it will require a re-think of some of that and some rework of the runner code to avoid it as much as possible.

szymontrocha · 2025-02-13T14:55:30Z

Maybe it indeed requires additional work but taking into account that even testptoint struggles to run a simple mesh now it seems like an urgent work. My 4 nodes mesh is totally filled up doing almost nothing with multiple failed and missed tests.
There is also an increasing discussion at least in GEANT community to use RPi fo a lightwaetigh testpoint measurements. I can't imagine Rpi so much loaded. I think testpoint should have mminimal load of all possible and do not require so much RAM resources just ofr running the processes. I'm not talking about limitation of harwdare to run tests like trhoughput.
Second is the revision of hardware specification that seems to be just not right currently for different bundles.

arlake228 added this to perfSONAR Jan 10, 2025

github-project-automation bot moved this to Ready in perfSONAR Jan 10, 2025

szymontrocha mentioned this issue Jan 21, 2025

Update hardware requirements documentation perfsonar/docs#265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for pscheduler optimisation due to constant load even for small mesh #1502

Request for pscheduler optimisation due to constant load even for small mesh #1502

szymontrocha commented Jan 10, 2025

szymontrocha commented Jan 21, 2025

mfeit-internet2 commented Jan 22, 2025

szymontrocha commented Feb 13, 2025

Request for pscheduler optimisation due to constant load even for small mesh #1502

Request for pscheduler optimisation due to constant load even for small mesh #1502

Comments

szymontrocha commented Jan 10, 2025

szymontrocha commented Jan 21, 2025

mfeit-internet2 commented Jan 22, 2025

szymontrocha commented Feb 13, 2025