Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for pscheduler optimisation due to constant load even for small mesh #1502

Open
szymontrocha opened this issue Jan 10, 2025 · 3 comments

Comments

@szymontrocha
Copy link

Here is my production mesh in my metro network:
3 nodes (small node hardware BRIX GB-BASE-3160, Intel(R) Celeron(R) CPU J3160 @ 1.60GHz with 8G RAM). Two are Debian12, one Ubuntu20.
All nodes are perfsonar-testpoint with 5.1.4 bundle. All have the same default config. All sending results to central default installation archive.
The mesh runs a very light set of tests: throughput every 3hs, latency and some dns, http, rtt tests. See JSON attached.
There were no significant changes to the mesh definition in the timeframe

Issues observed (all graphs attached from built-in Prometheus monitoring, for last 14 days ):

  • Looking at memory consumption on the nodes pscheduler (looking at top these are mostly python and postgress processes) does something every day for almost 9 (!) hours consuming 3G (!) of RAM resources on D12 nodes. This means the node is loaded almost every half of its daylife doing almost nothing (no heavy tests running). I think it seems like an overkill for a lighweigth testpoint which was designed to be a thin installation and for such a small mesh.
  • There is a clear difference between D12 and U20 load. In Ubuntu20 this process takes the same amount of time but consumes ~2G of RAM instead of 3GB.
  • I observe big spikes in number of runs between 7500 to 15000. I don't know if that matters but at least it's strange to observe it and maybe this deserves some description in documentation what such big numbers really mean to the user i.e. when one sees such graphs should be worrying about or it's normal. How the graph should be interpretted if we put it into the release.

p1-debian12-host-metrics
p2-ubuntu20-host-metrics
p3-debian12-host-metrics
p1-debian12-pscheduler-run
p2-ubuntu20-pscheduler-run
p3-debian12-pscheduler-run
pozman.json

@szymontrocha
Copy link
Author

Any thoughts on this?

@mfeit-internet2
Copy link
Member

We've seen memory growth in the runner before. I suspect that it comes from a combination of having processes in the runner that take on long-running jobs (e.g., latencybg) and grow from taking on smaller jobs. As pSConfig refreshes things, the older processes drop off and new ones start, which is where the 24-hour cycle happens.

Fixing it will require a re-think of some of that and some rework of the runner code to avoid it as much as possible.

@szymontrocha
Copy link
Author

Maybe it indeed requires additional work but taking into account that even testptoint struggles to run a simple mesh now it seems like an urgent work. My 4 nodes mesh is totally filled up doing almost nothing with multiple failed and missed tests.
There is also an increasing discussion at least in GEANT community to use RPi fo a lightwaetigh testpoint measurements. I can't imagine Rpi so much loaded. I think testpoint should have mminimal load of all possible and do not require so much RAM resources just ofr running the processes. I'm not talking about limitation of harwdare to run tests like trhoughput.
Second is the revision of hardware specification that seems to be just not right currently for different bundles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready
Development

No branches or pull requests

2 participants