-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Querying job_stats on large cluster(s) takes multiple seconds #30
Comments
Hi mjurksa, how did you measure the delay of the metrics processing of the jobstats? Which Lustre version and Lustre exporter version do you use? Can you provide an overview of the jobstats count of your Lustre filesystem, so we can better do comparisons. Since querying Prometheus for jobstats might get pretty fast impossible because of that high data point count, we could compare some snapshot numbers executed on the server as following:
Currently the max count for OSS is at 10k jobids with a scape time of 1-2s and for MDS it is at 8.5k jobids with a scrape time of 3-5s. Just to have some numbers, even that can surely change... Checking the the promql query Indeed, files are re-opened and -read again for each metric across multiple OSTs. Yes, a workaround to just open and read the file once should not be a big deal. Regardless of the implementation, I would consider to decrease the Lustre parameter As scrape interval we have set for the exporter 30 seconds. Best |
Hey, We have around 15,000-20,000 jobs running daily, depending on the day of the week and holidays. We have compiled the latest release of the lustre_exporter and are using lustre-server version 2.12.8. The We measured the performance like this:
Regards, |
Hi Mantautas, I do not see the problem, if the scrape interval is set to 60 seconds and the exporter has a runtime of few seconds, it is more than sufficient in your case. Best |
Hello Gabriele, well yes it works but thats waste of resources imho. FYI: we also did some profiling and it looks like 50% of runtime is spend in the regex operations of jobstats. Regards, |
Hi Mantautas, what resources do you exactly mean e.g. CPU utilization? File server usually are not very CPU bound... Can you give some numbers about the count of jobstats blocks to be processes? You have an OSS failover pair, in which way the exporter impacts the move time of resources from one to another OSS? Sure, with increasing amount of jobstats to be processed the exporter runtime will increase. Do not get me wrong, I agree with you, wasting resources can hurt performance and also costs resources, but investigating, improving and testing the exporter for few seconds of performance gain against investing some time into it, is a tradeoff that has to be considered. And at the moment from our side there is not critical issue on this. Interesting, how did you exactly profile the regex operations of the jobstats? Best |
I think the expectation is that if there are twice as many OSTs running on an OSS due to failover, then there are twice as many |
Note that it is also possible to consume the Another alternative instead of waiting for At worst, if a job became active immediately after the stats were read, the amount of stats lost would be A more complicated mechanism would be something like "write |
Hello,
we have identified a performance issue with the lustre_exporter when querying metrics on large Lustre file systems with a significant number of jobstats. The problem seems to be related to the procfs.go script repeatedly accessing the same job_stats file in the procfs, resulting in a delay of 4-5 seconds per query.
I think it should be possible to open the file once and scan each line for needed information aldough im not very well versed in GO and dont know if this would require significant refactor of the code.
Is this issue known? Are there any workarounds around this?
The text was updated successfully, but these errors were encountered: