Skip to content
forked from cinek810/slurmmon

gather and plot data about Slurm scheduling and job statistics

License

Notifications You must be signed in to change notification settings

fafik23/slurmmon

 
 

Repository files navigation

Slurmmon is a system for gaining insight into Slurm and the jobs it runs. It's meant for cluster administrators looking to raise cluster utilization and measure the effects of configuration changes. Features include:

  • trending all the scheduler performance diagnostics (sdiag output)
  • measuring job turnaround time of probe jobs, as a bellwether of scheduling issues
  • creating daily whitespace reports -- identifying specific users and jobs with low utilization of their allocations (the jobs that lead to the dreaded whitespace gap in plots of total resources vs. used resources)

Slurmmon is meant to run on a RHEL/CentOS/SL 6 based system and currently uses Ganglia for data collection and Apache/mod_python for reporting. The components are:

  • slurmmon-daemon -- the daemons that query Slurm and send data to Ganglia
  • slurmmon-ganglia -- the Ganglia custom reports that use php to stack raw rrd data
  • slurmmon-web -- a set of web pages that organize all the reports and relevant plots
  • slurmmon-python -- a general python interface to Slurm, using lazy evaluation

See the doc directory for more information, specifically:

  • INSTALL for initial installation and setup
  • FAQ for answers to common questions and other details

Here is a screenshot of the basic diagnostic report from the production cluster at FASRC:

slurmmon screenshot

About

gather and plot data about Slurm scheduling and job statistics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 74.1%
  • Shell 14.4%
  • PHP 11.5%