Skip to content

A Cron job distributed system monitor. Performs periodic diagnostics using the High Performance Linpack benchmark to test for abnormal behavior in an HPC system.

Notifications You must be signed in to change notification settings

AdamDorwart/HPLdiag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This directory contains everything that you'll need to run a single
node HPL problem on Comet using 12 MPI processes and 2 threads per
process.

log/				Stores simplified logs of test results
results/			Stores indivdual job stdout/err files
compute/HPL.dat			HPL input data for CPU tests
compute/hpl_1n_12p_2t.comet	Slurm batch script for HPL benchmarks on Comet compute nodes
compute/xhpl.cpu		HPL executable for hybrid (MPI+OpenMP) execution
compute/ssd-test.fio		FIO input data used for SSD test
compute/fio			Flexible IO tester. Used for SSD benchmarks
gpu/HPL.dat			HPL input data for GPU tests (larger problem size for longer runtime)
gpu/hpl_1n_12p_4g_2t.comet	Slurm batch script for HPL benchmarks on Comet gpu nodes
gpu/xhpl.gpu			HPL executable for CUDA GPU execution
hosts				List of Comet nodes with partitions
HPLdiag.sh			Runs diagnostics on all nodes in a provided input file (see hosts) and puts the results in log/HPLresults.MMDDYY
HPLcancel.sh			Cancels all running tests for the current $USER. Logs actions in log/HPLresults.MMDDYY
sendStatusEmail.sh		Sends a status email on the health of the system given an input log file (located in logs), uses statusMailingList for recipients
statusEmail.txt			The template used for sendStatusEmail script
statusMailingList		Line deliminated file of emails to send status reports to

To run a diagnostic across all of comet use the provided hosts file like so
$ ./HPLdiag.sh hosts

To run a diagnostic on particular nodes and partitions
$ cat > smallTest
comet-xx-yy compute
comet-ww-zz compute
comet-ii-jj shared

$ ./HPLdiag.sh smallTest

'hosts' file can be generated with
sinfo -o "%n %R" -h | column -t | sort -u -k1,1 > hosts
The end of the file will sometimes contain a line for a 'fat' partition. Remove it as needed.

TODO:
For GPU nodes, use combination of HPL and AMBER
For SSD tests, look into IOR and FIO. Note that drive performance degrades as sectors get marked for TRIM and need to be periodically wiped to achevie full speed
Create Health Moniter cronjob to manage automated benchmarks 
Analyze duration times of results
Better input handling for HPLdiag
- Accept pipping or input file
- Options to run specific tests

Useful commands:
Get the status of all nodes currently enqueued
sinfo -o "%n %T %E" -n `squeue -h -o %n -u $USER | sed ':a;N;$!ba;s/\n/,/g'`
Create a hosts retest file from the TIMEOUTs in a result file
grep HPL-*-TIMEOUT log/HPLresults.$timestamp | awk '{print $4}' | grep -f - hosts > retestHosts

About

A Cron job distributed system monitor. Performs periodic diagnostics using the High Performance Linpack benchmark to test for abnormal behavior in an HPC system.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages