-
Notifications
You must be signed in to change notification settings - Fork 2
Wheeler
This page contains general info for wheeler.caltech.edu
. The full
SLURM documentation can be found here:
https://slurm.schedmd.com/documentation.html
Newer version of OpenSSH have reduced support for RSA keys and this makes Wheeler unhappy. If you are having trouble SSHing in, please make sure you have
PubkeyAcceptedKeyTypes ssh-rsa
set in your ~/.ssh/config
for Wheeler, or that you are using a newer ed25519 key.
Compute nodes on Wheeler have 24 physical cores and 64GB of RAM.
Certain nodes have been found to never execute MPI code. These nodes are (as of January 9th, 2023)
- wheeler061 (But worked fine for FLASH on June 23, 2023)
- wheeler063
- wheeler099
- wheeler105 (But worked fine for SpEC on June 23, 2023)
- wheeler110 (But worked fine for SpEC on June 23, 2023)
- wheeler126 (But worked fine for SpEC on June 23, 2023)
Also, I (Kyle Nelli) just ran an 8-node job on Wheeler that was considerably slower than an identical job run on a different 8 nodes. The offending 8 nodes were wheeler[017-021,101-103]. Not sure which of these are the bad ones though. (January 19th, 2023)
You can avoid bad nodes by, e.g. #SBATCH --exclude=wheeler061,wheeler063
in your batch script. The syntax #SBATCH --exclude=wheeler[061-063]
also works and will ignore nodes 061, 062, and 063.
- I (Mark) put wheeler099 into a DOWN state on Jun 23, 2023. (Lets see if it stays that way) (Of course this doesn't fix the problem)
Jobs are sorted into one of two queues: the default productionQ
with a
time limit of 24 hrs, and debug
with a time limit of 2 hrs. There is
no limit on number of cores, beyond system-wide limits.
Wheeler uses SLURM instead of PBS, so the job & queue commands are different. For a full list of commands, see the SLURM documentation. Below is a toolbox of commands you will likely use frequently:
sbatch <job script> # submit a batch job
scancel <job id> # cancel a job
squeue --start # show estimated start times
squeue -u <username> # query current jobs
scontrol show job <job id> # query details on a job
sinfo -a # query queue and node statuses
All of these SLURM commands have help flags and man pages: e.g. to print
a summary of options for sinfo
use sinfo -h
and to examine the
manual use man sinfo
.
In addition, to submit an interactive job to the debug queue, use one of the following commands (see warning
srun -p debug -n <number of cores> -t <time in minutes> --pty /bin/bash
or
srun -p debug -N <number of nodes> -c <number of cores per node> -t <time in minutes> --pty /bin/bash
-n
option.
For running SpECTRE, which assumes one mpi rank per node, use the -c
option. Using the
wrong option can result in MPI hangs.
You can change the number of processes used on each node by defining
--ntasks-per-node
. Note that <number of cores>
does NOT need to
be a multiple of 24, i.e. you can request a fraction of a node, and the
time limit cannot exceed 120 minutes for the debug queue.
In order to run an OpenMPI-dependent executable, you might need to
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
and you need to prepend srun in front of the executable you run. You shouldn't need this for Intel MPI.
Finally, while qsub -I -q debug
does work, when the job starts you
will be placed on the head node and then must SSH into the compute node
allocated. We do not describe how to do this here because the srun
approach is easier and less error-prone.
Jupyter notebooks can be launched on Wheeler and then viewed locally on one's desktop via the following.
On Wheeler, ssh into a compute node (e.g., srun -p debug -n 1 -t 120 --pty /bin/bash
), and then run:
jupyter notebook --no-browser --port=<PORT> --ip=0.0.0.0
where <PORT>
is the port you would like to be on, e.g., 8888.
Locally, run:
ssh -NfL <PORT>:<NODE>:<PORT> <USERNAME>@wheeler.caltech.edu
where <NODE>
is the compute node you ssh'd into on Wheeler, e.g., wheeler014.
Finally, type localhost:<PORT>
into your favorite browser. If it asks for a password,
this is the token printed when running jupyter notebook
on Wheeler.
Add the following to your ~/.ssh/config
Host wheelerjupyter
HostName wheeler.caltech.edu
ForwardAgent yes
User <USERNAME>
LocalForward localhost:<PORT> <NODE>:<PORT>
Then connect to Wheeler through VSCode using the new wheelerjupyter
host. Alternatively, after you have changed your config file, locally run ssh -NfL <PORT>:<NODE>:<PORT> <USERNAME>@wheeler.caltech.edu
and type localhost:<PORT>
into your browser.
For running parallel bilby on Wheeler, you need to have the following modules loaded:
gcc/9.3.0 impi/2017.1 python/3.8.7
and should create a python venv (and activate it) before installing parallel bilby and its dependencies
python -m venv /path/to/python_venv
source /path/to/python_venv/bin/activate
If you don't do this, then MPI interoperablitiy issue will cause your multi-node jobs to run forever or crash.
Within a single-node interactive job, a multithreaded executable can be run under gdb using
srun <srun args> --pty gdb --args <executable> <executable args>
where the <srun args>
are the same as what would be used to run
without gdb.
Ganglia is an online tool to help monitor the status of compute nodes. It is currently not working on Wheeler, but may be brought back online if there is enough interest. The old link was Ganglia.
It is often convenient to be notified via email when your job finishes or is aborted. To do this, include the following in your submission script:
#SBATCH [email protected]
#SBATCH --mail-type=ALL
where [email protected]
is your email address. This will notify you when
the job starts, if it is aborted, and when it finishes.
Usually every job on wheeler reserves an integer number of nodes, where each node has 24 cores. So what do you do if you want to run a job that uses fewer than 24 cores? Please do not just run that job on a 24-core node without thinking; by default if you run (for example) a 4-core job on a 24-core node, then 20 cores will be doing nothing, and nobody else can use them (including other jobs owned by you). So here's what you do:
Slurm on wheeler currently assumes that you use 2.3GB of memory per core. If you need more or less memory than that, #SBATCH --mem-per-cpu=2G
is the option you need. If you specify --ntasks
less than 24 (the number of cores on a wheeler node), then more than one job can run on a single node, as long as those jobs don't request more than 56GB in total (the amount of memory on a wheeler node).
Here's an example script that uses 12 cores (--ntasks-per-node 12
) and uses --mem-per-cpu
to allow more than one slurm job to run on the same node. If you run this same script twice (with two different executables), both jobs should end up on the same node.
#!/bin/bash -
#SBATCH -o SpEC.stdout
#SBATCH -e SpEC.stdout
#SBATCH --ntasks-per-node 12
#SBATCH -A sxs
#SBATCH --no-requeue
#SBATCH -J ID_delta_1_82_
#SBATCH --nodes 1
#SBATCH -t 01:00:00
#SBATCH --mem-per-cpu=2G
mpirun -np 12 MyExecutable >Output.out 2>&1
Here is an example submit script that launches two 12-core job steps together:
#!/bin/bash -
#SBATCH -o SpEC.stdout
#SBATCH -e SpEC.stdout
#SBATCH --ntasks-per-node 24
#SBATCH -A sxs
#SBATCH --no-requeue
#SBATCH -J ID_delta_1_82_
#SBATCH --nodes 1
#SBATCH -t 01:00:00
#SBATCH --mem-per-cpu=2G
module purge
umask 0022
set -x
# load modules, etc.
# Note that the '&' is used to background each job step that
# launches MPI jobs. In this setup the jobs are not explicitly
# pinned to specific cores. You can set specific cores to run
# on by launching with
# 'srun --mpi=pmi2 -n 12 --cpu_bind=map_cpu:0,1,2,3,etc MyExecutable'
cd ./A0.075
mpirun -np 12 MyExecutable >Output.out 2>&1 &
cd ../A0.0755
mpirun -np 12 MyExecutable >Output.out 2>&1 &
# Wait for the backgrounded jobs to complete
wait
The wait
command at the end of the submit script is important; without
it the job would end before the backgrounded tasks completed.
In order to achieve good performance when running multiple executables on a single
node they must all have their own dedicated cores. For MPI executables mpirun
can pin
executables to cores. For single or multithreaded applications taskset
can be used.
For SpECTRE or other Charm++-based executables (e.g. SpECTRE CCE) you can use
{spectre_build_dir}/bin/CharacteristicExtract ++ppn 1 +setcpuaffinity \
+pemap some_number1 \
+commap some_number2 2>&1 &
In this example CCE is run on 2 cores, 1 communication core and 1 worker core. The +pemap
specifies which core(s) the worker(s) should use, while +commap
specifies which core(s)
the communication thread(s) should be placed on. See the Charm++ manual for more details.
ParaView is installed for off-screen rendering using the OSMesa 17.3.3
backend. Load the ParaView module using module load paraview/5.10.1
(you'll need ParaView 5.10.1 also on your local machine) and if you want
to run on multiple cores or nodes load the IMPI 2017.1 module using
module load impi/2017.1
. ParaView has a python interface, pvpython
and a parallel python interface pvbatch
. Documentation of the Python
interface is unfortunately quite sparse, but ParaView has a tracing
option that will record the python commands corresponding to what you
are doing in the GUI. To start tracing use Tools->Start Trace
in the
ParaView GUI. This way you can locally set up a script working with a
small data set and once you are happy run it scaled up on Wheeler. Here
is an example ParaView python script used for SpECTRE:
from paraview.simple import *
paraview.simple._DisableFirstRenderCameraReset()
grmhdxmf = XDMFReader(
FileNames=['/home/nils/nils/spectre/FishboneDiskCube/GrMhd.xmf'])
grmhdxmf.PointArrayStatus = ['ErrorRestMassDensity', 'RestMassDensity']
# get active source.
grmhdxmf = GetActiveSource()
# Properties modified on grmhdxmf
grmhdxmf.GridStatus = ['Evolution']
# create a new 'Slice'
slice1 = Slice(Input=grmhdxmf)
slice1.SliceType = 'Plane'
slice1.SliceOffsetValues = [0.0]
# init the 'Plane' selected for 'SliceType'
slice1.SliceType.Origin = [0.0, 11.0, 0.0]
# Properties modified on slice1.SliceType
slice1.SliceType.Normal = [0.0, 0.0, 1.0]
# Properties modified on slice1.SliceType
slice1.SliceType.Normal = [0.0, 0.0, 1.0]
# get active view
renderView1 = GetActiveViewOrCreate('RenderView')
renderView1.ViewSize = [1920, 1080]
# get color transfer function/color map for 'RestMassDensity'
restMassDensityLUT = GetColorTransferFunction('ErrorRestMassDensity')
# Rescale transfer function
restMassDensityLUT.RescaleTransferFunction(1e-12, 78.0)
# restMassDensityLUT.UseLogScale = 1
print("rendering...")
# show data in view
slice1Display = Show(slice1, renderView1)
print("Setting data and camera")
# trace defaults for the display properties.
slice1Display.Representation = 'Surface'
slice1Display.ColorArrayName = ['POINTS', 'ErrorRestMassDensity']
slice1Display.LookupTable = restMassDensityLUT
slice1Display.OSPRayScaleArray = 'ErrorRestMassDensity'
slice1Display.OSPRayScaleFunction = 'PiecewiseFunction'
slice1Display.SelectOrientationVectors = 'None'
slice1Display.ScaleFactor = 4.0
slice1Display.SelectScaleArray = 'ErrorRestMassDensity'
slice1Display.GlyphType = 'Arrow'
slice1Display.GlyphTableIndexArray = 'ErrorRestMassDensity'
slice1Display.DataAxesGrid = 'GridAxesRepresentation'
slice1Display.PolarAxes = 'PolarAxesRepresentation'
slice1Display.GaussianRadius = 2.0
slice1Display.SetScaleArray = ['POINTS', 'ErrorRestMassDensity']
slice1Display.ScaleTransferFunction = 'PiecewiseFunction'
slice1Display.OpacityArray = ['POINTS', 'ErrorRestMassDensity']
slice1Display.OpacityTransferFunction = 'PiecewiseFunction'
# show color bar/color legend
slice1Display.SetScalarBarVisibility(renderView1, True)
# hide data in view
Hide(grmhdxmf, renderView1)
# current camera placement for renderView1
renderView1.CameraPosition = [0.0, 11.0, 50]
renderView1.CameraFocalPoint = [0.0, 11.0, 0.0]
renderView1.CameraParallelScale = 23.345235059857504
# update the view to ensure updated data information
renderView1.Update()
WriteImage("./error.jpg", renderView1)
print("Image file written")
Since rendering can take quite a while print
statements are used at
specific points so the user receives some feedback. To launch the above
pvpython script in parallel on 10 cores run:
module purge # Get rid of whatever modules you have loaded
module load paraview/5.10.1 impi/2017.1 # Load ParaView and OpenMPI
mpirun -n 10 pvbatch ./VisParaView.py
where VisParaView.py
is the script name on disk. Please don't run in
parallel on the login node!
For those interested in how ParaView was built, see the module files for OSMesa and ParaView:
/usr/local/Modules/modulefiles/visualization/osmesa/17.3.3
/usr/local/Modules/modulefiles/visualization/paraview/5.6.0
ParaView server has support for rendering data on Wheeler and sending
the results to a local machine for viewing. Start pvserver
in serial on the
login node (pvserver
does not currently work on the compute nodes), e.g. using
pvserver
. Now start a new SSH connection to Wheeler using:
ssh -L11111:wheeler:11111 wheeler
where the 11111
are the ports the ParaView server and client will use.
On your local machine open ParaView and select File->Connect...
.
Select Add Server
, give the new server a name, for ServerType
choose
Client/Server
, for Host
use localhost
, and for the port use
11111
(the port needs to match the first port specified in the ssh
connection). Click Configure
and set the Startup Type
to Manual
.
Now click Save
. In the future, to connect to Wheeler select the server
you just created and click Connect
. You can now open files on Wheeler
through the ParaView GUI as if you were working locally on Wheeler. Keep
in mind that there will be some delay due to internet connectivity and
due to the amount of data you might be visualizing.
To find out your disk usage on Panasas filesystems (/panfs/ds09/sxs) run
/usr/local/adm/bin/fs_usage /panfs/ds09/sxs | grep `whoami`
You can figure out your quota using
/usr/local/bin/pan_quota
Note that you must be on /panfs
somewhere for the command to succeed.
Globus (https://www.globus.org/) allows users to transfer files between various HPC systems and other local endpoints. In Globus terminology an endpoint is effectively one system or location where you can transfer data to and/or from. All XSEDE machines already have endpoints set up so please visit the XSEDE documentation for how to use Globus in that environment.
To transfer data to/from Wheeler, you must set up a "Globus Connect Personal" (GCP) endpoint on Wheeler. Follow the steps below:
- Sign in to the Globus web app, which you can (for example) do using your XSEDE credentials as detailed on the XSEDE portal https://portal.xsede.org/data-management. Then generate a setup key for the new GCP endpoint via the web interface. As of this writing (May 7, 2019) it is the first three steps in the Installation instructions https://docs.globus.org/how-to/globus-connect-personal-linux/.
- On wheeler, the GCP tools are provided in a module. The endpoint can be set up and started using the command line To set up your endpoint run
module load globus-personal/2.3.6
globusconnectpersonal -setup <YOUR_ENDPOINT_KEY>
where you must replace <YOUR_ENDPOINT_KEY>
with the key generated in item 1 above.
To start the Globus server run
globusconnectpersonal -start -restrict-paths "/panfs/ds09/sxs/<USERNAME>/,/home/<USERNAME>/" &
where you must replace <USERNAME>
with the output of whoami
.
At this point you should see Wheeler under the "Administered by You" tab
in the Endpoints page of the Globus web app. If you click on Wheeler you
should be able to browse your Wheeler files in the web view. Note that
unless you disown the globusconnectpersonal
process or run it in GNU
screen you will need to remain logged into Wheeler during the transfer.
You will get transfer speeds from 5-10MB/s, though at the start of the
transfer it will be closer to 1MB/s.
After you are done transferring data you can stop the server
with$ globusconnectpersonal -stop
To contact fellow admins email [email protected]
.
Slurm calls queues "partitions". Partitions are assigned nodes, allowed groups/users, time limits, etc. To set up a temporary partition for only some users it is easiest to leave the existing partitions as they are and use a reservation to block off the desired nodes. What the overall procedure looks like is:
- Set up a new partition with the desired time limits, etc. (all users is fine here). Choose whichever nodes you want to have in the partition. You can check the queue for nodes that will be available soon and grab those.
- Set up a reservation on the same nodes specifying which users you
want to be able to run on the reservation. Make sure that you set
the flag
IGNORE_JOBS
, which tells Slurm not to kill any jobs currently using those nodes but to not allow anyone to allocate the nodes. - Have users submit jobs to the new partition specifying both the
partition and reservation. The flags for this are
-p PARTITION_NAME --reservation RESERVATION_NAME
. The reservation name will be the name of the first user of the reservation followed by_
followed by a number. If no users are specified when the reservation is created it will be namedroot_NUMBER
.
Note: you can update the partition and reservation later in any way you want, you do not need to recreate it.
Now for the more detailed instructions.
- To create a new partition run
sudo scontrol create partition PartitionName=PARTITION_NAME Default=no Nodes=LIST_OF_NODES MaxTime=MAX_TIME
. TheLIST_OF_NODES
can be a comma separated list including a range, e.g.wheeler001,wheeler012,wheeler[013-025]
. TheMAX_TIME
can be in the formatday:hours:minutes:seconds
orUNLIMITED
. For more details see the scontrol documentation section onSPECIFICATIONS FOR CREATE, UPDATE, AND DELETE COMMANDS, PARTITIONS
. - Next we want to reserve the nodes so that only the users we want can
run on them. This is also done using the
scontrol
command. To create the reservation usesudo scontrol create reservation Reservation=RESERVATION_NAME StartTime=HH:MM:SS Duration=HH:MM:SS Flags=SPEC_NODES,OVERLAP,IGNORE_JOBS Nodes=NODE_LIST
. Instead of supplying a node list, the Slurm documentation says you can also add the flagPART_NODES
and specifyPartitionName=PARTITION_NAME
to have the nodes associated with a specific partition control the nodes of the reservation. - Finally, any user who wants to run on the new partition and
reservation will need to set
#SBATCH -p PARTITION_NAME
and#SBATCH --reservation RESERVATION_NAME
in the Slurm submit script.
When a node is put up for maintenance it gets put into a DRAIN
state
so that no new jobs are run on the node. Sometimes the state isn't
cleared properly. To check the state of a node run
scontrol show node NODE_NAME
where NODE_NAME
would be, for example,
wheeler008
. To check the reason why a node is in a DRAIN
state run
sinfo -R
. To bring a node out of a DRAIN
state run
scontrol update NodeName=NODE_NAME State=UNDRAIN
where the NODE_NAME
would be, for example, wheeler008
.
Sometimes jobs get stuck as they are completing and will hang in the CG
state. The only solution I've found is SSHing to the compute node(s) and
running systemctl restart slurmd
. Note that this will take a few
minutes to complete and your terminal will be "stuck" waiting.
Compute nodes can go down for a variety of reasons. If there is a
hardware issue then someone on the Caltech HPC team must revive the
node. However, often a node goes down because of some issue with the
slurm daemon, slurmd on the node. There is slurm documentation on this
at: https://slurm.schedmd.com/troubleshoot.html#nodes However, there
are a few subtle differences. To restart the slurm daemon on the node
you must run systemctl restart slurmd
instead of the /etc/init.d/...
in the slurm manual. It might take a while to restart the daemon (a
minute or two), or you may need to kill the process manually and then
run systemctl start slurmd
. You can then log out of the compute node
and set the node back to IDLE by running
sudo /panfs/ds09/support/slurm/install/current/bin/scontrol update NodeName=wheeler019 State=IDLE Reason="Start node"
.
The node will be in the IDLE*
state, where the *
means the node is
"unreachable". This is because it might take a few (10-20) minutes for
slurm to realize the node has returned. The node may even go back to a
DOWN*
for a bit before returning to service.
The slurm conf file is located at /etc/slurm/slurm.conf
. The CTLD log
is located at /var/log/slurm/slurmctld.log
. The daemon log file is
located at /var/log/slurm/slurm.log
. However, the log files seem to
not always be written (Nils D. doesn't understand this).
The compute nodes (and login node) all should have /usr/local
symlinked to /home/_SYS_/usr_local
. To do this:
cd /usr; mv local local_ORIG; ln -s /home/_SYS_/usr_local local
Not having this correct will result in being unable to load modules.
The /etc/fstab
should be something like:
#
# /etc/fstab
# Created by anaconda on Fri May 20 20:01:43 2022
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/vg00-lv_root / xfs defaults 0 0
UUID=70322a03-e113-42c3-82ff-4dad195856b3 /boot ext4 defaults 1 2
/dev/mapper/vg00-lv_scratch /scratch xfs defaults 0 0
/dev/mapper/vg00-lv_tmp /tmp xfs defaults 0 0
/dev/mapper/vg00-lv_var /var xfs defaults 0 0
/dev/mapper/vg00-lv_varlog /var/log xfs defaults 0 0
/dev/mapper/vg00-lv_varspool /var/spool xfs defaults 0 0
/dev/mapper/vg00-lv_vartmp /var/tmp xfs defaults 0 0
UUID=21ae7bb6-56c8-45f1-b795-86aa69428fdd swap swap defaults 0 0
#
tmpfs /dev/shm tmpfs defaults 0 0
#
172.16.20.1:/home /home nfs defaults 0 0
#
panfs://panasas-wheeler/Support /panfs/ds09/support panfs rw,auto,_netdev,callback-network-allow=192.168.132.0/24,rmlist=(192.168.202.87;192.168.202.71;192.168.202.65) 0 0
panfs://panasas-wheeler/SXS /panfs/ds09/sxs panfs rw,auto,_netdev,callback-network-allow=192.168.132.0/24,rmlist=(192.168.202.87;192.168.202.71;192.168.202.65) 0 0
panfs://panasas-wheeler/Hopkins /panfs/ds09/hopkins panfs rw,auto,_netdev,callback-network-allow=192.168.132.0/24,rmlist=(192.168.202.87;192.168.202.71;192.168.202.65) 0 0
panfs://panasas-wheeler/Fuller /panfs/ds09/fuller panfs rw,auto,_netdev,callback-network-allow=192.168.132.0/24,rmlist=(192.168.202.87;192.168.202.71;192.168.202.65) 0 0
You need tho right paths and IP addresses for panfs
. Make sure the directory /panfs/ds09
exists. These should be directories, not symlinks.
for x in support sxs hopkins fuller;
do
mkdir /pandfs/ds09/$x
mount /panfs/ds09/$x
done
To be sure things are mounted on reboot, reboot the node.
Singularity needs to be built as root in order for it to work properly,
but also requires the Go compiler. To build Go follow the installation
instructions in the Singularity dox. On Wheeler Go was installed inside
/usr/local/go
using
export VERSION=1.13.7 OS=linux ARCH=amd64
wget https://dl.google.com/go/go$VERSION.$OS-$ARCH.tar.gz
tar xzf go$VERSION.$OS-$ARCH.tar.gz
Note that Go does not need to be installed as root. The Go directory was
then renamed to being $VERSION
so that multiple compiler versions can
be supported via modules. The module was set up in
/usr/local/Modules/modulefiles/compilers/go/$VERSION
. Only the bin
directory needs to be appended to the path in the module file.
Singularity was built into /usr/local/singularity
using
sudo su
export VERSION=3.5.2
wget https://github.com/sylabs/singularity/releases/download/v${VERSION}/singularity-${VERSION}.tar.gz
tar -xzf singularity-${VERSION}.tar.gz
cd ./singularity
./mconfig --prefix=/usr/local/singularity/${VERSION}
cd ./builddir
make
make install
Note that the sudo su
at the beginning is necessary to set up
Singularity correctly because it needs to be built as root. The
Singularity module file is in
/usr/local/Modules/modulefiles/tools/singularity/$VERSION
. Only the
bin
directory needs to be appended to the path in the module file.
Mathematica can be run on wheeler by running
module load mathematica/11.0
and then executing the command math
. But you may need to put !mathematica.caltech.edu
in your .Mathematica/Licensing/mathpass
file to avoid activation key issues.
The address sanitizer tries to allocate a huge amount of memory at the beginning of the process. It won't actually use all of it, but some of it. My guess is this is to ensure that it won't run out of memory during memory diagnostics. If vm.overcommit_memory
is set to 2
then ASAN won't work. You can check this by running:
cat /proc/sys/vm/overcommit_memory
If this is set to 2, you won't be able to run ASAN. If you have root privileges (or know someone who does and have a good reason to get this temporarily changed) then the person with root privileges can sudo ssh
into the nodes you want to use and run:
echo 0 > /proc/sys/vm/overcommit_memory
After the user is done with ASAN someone with root privileges should SSH back into the nodes and do
echo 2 > /proc/sys/vm/overcommit_memory
Jobs can request specific nodes using e.g. --nodelist=wheeler001,wheeler002
.
Some info at: https://github.com/google/sanitizers/wiki/AddressSanitizer (search for overcommit
)
Depending on the exact MPI installation different hardware backends can be used, with extremely different performance characteristics. When testing an MPI installation, be it an existing module on a system or an installation you are doing, it is important to understand the performance characteristics of the MPI library. Ohio State University, who developed MPICH, has created a set of benchmarks to test the performance of an MPI library. The OSU microbenchmarks are available here. Below is a plot and the row data of latency and bandwidth measurements on Wheeler. The OpenMPI installation uses UCX, but UCX has some bug in it that prevents Charm++ from running properly. However, UCX is generally one of or the fastest layer to use.
# 0: Packet size
# 1: OpenMPI bandwidth (MB/s)
# 2: IntelMPI Bendwidth (MB/s)
# 3: OpenMPI latency (us)
# 4: IntelMPI latency (us)
#
# Theoretical max on Wheeler with
# Mellanox SX6025 FDR IB Switch (oPSE) is 7168 MB/s
#
# Switch details at:
# https://network.nvidia.com/related-docs/prod_ib_switch_systems/PB_SX6025.pdf
#
# Data obtained used OSU microbenchmarks:
# ulhpc-tutorials.readthedocs.io/en/latest/parallel/mpi/OSU_MicroBenchmarks/
#
# OpenMPI launch command:
# mpirun -mca btl ^openib -mca pml ucx -x UCX_NET_DEVICES=mlx4_0:1 ...
#
# IntelMPI launch command:
# mpiexec ...
1 2.41 2.08 1.87 2.26
2 5.09 4.09 1.80 2.25
4 10.30 8.20 1.77 2.25
8 20.51 16.99 1.77 2.25
16 41.18 32.33 1.77 2.89
32 81.40 63.46 1.78 2.89
64 159.45 129.67 1.80 2.89
128 315.03 255.52 1.87 2.93
256 601.62 494.53 1.97 3.01
512 1135.19 984.94 2.08 3.13
1024 1944.95 1808.88 2.34 3.44
2048 3102.21 2976.16 2.87 3.94
4096 5274.20 4399.81 3.22 4.40
8192 6083.76 5462.46 3.90 5.33
16384 6204.29 5607.61 5.31 6.90
32768 6309.14 5855.47 7.94 9.78
65536 6361.18 5932.48 13.21 15.32
131072 6368.93 5959.46 23.47 26.48
262144 5994.83 5209.34 43.75 46.89
524288 5927.50 5531.01 84.67 88.26
1048576 5891.17 5670.04 166.84 170.65
2097152 5878.14 5733.51 331.00 335.70
4194304 5853.38 5554.40 659.37 665.51
In order to be able to change user quotas, you must have permission to ssh admin@panasas-wheeler
You will not actually ssh there explicitly; you will run commands on Wheeler and the ssh will happen in the background (kind of like git pull
and git push
from/to a remote). To change quotas:
- Make sure
/usr/local/adm/bin
is in your $PATH on wheeler. - On wheeler, pull all the current quotas to a (previously nonexistent) file. Here we will name the file
quotas_OLD
:
wheeler> get_panasas_quotas quotas_OLD
The above command will print something like Successfully copied the limits file contents to /tmp/quotas
. You can ignore that message: /tmp/quotas
is a file on panasas-wheeler
(not on wheeler) that is temporarily generated as part of the get_panasas_quotas
command.
- Copy the quotas file so you have a backup in case you do something wrong:
wheeler> cp quotas_OLD quotas_NEW
- Edit
quotas_NEW
to change whatever quotas you want, using vi, emacs, etc. Each line inquotas_NEW
looks like
user uid:17625 /SXS 1.8T 2T 0 0 [email protected] =
where columns 4 and 5 are the soft and hard quotas, and the email field is always <USERNAME>@hpc.caltech.edu
and not the user's actual email. The soft quota should be 90% of the hard quota.
- Push
quotas_NEW
to the server:
wheeler> set_panasas_quotas quotas_NEW