-
Notifications
You must be signed in to change notification settings - Fork 32
Running diagnostics distributedly
ACME diagnostics can be ran distributedly on a cluster. This speeds up the diagnostics, making it run faster.
Go to the head node (aims4
or acme1
) and become root.
Then run the following commands:
source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
export UVCDAT_ANONYMOUS_LOG=False
dask-scheduler --host 10.10.10.1
Go to each of the compute nodes and become root. Make sure that /p/cscratch
is assessable to each node.
Then run the following commands.
source /p/cscratch/acme/shaheen2/acme_diags_env/bin/activate /p/cscratch/acme/shaheen2/acme_diags_env
export UVCDAT_ANONYMOUS_LOG=False
dask-worker 10.10.10.1:8786 -- procs NUM_WORKERS --nthreads 1
You can select the number of workers on your machine with something like dask-worker 10.10.10.1:8786 --nprocs 4 --nthreads 1
Use cdp-distrib
cli to interact with the scheduler and workers. You can do this even if you don't have access to the compute nodes. Just run cdp-distrib SCHEDULER_ADDR:PORT -w
. Below is a sample output.
(/p/cscratch/acme/shaheen2/acme_diags_env) shaheen2@aims4:~$ cdp-distrib 10.10.10.1:8786 -w
Scheduler 10.10.10.1:8786 has 12 workers attached to it
Information about worker at greyworm1.llnl.gov(10.10.10.51):37188
name: tcp://10.10.10.51:37188
ncores: 1
executing: 0
memory_limit: 10146302976.0
pid: 60292
last-task: 2017-07-28 16:35:43
last-seen: 2017-07-28 16:58:38
Information about worker at greyworm1.llnl.gov(10.10.10.51):38710
name: tcp://10.10.10.51:38710
ncores: 1
executing: 0
memory_limit: 10146302976.0
pid: 60290
last-task: 2017-07-28 16:35:46
last-seen: 2017-07-28 16:58:38
- If you get an error like the one below when running
dask-worker
, make sure you canping SCHEDULER_ADDRESS
. If not, contact your sysadmin.(/p/cscratch/acme/shaheen2/acme_diags_env) [root@greyworm1 ~]# dask-worker 198.128.245.178:8786 ... distributed.worker - INFO - Trying to connect to scheduler: tcp://198.128.245.178:8786
- If after running the diagnostics and you eventually get an
IOError: [Errno 24] Too many open files
error on the worker or head nodes, follow the steps below on all of the compute and head nodes. For more detail, see the actual answer on the Dask FAQs- View the old soft and hard limits on open files
# ulimit -Sn # ulimit -Hn
- Open
/etc/security/limits.conf
and add the following lines to increase the limit for root to something largerroot soft nofile 8192 root hard nofile 20480
- View the new soft and hard limits
# exit $ sudo su - # ulimit -Sn # ulimit -Hn
- View the old soft and hard limits on open files
- When you start
dask-schduler
, are there workers created (Starting worker ...
) like in the snippet below? If so, runlsof | grep -E 'python2.7.*LISTEN'
and kill all of the Python processes (withkill -9 PID
) listening on the localhost (those with*:SOMEPORT
). Do this with caution.(acme_diags_env) shaheen2@shaheen2ml: dask-scheduler distributed.scheduler - INFO - ----------------------------------------------- distributed.scheduler - INFO - Scheduler at: tcp://128.15.245.24:8786 distributed.scheduler - INFO - bokeh at: 0.0.0.0:8787 distributed.scheduler - INFO - http at: 0.0.0.0:9786 distributed.scheduler - INFO - Local Directory: /var/folders/nl/4tby_mh129g_95fj9dh6cgdh001nkh/T/scheduler-QMmVpu distributed.scheduler - INFO - ----------------------------------------------- distributed.scheduler - INFO - Register tcp://128.15.245.24:50215 distributed.scheduler - INFO - Register tcp://128.15.245.24:50234 distributed.scheduler - INFO - Register tcp://128.15.245.24:50145 distributed.scheduler - INFO - Register tcp://128.15.245.24:50238 distributed.scheduler - INFO - Register tcp://128.15.245.24:50219 distributed.scheduler - INFO - Register tcp://128.15.245.24:49982 distributed.scheduler - INFO - Register tcp://128.15.245.24:50088 distributed.scheduler - INFO - Register tcp://128.15.245.24:50101 distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50238 distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50219 distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:49982 distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50088 distributed.scheduler - INFO - Starting worker compute stream, tcp://128.15.245.24:50101
Creating a single Anaconda environment accessible through the head node and compute nodes might be difficult, due to different system configurations and security settings. Below is how it was done. Eventually, all of this distributed stuff will be included in the default ACME environment.
- Login to the head node and make sure you have Anaconda installed in a location accessible to the compute nodes. In the case of
aims4
and thegreyworm
cluster, only/p/cscratch
is accessible, so we installed Anaconda in/p/cscratch/acme/shaheen2/anaconda2/
. - Create an Anaconda environment in a location accessible to the compute nodes (
/p/cscratch/acme/shaheen2/acme_diags_env
).Make sure to use/p/cscratch/acme/shaheen2/anaconda2/bin/conda create -p /p/cscratch/acme/shaheen2/acme_diags_env python=2.7 dask distributed -c conda-forge --copy -y
--copy
, it copies the packages instead of symbolically linking them. Even if you use copy,activate
,create
, andconda
are still symbolically linked based on whatconda
was used inconda create
. Hence, this is why we needed theconda
(in/p/cscratch/acme/shaheen2/anaconda2/bin/conda
) be available on the head and compute nodes.