Skip to content

Lascilab/htcondor-pararell

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Htcondor-pararell

This guide will give you detail instructions on how to setup and use HTCondor for pararell universe and MPI executions.

Use HTCondor >= 8.6.2

Setup

Install HTCondor in your machines along with MPI and SSH. Then configure your nodes to allow them to receive pararell jobs creating this file /etc/condor/config.d/condor_config.local.dedicated.resource

DedicatedScheduler = "DedicatedScheduler@controller"
##-------------------------------------------------------------------
## 2) Always run jobs, but prefer dedicated ones
##--------------------------------------------------------------------
START          = True
SUSPEND        = False
CONTINUE       = True
PREEMPT        = False
KILL           = False
WANT_SUSPEND   = False
WANT_VACATE    = False
RANK = Scheduler =?= $(DedicatedScheduler) 
##--------------------------------------------------------------------
MPI_CONDOR_RSH_PATH = $(LIBEXEC)
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
OPENMPI_INSTALL_PATH = /usr
OPENMPI_EXCLUDE_NETWORK_INTERFACES = docker0,virbr0
MOUNT_UNDER_SCRATCH = /

Replace DedicatedScheduler = "DedicatedScheduler@controller" with your scheduler name, Be careful when specifying the name of the dedicated scheduler as it must match exactly. You can see the name of the scheduler by running condor_status -schedd. For example, if the output of the command is:

$ condor_status -schedd
Name                 Machine              TotalRunningJobs           TotalIdleJobs     TotalHeldJobs

hpctest0             hpctest0              0                              0                        0

Then, the line must be DedicatedScheduler = "DedicatedScheduler@hpctest0".

Restart HTCondor in each node (For Ubuntu sudo service condor restart) and check if the nodes are avalaible:

condor_status -const '!isUndefined(DedicatedScheduler)' \ 
   -format "%s\t" Machine -format "%s\n" DedicatedScheduler

Hello world Test

Lets test a "hello world" program creating a file called: hello.c

#include <stdio.h>
#include <unistd.h>
#include <mpi.h>

int main(int argc, char** argv) {
    int myrank, nprocs;
    char hostname[256];

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
    gethostname(hostname,255);

    printf("Hello from processor %d of %d on host %s\n", myrank, nprocs,hostname);

    MPI_Finalize();
    return 0;
}

Compile it mpicc hello.c -o hello.mpi and then create a submitfile:

######################################
## Example submit description file
## for Open MPI 
## condor_submit submitfile
######################################
universe = parallel
executable = /usr/share/doc/condor/etc/examples/openmpiscript
arguments = hello.mpi
machine_count = 2
# request_cpus = 8
should_transfer_files = yes
when_to_transfer_output = on_exit
log                     = logs.log
output                  = logs.out.$(NODE)
error                   = logs.err.$(NODE)
transfer_input_files = hello.mpi
queue

Submit to HTCondor condor_submit submitfile and wait condor_q

Numerical integration example

Download the source code of a numerical integration usin mpi that demonstrates the use of MPI_Init,MPI_Comm_rank,MPI_Comm_size,MPI_Reduce and MPI_Finalize in your cluster. Compile and run it.

$ wget https://raw.githubusercontent.com/Lascilab/htcondor-pararell/master/ejemplo/integracion/integracion.c
$ wget https://raw.githubusercontent.com/Lascilab/htcondor-pararell/master/ejemplo/integracion/submitfile
$ condor_submit submitfile

Advance examples

Lets say that you need to source an environment file or execute mpi in a NFS folder, for that case you would need to modify a little openmpiscript (copy to your folder, make the adjustments and modify the submitfile), check the Openfoam example.

Notice that the line 172 added -wdir option that tells mpi to execute in that directory.

 mpirun -v -wdir /vagrant/ejemplo/openfoam/damBreak \
      --prefix $MPDIR --mca $mca_ssh_agent $CONDOR_SSH \ 
      -n $_CONDOR_NPROCS -hostfile machines $EXECUTABLE $@ &

Also notice thath condor_ssh file, in the last line (150) does source /etc/profile and set many environment variables for the execution.

I have tried unsuccessfully to set USE_NFS = true in the condor configuration file and initialDir = /nfs/folder in the submit file to accomplish the above objectice, unfortunelly none of that options has worked for me. Also, if you want to execute a few environment variables without modifying the openmpiscript, you can set environment = "PATH=/usr/bin:/bin:/usr/sbin:/sbin" in the submitfile or even use "getenv = true".

Vagrant

Use vagrant for a quick test, install vagrant and execute vagrant up. In a few minutes you will have three virtual machines up and running: a controller and two nodes. Execute vagrant ssh controller to get into the controller and submit every example located in "/vagrant". Remember that the controller machine is only used to submit jobs, so if you want to compile each example enter into the other nodes (vagrant ssh server1 or vagrant ssh server2).

Troubleshooting

Your cluster nodes might have this entries in HTCondor config TRUST_UID_DOMAIN = FALSE and STARTER_ALLOW_RUNAS_OWNER = FALSE, that means that every Job must be run as the user 'nobody'. But there is a problem because that user doen't have an home dir where openmpiscript can chdir. Please, change it to true (or share with us your solution :D).

Sources

About

Examples of HTCondor pararell universe

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published