The aim of this recipes is learn how to use the execution queue on Ogbon Environment. A practical example to see how it can be used and to see a real example of submit jobs. The results are impressive for the effort and performance on the supercomputacional environment.
~$ ssh -p 5001 [email protected]
Please refer to create a alias for connect with OGBON, see [Simplifying SSH] for an way to reduce the complexity of this command.
By following the steps below you will be able to do just
~$ ssh ogbon and successfully connect to the server.
Create or edit a ~/.ssh/config file
~$ mkdir -p ~/.ssh
then create or edit the ~/.ssh/config file, appending the following content:
Host ogbon
HostName ogbon-login8.fieb.org.br
User murilo
PreferredAuthentications publickey
Compression yes
ServerAliveInterval 40
ForwardX11 yes
Port 5001
IdentityFile ~/.ssh/id_rsa
where you should change the name of the User option from murilo to your username. Furthermore, check if your ssh key is really id_rsa, otherwise change it to the correct one.
~$ ssh ogbon
~$ sinfo
~$ squeue
~$ module avail
~$ module load gcc/11.1.0
~$ module list
~$ groups ~$ murilo nec projetos cenpes-lde
To allocate run:
~$ salloc -p cpulongb -N 1 -A cenpes-lde
The expected output is something like:
salloc: Pending job allocation 528705
salloc: job 528705 queued and waiting for resources
salloc: job 528705 has been allocated resources
salloc: Granted job allocation 528705
salloc: Waiting for resource configuration
salloc: Nodes c153 are ready for job
With the node c153 (only an example) properly allocated, ssh into it with the following command:
~$ ssh c153
To free up the allocated resources run:
~$ scancel -u murilo
~$ sbatch script-slurm.sh
!/bin/sh
#SBATCH --job-name=MPI # Job name
#SBATCH --nodes=2 # Run all processes on 2 nodes
#SBATCH --partition=cpulongb # Partition OGBON
#SBATCH --output=out_%j.log # Standard output and error log
#SBATCH --ntasks-per-node=1 # 1 job per node
#SBATCH --account=cenpes-lde # Account of the group
~$ scp -P 5001 -r /Users/muriloboratto/Documents/github/howto-ogbon/ [email protected]:/home/murilo/
~$ scp -P 5001 -r [email protected]:/home/murilo/cap-hpc/ .
Click your profile photo in GitHub > Settings > SSH and GPG keys > Add SSH key
~$ ssh -p 5001 -CXY -o ServerAliveInterval=40 [email protected] -L 8559:*:8559
~$ module load anaconda3/2020.07
~$ jupyter lab --port=8559
~$ ssh -p 5001 -CXY -o ServerAliveInterval=40 [email protected] -L 8559:*:8559
To allocate run:
~$ salloc -p gpulongb -N 1 -A cenpes-lde
The expected output is something like:
salloc: Pending job allocation 528705
salloc: job 528705 queued and waiting for resources
salloc: job 528705 has been allocated resources
salloc: Granted job allocation 528705
salloc: Waiting for resource configuration
salloc: Nodes c003 are ready for job
With the node c003 (only an example) properly allocated, ssh into it with the following command:
~$ ssh c003 -L 8559:*:8559
~$ module load anaconda3/2020.07
~$ jupyter lab --port=8559
~$ singularity pull docker://speglich/cimatec-base
~$ singularity exec --nv docker://speglich/cimatec-base bash
Then, read the instructions that follow in the notebook, and connect with NICE DCV on OGBON:
~$ ssh -p 5001 [email protected]
~$ alias dcvCreate="dcv create-session profiling"
~$ alias dcvList="dcv list-sessions"
~$ alias dcvClose="dcv close-session profiling"
~$ dcvCreate
https://ogbon-cgpu4.fieb.org.br:8443#profiling
After clicking the Connect button you will be asked for a password, which is registered in the NOC/CS2I.
~$ module load anaconda3/2023.07
~$ conda info --envs
[murilo@login8 ~]$ conda info --envs
# conda environments:
#
pytorch-2.x /home/murilo/.conda/envs/pytorch-2.x
tensorflow-2.x /home/murilo/.conda/envs/tensorflow-2.x
base * /opt/share/anaconda3/2020.07
llvm12 /opt/share/anaconda3/2020.07/envs/llvm12
~$ source activate pytorch-2.x
- Create a reference file called
conda-pytorch-env.yaml
:
name: pytorch-2.x
channels:
- pytorch
- conda-forge
- nvidia
dependencies:
- python=3.11
# Libraries
- pytorch-cuda=11.8
- pytorch>=2.0.1
- numpy
- pandas
# Tools
- ipykernel
- jupyterlab
- pip
- Create the env:
~$ conda env create --name pytorch-2.x --file conda-pytorch-env.yaml
- And activate:
~$ source activate pytorch-2.x
~$ source deactivate
[murilo@login8 ~]$ ls /public/singularity/tensorflow-2.14-gpu-jupyter.sif
[murilo@login8 ~]$ sbatch /public/singularity/slurm-jupyter-notebook.sh
ssh -L 9807:c000:8888 [email protected] -p 5001
Atention 4: In the below of the file slurm-notebook-*.log
, copy the jupyter's weblink, and change the port 8888, for the port select by SLURM, in this case 9807:
http://127.0.0.1:9807/tree?token=f79af719ee03701f0a0cf2f02cc72e8a895800487d595559