Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use datalad containers with sandbox #157

Open
m-petersen opened this issue Jul 29, 2021 · 7 comments
Open

Use datalad containers with sandbox #157

m-petersen opened this issue Jul 29, 2021 · 7 comments

Comments

@m-petersen
Copy link

Hi,

our HPC enforces unpacking singularity containers to sandboxes which takes a really long time if done multiple times in parallel. One way to circumvent unpacking all the time would be to use a preconverted sandboxes. Is there a way to use datalad containers with a sandbox?

Thanks!

@yarikoptic
Copy link
Member

Never encountered such need. Could you please elaborate a bit - what commands you run to "unpack"? Any sample public singularity image (if image type specific)?

@m-petersen
Copy link
Author

Hi Yaroslav,

of course.

The command I call (being located in the root of a subject specific ephemeral clone) is
datalad containers-run \ -m "$PIPE_ID" \ --explicit \ --output $CLONE_DATA_DIR/freesurfer -o $CLONE_DATA_DIR/smriprep -o $CLONE_DATA_DIR/smriprep/$1 \ --input "$CLONE_BIDS_DIR/$1"\ --container-name smriprep \ data/raw_bids data participant \ -w .git/tmp/wdir \ --participant-label $1 \ --output-spaces fsnative fsaverage MNI152NLin6Asym T1w T2w \ --nthreads $SLURM_CPUS_PER_TASK \ --fs-subjects-dir data/freesurfer \ --stop-on-first-crash \ --fs-license-file envs/freesurfer_license.txt

The container installed with datalad containers-add is
https://hub.docker.com/layers/nipreps/smriprep/0.8.0rc2/images/sha256-4b6669dbb82f8ee14837208472d31c3b842b432e3abd6fd7deea738b4f4dafd7?context=explore

The containers-add command is

datalad containers-add ${container%-*} \ --url $ENV_DIR/$container \ --dataset $PROJ_DIR \ --update \ --call-fmt "singularity run --cleanenv --userns --no-home -B . -B \$SCRATCH_DIR:/tmp {img} {cmd}" done

Environment variables used here:

PROJ_DIR=. (superdataset root)
ENV_DIR=./envs
SCRATCH_DIR=/scratch (scratch partition on our HPC)
CLONE_DATA_DIR is a directory of an ephemeral clone containing input and result subdatasets (my workflow follows http://handbook.datalad.org/en/latest/beyond_basics/101-171-enki.html)

During test runs on a single subject the container works well when unpacked, but until then ~15 minutes pass. When I run multiple subjects on one node in parallel converting the containers does take hours rendering the whole computation effectively sequential. That's why I want to avoid unpacking for every subject run.

We are aiming for computationally optimized processing of imaging data on our HPC using datalad for provenance tracking and collaboration.

My scripts are

pipelines_submission.txt
Submits pipelines_parallelization for a subject batch for a defined pipeline

pipelines_parallelization.txt
Shall parallelize execution of pipelines_processing across subject batch on a node. During tests it did that but now I'm in doubt because singularity container conversions interfere. Correct me if I'm wrong but interference occurs when multiple jobs try to unpack the same container. If I produce an ephemeral clone per subject (as I do in pipelines_processing) shouldn't every process unpack it's own container since the installed container (datalad containers-add) should be transferred to the clone?

pipelines_processing.txt
Sets up an ephemeral clone and calls pipeline script like smriprep

smriprep.txt
Script containing the pipeline specific command including the container execution with datalad containers-run. We also try to implement other preconfigured pipelines like fmriprep, qsiprep etc.

Hope that clarifies a bit what my problem is and what I am trying to achieve. Using datalad for all these things is a little bit overwhelming for me at the moment.

Thanks a lot in advance.

Regards,
Marvin

@yarikoptic
Copy link
Member

I see -- so it is conversion of container from docker to singularity upon each run.
Ideally you/we should just have a singular converted singularity container for that app, so no conversion would be needed. that is what we also try to facilitate with https://github.com/ReproNim/containers/ but smriprep isn't there yet (filed an issue ref'ed above within bids-apps).
What is the value of ${container%-*} you have? I thought that if docker:// we would do such conversion once while adding that singularity container from docker into local dataset, so then it could be reused across jobs, and there will be no other repacking of anything.

here is a test on smaller container to confirm that
smaug:/tmp
$> datalad create containers-add-docker-test
[INFO   ] Creating a new annex repo at /tmp/containers-add-docker-test
[INFO   ] Scanning for unlocked files (this may take some time)
create(ok): /tmp/containers-add-docker-test (dataset)
2 5442.....................................:Thu 29 Jul 2021 09:44:42 AM EDT:.
smaug:/tmp
$> cd containers-add-docker-test
2 5443.....................................:Thu 29 Jul 2021 09:44:43 AM EDT:.
(git)smaug:/tmp/containers-add-docker-test[master]
$> datalad containers-add --url docker://neurodebian:nd100 test-docker
[INFO   ] Building Singularity image for docker://neurodebian:nd100 (this may take some time)
INFO:    Starting build...
Getting image source signatures
Copying blob 627b765e08d1 skipped: already exists
Copying blob ff66d7acb9e0 skipped: already exists
Copying blob 4ac627f2d764 skipped: already exists
Copying blob b33c3e9e07dc skipped: already exists
Copying blob 2c9c4b1dfc17 skipped: already exists
Copying config 6a5f86f6be done
Writing manifest to image destination
Storing signatures
2021/07/29 09:44:57  info unpack layer: sha256:627b765e08d177e63c9a202ca4991b711448905b934435c70b7cbd7d4a9c7959
2021/07/29 09:44:59  info unpack layer: sha256:ff66d7acb9e05c47e0621027afb45a8dfa4665301a45f8f794a16bd8c8ae8205
2021/07/29 09:44:59  info unpack layer: sha256:4ac627f2d764f56d7380099754dd943f54a31247e9400d632a215b8bc4ec5fa2
2021/07/29 09:44:59  info unpack layer: sha256:b33c3e9e07dc85213c696c81aff06d286ba379571b799845954b6cc776457e5f
2021/07/29 09:44:59  info unpack layer: sha256:2c9c4b1dfc179a7cb39a5bcbff0f4d529c84d1fab9395e799ab8a0350c620e58
INFO:    Creating SIF file...
INFO:    Build complete: image
[WARNING] Got jobs=6 but we cannot use threads with Pythons versions prior 3.8.0. Will run serially
add(ok): .datalad/config (file)
add(ok): .datalad/environments/test-docker/image (file)
save(ok): . (dataset)
containers_add(ok): /tmp/containers-add-docker-test/.datalad/environments/test-docker/image (file)
action summary:
  add (ok: 2)
  containers_add (ok: 1)
  save (ok: 1)
datalad containers-add --url docker://neurodebian:nd100 test-docker  24.41s user 2.23s system 165% cpu 16.096 total
2 5444.....................................:Thu 29 Jul 2021 09:45:08 AM EDT:.
(git)smaug:/tmp/containers-add-docker-test[master]
$> datalad containers-run -n test-docker ls
[INFO   ] Making sure inputs are available (this may take some time)
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
[WARNING] Got jobs=6 but we cannot use threads with Pythons versions prior 3.8.0. Will run serially
action summary:
  get (notneeded: 1)
  save (notneeded: 1)

so I am still not 100% sure what conversion in relation to singularity we are dealing with. May be you have some output/logging which shows ?

During test runs on a single subject the container works well when unpacked, but until then ~15 minutes pass.

so are you talking about those 15 minutes as the time of concern? My guess it is datalad/git trying to figure out what has changed to make a commit.

@m-petersen
Copy link
Author

m-petersen commented Jul 29, 2021

I think there is a misunderstanding.

I install the singularity container from a local datalad dataset containing predownloaded singularity containers (from dockerhub) called ENV_DIR before executing a SLURM submission. I do not download any singularity containers from dockerhub on the fly during the respective job. It's the conversion of the singularity container image to a sandbox that is enforced on our HPC (I think because of incompatibility of the file system with singularity) that takes a long time or is practically impossible when done multiple times in parallel.

What is the value of ${container%-*} you have?
The containers are named after a scheme of -.sif in ENV_DIR. With ${container%-*} they are named by since dots and hyphens aren't allowed in container names if I remember right.

The workaround I am now establishing is to execute singularity run wrapped in datalad run with a sandbox that has been converted in advance. And I was wondering whether there is a way of using datalad containers-run assuming there are some benefits for instance with regard to provenance tracking etc. compared to the simple datalad run.

@yarikoptic
Copy link
Member

sorry we forgot about this issue discussion.

It's the conversion of the singularity container image to a sandbox that is enforced on our HPC

still not clear to me (please point/cite specific lines) on what exactly such conversion entails? My only guess: copy singularity container from a partition where they cannot be executed from (e.g. /home) to another partition which supports executing them (e.g. /sandbox). If that is so, and /sandbox is mounted across all the nodes, the easiest solution probably would be:

  • install (with needed images) the (sub)dataset with containers under /sandbox;
  • install it from /sandbox in wreckless mode (symlinking .git/annex/objects) into resultant dataset before execution;
  • run regular containers-run*

And I was wondering whether there is a way of using datalad containers-run assuming there are some benefits for instance with regard to provenance tracking etc. compared to the simple datalad run.

well, containers-run is a lean wrapper around regular run. The only specific aspect which comes to mind, it passes container image not within inputs but extra_inputs which has a little different semantic. ref: https://github.com/datalad/datalad-container/blob/master/datalad_container/containers_run.py#L137

@m-petersen
Copy link
Author

m-petersen commented Feb 2, 2022

Thanks a lot for your reply.

By now we have established another solution foregoing datalad.

What I mean by a sandbox is the conversion of the singularity container image to a writable directory (https://sylabs.io/guides/3.0/user-guide/build_a_container.html#creating-writable-sandbox-directories). Automated conversion to of the containers to sandbox directories before using them is enforced on our HPC and due to its very slow filesystem this process takes forever hampering computations.

@yarikoptic
Copy link
Member

FWIW, as the command to actually execute the container is fully configurable in .datalad/config I guess one solution could be to develop the shim which would convert to sandbox if it was not yet done. Could then even be local to the system/compute-node if so desired. An example of such "shimming" is https://github.com/ReproNim/containers/blob/master/scripts/singularity_cmd which takes care about "thorough" sanitization and also suppor of running singularity via docker if on OSX. Then https://github.com/ReproNim/containers/blob/master/.datalad/config refers to it instead of plain singularity run command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants