Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move temporary charliecloud containers outside of process directory #5212

Closed
wants to merge 0 commits into from

Conversation

nschan
Copy link
Contributor

@nschan nschan commented Aug 7, 2024

This adresses #5192.
Instead of placing the process-container inside the process work directory, it is created in a folder that can be defined via charliecloud.containerStore, or by default in work/charliecloud_store. This solves the problem of find trying to work through the container.

There are some issues that that remain:

  • I do not know if nextflow clean should remove work/charliecloud_store, but I would assume it does not.

  • The whole approach of creating process specific containers comes with quite some storage overhead, since each process will have a dedicated container created, which can easily create hundreds to thousands of new containers during a single run. There are some work-arounds for this: On modern systems ch-runs's --write-fake can be used, which skips process specific container creation and instead runs containers in the container storage. This is probably the best way. On somewhat modern systems, useSquash can be enabled to create .squashfs containers, which would only create a single file per container, instead of a whole directory tree. I think it would be nice to do a cleanup of the temporary (process) container after the process finished, but I am not sure if / how this can be implemented via the charliecloud driver.

@nschan nschan requested a review from a team as a code owner August 7, 2024 14:04
Copy link

netlify bot commented Aug 7, 2024

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 416a496
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/66e2b38c8e584e0007ba164f
😎 Deploy Preview https://deploy-preview-5212--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@nschan nschan changed the title Charliecloud storage Move temporary charliecloud containers outside of process directory Aug 7, 2024
@nschan
Copy link
Contributor Author

nschan commented Aug 28, 2024

Any comments or thoughts on this PR?

@nschan
Copy link
Contributor Author

nschan commented Sep 4, 2024

Any chance someone could review this and let me know if anything should be changed or done differently before this can be merged? @pditommaso @bentsherman

@pditommaso
Copy link
Member

it would be nice to have a review @l-modolo or @phue (provided they are still interested/using charliecloud)

docs/config.md Outdated
: Enable `writeFake` with charliecloud. This allows to run containers from storage in writeable mode, using overlayfs, see [charliecloud documentation](https://hpc.github.io/charliecloud/ch-run.html#ch-run-overlay) for details. Using this is `writeFake`, given that the kernel supports it. If `writeFake` is not enabled, each process will run in an individual container, which can lead to the creation of many temporary containers for a pipeline with many containers.

`charliecloud.containerStore`
: If `writeFake` is enabled, this option has no effect. Directoy that stores temporary containers. Unless `writeFake` is enabled, each process runs in an individual container. If this is not set, nextflow will default to `work/charliecloud_store`. Path to temporary containers inside the `containerStore` is the full hash of the process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: If `writeFake` is enabled, this option has no effect. Directoy that stores temporary containers. Unless `writeFake` is enabled, each process runs in an individual container. If this is not set, nextflow will default to `work/charliecloud_store`. Path to temporary containers inside the `containerStore` is the full hash of the process.
: If `writeFake` is enabled, this option has no effect. Directory that stores temporary containers. Unless `writeFake` is enabled, each process runs in an individual container. If this is not set, nextflow will default to `work/charliecloud_store`. Path to temporary containers inside the `containerStore` is the full hash of the process.

Copy link
Contributor

@phue phue Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't you just reuse charliecloud.cacheDir instead of adding yet another config option?
The singularity driver also has the concept of libraryDir for images in read-only locations, might also be worth considering here to keep things more comparable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that charliecloud does not allow running writable from storage, so another directory seems necessary to store the temporary container. If cacheDir and $CH_IMAGE_STORAGE are identical charliecloud would refuse to run with -w, unless I am misunderstanding something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-w isn't actually needed when --write-fake is provided, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not

docs/config.md Outdated

`charliecloud.useSquash`
: Create a temporary squashFS container image in the process work directory instead of a folder.
: If `writeFake` is enabled, this option has no effect. Create a temporary squashFS container image in `containerStore` (see above) named `container.squashfs`. If `writeFake` is not available but squashFS is an option, this should be enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think useSquash should be made the default, if not the only option.
Using plain directory trees and making a copy for each process execution can lead to insane amounts of inodes for larger runs. Sysadmins won't like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using squashfs images is possible, we should do it.
Could you be more specific about the meaning of "somewhat modern" systems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also do not like this, and I am not sure what exactly is preventing this, but the infrastructure I am using does not support squashfs. I talked to the admins, but this will not change in the foreseeable future..
I agree that it should be made default. Maybe if squashfs is not available, only a single copy of the container could be created per process, instead of having one per task? This might be a bit unsafe, but maybe a good trade of over cluttering the filesystem?

Copy link
Contributor

@phue phue Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating one full copy of the container filesystem per task is no sane behaviour, I think we agree on that.

Instead (from what I understand), you could pull the container once (eg. to charliecloud.cacheDir, that could be the same as $CH_IMAGE_STORAGE), and then use --write-fake to spawn as many container instances as you'd like without having to copy the container filesystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, clearly using --write-fake with a storage directory is the best choice and avoids all of the extra creation of containers.
This is not an option on older systems though (https://hpc.github.io/charliecloud/ch-run.html#writeable-overlay-with-write-fake), such as the one I am using. Since charliecloud will not run an image in the storage writable (-w), a new image needs to be created somewhere that is not $CH_IMAGE_STORAGE. cacheDir may, or may not be the same directory as $CH_IMAGE_STORAGE, so I guess that would need to be checked against? I will try to come to with a better way to handle the support for older systems. I agree that creating maybe thousands of container directories is not really a reasonable way to handle this..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the additional issue that running ch-convert at the beginning of every task takes some time (minutes), which is not great.

@@ -62,6 +65,9 @@ class CharliecloudBuilder extends ContainerBuilder<CharliecloudBuilder> {
if( params.containsKey('readOnlyInputs') )
this.readOnlyInputs = params.readOnlyInputs?.toString() == 'true'

if( params.containsKey('containerStore') )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is actually needed (see above)

.params(runOptions: '-j --no-home')
.build()
.runCommand == 'ch-convert -i ch-image --storage /cacheDir busybox "$NXF_TASK_WORKDIR"/container_busybox && ch-run --unset-env="*" -c "$NXF_TASK_WORKDIR" --set-env -w -b "$NXF_TASK_WORKDIR" -j --no-home "$NXF_TASK_WORKDIR"/container_busybox --'
.runCommand == 'mkdir -p /containerStorage/"${NXF_TASK_WORKDIR: -34}" && ch-convert -i ch-image --storage /cacheDir busybox /containerStorage/"${NXF_TASK_WORKDIR: -34}" && ch-run --unset-env="*" -c "$NXF_TASK_WORKDIR" --set-env -w -b "$NXF_TASK_WORKDIR" -j --no-home /containerStorage/"${NXF_TASK_WORKDIR: -34}" --'
Copy link
Contributor

@phue phue Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the magic default of -34? I don't understand the purpose of this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -34 will get the task hash to create containers per task.
I think this is worth reconsidering, is it really necessary to create one container per task, or would it be sufficient to have one dedicated container for each process?

@nschan
Copy link
Contributor Author

nschan commented Sep 10, 2024

Thanks @phue, I think the review largely confirms what I have outlined in the initial comment: Creating one container per task (as was initially suggested here) is not going to be a useful route for handling containers, especially if squashfs is not an option.
I would like to discuss what a reasonable way to work around this could be.
The problem that is underlying this issue is that ch-run refuses to run images stored in $CH_IMAGE_STORAGE with -w (writeable), which is in turn required to use -b (bind-mount) to bind a path into the container root.
Some solutions I could come up with are:

  • Use --write-fake and run from $CH_IMAGE_STORAGE and offer no other options. This removes support for old kernels, and I would not be able to use charliecloud any longer. I am obviously not in favour of this. I think it should be the default way of running images, but other ways need to be available for older systems.
  • Work around the charliecloud cache: only use $NXF_CHARLIECLOUD_CACHEDIR or charliecloud.cacheDir for storage and either make sure that those are not identical to $CH_IMAGE_STORAGE, or unset $CH_IMAGE_STORAGE before running images. Unsetting $CH_IMAGE_STORAGE probably goes against the idea of charliecloud trying to maintain a cache that contains unmodified images.
  • Continue using the charliecloud cache ($CH_IMAGE_STORAGE) and providing a second cache ($NXF_CHARLIECLOUD_CACHEDIR or charliecloud.cacheDir) , that is only available for that pipeline run. I do not really know if that has any advantages, images would still need to be converted via ch-convert to be moved from $CH_IMAGE_STORAGE to the run cache. I imagine this is not much faster than simply redownloading the image into the secondary cache. It also seems unnecessarily complex to maintain a shadow-cache for each pipeline run.

I understand the idea of maintaining the purity of containers in the $CH_IMAGE_STORAGE cache, but I think for the way nextflow uses containers a cache that can be run with -w is required, in case --write-fake is not an option. I am inclined to think that simply dropping $CH_IMAGE_STORAGE from the whole container pulling procedure and relying on $NXF_CHARLIECLOUD_CACHEDIR, charliecloud.cacheDir and falling back onto $workDir/charliecloud is probably the best procedure. Maybe it would be worth checking if $NXF_CHARLIECLOUD_CACHEDIR is identical to
$CH_IMAGE_STORAGE and throw an error to avoid problems with -w?

Do you have any thoughts or comments on this @phue ?

@nschan
Copy link
Contributor Author

nschan commented Sep 10, 2024

I have updated this PR to follow what I have outlined above:

I am inclined to think that simply dropping $CH_IMAGE_STORAGE from the whole container pulling procedure and relying on $NXF_CHARLIECLOUD_CACHEDIR, charliecloud.cacheDir and falling back onto $workDir/charliecloud is probably the best procedure. Maybe it would be worth checking if $NXF_CHARLIECLOUD_CACHEDIR is identical to
$CH_IMAGE_STORAGE and throw an error to avoid problems with -w

This greatly simplifies the container procedure for charliecloud. However, with this nextflow no longer makes proper use of charliecloud's internal storage capabilities. I personally do not consider this a major drawback, it is still possible to use CH_IMAGE_STORAGE for everything that is not nextflow.

The commit messages that end early should include $CH_IMAGE_STORAGE; but I used double quotes so it got expanded to nothing, sorry about that.

@phue
Copy link
Contributor

phue commented Sep 10, 2024

Just to recap why -w was needed in the first place:
Before charliecloud gained support for overlayfs (implemented as --write-fake), we had to ensure the task work directory exists within the container to make it work with nextflow. This was done by using ch-run --write and mkdir.
This wasn't optimal in the first place, because it goes against the idea of immutable containers and completely lacks provenance.
Now overlayfs solves that issue and imo should be the default, but unfortunately its implementation in charliecloud requires a rather recent kernel.

The next best thing to do imo would be something along these lines:

  • leave it completely to charliecloud to handle its cache (the cache implementation anyway seems to change every couple of releases)
  • let it store "vanilla" images there, and for each nextflow run create a "working copy" of that container using ch-image convert and store it under work/charliecloud.
  • to the latter one, add the mountpoint for work/, and cross fingers that it is enough to do that once per container and not repeatedly for each task, i.e. if you bindmount work/ it will hopefully also recursively include all subdirectories thereof. Maybe give that a try.

@nschan
Copy link
Contributor Author

nschan commented Sep 10, 2024

Just to recap why -w was needed in the first place:
Before charliecloud gained support for overlayfs (implemented as --write-fake), we had to ensure the task work directory exists within the container to make it work with nextflow. This was done by using ch-run --write and mkdir.
This wasn't optimal in the first place, because it goes against the idea of immutable containers and completely lacks provenance.
Now overlayfs solves that issue and imo should be the default, but unfortunately its implementation in charliecloud requires a rather recent kernel.

I agree that --write-fake should be used by default, but I will also be honest that a big motivation for me is having an implementation that works for me, and overlayfs is not an option with my infrastructure.

leave it completely to charliecloud to handle its cache (the cache implementation anyway seems to change every couple of releases)

This was how the cache was used up until my latest commit. Even with this commit charliecloud will still handle it's own cache, but nextflow will use a different cache ;)

let it store "vanilla" images there, and for each nextflow run create a "working copy" of that container using ch-image convert and store it under work/charliecloud.
to the latter one, add the mountpoint for work/, and cross fingers that it is enough to do that once per container and not repeatedly for each task, i.e. if you bindmount work/ it will hopefully also recursively include all subdirectories thereof. Maybe give that a try.

I have contemplated this before going with the current approach. The main problem is that this would require some additional parts in the charliecloudCache, basically replicating behaviour of the imageDownloader to have an imageCloner or something. Since ch-convert is not significantly faster than ch-image pull (personal observation), the difference between downloading images to somewhere that is not CH_IMAGE_STORAGE and downloading them into CH_IMAGE_STORAGE and then converting images from CH_IMAGE_STORAGE to that other place seems kinda minor, but come with a bunch of extras that may cause issues in the future? If there is some cool advantage that I am missing I am happy to learn more about this.

Conversion of the container into a single "working copy" cannot easily be done inside the charliecloudBuilder, because those may run parallel and I assume it would create a massive mess if several tasks try to do the same ch-convert. Of course checks could be added to avoid this, but I think it would be kinda messy.

I should add that the current approach may lead to Issue 3964 coming up again..

@phue
Copy link
Contributor

phue commented Sep 11, 2024

I don't quite understand this.
Isn't ch-image pull -s <cacheDir> (as done in CharliecloudCache) exactly the same as doing CH_IMAGE_STORAGE=<cacheDir> ch-image pull ...?
And if so, I believe you'll run into #4463 again.

The main problem is that this would require some additional parts in the charliecloudCache, basically replicating behaviour of the imageDownloader to have an imageCloner or something. Since ch-convert is not significantly faster than ch-image pull (personal observation), the difference between downloading images to somewhere that is not CH_IMAGE_STORAGE and downloading them into CH_IMAGE_STORAGE and then converting images from CH_IMAGE_STORAGE to that other place seems kinda minor, but come with a bunch of extras that may cause issues in the future? If there is some cool advantage that I am missing I am happy to learn more about this.

The main difference is that this will populate a new local cache each time you start a fresh pipeline run. It may not be faster but certainly cheaper because using the global cache at least gives you some deduplication.

Conversion of the container into a single "working copy" cannot easily be done inside the charliecloudBuilder, because those may run parallel and I assume it would create a massive mess if several tasks try to do the same ch-convert. Of course checks could be added to avoid this, but I think it would be kinda messy.

Then extend CharliecloudCache, it already implements a mutex to avoid concurrent pulls.

@nschan
Copy link
Contributor Author

nschan commented Sep 11, 2024

I don't quite understand this.
Isn't ch-image pull -s (as done in CharliecloudCache) exactly the same as doing CH_IMAGE_STORAGE= ch-image pull ...?

Yes it is. One difference is that ch-run will refuse to run with -w if the container is run by name from CH_IMAGE_STORAGE. I see the issue regarding #4463 and will think about how to handle this..

Then extend CharliecloudCache, it already implements a mutex to avoid concurrent pulls.

This mutex never worked properly and caused problems. See here #3566

The main difference is that this will populate a new local cache each time you start a fresh pipeline run. It may not be faster but certainly cheaper because using the global cache at least gives you some deduplication.

You could use the same cache for different pipeline runs via NXF_CHARLIECLOUD_CACHEDIR, when not being able to make use of --write-fake

@nschan
Copy link
Contributor Author

nschan commented Sep 11, 2024

So, here is the latest set of changes:

  • charliecloud.writeFake is true by default (!)
  • if charliecloud.writeFake is true and $CH_IMAGE_STORAGE is set, images will go into $CH_IMAGE_STORAGE and be run from there with --write-fake (will be run by name in that case).
  • if charliecloud.writeFake is true, but $CH_IMAGE_STORAGE is unset, images will go into charliecloud.cacheDir, $NXF_CHARLIECLOUD_CACHE or work/charliecloud (order of priority) and will be run by path (since run by name only works if $CH_IMAGE_STORAGE is set)
  • if charliecloud.writeFake is set to false, $CH_IMAGE_STORAGE will not be used to store images. Images will go into charliecloud.cacheDir, $NXF_CHARLIECLOUD_CACHE or work/charliecloud (order of priority). If the folder used to store images is the same as $CH_IMAGE_STORAGE, throw an exception. Images will be run by path.
  • If the image is supposed to be run with -w (i.e. readOnlyInputs = false and writeFake = false) and $CH_IMAGE_STORAGE is set, this will throw an exception. This is mainly to ensure that users realize that this can be a problem, I think it would be fine to not throw this here?

In summary:

  • Give preference to $CH_IMAGE_STORAGE in the default case of writeFake
  • Make some effort to avoid using $CH_IMAGE_STORAGE when writeFake is false.
  • Not using convert, since I still do not see major advantage over pulling, as far as I understand charliecloud will maintain its own cache defined by $CH_IMAGE_STORAGE even if the image goes somewhere else. nextflow will use $CH_IMAGE_STORAGE if it is set and possible (writeFake = true) and nextflow will never use $CH_IMAGE_STORAGE if -w is used to avoid messing with charliecloud cache.

@nschan
Copy link
Contributor Author

nschan commented Sep 11, 2024

I would also like to add something that may not be obvious (at least to me) regarding this

The main difference is that this will populate a new local cache each time you start a fresh pipeline run. It may not be faster but certainly cheaper because using the global cache at least gives you some deduplication.

The charliecloud build cache appears to be located in $CH_IMAGE_STORAGE, but would still be used if the image is pulled somewhere else (https://hpc.github.io/charliecloud/ch-image.html#build-cache). I understand this to mean that setting $CH_IMAGE_STORAGE is generally a good idea. I will therefore modify this

If the image is supposed to be run with -w (i.e. readOnlyInputs = false and writeFake = false) and $CH_IMAGE_STORAGE is set, this will throw an exception.

To test if the image storage directory is the same as $CH_IMAGE_STORAGE and only throw an exception if that is the case.

Edit:
I think I misunderstood and the cache is in the directory that is passed via -s. See also here hpc/charliecloud#1922

@nschan
Copy link
Contributor Author

nschan commented Sep 11, 2024

To explain my reluctance to add some conversion step, besides the fact that it would be quite some work: This would only add support for cases where someone is not able to use writeFake and needs to use $CH_IMAGE_STORAGE when running nextflow. If the goal is to have a charliecloud cache for nextflow containers without writeFake, this can be defined via NXF_CHARLIECLOUD_CACHE. This writeable storage should probably be used only for nextflow anyway?

Copy link
Contributor

@phue phue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that it defaults to using overlayfs now.
The workaround that you implemented for older kernels isn't optimal, but it seems to be an improvement over the current state of things.

Left two comments below, but otherwise looks good to me.
Please note though that I currently don't have a good way to test this other than my laptop..

docs/config.md Outdated Show resolved Hide resolved
} else {
// Otherwise run by path
result << image
}
Copy link
Contributor

@phue phue Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have unit tests for both codepaths

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree. I am not super familiar with the test framework though, is there a way to change the environment in the test framework, or would I need to construct something similar to how it is done in the container builder?

@nschan
Copy link
Contributor Author

nschan commented Sep 12, 2024

Once again, something went sideways when I tried to sign-off something.. I will try to solve whatever went wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants