Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copying sdata_processed.zarr to S3 yields java.nio.file.NoSuchFileException #94

Open
R3myG opened this issue Nov 4, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@R3myG
Copy link

R3myG commented Nov 4, 2024

Description of the bug

Hello,

I'm running into a very weird issue with Nextflow + Tower + SpatialVi when publishing the sdata_processed.zarr to S3.
I tested with Nextflow 24.04.3 and 24.04.4. The version of Tower is 23.3.0.
The process fails suddenly, aborting all currently running processes and just reporting a java.nio.file.NoSuchFileException in the Error report.

The error is systematic for this folder. I've checked the sdata_processed.zarr in the temporary directory and all is valid, I've tried to manually copy the folder to S3 with an aws s3 cp --recursive command and all the files were copied to the S3 bucket without problems.

I've sanitized the logs which are below.

Thank you,

Command used and terminal output

No response

Relevant files

[3f/b1e1b8] Submitted process > NFCORE_SPATIALVI:SPATIALVI:DOWNSTREAM:SPATIALLY_VARIABLE_GENES (B123456)
Nov-03 08:48:10.525 [PublishDir-5] DEBUG n.cloud.aws.nio.S3FileSystemProvider - S3 upload file from=/my/path/dev/fd/1d03ec9a5058771f66189cef527764/clustering.qmd to=s3://mys3bucket/Output/Project_Tmp/results/B123456/reports/clustering.qmd
Nov-03 08:48:10.560 [PublishDir-2] DEBUG n.cloud.aws.nio.S3FileSystemProvider - S3 upload file from=/my/path/dev/fd/1d03ec9a5058771f66189cef527764/params.yml to=s3://mys3bucket/Output/Project_Tmp/results/B123456/reports/clustering.yml
Nov-03 08:48:10.578 [PublishDir-1] DEBUG n.cloud.aws.nio.S3FileSystemProvider - S3 upload file from=/my/path/dev/fd/1d03ec9a5058771f66189cef527764/clustering.html to=s3://mys3bucket/Output/Project_Tmp/results/B123456/reports/clustering.html
Nov-03 08:48:10.852 [PublishDir-9] DEBUG n.cloud.aws.nio.S3FileSystemProvider - S3 upload file from=/my/path/dev/fd/1d03ec9a5058771f66189cef527764/artifacts/adata_processed.h5ad to=s3://mys3bucket/Output/Project_Tmp/results/B123456/data/adata_processed.h5ad
Nov-03 08:48:12.952 [PublishDir-8] DEBUG nextflow.processor.PublishDir - Failed to publish file: /my/path/dev/fd/1d03ec9a5058771f66189cef527764/_extensions; to: s3://mys3bucket/Output/Project_Tmp/results/B123456/reports/_extensions [copy] -- attempt: 1; reason: the path: s3://mys3bucket/Output/Project_Tmp/results/B123456/reports/_extensions/nf-core does not exist
Nov-03 08:48:13.059 [PublishDir-8] DEBUG n.cloud.aws.nio.S3FileSystemProvider - S3 upload directory from=/my/path/dev/fd/1d03ec9a5058771f66189cef527764/_extensions to=s3://mys3bucket/Output/Project_Tmp/results/B123456/reports/_extensions
Nov-03 08:48:15.024 [PublishDir-7] DEBUG nextflow.processor.PublishDir - Failed to publish file: /my/path/dev/fd/1d03ec9a5058771f66189cef527764/artifacts/sdata_processed.zarr; to: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr [copy] -- attempt: 1; reason: the path: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr/tables/table/obs/pct_counts_in_top_50_genes does not exist
Nov-03 08:48:18.521 [PublishDir-7] DEBUG nextflow.processor.PublishDir - Failed to publish file: /my/path/dev/fd/1d03ec9a5058771f66189cef527764/artifacts/sdata_processed.zarr; to: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr [copy] -- attempt: 2; reason: the path: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr/tables/table/obs/pct_counts_mt does not exist
Nov-03 08:48:22.555 [PublishDir-7] DEBUG nextflow.processor.PublishDir - Failed to publish file: /my/path/dev/fd/1d03ec9a5058771f66189cef527764/artifacts/sdata_processed.zarr; to: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr [copy] -- attempt: 3; reason: the path: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr/tables/table/obs/pct_counts_ribo does not exist
Nov-03 08:48:29.014 [PublishDir-7] DEBUG nextflow.processor.PublishDir - Failed to publish file: /my/path/dev/fd/1d03ec9a5058771f66189cef527764/artifacts/sdata_processed.zarr; to: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr [copy] -- attempt: 4; reason: the path: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr/tables/table/obs/region/categories does not exist
Nov-03 08:48:33.155 [PublishDir-7] ERROR nextflow.processor.PublishDir - Failed to publish file: /my/path/dev/fd/1d03ec9a5058771f66189cef527764/artifacts/sdata_processed.zarr; to: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr [copy] -- See log file for details
dev.failsafe.FailsafeException: java.nio.file.NoSuchFileException: the path: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr/tables/table/obs/region/codes does not exist
	at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:444)
	at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:129)
	at nextflow.processor.PublishDir.retryableProcessFile(PublishDir.groovy:396)
	at nextflow.processor.PublishDir.safeProcessFile(PublishDir.groovy:367)
	at jdk.internal.reflect.GeneratedMethodAccessor240.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.PublishDir$_apply1_closure1.doCall(PublishDir.groovy:342)
	at nextflow.processor.PublishDir$_apply1_closure1.call(PublishDir.groovy)
	at groovy.lang.Closure.run(Closure.java:505)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.nio.file.NoSuchFileException: the path: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr/tables/table/obs/region/codes does not exist
	at nextflow.cloud.aws.nio.S3FileSystemProvider.delete(S3FileSystemProvider.java:499)
	at java.base/java.nio.file.Files.delete(Files.java:1142)
	at nextflow.file.FileHelper$3.postVisitDirectory(FileHelper.groovy:1038)
	at nextflow.file.FileHelper$3.postVisitDirectory(FileHelper.groovy)
	at java.base/java.nio.file.Files.walkFileTree(Files.java:2743)
	at java.base/java.nio.file.Files.walkFileTree(Files.java:2797)
	at nextflow.file.FileHelper.deleteDir0(FileHelper.groovy:1030)
	at nextflow.file.FileHelper.deletePath(FileHelper.groovy:1022)
	at nextflow.processor.PublishDir.processFile(PublishDir.groovy:419)
	at jdk.internal.reflect.GeneratedMethodAccessor242.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.PublishDir$_retryableProcessFile_closure2.doCall(PublishDir.groovy:397)
	at jdk.internal.reflect.GeneratedMethodAccessor241.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:279)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at groovy.lang.Closure.call(Closure.java:433)
	at org.codehaus.groovy.runtime.ConvertedClosure.invokeCustom(ConvertedClosure.java:52)
	at org.codehaus.groovy.runtime.ConversionHandler.invoke(ConversionHandler.java:113)
	at com.sun.proxy.$Proxy43.get(Unknown Source)
	at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
	at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:75)
	at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:176)
	at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:437)
	... 22 common frames omitted
ERROR ~ Failed to publish file: /my/path/dev/fd/1d03ec9a5058771f66189cef527764/artifacts/sdata_processed.zarr; to: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr [copy] -- See log file for details

 -- Check 'nf-2Aa5zlcwcMJE9j.log' file for details
Nov-03 08:48:33.187 [PublishDir-7] DEBUG nextflow.Session - Session aborted -- Cause: java.nio.file.NoSuchFileException: the path: s3://mys3bucket/Output/Project_Tmp/results/B123456/data/sdata_processed.zarr/tables/table/obs/region/codes does not exist

System information

Nextflow 24.04.3 and 24.04.4.
Tower / Sequera cloud enterprise 23.3.0

Tested with version currently on dev of SpatialVi.

Hardware: HPC
Executor Slurm

@R3myG R3myG added the bug Something isn't working label Nov 4, 2024
@cavenel
Copy link
Collaborator

cavenel commented Nov 4, 2024

Hi,

I think this might be fixed by nextflow-io/nextflow#3933, as the issue seems to be when publishing a full directory (both sdata_processed.zarr and reports/_extensions) that contain sub-directories. I guess nextflow doesn't create the subdirectories in s3 before trying to upload individual files.
Unfortunately the PR is not merged yet in Nextflow.

@R3myG
Copy link
Author

R3myG commented Nov 5, 2024

@cavenel Thank you, this indeed seems to be it. Do you have any suggestions on how to get around it until the fix is in place?

@cavenel
Copy link
Collaborator

cavenel commented Nov 5, 2024

Hi @R3myG, the answer from @bentsherman on the PR makes me wonder if that's actually the issue here. I have another theory.

If we look at your logs, we can see that it tries multiple times before failing.
For reports/_extensions, it fails once (attempt: 1) and then succeeds.
For artifacts/sdata_processed.zarr, it fails 4 times and then crash.

What I think is happening here is that the "attempt index" is not reset between files of the same folder. And artifacts/sdata_processed.zarr contains a lot of files. So if the server is slow and timeouts for more than 4 files in the folder, it ends up with a crash.

So if I am right and this issue comes from your s3 server being too slow or failing for any other reason, then the simplest solution would be to increase the workflow.output.retryPolicy.maxAttempt and workflow.output.retryPolicy.delay . The first one is default to 5 and the second one to 150ms. You can try multiplying these by 10 and see if it helps:

workflow.output.retryPolicy.delay = "3500ms"
workflow.output.retryPolicy.maxAttempt = 50

(It would also eventually be nice to have a "fail index" per file and not per folder, but this looks a bit more tricky on the Nextflow side...)

@R3myG
Copy link
Author

R3myG commented Nov 7, 2024

@cavenel Thank you so much for the suggestion and for pointing me to the right setting to increase the retry delay and max attempt.
I had tried to increase the equivalents in the AWS config but that didn't work.

Using

workflow.output.retryPolicy.delay = "3500ms"
workflow.output.retryPolicy.maxAttempt = 50

Seem to have solve the problem when I ran 1 sample from my samplesheet. Currently rerunning the full samplesheet to double check.

PS: You inverted the values in your example, flagging it for anyone who may blindly copy paste the snipped in their .config file ;)

@cavenel
Copy link
Collaborator

cavenel commented Nov 7, 2024

Great, let us know if it works on the full samplesheet so that we can close this issue.
(I fixed my inversion, sorry about that!)

@R3myG
Copy link
Author

R3myG commented Nov 11, 2024

Alright some updates on my end.

After a successful run using a single sample into a completely different output directory in S3, I reverted back to my original config.
This entailed in using 3 samples and output to the usual /results directory.
This meant that the output folders, controlled my ${outDir}/results/${meta.id} already had data in them from previous runs.

I had failures due to an issue with a Docker image but the resumes worked and it all went through. I thought this was sorted for good but after the spatialVi, I've added more steps and I had a failure almost at the end (spatialVi had completed).
When I resumed, something may have changed somewhere because it only resumed after the the spaceranger step and when it reached the completion of the Clustering process, the java.nio.file.NoSuchFileException popped up again.

Thankfully I had enabled the extra debugging logs which I've sanitized for sensitive details and attached.

My reading is that when it reran the Clustering process, after completion it deletes certain folders then attempt to upload again?

error_S3_retry.txt

I'm curious to hear what your thoughts are and I'm currently rerunning the workflow from the beginning after clearing all the result folders.

@R3myG
Copy link
Author

R3myG commented Nov 14, 2024

@cavenel I've now confirmed that the issue persist when the sdata_processed.zarr has been previously copied over to S3 and a rerun with the same output directory has to overwrite the existing data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants