Skip to content
This repository has been archived by the owner on Jan 25, 2018. It is now read-only.

Make cluster directory access more robust #24

Open
braymp opened this issue Jan 22, 2016 · 4 comments
Open

Make cluster directory access more robust #24

braymp opened this issue Jan 22, 2016 · 4 comments

Comments

@braymp
Copy link

braymp commented Jan 22, 2016

From @dlogan on January 29, 2014 19:17

Batch # 4203 http://imagingweb.broadinstitute.org/batchprofiler/cgi-bin/FileUI/CellProfiler/BatchProfiler/ViewBatch.py?batch_id=4203
is running, however 4% of it's batches have failed so far, all with the same error (example below). They all seem to be a temporary directory access failure. I presume temporary because I can manually cd to the supposedly offending directory just fine.

Instead of me resubmitting them manually, can we add a "try, wait, try again" loop in loadimages?

...

Tue Jan 28 19:22:52 2014: Image # 20306, module MeasureObjectIntensity # 7: 0.88 sec
Tue Jan 28 19:22:53 2014: Image # 20306, module MeasureImageIntensity # 17: 3.22 sec
Tue Jan 28 19:22:56 2014: Image # 20306, module ExportToDatabase # 19: 0.20 sec
Tue Jan 28 19:22:57 2014: Image # 20306, module CreateBatchFiles # 20: 0.00 sec
Error detected during run of module LoadData
Traceback (most recent call last):
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/pipeline.py", line 1747, in run_with_yield
    module.run(workspace)
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loaddata.py", line 1065, in run
    image = workspace.image_set.get_image(image_name)
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/measurements.py", line 1485, in get_image
    image = matching_providers[0].provide_image(self)
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3138, in provide_image
    self.cache_file()
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3063, in cache_file
    raise IOError("Test for access to directory failed. Directory: %s" %path)
IOError: Test for access to directory failed. Directory: /cbnt/cbimageX/HCS/shanmeghan/combinatorialscreen-dcn/nocode/2013-03-28/38265
Tue Jan 28 19:22:57 2014: Image # 20307, module LoadData # 1: 0.00 sec
Exiting the JVM monitor thread
FreeFontPath: FPE "unix/:7100" refcount is 2, should be 1; fixing.

Copied from original issue: CellProfiler/CellProfiler#1033

@braymp
Copy link
Author

braymp commented Jan 22, 2016

From @LeeKamentsky on January 29, 2014 19:28

I can always try a certain number of times, maybe with pauses in between,
but my gut feeling is that the problem might be local to that cluster node
and perhaps it's not recoverable. This might be a case where it's simpler
and more reliable to deal with the failure at a higher level (have
BatchProfiler 2.0 run CellProfiler again on another node). Vebjorn what do
you think? Also is it worth pinging IT to ask them to look at the logs?

On Wed, Jan 29, 2014 at 2:17 PM, David Logan [email protected]:

Batch # 4203
http://imagingweb.broadinstitute.org/batchprofiler/cgi-bin/FileUI/CellProfiler/BatchProfiler/ViewBatch.py?batch_id=4203
is running, however 4% of it's batches have failed so far, all with the
same error (example below). They all seem to be a temporary directory
access failure. I presume temporary because I can manually cd to the
supposedly offending directory just fine.

Instead of me resubmitting them manually, can we add a "try, wait, try
again" loop in loadimages?

...

Tue Jan 28 19:22:52 2014: Image # 20306, module MeasureObjectIntensity # 7: 0.88 sec
Tue Jan 28 19:22:53 2014: Image # 20306, module MeasureImageIntensity # 17: 3.22 sec
Tue Jan 28 19:22:56 2014: Image # 20306, module ExportToDatabase # 19: 0.20 sec
Tue Jan 28 19:22:57 2014: Image # 20306, module CreateBatchFiles # 20: 0.00 sec
Error detected during run of module LoadData
Traceback (most recent call last):
File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/pipeline.py", line 1747, in run_with_yield
module.run(workspace)
File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loaddata.py", line 1065, in run
image = workspace.image_set.get_image(image_name)
File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/measurements.py", line 1485, in get_image
image = matching_providers[0].provide_image(self)
File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3138, in provide_image
self.cache_file()
File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3063, in cache_file
raise IOError("Test for access to directory failed. Directory: %s" %path)
IOError: Test for access to directory failed. Directory: /cbnt/cbimageX/HCS/shanmeghan/combinatorialscreen-dcn/nocode/2013-03-28/38265
Tue Jan 28 19:22:57 2014: Image # 20307, module LoadData # 1: 0.00 sec
Exiting the JVM monitor thread
FreeFontPath: FPE "unix/:7100" refcount is 2, should be 1; fixing.

Reply to this email directly or view it on GitHubhttps://github.com/CellProfiler/CellProfiler/issues/1033
.

@braymp
Copy link
Author

braymp commented Jan 22, 2016

From @dlogan on January 29, 2014 19:37

Aha, you are likely right. I just checked and (so far) the node is always the same, node1625.

/imaging/analysis/2007_11_07_Hepatoxicity_SPARC/2013_03_27_combinatorialscreen/Main_pipeline_output/2014_01_28_CP2p1_RUN_STRMLD/txt_output]$ grep -B 1000 rror * | grep node
20301_to_20650.txt-Sender: LSF System <lsf@node1625>
20301_to_20650.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
24851_to_25200.txt-Sender: LSF System <lsf@node1625>
24851_to_25200.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
51451_to_51800.txt-Sender: LSF System <lsf@node1625>
51451_to_51800.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
51801_to_52150.txt-Sender: LSF System <lsf@node1625>
51801_to_52150.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
59151_to_59500.txt-Sender: LSF System <lsf@node1625>
59151_to_59500.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
60901_to_61250.txt-Sender: LSF System <lsf@node1625>
60901_to_61250.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.

@braymp
Copy link
Author

braymp commented Jan 22, 2016

From @dlogan on January 29, 2014 21:23

I emailed Help to get them to look at node1625
Help Ticket # 409355

@braymp
Copy link
Author

braymp commented Jan 22, 2016

From @ljosa on January 31, 2014 19:15

I think these kinds of errors are rarely temporary enough that it makes sense to sleep and retry; that only delays the inevitable. Better to fail fast and restart failed jobs from the top level.

Ideally, BatchProfiler should do that ASAP instead of waiting for a human to diagnose and trigger restarts, but I guess we don't want to rewrite BP right now…

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant