Make cluster directory access more robust #24

braymp · 2016-01-22T20:29:00Z

From @dlogan on January 29, 2014 19:17

Batch # 4203 http://imagingweb.broadinstitute.org/batchprofiler/cgi-bin/FileUI/CellProfiler/BatchProfiler/ViewBatch.py?batch_id=4203
is running, however 4% of it's batches have failed so far, all with the same error (example below). They all seem to be a temporary directory access failure. I presume temporary because I can manually cd to the supposedly offending directory just fine.

Instead of me resubmitting them manually, can we add a "try, wait, try again" loop in loadimages?

...

Tue Jan 28 19:22:52 2014: Image # 20306, module MeasureObjectIntensity # 7: 0.88 sec
Tue Jan 28 19:22:53 2014: Image # 20306, module MeasureImageIntensity # 17: 3.22 sec
Tue Jan 28 19:22:56 2014: Image # 20306, module ExportToDatabase # 19: 0.20 sec
Tue Jan 28 19:22:57 2014: Image # 20306, module CreateBatchFiles # 20: 0.00 sec
Error detected during run of module LoadData
Traceback (most recent call last):
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/pipeline.py", line 1747, in run_with_yield
    module.run(workspace)
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loaddata.py", line 1065, in run
    image = workspace.image_set.get_image(image_name)
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/measurements.py", line 1485, in get_image
    image = matching_providers[0].provide_image(self)
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3138, in provide_image
    self.cache_file()
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3063, in cache_file
    raise IOError("Test for access to directory failed. Directory: %s" %path)
IOError: Test for access to directory failed. Directory: /cbnt/cbimageX/HCS/shanmeghan/combinatorialscreen-dcn/nocode/2013-03-28/38265
Tue Jan 28 19:22:57 2014: Image # 20307, module LoadData # 1: 0.00 sec
Exiting the JVM monitor thread
FreeFontPath: FPE "unix/:7100" refcount is 2, should be 1; fixing.

Copied from original issue: CellProfiler/CellProfiler#1033

The text was updated successfully, but these errors were encountered:

braymp · 2016-01-22T20:29:00Z

From @LeeKamentsky on January 29, 2014 19:28

I can always try a certain number of times, maybe with pauses in between,
but my gut feeling is that the problem might be local to that cluster node
and perhaps it's not recoverable. This might be a case where it's simpler
and more reliable to deal with the failure at a higher level (have
BatchProfiler 2.0 run CellProfiler again on another node). Vebjorn what do
you think? Also is it worth pinging IT to ask them to look at the logs?

On Wed, Jan 29, 2014 at 2:17 PM, David Logan [email protected]:

Batch # 4203
http://imagingweb.broadinstitute.org/batchprofiler/cgi-bin/FileUI/CellProfiler/BatchProfiler/ViewBatch.py?batch_id=4203
is running, however 4% of it's batches have failed so far, all with the
same error (example below). They all seem to be a temporary directory
access failure. I presume temporary because I can manually cd to the
supposedly offending directory just fine.

Instead of me resubmitting them manually, can we add a "try, wait, try
again" loop in loadimages?

...

Tue Jan 28 19:22:52 2014: Image # 20306, module MeasureObjectIntensity # 7: 0.88 sec
Tue Jan 28 19:22:53 2014: Image # 20306, module MeasureImageIntensity # 17: 3.22 sec
Tue Jan 28 19:22:56 2014: Image # 20306, module ExportToDatabase # 19: 0.20 sec
Tue Jan 28 19:22:57 2014: Image # 20306, module CreateBatchFiles # 20: 0.00 sec
Error detected during run of module LoadData
Traceback (most recent call last):
File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/pipeline.py", line 1747, in run_with_yield
module.run(workspace)
File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loaddata.py", line 1065, in run
image = workspace.image_set.get_image(image_name)
File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/measurements.py", line 1485, in get_image
image = matching_providers[0].provide_image(self)
File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3138, in provide_image
self.cache_file()
File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3063, in cache_file
raise IOError("Test for access to directory failed. Directory: %s" %path)
IOError: Test for access to directory failed. Directory: /cbnt/cbimageX/HCS/shanmeghan/combinatorialscreen-dcn/nocode/2013-03-28/38265
Tue Jan 28 19:22:57 2014: Image # 20307, module LoadData # 1: 0.00 sec
Exiting the JVM monitor thread
FreeFontPath: FPE "unix/:7100" refcount is 2, should be 1; fixing.

Reply to this email directly or view it on GitHubhttps://github.com/CellProfiler/CellProfiler/issues/1033
.

braymp · 2016-01-22T20:29:01Z

From @dlogan on January 29, 2014 19:37

Aha, you are likely right. I just checked and (so far) the node is always the same, node1625.

/imaging/analysis/2007_11_07_Hepatoxicity_SPARC/2013_03_27_combinatorialscreen/Main_pipeline_output/2014_01_28_CP2p1_RUN_STRMLD/txt_output]$ grep -B 1000 rror * | grep node
20301_to_20650.txt-Sender: LSF System <lsf@node1625>
20301_to_20650.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
24851_to_25200.txt-Sender: LSF System <lsf@node1625>
24851_to_25200.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
51451_to_51800.txt-Sender: LSF System <lsf@node1625>
51451_to_51800.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
51801_to_52150.txt-Sender: LSF System <lsf@node1625>
51801_to_52150.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
59151_to_59500.txt-Sender: LSF System <lsf@node1625>
59151_to_59500.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
60901_to_61250.txt-Sender: LSF System <lsf@node1625>
60901_to_61250.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.

braymp · 2016-01-22T20:29:01Z

From @dlogan on January 29, 2014 21:23

I emailed Help to get them to look at node1625
Help Ticket # 409355

braymp · 2016-01-22T20:29:01Z

From @ljosa on January 31, 2014 19:15

I think these kinds of errors are rarely temporary enough that it makes sense to sleep and retry; that only delays the inevitable. Better to fail fast and restart failed jobs from the top level.

Ideally, BatchProfiler should do that ASAP instead of waiting for a human to diagnose and trigger restarts, but I guess we don't want to rewrite BP right now…

braymp added the Feature request label Jan 22, 2016

braymp mentioned this issue Jan 22, 2016

Make cluster directory access more robust CellProfiler/CellProfiler#1033

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make cluster directory access more robust #24

Make cluster directory access more robust #24

braymp commented Jan 22, 2016

braymp commented Jan 22, 2016

braymp commented Jan 22, 2016

braymp commented Jan 22, 2016

braymp commented Jan 22, 2016

Make cluster directory access more robust #24

Make cluster directory access more robust #24

Comments

braymp commented Jan 22, 2016

braymp commented Jan 22, 2016

braymp commented Jan 22, 2016

braymp commented Jan 22, 2016

braymp commented Jan 22, 2016