You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have observed that shards of varcan work are sporadically failing and not being caught by Cromwell.
This manifests as statements like this in the Varscan stderr:
[E::bgzf_read_block] Failed to read BGZF block data at offset 2296748725 expected 9949 bytes; hread returned -1
[E::bgzf_read] Read block operation failed with error 4 after 65 of 234 bytes
samtools mpileup: error reading from input file
To get a clean run of VarScan results for comparison we can turn localization optional off by removing this:
To get this working we also had to change the way varscan was run to use pipes instead of redirection.
That did allow the errors to be caught by Cromwell correctly. But the issue remained that when streaming from a bucket we encounter bgzf_read_block errors. In a test run I observed such failures in 3 out of 50 varscan shards. In one of these, the task succeeded on a reattempt. However, in the other two both re-attempts also failed, in very similar but non identical fashion. e.g.
attempt-1/stderr:[E::bgzf_read_block] Failed to read BGZF block data at offset 7179935145 expected 10896 bytes; hread returned -1
attempt-1/stderr:[E::bgzf_read] Read block operation failed with error 4 after 155 of 229 bytes
attempt-2/stderr:[E::bgzf_read_block] Failed to read BGZF block data at offset 7179459514 expected 23195 bytes; hread returned -1
attempt-2/stderr:[E::bgzf_read] Read block operation failed with error 4 after 967 of 2007 bytes
attempt-3/stderr:[E::bgzf_read_block] Failed to read BGZF block data at offset 7178631036 expected 10117 bytes; hread returned -1
attempt-3/stderr:[E::bgzf_read] Read block operation failed with error 4 after 222 of 226 bytes
We can investigate our options here (e.g. investigate different versions of htslib, make more reattempts, change the way Varscan does parallel work, don't use varscan. etc), but I think the short term fix is to disable localization_optional for now for VarScan. Something about the way this is working in this context does not seem robust enough for production.
We have observed that shards of varcan work are sporadically failing and not being caught by Cromwell.
This manifests as statements like this in the Varscan stderr:
To get a clean run of VarScan results for comparison we can turn localization optional off by removing this:
analysis-wdls/definitions/tools/varscan_somatic.wdl
Lines 22 to 27 in abc7e58
Then to hopefully cause Crowell to detect the failures and retry we can add the following:
here
analysis-wdls/definitions/tools/varscan_somatic.wdl
Lines 41 to 43 in abc7e58
The text was updated successfully, but these errors were encountered: