Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tests for SAM files that corrupt with adam + htsjdk 2.20.x #203

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

heuermh
Copy link
Contributor

@heuermh heuermh commented Sep 11, 2019

ADAM fails unit tests after bumping the htsjdk dependency version from 2.19.x to 2.20.x, see
bigdatagenomics/adam#2195

This pull request demonstrates that Hadoop-BAM shares the same issue.

$ mvn test -Dtest=TestSAMInputFormat
...
[ERROR] Tests run: 4, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 2.77 s <<< FAILURE! - in org.seqdoop.hadoop_bam.TestSAMInputFormat
[ERROR] testCorruptedHeaderSAM(org.seqdoop.hadoop_bam.TestSAMInputFormat)  Time elapsed: 0.829 s  <<< ERROR!
htsjdk.samtools.SAMFormatException:
Error parsing text SAM file. RNAME '75M' not found in any SQ record; Line 3
Line: 1	60	75M	*	0	0	GTATAAGAGCAGCCTTATTCCTATTTATAATCAGGGTGAAACACCTGTGCCAATGCCAAGACAGGGGTGCCAAGA	*	NM:i:0	AS:i:75	XS:i:0
	at org.seqdoop.hadoop_bam.TestSAMInputFormat.testCorruptedHeaderSAM(TestSAMInputFormat.java:151)

[ERROR] testCorruptedReadNameSAM(org.seqdoop.hadoop_bam.TestSAMInputFormat)  Time elapsed: 0.033 s  <<< FAILURE!
org.junit.ComparisonFailure: expected:<[simread:1:2]6472783:false> but was:<[]6472783:false>
	at org.seqdoop.hadoop_bam.TestSAMInputFormat.testCorruptedReadNameSAM(TestSAMInputFormat.java:131)

[INFO]
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR]   TestSAMInputFormat.testCorruptedReadNameSAM:131 expected:<[simread:1:2]6472783:false> but was:<[]6472783:false>
[ERROR] Errors:
[ERROR]   TestSAMInputFormat.testCorruptedHeaderSAM:151 » SAMFormat Error parsing text S...
[INFO]
[ERROR] Tests run: 4, Failures: 1, Errors: 1, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------

@heuermh
Copy link
Contributor Author

heuermh commented Sep 17, 2019

@lbergelson @cmnbroad This pull request demonstrates an issue we see in ADAM and in Hadoop-BAM but not in disq. Might you be able to help me make a reproducing case in htsjdk, to see if the problem is there or in Hadoop-BAM?

@cmnbroad
Copy link
Collaborator

@heuermh - that seems bizarre. I'll have a look in the next day or so and see if I can track it down or repro.

@cmnbroad
Copy link
Collaborator

@heuermh It looks like these failures are caused by the combination of a change in htsjdk and an invalid assumption inherent in this code in WorkaroundingStream. That code block in WorkaroundingStream probably never executed before, but does now because SAMFileHeader.getTextHeader() has been deprecated and was changed to always return null (I believe that change was motivated by a performance issue with very large headers).

The code in WorkaroundingStream assumes that the length of the string resulting from re-encoding the header object will be the same as the original header's text length, and therefore can be used as an offset to find the end of the header in the input stream. Both of the failing test cases are missing the optional @HD line, so in those cases the re-encoded text is longer than the original because the encoder inserts an @HD line during encode. You can verify this by seeing that the tests pass if the @HD line is added to the input files. Not sure what they real fix should be though, the WorkaroundingStream seems pretty fragile.

@lbergelson
Copy link
Contributor

@cmnbroad Wow, nice work tracking that down.

@heuermh
Copy link
Contributor Author

heuermh commented Sep 18, 2019

Ah right, I mistakenly waded into the SAMFileHeader.getTextHeader() mess once before. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants