interleave-reads.py fails with differing metadata for read pairs #1854

fungs · 2018-04-12T17:52:32Z

The script, if fed with read pairs which look like this

read1 length=99/1
read1 length=101/2

fails. This should be fixed by looking at the first part of the identifier separated by whitespace, which seems to be common Illumina format.

Message: ERROR: This doesn't look like paired data!
Version: 2.1.2
Executable: interleave-reads.py

Best,
Johannes

The text was updated successfully, but these errors were encountered:

standage · 2018-04-12T18:02:14Z

Thanks for the report!

The issue isn't the read lengths themselves, its the read IDs. At the moment khmer only recognizes paired reads if the entire read ID (including stuff after the first whitespace) is identical up to the /1 or /2 suffix. I'm not convinced this is a bug.

This isn't the only pairing format that khmer supports, but if you are relying on the /1 and /2 suffixes then everything prior to this suffix must be identical.

fungs · 2018-04-12T18:32:57Z

Well, I know there are different naming schemes but if you implement sanity checking for Illumina-style naming you should know that many post-processing tools append key-value tags to the read names but the part before the first whitespace is always unique and the rest is just metainfo. I'm not aware of a program creating identifiers which include whitespaces. So it would be much more robust to the if the first part matches instead of simply cutting of the /1 and /2 parts. Or you can argue with Brian Bushnell over at BBMap whether adjusting the length= tag to the new length of the read really is a bug (though my gut feeling tells me that not adjusting it would be wrong).

The length=xx tag is standard when you pull sequences from NCBI SRA using fastq-dump, this is where my reads came from.

This is just my 2 cents, I have my own working alternative and BBMap has its own which I guess works just fine (reformat.sh), I was just giving the khmer scripts a try.

Best,
Johannes

standage · 2018-04-12T20:16:43Z

One of the alternative conventions khmer supports (or is supposed to support) is @name 1:blah blah blah and @name 2:blah blah blah, which is compatible with appending metadata to the defline ad infinitum.

fungs · 2018-04-12T20:29:02Z

That doesn't change the checking rule I mentioned. You don't really need to check 1 vs 2, just make sure the identifiers before the whitespace match.

standage · 2018-04-12T20:34:54Z

I'll poke some of my colleagues and see if anyone else wants to chime in. As a new user of khmer, I remember the handling of paired reads as one of my first pain points. It should be fairly straightforward to support any number of reasonable conventions as long as they're not too obscure or complex.

drtamermansour · 2018-04-12T22:22:51Z

I used to have a similar problem when I got Fastq files from SRA using fastq-dump. They have an option to append .1 and .2 to the @name part of the headers but this is still not accepted by our parser. I had to edit them to /1 and /2. I think we can have an option like force_paired that does not check for broken reads and just believe that reads are paired.

ctb · 2018-04-13T13:02:36Z

note that for the 2.x series, we can only add command line flags (ref Semantic Versioning) to adjust behavior. I think @drtamermansour proposal is good for that.

however, for 3.x, we could completely change the behavior, and for that I would propose by default checking that the identifier before the white space is correct and ignoring everything after that and ripping out anything else that checks for proper pairing. There are too many conventions to support to do otherwise and there's a major speed consideration also.

thoughts?

fungs · 2018-04-13T13:30:34Z

@ctb: sounds reasonable to me. It's rather unlikely that two reads in the forward and reverse files have the same identifier and index but should not be properly paired.

If you want to go nuts on format checking, you could have a command line option for strong consistency checking which understands some formats and might check for uniqueness of identifiers as well. However, following the UNIX one-program-per-task paradigm that should be in a separate program.

fungs changed the title ~~interleave-reads.py fails with differently-sized read pairs~~ interleave-reads.py fails with differing metadata for read pairs Apr 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

interleave-reads.py fails with differing metadata for read pairs #1854

interleave-reads.py fails with differing metadata for read pairs #1854

fungs commented Apr 12, 2018 •

edited

Loading

standage commented Apr 12, 2018

fungs commented Apr 12, 2018

standage commented Apr 12, 2018

fungs commented Apr 12, 2018 •

edited

Loading

standage commented Apr 12, 2018

drtamermansour commented Apr 12, 2018

ctb commented Apr 13, 2018

fungs commented Apr 13, 2018

interleave-reads.py fails with differing metadata for read pairs #1854

interleave-reads.py fails with differing metadata for read pairs #1854

Comments

fungs commented Apr 12, 2018 • edited Loading

standage commented Apr 12, 2018

fungs commented Apr 12, 2018

standage commented Apr 12, 2018

fungs commented Apr 12, 2018 • edited Loading

standage commented Apr 12, 2018

drtamermansour commented Apr 12, 2018

ctb commented Apr 13, 2018

fungs commented Apr 13, 2018

fungs commented Apr 12, 2018 •

edited

Loading

fungs commented Apr 12, 2018 •

edited

Loading