Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interleave-reads.py fails with differing metadata for read pairs #1854

Open
fungs opened this issue Apr 12, 2018 · 8 comments
Open

interleave-reads.py fails with differing metadata for read pairs #1854

fungs opened this issue Apr 12, 2018 · 8 comments

Comments

@fungs
Copy link

fungs commented Apr 12, 2018

The script, if fed with read pairs which look like this

read1 length=99/1
read1 length=101/2

fails. This should be fixed by looking at the first part of the identifier separated by whitespace, which seems to be common Illumina format.

Message: ERROR: This doesn't look like paired data!
Version: 2.1.2
Executable: interleave-reads.py

Best,
Johannes

@standage
Copy link
Member

Thanks for the report!

The issue isn't the read lengths themselves, its the read IDs. At the moment khmer only recognizes paired reads if the entire read ID (including stuff after the first whitespace) is identical up to the /1 or /2 suffix. I'm not convinced this is a bug.

This isn't the only pairing format that khmer supports, but if you are relying on the /1 and /2 suffixes then everything prior to this suffix must be identical.

@fungs
Copy link
Author

fungs commented Apr 12, 2018

Well, I know there are different naming schemes but if you implement sanity checking for Illumina-style naming you should know that many post-processing tools append key-value tags to the read names but the part before the first whitespace is always unique and the rest is just metainfo. I'm not aware of a program creating identifiers which include whitespaces. So it would be much more robust to the if the first part matches instead of simply cutting of the /1 and /2 parts. Or you can argue with Brian Bushnell over at BBMap whether adjusting the length= tag to the new length of the read really is a bug (though my gut feeling tells me that not adjusting it would be wrong).

The length=xx tag is standard when you pull sequences from NCBI SRA using fastq-dump, this is where my reads came from.

This is just my 2 cents, I have my own working alternative and BBMap has its own which I guess works just fine (reformat.sh), I was just giving the khmer scripts a try.

Best,
Johannes

@standage
Copy link
Member

One of the alternative conventions khmer supports (or is supposed to support) is @name 1:blah blah blah and @name 2:blah blah blah, which is compatible with appending metadata to the defline ad infinitum.

@fungs
Copy link
Author

fungs commented Apr 12, 2018

That doesn't change the checking rule I mentioned. You don't really need to check 1 vs 2, just make sure the identifiers before the whitespace match.

@fungs fungs changed the title interleave-reads.py fails with differently-sized read pairs interleave-reads.py fails with differing metadata for read pairs Apr 12, 2018
@standage
Copy link
Member

I'll poke some of my colleagues and see if anyone else wants to chime in. As a new user of khmer, I remember the handling of paired reads as one of my first pain points. It should be fairly straightforward to support any number of reasonable conventions as long as they're not too obscure or complex.

@drtamermansour
Copy link
Member

I used to have a similar problem when I got Fastq files from SRA using fastq-dump. They have an option to append .1 and .2 to the @name part of the headers but this is still not accepted by our parser. I had to edit them to /1 and /2. I think we can have an option like force_paired that does not check for broken reads and just believe that reads are paired.

@ctb
Copy link
Member

ctb commented Apr 13, 2018

note that for the 2.x series, we can only add command line flags (ref Semantic Versioning) to adjust behavior. I think @drtamermansour proposal is good for that.

however, for 3.x, we could completely change the behavior, and for that I would propose by default checking that the identifier before the white space is correct and ignoring everything after that and ripping out anything else that checks for proper pairing. There are too many conventions to support to do otherwise and there's a major speed consideration also.

thoughts?

@fungs
Copy link
Author

fungs commented Apr 13, 2018

@ctb: sounds reasonable to me. It's rather unlikely that two reads in the forward and reverse files have the same identifier and index but should not be properly paired.

If you want to go nuts on format checking, you could have a command line option for strong consistency checking which understands some formats and might check for uniqueness of identifiers as well. However, following the UNIX one-program-per-task paradigm that should be in a separate program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants