-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
interleave-reads.py fails with differing metadata for read pairs #1854
Comments
Thanks for the report! The issue isn't the read lengths themselves, its the read IDs. At the moment khmer only recognizes paired reads if the entire read ID (including stuff after the first whitespace) is identical up to the This isn't the only pairing format that khmer supports, but if you are relying on the |
Well, I know there are different naming schemes but if you implement sanity checking for Illumina-style naming you should know that many post-processing tools append key-value tags to the read names but the part before the first whitespace is always unique and the rest is just metainfo. I'm not aware of a program creating identifiers which include whitespaces. So it would be much more robust to the if the first part matches instead of simply cutting of the The length=xx tag is standard when you pull sequences from NCBI SRA using fastq-dump, this is where my reads came from. This is just my 2 cents, I have my own working alternative and BBMap has its own which I guess works just fine (reformat.sh), I was just giving the khmer scripts a try. Best, |
One of the alternative conventions khmer supports (or is supposed to support) is |
That doesn't change the checking rule I mentioned. You don't really need to check 1 vs 2, just make sure the identifiers before the whitespace match. |
I'll poke some of my colleagues and see if anyone else wants to chime in. As a new user of khmer, I remember the handling of paired reads as one of my first pain points. It should be fairly straightforward to support any number of reasonable conventions as long as they're not too obscure or complex. |
I used to have a similar problem when I got Fastq files from SRA using fastq-dump. They have an option to append .1 and .2 to the @name part of the headers but this is still not accepted by our parser. I had to edit them to /1 and /2. I think we can have an option like force_paired that does not check for broken reads and just believe that reads are paired. |
note that for the 2.x series, we can only add command line flags (ref Semantic Versioning) to adjust behavior. I think @drtamermansour proposal is good for that. however, for 3.x, we could completely change the behavior, and for that I would propose by default checking that the identifier before the white space is correct and ignoring everything after that and ripping out anything else that checks for proper pairing. There are too many conventions to support to do otherwise and there's a major speed consideration also. thoughts? |
@ctb: sounds reasonable to me. It's rather unlikely that two reads in the forward and reverse files have the same identifier and index but should not be properly paired. If you want to go nuts on format checking, you could have a command line option for strong consistency checking which understands some formats and might check for uniqueness of identifiers as well. However, following the UNIX one-program-per-task paradigm that should be in a separate program. |
The script, if fed with read pairs which look like this
fails. This should be fixed by looking at the first part of the identifier separated by whitespace, which seems to be common Illumina format.
Message:
ERROR: This doesn't look like paired data!
Version: 2.1.2
Executable: interleave-reads.py
Best,
Johannes
The text was updated successfully, but these errors were encountered: