-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
canon-gff3 problem with IDs #206
Comments
I think this represents a discrepancy between how the CDS is encoded in the Araport file and how CDSs are typically encoded in GFF3. In this test file, there are 4-5 CDS records associated with each mRNA. These of course don't represent distinct coding sequences, but a single discontinuous CDS. In GFF3, this is typically represented as a multifeature, where multiple records/lines are required to fully represent a single feature. The key to defining multifeatures is that each record associated with the multifeature must have the same ID (see the canonical gene example from the GFF3 spec). The Araport GFF3 doesn't follow this convention. Frankly, this isn't all that uncommon. The |
Hmm. This is still a problem. See That is, canon-gff3 turns valid GFF3 into something that is not. |
In a syntactic sense, yes the input is a valid GFF3 file. But in a semantic sense it's not—4+ CDSs per mRNA. This doesn't trigger a warning or error message with the GenomeTools validator, but it does trigger corrective measures, probably in the GenomeTools GFF3 writer if I had to guess. There's an argument that this particular input should cause a warning message in the GFF3 validator. But the link to the GFF3 spec in my last comment shows a valid, though less commonly used, encoding of multiple CDSs for a single mRNA. Distinguishing between these two scenarios is a bit involved, and presumably determined to be out-of-scope for the validator by the GenomeTools folks. Updating the GenomeTools validator, GFF3 parser, or GFF3 writer are all possibilities, though they would be pretty labor intensive. Perhaps the solution is to clarify that AEGeAn Toolkit programs, and the GenomeTools library on which they rely, will work correctly when CDS features are encoded as per the spec, but may have unexpected behavior when input data deviates from the spec. Note: the documentation already discusses GFF3 and some common pitfalls: https://aegean.readthedocs.io/en/stable/gff3.html. Perhaps I can add another point for this common multifeature ID issue here. |
Ok. I think one solution in our case would be to remove the Name attribute from the CDS lines in the canon-gff3 produced file. Thanks for the clarification. |
Input:
Run:
This shows that canon.gff3 creates a mismatch between Name and ID (see CDS:5).
The text was updated successfully, but these errors were encountered: