Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add advanced options to do some basic manipulation of sequences and sequence names #39

Open
mvences opened this issue Jan 19, 2022 · 0 comments
Assignees

Comments

@mvences
Copy link
Collaborator

mvences commented Jan 19, 2022

For the next version of DNAconvert, it will be good to implement some simple options for the user to specify some details of how the converted file should look like. This is not a priority issue, but if there is time maybe some of these functions can be added. Before doing so, we need to think about the best way for it because

  • some functions will only be applicable to certain output formats, this needs to be obvious from the way it is shown in the GUI (maybe a specific pop-up window with options?)
  • IMPORTANT: In general, the current GUI of DNAconvert is very nice and simple, and we must avoid cluttering it with too many options. So these advanced options must be somehow "hidden" - just one button and if clicked, a pop up menu or similar with these options is shown.

Some of the functions will be super easy to implement, some are much more difficult. We definitely should go for the easy ones first.

I have tried to reorder the issues in order of priority (and trying to list those changes first that probably are easy to implement).

Many of the new options would be particularly relevant for the "batch" function:

  1. Add an option to make the tabfile "spreadsheet compatible" by adding a starting N where there now is a gap as first character (a kind of "padding" - not sure if this is already implemented, it is so in concatenator...)

  2. Option to remove all gaps from the sequences

  3. Option to remove all terminal gaps from the sequences

  4. Option to remove all terminal N and ? from the sequences

  5. Option to change all terminal N and ? to gaps

  6. Option to change all ambiguous data: RYSWKMBDHV to N or ?

  7. Like in concatenator, allow user to decide what symbol is used for missing data (? or N) and then adjust this for all sequences

  8. Like in concatenator, allow changing all nucleotides either to upper case or lower case

  9. Option to restrict the length of sequence names to a certain, user specified length (e.g., 50 or 80 characters) and add numbers to the names to make sure there are no duplicates.

  10. Include two sorting functions: order sequences in output file by sequence length, or alphabetically by sequence name.

  11. add options how to treat spaces preding sequences in ali files during conversion: remove spaces, or change to question mark, dashes, and perhaps also asterisks).

  12. Add option to disable all "sanitizing" of sequence names (special characters such as spaces etc will be kept in the sequence names identical to the input file

  13. Option to change all letters in sequences to upper case or lower case (such as: atcgggttnty vs. ATCGGGTTNTY)


More complex options:

  1. Filter out (delete from output file) all sequences that contain a certain search string in the sequence or the sequence name

  2. Filter out (delete sequences from output file) according to either a "positive list" or "negative list" of sequence names, that is, user can provide a list of sequence names and those sequences agreeing with the list are either kept in the output file and all others deleted, or deleted and all others kept.

  3. Maybe a simple dereplication option could be integrated to remove sequences or sequence names that are identical to others (like in TaxI2 ... but probably this will be very memory intensive, to be decided)

  4. Is there a way to include "fastmerge" into DNAconvert as one option, so if running in batch mode (specifying a directory instead of input file) the program first merges all files in the directory, and then does the conversion?

  5. Allow specifying that sequences are proteins, and adjust a few options (e.g., allow deleting all * which in protein sequences are sometimes used instead of gaps).

  6. In fasta files, now the entire sequence is written in one line which usually is the best, but we could allow for "wrapping" the sequence in several lines and allowing the user to specify how many characters per line

  7. Add options to translate into proteins and back-translate from proteins to amino acids.

I will keep adding additional ideas to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants