Add advanced options to do some basic manipulation of sequences and sequence names #39

mvences · 2022-01-19T20:27:14Z

For the next version of DNAconvert, it will be good to implement some simple options for the user to specify some details of how the converted file should look like. This is not a priority issue, but if there is time maybe some of these functions can be added. Before doing so, we need to think about the best way for it because

some functions will only be applicable to certain output formats, this needs to be obvious from the way it is shown in the GUI (maybe a specific pop-up window with options?)
IMPORTANT: In general, the current GUI of DNAconvert is very nice and simple, and we must avoid cluttering it with too many options. So these advanced options must be somehow "hidden" - just one button and if clicked, a pop up menu or similar with these options is shown.

Some of the functions will be super easy to implement, some are much more difficult. We definitely should go for the easy ones first.

I have tried to reorder the issues in order of priority (and trying to list those changes first that probably are easy to implement).

Many of the new options would be particularly relevant for the "batch" function:

Add an option to make the tabfile "spreadsheet compatible" by adding a starting N where there now is a gap as first character (a kind of "padding" - not sure if this is already implemented, it is so in concatenator...)
Option to remove all gaps from the sequences
Option to remove all terminal gaps from the sequences
Option to remove all terminal N and ? from the sequences
Option to change all terminal N and ? to gaps
Option to change all ambiguous data: RYSWKMBDHV to N or ?
Like in concatenator, allow user to decide what symbol is used for missing data (? or N) and then adjust this for all sequences
Like in concatenator, allow changing all nucleotides either to upper case or lower case
Option to restrict the length of sequence names to a certain, user specified length (e.g., 50 or 80 characters) and add numbers to the names to make sure there are no duplicates.
Include two sorting functions: order sequences in output file by sequence length, or alphabetically by sequence name.
add options how to treat spaces preding sequences in ali files during conversion: remove spaces, or change to question mark, dashes, and perhaps also asterisks).
Add option to disable all "sanitizing" of sequence names (special characters such as spaces etc will be kept in the sequence names identical to the input file
Option to change all letters in sequences to upper case or lower case (such as: atcgggttnty vs. ATCGGGTTNTY)

More complex options:

Filter out (delete from output file) all sequences that contain a certain search string in the sequence or the sequence name
Filter out (delete sequences from output file) according to either a "positive list" or "negative list" of sequence names, that is, user can provide a list of sequence names and those sequences agreeing with the list are either kept in the output file and all others deleted, or deleted and all others kept.
Maybe a simple dereplication option could be integrated to remove sequences or sequence names that are identical to others (like in TaxI2 ... but probably this will be very memory intensive, to be decided)
Is there a way to include "fastmerge" into DNAconvert as one option, so if running in batch mode (specifying a directory instead of input file) the program first merges all files in the directory, and then does the conversion?
Allow specifying that sequences are proteins, and adjust a few options (e.g., allow deleting all * which in protein sequences are sometimes used instead of gaps).
In fasta files, now the entire sequence is written in one line which usually is the best, but we could allow for "wrapping" the sequence in several lines and allowing the user to specify how many characters per line
Add options to translate into proteins and back-translate from proteins to amino acids.

I will keep adding additional ideas to this issue.

mvences added the no priority label Jan 19, 2022

mvences assigned necrosovereign Jan 19, 2022

mvences unassigned necrosovereign Jan 13, 2023

mvences assigned necrosovereign Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add advanced options to do some basic manipulation of sequences and sequence names #39

Add advanced options to do some basic manipulation of sequences and sequence names #39

mvences commented Jan 19, 2022 •

edited

Loading

Add advanced options to do some basic manipulation of sequences and sequence names #39

Add advanced options to do some basic manipulation of sequences and sequence names #39

Comments

mvences commented Jan 19, 2022 • edited Loading

mvences commented Jan 19, 2022 •

edited

Loading