Skip to content

Commit

Permalink
improve-derep-page
Browse files Browse the repository at this point in the history
  • Loading branch information
telatin committed Jul 15, 2024
1 parent 59d0d1b commit adb2829
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 3 deletions.
18 changes: 15 additions & 3 deletions docs/tools/derep.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@

# seqfu derep

*derep* is one of the core subprograms of *SeqFu*, that allows
the dereplication of FASTA and FASTQ files.
*derep* is one of the core subprograms of *SeqFu*, that allows the dereplication of FASTA and FASTQ files.
Dereplication, in R. C. Edgard [words](https://drive5.com/usearch/manual/dereplication.html) is *A rather obscure name for finding the set of unique sequences. Or, equivalently, the process of finding duplicated (replicate) sequences.*

In simple words, given a FASTA file, only unique sequences will be printed in the output. A core feature is printing the number of identical sequences found in the original dataset.

Dereplication is a step commonly used in NGS sequencing of amplicons, to reduce the computational time dedicated to the analysis of each representative sequence, and some tools will require dereplicated sequences as input
(e.g. [USEARCH](https://rcedgar.github.io/usearch12_documentation/)).



```text
Usage: derep [options] [<inputfile> ...]
Expand All @@ -27,6 +34,8 @@ Options:
## Size values

By default the program will add the number of identical sequences found to the sequence name, as USEARCH does:
For example, if a sequence is found 18.335 times in the input file, the output will contain a sequence with ";size=18335" in the name (unless `--ignore-size` is passed). The term "size" can be confusing, but it was adopted for compatibility with USEARCH/VSERACH.

```
>seq.1;size=18335
CTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTACAGTATTCTTTTTGCCAGCGCTTAATTGCGCGGCGAAAAAACCTTACACACAGTGTTTTTTGTTATTACAAGAACTTTTGCTTTGGTCTGGACTAGAAATAGTTTGGGCCAGAGGTTTACTGAACTAAACTTCAATATTTATATTGAATTGTTATTTATTTAATTGTCAATTTGTTGATTAAATTCAAAAAATCTTCAAAACTTTCAACAACGGATCTCTTGGTTCTCGCATCGATGAAGAACGCAGC
Expand All @@ -51,7 +60,10 @@ CTTGGTCATTTAGAGGAAGTAAGAGAGAAATGTATAAACTCATAATTGACGAATGATAATTGTTATTGAAGTTTTTGTAA
If the input files were already dereplicated printing the "size" of the cluster, `derep` will sum the
size values.

This is a feature that to our knowledge is only available in SeqFu and allows to process in parallel multiple samples
and generating a single "dereplicated file" at the end, propagating the correct cluster sizes.


## Screenshot

![Screenshot of "seqfu derep"](img/screenshot-derep.svg "SeqFu derep")
![Screenshot of "seqfu derep"]({{site.baseurl}}img/screenshot-derep.svg "SeqFu derep")
19 changes: 19 additions & 0 deletions scripts/makeFileFeatures.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/usr/bin/env python
"""
A Python 3.6+ compatible script to generate FASTA or FASTQ files, randomly or using a template.
"""

import argparse
import random
import string
import sys

def main():
args = argparse.ArgumentParser(description=__doc__)
args.add_argument('output', help='Output file')
args.add_argument('count', type=int, help='Number of sequences to generate')
args.add_argument('--template', help='Template file')


if __name__ == '__main__':
exit(main())

0 comments on commit adb2829

Please sign in to comment.