improve-derep-page

telatin · Jul 15, 2024 · adb2829 · adb2829
1 parent 59d0d1b
commit adb2829
Show file tree

Hide file tree

Showing 2 changed files with 34 additions and 3 deletions.
diff --git a/docs/tools/derep.md b/docs/tools/derep.md
@@ -1,8 +1,15 @@
 
 # seqfu derep
 
-*derep*  is one of the core subprograms of *SeqFu*, that allows 
-the dereplication of FASTA and FASTQ files.
+*derep*  is one of the core subprograms of *SeqFu*, that allows the dereplication of FASTA and FASTQ files. 
+Dereplication, in R. C. Edgard [words](https://drive5.com/usearch/manual/dereplication.html) is *A rather obscure name for finding the set of unique sequences. Or, equivalently, the process of finding duplicated (replicate) sequences.* 
+
+In simple words, given a FASTA file, only unique sequences will be printed in the output. A core feature is printing the number of identical sequences found in the original dataset.
+
+Dereplication is a step commonly used in NGS sequencing of amplicons, to reduce the computational time dedicated to the analysis of each representative sequence, and some tools will require dereplicated sequences as input 
+(e.g. [USEARCH](https://rcedgar.github.io/usearch12_documentation/)).
+
+
 
 ```text
 Usage: derep [options] [<inputfile> ...]
@@ -27,6 +34,8 @@ Options:
 ## Size values
 
 By default the program will add the number of identical sequences found to the sequence name, as USEARCH does:
+For example, if a sequence is found 18.335 times in the input file, the output will contain a sequence with ";size=18335" in the name (unless `--ignore-size` is passed). The term "size" can be confusing, but it was adopted for compatibility with USEARCH/VSERACH.
+
 ```
 >seq.1;size=18335
 CTTGGTCATTTAGAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTACAGTATTCTTTTTGCCAGCGCTTAATTGCGCGGCGAAAAAACCTTACACACAGTGTTTTTTGTTATTACAAGAACTTTTGCTTTGGTCTGGACTAGAAATAGTTTGGGCCAGAGGTTTACTGAACTAAACTTCAATATTTATATTGAATTGTTATTTATTTAATTGTCAATTTGTTGATTAAATTCAAAAAATCTTCAAAACTTTCAACAACGGATCTCTTGGTTCTCGCATCGATGAAGAACGCAGC
@@ -51,7 +60,10 @@ CTTGGTCATTTAGAGGAAGTAAGAGAGAAATGTATAAACTCATAATTGACGAATGATAATTGTTATTGAAGTTTTTGTAA
 If the input files were already dereplicated printing the "size" of the cluster, `derep` will sum the
 size values.
 
+This is a feature that to our knowledge is only available in SeqFu and allows to process in parallel multiple samples
+and generating a single "dereplicated file" at the end, propagating the correct cluster sizes.
+
 
 ## Screenshot
 
-![Screenshot of "seqfu derep"](img/screenshot-derep.svg "SeqFu derep")
+![Screenshot of "seqfu derep"]({{site.baseurl}}img/screenshot-derep.svg "SeqFu derep")
diff --git a/scripts/makeFileFeatures.py b/scripts/makeFileFeatures.py
@@ -0,0 +1,19 @@
+#!/usr/bin/env python
+"""
+A Python 3.6+ compatible script to generate FASTA or FASTQ files, randomly or using a template.
+"""
+
+import argparse
+import random
+import string
+import sys
+
+def main():
+    args = argparse.ArgumentParser(description=__doc__)
+    args.add_argument('output', help='Output file')
+    args.add_argument('count', type=int, help='Number of sequences to generate')
+    args.add_argument('--template', help='Template file')
+
+
+if __name__ == '__main__':
+    exit(main())