-
Notifications
You must be signed in to change notification settings - Fork 1
/
DOCUMENTATION
140 lines (112 loc) · 5.78 KB
/
DOCUMENTATION
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
Documentation
=============
The last version of the documentation is available in the UrQt website:
https://lbbe.univ-lyon1.fr/-UrQt-.html
UrQt (Unsupervised read Quality trimming) is a fast C++ software to trim
nucleotides of unreliable quality from NGS data in fastq or fastq.gz format
(automatically detected). For the phred score encoding, the default is
33 = Sanger (ASCII 33 to 126), but this can be modified with the option
--phred to set for example 64 = Illumina 1.3 or 59 = Solexa/Illumina 1.0.
Single-end
----------
To use UrQt on a single-end fastq of fastq.gz file simply run the following
command:
UrQt --in file.fastq --out file_trimmed.fastq
Both input and output files must be accessible and writeable to UrQt to prevent
errors.
Paired-end
----------
To use UrQt on a paired-end fastq of fastq.gz file simply run the following
command:
UrQt --in file_R1.fastq --inpair file_R2.fastq --out file_R2_trimmed.fastq
--outpair file_R2_trimmed.fastq
By default UrQt remove empty reads (i.e. reads with zero nucleotides of good
quality), and keep the correspondence between the paired-end files.
Note that we recommend to use the option --gz and output your file in fastq.gz
for significant gains of disk space.
Quality threshold
-----------------
The quality threshold parameter --t threshold define the minimum phred score
above which a phred score is considered as "good quality".
By default UrQt use a phred of 5 but this can be changed with the option
--t threshold.
The classical definition of the quality threshold is obtained with --t 3.0103.
Note that UrQt won’t remove every base with a phred score below --t, but will
find the best segmentation between two segments of "bad quality" framing a
segment of "good quality".
This parameter is independent to the data and must be chosen according to the
goal of the analysis.
Example to set a threshold of 10 :
UrQt --in file.fastq --out file_trimmed.fastq --t 10
Homopolymer trimming
--------------------
With the option --N letter you can define the poly-nucleotide to trim at the head
or tail of the sequences.
For letters not present in the standard IUB/IUPAC dictionary, UrQt will perform
QC trimming instead of poly-nucleotide trimming.
Example to trim polyA at the head and tail of the reads :
UrQt --in file.fastq --out file_trimmed.fastq --N A
Verbose mode
------------
By default UrQt display a minimal number of information. If you want you can use
the option --v to display all the options used, progress bars and time left.
UrQt --in file.fastq --out file_trimmed.fastq --v
Multi-threading
---------------
By default UrQt use 3 thread (main plus two sub-threads) for a total CPU usage
of 100% of one processing unit. You can use the option --m thread_number to use
more than one processing unit.
Each additional thread will use a new processing unit.
To run UrQt on 10 processing units:
UrQt --in file.fastq --out file_trimmed.fastq --m 10
Head or Tail
------------
By default, UrQt start by finding the best cut-point k_1 \in [1,l] with l the
size of the read, between a segment of "good quality" and a segment of "bad
quality", and then find the best cut-point k_2 \in [1,k_1] between a segment of
"bad quality" quality and a segment of "good quality".
Instead of --pos both, one can use the parameter --pos head to only trim the head
of the reads or --pos tail to only trim the tail of the reads.
Example to trim the head and tail of the reads :
UrQt --in file.fastq --out file_trimmed.fastq UrQt --in file.fastq --out file_trimmed.fastq --pos both
Example to trim only the head of the reads :
UrQt --in file.fastq --out file_trimmed.fastq --pos head
Example to trim only the tail of the reads :
UrQt --in file.fastq --out file_trimmed.fastq --pos tail
Minimum trimmed read size
-------------------------
You can tell UrQt to only report reads with a size superior to n nucleotides
with the option --min_read_size n
Example to report only reads with a size superior to 15 nucleotides :
UrQt --in file.fastq --out file_trimmed.fastq --min_read_size 15
Maximum number of nucleotides trimmed
-------------------------------------
You can constrain UrQt to remove no more than n nucleotides at the tail of the
reads with the option --max_tail_trim n.
The complementary option for the head of the reads is --max_head_trim n.
Example to trim at maximum 10 nucleotides at the head of the reads :
UrQt --in file.fastq --out file_trimmed.fastq --max_head_trim 10
Example to trim at maximum 10 nucleotides at the tail of the reads :
UrQt --in file.fastq --out file_trimmed.fastq --max_tail_trim 10
Example to trim at maximum 10 nucleotides at the head and at the tail of the
reads :
UrQt --in file.fastq --out file_trimmed.fastq --max_head_trim 10 --max_tail_trim 10
By default empty reads are removed from the output, you can keep them with the option --r.
Classical filter
----------------
You can tell UrQt to only keep reads with a minimum of x percent of their length
above y phred with the two following options:--min_QC_length x and --min_QC_phred.
This filter will be applied after the trimming procedure of UrQt.
For example to retain only reads with a phred of more than 20 on 80% of their
length after trimming:
UrQt --in file.fastq --out file_trimmed.fastq --min_QC_length 80.0 --min_QC_phred 20
Nucleotides probability computation
-----------------------------------
By default, UrQt use the EM algorithm to compute the proportion of the 4
different nucleotides in a read and estimate the different cut-point.
You can tell UrQt to use fixed proportion of 1/4 for each nucleotides with the
option --S.
To compute the proportion of each nucleotide on a sample of size n reads, you
can use the option --s n.
These two option speed-up the computation but we recommend to use the default
parameters (no parameter) for a better estimate of the cut-points in the reads.