-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME
276 lines (193 loc) · 10.4 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
HPC-CLUST v1.2 (5 February 2015)
by Joao F. Matias Rodrigues and Christian von Mering
Institute of Molecular Life Sciences, University of Zurich, Switzerland
Matias Rodrigues JF, Mering C von. HPC-CLUST: Distributed hierarchical clustering for very large sets of nucleotide sequences. Bioinformatics. 2013.
=================================================
Table of contents
1. Installation
2. HPC-CLUST(-MPI) usage instructions
3. File output
a) The merge log file
b) Obtaining OTU/clusters from the merge results
c) Obtaining OTU representatives
4. FAQ
5. Acknowledgments
6. History
-
HPC-CLUST is a set of tools designed to cluster large numbers (1 million or more)
pre-aligned nucleotide sequences. HPC-CLUST performs the clustering of sequences
using the Hierarchical Clustering Algorithm (HCA). There are currently three different
cluster metrics implemented: single-linkage, complete-linkage, and average-linkage.
There are currently 4 sequence distance functions implemented, these are:
- identity
gap-gap counting as match
- nogap
gap-gap being ignored
- nogap-single
like nogap, but consecutive gap-nogap's count as a single mismatch
- tamura
distance is calculated with the knowledge that transitions are more likely than
transversions
One advantage that HCA has over other algorithms is that instead of producing only the
clustering at a given threshold, it produces the set of merges happening at each threshold.
With this approach, the clusters can be very quickly computed for every threshold with
little extra computations. This approach also allows the plotting of the variation of
number of clusters with clustering threshold without requiring the clustering to be run
for each threshold independently.
Another feature of the way HPC-CLUST is implemented is that the single, complete, and average
linkage clusterings can be computed in a single go with little overhead.
-
For bugs and more information contact: Joao F. Matias Rodrigues <[email protected]>
1. INSTALLATION
=================================================
You can get the source code or binary packages at:
http://meringlab.org/software/hpc-clust/
or clone it using git using the command:
git clone https://github.com/jfmrod/hpc-clust.git
i) To compile on Linux/Unix/MacOSX type the following in the source directory
./bootstrap # this step is only needed if you cloned the repository, you will also need to install autotools/autoconf packages
./configure
make
make install
In the directory where you unpacked the package contents.
Alternatively, if you want the program to be installed to another
location instead of the default system wide /usr/local/ directory,
you can change the ./configure command to:
./configure --prefix=$HOME/usr
This would install the program binaries to a directory usr/bin inside
your home directory (i.e.: $HOME/usr/bin/hpc-clust), after you type
the command "make install".
2. HPC-CLUST, HPC-CLUST-MPI usage instructions
=================================================
Two programs are provided in the HPC-CLUST package.
When running on a single machine you should use HPC-CLUST. It is multithreaded
so it can make use of several processors if you run it on a multi-core computer.
HPC-CLUST-MPI is the distributed computing version of
HPC-CLUST to be used on a cluster of computers. HPC-CLUST-MPI uses the MPI library
to automatically setup the communication between master and slave processes running
on multiple computing nodes.
a) USING HPC-CLUST
=================================================
To use HPC-CLUST you must first prepare your alignment file. You can accomplish this
using the INFERNAL aligner (http://infernal.janelia.org/) which aligns all sequences
against a consensus sequence model. Any other alignment such as multiple sequence
alignment will work as well, although, in practice, for very large numbers of sequences
it becomes difficult to compute a multiple sequence alignment.
HPC-CLUST takes the alignment file in aligned fasta formar or non-interleaved Stockholm format.
It detects automatically the format based on the first character. It switches to fasta format if the first
character is a '>'.
An example of non-interleaved stockholm format is:
SEQUENCE_ID1 ---ATGCAT---GCATGCATGC----... ...AT--AATT
SEQUENCE_ID2 ---ATCCAC---GCACGCATGC-AT-... ...AT--ATAA
...
...
An example of aligned fasta format is:
>SEQUENCE_ID1
---ATGCAT---GCATGCATGC----... ...AT--AATT
>SEQUENCE_ID2
---ATCCAC---GCACGCATGC-AT-... ...AT--ATAA
...
An example of how to call the INFERNAL aligner such that it produces a file in the
right format is:
cmalign -1 --sub --matchonly -o alignedseqs.sto bacterial-ssu-model.cm unalignedseqs.fasta
This would produce a file with the aligned sequences "alignedseqs.sto", using the
"bacterial-ssu-model.cm" from the unaligned sequences file "unalignedseqs.fasta"
If the input sequences can be found in the file "alignedseqs.sto". Then to cluster the
sequences one simply has to run the following command:
hpc-clust -sl true alignedseqs.sto
This will cluster the sequences using single-linkage clustering. The output will be written
to the file "alignedseqs.sto.sl"
To run all linkage methods at once:
hpc-clust -sl true -cl true -al true alignedseqs.sto
In this case, the single, complete and average linkage clusterings will be written to the
"alignedseqs.sto.sl", "alignedseqs.sto.cl", and "alignedseqs.sto.al", respectively.
You can change the number of cpus that hpc-clust should use with the -ncpus <no_cpus> argument.
The name of the output file using the -ofile <output_filename> argument.
The threshold at which distances should be kept is specified using the -t <threshold> argument.
The lower this value, the more distances will be stored, and the lower the threshold until which
the clustering can proceed. However, the more distances are stored, the higher the memory
requirement.
b) USING HPC-CLUST-MPI
=================================================
To run the distributed version of HPC-CLUST, one needs to have the MPI library installed
before installing HPC-CLUST. If the MPI library is already set up, running HPC-CLUST-MPI
can be accomplished simply with:
mpirun -n 10 hpc-clust-mpi -sl true alignedseqs.sto
This will start 10 instances of hpc-clust-mpi, (1 master + 9 slaves) and perform single linkage
clustering on the "alignedseqs.sto" file. The results can be found in the "alignedseqs.sto.sl"
file as explained in the "USING HPC-CLUST" section above.
The MPI version of HPC-CLUST is best used in conjunction with a queuing system such as the
Sun Grid Engine system. In that case, if every instance of HPC-CLUST-MPI runs on a
separate machine it can make use of the combined memory of all the machines.
The exact command for submitting a job to the queuing system will depend on your local cluster
configuration. An example of the command to submit a job to run hpc-clust-mpi on 10 nodes would be:
qsub -b y -cwd -pe orte 10 mpirun -n 10 hpc-clust-mpi -sl true alignedseqs.sto
The speed of the computation and memory usage can be further optimized by having more than one
thread per node, in this manner the sequence data is shared between the threads. You can
increase the number of threads to use in each node by specifying the -nthreads <number_threads>
argument to hpc-clust-mpi.
3. FILE OUTPUT
=================================================
a) THE MERGE LOG FILE
=================================================
# seqsfile: aligned-archaea-seqs.sto
# OTU_count Merge_distance Merged_OTU_id1 Merged_OTU_id2
51094 1.0 37 38
51093 1.0 37 12176
51092 1.0 37 33214
51091 1.0 42 79
51090 1.0 42 2734
51089 1.0 42 2746
51088 1.0 42 2747
51087 1.0 42 2748
51086 1.0 42 2749
51085 1.0 42 2750
51084 1.0 42 2751
...
In the merge file output, each line indicates a merging event. For each merge event, the
number of OTUs existing after the merge, the identity distance at which the merge occurred,
and the OTU ids are shown. When a merge event occurs, the new OTU will have the same id
as the smallest OTU id merged. For example, if OTUs' 37 and 12176 are merged, the new OTU
will have id 37.
b) OBTAINING OTUS/CLUSTERS FROM THE MERGE RESULTS
=================================================
Once the merge result files (.sl,.cl,.al) have been computed, you can produce the clusters
at different thresholds by running hpc-clust with the -makeotus option.
hpc-clust -makeotus alignedseqs.sto alignedseqs.sto.sl 0.97
Would generate the clusters produced at the 97% identity threshold from the single linkage
merge file "alignedseqs.sto.sl".
It is possible to obtain the OTU list in a format compatible with MOTHUR by using
the "-makeotus_mothur" option
hpc-clust -makeotus_mothur alignedseqs.sto alignedseqs.sto.sl 0.97
c) OBTAINING OTU REPRESENTATIVES
=================================================
After clustering and producing an OTU file at a certain threshold it is possible to
obtain a fasta file containing a set of representatives for each OTU by using the
-makereps option.
hpc-clust -makereps alignedseqs.sto alignedseqs.sto.sl.0.97.otu
For each OTU, the sequence chosen to be the representative is the one having the smallest
average distance to all other sequences in the OTU.
4. FAQ
=================================================
Frequently asked questions and other documentation can be found at
http://meringlab.org/software/hpc-clust/faq.html
5. ACKNOWLEDGMENTS
=================================================
Sebastian Schmidt
=================================================
6. HISTORY
1.2.0 (5 February 2015)
- Average linkage computation now is calculated until the specified threshold
- Added -makeotus, -makeotus_mothur, and -makerefs options to hpc-clust
1.1.1 (21 October 2014)
- Fixed bug in make-otus.sh when using fasta files
- Added make-otus-mothur.sh to create otu list files in mothur format
1.1.0 (5 June 2014)
- Added test suite
- Fixed bug in mpi version introducted after changing to long indices
1.0.2 (23 May 2014)
- Added support for aligned fasta format (automatically detects format based on whether the first character is '>')
- Added support for computing the clustering of more than 2 million sequences (--enable-longind option for configure command)
- Fixed issue with eutils not compiling with some gnu compiler versions (push_back error in ebasicarray.h)
1.0.1 (May 2014)
- Several bugs fixed in optimized distance calculation functions, sorting function (only with -O optimization), distributed computing (when distance threshold is strict or sparse databases)