Skip to content

Commit b0efa12

Browse files
James BarbettiJames Barbetti
James Barbetti
authored and
James Barbetti
committed
-alphabet parameter works (formerly is_DNA wasn't getting set to false
when alphabet was supplied, so it didn't work as it should have) (now it is, in DecentTreeOptions::processCommandLineOptions()). If alphabet is NOT yet set (e.g. is_DNA was false because -not-dna was supplied on the command-line but -alphabet was NOT supplied on the command-line), Sequences member functions, loadSequencesFrom{Fasta|Phylip}() now treat EVERY character as "acceptable" (because the alphabet will be determine retrospectively from what was seen in the alignment). Added command-line examples of the use of -alphabet and -not-dna.
1 parent 890dcda commit b0efa12

File tree

5 files changed

+95
-49
lines changed

5 files changed

+95
-49
lines changed

decenttree.cpp

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -153,18 +153,22 @@ void showUsage() {
153153
* vary.
154154
* @param report_progress whether to display a progress bar
155155
* @param m (output) the distance matrix
156+
* @param alphabet the alphabet to use (if blank will be determined by
157+
* reading the sequences)
158+
*
156159
* @return true (if the sequences could be loaded, their distances calculated)
157160
* @return false (if an error occurred)
158161
*/
159162

160163
bool loadSequenceDistancesIntoMatrix(Sequences& sequences,
161-
const std::vector<char>& is_site_variant,
162-
bool report_progress, FlatMatrix& m) {
164+
const std::vector<char>& is_site_variant,
165+
bool report_progress, FlatMatrix& m,
166+
std::string& alphabet) {
163167
SequenceLoader loader(unknown_char, is_DNA, max_distance,
164168
sequences, correcting_distances,
165169
precision, compression_level, format,
166170
is_site_variant, report_progress);
167-
bool success = loader.loadSequenceDistances(m);
171+
bool success = loader.loadSequenceDistances(m, alphabet);
168172
return success;
169173
}
170174

@@ -449,7 +453,7 @@ bool prepInput(const std::string& fastaFilePath,
449453
if (loadMatrix) {
450454
if (!loadSequenceDistancesIntoMatrix
451455
(sequences, is_site_variant,
452-
reportProgress, m)) {
456+
reportProgress, m, alphabet)) {
453457
return false;
454458
}
455459
}
@@ -476,7 +480,7 @@ bool prepInput(const std::string& fastaFilePath,
476480
correcting_distances, precision,
477481
compression_level, format,
478482
is_site_variant, reportProgress);
479-
bool success = loader.writeDistanceMatrixToFile(numbered_names,
483+
bool success = loader.writeDistanceMatrixToFile(numbered_names, alphabet,
480484
distanceOutputFilePath);
481485
return success;
482486
}
@@ -675,9 +679,13 @@ class DecentTreeOptions {
675679
outputFilePath = "STDOUT";
676680
}
677681
isBannerSuppressed |= beSilent;
682+
if (!alphabet.empty() && is_DNA) {
683+
is_DNA = false;
684+
}
678685
if (alphabet.empty() && is_DNA) {
679686
alphabet = "ACGT";
680687
}
688+
681689
if (unknown_chars.empty()) {
682690
unknown_chars = ".~_-?N";
683691
unknown_char = 'N';

doco/Command_Line_Examples.md

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -23,25 +23,31 @@ format) </li>
2323

2424
|Command Line|Explanation|
2525
|------------|-----------|
26-
|decenttree -in ../example/example.dist -std-out | Using the default distance matrix algorithm, infer a phylogenetic tree from the supplied distance matrix, and write the tree, in Newick format, to standard output.|
27-
|decenttree -in ../example/compressed_example.dist.gz -std-out | Using the same (default) distance matrix algorithm, infer a phylogenetic tree from the supplied distance matrix (which is in gz compressed format), and write the tree, in Newick format, to standard output.|
28-
|decenttree -in ../example/example.dist -t NJ-R-V -no-banner -out njrv.newick | Infer a phylogenetic tree, using the *r*apid, *v*ectorized version of the NJ algorithm (NJ-R-V), and write the tree, in Newick format, to the file njrv.newick. Suppress the decenttree banner text. |
29-
|decenttree -in ../example/example.dist -t NJ-R-V -q -out njrv.newick | Infer a phylogenetic tree, using the *r*apid, *v*ectorized version of the NJ algorithm (NJ-R-V), and write the tree, in Newick format, to the file njrv.newick. Don't write banner text or timing information to standard output (-q suppresses both). |
30-
|decenttree -in ../example/example.dist -t NJ-R-V -q -f 4 -out njrv_low_precision.newick | The same, but reduce the precision, of the inferred distances in the newick tree, to four digits, and write to a different file |
31-
|decenttree -in ../example/example.dist -t NJ-R-V -q -f 4 -out njrv_compressed.newick.gz | The same, but compress the output
26+
|decenttree -in ../example/example.dist<br> -std-out | Using the default distance matrix algorithm, infer a phylogenetic tree from the supplied distance matrix, and write the tree, in Newick format, to standard output.|
27+
|decenttree -in ../example/compressed_example.dist.gz<br> -std-out | Using the same (default) distance matrix algorithm, infer a phylogenetic tree from the supplied distance matrix (which is in gz compressed format), and write the tree, in Newick format, to standard output.|
28+
|decenttree -in ../example/example.dist<br> -t NJ-R-V -no-banner<br> -out njrv.newick | Infer a phylogenetic tree, using the *r*apid, *v*ectorized version of the NJ algorithm (NJ-R-V), and write the tree, in Newick format, to the file njrv.newick. Suppress the decenttree banner text. |
29+
|decenttree -in ../example/example.dist<br> -t NJ-R-V -q <br>-out njrv.newick | Infer a phylogenetic tree, using the *r*apid, *v*ectorized version of the NJ algorithm (NJ-R-V), and write the tree, in Newick format, to the file njrv.newick. Don't write banner text or timing information to standard output (-q suppresses both). |
30+
|decenttree -in ../example/example.dist<br> -t NJ-R-V -q -f 4<br>-out njrv_low_precision.newick | The same, but reduce the precision, of the inferred distances in the newick tree, to four digits, and write to a different file |
31+
|decenttree -in ../example/example.dist<br> -t NJ-R-V -q -f 4<br> -out njrv_compressed.newick.gz | The same, but compress the output
3232
(compression is implied by a .gz suffix). |
33-
|decenttree -in ../example/example.dist -t NJ-R-V -q -f 4 -out njrv_compressed.newick.gz | The same, but compress the output
33+
|decenttree -in ../example/example.dist<br> -t NJ-R-V -q -f 4<br> -out njrv_compressed.newick.gz | The same, but compress the output
3434
(compression is implied by a .gz suffix). |
35-
|decenttree -in ../example/example.dist -t NJ-R-V -q -f 4 -out njrv_compressed.newick.gz -nt 1 | The same, but use only a single thread. |
35+
|decenttree -in ../example/example.dist<br> -t NJ-R-V -q -f 4 -nt 1<br> -out njrv_compressed.newick.gz | The same, but use only a single thread. |
3636

3737
<h2>Alignments as inputs</h2>
3838

3939
|Command Line|Explanation|
4040
|------------|-----------|
41-
|decenttree decenttree -phylip ../example/example.phy -std-out -no-banner| Using the default distance matrix algorithm, infer a phylogenetic tree from the supplied phylip alignment file, and write the tree, in Newick format, to standard output.|
42-
|decenttree -phylip ../example/compressed_example.phy.gz -std-out -no-banner| As before, only this time the input is compressed. decenttree inspects the file to determine if it is compressed. The file name doesn't matter.|
43-
|decenttree -phylip ../example/example.phy -uncorrected -std-out -no-banner| Using the default distance matrix algorithm, infer a phylogenetic tree from the supplied phylip alignment file (using uncorrected Hamming distances), and write the tree, in Newick format, to standard output.|
44-
| decenttree -t NJ-R-V -phylip ../example/interleaved_example.phy -std-out | Using a vectorized (the V in NJ-R) branch-and-bound (the R in NJ-R-V) version of the Neighbour Joining algorithm, infer a phylogenetic tree from the supplied *interleaved* phylip alignment file, and write the tree, in Newick format, to standard output. |
41+
|decenttree decenttree -phylip ../example/example.phy<br> -std-out -no-banner| Using the default distance matrix algorithm, infer a phylogenetic tree from the supplied phylip alignment file, and write the tree, in Newick format, to standard output.|
42+
|decenttree -phylip ../example/compressed_example.phy.gz<br> -std-out -no-banner| As before, only this time the input is compressed. decenttree inspects the file to determine if it is compressed. The file name doesn't matter.|
43+
|decenttree -phylip ../example/example.phy -uncorrected<br> -std-out -no-banner| Using the default distance matrix algorithm, infer a phylogenetic tree from the supplied phylip alignment file (using uncorrected Hamming distances), and write the tree, in Newick format, to standard output.|
44+
| decenttree -phylip ../example/interleaved_example.phy<<br>> -t NJ-R-V -std-out | Using a vectorized (the V in NJ-R) branch-and-bound (the R in NJ-R-V) version of the Neighbour Joining algorithm, infer a phylogenetic tree from the supplied *interleaved* phylip alignment file, and write the tree, in Newick format, to standard output. |
45+
| decenttree -phylip ../example/3x3.phy<br> -dist-out 3x3_as_dna.dist<br> -q -no-out | Calculate a distance matrix for a dummy 3 taxon, 3 nucleotide input, which has A, C, or T, but not G sites. By default decenttree assumes that sites can be A, C, G, or T.
46+
| decenttree -phylip ../example/3x3.phy<br> -dist-out 3x3_not_dna.dist<br> -q -no-out -not-dna | Calculate a distance matrix using a 3-character alphabet, implied by input, which has A, C, or T, but not G. |
47+
| decenttree -phylip ../example/3x3.phy<br> -dist-out 3x3_five_chars.dist<br> -q -no-out -alphabet CATGU | Calculate a distance matrix using a 5 character alphabet |
48+
| cat 3x3*.dist | Display the distance matrices output by the previous three examples, showing the affect of alphabet size on corrected distances |
49+
50+
4551

4652
Todo: Provide example command-lines for:
4753
<ul>
@@ -63,8 +69,8 @@ Todo: Provide example command-lines for:
6369

6470
|Command Line|Explanation|
6571
|------------|-----------|
66-
| decenttree -t NONE -phylip ../example/example.phy -aln-out interleaved_example.phy -no-out -no-banner | Convert an un-interleaved phylip alignment file into an interleaved one |
67-
| decenttree -t NONE -phylip ../example/example.phy -aln-out interleaved_example.phy.gz -no-out -no-banner | Convert an un-interleaved phylip alignment file into an interleaved one, and zip the output file as it is generated (zipping is implied by the .gz extension) |
72+
| decenttree -phylip ../example/example.phy<br> -aln-out interleaved_example.phy<br> -t NONE -no-out -no-banner | Convert an un-interleaved phylip alignment file into an interleaved one |
73+
| decenttree -phylip ../example/example.phy<br> -aln-out interleaved_example.phy.gz<br> -t NONE -no-out -no-banner | Convert an un-interleaved phylip alignment file into an interleaved one, and zip the output file as it is generated (zipping is implied by the .gz extension) |
6874

6975

7076
<h2>Distance Matrices as outputs</h2>
@@ -78,8 +84,8 @@ Todo:
7884

7985
|Command Line|Explanation|
8086
|------------|-----------|
81-
|decenttree -phylip ../example/example.phy -no-banner -dist example.dist -out-format upper -t NONE -no-out | Using the default distance matrix algorithm, infer a distance matrix, and write the distance matrix (in upper-triangle Phylip format) to the file, example.dist. -t NONE says *not* to use a phylogenetic inference algorithm. If no out-format parameter is provided the default is square. |
82-
|decenttree -phylip ../example/example.phy -no-banner -uncorrected -dist uncorrected.dist.gz -out-format upper.gz -t NONE -no-out | The same, only this time uncorrected distances are to be calculated, the distance matrix is to be written to uncorrected.dist.gz, and is to be compressed with gzip|
87+
|decenttree -phylip ../example/example.phy<br> -no-banner <br> -dist example.dist -out-format upper <br>-t NONE -no-out | Using the default distance matrix algorithm, infer a distance matrix, and write the distance matrix (in upper-triangle Phylip format) to the file, example.dist. -t NONE says *not* to use a phylogenetic inference algorithm. If no out-format parameter is provided the default is square. |
88+
|decenttree -phylip ../example/example.phy<br> -no-banner -uncorrected<br> -dist uncorrected.dist.gz -out-format upper.gz<br> -t NONE -no-out | The same, only this time uncorrected distances are to be calculated, the distance matrix is to be written to uncorrected.dist.gz, and is to be compressed with gzip|
8389

8490
<h2>Benchmarking</h2>
8591
In these examples, simulated_1k.fa.gz is a fasta file containing a simulated alignment with 1000 taxa, and a sequence length of 10,000,
@@ -92,5 +98,5 @@ generated using IQTree2 and gzip, via the (MacOS/Unix) commands:
9298

9399
|Command Line|Explanation|
94100
|------------|-----------|
95-
|decenttree -t BENCHMARK -fasta ../example/simulated_1k.fa.gz -no-out -nt 4 | Benchmark each of the available algorithms, one after another, with 1 through 4 threads of execution (multi-threading is only supported if decenttree is compiled with Open MP, and running on a multi-CPU architecture).|
96-
|decenttree -t BENCHMARK -fasta ../example/simulated_1k.fa.gz -no-out | Benchmark each of the available algorithms, one after another, with 1 through *n* threads (where *n* is the number of available CPUs) (only if decenttree is compiled with Open MP, and running on a multi-CPU architecture).|
101+
|decenttree -fasta ../example/simulated_1k.fa.gz<br> -t BENCHMARK -no-out -nt 4 | Benchmark each of the available algorithms, one after another, with 1 through 4 threads of execution (multi-threading is only supported if decenttree is compiled with Open MP, and running on a multi-CPU architecture).|
102+
|decenttree -fasta ../example/simulated_1k.fa.gz<br> -t BENCHMARK -no-out | Benchmark each of the available algorithms, one after another, with 1 through *n* threads (where *n* is the number of available CPUs) (only if decenttree is compiled with Open MP, and running on a multi-CPU architecture).|

example/3x3.phy

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
3 3
2+
cat CAT
3+
tac TAC
4+
act ACT

sequence.cpp

Lines changed: 50 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ double correctedDistance(double char_dist, double chars_compared,
162162
if (max_distance<=d) {
163163
d = max_distance;
164164
}
165-
return d;
165+
return d;
166166
}
167167

168168
/**
@@ -367,9 +367,11 @@ bool Sequences::loadSequencesFromFasta(const std::string& fastaFilePath,
367367
}
368368
size_t line_num = 1;
369369
std::vector<int> in_alphabet;
370-
in_alphabet.resize(256, 0);
371-
for (auto alpha=alphabet.begin(); alpha!=alphabet.end(); ++alpha) {
372-
in_alphabet[*alpha] = 1;
370+
in_alphabet.resize(256, alphabet.empty() ? 1 : 0);
371+
if (!alphabet.empty()) {
372+
for (auto alpha=alphabet.begin(); alpha!=alphabet.end(); ++alpha) {
373+
in_alphabet[*alpha] = 1;
374+
}
373375
}
374376
for (; !in.eof(); ++line_num) {
375377
std::string line;
@@ -483,9 +485,11 @@ bool Sequences::loadSequencesFromPhylip(const std::string& phylipFilePath,
483485
}
484486
size_t line_num = 1;
485487
std::vector<int> in_alphabet;
486-
in_alphabet.resize(256, 0);
487-
for (auto alpha=alphabet.begin(); alpha!=alphabet.end(); ++alpha) {
488-
in_alphabet[*alpha] = 1;
488+
in_alphabet.resize(256, alphabet.empty() ? 1 : 0);
489+
if (!alphabet.empty()) {
490+
for (auto alpha=alphabet.begin(); alpha!=alphabet.end(); ++alpha) {
491+
in_alphabet[*alpha] = 1;
492+
}
489493
}
490494
size_t num_sequences = 0;
491495
size_t sequence_length = 0;
@@ -812,28 +816,46 @@ void SequenceLoader::setUpSerializedData() {
812816

813817
/**
814818
* @brief Determine the number of states (needed for correcting distances)
819+
* @param alphabet - on entry the alphabet supplied by the user
820+
* - on exit, the same (if one was supplied, minus
821+
* any duplicates), or the distinct characters found
822+
* in the input, if is_DNA is false, and a blank
823+
* alphabet was supplied.
815824
*/
816-
void SequenceLoader::getNumberOfStates() {
825+
void SequenceLoader::getNumberOfStates(std::string& alphabet) {
817826
num_states = 0.0;
818827
if (is_DNA) {
819828
num_states = 4;
820829
}
821-
else
822-
{
830+
else {
823831
std::vector<size_t> char_counts;
824832
char_counts.resize(256, 0);
825833
auto char_count_array = char_counts.data();
826-
const unsigned char* start_buffer = reinterpret_cast<unsigned char*>(buffer);
827-
const unsigned char* end_buffer = start_buffer + rank * seqLen;
828-
for (const unsigned char* scan=start_buffer; scan<end_buffer; ++scan) {
829-
++char_count_array[*scan];
834+
if (!alphabet.empty()) {
835+
//Determine what the distinct characters in the user-supplied
836+
//alphabet were.
837+
for (const unsigned char ch : alphabet) {
838+
++char_count_array[ch];
839+
}
830840
}
841+
else
842+
{
843+
//Determine what the distinct characters in the input were
844+
//(if we were told, it wasn't DNA but we weren't supplied an alphabet)
845+
const unsigned char* start_buffer = reinterpret_cast<unsigned char*>(buffer);
846+
const unsigned char* end_buffer = start_buffer + rank * seqLen;
847+
std::cout << "scanning: " << rank << " sequences of length " << seqLen << std::endl;
848+
for (const unsigned char* scan=start_buffer; scan<end_buffer; ++scan) {
849+
++char_count_array[*scan];
850+
}
851+
}
852+
alphabet.clear();
831853
for (int i=0; i<256; ++i) {
832-
num_states += (char_counts[i]==0) ? 0 : 1;
833-
}
834-
if (0<char_counts[unknown_char]) {
835-
--num_states;
836-
}
854+
if (i!=unknown_char && 0<char_counts[i]) {
855+
alphabet.push_back(static_cast<char>(i));
856+
}
857+
}
858+
num_states = alphabet.length();
837859
}
838860
}
839861

@@ -937,6 +959,8 @@ double SequenceLoader::getDistanceBetweenSequences(intptr_t row, intptr_t col) c
937959
* @brief Calculate pairwise distances, for every pair of sequences
938960
* @param m - reference to the FlatMatrix into which distances are to be
939961
* written.
962+
* @param alphabet - alphabet to use (if blank, will be determined
963+
* from the sites of the sequences)
940964
* @return true - always
941965
* @return false - in theory, could return this if it failed (but it won't)
942966
* (though it might throw an out of memory exception, I suppose)
@@ -947,13 +971,14 @@ double SequenceLoader::getDistanceBetweenSequences(intptr_t row, intptr_t col) c
947971
* (we may hope) "load balance better" (because the small bits of
948972
* work that won't take so long, occur for the later rows).
949973
*/
950-
bool SequenceLoader::loadSequenceDistances(FlatMatrix& m) {
974+
bool SequenceLoader::loadSequenceDistances(FlatMatrix& m, std::string& alphabet) {
951975
m.setSize(rank);
952976
for (intptr_t row=0; row<rank; ++row) {
953977
m.addCluster(sequences[row].getName());
954978
}
955979
setUpSerializedData();
956-
getNumberOfStates();
980+
getNumberOfStates(alphabet);
981+
957982
#if USE_PROGRESS_DISPLAY
958983
const char* task = report_progress ? "Calculating distances": "";
959984
progress_display progress( rank*(rank-1)/2, task );
@@ -984,6 +1009,8 @@ bool SequenceLoader::loadSequenceDistances(FlatMatrix& m) {
9841009
* calculating a distance matrix on a memory-challenged machine).
9851010
* @param numbered_names - true if names are to be numbered,
9861011
* false if the existing names are to be used
1012+
* @param alphabet - can be passed in (if blank, alphabet will be
1013+
* determined from the sequence character data)
9871014
* @param filePath - the path of the (phylip format) output file
9881015
* @return true - on success
9891016
* @return false - on failure (error messages will be written to std::cerr)
@@ -997,9 +1024,10 @@ bool SequenceLoader::loadSequenceDistances(FlatMatrix& m) {
9971024
* are parallelized over the columns to be outputted in that row
9981025
*/
9991026
bool SequenceLoader::writeDistanceMatrixToFile(bool numbered_names,
1027+
std::string& alphabet,
10001028
const std::string& filePath) {
10011029
setUpSerializedData();
1002-
getNumberOfStates();
1030+
getNumberOfStates(alphabet);
10031031

10041032
#if USE_PROGRESS_DISPLAY
10051033
bool isTriangle = contains(output_format,"lower") ||

0 commit comments

Comments
 (0)