fastANI is not working #96

Biofarmer · 2020-11-07T22:46:12Z

Hi Dr. Olm,

I am using version 2.6.2, when running dRep compare with --S_algorithm fastANI, there is an error:

Clustering Step 1. Parse Arguments
Clustering Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
2b. Cluster pair-wise MASH clustering
3355 primary clusters made
Step 3. Perform secondary clustering
Running 8999390 fastANI comparisons- should take ~ 1200.5 min
Traceback (most recent call last):
File "/install/software/anaconda3.6.b/bin/dRep", line 33, in
controller.parseArguments(args)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/controller.py", line 146, in parseArguments
self.compare_operation(**vars(args))
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/controller.py", line 91, in compare_operation
drep.d_workflows.compare_wrapper(kwargs['work_directory'],**kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_workflows.py", line 96, in compare_wrapper
drep.d_cluster.d_cluster_wrapper(wd, **kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 80, in d_cluster_wrapper
data_folder, wd=workDirectory, **kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 215, in cluster_genomes
ndb = compare_genomes(bdb, algorithm, data_folder, **kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 921, in compare_genomes
df = run_pairwise_fastANI(genome_list, working_data_folder, **kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 1096, in run_pairwise_fastANI
exe_loc = drep.get_exe('fastANI')
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/init.py", line 100, in get_exe
assert False, "{0} isn't working- make sure its installed".format(name)
AssertionError: fastANI isn't working- make sure its installed

May I ask that should I install the fastANI separately? If yes, how can I make sure the dRep can call it? We already have FastANI 1.1 installed.

Thanks

MrOlm · 2020-11-08T18:44:12Z

Yes, you need to install it separately and need to make sure it's in your system PATH. dRep bonus test --check_dependencies will tell you which programs dRep is able to confirm are working correctly.

Biofarmer · 2020-11-08T22:20:49Z

Hi Dr. Olm, thanks for reply. I just runt the code and see:

mash.................................... all good (location = /install/software/anaconda3.6.b/bin/mash)
nucmer.................................. all good (location = /install/software/anaconda3.6.b/bin/nucmer)
checkm.................................. all good (location = /install/software/anaconda3.6.b/bin/checkm)
ANIcalculator........................... !!! ERROR !!! (location = None)
prodigal................................ all good (location = /install/software/anaconda3.6.b/bin/prodigal)
centrifuge.............................. all good (location = /install/software/anaconda3.6.b/bin/centrifuge)
nsimscan................................ !!! ERROR !!! (location = None)
fastANI................................. !!! ERROR !!! (location = None)

May I ask that if I want to use fastANI for S_algorithm with dRep compare or dereplicate, do all the three "ERROR" stools need to be installed? For sure, fastANI yes. However about the other two? If yes, are the following links correct to download?
For ANIcalculator, is it here: https://ani.jgi.doe.gov/html/download.php?
For nsimscan, is it here: https://github.com/abadona/qsimscan ?

Many thanks for reply.

MrOlm · 2020-11-09T16:50:32Z

Hello,

Nope, to use fastANI you only need to correct the fastANI option. The other two are used for other S_algorithms

Biofarmer · 2020-11-09T17:58:13Z

Hi Dr. Olm, it is great to know that. I have only fixed the fastANI option, and now looks good.
mash.................................... all good (location = /install/software/anaconda3.6.b/bin/mash)
nucmer.................................. all good (location = /install/software/anaconda3.6.b/bin/nucmer)
checkm.................................. all good (location = /install/software/anaconda3.6.b/bin/checkm)
ANIcalculator........................... !!! ERROR !!! (location = None)
prodigal................................ all good (location = /install/software/anaconda3.6.b/bin/prodigal)
centrifuge.............................. all good (location = /install/software/anaconda3.6.b/bin/centrifuge)
nsimscan................................ !!! ERROR !!! (location = None)
fastANI................................. all good (location = /install/software/fastani/1.32/fastANI)

I have run dRep compare with fastANI, and find that the flag of fastANI '--minFraction' is 0, but the default is 0.2. May I ask that the '--minFraction' is indeed set to 0? or do I misunderstand anything?

11-09 17:51 DEBUG running cluster 1
11-09 17:51 DEBUG /install/software/fastani/1.32/fastANI --ql /data/Food/analysis/R0898_HYDROFish/allbins_ncbi/random_1_drep_20k/data/fastANI_files/tmp/genomeList --rl /data/Food/analysis/R0898_HYDROFish/allbins_ncbi/random_1_drep_20k/data/fastANI_files/tmp/genomeList -o /data/Food/analysis/R0898_HYDROFish/allbins_ncbi/random_1_drep_20k/data/fastANI_files/fastANI_out_tqcrxtgsrr --matrix -t 50 --minFraction 0 tqcrxtgsrr

Thanks

MrOlm · 2020-11-09T18:00:35Z

Hello,

Yes, the --minFraction is not used at the stage of the program being referenced in the log. It's used at a later stage when the clustering happens. The default is indeed 0.2, not 0.

Biofarmer · 2020-11-09T18:08:49Z

Hi, thanks for reply. It is good to have the default value used. I will see it when the running is finished.
Thanks again.

Biofarmer · 2020-11-11T14:24:12Z

Hi Dr. Olm, I have run with drep compare with fastANI with 50 processors, and it shows in the log that 1200.5 min is needed as below.
11-09 17:51 INFO 3355 primary clusters made
11-09 17:51 INFO Step 3. Perform secondary clustering
11-09 17:51 INFO Running 8999390 fastANI comparisons- should take ~ 1200.5 min

However, it has been running already 45 h (2700 min), and I can see that the job is still running, however may I ask that the time indicated in the log is just estimated and the real time may be longer than that?

Thanks

MrOlm · 2020-11-11T16:08:47Z

Hello,

Yes it is indeed just an estimate. Depending on random factors like genome size, the specifics of the genomes being compared, and the number of cores being used, the actual time can vary a bit. The parallelization also doesn't scale perfectly; 50 cores will take more than twice as long as 25 cores, but the simple formula used to estimate how long the job will take doesn't take this into account.

Best,
Matt

Biofarmer · 2020-11-13T13:51:25Z

Hi Dr. Olm,
It is good that I have finished my run with 20,000 genomes, with -pa 0.9 -sa 0.95 -nc 0.30 -cm larger.

However there is an error for a cluster:

Running 8999390 fastANI comparisons- should take ~ 1200.5 min
CRITICAL ERROR WITH SECONDARY CLUSTERING CODE hdzdcskeha; SKIPPING
CRITICAL ERROR WITH PRIMARY CLUSTER 592; TRYING AGAIN
CRITICAL ERROR WITH SECONDARY CLUSTERING CODE geyhqnsofz; SKIPPING
DOUBLE CRITICAL ERROR AGAIN WITH PRIMARY CLUSTER 592; SKIPPING

In the Cdb.csv, there is no cluster 592 record. May I ask why?

For dRep compare Step 4. Analyze part, there are many lines as below in my error log:

Plotting MDS plot
/install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:130: RuntimeWarning : divide by zero encountered in double_scalars
old_stress = stress / dis
/install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:125: RuntimeWarning : invalid value encountered in double_scalars
if(old_stress - stress / dis) < eps:
/install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:130: RuntimeWarning : invalid value encountered in double_scalars
old_stress = stress / dis
/install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:125: RuntimeWarning : invalid value encountered in double_scalars
if(old_stress - stress / dis) < eps:
...

May I ask why? I think these issues are without any impact on the cluster, and the files in data_tables are totally fine for subsequence analysis, right?

The value (0.30) I set with -nc is applied to --minFraction in fastANI? Because I need the alignment fraction (AF) for subsequent analysis. Where can I check the value used for --minFraction in fastANI?

Sorry about many questions and thanks for your time.

MrOlm · 2020-11-13T18:27:33Z

Hello,

This is because FastANI is a somewhat finicky program, and it didn't like something about one of those genomes. Try and run fastANI on those genomes outside of dRep to see what the problem is. dRep runs FastANI like below, where genomeList is a text file listing the genome locations.

$ fastANI --ql genomeList --rl genomeList -o fastANI_out_pqbxqtvbuy --matrix -t 8 --minFraction 0 pqbxqtvbuy

Alternatively you could run dRep again in debug mode (add -d), which will store all logs that FastANI creates to show what the error is (stored in log/cmd_lods/). I must warn you though, often fastANI crashes without an easily interpretable error, which makes it difficult to know what the problem is.

Yeah don't worry about those; it's just matplotlib yelling about unimportant things that might be problems but really aren't.
No; -nc is applied during the clustering step. You can see the AF values for all comparisons in the Ndb.csv file in the data_tables folder. During the clustering step if the AF is less than -nc, dRep will replace the actual ANI value with a 0. This 0 isn't stored in Ndb.csv so that you can see what the reported ANI is, when the clustering happens it will be treated as a 0.

Best,
Matt

Biofarmer · 2020-11-13T20:31:45Z

Hi Dr. Olm,

Thanks for reply. May I further ask based on your reply?

Where can I find the genomes that are listed in Cluster 592? I just see one genomeList lin tmp folder. Then I can run fastANI independently to see the problem.
That's great, thanks.
Based on my understanding, the AF used for secondary clustering is from -nc, right? When running fastANI, the --minFraction is set to 0, and the AF is then calculated based one the output of fastANI for each comparsion. Then dRep used -nc to do secondary clustering based on the calculated AF, right?

Thanks.

MrOlm · 2020-11-16T16:51:09Z

Hello,

It should just be the genomes that are in the datatable Bdb.csv (which contains all filtered genomes) but are not in Cdb.csv (which has the cluster assignments of all genomes).
That is mostly correct. You are correct and AF is reported and calculated by fastANI with --minFraction set to 0. dRep uses ANI to do the secondary clustering, however, not AF. All genomes with an AF below the nc have their ANI set to 0 by dRep, though.

-Matt

Biofarmer · 2020-11-17T14:00:11Z

Hi Dr. Olm,

Thank you for reply.

Yes, but all the genomes are in the datatable Bdb.csv, right? How can I know which genomes that belong to Cluster 592? I think I only need to re-run fastANI on this cluster to check where the problem is, am I right?
Thanks, it makes sense to me now.

Sorry may I introduce a new question:
3.1 Is the final number of clusters from column of "secondary_cluster" in Cdb? I have checked the unique cluster number from "secondary_cluster" is 4546, and the unique cluster from "primary_cluster" is 3354. I think this is expected to have higher cluster number from "secondary_cluster", right?
3.2 I run code with -pa 0.95 -sa 0.95 -nc 0.30 -cm larger for the same genomes. I am just curious that how many clusters from Mash with 0.95, which results in 4081 clusters. I understand the problem of Mash is to underestimate the distance between the incomplete genomes and split same-species genomes into multiple clusters, which means higher number of clusters from Mash itself. However, why the number I got with -pa 0.95 from Mash (the first step of dRep) is lower than the clusters I got from dRep with -sa 0.95? Or I cannot compare them directly like this?

Thanks

MrOlm · 2020-11-17T23:50:34Z

Hello,

The genomes that ARE in Bdb.csv and ARE NOT in Cdb.csv will be the ones where fastANI failed.

3.1) Yes, your understanding is correct.

3.2) I'm a little confused by the question but can explain a few things. The purpose of primary clustering (done with Mash, which is fast) is to reduce the number of comparisons that have to be done with fastANI (which is slow). The only genomes that are ever compared with FastANI are those that are in the same primary cluster by Mash; genomes in different primary clusters will never be in the same secondary clusters. Mash is not very accurate however (especially with incomplete genomes), so that's why by default dRep casts a wide net with Primary clustering, and then makes secondary clusters with fastANI.

If you'd like to compare Mash and fastANI directly, you'll need to run dRep once with the command -pa 0.95 --SkipMash and once with the command -sa 0.95 --SkipSecondary. I'll warn you though, this second command will be very slow (because it will not have the primary clustering speeding things up).

A bit more detail on this can be found here: https://drep.readthedocs.io/en/latest/choosing_parameters.html

-Matt

Biofarmer · 2020-11-18T15:36:50Z

Hi Dr. Olm,

sorry, I understood now, and found the cluster 592 only has one genome, and run the fastANI as below:
fastANI --ql genomeList --rl genomeList -o fastANI_out_592 --matrix -t 50 --minFraction 0

It run without error:

Reference = [bin.7.fna]
Query = [bin.7.fna]
Kmer size = 16
Fragment length = 3000
Threads = 50
ANI output file = fastANI_out_592

INFO [thread 0], skch::main, Count of threads executing parallel_for : 50
INFO [thread 0], skch::Sketch::build, window size for minimizer sampling = 24
INFO [thread 18], skch::main, ready to exit the loop
INFO [thread 1], skch::main, ready to exit the loop
INFO [thread 24], skch::main, ready to exit the loop
INFO [thread 17], skch::main, ready to exit the loop
INFO [thread 47], skch::main, ready to exit the loop
INFO [thread 23], skch::main, ready to exit the loop
INFO [thread 41], skch::main, ready to exit the loop
INFO [thread 19], skch::main, ready to exit the loop
INFO [thread 33], skch::main, ready to exit the loop
INFO [thread 43], skch::main, ready to exit the loop
INFO [thread 39], skch::main, ready to exit the loop
INFO [thread 12], skch::main, ready to exit the loop
INFO [thread 2], skch::main, ready to exit the loop
INFO [thread 15], skch::main, ready to exit the loop
INFO [thread 46], skch::main, ready to exit the loop
INFO [thread 44], skch::main, ready to exit the loop
INFO [thread 21], skch::main, ready to exit the loop
INFO [thread 34], skch::main, ready to exit the loop
INFO [thread 7], skch::main, ready to exit the loop
INFO [thread 6], skch::main, ready to exit the loop
INFO [thread 22], skch::main, ready to exit the loop
INFO [thread 38], skch::main, ready to exit the loop
INFO [thread 30], skch::main, ready to exit the loop
INFO [thread 25], skch::main, ready to exit the loop
INFO [thread 3], skch::main, ready to exit the loop
INFO [thread 42], skch::main, ready to exit the loop
INFO [thread 4], skch::main, ready to exit the loop
INFO [thread 45], skch::main, ready to exit the loop
INFO [thread 29], skch::main, ready to exit the loop
INFO [thread 36], skch::main, ready to exit the loop
INFO [thread 14], skch::main, ready to exit the loop
INFO [thread 10], skch::main, ready to exit the loop
INFO [thread 9], skch::main, ready to exit the loop
INFO [thread 32], skch::main, ready to exit the loop
INFO [thread 5], skch::main, ready to exit the loop
INFO [thread 37], skch::main, ready to exit the loop
INFO [thread 28], skch::main, ready to exit the loop
INFO [thread 8], skch::main, ready to exit the loop
INFO [thread 31], skch::main, ready to exit the loop
INFO [thread 49], skch::main, ready to exit the loop
INFO [thread 13], skch::main, ready to exit the loop
INFO [thread 35], skch::main, ready to exit the loop
INFO [thread 16], skch::main, ready to exit the loop
INFO [thread 26], skch::main, ready to exit the loop
INFO [thread 48], skch::main, ready to exit the loop
INFO [thread 27], skch::main, ready to exit the loop
INFO [thread 40], skch::main, ready to exit the loop
INFO [thread 20], skch::main, ready to exit the loop
INFO [thread 11], skch::main, ready to exit the loop
INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 118228
INFO [thread 0], skch::Sketch::index, unique minimizers = 115202
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 112667) ... (14, 1)
INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 0.161255 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.0048895 sec
INFO [thread 0], skch::main, Time spent post mapping : 1.052e-06 sec
INFO [thread 0], skch::main, ready to exit the loop
INFO, skch::main, parallel_for execution finished

However, there is nothing in fastANI_out_592 output. I think there should be one comparison inside.

I run another cluster also with only one genome that was good with dRep, there is one comparison in fastANI_out, and the log is same as above. I am confused, is it due to the bin file itself? Could you have a look where the problem is?

Thanks

MrOlm · 2020-11-19T01:03:48Z

...interesting. Are you sure that cluster 592 only has a single genome in it? The size of Bdb.csv (measured with the command cat Bdb.csv | wc -l) is only 1 line longer than Cdb.csv (assessed with the command cat Cdb.csv | wc -l)? The reason I ask is that secondary comparisons are not usually run on primary clusters that only have a single genome in them, so it's a bit surprising to me.

Thank you,
Matt

Biofarmer · 2020-11-19T10:28:50Z

Hi Dr. Olm,

Yes, there is only one genome difference between Bdb.csv (200001) and Cdb.csv (20000). I run 20000 genomes together, plus the title line, that's why the Bdb.csv is with 200001 lines. Indeed, I have so many fastANI output with only one genome in /data/fastANI_files/ folder.

The top 10 lines from Cdb.csv:

genome | secondary_cluster | threshold | cluster_method | comparison_algorithm | primary_cluster
-- | -- | -- | -- | -- | --
GCF_000711905.1_ASM71190v1_genomic.fna | 1_0 | 0.05 | average | fastANI | 1
GCF_002002665.1_ASM200266v1_genomic.fna | 2_0 | 0.05 | average | fastANI | 2
GCF_902158755.1_SYNGOMJ08_genomic.fna | 3_0 | 0.05 | average | fastANI | 3
GCF_000008665.1_ASM866v1_genomic.fna | 4_1 | 0.05 | average | fastANI | 4
GCF_000734035.1_ASM73403v1_genomic.fna | 4_1 | 0.05 | average | fastANI | 4
GCF_001610875.1_ASM161087v1_genomic.fna | 5_0 | 0.05 | average | fastANI | 5
GCF_001654775.1_ASM165477v1_genomic.fna | 6_0 | 0.05 | average | fastANI | 6
GCF_002212725.1_ASM221272v1_genomic.fna | 7_0 | 0.05 | average | fastANI | 7
GCF_006740685.1_ASM674068v1_genomic.fna | 8_0 | 0.05 | average | fastANI | 8
GCF_014078535.1_ASM1407853v1_genomic.fna | 9_0 | 0.05 | average | fastANI | 9
GCF_013407165.1_ASM1340716v1_genomic.fna | 10_0 | 0.05 | average | fastANI | 10

The version of fastANI I used is 1.32, and the version of dRep is 2.6.2, and run dRep bonus test --check_dependencies

Loading work directory
Checking dependencies
mash.................................... all good        (location = /install/software/anaconda3.6.b/bin/mash)
nucmer.................................. all good        (location = /install/software/anaconda3.6.b/bin/nucmer)
checkm.................................. all good        (location = /install/software/anaconda3.6.b/bin/checkm)
ANIcalculator........................... !!! ERROR !!!   (location = None)
prodigal................................ all good        (location = /install/software/anaconda3.6.b/bin/prodigal)
centrifuge.............................. all good        (location = /install/software/anaconda3.6.b/bin/centrifuge)
nsimscan................................ !!! ERROR !!!   (location = None)
fastANI................................. all good        (location = /install/software/fastani/1.32/fastANI)

Sorry about the new problem, and thanks a million for your time.

MrOlm · 2020-11-23T16:41:42Z

Hello,

OK interesting- thank you for pointing this out to me. You are indeed correct and apologies for not understanding how my own program works.

Unfortunately though I don't think I'll be able to help with this problem, since it seems to be an issue with FastANI. You can try and post an issue on the FastANI page (https://github.com/ParBLiSS/FastANI/issues), but they don't seem to be answered very often.

However, in future versions of dRep I will have it assign genomes that are in a primary cluster by themselves to a cluster without needing to run fastANI.

Best,
Matt

Biofarmer · 2020-11-23T17:53:10Z

Hi Dr. Olm,

So do you mean that secondary comparisons are run on primary clusters that only have a single genome in them, and what I have run and the output is correct, right?
As for the problem I met with the cluster 592, I should ask on the FastANI page, right? I will do that. However, I think if only one genome is in this cluster, actually I do not need to run secondary comparison, and just take it as its cluster 592. Am I right?

Thanks

MrOlm · 2020-11-23T18:04:09Z

Hello,

Yes, you are correct on both points. FastANI failing doesn't matter in this case, because no matter what that genome would be cluster 592.

-Matt

Biofarmer · 2020-11-23T18:06:05Z

Hi Dr. Olm,

Thank you very much for reply fast and now it does make sense to me.

Thanks again.

Biofarmer · 2020-11-23T18:28:45Z

Hi Dr. Olm,

FYI, I just searched the answer to my fastANI problem, and I found it here ParBLiSS/FastANI#21. I also changed the fragLen like 200, this genome worked with fastANI.

A quite small question about the command of fastANI in dRep, like below, the last 10 letters are just to distinguish the output of fastANI in fastANI_files folder, making no any effect on the process of fastANI itself, right?

$ fastANI --ql genomeList --rl genomeList -o fastANI_out_pqbxqtvbuy --matrix -t 8 --minFraction 0 pqbxqtvbuy

Thanks

MrOlm · 2020-11-23T18:31:13Z

Interesting. Thank you for pointing this out to me.

Yes, the last 10 letters are randomly generated and just used to distinguish output files.

-Matt

Biofarmer · 2020-11-23T18:40:04Z

Hi Dr. Olm,

Thank you for reply. Anyway, this is not a problem in the case of one genome in a cluster.
That's great, will run all my genomes.

Thanks again.

Biofarmer · 2020-12-03T12:57:09Z

Hi Dr. Olm,

One question from the output of run 30,000 genome together, with

dRep compare output -p 50 -pa 0.90 -sa 0.95 -nc 0.30 -cm larger --S_algorithm fastANI -g genome_path.txt

There is no error reported for Step 1. Cluster

***************************************************
    ..:: dRep compare Step 1. Cluster ::..
***************************************************

Loading genomes from a list
Clustering Step 1. Parse Arguments
Clustering Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
2b. Cluster pair-wise MASH clustering
4362 primary clusters made
Step 3. Perform secondary clustering
Running 21641432 fastANI comparisons- should take ~ 2887.0 min
Step 4. Return output
***************************************************
    ..:: dRep compare Step 2. Bonus ::..
***************************************************

However, in the Cdb, one genome is missing compared to Bdb (there are 30,000 genomes). I also checked the unique primary cluster of Cdb, it is correct with 4362. I also checked the number of comparison in Mdb, it is correct with 59,999, however, there is no record in Ndb.

Do you know why? may be still the FastANI issue as discussed above. In addition, is it possible to check which cluster this genome is from?
If I leave this genome out, is the output of current clustering in Cdb still reliable?

Thanks
Wang

Biofarmer · 2020-12-08T19:36:52Z

Hi Dr. Olm,

I updated my question, can you look at it when you are available?
Thanks

MrOlm · 2020-12-08T21:08:14Z

Hello,

This is an interesting problem and not something that I've encountered before. I don't see any way that the problem genome could impact anything else, so I believe the rest of the clustering output is still correct.

I think this is probably an issue with FastANI, though it's not clear to me what the issue is. The only way I can think of to tell which primary cluster the missing genome is in is to re-run the program with the --debug flag. This will create an additional file called CdbF.csv in the data_tables folder, which contains the clustering information for all genomes with primary clustering only. You can also run this additional step with the --SkipSecondary flag to make it run much faster if you'd like.

Best,
Matt

Biofarmer · 2020-12-08T21:19:56Z

Hi Matt,

Thanks for reply. It is good that all rest clusters are fine.

Best,
Wang

Biofarmer · 2020-12-11T11:20:49Z

Hi Matt,

When I run ~50,000 genomes, and there is an error:

File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/pandas/core/reshape/reshape. py", line 124, in __init__
    raise ValueError('Unstacked DataFrame is too big, '
ValueError: Unstacked DataFrame is too big, causing int32 overflow

I think this is due to the large number of genomes, right? Is there any other way to solve this problem without reducing the genome number? In addition, I can shorten my genome names with half length.

Thanks
Wang

MrOlm · 2021-06-28T16:42:28Z

Yes your understanding correctly. And yes you can still use the clustering as reliable even when warnings are present; I ignore the warnings in my own research

-Matt

Biofarmer · 2021-06-28T19:52:20Z

Thanks a million, Matt. Best, Wang

Biofarmer · 2021-07-12T15:49:07Z

Hi Matt, I have sent one email to your gmail obtained from your github homepage, which is about the paper https://www.nature.com/articles/s41467-017-02018-w, may I ask whether you know it? Thanks, Wang

Biofarmer · 2021-07-21T18:10:10Z

Hi Matt,
I just found a failure to make plot in dRep (v2.6.2) compare Step 4. Analyze as below:

making plots 1, 2, 3, 4
Plotting primary dendrogram
Failed to make plot #1: Image size of 1000x607960 pixels is too large. It must be less than 2^16 in each direction.

I think this does not matter for the clustering results, right?
Thanks
Wang

MrOlm · 2021-07-21T18:11:47Z

Yes, that will not impact clustering results.

In response to your previous question, yes I know about that paper. It uses methods from the pre-inStrain days. Do you have a question about it?

Biofarmer · 2021-07-21T18:16:35Z

Hi Matt,
Thank you for your fast reply, that's great it does not impact the clustering results.

As for the other question, I have a question about the metadata for four samples, and sent it to [email protected], do you receive it?

Thanks
Wang

Biofarmer · 2021-07-21T18:21:28Z

Yes, that will not impact clustering results.

In response to your previous question, yes I know about that paper. It uses methods from the pre-inStrain days. Do you have a question about it?

Sorry to follow this question, as indicating to fail to make plot for primary cluster, why I still find 'Primary_clustering_dendrogram.pdf' in the folder of 'figures'?

MrOlm · 2021-07-21T18:22:34Z

The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind.

I don't believe I got the email. Maybe try re-sending to [email protected]

Biofarmer · 2021-07-21T18:26:49Z

The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind.

I don't believe I got the email. Maybe try re-sending to [email protected]

So, the Primary_clustering_dendrogram.pdf in this case is not complete, right? and this failure to make plot is due to too large number of genomes?

Thanks

Biofarmer · 2021-07-22T18:09:40Z

The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind.
I don't believe I got the email. Maybe try re-sending to [email protected]

So, the Primary_clustering_dendrogram.pdf in this case is not complete, right? and this failure to make plot is due to too large number of genomes?

Thanks

Hi Matt, do I understand correctly? Thanks, Wang

MrOlm · 2021-07-23T17:06:11Z

I think the figure it not complete, but I'm not entirely sure what happens when matplotlib encounters that problem

Biofarmer · 2021-07-23T17:09:17Z

I think the figure it not complete, but I'm not entirely sure what happens when matplotlib encounters that problem

Hi Matt, okay, thanks. I think this failure to make plot is due to too large number of genomes, right? When I run less genomes, this plot was plotted without failure.

MrOlm · 2021-07-23T17:19:43Z

Yes exactly

Biofarmer · 2021-07-23T17:24:15Z

Good to learn. Anyway, the unaffected clustering results are the most important. Many thanks.

Biofarmer · 2022-04-29T07:23:51Z

Hi Matt, May I ask a question about the memory used? If I run 30000 genomes at once (~90G in size) with 40 threads, how much memory and storage (considering the intermediate files) has to be used approximately? Thanks

brittanysuttner · 2022-09-15T12:07:39Z

Hello, I think I found a bug in dRep related to running it with the -l option less than 3000. I am trying to derep a collection of viral genomes (many of them are less than 2k) so I run derep as such:
> dRep dereplicate --S_algorithm fastANI -nc .5 -l 1000 -d -sa 0.99 -N50W 0 -sizeW 1 --ignoreGenomeQuality --clusterAlg single --multiround_primary_clustering drep_pig_virus_sa99 -g genomes/*ffn

This results in the majority of my genomes failing the fastANI step and I think its becasue the default FragmentLength for fastANI is 3000 and that should be lowered when running dRep with -l 1000 (maybe FragmentLength should be set equal to whatever is set for -l if its less than 3000?). Or there should be an option in the dRep command to set options in the fastANI command.
Thanks!

MrOlm · 2022-09-15T16:05:25Z

Hello,

Can you please confirm you’re running the most up-to-date versions of dRep and fastANI? I remember this being an issue that I believe I fixed

-Matt

brittanysuttner · 2022-09-16T10:32:19Z

Yes, I am running the most up-to-date versions *I think of dRep (v3.4.0) and fastANI (v1.33).
I installed dRep using pip install so it just installed that version by default.

MrOlm · 2022-09-19T16:18:35Z

Ahhh OK I see. I used to have an option to set the fragment length, but it ended up being a problem with different version of inStrain. I will try and address this in the next dRep update- thank you for bringing it to my attention

-MO

Gian77 · 2023-01-10T18:37:21Z

Hello,
I know this is closed but I am having this weird output from dRep. dRep is telling me

...
...
  File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 298, in secondary_clustering
    ndb = compare_genomes(bdb, algorithm, data_folder, **kwargs)
  File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 367, in compare_genomes
    df = drep.d_cluster.external.run_pairwise_fastANI(genome_list, working_data_folder, **kwargs)
  File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/external.py", line 93, in run_pairwise_fastANI
    exe_loc = drep.get_exe('fastANI')
  File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/__init__.py", line 100, in get_exe
    raise ValueError("{0} isn't working- make sure its installed".format(name))
ValueError: fastANI isn't working- make sure its installed

but fstani is there

dendropy                  4.5.2              pyh3252c3a_0    bioconda
drep                      3.4.0              pyhdfd78af_0    bioconda
expat                     2.4.4                h295c915_0    anaconda
fastani                   1.33                 h0fdf51a_0    bioconda
fftw                      3.3.9                h27cfd23_1

and

mash.................................... all good        (location = /mnt/home/benucci/anaconda2/envs/drep/bin/mash)
nucmer.................................. all good        (location = /mnt/home/benucci/anaconda2/envs/drep/bin/nucmer)
checkm.................................. all good        (location = /mnt/home/benucci/anaconda2/envs/drep/bin/checkm)
ANIcalculator........................... !!! ERROR !!!   (location = None)
prodigal................................ all good        (location = /mnt/home/benucci/anaconda2/envs/drep/bin/prodigal)
centrifuge.............................. !!! ERROR !!!   (location = None)
nsimscan................................ !!! ERROR !!!   (location = None)
fastANI................................. !!! ERROR !!!   (location = /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI)

what is happening? Maybe doesnt like the fastANI version?
Thanks much!
G.

MrOlm · 2023-01-10T18:40:46Z

Hi @Gian77 -

What happens when you try the command /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI -h?

FastANI sometimes requires its own dependencies, which could be why this isn't working

-Matt

Gian77 · 2023-01-10T18:59:46Z

Hey @MrOlm

Thanks fro your fast answer on this.

This, what comes out, what's libgsl.so.25?

(drep) [benucci@dev-intel18 DAS_Tool_bins]$ /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI -h
/mnt/home/benucci/anaconda2/envs/drep/bin/fastANI: error while loading shared libraries: libgsl.so.25: cannot open shared object file: No such file or directory

Gian

MrOlm · 2023-01-10T19:05:19Z

Hi Gian,

This is what I suspected. This has something to do with C++ and how fastANI is written, which is something I don't fully understand.

Someone else was able to fix the problem here - #146

And a discussion of the problem on fastANI's gitHub can be found here - ParBLiSS/FastANI#96

Best,
Matt

Gian77 · 2023-01-10T19:10:25Z

Thanks much! I dug this out...

I was able to make fastANI to work by installing the right gsl libraries in the conda environemnt

conda install -c conda-forge gsl=2.7=he838d99_0

dRep worked after this fix.

Gian

AlessioMilanese · 2023-09-27T15:54:40Z

Following on the discussion.

Main problem: fastANI -h return 1, and I cannot run dRep.

I have fastANI installed and it works. I try to run:

$ fastANI -q test/2838728.3.fna -r test/2838728.3.fna -o test
$ cat test
test/2838728.3.fna      test/2838728.3.fna      100     353     357

But if I do:

$ fastANI -h
$ echo $?
1

Which basically means it return an error when you call it with -h.

dRep cannot proceed, even if fastANI is working, error:

    exe_loc = drep.get_exe('fastANI')
  File "/home/ec2-user/Software/miniconda3/envs/drep/lib/python3.7/site-packages/drep/__init__.py", line 100, in get_exe
    raise ValueError("{0} isn't working- make sure its installed".format(name))
ValueError: fastANI isn't working- make sure its installed

Tool version:

drep                      3.4.5              pyhdfd78af_0    bioconda
fastani                   1.1                  h4ef8376_0    bioconda

MrOlm · 2023-09-27T18:53:58Z

Hi @AlessioMilanese -

Thanks for tracking down this issue and posting here- I very much appreciate it! I will address in the next dRep update (or please feel free to submit a pull request if you're inclined). In the meantime, it seems that updating fastANI to new versions fixes this fastANI behavior.

Thanks again!
Matt

AlessioMilanese · 2023-09-29T08:16:10Z

Thanks for the fast response Matt. For some reasons I cannot install a newer version of FastANI, but I will figure it out.

I will address in the next dRep update (or please feel free to submit a pull request if you're inclined)

I'm not sure what is the best solution. I think I would add an option (like -I) to skip the part where you check if the tool is installed and working.

MrOlm · 2023-09-29T17:05:56Z

Hi @AlessioMilanese - yeah that would be a fine solution, or (if possible) just doing a different check for fastANI that doesn't return a 0 error code

MrOlm closed this as completed Nov 8, 2020

MrOlm reopened this Sep 27, 2023

MrOlm added the bug label Mar 1, 2024

fastANI is not working #96

fastANI is not working #96

Comments

Biofarmer commented Nov 7, 2020 • edited Loading

MrOlm commented Nov 8, 2020

Biofarmer commented Nov 8, 2020

MrOlm commented Nov 9, 2020

Biofarmer commented Nov 9, 2020 • edited Loading

MrOlm commented Nov 9, 2020

Biofarmer commented Nov 9, 2020

Biofarmer commented Nov 11, 2020

MrOlm commented Nov 11, 2020

Biofarmer commented Nov 13, 2020

MrOlm commented Nov 13, 2020

Biofarmer commented Nov 13, 2020

MrOlm commented Nov 16, 2020

Biofarmer commented Nov 17, 2020 • edited Loading

MrOlm commented Nov 17, 2020

Biofarmer commented Nov 18, 2020 • edited Loading

MrOlm commented Nov 19, 2020

Biofarmer commented Nov 19, 2020 • edited Loading

MrOlm commented Nov 23, 2020

Biofarmer commented Nov 23, 2020

MrOlm commented Nov 23, 2020

Biofarmer commented Nov 23, 2020

Biofarmer commented Nov 23, 2020

MrOlm commented Nov 23, 2020

Biofarmer commented Nov 23, 2020

Biofarmer commented Dec 3, 2020 • edited Loading

Biofarmer commented Dec 8, 2020

MrOlm commented Dec 8, 2020

Biofarmer commented Dec 8, 2020

Biofarmer commented Dec 11, 2020 • edited Loading

MrOlm commented Jun 28, 2021

Biofarmer commented Jun 28, 2021

Biofarmer commented Jul 12, 2021

Biofarmer commented Jul 21, 2021 • edited Loading

MrOlm commented Jul 21, 2021

Biofarmer commented Jul 21, 2021

Biofarmer commented Jul 21, 2021 • edited Loading

MrOlm commented Jul 21, 2021

Biofarmer commented Jul 21, 2021 • edited Loading

Biofarmer commented Jul 22, 2021

MrOlm commented Jul 23, 2021

Biofarmer commented Jul 23, 2021

MrOlm commented Jul 23, 2021

Biofarmer commented Jul 23, 2021 • edited Loading

Biofarmer commented Apr 29, 2022 • edited Loading

brittanysuttner commented Sep 15, 2022

MrOlm commented Sep 15, 2022

brittanysuttner commented Sep 16, 2022

MrOlm commented Sep 19, 2022

Gian77 commented Jan 10, 2023

MrOlm commented Jan 10, 2023

Gian77 commented Jan 10, 2023

MrOlm commented Jan 10, 2023

Gian77 commented Jan 10, 2023 • edited Loading

AlessioMilanese commented Sep 27, 2023

MrOlm commented Sep 27, 2023

AlessioMilanese commented Sep 29, 2023

MrOlm commented Sep 29, 2023

Biofarmer commented Nov 7, 2020 •

edited

Loading

Biofarmer commented Nov 9, 2020 •

edited

Loading

Biofarmer commented Nov 17, 2020 •

edited

Loading

Biofarmer commented Nov 18, 2020 •

edited

Loading

Biofarmer commented Nov 19, 2020 •

edited

Loading

Biofarmer commented Dec 3, 2020 •

edited

Loading

Biofarmer commented Dec 11, 2020 •

edited

Loading

Biofarmer commented Jul 21, 2021 •

edited

Loading

Biofarmer commented Jul 21, 2021 •

edited

Loading

Biofarmer commented Jul 21, 2021 •

edited

Loading

Biofarmer commented Jul 23, 2021 •

edited

Loading

Biofarmer commented Apr 29, 2022 •

edited

Loading

Gian77 commented Jan 10, 2023 •

edited

Loading