Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

riplike: script is too slow compared to LANL tool #12

Open
ArtPoon opened this issue Jun 25, 2019 · 5 comments
Open

riplike: script is too slow compared to LANL tool #12

ArtPoon opened this issue Jun 25, 2019 · 5 comments

Comments

@ArtPoon
Copy link
Contributor

ArtPoon commented Jun 25, 2019

On my Mac at home (admittedly a slow machine):

Elzar:poplars artpoon$ time python3 riplike.py ref_genomes/K03455.fasta test.out 
K03455|HIVHXB2CG

real	2m35.830s
user	2m31.287s
sys	0m0.750s

This same query takes about 7 seconds on the LANL server.

First I'm going to see if the bootstrap step can be made faster.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jun 25, 2019

Turning off bootstrap sampling makes a big difference:

Elzar:poplars artpoon$ time python3 riplike.py -nrep 0 ref_genomes/K03455.fasta test.out
K03455|HIVHXB2CG

real	0m6.893s
user	0m6.578s
sys	0m0.256s

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jun 25, 2019

Replacing random.randint with random.random seems to have made a big difference:

Elzar:poplars artpoon$ time python3 riplike.py  ref_genomes/K03455.fasta test.out
K03455|HIVHXB2CG

real	0m35.956s
user	0m35.614s
sys	0m0.278s

@kwade4 kwade4 reopened this Jul 31, 2019
@kwade4
Copy link
Collaborator

kwade4 commented Jul 31, 2019

riplike is very slow on Windows (possibly due to the MAFFT version). I think pdist and bootstrap could be made faster.

pdist time = 22 seconds
bootstrap time = 101 seconds.

Bootstrap
def bootstrap(s1, s2, reps=100):
...
    
    for rep in range(reps):
        result = []
        bootstrap = [random.randint(0, seqlen-1) for _ in range(seqlen)]        
        b1 = ''.join([s1[i] for i in bootstrap])
        b2 = ''.join([s2[i] for i in bootstrap])
        yield b1, b2

The string joining in bootstrap seems slow and may not be necessary. pdist could be modified to use a list.

NumPy

Using NumPy arrays in pdist and bootstrap (see changes in commit 2d12ba5) seems to improve performance.

pdist time = 24 seconds
bootstrap time = 5 seconds

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 31, 2019

I think that the implementation of random.randint in Python is exceedingly slow, try using random.random in combination with round instead.
Also we could pass a vector of differences (binary state) instead and resample that, to avoid a lot of unnecessary calculation.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jul 31, 2019

See #22

kwade4 added a commit that referenced this issue Aug 9, 2019
-riplike: modified bootstrapping to use random.choices() (#12)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants