Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representative sequences after deduplication not consistent between different runs #440

Closed
zxl124 opened this issue Oct 16, 2020 · 6 comments

Comments

@zxl124
Copy link

zxl124 commented Oct 16, 2020

When I run dedup on the same BAM files twice, even with the same --random-seed, the returned deduped BAM files have different sets of reads. This has a very small but non-zero effect on downstream analysis. Would it be possible to have completely consistent results between runs when random seed is the same?

For context, the input BAM was coordinate-sorted, and generated using STAR. dedup was run with --random-seed 100 --spliced-is-unique --multimapping-detection-method=NH.

@TomSmithCGAT
Copy link
Member

If you set the bash variable PYTHONHASHSEED, the output should be consistent.

Since python 3.3, the hashing used in e.g dictionary keys is non-determininistic and are 'salted' with a unpredictable random values: https://docs.python.org/3.4/reference/datamodel.html#object.__hash__. I understand this is prevent DOS attacks.

@IanSudbery - Should we add the above to the FAQ?

@IanSudbery
Copy link
Member

IanSudbery commented Oct 19, 2020 via email

@TomSmithCGAT
Copy link
Member

Seems like it is possible: https://stackoverflow.com/questions/32538764/unable-to-see-or-modify-value-of-pythonhashseed-through-a-module. I think it would make sense from a user point of view if --random-seed set the value for PYTHONHASHSEED add made the output deterministic. Agree?

@SPPearce
Copy link

Has this actually been fixed in a release? I'm seeing the same non-deterministic behaviour in dedup, even after setting random-seed. Can a note be added to the website, to make it clear that random-seed on its own isn't sufficient. Trying exporting the PYTHONHASHSEED now, but it has taken me a while of digging in these issues to find the fix.

@TomSmithCGAT
Copy link
Member

Hi @SPPearce - Sorry for the wasted time spent digging into how to make UMI-tools determininstic.

We have two open PRs to deal with this (#365 & #470), and I have a separete idea I wanted to try as well. I'm optimistically hoping to decide which route to take this week and then issue a new version. I've been saying that for the past few weeks though 😬

@TomSmithCGAT
Copy link
Member

See the outstanding #550

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants