Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

makes umi_tools deterministic with --random-seed #550

Merged
merged 7 commits into from
Oct 1, 2024

Conversation

TomSmithCGAT
Copy link
Member

Trying to resolve issues with umi_tools being non-deterministic even with --random-seed. We have two open PRs on this.

I found #365 was sometimes reporting the same read multiple times. For example, running the test
umi_tools group -L test.log --out-sam --random-seed=123456789 --method=adjacency --output-bam --group-out=group_adj_py3.tsv -I tests/chr19.bam > branch_out

SRR2057595.11597812_ATAAA is reported twice, but just once (as it should be) when using the master branch.

$ grep SRR2057595.11597812_ATAAA *out
branch_out:SRR2057595.11597812_ATAAA	16	chr19	4078297	255	38M	*	0	0	*	*	XA:i:1	MD:Z:29A8	NM:i:1	UG:i:52	BX:Z:ATAAA
branch_out:SRR2057595.11597812_ATAAA	16	chr19	4078297	255	38M	*	0	0	*	*	XA:i:1	MD:Z:29A8	NM:i:1	UG:i:53	BX:Z:ATAAA
master_out:SRR2057595.11597812_ATAAA	16	chr19	4078297	255	38M	*	0	0	*	*	XA:i:1	MD:Z:29A8	NM:i:1	UG:i:52	BX:Z:ATAAA

#470 works as intended but introduces a non-conda dependency, as discussed. We could rip the code or get siphashc into conda, but we can take a much simpler route if we don't care about having to update the test files.

Following @TyberiusPrime's suggestion in #470, this PR just adds a sort to the components in network.py

Runtime is unaffected, at least on the example.bam file we include in the release (9.4-9.7s with both master and this branch).

The test files will need to be updated of course. I just wanted to raise the PR first to check you are OK with this @IanSudbery before I update them and merge.

Note that the output when using the adjacency method can change with respect to not just which reads are output, but also how many reads, since the order of the UMIs with the same counts affects how many steps are required to account for a connected component. I see no issue with the number of reads returned in these cases being fixed by the sort, given we are currently treating any possible resolution as equally probable.

For the other methods, the number of reads returned will be unchanged.

@IanSudbery
Copy link
Member

In general I have no issue with this, but I want to check performance on bigger input files (with 10s of thousands of UMIs at one position).

@TomSmithCGAT
Copy link
Member Author

Have you got something in mind?

@IanSudbery
Copy link
Member

I should have something somewhere from when I did benchmarking for the grant proposal.

@TomSmithCGAT
Copy link
Member Author

Hi @IanSudbery - Did you have a chance to check the runtimes on bigger files?

@TomSmithCGAT
Copy link
Member Author

Hi @IanSudbery. Would you be able to run the benchmarks that you wanted to before I merge this. The tests fail at the moment because many of the test files need to be updated since this fixed determination yields a different set of UMIs for cases where the selection is between equally likely UMIs, compared to the previous code.

@IanSudbery
Copy link
Member

IanSudbery commented Jun 13, 2023

Okay, I checked out timeings. Not much in it. Here I checked times to process a single position with an increasing number of "real" UMIs, check with twenty random PCR errors:

Current:

Centre_umis     time
1       17.33
4       17.27
16      17.3
64      17.21
256     17.65
1024    21.36
4096    70.68
5792    117.89
8192    223.07999999999998
11585   534.42

Proposed:

Centre_umis     time
1       9.26
4       9.15
16      9.21
64      9.19
256     9.61
1024    13.61
4096    62.99
5792    111.89
8192    219.17000000000002
11585   592.01

I'm not sure why there is a small but constant reduction in time for the new method at lower UMI counts - possibly something to do with the other changes since this branch was proposed? Anyway, other than that, very little in it.

I also tested on a 5M read real data set, with 4M positions, and an average of 1.1 UMI per position. Both versions took exactly the same 165 seconds.

@IanSudbery
Copy link
Member

IanSudbery commented Aug 8, 2024

I assume we are not worried that this will introduce a bias into which UMIs are selected. My guess is that it makes ties more likely to be solved in favour of alphabetically earlier UMIs? I'm not particulalry worried about this, but I thought I'd note that we are aware of it.

Its also worth noting that the siphash referred to in the first post on this PR is actaully now the default hash method in python. >=3.11

@TomSmithCGAT TomSmithCGAT merged commit 0a037ba into master Oct 1, 2024
6 checks passed
@IanSudbery
Copy link
Member

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants