The project has moved here:
https://github.com/bioinfo-pf-curie/mpiMarkDup
MPI and C based program of marking duplicates for Next Generation Sequencing technology.
The code is a fork from the mpiSORT project. We add a process to manage the marking of duplicate reads.
The goal is to create a distributed and (near) real time version of the well known Picard MarkDuplicate from the Broad Institute. To do that we rely on low level technologies and parallel algorithms.
The project is still under development so feel free to test it and report.
This is an open project initiated by Institut Curie HPC team and students from Paris Descartes University.
Release 1.0 from the 25/10/2019
-
Not a good idea to play with magic number bgzip complain. We come back to previous version
-
TODO: now the major bottle neck is the computation of the collision in perfect hash table.
we shall compute collision only one time according to the biggest chromosom to mark -
TODO: problem with the option -q > 0 (it works with -q 0)
Release 1.0 from the 23/10/2019
- Add the magic number at end of BGZF file. This way samtools doesn't complain
example of usage with samtools
- samtools flagstat chr1.gz
- samtools view -f 1024 chr1.gz
Other tools shall complain (not tested).
Release 1.0 from the 21/10/2019
- Review of the search of mates after the bruck exchange of inter-process reads.
We now use the hash table to look for the corresponding pair of a read.
Release 1.0 from the 02/07/2019
-
Fix a bug when pairs overlap 2 ranks
-
Fix a memory leak
Release 1.0 from the 01/07/2019
-
Fix memory leaks
-
Remove global variable COMM_WORLD
Release 1.0 from the 27/06/2019
-
Fix a corner case. We check intra-node overlap before extra-node overlap in mpiSort.c.
-
Fix a bug in sort_any_dim.c
Release 1.0 from the 19/06/2019
- Fix a corner case
Release 1.0 from the 13/06/2019
-
Replace All2all with a Bruck when exchange external frags
-
Change some types in readInfo and mateInfo structures
Release 1.0 from the 23/05/2019
- Fix issues with openssl 1.0.2k-fips.
Release 1.0 from the 15/05/2019
- Cleaning up the code.
Release 1.0 from the 07/05/2019
- Fix reproducibility issue with 1 cpu.
Release 1.0 from the 06/05/2019
- Fix tie case (tested only with power of 2 cpu).
Release 1.0 from the 08/04/2019
- fix reproducibility issue but still have a corner case.
Release 1.0 from the 31/03/2019
- fix corner case when reads distribution is unbalanced.
Release 1.0 from the 28/03/2019
- fix some integer conversion and prototype.
- fix a corner case when a rank recieve no mate from other rank.
Release 1.0 from the 22/03/2019
- fix the coordinates overlapped between 2 ranks in case we are not power of 2 dimension (again it's better if you are in power of 2 dimension)
Release 1.0 from the 16/03/2019
- cleaning of the code
- we forgot to write unmapped in previous release
- update of the overlapped coordinates between rank algorithm
- next step : a singularity definition file, more tests and profiling, upgrade of compression algorithm.
tested with
gcc > 4.8 (tested with 7.3)
openmpi (tested with 2.2.1, 3, 4.0, intel mpi 2019)
openssl (tested with :1.0.2k, 1.1.0g)
For compiling from code source or creating distribution you need
automake-1.15
autoconf-2.69
cmocka (optionnal and only for unit testing)
A SAM file of aligned paired reads, trimmed or not, and compliant with the SAM format.
For small tests a laptop is sufficient.
For real life test a HPC cluster with low latency network and parallel file system is a mandatory.
git clone the repo
cd mpiMarkDup
aclocal
autoconf
automake --add-missing
./configure --prefix /usr/local CC=path_to_mpicc
make
make install
make clean
git clone the repo
cd mpiMarkDup/src
aclocal
autoreconf --install
./configure
automake
make dist
or
make distcheck
this create a tar.gz in mpiMarkDup/src that you can distribute
tar -xvzf mpimd-1.0.tar.gz
cd mpimd-1.0
./configure
make
mpirun -n cpu_number mpiMD input_sam output_dir -q 0 -d 1000 -v 4
(it's better if cpu_number is a power of 2)
options are:
-q is for quality filtering (problem with q > 0 use -q 0)
-d is for optical distance duplicate
-v is level of log verbose
0 is LOG_OFF
1 is LOG_ERROR
2 is LOG_WARNING
3 is LOG_INFO (default)
4 is LOG_DEBUG
5 is LOG_TRACE
First the programm sort the reads by genome's coordinates and extract discordant and unmapped (end and mate) reads with the same technics described in mpiSORT.
Second the programm mark the duplicates for each chromosome and discordant reads according to Picard Markduplicate method. The unmapped and unmapped mates are not marked. To limit memory overhead we build a distributed perfect hash table (see perfectHash.c for details) for fragment list and end list. This way the memory usage is under the memory usage of mpiSort.
Finally each chromosome is marked and compressed with bgzf and written down in the output folder.
We test the reproducibility by comparing both pipelines :mpiMD and mpiSORT + Picard (MarkDuplicate).
We use the same number of cpu for each pipeline. So far We obtain 100% reproducibility.
If the number of cpu differs the reproducibility is not garantee. Indeed tie cases are solved using the index of the read in the sorted file. This index can differ with the number of cpu.
This problem does not impact the results in the downstream analysis.
In conclusion when you test reproducibility always take the same number of cpu.
List of thing that left to do before production level
-
Build only one file with all chromosoms and unmapped
-
Integrate the discordant reads in chromosom
-
Test with multiple libraries eg multiple RG
-
Accelerate the construction of the hashtable. Considering a list of prime number
-
Modify the bruck with a its zero-copy version
-
Generate a bam file instead of gz
This program has been developed by
Frederic Jarlier from Institut Curie and Firmin Martin from Paris Decartes University for marking of duplicates part
and supervised by
Philippe Hupe from Institut Curie
Contacts: