mpiMarkDup

The project has moved here:

https://github.com/bioinfo-pf-curie/mpiMarkDup

mpiMarkDup

MPI and C based program of marking duplicates for Next Generation Sequencing technology.

The code is a fork from the mpiSORT project. We add a process to manage the marking of duplicate reads.

The goal is to create a distributed and (near) real time version of the well known Picard MarkDuplicate from the Broad Institute. To do that we rely on low level technologies and parallel algorithms.

The project is still under development so feel free to test it and report.

This is an open project initiated by Institut Curie HPC team and students from Paris Descartes University.

Release notes

Release 1.0 from the 25/10/2019

Not a good idea to play with magic number bgzip complain. We come back to previous version
TODO: now the major bottle neck is the computation of the collision in perfect hash table.
we shall compute collision only one time according to the biggest chromosom to mark
TODO: problem with the option -q > 0 (it works with -q 0)

Release 1.0 from the 23/10/2019

Add the magic number at end of BGZF file. This way samtools doesn't complain

example of usage with samtools

samtools flagstat chr1.gz
samtools view -f 1024 chr1.gz

Other tools shall complain (not tested).

Release 1.0 from the 21/10/2019

Review of the search of mates after the bruck exchange of inter-process reads.
We now use the hash table to look for the corresponding pair of a read.

Release 1.0 from the 02/07/2019

Fix a bug when pairs overlap 2 ranks
Fix a memory leak

Release 1.0 from the 01/07/2019

Fix memory leaks
Remove global variable COMM_WORLD

Release 1.0 from the 27/06/2019

Fix a corner case. We check intra-node overlap before extra-node overlap in mpiSort.c.
Fix a bug in sort_any_dim.c

Release 1.0 from the 19/06/2019

Fix a corner case

Release 1.0 from the 13/06/2019

Replace All2all with a Bruck when exchange external frags
Change some types in readInfo and mateInfo structures

Release 1.0 from the 23/05/2019

Fix issues with openssl 1.0.2k-fips.

Release 1.0 from the 15/05/2019

Cleaning up the code.

Release 1.0 from the 07/05/2019

Fix reproducibility issue with 1 cpu.

Release 1.0 from the 06/05/2019

Fix tie case (tested only with power of 2 cpu).

Release 1.0 from the 08/04/2019

fix reproducibility issue but still have a corner case.

Release 1.0 from the 31/03/2019

fix corner case when reads distribution is unbalanced.

Release 1.0 from the 28/03/2019

fix some integer conversion and prototype.
fix a corner case when a rank recieve no mate from other rank.

Release 1.0 from the 22/03/2019

fix the coordinates overlapped between 2 ranks in case we are not power of 2 dimension (again it's better if you are in power of 2 dimension)

Release 1.0 from the 16/03/2019

cleaning of the code
we forgot to write unmapped in previous release
update of the overlapped coordinates between rank algorithm
next step : a singularity definition file, more tests and profiling, upgrade of compression algorithm.

requirements

tested with

gcc > 4.8 (tested with 7.3)
openmpi (tested with 2.2.1, 3, 4.0, intel mpi 2019)
openssl (tested with :1.0.2k, 1.1.0g)

For compiling from code source or creating distribution you need

automake-1.15
autoconf-2.69

cmocka (optionnal and only for unit testing)
A SAM file of aligned paired reads, trimmed or not, and compliant with the SAM format.

For small tests a laptop is sufficient.
For real life test a HPC cluster with low latency network and parallel file system is a mandatory.

How to compile from source:

git clone the repo
cd mpiMarkDup
aclocal
autoconf
automake --add-missing
./configure --prefix /usr/local CC=path_to_mpicc
make
make install
make clean

How to distribute from source:

git clone the repo
cd mpiMarkDup/src
aclocal
autoreconf --install
./configure
automake
make dist
or
make distcheck

this create a tar.gz in mpiMarkDup/src that you can distribute

How to compile from tar.gz:

tar -xvzf mpimd-1.0.tar.gz
cd mpimd-1.0
./configure
make

How to test it

mpirun -n cpu_number mpiMD input_sam output_dir -q 0 -d 1000 -v 4

(it's better if cpu_number is a power of 2)

options are:

-q is for quality filtering (problem with q > 0 use -q 0)
-d is for optical distance duplicate
-v is level of log verbose
0 is LOG_OFF
1 is LOG_ERROR
2 is LOG_WARNING
3 is LOG_INFO (default)
4 is LOG_DEBUG
5 is LOG_TRACE

How it works

First the programm sort the reads by genome's coordinates and extract discordant and unmapped (end and mate) reads with the same technics described in mpiSORT.

Second the programm mark the duplicates for each chromosome and discordant reads according to Picard Markduplicate method. The unmapped and unmapped mates are not marked. To limit memory overhead we build a distributed perfect hash table (see perfectHash.c for details) for fragment list and end list. This way the memory usage is under the memory usage of mpiSort.

Finally each chromosome is marked and compressed with bgzf and written down in the output folder.

We test the reproducibility by comparing both pipelines :mpiMD and mpiSORT + Picard (MarkDuplicate).
We use the same number of cpu for each pipeline. So far We obtain 100% reproducibility.

If the number of cpu differs the reproducibility is not garantee. Indeed tie cases are solved using the index of the read in the sorted file. This index can differ with the number of cpu.

This problem does not impact the results in the downstream analysis.

In conclusion when you test reproducibility always take the same number of cpu.

TODO

List of thing that left to do before production level

Build only one file with all chromosoms and unmapped
Integrate the discordant reads in chromosom
Test with multiple libraries eg multiple RG
Accelerate the construction of the hashtable. Considering a list of prime number
Modify the bruck with a its zero-copy version
Generate a bam file instead of gz

Authors and contacts

This program has been developed by

Frederic Jarlier from Institut Curie and Firmin Martin from Paris Decartes University for marking of duplicates part

and supervised by

Philippe Hupe from Institut Curie

Contacts:

[email protected]
[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
functional-tests		functional-tests
scripts		scripts
singularity		singularity
src		src
unit-tests		unit-tests
MPI.supp		MPI.supp
Makefile		Makefile
Makefile.am		Makefile.am
Makefile.in		Makefile.in
README.md		README.md
aclocal.m4		aclocal.m4
autoscan.log		autoscan.log
compile		compile
config.log		config.log
config.status		config.status
configure		configure
configure.ac		configure.ac
configure.scan		configure.scan
install-sh		install-sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mpiMarkDup

Release notes

requirements

How to compile from source:

How to distribute from source:

How to compile from tar.gz:

How to test it

How it works

TODO

Authors and contacts

About

Releases

Packages

Languages

fredjarlier/mpiMarkDup

Folders and files

Latest commit

History

Repository files navigation

mpiMarkDup

Release notes

requirements

How to compile from source:

How to distribute from source:

How to compile from tar.gz:

How to test it

How it works

TODO

Authors and contacts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages