Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sv-callers Installation in local machine and input files #47

Closed
nitha26 opened this issue Oct 22, 2020 · 14 comments
Closed

sv-callers Installation in local machine and input files #47

nitha26 opened this issue Oct 22, 2020 · 14 comments
Labels

Comments

@nitha26
Copy link

nitha26 commented Oct 22, 2020

Hi,

I would like to use sv-callers for calling germlineSVs from WGS. I have following doubts:

  1. Is this tool can be installed in local centos7 machine? And whether the sv callers like manta, delly, lumpy and GRIDSS and other tools bcftools, survivor should be installed separately or it is the part of this repository (sub-modules already build in sv-callers)?

  2. The version of manta is 1.1.0, whether the current version of sv-callers supports updated version of manta or GRIDSS?

  3. Second thing is whether "cram" can be used as a input?

  4. Samples are aligned to GRCh38 reference, does sv-callers provide excluded regions in .bed file of reference genome?

Sorry for all the questions, let me know if there is a better place to ask them, person to email, etc.

Thanks in advance!
Nitha

@arnikz
Copy link
Contributor

arnikz commented Oct 22, 2020

Hi,

I would like to use sv-callers for calling germlineSVs from WGS. I have following doubts:

First, have you tried to run it locally?

1. Is this tool can be installed in local centos7 machine? And whether the sv callers like manta, delly, lumpy and GRIDSS  and other tools bcftools, survivor should be installed separately or it is the part of this repository (sub-modules already build in sv-callers)?

The workflow takes care of the dependencies including SV callers etc. via (bio)conda.

2. The version of  manta is 1.1.0, whether the current version of sv-callers supports updated version of  manta or GRIDSS?

In principle, yes (see here) but the unit/CI tests run with the aforementioned (older) software versions (see #35).

3. Second thing is whether "cram" can be used as a input?

Currently, there is no support for CRAM (sorry, we've been working with BAMs only).

4. Samples are aligned to GRCh38 reference, does sv-callers provide excluded regions in .bed file of reference genome?

You can configure a.o. things here

exclusion_list: data/ENCFF001TDO.bed

Cheers,
Arnold

@nitha26
Copy link
Author

nitha26 commented Oct 23, 2020

Thank you for the reply. I will try to install sv-callers in local machine.

@nitha26
Copy link
Author

nitha26 commented Oct 23, 2020

First, have you tried to run it locally?
Yes I had installed in Centos7 machine and was able to run the "execution of SV callers by writing (dummy) VCF files" command using example "data". (Attached log file "Trial_log_Exampledata.txt")

But I noticed the following lines in the output vcf files. Further how to execute these sv tools such as manta,delly,lumpy and gridss by using sv-caller in order to get structural variants genotype results (germline.vcf) in our local machine. Could you please direct me the command documentation.

all.vcf
data/bam/3/T3--N3/manta_out/survivor/manta.vcf data/bam/3/T3--N3/delly_out/survivor/delly.vcf data/bam/3/T3--N3/lumpy_out/survivor/lumpy.vcf data/bam/3/T3--N3/gridss_out/survivor/gridss.vcf

delly.vcf
data/fasta/chr22.fasta data/fasta/chr22.fasta.fai data/bam/3/T3.bam data/bam/3/T3.bam.bai data/bam/3/N3.bam data/bam/3/N3.bam.bai data/fasta/chr22.fasta data/fasta/chr22.fasta.fai data/bam/3/T3.bam data/bam/3/T3.bam.bai data/bam/3/N3.bam data/bam/3/N3.bam.bai data/fasta/chr22.fasta data/fasta/chr22.fasta.fai data/bam/3/T3.bam data/bam/3/T3.bam.bai data/bam/3/N3.bam data/bam/3/N3.bam.bai data/fasta/chr22.fasta data/fasta/chr22.fasta.fai data/bam/3/T3.bam data/bam/3/T3.bam.bai data/bam/3/N3.bam data/bam/3/N3.bam.bai data/fasta/chr22.fasta data/fasta/chr22.fasta.fai data/bam/3/T3.bam data/bam/3/T3.bam.bai data/bam/3/N3.bam [data/bam/3/N3.bam.bai

Trial_log_Exampledata.txt

Thanks.

@arnikz
Copy link
Contributor

arnikz commented Oct 23, 2020

Yep, that's correct

# 'vanilla' run (default) mimics the execution of SV callers by writing (dummy) VCF files
snakemake -C echo_run=1

Now for real, remove the data/bam/3/T3--N3 dir and run the workflow again with echo_run=0 etc. Please, README or see this command 😉

@nitha26
Copy link
Author

nitha26 commented Oct 23, 2020

As per your suggestion we had tried the command snakemake -C echo_run=0
we are getting error. attached log file.
2020-10-23T143034.916521.snakemake.log

I had some doubts:

  1. As asked in earlier query, "whether the sv callers like manta, delly, lumpy and GRIDSS and other tools bcftools, survivor should be installed separately or it is the part of this repository (sub-modules already build in sv-callers)"?

In this system we had already installed "delly" so I think sv-callers is able invoke only delly., but the other tools are not executing because the other tools are not installed.

@arnikz
Copy link
Contributor

arnikz commented Oct 23, 2020

Please, read the README carefully. You must not install the callers and processing tools yourself; it's taken care of by the workflow if you add the missing --use-conda arg. In addition, take a closer look at the aforementioned Travis CI log (green badge), which exemplifies that all runs fine (no errors) in the automated deployment of the workflow with test data.

@nitha26
Copy link
Author

nitha26 commented Oct 23, 2020

Right now we are working in the local machine Centos7, so we tried the bellow command and it is running for more than 45min but still showing same message. Could you please point out whether this command is correct, if so how long it takes to run.

(wf) [root@localhost snakemake]# snakemake -C echo_run=0 mode=p enable_callers="['manta','delly','lumpy','gridss']" --use-conda

Building DAG of jobs... Removing incomplete Conda environment environment.yaml... Creating conda environment environment.yaml... Downloading and installing remote packages.

Thanks for your support.

@arnikz
Copy link
Contributor

arnikz commented Oct 23, 2020

The command looks fine. Yeah, conda install use to take a few minutes but these days it's very slow indeed - something to consider for the next release (#49) - but it needs to be done just once before the actual workflow run(s). What's your conda --version? Btw. why are you executing the wf as root?

@nitha26
Copy link
Author

nitha26 commented Oct 23, 2020

I am running from root user. The conda version is

[root@localhost]# conda --version
conda 4.8.3

And the conda package install is STILL in same stage.
Building DAG of jobs... Removing incomplete Conda environment environment.yaml... Creating conda environment environment.yaml... Downloading and installing remote packages.

@arnikz
Copy link
Contributor

arnikz commented Oct 23, 2020

I am running from root user.

Yes, that's clear but it's not necessary (and could be dangerous).

The conda version is

[root@localhost]# conda --version
conda 4.8.3

Update to the latest version via conda update -y conda once this one below is finished

And the conda package install is STILL in same stage.
Building DAG of jobs... Removing incomplete Conda environment environment.yaml... Creating conda environment environment.yaml... Downloading and installing remote packages.

Sorry, I can't help you with that (e.g. waiting, Internet bandwidth etc.)

@nitha26
Copy link
Author

nitha26 commented Oct 23, 2020

Yes, that's clear but it's not necessary (and could be dangerous).
Got it. Thank you.

Sorry, I can't help you with that (e.g. waiting, Internet bandwidth etc.)
I understand.

the command had run but I wonder why in result data not able to find any SV call information. pasting the ``all.vcf``` information. (sorry I could not drop the all.vcf file)

##fileformat=VCFv4.1 ##source=SURVIVOR ##fileDate=20201023 ##contig=<ID=chr22,length=51304566> ##ALT=<ID=DEL,Description="Deletion"> ##ALT=<ID=DUP,Description="Duplication"> ##ALT=<ID=INV,Description="Inversion"> ##ALT=<ID=BND,Description="Translocation"> ##ALT=<ID=INS,Description="Insertion"> ##INFO=<ID=CIEND,Number=2,Type=String,Description="PE confidence interval around END"> ##INFO=<ID=CIPOS,Number=2,Type=String,Description="PE confidence interval around POS"> ##INFO=<ID=CHR2,Number=1,Type=String,Description="Chromosome for END coordinate in case of a translocation"> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant"> ##INFO=<ID=MAPQ,Number=1,Type=Integer,Description="Median mapping quality of paired-ends"> ##INFO=<ID=RE,Number=1,Type=Integer,Description="read support"> ##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation"> ##INFO=<ID=PRECISE,Number=0,Type=Flag,Description="Precise structural variation"> ##INFO=<ID=SVLEN,Number=1,Type=Float,Description="Length of the SV"> ##INFO=<ID=SVMETHOD,Number=1,Type=String,Description="Method for generating this merged VCF file."> ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of the SV."> ##INFO=<ID=SUPP_VEC,Number=1,Type=String,Description="Vector of supporting samples."> ##INFO=<ID=SUPP,Number=1,Type=String,Description="Number of samples supporting the variant"> ##INFO=<ID=STRANDS,Number=1,Type=String,Description="Indicating the direction of the reads with respect to the type and breakpoint."> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=PSV,Number=1,Type=String,Description="Previous support vector"> ##FORMAT=<ID=LN,Number=1,Type=Integer,Description="predicted length"> ##FORMAT=<ID=DR,Number=2,Type=Integer,Description="# supporting reference,variant reads in that order"> ##FORMAT=<ID=ST,Number=1,Type=String,Description="Strand of SVs"> ##FORMAT=<ID=QV,Number=1,Type=String,Description="Quality values: if not defined a . otherwise the reported value."> ##FORMAT=<ID=TY,Number=1,Type=String,Description="Types"> ##FORMAT=<ID=ID,Number=1,Type=String,Description="Variant ID from input."> ##FORMAT=<ID=RAL,Number=1,Type=String,Description="Reference allele sequence reported from input."> ##FORMAT=<ID=AAL,Number=1,Type=String,Description="Alternative allele sequence reported from input."> ##FORMAT=<ID=CO,Number=1,Type=String,Description="Coordinates"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878_2 NA12878 NA12878_1 N3.bam

  1. Could you please guide me which is the complete result file, can be used for downstream analysis?

  2. Right now I have more than 500 WGS samples, what is maximum samples can be used for running sv-caller? What is the time calculation for running a complete analysis for 10 samples in a normal machine?

  3. To speed up the processes, if I split 500 samples in batch (if I do in HPC cluster slurm), and in which stage I can combine all the vcf file for population study? Please give an outline how this multiple samples can be run in batches and merging takes places in sv-caller. Whether any chnages have to do in "sample.csv"?

  4. how to validate or to confirm whether all my jobs are completed successfully and time logs?

Thank you so much.

@arnikz
Copy link
Contributor

arnikz commented Oct 26, 2020

the command had run but I wonder why in result data not able to find any SV call information. pasting the ``all.vcf``` information. (sorry I could not drop the all.vcf file)

That's correct. The sample data are meant for CI testing only (T3/N3.bam files are identical and refer to a small part of the genome). The all.vcf file is the result of SURVIVOR merge (final wf step) of all the SV callers' VCF files. For more details, refer to our paper.

Could you please guide me which is the complete result file, can be used for downstream analysis?

You could use VCF files of each caller in the corresponding dir or the aforementioned (merged) VCF.

Right now I have more than 500 WGS samples, what is maximum samples can be used for running sv-caller?

In principle, there is no limit on the number of samples in samples.csv you could analyze. It depends on the compute/storage resources available to you on a HPC system.

What is the time calculation for running a complete analysis for 10 samples in a normal machine?

It depends on your samples and machine. See our paper for example runs (germline and somatic).

To speed up the processes, if I split 500 samples in batch (if I do in HPC cluster slurm), and in which stage I can combine all the vcf file for population study? Please give an outline how this multiple samples can be run in batches and merging takes places in sv-caller. Whether any chnages have to do in "sample.csv"?

The workflow takes care of the parallelization so there is no need to split/merge jobs yourself.

how to validate or to confirm whether all my jobs are completed successfully and time logs?how to validate or to confirm whether all my jobs are completed successfully and time logs?

See snakemake...&>smk.log workflow log and/or stderr-[jobid].log per job. In addition, you can retrieve detailed job accounting info from the HPC system used (see README.md).

@arnikz
Copy link
Contributor

arnikz commented Oct 26, 2020

And the conda package install is STILL in same stage.

This was fixed in the v1.1.2 (#49)

@nitha26
Copy link
Author

nitha26 commented Oct 26, 2020

Yesterday when i tried to fired the job, the conda package was installed within a miniutue. Thanks

@arnikz arnikz closed this as completed Oct 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants