-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doing just variant calling /haplotypecaller? non-model organism no known-variant file for bqsr #49
Comments
Hi, With regard to the error in the command, it seems you forgot to specify the name of the output file. The structure of the sfm command is: elprep sfm input.bam output.bam ... You can also just use elprep for haplotype calling (assuming the input bam has already been sorted, duplicate marked, etc because the algorithm relies on that). However, it is best to combine all steps of the pipeline in a single elprep command because elprep internally merges and parallises the different steps of a pipeline, which leads to better performance than calling the command separately for different pipeline steps. Thanks! |
HI
Thanks and Welcome.
I think I saw the error. Do the files must say *.input.bam and *.output.bam or can it be any name? In the example other names are output.metrics output.recal and output.vcf.gz
I have one question. The germline workflow for GATK has a final cohort variant calling step. Are the inprep vcf compatible with that last step?
Thanks!
Get BlueMail for Android<http://www.bluemail.me/r?b=16696>
On Jun 7, 2021, at 10:58 AM, Charlotte Herzeel ***@***.******@***.***>> wrote:
Hi,
With regard to the error in the command, it seems you forgot to specify the name of the output file.
The structure of the sfm command is: elprep sfm input.bam output.bam ...
You can also just use elprep for haplotype calling (assuming the input bam has already been sorted, duplicate marked, etc because the algorithm relies on that). However, it is best to combine all steps of the pipeline in a single elprep command because elprep internally merges and parallises the different steps of a pipeline, which leads to better performance than calling the command separately for different pipeline steps.
Thanks!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FExaScience%2Felprep%2Fissues%2F49%23issuecomment-856009463&data=04%7C01%7Cja569116%40ohio.edu%7C04c55ab8f3b847cc539308d929c4b88d%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637586747161252331%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qIvZ%2B1BbTVBYz8WTWgJ%2Fxy5PDrWIX%2Fm5jQG2HS%2FO0Iw%3D&reserved=0>, or unsubscribe<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VJ4PHRJBTT4DESEV6LTRTNBRANCNFSM46FPFRPA&data=04%7C01%7Cja569116%40ohio.edu%7C04c55ab8f3b847cc539308d929c4b88d%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637586747161262349%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pdsXwKvLp%2BohV6wX4yc%2BcSxW5qL7yNLHdmK4mcuNIaU%3D&reserved=0>.
|
I ran the code as you suggested it worked, but then I got an error: elprep version 5.0.2 compiled with go1.16.4 - see http://github.com/exascience/elprep for more information. 2021/06/07 21:27:25 signal: bus error (core dumped) I gave the job 170Gb of ram and 2 cores. My genome is 1.9 GB and about 40X WGS. |
Hi,
Would it be possible to send us the log file: /users/PHS0338/jpac1984/logs/elprep/elprep-2021-06-07-20-58-25-469790574-EDT.log
Which OS are you using? Please keep in mind that elPrep is only supported for Linux.
(The signal could suggest an error with addressing a memory mapped file).
Thanks!
On 8 Jun 2021, at 05:33, desmodus1984 ***@***.******@***.***>> wrote:
I ran the code as you suggested it worked, but then I got an error:
elprep version 5.0.2 compiled with go1.16.4 - see http://github.com/exascience/elprep<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexascience%2Felprep&data=04%7C01%7Ccharlotte.herzeel%40imec.be%7C051e22f0f7c04cf23d9508d92a2e3024%7Ca72d5a7225ee40f09bd1067cb5b770d4%7C0%7C0%7C637587200135955003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vGG54tK7W9k7OCg%2FC4cXKWqmkqx5zKNeNKYWz0aVm6Q%3D&reserved=0> for more information.
2021/06/07 20:58:25 Created log file at /users/PHS0338/jpac1984/logs/elprep/elprep-2021-06-07-20-58-25-469790574-EDT.log
2021/06/07 20:58:25 Command line: [./elprep sfm PA113corr.bam PA113.output.bam --mark-duplicates --mark-optical-duplicates PA113corr.metrics --sorting-order coordinate --bqsr PA113cor$
2021/06/07 20:58:25 Executing command:
./elprep sfm PA113corr.bam PA113.output.bam --mark-duplicates --mark-optical-duplicates PA113corr.metrics --optical-duplicates-pixel-distance 100 --bqsr PA113corr.recal --reference A$
2021/06/07 20:58:25 Splitting...
2021/06/07 21:22:54 Filtering (phase 1)...
2021/06/07 21:27:25 signal: bus error (core dumped)
I gave the job 170Gb of ram and 2 cores. My genome is 1.9 GB and about 40X WGS.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FExaScience%2Felprep%2Fissues%2F49%23issuecomment-856411378&data=04%7C01%7Ccharlotte.herzeel%40imec.be%7C051e22f0f7c04cf23d9508d92a2e3024%7Ca72d5a7225ee40f09bd1067cb5b770d4%7C0%7C0%7C637587200135964997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=KQnsbW8gJ1gf6qXx0kvgVmdSK%2FvIYu4V3XKNDW30iiY%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABPESUTD5C2I4BR23KPSQY3TRWFQTANCNFSM46FPFRPA&data=04%7C01%7Ccharlotte.herzeel%40imec.be%7C051e22f0f7c04cf23d9508d92a2e3024%7Ca72d5a7225ee40f09bd1067cb5b770d4%7C0%7C0%7C637587200135964997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dV2OxZGVhZZcNjgMHkQYIjkgsnMamD%2BCiKpmTbf5F94%3D&reserved=0>.
|
Hi,
I was wondering, is there a way to determine how much resources (time/memory). That log file was deleted when I deleted the folder of the failed job. The HPC uses linux.
I converted the my reference genome (1.9 GB / 88 scaffolds) into elfasta format, and I had as input the bam/converted sam. The only thing that I excluded the unmapped reads to my reference into the bam file (-F 12) during the samtools job. Furthermore, the coverage is ~6-7X at most, I gave the job 170GB and it didn't finish in 24 hours. Is there any way to improve the runtime? I thought that 170 GB of ram was enough for such a small dataset, since your recommendation is: whole-genome 30x: 128 GB RAM using the elprep split/filter/merge mode (sfm)
The log is:
elprep version 5.0.2 compiled with go1.16.4 - see http://github.com/exascience/elprep for more information.2021/06/08 14:03:42 Created log file at /users/PHS0338/jpac1984/logs/elprep/elprep-2021-06-08-14-03-42-745829962-EDT.log2021/06/08 14:03:42 Command line: [./elprep sfm /fs/scratch/PHS0338/appz/bwa-mem2-2.1_x64-linux/PA112.bam PA112.output.bam --mark-duplicates --mark-optical-duplicates PA112.metrics --sorting-order coordinate --bqsr PA112.recal --haplotypecaller PA112.vcf.gz --reference Autosome.elfasta]2021/06/08 14:03:42 Executing command: ./elprep sfm /fs/scratch/PHS0338/appz/bwa-mem2-2.1_x64-linux/PA112.bam PA112.output.bam --mark-duplicates --mark-optical-duplicates PA112.metrics --optical-duplicates-pixel-distance 100 --bqsr PA112.recal --reference Autosome.elfasta --quantize-levels 0 --max-cycle 500 --haplotypecaller PA112.vcf.gz --sorting-order coordinate --intermediate-files-output-prefix PA112 --intermediate-files-output-type bam2021/06/08 14:03:42 Splitting...2021/06/08 14:07:34 Filtering (phase 1)...2021/06/08 14:19:45 Filtering (phase 2) and variant calling...slurmstepd: error: *** JOB 4304503 ON p0071 CANCELLED AT 2021-06-09T14:03:52 DUE TO TIME LIMIT ***
Any suggestions?
Thank you very much;
Juan Pablo Aguilar Cabezas
Ecology and Evolutionary Biology Ph.D. Candidate
Department of Biological Sciences
Ohio University, Athens OH
…________________________________
From: Charlotte Herzeel ***@***.***>
Sent: Wednesday, June 9, 2021 4:17 AM
To: ExaScience/elprep ***@***.***>
Cc: Aguilar Cabezas, Juan Pablo ***@***.***>; Author ***@***.***>
Subject: Re: [ExaScience/elprep] Doing just variant calling /haplotypecaller? non-model organism no known-variant file for bqsr (#49)
Hi,
Would it be possible to send us the log file: /users/PHS0338/jpac1984/logs/elprep/elprep-2021-06-07-20-58-25-469790574-EDT.log
Which OS are you using? Please keep in mind that elPrep is only supported for Linux.
(The signal could suggest an error with addressing a memory mapped file).
Thanks!
On 8 Jun 2021, at 05:33, desmodus1984 ***@***.******@***.***>> wrote:
I ran the code as you suggested it worked, but then I got an error:
elprep version 5.0.2 compiled with go1.16.4 - see http://github.com/exascience/elprep<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexascience%2Felprep&data=04%7C01%7Ccharlotte.herzeel%40imec.be%7C051e22f0f7c04cf23d9508d92a2e3024%7Ca72d5a7225ee40f09bd1067cb5b770d4%7C0%7C0%7C637587200135955003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vGG54tK7W9k7OCg%2FC4cXKWqmkqx5zKNeNKYWz0aVm6Q%3D&reserved=0> for more information.
2021/06/07 20:58:25 Created log file at /users/PHS0338/jpac1984/logs/elprep/elprep-2021-06-07-20-58-25-469790574-EDT.log
2021/06/07 20:58:25 Command line: [./elprep sfm PA113corr.bam PA113.output.bam --mark-duplicates --mark-optical-duplicates PA113corr.metrics --sorting-order coordinate --bqsr PA113cor$
2021/06/07 20:58:25 Executing command:
./elprep sfm PA113corr.bam PA113.output.bam --mark-duplicates --mark-optical-duplicates PA113corr.metrics --optical-duplicates-pixel-distance 100 --bqsr PA113corr.recal --reference A$
2021/06/07 20:58:25 Splitting...
2021/06/07 21:22:54 Filtering (phase 1)...
2021/06/07 21:27:25 signal: bus error (core dumped)
I gave the job 170Gb of ram and 2 cores. My genome is 1.9 GB and about 40X WGS.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FExaScience%2Felprep%2Fissues%2F49%23issuecomment-856411378&data=04%7C01%7Ccharlotte.herzeel%40imec.be%7C051e22f0f7c04cf23d9508d92a2e3024%7Ca72d5a7225ee40f09bd1067cb5b770d4%7C0%7C0%7C637587200135964997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=KQnsbW8gJ1gf6qXx0kvgVmdSK%2FvIYu4V3XKNDW30iiY%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABPESUTD5C2I4BR23KPSQY3TRWFQTANCNFSM46FPFRPA&data=04%7C01%7Ccharlotte.herzeel%40imec.be%7C051e22f0f7c04cf23d9508d92a2e3024%7Ca72d5a7225ee40f09bd1067cb5b770d4%7C0%7C0%7C637587200135964997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dV2OxZGVhZZcNjgMHkQYIjkgsnMamD%2BCiKpmTbf5F94%3D&reserved=0>.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FExaScience%2Felprep%2Fissues%2F49%23issuecomment-857532060&data=04%7C01%7Cja569116%40ohio.edu%7C216a7198923c4748efb808d92b2775b6%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637588270754430751%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OBTefDH%2B7CW9B3IsAjsAvg%2BcIoc4kLky%2Bo2B94R4GGI%3D&reserved=0>, or unsubscribe<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VOBCJQQOX5X45XTIRDTR4WT7ANCNFSM46FPFRPA&data=04%7C01%7Cja569116%40ohio.edu%7C216a7198923c4748efb808d92b2775b6%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637588270754440747%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QKXQe8165yW8BBcwrKKMwSnIoLGbTDaG0JTSkCoOMKY%3D&reserved=0>.
|
Hi, The error you first reported does not imply an out of memory error. The amount of memory for the job you describe indeed seems sufficient based on our previous experience. The error seems to suggest a memory addressing error, e.g. because of accessing a corrupted data file, an OS issue, a bug, or something else entirely. It is hard to help figuring out what the problem is without access to the detailed elprep log files or the data. For your system, you may also want to look into the --tmp-path option (see documentation). By default the temp data elprep creates is stored on the path where the elprep binary is called. You may want to instead store it on a local scratch, specific shared storage, etc. If you can send us the detailed log files when errors occur, we may be able to better help. Thanks. |
Hi,
I have been using the following code to test a couple of samples:
#!/bin/bash#SBATCH --time=36:00:00#SBATCH --ntasks=48#SBATCH --mem=400G#SBATCH --job-name=PA113-bam#SBATCH --account=PHS0338./elprep sfm PA113.bam PA113ori.output.bam --mark-duplicates --mark-optical-duplicates PA113.metrics --sorting-order coordinate \ --bqsr PA113.recal --haplotypecaller PA113.vcf.gz --reference Autosome.elfasta --nr-of-threads 48
I wasn't aware of the multi-thread code, and while reading I found this: "The sfm subcommand tells elprep to run in sfm (split/filter/merge) mode. This is generally the preferred mode for WGS data, unless the data has very low coverage (<= 10x)."
most of my samples will have ~ 5 -7X [100bp PE reads], which subcommand should I use? I will highly appreciate your help.
I am amazed that according to your testing, 50x Platinum NA12878 WGS aligned against hg38, which has 999 contigs and 473 scaffolds, the runtime was less than 20 hours for the 5 steps, and my small datasets are taking more than 11 hours. Any suggestion of what might be going wrong besides using the sfm subcommand and any suggestion aside the --tmp-path option?
Please let me know which log file would you like me to send to you.
Thanks you very much;
Juan Pablo Aguilar Cabezas
Ecology and Evolutionary Biology Ph.D. Candidate
Department of Biological Sciences
Ohio University, Athens OH
…________________________________
From: Charlotte Herzeel ***@***.***>
Sent: Thursday, June 10, 2021 4:21 AM
To: ExaScience/elprep ***@***.***>
Cc: Aguilar Cabezas, Juan Pablo ***@***.***>; Author ***@***.***>
Subject: Re: [ExaScience/elprep] Doing just variant calling /haplotypecaller? non-model organism no known-variant file for bqsr (#49)
Hi,
The error you first reported does not imply an out of memory error. The amount of memory for the job you describe indeed seems sufficient based on our previous experience.
The error seems to suggest a memory addressing error, e.g. because of accessing a corrupted data file, an OS issue, a bug, or something else entirely. It is hard to help figuring out what the problem is without access to the detailed elprep log files or the data.
For your system, you may also want to look into the --tmp-path option (see documentation). By default the temp data elprep creates is stored on the path where the elprep binary is called. You may want to instead store it on a local scratch, specific shared storage, etc.
If you can send us the detailed log files when errors occur, we may be able to better help.
Thanks.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FExaScience%2Felprep%2Fissues%2F49%23issuecomment-858460714&data=04%7C01%7Cja569116%40ohio.edu%7Cd12a17488fbc4d2da79908d92bf11662%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637589136748778825%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9FAPhqG4qrd8s%2B%2BfiHEjVXSeyOy04uVCPp9nqpeCDS0%3D&reserved=0>, or unsubscribe<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VKFENQLITCPKZLSPQLTSB7YNANCNFSM46FPFRPA&data=04%7C01%7Cja569116%40ohio.edu%7Cd12a17488fbc4d2da79908d92bf11662%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637589136748788819%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZKzSpuvDCMIrrMoEyB7cgaHTFAqn8%2Bh5OFrtnHUj20Y%3D&reserved=0>.
|
Hi,
I got interested in elprep5 because I have been trying GATK4 but it is taking more than five days which is ridiculous.
I wanted to ask two questions. Since I have tried GATK4 I have a sorted/markeduplicated/bam file. Is there a way to just perform variant calling using elprep?
Also, I have tried to do the mapping and converting to get a .bam file as input, I am using the following job script:
./elprep sfm PA113corr.bam --mark-duplicates --mark-optical-duplicates PA113corr.metrics --sorting-order coordinate
--bqsr PA113corr.recal --haplotypecaller PA113corr.vcf.gz --reference Autosome.elfasta
and I still get and error
elprep version 5.0.2 compiled with go1.16.4 - see http://github.com/exascience/elprep for more information.
2021/06/06 03:18:42 Filename(s) in command line missing.
Thanks;
The text was updated successfully, but these errors were encountered: