Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COSMIC NEWER VERSION SUPPORT(v94 and later) #3698

Open
wangpenhok opened this issue Feb 13, 2023 · 6 comments
Open

COSMIC NEWER VERSION SUPPORT(v94 and later) #3698

wangpenhok opened this issue Feb 13, 2023 · 6 comments

Comments

@wangpenhok
Copy link

wangpenhok commented Feb 13, 2023

Version info

  • bcbio version :1.2.9
  • OS name and version :Ubuntu 20.04.5 LTS

To Reproduce
Exact bcbio command you have used:

bcbio_python  prepare_cosmic.py 94  ~/

Log files (could be found in work/log)

`2023-02-13 14:22:35,700 [I] Beginning COSMIC v94 prep for GRCh37.
2023-02-13 14:22:35,700 [I] Beginning COSMIC v94 prep for GRCh38.
2023-02-13 14:22:35,700 [I] Downloading COSMIC VCF files.
2023-02-13 14:22:35,700 [I] Downloading https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cosmic/v94/VCF/CosmicCodingMuts.vcf.gz
2023-02-13 14:25:13,610 [I] Downloading https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cosmic/v94/VCF/CosmicNonCodingVariants.vcf.gz
2023-02-13 14:26:18,927 [I] Sorting v94/GRCh38/CosmicCodingMuts.vcf.gz to match the order of /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa.

gsort version 0.1.4
[W::vcf_parse] Contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::vcf_format] Invalid BCF, CONTIG id=0 not present in the header
[flush_buffer] Error: cannot write to -
Failed to read from standard input: unknown file type
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x546d43]
goroutine 1 [running]:
bufio.NewReaderSize(...)
/home/brentp/go/src/bufio/bufio.go:49
bufio.NewReader(...)
/home/brentp/go/src/bufio/bufio.go:62
github.com/brentp/gsort.Sort(0x7ad380, 0x0, 0x7ad3a0, 0xc00061bb00, 0xc00061bac0, 0xaf0, 0x0, 0x0, 0x0)
/home/brentp/go/go/src/github.com/brentp/gsort/gsort.go:116 +0x63
main.main()
/home/brentp/go/go/src/github.com/brentp/gsort/cmd/gsort/gsort.go:373 +0x513
Failed to read from standard input: unknown file type
2023-02-13 14:26:19,012 [I] bgzipping and indexing v94/GRCh38/CosmicCodingMuts-prep.vcf.gz.
2023-02-13 14:26:19,058 [I] Sorting v94/GRCh38/CosmicNonCodingVariants.vcf.gz to match the order of /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa.
gsort version 0.1.4
[W::vcf_parse] Contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::vcf_format] Invalid BCF, CONTIG id=0 not present in the header
[flush_buffer] Error: cannot write to -
Failed to read from standard input: unknown file type
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x546d43]
goroutine 1 [running]:
bufio.NewReaderSize(...)
/home/brentp/go/src/bufio/bufio.go:49
bufio.NewReader(...)
/home/brentp/go/src/bufio/bufio.go:62
github.com/brentp/gsort.Sort(0x7ad380, 0x0, 0x7ad3a0, 0xc000629d40, 0xc000629d00, 0xaf0, 0x0, 0x0, 0x0)
/home/brentp/go/go/src/github.com/brentp/gsort/gsort.go:116 +0x63
main.main()
/home/brentp/go/go/src/github.com/brentp/gsort/cmd/gsort/gsort.go:373 +0x513
Failed to read from standard input: unknown file type
2023-02-13 14:26:19,137 [I] bgzipping and indexing v94/GRCh38/CosmicNonCodingVariants-prep.vcf.gz.
2023-02-13 14:26:19,175 [I] Combining COSMIC files to v94/bcbio_ready/hg38/cosmic.vcf.gz.
INFO 2023-02-13 14:26:19 MergeVcfs
********** NOTE: Picard's command line syntax is changing.


********** For more information, please see:


https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)


********** The command line looks like this in the new syntax:


********** MergeVcfs -O v94/bcbio_ready/hg38/cosmic.vcf.gz -D /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.dict -I v94/GRCh38/CosmicCodingMuts-prep.vcf.gz -I v94/GRCh38/CosmicNonCodingVariants-prep.vcf.gz -USE_JDK_DEFLATER true -USE_JDK_INFLATER true -CREATE_INDEX false


[Mon Feb 13 14:26:19 CST 2023] MergeVcfs INPUT=[v94/GRCh38/CosmicCodingMuts-prep.vcf.gz, v94/GRCh38/CosmicNonCodingVariants-prep.vcf.gz] OUTPUT=v94/bcbio_ready/hg38/cosmic.vcf.gz SEQUENCE_DICTIONARY=/home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.dict CREATE_INDEX=false USE_JDK_DEFLATER=true USE_JDK_INFLATER=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon Feb 13 14:26:19 CST 2023] Executing as bcbio@dell-PowerEdge-R7525 on Linux 5.15.0-46-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_332-b09; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.27.5
[Mon Feb 13 14:26:19 CST 2023] picard.vcf.MergeVcfs done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=514850816
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file, for input source: file:///home/data/bcbio/cosmic-prep/v94/GRCh38/CosmicCodingMuts-prep.vcf.gz
at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:97)
at htsjdk.tribble.TabixFeatureReader.(TabixFeatureReader.java:82)
at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:117)
at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:81)
at htsjdk.variant.vcf.VCFFileReader.(VCFFileReader.java:145)
at picard.vcf.MergeVcfs.doWork(MergeVcfs.java:182)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:309)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:79)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:37)
at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:95)
... 8 more
Traceback (most recent call last):
File "prepare_cosmic.py", line 246, in
main(args.cosmic_version, args.bcbio_directory, args.overwrite, args.clean)
File "prepare_cosmic.py", line 62, in main
ready_cosmic = combine_cosmic(sorted_inputs, bcbio_ref, out_file)
File "prepare_cosmic.py", line 153, in combine_cosmic
subprocess.check_call(cmd)
File "/home/data/bcbio/anaconda/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['picard', 'MergeVcfs', 'O=v94/bcbio_ready/hg38/cosmic.vcf.gz', 'D=/home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.dict', 'I=v94/GRCh38/CosmicCodingMuts-prep.vcf.gz', 'I=v94/GRCh38/CosmicNonCodingVariants-prep.vcf.gz', 'USE_JDK_DEFLATER=true', 'USE_JDK_INFLATER=true', 'CREATE_INDEX=false']' returned non-zero exit status 1.
`

I have also tried COSMIC version 93 and earlier versions such as v90 , but it seems the VCF files are not available anymore for these versions. Furthermore, for version from v94 on, though the vcf files could be downloaded successfully, the following process always halted because of error:
Contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.)
I assume format changes occurred to these newer versions of vcf files. Could you please update the bcbio_nextgen.py script so as to avoid such bugs? Thanks ~

@wangpenhok
Copy link
Author

I looked through the python script to prepare cosmic.vcf.gz for hg38 manually. Interestingly, I found the problem arises at the CMD bcftools norm --check-ref s --do-not-normalize -f {ref_file} , which fails because the COSMIC.pre.vcf does not contain contig information in its header, similar to the problem described here: samtools/bcftools#766. So, I tried bgip the COSMIC.pre.vcf and tabix it, and then the following steps went well until bcftools view -e 'SNP=1'. But if you skip this cmd, you can finally run through all commands.

So my queston is the step bcftools view -e 'SNP=1' important for annotation? And how could we fix this problem? I didn't see any info regarding "SNP" in the header of the original COSMIC.vcf files, possibly due to the updated version.

@wangpenhok wangpenhok changed the title COSMIC NEWER VERSION(v94 and later) COSMIC NEWER VERSION SUPPORT(v94 and later) Feb 15, 2023
@lbeltrame
Copy link
Contributor

That command excludes the known SNPs (non-tumor) from the COSMIC data. I believe they don't ship these variants anymore, though.

@wangpenhok
Copy link
Author

Yeah , so I skipped this step as well. I manually prepared vcf files of CodingVariants and NonCodingVariants separately. This time, the problem arises when I tried to merge the above two vcf files with error indicating that the contigs of them are not compatible. It is strange that even if I normalized according to the same reference genome, the contigs of NonCodingVcf is ordered as "chr1, chr2, chr3, chr 4....." while the CodingVcf is ordered as "chr1, chr10, chr11, chr12....chr2, chr22..."

@wangpenhok
Copy link
Author

I cannot figure out what is wrong

@lbeltrame
Copy link
Contributor

Did you try reordering the contigs so they all have the same order?

@wangpenhok
Copy link
Author

wangpenhok commented Feb 24, 2023

Thanks for your advice, I may try this later. But it still confused me how come different orders are given after the same operation(exactly the same gunzip-normalize-gsort process as is shown below)

gunzip -c CosmicNonCodingVariants.vcf.gz | sed "s/^\([0-9]\+\)\t/chr\1\t/g" | sed "s/^MT/chrM/g" | sed "s/^X/chrX/g" | sed "s/^Y/chrY/g" > CosmicNonCodingVariants-fix-chrom.vcf.gz

bgzip -c CosmicNonCodingVariants-fix-chrom.vcf.gz > CosmicNonCodingVariants-fix-chrom-bgzip.vcf.gz

tabix CosmicNonCodingVariants-fix-chrom-bgzip.vcf.gz

bcftools norm --check-ref s --do-not-normalize -f /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa CosmicNonCodingVariants-fix-chrom-bgzip.vcf.gz > CosmicNonCodingVariants-fix-chrom-bgzip-norm.vcf.gz
##Lines total/split/realigned/skipped: 32580266/0/0/0
##REF/ALT total/modified/added: 32580266/0/318

gsort --memory 20000 CosmicNonCodingVariants-fix-chrom-bgzip-norm.vcf.gz /home/data/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa.fai > CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort.vcf.gz

grep -v ^##contig CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort.vcf.gz | bcftools annotate -h ./v97/GRCh38/CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort-contig_header.txt

bgzip -c CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort.vcf.gz > CosmicNonCodingVariants-prep.vcf.gz

tabix CosmicNonCodingVariants-prep.vcf.gz

mv CosmicNonCodingVariants-fix-chrom-bgzip-norm-gsort-contig_header.txt CosmicNonCodingVariants-prep-contig_header.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants