Outputs from only *.vcf file #95

sinclairify · 2021-12-29T19:51:12Z

I'd like to see how to generate mutation and lineage reports using only *.vcf as an input. The commercial lab that provides our county wastewater sequencing services provides a *.vcf , but doesn't provide the raw fastq file in an effort to protect their companies proprietary primers. They don't provide wastewater sequencing reports and process the the extracted RNA (from wastewater) as a clinical sample. The result is a few different files and I'd like to use the pigx to generate some lineage and mutation charts.

The *.vcf is generated from our commercial lab after they:

Align NGS reads to human genome and the seven coronaviruses that are known to affect humans
Trim the Fulgent primers from the ends of the reads that uniquely align to SARS-CoV-2 using the iVar trim utility
Compute coverage pileup using Samtools mpileup utility
Generate VCF using VarScan v2.4.3

We have a few outputs from them <pangolin_##_trimmed.csv>, <##_ivar_consensus_trimmed_qual.fa>, <##ivar_consensus_trimmed_qual.txt>, and <VarScan##_trimmed.vcf>.

I'm providing some files that they returned to us in late November. I'm assuming the *.vcf is the best bet. Any help would be appreciated.

SH7951.zip

vicfabienne · 2022-01-10T13:39:25Z

Hey, thank you for the request!
I looked into it. From what I can see it should be doable.
However, I'm not yet completely sure about how to deal with the missing Quality Control. Any analysis and calculation would have been performed under the strong assumption, that all samples are of comparable quality i.e. comparable sequencing depth across the whole genome at the mutation sites, reference genome coverage etc. pp.. Since there is no way to do this automatically with only the vcf files the reports can only be so reliant on being taken on their own. You would need to have that QC part extra.

If you still think it's a possibility that can help you I would go forward with this on an extra branch. I can't promise anything but if it works as expected I'd try to get a version working there.

sinclairify · 2022-01-12T18:32:23Z

Hi. Thanks for offering that. Its a great way to go because we do have a broad QC numbers in some of the outputs that are provided. I will manually check a few items:

The *.vcf has a "qual" and it seems that everything says "pass". That may not give detail and I'll have to ask the company some more.
In the trimmed.csv there is a variable called "status" and I'll check choose only the files that have "passed_qc". It seems to be most files that we have received from them, so I will prioritize the next one.
in the ivar_consensus_trimmed.fa there is a quality score and most of them say "20". I'll have to follow up with the company, but I think that because we never give them samples higher than ct of 30 it should be OK.

I suggest proceeding and I'll ask that company about more detail. Thanks!

jonasfreimuth · 2022-07-10T21:30:39Z

Hello,

here is a little update: I am currently working on enabling direct vcf input. However, there are some INFO fields that need to be present, namely Allele Frequency (AF) and Depth (DP). The information from both those fields is required by the downstream analysis. I tried running the pipeline on the vcf files you provided, but they are lacking that info. Also, when I try to work around this, no nucleotide info gets found by vep, which I am still investigating.

So if you (still) want to use the pigx-sars-cov-2 pipeline to analyse your data, you would probably need to get your variants called with lofreq. There is a version that should be capable of producing variant reports from lofreq vcf output alone on brach predefine_file_io in my personal repo (not thoroughly tested at all).

sinclairify · 2022-07-11T16:26:21Z

Hi Thanks Jonas, We were able to eventually obtain some raw fastq, but not for the majority of our weekly assessments. I’m going to try the predefine_file_io<https://urldefense.com/v3/__https:/github.com/jonasfreimuth/pigx_sars-cov-2/tree/predefine-rule-io__;!!DfVsRZep!kpz-Nz-A5uLwO7b3TFesinyth5tNcF8RZBpu6Ez1DBrjR9qY6q2ilTqkPVzAXVQ$> option that you detailed below. From: Jonas Freimuth ***@***.***> Sent: Sunday, July 10, 2022 2:31 PM To: BIMSBbioinfo/pigx_sars-cov-2 ***@***.***> Cc: Sinclair, Ryan (LLU) ***@***.***>; Author ***@***.***> Subject: [EXTERNAL] Re: [BIMSBbioinfo/pigx_sars-cov-2] Outputs from only *.vcf file (Issue #95) CAUTION: This message originated from outside the LLUH email system. Do not open attachments or follow links unless you have verified the legitimacy of the sender and its content. If you receive a suspicious email, you may forward it to ***@***.******@***.***> and then delete the suspicious email.

…

________________________________ Hello, here is a little update: I am currently working on enabling direct vcf input. However, there are some INFO fields that need to be present, namely Allele Frequency (AF) and Depth (DP). The information from both those fields is required by the downstream analysis. I tried running the pipeline on the vcf files you provided, but they are lacking that info. Also, when I try to work around this, no nucleotide info gets found by vep, which I am still investigating. So if you (still) want to use the pigx-sars-cov-2 pipeline to analyse your data, you would probably need to get your variants called with lofreq. There is a version that should be capable of producing variant reports from lofreq vcf output alone on brach predefine_file_io<https://urldefense.com/v3/__https:/github.com/jonasfreimuth/pigx_sars-cov-2/tree/predefine-rule-io__;!!DfVsRZep!kpz-Nz-A5uLwO7b3TFesinyth5tNcF8RZBpu6Ez1DBrjR9qY6q2ilTqkPVzAXVQ$> in my personal repo (not thoroughly tested at all). — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/BIMSBbioinfo/pigx_sars-cov-2/issues/95*issuecomment-1179803401__;Iw!!DfVsRZep!kpz-Nz-A5uLwO7b3TFesinyth5tNcF8RZBpu6Ez1DBrjR9qY6q2ilTqkV8LVeVc$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AWOJAP273LWDFSVVRYUOK63VTM6IVANCNFSM5K6QAO7A__;!!DfVsRZep!kpz-Nz-A5uLwO7b3TFesinyth5tNcF8RZBpu6Ez1DBrjR9qY6q2ilTqkx7kbjBU$>. You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>> CONFIDENTIALITY NOTICE: This e-mail communication and any attachments may contain confidential and privileged information for the use of the designated recipients named above. If you are not the intended recipient, you are hereby notified that you have received this communication in error and that any review, disclosure, dissemination, distribution or copying of it or its contents is prohibited. If you have received this communication in error, please notify me immediately by replying to this message and destroy all copies of this communication and any attachments. Thank you.

jonasfreimuth · 2022-07-13T10:28:03Z

FYI, development of that branch will now take place on predef-rule-io-dev, due to git reasons

This commit has a lot of consequences: * It changes how synonymous AA mutations are coded in the output. Previously the format was X123-, now it is X123X. * The code that deals with deletions now gets executed reliably. The previous condition was misspecified and would almost never work (specified as `!(any(is.na(...)))` whereas `any(!is.na(...))` would be correct) * the names between the df `full` created in `detectable_deletions()` don't match up with the colnames passed into the function (gene_mut is missing from `full`), this error is fixes as `detectable_deletions()` will not be called any more and removed in a future commit. Note: There are no deletions (that I found) anywhere in the results section of the project dir. I only got some from running the pipeline on the files provided in BIMSBbioinfo#95.

This commit has a lot of consequences: * It changes how synonymous AA mutations are coded in the output. Previously the format was X123-, now it is X123X. * The code that deals with deletions now gets executed reliably. The previous condition was misspecified and would almost never work (specified as `!(any(is.na(...)))` whereas `any(!is.na(...))` would be correct) * the names between the df `full` created in `detectable_deletions()` don't match up with the colnames passed into the function (gene_mut is missing from `full`), this error is fixes as `detectable_deletions()` will not be called any more and removed in a future commit. Note: There are no deletions (that I found) anywhere in the results section of the project dir. I only got some from running the pipeline on the files provided in #95.

jonasfreimuth · 2022-09-18T11:27:19Z

The changes are now merged into main in #142. But I have no updates on getting nucleotide info from the files @sinclairify provided.

sinclairify · 2022-10-12T15:30:11Z

Thank you @jonasfreimuth. Our governmental partners are working with the sequencing company to provide this (Fulgent). They had some staff changes and lost track of our progress. We will keep trying.

vicfabienne self-assigned this Jan 3, 2022

vicfabienne added the type:enhancement any enhancement that doesn't fit into aesthetics, bug or documentation label Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outputs from only *.vcf file #95

Outputs from only *.vcf file #95

sinclairify commented Dec 29, 2021

vicfabienne commented Jan 10, 2022

sinclairify commented Jan 12, 2022

jonasfreimuth commented Jul 10, 2022

sinclairify commented Jul 11, 2022 via email

jonasfreimuth commented Jul 13, 2022 •

edited

Loading

jonasfreimuth commented Sep 18, 2022

sinclairify commented Oct 12, 2022

Outputs from only *.vcf file #95

Outputs from only *.vcf file #95

Comments

sinclairify commented Dec 29, 2021

vicfabienne commented Jan 10, 2022

sinclairify commented Jan 12, 2022

jonasfreimuth commented Jul 10, 2022

sinclairify commented Jul 11, 2022 via email

jonasfreimuth commented Jul 13, 2022 • edited Loading

jonasfreimuth commented Sep 18, 2022

sinclairify commented Oct 12, 2022

jonasfreimuth commented Jul 13, 2022 •

edited

Loading