Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outputs from only *.vcf file #95

Open
sinclairify opened this issue Dec 29, 2021 · 7 comments
Open

Outputs from only *.vcf file #95

sinclairify opened this issue Dec 29, 2021 · 7 comments
Assignees
Labels
type:enhancement any enhancement that doesn't fit into aesthetics, bug or documentation

Comments

@sinclairify
Copy link

I'd like to see how to generate mutation and lineage reports using only *.vcf as an input. The commercial lab that provides our county wastewater sequencing services provides a *.vcf , but doesn't provide the raw fastq file in an effort to protect their companies proprietary primers. They don't provide wastewater sequencing reports and process the the extracted RNA (from wastewater) as a clinical sample. The result is a few different files and I'd like to use the pigx to generate some lineage and mutation charts.

The *.vcf is generated from our commercial lab after they:

  1. Align NGS reads to human genome and the seven coronaviruses that are known to affect humans
  2. Trim the Fulgent primers from the ends of the reads that uniquely align to SARS-CoV-2 using the iVar trim utility
  3. Compute coverage pileup using Samtools mpileup utility
  4. Generate VCF using VarScan v2.4.3

We have a few outputs from them <pangolin_##_trimmed.csv>, <##_ivar_consensus_trimmed_qual.fa>, <##ivar_consensus_trimmed_qual.txt>, and <VarScan##_trimmed.vcf>.

I'm providing some files that they returned to us in late November. I'm assuming the *.vcf is the best bet. Any help would be appreciated.

SH7951.zip

@vicfabienne vicfabienne self-assigned this Jan 3, 2022
@vicfabienne vicfabienne added the type:enhancement any enhancement that doesn't fit into aesthetics, bug or documentation label Jan 10, 2022
@vicfabienne
Copy link
Collaborator

Hey, thank you for the request!
I looked into it. From what I can see it should be doable.
However, I'm not yet completely sure about how to deal with the missing Quality Control. Any analysis and calculation would have been performed under the strong assumption, that all samples are of comparable quality i.e. comparable sequencing depth across the whole genome at the mutation sites, reference genome coverage etc. pp.. Since there is no way to do this automatically with only the vcf files the reports can only be so reliant on being taken on their own. You would need to have that QC part extra.

If you still think it's a possibility that can help you I would go forward with this on an extra branch. I can't promise anything but if it works as expected I'd try to get a version working there.

@sinclairify
Copy link
Author

Hi. Thanks for offering that. Its a great way to go because we do have a broad QC numbers in some of the outputs that are provided. I will manually check a few items:

  • The *.vcf has a "qual" and it seems that everything says "pass". That may not give detail and I'll have to ask the company some more.
  • In the trimmed.csv there is a variable called "status" and I'll check choose only the files that have "passed_qc". It seems to be most files that we have received from them, so I will prioritize the next one.
  • in the ivar_consensus_trimmed.fa there is a quality score and most of them say "20". I'll have to follow up with the company, but I think that because we never give them samples higher than ct of 30 it should be OK.

I suggest proceeding and I'll ask that company about more detail. Thanks!

@jonasfreimuth
Copy link
Member

Hello,

here is a little update: I am currently working on enabling direct vcf input. However, there are some INFO fields that need to be present, namely Allele Frequency (AF) and Depth (DP). The information from both those fields is required by the downstream analysis. I tried running the pipeline on the vcf files you provided, but they are lacking that info. Also, when I try to work around this, no nucleotide info gets found by vep, which I am still investigating.

So if you (still) want to use the pigx-sars-cov-2 pipeline to analyse your data, you would probably need to get your variants called with lofreq. There is a version that should be capable of producing variant reports from lofreq vcf output alone on brach predefine_file_io in my personal repo (not thoroughly tested at all).

@sinclairify
Copy link
Author

sinclairify commented Jul 11, 2022 via email

@jonasfreimuth
Copy link
Member

jonasfreimuth commented Jul 13, 2022

FYI, development of that branch will now take place on predef-rule-io-dev, due to git reasons

jonasfreimuth added a commit to jonasfreimuth/pigx_sars-cov-2 that referenced this issue Aug 25, 2022
  This commit has a lot of consequences:
  * It changes how synonymous AA mutations are coded in the output.
  Previously the format was X123-, now it is X123X.
  * The code that deals with deletions now gets executed reliably. The
  previous condition was misspecified and would almost never work
  (specified as `!(any(is.na(...)))`  whereas `any(!is.na(...))` would
  be correct)
  * the names between the df `full` created in `detectable_deletions()`
  don't match up with the colnames passed into the function (gene_mut is
  missing from `full`), this error is fixes as `detectable_deletions()`
  will not be called any more and removed in a future commit.

  Note: There are no deletions (that I found) anywhere in the results
  section of the project dir. I only got some from running the pipeline
  on the files provided in BIMSBbioinfo#95.
jonasfreimuth added a commit that referenced this issue Aug 25, 2022
  This commit has a lot of consequences:
  * It changes how synonymous AA mutations are coded in the output.
  Previously the format was X123-, now it is X123X.
  * The code that deals with deletions now gets executed reliably. The
  previous condition was misspecified and would almost never work
  (specified as `!(any(is.na(...)))`  whereas `any(!is.na(...))` would
  be correct)
  * the names between the df `full` created in `detectable_deletions()`
  don't match up with the colnames passed into the function (gene_mut is
  missing from `full`), this error is fixes as `detectable_deletions()`
  will not be called any more and removed in a future commit.

  Note: There are no deletions (that I found) anywhere in the results
  section of the project dir. I only got some from running the pipeline
  on the files provided in #95.
@jonasfreimuth
Copy link
Member

The changes are now merged into main in #142. But I have no updates on getting nucleotide info from the files @sinclairify provided.

@sinclairify
Copy link
Author

Thank you @jonasfreimuth. Our governmental partners are working with the sequencing company to provide this (Fulgent). They had some staff changes and lost track of our progress. We will keep trying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:enhancement any enhancement that doesn't fit into aesthetics, bug or documentation
Projects
None yet
Development

No branches or pull requests

3 participants