Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[palmDB] Generate palmprint-qc SQL tables for palmdb v2 #153

Open
3 tasks
ababaian opened this issue Nov 4, 2024 · 0 comments
Open
3 tasks

[palmDB] Generate palmprint-qc SQL tables for palmdb v2 #153

ababaian opened this issue Nov 4, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@ababaian
Copy link
Member

ababaian commented Nov 4, 2024

Overview

Underlying the Open Virome is the use of palmDB (https://github.com/ababaian/palmdb) as a reference database. Each of the palmprint calls was performed by a tool; either palmscan v1 or v2 or palm_annot.

In addition the sequence QC table for each sequence will be compared to contrast basic sequence statistics for each palmprint.

Background / Context

A sub-set of palmDB calls currently are "False Positives"; for example defined as being a sequence which are

  • Not-RdRp Error: Sequence is not an RdRp
  • Not-Palmprint Error: Sequence is an RdRp; but the selected palmprint is not a well-formed palmprint defined by Motif A, B, C

For a complete set of TP/FP categories, see: #154

Hypothesis

The simple histogram distribution of scores / statistical values will be significantly different (T-test) between positive control and negative control palmprint sequences.

The difference in these values will define a "space" by which to interpret unknown sequences and define a False Positive Rate for each palmDB sequence based on it's scores.

Experiment

For each palmprint in palmDB v2 run and store the results in a defined (standardized) SQL table

  • palmscan2
  • palm_annot
  • Primary sequence properties (Length; AA composition; stop codons ...)

Controls

Positive Controls: Sub-set of ICTV-confirmed RdRp sequences; thus they are certainly "RdRp", and when placed in a Multiple Sequence Alignment (i.e. Wolf 18) the motif A, B, and C will be aligned to one another. This is limited in term of highly-divergent RdRp as they are less likely to be in the "reference" or "known" virome set, but it will provide a space in terms of algorihtm scores which are "very highly quality"

Negative Controls: In the course of experimental analysis; there have been many instances of FP hits coming up; either Not-RdRp or Not-Palmprint; a seperate issue will be needed to aggregate as many possible examples of these sequences into one set.

Expected Outcome

The results of palmscan and palm_annot analysis for each palmprint in palmDB2 will be stored in a relational database hosted on the logan SQL server.

Open Questions

No response

References

No response

@ababaian ababaian added the enhancement New feature or request label Nov 4, 2024
@ababaian ababaian self-assigned this Nov 4, 2024
@github-project-automation github-project-automation bot moved this to Backlog in Open Virome Nov 4, 2024
@ababaian ababaian changed the title Generate QC Tables for palmdb v2 Generate palmprint-qc SQL tables for palmdb v2 Nov 4, 2024
@ababaian ababaian changed the title Generate palmprint-qc SQL tables for palmdb v2 [palmDB] Generate palmprint-qc SQL tables for palmdb v2 Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

1 participant