Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kSNP4] Add new task/workflow for phylogenetic/SNP analysis #759

Merged
merged 11 commits into from
Feb 20, 2025

Conversation

MrTheronJ
Copy link
Contributor

@MrTheronJ MrTheronJ commented Feb 14, 2025

This PR closes #

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

⚡ Impacted Workflows/Tasks

This PR may lead to different results in pre-existing outputs: No

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

Created a new workflow for kSNP4, representing version 4.1.

⚙️ Algorithm

Algorithmically, there are no significant changes between kSNP3 and kSNP4. According to their paper,

"kSNP4 restored the annotation function that was lost when NCBI discontinued the use of gi numbers and is more than twice as fast as was kSNP3."

Additionally, their README states that upgrading from version 3.1.2 to 4.1 will significantly reduce runtime, with most changes being transparent to the user.

➡️ Inputs

All inputs/flags remain the same as in kSNP3 except for a few that have been removed. Notably, these removed flags were not used in the previous kSNP3 task/workflow. The following flags have been removed:

  • --genbank <genbank.gbk> # optional file for SNP annotation
  • --all_annotations # optional, annotate each locus exhaustively with all the annotations in any of the annotated genomes. Without this option it only provides the first annotation it comes to for a given locus, checking in the order genomes are listed in the -annotate file.

⬅️ Outputs

One default output file in kSNP4 was renamed, while all other outputs remain unchanged from kSNP3. See line 65 in task_ksnp4.wdl for details. To clarify, kSNP3 originally outputs a file named tree.core.tre, but in kSNP4, this has been renamed to tree.core_SNPs.parsimony.tre. However, in both workflows, we automatically rename this file to ~{cluster_name}_core.nwk for consistency.

🧪 Testing

Tested kSNP4 locally with miniwdl and on Terra using a set of samples from the phylogenetics_bacterial_set data table from the PHB_Validation_v2-3-0 workspace. To assess differences between kSNP3 and kSNP4, I compared all of the major output files from each workflow using diff -u. This allowed me to check for any changes in SNP count or tree structure.

Files Compared:
(Terra output variables are listed above their corresponding file name.)

  • ksnp4_core_snp_matrix
    • ~{cluster_name}_core_distance_matrix.csv
  • ksnp4_core_tree
    • ~{cluster_name}_core_tree.nwk
  • ksnp4_pan_tree
    • ~{cluster_name}_pan_tree.nwk
  • ksnp4_vcf_ref_genome
    • ~{cluster_name}_VCF.reference_genome.vcf
  • ksnp4_core_snp_table
    • ~{cluster_name}_core_snp_table.csv
  • ksnp4_pan_snp_matrix
    • ~{cluster_name}_pan_distance_matrix.csv
  • ksnp4_vcf_snps_not_in_ref
    • ~{cluster_name}VCF.SNPsNotinRef.tsv
  • ksnp4_snps
    • ~{cluster_name}_SNPs_all

Results:

  1. The ~{cluster_name}_SNPs_all files showed slight differences across all sample sets, exclusively due to sorting order. After adjusting for these sorting discrepancies, the contents were found to be identical.
  2. For the "cjejuni_0810PADBR-1" and "senterica_simulated" sample sets, the pan_tree.nwk and pan_distance_matrix.csv had slight differences (mostly due to changes in order).
  3. For "senterica_tuna_scrape" sample set, the core_tree.nwk and core_distance_matrix.csv also showed minor differences.
  4. All other outputs were identical across all sample sets.

Terra kSNP3 Test
Terra kSNP4 Test

diff -u STDOUT cjejuni_0810PADBR-1
diff-u STDOUT senterica_simulated
diff -u STDOUT senterica_tuna_scrape
diff -u STDOUT ecoli_1405WAEXK-1 (empty file)
diff -u STDOUT lmonocytogenes_1408MLGX6-3WGS (empty file)

Suggested Scenarios for Reviewer to Test

It might be beneficial to conduct additional tests using some of the optional arguments (or other sample sets). I tested the optional arguments locally, and they work, but I haven’t fully assessed their impact in comparison to kSNP3.

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable
    • You have updated the "Last Known Changes" field for any affected workflows in the respective workflow documentation page and for every entry in the three workflows_overview tables to be the tag for the next upcoming release. If you do not know the tag, please put "vX.X.X"

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

@MrTheronJ MrTheronJ changed the title Add kSNP4 task and workflow for phylogenetic/SNP analysis [kSNP4] Add new task/workflow for phylogenetic/SNP analysis Feb 19, 2025
@MrTheronJ MrTheronJ marked this pull request as ready for review February 19, 2025 19:41
@MrTheronJ MrTheronJ requested a review from a team as a code owner February 19, 2025 19:41
@Michal-Babins
Copy link
Contributor

Tests done for ksnp4 with flu
Test done for ksnp4 with sars-cov-2

Tested ksnp3

Everything looks good

Copy link
Contributor

@Michal-Babins Michal-Babins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation looks good. Code additions look good. Maintains ksnp3 scheme but updates for 4. Good work. Tests are passing.

@Michal-Babins Michal-Babins merged commit cf7a742 into main Feb 20, 2025
7 checks passed
@Michal-Babins Michal-Babins deleted the tj-ksnp4-dev branch February 20, 2025 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants