panGenomeBreedr
(panGB
) is conceptualized to be a unified platform
for pangenome-enabled breeding that follows standardized conventions for
natural or casual variant analysis using pangenomes, marker design, and
marker QC hypothesis testing (Figure 1). It seeks to simplify the use of
pangenome resources to support plant breeding decisions during cultivar
development.
In its current development version, panGB
provides customizable
functions for KASP marker design and validation (Steps 2 and 3 in
Figure 1).
panGB
will host a user-friendly shiny application to enable non-R
users to access its functionalities outside R.
LGC Genomics’ current visualization tool is platform-specific — the SNP
Viewer program runs only on Windows, thus preventing Mac and other
non-Windows platform customers from utilizing it. The SNP Viewer program
does not incorporate standardized conventions for visualizing the
prediction of positive controls to fully validate a marker. This makes
it difficult for users to validate markers conclusively using the
existing tool. panGB
provides platform-independent functionalities to
users to perform hypothesis testing on KASP marker QC and validation.
Submit bug reports and feature suggestions, or track changes on the issues page.
- Requirements
- Recommended packages
- Installation
- Usage
- Troubleshooting
- Authors and contributors
- License
- Support and Feedback
To run this package locally on a machine, the following R packages are required:
-
ggplot2: Elegant Graphics for Data Analysis.
-
gridExtra: Miscellaneous Functions for “Grid” Graphics.
-
utils: The R Utils Package.
-
Rtools: Needed for package development and installation from GitHub on Windows PCs.
-
rmarkdown: When installed, display of the project’s README.md will be rendered with R Markdown.
You can install the development version of panGenomeBreedr
from
GitHub with:
# install.packages("pak")
pak::pkg_install("awkena/panGenomeBreedr")
panGB
depends on a list of Bioconductor packages that may not be
installed automatically alongside panGB
. To manually install these
packages, use the code snippet below:
# Install and load required Bioconductor packages
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("VariantAnnotation",
"Biostrings",
"GenomicRanges",
"IRanges",
"msa"))
Currently, panGB
has functionality for KASP marker design based on
causal variants and QC visualizations for marker validation.
Here, we provide examples on how to use panGB
to design a KASP marker
based on a causal variant, as well as marker validation for any KASP
marker.
The kasp_marker_design()
function provides a simplified approach to
designing a KASP marker based on identified causal variants.
The user needs two important input data to run the
kasp_marker_design()
: the whole genome or specific chromosome sequence
of the focused crop and a vcf file containing variant calls from
putative causal variant analytical pipeline.
The vcf file must contain the Chromosome ID, Position, locus ID, REF and ALT alleles, as well as the genotype data for samples, as shown below in Table 1:
CHROM | POS | ID | REF | ALT | IDMM | ISGC | ISGK | ISHC | ISHJ |
---|---|---|---|---|---|---|---|---|---|
Chr02 | 69197088 | SNP_Chr02_69197088 | G | A | 0\|0 | 0\|0 | 0\|0 | 0\|0 | 0\|0 |
Chr02 | 69197120 | SNP_Chr02_69197120 | G | C | 0\|0 | 0\|0 | 0\|0 | 0\|0 | 0\|0 |
Chr02 | 69197131 | SNP_Chr02_69197131 | G | T | 0\|0 | 0\|0 | 0\|0 | 0\|0 | 0\|0 |
Chr02 | 69197209 | SNP_Chr02_69197209 | G | T | 0\|0 | 0\|0 | 0\|0 | 0\|0 | 0\|0 |
Chr02 | 69197294 | SNP_Chr02_69197294 | G | A | 0\|0 | 0\|0 | 0\|0 | 0\|0 | 0\|0 |
# Example to design a KASP marker on a substitution variant
# Set path to alignment output folder
library(panGenomeBreedr)
path <- tempdir() # (default directory for saving alignment outputs)
# Path to import sorghum genome sequence for Chromosome 2
path1 <- "https://raw.githubusercontent.com/awkena/panGB/main/Chr02.fa.gz"
# Path to import vcf file for variant calls on Chromosome 2
path2 <- system.file("extdata", "Sobic.002G302700_SNP_snpeff.vcf",
package = "panGenomeBreedr",
mustWork = TRUE)
# KASP marker design for variant ID: SNP_Chr02_69200443 in vcf file
ma1 <- kasp_marker_design(vcf_file = path2,
genome_file = path1,
marker_ID = "SNP_Chr02_69200443",
chr = "Chr02",
plot_draw = TRUE,
plot_file = path,
vcf_geno_code = c('1|1', '0|1', '0|0', '.|.'),
region_name = "ma1",
maf = 0.05)
#> using Gonnet
# View marker alignment output from temp folder
path3 <- file.path(path, list.files(path = path, "alignment_"))
system(paste0('open "', path3, '"')) # Open PDF file from R
on.exit(unlink(path)) # Clear the temp directory on exit
In the kasp_marker_design()
function call above, the user must specify
the path to the genome sequence and vcf files using the genome_file
and vcf_file
arguments, respectively. The user must specify the ID for
the variant in the vcf file using the marker_ID
argument.
To save memory and enhance the computational speed, the chr
argument
can be specified to access only the chromosome sequence of the chosen
variant from the genome sequence.
The vcf_geno_code
argument is used to specify the genotype coding in
the vcf file – either phased (1|1) or unphased (1/1) coding.
The plot_draw = TRUE
argument indicates the return of the alignment of
the 100 bp upstream and downstream sequences to the imported reference
genome as PDF file (Figure 2).
The plot_file
argument specifies the path to the directory where the
alignment should be saved – default is a temporary directory.
Fig. 2. Alignment of the 100 bp upstream and downstream sequences to the reference genome used for KASP marker design. |
The required sequence for submission to Intertek for the designed KASp marker is shown in Table 2.
SNP_Name | SNP | Marker_Name | Chromosome | Chromosome_Position | Sequence | ReferenceAllele | AlternativeAllele |
---|---|---|---|---|---|---|---|
SNP_Chr02_69200443 | Substitution | ma1 | Chr02 | 69200443 | TAGTTTGATGTTTGCCTTACAATTTGATTTGATGGCAATACCTTTTCCATTTTATCAGCATCTACACCATTTTATATCTTTGGATTAGATTTTTTTTWAA\[A/T\]AAAAAAGTAATATGTTTGTTATGTGCTTTACTCAACAAGATCTACATTTTAAATTAGCTACTTTTTACCATCTTATTTGTTTGTTGTGTGTTTTATTCAA | A | T |
The following example demonstrates how to use the customizable functions
in panGB
to perform hypothesis testing of allelic discrimination for
KASP marker QC and validation.
panGB
offers customizable functions for KASP marker validation through
hypothesis testing. These functions allow users to easily perform the
following tasks:
-
Import raw or polished KASP genotyping results files (.csv) into R.
-
Process imported data and assign FAM and HEX fluorescence colors for multiple plates.
-
Visualize marker QC using FAM and HEX fluorescence scores for each sample.
-
Validate the effectiveness of trait-predictive or background markers using positive controls.
-
Visualize plate design and randomization.
The read_kasp_csv()
function allows users to import raw or polished
KASP genotyping full results file (.csv) into R. The function requires
the path of the raw file and the row tags for the different components
of data in the raw file as arguments.
For polished files, the user must extract the Data
component of the
full results file and save it as a csv file before import.
By default, a typical unedited raw KASP data file uses the following row
tags for genotyping data: Statistics
, DNA
, SNPs
, Scaling
,
Data
.
The raw file is imported as a list object in R. Thus, all components in the imported data can be extracted using the row tag ID as shown in the code snippet below:
# Import raw KASP genotyping file (.csv) using the read_kasp_csv() function
library(panGenomeBreedr)
# Set path to the directory where your data is located
# path1 <- "inst/extdata/Genotyping_141.010_01.csv"
path1 <- system.file("extdata", "Genotyping_141.010_01.csv",
package = "panGenomeBreedr",
mustWork = TRUE)
# Import raw data file
file1 <- read_kasp_csv(file = path1,
row_tags = c("Statistics", "DNA", "SNPs", "Scaling", "Data"),
data_type = 'raw')
# Get KASP genotyping data for plotting
kasp_dat <- file1$Data
The next step after importing data is to assign FAM and HEX fluorescence
colors to samples based on their observed genotype calls. This step is
accomplished using the kasp_color()
function in panGB
as shown in
the code snippet below:
# Assign KASP fluorescence colors using the kasp_color() function
library(panGenomeBreedr)
# Create a subet variable called plates: masterplate x snpid
kasp_dat$plates <- paste0(kasp_dat$MasterPlate, '_',
kasp_dat$SNPID)
dat1 <- kasp_color(x = kasp_dat,
subset = 'plates',
sep = ':',
geno_call = 'Call',
uncallable = 'Uncallable',
unused = '?',
blank = 'NTC')
The kasp_color()
function requires the KASP genotype call file as a
data frame and can do bulk processing if there are multiple master
plates. The default values for the arguments in the kasp_color()
function are based on KASP annotations.
The kasp_color()
function calls the kasp_pch()
function to
automatically add PCH plotting symbols that can equally be used to group
genotypic clusters on the plot.
When expected genotype calls are available for positive controls in KASP genotyping samples, we recommend the use of the PCH symbols for grouping observed genotypes instead of FAM and HEX colors.
The kasp_color()
function expects that genotype calls are for diploid
state with alleles separated by a symbol. By default KASP data are
separated by :
symbols.
The kasp_color()
function returns a list object with the processed
data for each master plate as the components.
To test the hypothesis that the designed KASP marker can accurately discriminate between homozygotes and heterozygotes (allelic discrimination), a cluster plot needs to be generated.
The kasp_qc_ggplot()
and kasp_qc_ggplot2()
functions in panGB
can
be used to make the cluster plots for each plate and KASP marker as
shown below:
# KASP QC plot for Plate 05
library(panGenomeBreedr)
kasp_qc_ggplot2(x = dat1[5],
pdf = FALSE,
Group_id = NULL,
scale = TRUE,
expand_axis = 0.6,
alpha = 0.9,
legend.pos.x = 0.6,
legend.pos.y = 0.75)
#> $`SE-24-1088_P01_d1_snpSB00804`
# KASP QC plot for Plate 05
library(panGenomeBreedr)
kasp_qc_ggplot2(x = dat1[5],
pdf = FALSE,
Group_id = 'Group',
Group_unknown = '?',
scale = TRUE,
pred_cols = c('Blank' = 'black', 'False' = 'red',
'True' = 'blue', 'Unverified' = 'yellow2'),
expand_axis = 0.6,
alpha = 0.9,
legend.pos.x = 0.6,
legend.pos.y = 0.75)
#> $`SE-24-1088_P01_d1_snpSB00804`
Color-blind-friendly color combinations are used to visualize verified genotype predictions (Figure 3).
In Figure 4, the three genotype classes are grouped based on plot PCH symbols using the FAM and HEX scores for observed genotype calls.
To simplify the verified prediction overlay for the expected genotypes for positive controls, all possible outcomes are divided into three categories (TRUE, FALSE, and UNVERIFIED) and color-coded to make it easier to visualize verified predictions.
BLUE (color code for the TRUE category) means genotype prediction matches the observed genotype call for the sample.
RED (color code for the FALSE category) means genotype prediction does not match the observed genotype call for the sample.
YELLOW (color code for the UNVERIFIED category) means three things: an expected genotype call could not be made before KASP genotyping, or an observed genotype call could not be made to verify the prediction.
Users can set the pdf = TRUE
argument to save plots as a PDF file in a
directory outside R. The kasp_qc_ggplot()
and
kasp_qc_ggplot2()
functions can generate cluster plots for multiple
plates simultaneously.
To visualize predictions for positive controls to validate KASP markers,
the column name containing expected genotype calls must be provided and
passed to the function using the Group_id = 'Group'
argument as shown
in the code snippets above. If this information is not available, set
the argument Group_id = NULL
.
The pred_summary()
function produces a summary of predicted genotypes
for positive controls in each reaction plate after verification (Table
3), as shown in the code snippet below:
# Get prediction summary for all plates
library(panGenomeBreedr)
my_sum <- pred_summary(x = dat1,
snp_id = 'SNPID',
Group_id = 'Group',
Group_unknown = '?',
geno_call = 'Call')
plate | snp_id | false | true | unverified |
---|---|---|---|---|
SE-24-1088_P01_d1_snpSB00800 | snpSB00800 | 4 | 6 | 84 |
SE-24-1088_P01_d2_snpSB00800 | snpSB00800 | 2 | 6 | 86 |
SE-24-1088_P01_d1_snpSB00803 | snpSB00803 | 0 | 32 | 62 |
SE-24-1088_P01_d2_snpSB00803 | snpSB00803 | 0 | 32 | 62 |
SE-24-1088_P01_d1_snpSB00804 | snpSB00804 | 1 | 31 | 62 |
SE-24-1088_P01_d2_snpSB00804 | snpSB00804 | 1 | 31 | 62 |
SE-24-1088_P01_d1_snpSB00805 | snpSB00805 | 14 | 18 | 62 |
SE-24-1088_P01_d2_snpSB00805 | snpSB00805 | 14 | 18 | 62 |
Users can visualize the observed genotype calls in a plate design format
using the plot_plate()
function as depicted in Figure 5, using the
code snippet below:
plot_plate(dat1[5], pdf = FALSE)
#> $`SE-24-1088_P01_d1_snpSB00804`
If the app does not run as expected, check the following:
-
Was the package properly installed?
-
Were any warnings or error messages returned during package installation?
-
Do you have the required dependencies installed?
-
Are all packages up to date?
For support and submission of feedback, email the maintainer Alexander Kena, PhD at [email protected]