The AnVILVRS package provides an R interface to the AnVIL VRS Toolkit, a Python library for working with the Global Alliance for Genomics and Health (GA4GH) Variation Representation Specification (VRS) standard. The package allows users to translate variant identifiers from various formats (e.g., gnomAD, SPDI, HGVS, Beacon) into GA4GH VRS Allele IDs and vice versa. Additionally, it provides functionality to retrieve allele frequency data from the 1000 Genomes Project based on VRS Allele IDs.
To use the AnVILVRS package, you need to have Python installed on your system. The package requires Python 3.11, so ensure that you have this version installed.
Install the AnVILVRS package from Bioconductor with:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("AnVILVRS")
Load the AnVILVRS package and the reticulate package for Python integration:
library(AnVILVRS)
library(reticulate)
After installing the package, you need to set up a Python virtual environment with the required dependencies. You can do this by running the following command in R:
install_AnVILVRS(envname = "vrs_env")
Once the virtual environment is set up, you can use the AnVILVRS package to translate variant identifiers and retrieve allele frequency data. First, load the package and activate the virtual environment:
use_virtualenv("vrs_env", required = TRUE)
You can translate variant identifiers from various formats into GA4GH
VRS Allele IDs using the get_vrs_id
function. Supported formats
include “gnomad”, “spdi”, “hgvs”, and “beacon”.
get_vrs_id("chr7-87509329-A-G", "gnomad")
#> [1] "'str' object has no attribute 'id'"
get_vrs_id("NC_000005.10:80656509:C:TT", "spdi")
#> [1] "'str' object has no attribute 'id'"
get_vrs_id("NC_000005.10:g.80656510delinsTT", "hgvs")
#> [1] "'str' object has no attribute 'id'"
get_vrs_id("5 : 80656489 C > T", "beacon")
#> [1] "'str' object has no attribute 'id'"
You can also retrieve the VRS Allele object using the get_vrs_allele
function:
allele <- get_vrs_allele("5 : 80656489 C > T", "beacon")
allele
#> [1] "HTTPSConnectionPool(host='services.genomicmedlab.org', port=443): Max retries exceeded with url: /seqrepo/1/metadata/GRCh38:5 (Caused by NameResolutionError(\"<urllib3.connection.HTTPSConnection object at 0x77f5a8137b90>: Failed to resolve 'services.genomicmedlab.org' ([Errno -3] Temporary failure in name resolution)\"))"
You can convert a VRS Allele object back to a variant identifier in a
specified format using the get_variant_from_allele
function:
get_variant_from_allele(allele, "hgvs")
#> [1] "could not translate host name \"uta.biocommons.org\" to address: Temporary failure in name resolution\n"
The get_pop_descriptor
function downloads the population descriptor
file from a known Google Storage URI to the BiocFileCache. This file is
used in the get_caf
function to map population codes.
library(readr)
pop_desc <- get_pop_descriptor()
#> Found in BiocFileCache: BFC2
read_tsv(
file = pop_desc,
col_types = cols(
population_descriptor_id = col_character(),
population_descriptor = col_character(),
subject_id = col_character(),
country_of_recruitment = col_character(),
population_label = col_character()
)
)
#> # A tibble: 3,202 × 5
#> population_descriptor_id population_descriptor subject_id country_of_recruitment population_label
#> <chr> <chr> <chr> <chr> <chr>
#> 1 fb7368ca population | superpopul… HG00096 UK GBR | EUR
#> 2 ada92c9c population | superpopul… HG00097 UK GBR | EUR
#> 3 5b9b6b8c population | superpopul… HG00099 UK GBR | EUR
#> 4 4957a17c population | superpopul… HG00100 UK GBR | EUR
#> 5 f3e3ab5b population | superpopul… HG00101 UK GBR | EUR
#> 6 bc1c2c8e population | superpopul… HG00102 UK GBR | EUR
#> 7 035cf47a population | superpopul… HG00103 UK GBR | EUR
#> 8 89ee21d5 population | superpopul… HG00105 UK GBR | EUR
#> 9 ac194313 population | superpopul… HG00106 UK GBR | EUR
#> 10 7c2b88af population | superpopul… HG00107 UK GBR | EUR
#> # ℹ 3,192 more rows
The get_seqrepo
function downloads a SeqRepo archive from a specified
URI to a local directory. This is useful for setting up the SeqRepo
database required by the AnVIL VRS Toolkit.
get_seqrepo(destdir = tempdir())
To calculate the Cohort Allele Frequency (CAF) using the 1000 Genomes
Project based on a VRS Allele ID, use the get_caf
function. You will
need a .gz
zipped VCF file containing 1000 Genomes Project variants
with allele frequency annotations and its corresponding index file.
use_virtualenv("vrs_env", required = TRUE)
vcf <- "../vrs_anvil_toolkit/tests/fixtures/1kGP.chr1.1000.vrs.vcf.gz"
vcf_index <- "../1000g_chr1_index.db"
variant_id <- "chr1-20094-TAA-T"
vrs_id <- get_vrs_id(variant_id, "gnomad")
pop_desc <- get_pop_descriptor()
get_caf(
vrs_id, vcf, vcf_index, "USA",
pop_desc_file = pop_desc, toolkit_dir = "../vrs_anvil_toolkit"
)