Most functions to retrieve and compare distributions between repertoires expect data.table
objects as input. For example, the package contains an annotations dataset in the extdata
folder, which can be read in as a data.table
as follows:
f <- system.file("extdata/test_annotations.csv", package="sumrep")
dat_a <- data.table::fread(f)
This file was generate via the following commands:
library(magrittr)
f <- system.file("extdata/test_annotations.csv", package="sumrep")
getPartisAnnotations(f, locus="igh") %$%
annotations %>%
fwrite("test_annotations.csv")
While sumrep
is able to handle rather general annotations datasets, things work best when the annotations dataset is a data.table
object with specific defaults chosen by members of the AIRR Community Software Working Group.
In particular, by default, sumrep
adheres to the AIRR Rearrangement standard, with specific default column names and definitions laid out in the extended documention.
Columns can be manually set via the column
argument to any function which acts on a single column; functions acting on more than one column have similar arguments for each column involved.
An example of this is shown below.
To encourage principled analyses, we recommend the following specification of sequence_alignment
and germline_alignment
(which have somewhat open-ended definitions in the general AIRR schema):
- The recommended specification of
sequence_alignment
will be the aligned portion of the query sequence with IMGT gaps, constrained to the V(D)J region. Similarly, the recommended specification ofgermline_alignment
will be the assembled, aligned, fully length inferred germline sequence spanning the same region as the sequence_alignment field, with IMGT gaps. - If IMGT gaps are not available, we will still recommend that the sequences in
sequence_alignment
be pairwise aligned and constrained to the V(D)J region, and thatgermline_alignment
adhere to the same properties as above, just without the IMGT gaps. - The user may specify custom columns to any function which uses
sequence_alignment
and/orgermline_alignment
as defaults. In this case, we note that the results will be sensible insofar as these sequences are well-defined and compared over a span that is covered by the reads in both sets, so please take caution.
Moreover, any function which acts on a column of sequences in an annotations object by default performs the following filters:
- Any gaps/spacers (defined here as
-
or.
characteres) are removed for distance-based functions. - Any sequences with a stop codon (with respect to the start of the germline V sequence ), as determined by the
stop_codon
column, are excluded. - Any sequences that are out-of-frame with respect to the start of the germline V sequence, as determined by the
vj_in_frame
column, are excluded.
If the stop_codon
and/or vj_in_frame
columns are missing, the function will output a warning and return the values without applying the respective filter.
The user can adjust one or more of these flags for any summary that operates on sequences from a column in a data.table
.
These defaults attempt to encourage sensible analyses that are not biased from sequencing artifacts or otherwise noisy data.
Nonetheless, we have tried to keep sumrep
as flexible as possible to accomodate a wide range of use cases.
There are many helper functions which take other types of data structures as input.
These are available to the user but are not as polished or standardized.
For example, the getDistanceMatrix
function returns the pairwise distance matrix of a list
or vector
of input sequences.
This object may be of auxiliary interest to the user but is not directly useful for plotting or comparison functions within the package.
Functions for retrieving distributions are generally of the form getXDistribution
.
For example, the pairwise distance distribution of dat
can be obtained via
library(sumrep)
pairwise_distances <- getPairwiseDistanceDistribution(dat_a)
This returns a vector of pairwise distances rather than a matrix, which is more practical for plotting and comparison.
This function will by default compute this distribution on the sequence
column.
To specify a different, column, say the junction
column, if present, you would instead want:
junction_pairwise_distances <- getPairwiseDistanceDistribution(dat_a, column="junction")
A complete table of available summary functions can be found in the extended documentation.
Functions to compare distributions of two annotations datasets, say dat_a
and dat_b
, are in general of the form compareXDistributions
, and expect two data.tables
as input.
Let's read in another dataset, this time a post-processed annotations dataset is provided in the
sumrep
package:
data(test_dat_boot, package="sumrep")
dat_b <- test_dat_boot
Then, to compare the pairwise distance distributions of dat_a
and dat_b
, we simply run
divergence <- comparePairwiseDistanceDistributions(dat_a, dat_b)
In this case, the output is the scalar JS-divergence between the pairwise distance distributions of the two distributions.
The function compareRepertoires
performs the full suite of summary comparisons between two repertoires.
The main inputs of this function, repertoire_1
and repertoire_2
are list
s, rather than data.table
s.
These lists should include an annotations
field corresponding to the annotations data.table
objects discussed above, and optionally a mutation_rates
field, which is described below.
For example, the following would do the trick:
repertoire_a <- list(annotations=dat_a)
repertoire_b <- list(annotations=dat_b)
compareRepertoires(repertoire_a, repertoire_b, locus="igh")
This would perform every comparison sans comparePerGeneMutationRates
and comparePerGenePerPositionMutationRates
.
Note that by default getPartisAnnotations
and getIgBlastAnnotations
return lists of this sort, although only getPartisAnnotations
includes the mutation_rates
object.
The mutation_rates
object should be equivalent to a data structure returned by getMutationInfo
, which is called within getPartisAnnotations
.
This structure includes a field for each gene name (e.g. `IGHD2-21\*01`
), which then includes subfields overall_mut_rate
and mut_rate_by_position
.
The function plotUnivariateDistributions
takes in a list of annotations datasets as well as a locus, and returns a plot of as many summaries are as relevant to the locus and data. For example, using our datasets above, we can run
plot_1 <- plotUnivariateDistributions(list(dat_a, dat_b), locus="igh")
By default, this creates two plots: a frequency polygon of each distribution, as well as an empirical CDF of each distribution.
You can access these plots by calling plot_1[["freqpoly"]]
and plot_1[["ecdf"]]
respectively.
This function is easy to modify; for example, if we wish to only plot frequency polygons of the pairwise distance distribution and aromaticity distribution, we could run
plot_2 <- plotUnivariateDistributions(list(dat_a, dat_b), locus="igh", plot_types="freqpoly", plot_function_strings=c("getPairwiseDistanceDistribution", "getAromaticityDistribution"))
The Examples
folder includes a few scripts that demonstrate basic sumrep
usage:
-
ExampleComparisonUsingPartis.R
shows how to obtainpartis
annotations and parameters fromsumrep
; how to simulate from these parameters usingpartis
fromsumrep
; and how to compare these observed and simulated annotations datasets with thecompareRepertoires
function. -
ExampleComparisonWithoutPartis.R
loads pre-computed annotations, so thatpartis
need not be installed, and shows how to compare these observed and simulated annotations datasets with thecompareRepertoires
function. -
MDS.R
downloads many post-processed IgH datasets from Gupta et al. (2017), computes divergences of each pairwise CDR3 length distribution, performs an multidimensional scaling of these divergences, and plots the first 2 coordinates.