[TODO: get help from Jerome and Alistair] [Existing tools include scikit-allel.]
Sgkit provides a number of methods for computing statistics in population genetics.
All methods work on an Xarray dataset that has a number of well-known variables defined [TODO: see section...], which for population genetics are the genotype calls, stored in the call_genotype
variable. The methods return a new dataset containing variables that store the computed statistics.
Before running the methods, the dataset is usually divided into windows along the genome, using the window_by_*
functions, which tell sgkit to produce per-window statistics.
For example, window_by_position
creates windows that are a fixed number of base pairs, while window_by_interval
creates windows corresponding to arbitrary user-defined intervals.
It's common in population genetics to group samples into populations, which in sgkit are referred to as cohorts. There are two types of statistics: one-way statistics where there is a single statistic for each cohort, and multi-way statistics where there is a statistic between each pair, triple, etc of cohorts. [TODO: do we need to say how cohorts are defined?]
The methods for one-way statistics include diversity
for computing mean genetic diversity, Tajimas_D
for computing Tajima’s D, and Garud_H
for computing the H1, H12, H123 and H2/H1 statistics defined in [@doi:10.1371/journal.pgen.1005004].
The methods for multi-way statistics include divergence
and Fst
for computing mean genetic divergence and F[ST] (respectively) between pairs of cohorts, and pbs
for computing the population branching statistic between cohort triples.
We converted phased Ag1000G hypotype data in Zarr format [@https://www.malariagen.net/data/ag1000g-phase-2-ar1] to sgkit's Zarr format using the read_scikit_allel_vcfzarr
function.
The data contained 1,164 samples at 39,604,636 sites, and was [TODO] MB on disk before conversion, and [TODO] MB after conversion to sgkit's Zarr format.
Data for the X chromosome was discarded since it was not available for all samples.
The conversion took [TODO: X minutes Y seconds], including a postprocessing rechunk
step to ensure that the data was suitably chunked for the subsequent analysis.
We grouped samples into 15 cohorts based on location using simple Pandas and Xarray data manipulation functions.
We then set up half-overlapping windows along the genome using sgkit's window_by_position
function with a fixed window size of 6000 base pairs, and a step size of 3000 base pairs.
We computed H statistics for one cohort using the Garud_H
function in sgkit.
This operation took [TODO: X minutes Y seconds].
Finally, we verified that the output was concordant with scikit-allel.
[TODO: how hard to reproduce the scikit-allel visualization?]
[TODO: summarize the biologically interesting thing here]