Skip to content

Latest commit

 

History

History
39 lines (26 loc) · 2.87 KB

10.popgen.md

File metadata and controls

39 lines (26 loc) · 2.87 KB

Popgen

Audience

[TODO: get help from Jerome and Alistair] [Existing tools include scikit-allel.]

API overview

Sgkit provides a number of methods for computing statistics in population genetics. All methods work on an Xarray dataset that has a number of well-known variables defined [TODO: see section...], which for population genetics are the genotype calls, stored in the call_genotype variable. The methods return a new dataset containing variables that store the computed statistics.

Before running the methods, the dataset is usually divided into windows along the genome, using the window_by_* functions, which tell sgkit to produce per-window statistics. For example, window_by_position creates windows that are a fixed number of base pairs, while window_by_interval creates windows corresponding to arbitrary user-defined intervals.

It's common in population genetics to group samples into populations, which in sgkit are referred to as cohorts. There are two types of statistics: one-way statistics where there is a single statistic for each cohort, and multi-way statistics where there is a statistic between each pair, triple, etc of cohorts. [TODO: do we need to say how cohorts are defined?]

The methods for one-way statistics include diversity for computing mean genetic diversity, Tajimas_D for computing Tajima’s D, and Garud_H for computing the H1, H12, H123 and H2/H1 statistics defined in [@doi:10.1371/journal.pgen.1005004].

The methods for multi-way statistics include divergence and Fst for computing mean genetic divergence and F[ST] (respectively) between pairs of cohorts, and pbs for computing the population branching statistic between cohort triples.

Example

We converted phased Ag1000G hypotype data in Zarr format [@https://www.malariagen.net/data/ag1000g-phase-2-ar1] to sgkit's Zarr format using the read_scikit_allel_vcfzarr function. The data contained 1,164 samples at 39,604,636 sites, and was [TODO] MB on disk before conversion, and [TODO] MB after conversion to sgkit's Zarr format. Data for the X chromosome was discarded since it was not available for all samples. The conversion took [TODO: X minutes Y seconds], including a postprocessing rechunk step to ensure that the data was suitably chunked for the subsequent analysis.

We grouped samples into 15 cohorts based on location using simple Pandas and Xarray data manipulation functions. We then set up half-overlapping windows along the genome using sgkit's window_by_position function with a fixed window size of 6000 base pairs, and a step size of 3000 base pairs.

We computed H statistics for one cohort using the Garud_H function in sgkit. This operation took [TODO: X minutes Y seconds]. Finally, we verified that the output was concordant with scikit-allel. [TODO: how hard to reproduce the scikit-allel visualization?]

[TODO: summarize the biologically interesting thing here]