To distribute Allen Institute Taxonomies (AIT) we define an anndata .h5ad file which encapsulates the essential components of a taxonomy required for downstream analysis with a formalized schema.
For information on how to build and work with AIT files, see the companion scrattch R libraries.
For a list of available taxonomies in AIT format, see this table of available taxonomies.
One major challenge in creating a cell type taxonomy schema is in definition of terms such as "taxonomy", "dataset", "annotation", "metadata", and "data". It is becoming increasingly important to separate out the data from the other components, and compartmentalize all components to avoid the need to download, open, or upload huge and unweildy files. AIT addresses this challenge by extending and modifying the popular CELLxGENE schema to better align with BICAN and Allen Institute needs.
(Brief description of AIT and it's difference from CELLxGENE to be entered here. Also link to version of schema table with new column indicating what is included in CELLxGENE.)
Taxonomy 'modes' are a key concept specific to AIT that allow multiple embedded subsets of the data to be stored in a single .h5ad file. More detail about taxonomy modes and a separate schema describing how they work can be found here.
AIT is being developed alongside three complementary efforts for packaging of taxonomies, data sets, and associated metadata and annotations.
- Cell Annotation Platform (CAP): CAP 'is a centralized, community-driven platform for the creation, exploration, and storage of cell annotations for single-cell RNA-sequencing (scRNA-seq) datasets.' The Allen Institute and BICAN are partnering with CAP for annotation of brain (including basal ganglia) and spinal cord taxonomies.
- Cell Annotation Schema (CAS): Compatible with CAP and with Taxonomy Development Tools (TdT), CAS functions as a store of extended information about cell sets, including ontology term mappings and evidence for annotation (from annotation transfer and marker expression). CAS complements other cell-centric and occasionally cluster-centric schema more commonly used. CAS has both a general schema and a BICAN-associated schema, and can be embedded in the header (
uns) of an AIT file. - Brain Knowledge Platform (BKP): While not publicly laid out anywhere that I can find, the BKP schema is the data model used for Jupyter Notebooks associated with the Allen Brain Cell (ABC) Atlas and will eventually power all novel content hosted on Allen Brain Map. Currently, any data sets to be included in ABC Atlas or MapMyCells must be ingested into BKP.
