Abstract {.page_break_before}

The PyData ecosystem is an umbrella term covering Python packages based on a broad range of modern techniques, such as chunk-compressed columnar data storage, just-in-time compilation of numerical code, and scaling of calculations across clusters of computers. Together, these technologies have been successful applied in scientific applications using data at the petabyte scale. These technologies, and the many benefits that they provide, however, have not been successfully applied in the field of genomics, which is currently making the transition to working at petabyte scale. We present sgkit, a Python package designed to bring the benefits of the PyData ecosystem to genomics, allowing users to efficiently analyse large-scale data using familiar tools. We discuss the underlying design principles of these technologies and illustrate their suitability in genetics and genomics applications, via examples on large-scale datasets such as UK Biobank.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02.abstract.md

02.abstract.md

Abstract {.page_break_before}

Files

02.abstract.md

Latest commit

History

02.abstract.md

File metadata and controls

Abstract {.page_break_before}