Skip to content

Latest commit

 

History

History
15 lines (14 loc) · 991 Bytes

02.abstract.md

File metadata and controls

15 lines (14 loc) · 991 Bytes

Abstract {.page_break_before}

The PyData ecosystem is an umbrella term covering Python packages based on a broad range of modern techniques, such as chunk-compressed columnar data storage, just-in-time compilation of numerical code, and scaling of calculations across clusters of computers. Together, these technologies have been successful applied in scientific applications using data at the petabyte scale. These technologies, and the many benefits that they provide, however, have not been successfully applied in the field of genomics, which is currently making the transition to working at petabyte scale. We present sgkit, a Python package designed to bring the benefits of the PyData ecosystem to genomics, allowing users to efficiently analyse large-scale data using familiar tools. We discuss the underlying design principles of these technologies and illustrate their suitability in genetics and genomics applications, via examples on large-scale datasets such as UK Biobank.