-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathpaper.tex
2668 lines (2510 loc) · 129 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%,%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Please note that whilst this template provides a
% preview of the typeset manuscript for submission, it
% will not necessarily be the final publication layout.
%
% letterpaper/a4paper: US/UK paper size toggle
% num-refs/alpha-refs: numeric/author-year citation and bibliography toggle
%\documentclass[letterpaper]{oup-contemporary}
\documentclass[a4paper,num-refs]{oup-contemporary}
%%% Journal toggle; only specific options recognised.
%%% (Only "gigascience" and "general" are implemented now. Support for other journals is planned.)
\journal{preprint}
\usepackage{graphicx}
\usepackage{siunitx}
\usepackage[switch]{lineno}
\linenumbers
% Additional packages for data model figure
\usepackage{forest}
\usepackage{tikz}
\usetikzlibrary{quotes,arrows.meta,3d}
%%% Flushend: You can add this package to automatically balance the final page, but if things go awry (e.g. section contents appearing out-of-order or entire blocks or paragraphs are coloured), remove it!
% \usepackage{flushend}
% Macros etc for data model figure
\input{diagrams/data-model-preamble.tex}
% See https://github.com/sgkit-dev/vcf-zarr-publication/issues/87
% for discussion
\title{Analysis-ready VCF at Biobank scale using Zarr}
%%% Use the \authfn to add symbols for additional footnotes, if any.
% 1 is reserved for correspondence emails; then continuing with 2 etc for contributions.
% First author
\author[1,2\authfn{1}]{Eric Czech} % https://orcid.org/0000-0002-4254-4255
\author[3,4\authfn{1}]{Timothy R. Millar} % https://orcid.org/0000-0002-5142-8811
\author[5,\authfn{1}]{Will Tyler}
\author[6,\authfn{1}]{Tom White}
% Middle, alphabetical order
\author[7]{Benjamin Elsworth} % https://orcid.org/0000-0001-7328-4233
\author[8,9]{Jérémy Guez} % https://orcid.org/0009-0007-6406-5187
\author[10]{Jonny Hancox} % https://orcid.org/0009-0003-5799-5299
\author[11]{Ben Jeffery} % https://orcid.org/0000-0002-1982-6801
\author[8,9,12]{ Konrad J. Karczewski} % https://orcid.org/0000-0003-2878-4671
\author[13]{Alistair Miles} % https://orcid.org/0000-0001-9018-4680
\author[14]{Sam Tallman} % https://orcid.org/0000-0001-7183-6276
\author[15]{Per Unneberg} % https://orcid.org/0000-0001-5735-3315
\author[1]{Rafal Wojdyla} % https://orcid.org/0009-0005-0735-7090
\author[16]{Shadi Zabad} % https://orcid.org/0000-0002-8003-9284
% Senior
\author[1,2,\authfn{2}]{Jeff Hammerbacher} % https://orcid.org/0000-0001-6596-8563
\author[11,\authfn{2},\authfn{3}]{Jerome Kelleher} % https://orcid.org/0000-0002-7894-5253
\affil[1]{Open Athena AI Foundation}
\affil[2]{Related Sciences}
\affil[3]{The New Zealand Institute for Plant \& Food Research Ltd, Lincoln,
New Zealand}
\affil[4]{Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand}
\affil[5]{Independent researcher}
\affil[6]{Tom White Consulting Ltd.}
\affil[7]{Our Future Health, Manchester, UK.}
\affil[8]{Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA}
\affil[9]{Analytic and Translational Genetics Unit, Massachusetts General
Hospital, Boston, Massachusetts 02114, USA}
\affil[10]{NVIDIA Ltd, Reading, UK}
\affil[11]{Big Data Institute, Li Ka Shing Centre for Health Information and Discovery,
University of Oxford, UK}
\affil[12]{Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA}
\affil[13]{Wellcome Sanger Institute}
\affil[14]{Genomics England}
\affil[15]{Department of Cell and Molecular Biology, National
Bioinformatics Infrastructure Sweden, Science for Life Laboratory,
Uppsala University, Uppsala, Sweden}
\affil[16]{School of Computer Science, McGill University, Montreal, QC, Canada}
%%% Author Notes
\authnote{\authfn{1}Joint first author.}
\authnote{\authfn{2}Joint senior author.}
\authnote{\authfn{3}[email protected]}
%%% Paper category
\papercat{Paper}
%%% "Short" author for running page header
\runningauthor{Czech et al.}
%%% Should only be set by an editor
\jvolume{00}
\jnumber{0}
\jyear{2024}
\begin{document}
\begin{frontmatter}
\maketitle
% The Abstract (250 words maximum) should be structured to
% include the following details:
% \textbf{Background}, the context and purpose of
% the study;
% \textbf{Results}, the main findings;
% \textbf{Conclusions}, brief
% summary and potential implications. Please minimize the use of abbreviations
% and do not cite references in the abstract.
% The Abstract (250 words maximum) should be structured to
% include the following details:
% \textbf{Background}, the context and purpose of % the study;
% \textbf{Results}, the main findings;
% \textbf{Conclusions}, brief summary and potential implications.
%% NOTE: this is much too long currently, but keeping for now so we
% can see which stuff migrates to the intro
\begin{abstract}
\textbf{Background:}
Variant Call Format (VCF) is the standard file format for interchanging
genetic variation data and associated quality control metrics.
The usual row-wise encoding of the VCF data model (either as text
or packed binary) emphasises efficient retrieval of all data for a given
variant, but accessing data on a field or sample basis is inefficient.
Biobank scale datasets currently available
consist of hundreds of thousands of whole genomes
and hundreds of terabytes of compressed VCF.
Row-wise data storage is fundamentally unsuitable
and a more scalable approach is needed.
\textbf{Results:}
Zarr is a format for storing
multi-dimensional data that is widely used across the sciences,
and is ideally suited to massively parallel processing.
We present the VCF Zarr specification, an encoding of the
VCF data model using Zarr, along with fundamental software infrastructure
for efficient and reliable conversion at scale.
We show how this format is far more efficient than
standard VCF based approaches,
and competitive with specialised methods for
storing genotype data in terms of compression ratios
and single-threaded calculation performance.
We present case studies on subsets of three large
human datasets (Genomics England: $n$=78,195;
Our Future Health: $n$=651,050;
All of Us: $n$=245,394) along with whole genome
datasets for Norway Spruce ($n$=1,063) and
SARS-CoV-2 ($n$=4,384,310).
We demonstrate the potential for VCF Zarr to enable a new generation
of high-performance and cost-effective applications
via illustrative examples using cloud computing and GPUs.
\textbf{Conclusions:}
Large row-encoded VCF files are a major bottleneck for current research, and
storing and processing these files incurs a substantial cost.
The VCF Zarr specification, building on widely-used, open-source technologies
has the potential to greatly reduce these costs,
and may enable a diverse ecosystem of next-generation tools for analysing
genetic variation data directly from cloud-based object stores,
while maintaining compatibility with existing file-oriented workflows.
\end{abstract}
\begin{keywords}
Variant Call Format; Zarr; Analysis ready data.
\end{keywords}
\end{frontmatter}
%%% Key points will be printed at top of second page
\begin{keypoints*}
\begin{itemize}
\item VCF is widely supported, and the underlying data model entrenched
in bioinformatics pipelines.
\item The standard row-wise encoding as text (or binary) is inherently
inefficient for large-scale data processing.
\item The Zarr format provides an efficient solution, by encoding fields
in the VCF separately in chunk-compressed binary format.
\end{itemize}
\end{keypoints*}
\section{Background}
Variant Call Format (VCF) is the standard format for interchanging genetic
variation data, encoding information about
DNA sequence polymorphisms among a set of samples with associated
quality control metrics and metadata~\citep{danecek2011variant}.
Originally defined specifically as a text file,
it has been refined and standardised~\citep{rehm2021ga4gh} and the
underlying data-model is now deeply embedded in bioinformatics practice.
Dataset sizes have grown explosively since the introduction of
VCF as part of 1000 Genomes project~\citep{10002015global},
with Biobank scale initiatives such as
Genomics England~\cite{turnbull2018100,caulfield2017national},
UK Biobank~\citep{bycroft2018genome,backman2021exome,halldorsson2022sequences,uk2023whole},
Our Future Health~\citep{cook2025our,ofhmain_2025},
and the All of Us research program~\citep{all2024genomic}
collecting genome sequence data for millions of humans~\citep{stark2024call}.
Large genetic variation datasets are also being generated for other organisms
and a variety of purposes including
agriculture~\citep{ros2020accuracy,wang2023rice},
conservation~\citep{shaffer2022landscape}
and infectious disease surveillance~\citep{hamid2023pf7}.
VCF's simple text-based design and widespread
support~\cite{garrison2022spectrum} make it an
excellent archival format, but it is an inefficient basis for analysis.
Methods that require efficient access to genotype data
either require conversion to the
PLINK~\cite{purcell2007plink,chang2015second}
or BGEN~\citep{band2018bgen}
formats~\citep[e.g.][]{yang2011gcta,mbatchou2021computationally,loh2015efficient}
or use bespoke binary formats
that support the required access patterns~\citep[e.g.][]{
% Uses custom "bref3" format,
% https://faculty.washington.edu/browning/beagle/bref3.24May18.pdf
browning2018one,
% .samples Zarr format
kelleher2019inferring,
% Has a "xcftools" package, but it still looks pretty experimental
hofmeister2023accurate}.
While PLINK and BGEN formats are more efficient to access than VCF, neither
can accommodate the full flexibility of the VCF data model and conversion
is lossy.
PLINK's approach of storing the genotype matrix in uncompressed
packed-binary format provides efficient access to genotype
data, but file sizes are substantially larger than
the equivalent compressed VCF (see Fig~\ref{fig-data-storage}).
For example, at two bits
per diploid genotype, the full genotype matrix for the GraphTyper SNP dataset
in the 500K UKB WGS data~\citep{uk2023whole} is 116 TiB.
% 1,037,556,156 SNPs x 490,640 samples
% humanize.naturalsize(1_037_556_156 * 490_640 / 4, binary=True)
% '115.7 TiB'
Processing of Biobank scale datasets can be split into a
few broad categories. The most basic analysis
is quality control (QC). Variant QC is an
involved and multi-faceted
task~\cite{marees2018tutorial,panoutsopoulou2018quality,chen2024genomic,hemstrom2024next},
often requiring interactive, exploratory analysis
and incurring substantial computation over multiple QC fields.
Genotype calls are sometimes refined via statistical methods,
for example by phasing~\citep{
browning2021fast,browning2023statistical,hofmeister2023accurate,williams2024phasing},
and imputation~\citep{browning2018one,rubinacci2020genotype,barton2021whole,
rubinacci2023imputation}
creating additional dataset copies.
A common task to perform is
a genome wide association study (GWAS)~\cite{uffelmann2021genome}.
The majority of tools for
performing GWAS and related analyses require
data to be in PLINK or BGEN formats~\cite[e.g][]{chang2015second,
loh2015efficient,
abraham2017flashpca2,
mbatchou2021computationally},
and so data must be ``hard-called'' according to some QC criteria
and exported to additional copies.
Finally, variation datasets are often queried in exploratory
analyses, to find regions or samples of interest for a particular
study~\cite[e.g.][]{chen2024novo}.
VCF cannot support any of these workflows efficiently at the Biobank scale.
The most intrinsically limiting aspect of VCF's design
is its row-wise layout of data, which means that (for example)
information for a particular sample or field cannot be obtained without
retrieving the entire dataset.
The file-oriented paradigm is also unsuited to the realities
of modern datasets, which are too large to download and
often required to stay in-situ by data-access agreements.
Large files are currently stored in cloud environments, where the
file systems that are required by classical file-oriented tools
are expensively emulated on the basic building blocks
of object storage.
These multiple layers of inefficiencies around processing
VCF data at scale in the cloud mean that it is
time-consuming and expensive, and these vast datasets are
not utilised to their full potential.
To achieve this full potential we
need a new generation of tools that operate directly
on a primary data representation that supports
efficient access across a range of applications,
with native support for cloud object storage.
Such a representation can be termed ``analysis-ready''
and ``cloud-native''~\citep{abernathey2021cloud}.
For the representation to be FAIR~\citep{wilkinson2016fair},
it must be \emph{accessible}, using protocols that are
``open, free, and universally implementable''.
There is currently no efficient, FAIR representation of genetic variation
data suitable for cloud deployments.
Hail~\cite{ganna2016ultra,hail2024} has become the dominant platform
for quality control of large-scale variation datasets,
and has been instrumental in projects such as
gnomadAD~\cite{karczewski2020mutational,chen2024genomic}.
While Hail is built on open components
from the Hadoop distributed computing ecosystem~\citep{white2012hadoop},
the details of its MatrixTable format are not documented
or intended for external reuse.
% cite? https://dev.hail.is/t/matrixtable-file-format-reference/173/6
Similarly, commercial solutions that have emerged to facilitate
the analysis of large-scale genetic variation data are either
based on proprietary~\cite{basespace2024,graf2024,googlelifesciences2024,
awshealthomics2024,microsoftgenomics2024}
or single-vendor technologies~\cite[e.g.][]{tiledb2024,genomicsdb2024}.
The next generation of VCF analysis methods requires
an open, free and transparent data representation
with multiple independent implementations.
In this article, we decouple the VCF data model from its row-oriented
file definition, and show how the data can be
compactly stored and efficiently analysed in a cloud-native, FAIR manner.
We do this by translating VCF data into Zarr format,
a method of storing large-scale multidimensional data as a regular
grid of compressed chunks.
Zarr's elegant simplicity and first-class support for
cloud object stores and parallel computing have led to
it gaining substantial traction
across the sciences, and it is now used in multiple petabyte-scale
datasets in cloud deployments (see Methods for details).
We present the VCF Zarr specification that formalises this
mapping, along with the \texttt{vcf2zarr} and \texttt{vcztools}
utilities to reliably convert VCF to and from Zarr at scale.
We show that VCF Zarr is much more compact than
VCF and is competitive with state-of-the-art
file-based VCF compression tools.
Moreover, we show that Zarr's storage of data in an analysis-ready
format greatly facilitates computation,
with various benchmarks being substantially faster than
\texttt{bcftools} based pipelines, and again competitive
with state-of-the-art file-oriented methods.
We demonstrate the scalability and flexibility of VCF Zarr
via case-studies on five boundary-pushing datasets.
We examine conversion and compression performance on three different
data modalities for large human datasets: whole genome sequence
data from Genomics England (5X smaller than gzipped VCF);
genotype data from Our Future Health (2.5X smaller than gzipped VCF);
and exome-like data from All of Us (1.1X larger than gzipped VCF).
We then examine performance
for species with large genomes using Norway Spruce data (3.2X smaller
than gzipped VCF),
and in a large collection of short whole genome alignments
using SARS-CoV-2 data
(81X smaller than gzipped FASTA).
Interoperability with the vast ecosystem of tools
based on VCF~\citep{danecek2021twelve,garrison2022spectrum}
is crucial.
% TODO sentence too long
We first present the \texttt{vcztools} program,
which implements a subset of the functionality in \texttt{bcftools}
and allows data stored in Zarr at some URL to be
treated as if it were a VCF or BCF file,
and demonstrate that it can be used as a practical replacement
in processing pipelines.
We then present an illustrative example of how existing
methods can adopt VCF Zarr by modifying the popular
SAIGE~\cite{zhou2018efficiently, zhou2020scalable} GWAS software
(written in C++), and show significant speedups over the existing
VCF (7.9X) and BCF (3.4X) backends in a simple benchmark.
Finally, we illustrate the potential
for a new generation of Zarr-based applications via some illustative
benchmarks. First, we demonstrate the very high levels of throughput
performance available on public clouds when data in an object
store is accessed in an asynchronous and highly parallel manner.
Finally, we demonstrate the ease and efficiency with which data
in Zarr format can be processed on a GPU, opening many exciting
avenues for future research.
\section{Results}
\subsection{Storing genetic variation data}
Although VCF is the standard format for exchanging genetic variation
data, its limitations both in terms of compression
and query/compute performance are well
known~\citep[e.g.][]{kelleher2013processing,layer2016efficient,li2016bgt},
and many methods
have been suggested to improve on these properties.
Most approaches balance compression with
performance on particular types of queries,
typically using a command line interface (CLI)
and outputting VCF text~\citep{
layer2016efficient, %GQT
li2016bgt, % BGT
tatwawadi2016gtrac, % GTRAC
danek2018gtc, % GTC
lin2020sparse, % SpVCF
lan2020genozip,lan2021genozip, %genozip
lefaive2021sparse, % SAVVY
wertenbroek2022xsi,% XSI
zhang2023gbc,%GBC
luo2024gsc}. %GSC
Several specialised algorithms for compressing
the genotype matrix (i.e., just the genotype calls without additional
VCF information) have been proposed
\citep{qiao2012handling, % SpeedGene
deorowicz2013genome, %TGC
sambo2014compression, % snpack
deorowicz2019gtshark, %GTShark
deorowicz2021vcfshark, % VCFShark
dehaas2024genotype} % GRG
most notably the Positional
Burrows--Wheeler Transform (PBWT)~\citep{durbin2014efficient}.
(See~\citep{mcvean2019linkage} for a review of the techniques
employed in genetic data compression.)
The widely-used PLINK binary format stores genotypes in a
packed binary representation, supporting only biallelic
variants without phase information.
The PLINK 2 PGEN format~\citep{rivas2024efficient} is more general
and compact than PLINK, compressing variant data using specialised
algorithms~\cite{sambo2014compression}.
Methods have also been developed which store variation data
along with annotations in databases to facilitate
efficient queries~\cite[e.g.][]{
paila2013gemini,%GEMINI
lopez2017hgva} %OpenCGA
which either limit to certain classes of variant~\cite[e.g.][]{greene2023genetic}
or have storage requirements larger
than uncompressed VCF~\citep{al2023critical}.
The SeqArray package~\citep{zheng2017seqarray} builds on the
Genomic Data Storage container format~\cite{zheng2012high}
to store VCF genotype data in a packed and compressed format,
and is used in several downstream R packages~\cite[e.g.][]{
gogarten2019genetic,fernandes2020simplephenotypes}.
VCF is a row-wise format in which
observations and metadata for a single variant are
encoded as a line of text~\citep{danecek2011variant}.
BCF~\citep{li2011statistical}, the standard binary representation of VCF,
is similarly row-wise, as
are the majority of proposed alternative storage
formats.
Row-wise storage makes retrieving all information
for a given record straightforward and efficient,
and works well when records are either relatively small
or we typically want to analyse each record in its entirety.
When we want to analyse only a subset of a record,
row-wise storage can be inefficient because we will usually need to
retrieve more information than required from storage. In the case
of VCF (and BCF) where records are not of a fixed size and
are almost always compressed in blocks, accessing any information
for a set of rows means retrieving and decompressing \emph{all}
information from these rows.
The usual alternative to row-wise storage is \emph{columnar} storage:
instead of grouping together all the fields for a record,
we group together all the records for a given field.
Columnar storage formats such as Parquet~\citep{parquet2024}
make retrieving particular columns much
more efficient and can lead to substantially better compression.
While columnar techniques have been successfully applied
in alignment
storage~\citep[e.g.][]{bonfield2014scramble,nothaft2015rethinking,bonfield2022cram},
columnar technologies for
storing and analysing variation data have had limited
success~\citep{boufea2017managing,fan2020variant}.
Mapping VCF directly to a columnar layout, in which there is a
column for the genotypes (and other per-call QC metrics)
for each sample leads to a large number of columns, which
can be cumbersome and cause scalability issues.
Fundamentally, columnar methods are one-dimensional, storing a vector
of values associated with a particular key, whereas
genetic variation data is usually modelled as a two-dimensional matrix
in which we are interested in accessing both rows \emph{and} columns.
Just as row-oriented storage makes accessing data for a given
sample inefficient, columnar storage makes accessing all the data
for a given variant inefficient.
\begin{figure}
\resizebox{225pt}{!}{\input{diagrams/data-model.tex}}
\caption{Chunked compressed storage of VCF data using Zarr.
The \texttt{call\_genotype} array is a three-dimensional (variants, samples,
ploidy) array of integers, split into a uniform grid of
chunks determined by the variant and sample chunk sizes (10,000
and 1,000 by default in \texttt{vcf2zarr}). Each chunk is associated
with a key defining its location in this grid, which can be stored
in any key-value store such as a standard file-system or cloud object
store. Chunks are compressed independently using standard
codecs and pre-compression filters, which can be specified on a per-array
basis. Also shown are the one-dimensional \texttt{variant\_contig} (CHROM)
and \texttt{variant\_position} arrays (POS). Other fields are stored
in a similar fashion. \label{fig-data-model}}
\end{figure}
VCF is at its core an encoding of the genotype matrix, where each entry
describes the observed genotypes for a given sample at a given variant site,
interleaved with per-variant information
and other call-level matrices (e.g., the GQ or AD fields).
The data is largely numerical and of fixed dimension,
making it a natural fit for array-oriented or ``tensor'' storage.
We propose the VCF Zarr specification which maps the
VCF data model into an array-oriented layout using Zarr
(Fig~\ref{fig-data-model}).
In the VCF Zarr specification,
each field in a VCF is mapped to a separately-stored array,
allowing for efficient retrieval and
high levels of compression.
% In particular, call-level data is stored as $m \times n$ arrays
% (for $m$ sites and $n$ samples), allowing for efficient
% retrieval of subsets of those fields along both the
% variants and samples axis.
See the Methods for more detail on Zarr and the VCF Zarr
specification.
\begin{figure}
\begin{center}
\includegraphics[]{figures/data-scaling}
\end{center}
\caption{Compression performance on simulated genotypes.
Comparison of total stored bytes for VCF data produced
by subsets of a large simulation of French-Canadians.
Sizes for $10^6$ samples are shown on the right.
Also shown for reference is the size of genotype matrix
when encoded as two bits per diploid genotype (2bit), as used
in the PLINK binary format. Default compression settings were
used in all cases.
\label{fig-data-storage}}
\end{figure}
One of the key benefits of Zarr is its cloud-native design,
but it also works well on standard file systems, where
arrays and chunks are stored hierarchically in directories
and files (storage as a single Zip archive is also supported,
and has certain advantages as discussed later).
To enable comparison with the existing file-based ecosystem
of tools, we focus on Zarr's file system chunk storage in a series of illustrative
benchmarks in the following sections before presenting cloud-based benchmarks.
We compare primarily with
VCF/BCF based workflows using \texttt{bcftools} because this
is the standard practice, used in the vast majority of cases.
We also compare with two representative recent specialised utilities;
see~\cite{danek2018gtc,zhang2023gbc,luo2024gsc} for further benchmarks of
these and other tools.
Genozip~\cite{lan2020genozip,lan2021genozip} is a tool focused
on compression performance, which uses a custom file format
and a CLI to extract VCF as text with various filtering options.
Savvy~\cite{lefaive2021sparse} is an extension of BCF which
takes advantage of sparsity in the genotype matrix as well
as using PBWT-based approaches for improved compression.
Savvy provides a CLI as well as a C++ API.
Our benchmarks are based on genotype data
from subsets of a large and highly realistic
simulation of French-Canadians~\cite{anderson2023on}
(see Methods for details on the dataset and benchmarking methodology).
Note that while simulations cannot capture
all the subtleties of real data, the allele frequency
and population structure patterns in this dataset
have been shown to closely follow
observations~\cite{anderson2023on} and so it provides
a reasonable and easily reproducible data point
when comparing such methods.
The simulations only contain genotypes without any additional
high-entropy QC fields, which is unrealistic
(see the later case-studies
for benchmarks on datasets that contain a wide variety
of such fields).
Note, however, that such minimal, genotype-only data
is something of a best-case scenario for specialised genotype
compression methods using row-wise storage.
Fig~\ref{fig-data-storage} shows compression performance
on up to a million samples for chromosome 21, with
the size of the genotype-matrix encoded as 1-bit per haploid
call included for reference.
% 2bit 1689.153258 G
% vcf 81.375831 G
% = 20.75 X, 4.8%
Gzip compressed VCF performs remarkably well, compressing
the data to around 5\% of the
minimal binary encoding of a biallelic genotype matrix
for 1 million samples.
BCF provides a significant improvement in compression
performance over VCF (note the log-log scale). Genozip has
superb compression, having far smaller file sizes that the
other methods (although somewhat losing its advantage at
larger sample sizes). Zarr and Savvy have
very similar compression performance in this example (23.75 GiB
and 21.25 GiB for $10^6$ samples, respectively). When using
a chunk size of $10^4$ variants $\times$ $10^3$ samples, Zarr
storage is reduced to 22.07 GiB for $10^6$ samples (data not shown;
see Figs~\ref{fig-compression-chunksize} and \ref{fig-compression-chunksize-finegrained}
for more analysis on the effect of chunk size on compression).
It is remarkable that the simple approach of compressing
two dimensional chunks of the genotype matrix
using the Zstandard compressor~\citep{collet2021rfc} and the
bit-shuffle filter from Blosc~\cite{alted2010modern}
(see Methods for details) produces
compression levels competitive with the highly specialised methods
used by Savvy.
\subsection{Calculating with the genotype matrix}
Storing genetic variation data compactly is important, but it is also
important that we can analyse the data efficiently. Bioinformatics
workflows tend to emphasise text files and command line utilities
that consume and produce text~\citep[e.g.][]{buffalo2015bioinformatics}.
Thus, many tools that compress VCF data provide a command line
utility with a query language to restrict the records
examined, perform some pre-specified calculations and finally
output some text, typically VCF or tab/comma separated
values~\citep{
layer2016efficient, %GQT
li2016bgt, % BGT
danek2018gtc, % GTC
lin2020sparse, % SpVCF
lan2020genozip,lan2021genozip, %genozip
zhang2023gbc,%GBC
luo2024gsc}. %GSC
These pre-defined calculations are by necessity limited in scope, however,
and the volumes of text involved in Biobank scale datasets
make the classical approach of custom
analyses via Unix utilities in pipelines prohibitively slow. Thus,
methods have begun to provide Application Programming Interfaces
(APIs), providing efficient access to genotype and other VCF
data~\cite[e.g.][]{kelleher2013processing,lefaive2021sparse,
wertenbroek2022xsi}. By providing programmatic access,
the data can be retrieved from storage, decoded and then analysed
in the same memory space without additional copies and
inter-process communication through pipes.
To demonstrate the accessibility of genotype data and efficiency with
which calculations can be performed under the different formats,
we use the \texttt{bcftools +af-dist} plugin
(which computes a table of
deviations from Hardy-Weinberg expectations in
allele frequency bins) as an example.
% The details of the \texttt{af-dist} operation are not important:
% as an example of a whole-matrix operation.
We chose this particular operation for several reasons.
First, it is a straightforward calculation that
requires examining every element in the genotype matrix,
and can be reproduced in different programming languages
without too much effort.
Secondly, it produces a small volume of output and therefore the
time spent outputting results is negligible.
Finally, it has an efficient implementation written using the
\texttt{htslib} C API~\citep{bonfield2021htslib},
and therefore running this command on a VCF or BCF file provides
a reasonable approximation of the limit of what can be achieved in terms
of whole-matrix computation on these formats.
\begin{figure}
\includegraphics{figures/whole-matrix-compute}
\caption{Whole-matrix compute performance with increasing sample size.
Total CPU time required to run \texttt{bcftools +af-dist}
and equivalent operations in a single thread for various tools.
Elapsed time is also reported (dotted line). Run-time for genozip
and bcftools on VCF
at $10^6$ samples were extrapolated by fitting an exponential.
See Methods for full details.
\label{fig-whole-matrix-compute}}
\end{figure}
Fig~\ref{fig-whole-matrix-compute} shows timing results
for running \texttt{bcftools +af-dist} and equivalent operations
on the data of Fig~\ref{fig-data-storage}. There is a large
difference in the time required (note the log-log scale).
The slowest approach uses Genozip. Because Genozip does not
provide an API and only outputs VCF text, the best approach available
is to pipe its output into \texttt{bcftools +af-dist}.
This involves first decoding the data from Genozip format,
then generating large volumes of VCF text (terabytes, in the
largest examples here), which we must
subsequently parse before finally doing the actual calculation.
Running \texttt{bcftools +af-dist} directly on the gzipped VCF
is substantially faster, indicating that Genozip's excellent
compression performance comes at a substantial decompression cost.
Using a BCF file is again significantly faster,
because the packed binary format avoids the overhead of parsing
VCF text into \texttt{htslib}'s internal data structures.
We only use BCF for subsequent \texttt{bcftools} benchmarks.
The data shown in Fig~\ref{fig-whole-matrix-compute} for Zarr and Savvy
is based on custom programs written using their respective APIs
to implement the \texttt{af-dist} operation. The Zarr program uses
the Zarr-Python package to iterate over the decoded chunks of the
genotype matrix and classifies genotypes within a chunk using a 14 line Python
function, accelerated using the Numba JIT compiler~\cite{lam2015numba}.
The allele frequencies and genotype counts are then analysed to produce
the final counts within the allele frequency bins with 9 lines of
Python using NumPy~\cite{harris2020array} functions. Remarkably, this
short and simple Python program is substantially faster than the
equivalent compiled C using \texttt{htslib} APIs on BCF (7.3 hours
vs 20.6 hours for 1 million samples).
% num_samples num_sites tool user_time sys_time wall_time total_time
% 21 1000000 7254858 savvy 8451.25 16.59 8471.293500 hdd 2.352178
% 10 1000000 7254858 bcftools 73939.32 46.09 74023.083678 hdd 20.551503
% 27 1000000 7254858 zarr 26227.92 187.22 31476.507443 hdd 7.337539
The fastest method is the
C++ program written using the Savvy API. This would largely seem
to be due to Savvy's excellent genotype decoding performance
% zarr 1.2 GiB
% savvy 6.6 GiB
% zarr_nshf 2.7 GiB
% zarr_lz4_nshf 2.9 GiB
(up to 6.6 GiB/s vs 1.2 GiB/s for Zarr on this dataset;
Fig~\ref{fig-whole-matrix-decode}).
% sav / zarr ratio: [0.68777206 0.75870945 0.69865406 0.64644888 0.68131752
% 0.89460906]
% sequence_length num_samples num_sites tool size
% 58 48129895 1000000 7254858 zarr 23.752762
% 50 48129895 1000000 7254858 10kx1k.zarr 22.068925
% 51 48129895 1000000 7254858 bcf 51.749294
% 52 48129895 1000000 7254858 genozip 10.691384
% 53 48129895 1000000 7254858 lz4.noshuffle.zarr 214.868005
% 54 48129895 1000000 7254858 noshuffle.zarr 31.208149
% 55 48129895 1000000 7254858 sav 21.249436
% 56 48129895 1000000 7254858 tsk 1.802636
% 57 48129895 1000000 7254858 vcf 81.375831
% 59 48129895 1000000 7254858 zarr.zip 24.240338
% 5 48129895 1000000 7254858 2bit 1689.153258
Turning off the BitShuffle filter for the Zarr dataset,
however, leads to a substantial increase in decoding speed
(2.7 GiB/s) at the cost of a roughly 25\% increase in storage
space (31.2 GiB up from 23.75 GiB for 1 million samples; data not
shown). The LZ4 codec (again without BitShuffle) leads to slightly higher
decoding speed (2.9 GiB/s), but at the cost of much larger storage requirements
(214 GiB; data not shown).
Given the relatively small contribution of genotypes to the
overall storage of real datasets (see the Genomics England example)
and the frequency that they are likely to be accessed, using
Zstandard without BitShuffle
would seem like a good tradeoff in many cases.
This ability to easily tune compression performance
and decoding speed on a field-by-field basis is a major strong
point of Zarr. The \texttt{vcf2zarr} utility also provides
functionality to aid with such storage schema tuning.
The discrepancy between elapsed time and total CPU time for Zarr
benchmarks in
Figs~\ref{fig-whole-matrix-compute}--~\ref{fig-column-extract}
and~\ref{fig-whole-matrix-decode} is mostly due to the overhead
of accessing a large number of small chunk files and
consequent reduced effectiveness of operating system read-ahead caching.
While having many independent chunks maps
naturally to high-performance parallel access patterns
in the cloud store setting~\citep{durner2023exploiting}
(see also the section on accessing Genomics England data stored on S3),
such file access patterns can be problematic in the
classical file-system setting.
The Zarr Zip store (which is simply a Zip archive of the standard directory
hierarchy) provides a useful alternative here and can
significantly reduce overall processing time.
(The nascent Zarr v3 specification provides another option
via ``sharding'' which essentially
allows multiple chunks to be stored in a single file.)
To illustrate this
we repeated the benchmarks in Fig~\ref{fig-whole-matrix-decode} in
which we measure the time required to decode the entire genotype
matrix (thus serving as a lower-bound for any subsequent computations).
In the case of $10^6$ samples, the Zip file requires slightly more
% Full data table above ^
% 58 48129895 1000000 7254858 zarr 23.752762
% 59 48129895 1000000 7254858 zarr.zip 24.240338
storage space than the directory hierarchy (23.75 GiB and 24.24 GiB
for file system and Zip file, respectively)
and while total
CPU time is similar (193m vs 187m), the elapsed wall clock
time for the Zip file version is significantly less
(258m vs 188m).
% tool user_time sys_time cpu_time wall_time
% 11 zarr 11383.82 191.01 192.913833 258.598778
% 29 zarr.zip 11198.37 43.13 187.358333 188.659252
Zarr is a standard with multiple
independent implementations across different programming languages
(see Methods). Fig~\ref{fig-whole-matrix-compute-zarr-versions}
compares the CPU time required to perform the computations
of Fig~\ref{fig-whole-matrix-compute}
for af-dist implementations using the Zarr-Python~\cite{zarrpython},
JZarr~\cite{jzarr},
and TensorStore~\citep{tensorstore} (C++ and Python) libraries.
The program using the TensorStore C++ API is fastest, requiring
about 20\% less CPU time than the version using Zarr-Python and Numba
used in our other benchmarks.
% num_samples num_sites tool user_time sys_time wall_time
% storage total_time
% 22 1000000 7254858 zarr 16709.55 38.14 16910.057917 hdd 4.652136
% 20 1000000 7254858 ts_cpp 13233.50 92.81 13573.456810 hdd 3.701753
% 21 1000000 7254858 zarr_java 43780.50 245.28 44134.462377 hdd 12.229383
% 23 1000000 7254858 ts_py 18027.43 642.40 19163.174182 hdd 5.186064
% Cpp / Zarr python [0.79197226]
\subsection{Subsetting the genotype matrix}
\begin{figure}[t]
\includegraphics{figures/subset-matrix-compute}
\caption{Compute performance on subsets of the matrix.
Total CPU time required to run the af-dist calculation for
a contiguous subset of 10,000 variants $\times$ 10 samples
from the middle of the matrix
for the data in Fig~\ref{fig-data-storage}.
Elapsed time is also reported (dotted line).
The \texttt{genozip} and \texttt{bcftools} pipelines involve
multiple commands required to correctly calculate the AF INFO field
required by \texttt{bcftools +af-dist}. See the Methods for full details
on the steps performed.
\label{fig-subset-matrix-compute}}
\end{figure}
As datasets grow ever larger, the ability to efficiently access subsets
of the data becomes increasingly important. VCF/BCF achieve efficient
access to the data for genomic ranges
by compressing blocks of adjacent records using \texttt{bgzip},
and storing secondary indexes alongside the original
files with a conventional suffix~\citep{li2011tabix}.
Thus, for a given range query we
decompress only the necessary blocks and can quickly access
the required records.
The row-wise nature of VCF (and most proposed alternatives), however, means
that we cannot efficiently subset \emph{by sample}
(e.g., to calculate statistics within a particular cohort). In the extreme
case, if we want to access only the genotypes for a single sample
we must still retrieve and decompress the entire dataset.
We illustrate this cost of row-wise encoding in
Fig~\ref{fig-subset-matrix-compute}, where we run the af-dist calculation
on a small fixed-size subset of the genotype matrices of
Fig~\ref{fig-data-storage}. The two-dimensional chunking of Zarr
means that this sub-matrix can be efficiently
extracted, and therefore the execution time depends very weakly on
the overall dataset size, with the computation requiring around
2 seconds for 1 million samples. Because of their
row-wise encoding, CPU time scales with the number of samples
for all the other methods.
Fig~\ref{fig-subset-matrix-compute-supplemental} shows performance
for the same operation when selecting half of the samples in the
dataset.
\subsection{Extracting, inserting and updating fields}
\begin{figure}
\includegraphics{figures/column-extract}
\caption{Time to extract the genome position and write to a text file.
Total CPU time required to extract the POS field for BCF,
sav and Zarr formats
for the data in Figure~\ref{fig-data-storage}.
For the BCF file we used \texttt{bcftools query -f"\%POS$\backslash$n"}.
For sav, we used the Savvy C++ API to extract position for each variant
and output text using the \texttt{std::cout} stream. For Zarr, we read
the variant\_position array into a NumPy array, and then wrote to
a text file using the Pandas \texttt{write\_csv} method.
Zarr CPU time is dominated by writing the text output; we also show
the time required to populate a NumPy array with the data in Zarr,
which is 2 seconds. Wall-clock time (dotted line) is dominated
in this case by file I/O. Note that our benchmarking methodology of
dropping the file system caches before each run (see Methods) is
likely not reflective of real world usage where a small and
frequently used field such as POS would remain in cache.
Time to output text for Savvy is not significant
for $> 1000$ samples (not shown).
\label{fig-column-extract}}
\end{figure}
We have focused on the genotype matrix up to this point, contrasting
Zarr with existing row-wise methods.
Real-world VCFs encapsulate much more than just the genotype
matrix, and can contain large numbers of additional fields.
Fig~\ref{fig-column-extract} shows the time required to extract
the genomic position of each variant in the simulated benchmark
dataset, which we can use as an indicative example of a per-variant
query. Although Savvy is many times faster than \texttt{bcftools query}
here, the row-wise storage strategy that they share means that
the entire dataset must be read into memory and
decompressed to extract just one field from each record. Zarr
excels at these tasks: we only read and decompress the information required.
Many of the additional fields that we find in real-world VCFs are
variant-level annotations, extensively used in downstream applications.
For example, a common workflow is to
add or update variant IDs in a VCF using a reference database
such as dbSNP \cite{Sherry2001dbSNP}. The standard approach to this
(using e.g.\ \texttt{bcftools annotate}) is to create a \emph{copy} of
the VCF which includes these new annotations. Thus, even though
we may only be altering a single field comprising a tiny fraction
of the data, we still read, decompress, update, compress and
write the entire dataset to a new file. With Zarr,
we can update an existing field or add arbitrary additional
fields without touching the rest of the data or creating redundant
copies.
\subsection{Case study: Genomics England WGS data}
In this section we demonstrate the utility of VCF Zarr on a large human dataset
and the scalability of the \texttt{vcf2zarr} conversion utility.
Genomics England’s multi-sample VCF dataset (aggV2) is an
aggregate of 78,195 gVCFs from rare disease and cancer participants
recruited as part of the 100,000 Genomes
Project~\cite{turnbull2018100,caulfield2017national}.
The dataset comprises approximately 722 million annotated single-nucleotide
variants and small indels split into 1,371 roughly equal chunks and
totalling 165.3 TiB of VCF data after \texttt{bgzip} compression.
The dataset is used for a variety of research purposes, ranging from
GWAS~\cite{kousathanas2022whole} and
imputation~\cite{shi2023genomics} to
simple queries involving single gene
regions~\cite{leggatt2023genotype,lam2023repeat}.
As described in the Methods, conversion to Zarr using
\texttt{vcf2zarr} is a two-step process. We
first converted the 106 VCF files (12.81 TiB) for chromosome 2
into the intermediate columnar format (ICF). This task was
split into 14,605 partitions, and distributed using the Genomics England
HPC cluster. The average run-time per partition was 20.7 min.
The ICF representation used a total
of 9.94 TiB over 3,960,177 data storage files.
We then converted the ICF to Zarr, partitioned into
5989 independent jobs, with an 18.6 min average run time.
This produced a dataset with 44 arrays, consuming a
total of 2.54 TiB of storage over 6,312,488
chunk files (using a chunk size of $10^4$ variants $\times 10^3$
samples). This is a roughly 5X reduction in total storage
space over the original VCF.
The top fields in terms
of storage are detailed in Table~\ref{tab-genomics-england-data}.
We do not compare with other tools such as Genozip and
Savvy here because they have fundamental limitations
(as shown in earlier simulation-based benchmarks),
and conversion of these large VCFs is a major undertaking.
\begin{table}
\caption{Summary for VCF Zarr conversion of
Genomics England WGS data for chromosome 2 (78,195 samples, 59,880,903 variants),
consisting of 44 fields and 2.54 TiB of storage ($\sim$5X compression over
source gzipped VCF).
Each field is stored independently
as a Zarr array with the given type (sufficient to represent all values in the
data). We show the total storage consumed (reported via \texttt{du}) in
power-of-two units, and the compression ratio achieved on that array.
We also show the percentage of the overall storage that each array consumes.
Shown are the top 11 fields consuming at least 0.01\% of the overall storage.
\label{tab-genomics-england-data}}
\begin{tabular}{llS[table-format=3.2]S[table-format=5.1]S[table-format=3.2]}
\toprule
{Field} & {Type} & {Storage} & {Compress} & {\%Total} \\
\midrule
call\_AD & int16 & 658.44 GiB & 26.0 & 25.35\% \\
call\_GQ & int16 & 654.45 GiB & 13.0 & 25.20\% \\
call\_DP & int16 & 570.03 GiB & 15.0 & 21.95\% \\
call\_DPF & int16 & 447.09 GiB & 20.0 & 17.22\% \\
call\_PL & int16 & 162.56 GiB & 160.0 & 6.26\% \\
call\_GQX & int16 & 40.99 GiB & 210.0 & 1.58\% \\
call\_FT & str & 25.04 GiB & 1400.0 & 0.96\% \\
call\_genotype & int8 & 21.46 GiB & 410.0 & 0.83\% \\
call\_genotype\_mask & bool & 12.78 GiB & 680.0 & 0.49\% \\
call\_genotype\_phased & bool & 2.35 GiB & 1900.0 & 0.09\% \\
call\_PS & int8 & 383.38 MiB & 12000.0 & 0.01\% \\
call\_ADF & int8 & 383.38 MiB & 12000.0 & 0.01\% \\
call\_ADR & int8 & 383.38 MiB & 12000.0 & 0.01\% \\
\bottomrule
\end{tabular}
\end{table}
Table~\ref{tab-genomics-england-data} shows that the dataset storage
size is dominated by a few columns with the top four
(call\_AD, call\_GQ, call\_DP and call\_DPF) accounting for
90\% of the total. These fields are much less compressible
than genotype data (which uses $<1\%$ of the total space here)
because of their inherent noisiness~\citep{lin2020sparse}.
Note that these top four fields are stored as 16 bit integers
because they contain rare outliers that cannot be stored as
8 bits. While the fields could likely be truncated to have
a maximum of 127 with minimal loss of information,
the compression gains from doing so are relatively minor,
and we therefore opt for fully lossless compression here
for simplicity. The call\_PS field here has an extremely high
compression ratio because it consists entirely of missing data
(i.e., it was listed in the header but never used in the VCF).
To demonstrate the computational accessibility of Zarr on this
large human dataset, we performed some illustrative benchmarks.
As these benchmarks take some time to run, we focus
on a single 132 GiB compressed VCF file covering
positions 58,219,159--60,650,943 (562,640 variants)
from the middle of the list of 106 files for chromosome 2.
We report both the total CPU time and elapsed wall-clock time here
as both are relevant.
First, we extracted the genome position for each variant in this single VCF
chunk using \texttt{bcftools query} and Python Zarr code as described in
Fig~\ref{fig-column-extract}. The \texttt{bcftools} command required
% def parse_time(time_str):
% minsplit = time_str.split("m")
% return datetime.timedelta(minutes=int(minsplit[0]),
% seconds=float(minsplit[1][:-1]))
% def parse_time_output(s):
% values = {}
% for line in s.splitlines():
% split = line.split()
% values[split[0]] = parse_time(split[1])
% return values["real"].seconds / 60, (values["user"] + values["sys"]).seconds / 60
% real 85m51.013s
% user 54m39.986s
% sys 0m45.277s
55.42 min CPU and 85.85 min elapsed.
% CPU times: user 846 ms, sys: 1.93 s, total: 2.78 s
% Wall time: 1min 44s
% 55.42 * 60 / 2.78 = 1196
% 85.85 / 1.73 = 49.62
The Zarr code required 2.78 sec CPU and 1.73 min elapsed.
This is a 1196X smaller CPU burden and a 50X speed-up in elapsed time.
The major difference between CPU time and wall-time is noteworthy
here, and highlights the importance of using an index to find
the location of variant chunks.
This benchmark predates the \texttt{region\_index}
array computed by \texttt{vcf2zarr} (see Methods) which
stores sufficient information to identify which variant chunks intersect