-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathreviewed-paper.tex
1672 lines (1564 loc) · 79.8 KB
/
reviewed-paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%,%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Please note that whilst this template provides a
% preview of the typeset manuscript for submission, it
% will not necessarily be the final publication layout.
%
% letterpaper/a4paper: US/UK paper size toggle
% num-refs/alpha-refs: numeric/author-year citation and bibliography toggle
%\documentclass[letterpaper]{oup-contemporary}
\documentclass[a4paper,num-refs]{oup-contemporary}
%%% Journal toggle; only specific options recognised.
%%% (Only "gigascience" and "general" are implemented now. Support for other journals is planned.)
\journal{preprint}
\usepackage{graphicx}
\usepackage{siunitx}
\usepackage[switch]{lineno}
\linenumbers
% Additional packages for data model figure
\usepackage{forest}
\usepackage{tikz}
\usetikzlibrary{quotes,arrows.meta,3d}
%%% Flushend: You can add this package to automatically balance the final page, but if things go awry (e.g. section contents appearing out-of-order or entire blocks or paragraphs are coloured), remove it!
% \usepackage{flushend}
% Macros etc for data model figure
\input{diagrams/data-model-preamble.tex}
% See https://github.com/sgkit-dev/vcf-zarr-publication/issues/87
% for discussion
\title{Analysis-ready VCF at Biobank scale using Zarr}
%%% Use the \authfn to add symbols for additional footnotes, if any.
% 1 is reserved for correspondence emails; then continuing with 2 etc for contributions.
% First author
\author[1,\authfn{1}]{Eric Czech} % https://orcid.org/0000-0002-4254-4255
\author[2,3\authfn{1}]{Timothy R. Millar} % https://orcid.org/0000-0002-5142-8811
\author[4,\authfn{1}]{Tom White}
% Middle
\author[5]{Ben Jeffery} % https://orcid.org/0000-0002-1982-6801
\author[6]{Alistair Miles} % https://orcid.org/0000-0001-9018-4680
\author[7]{Sam Tallman} % https://orcid.org/0000-0001-7183-6276
\author[1]{Rafal Wojdyla} % https://orcid.org/0009-0005-0735-7090
\author[8]{Shadi Zabad} % https://orcid.org/0000-0002-8003-9284
% Senior
\author[1,\authfn{2}]{Jeff Hammerbacher} % https://orcid.org/0000-0001-6596-8563
\author[5,\authfn{2},\authfn{3}]{Jerome Kelleher} % https://orcid.org/0000-0002-7894-5253
\affil[1]{Related Sciences}
\affil[2]{The New Zealand Institute for Plant \& Food Research Ltd, Lincoln,
New Zealand}
\affil[3]{Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand}
\affil[4]{Tom White Consulting Ltd.}
\affil[5]{Big Data Institute, Li Ka Shing Centre for Health Information and Discovery,
University of Oxford, UK}
\affil[6]{Wellcome Sanger Institute}
\affil[7]{Genomics England}
\affil[8]{School of Computer Science, McGill University, Montreal, QC, Canada}
%%% Author Notes
\authnote{\authfn{1}Joint first author.}
\authnote{\authfn{2}Joint senior author.}
\authnote{\authfn{3}[email protected]}
%%% Paper category
\papercat{Paper}
%%% "Short" author for running page header
\runningauthor{Czech et al.}
%%% Should only be set by an editor
\jvolume{00}
\jnumber{0}
\jyear{2024}
\begin{document}
\begin{frontmatter}
\maketitle
% The Abstract (250 words maximum) should be structured to
% include the following details:
% \textbf{Background}, the context and purpose of
% the study;
% \textbf{Results}, the main findings;
% \textbf{Conclusions}, brief
% summary and potential implications. Please minimize the use of abbreviations
% and do not cite references in the abstract.
% The Abstract (250 words maximum) should be structured to
% include the following details:
% \textbf{Background}, the context and purpose of % the study;
% \textbf{Results}, the main findings;
% \textbf{Conclusions}, brief summary and potential implications.
%% NOTE: this is much too long currently, but keeping for now so we
% can see which stuff migrates to the intro
\begin{abstract}
\textbf{Background:}
Variant Call Format (VCF) is the standard file format for interchanging
genetic variation data and associated quality control metrics.
% It provides a well-defined data model and is central to a large ecosystem
% of interoperating tools.
The usual row-wise encoding of the VCF data model (either as text
or packed binary) emphasises efficient retrieval of all data for a given
variant, but accessing data on a field or sample basis is inefficient.
Biobank scale datasets currently available
consist of hundreds of thousands of whole genomes
and hundreds of terabytes of compressed VCF.
Row-wise data storage is fundamentally unsuitable
and a more scalable approach is needed.
\textbf{Results:}
We present the VCF Zarr specification, an encoding of the
VCF data model using Zarr which makes retrieving subsets of the
data much more efficient. Zarr is a cloud-native format for storing
multi-dimensional data, widely used in scientific computing.
We show how this format is far more efficient than
standard VCF based approaches,
and competitive with specialised methods for
storing genotype data in terms of compression ratios
and calculation performance.
We demonstrate the VCF Zarr format (and the vcf2zarr conversion utility)
on a subset of the Genomics England aggV2 dataset comprising
78,195 samples and 59,880,903 variants,
with a 5X reduction in storage and greater than 300X reduction in CPU usage
in some representative benchmarks.
\textbf{Conclusions:}
Large row-encoded VCF files are a major bottleneck for current research, and
storing and processing these files incurs a substantial cost.
The VCF Zarr specification, building on widely-used, open-source technologies
has the potential to greatly reduce these costs,
and may enable a diverse ecosystem of next-generation tools for analysing
genetic variation data directly from cloud-based object stores,
while maintaining compatibility with existing file-oriented workflows.
\end{abstract}
\begin{keywords}
Variant Call Format; Zarr; Analysis ready data.
\end{keywords}
\end{frontmatter}
%%% Key points will be printed at top of second page
\begin{keypoints*}
\begin{itemize}
\item VCF is widely supported, and the underlying data model entrenched
in bioinformatics pipelines.
\item The standard row-wise encoding as text (or binary) is inherently
inefficient for large-scale data processing.
\item The Zarr format provides an efficient solution, by encoding fields
in the VCF separately in chunk-compressed binary format.
\end{itemize}
\end{keypoints*}
\section{Background}
Variant Call Format (VCF) is the standard format for interchanging genetic
variation data, encoding information about
DNA sequence polymorphisms among a set of samples with associated
quality control metrics and metadata~\citep{danecek2011variant}.
Originally defined specifically as a text file,
it has been refined and standardised~\citep{rehm2021ga4gh} and the
underlying data-model is now deeply embedded in bioinformatics practice.
Dataset sizes have grown explosively since the introduction of
VCF as part of 1000 Genomes project~\citep{10002015global},
with Biobank scale initiatives such as
Genomics England~\cite{turnbull2018100},
UK Biobank~\citep{bycroft2018genome,backman2021exome,halldorsson2022sequences,uk2023whole},
and the All of Us research program~\citep{all2024genomic}
collecting genome sequence data for hundreds of thousands of humans.
Large genetic variation datasets are also being generated for other organisms
and a variety of purposes including
agriculture~\citep{ros2020accuracy,wang2023rice},
conservation~\citep{shaffer2022landscape}
and infectious disease surveillance~\citep{hamid2023pf7}.
VCF's simple text-based design and widespread
support~\cite{garrison2022spectrum} makes it an
excellent archival format, but it is an inefficient basis for analysis.
Methods that require efficient access to genotype data
either require conversion to the
PLINK~\cite{purcell2007plink,chang2015second}
or BGEN~\citep{band2018bgen}
formats~\citep[e.g.][]{yang2011gcta,mbatchou2021computationally,loh2015efficient}
or use bespoke binary formats
that support the required access patterns~\citep[e.g.][]{
% Uses custom "bref3" format,
% https://faculty.washington.edu/browning/beagle/bref3.24May18.pdf
browning2018one,
% .samples Zarr format
kelleher2019inferring,
% Has a "xcftools" package, but it still looks pretty experimental
hofmeister2023accurate}.
While PLINK and BGEN formats are more efficient to access than VCF, neither
can accommodate the full flexibility of the VCF data model and conversion
is lossy.
PLINK's approach of storing the genotype matrix in uncompressed
packed-binary format provides efficient access to genotype
data, but file sizes are substantially larger than
the equivalent compressed VCF (see Fig~\ref{fig-data-storage}).
For example, at two bits
per diploid genotype, the full genotype matrix for the GraphTyper SNP dataset
in the 500K UKB WGS data~\citep{uk2023whole} is 116 TiB.
% 1,037,556,156 SNPs x 490,640 samples
% humanize.naturalsize(1_037_556_156 * 490_640 / 4, binary=True)
% '115.7 TiB'
Processing of Biobank scale datasets can be split into a
few broad categories. The most basic analysis
is quality control (QC). Variant QC is an
involved and multi-faceted
task~\cite{marees2018tutorial,panoutsopoulou2018quality,chen2024genomic,hemstrom2024next},
often requiring interactive, exploratory analysis
and incurring substantial computation over multiple QC fields.
Genotype calls are sometimes refined via statistical methods,
for example by phasing~\citep{
browning2021fast,browning2023statistical,hofmeister2023accurate,williams2024phasing},
and imputation~\citep{browning2018one,rubinacci2020genotype,barton2021whole,
rubinacci2023imputation}
creating additional dataset copies.
A common task to perform is
a genome wide association study (GWAS)~\cite{uffelmann2021genome}.
The majority of tools for
performing GWAS and related analyses require
data to be in PLINK or BGEN formats~\cite[e.g][]{chang2015second,
loh2015efficient,
abraham2017flashpca2,
mbatchou2021computationally},
and so data must be ``hard-called'' according to some QC criteria
and exported to additional copies.
Finally, variation datasets are often queried in exploratory
analyses, to find regions or samples of interest for a particular
study~\cite[e.g.][]{chen2024novo}.
VCF cannot support any of these workflows efficiently at the Biobank scale.
The most intrinsically limiting aspect of VCF's design
is its row-wise layout of data, which means that (for example)
information for a particular sample or field cannot be obtained without
retrieving the entire dataset.
The file-oriented paradigm is also unsuited to the realities
of modern datasets, which are too large to download and
often required to stay in-situ by data-access agreements.
Large files are currently stored in cloud environments, where the
file systems that are required by classical file-oriented tools
are expensively emulated on the basic building blocks
of object storage.
These multiple layers of inefficiencies around processing
VCF data at scale in the cloud mean that it is
time-consuming and expensive, and these vast datasets are
not utilised to their full potential.
To achieve this full potential we
need a new generation of tools that operate directly
on a primary data representation that supports
efficient access across a range of applications,
with native support for cloud object storage.
Such a representation can be termed ``analysis-ready''
and ``cloud-native''~\citep{abernathey2021cloud}.
For the representation to be FAIR~\citep{wilkinson2016fair},
it must also be \emph{accessible}, using protocols that are
``open, free, and universally implementable''.
There is currently no efficient, FAIR representation of genetic variation
data suitable for cloud deployments.
Hail~\cite{ganna2016ultra,hail2024} has become the dominant platform
for quality control of large-scale variation datasets,
and has been instrumental in projects such as
gnomadAD~\cite{karczewski2020mutational,chen2024genomic}.
While Hail is built on open components
from the Hadoop distributed computing ecosystem~\citep{white2012hadoop},
the details of its MatrixTable format are not documented
or intended for external reuse.
% cite? https://dev.hail.is/t/matrixtable-file-format-reference/173/6
Similarly, commercial solutions that have emerged to facilitate
the analysis of large-scale genetic variation data are either
based on proprietary~\cite{basespace2024,graf2024,googlelifesciences2024,
awshealthomics2024,microsoftgenomics2024}
or single-vendor technologies~\cite[e.g.][]{tiledb2024,genomicsdb2024}.
The next generation of VCF analysis methods requires
an open, free and transparent data representation
with multiple independent implementations.
In this article, we decouple the VCF data model from its row-oriented
file definition, and show how the data can be
compactly stored and efficiently analysed in a cloud-native, FAIR manner.
We do this by translating VCF data into Zarr format,
a method of storing large-scale multidimensional data as a regular
grid of compressed chunks.
Zarr's elegant simplicity and first-class support for
cloud object stores have led to
it gaining substantial traction
across the sciences, and it is now used in multiple petabyte-scale
datasets in cloud deployments (see Methods for details).
We present the VCF Zarr specification that formalises this
mapping, and the \texttt{vcf2zarr}
utility to reliably convert large-scale VCFs to Zarr.
We show that VCF Zarr is much more compact than
VCF and is competitive with state-of-the-art
file-based VCF compression tools.
Moreover, we show that Zarr's storage of data in an analysis-ready
format greatly facilitates computation,
with various benchmarks being substantially faster than
\texttt{bcftools} based pipelines, and again competitive
with state-of-the-art file-oriented methods. Finally, we show the
utility of VCF Zarr on the Genomics England aggV2 dataset,
demonstrating that common \texttt{bcftools} queries can be performed orders
of magnitude more quickly using simple Python scripts.
\section{Results}
\subsection{Storing genetic variation data}
Although VCF is the standard format for exchanging genetic variation
data, its limitations both in terms of compression
and query/compute performance are well
known~\citep[e.g.][]{kelleher2013processing,layer2016efficient,li2016bgt},
and many methods
have been suggested to improve on these properties.
Most approaches balance compression with
performance on particular types of queries,
typically using a command line interface (CLI)
and outputting VCF text~\citep{
layer2016efficient, %GQT
li2016bgt, % BGT
tatwawadi2016gtrac, % GTRAC
danek2018gtc, % GTC
lin2020sparse, % SpVCF
lan2020genozip,lan2021genozip, %genozip
lefaive2021sparse, % SAVVY
wertenbroek2022xsi,% XSI
zhang2023gbc}. %GBC
Several specialised algorithms for compressing
the genotype matrix (i.e., just the genotype calls without additional
VCF information) have been proposed
\citep{qiao2012handling, % SpeedGene
deorowicz2013genome, %TGC
sambo2014compression, % snpack
deorowicz2019gtshark, %GTShark
deorowicz2021vcfshark, % VCFShark
dehaas2024genotype} % GRG
most notably the Positional
Burrows--Wheeler Transform (PBWT)~\citep{durbin2014efficient}.
See~\citep{mcvean2019linkage} for a review of the techniques
employed in genetic data compression.
The widely-used PLINK binary format stores genotypes in a
packed binary representation, supporting only biallelic
variants without phase information.
The PLINK 2 PGEN format~\citep{pgen2024} is more general
and compact than PLINK, compressing variant data using specialised
algorithms~\cite{sambo2014compression}.
Methods have also been developed which store variation data
along with annotations in databases to facilitate
efficient queries~\cite[e.g.][]{
paila2013gemini,%GEMINI
lopez2017hgva} %OpenCGA
which either limit to certain classes of variant~\cite[e.g.][]{greene2023genetic}
or have storage requirements larger
than uncompressed VCF~\citep{al2023critical}.
The SeqArray package~\citep{zheng2017seqarray} builds on the
Genomic Data Storage container format~\cite{zheng2012high}
to store VCF genotype data in a packed and compressed format,
and is used in several downstream R packages~\cite[e.g.][]{
gogarten2019genetic,fernandes2020simplephenotypes}.
VCF is a row-wise format in which
observations and metadata for a single variant are
encoded as a line of text~\citep{danecek2011variant}.
BCF~\citep{li2011statistical}, the standard binary representation of VCF,
is similarly row-wise, as
are the majority of proposed alternative storage
formats.
Row-wise storage makes retrieving all information
for a given record straightforward and efficient,
and works well when records are either relatively small
or we typically want to analyse each record in its entirety.
When we want to analyse only a subset of a record,
row-wise storage can be inefficient because we will usually need to
retrieve more information than required from storage. In the case
of VCF (and BCF) where records are not of a fixed size and
are almost always compressed in blocks, accessing any information
for a set of rows means retrieving and decompressing \emph{all}
information from these rows.
The usual alternative to row-wise storage is \emph{columnar} storage:
instead of grouping together all the fields for a record,
we group together all the records for a given field.
Columnar storage formats such as Parquet~\citep{parquet2024}
make retrieving particular columns much
more efficient and can lead to substantially better compression.
While columnar techniques have been successfully applied
in alignment
% TODO check
% Jeff: OK with the ADAM reference in here? You did some parquet based
% alignment storage, right?
storage~\citep[e.g.][]{bonfield2014scramble,nothaft2015rethinking,bonfield2022cram},
the use of columnar technologies for
storing and analysing variation data have had limited
success~\citep{boufea2017managing,fan2020variant}.
Mapping VCF directly to a columnar layout, in which there is a
column for the genotypes (and other per-call QC metrics)
for each sample leads to a large number of columns, which
can be cumbersome and cause scalability issues.
Fundamentally, columnar methods are one-dimensional, storing a vector
of values associated with a particular key, whereas
genetic variation data is usually modelled as a two-dimensional matrix
in which we are interested in accessing both rows \emph{and} columns.
Just as row-oriented storage makes accessing data for a given
sample inefficient, columnar storage makes accessing all the data
for a given variant inefficient.
\begin{figure}
\resizebox{225pt}{!}{\input{diagrams/data-model.tex}}
\caption{Chunked compressed storage of VCF data using Zarr.
The \texttt{call\_genotype} array is a three-dimensional (variants, samples,
ploidy) array of integers, split into a uniform grid of
chunks determined by the variant and sample chunk sizes (10,000
and 1,000 by default in \texttt{vcf2zarr}). Each chunk is associated
with a key defining its location in this grid, which can be stored
in any key-value store such as a standard file-system or cloud object
store. Chunks are compressed independently using standard
codecs and pre-compression filters, which can be specified on a per-array
basis. Also shown are the one-dimensional \texttt{variant\_contig} (CHROM)
and \texttt{variant\_position} arrays (POS). Other fields are stored
in a similar fashion. \label{fig-data-model}}
\end{figure}
VCF is at its core an encoding of the genotype matrix, where each entry
describes the observed genotypes for a given sample at a given variant site,
interleaved with per-variant information
and other call-level matrices (e.g., the GQ or AD fields).
The data is largely numerical and of fixed dimension,
and is therefore a natural mapping to array-oriented
or ``tensor'' storage.
We propose the VCF Zarr specification which maps the
VCF data model into an array-oriented layout using Zarr
(Fig~\ref{fig-data-model}).
In the VCF Zarr specification,
each field in a VCF is mapped to a separately-stored array,
allowing for efficient retrieval and
high levels of compression.
% In particular, call-level data is stored as $m \times n$ arrays
% (for $m$ sites and $n$ samples), allowing for efficient
% retrieval of subsets of those fields along both the
% variants and samples axis.
See the Methods for more detail on Zarr and the VCF Zarr
specification.
\begin{figure}
\begin{center}
\includegraphics[]{figures/data-scaling}
\end{center}
\caption{Compression performance on simulated genotypes.
Comparison of total stored bytes for VCF data produced
by subsets of a large simulation of French-Canadians.
Sizes for $10^6$ samples are shown on the right. Sizes
for Savvy (21.25GiB) and Zarr (22.06GiB) are very similar.
Also shown for reference is the size of genotype matrix
when encoded as two bits per diploid genotype (2bit), as used
in the PLINK binary format.
\label{fig-data-storage}}
\end{figure}
One of the key benefits of Zarr is its cloud-native design,
but it also works well on standard file systems, where
arrays and chunks are stored hierarchically in directories
and files (storage as a single Zip archive is also supported).
To enable comparison with the existing file-based ecosystem
of tools, we focus on Zarr's file system chunk storage in a series of illustrative
benchmarks in the following sections.
(See \citep{durbin2020task,moore2021ome,gowan2022using} for Zarr
benchmarks in cloud settings.)
We compare primarily with
VCF/BCF based workflows using \texttt{bcftools} because this
is the standard practice, used in the vast majority of cases.
We also compare with two representative recent specialised utilities;
see~\cite{danek2018gtc,zhang2023gbc} for further benchmarks of
these and other tools.
Genozip~\cite{lan2020genozip,lan2021genozip} is a tool focused
on compression performance, which uses a custom file format
and a CLI to extract VCF as text with various filtering options.
Savvy~\cite{lefaive2021sparse} is an extension of BCF which
takes advantage of sparsity in the genotype matrix as well
as using PBWT-based approaches for improved compression.
Savvy provides a CLI as well as a C++ API.
Our benchmarks are based on genotype data
from subsets of a large and highly realistic
simulation of French-Canadians~\cite{anderson2023on}
(see Methods for details on the dataset and benchmarking methodology).
Note that while simulations cannot capture
all the subtleties of real data, the allele frequency
and population structure patterns in this dataset
have been shown to closely follow
observations~\cite{anderson2023on} and so it provides
a reasonable and easily reproducible data point
when comparing such methods.
The simulations only contain genotypes without any additional
high-entropy QC fields, which is unrealistic
(see the Genomics England case-study
for benchmarks on a large human dataset that includes
many such fields).
Note, however, that such minimal, genotype-only data
is something of a best-case scenario for specialised genotype
compression methods using row-wise storage.
Fig~\ref{fig-data-storage} shows compression performance
on up to a million samples for chromosome 21, with
the size of the genotype-matrix encoded as 1-bit per haploid
call included for reference.
% 2bit 1689.153258 G
% vcf 81.375831 G
% = 20.75 X, 4.8%
Gzip compressed VCF performs remarkably well, compressing
the data to around 5\% of the
minimal binary encoding of a biallelic genotype matrix
for 1 million samples.
BCF provides a significant improvement in compression
performance over VCF (note the log-log scale). Genozip has
superb compression, having far smaller file sizes that the
other methods (although somewhat losing its advantage at
larger sample sizes). Zarr and Savvy have
almost identical compression performance in this example.
It is remarkable that the simple approach of compressing
two dimensional chunks of the genotype matrix
using the Zstandard compressor~\citep{collet2021rfc} and the
bit-shuffle filter from Blosc~\cite{alted2010modern}
(see Methods for details) produces
compression levels competitive with the highly specialised methods
used by Savvy.
\subsection{Calculating with the genotype matrix}
Storing genetic variation data compactly is important, but it is also
important that we can analyse the data efficiently. Bioinformatics
workflows tend to emphasise text files and command line utilities
that consume and produce text~\citep[e.g.][]{buffalo2015bioinformatics}.
Thus, many tools that compress VCF data provide a command line
utility with a query language to restrict the records
examined, perform some pre-specified calculations and finally
output some text, typically VCF or tab/comma separated
values~\citep{
layer2016efficient, %GQT
li2016bgt, % BGT
danek2018gtc, % GTC
lin2020sparse, % SpVCF
lan2020genozip,lan2021genozip, %genozip
zhang2023gbc}. %GBC
These pre-defined calculations are by necessity limited in scope, however,
and the volumes of text involved in Biobank scale datasets
make the classical approach of custom
analyses via Unix utilities in pipelines prohibitively slow. Thus,
methods have begun to provide Application Programming Interfaces
(APIs), providing efficient access to genotype and other VCF
data~\cite[e.g.][]{kelleher2013processing,lefaive2021sparse,
wertenbroek2022xsi}. By providing programmatic access,
the data can be retrieved from storage, decoded and then analysed
in the same memory space without additional copies and
inter-process communication through pipes.
To demonstrate the accessibility of genotype data and efficiency with
which calculations can be performed under the different formats,
we use the \texttt{bcftools +af-dist} plugin
(which computes a table of
deviations from Hardy-Weinberg expectations in
allele frequency bins) as an example.
% The details of the \texttt{af-dist} operation are not important:
% as an example of a whole-matrix operation.
We chose this particular operation for several reasons.
First, it is a straightforward calculation that
requires examining every element in the genotype matrix,
and can be reproduced in different programming languages
without too much effort.
Secondly, it produces a small volume of output and therefore the
time spent outputting results is negligible.
Finally, it has an efficient implementation written using the
\texttt{htslib} C API~\citep{bonfield2021htslib},
and therefore running this command on a VCF or BCF file provides
a reasonable approximation of the limit of what can be achieved in terms
of whole-matrix computation on these formats.
\begin{figure}
\includegraphics{figures/whole-matrix-compute}
\caption{Whole-matrix compute performance with increasing sample size.
Total CPU time required to run \texttt{bcftools +af-dist}
and equivalent operations in a single thread for various tools.
Elapsed time is also reported (dotted line). Run-time for genozip
and bcftools on VCF
at $10^6$ samples were extrapolated by fitting an exponential.
See Methods for full details.
\label{fig-whole-matrix-compute}}
\end{figure}
Fig~\ref{fig-whole-matrix-compute} shows timing results
for running \texttt{bcftools +af-dist} and equivalent operations
on the data of Fig~\ref{fig-data-storage}. There is a large
difference in the time required (note the log-log scale).
The slowest approach uses Genozip. Because Genozip does not
provide an API and only outputs VCF text, the best approach available
is to pipe its output into \texttt{bcftools +af-dist}.
This involves first decoding the data from Genozip format,
then generating large volumes of VCF text (terabytes, in the
largest examples here), which we must
subsequently parse before finally doing the actual calculation.
Running \texttt{bcftools +af-dist} directly on the gzipped VCF
is substantially faster, indicating that Genozip's excellent
compression performance comes at a substantial decompression cost.
Using a BCF file is again significantly faster,
because the packed binary format avoids the overhead of parsing
VCF text into \texttt{htslib}'s internal data structures.
We only use BCF for subsequent \texttt{bcftools} benchmarks.
The data shown in Fig~\ref{fig-whole-matrix-compute} for Zarr and Savvy
is based on custom programs written using their respective APIs
to implement the \texttt{af-dist} operation. The Zarr program uses
the Zarr-Python package to iterate over the decoded chunks of the
genotype matrix and classifies genotypes within a chunk using a 14 line Python
function, accelerated using the Numba JIT compiler~\cite{lam2015numba}.
The allele frequencies and genotype counts are then analysed to produce
the final counts within the allele frequency bins with 9 lines of
Python using NumPy~\cite{harris2020array} functions. Remarkably, this
short and simple Python program is substantially faster than the
equivalent compiled C using \texttt{htslib} APIs on BCF (6.9 hours
vs 20.6 hours for 1 million samples).
% num_samples num_sites tool user_time sys_time wall_time total_time
% 1000000 7254858 savvy 8371.07 4.16 8377.718136 2.326453
% 1000000 7254858 bcftools 73939.32 46.09 74023.083678 20.551503
% 1000000 7254858 zarr 24709.64 29.32 24750.704171 6.871933
The fastest method is the
C++ program written using the Savvy API. This would largely seem
to be due to Savvy's excellent genotype decoding performance
% zarr 1.2 GiB
% savvy 6.6 GiB
% zarr_nshf 3.9 GiB
(up to 6.6GiB/s vs 1.2GiB/s for Zarr on this dataset;
Fig~\ref{fig-whole-matrix-decode}).
% sequence_length num_samples num_sites tool size
% 40 48129895 1000000 7254858 zarr 22.071687
% 35 48129895 1000000 7254858 bcf 51.749294
% 36 48129895 1000000 7254858 genozip 10.691384
% 37 48129895 1000000 7254858 sav 21.249436
% 38 48129895 1000000 7254858 tsk 1.802636
% 39 48129895 1000000 7254858 vcf 81.375831
% 41 48129895 1000000 7254858 zarr_nshf 29.897540
Turning off the BitShuffle filter for the Zarr dataset,
however, leads to a substantial increase in decoding speed
(3.9GiB/s) at the cost of a roughly 25\% increase in storage
space (29.9GiB up from 22.1GiB for 1 million samples; data not
shown). Given the relatively small contribution of genotypes to the
overall storage of real datasets (see the Genomics England example)
and the frequency that they are likely to be accessed, this
would seem like a good tradeoff in most cases.
This ability to easily tune compression performance
and decoding speed on a field-by-field basis is a major strong
point of Zarr. The \texttt{vcf2zarr} utility also provides
functionality to aid with such storage schema tuning.
\subsection{Subsetting the genotype matrix}
\begin{figure}[t]
\includegraphics{figures/subset-matrix-compute}
\caption{Compute performance on subsets of the matrix.
Total CPU time required to run the af-dist calculation for
a contiguous subset of 10,000 variants $\times$ 10 samples
from the middle of the matrix
for the data in Fig~\ref{fig-data-storage}.
Elapsed time is also reported (dotted line).
The \texttt{genozip} and \texttt{bcftools} pipelines involve
multiple commands required to correctly calculate the AF INFO field
required by \texttt{bcftools +af-dist}. See the Methods for full details
on the steps performed.
\label{fig-subset-matrix-compute}}
\end{figure}
As datasets grow ever larger, the ability to efficiently access subsets
of the data becomes increasingly important. VCF/BCF achieve efficient
access to the data for genomic ranges
by compressing blocks of adjacent records using \texttt{bgzip},
and storing secondary indexes alongside the original
files with a conventional suffix~\citep{li2011tabix}.
Thus, for a given range query we
decompress only the necessary blocks and can quickly access
the required records.
The row-wise nature of VCF (and most proposed alternatives), however, means
that we cannot efficiently subset \emph{by sample}
(e.g., to calculate statistics within a particular cohort). In the extreme
case, if we want to access only the genotypes for a single sample
we must still retrieve and decompress the entire dataset.
We illustrate this cost of row-wise encoding in
Fig~\ref{fig-subset-matrix-compute}, where we run the af-dist calculation
on a small fixed-size subset of the genotype matrices of
Fig~\ref{fig-data-storage}. The two-dimensional chunking of Zarr
means that this sub-matrix can be efficiently
extracted, and therefore the execution time depends very weakly on
the overall dataset size, with the computation requiring around
1 second for 1 million samples. Because of their
row-wise encoding, CPU time scales with the number of samples
for all the other methods.
Fig~\ref{fig-subset-matrix-compute-supplemental} shows performance
for the same operation when selecting half of the samples in the
dataset.
\subsection{Extracting, inserting and updating fields}
\begin{figure}
\includegraphics{figures/column-extract}
\caption{Time to extract the genome position and write to a text file.
Total CPU time required to extract the POS field for BCF,
sav and Zarr formats
for the data in Figure~\ref{fig-data-storage}.
For the BCF file we used \texttt{bcftools query -f"\%POS$\backslash$n"}.
For sav, we used the Savvy C++ API to extract position for each variant
and output text using the \texttt{std::cout} stream. For Zarr, we read
the variant\_position array into a NumPy array, and then wrote to
a text file using the Pandas \texttt{write\_csv} method.
Zarr CPU time is dominated by writing the text output; we also show
the time required to populate a NumPy array with the data in Zarr,
which is less than a second. Wall-clock time (dotted line) is dominated
in this case by file I/O. Time to output text for Savvy is not significant
for $> 1000$ samples (not shown).
\label{fig-column-extract}}
\end{figure}
We have focused on the genotype matrix up to this point, contrasting
Zarr with existing row-wise methods.
Real-world VCFs encapsulate much more than just the genotype
matrix, and can contain large numbers of additional fields.
Fig~\ref{fig-column-extract} shows the time required to extract
the genomic position of each variant in the simulated benchmark
dataset, which we can use as an indicative example of a per-variant
query. Although Savvy is many times faster than \texttt{bcftools query}
here, the row-wise storage strategy that they share means that
the entire dataset must be read into memory and
decompressed to extract just one field from each record. Zarr
excels at these tasks: we only read and decompress the information required.
Many of the additional fields that we find in real-world VCFs are
variant-level annotations, extensively used in downstream applications.
For example, a common workflow is to
add or update variant IDs in a VCF using a reference database
such as dbSNP \cite{Sherry2001dbSNP}. The standard approach to this
(using e.g.\ \texttt{bcftools annotate}) is to create a \emph{copy} of
the VCF which includes these new annotations. Thus, even though
we may only be altering a single field comprising a tiny fraction
of the data, we still read, decompress, update, compress and
write the entire dataset to a new file. With Zarr,
we can update an existing field or add arbitrary additional
fields without touching the rest of the data or creating redundant
copies.
\subsection{Case study: Genomics England 100,000 genomes}
In this section we demonstrate the utility of VCF Zarr on a large human dataset
and the scalability of the \texttt{vcf2zarr} conversion utility.
Genomics England’s multi-sample VCF dataset (aggV2) is an
aggregate of 78,195 gVCFs from rare disease and cancer participants
recruited as part of the 100,000 Genomes Project~\cite{turnbull2018100}.
The dataset comprises approximately 722 million annotated single-nucleotide
variants and small indels split into 1,371 roughly equal chunks and
totalling 165.3 TiB of VCF data after \texttt{bgzip} compression.
The dataset is used for a variety of research purposes, ranging from
GWAS~\cite{kousathanas2022whole} and
imputation~\cite{shi2023genomics} to
simple queries involving single gene
regions~\cite{leggatt2023genotype,lam2023repeat}.
As described in the Methods, conversion to Zarr using
\texttt{vcf2zarr} is a two-step process. We
first converted the 106 VCF files (12.81 TiB) for chromosome 2
into the intermediate columnar format (ICF). This task was
split into 14,605 partitions, and distributed using the Genomics England
HPC cluster. The average run-time per partition was 20.7 min.
The ICF representation used a total
of 9.94 TiB over 3,960,177 data storage files.
We then converted the ICF to Zarr, partitioned into
5989 independent jobs, with an 18.6 min average run time.
This produced a dataset with 44 arrays, consuming a
total of 2.54 TiB of storage over 6,312,488
chunk files. This is a roughly 5X reduction in total storage
space over the original VCF.
The top fields in terms
of storage are detailed in Table~\ref{tab-genomics-england-data}.
We do not compare with other tools such as Genozip and
Savvy here because they have fundamental limitations
(as shown in earlier simulation-based benchmarks),
and conversion of these large VCFs is a major undertaking.
\begin{table}
\caption{Summary for a selection of the largest VCF Zarr columns produced for
Genomics England aggV2 VCFs on chromosome 2 using \texttt{vcf2zarr}
default settings. Each field is stored independently
as a Zarr array with the given type (sufficient to represent all values in the
data). We show the total storage consumed (reported via \texttt{du}) in
power-of-two units, and the compression ratio achieved on that array.
We also show the percentage of the overall storage that each array consumes
(omitting values < 0.01\%).
\label{tab-genomics-england-data}}
\begin{tabular}{llS[table-format=3.1]S[table-format=3.2]S[table-format=3.2]}
\toprule
{Field} & {type} & {storage} & {compress} & {\%total} \\
\midrule
/call\_AD & int16 & 658.4G & 26 & 25.35\% \\
/call\_GQ & int16 & 654.5G & 13 & 25.20\% \\
/call\_DP & int16 & 570.0G & 15 & 21.95\% \\
/call\_DPF & int16 & 447.1G & 20 & 17.22\% \\
/call\_PL & int16 & 162.6G & 160 & 6.26\% \\
/call\_GQX & int16 & 41.0G & 210 & 1.58\% \\
/call\_FT & string & 25.0G & 1400& 0.96\% \\
/call\_genotype & int8 & 21.5G & 410& 0.83\% \\
/call\_genotype\_mask & bool & 12.8G & 680& 0.49\% \\
/call\_genotype\_phased & bool & 2.4G & 1900& 0.09\% \\
/call\_PS & int8 & 383.4M & 12000& 0.01\% \\
% /call\_ADF & int8 & 383.4M & 12000.0 & 0.01\% \\
% /call\_ADR & int8 & 383.4M & 12000.0 & 0.01\% \\
/variant\_position & int32 & 111.6M & 2 & \\
/variant\_quality & float32 & 87.4M & 2.6 & \\
/variant\_allele & string & 69.3M & 13 & \\
% /variant\_OLD\_MULTIALLELIC & object & 55.88M & 8.2 & 0.00\% \\
/variant\_AN & int32 & 47.3M & 4.8 & \\
% /variant\_AC & int32 & 47M & 4.9 & 0.00\% \\
% /variant\_AC\_Het & int32 & 46.62M & 4.9 & 0.00\% \\
% /variant\_ABratio & float32 & 20.23M & 11.0 & 0.00\% \\
% /variant\_MendelSite & float32 & 20.09M & 11.0 & 0.00\% \\
% /variant\_medianGQ & float32 & 19.16M & 12.0 & 0.00\% \\
% /variant\_completeGTRatio & float32 & 14.98M & 15.0 & 0.00\% \\
% /variant\_phwe\_eur & float32 & 14.46M & 16.0 & 0.00\% \\
% /variant\_phwe\_afr & float32 & 13.37M & 17.0 & 0.00\% \\
% /variant\_medianDepthNonMiss & float32 & 12.17M & 19.0 & 0.00\% \\
% /variant\_medianDepthAll & float32 & 11.99M & 19.0 & 0.00\% \\
% /variant\_phwe\_sas & float32 & 11.59M & 20.0 & 0.00\% \\
% /variant\_AC\_Hom & int32 & 11.12M & 21.0 & 0.00\% \\
% /variant\_missingness & float32 & 9.3M & 25.0 & 0.00\% \\
% /variant\_phwe\_eas & float32 & 7.12M & 32.0 & 0.00\% \\
/variant\_filter & bool & 6.4M & 570 & \\
% /variant\_AC\_Hemi & int32 & 849.43 KiB & 280.0 & 0.00\% \\
% /variant\_composite\_filter & bool & 433.38 KiB & 130.0 & 0.00\% \\
% /variant\_OLD\_CLUMPED & object & 384.18 KiB & 1200.0 & 0.00\% \\
% /variant\_id & object & 298.76 KiB & 1600.0 & 0.00\% \\
% /variant\_phwe\_amr & float32 & 269.52 KiB & 870.0 & 0.00\% \\
% /variant\_id\_mask & bool & 269.46 KiB & 220.0 & 0.00\% \\
/sample\_id & str & 268.1K & 2.3 & \\
% /variant\_contig & int16 & 257.77 KiB & 450.0 & 0.00\% \\
% /contig\_length & int64 & 3.79 KiB & 5.3 & 0.00\% \\
% /contig\_id & object & 2.3 KiB & 8.8 & 0.00\% \\
% /filter\_id & object & 808 bytes & 0.6 & 0.00\% \\
%%%% OLD for chr20
% call\_AD & int16 & 179.7G & 26 & 24.0\\
% call\_GQ & int16 & 171.8G & 13 & 23.0 \\
% call\_DP & int16 & 141.8G & 16 & 18.9 \\
% call\_DPF& int16 & 115.1G & 20 & 15.3\\
% call\_FT & string & 58.5G & 160 & 7.8 \\
% call\_PL & int16 & 51.4G & 140 & 6.9 \\
% call\_GQX & int16 & 12.1G & 190 & 1.6 \\
% call\_genotype & int8 & 6.1G & 380 & 0.8 \\
% call\_genotype\_mask & bool & 3.7G & 630 & 0.5\\
% call\_genotype\_phased & bool & 692.5M & 1700 & 0.1 \\
% call\_PS & int8 & 102.2M & 12000 & 0.05 \\
% variant\_quality & float32 & 22.6M & 2.7 & <0.01 \\
% variant\_allele & string & 21.5M & 11 \\
% variant\_position & int32 & 12.7M & 4.7 \\
% variant\_ABratio & float32 & 9.6M & 6.3 \\
% variant\_AN & int32 & 9.6M & 6.3 \\
% variant\_filter & bool & 1.9M & 490 \\
\bottomrule
\end{tabular}
\end{table}
Table~\ref{tab-genomics-england-data} shows that the dataset storage
size is dominated by a few columns with the top four
(call\_AD, call\_GQ, call\_DP and call\_DPF) accounting for
90\% of the total. These fields are much less compressible
than genotype data (which uses $<1\%$ of the total space here)
because of their inherent noisiness~\citep{lin2020sparse}.
Note that these top four fields are stored as 16 bit integers
because they contain rare outliers that cannot be stored as
8 bits. While the fields could likely be truncated to have
a maximum of 127 with minimal loss of information,
the compression gains from doing so are relatively minor,
and we therefore opt for fully lossless compression here
for simplicity. The call\_PS field here has an extremely high
compression ratio because it consists entirely of missing data
(i.e., it was listed in the header but never used in the VCF).
To demonstrate the computational accessibility of Zarr on this
large human dataset, we performed some illustrative benchmarks.
As these benchmarks take some time to run, we focus
on a single 132GiB compressed VCF file covering
positions 58,219,159--60,650,943 (562,640 variants)
from the middle of the list of 106 files for chromosome 2.
We report both the total CPU time and elapsed wall-clock time here
as both are relevant.
First, we extracted the genome position for each variant in this single VCF
chunk using \texttt{bcftools query} and Python Zarr code as described in
Fig~\ref{fig-column-extract}. The \texttt{bcftools} command required
% def parse_time(time_str):
% minsplit = time_str.split("m")
% return datetime.timedelta(minutes=int(minsplit[0]),
% seconds=float(minsplit[1][:-1]))
% def parse_time_output(s):
% values = {}
% for line in s.splitlines():
% split = line.split()
% values[split[0]] = parse_time(split[1])
% return values["real"].seconds / 60, (values["user"] + values["sys"]).seconds / 60
% real 85m51.013s
% user 54m39.986s
% sys 0m45.277s
55.42 min CPU and 85.85 min elapsed.
% CPU times: user 846 ms, sys: 1.93 s, total: 2.78 s
% Wall time: 1min 44s
% 55.42 * 60 / 2.78 = 1196
% 85.85 / 1.73 = 49.62
The Zarr code required 2.78 sec CPU and 1.73 min elapsed.
This is a 1196X smaller CPU burden and a 50X speed-up in elapsed time.
The major difference between CPU time and wall-time is noteworthy
here, and indicates some opportunities for improvement in VCF Zarr
in high-latency environments such as the shared file system in the
Genomics England HPC system. Currently VCF Zarr does not store any
specialised index to map genomic coordinates to array positions
along the variants dimension. Instead, to find the relevant slice
of records corresponding to the range of positions in the target
VCF file, we load the entire variant\_position array and
binary search. This entails reading 5,989 chunk files
(the chunk size is 100,000 variants) which incurs a substantial
latency penalty on this system. Later versions of the specification
may solve this problem by storing an array of size
(approximately) the number variant chunks
which maps ranges of genome coordinates to chunk indexes,
or a more specialised structure that supports overlap queries.
% When we subsequently time the operation of
% just populating an array of the variant positions in specified region,
% this is reduced to 1.13 sec, with an elapsed time of 20.9 sec.
% [Note: this is crazy that we still have a 19 second I/O wait. The
% 5989 files for the POS chunks must be in cache already, so the delay
% has to be the file system checking cache coherency.]
We then ran the af-dist calculation (Figs~\ref{fig-whole-matrix-compute}
and~\ref{fig-subset-matrix-compute}) on the VCF file
using \texttt{bcftools +af-dist} as before.
The elapsed time for this operation was 716.28 min CPU, 716.3 min elapsed.
Repeating this operation for the same coordinates in Zarr
(using Python code described in previous sections)
% 716.28 / 2.32 = 308.74
% 716.3 / 4.25 = 168.54
gave a total CPU time of 2.32 min and elapsed time of 4.25 min.
This is a 309X reduction in CPU burden and a 169X speed-up in elapsed time.
It is worth noting here that \texttt{bcftools +af-dist} cannot be
performed in parallel across multiple slices of a chromosome, and if
we did want to run it on all of chromosome 2 we would need to
concatenate the 106 VCF files. While af-dist itself is not a common operation,
many tasks share this property
of not being straightforwardly decomposable across multiple VCF files.
Finally, to illustrate performance on a common filtering task, we
created a copy of the VCF chunk which contains only variants
that pass some common filtering criteria using
\texttt{bcftools view -I --include "FORMAT/DP>10 \& FORMAT/GQ>20"},
following standard practices~\citep[e.g.][]{bergstrom2020insights,
kousathanas2022whole,
chen2024genomic}.
This used 689.46 min CPU time, with an elapsed time of 689.48 min.
In comparison, computing and storing a variant mask (i.e., a boolean value
for each variant denoting whether it should be considered or not
for analysis) based on the same criteria using Zarr
consumed 1.96 min CPU time with an elapsed time of 11 min.
This is a 358X reduction in CPU usage, and 63X reduction in elapsed time.
% 689.46 / 1.96 = 351.76
% 689.48 / 11 = 62.6 = 63X
There is an important distinction here between creating a copy
of the data (an implicit part of VCF based workflows) and creating an
additional \emph{mask}. As Table~\ref{tab-genomics-england-data}
illustrates, call-level masks are cheap (the standard
genotype missingness mask, call\_genotype\_mask, uses 0.49\% of the overall
storage) and variant or sample level masks require negligible storage.
If downstream software can
use configurable masks (at variant, sample and call level)
rather than expecting full copies of the data, major storage savings
and improvements in processing efficiency can be made.
The transition from the manifold inefficiencies of
present-day ``copy-oriented'' computing,
to the ``mask-oriented'' analysis of large immutable, single-source
datasets is a potentially transformational change enabled by Zarr.
\section{Discussion}
% Zarr is great
VCF is a central element of modern genomics, facilitating
the exchange of data in a large ecosystem of interoperating tools.
Its current row-oriented form, however,
is fundamentally inefficient,
profoundly limiting the scalability of the present generation
of bioinformatics tools. Large scale VCF data cannot
currently be
processed without incurring a substantial economic
(and environmental~\cite{grealey2022carbon}) cost.
We have shown here that this is not a necessary situation,
and that greatly improved efficiency can be achieved by
using more appropriate storage representations tuned
to the realities of modern computing. We have argued that
Zarr provides a powerful basis for cloud-based
storage and analysis of large-scale genetic variation data.
We propose the VCF Zarr specification which losslessly
maps VCF data to Zarr, and provide an efficient and scalable