-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathresponse-to-reviewers.tex
261 lines (230 loc) · 10.7 KB
/
response-to-reviewers.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
% LaTeX rebuttal letter example.
%
% Copyright 2019 Friedemann Zenke, fzenke.net
%
% Based on examples by Dirk Eddelbuettel, Fran and others from
% https://tex.stackexchange.com/questions/2317/latex-style-or-macro-for-detailed-response-to-referee-report
%
% Licensed under cc by-sa 3.0 with attribution required.
\documentclass[11pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{lipsum} % to generate some filler text
\usepackage{fullpage}
% import Eq and Section references from the main manuscript where needed
% \usepackage{xr}
% \externaldocument{manuscript}
% package needed for optional arguments
\usepackage{xifthen}
% define counters for reviewers and their points
\newcounter{reviewer}
\setcounter{reviewer}{0}
\newcounter{point}[reviewer]
\setcounter{point}{0}
% This refines the format of how the reviewer/point reference will appear.
\renewcommand{\thepoint}{\thereviewer.\arabic{point}}
% command declarations for reviewer points and our responses
\newcommand{\reviewersection}{\stepcounter{reviewer} \bigskip \hrule}
% \section*{Reviewer \thereviewer}}
\newenvironment{point}
{\refstepcounter{point} \bigskip \noindent {\textbf{Point~\thepoint} } ---\ }
{\par }
\newcommand{\shortpoint}[1]{\refstepcounter{point} \bigskip \noindent
{\textbf{Reviewer~Point~\thepoint} } ---~#1\par }
\newenvironment{reply}
{\medskip \noindent \begin{sf}\textbf{Reply}:\ }
{\medskip \end{sf}}
\newcommand{\shortreply}[2][]{\medskip \noindent \begin{sf}\textbf{Reply}:\ #2
\ifthenelse{\equal{#1}{}}{}{ \hfill \footnotesize (#1)}%
\medskip \end{sf}}
\begin{document}
\section*{Response to the editor}
% General intro text goes here
Thank you for considering this manuscript for publication.
We have addressed the comments from both reviewers point-by-point
below.
\reviewersection
\section*{Reviewer 1 comments}
\begin{point}
The authors present VCF Zarr, a specification that translates the variant call
format (VCF) data model into an array-based representation for the Zarr storage
format. They also present the `vcf2zarr` utility to convert large VCFs to Zarr.
They provide data compression and analysis benchmarks comparing VCF Zarr to
existing variant storage technologies using simulated genotype data. They also
present a case study on real world Genomics England aggV2 data.
The authors' benchmarks overall show that VCF Zarr has superior compression and
computational analysis performance at scale relative to data stored as
row-oriented VCF and that VCF Zarr is competitive with specialized storage
solutions that require similarly specialized tools and access libraries for
querying. An attractive feature is that VCF Zarr allows for variant annotation
workflows that do not require full dataset copy and conversion. Another key
point is that Zarr is a high-level spec and data model for the chunked storage
of n-d arrays, rather than a byte-level encoding designed specifically around
the genomic variant data type. I personally have used Zarr productively for
several applications unrelated to statistical genetics. While Zarr VCF mildly
underperforms some of the specialized formats (Savvy in compute, Genozip in
compression) in a few instances, I believe the accessibility, interoperability,
and reusability gains of Zarr make the small tradeoff well worthwhile.
Because Zarr has seen heavy adoption in other scientific communities like the
geospatial and Earth sciences, and is well integrated in the scientific Python
stack, I think it holds potential for greater reusability across the ecosystem.
As such, I think the VCF Zarr spec is a highly valuable if not overdue
contribution to an entrenched field that has recently been confronted by a
scalability wall.
Overall, the paper is clear, comprehensive, and well written.
\end{point}
\begin{reply}
Thank you! We are delighted that you agree with us on the transformative
potential of VCF Zarr, and thank you for your insightful points and
helpful suggestions.
\end{reply}
\begin{point}
The benefits for large scientific datasets to be analysis-ready
cloud-optimized (ARCO) have been well articulated by Abernathey et al., 2021.
However, I do think that the "local"/HPC single-file use case is still
important and won't disappear any time soon, and for some file system use
cases, expansive and deep hierarchies can be performance limiting (this was
hinted at in one of the benchmarks). In this scenario would a large Zarr VCF
perform reasonably well (or even better on some file systems) via a single
local zip store?
\end{point}
\begin{reply}
This is a good point, and we have added several mentions of the Zip file backend
as well as benchmarks using it. We added a new paragraph on pg5, (l318-338)
to discuss this point. We also added benchmarks based on the SARS-CoV-2
dataset, which is distributed as a Zipfile.
\end{reply}
\begin{point}
The description of the intermediate columnar format (ICF) used by `vcf2zarr`
is missing some detail. At first I got the impression it might be based on
something like Parquet, but running the provided code showed that it consists
of a similar file-based chunk layout to Zarr. This should be clarified in the
manuscript.
\end{point}
\begin{reply}
We have tried to clarify this (and the purpose of ICF) by saying
``...storing each field independently in (approximately) fixed-size
compressed chunks in a file-system hierarchy.
ICF is designed to support efficient Zarr encoding within \texttt{vcf2zarr}
and not intended for reuse outside that context.''
\end{reply}
\begin{point}
The authors discuss the possibility of storing an index mapping genomic
coordinates to chunk indexes. Have Zarr-based formats in other fields like
geospatial introduced their own indexing approaches to take inspiration from?
\end{point}
\begin{reply}
We have added a new \texttt{region\_index} field for this purpose, which
supports efficient genomic range searches. This is discussed in the last
paragraph of the vcf2zarr Methods section (pg 13).
\end{reply}
\begin{point}
Since VCF Zarr is still a draft proposal, it could be useful to indicate
where community discussions are happening and how potential new contributors
can get involved, if possible. This doesn't need to be in the paper per se, but
perhaps documented in the spec repo.
\end{point}
\begin{reply}
Thank you, that is an excellent idea. We have added basic guidance for
contributors to the specification repo.
\end{reply}
\begin{point}
In the background: "For the representation to be FAIR, it must also be
accessible," -- A is for "accessible", so "also" doesn't make sense.
\end{point}
\begin{reply}
Fixed
\end{reply}
\begin{point}
"There is currently no efficient, FAIR representation...". Just a nit and
feel free to ignore, but the solution you present is technically "current".
\end{point}
\begin{reply}
You are right, but we would like to keep this as it is for narrative clarity.
\end{reply}
\begin{point}
In Figure 2, the zarr line is occluded by the sav line and hard to see.
\end{point}
\begin{reply}
The benchmarks have changed slightly, and the lines are now more separated.
\end{reply}
\reviewersection
\section*{Reviewer 2 comments}
\begin{point}
The paper presents an encoding of the VCF data using Zarr to enable fast
retrieving subsets of the data. A vcf2arr conversion was provided and validated
on both simulated and real-world data sets. The topic of this work is
interesting and of good values, however, the experimental studies and
contributions should be considerable improved.
\end{point}
\begin{reply}
Thank you for your valuable feedback; it has resulted in a much improved
manuscript. We have significantly extended the scope
of this work by providing
\begin{itemize}
\item The vcztools program, which converts from Zarr to VCF efficiently.
\item A new case study on the Our Future Health genotype data, for
651K samples.
\item A new case study on the All of Us exome-like data, for 245K
samples.
\item A new case study on Norway Spruce data, for 1063 samples
and 3.75 \emph{billion} variants.
\item A new case study on SARS-CoV-2 data for 4.3 million samples.
\item A new section demonstrating a prototype of the SAIGE software
incorporating support for VCF Zarr, with excellent performance.
\item A new section exploring the use of VCF Zarr in cloud computing
platforms, showing how data processing rates of up to 25.5GiB/s
are possible.
\item A new section exploring the use of GPU acceleration on VCF Zarr data.
\end{itemize}
We hope that you agree that these extensive additions
and demonstrations on a selection of the largest genomic datasets
in the world (in terms of both sample and genome size),
demonstrate the real-world value and transformative potential of VCF Zarr.
\end{reply}
\begin{point}
The proposed method is simply a conversion from VCF to Zarr format. Since
both are existing formats, the contributions and originality of this work are
not impressive.
\end{point}
\begin{reply}
We believe that the extensive work detailed in the previous response
addresses this criticism.
\end{reply}
\begin{point}
The compression and query performance is the main concern of this work. The
method should be compared with other state-of-the-art queriable VCF compressors
like GTC, GBC, and GSC. [References ommitted]
\end{point}
\begin{reply}
We disagree on this point, and respectfully point out that for our
purposes, these methods are essentially equivalent to Genozip
(which we have included in our benchmarks as a representative
of this class of method).
As we argue in the ``Calculating with the genotype matrix'' section,
a fundamental point
is that compression methods that only provide access to data
output via VCF text are limiting the rate at which the data can be
processed to the rate at which VCF text can be parsed. Producing VCF text
as output cannot be seen as the end point of analysis, and the rate at which it can be
done (beyond a certain minimum value of around 120 MiB/s, as we argue in the
new vcztools section) is of no real practical interest. Our benchmarks
in e.g. Figures 3 and 4 show that the Genozip like approach
of highly compressed genotype data that is accessed only via VCF text
cannot form the basis of efficient processing pipelines. It is
intrinsically inefficient.
We have included references to GTC, GBC and GSC, and discussing them in the
``Storing genetic variation data'' section, directing the reader to
the benchmarks performed in the ``Calculating with the genotype matrix''
section.
\end{reply}
\begin{point}
The method should be evaluated on more real VCF data sets.
\end{point}
\begin{reply}
We have added four new case studies applying VCF Zarr
to some of the world's largest
datasets with different data modalities (genotype, exome-like, whole genome sequence),
on three very different organisms (Human, Tree, Virus)
with widely differing genome properties.
\end{reply}
\end{document}