response-to-reviewers.tex

% LaTeX rebuttal letter example.
%
% Copyright 2019 Friedemann Zenke, fzenke.net
%
% Based on examples by Dirk Eddelbuettel, Fran and others from
% https://tex.stackexchange.com/questions/2317/latex-style-or-macro-for-detailed-response-to-referee-report
%
% Licensed under cc by-sa 3.0 with attribution required.

\documentclass[11pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{lipsum} % to generate some filler text
\usepackage{fullpage}

% import Eq and Section references from the main manuscript where needed
% \usepackage{xr}
% \externaldocument{manuscript}

% package needed for optional arguments
\usepackage{xifthen}
% define counters for reviewers and their points
\newcounter{reviewer}
\setcounter{reviewer}{0}
\newcounter{point}[reviewer]
\setcounter{point}{0}

% This refines the format of how the reviewer/point reference will appear.
\renewcommand{\thepoint}{\thereviewer.\arabic{point}}

% command declarations for reviewer points and our responses
\newcommand{\reviewersection}{\stepcounter{reviewer} \bigskip \hrule}
                  % \section*{Reviewer \thereviewer}}

\newenvironment{point}
   {\refstepcounter{point} \bigskip \noindent {\textbf{Point~\thepoint} } ---\ }
   {\par }

\newcommand{\shortpoint}[1]{\refstepcounter{point}  \bigskip \noindent
    {\textbf{Reviewer~Point~\thepoint} } ---~#1\par }

\newenvironment{reply}
   {\medskip \noindent \begin{sf}\textbf{Reply}:\  }
   {\medskip \end{sf}}

\newcommand{\shortreply}[2][]{\medskip \noindent \begin{sf}\textbf{Reply}:\  #2
    \ifthenelse{\equal{#1}{}}{}{ \hfill \footnotesize (#1)}%
    \medskip \end{sf}}

\begin{document}

\section*{Response to the editor}
% General intro text goes here
Thank you for considering this manuscript for publication.  
We have addressed the comments from both reviewers point-by-point
below.

\reviewersection
\section*{Reviewer 1 comments}

\begin{point}
The authors present VCF Zarr, a specification that translates the variant call
format (VCF) data model into an array-based representation for the Zarr storage
format. They also present the `vcf2zarr` utility to convert large VCFs to Zarr.
They provide data compression and analysis benchmarks comparing VCF Zarr to
existing variant storage technologies using simulated genotype data. They also
present a case study on real world Genomics England aggV2 data.

The authors' benchmarks overall show that VCF Zarr has superior compression and
computational analysis performance at scale relative to data stored as
row-oriented VCF and that VCF Zarr is competitive with specialized storage
solutions that require similarly specialized tools and access libraries for
querying. An attractive feature is that VCF Zarr allows for variant annotation
workflows that do not require full dataset copy and conversion. Another key
point is that Zarr is a high-level spec and data model for the chunked storage
of n-d arrays, rather than a byte-level encoding designed specifically around
the genomic variant data type. I personally have used Zarr productively for
several applications unrelated to statistical genetics. While Zarr VCF mildly
underperforms some of the specialized formats (Savvy in compute, Genozip in
compression) in a few instances, I believe the accessibility, interoperability,
and reusability gains of Zarr make the small tradeoff well worthwhile. 

Because Zarr has seen heavy adoption in other scientific communities like the
geospatial and Earth sciences, and is well integrated in the scientific Python
stack, I think it holds potential for greater reusability across the ecosystem.
As such, I think the VCF Zarr spec is a highly valuable if not overdue
contribution to an entrenched field that has recently been confronted by a
scalability wall.

Overall, the paper is clear, comprehensive, and well written. 
\end{point}
\begin{reply}
Thank you!  We are delighted that you agree with us on the transformative
potential of VCF Zarr, and thank you for your insightful points and
helpful suggestions.
\end{reply}

\begin{point}
The benefits for large scientific datasets to be analysis-ready
cloud-optimized (ARCO) have been well articulated by Abernathey et al., 2021.
However, I do think that the "local"/HPC single-file use case is still
important and won't disappear any time soon, and for some file system use
cases, expansive and deep hierarchies can be performance limiting (this was
hinted at in one of the benchmarks). In this scenario would a large Zarr VCF
perform reasonably well (or even better on some file systems) via a single
local zip store?
\end{point}
\begin{reply}
This is a good point, and we have added several mentions of the Zip file backend
as well as benchmarks using it. We added a new paragraph on pg5, (l318-338)
to discuss this point. We also added benchmarks based on the SARS-CoV-2
dataset, which is distributed as a Zipfile.
\end{reply}

\begin{point}
The description of the intermediate columnar format (ICF) used by `vcf2zarr`
is missing some detail. At first I got the impression it might be based on
something like Parquet, but running the provided code showed that it consists
of a similar file-based chunk layout to Zarr. This should be clarified in the
manuscript. 
\end{point}
\begin{reply}
We have tried to clarify this (and the purpose of ICF) by saying
``...storing each field independently in (approximately) fixed-size
compressed chunks in a file-system hierarchy.
ICF is designed to support efficient Zarr encoding within \texttt{vcf2zarr}
and not intended for reuse outside that context.''
\end{reply}

\begin{point}
The authors discuss the possibility of storing an index mapping genomic
coordinates to chunk indexes. Have Zarr-based formats in other fields like
geospatial introduced their own indexing approaches to take inspiration from?
\end{point}
\begin{reply}
We have added a new \texttt{region\_index} field for this purpose, which 
supports efficient genomic range searches. This is discussed in the last 
paragraph of the vcf2zarr Methods section (pg 13).
\end{reply}

\begin{point}
Since VCF Zarr is still a draft proposal, it could be useful to indicate
where community discussions are happening and how potential new contributors
can get involved, if possible. This doesn't need to be in the paper per se, but
perhaps documented in the spec repo.
\end{point}
\begin{reply}
Thank you, that is an excellent idea. We have added basic guidance for 
contributors to the specification repo.
\end{reply}


\begin{point}
In the background: "For the representation to be FAIR, it must also be
accessible," -- A is for "accessible", so "also" doesn't make sense.
\end{point}
\begin{reply}
Fixed
\end{reply}

\begin{point}
"There is currently no efficient, FAIR representation...". Just a nit and
feel free to ignore, but the solution you present is technically "current".
\end{point}
\begin{reply}
You are right, but we would like to keep this as it is for narrative clarity.
\end{reply}

\begin{point}
In Figure 2, the zarr line is occluded by the sav line and hard to see.
\end{point}
\begin{reply}
The benchmarks have changed slightly, and the lines are now more separated.
\end{reply}

\reviewersection
\section*{Reviewer 2 comments}

\begin{point}
The paper presents an encoding of the VCF data using Zarr to enable fast
retrieving subsets of the data. A vcf2arr conversion was provided and validated
on both simulated and real-world data sets. The topic of this work is
interesting and of good values, however, the experimental studies and
contributions should be considerable improved.
\end{point}
\begin{reply}
Thank you for your valuable feedback; it has resulted in a much improved
manuscript. We have significantly extended the scope 
of this work by providing 
\begin{itemize}
\item The vcztools program, which converts from Zarr to VCF efficiently.
\item A new case study on the Our Future Health genotype data, for 
651K samples.
\item A new case study on the All of Us exome-like data, for 245K
samples.
\item A new case study on Norway Spruce data, for 1063 samples
and 3.75 \emph{billion} variants.
\item A new case study on SARS-CoV-2 data for 4.3 million samples.
\item A new section demonstrating a prototype of the SAIGE software
incorporating support for VCF Zarr, with excellent performance.
\item A new section exploring the use of VCF Zarr in cloud computing 
platforms, showing how data processing rates of up to 25.5GiB/s
are possible.
\item A new section exploring the use of GPU acceleration on VCF Zarr data.
\end{itemize}
We hope that you agree that these extensive additions
and demonstrations on a selection of the largest genomic datasets
in the world (in terms of both sample and genome size),
demonstrate the real-world value and transformative potential of VCF Zarr.
\end{reply}

\begin{point}
The proposed method is simply a conversion from VCF to Zarr format. Since
both are existing formats, the contributions and originality of this work are
not impressive.
\end{point}
\begin{reply}
We believe that the extensive work detailed in the previous response
addresses this criticism.
\end{reply}

\begin{point}
The compression and query performance is the main concern of this work. The
method should be compared with other state-of-the-art queriable VCF compressors
like GTC, GBC, and GSC. [References ommitted]
\end{point}
\begin{reply}
We disagree on this point, and respectfully point out that for our 
purposes, these methods are essentially equivalent to Genozip
(which we have included in our benchmarks as a representative 
of this class of method). 
As we argue in the ``Calculating with the genotype matrix'' section,
a fundamental point
is that compression methods that only provide access to data 
output via VCF text are limiting the rate at which the data can be 
processed to the rate at which VCF text can be parsed. Producing VCF text
as output cannot be seen as the end point of analysis, and the rate at which it can be
done (beyond a certain minimum value of around 120 MiB/s, as we argue in the 
new vcztools section) is of no real practical interest. Our benchmarks 
in e.g. Figures 3 and 4 show that the Genozip like approach 
of highly compressed genotype data that is accessed only via VCF text
cannot form the basis of efficient processing pipelines. It is 
intrinsically inefficient.

We have included references to GTC, GBC and GSC, and discussing them in the 
``Storing genetic variation data'' section, directing the reader to 
the benchmarks performed  in the ``Calculating with the genotype matrix''
section.
\end{reply}

\begin{point}
The method should be evaluated on more real VCF data sets.
\end{point}
\begin{reply}
We have added four new case studies applying VCF Zarr
to some of the world's largest 
datasets with different data modalities (genotype, exome-like, whole genome sequence),
on three very different organisms (Human, Tree, Virus)
with widely differing genome properties. 
\end{reply}

\end{document}