adding some references

scipy-conference · Jun 4, 2024 · a5ea41e · a5ea41e
1 parent 6157490
commit a5ea41e
Show file tree

Hide file tree

Showing 2 changed files with 48 additions and 6 deletions.
diff --git a/papers/valentina_staneva/main.md b/papers/valentina_staneva/main.md
@@ -2,16 +2,19 @@
 # Ensure that this title is the same as the one in `myst.yml`
 title: "echodataflow: Recipe-based Fisheries Acoustics Workflow Orchestration"
 abstract: |
-  With the influx of large data from multiple instruments and experiments, scientists are wrangling complex data pipelines that are context-dependent and non-reproducible. We demonstrate how we leverage Prefect, a modern orchestration framework, to facilitate fisheries acoustics data processing. We built a Python package Echodataflow which 1) allows users to specify workflows and their parameters through editing text “recipes” which provide transparency and reproducibility of the pipelines; 2) supports scaling of the workflows while abstracting the computational infrastructure; 3) provides monitoring and logging of the workflow progress. Under the hood, echodataflow uses Prefect to execute the workflows while providing a domain-friendly interface to facilitate diverse fisheries acoustics use cases. We demonstrate the features through a typical ship survey data processing pipeline. 
+  With the influx of large data from multiple instruments and experiments, scientists are wrangling complex data pipelines that are context-dependent and non-reproducible. We demonstrate how we leverage Prefect [@prefect], a modern orchestration framework, to facilitate fisheries acoustics data processing. We built a Python package Echodataflow which 1) allows users to specify workflows and their parameters through editing text “recipes” which provide transparency and reproducibility of the pipelines; 2) supports scaling of the workflows while abstracting the computational infrastructure; 3) provides monitoring and logging of the workflow progress. Under the hood, echodataflow uses Prefect to execute the workflows while providing a domain-friendly interface to facilitate diverse fisheries acoustics use cases. We demonstrate the features through a typical ship survey data processing pipeline. 
 ---
 
+
+
+
 ## Motivation
 Acoustic fisheries surveys and ocean observing systems collect terabytes of echosounder (water column sonar) data that require custom processing pipelines to obtain the distributions and abundance of fish and zooplankton in the ocean[ref]. The data are collected by sending an acoustic signal into the ocean which then scatters from objects and the returning “echo” is recorded. Although data usually have similar dimensions: range, time, location, frequency, and can be stored into multi-dimensional arrays, the exact format varies based on the data collection scheme and the exact instrument used. Fisheries ship surveys, for example, follow pre-defined paths and can span several months ([Figure%s](fig:data_collection) top-left). Ocean moorings, on the other hand, have instruments at fixed locations and can collect data 24/7 for several years (when wired) ([Figure%s](#fig:data_collection) bottom). Unmanned Surface Vehicles (USVs) (e.g. Saildrone[ref], DriX[ref], [Figure%s](fig:data_collection) top-right) can autonomously collect echosounder data over large spatial regions. Despite that in all these scenarios data are usually collected with the same type of instruments, and some of the initial processing steps are similar, the combination of research needs, data volume, and available computational infrastructure demand different workflows. 
 
 
 :::{figure} data_collection.png
 :label: fig:data_collection
-**Data Collection Schemes:** top left, ship survey transect map for the Joint U.S.-Canada Integrated Ecosystem and Pacific Hake Acoustic Trawl Survey[ref]; top right, USV path map for Saildrone west coast survey [ref][image_source]; bottom, map and instrument diagram for a stationary ocean observing system (Ocean Observatories Initiative Cabled and Endurance Arrays [ref] [image_source])**
+**Data Collection Schemes:** top left, ship survey transect map for the Joint U.S.-Canada Integrated Ecosystem and Pacific Hake Acoustic Trawl Survey[ref]; top right, USV path map for Saildrone west coast survey [ref][image_source]; bottom, map and instrument diagram for a stationary ocean observing system (Ocean Observatories Initiative Cabled and Endurance Arrays [ref] [image_source])
 :::
 
 
@@ -61,15 +64,15 @@ Researchers are faced with decisions of where to store the data from experiments
 With the growth of the echosounder datasets, researchers face challenges processing the data on their personal machines: both in terms of memory usage and computational time. A typical first attempt for resolution would be amend the workflow to process smaller chunks of the data and parallelize operations across multiple cores if available. However, today researchers are also presented with a multitude of options for distributed computing: high-performance computing clusters at local or national institutions, cloud provider services: batch computing (e.g. Azure Batch, AWS Batch), elastic container services (e.g. , serverless options (e.g. Amazon Web Services Lamdba Functions [ref], Google Cloud Functions, Microsoft Azure Functions). Data, code and workflow organization usually needs to be adapted based on the computing infrastructure. The knowledge required to configure these systems to achieve efficient processing is quite in-depth, and distributing libraries can be hard to debug and can have unpredictable performance. *Abstracting the computing infrastructure and distribution of the tasks can allow researchers to focus on the scientific analysis of these large and rich datasets.*
 
 ## Echodataflow Overview
-At the center of `echodataflow`'s design is the notion that a workflow can be configured through a set of recipes (`.yaml` files) that specify the pipeline, data storage, and logging details. The idea draws inspiration from the Pangeo-Forge Project [ref] which facilitates the Extraction, Transformation, Loading (ETL) of earth science geospatial datasets from traditional repositories to analysis-ready, cloud-optimized (ARCO) data stores [@pangeo-forge]. The pangeo-forge recipes provide a model of how the data should be accessed and transformed, and the project has garnered numerous recipes from the community. While Pangeo-Forge’s focus is on transformation from `.netcdf` [ref] and `hdf5` [ref] formats to `.zarr`, echodataflow’s aim is to support full echosounder data processing and analysis pipelines: from instrument raw data formats to biological products. `echodataflow` leverages Prefect to abstract data and computation management. In  we provide an overview of echodataflow’s framework. At the center we see several steps from an echosounder data processing pipeline: `open_raw`, `combine_echodata`, `compute_Sv`, `compute_MVBS`. All these functions exist in the echopype package, and are wrapped by echodataflow into predefined stages. Prefect executes the stages on a dask cluster which can be started locally or can be externally set up. These echopype functions already support distributed operations with dask thus the integration with Prefect within echodataflow is natural. Dask clusters can be set up on a variety of platforms: local, cloud, kubernetes [ref], HPC cluster via `dask-jobqueue` [ref], etc. and allow abstraction from the computing infrastructure. Input, intermediate, and final data sets can live in different storage systems (local/cloud, public/private) and Prefect’s block feature provides seamless, provider-agnostic, and secure integration. Workflows can be executed and monitored through Prefect’s dashboard, while logging of each function is handled by echodataflow.
+At the center of `echodataflow`'s design is the notion that a workflow can be configured through a set of recipes (`.yaml` files) that specify the pipeline, data storage, and logging details. The idea draws inspiration from the Pangeo-Forge Project [ref] which facilitates the Extraction, Transformation, Loading (ETL) of earth science geospatial datasets from traditional repositories to analysis-ready, cloud-optimized (ARCO) data stores [@pangeo-forge]. The pangeo-forge recipes provide a model of how the data should be accessed and transformed, and the project has garnered numerous recipes from the community. While Pangeo-Forge’s focus is on transformation from `.netcdf` [ref] and `hdf5` [ref] formats to `.zarr`, echodataflow’s aim is to support full echosounder data processing and analysis pipelines: from instrument raw data formats to biological products. `echodataflow` leverages Prefect [@prefect] to abstract data and computation management. In  we provide an overview of echodataflow’s framework. At the center we see several steps from an echosounder data processing pipeline: `open_raw`, `combine_echodata`, `compute_Sv`, `compute_MVBS`. All these functions exist in the echopype package, and are wrapped by echodataflow into predefined stages. Prefect executes the stages on a dask cluster which can be started locally or can be externally set up. These echopype functions already support distributed operations with dask thus the integration with Prefect within echodataflow is natural. Dask clusters can be set up on a variety of platforms: local, cloud, kubernetes [ref], HPC cluster via `dask-jobqueue` [ref], etc. and allow abstraction from the computing infrastructure. Input, intermediate, and final data sets can live in different storage systems (local/cloud, public/private) and Prefect’s block feature provides seamless, provider-agnostic, and secure integration. Workflows can be executed and monitored through Prefect’s dashboard, while logging of each function is handled by echodataflow.
 
 :::{figure} echodataflow.png
 :label: fig:echodataflow
 **`echodataflow` Framework:** The above diagram provides an overview of the echodataflow framework: the task is to fetch raw files from a local filesystem/cloud archive, process them through several stages of an echosounder data workflow using a cluster infrastructure, and store intermediate and final products. Echodataflow allows the workflow to be executed based on text configurations, and logs are generated for the individual processing stages. Prefect handles the distribution of the tasks on the cluster, and provides tools for monitoring the workflow runs. 
 :::
 
 ### Why Prefect?
-We chose Prefect among other Python workflow orchestration tools such as Apache Airflow[ref], Dagster[ref], Argo[ref], Luigi[ref] for the following reasons:
+We chose Prefect among other Python workflow orchestration tools such as Apache Airflow[#airflow], Dagster[#dagster], Argo[#argo], Luigi[#luigi] for the following reasons:
 * Prefect accepts dynamic workflows which are specified at runtime and do not require to follow a Directed Acyclic Graph, which can be restricting and difficult to implement.
 * In Prefect, Python functions are first class citizens, thus building a Prefect workflow does not deviate a lot from traditional science workflows built out of functions.
 * Prefect integrates with a dask cluster, and echopype processing functions are already using dask to scale operations
@@ -82,7 +85,7 @@ We next describe in more detail the components of the workflow lifecycle.
 The main goal of echodataflow is to allow users to configure an echosounder data processing pipeline through editing configuration “recipe” templates. Echodataflow can be configured through three templates: datastore.yaml which handles the data storage decisions, pipeline.yml which specifies the processing stages, and logging.yaml which sets the logging format. 
 
 ### Data Storage Configuration
-In [Figure%s](#fig:datastore_config): datstore.yaml we provide an example of a data store configuration for a ship survey. In this scenario we want to process data from the Joint U.S.-Canada Integrated Ecosystem and Pacific Hake Acoustic Trawl Survey [NWFSC_FRAM_2022]which is being publicly shared on an AWS S3 bucket by NOAA National Cent er for Environmental Information Acoustics Archive (NCEA)[NWFSC_FRAM_2022]. The archive contains data from many surveys dating back to 1991 and contains ~280TB of data. The additional parameters referring to ship, survey, and sonar model names allow to parse the files to those belonging only to the survey of interest. The output is set to a private S3 bucket belonging to the user (i.e. an AWS account different from the input one), and the credentials are passed through a block_name. The survey contains ~4000 files, and one can set the group option to combine the files into survey-specific groups: based on transect information provided in the transect_group.txt file. One can further use regular expressions to subselect new groups based on needs. 
+In [Figure%s](#fig:datastore_config): datstore.yaml we provide an example of a data store configuration for a ship survey. In this scenario we want to process data from the Joint U.S.-Canada Integrated Ecosystem and Pacific Hake Acoustic Trawl Survey [@NWFSC_FRAM_2022]which is being publicly shared on an AWS S3 bucket by NOAA National Cent er for Environmental Information Acoustics Archive (NCEA)[@NWFSC_FRAM_2022]. The archive contains data from many surveys dating back to 1991 and contains ~280TB of data. The additional parameters referring to ship, survey, and sonar model names allow to parse the files to those belonging only to the survey of interest. The output is set to a private S3 bucket belonging to the user (i.e. an AWS account different from the input one), and the credentials are passed through a block_name. The survey contains ~4000 files, and one can set the group option to combine the files into survey-specific groups: based on transect information provided in the transect_group.txt file. One can further use regular expressions to subselect new groups based on needs. 
 
 :::{figure} datastore_config.png
 :label: fig:datastore_config

diff --git a/papers/valentina_staneva/mybib.bib b/papers/valentina_staneva/mybib.bib
@@ -390,4 +390,43 @@ @article{numpy
   title        = {Array programming with {NumPy}},
   volume       = {585},
 }
-}
+}
+
+
+@misc{prefect,
+  title   = {Prefect},
+  year    = {2024},
+  url     = {https://www.prefect.io/},
+  note    = {Accessed 1 Jun. 2024}
+}
+
+@misc{
+  title   = {https://dagster.io/},
+  year    = {2024},
+  url     = {https://dagster.io/,
+  note    = {Accessed 1 Jun. 2024}
+}
+
+@misc{
+  title   = {Apache Airflow},
+  year    = {2024},
+  url     = {https://airflow.apache.org/,
+  note    = {Accessed 1 Jun. 2024}
+}
+
+@misc{
+  title   = {Luigi},
+  year    = {2024},
+  url     = {https://luigi.readthedocs.io/en/stable/running_luigi.html,
+  note    = {Accessed 1 Jun. 2024}	
+}
+
+
+
+
+
+
+
+
+
+