A Python library designed to streamline writing, updating & accessing ocean model and observational data stored in cloud object storage.
To install the OceanDataStore package, first clone the repository from GitHub:
git clone [email protected]:NOC-MSM/OceanDataStore.git
Next, install OceanDataStore in editable mode as follows:
cd OceanDataStore
pip install -e .
Note: we strongly recommend installing OceanDataStore into a new virtual environment using either venv
or conda / mamba
.
To get started using OceanDataStore, users need to create a credentials.json
file containing the following information:
{
"secret": "your_secret",
"token": "your_token",
"endpoint_url": "https://noc-msm-o.s3-ext.jc.rl.ac.uk"
}
To create a new zarr store in an object store from a local file, use the send_to_zarr
command:
ods send_to_zarr -f /path/to/file.nc -c credentials.json -b bucket_name -v var
The arguments used are:
-f
: Path to the netCDF file containing the variables.-c
: Path to the JSON file containing the object store credentials.-b
: Bucket name in the object store where the variables will be stored.-v
: Variable within the netCDF file to send to the object store.
In the example above, without a -p
(or --prefix
), the variables will be stored in <bucket_name>/<var>
. If a --prefix
is provided, the variables will be stored in <bucket_name>/<prefix>/<var>
.
To create a new zarr store in an object store using a large number of files, we can use dask with the send_to_zarr
command by passing a dask configuration JSON file:
ods send_to_zarr -f filepaths -c credentials.json -b bucket_name -p prefix \
-gf filepath_domain -uc '{"lat":"lat_new", "lon":"lon_new"}' \
-cs '{"x":500, "y":500, "depthw":25}' \
-dc dask_config.json
The arguments used are:
-f
: Paths to the multiple netCDF files containing the variables.-c
: Path to the JSON file containing the object store credentials.-b
: Bucket name in the object store where the variables will be stored.-p
: Prefix used to define path to object (see above).-gf
: Path to model grid file containing domain variables.-uc
: Coordinates dimension variables to update given as a JSON string '{current_coord : new_coord}'.-cs
: Chunk strategy used to rechunk model data.-dc
: Path to JSON file containing Dask configuration.
where the contents of the dask_config.json
are:
{
"config_kwargs": {
"temporary_directory":"..../jasmin_os_tmp/",
"local_directory":"..../jasmin_os_tmp/"
},
"cluster_kwargs": {
"n_workers" : 12,
"threads_per_worker" : 1,
"memory_limit":"2GB"
}
}
In the example, a LocalCluster with 12 single threaded workers, each with 2 GB of available memory, is used to transfer a large collection of files to an object store.
Users are strongly recommended to implement send_to_zarr
workflows using a job scheduler, such as SLURM or PBS, to either run the LocalCluster on a single compute node or to use an existing the SLURMCluster or PBSCluster (dask job queue).
Note: the netCDF4 library does not support multi-threaded access to datasets, so users should ensure that threads_per_worker : 1
in their dask configuration JSON file to avoid raising CancelledError exceptions when using send_to_zarr
or update_zarr
.
To update an existing zarr store in an object store, we can use the update_zarr
command:
ods update_zarr -f /path/to/file.nc -c credentials.json -b bucket_name -p prefix -v var
This command will replace and/or append the values of variable var
stored at the local filepath to the /bucket_name/prefix/var
store provided it already exists in the object store.
Note: compatability checks must be passed before local data will be appended to an existing store, these include chunk size & dimension compatability.
To update an existing zarr store in an object store using a large number of files, we can use dask via the update_zarr
command analogously to send_to_zarr
:
ods update_zarr -f filepaths -c credentials.json -b bucket_name -p prefix \
-gf filepath_domain -uc '{"lat":"lat_new", "lon":"lon_new"}' \
-cs '{"x":500, "y":500, "depthw":25}' -ad time \
-dc dask_config.json
For further examples of how to implement the commands in OceanDataStore in your own workflows, see the bash scripts in the examples
directory.
Long version | Short Version | Description |
---|---|---|
action | Specify the action: send to send a file or update to update an existing object. |
|
--filepaths |
-f |
Paths to the files to send or update. |
--credentials |
-c |
Path to the JSON file containing the credentials for the object store. |
--bucket |
-b |
Bucket name. |
Flag | Short Version | Description |
---|---|---|
--prefix |
-p |
Object prefix (default=None ). |
--append-dim |
-ad |
Append dimension (default=time_counter ). |
--variables |
-v |
Variables to send (default=None ). If None , all variables will be sent. If consolidated , the variables will be sent to a single consolidated zarr store. |
--chunk-strategy |
-cs |
Chunk strategy as a JSON string (default=None ). E.g., '{"time_counter": 1, "x": 100, "y": 100}' |
--dask-configuration |
-dc |
Path to the JSON file defining the Dask Local Cluster configuration (default=None ). |
--grid-filepath |
-gf |
File path to model grid file containing domain information (default=None ). |
--update-coords |
-uc |
Coordinate dimensions to update as a JSON string (default=None ). E.g., '{"nav_lon": "glamt", "nav_lat": "gphit"}' |
--attributes |
-at |
Attributes to add to the dataset as a JSON string. E.g., '{"title": "my_dataset"}' |
--zarr-version |
-zv |
Zarr version used to create the zarr store (default=3 ). Options are 2 (v2) or 3 (v3). |