Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring to ECCO4 data #81

Open
simone-silvestri opened this issue May 3, 2024 · 14 comments
Open

Restoring to ECCO4 data #81

simone-silvestri opened this issue May 3, 2024 · 14 comments
Assignees
Labels
global simulations 🌎 They should have called this planet Ocean SDI Software Design Issue

Comments

@simone-silvestri
Copy link
Collaborator

We need a utility to download/use the ECCO4 fields as restoring data.
ECCO4 fields come in netcdf format where each state for a particular day is stored in a file like this OCEAN_TEMPERATURE_SALINITY_day_mean_2017-12-20_ECCO_V4r4_latlon_0p50deg.nc

There are two main options to do this:

  • leverage the same code that we have for JRA55 (changing names of functions to generalize them). This would require preprocessing the ECCO4 data which is about 555GB but then it allows us to operate in the same exact way we operate with the prescribed atmosphere. Pros: easy to implement and a lot of code reutilization. Cons: We might be stuck with having a gigantic 555 GB datafile to dowload (the same problem we will eventually have with the atmosphere)
  • dowload the data, build a new ECCO4NetCDFBackend <: AbstractInMemoryBackend which will load individual snapshots in the fieldtimeseries data. Pros: flexibility with how much data we want to download. Cons: more coding and less code reutilization
@simone-silvestri
Copy link
Collaborator Author

An additional drawback of method 2 is that we need to preprocess the data anyway because we need to inpaint missing regions.

Mostly for this reason I would probably favour method 1

@simone-silvestri simone-silvestri added enhancement New feature or request global simulations 🌎 They should have called this planet Ocean labels May 3, 2024
@simone-silvestri simone-silvestri self-assigned this May 3, 2024
@simone-silvestri simone-silvestri added SDI Software Design Issue and removed enhancement New feature or request labels May 3, 2024
@glwagner
Copy link
Member

glwagner commented May 3, 2024

Can we also look into the European Copernicus reanalysis data? I wonder also if there is a more intelligent way to download it, like downloading slices or something. I'm unsure about the pros and cons of the different data products, but for the purposes of restoring or even initial conditions, I'm not sure a reanalysis is necessarily worse than ECCO's state estimate (which differs in that ECCO is more dynamically consistent, somehow)

@glwagner
Copy link
Member

glwagner commented May 3, 2024

Cons: We might be stuck with having a gigantic 555 GB datafile to dowload (the same problem we will eventually have with the atmosphere)

How long does this take to download? In 2024 we have fast internet, maybe this is just life and we can accept it. We can build some tools that help users do a download once and store the data in some common that ClimaOcean knows about (eg independent from individual run scripts). That's what DataDeps did though I think it's simpler to use an Artifacts.toml

@simone-silvestri
Copy link
Collaborator Author

I have figured out that it is practically impossible to write a netcdf file that contains the whole ECCO dataset, we are talking about around 660 GB per variable in float32 format.

I think we can circumvent this by loading one fieldtimeseries time index from a single file by implementing a different backend like ECCONetCDFBackend.
This will probably not kill performance too much since we can use the daily means from ECCO (or even the monthly means) so the load will happen rather rarely

@glwagner
Copy link
Member

glwagner commented May 7, 2024

Why do you have to write a single nc file?

@glwagner
Copy link
Member

glwagner commented May 7, 2024

Can't you load data from the original nc files on the fly, the same way we do for the JRA55 nc files? This also saves a pre-processing step which is nice.

@glwagner
Copy link
Member

glwagner commented May 7, 2024

You have to define set! and new_backend for the Oceananigans.FieldTimeSeries interface:

function set!(fts::JRA55NetCDFFTS, path::String=fts.path, name::String=fts.name)
ds = Dataset(path)
# Note that each file should have the variables
# - ds["time"]: time coordinate
# - ds["lon"]: longitude at the location of the variable
# - ds["lat"]: latitude at the location of the variable
# - ds["lon_bnds"]: bounding longitudes between which variables are averaged
# - ds["lat_bnds"]: bounding latitudes between which variables are averaged
# - ds[shortname]: the variable data
# Nodes at the variable location
λc = ds["lon"][:]
φc = ds["lat"][:]
LX, LY, LZ = location(fts)
i₁, i₂, j₁, j₂, TX = compute_bounding_indices(nothing, nothing, fts.grid, LX, LY, λc, φc)
ti = time_indices(fts)
ti = collect(ti)
data = ds[name][i₁:i₂, j₁:j₂, ti]
close(ds)
copyto!(interior(fts, :, :, 1, :), data)
fill_halo_regions!(fts)
return nothing
end
new_backend(::JRA55NetCDFBackend, start, length) = JRA55NetCDFBackend(start, length)

you can probably also reuse compute_bounding_indices and move that to DataWrangling to use in both ECCO and JRA55.

@glwagner
Copy link
Member

glwagner commented May 7, 2024

Don't we also need a new module called ECCO4? That can be a first PR that just defines the module and adds some basic functionality.

@simone-silvestri
Copy link
Collaborator Author

Can't you load data from the original nc files on the fly, the same way we do for the JRA55 nc files? This also saves a pre-processing step which is nice.

This is exactly what i was suggesting. We always have to preprocess though because ECCO files have missing values that have to be filled in to a certain extent

@simone-silvestri
Copy link
Collaborator Author

Creating a new ECCO4 module is probably unnecessary since the only difference between our current ECCO2 and ECCO4 is the filename to download from. I was thinking of just renaming the ECCO2 module to ECCO4 (since ECCO4 is a little more dynamically consistent)

Another option is the rename the module to ECCO and just give duplicate the download files dictionary to include both ECCO2 and ECCO4 so we can have the maximum code reutilization

@glwagner
Copy link
Member

glwagner commented May 7, 2024

JRA55 isn't dynamically consistent and we support that. Is the advantage of ECCO2 that it's higher resolution? Or no?

@simone-silvestri
Copy link
Collaborator Author

Ok, I think it is possible to support ECCO2Daily, ECCO2Monthly and ECCO4Montly with only 4 lines change in the code. Luckily the structure of the .nc file does not change between these

@glwagner
Copy link
Member

glwagner commented May 7, 2024

Can't you load data from the original nc files on the fly, the same way we do for the JRA55 nc files? This also saves a pre-processing step which is nice.

This is exactly what i was suggesting. We always have to preprocess though because ECCO files have missing values that have to be filled in to a certain extent

Ok, I was confused since I assumed we would have to do this. So I didn't understand the context of what you were saying. I didn't realize you were trying something different. I think it would help to write a bit more like "I wanted to explore whether we could avoid loading data from separate .nc files by writing a single huge .nc file. But it turns out that its too big."

@glwagner
Copy link
Member

dowload the data, build a new ECCO4NetCDFBackend <: AbstractInMemoryBackend which will load individual snapshots in the fieldtimeseries data. Pros: flexibility with how much data we want to download. Cons: more coding and less code reutilization

I think this is the right way to go.

Keep this overarching goal in mind: our goal is to make it as easy as possible for new users to start using the code, and also to port setups between machines and change setups. Because of this priority, the workflow where we "preprocess a huge dataset and then keep using it for the next 3 years" is not the kind of workflow we want to promote.

Instead we want to promote a workflow where we re-download and re-process data often.

I don't think we want to opt to download huge files and make pre-processing really expensive just to save a bit of coding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
global simulations 🌎 They should have called this planet Ocean SDI Software Design Issue
Projects
None yet
Development

No branches or pull requests

2 participants