Restoring to ECCO4 data #81

simone-silvestri · 2024-05-03T00:53:51Z

We need a utility to download/use the ECCO4 fields as restoring data.
ECCO4 fields come in netcdf format where each state for a particular day is stored in a file like this OCEAN_TEMPERATURE_SALINITY_day_mean_2017-12-20_ECCO_V4r4_latlon_0p50deg.nc

There are two main options to do this:

leverage the same code that we have for JRA55 (changing names of functions to generalize them). This would require preprocessing the ECCO4 data which is about 555GB but then it allows us to operate in the same exact way we operate with the prescribed atmosphere. Pros: easy to implement and a lot of code reutilization. Cons: We might be stuck with having a gigantic 555 GB datafile to dowload (the same problem we will eventually have with the atmosphere)
dowload the data, build a new ECCO4NetCDFBackend <: AbstractInMemoryBackend which will load individual snapshots in the fieldtimeseries data. Pros: flexibility with how much data we want to download. Cons: more coding and less code reutilization

The text was updated successfully, but these errors were encountered:

simone-silvestri · 2024-05-03T01:10:10Z

An additional drawback of method 2 is that we need to preprocess the data anyway because we need to inpaint missing regions.

Mostly for this reason I would probably favour method 1

glwagner · 2024-05-03T04:41:00Z

Can we also look into the European Copernicus reanalysis data? I wonder also if there is a more intelligent way to download it, like downloading slices or something. I'm unsure about the pros and cons of the different data products, but for the purposes of restoring or even initial conditions, I'm not sure a reanalysis is necessarily worse than ECCO's state estimate (which differs in that ECCO is more dynamically consistent, somehow)

glwagner · 2024-05-03T04:42:17Z

Cons: We might be stuck with having a gigantic 555 GB datafile to dowload (the same problem we will eventually have with the atmosphere)

How long does this take to download? In 2024 we have fast internet, maybe this is just life and we can accept it. We can build some tools that help users do a download once and store the data in some common that ClimaOcean knows about (eg independent from individual run scripts). That's what DataDeps did though I think it's simpler to use an Artifacts.toml

simone-silvestri · 2024-05-07T01:55:34Z

I have figured out that it is practically impossible to write a netcdf file that contains the whole ECCO dataset, we are talking about around 660 GB per variable in float32 format.

I think we can circumvent this by loading one fieldtimeseries time index from a single file by implementing a different backend like ECCONetCDFBackend.
This will probably not kill performance too much since we can use the daily means from ECCO (or even the monthly means) so the load will happen rather rarely

glwagner · 2024-05-07T03:00:11Z

Why do you have to write a single nc file?

glwagner · 2024-05-07T03:01:13Z

Can't you load data from the original nc files on the fly, the same way we do for the JRA55 nc files? This also saves a pre-processing step which is nice.

glwagner · 2024-05-07T03:07:19Z

You have to define set! and new_backend for the Oceananigans.FieldTimeSeries interface:

ClimaOcean.jl/src/DataWrangling/JRA55.jl

Lines 221 to 250 in 89a549d

    
           function set!(fts::JRA55NetCDFFTS, path::String=fts.path, name::String=fts.name)  
        
               ds = Dataset(path) 
        
               # Note that each file should have the variables 
        
               #   - ds["time"]:     time coordinate  
        
               #   - ds["lon"]:      longitude at the location of the variable 
        
               #   - ds["lat"]:      latitude at the location of the variable 
        
               #   - ds["lon_bnds"]: bounding longitudes between which variables are averaged 
        
               #   - ds["lat_bnds"]: bounding latitudes between which variables are averaged 
        
               #   - ds[shortname]:  the variable data 
        
               # Nodes at the variable location 
        
               λc = ds["lon"][:] 
        
               φc = ds["lat"][:] 
        
               LX, LY, LZ = location(fts) 
        
               i₁, i₂, j₁, j₂, TX = compute_bounding_indices(nothing, nothing, fts.grid, LX, LY, λc, φc) 
        
               ti = time_indices(fts) 
        
               ti = collect(ti) 
        
               data = ds[name][i₁:i₂, j₁:j₂, ti] 
        
               close(ds) 
        
               copyto!(interior(fts, :, :, 1, :), data) 
        
               fill_halo_regions!(fts) 
        
               return nothing 
        
           end 
        
           new_backend(::JRA55NetCDFBackend, start, length) = JRA55NetCDFBackend(start, length)

you can probably also reuse compute_bounding_indices and move that to DataWrangling to use in both ECCO and JRA55.

glwagner · 2024-05-07T03:07:48Z

Don't we also need a new module called ECCO4? That can be a first PR that just defines the module and adds some basic functionality.

simone-silvestri · 2024-05-07T03:15:30Z

Can't you load data from the original nc files on the fly, the same way we do for the JRA55 nc files? This also saves a pre-processing step which is nice.

This is exactly what i was suggesting. We always have to preprocess though because ECCO files have missing values that have to be filled in to a certain extent

simone-silvestri · 2024-05-07T03:33:34Z

Creating a new ECCO4 module is probably unnecessary since the only difference between our current ECCO2 and ECCO4 is the filename to download from. I was thinking of just renaming the ECCO2 module to ECCO4 (since ECCO4 is a little more dynamically consistent)

Another option is the rename the module to ECCO and just give duplicate the download files dictionary to include both ECCO2 and ECCO4 so we can have the maximum code reutilization

glwagner · 2024-05-07T04:53:50Z

JRA55 isn't dynamically consistent and we support that. Is the advantage of ECCO2 that it's higher resolution? Or no?

simone-silvestri · 2024-05-07T12:24:52Z

Ok, I think it is possible to support ECCO2Daily, ECCO2Monthly and ECCO4Montly with only 4 lines change in the code. Luckily the structure of the .nc file does not change between these

glwagner · 2024-05-07T16:38:49Z

Can't you load data from the original nc files on the fly, the same way we do for the JRA55 nc files? This also saves a pre-processing step which is nice.

This is exactly what i was suggesting. We always have to preprocess though because ECCO files have missing values that have to be filled in to a certain extent

Ok, I was confused since I assumed we would have to do this. So I didn't understand the context of what you were saying. I didn't realize you were trying something different. I think it would help to write a bit more like "I wanted to explore whether we could avoid loading data from separate .nc files by writing a single huge .nc file. But it turns out that its too big."

glwagner · 2024-06-20T23:07:32Z

dowload the data, build a new ECCO4NetCDFBackend <: AbstractInMemoryBackend which will load individual snapshots in the fieldtimeseries data. Pros: flexibility with how much data we want to download. Cons: more coding and less code reutilization

I think this is the right way to go.

Keep this overarching goal in mind: our goal is to make it as easy as possible for new users to start using the code, and also to port setups between machines and change setups. Because of this priority, the workflow where we "preprocess a huge dataset and then keep using it for the next 3 years" is not the kind of workflow we want to promote.

Instead we want to promote a workflow where we re-download and re-process data often.

I don't think we want to opt to download huge files and make pre-processing really expensive just to save a bit of coding.

simone-silvestri added enhancement New feature or request global simulations 🌎 They should have called this planet Ocean labels May 3, 2024

simone-silvestri self-assigned this May 3, 2024

simone-silvestri added SDI Software Design Issue and removed enhancement New feature or request labels May 3, 2024

simone-silvestri mentioned this issue Jun 20, 2024

problem with one_degree_near_global_simulation.jl #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restoring to ECCO4 data #81

Restoring to ECCO4 data #81

simone-silvestri commented May 3, 2024

simone-silvestri commented May 3, 2024

glwagner commented May 3, 2024

glwagner commented May 3, 2024 •

edited

Loading

simone-silvestri commented May 7, 2024

glwagner commented May 7, 2024

glwagner commented May 7, 2024

glwagner commented May 7, 2024

glwagner commented May 7, 2024

simone-silvestri commented May 7, 2024

simone-silvestri commented May 7, 2024

glwagner commented May 7, 2024

simone-silvestri commented May 7, 2024

glwagner commented May 7, 2024

glwagner commented Jun 20, 2024

Restoring to ECCO4 data #81

Restoring to ECCO4 data #81

Comments

simone-silvestri commented May 3, 2024

simone-silvestri commented May 3, 2024

glwagner commented May 3, 2024

glwagner commented May 3, 2024 • edited Loading

simone-silvestri commented May 7, 2024

glwagner commented May 7, 2024

glwagner commented May 7, 2024

glwagner commented May 7, 2024

glwagner commented May 7, 2024

simone-silvestri commented May 7, 2024

simone-silvestri commented May 7, 2024

glwagner commented May 7, 2024

simone-silvestri commented May 7, 2024

glwagner commented May 7, 2024

glwagner commented Jun 20, 2024

glwagner commented May 3, 2024 •

edited

Loading