-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restoring to ECCO4 data #81
Comments
An additional drawback of method 2 is that we need to preprocess the data anyway because we need to inpaint missing regions. Mostly for this reason I would probably favour method 1 |
Can we also look into the European Copernicus reanalysis data? I wonder also if there is a more intelligent way to download it, like downloading slices or something. I'm unsure about the pros and cons of the different data products, but for the purposes of restoring or even initial conditions, I'm not sure a reanalysis is necessarily worse than ECCO's state estimate (which differs in that ECCO is more dynamically consistent, somehow) |
How long does this take to download? In 2024 we have fast internet, maybe this is just life and we can accept it. We can build some tools that help users do a download once and store the data in some common that ClimaOcean knows about (eg independent from individual run scripts). That's what DataDeps did though I think it's simpler to use an Artifacts.toml |
I have figured out that it is practically impossible to write a netcdf file that contains the whole ECCO dataset, we are talking about around 660 GB per variable in float32 format. I think we can circumvent this by loading one fieldtimeseries time index from a single file by implementing a different backend like |
Why do you have to write a single nc file? |
Can't you load data from the original nc files on the fly, the same way we do for the JRA55 nc files? This also saves a pre-processing step which is nice. |
You have to define ClimaOcean.jl/src/DataWrangling/JRA55.jl Lines 221 to 250 in 89a549d
you can probably also reuse |
Don't we also need a new module called |
This is exactly what i was suggesting. We always have to preprocess though because ECCO files have missing values that have to be filled in to a certain extent |
Creating a new ECCO4 module is probably unnecessary since the only difference between our current ECCO2 and ECCO4 is the filename to download from. I was thinking of just renaming the ECCO2 module to ECCO4 (since ECCO4 is a little more dynamically consistent) Another option is the rename the module to ECCO and just give duplicate the download files dictionary to include both ECCO2 and ECCO4 so we can have the maximum code reutilization |
JRA55 isn't dynamically consistent and we support that. Is the advantage of ECCO2 that it's higher resolution? Or no? |
Ok, I think it is possible to support |
Ok, I was confused since I assumed we would have to do this. So I didn't understand the context of what you were saying. I didn't realize you were trying something different. I think it would help to write a bit more like "I wanted to explore whether we could avoid loading data from separate .nc files by writing a single huge .nc file. But it turns out that its too big." |
I think this is the right way to go. Keep this overarching goal in mind: our goal is to make it as easy as possible for new users to start using the code, and also to port setups between machines and change setups. Because of this priority, the workflow where we "preprocess a huge dataset and then keep using it for the next 3 years" is not the kind of workflow we want to promote. Instead we want to promote a workflow where we re-download and re-process data often. I don't think we want to opt to download huge files and make pre-processing really expensive just to save a bit of coding. |
We need a utility to download/use the ECCO4 fields as restoring data.
ECCO4 fields come in netcdf format where each state for a particular day is stored in a file like this
OCEAN_TEMPERATURE_SALINITY_day_mean_2017-12-20_ECCO_V4r4_latlon_0p50deg.nc
There are two main options to do this:
ECCO4NetCDFBackend <: AbstractInMemoryBackend
which will load individual snapshots in the fieldtimeseries data. Pros: flexibility with how much data we want to download. Cons: more coding and less code reutilizationThe text was updated successfully, but these errors were encountered: