Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add xarray backends for fgmax and fgout gridded formats #611

Merged
merged 24 commits into from
Aug 23, 2024

Conversation

kbarnhart
Copy link
Contributor

@rjleveque, @mandli -

Here is a PR that addresses #598.

It may not be the most efficient (I flipud and transpose each array to put them in the orientation that xarray wants), but I find that it works.

I've tested the fgmax method on dclaw and geoclaw (the chile 2010 fgout-fgmax example) with each option for the number of variables written.

I've tested the fgout method with dclaw and geoclaw for both binary and ascii type files.

One point of discussion is whether you like the variable names I've use for fgmax (they are not exactly the same as in the fgmax object (e.g., I use h_max instead of h). I'm happy to change them to be identical if that is important to either of you.

I added an example usage script in the chile2010 fgout-fgmax example as it seemed the most sensible place to put an example.

Happy to improve this as you see fit.

@rjleveque
Copy link
Member

Thanks @kbarnhart for providing this! I haven't experimented with it much yet, but a few comments...

  1. I had to install rioxarray, which is not part of xarray.
  2. Nice that for binary .b files it seems to work fine regardless of whether binary32 or binary64 is used.
  3. When reading fgmax arrays, it seems like the topography B field should also be returned by default. This is often needed in processing the results (e.g. to plot h_max only on shore).

Eventually it would be nice to integrate this better with fgout_tools.py and fgmax_tools.py so that the xarray datasets could be worked with in the same way as the ndarray option currently supported.

I hope to revamp the fgmax output options in the near future, e.g. to allow binary output and to make it easier to specify exactly what variables should be monitored. We are also working on improving the arrival time monitoring to keep track of various different times, which we have found necessary for tsunami studies in regions of vertical land motion and for maritime modeling where drawdown is important as well as increases in depth. That might be a good time to rethink a lot of the interface and try to better coordinate these tools.

But for now it seems fine to merge this in, perhaps with some tweaks, so that it can be used for your purposes.

@mandli and @ketch: I note that our new attempt at CI throws an error for this PR in the setup step:

cd geoclaw
git checkout refs/pull/611/merge
error: pathspec 'refs/pull/611/merge' did not match any file(s) known to git

Not sure if this is something to investigate?

@kbarnhart
Copy link
Contributor Author

@rjleveque thanks for your comments. I'll make some updates to address them and ping you here when they are ready to consider.

happy to be involved in integrating these better.

@mandli
Copy link
Member

mandli commented Jun 2, 2024

Would the file xarray_backends.py be the location where any xarray backend would go? For example would we stick a gauge xarray backend in there?

@kbarnhart
Copy link
Contributor Author

@mandli - I thought a tiny amount about this and it seemed to me like there were two options. Either all the xarray backends go in one module or each xarray backend goes inside a module associated with the relevant clawpack data structure (e.g., the FGOutBackend goes in fgout_tools.py).

I have a slight preference towards the latter option. However, I implemented the former because I didn't want to make the fgout_tools and fgmax_tools files longer and make it hard to discover the backends.

All of this is to say that I'm not sure the best place for these within clawpack.geoclaw.

@mandli
Copy link
Member

mandli commented Jun 4, 2024

@kbarnhart Those were the two options that came to my mind as well. Given that gauges are not actually handled here in GeoClaw but in PyClaw (it's common across most of the Clawpack packages) it may make more sense to keep them in the associated data structures. Similarly we could/should do something with other data formats, such as friction from land use or storm data, for which there may be multiple backends. Should be easy to change up until we merge this fully and make a release though.

@mandli
Copy link
Member

mandli commented Jun 4, 2024

Variables when it makes sense should be named the same thing probably to reduce confusion although I do like the more descriptive variable you mentioned.

@kbarnhart
Copy link
Contributor Author

Some notes based on conversation with @rjleveque:

  • Keep the more descriptive names
  • Keep these backends in their current file
  • In addition to adding B as a variable, add level

@rjleveque
Copy link
Member

@kbarnhart: I'm finally getting back to looking at this and I see that the flexibility recently added to fgout variables means that these xarray tools won't always work.

What you call _qelements_geoclaw is what I called qmap (with default lists for geoclaw or dclaw) in the latest fgout_tools.FGoutGrid class, but now a user-specified qmap can used if the user does not output all variables at specifies a different fgout.q_out_vars, as described at https://www.clawpack.org/dev/fgout.html#specifying-q-out-vars.

Class fgout_tools.FGoutFrame includes various properties for computing B from h and eta or any one of these from the other two, giving more flexibility.

I'm not sure if it's worth trying to incorporate these changes in the xarray backend, or perhaps easier to read in using fgout_tools and then convert to xarrrays, or maybe fine to merge this for now and work on enhancements later?

@rjleveque
Copy link
Member

Another thing you might want to add to your sample chile2010_fgmax-fgout/use_xarray_backends.py:

# combine all fgout files into a single netcdf file:
import glob
fgout_files = glob.glob('_output/fgout0001.b*')
fgouts = []
for filename in fgout_files:
    ds = xr.open_dataset(filename, engine=FGOutBackend,
                         backend_kwargs={'epsg':epsg_code})
    fgouts.append(ds)

fgout_all = xarray.concat(fgouts, dim='time')
fgout_all.to_netcdf('fgout0001.nc')

Or maybe there's a cleaner way to do this. At any rate it could be useful for reducing all the fgout files to a single file.

@kbarnhart
Copy link
Contributor Author

kbarnhart commented Aug 20, 2024

@rjleveque - Today I've finally gotten around to updating this for compatibility with your changes. I think I've now made the changes needed to make it work and have tested it with the geoclaw example and the radial slide example over on geoflows/dclaw#17.
both work.

My only thought is that it would be really nice if dclaw could have a default that would not require always setting

fgout.q_out_vars=[...]

However, I don't think it is a big deal to always need to set this for dclaw. I've discussed with Dave and we agree that it is fine as is.

It is going to be great to not need to write all elements of q.

I added an example of opening all the files per your suggestion. There is a slightly easier way to do this with xr.open_mfdataset() that I used.

I think that this is ready to merge.

"B": "meters",
"huc": "meters squared per second",
"hvc": "meters squared per second",
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjleveque I did have a question about whether the units for huc and hvc were correct. I don't know how the corrections are applied (if they are added or multiplied or something else, so I wasn't sure what to put).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbarnhart: Interesting question, I think for the SGN equations the units are "meters per second squared" but for the MS equations "meters squared per second squared" because of the different scalings that get used when using these terms to update hu and hv. So I'm not sure what to put here, maybe "varies"? Normally the user would not use these terms for anything directly and would only output them for debugging purposes, perhaps.

Another related question: For dclaw you call the 7th element of q "delta_a" whereas in qmap I called it "bdif" because in digclaw_module.f90, i_bdif=7 is the index name used. We should be consistent, is delta_a a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the units information. I've changed it to 'varies'.

'bdif' is better than 'delta_a'. This has been called different things at different times. At present it is also called 'b_eroded' internally and in the dclaw data.py docstrings (I'll change that for consistency).

I've changed delta_a to bdif as well. Good catch.

elif qmap == "geoclaw":
_qelements = _qelements_geoclaw
elif qmap == "geoclaw-bouss":
_qelements = _qelements_geoclaw_bouss
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If qmap isn't one of these three choices, then I think _qelements will be left uninitialized, leading to an error later? Should this have:

          else:
               _qelements = list(qmap.keys())

or even just set _qelements this way in general without needing _qelements_geoclaw etc and this if test at all?

Also, maybe qmap can be used more directly below without needing _qelements at all?

But I'm not sure what drop_variables is intended for, is this to further drop things that were already output in the fgout files, or is it to list variables that were not output to begin with (which is information already included in qmap if that is set properly, I think)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like I had made _qelements vestigial and not removed it. Now removed.

drop_variables is a requirement of the BackendEntrypoint class. It would be used to further drop things from the fgout file (e.g., h and eta in the fgout file but only h in the xarray dataset).

https://docs.xarray.dev/en/stable/generated/xarray.backends.BackendEntrypoint.html#xarray.backends.BackendEntrypoint

@rjleveque
Copy link
Member

Thanks @kbarnhart!

Reading a single fgout frame still works fine, but the new code in use_xarray_backends.py:

fgout_files = glob.glob("_output/fgout0001.b*")
ds_all = xr.open_mfdataset(
    fgout_files,
    engine=FGOutBackend,
    backend_kwargs={"epsg": epsg_code, "qmap": "geoclaw"},
)

throws an error for me when trying to read the second file:

$ python use_xarray_backends.py
Reading fgout grid info from 
    _output/fgout_grids.data
Reading input for fgno=1, point_style = 2 
Found fgout grid q_out_vars =  [1, 2, 3, 4]
Using this mapping to fgout variable names: 
      qmap =  {'h': 1, 'hu': 2, 'hv': 3, 'eta': 4, 'B': 5}
    Reading  Frame 1 at t = 0  from outdir = /Users/rjl/git/clawpack/geoclaw/examples/tsunami/chile2010_fgmax-fgout/_output
Reading fgout grid info from 
    _output/fgout_grids.data
Reading input for fgno=1, point_style = 2 
Found fgout grid q_out_vars =  [1, 2, 3, 4]
Using this mapping to fgout variable names: 
      qmap =  {'h': 1, 'hu': 2, 'hv': 3, 'eta': 4, 'B': 5}
    Reading  Frame 10 at t = 8100  from outdir = /Users/rjl/git/clawpack/geoclaw/examples/tsunami/chile2010_fgmax-fgout/_output
Traceback (most recent call last):
  File "/Users/rjl/git/clawpack/geoclaw/examples/tsunami/chile2010_fgmax-fgout/use_xarray_backends.py", line 36, in <module>
    ds_all = xr.open_mfdataset(
  File "/opt/homebrew/lib/python3.10/site-packages/xarray/backends/api.py", line 1012, in open_mfdataset
    datasets = [open_(p, **open_kwargs) for p in paths]
  File "/opt/homebrew/lib/python3.10/site-packages/xarray/backends/api.py", line 1012, in <listcomp>
    datasets = [open_(p, **open_kwargs) for p in paths]
  File "/opt/homebrew/lib/python3.10/site-packages/xarray/backends/api.py", line 576, in open_dataset
    ds = _dataset_from_backend_dataset(
  File "/opt/homebrew/lib/python3.10/site-packages/xarray/backends/api.py", line 371, in _dataset_from_backend_dataset
    ds = _chunk_ds(
  File "/opt/homebrew/lib/python3.10/site-packages/xarray/backends/api.py", line 319, in _chunk_ds
    chunkmanager = guess_chunkmanager(chunked_array_type)
  File "/opt/homebrew/lib/python3.10/site-packages/xarray/core/parallelcompat.py", line 93, in guess_chunkmanager
    raise ValueError(
ValueError: unrecognized chunk manager dask - must be one of: []

Apparently dask is required for this approach to work. Should we be requiring this, or would it be better to revert to the slightly clumsier approach I used?

(now tested for both absence of h, and absence of h, eta, and B)
@kbarnhart
Copy link
Contributor Author

@rjleveque Just updated to have options for both with/without Dask. I also tested and improved the drytolerance (hardcoded to 0.001) handling. Let me know if it works for you.

If there is a way to read in the user-specified drytolerance, that is one final improvement to this that I can think of.

@rjleveque
Copy link
Member

@kbarnhart: use_xarray_backends.py runs if I replace

except ImportError:  # if dask is not available, use xr.concat.

with

except ValueError:  # if dask is not available, use xr.concat.

@rjleveque
Copy link
Member

Is it clear you always want to mask out dry values in B and/or eta? The user may need to know B even at dry points for plotting.

You could have a default in

  def open_dataset(... , dry_tol=None, ...)

for example to mean no masking, or let the user specify the value to use at this point?

@kbarnhart
Copy link
Contributor Author

@rjleveque I've added user-specified dry_tolerance (default = 0.001, but dry_tolerance=None means no masking at all).

mask is never applied to B or eta, this seems like good default usage to me, but if you think it would be better to do something different, I'm happy to change.

Also, if it would be easier for a quick zoom to discuss, let me know. I'm pretty free for the rest of the day. Thanks for all these comments!

@kbarnhart
Copy link
Contributor Author

@kbarnhart: use_xarray_backends.py runs if I replace

except ImportError:  # if dask is not available, use xr.concat.

with

except ValueError:  # if dask is not available, use xr.concat.

I should have fixed this. I just got rid of the specific exception type.

@rjleveque
Copy link
Member

I think you removed ImportError from the wrong line.

Otherwise it looks fine and the dry_tolerance makes sense to me.

@kbarnhart
Copy link
Contributor Author

🤦‍♀️ Thanks. Should be fixed now.

@rjleveque
Copy link
Member

Works for me, I'm going to merge it.

Thanks!!

@rjleveque rjleveque merged commit 88cdbc7 into clawpack:master Aug 23, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants