-
Notifications
You must be signed in to change notification settings - Fork 234
Add pygmt.gmtread to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray #3673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cac7d74 to
c50232e
Compare
|
Now, the
|
| case "dataset": | ||
| return lib.virtualfile_to_dataset( | ||
| vfname=voutfile, | ||
| column_names=column_names, | ||
| header=header, | ||
| dtype=dtype, | ||
| index_col=index_col, | ||
| ) | ||
| case "grid" | "image": | ||
| raster = lib.virtualfile_to_raster(vfname=voutfile, kind=kind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debating on whether we should have a low-level clib read that reads into a GMT virtualfile, and a high-level read that wraps around that to do both read + convert virtualfile to a pandas.DataFrame or xarray.DataArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a low-level Session.read method, but as far as I can see, currently it will be used only in the high-level read function, so it seems unecessary. We can refactor and add the low-level Session.read method when we need it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, ok with just making a high-level read method for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we already have the Session.read_data method which is almost the same as the low-level read method you proposed.
Lines 1112 to 1123 in 99a6340
| def read_data( | |
| self, | |
| infile: str, | |
| kind: Literal["dataset", "grid", "image"], | |
| family: str | None = None, | |
| geometry: str | None = None, | |
| mode: str = "GMT_READ_NORMAL", | |
| region: Sequence[float] | None = None, | |
| data=None, | |
| ): | |
| """ | |
| Read a data file into a GMT data container. |
The fun fact is that gmtread calls the gmt_copy function, which is a wrapper of the GMT_Read_Data function (https://github.com/GenericMappingTools/gmt/blob/9a8769f905c2b55cf62ed57cd0c21e40c00b3560/src/gmt_api.c#L1294)
| load_dataarray | ||
| read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The load_dataarray function was put under the pygmt.io namespace. Should we consider putting read under pygmt.io too? (Thinking about whether we need a low-level pygmt.clib.read and high-level pygmt.io.read in my other comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds good. I have two questions:
- Should we place the
readsource code inpygmt/io.py, or restructureio.pyinto a directory and put it inpygmt/io/read.pyinstead? - Should we deprecate the
load_dataarrayfunction in favor of the newreadfunction?
I'm expecting to have a write function that writes a pandas.DataFrame/xarray.DataArray into a tabular/netCDF file
GMT.jl also wraps the read module (xref: https://www.generic-mapping-tools.org/GMTjl_doc/documentation/utilities/gmtread/). The differences are:
- It uses name
gmtread, which I think is better sincereadis a little to general. - It returns custom data types like
GMTVector,GMTGrid. [This doesn't work in PyGMT] - It guesses the data kind based on the extensions. [Perhaps we can also do a similar guess?]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should we place the
readsource code inpygmt/io.py, or restructureio.pyinto a directory and put it inpygmt/io/read.pyinstead?
I think making the io directory sounds good, especially if you're planning on making a write function in the future.
Should we deprecate the
load_dataarrayfunction in favor of the newreadfunction?
No, let's keep load_dataarray for now. Something I'm contemplating is to make an xarray BackendEntrypoint that uses GMT read, so that users can then do pygmt.io.load_dataarray(..., engine="gmtread") or something like that. The load_dataarray function would use this new gmtread backend engine by default instead of netcdf4.
| @pytest.mark.skipif(not _HAS_NETCDF4, reason="netCDF4 is not installed.") | ||
| def test_io_gmtread_grid(): | ||
| """ | ||
| Test that reading a grid returns an xr.DataArray and the grid is the same as the one | ||
| loaded via xarray.load_dataarray. | ||
| """ | ||
| grid = gmtread("@static_earth_relief.nc", kind="grid") | ||
| assert isinstance(grid, xr.DataArray) | ||
| expected_grid = xr.load_dataarray(which("@static_earth_relief.nc", download="a")) | ||
| assert np.allclose(grid, expected_grid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should have a similar test for kind="image", comparing against rioxarray.open_rasterio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a6c4ee7.
When I tried to add a test for reading datasets, I realized that the DataFrame returned by the load_sample_data is not ideal:
In [1]: from pygmt.datasets import load_sample_data
In [2]: data = load_sample_data("hotspots")
In [3]: data.dtypes
Out[3]:
longitude float64
latitude float64
symbol_size float64
place_name object
dtype: object
The last column place_name should be string dtype, rather than object. We also have similar issues for other sample datasets.
We have three options:
- Do nothing and keep them unchanged
- Fix and use appropriate dtypes
- Use the new
gmtreadfunction instead ofpd.read_csvin_load_xxxfunctions.
I'm inclined to option 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. Use the new
gmtreadfunction instead ofpd.read_csvin_load_xxxfunctions.I'm inclined to option 3.
Agree with this. We should also add dtype related checks for the tabular dataset tests in pygmt/tests/test_datasets_samples.py.
| column_names | ||
| A list of column names. | ||
| header | ||
| Row number containing column names. ``header=None`` means not to parse the | ||
| column names from table header. Ignored if the row number is larger than the | ||
| number of headers in the table. | ||
| dtype | ||
| Data type. Can be a single type for all columns or a dictionary mapping | ||
| column names to types. | ||
| index_col | ||
| Column to set as index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we indicate in the docstring that these params are only used for kind="dataset"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At line 31:
For datasets, keyword arguments
column_names,header,dtype, and
index_colare supported.
| def gmtread( | ||
| file: str | PurePath, | ||
| kind: Literal["dataset", "grid", "image"], | ||
| region: Sequence[float] | str | None = None, | ||
| header: int | None = None, | ||
| column_names: pd.Index | None = None, | ||
| dtype: type | Mapping[Any, type] | None = None, | ||
| index_col: str | int | None = None, | ||
| ) -> pd.DataFrame | xr.DataArray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I'm thinking if we should make gmtread a private function for internal use only for now, the fact that it can read either tabular or grid/image files seems like a lot of magic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the gmtread function is no longer needed if PR #3919 is implemented, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, not needed for grids/images, but we could still use gmtread for tabular datasets? Though let's think about #3673 (comment).
|
|
||
| def gmtread( | ||
| file: str | PurePath, | ||
| kind: Literal["dataset", "grid", "image"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does GMT read also handle 'cube'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (xref: https://github.com/GenericMappingTools/gmt/blob/9a8769f905c2b55cf62ed57cd0c21e40c00b3560/src/gmtread.c#L75-L81), but need to wait for #3150, which may have upstream bugs.
Description of proposed changes
This PR adds the
pygmt.readfunction to read any recognized data files (currently dataset, grid, or image) into a pandas.DataFrame/xarray.DataArray object.The new
readfunction can replace mostload_dataarray/xr.open_dataarray/xr.load_dataarraycalls.Related to #3643 (comment).
Preview: https://pygmt-dev--3673.org.readthedocs.build/en/3673/api/generated/pygmt.read.html
Reminders
make formatandmake checkto make sure the code follows the style guide.doc/api/index.rst.Slash Commands
You can write slash commands (
/command) in the first line of a comment to performspecific operations. Supported slash command is:
/format: automatically format and lint the code