Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAS-2232 - Functionality added to support SMAP L3 products #15

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

sudha-murthy
Copy link
Collaborator

@sudha-murthy sudha-murthy commented Sep 10, 2024

Description

SMAP L3 does not have 1D dimension scales and grid mapping variable which is needed by HOSS to do spatial subsetting.
Methods added to override the missing grid mapping with overrides in the hoss_config.json and supporting methods.
Methods also added to use the 2D coordinate datasets to generate 1D dimension scales that could be used for to calculate
the index ranges to provide the spatial subsetted outputs

Jira Issue ID

DAS-2232

Local Test Steps

  • Spatial subsetting can be tested without mask fill. (till DAS-2215 is complete)
  • Run Harmony in the box with HOSS.
  • Comment out of mask_fill in services.yaml file
  • Run a spatial subset request locally and ensure you get the right subsetted output
  • e.g.
  1. http://localhost:3000/C1268452378-EEDTEST/ogc-api-coverages/1.0.0/collections/parameter_vars/coverage/rangeset?forceAsync=true&granuleId=G1268452388-EEDTEST&subset=lat(70%3A80)&subset=lon(-160%3A-150)&variable=Soil_Moisture_Retrieval_Data_AM%2Fstatic_water_body_fraction&skipPreview=true
  2. http://localhost:3000/C1268452378-EEDTEST/ogc-api-coverages/1.0.0/collections/parameter_vars/coverage/rangeset?forceAsync=true&granuleId=G1268452388-EEDTEST&subset=lat(54%3A72)&subset=lon(2%3A42)&format=application%2Fx-netcdf4&variable=Soil_Moisture_Retrieval_Data_AM%2Falbedo%2CSoil_Moisture_Retrieval_Data_AM%2Fsurface_flag&skipPreview=true
  • 3D variables do not work - pending on DAS-2238
  • New unit tests have not been added. The current unit tests are passing
  • DAS-2236 written to handle fill values in the corners.
  • Jupyter test notebooks exist for SMAP L3 and need to be updated

PR Acceptance Checklist

  • Jira ticket acceptance criteria met.
  • [X ] CHANGELOG.md updated to include high level summary of PR changes.
  • docker/service_version.txt updated if publishing a release.
  • Tests added/updated and passing.
  • Documentation updated (if needed).

@sudha-murthy sudha-murthy marked this pull request as draft September 12, 2024 17:40
@sudha-murthy sudha-murthy marked this pull request as ready for review September 13, 2024 05:09
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
Copy link
Member

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sudha, There's a lot here to review and unfortunately I haven't worked with HOSS directly before. I tried to look through what I could and make decent suggestions.

But for sure, you need to test your changes. When you run the tests a coverage report is generated. Before your changes the results were:

Test Coverage Estimates
Name                           Stmts   Miss  Cover
--------------------------------------------------
Hoss/dimension_utilities.py      156      2    99%
hoss/spatial.py                   61      0   100%
--------------------------------------------------
TOTAL                            668      8    99%

And after they dropped considerably:


Test Coverage Estimates
Name                           Stmts   Miss  Cover
--------------------------------------------------
hoss/dimension_utilities.py      242     65    73%
hoss/spatial.py                   70      8    89%
--------------------------------------------------
TOTAL                            771     81    89%

It is very difficult to be able to understand the changes when I can't look at test and see what a function was supposed to do. Likewise, the function comments were not updated to describe the new functionality. Hopefully I'm not confusing anything with my lack of familiarity. I will defer to Owen if there's differences of opinion on what should be done.

Final test instructions assume a strong understanding of the problem you were solving, but I was able to eventually run the two test URLs and get output. Should the output files be geolocated somehow? Because they open in panoply, but aren't geo-2d, just 2d.

CHANGELOG.md Outdated Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
hoss/projection_utilities.py Outdated Show resolved Hide resolved
hoss/spatial.py Outdated Show resolved Hide resolved
hoss/spatial.py Outdated Show resolved Hide resolved
hoss/spatial.py Outdated Show resolved Hide resolved
hoss/spatial.py Outdated Show resolved Hide resolved
Copy link
Member

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sudha-murthy for putting a lot of effort into this PR!

This review is not 100% thorough. I think there are a few bigger things here that need to be addressed, and that will make it easier to digest in full:

  • I think the new functionality to deal with overriding dimensions should not just go in the get_projected_x_y_ranges. By doing so, it's adding complexity to that function, which makes things harder to understand overall. I also think it is forcing you to do some things that would otherwise be unnecessary - for example, I don't think you need to write the variables to the NetCDF-4 file, you just need to ask a function for 1-D dimension variables and spit out 1-D numpy arrays.
  • I'm really worried about the use of set objects to contain overriding dimensions. The ordering of dimensions is important, and must match the underlying array axes. I suspect this is the reason you are having issues with 3-D variables (it's certainly part of the issue).
  • A lot of the additional code is written in ways that adds high levels of cyclomatic complexity. I've suggested a few ways to cut that down, but it's worth taking a step back and working out what is common to code branches and what is truly different. Try to minimise what has to go in an if/elif/else, and then make sure each branch gives you the same output. The add_index_ranges function is perhaps the area of most concern.
  • There are a few places where the code branches and only assigns variables in one branch which are then used after the branching is finished. This will cause errors when something skips that block and later tries to access a non-existent variable.
  • This PR has no new or updated unit tests, either for new functions of new branches of code. I know I'm religious about these things, but they are absolutely needed. Particularly on a PR of this scope. There are a bunch of changes, and they are complicated and hard to understand. Not only do tests make sure things work, reading them makes it easier to understand how the code flows.
  • There are a lot of places where code has changed significantly, but the documentation strings describing the functions have not. They should definitely be updated to keep in sync with what's going on.

Sorry for all the comments, but hopefully they are helpful!

CHANGELOG.md Outdated Show resolved Hide resolved
docker/service_version.txt Show resolved Hide resolved
@@ -55,6 +59,42 @@ def is_index_subset(message: Message) -> bool:
)


def get_override_projected_dimensions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not include any changes to unit tests. Each new function you are adding (or new branch of code within an existing conditional block) needs unit testing that hits every branch of the code within a function. I've not reviewed any of the code changes in detail yet, but this lack of tests is an absolute blocker for me to approve the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Owen. Will add the unit tests.

hoss/subset.py Outdated Show resolved Hide resolved
hoss/subset.py Outdated Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
if variable.dimensions == []:
override_dimensions = get_override_dimensions(varinfo, [variable_name])
if len(override_dimensions) > 0:
for override in reversed(list(override_dimensions)):
Copy link
Member

@owenlittlejohns owenlittlejohns Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you reversing things here? The ordering should be preserved. If the order of override dimensions does not reflect the ordering of array axes, then that needs to be fixed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revisit that. I had a problem with fixing the override dimensions list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are in commit 756f7c0

@@ -422,22 +559,48 @@ def add_index_range(

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this function: READ THIS COMMENT FIRST:

I think this is really overcomplicated. Fundamentally, all this function is doing is getting a list of dimensions for a variable. Then it is looking in a cache to see if there is an index range for any of the dimension names, before using that cache value to tack something on the end of a string.

Maybe I'm being naive, but it feels like really all you need is something to determine the correct list of variable dimensions, then all the rest of the logic (looking in the cache and appending strings to the end of the variable name) is the same. That latter stuff 100% does not need to be in the duplicated in multiple condition branches. It's making this function unnecessarily hard to read.

The other, really important comment: I am super suspicious of the bit where you are needing to reverse the order of the dimensions list. However that is derived, it should be reliably in the ordering of the variable axes. I'm wary that what this means is that you have found that for your particular use-case the "random" ordering in a set is predictable for the pseudo-dimensions you have for SMAP L3, and you can coincidentally impose the order you need coincidentally by reversing the set. I really think dimensions should not be passed around in sets, because you need that ordering. I strongly suspect this is the root cause of your issues with 3-D variables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ordering and the shape are things I need to get from varinfo. I dont have that information. Updated DAS-2241 for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rephrasing a different comment: I'm very uncomfortable with the thought of merging code with known problems into main. Going back to one of the things mentioned in the TRT breakout last week:

Teams ought to have a high-quality, maintainable baseline without known defects (particularly in new features) that is deployable to the production environment at all times.

Instead of making a massive PR that does 80% of a lot of stuff, this probably should be submitted piecemeal, with each piece being made 100% solid in a series of smaller PRs.

If you and David feel we're at a point that we can't break the changes down in that way, then a compromise might be to update this PR to merge into a feature branch. Then once all of the multiple PRs are merged into the feature branch, the feature branch can be merged into main.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think terminology has confused some of this discussion. The question of reversing is not one of reversing dimensions, but the lat/lon variables coming from the Coordinates attribute. The recommendation here is that the get_coordinates method itself should return in a standard order (reversed in case), based upon the variable name - and not using reverse as shown here.

I don't think this is a case of "Known Issue".

We are planning to release a version for NSIDC to start testing with, which may not handle the 3D cases or other issues, but this release should not break any existing use cases, and should not truly break, but simply not handle as desired. Incremental feature release is a tenet of agile development.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are in commit 756f7c0

I have simplified the method. The order for SMAP is 'projected_y', 'projected_x'. The override section of the code is only used by SMAP at the moment. It can be generalized if we can get that order of dimensions from varinfo. I am not sure if the order of dimensions is used for other products currently handled by hoss.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of sets for required_dimensions is based on how it is returned by varinfo and how it was originally in hoss before my changes. the bounds update requires the dimensions to be a set , It fails for a list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a difference between a set of all dimensions for all required variables (as used in the prefetch function), which aggregates all Variable.dimensions attributes (which individually are lists), and the dimensions on an individual variable. Variable.dimensions is not a set for ordering purposes, it is a list.

With regards to bounds - we know that the pseudo-dimensions won't have a bounds attribute, so you might be better to not try to find bounds variables for them. Then you'll avoid any compatibility issues there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The add_index_range is a lot simpler now

hoss/dimension_utilities.py Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
@owenlittlejohns
Copy link
Member

A couple of quick thoughts:

autydp yesterday
The review is a bit disjointed now with all the comments.

I agree that this PR is busy with a lot of comment. I'd recommend that @sudha-murthy picks a few of the comments in the PR to address at a time, pushes a commit up with them, and indicates in the comments that she has addressed them (with a link to the commit). Then the person making the comment can decide whether to mark it as resolved. Resolving the comments will make the PR look a lot less cluttered, and allow us to all focus on the remaining pieces.

flamingbear 4 hours ago
But since it looks like you've thumbed up a bunch of things, you can add a comment with the githash as a way to alert the commenter that you have addressed the issue, when you get there.

Yup - agreed.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Member

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another incomplete review, but there are some things raised that I think are important to address:

  • Coping with multiple grids.
  • I genuinely think the method to get the 1-D grid dimension variables is far too complicated. (You could cut out probably 100 lines of code by just using pyproj)

hoss/dimension_utilities.py Show resolved Hide resolved
hoss/dimension_utilities.py Show resolved Hide resolved
if not contains_latitude:
raise MissingCoordinateDataset('latitude')
if not contains_longitude:
raise MissingCoordinateDataset('longitude')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function feels pretty convoluted now. I think it might be easier to do:

coordinate_variables_set = varinfo.get_references_for_attribute(
    requested_variables, 'coordinates'
)

latitude_coordinate_variables = [
    coordinate
    for coordinate in coordinate_variables_set
    if varinfo.get_variable(coordinate).is_latitude()
]

longitude_coordinate_variables = [
    coordinate
    for coordinate in coordinate_variables_set
    if varinfo.get_variable(coordinate).is_longitude()
]

if not latitude_coordinate_variables:
    raise MissingCoordinateDataset('latitude')

if not longitude_coordinate_variables:
    raise MissingCoordinateDataset('longitude')

return latitude_coordinate_variables, longitude_coordinate_variables

This reduces the number of local variables in the function, reduces the cyclomatic complexity and uses the more Python-native model of list-comprehensions. (Minor downside is two list comprehensions looping through the coordinates, but they should be a handful of elements, so I don't think we need to worry about performance)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thought, though, what if the metadata attributes for the collection point to non existent variables? That will raise an exception that you currently aren't capturing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought (probably more important) - what about multiple grids? SPL3FP has global and polar grids, so has 2 latitude and 2 longitude coordinate variables. Your current implementation would get you something that is a 4-element list, and that ordering would depend entirely on the random ordering of the output of VarInfoFromDmr.get_references_for_attribute. That seems troubling. A big question here is how to make sure that the right latitudes and longitudes are being paired together.

When you write the unit tests for a lot of these functions, I definitely want to see what happens for a use-case when variables from different grids are being requested. We're going to need those tests to make sure things don't break.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. will add the multiple grid test cases

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function feels pretty convoluted now. I think it might be easier to do:

coordinate_variables_set = varinfo.get_references_for_attribute(
    requested_variables, 'coordinates'
)

latitude_coordinate_variables = [
    coordinate
    for coordinate in coordinate_variables_set
    if varinfo.get_variable(coordinate).is_latitude()
]

longitude_coordinate_variables = [
    coordinate
    for coordinate in coordinate_variables_set
    if varinfo.get_variable(coordinate).is_longitude()
]

if not latitude_coordinate_variables:
    raise MissingCoordinateDataset('latitude')

if not longitude_coordinate_variables:
    raise MissingCoordinateDataset('longitude')

return latitude_coordinate_variables, longitude_coordinate_variables

This reduces the number of local variables in the function, reduces the cyclomatic complexity and uses the more Python-native model of list-comprehensions. (Minor downside is two list comprehensions looping through the coordinates, but they should be a handful of elements, so I don't think we need to worry about performance)

updated the method - 802fe0e

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#15 (comment)

  • not sure what that is referring to in the earlier comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thought, though, what if the metadata attributes for the collection point to non existent variables? That will raise an exception that you currently aren't capturing.

if the coordinate variables don't exist, we will not do the override sections of the code

Copy link
Member

@owenlittlejohns owenlittlejohns Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update of the function.

if the coordinate variables don't exist, we will not do the override sections of the code

I'm not sure that's quite true. The first time this function is called in the flow of the code is within hoss.dimension_utilities::get_prefetch_variables. The function gets called if there are no required dimensions. Then within this function you are looking at the coordinates metadata attribute on each required variable, and trying to retrieve information on each listed coordinate. If the metadata attribute is incorrect and points to non-existent variables, then at that point, the code is asking VarInfoFromDmr to do something it can't do (varinfo.get_variable(coordinate).is_longitude() or varinfo.get_variable(coordinate).is_latitude()), and it will raise an exception. Specifically, it will say that None does not have a method is_latitude or is_longitude, because varinfo.get_variable('something that doesn't exist') will return None.

I think this failure case needs to be handled, most likely by catching the exception and re-raising a more user-friendly one.

return is_dimension_ascending(lat_col)


def get_lat_lon_arrays(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, sorry, the return value is a list of coordinate names, but the idea is still applicable. (The main looping logic is essentially duplicated between the two functions, see L168-L174 versus L278-L283: those are doing the same loops and the same checks)

That said, I think there's a bigger concern here for collections with multiple grids (and therefore multiple groups of variables in the coordinates references), such as SPL3FTP. In the function below the logic is going to go through all coordinates and check each one to see if it is a latitude or a longitude. You're getting the list of coordinates (via a few other functions) from get_coordinate_variables (originally called here), which would have multiple latitude/longitude references for SPL3FTP (if a user requests both polar and global variables). If you are trying to handle both the global and polar grid, you're going to need to sometimes get the global coordinates and sometimes the polar, but this function will always just get whichever are listed last (because the loop will get to them last and assign them to lat_arr and lon_arr, even if something is already assigned to those variables).

hoss/dimension_utilities.py Outdated Show resolved Hide resolved
prefetch_dataset, coordinates, varinfo
)

geo_grid_corners = get_geo_grid_corners(prefetch_dataset, coordinates, varinfo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still feels like an unnecessarily complicated methodology. You could just project the entire latitude and longitude arrays using pyproj and take one row and one column from the projected output. Is it to reduce the memory usage? SPL3FTP has arrays of (2, 406, 964) and (2, 500, 500) - those size arrays will not be a problem for memory usage.

The method we have here adds a bunch of functions and complexity that would otherwise not be needed.

Copy link
Collaborator Author

@sudha-murthy sudha-murthy Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will look into it

hoss/spatial.py Outdated Show resolved Hide resolved
hoss/spatial.py Outdated
Comment on lines 221 to 224
"""This function returns a dictionary containing the minimum and maximum
index ranges for a pair of lat/lon coordinates, e.g.:

index_ranges = {'/x': (20, 42), '/y': (31, 53)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial bit of this description (the bit people are most likely to read) doesn't quite have enough information. I think the disconnect is due to referring to latitude and longitude and then the example having x and y.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - the names don't actually match what your function produces (currently "projected_x" and "projected_y" for all grids)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it - 802fe0e

But may need to revise it - if I pursue new names to support multiple grids

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Thanks for the update. I'll close this comment thread, and we can continue to discuss the multiple-grid things in this thread. (That's still an issue that will have to be resolved in this PR)

Comment on lines +240 to +241
projected_x = 'projected_x'
projected_y = 'projected_y'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For granules with multiple grids, these names will clash for both grids.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you likely want some nice naming function (likely deriving the names from the coordinate variables used), and then that function can be used in add_index_range to determine which entries in the cache to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a significant issue. If a user requests variables from both the SMAP L3 global and polar grids, this function will try to assign the x and y dimension index ranges for both to keys called projected_x and projected_y in the cache of index ranges.

hoss/spatial.py Outdated Show resolved Hide resolved
hoss/spatial.py Outdated Show resolved Hide resolved
Comment on lines +78 to +82
) -> bool:
"""
returns the list of required variables without any
dimensions
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two quick things here:

  1. The return type in the type hint is wrong: it should be set[str] not bool.
  2. It would probably be clearer in the documentation string to say "set of required variables" not "list of required variables". (I'm guessing you meant list in a generic sense, rather than the Python type, but it's a little ambiguous)



def get_variables_with_anonymous_dims(
varinfo: VarInfoFromDmr, required_variables: set[str]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: It probably isn't necessary in this function to refer to the variables as required_variables. That is a piece of information outside of the scope of this function, maybe a better choice of name would be variable_names.

hoss/spatial.py Outdated Show resolved Hide resolved
Copy link
Member

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding some unit tests - I'll keep an eye out for more in future commits.

I resolved older comments I had left, which now look to be taken care of (thanks for that). There are still outstanding older items, and I've added some more comments on things that I looked at in more detail this time around. (To be honest, given the huge scale of this PR, it's hard to review it all in a single go, and so there are still bits I'm spotting only now, despite trying to go through the PR a couple of times)

Comment on lines +193 to +200
if lat_arr.ndim > 1:
col_size = lat_arr.shape[0]
row_size = lat_arr.shape[1]
if (lon_arr.shape[0] != lat_arr.shape[0]) or (lon_arr.shape[1] != lat_arr.shape[1]):
raise IrregularCoordinateDatasets(lon_arr.shape, lat_arr.shape)
if lat_arr.ndim and lon_arr.ndim == 1:
col_size = lat_arr.size
row_size = lon_arr.size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit wonky:

The middle check that the array sizes are equal explicitly checks the size of the 0th and 1st axes of the array. But after that you have a condition for whether the arrays only have one dimensions. This means one of two things:

  1. Either the last bit with lat_array.ndim and lon_arr.ndim are both 1 will never get reached (because they are always 2-D or higher)
  2. The coordinate arrays could be 1-D, and then the check for lon_arr.shape[1] != lat_arr.shape[1] will raise an exception, because the shape tuple doesn't have enough elements.

row_size = lat_arr.shape[1]
if (lon_arr.shape[0] != lat_arr.shape[0]) or (lon_arr.shape[1] != lat_arr.shape[1]):
raise IrregularCoordinateDatasets(lon_arr.shape, lat_arr.shape)
if lat_arr.ndim and lon_arr.ndim == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition isn't doing what I believe you think it is.

My guess is you are trying to check if both lat_arr and lon_arr have 1 dimension. What's really happening is that you are checking if lat_arr.ndim has a "truthy" value (so isn't 0, where it's an integer), and then you are checking if lon_arr.ndim == 1. If it helps, the checks are really:

if (lat_arr.ndim) and (lon_arr.ndim == 1):

I think what you are trying to do is:

if lat_arr.ndim == 1 and lon_arr.ndim == 1:


"""
override_variable = varinfo.get_variable(variable_name)
projected_dimension_name = ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would fit much better with the rest of the style of the code if you used else for these sorts of default values, instead of declaring them and then overriding them if a condition is met.

Yes, it's a personal preference, but it's consistent with the rest of the code (which is the more important thing here).

Comment on lines +191 to +192
row_size = 0
col_size = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one example but there are quite a few places in this code that I think default values are being used instead of raising a genuine exception. If none of the conditions below are met, then it would be much better to catch the issue now and raise a user-friendly exception before later code tries to use these values and finds it can't do what it needs to (and raises some unintelligible exception to the end user).

I think the question to ask with a bunch of these functions is: if they fall back on the default values, can the rest of the code work using those return values. I think in this case (and a few others) the honest answer is no.

Comment on lines +215 to +216
lat_col = lat_arr[:, 0]
lon_row = lon_arr[0, :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making an assumption that latitude and longitude always represent the same axes in an array for all collections. We know that isn't true: for example the GPM_3IMERGHH collection from GES DISC have things the wrong way around (time, lon, lat).

It would be better to take both arrays and check one row and one column of each to find which array is varying along each dimension.

Comment on lines +189 to +193
def test_get_x_y_index_ranges_from_coordinates(
self,
mock_get_x_y_extents,
mock_get_dimension_index_range,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good test!

You definitely need either one more that requests SMAP L3 data for multiple grids, or this test could be updated to do so using a collection that has multiple grids.

Comment on lines +135 to +137
# lat_arr = prefetch_dataset[self.latitude][:]
# lon_arr = prefetch_dataset[self.longitude][:]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commented out lines should be removed.

lat_fill,
lon_fill,
)
for actual, expected in zip(actual_geo_corners, expected_geo_corners):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things here:

  1. Why only self.assertAlmostEqual? You are defining the expected values, so you could define them to be exactly what you need.
  2. (Possibly a personal preference) zip makes things a little more complicated. Instead you could do something like:
for index, expected_corner in enumerate(expected_geo_corners):
    self.assertTupleEqual(actual_geo_corners[index], expected_corner)

If you stick with zip, could you maybe tweak the variables in the loop so they are actual_corner and expected_corner. Just to make it easier when reading this test back. Thanks!


"""

expected_result1 = np.array([0, 1, 2, 3, 4])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something of a theme here - there are a bunch of variable names in these tests that are really vague (expected_result1, expected, expected_result, similar things with actual). It would be clearer to the reader if the names were more specific. Something like expected_valid_indices.

Unit tests are a secondary form of documentation for developers, and can be really informative. It's really helpful (to me at least 😅) if they are as easy to read as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting to add in the tests! A recommendation for next time would be to couple the writing of a function with the accompanying tests, instead of leaving all the tests to the end. (I tend to write tests for functions just after writing the code for the function itself, because that way it's still fresh in my head, but also when I move on from that function, I feel confident that it works when I call it elsewhere)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants