Release/2.2.0 (#124)

jamesfwood · l2ss-py bot · dkauf42 · web-flow · commit 1f913b44047f · 2022-10-18T14:49:48.000-07:00
* /version 2.2.0-alpha.0 * Feature/issue 85 (#112) * Add initial poetry setup guidance to the README * Update CHANGELOG "unreleased" with issue 85 * Adding collections UAT: C1242387621-POCLOUD * Adding collections UAT: C1238658389-POCLOUD * Feature/issue 115 (#116) * make note in README of `-E harmony` install option for tests * Update CHANGELOG.md Co-authored-by: dkaufma3 <daniel.kaufman@nasa.gov> * Adding collections OPS: C2152045877-POCLOUD * Feature/issue-110 (#117) * Add extra line of logic to catch timedelta time cases * Updated Changelog * Add logic to handle time attributes for he5 time converted files * Linted code * Feature/issue 119 (#120) * Add extra line of logic to catch timedelta time cases * Updated Changelog * Add logic to handle time attributes for he5 time converted files * Linted code * Added extra logic for compute time vars for cases without any variables * change back chunking * Update Changelog * Feature/issue 122 (#123) * Add extra line of logic to catch timedelta time cases * Updated Changelog * Add logic to handle time attributes for he5 time converted files * Linted code * Fix for ncdataset rename deprication - test writing to follow * Remove unnecessary comments, changing to issue 119 * Added test and linted for duplicate dimension name change * Updated Changelog.md * Release 2.2.0 * /version 2.2.0-rc.1 * Updated to rc.2 * /version 2.2.0-rc.3 Co-authored-by: l2ss-py bot <l2ss-py@noreply.github.com> Co-authored-by: Daniel Kaufman <dkauf42@users.noreply.github.com> Co-authored-by: podaac-cloud-dsa <podaac-cloud-dsa@jpl.nasa.gov> Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com> Co-authored-by: dkaufma3 <daniel.kaufman@nasa.gov> Co-authored-by: James Wood <James.F.Wood@jpl.nasa.gov> Co-authored-by: Nick Lenssen <nicklenssen4@gmail.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,25 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ## [Unreleased]
+### Added
+### Changed
+### Deprecated 
+### Removed
+### Fixed
+### Security
+
+## [2.2.0]
+### Added
+### Changed
+- [issue/115](https://github.com/podaac/l2ss-py/issues/115): Added notes to README about installing "extra" harmony dependencies to avoid test suite fails. 
+- [issue/85](https://github.com/podaac/l2ss-py/issues/85): Added initial poetry setup guidance to the README
+- [issue/122](https://github.com/podaac/l2ss-py/issues/122): Changed renaming of duplicate dimension from netcdf4 to xarray per issues in the netcdf.rename function. https://github.com/Unidata/netcdf-c/issues/1672 	
+### Deprecated 
+### Removed
+### Fixed
+- [issue/119](https://github.com/podaac/l2ss-py/issues/119): Add extra line for variables without any dimensions after a squeeze in compute_time_vars():	
+- [issue/110](https://github.com/podaac/l2ss-py/issues/110): Get the start date in convert_times and reconvert times into original type in _recombine groups method.
+### Security
 
 ## [2.1.1]
 ### Changed
@@ -32,8 +51,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - **Breaking Change** [issue/99](https://github.com/podaac/l2ss-py/issues/99): Removed support for python 3.7
 ### Fixed
 - [issue/95](https://github.com/podaac/l2ss-py/issues/95): Fix non variable subsets for OMI since variables are not in the same group as the lat lon variables 
-
 ### Security
+
+
 ## [1.5.0]
 ### Added
 - Added Shapefile option to UMM-S entry
diff --git a/README.md b/README.md
@@ -16,6 +16,18 @@ Harmony service for subsetting L2 data. l2ss-py supports:
 
 If you would like to contribute to l2ss-py, refer to the [contribution document](CONTRIBUTING.md).
 
+## Initial setup, with poetry
+
+1. Follow the instructions for installing `poetry` [here](https://python-poetry.org/docs/).
+2. Install l2ss-py, with its dependencies, by running the following from the repository directory:
+
+```
+poetry install
+```
+
+***Note:*** l2ss-py can be installed as above and run without any dependency on `harmony`. 
+However, to additionally test the harmony adapter layer, 
+extra dependencies can be installed with `poetry install -E harmony`.
 
 ## How to test l2ss-py locally
 
@@ -33,6 +45,11 @@ You can generate coverage reports as follows:
 poetry run pytest --junitxml=build/reports/pytest.xml --cov=podaac/ --cov-report=html -m "not aws and not integration" tests/
 ```
 
+***Note:*** The majority of the tests execute core functionality of l2ss-py without ever interacting with the harmony python modules. 
+The `test_subset_harmony` tests, however, are explicitly for testing the harmony adapter layer 
+and do require the harmony optional dependencies be installed, 
+as described above with the `-E harmony` argument.
+
 ### l2ss-py script
 
 You can run l2ss-py on a single granule without using Harmony. In order 
diff --git a/cmr/ops_associations.txt b/cmr/ops_associations.txt
@@ -52,3 +52,4 @@ C2251465126-POCLOUD
 C2254232941-POCLOUD
 C2251464384-POCLOUD
 C2247621105-POCLOUD
+C2152045877-POCLOUD
diff --git a/cmr/uat_associations.txt b/cmr/uat_associations.txt
@@ -38,3 +38,4 @@ C1238621088-POCLOUD
 C1240739713-POCLOUD
 C1244459498-POCLOUD
 C1242387621-POCLOUD
+C1238658389-POCLOUD
diff --git a/podaac/subsetter/dimension_cleanup.py b/podaac/subsetter/dimension_cleanup.py
@@ -24,17 +24,18 @@ def remove_duplicate_dims(nc_dataset):
     is changed to the original name.
     """
     dup_vars = {}
+    dup_new_varnames = []
     for var_name, var in nc_dataset.variables.items():
         dim_list = list(var.dimensions)
         if len(set(dim_list)) != len(dim_list):  # get true if var.dimensions has a duplicate
             dup_vars[var_name] = var  # populate dictionary with variables with vars with dup dims
     for dup_var_name, dup_var in dup_vars.items():
         dim_list = list(dup_var.dimensions)  # list of original dimensions of variable with dup dims
-        # get the dimensions that is duplicated
+        # get the dimensions that are duplicated
         dim_dup = [item for item, count in collections.Counter(dim_list).items() if count > 1][0]
         dim_dup_new = dim_dup+'_1'
-
         var_name_new = dup_var_name+'_1'
+        dup_new_varnames.append(var_name_new)
 
         # create new dimension by copying from the duplicated dimension
 
@@ -61,6 +62,18 @@ def remove_duplicate_dims(nc_dataset):
                 data[var_name_new].setncattr(attrname, nc_dataset.variables[dup_var_name].getncattr(attrname))
                 data[var_name_new][:] = nc_dataset.variables[dup_var_name][:]
         del nc_dataset.variables[dup_var_name]
-        nc_dataset.renameVariable(var_name_new, dup_var_name)
 
-    return nc_dataset
+    # return the variables that will need to be renamed: Rename method is still an issue per https://github.com/Unidata/netcdf-c/issues/1672
+    return nc_dataset, dup_new_varnames
+
+
+def rename_dup_vars(dataset, rename_vars):
+    """
+    NetCDF4 rename function raises and HDF error for variable in S5P files with duplicate dimensions
+    This method will use xarray to rename the variables
+    """
+    for i in rename_vars:
+        original_name = i[:-2]
+        dataset = dataset.rename({i: original_name})
+
+    return dataset
diff --git a/podaac/subsetter/subset.py b/podaac/subsetter/subset.py
@@ -522,6 +522,8 @@ def compute_time_variable_name(dataset, lat_var):
         if "time" in var_name and dataset[var_name].squeeze().dims == lat_var.squeeze().dims:
             return var_name
     for var_name in list(dataset.data_vars.keys()):
+        if len(dataset[var_name].squeeze().dims) == 0:
+            continue
         if 'time' in var_name.lower() and dataset[var_name].squeeze().dims[0] in lat_var.squeeze().dims:
             return var_name
 
@@ -946,7 +948,7 @@ def walk(group_node, path):
     return nc_dataset
 
 
-def recombine_grouped_datasets(datasets, output_file):  # pylint: disable=too-many-branches
+def recombine_grouped_datasets(datasets, output_file, start_date):  # pylint: disable=too-many-branches
     """
     Given a list of xarray datasets, combine those datasets into a
     single netCDF4 Dataset and write to the disk. Each dataset has been
@@ -978,7 +980,7 @@ def recombine_grouped_datasets(datasets, output_file):  # pylint: disable=too-ma
             dim_group.createDimension(new_dim_name, dataset.dims[dim_name])
 
         # Rename variables
-        _rename_variables(dataset, base_dataset)
+        _rename_variables(dataset, base_dataset, start_date)
 
     # Remove group vars from base dataset
     for var_name in list(base_dataset.variables.keys()):
@@ -1003,7 +1005,7 @@ def _get_nested_group(dataset, group_path):
     return nested_group
 
 
-def _rename_variables(dataset, base_dataset):
+def _rename_variables(dataset, base_dataset, start_date):
     for var_name in list(dataset.variables.keys()):
         new_var_name = var_name.split(GROUP_DELIM)[-1]
         var_group = _get_nested_group(base_dataset, var_name)
@@ -1014,10 +1016,13 @@ def _rename_variables(dataset, base_dataset):
         ) or np.issubdtype(
             dataset.variables[var_name].dtype, np.dtype(np.timedelta64)
         ):
-
-            cf_dt_coder = xr.coding.times.CFDatetimeCoder()
-            encoded_var = cf_dt_coder.encode(dataset.variables[var_name])
-            variable = encoded_var
+            if start_date:
+                dataset.variables[var_name].values = (dataset.variables[var_name].values - np.datetime64(start_date))/np.timedelta64(1, 's')
+                variable = dataset.variables[var_name]
+            else:
+                cf_dt_coder = xr.coding.times.CFDatetimeCoder()
+                encoded_var = cf_dt_coder.encode(dataset.variables[var_name])
+                variable = encoded_var
 
         var_attrs = variable.attrs
         fill_value = var_attrs.get('_FillValue')
@@ -1134,15 +1139,19 @@ def convert_to_datetime(dataset, time_vars):
             # adjust the time values from the start date
             if start_date:
                 dataset[var].values = [start_date + datetime.timedelta(seconds=i) for i in dataset[var].values]
-            # copy the values from the utc time variable to the time variable
-            else:
-                utc_var_name = compute_utc_name(dataset)
-                if utc_var_name:
-                    dataset[var].values = [datetime.datetime(i[0], i[1], i[2], hour=i[3], minute=i[4], second=i[5]) for i in dataset[utc_var_name].values]
+                return dataset, start_date
+
+            utc_var_name = compute_utc_name(dataset)
+            if utc_var_name:
+                start_seconds = dataset[var].values[0]
+                dataset[var].values = [datetime.datetime(i[0], i[1], i[2], hour=i[3], minute=i[4], second=i[5]) for i in dataset[utc_var_name].values]
+                start_date = dataset[var].values[0] - np.timedelta64(int(start_seconds), 's')
+                return dataset, start_date
+
         else:
             pass
 
-    return dataset
+    return dataset, start_date
 
 
 def subset(file_to_subset, bbox, output_file, variables=None,
@@ -1210,7 +1219,7 @@ def subset(file_to_subset, bbox, output_file, variables=None,
         if has_groups:
             nc_dataset = transform_grouped_dataset(nc_dataset, file_to_subset)
 
-    nc_dataset = dc.remove_duplicate_dims(nc_dataset)
+    nc_dataset, rename_vars = dc.remove_duplicate_dims(nc_dataset)
 
     if variables:
         variables = [x.replace('/', GROUP_DELIM) for x in variables]
@@ -1227,14 +1236,16 @@ def subset(file_to_subset, bbox, output_file, variables=None,
             xr.backends.NetCDF4DataStore(nc_dataset),
             **args
     ) as dataset:
+        dataset = dc.rename_dup_vars(dataset, rename_vars)
         lat_var_names, lon_var_names, time_var_names = get_coordinate_variable_names(
             dataset=dataset,
             lat_var_names=lat_var_names,
             lon_var_names=lon_var_names,
             time_var_names=time_var_names
         )
+        start_date = None
         if min_time or max_time:
-            dataset = convert_to_datetime(dataset, time_var_names)
+            dataset, start_date = convert_to_datetime(dataset, time_var_names)
         chunks = calculate_chunks(dataset)
         if chunks:
             dataset = dataset.chunk(chunks)
@@ -1306,7 +1317,7 @@ def subset(file_to_subset, bbox, output_file, variables=None,
                 dataset.load().to_netcdf(output_file, 'w', encoding=encoding)
 
         if has_groups:
-            recombine_grouped_datasets(datasets, output_file)
+            recombine_grouped_datasets(datasets, output_file, start_date)
             # Check if the spatial bounds are all 'None'. This means the
             # subset result is empty.
             if any(bound is None for bound in spatial_bounds):
diff --git a/pyproject.toml b/pyproject.toml
@@ -12,7 +12,7 @@
 
 [tool.poetry]
 name = "l2ss-py"
-version = "2.1.1"
+version = "2.2.0-rc.3"
 description = "L2 Subsetter Service"
 authors = ["podaac-tva <podaac-tva@jpl.nasa.gov>"]
 license = "Apache-2.0"
diff --git a/tests/data/tropomi/S5P_OFFL_L2__AER_LH_20210704T005246_20210704T023416_19290_02_020200_20210708T023111.nc b/tests/data/tropomi/S5P_OFFL_L2__AER_LH_20210704T005246_20210704T023416_19290_02_020200_20210708T023111.nc
diff --git a/tests/test_subset.py b/tests/test_subset.py
@@ -1290,6 +1290,34 @@ def test_duplicate_dims_sndr(self):
         for var_name, variable in in_nc.variables.items():
             assert in_nc[var_name].shape == out_nc[var_name].shape
 
+    def test_duplicate_dims_tropomi(self):
+        """
+        Check if SNDR Climcaps files run successfully even though
+        these files have variables with duplicate dimensions
+        """
+        TROP_dir = join(self.test_data_dir, 'tropomi')
+        trop_file = 'S5P_OFFL_L2__AER_LH_20210704T005246_20210704T023416_19290_02_020200_20210708T023111.nc'
+
+        bbox = np.array(((-180, 180), (-90, 90)))
+        output_file = "{}_{}".format(self._testMethodName, trop_file)
+        shutil.copyfile(
+            os.path.join(TROP_dir, trop_file),
+            os.path.join(self.subset_output_dir, trop_file)
+        )
+        box_test = subset.subset(
+            file_to_subset=join(self.subset_output_dir, trop_file),
+            bbox=bbox,
+            output_file=join(self.subset_output_dir, output_file)
+        )
+        # check if the box_test is
+
+        in_nc = nc.Dataset(join(TROP_dir, trop_file))
+        out_nc = nc.Dataset(join(self.subset_output_dir, output_file))
+
+        for var_name, variable in in_nc.groups['PRODUCT'].groups['SUPPORT_DATA'].groups['DETAILED_RESULTS'].variables.items():
+            assert variable.shape == out_nc.groups['PRODUCT'].groups['SUPPORT_DATA'].groups['DETAILED_RESULTS'].variables[var_name].shape
+            
+
     def test_omi_novars_subset(self):
         """
         Check that the OMI variables are conserved when no variable are specified
@@ -1314,8 +1342,9 @@ def test_omi_novars_subset(self):
         in_nc = nc.Dataset(join(omi_dir, omi_file))
         out_nc = nc.Dataset(join(self.subset_output_dir, output_file))
 
-        for var_name, variable in in_nc.variables.items():
-            assert in_nc[var_name].shape == out_nc[var_name].shape
+        for var_name, variable in in_nc.groups['HDFEOS'].groups['SWATHS'].groups['OMI Total Column Amount SO2'].groups['Geolocation Fields'].variables.items():
+            assert in_nc.groups['HDFEOS'].groups['SWATHS'].groups['OMI Total Column Amount SO2'].groups['Geolocation Fields'].variables[var_name].shape == \
+                out_nc.groups['HDFEOS'].groups['SWATHS'].groups['OMI Total Column Amount SO2'].groups['Geolocation Fields'].variables[var_name].shape
 
 
     def test_root_group(self):
@@ -1691,12 +1720,58 @@ def test_temporal_he5file_subset(self):
                 if 'BRO' in i:
                     assert any('utc' in x.lower() for x in time_var_names)
 
-                    
-                dataset = subset.convert_to_datetime(dataset, time_var_names)
-
+                dataset, start_date = subset.convert_to_datetime(dataset, time_var_names)
                 assert dataset[time_var_names[0]].dtype == 'datetime64[ns]'
-    
 
+
+    def test_he5_timeattrs_output(self):
+        """Test that the time attributes in the output match the attributes of the input for OMI test files"""
+
+        omi_dir = join(self.test_data_dir, 'OMI')
+        omi_file = 'OMI-Aura_L2-OMBRO_2020m0116t1207-o82471_v003-2020m0116t182003.he5'
+        omi_file_input = 'input'+omi_file
+        bbox = np.array(((-180, 90), (-90, 90)))
+        output_file = "{}_{}".format(self._testMethodName, omi_file)
+        shutil.copyfile(
+            os.path.join(omi_dir, omi_file),
+            os.path.join(self.subset_output_dir, omi_file)
+        )
+        shutil.copyfile(
+            os.path.join(omi_dir, omi_file),
+            os.path.join(self.subset_output_dir, omi_file_input)
+        )
+        
+        min_time='2020-01-16T12:30:00Z'
+        max_time='2020-01-16T12:40:00Z'
+        bbox = np.array(((-180, 180), (-90, 90)))
+        nc_dataset_input = nc.Dataset(os.path.join(self.subset_output_dir, omi_file_input))
+        incut_set = nc_dataset_input.groups['HDFEOS'].groups['SWATHS'].groups['OMI Total Column Amount BrO'].groups['Geolocation Fields']
+        xr_dataset_input = xr.open_dataset(xr.backends.NetCDF4DataStore(incut_set))
+        inattrs =  xr_dataset_input['Time'].attrs
+        
+        subset.subset(
+            file_to_subset=os.path.join(self.subset_output_dir, omi_file),
+            bbox=bbox,
+            output_file=os.path.join(self.subset_output_dir, output_file),
+            min_time=min_time,
+            max_time=max_time
+        )
+
+        output_ncdataset = nc.Dataset(os.path.join(self.subset_output_dir, output_file))
+        outcut_set = output_ncdataset.groups['HDFEOS'].groups['SWATHS'].groups['OMI Total Column Amount BrO'].groups['Geolocation Fields']
+        xrout_dataset = xr.open_dataset(xr.backends.NetCDF4DataStore(outcut_set))
+        outattrs = xrout_dataset['Time'].attrs
+
+        for key in inattrs.keys():
+            if isinstance(inattrs[key], np.ndarray):
+                if np.array_equal(inattrs[key],outattrs[key]):
+                    pass
+                else:
+                    raise AssertionError('Attributes for {} do not equal each other'.format(key))
+            else:
+                assert inattrs[key] == outattrs[key]
+                
+        
     def test_temporal_subset_lines(self):
         bbox = np.array(((-180, 180), (-90, 90)))
         file = 'SWOT_L2_LR_SSH_Expert_368_012_20121111T235910_20121112T005015_DG10_01.nc'