Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset.services() method to list available Harmony services #500

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

nikki-t
Copy link
Collaborator

@nikki-t nikki-t commented Mar 25, 2024

Github Issue: 447

Description

List available Harmony services for a collection. As a first step to facilitating the use of services in earthaccess, earthaccess was modified so that it can list the available services for a collection.

Overview of work done

A new services module was created that performs service queries. The results module was updated to include a services function which will use the services module to query a collection's services and return the provider-id and umm JSON for the service.

Sample services results:

{
    "provider-id": "POCLOUD",
    "umm": {
        "URL": {
            "Description": "the main access point",
            "URLValue": "https://opendap.earthdata.nasa.gov/"
        },
        "Type": "OPeNDAP",
        "ServiceKeywords": [
            {
                "ServiceCategory": "EARTH SCIENCE SERVICES",
                "ServiceTopic": "DATA MANAGEMENT/DATA HANDLING",
                "ServiceTerm": "DATA SUBSETTING/SUPERSETTING",
                "ServiceSpecificTerm": "SPATIAL SUBSETTING"
            },
            {
                "ServiceCategory": "EARTH SCIENCE SERVICES",
                "ServiceTopic": "DATA MANAGEMENT/DATA HANDLING",
                "ServiceTerm": "DATA SUBSETTING/SUPERSETTING",
                "ServiceSpecificTerm": "TEMPORAL SUBSETTING"
            },
            {
                "ServiceCategory": "EARTH SCIENCE SERVICES",
                "ServiceTopic": "DATA MANAGEMENT/DATA HANDLING",
                "ServiceTerm": "DATA SUBSETTING/SUPERSETTING",
                "ServiceSpecificTerm": "VARIABLE SUBSETTING"
            }
        ],
        "ServiceOrganizations": [
            {
                "Roles": [
                    "DEVELOPER"
                ],
                "ShortName": "UCAR/UNIDATA",
                "LongName": "Unidata, University Corporation for Atmospheric Research"
            }
        ],
        "OperationMetadata": [
            {
                "DistributedComputingPlatform": [
                    "WEBSERVICES"
                ]
            }
        ],
        "Description": "Earthdata OPEnDAP in the cloud",
        "Version": "9",
        "Name": "PO.DAAC Cloud OPeNDAP",
        "ServiceOptions": {
            "Subset": {
                "SpatialSubset": {
                    "BoundingBox": {
                        "AllowMultipleValues": false
                    }
                },
                "TemporalSubset": {
                    "AllowMultipleValues": false
                },
                "VariableSubset": {
                    "AllowMultipleValues": true
                }
            },
            "SupportedReformattings": [
                {
                    "SupportedInputFormat": "NETCDF-4",
                    "SupportedOutputFormats": [
                        "ASCII",
                        "CSV",
                        "NETCDF-3",
                        "NETCDF-4"
                    ]
                },
                {
                    "SupportedInputFormat": "HDF5",
                    "SupportedOutputFormats": [
                        "ASCII",
                        "CSV",
                        "NETCDF-3",
                        "NETCDF-4"
                    ]
                }
            ]
        },
        "MetadataSpecification": {
            "URL": "https://cdn.earthdata.nasa.gov/umm/service/v1.5.2",
            "Name": "UMM-S",
            "Version": "1.5.2"
        },
        "LongName": "PO.DAAC OPeNDADP In the Cloud"
    }
}

Overview of verification done

Tested new service functionality on four collections:

  • HLSS30
  • MUR-JPL-L4-GLOB-v4.1
  • VIIRS_NPP-JPL-L2P-v2016.2
  • ATL22
  • TOMSN7SO2

Overview of integration done

Created a new unit test to test DataService get. Unit test coverage:

---------- coverage: platform linux, python 3.11.6-final-0 -----------
Name                                    Stmts   Miss  Cover   Missing
---------------------------------------------------------------------
earthaccess/__init__.py                    31      3    90%   77-81
earthaccess/api.py                        107     77    28%   26-28, 65-78, 117, 124, 143-158, 182-193, 211-213, 234-239, 248-252, 261-265, 284-285, 307-308, 328-336, 345-346, 350-355
earthaccess/auth.py                       207     80    61%   17-18, 49-63, 95-96, 122-154, 188-206, 211-215, 242, 251-252, 260-270, 316-317, 346-358, 362-372, 384
earthaccess/daac.py                        20      6    70%   136, 140-147
earthaccess/formatters.py                  17     10    41%   11, 18, 22-59
earthaccess/kerchunk.py                    32     26    19%   13-22, 32-58
earthaccess/results.py                    152     81    47%   28-31, 34-47, 88-97, 107-109, 117, 124-130, 137-139, 146-148, 155-158, 165-166, 174-176, 184-200, 212-220, 223, 258-261, 268-276, 283-284, 287-290, 306-317, 321-329, 355, 379-380
earthaccess/search.py                     305    139    54%   47, 63-72, 88-89, 99-100, 114, 129-133, 146-150, 165-180, 184-186, 194-195, 203-204, 217, 235-236, 244, 276-312, 374, 392-396, 421, 441-442, 450, 458-464, 474-475, 489-498, 511-516, 525-526, 534-535, 543-544, 552-553, 564-565, 573-574, 582, 588, 628, 634-638, 641, 650, 654, 660-662, 665, 678-679, 721-722, 731-732, 741-742, 772-773, 782-783, 795-803
earthaccess/services.py                    34      6    82%   31, 51, 57-58, 61, 66
earthaccess/store.py                      289    190    34%   27-28, 31, 34, 42, 50-55, 62-75, 82-86, 109-114, 118-121, 124-125, 128-131, 134-137, 148, 156-159, 183-190, 193-194, 214, 224, 239-242, 292, 310-312, 331, 340-385, 394-440, 467-478, 506, 516-536, 546-582, 595-618, 634-649, 656-665
earthaccess/utils/_validation.py            5      3    40%   5-7
earthaccess/widgets.py                      0      0   100%
tests/unit/__init__.py                      0      0   100%
tests/unit/conftest.py                      0      0   100%
tests/unit/test_auth.py                    60      2    97%   133-134
tests/unit/test_collection_queries.py      30      0   100%
tests/unit/test_formatters.py               0      0   100%
tests/unit/test_granule_queries.py         22      0   100%
tests/unit/test_results.py                 22      0   100%
tests/unit/test_store.py                   58      0   100%
---------------------------------------------------------------------
TOTAL                                    1391    623    55%

========================================================================================================= 26 passed, 2 warnings in 1.75s =========================================================================================================
+ bash ./scripts/lint.sh
+ mypy earthaccess --disallow-untyped-defs
Success: no issues found in 12 source files
+ ruff check .

Create a new integration test that tests the services functionality from the service query to the parsing of query results. Integration test coverage:

---------- coverage: platform linux, python 3.11.6-final-0 -----------
Name                                        Stmts   Miss  Cover   Missing
-------------------------------------------------------------------------
earthaccess/__init__.py                        31      4    87%   68-72, 75
earthaccess/api.py                            107     33    69%   77, 124, 144-152, 190-192, 213, 237-238, 248-252, 261-265, 284-285, 331-334, 336, 345-346
earthaccess/auth.py                           207     59    71%   17-18, 122-154, 188-206, 211-212, 242, 251-252, 260-261, 266, 269, 346-358, 362-372, 384
earthaccess/daac.py                            20      2    90%   136, 147
earthaccess/formatters.py                      17     10    41%   11, 18, 22-59
earthaccess/kerchunk.py                        32     26    19%   13-22, 32-58
earthaccess/results.py                        152     58    62%   28-31, 34-47, 88-97, 107-109, 124-130, 137-139, 146-148, 155-158, 165-166, 174-176, 193, 223, 258-261, 268-276, 283-284, 287-290, 316-317, 321-329, 355, 379-380
earthaccess/search.py                         305     81    73%   69-70, 88-89, 114, 129-133, 146-150, 172, 184-186, 194-195, 203-204, 217, 235-236, 284, 290-294, 297, 304, 392-396, 421, 441-442, 459, 474-475, 490, 511-516, 525-526, 543-544, 552-553, 574, 582, 628, 634-638, 641, 650, 662, 678-679, 721-722, 731-732, 741-742, 772-773, 782-783, 795-803
earthaccess/services.py                        34      6    82%   31, 51, 57-58, 61, 66
earthaccess/store.py                          289     89    69%   34, 42, 62-75, 109-114, 118-121, 124-125, 128-131, 159, 183-190, 193-194, 214, 224, 241-242, 312, 331, 345, 373-374, 383-385, 394-440, 468-470, 478, 506, 523-536, 596, 612-617, 635, 637, 660-661
earthaccess/utils/_validation.py                5      0   100%
earthaccess/widgets.py                          0      0   100%
tests/integration/conftest.py                   9      5    44%   8-12
tests/integration/test_api.py                  51      0   100%
tests/integration/test_auth.py                 82     15    82%   56, 64-65, 87-95, 106, 117-118
tests/integration/test_cloud_download.py       80     13    84%   88-89, 130-133, 141-142, 147-156, 169
tests/integration/test_cloud_open.py           82      5    94%   101, 134-135, 159, 169
tests/integration/test_kerchunk.py             41     33    20%   11-87
tests/integration/test_onprem_download.py      82      5    94%   94, 127-128, 159, 161
tests/integration/test_onprem_open.py          75      4    95%   93, 126-127, 151
tests/integration/test_services.py             24      0   100%
-------------------------------------------------------------------------
TOTAL                                        1725    448    74%

================================================================================================================= short test summary info ==================================================================================================================
FAILED tests/integration/test_api.py::test_download[True-0] - ValueError: earthaccess can't yet guess the provider for cloud collections, we need to use one from earthaccess.list_cloud_providers()
FAILED tests/integration/test_api.py::test_download[True-selection1] - ValueError: earthaccess can't yet guess the provider for cloud collections, we need to use one from earthaccess.list_cloud_providers()
FAILED tests/integration/test_auth.py::test_auth_can_read_from_netrc_file - AssertionError: False is not true
FAILED tests/integration/test_auth.py::test_auth_populates_attrs - AssertionError: False is not true
FAILED tests/integration/test_auth.py::test_auth_can_fetch_s3_credentials - AssertionError: False is not true
FAILED tests/integration/test_auth.py::test_get_s3_credentials_lowercase_location[location0] - assert {}
FAILED tests/integration/test_auth.py::test_get_s3_credentials_lowercase_location[location1] - assert {}
FAILED tests/integration/test_auth.py::test_get_s3fs_session_lowercase_location[location0] - requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
FAILED tests/integration/test_auth.py::test_get_s3fs_session_lowercase_location[location1] - requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
FAILED tests/integration/test_cloud_download.py::test_earthaccess_can_download_cloud_collection_granules[daac0] - AssertionError: assert False
FAILED tests/integration/test_cloud_download.py::test_earthaccess_can_download_cloud_collection_granules[daac1] - AssertionError: assert False
FAILED tests/integration/test_cloud_download.py::test_earthaccess_can_download_cloud_collection_granules[daac2] - AssertionError: assert False
FAILED tests/integration/test_cloud_download.py::test_earthaccess_can_download_cloud_collection_granules[daac3] - AssertionError: assert False
FAILED tests/integration/test_cloud_download.py::test_earthaccess_can_download_cloud_collection_granules[daac4] - AssertionError: assert False
FAILED tests/integration/test_cloud_download.py::test_multi_file_granule - requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
FAILED tests/integration/test_cloud_open.py::test_multi_file_granule - requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
FAILED tests/integration/test_onprem_download.py::test_earthaccess_can_download_onprem_collection_granules[daac0] - AssertionError: False is not true
FAILED tests/integration/test_onprem_download.py::test_earthaccess_can_download_onprem_collection_granules[daac1] - AssertionError: False is not true
============================================================================================= 18 failed, 52 passed, 1 skipped, 3 warnings in 174.40s (0:02:54) =============================================================================================

PR checklist:

  • Linted & Formatted
  • Updated unit & integration tests
  • Updated changelog & readme
  • Updated documentation

Pending:

  • Pass currently failing integration tests.

📚 Documentation preview 📚: https://earthaccess--500.org.readthedocs.build/en/500/

@mfisher87 mfisher87 changed the title Feature/issue 447 Add dataset.services() method to list available Harmony services Mar 25, 2024
@nikki-t
Copy link
Collaborator Author

nikki-t commented Mar 25, 2024

It seems like the existing integration tests are failing due to credentials which I had set up as environment variables and in a .netrc file when running them. What is the best way to handle credentials in these tests? Also happy to move this to a draft if we want to work out those details first.

@mfisher87
Copy link
Member

Hope you don't mind the rename, was finding the old name difficult to remember in my notifications :)

@mfisher87
Copy link
Member

mfisher87 commented Mar 25, 2024

What is the best way to handle credentials in these tests?

@betolink @andypbarrett I think this may need documenting! As a developer, how do I run integration tests on my laptop?

@nikki-t
Copy link
Collaborator Author

nikki-t commented Apr 26, 2024

@mfisher87 - I think that I was able to fix the integration test and also modified the services unit and integration tests to use VCR. All unit tests are passing.

I ran the integration tests locally and here is the summary:

==================== short test summary info =================================

FAILED tests/integration/test_auth.py::test_auth_can_read_from_netrc_file - AssertionError: False is not true
FAILED tests/integration/test_auth.py::test_auth_populates_attrs - AssertionError: False is not true
FAILED tests/integration/test_auth.py::test_auth_can_fetch_s3_credentials - AssertionError: False is not true
FAILED tests/integration/test_auth.py::test_get_s3_credentials_lowercase_location[location0] - assert {}
FAILED tests/integration/test_auth.py::test_get_s3_credentials_lowercase_location[location1] - assert {}
FAILED tests/integration/test_auth.py::test_get_s3fs_session_lowercase_location[location0] - requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
FAILED tests/integration/test_auth.py::test_get_s3fs_session_lowercase_location[location1] - requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

====== 7 failed, 71 passed, 1 skipped in 496.39s (0:08:16) ==========================================

I wonder if these tests need to be run in the same AWS region as the data?

Other than that I think this PR can be merged but let me know how that typically works for earthaccess development.

@mfisher87
Copy link
Member

Will you be at hack day this coming week? I may not have time to review this before then 😬

@nikki-t
Copy link
Collaborator Author

nikki-t commented Apr 28, 2024

@mfisher87 - Unfortunately I am on travel this coming week so I can't make it but I do plan to attend the next earthaccess hack day in a few weeks. We can discuss then if it's easier! 😄

@mfisher87
Copy link
Member

Sounds like a good plan :) Safe and fun travels!

@mfisher87
Copy link
Member

@nikki-t I'll only attend 2nd half of hack day this week, had to schedule a dentist appointment at that time. I may be co-working on that call (I've used my funded Openscapes allocation), but feel free to interrupt me so we can have a quick chat. I love the how-to you added 🤩

earthaccess/services.py Outdated Show resolved Hide resolved
earthaccess/services.py Outdated Show resolved Hide resolved
nikki-t and others added 3 commits June 3, 2024 14:13
Co-authored-by: Jessica Scheick <[email protected]>
Co-authored-by: Jessica Scheick <[email protected]>
@nikki-t
Copy link
Collaborator Author

nikki-t commented Jun 3, 2024

@JessicaS11 - Thank you for the suggested updates, I have applied them!

@mfisher87 , @betolink - Do you think this PR is ready to be merged?

@betolink
Copy link
Member

betolink commented Jun 3, 2024

Hi @nikki-t the code looks good to me, there are some failing tests due a print statement which I'm not opposed to, @mfisher87do you know if there is a way to skip this rule on a file?

@chuckwondo
Copy link
Collaborator

Hi @nikki-t the code looks good to me, there are some failing tests due a print statement which I'm not opposed to, @mfisher87do you know if there is a way to skip this rule on a file?

Please use logging, not printing. We addressed this with #511.

@betolink
Copy link
Member

betolink commented Jun 3, 2024

Ahh right @chuckwondo cc @nikki-t

Comment on lines 44 to 73

page_size = min(limit, 2000)
url = self._build_url()

results = [] # type: List[str]
page = 1
while len(results) < limit:
params = {"page_size": page_size, "page_num": page}
if self._debug:
print(f"Fetching: {url}")
# TODO: implement caching
response = self.session.get(url, params=params)

try:
response.raise_for_status()
except exceptions.HTTPError as ex:
raise RuntimeError(ex.response.text)

if self._format == "json":
latest = response.json()["items"]
else:
latest = [response.text]

if len(latest) == 0:
break

results.extend(latest)
page += 1

return results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code already exists in ServiceQuery, so simply call the superclass method. Ideally, this method doesn't need to be here at all, but for now we do this simply for generating docs.

Suggested change
page_size = min(limit, 2000)
url = self._build_url()
results = [] # type: List[str]
page = 1
while len(results) < limit:
params = {"page_size": page_size, "page_num": page}
if self._debug:
print(f"Fetching: {url}")
# TODO: implement caching
response = self.session.get(url, params=params)
try:
response.raise_for_status()
except exceptions.HTTPError as ex:
raise RuntimeError(ex.response.text)
if self._format == "json":
latest = response.json()["items"]
else:
latest = [response.text]
if len(latest) == 0:
break
results.extend(latest)
page += 1
return results
return super.get(limit)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be a call to super like this: super().get(limit). Will test and implement in code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, hang on. We want umm_json, not json, so we do need to implement this here because python_cmr currently parses only when the format is json, not umm_json.

However, we already implement this generally in search.get_results, so this should likely be something like so:

    from .search import get_results

    def get(self, limit: int = 2000) -> List[Any]:
        return search.get_results(self.session, self, limit)

However, this will need you to tweak search.py as well, as follows:

First, change from cmr import CollectionQuery, GranuleQuery to from cmr import CollectionQuery, GranuleQuery, ServiceQuery

Next, change this:

def get_results(
    session: requests.Session,
    query: Union[CollectionQuery, GranuleQuery],
    limit: int = 2000,
) -> List[Any]:

to this:

def get_results(
    session: requests.Session,
    query: Union[CollectionQuery, GranuleQuery, ServiceQuery],
    limit: int = 2000,
) -> List[Any]:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I make these changes I get a circular dependency as services.py tries to import from .search while results.py is trying to import DataCollection, DataGranule from .search. The results module uses DataService to query and parse results. Also see this comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, yeah, I see the circularity. We may need to rethink where to put things to avoid circularities.

earthaccess/results.py Outdated Show resolved Hide resolved
Comment on lines 190 to 199
if services:
for service in services:
if earthaccess.__auth__.authenticated:
query = DataService(auth=earthaccess.__auth__).parameters(
concept_id=service
)
else:
query = DataService().parameters(concept_id=service)
results = query.get(query.hits())
parsed[service] = self._parse_service_result(results)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if services:
for service in services:
if earthaccess.__auth__.authenticated:
query = DataService(auth=earthaccess.__auth__).parameters(
concept_id=service
)
else:
query = DataService().parameters(concept_id=service)
results = query.get(query.hits())
parsed[service] = self._parse_service_result(results)
for service in services:
if earthaccess.__auth__.authenticated:
query = DataService(auth=earthaccess.__auth__).parameters(
concept_id=service
)
else:
query = DataService().parameters(concept_id=service)
results = query.get(query.hits())
parsed[service] = self._parse_service_result(results)

Comment on lines 216 to 217
"provider-id": result_json["items"][0]["meta"]["provider-id"],
"umm": result_json["items"][0]["umm"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we getting only the item at index 0?

Copy link
Collaborator Author

@nikki-t nikki-t Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the results response, the data always seems to be returned under a list of one element which contains all of the metadata. In order to provide some filtering, I chose only to return the provider_id and the UMM JSON response for each service. See attached sample-results-response.json.

It also looks like this is the case in the CMR API documentation but to be on the safe side I will figure out how to iterate over the list to make sure we don't miss anything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you address my suggestion I just added to the get method, you should not need to use json.loads because the get method will returned parsed results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chuckwondo - I can update the code to iterate over the "items" list in the CMR service query response but then I end up returning a list of lists to the end user. See attached parsed.json. What do you think of returning a list of lists that contains the items?

Here is an example:

"S2839491596-XYZ_PROV": [
        [
            {
                "provider-id": "XYZ_PROV",
                "umm": {
                    "URL": {
                        "Description": "https://harmony.earthdata.nasa.gov",
                        "URLValue": "This is the Harmony root endpoint."
                    },
                    "Type": "Harmony",
                    "ServiceKeywords": [
                        {
                            "ServiceCategory": "EARTH SCIENCE SERVICES",
                            "ServiceTopic": "DATA MANAGEMENT/DATA HANDLING",
                            "ServiceTerm": "DATA ACCESS/RETRIEVAL"
                        },
                        {
                            "ServiceCategory": "EARTH SCIENCE SERVICES",
                            "ServiceTopic": "DATA MANAGEMENT/DATA HANDLING",
                            "ServiceTerm": "DATA INTEROPERABILITY",
                            "ServiceSpecificTerm": "DATA REFORMATTING"
                        }
                    ],
                    "ServiceOrganizations": [
                        {
                            "Roles": [
                                "DEVELOPER",
                                "PUBLISHER",
                                "SERVICE PROVIDER"
                            ],
                            "ShortName": "NASA/GSFC/EOS/EOSDIS/EMD",
                            "LongName": "Maintenance and Development, Earth Observing System Data and Information System, Earth Observing System,Goddard Space Flight Center, NASA"
                        }
                    ],
                    "Description": "Backend NetCDF-to-Zarr service option description for Harmony data transformations. Cannot be chained with other operations from this record.",
                    "VersionDescription": "Semantic version number for the NetCDF-to-Zarr Docker image used by Harmony in production.",
                    "Version": "1.2.0",
                    "Name": "Harmony NetCDF-to-Zarr Service",
                    "ContactPersons": [
                        {
                            "Roles": [
                                "DEVELOPER"
                            ],
                            "FirstName": "Owen",
                            "LastName": "Littlejohns",
                            "ContactInformation": {
                                "ContactMechanisms": [
                                    {
                                        "Type": "Email",
                                        "Value": "[email protected]"
                                    }
                                ]
                            }
                        },
                        {
                            "Roles": [
                                "SERVICE PROVIDER"
                            ],
                            "FirstName": "David",
                            "LastName": "Auty",
                            "ContactInformation": {
                                "ContactMechanisms": [
                                    {
                                        "Type": "Email",
                                        "Value": "[email protected]"
                                    }
                                ]
                            }
                        }
                    ],
                    "ServiceOptions": {
                        "Aggregation": {
                            "Concatenate": {
                                "ConcatenateDefault": "False"
                            }
                        },
                        "SupportedReformattings": [
                            {
                                "SupportedInputFormat": "NETCDF-4",
                                "SupportedOutputFormats": [
                                    "ZARR"
                                ]
                            }
                        ]
                    },
                    "MetadataSpecification": {
                        "URL": "https://cdn.earthdata.nasa.gov/umm/service/v1.5.3",
                        "Name": "UMM-S",
                        "Version": "1.5.3"
                    },
                    "LongName": "Harmony NetCDF-to-Zarr Service"
                }
            }
        ]
    ]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there meaning to the nested list structure? If not, we could use itertools.chain() to flatten it out, IIRC.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nikki-t and others added 2 commits June 3, 2024 16:16
Co-authored-by: Chuck Daniels <[email protected]>
Co-authored-by: Chuck Daniels <[email protected]>
Comment on lines 44 to 73

page_size = min(limit, 2000)
url = self._build_url()

results = [] # type: List[str]
page = 1
while len(results) < limit:
params = {"page_size": page_size, "page_num": page}
if self._debug:
print(f"Fetching: {url}")
# TODO: implement caching
response = self.session.get(url, params=params)

try:
response.raise_for_status()
except exceptions.HTTPError as ex:
raise RuntimeError(ex.response.text)

if self._format == "json":
latest = response.json()["items"]
else:
latest = [response.text]

if len(latest) == 0:
break

results.extend(latest)
page += 1

return results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, hang on. We want umm_json, not json, so we do need to implement this here because python_cmr currently parses only when the format is json, not umm_json.

However, we already implement this generally in search.get_results, so this should likely be something like so:

    from .search import get_results

    def get(self, limit: int = 2000) -> List[Any]:
        return search.get_results(self.session, self, limit)

However, this will need you to tweak search.py as well, as follows:

First, change from cmr import CollectionQuery, GranuleQuery to from cmr import CollectionQuery, GranuleQuery, ServiceQuery

Next, change this:

def get_results(
    session: requests.Session,
    query: Union[CollectionQuery, GranuleQuery],
    limit: int = 2000,
) -> List[Any]:

to this:

def get_results(
    session: requests.Session,
    query: Union[CollectionQuery, GranuleQuery, ServiceQuery],
    limit: int = 2000,
) -> List[Any]:

Comment on lines 216 to 217
"provider-id": result_json["items"][0]["meta"]["provider-id"],
"umm": result_json["items"][0]["umm"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you address my suggestion I just added to the get method, you should not need to use json.loads because the get method will returned parsed results.

Comment on lines 194 to 195
results = query.get(query.hits())
parsed[service] = self._parse_service_result(results)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you make the change I suggested in the get method, you should be able to eliminate the _parse_service_result method and just do this:

Suggested change
results = query.get(query.hits())
parsed[service] = self._parse_service_result(results)
parsed[service] = query.get(query.hits())

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
from .auth import Auth


class DataService(ServiceQuery):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is not in search.py, like DataCollections and DataGranules are?

I'm not opposed to keeping this in a separate file, but splitting each type of query class into their own modules might perhaps be left as a task for a separate issue (after discussing whether or not we want to do so).

For consistency with existing code, I suggest moving this to search.py and also renaming it to be pluralized: DataServices.

Copy link
Collaborator Author

@nikki-t nikki-t Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had initially wanted to place DataService in search.py but because the results module uses the DataService class to query and parse results, a circular dependency is created between the search and results modules. I also thought the DataService class might grow as we add in the plugin architecture and could serve as the main entrypoint for loading and accessing plugins.

I can look into re-architecting the code so that there is a DataServices class in search.py and DataService class in results.py to be more consistent with the Collections and Granules structure. I was initially thinking that the services would be returned for a collection rather than having a separate search_services function.

RIght now the end user can query services like this:

datasets = search_datasets(
    short_name="MUR-JPL-L4-GLOB-v4.1",
    cloud_hosted=True,
    temporal=("2024-02-27T00:00:00Z", "2024-02-29T00:00:00Z"),
)
for dataset in datasets:
    print(dataset.services())

Making it pretty easy to return service data for a collection. If I re-architect as mentioned above. The user would search a service like this:

datasets = search_datasets(
    short_name="MUR-JPL-L4-GLOB-v4.1",
    cloud_hosted=True,
    temporal=("2024-02-27T00:00:00Z", "2024-02-29T00:00:00Z"),
)
for dataset in datasets:
    services = dataset["meta"]["associations"]["services"]
    for service in services:
         print(search_services(service))

I like how easy it is to return service data in the first code snippet but also want to make sure we are building a codebase that is consistent and easy to modify (add to) in the future as I think we want to build off of this with the plugin architecture.

Open to suggestions and/or discussing at the next hackday.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the circular dependency you're referring to.

I still suggest your rename DataService (singular) to DataServices (plural) making it consistent with DataCollections and DataGranules.

Regarding the "easier" code you mention above, I agree, but that doesn't preclude providing a search_services method as well. Users can then do both things: (a) call dataset.services() to get the services associated with a dataset, or (b) call search_services to search for services more generally, not necessarily specific to a dataset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can then do both things: (a) call dataset.services() to get the services associated with a dataset, or (b) call search_services to search for services more generally, not necessarily specific to a dataset.

Should we consider these separate features and follow-up later to add even more convenience?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. In fact, I suggest that dataset.services() simply invoke search_services.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me. My only request is to not name anything "utils." It's a pet peeve of mine because it's such a generic name, as to have no meaning. Happy to iron out kinks with you at the next hack day.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only request is to not name anything "utils."

I love this and feel so called out 🤣 I'm very prone to creating a utils subpackage but I always regret it later and am trying to get better at it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree on utils 😄 but it does look like there is already a utils directory. Should we consider moving that to a different name? Or maybe I am misunderstanding!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I hadn't noticed that there's already a utils. Oh well. We can worry about that another time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that was me, sorry 🤣

- Modify services.py to use get_results function
- Update results.DataCollection.services function to retrieve and return UMM-JSON for services
- Update unit and integration tests
@nikki-t
Copy link
Collaborator Author

nikki-t commented Jun 11, 2024

Suggested modifications have been made and unit and integration tests have been updated.

Unit Test Summary

================================================================================================================ short test summary info =================================================================================================================
FAILED tests/unit/test_collection_queries.py::test_query_can_parse_single_dates[2001-12-12-2001-12-21-2001-12-12T00:00:00Z,2001-12-21T23:59:59Z] - ValueError: time data '2001-12-12' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_collection_queries.py::test_query_can_parse_single_dates[2021-02-01--2021-02-01T00:00:00Z,] - ValueError: time data '2021-02-01' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_collection_queries.py::test_query_can_parse_single_dates[1999-02-01 06:00-2009-01-01-1999-02-01T06:00:00Z,2009-01-01T23:59:59Z] - ValueError: time data '1999-02-01 06:00' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_collection_queries.py::test_query_can_parse_single_dates[2019-03-10T00:00:00Z-2019-03-10T00:00:00-01:00-2019-03-10T00:00:00Z,2019-03-10T01:00:00Z] - ValueError: time data '2019-03-10T00:00:00-01:00' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_granule_queries.py::test_query_can_parse_single_dates[2001-12-12-2001-12-21-2001-12-12T00:00:00Z,2001-12-21T23:59:59Z] - ValueError: time data '2001-12-12' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_granule_queries.py::test_query_can_parse_single_dates[2021-02-01--2021-02-01T00:00:00Z,] - ValueError: time data '2021-02-01' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_granule_queries.py::test_query_can_parse_single_dates[1999-02-01 06:00-2009-01-01-1999-02-01T06:00:00Z,2009-01-01T23:59:59Z] - ValueError: time data '1999-02-01 06:00' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_granule_queries.py::test_query_can_parse_single_dates[2019-03-10T00:00:00Z-2019-03-10T00:00:00-01:00-2019-03-10T00:00:00Z,2019-03-10T01:00:00Z] - ValueError: time data '2019-03-10T00:00:00-01:00' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_results.py::TestResults::test_data_links - ValueError: time data '2020' does not match format '%Y-%m-%dT%H:%M:%SZ'
============================================================================================================== 9 failed, 31 passed in 2.56s ==============================================================================================================

Integration Test Summary

================================================================================================================ short test summary info =================================================================================================================
FAILED tests/integration/test_auth.py::test_auth_can_read_earthdata_env_variables - AttributeError: 'Auth' object has no attribute 'username'
FAILED tests/integration/test_auth.py::test_auth_can_read_from_netrc_file - AssertionError: False is not true
FAILED tests/integration/test_auth.py::test_auth_populates_attrs - AssertionError: False is not true
FAILED tests/integration/test_auth.py::test_auth_can_fetch_s3_credentials - AssertionError: False is not true
FAILED tests/integration/test_auth.py::test_get_s3_credentials_lowercase_location[location0] - assert {}
FAILED tests/integration/test_auth.py::test_get_s3_credentials_lowercase_location[location1] - assert {}
FAILED tests/integration/test_onprem_download.py::test_earthaccess_can_download_onprem_collection_granules[daac1] - AssertionError: False is not true
FAILED tests/integration/test_onprem_download.py::test_earthaccess_can_download_onprem_collection_granules[daac3] - AssertionError: 0 not greater than 3
FAILED tests/integration/test_onprem_open.py::test_earthaccess_can_open_onprem_collection_granules[daac3] - AssertionError: 0 not greater than 2
FAILED tests/unit/test_collection_queries.py::test_query_can_parse_single_dates[2001-12-12-2001-12-21-2001-12-12T00:00:00Z,2001-12-21T23:59:59Z] - ValueError: time data '2001-12-12' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_collection_queries.py::test_query_can_parse_single_dates[2021-02-01--2021-02-01T00:00:00Z,] - ValueError: time data '2021-02-01' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_collection_queries.py::test_query_can_parse_single_dates[1999-02-01 06:00-2009-01-01-1999-02-01T06:00:00Z,2009-01-01T23:59:59Z] - ValueError: time data '1999-02-01 06:00' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_collection_queries.py::test_query_can_parse_single_dates[2019-03-10T00:00:00Z-2019-03-10T00:00:00-01:00-2019-03-10T00:00:00Z,2019-03-10T01:00:00Z] - ValueError: time data '2019-03-10T00:00:00-01:00' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_granule_queries.py::test_query_can_parse_single_dates[2001-12-12-2001-12-21-2001-12-12T00:00:00Z,2001-12-21T23:59:59Z] - ValueError: time data '2001-12-12' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_granule_queries.py::test_query_can_parse_single_dates[2021-02-01--2021-02-01T00:00:00Z,] - ValueError: time data '2021-02-01' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_granule_queries.py::test_query_can_parse_single_dates[1999-02-01 06:00-2009-01-01-1999-02-01T06:00:00Z,2009-01-01T23:59:59Z] - ValueError: time data '1999-02-01 06:00' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_granule_queries.py::test_query_can_parse_single_dates[2019-03-10T00:00:00Z-2019-03-10T00:00:00-01:00-2019-03-10T00:00:00Z,2019-03-10T01:00:00Z] - ValueError: time data '2019-03-10T00:00:00-01:00' does not match format '%Y-%m-%dT%H:%M:%SZ'
FAILED tests/unit/test_results.py::TestResults::test_data_links - ValueError: time data '2020' does not match format '%Y-%m-%dT%H:%M:%SZ'
============================================================================================ 18 failed, 66 passed, 1 skipped, 1 warning in 596.56s (0:09:56) =============================================================================================

It looks like a lot of the tests are failing due to the date format. I pulled in the most recent changes from main to see if they had been fixed but it doesn't look like. I also typically see some test failures when running the integration tests locally.

@mfisher87 or @chuckwondo - Do you mind approving the workflows so that they can run? If these pass I think we might be in good shape to merge the PR.

@mfisher87
Copy link
Member

Workflows kicked off! You should not have to worry about that... I sent you an invitation with "triage" rights (includes ability to add labels, close issues, etc., but not quite as much as "maintainer") you can accept or decline :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants