Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset.services() method to list available services #500

Merged
merged 55 commits into from
Sep 16, 2024
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
3c4d293
List services that are available for a collection
nikki-t Mar 19, 2024
78a6fcb
Define integration test for services functionality
nikki-t Mar 19, 2024
285b0d6
Update imports and fix type annotiations
nikki-t Mar 19, 2024
5a40e8a
Update file formatting
nikki-t Mar 19, 2024
8088690
Update changelog and readme to include services functionality
nikki-t Mar 25, 2024
911b954
Update for clarity on services
nikki-t Mar 25, 2024
bc9c2f4
Provide unit test for DataService get function
nikki-t Mar 25, 2024
0bb9f11
Fix formatting of imports
nikki-t Mar 25, 2024
692d3cf
Fix code formatting
nikki-t Mar 25, 2024
9e37fe2
Mock API response to account for changing service records
nikki-t Mar 25, 2024
e8dfc74
Add documentation for services functionality
nikki-t Mar 25, 2024
5404df3
Be more clear about test failures because no tests were collected
mfisher87 Apr 2, 2024
1c41659
Improve the error message when no tests collected
mfisher87 Apr 2, 2024
daf1bb4
Merge branch 'main' into feature/issue-447
mfisher87 Apr 2, 2024
07771e1
Merge branch 'main' of https://github.com/nikki-t/earthaccess into fe…
nikki-t Apr 26, 2024
240c930
Fix import organization
nikki-t Apr 26, 2024
4a710f4
Use VCR for CMR API calls and update unit and integration test for se…
nikki-t Apr 26, 2024
6cf0f64
Merge branch 'feature/issue-447' of https://github.com/nikki-t/eartha…
nikki-t Apr 26, 2024
20d395d
Fix test formatting
nikki-t Apr 26, 2024
44917d9
Merge branch 'main' of https://github.com/nikki-t/earthaccess into fe…
nikki-t May 14, 2024
71973dc
Fix DataService init documentation
nikki-t May 14, 2024
16512d1
Add a HOW-TO on searching for services
nikki-t May 14, 2024
057d7fd
Fix trailing whitespace
nikki-t May 14, 2024
41d18b8
Update docs/howto/search-services.md
nikki-t Jun 3, 2024
14dc3ef
Update earthaccess/services.py
nikki-t Jun 3, 2024
64caf3d
Update earthaccess/services.py
nikki-t Jun 3, 2024
f67365b
Update earthaccess/results.py
nikki-t Jun 3, 2024
0c66c66
Update earthaccess/results.py
nikki-t Jun 3, 2024
9dfa385
Add issue to changelog enhancements
nikki-t Jun 11, 2024
cc24470
Update service architecture to provide cleanr access to service queries.
nikki-t Jun 11, 2024
2362eb8
Factor our get_results to utils._search to be shared by search and re…
nikki-t Jun 11, 2024
d1bebe3
Merge branch 'main' of https://github.com/nikki-t/earthaccess into fe…
nikki-t Jun 11, 2024
0e505ae
Fix code formatting
nikki-t Jun 11, 2024
b437341
Fix reference to expected test data file
nikki-t Jun 11, 2024
8ecacd1
Fix issue with accessing expected test data
nikki-t Jun 11, 2024
b874c01
Test response for different Python version unit tests
nikki-t Jun 11, 2024
61b2e04
Test response for different Python version unit tests
nikki-t Jun 11, 2024
49fb87d
Remove logging of response
nikki-t Jun 11, 2024
c2af6b3
Update fixtures for JSON body
nikki-t Jul 23, 2024
9474b33
Set authentication to false
nikki-t Jul 23, 2024
f6d3776
Merge branch 'main' into feature/issue-447
betolink Jul 23, 2024
c0a3c8a
Fix end of file reference
nikki-t Jul 23, 2024
79319c8
Merge branch 'feature/issue-447' of https://github.com/nikki-t/eartha…
nikki-t Jul 23, 2024
01bad9c
Merge branch 'main' of https://github.com/nikki-t/earthaccess into fe…
nikki-t Sep 10, 2024
2c29d73
Decode compressed VCR response
nikki-t Sep 10, 2024
8dd2402
Update unit test VCR file
nikki-t Sep 10, 2024
a1b64f2
Cache mypy_cache in CI to speedup build
chuckwondo Sep 13, 2024
d99dc3c
Tweak mypy config to drop explicit path args
chuckwondo Sep 13, 2024
134a7b7
Pluralize DataService
chuckwondo Sep 13, 2024
783d996
Add top-level search_services function
chuckwondo Sep 13, 2024
ba0eb8f
Simplify logic
chuckwondo Sep 13, 2024
39f8963
Fix tests failing for response compression handling
chuckwondo Sep 13, 2024
86ab7a8
Fixup changelog
chuckwondo Sep 13, 2024
3971454
Fix mkdocs build, including broken links
chuckwondo Sep 13, 2024
39b86a8
Add mfisher87 and betolink to credits for #447
chuckwondo Sep 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

## [Unreleased]

* New Features
nikki-t marked this conversation as resolved.
Show resolved Hide resolved

* [#447](https://github.com/nsidc/earthaccess/issues/447) Enable the retrieval of services associated with a collection.
nikki-t marked this conversation as resolved.
Show resolved Hide resolved

* Changes

* [#421](https://github.com/nsidc/earthaccess/issues/421): Removed the
Expand Down
24 changes: 24 additions & 0 deletions docs/howto/search-services.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# How to search for services using `earthaccess`

You can search for services associated with a dataset. Services include a back-end processing workflow that transforms or processes the data in some way (e.g. clipping to a spatial extent or converting to a different file format).

`earthaccess` facilitates the retrieval of service metadata via the `search_datasets` function. The results from the `search_datasets` method are an enhanced Python dictionary that includes a `services` method which returns the metadata for all services associated with a collection. The service results are returned as a Python dictionary.

To search for services: Import the earthaccess library and search by dataset (you need to know the short name of the dataset which can be found on the dataset landing page).

```py
import earthaccess

datasets = search_datasets(
short_name="MUR-JPL-L4-GLOB-v4.1",
cloud_hosted=True,
temporal=("2024-02-27T00:00:00Z", "2024-02-29T23:59:59Z"),
)
```

Parse the service results to return metadata on services available for the dataset.

```py
for dataset in datasets:
print(dataset.services())
```
8 changes: 8 additions & 0 deletions docs/user-reference/collections/collections-services.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Documentation for `Collection Services`

::: earthaccess.services.DataService
options:
inherited_members: true
show_root_heading: true
show_source: false

43 changes: 43 additions & 0 deletions earthaccess/results.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@
import uuid
from typing import Any, Dict, List, Optional, Union

import earthaccess

from .formatters import _repr_granule_html
from .services import DataService


class CustomDict(dict):
Expand Down Expand Up @@ -172,6 +175,46 @@ def s3_bucket(self) -> Dict[str, Any]:
return self["umm"]["DirectDistributionInformation"]
return {}

def services(self) -> Dict[Any, List[Dict[str, Any]]]:
"""
Returns:
A list of services available for the collection.
"""

services = self.get("meta", {}).get("associations", {}).get("services", [])

parsed = {}
for service in services:
if earthaccess.__auth__.authenticated:
query = DataService(auth=earthaccess.__auth__).parameters(
concept_id=service
)
else:
query = DataService().parameters(concept_id=service)
results = query.get(query.hits())
parsed[service] = self._parse_service_result(results)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you make the change I suggested in the get method, you should be able to eliminate the _parse_service_result method and just do this:

Suggested change
results = query.get(query.hits())
parsed[service] = self._parse_service_result(results)
parsed[service] = query.get(query.hits())

return parsed

def _parse_service_result(self, service_results: List) -> List[Dict[str, Any]]:
"""Parse CMR query service search result.

Parameters:
service_result (list): List of service query results

Returns:
List of relevant service data
"""

parsed = []
for service_result in service_results:
result_json = json.loads(service_result)
result_item = {
"provider-id": result_json["items"][0]["meta"]["provider-id"],
"umm": result_json["items"][0]["umm"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we getting only the item at index 0?

Copy link
Collaborator Author

@nikki-t nikki-t Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the results response, the data always seems to be returned under a list of one element which contains all of the metadata. In order to provide some filtering, I chose only to return the provider_id and the UMM JSON response for each service. See attached sample-results-response.json.

It also looks like this is the case in the CMR API documentation but to be on the safe side I will figure out how to iterate over the list to make sure we don't miss anything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you address my suggestion I just added to the get method, you should not need to use json.loads because the get method will returned parsed results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chuckwondo - I can update the code to iterate over the "items" list in the CMR service query response but then I end up returning a list of lists to the end user. See attached parsed.json. What do you think of returning a list of lists that contains the items?

Here is an example:

"S2839491596-XYZ_PROV": [
        [
            {
                "provider-id": "XYZ_PROV",
                "umm": {
                    "URL": {
                        "Description": "https://harmony.earthdata.nasa.gov",
                        "URLValue": "This is the Harmony root endpoint."
                    },
                    "Type": "Harmony",
                    "ServiceKeywords": [
                        {
                            "ServiceCategory": "EARTH SCIENCE SERVICES",
                            "ServiceTopic": "DATA MANAGEMENT/DATA HANDLING",
                            "ServiceTerm": "DATA ACCESS/RETRIEVAL"
                        },
                        {
                            "ServiceCategory": "EARTH SCIENCE SERVICES",
                            "ServiceTopic": "DATA MANAGEMENT/DATA HANDLING",
                            "ServiceTerm": "DATA INTEROPERABILITY",
                            "ServiceSpecificTerm": "DATA REFORMATTING"
                        }
                    ],
                    "ServiceOrganizations": [
                        {
                            "Roles": [
                                "DEVELOPER",
                                "PUBLISHER",
                                "SERVICE PROVIDER"
                            ],
                            "ShortName": "NASA/GSFC/EOS/EOSDIS/EMD",
                            "LongName": "Maintenance and Development, Earth Observing System Data and Information System, Earth Observing System,Goddard Space Flight Center, NASA"
                        }
                    ],
                    "Description": "Backend NetCDF-to-Zarr service option description for Harmony data transformations. Cannot be chained with other operations from this record.",
                    "VersionDescription": "Semantic version number for the NetCDF-to-Zarr Docker image used by Harmony in production.",
                    "Version": "1.2.0",
                    "Name": "Harmony NetCDF-to-Zarr Service",
                    "ContactPersons": [
                        {
                            "Roles": [
                                "DEVELOPER"
                            ],
                            "FirstName": "Owen",
                            "LastName": "Littlejohns",
                            "ContactInformation": {
                                "ContactMechanisms": [
                                    {
                                        "Type": "Email",
                                        "Value": "[email protected]"
                                    }
                                ]
                            }
                        },
                        {
                            "Roles": [
                                "SERVICE PROVIDER"
                            ],
                            "FirstName": "David",
                            "LastName": "Auty",
                            "ContactInformation": {
                                "ContactMechanisms": [
                                    {
                                        "Type": "Email",
                                        "Value": "[email protected]"
                                    }
                                ]
                            }
                        }
                    ],
                    "ServiceOptions": {
                        "Aggregation": {
                            "Concatenate": {
                                "ConcatenateDefault": "False"
                            }
                        },
                        "SupportedReformattings": [
                            {
                                "SupportedInputFormat": "NETCDF-4",
                                "SupportedOutputFormats": [
                                    "ZARR"
                                ]
                            }
                        ]
                    },
                    "MetadataSpecification": {
                        "URL": "https://cdn.earthdata.nasa.gov/umm/service/v1.5.3",
                        "Name": "UMM-S",
                        "Version": "1.5.3"
                    },
                    "LongName": "Harmony NetCDF-to-Zarr Service"
                }
            }
        ]
    ]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there meaning to the nested list structure? If not, we could use itertools.chain() to flatten it out, IIRC.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
parsed.append(result_item)
return parsed

def __repr__(self) -> str:
return json.dumps(
self.render_dict, sort_keys=False, indent=2, separators=(",", ": ")
Expand Down
73 changes: 73 additions & 0 deletions earthaccess/services.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
from typing import Any, List, Optional

from requests import exceptions, session

from cmr import ServiceQuery

from .auth import Auth


class DataService(ServiceQuery):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is not in search.py, like DataCollections and DataGranules are?

I'm not opposed to keeping this in a separate file, but splitting each type of query class into their own modules might perhaps be left as a task for a separate issue (after discussing whether or not we want to do so).

For consistency with existing code, I suggest moving this to search.py and also renaming it to be pluralized: DataServices.

Copy link
Collaborator Author

@nikki-t nikki-t Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had initially wanted to place DataService in search.py but because the results module uses the DataService class to query and parse results, a circular dependency is created between the search and results modules. I also thought the DataService class might grow as we add in the plugin architecture and could serve as the main entrypoint for loading and accessing plugins.

I can look into re-architecting the code so that there is a DataServices class in search.py and DataService class in results.py to be more consistent with the Collections and Granules structure. I was initially thinking that the services would be returned for a collection rather than having a separate search_services function.

RIght now the end user can query services like this:

datasets = search_datasets(
    short_name="MUR-JPL-L4-GLOB-v4.1",
    cloud_hosted=True,
    temporal=("2024-02-27T00:00:00Z", "2024-02-29T00:00:00Z"),
)
for dataset in datasets:
    print(dataset.services())

Making it pretty easy to return service data for a collection. If I re-architect as mentioned above. The user would search a service like this:

datasets = search_datasets(
    short_name="MUR-JPL-L4-GLOB-v4.1",
    cloud_hosted=True,
    temporal=("2024-02-27T00:00:00Z", "2024-02-29T00:00:00Z"),
)
for dataset in datasets:
    services = dataset["meta"]["associations"]["services"]
    for service in services:
         print(search_services(service))

I like how easy it is to return service data in the first code snippet but also want to make sure we are building a codebase that is consistent and easy to modify (add to) in the future as I think we want to build off of this with the plugin architecture.

Open to suggestions and/or discussing at the next hackday.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the circular dependency you're referring to.

I still suggest your rename DataService (singular) to DataServices (plural) making it consistent with DataCollections and DataGranules.

Regarding the "easier" code you mention above, I agree, but that doesn't preclude providing a search_services method as well. Users can then do both things: (a) call dataset.services() to get the services associated with a dataset, or (b) call search_services to search for services more generally, not necessarily specific to a dataset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can then do both things: (a) call dataset.services() to get the services associated with a dataset, or (b) call search_services to search for services more generally, not necessarily specific to a dataset.

Should we consider these separate features and follow-up later to add even more convenience?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. In fact, I suggest that dataset.services() simply invoke search_services.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me. My only request is to not name anything "utils." It's a pet peeve of mine because it's such a generic name, as to have no meaning. Happy to iron out kinks with you at the next hack day.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only request is to not name anything "utils."

I love this and feel so called out 🤣 I'm very prone to creating a utils subpackage but I always regret it later and am trying to get better at it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree on utils 😄 but it does look like there is already a utils directory. Should we consider moving that to a different name? Or maybe I am misunderstanding!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I hadn't noticed that there's already a utils. Oh well. We can worry about that another time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that was me, sorry 🤣

"""A Service client for NASA CMR that returns data on collection services.

API: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#service
"""

_format = "umm_json"

def __init__(self, auth: Optional[Auth] = None, *args: Any, **kwargs: Any) -> None:
"""Build an instance of DataService to query CMR.

auth is an optional parameter for queries that need authentication,
e.g. restricted datasets.

Parameters:
auth (Optional[Auth], optional): An authenticated `Auth` instance.
"""

super().__init__(*args, **kwargs)
self._debug = False
self.session = session()
if auth is not None and auth.authenticated:
# To search, we need the new bearer tokens from NASA Earthdata
self.session = auth.get_session(bearer_token=True)

def get(self, limit: int = 2000) -> List:
"""Get all service results up to some limit.

Parameters
limit (int): The number of results to return

Returns
Query results as a list
"""

page_size = min(limit, 2000)
url = self._build_url()

results = [] # type: List[str]
page = 1
while len(results) < limit:
params = {"page_size": page_size, "page_num": page}
if self._debug:
print(f"Fetching: {url}")
# TODO: implement caching
response = self.session.get(url, params=params)

try:
response.raise_for_status()
except exceptions.HTTPError as ex:
raise RuntimeError(ex.response.text)

if self._format == "json":
latest = response.json()["items"]
else:
latest = [response.text]

if len(latest) == 0:
break

results.extend(latest)
page += 1

return results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code already exists in ServiceQuery, so simply call the superclass method. Ideally, this method doesn't need to be here at all, but for now we do this simply for generating docs.

Suggested change
page_size = min(limit, 2000)
url = self._build_url()
results = [] # type: List[str]
page = 1
while len(results) < limit:
params = {"page_size": page_size, "page_num": page}
if self._debug:
print(f"Fetching: {url}")
# TODO: implement caching
response = self.session.get(url, params=params)
try:
response.raise_for_status()
except exceptions.HTTPError as ex:
raise RuntimeError(ex.response.text)
if self._format == "json":
latest = response.json()["items"]
else:
latest = [response.text]
if len(latest) == 0:
break
results.extend(latest)
page += 1
return results
return super.get(limit)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be a call to super like this: super().get(limit). Will test and implement in code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, hang on. We want umm_json, not json, so we do need to implement this here because python_cmr currently parses only when the format is json, not umm_json.

However, we already implement this generally in search.get_results, so this should likely be something like so:

    from .search import get_results

    def get(self, limit: int = 2000) -> List[Any]:
        return search.get_results(self.session, self, limit)

However, this will need you to tweak search.py as well, as follows:

First, change from cmr import CollectionQuery, GranuleQuery to from cmr import CollectionQuery, GranuleQuery, ServiceQuery

Next, change this:

def get_results(
    session: requests.Session,
    query: Union[CollectionQuery, GranuleQuery],
    limit: int = 2000,
) -> List[Any]:

to this:

def get_results(
    session: requests.Session,
    query: Union[CollectionQuery, GranuleQuery, ServiceQuery],
    limit: int = 2000,
) -> List[Any]:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I make these changes I get a circular dependency as services.py tries to import from .search while results.py is trying to import DataCollection, DataGranule from .search. The results module uses DataService to query and parse results. Also see this comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, yeah, I see the circularity. We may need to rethink where to put things to avoid circularities.

2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ nav:
- "Using authenticated sessions to access data": "howto/edl.ipynb"
- "Download data from on-prem location": "howto/onprem.md"
- "Direct S3 access - Open/stream files in the cloud": "howto/cloud.md"
- "Search for services": "howto/search-services.md"
- TUTORIALS:
- "Accessing remote NASA data with fsspec": "tutorials/file-access.ipynb"
- "Search and access of restricted datasets": "tutorials/restricted-datasets.ipynb"
Expand All @@ -87,6 +88,7 @@ nav:
- Collections:
- "Collection Queries": "user-reference/collections/collections-query.md"
- "Collection Results": "user-reference/collections/collections.md"
- "Collection Services": "user-reference/collections/collections-services.md"
- Granules:
- "Granule Queries": "user-reference/granules/granules-query.md"
- "Granule Results": "user-reference/granules/granules.md"
Expand Down
8 changes: 8 additions & 0 deletions tests/integration/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,14 @@
def pytest_sessionfinish(session, exitstatus):
if exitstatus == 0:
return

if session.testscollected == 0:
raise RuntimeError(
"Failed to initialize tests. Couldn't calculate acceptable failure rate"
" because no tests were collected."
" This can happen if credential envvars are not populated."
)

failure_rate = (100.0 * session.testsfailed) / session.testscollected
if failure_rate <= ACCEPTABLE_FAILURE_RATE:
session.exitstatus = 0
Loading
Loading