Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further revision to the new ingestion workflow #16

Merged
merged 74 commits into from
Nov 8, 2023
Merged

Conversation

dchandan
Copy link
Collaborator

This PR makes further revision to the ingestion workflow by incorporating @huard's recent PR from another branch #14. Additional small architecture changes are included.

@dchandan dchandan requested a review from Nazim-crim August 24, 2023 03:52
@dchandan
Copy link
Collaborator Author

Not sure why I can't add @huard as a reviewer. David, please review this PR.

url = ds.access_urls["NCML"]

LOGGER.info("Requesting NcML dataset description")
r = requests.get(url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found that if you use this
r = requests.get(url, params={"catalog": catalog, "dataset": dataset})

You get more information in the response, in particular, you get a group called "THREDDSMetadata" with all the services offered by THREDDS. The line below (access_urls) would not be needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my tests, I didn't find any difference between the two. Maybe I am not providing the correct arguments for the parameters? Could you give an example with url, catalog and dataset values that produce the correct output?

Copy link
Collaborator

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the link provided on the THREDDS web page for the NcML service.

Comment on lines 87 to 91
f = NamedTemporaryFile()
f.write(r.content)

# Convert NcML to CF-compliant dictionary
attrs = xncml.Dataset(f.name).to_cf_dict()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: I could add a from_xml class method to xncml.Dataset to avoid having to create a temp file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to get the thing to read from an io.StringIO class (so an in-memory file-like object) which would also remove the need to create a disk on file but that would require changes to xncml (pathlib doesn't work well with such constructions). So I thought I would do that later. Let's coordinate on this since we are both not satisfied with with having to write a file to disk.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll make a PR for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def handle_ingestion_error(self, error: str, item_name: str, item_data: MutableMapping[str, Any]):
pass

def create_stac_item(self, item_name: str, item_data: MutableMapping[str, Any]) -> MutableMapping[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, I suspect this method will both ingest and validate in one step, but let's keep that structure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the error handler separate in the hopes that it could handle errors from other aspects of the workflow as well (for example, while posting the stac item). But I am not married to it and I suspect the structure will evolve as our error handling needs become more clear going forward.

@@ -84,21 +86,33 @@ def ingest(self) -> None:
if not stac_collection_exists(self.stac_host, self.collection_id):
LOGGER.info(f"Creating collection '{self.collection_name}'")
pystac_collection = create_stac_collection(self.collection_id, self._collection_info)
post_collection(self.stac_host, pystac_collection)
post_stac_collection(self.stac_host, pystac_collection)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why the collection isn't created in the base constructor ? Which would remove at the same time this check in ingest

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understood you correctly: what I understand is you are proposing to move the logic around creation of the collection itself to the contractor (or called from the constructor)? I think that's not a bad idea, that logic is separate from the "ingestion" itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes to have it seperated from the ingest

pass

@abstractmethod
def process_stac_item(self): # noqa N802
def create_stac_item(self, item_name: str, item_data: MutableMapping[str, Any]) -> MutableMapping[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have create_stac_collection here too instead of in stac_utils ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Right now the create_stac_collection is in stac_utils because it does not contain any logic that might need to be overridden by any derived classes of STACpopulatorBase. But, that is not to say that in future there might not be a need to override some parts of the logic. Do you think we should move all the create/post stac item/collection functions to STACpopulatorBase?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, I think there should be a different file for all the code where requests are sent to the stac host. In the future stac_utils might become too big and have too many different responsabilies which would hinder the code readability. I would just keep url_validate inside stac_utils. Also, the create_stac_collection could be in STACpopulatorBase as a non-abstract method and be changed later on if there's a need for that. Let me know what you think.

@@ -0,0 +1,51 @@
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add class bcolors that is in .depecrated/collection_processor.py since it has not been defined and is used in post_stac_collection

@dchandan
Copy link
Collaborator Author

dchandan commented Nov 8, 2023

Closed because the PR was getting too long. Will fix outstanding issues in future PRs.

@dchandan dchandan merged commit 9cd2ced into master Nov 8, 2023
@dchandan dchandan deleted the arch-changes branch January 17, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants