-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further revision to the new ingestion workflow #16
Conversation
Not sure why I can't add @huard as a reviewer. David, please review this PR. |
url = ds.access_urls["NCML"] | ||
|
||
LOGGER.info("Requesting NcML dataset description") | ||
r = requests.get(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found that if you use this
r = requests.get(url, params={"catalog": catalog, "dataset": dataset})
You get more information in the response, in particular, you get a group called "THREDDSMetadata" with all the services offered by THREDDS. The line below (access_urls) would not be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my tests, I didn't find any difference between the two. Maybe I am not providing the correct arguments for the parameters? Could you give an example with url
, catalog
and dataset
values that produce the correct output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the link provided on the THREDDS web page for the NcML service.
STACpopulator/input.py
Outdated
f = NamedTemporaryFile() | ||
f.write(r.content) | ||
|
||
# Convert NcML to CF-compliant dictionary | ||
attrs = xncml.Dataset(f.name).to_cf_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself: I could add a from_xml
class method to xncml.Dataset
to avoid having to create a temp file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to get the thing to read from an io.StringIO
class (so an in-memory file-like object) which would also remove the need to create a disk on file but that would require changes to xncml
(pathlib
doesn't work well with such constructions). So I thought I would do that later. Let's coordinate on this since we are both not satisfied with with having to write a file to disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll make a PR for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def handle_ingestion_error(self, error: str, item_name: str, item_data: MutableMapping[str, Any]): | ||
pass | ||
|
||
def create_stac_item(self, item_name: str, item_data: MutableMapping[str, Any]) -> MutableMapping[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, I suspect this method will both ingest and validate in one step, but let's keep that structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left the error handler separate in the hopes that it could handle errors from other aspects of the workflow as well (for example, while posting the stac item). But I am not married to it and I suspect the structure will evolve as our error handling needs become more clear going forward.
STACpopulator/populator_base.py
Outdated
@@ -84,21 +86,33 @@ def ingest(self) -> None: | |||
if not stac_collection_exists(self.stac_host, self.collection_id): | |||
LOGGER.info(f"Creating collection '{self.collection_name}'") | |||
pystac_collection = create_stac_collection(self.collection_id, self._collection_info) | |||
post_collection(self.stac_host, pystac_collection) | |||
post_stac_collection(self.stac_host, pystac_collection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why the collection isn't created in the base constructor ? Which would remove at the same time this check in ingest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understood you correctly: what I understand is you are proposing to move the logic around creation of the collection itself to the contractor (or called from the constructor)? I think that's not a bad idea, that logic is separate from the "ingestion" itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes to have it seperated from the ingest
STACpopulator/populator_base.py
Outdated
pass | ||
|
||
@abstractmethod | ||
def process_stac_item(self): # noqa N802 | ||
def create_stac_item(self, item_name: str, item_data: MutableMapping[str, Any]) -> MutableMapping[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have create_stac_collection here too instead of in stac_utils ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Right now the create_stac_collection
is in stac_utils
because it does not contain any logic that might need to be overridden by any derived classes of STACpopulatorBase
. But, that is not to say that in future there might not be a need to override some parts of the logic. Do you think we should move all the create/post stac item/collection functions to STACpopulatorBase
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily, I think there should be a different file for all the code where requests
are sent to the stac host. In the future stac_utils
might become too big and have too many different responsabilies which would hinder the code readability. I would just keep url_validate
inside stac_utils
. Also, the create_stac_collection
could be in STACpopulatorBase
as a non-abstract method and be changed later on if there's a need for that. Let me know what you think.
Validation
@@ -0,0 +1,51 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add class bcolors
that is in .depecrated/collection_processor.py since it has not been defined and is used in post_stac_collection
Arch finalization proposal
Closed because the PR was getting too long. Will fix outstanding issues in future PRs. |
This PR makes further revision to the ingestion workflow by incorporating @huard's recent PR from another branch #14. Additional small architecture changes are included.