Further revision to the new ingestion workflow #16

dchandan · 2023-08-24T03:52:13Z

This PR makes further revision to the ingestion workflow by incorporating @huard's recent PR from another branch #14. Additional small architecture changes are included.

dchandan · 2023-08-24T03:53:46Z

Not sure why I can't add @huard as a reviewer. David, please review this PR.

huard · 2023-08-24T13:16:53Z

STACpopulator/input.py

+        url = ds.access_urls["NCML"]
+
+        LOGGER.info("Requesting NcML dataset description")
+        r = requests.get(url)


I've found that if you use this
r = requests.get(url, params={"catalog": catalog, "dataset": dataset})

You get more information in the response, in particular, you get a group called "THREDDSMetadata" with all the services offered by THREDDS. The line below (access_urls) would not be needed.

In my tests, I didn't find any difference between the two. Maybe I am not providing the correct arguments for the parameters? Could you give an example with url, catalog and dataset values that produce the correct output?

Try this:

https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc?catalog=https%3A%2F%2Fpavics.ouranos.ca%2Ftwitcher%2Fows%2Fproxy%2Fthredds%2Fcatalog%2Fbirdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fcatalog.html&dataset=birdhouse%2Ftestdata%2Fxclim%2Fcmip6%2Fsic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc

It's the link provided on the THREDDS web page for the NcML service.

huard · 2023-08-24T13:20:52Z

STACpopulator/input.py

+        f = NamedTemporaryFile()
+        f.write(r.content)
+
+        # Convert NcML to CF-compliant dictionary
+        attrs = xncml.Dataset(f.name).to_cf_dict()


Note to myself: I could add a from_xml class method to xncml.Dataset to avoid having to create a temp file.

I wanted to get the thing to read from an io.StringIO class (so an in-memory file-like object) which would also remove the need to create a disk on file but that would require changes to xncml (pathlib doesn't work well with such constructions). So I thought I would do that later. Let's coordinate on this since we are both not satisfied with with having to write a file to disk.

Ok, I'll make a PR for this.

xarray-contrib/xncml#54

huard · 2023-08-24T13:48:11Z

implementations/CMIP6-UofT/add_CMIP6.py

+    def handle_ingestion_error(self, error: str, item_name: str, item_data: MutableMapping[str, Any]):
+        pass
+
+    def create_stac_item(self, item_name: str, item_data: MutableMapping[str, Any]) -> MutableMapping[str, Any]:


In practice, I suspect this method will both ingest and validate in one step, but let's keep that structure.

I left the error handler separate in the hopes that it could handle errors from other aspects of the workflow as well (for example, while posting the stac item). But I am not married to it and I suspect the structure will evolve as our error handling needs become more clear going forward.

Nazim-crim · 2023-08-24T15:56:18Z

STACpopulator/populator_base.py

@@ -84,21 +86,33 @@ def ingest(self) -> None:
        if not stac_collection_exists(self.stac_host, self.collection_id):
            LOGGER.info(f"Creating collection '{self.collection_name}'")
            pystac_collection = create_stac_collection(self.collection_id, self._collection_info)
-            post_collection(self.stac_host, pystac_collection)
+            post_stac_collection(self.stac_host, pystac_collection)


Any reason why the collection isn't created in the base constructor ? Which would remove at the same time this check in ingest

I'm not sure if I understood you correctly: what I understand is you are proposing to move the logic around creation of the collection itself to the contractor (or called from the constructor)? I think that's not a bad idea, that logic is separate from the "ingestion" itself.

Yes to have it seperated from the ingest

Nazim-crim · 2023-08-24T16:01:20Z

STACpopulator/populator_base.py

        pass

    @abstractmethod
-    def process_stac_item(self):  # noqa N802
+    def create_stac_item(self, item_name: str, item_data: MutableMapping[str, Any]) -> MutableMapping[str, Any]:


Should we have create_stac_collection here too instead of in stac_utils ?

Good question. Right now the create_stac_collection is in stac_utils because it does not contain any logic that might need to be overridden by any derived classes of STACpopulatorBase. But, that is not to say that in future there might not be a need to override some parts of the logic. Do you think we should move all the create/post stac item/collection functions to STACpopulatorBase?

Not necessarily, I think there should be a different file for all the code where requests are sent to the stac host. In the future stac_utils might become too big and have too many different responsabilies which would hinder the code readability. I would just keep url_validate inside stac_utils. Also, the create_stac_collection could be in STACpopulatorBase as a non-abstract method and be changed later on if there's a need for that. Let me know what you think.

…C host

Validation

Nazim-crim · 2023-09-18T16:31:28Z

STACpopulator/api_requests.py

@@ -0,0 +1,51 @@
+import os


add class bcolors that is in .depecrated/collection_processor.py since it has not been defined and is used in post_stac_collection

…e ID.

Arch finalization proposal

dchandan · 2023-11-08T21:20:12Z

Closed because the PR was getting too long. Will fix outstanding issues in future PRs.

dchandan added 5 commits August 23, 2023 23:42

Re-architecting the loader classes

37de655

updating gitignore

12305af

further developing the ingestion loop

ca45cc3

moving post_stac_item to stac_utils

6e500d8

renaming post_collection to post_stac_collection

f93b10d

dchandan requested a review from Nazim-crim August 24, 2023 03:52

huard approved these changes Aug 24, 2023

View reviewed changes

Nazim-crim approved these changes Aug 24, 2023

View reviewed changes

dchandan and others added 19 commits August 25, 2023 11:42

moving collection creation to seaparate function

d5a3d2d

moving create_stac_collection to STACpopulatorBase

946e72a

moving all STAC API calls to separate file

3ee1e6e

Create pydantic data model for CMIP6 CV

db008a5

Merge branch 'arch-changes' into validation

92c5b74

create STAC item from TDS NcML response - untested due to missing STA…

fc2daf3

…C host

Suggestions from Deepak. Use xncml 0.3 from_text

ab09580

black

86dddd4

Fix errors in metadata parsing

1c7aa53

change dchandan user to testuser

dd171a6

Merge pull request #17 from Ouranosinc/validation

7e2903a

Validation

adding type hints to collection2enum

99d17b5

implementation for post_stac_item + logger changes

94a43a5

removing validator from CMIP6populator

e193c28

makefile

a0770a7

simplifying the ingestion logic

b481f37

comments and small changes to the posting functions

d28ab74

fixing issue with thredds metadata for attributes with type tag

933d003

Replaced enums by literal for CMIP6 CV

f697be9

Nazim-crim reviewed Sep 18, 2023

View reviewed changes

Implemented post_stac_item, using a hash of the item attributes as th…

74ad594

…e ID.

dchandan added 27 commits October 12, 2023 11:30

code cleanup

94eb521

change how prefix is applied

a64a226

PR changes

f22c1a2

fixing output media type and roles output for assets

efd9230

adding magpie resource link

3e88591

adding collection resource link for Magpie

8d66fba

posting items fixes

00a968a

removing function no longer in use

2c3b49d

implemented updating stac collection and items

6908d55

removing need to pass yml file to app on command line

0c959ea

code cleanup

73b2773

adding __init__ files

9e919c2

fix

c62fb80

more fixes

10db128

diagnostics

25985db

removing unused code

6d675bc

refactoring to allow more flexibility

65bd5bb

fix datacube extension

f540dbe

pr changes

323c945

reverting to old way to read thredds access links

0581c61

adding ability to get single file from THREDDS loader

37a26e1

making make_cmip6_item_id a staticmethod

e55591d

wrapping call to make STAC item with a try-exepcet block

f1e28db

fixing commit e55591d

8bb21e1

more fixes to previous commits

3055afc

making tracking_id optional in CMIP6ItemProperties

3f1d284

Merge pull request #25 from crim-ca/arch-finalization-proposal

26fe4ad

Arch finalization proposal

dchandan merged commit 9cd2ced into master Nov 8, 2023

dchandan deleted the arch-changes branch January 17, 2024 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further revision to the new ingestion workflow #16

Further revision to the new ingestion workflow #16

dchandan commented Aug 24, 2023

dchandan commented Aug 24, 2023

huard Aug 24, 2023

dchandan Aug 24, 2023

huard Aug 24, 2023

huard Aug 24, 2023

huard Aug 24, 2023

dchandan Aug 24, 2023

huard Aug 24, 2023

huard Aug 25, 2023

huard Aug 24, 2023

dchandan Aug 24, 2023

Nazim-crim Aug 24, 2023

dchandan Aug 24, 2023

Nazim-crim Aug 25, 2023

Nazim-crim Aug 24, 2023

dchandan Aug 24, 2023

Nazim-crim Aug 25, 2023

Nazim-crim Sep 18, 2023

dchandan commented Nov 8, 2023

Further revision to the new ingestion workflow #16

Further revision to the new ingestion workflow #16

Conversation

dchandan commented Aug 24, 2023

dchandan commented Aug 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchandan commented Nov 8, 2023