Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add directory crawler populator to yield STAC Collections and Items + CLI utility #31

Merged
merged 12 commits into from
Nov 16, 2023

Conversation

fmigneault
Copy link
Collaborator

@fmigneault fmigneault commented Nov 10, 2023

Changes

  • Add request session keyword to all request-related functions and populator methods to allow sharing a common set
    of settings (auth, SSL verify, cert) across requests toward the STAC Catalog.
  • Add DirectoryLoader that allows populating a STAC Catalog with Collections and Items loaded from a crawled directory
    hierarchy that contains collection.json files and other .json/.geojson items.
  • Add a generic CLI stac-populator that can be called to run populator implementations directly
    using command stac-populator run <implementation> [impl-args].
  • Remove hardcoded verify=False to requests calls.
    If needed for testing purposes, users should use a custom requests.sessions.Session with verify=False passed to
    the populator, or alternatively, employ the CLI argument --no-verify that will accomplish the same behavior.

Testing

Run the following commands:

# get sample data
git clone https://github.com/ai-extensions/stac-data-loader /tmp/stac-eurosat

# get magpie cookie
curl \
    -k \
    -X POST \
    --cookie-jar /tmp/magpie-cookie.txt \
    -d '{"user_name":"...","password":"..."}' \
    -H 'Accept:application/json' \
    -H 'Content-Type:application/json' \
    'https://{hostname}/magpie/signin'

# run populator on data
stac-populator run DirectoryLoader \
    "https://{hostname}/stac/" \
    "/tmp/stac-eurosat/data/EuroSAT/stac/subset" \
    --update --prune \
    --no-verify --auth-handler cookie --auth-identity /tmp/magpie-cookie.txt

Results for reference:

Base automatically changed from devtools to master November 10, 2023 19:05
@fmigneault fmigneault self-assigned this Nov 14, 2023
@fmigneault fmigneault requested review from huard and dchandan November 14, 2023 15:56
@fmigneault fmigneault changed the title Add directory crawler populator to yield STAC Collections and Items Add directory crawler populator to yield STAC Collections and Items + CLI utility Nov 14, 2023
@fmigneault fmigneault marked this pull request as ready for review November 14, 2023 16:01
Copy link
Collaborator

@huard huard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to test this PR locally but having problems running the host server.
make complains about an existing pyessv-archive directory:

git clone "https://github.com/ES-DOC/pyessv-archive" ~/.esdoc/pyessv-archive
fatal: destination path '/home/david/.esdoc/pyessv-archive' already exists and is not an empty directory.
make: *** [Makefile:21: setup-pyessv-archive] Error 128

and if I go into the docker directory and do docker compose up, I'm getting errors like:

(stac) david@it-282:~/src/stac-populator/docker$ docker compose up
[+] Running 2/1
 ✔ Network docker_default            Created                                                                                                                                                          0.1s 
 ✔ Volume "docker_stac-db"           Created                                                                                                                                                          0.0s 
 ⠋ Container stac-populator-test-db  Creating                                                                                                                                                         0.0s 
Error response from daemon: Conflict. The container name "/stac-populator-test-db" is already in use by container "f701152d5d735273fc26e499b75a2bf00de839e08d18ef8f3a12c85efb5b7fda". You have to remove (or rename) that container to be able to reuse that name

I tried to remove that container, but then I get the same error with another hash.

STACpopulator/api_requests.py Outdated Show resolved Hide resolved
@fmigneault
Copy link
Collaborator Author

@huard If you delete the directory (recursive) before make setup-pyessv-archive, does it work? The rm could be added before the git clone to make sure the error does not happen, but I prefer to leave it up to the user to avoid removing something they did not intend to wipe.

@fmigneault
Copy link
Collaborator Author

and if I go into the docker directory and do docker compose up, I'm getting errors like:

(stac) david@it-282:~/src/stac-populator/docker$ docker compose up
[+] Running 2/1
 ✔ Network docker_default            Created                                                                                                                                                          0.1s 
 ✔ Volume "docker_stac-db"           Created                                                                                                                                                          0.0s 
 ⠋ Container stac-populator-test-db  Creating                                                                                                                                                         0.0s 
Error response from daemon: Conflict. The container name "/stac-populator-test-db" is already in use by container "f701152d5d735273fc26e499b75a2bf00de839e08d18ef8f3a12c85efb5b7fda". You have to remove (or rename) that container to be able to reuse that name

I tried to remove that container, but then I get the same error with another hash.

Does adding --rm within the same command that does docker compose up work?

@huard
Copy link
Collaborator

huard commented Nov 15, 2023

I think I'd prefer something like this (not sure it's actually working):

PYESSV-ARCHIVE-DIR = ~/.esdoc/pyessv-archive

$(PYESSV-ARCHIVE-DIR)%:
	    @echo [ -d $@ ] || git clone "https://github.com/ES-DOC/pyessv-archive" $@
		cd $@ && git pull

Not sure I understand where to put the --rm.

@fmigneault
Copy link
Collaborator Author

PYESSV-ARCHIVE-DIR = ~/.esdoc/pyessv-archive

$(PYESSV-ARCHIVE-DIR)%:
@echo [ -d $@ ] || git clone "https://github.com/ES-DOC/pyessv-archive" $@
cd $@ && git pull

I think something similar could work.

Not sure I understand where to put the --rm.

My bad. I confused with docker run commands. Maybe try docker compose down --rmi then retry make docker-start.

@huard
Copy link
Collaborator

huard commented Nov 16, 2023

I think what was needed was docker container prune.

Copy link
Collaborator

@huard huard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to run the ingestion of the eurosat entries.

@fmigneault fmigneault merged commit 843ca45 into master Nov 16, 2023
7 checks passed
@fmigneault fmigneault deleted the stac-dir-populator branch November 16, 2023 18:20
fmigneault added a commit to bird-house/birdhouse-deploy that referenced this pull request Nov 30, 2023
## Overview

Provide a way to host local data that STAC API can refer to for use/download.

Currently, any STAC Asset that is referenced within responses by STAC-API Collections/Items must either be already hosted by another service of the stack (eg: CMIP6 netCDF in THREDDS), or point at some other external resource not on the server. 

Instead of having a custom config and mount point for each node, this optional component defines a standard way to define it.

## Changes

**Non-breaking changes**

- `optional-components/stac-data-proxy`: add a new feature to allow hosting of local STAC assets.

  The new component defines variables `STAC_DATA_PROXY_DIR_PATH` (default `${DATA_PERSIST_ROOT}/stac-data`) and
  `STAC_DATA_PROXY_URL_PATH` (default `/data/stac`) that are aliased (mapped) under `nginx` to provide a URL
  where locally hosted STAC assets can be downloaded from. This allows a server node to be a proper data provider,
  where its STAC-API can return Catalog, Collection and Item definitions that points at these local assets available
  through the `STAC_DATA_PROXY_URL_PATH` endpoint.

  When enabled, this component can be combined with `optional-components/secure-data-proxy` to allow per-resource
  access control of the contents under `STAC_DATA_PROXY_DIR_PATH` by setting relevant Magpie permissions under service
  `secure-data-proxy` for children resources that correspond to `STAC_DATA_PROXY_URL_PATH`. Otherwise, the path and
  all of its contents are publicly available, in the same fashion that WPS outputs are managed without
  `optional-components/secure-data-proxy`. 

  More details are provided in https://github.com/bird-house/birdhouse-deploy/blob/stac-data-proxy/birdhouse/optional-components/README.rst#provide-a-proxy-for-local-stac-asset-hosting

**Breaking changes**
- n/a

## Related Issue / Discussion

- Relates to crim-ca/stac-populator#31
- Relates to contents in https://github.com/ai-extensions/stac-data-loader/tree/main/data/EuroSAT/stac
- Relates to https://github.com/ai-extensions/stac-data-loader/blob/main/notebooks/stac_eurosat.ipynb

STAC metadata generated from above notebook (see subset for example), will be able to use a location such as `https://${PAVICS_FQDN_PUBLIC}${STAC_DATA_PROXY_URL_PATH}/EuroSAT/...` instead of the temporary raw-GitHub content URLs. The STAC populator (with `DirectoryLoading` implementation), will be able to push the STAC Collection/Items toward that instances. The STAC Assets that they refer to will be placed under `${STAC_DATA_PROXY_DIR_PATH}/EuroSAT` to make them accessible externally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants