ghgc-data

This repository houses data and config used to create STAC records to be published to the US GHG Center Data Catalog. Inclusion in the US GHG Center catalog is a prerequisite for displaying the dataset in the US GHG Center web portal.

Repository layout

The repo follows the following folder structure:

| ingestion-data
    | collections
        | archive
        . collection-1.json
        . collection-2.json
        ...
        . collection-n.json
    | discovery-items
        | archive
            . archived-discovery-items-1.json  
            . archived-discovery-items-2.json  
            ...
            . archived-discovery-items-n.json  
        . discovery-items-1.json
        . discovery-items-2.json
        ...
        . discovery-items-n.json
| notebooks

ingestion-data

collections

These are STAC collection records for all the available datasets. They should conform to the STAC specification for a collection.

discovery-items

These are the items ingestion config files that are used by our data pipelines (airflow), specifically the veda_discover DAG in veda-data-airflow, which discovers all the specified files and triggers the veda_ingest_raster DAG which takes care of creating the stac items and publishing them.

The format looks like this:

{
    "collection": "<coll_name>",
    "bucket": "<bucket>",
    "prefix": "<prefix>/",
    "filename_regex": "<file_regex>",
    "id_regex": "<id_regex>",
    "id_template": "<id_template_string>",
    "datetime_range": "<year>|<month>|<day>",
    "assets": {
        "<asset1_name>": {
            "title": "<asset_title>",
            "description": "<asset_description>",
            "regex": "<asset_regex>",
        },
        "<asset2_name>": {
            "title": "<asset_title>",
            "description": "<asset_description>",
            "regex": "<asset_regex>",
        },
    },
}

transfer-config

These are the configs used to transfer assets from the dev bucket (ghgc-data-store-develop - where the data was delivered) to the production bucket (ghgc-data-store - where the data is moved after it is finalized). The files from the production bucket is used to publish to the catalog. The transfer is done via triggering veda_transfer DAG in veda-data-airflow.

Description of each field:

Field	Description
`collection`	The collection id for the collection that the items need to be ingested to
`bucket`	The s3 bucket where the item files are located
`prefix`	The s3 prefix under which to search for the files
`filename_regex`	The regex pattern that the files to be discovered should match
`id_regex`	Specifies in regex what part of the filename (usually the datetime) should be used to group assets into item. Example: if the filenames are `asset1_20151201.tif`, `asset2_20151201.tif`, `asset1_20161201.tif`, `asset2_20161201.tif`; the item should be based on the datetime part, hence it'd be `"._(.).tif$"`. The part should be specified using round brackets. The is also the part of the filenames that will be used to form the item id, together with the `id_template` field.
`id_template`	This is a python f-string formatted string that is used to define the `id` of the STAC item. It's used together with the value of `id_regex`. So, going off of the example above, if the `id_template` is `eccodarwin-{}`, then the two item `id`s would be `eccodarwin-20151201` and `eccodarwin-20161201`
`datetime_range`	This is used to extract the datetime range from the filename. Valid values are `day`, `month` and `year`. Example: if the filename has `20160104` in it, and `datetime_range` is `day` - the `start_datetime` and `end_datetime` are the start and end of the day. For `month`, they are the start and end of the month and so on.
`<asset_name>`	An `id` for the asset
`assets.<asset_name>.title`	A title for the asset
`assets.<asset_name>.description`	A description for the asset
`assets.<asset_name>.regex`	The regex pattern that matches a filename to its respective asset

notebooks

Sometimes, there are exceptional datasets that might require a one-off ingestion that is not supported by the current state of our data pipelines. In such cases, we create notebooks/python scripts that can be used to ingest those data. This is where those notebooks/python scripts live.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ingestion-data		ingestion-data
notebooks		notebooks
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ghgc-data

Repository layout

ingestion-data

collections

archive

discovery-items

transfer-config

Description of each field:

archive

notebooks

About

Uh oh!

Releases

Packages

Languages

Disasters-Learning-Portal/disaster-data

Folders and files

Latest commit

History

Repository files navigation

ghgc-data

Repository layout

ingestion-data

collections

archive

discovery-items

transfer-config

Description of each field:

archive

notebooks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages