Skip to content

GeoscienceAustralia/fetch

Repository files navigation

Ancillary fetch daemon Build Status

Download ancillary data automatically.

It is run with one argument: a config file location. This will run endlessly, downloading according to schedules in the config file:

fetch-service config.yaml

(and is typically run from an init script)

Additionally, you can run a single rule from the config file, ignoring any schedules. It will run the rule once immediately and exit:

fetch-now config.yaml LS7_CPF

Fetch uses file locks in its work directory to ensure that only one instance of each rule is running at a time. You can safely use fetch-now while a service is running without risking multiple instances interfering.

Development

If not installed to the system, such as during development, they can alternatively be run directly from modules:

Service:

python -m fetch.scripts.service config.yaml

Now:

python -m fetch.scripts.now config.yaml LS7_CPF

Developers should refer to the docs directory and the README file therein.

Configuration file

Configuration files are loaded in YAML format (essentially nested lists and dictionaries: YAML is a superset of JSON).

An example configuration file:

# Work directory:
directory: /data/ancillary-fetch

# Notification settings (for errors):
notify:
  email: ['[email protected]']

# Download rules:
rules:

  Modis utcpole-leapsec:
    schedule: '0 7 * * mon'
    source: !http-files
      urls:
      - http://oceandata.sci.gsfc.nasa.gov/Ancillary/LUTs/modis/utcpole.dat
      - http://oceandata.sci.gsfc.nasa.gov/Ancillary/LUTs/modis/leapsec.dat
      target_dir: /eoancillarydata/sensor-specific/MODIS/

  LS8 CPF:
    schedule: '*/30 * 1 1,4,7,10 *'
    source: !rss
      url: http://landsat.usgs.gov/cpf.rss
      target_dir: /eoancillarydata/sensor-specific/LANDSAT8/CalibrationParameterFile

directory: specifies the work directory for the daemon lock and log files.

notify: allows configuration of error notification.

The third option contains download rules (rules:).

  • In this case there are two rules specified: one http download of utcpole/leapsec files, and an RSS feed download of CPF files.

  • Rules are prefixed by a name: in the above example they are named Modis utcpole-leapsec and LS8 CPF.

  • Names are used as an ID for the rule.

  • The source: property is our download source for the rule. It is tagged with a YAML type (!rss or !http-files in this example) to specify the type of downloader.

  • Each downloader has properties: Usually the URL to download from, and a target directory to put the files.

  • schedule: uses standard cron syntax for the download schedule.

Download sources

Types of downloaders:

!http-files

Fetch static HTTP URLs.

This is useful for unchanging URLs that need to be repeatedly updated.

Example:

source: !http-files
  urls:
  - http://oceandata.sci.gsfc.nasa.gov/Ancillary/LUTs/modis/utcpole.dat
  - http://oceandata.sci.gsfc.nasa.gov/Ancillary/LUTs/modis/leapsec.dat
  target_dir: /eoancillarydata/sensor-specific/MODIS/

All http rules have a connection_timeout option, defaulting to 100 (seconds).

!ftp-files

Like http-files, but for FTP.

source: !ftp-files
  hostname: is.sci.gsfc.nasa.gov
  paths:
  - /ancillary/ephemeris/tle/drl.tle
  - /ancillary/ephemeris/tle/norad.tle
  target_dir: /eoancillarydata/sensor-specific/MODIS/tle

!http-directory

Fetch files from a HTTP listing page.

A (regexp) pattern can be specified to only download certain filenames.

source: !http-directory
    url: http://rhe-neo-dev03/ancillary/gdas
    # Download only files beginning with 'gdas'
    name_pattern: gdas.*
    target_dir: '/tmp/gdas-files'

!ftp-directory

Like http-directory, but for FTP

source: !ftp-directory
  hostname: ftp.cdc.noaa.gov
  source_dir: /Datasets/ncep.reanalysis/surface
  # Match filesnames such as "pr_wtr.eatm.2014.nc"
  name_pattern: pr_wtr.eatm.[0-9]{4}.nc
  target_dir: /eoancillarydata/water_vapour/source

!rss

Download files from an RSS feed.

source: !rss
  url: http://landsat.usgs.gov/cpf.rss
  target_dir: /eoancillarydata/sensor-specific/LANDSAT8/CalibrationParameterFile

!ecmwf-api

Fetch now allows access to the batch data servers of the European Centre for Medium-term Weather Forecasts. The data archive is accessed via the Python ECMWF API.

The ECMWF API required properties to be specfied as follows:

source: !ecmwf-api
    cls: ei
    dataset: interim
    date: 2005-01-03/to/2005-01-05
    area: 0/100/-50/160
    expver: 1
    grid: 0.125/0.125
    levtype: sfc
    param: 134.128
    stream: oper
    time: 00:00:00
    step: 0
    typ: an
    target: /home/547/smr547/ecmwf_data/sp_20050103_to_20050105.grib
    override_existing: True

The keys (dataset, date, area, etc) are MARS keywords used to specify various aspects of the data retrieval. Please note that the class and type keywords have different spelling (cls and typ) to avoid Python compiler name clashes.

Request parameter are complex. ECMWF recommend using the View Request Parameters feature as you get familiar with the avaiable ECMWF data sets. This will assist you in preparing error-free requests.

The !ecmwf-api datasource supports Transformers and the override_existing option (defaults to False). !ecmwf-api datasources can also be used with the !date-range datasource.

Transformers

Transformers allow for dynamic folder and file names (both sources and destinations).

Downloaders supporting them have a filename-transform: property.

!date-pattern

Put the current date/time in the filename.

This takes a format string with properties 'year', 'month', 'day', 'julday' (Julian day) and 'filename' (the original filename)

Example of an FTP download

source: !ftp-files
  hostname: is.sci.gsfc.nasa.gov
  paths:
  - /ancillary/ephemeris/tle/noaa/noaa.tle
  target_dir: /eoancillarydata/sensor-specific/NOAA/tle
  # Prepend the current date to the output filename (eg. '20141024.noaa.tle')
  filename_transform: !date-pattern '{year}{month}{day}.{filename}'

!regexp-extract

Extract fields from a filename, and use them in the destination directory.

(This requires knowledge of regular expressions including named groups)

Supply a regexp pattern with named groups. Those group names can then be used in the target folder name.

In this example, we have a pattern with three regexp groups: 'year', 'month' and 'day'. We use year and month in the target_dir.

LS8 BPF:
schedule: '*/15 * * * *'
source: !rss
  url: http://landsat.usgs.gov/bpf.rss
  # Extract year and month from filenames using regexp groups
  #    Example filename: 'LT8BPF20141028232827_20141029015842.01'
  filename_transform: !regexp-extract 'L[TO]8BPF(?P<year>[0-9]{4})(?P<month>[0-9]{2})(?P<day>[0-9]{2}).*'
  # Use these group names ('year' and 'month') in the output location:
  target_dir: /eoancillarydata/sensor-specific/LANDSAT8/BiasParameterFile/{year}/{month}

!date-range

A !date-range is a pseudo-source that repeats a source multiple times over a date range.

It takes a start_day number and an end_day number. These are relative to the current day: ie. A start day of -3 means three (UTC) days ago .

It then overrides properties on the embedded source using each date.

Example:

Modis Att-Ephem:
schedule: '20 */2 * * *'
source: !date-range
  start_day: -3
  end_day: 0
  overridden_properties:
    url: http://oceandata.sci.gsfc.nasa.gov/Ancillary/Attitude-Ephemeris/{year}/{julday}
    target_dir: /eoancillarydata/sensor-specific/MODIS/ancillary/{year}/{julday}
  using: !http-directory
    name_pattern: '[AP]M1(ATT|EPH).*'
    # Overridden by the property above
    url: ''
    # Overridden by the property above
    target_dir: ''

This expands to four !http-directory downloaders. Three days ago, two days ago, one day ago and today.

The properties in overridden_properties: are formatted with the given date and set on each !http-directory downloader.

Post-download file processing

Post-download processing can be done with the process: field.

Currently only shell commands are supported, using the !shell processor.

For example, use gdal to convert each downloaded file from NetCDF (*.nc) to Tiff (*.tiff):

Water vapour:
  schedule: '30 12 * * *'
  source: !ftp-directory
    hostname: ftp.cdc.noaa.gov
    source_dir: /Datasets/ncep.reanalysis/surface
    # Match filenames such as "pr_wtr.eatm.2014.nc"
    name_pattern: pr_wtr.eatm.[0-9]{4}.nc
    target_dir: /data/fetch/eoancil-test/water_vapour/source
  # Convert files to tiff (from netCDF)
 process: !shell
    command: 'gdal_translate -a_srs "+proj=latlong +datum=WGS84" {parent_dir}/{filename} {parent_dir}/{file_stem}.tif'
    expect_file: '{parent_dir}/{file_stem}.tif'

Where:

  • command: is the shell command to run
  • expect_file: is the full path to an output file. (To allow fetch daemon to track newly added files)

Both command:, the list of files in input_files: and expect_file: are evaluated with python string formatting, supporting the following fields as well as being evaluated with the named groups found in the input_files: pattern:

# Full name of file (eg. 'pr_wtr.eatm.2014.nc')
{filename}
# Suffix of filename (eg. '.nc')
{file_suffix}
# Filename without suffix (eg. 'pr_wtr.eatm.2014')
{file_stem}
# Directory ('/data/fetch/eoancil-test/water_vapour/source')
{parent_dir}

A more complex example involving sidecar files requiring download but to be treated as a single group when a post-download process is to take place:

MODIS BRDF:
  schedule: '10/10 * * * *'
  source: !date-range
    # Download from one day ago (-1) to tomorrow (1):
    start_day: -20
    end_day: -10
    overridden_properties:
      url: https://e4ftl01.cr.usgs.gov/MOTA/MCD43A1.006/{year}.{month}.{day}
      target_dir: /tmp/data/BRDF/MCD43A1.006/{year}.{month}.{day}
    using: !http-directory
      url: ''
      target_dir: ''
      name_pattern: 'MCD43A1\.A[0-9]{7}\.h(2[7-9]|3[0-2])v(09|1[0-3])\.006\.[0-9]{13}\.hdf'
      beforehand: !http-auth
        url: https://urs.earthdata.nasa.gov
        username: <username>
        password: <password>
  process: !shell
    command: 'swfo-convert mcd43a1 h5-md --fname {brdf_base}/{collection}/{ymd}/{basename}{hdf_ext} --outdir /tmp/data/conversion/BRDF/{collection}/{ymd}/ --filter-opts ''{{"aggression": 6}}'' --compression BLOSC_ZSTANDARD'
    input_files: ['^(?P<brdf_base>.*BRDF)/(?P<collection>.*)/(?P<ymd>[0-9]{4}\.[0-9]{2}\.[0-9]{2})/(?P<basename>.*)(?P<hdf_ext>.hdf)(?P<xml_ext>.xml)?', ['{brdf_base}/{collection}/{ymd}/{basename}{hdf_ext}', '{brdf_base}/{collection}/{ymd}/{basename}{hdf_ext}.xml']]
    expect_file: '/tmp/data/conversion/BRDF/{collection}/{ymd}/{basename}.h5'

Where:

  • command: is the shell command to run
  • input_files: contains a regex pattern, and list of expected files that are to be checked before running the post-process command
  • expect_file: is the full path to an output file. (To allow fetch daemon to track newly added files) This is useful when there are sidecar files. The value format is a list where the first element is a regx pattern. e.g. '^(?P<base>.*hdf)' This is applied to the full name of the downloaded file and used to create named groups used in the second element. The second element is a list of files that must be present before the shell command is executed.

Signals:

Send a SIGHUP signal to reload the config file without interrupting existing downloads:

kill -1 <pid>

Send a SIGINT or SIGTERM signal to start a graceful shutdown (any active downloads will be completed before exiting).

kill <pid>