Download ancillary data automatically.
It is run with one argument: a config file location. This will run endlessly, downloading according to schedules in the config file:
fetch-service config.yaml
(and is typically run from an init script)
Additionally, you can run a single rule from the config file, ignoring any schedules. It will run the rule once immediately and exit:
fetch-now config.yaml LS7_CPF
Fetch uses file locks in its work directory to ensure that only one instance of each rule is running at a time. You
can safely use fetch-now
while a service is running without risking multiple instances
interfering.
If not installed to the system, such as during development, they can alternatively be run directly from modules:
Service:
python -m fetch.scripts.service config.yaml
Now:
python -m fetch.scripts.now config.yaml LS7_CPF
Developers should refer to the docs
directory and the README file therein.
Configuration files are loaded in YAML format (essentially nested lists and dictionaries: YAML is a superset of JSON).
An example configuration file:
# Work directory:
directory: /data/ancillary-fetch
# Notification settings (for errors):
notify:
email: ['[email protected]']
# Download rules:
rules:
Modis utcpole-leapsec:
schedule: '0 7 * * mon'
source: !http-files
urls:
- http://oceandata.sci.gsfc.nasa.gov/Ancillary/LUTs/modis/utcpole.dat
- http://oceandata.sci.gsfc.nasa.gov/Ancillary/LUTs/modis/leapsec.dat
target_dir: /eoancillarydata/sensor-specific/MODIS/
LS8 CPF:
schedule: '*/30 * 1 1,4,7,10 *'
source: !rss
url: http://landsat.usgs.gov/cpf.rss
target_dir: /eoancillarydata/sensor-specific/LANDSAT8/CalibrationParameterFile
directory:
specifies the work directory for the daemon lock and log files.
notify:
allows configuration of error notification.
The third option contains download rules (rules:
).
-
In this case there are two rules specified: one http download of utcpole/leapsec files, and an RSS feed download of CPF files.
-
Rules are prefixed by a name: in the above example they are named
Modis utcpole-leapsec
andLS8 CPF
. -
Names are used as an ID for the rule.
-
The
source:
property is our download source for the rule. It is tagged with a YAML type (!rss
or!http-files
in this example) to specify the type of downloader. -
Each downloader has properties: Usually the URL to download from, and a target directory to put the files.
-
schedule:
uses standard cron syntax for the download schedule.
Types of downloaders:
Fetch static HTTP URLs.
This is useful for unchanging URLs that need to be repeatedly updated.
Example:
source: !http-files
urls:
- http://oceandata.sci.gsfc.nasa.gov/Ancillary/LUTs/modis/utcpole.dat
- http://oceandata.sci.gsfc.nasa.gov/Ancillary/LUTs/modis/leapsec.dat
target_dir: /eoancillarydata/sensor-specific/MODIS/
All http rules have a connection_timeout
option, defaulting to 100 (seconds).
Like http-files, but for FTP.
source: !ftp-files
hostname: is.sci.gsfc.nasa.gov
paths:
- /ancillary/ephemeris/tle/drl.tle
- /ancillary/ephemeris/tle/norad.tle
target_dir: /eoancillarydata/sensor-specific/MODIS/tle
Fetch files from a HTTP listing page.
A (regexp) pattern can be specified to only download certain filenames.
source: !http-directory
url: http://rhe-neo-dev03/ancillary/gdas
# Download only files beginning with 'gdas'
name_pattern: gdas.*
target_dir: '/tmp/gdas-files'
Like http-directory, but for FTP
source: !ftp-directory
hostname: ftp.cdc.noaa.gov
source_dir: /Datasets/ncep.reanalysis/surface
# Match filesnames such as "pr_wtr.eatm.2014.nc"
name_pattern: pr_wtr.eatm.[0-9]{4}.nc
target_dir: /eoancillarydata/water_vapour/source
Download files from an RSS feed.
source: !rss
url: http://landsat.usgs.gov/cpf.rss
target_dir: /eoancillarydata/sensor-specific/LANDSAT8/CalibrationParameterFile
Fetch now allows access to the batch data servers of the European Centre for Medium-term Weather Forecasts. The data archive is accessed via the Python ECMWF API.
The ECMWF API required properties to be specfied as follows:
source: !ecmwf-api
cls: ei
dataset: interim
date: 2005-01-03/to/2005-01-05
area: 0/100/-50/160
expver: 1
grid: 0.125/0.125
levtype: sfc
param: 134.128
stream: oper
time: 00:00:00
step: 0
typ: an
target: /home/547/smr547/ecmwf_data/sp_20050103_to_20050105.grib
override_existing: True
The keys (dataset, date, area, etc) are MARS keywords
used to specify various aspects of the data retrieval. Please note that the class
and
type
keywords have different spelling (cls
and typ
) to avoid Python compiler name clashes.
Request parameter are complex. ECMWF recommend using the View Request Parameters
feature as you get familiar with
the avaiable ECMWF data sets. This
will assist you in preparing error-free requests.
The !ecmwf-api
datasource supports Transformers and the override_existing
option (defaults to False
).
!ecmwf-api
datasources can also be used with the !date-range datasource.
Transformers allow for dynamic folder and file names (both sources and destinations).
Downloaders supporting them have a filename-transform:
property.
Put the current date/time in the filename.
This takes a format string with properties 'year', 'month', 'day', 'julday' (Julian day) and 'filename' (the original filename)
Example of an FTP download
source: !ftp-files
hostname: is.sci.gsfc.nasa.gov
paths:
- /ancillary/ephemeris/tle/noaa/noaa.tle
target_dir: /eoancillarydata/sensor-specific/NOAA/tle
# Prepend the current date to the output filename (eg. '20141024.noaa.tle')
filename_transform: !date-pattern '{year}{month}{day}.{filename}'
Extract fields from a filename, and use them in the destination directory.
(This requires knowledge of regular expressions including named groups)
Supply a regexp pattern with named groups. Those group names can then be used in the target folder name.
In this example, we have a pattern with three regexp groups: 'year', 'month' and 'day'. We use
year and month in the target_dir
.
LS8 BPF:
schedule: '*/15 * * * *'
source: !rss
url: http://landsat.usgs.gov/bpf.rss
# Extract year and month from filenames using regexp groups
# Example filename: 'LT8BPF20141028232827_20141029015842.01'
filename_transform: !regexp-extract 'L[TO]8BPF(?P<year>[0-9]{4})(?P<month>[0-9]{2})(?P<day>[0-9]{2}).*'
# Use these group names ('year' and 'month') in the output location:
target_dir: /eoancillarydata/sensor-specific/LANDSAT8/BiasParameterFile/{year}/{month}
A !date-range
is a pseudo-source that repeats a source multiple times over a date range.
It takes a start_day
number and an end_day
number. These are relative to the current
day: ie. A start day of -3 means three (UTC) days ago .
It then overrides properties on the embedded source using each date.
Example:
Modis Att-Ephem:
schedule: '20 */2 * * *'
source: !date-range
start_day: -3
end_day: 0
overridden_properties:
url: http://oceandata.sci.gsfc.nasa.gov/Ancillary/Attitude-Ephemeris/{year}/{julday}
target_dir: /eoancillarydata/sensor-specific/MODIS/ancillary/{year}/{julday}
using: !http-directory
name_pattern: '[AP]M1(ATT|EPH).*'
# Overridden by the property above
url: ''
# Overridden by the property above
target_dir: ''
This expands to four !http-directory
downloaders. Three days ago, two days ago, one day ago and today.
The properties in overridden_properties:
are formatted with the given date and set on each !http-directory
downloader.
Post-download processing can be done with the process:
field.
Currently only shell commands are supported, using the !shell
processor.
For example, use gdal to convert each downloaded file from NetCDF (*.nc
) to Tiff (*.tiff
):
Water vapour:
schedule: '30 12 * * *'
source: !ftp-directory
hostname: ftp.cdc.noaa.gov
source_dir: /Datasets/ncep.reanalysis/surface
# Match filenames such as "pr_wtr.eatm.2014.nc"
name_pattern: pr_wtr.eatm.[0-9]{4}.nc
target_dir: /data/fetch/eoancil-test/water_vapour/source
# Convert files to tiff (from netCDF)
process: !shell
command: 'gdal_translate -a_srs "+proj=latlong +datum=WGS84" {parent_dir}/{filename} {parent_dir}/{file_stem}.tif'
expect_file: '{parent_dir}/{file_stem}.tif'
Where:
command:
is the shell command to runexpect_file:
is the full path to an output file. (To allow fetch daemon to track newly added files)
Both command:
, the list of files in input_files:
and expect_file:
are evaluated with python string formatting,
supporting the following fields as well as being evaluated with the named groups found in the input_files:
pattern:
# Full name of file (eg. 'pr_wtr.eatm.2014.nc')
{filename}
# Suffix of filename (eg. '.nc')
{file_suffix}
# Filename without suffix (eg. 'pr_wtr.eatm.2014')
{file_stem}
# Directory ('/data/fetch/eoancil-test/water_vapour/source')
{parent_dir}
A more complex example involving sidecar files requiring download but to be treated as a single group when a post-download process is to take place:
MODIS BRDF:
schedule: '10/10 * * * *'
source: !date-range
# Download from one day ago (-1) to tomorrow (1):
start_day: -20
end_day: -10
overridden_properties:
url: https://e4ftl01.cr.usgs.gov/MOTA/MCD43A1.006/{year}.{month}.{day}
target_dir: /tmp/data/BRDF/MCD43A1.006/{year}.{month}.{day}
using: !http-directory
url: ''
target_dir: ''
name_pattern: 'MCD43A1\.A[0-9]{7}\.h(2[7-9]|3[0-2])v(09|1[0-3])\.006\.[0-9]{13}\.hdf'
beforehand: !http-auth
url: https://urs.earthdata.nasa.gov
username: <username>
password: <password>
process: !shell
command: 'swfo-convert mcd43a1 h5-md --fname {brdf_base}/{collection}/{ymd}/{basename}{hdf_ext} --outdir /tmp/data/conversion/BRDF/{collection}/{ymd}/ --filter-opts ''{{"aggression": 6}}'' --compression BLOSC_ZSTANDARD'
input_files: ['^(?P<brdf_base>.*BRDF)/(?P<collection>.*)/(?P<ymd>[0-9]{4}\.[0-9]{2}\.[0-9]{2})/(?P<basename>.*)(?P<hdf_ext>.hdf)(?P<xml_ext>.xml)?', ['{brdf_base}/{collection}/{ymd}/{basename}{hdf_ext}', '{brdf_base}/{collection}/{ymd}/{basename}{hdf_ext}.xml']]
expect_file: '/tmp/data/conversion/BRDF/{collection}/{ymd}/{basename}.h5'
Where:
command:
is the shell command to runinput_files:
contains a regex pattern, and list of expected files that are to be checked before running the post-process commandexpect_file:
is the full path to an output file. (To allow fetch daemon to track newly added files) This is useful when there are sidecar files. The value format is a list where the first element is a regx pattern. e.g.'^(?P<base>.*hdf)'
This is applied to the full name of the downloaded file and used to create named groups used in the second element. The second element is a list of files that must be present before the shell command is executed.
Send a SIGHUP
signal to reload the config file without interrupting existing downloads:
kill -1 <pid>
Send a SIGINT
or SIGTERM
signal to start a graceful shutdown (any active
downloads will be completed before exiting).
kill <pid>