datacite-websnap
was developed to facilitate interoperability between the data platforms of the ETH research institutions in Switzerland.
datacite-websnap
empowers research institutions to share their DataCite metadata records by exporting the records to publicly accessible S3 cloud storage.
pip install datacite-websnap
To access CLI documentation:
datacite-websnap --help
To access more detailed documentation for the export
command:
datacite-websnap export --help
Click to unfold
Bulk export DataCite XML metadata records that correspond to the records for a particular DataCite repository and/or DOI prefix.
The default behavior is to export DataCite XML records to an S3 bucket but command also supports exporting the records to a local machine.
Option | Default | Description |
---|---|---|
--doi-prefix |
None |
|
--client-id |
None |
|
--destination |
S3 |
|
--bucket |
None |
|
--key-prefix |
None |
|
--directory-path |
None |
|
--file-logs |
False |
|
--log-level |
INFO |
|
--early-exit |
False |
|
--api-url |
https://api.datacite.org |
|
--page-size |
250 |
|
Click to unfold
Repository account ID and DOI prefix are the supported filters used to select DataCite records that will be exported.
The filters can be applied for both S3 bucket and local machine usage.
Please note that applying this filter will bulk export ALL records for the specified repository account ID!
Repositories with records on DataCite each have their own DataCite repository account ID.
To confirm you have the correct repository ID you can call the DataCite API client endpoint.
If you do not know the repository ID but do know a specific DOI that belongs to the repository:
- Navigate to DataCite Commons
- Enter the DOI in the search box. For example: 10.16904/envidat.576
- Click on the record and then click "Download Metadata", select "DataCite JSON"
- The repository account ID is the value for
"clientId"
. For DOI 10.16904/envidat.576 the"clientId"
value is"ethz.wsl"
.
Example usage as a command line argument: --client-id ethz.wsl
Please note that applying this filter will bulk export ALL records for the specified DOI prefix!
Records can also be exported by their DOI prefix.
The --doi-prefix
argument accepts single or multiple prefix arguments.
Example usage as a command line argument: --doi-prefix 10.16904 --doi-prefix 10.25678
It can also be combined with the --client-id
argument.
Click to unfold
Utilizes the AWS SDK for Python (Boto3) to export DataCite XML metadata records for a specific repository and/or DOI prefix as objects in an S3 bucket.
The environment variables listed below are required to export records to an S3 bucket.
Environment Variable | Description |
---|---|
ENDPOINT_URL |
URL to use for the constructed S3 client |
AWS_ACCESS_KEY_ID |
AWS access key ID |
AWS_SECRET_ACCESS_KEY |
AWS secret access key |
Supports setting environment variables in a .env
file.
The .env
file must be located in the directory where the CLI is being executed.
For example, if you are running the program from my-drive/cli-tools/datacite-websnap
then the .env
file must be in that directory.
Example .env
file:
ENDPOINT_URL=https://dreamycloud.com
AWS_ACCESS_KEY_ID=1234567abcdefg
AWS_SECRET_ACCESS_KEY=hijklmn1234567
To export the records to an S3 bucket:
--bucket
option must be assigned to an existing S3 bucket
- Return all DataCite records for the EnviDat repository (using client-id
ethz.wsl
) - Write XML records to a bucket called "opendataswiss"
datacite-websnap export --client-id ethz.wsl --bucket opendataswiss
- Return all DataCite records for the EnviDat repository (using client-id
ethz.wsl
) - Write XML records to a bucket called "opendataswiss"
- Use key prefix
wsl
- Enable logging to a file
datacite-websnap export --client-id ethz.wsl --bucket opendataswiss --key-prefix wsl --file-logs
Click to unfold
Export DataCite XML metadata records for a specific repository and/or DOI prefix to a local machine.
To write the records locally:
--destination
option must be assigned tolocal
--directory-path
option must be assigned to a local existing directory
- Return all DataCite records for the EnviDat repository (using client-id
ethz.wsl
) - Write XML records locally
- Write XML records to a directory called "opendata/wsl"
datacite-websnap export --client-id ethz.wsl --destination local --directory-path "opendata/wsl"
Click to unfold
Exported DataCite XML records are assigned file names (or S3 keys) using the DOI that corresponds to the record.
- The "/" slash character that divides the DOI prefix and suffix are replaced with a "_" underscore character
- ".xml" is appended to the DOI as a file extension
Record DOI: 10.16904/envidat.31
File name (or S3 key) for exported record: 10.16904_envidat.31.xml
Click to unfold
Info messages and errors are logged to the console.
Optionally log messages errors can be written to a file log called by default "datacite-websnap.log"
.
To enable file logs the following option must be enabled: --file-logs
datacite-websnap export --client-id ethz.wsl --bucket opendataswiss --file-logs
Variables are assigned in config.py
for logging configuration.
To override the default configuration variables related to logging the variables in the table below can be set in config.py
.
LOG_NAME
is the name of the file log (used if the --file-logs
option is enabled).
Python logging basic configuration documentation.
Configuration Variable | Default |
---|---|
LOG_NAME |
"datacite-websnap.log" |
LOG_FORMAT |
"%(asctime)s | %(levelname)s | %(module)s.%(funcName)s:%(lineno)d | %(message)s" |
LOG_DATE_FORMAT |
"%Y-%m-%d %H:%M:%S" |
Click to unfold
datacite-websnap
retrieves XML metadata records from the DataCite API.
Documentation for the DataCite API endpoints and pagination used in datacite-websnap
:
Default configuration variables are assigned in config.py
for DataCite API base URL, endpoints, page size and timeout.
To override the default configuration variables related to DataCite the variables in the table below can be set in config.py
.
Configuration Variable | Default | Description |
---|---|---|
TIMEOUT |
32 |
Timeout of API requests in seconds. |
DATACITE_API_URL |
https://api.datacite.org |
DataCite base URL used for API requests. Value is assigned as default to --api-url CLI option. |
DATACITE_API_CLIENTS_ENDPOINT |
/clients |
Endpoint used to retrieve client. |
DATACITE_API_DOIS_ENDPOINT |
/dois |
Endpoint used to retrieve list of DOIs. |
DATACITE_PAGE_SIZE |
250 |
Number of DOIs retrieved per page using pagination. Value is assigned as default to --page-size CLI option. |
Rebecca Buchholz, EnviDat Software Engineer
EnviDat is the environmental data portal of the Swiss Federal Institute for Forest, Snow and Landscape Research WSL.
An EnviDat PyPI package that copies files retrieved from an API to an S3 bucket or a local machine.