Skip to content

EnviDat/datacite-websnap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

datacite-websnap

CLI tool that bulk exports DataCite metadata records for a specific repository to an S3 bucket.

Also supports exporting repository records to a local machine.


Purpose

datacite-websnap was developed to facilitate interoperability between the data platforms of the ETH research institutions in Switzerland.

datacite-websnap empowers research institutions to share their DataCite metadata records by exporting the records to publicly accessible S3 cloud storage.

Installation

pip install datacite-websnap

Terminal Documentation

To access CLI documentation:

datacite-websnap --help

To access more detailed documentation for the export command:

datacite-websnap export --help

CLI Options

Click to unfold

Command: export

Bulk export DataCite XML metadata records that correspond to the records for a particular DataCite repository and/or DOI prefix.

The default behavior is to export DataCite XML records to an S3 bucket but command also supports exporting the records to a local machine.

Option Default Description
--doi-prefix None
  • DataCite DOI prefix used to filter results
  • Accepts single or multiple prefix arguments
  • Example: --doi-prefix 10.16904 --doi-prefix 10.25678
--client-id None
  • DataCite repository account ID used to filter results
  • Example: --client-id ethz.wsl
--destination S3
  • Export destination for the DataCite XML records
  • S3 (default) for an S3 bucket
  • local for local file system
--bucket None
  • Name of S3 bucket that DataCite XML records (as S3 objects) will be written in
  • Example: --bucket opendataswiss
    --key-prefix None
    • Optional key prefix for objects in S3 bucket
    • If omitted then objects are written in S3 bucket without a prefix
    • Example: --key-prefix wsl
    --directory-path None
    • Only used if exporting to local destination
    • Path of the local directory that DataCite XML records will be written in
    --file-logs False
    • Enables logging info messages and errors to a file log
    --log-level INFO
    --early-exit False
    • If enabled then terminates program immediately after export error occurs
    • Default value is False (not enabled)
    • If False then only logs export error and continues to try to export other DataCite XML records returned by search query
    --api-url https://api.datacite.org
    • DataCite API base URL used for queries
    • Can also be set using a DataCite API configuration variable
    --page-size 250
    • Number of records returned per page of DataCite API response using pagination
    • Can also be set using a DataCite API configuration variable

    DataCite Filters

    Click to unfold

    Repository account ID and DOI prefix are the supported filters used to select DataCite records that will be exported.

    The filters can be applied for both S3 bucket and local machine usage.

    Repository Account ID

    Please note that applying this filter will bulk export ALL records for the specified repository account ID!

    Repositories with records on DataCite each have their own DataCite repository account ID.

    To confirm you have the correct repository ID you can call the DataCite API client endpoint.

    If you do not know the repository ID but do know a specific DOI that belongs to the repository:

    1. Navigate to DataCite Commons
    2. Enter the DOI in the search box. For example: 10.16904/envidat.576
    3. Click on the record and then click "Download Metadata", select "DataCite JSON"
    4. The repository account ID is the value for "clientId". For DOI 10.16904/envidat.576 the "clientId" value is "ethz.wsl".

    Example usage as a command line argument: --client-id ethz.wsl

    DOI Prefix

    Please note that applying this filter will bulk export ALL records for the specified DOI prefix!

    Records can also be exported by their DOI prefix.

    The --doi-prefix argument accepts single or multiple prefix arguments.

    Example usage as a command line argument: --doi-prefix 10.16904 --doi-prefix 10.25678

    It can also be combined with the --client-id argument.

    Usage: S3 Bucket

    Click to unfold

    Utilizes the AWS SDK for Python (Boto3) to export DataCite XML metadata records for a specific repository and/or DOI prefix as objects in an S3 bucket.

    Environment Variables

    The environment variables listed below are required to export records to an S3 bucket.

    Environment Variable Description
    ENDPOINT_URL URL to use for the constructed S3 client
    AWS_ACCESS_KEY_ID AWS access key ID
    AWS_SECRET_ACCESS_KEY AWS secret access key

    Supports setting environment variables in a .env file.

    The .env file must be located in the directory where the CLI is being executed.

    For example, if you are running the program from my-drive/cli-tools/datacite-websnap then the .env file must be in that directory.

    Example .env file:

    ENDPOINT_URL=https://dreamycloud.com
    AWS_ACCESS_KEY_ID=1234567abcdefg
    AWS_SECRET_ACCESS_KEY=hijklmn1234567
    

    Examples

    To export the records to an S3 bucket:

    • --bucket option must be assigned to an existing S3 bucket

    Basic Usage

    • Return all DataCite records for the EnviDat repository (using client-id ethz.wsl)
    • Write XML records to a bucket called "opendataswiss"
    datacite-websnap export --client-id ethz.wsl --bucket opendataswiss

    Advanced Usage

    • Return all DataCite records for the EnviDat repository (using client-id ethz.wsl)
    • Write XML records to a bucket called "opendataswiss"
    • Use key prefix wsl
    • Enable logging to a file
    datacite-websnap export --client-id ethz.wsl --bucket opendataswiss --key-prefix wsl --file-logs

    Usage: Local Machine

    Click to unfold

    Export DataCite XML metadata records for a specific repository and/or DOI prefix to a local machine.

    To write the records locally:

    • --destination option must be assigned to local
    • --directory-path option must be assigned to a local existing directory

    Example

    • Return all DataCite records for the EnviDat repository (using client-id ethz.wsl)
    • Write XML records locally
    • Write XML records to a directory called "opendata/wsl"
    datacite-websnap export --client-id ethz.wsl --destination local --directory-path "opendata/wsl"

    Record Name Formatting

    Click to unfold

    Exported DataCite XML records are assigned file names (or S3 keys) using the DOI that corresponds to the record.

    • The "/" slash character that divides the DOI prefix and suffix are replaced with a "_" underscore character
    • ".xml" is appended to the DOI as a file extension

    Example

    Record DOI: 10.16904/envidat.31

    File name (or S3 key) for exported record: 10.16904_envidat.31.xml

    Logs

    Click to unfold

    Info messages and errors are logged to the console.

    Optionally log messages errors can be written to a file log called by default "datacite-websnap.log".

    To enable file logs the following option must be enabled: --file-logs

    Example

    datacite-websnap export --client-id ethz.wsl --bucket opendataswiss --file-logs            

    Configuration: Logs

    Variables are assigned in config.py for logging configuration.

    To override the default configuration variables related to logging the variables in the table below can be set in config.py.

    LOG_NAME is the name of the file log (used if the --file-logs option is enabled).

    Python logging basic configuration documentation.

    Configuration Variable Default
    LOG_NAME "datacite-websnap.log"
    LOG_FORMAT "%(asctime)s | %(levelname)s | %(module)s.%(funcName)s:%(lineno)d | %(message)s"
    LOG_DATE_FORMAT "%Y-%m-%d %H:%M:%S"

    DataCite API

    Click to unfold

    datacite-websnap retrieves XML metadata records from the DataCite API.

    Documentation for the DataCite API endpoints and pagination used in datacite-websnap:

    Configuration: DataCite API

    Default configuration variables are assigned in config.py for DataCite API base URL, endpoints, page size and timeout.

    To override the default configuration variables related to DataCite the variables in the table below can be set in config.py.

    Configuration Variable Default Description
    TIMEOUT 32 Timeout of API requests in seconds.
    DATACITE_API_URL https://api.datacite.org DataCite base URL used for API requests.
    Value is assigned as default to --api-url CLI option.
    DATACITE_API_CLIENTS_ENDPOINT /clients Endpoint used to retrieve client.
    DATACITE_API_DOIS_ENDPOINT /dois Endpoint used to retrieve list of DOIs.
    DATACITE_PAGE_SIZE 250 Number of DOIs retrieved per page using pagination.
    Value is assigned as default to --page-size CLI option.

    Author

    Rebecca Buchholz, EnviDat Software Engineer

    EnviDat is the environmental data portal of the Swiss Federal Institute for Forest, Snow and Landscape Research WSL.

    Inspiration

    An EnviDat PyPI package that copies files retrieved from an API to an S3 bucket or a local machine.

    License

    MIT License

    About

    CLI tool that bulk exports DataCite metadata records for a specific repository to an S3 bucket.

    Resources

    License

    Stars

    Watchers

    Forks

    Packages

    No packages published

    Languages