Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DD-1670 Implementing update-deposits #2

Merged
merged 25 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
8daea6b
Basic service running, which currently can only report its version.
janvanmansum Nov 8, 2024
d6bce7f
Added unit test for DAO
janvanmansum Nov 8, 2024
974a719
WIP
janvanmansum Nov 8, 2024
7e64de3
Added some starter documentation.
janvanmansum Nov 9, 2024
d877f2e
First successful cycle to processed in outbox.
janvanmansum Nov 9, 2024
af8a9b8
PathIteratorZipper
janvanmansum Nov 9, 2024
8cce606
PathIteratorZipper
janvanmansum Nov 9, 2024
8c748ea
Upload with multiple zip files working
janvanmansum Nov 10, 2024
68f3363
Replaced dataset.json with dataset.yml
janvanmansum Nov 10, 2024
e525c33
Added get-import-status
janvanmansum Nov 10, 2024
b17a3ab
Corrections
janvanmansum Nov 10, 2024
bb3f432
[maven-release-plugin] prepare release v0.1.0
janvanmansum Nov 10, 2024
bb36771
[maven-release-plugin] prepare for next development iteration
janvanmansum Nov 10, 2024
100330a
Fixes
janvanmansum Nov 10, 2024
4277c8a
[maven-release-plugin] prepare release v0.2.0
janvanmansum Nov 10, 2024
b342875
[maven-release-plugin] prepare for next development iteration
janvanmansum Nov 10, 2024
dc0cc0a
naming
janvanmansum Nov 12, 2024
9904827
wip
janvanmansum Nov 12, 2024
c07a0d8
Implemented ordering of deposits, by creation.timestamp or sequence-n…
janvanmansum Nov 12, 2024
e636bb9
Update of metadata working.
janvanmansum Nov 12, 2024
64b2aa1
Refactorings, renamed some classes and separated out code into packages.
janvanmansum Nov 12, 2024
a61d3e4
New design for version updates.
janvanmansum Nov 13, 2024
00fc8bf
Multi-version imports
janvanmansum Nov 14, 2024
3906a34
Implemented deleting files.
janvanmansum Nov 14, 2024
efe938b
WIP
janvanmansum Nov 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions TODO.TXT
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
====
Copyright (C) 2024 DANS - Data Archiving and Networked Services ([email protected])

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
====

- log updateMetadata begin/einde apart.
- regel de upload-batch grootte naar beneden als :ZipUploadFilesLimit kleiner is dan de ingest batch grootte
- implement replaceFiles in edit
- implement providing file metadata
- implement setting license
- implement setting embargo
- implement setting request access permission

//edit.yml
deleteFiles: []
replaceFiles: []
terms:
license: <license url>
embargo: ...
3 changes: 3 additions & 0 deletions debug-init-env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,8 @@

echo -n "Pre-creating log..."
TEMPDIR=data
mkdir -p $TEMPDIR/imports/inbox
mkdir -p $TEMPDIR/imports/outbox
mkdir -p $TEMPDIR/temp
touch $TEMPDIR/dd-dataverse-ingest.log
echo "OK"
Empty file added docs/arch.md
Empty file.
134 changes: 134 additions & 0 deletions docs/description.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
DESCRIPTION
===========

Service for ingesting datasets into Dataverse via the API.

Deposit directories
-------------------

The datasets are prepared as deposit directories (or "deposits" for short) in the ingest area. There are the following types of deposit directories:

### Simple

A directory with the following structure:

```text
087920d1-e37d-4263-84c7-1321e3ecb5f8
├── bag
│ ├── bag-info.txt
│ ├── bagit.txt
│ ├── data
│ │ ├── file1.txt
│ │ ├── file2.txt
│ │ └── subdirectory
│ │ └── file3.txt
│ ├── dataset.yml
│ └── manifest-sha1.txt
└── deposit.properties
```

The name of the deposit directory must be a UUID. The deposit directory contains the following files:

| File | Description |
|----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `deposit.properties` | Contains instructions for `dd-dataverse-ingest` on how to ingest the dataset. |
| `bag/` | A bag, i.e. a directory with the files to be ingested, laid out according to <br>the [BagIt]{:target=_blank} specification. <br>The name of the bag does not have to be "bag"; it may be any valid filename. |

Instead of one bag multiple bags may be included, see [below](#new-versions-of-existing-datasets).

#### Metadata and instructions

In the root of the bag, the following files can be included to provide metadata and instructions for the ingest process

* `dataset.yml`: Dataset level metadata for the dataset in YAML format. It corresponds the JSON that is passed to the `createDataset` and `updateMetadata` API
calls.
* `files.yml`: File level metadata for the files in the dataset in YAML format. It corresponds to the JSON that is passed to the `addFile` API call.
* `edit.yml`: Instructions to delete or replace specific files. An example of the content of this file is:

<!-- TODO: elaborate this -->

```yaml
fileActions:
- action: DELETE
filename: file1.txt
- action: REPLACE
filename: file2.txt
replacement: /path/to/new/file2.txt
```

#### New versions of existing datasets

<!-- TODO: simplify this ? -->

A deposit can also be used to create a new version of an existing dataset. In this case, the `deposit.properties` file must contain the following property:

```text
updates-dataset: 'doi:10.5072/FK2/ABCDEF'
```

in which the value is the DOI of the dataset to be updated.

Instead of one bag directory, the deposit may contain multiple bags. In this case the directories are processed in lexographical order, so you should name the
bags accordingly, e.g. `1-bag`, `2-bag`, `3-bag`, etc. , or `001-bag`, `002-bag`, `003-bag`, etc., depending on the number of bags.

[BagIt]: {{ bagit_specs_url }}

### DANS bag

Processing
----------
The deposit area is a directory with the following structure:

```text
imports
├── inbox
│ └── path
│ └── to
│ ├── batch1
│ │ ├── 0223914e-c053-4ee8-99d8-a9135fa4db4a
│ │ ├── 1b5c1b24-de40-4a40-9c58-d4409672229e
│ │ └── 9a47c5be-58c0-4295-8409-8156bd9ed9e1
│ └── batch2
│ ├── 5e42a936-4b90-4cac-b3c1-798b0b5eeb0b
│ └── 9c2ce5a5-b836-468a-89d4-880efb071d9d
└── outbox
└── path
└── to
└── batch1
├── failed
├── processed
│ └── 7660539b-6ddb-4719-aa31-a3d1c978081b
└── rejected
```

### Processing a batch

The deposits to be processed are to be placed under `inbox`. All the files in it must be readable and writable by the service.
When the service is requested to process a batch, it will do the following:

1. Sort the deposits in the batch by their `creation.timestamp` property in `deposit.properties`, in ascending order.
2. Process each deposit in the batch in order.

### Processing a deposit

1. Sort the bags in the deposit by their numeric prefix, in ascending order.
2. Process each bag in the deposit in order.
3. Move the deposit to:
* `outbox/path/to/batch/processed` if the all versions were published successfully, or to
* `outbox/path/to/batch/rejected` if one or more of the version were not valid, or to
* `outbox/path/to/batch/failed` if some other error occurred.

Note that the relative path of the processed deposits in outbox is the same as in the inbox, except for an extra level
of directories for the status of the deposit.

### Processing a bag

1. If the bag is a first-version bag, create a dataset in Dataverse using the metadata in `dataset.yml`, otherwise update the existing dataset metadata
using the metadata in `dataset.yml`.
2. Execute the actions in `edit.yml` if it exists.
3.
4. Publish the dataset.
5. Wait for the dataset to be published.



Empty file added docs/dev.md
Empty file.
61 changes: 3 additions & 58 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,14 @@
dd-dataverse-ingest
===========

<!-- Remove this comment and extend the descriptions below -->
===================

Service for ingesting datasets into Dataverse via the API.

SYNOPSIS
--------

dd-dataverse-ingest { server | check }


DESCRIPTION
-----------

Ingest datasets into Dataverse via the API


ARGUMENTS
---------

positional arguments:
{server,check} available commands

named arguments:
-h, --help show this help message and exit
-v, --version show the application version and exit

EXAMPLES
--------

<!-- Add examples of invoking this module from the command line or via HTTP other interfaces -->


INSTALLATION AND CONFIGURATION
------------------------------
Currently this project is built as an RPM package for RHEL7/CentOS7 and later. The RPM will install the binaries to
`/opt/dans.knaw.nl/dd-dataverse-ingest` and the configuration files to `/etc/opt/dans.knaw.nl/dd-dataverse-ingest`.

For installation on systems that do no support RPM and/or systemd:

1. Build the tarball (see next section).
2. Extract it to some location on your system, for example `/opt/dans.knaw.nl/dd-dataverse-ingest`.
3. Start the service with the following command
```
/opt/dans.knaw.nl/dd-dataverse-ingest/bin/dd-dataverse-ingest server /opt/dans.knaw.nl/dd-dataverse-ingest/cfg/config.yml
```

BUILDING FROM SOURCE
--------------------
Prerequisites:
sudo systemctl {start|stop|restart|status} dd-dataverse-ingest

* Java 11 or higher
* Maven 3.3.3 or higher
* RPM

Steps:

git clone https://github.com/DANS-KNAW/dd-dataverse-ingest.git
cd dd-dataverse-ingest
mvn clean install

If the `rpm` executable is found at `/usr/local/bin/rpm`, the build profile that includes the RPM
packaging will be activated. If `rpm` is available, but at a different path, then activate it by using
Maven's `-P` switch: `mvn -Pprm install`.

Alternatively, to build the tarball execute:

mvn clean install assembly:single
36 changes: 36 additions & 0 deletions docs/install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
INSTALLATION AND CONFIGURATION
==============================

Currently, this project is built as an RPM package for RHEL8/Rocky8 and later. The RPM will install the binaries to
`/opt/dans.knaw.nl/dd-dataverse-ingest` and the configuration files to `/etc/opt/dans.knaw.nl/dd-dataverse-ingest`.

For installation on systems that do no support RPM and/or systemd:

1. Build the tarball (see next section).
2. Extract it to some location on your system, for example `/opt/dans.knaw.nl/dd-dataverse-ingest`.
3. Start the service with the following command
```
/opt/dans.knaw.nl/dd-dataverse-ingest/bin/dd-dataverse-ingest server /opt/dans.knaw.nl/dd-dataverse-ingest/cfg/config.yml
```

BUILDING FROM SOURCE
====================
Prerequisites:

* Java 17 or higher
* Maven 3.3.3 or higher
* RPM

Steps:

git clone https://github.com/DANS-KNAW/dd-dataverse-ingest.git
cd dd-dataverse-ingest
mvn clean install

If the `rpm` executable is found at `/usr/local/bin/rpm`, the build profile that includes the RPM
packaging will be activated. If `rpm` is available, but at a different path, then activate it by using
Maven's `-P` switch: `mvn -Pprm install`.

Alternatively, to build the tarball execute:

mvn clean install assembly:single
13 changes: 12 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,18 @@ repo_name: DANS-KNAW/dd-dataverse-ingest
repo_url: https://github.com/DANS-KNAW/dd-dataverse-ingest

nav:
- Manual: index.md
- Manual:
- Introduction: index.md
- Description: description.md
# - Examples: examples.md
- Installation: install.md
- Development:
- Overview: dev.md
- Context: arch.md

extra:
bagit_specs_url: https://www.rfc-editor.org/rfc/rfc8493.html


plugins:
- markdownextradata
Expand Down
Loading
Loading