Skip to content

Commit

Permalink
Documented API update from RFC 0072
Browse files Browse the repository at this point in the history
* See inveniosoftware/rfcs#91 for details

Co-authored-by: Mirek Simek <[email protected]>
  • Loading branch information
mesemus and mesemus committed Jan 19, 2025
1 parent 751e709 commit ca49043
Showing 1 changed file with 140 additions and 37 deletions.
177 changes: 140 additions & 37 deletions docs/reference/file_storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,51 @@
There are two different concepts when handling file storage in InvenioRDM. One is the
backend, meaning the actual technology that is used to store it. For example, the local
file system or S3. You can find more information about storage backends in the
[customize](../customize/s3.md) section.
[customize](../customize/s3.md) section. Most of the time the backend is transparent to
the user, as the InvenioRDM API abstracts it away.

Moreover, the origin or method used to transport the files is also important. In InvenioRDM
there are three defined types.
Moreover, the origin or method used to transport the files is also important.
InvenioRDM implements an extensible mechanism for transporting files. Out of the box,
the following 4 transport mechanisms are supported:

- Local, which represents the files that are managed by the InvenioRDM instance,
- **Local**, which represents the files that are managed by the InvenioRDM instance,
independently of the backend.
- Fetch, these are files that are not managed by the instance but will be transported.
- **Fetch**, these are files that are not managed by the instance at the beginning when
the file is attached to the record, but will be transported and stored locally.
This means that they will eventually become _local_ files.
- Remote, these are represented by a reference to an external storage system. Since
- **Multipart**, these are files that are uploaded in parts. User can upload parts
in parallel or can retransmit each part if the upload fails, for example due to
network errors. After upload, the parts are assembled into a single file and the
file becomes a _local_ file.
- **Remote**, these are represented by a reference to an external storage system. Since
the files are not managed by the instance there is no possible way to guarantee their
availability or integrity. At the moment this type of files are **not supported** by
InvenioRDM.
availability or integrity.

These file types are stored in the `storage_class` attribute of the file model, and
These file types are stored in the `transfer.type` attribute of the file model, and
represented by a one character encoding:

| Type | Representation |
|:------:|:--------------:|
| Local | L |
| Fetch | F |
| Remote | R |
| Type | Representation |
|:----------:|:--------------:|
| Local | L |
| Fetch | F |
| Multipart | M |
| Remote | R |

Example of selecting transfer type on file creation:

```http
POST /api/records/{id}/draft/files
Content-Type: application/json
[{
"key": "dataset.zip",
"transfer": {
"type": "F",
"url": "https://example.org/files/dataset.zip?token=<auth token>"
}
"metadata": {...}
}]
```

## Local files (L)

Expand All @@ -33,28 +56,17 @@ Local files are managed as defined in the

## Files fetching (F)

_Introduced in InvenioRDM v11_

!!! warning "Experimental feature"

The file fetching mechanism in InvenioRDM v11 has a few limitations. Be aware that
future releases of InvenioRDM might introduce breaking changes. We will document them
as extensively as possible.

**Use it at your own risk!**

Fetched files accept two more arguments than a local files on their
[initialization](rest_api_drafts_records.md#start-draft-file-uploads): _storage\_class_, and
_uri_:
During initialization, fetched files are created using [the same protocol as local files](rest_api_drafts_records.md#start-draft-file-uploads).
Additionally you need to provide a `transfer` object with `type` and `url` fields.

**Parameters**

| Name | Type | Location | Description |
| --------------- | ------ | -------- | -------------------------- |
| `storage_class` | string | body | "L" |
| `uri` | string | body | URL to fetch the file from |
| `type` | string | body | "F" |
| `url` | string | body | URL to fetch the file from |

The `uri` must be a URL, accessible from the server's network and resolving to a file
The `url` must be a URL, accessible from the server's network and resolving to a file
that can be fetched. No authentication mechanism (e.g. `Authorization` header) is
supported for the request process, so any authentication has to be part of the URL itself
(e.g. a token passed in a query string).
Expand All @@ -68,8 +80,10 @@ Content-Type: application/json
[
{
"key": "dataset.zip",
"uri": "https://example.org/files/dataset.zip?token=<auth token>",
"storage_class": "F",
"transfer": {
"type": "F",
"url": "https://example.org/files/dataset.zip?token=<auth token>",
}
},
...
]
Expand All @@ -92,8 +106,9 @@ Content-Type: application/json
"created": "2020-11-27 11:17:10.998919",
"metadata": null,
"status": "pending",
"storage_class": "F",
"uri": "https://example.org/files/dataset.zip?token=<auth token>",
"transfer": {
"type": "F",
},
"links": {
"content": "/api/records/{id}/draft/files/dataset.zip/content",
"self": "/api/records/{id}/draft/files/dataset.zip",
Expand All @@ -107,16 +122,25 @@ Content-Type: application/json
}
```

**Note**: The response does not contain the URL of the fetched file. This is intentional
as the URL might contain sensitive information (e.g. a token) that should not be exposed
to users.

At this point an asynchronous task will be launched and the file will be transported into
the InvenioRDM instance. Once the file transfer is completed, the status field will be
changed to `completed`. At this point the `storage_class` of the files has also changed
changed to `completed`. At this point the `transfer.type` of the files has also changed
to `L`. The status can be checked using the _files_ url (`/api/records/{id}/draft/files`).
Note, until all the files have been transferred (i.e. their status is `completed`) the
record cannot be published.

More over, while files are being transferred requests to the `content` and `commit`
endpoints are not allowed (disabled).

### Error handling

If the file fetching fails, the status of the file will be set to `failed`
and the error message will be stored in the `transfer.error` field.

### Security

By default file fetching will be refused. Files can only be fetched from a configurable
Expand All @@ -131,6 +155,85 @@ RECORDS_RESOURCES_FILES_ALLOWED_DOMAINS = [

## Remote files (R)

!!! info "Not supported"
To link to a remote file, the `transfer` section must contain the `type=R` and `url` fields.

**Request**

```http
POST /api/records/{id}/draft/files HTTP/1.1
Content-Type: application/json
[
{
"key": "dataset.zip",
"transfer": {
"type": "R",
"url": "https://mystoragehosting.org/files/dataset.zip",
}
},
...
]
```

There is no need to call the `commit` endpoint for remote files. The file is considered
committed as soon as it is created.

**Request**

```http
POST /api/records/{id}/draft/files/dataset.zip/commit HTTP/1.1
```

### Accessing remote files

Later on, when user tries to access the file, a 302 redirect will be returned to the
`url` provided in the request.

**Request**

```http
GET /api/records/{id}/draft/files/dataset.zip/content HTTP/1.1
```

**Response**

```http
HTTP/1.1 302 FOUND
Location: https://mystoragehosting.org/files/dataset.zip
```

Remote files are currently not supported.
### Security

When a `302` redirect is sent to the user, they will retrieve the file directly
by following the returned URL. Therefore, you must ensure:

1. **Network Access**: The file’s URL is reachable from the user’s network.
2. **No Sensitive Data**: The URL does not include any sensitive information (such as tokens).

By default, Invenio refuses references to external files. Files can only be referenced
from a “trusted domains” list, which you can configure in your `invenio.cfg` file:

```python
RECORDS_RESOURCES_FILES_ALLOWED_REMOTE_DOMAINS = [
"mystoragehosting.org",
]
```

Since the repository cannot guarantee a remote file’s availability or integrity,
file uploads are also restricted to trusted users only. By default, only users with
the superuser access can upload remote files.

You can change this behavior in your `invenio.cfg` file:

```python
from invenio_records_resources.services.files.generators import IfTransferType
from invenio_records_resources.services.files.transfer import REMOTE_TRANSFER_TYPE
from invenio_administration.generators import Administration

class MyRepositoryPermissionPolicy(RDMRecordPermissionPolicy):
can_draft_create_files = RDMRecordPermissionPolicy.can_draft_transfer_files + [
IfTransferType(REMOTE_TRANSFER_TYPE, Administration())
]

RDM_PERMISSION_POLICY = MyRepositoryPermissionPolicy
```

0 comments on commit ca49043

Please sign in to comment.