Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search strategies: Bulk data retrieval #8

Open
fkleon opened this issue Sep 10, 2018 · 2 comments
Open

Search strategies: Bulk data retrieval #8

fkleon opened this issue Sep 10, 2018 · 2 comments

Comments

@fkleon
Copy link

fkleon commented Sep 10, 2018

Bulk data retrieval

It's possible to efficiently retrieve a large number of records, or even all records through the scroll API.
The implementation is using the Elasticsearch scroll API under the hood, which is referred to in the documentation.

The same concepts apply, most importantly:

  • A scroll is a temporary view into the data, whose search context is kept open for a limited time only.
  • A scroll can be consumed in batches.
  • A scroll is a temporary resource that can be addressed by its scroll ID.

The Te Papa Collections API imposes some additional restrictions to avoid too much strain on the cluster:

  • A scroll expiry limit of 15 minutes max.
  • A maximum page size of 10,000 records.

A scroll is requested through any of the _scroll APIs. For example, by using /objects/_scroll the results are pre-filtered to only contain collection objects. To retrieve all object types, use the /search/_scroll API.

A scroll is opened with a POST request to a _scroll API, for example:

curl -XPOST -G \
  -H 'x-api-key: KEY' \
  'https://data.tepapa.govt.nz/collection/search/_scroll' \
  --data-urlencode 'duration=1' \
  --data-urlencode 'size=1'

This requests a scroll that is kept open for 1 minute (duration) and contains 1 result per page (size).
The result looks like an ordinary search result, with one addition, the _metadata.query.scrollId field which contains the unique scroll ID.

The next page of the scroll can then be retrieved through the GET scroll API. An API-root relative link to the next page is included in the Location header of the initial scroll response, or can be build based on the scroll ID in the response body. Example:

curl -G \
  'https://data.tepapa.govt.nz/collection/scroll/<SCROLL-ID>' \
  --data-urlencode 'duration=1'

Once the scroll is exhausted the GET scroll API returns an HTTP 204 No Content response.

An arbitrary search request can be use to control which records are included in a scroll result.
For example, to only retrieve objects that have been modified recently, a date range query can be added to the initial request:

curl -XPOST -G \
  'https://data.tepapa.govt.nz/collection/search/_scroll' \
  --data-urlencode 'q=_meta.modified:[2018-09-06 TO *]' \
  --data-urlencode 'size=1'

The relevant API documentation is here:

@staplegun
Copy link
Contributor

There is also a section in Getting Started - https://github.com/te-papa/collections-api/wiki/Getting-started#scrolling

@fkleon
Copy link
Author

fkleon commented Sep 10, 2018

Great, I've somehow missed that. Feel free to merge useful bits into that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants