Search strategies: Bulk data retrieval #8

fkleon · 2018-09-10T01:14:34Z

Bulk data retrieval

It's possible to efficiently retrieve a large number of records, or even all records through the scroll API.
The implementation is using the Elasticsearch scroll API under the hood, which is referred to in the documentation.

The same concepts apply, most importantly:

A scroll is a temporary view into the data, whose search context is kept open for a limited time only.
A scroll can be consumed in batches.
A scroll is a temporary resource that can be addressed by its scroll ID.

The Te Papa Collections API imposes some additional restrictions to avoid too much strain on the cluster:

A scroll expiry limit of 15 minutes max.
A maximum page size of 10,000 records.

A scroll is requested through any of the _scroll APIs. For example, by using /objects/_scroll the results are pre-filtered to only contain collection objects. To retrieve all object types, use the /search/_scroll API.

A scroll is opened with a POST request to a _scroll API, for example:

curl -XPOST -G \
  -H 'x-api-key: KEY' \
  'https://data.tepapa.govt.nz/collection/search/_scroll' \
  --data-urlencode 'duration=1' \
  --data-urlencode 'size=1'

This requests a scroll that is kept open for 1 minute (duration) and contains 1 result per page (size).
The result looks like an ordinary search result, with one addition, the _metadata.query.scrollId field which contains the unique scroll ID.

The next page of the scroll can then be retrieved through the GET scroll API. An API-root relative link to the next page is included in the Location header of the initial scroll response, or can be build based on the scroll ID in the response body. Example:

curl -G \
  'https://data.tepapa.govt.nz/collection/scroll/<SCROLL-ID>' \
  --data-urlencode 'duration=1'

Once the scroll is exhausted the GET scroll API returns an HTTP 204 No Content response.

An arbitrary search request can be use to control which records are included in a scroll result.
For example, to only retrieve objects that have been modified recently, a date range query can be added to the initial request:

curl -XPOST -G \
  'https://data.tepapa.govt.nz/collection/search/_scroll' \
  --data-urlencode 'q=_meta.modified:[2018-09-06 TO *]' \
  --data-urlencode 'size=1'

The relevant API documentation is here:

The text was updated successfully, but these errors were encountered:

staplegun · 2018-09-10T03:33:04Z

There is also a section in Getting Started - https://github.com/te-papa/collections-api/wiki/Getting-started#scrolling

fkleon · 2018-09-10T04:37:24Z

Great, I've somehow missed that. Feel free to merge useful bits into that one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search strategies: Bulk data retrieval #8

Search strategies: Bulk data retrieval #8

fkleon commented Sep 10, 2018 •

edited

Loading

staplegun commented Sep 10, 2018

fkleon commented Sep 10, 2018

Search strategies: Bulk data retrieval #8

Search strategies: Bulk data retrieval #8

Comments

fkleon commented Sep 10, 2018 • edited Loading

Bulk data retrieval

staplegun commented Sep 10, 2018

fkleon commented Sep 10, 2018

fkleon commented Sep 10, 2018 •

edited

Loading