Pagination using Range header #234

gudmundur · 2015-12-12T15:56:08Z

Note: 🐝 This is still work in progress. 🐝

This code is based on how the Heroku Platform API deals with pagination and ranges, both in terms of code and design. Credit for that goes to @brandur and @geemus.

gudmundur · 2015-12-12T15:58:38Z

spec/range_parser_spec.rb

+  end
+
+  context 'with an unbound end range' do
+    let(:range_header) { 'objects 0-' }


@geemus Do you think unbound ranges should be allowed?

@gudmundur I think they are in the API already!

$ curl -H "Range: id ]006e2c53-e3a9-4152-851c-abf6e9991c63..; max=1" -H "Accept: application/vnd.heroku+json; version=3" -n https://api.heroku.com/apps [ { "archived_at":null, ...

Not absolutely required, but certainly nice to have.

brandur · 2015-12-14T19:53:17Z

@gudmundur Looks good so far! Let me know when you want a more detailed review.

One tricky piece that we may still need are some kind of response helpers that are used to generate a next range.

BTW, not sure if I missed something, but did we change the range from the two dots to a hyphen? Might make ranging via UUID a little tricky ;)

gudmundur · 2015-12-14T22:03:46Z

@brandur Yeah, there is an internal RFC at Heroku that changes a few things. The major difference is that it's using slicing (limit & offset in SQL), rather than >= and <=. The benefit there is that if the key being sorting over is not unique, then there is no need to make it unique to sort out boundary issues between ranges. Then a minor difference is a hyphen, rather than two dots.

brandur · 2015-12-15T00:56:37Z

Ah, I see! Maybe a few considerations for you and @geemus to consider:

This is a pretty large breaking change in that it's not compatible with the old/current ranging interface. It might be worthwhile considering just reverting to a more traditional style of pagination (i.e. via query parameter) if it has to change anyway.
Postgres is bad at limit/offset because the entire dataset up to the offset must still be computed. Unlike the where predicate, its performance doesn't scale to large data sets/offsets. Here's some token information from the manual (see the last paragraph) and here's a more detailed explanation.
Querying by offset instead of ID will result in less stable results because indices mutate (especially if using UUIDs). It's always hard to paginate over a large data set in a stable way, but offset by ID makes this even harder.

I don't have too much skin in the game anymore, but it might be wise to at least make the pagination mechanism modular so that users can vary their implementation if necessary.

geemus · 2015-12-15T20:48:56Z

@brandur thanks for the feedback, some thoughts:

I would be fine maintaining double-dots instead of dash (or accepting either). I was a bit torn between backwards compat here and sticking more closely to the spec. Double-dot does seem more universal though and only one extra character.
We ran into the postgres limitation and discussed some. It seems not great, but the ease of implementation is nice. Do you have a sense of what scale of results starts to be a problem? ie for smallish sets is it also problematic or only in the tens/hundreds of thousands or something?
We had discussed order by created-at or similar, in the hopes that it would preserve order.

Also, happy to share the RFC itself if you are curious, will probably think about trying to move it into the HTTP API Guide in some capacity once we settle on some things.

brandur · 2015-12-15T22:29:36Z

@geemus

I would be fine maintaining double-dots instead of dash (or accepting either). I was a bit torn between backwards compat here and sticking more closely to the spec. Double-dot does seem more universal though and only one extra character.

Yeah, that's a tough one. I suppose that if you do adopt offset-based pagination, it might make sense to go back to hyphen if most client libraries will have to change anyway.

We ran into the postgres limitation and discussed some. It seems not great, but the ease of implementation is nice. Do you have a sense of what scale of results starts to be a problem? ie for smallish sets is it also problematic or only in the tens/hundreds of thousands or something?

Check out this graph from the article above. It's far from a comprehensive analysis, but it shows the performance drop-off very well. The discrepancy starts to show up right away, but it took 600 pages to see a very serious cliff.

The add-on API endpoint to retrieve lists of apps is probably large enough for at least some add-on providers for the difference to show (if you ever get that moved over to V3 ;). Obviously since it's offset-based today it's not going to take down the database or anything, but it's less efficient.

And one more data point: Stripe started with offset-based pagination but moved off of it (somewhat anecdotal ;). This wasn't so much for performance reasons, but more because offset-based pagination's inferior characteristics around stability.

geemus · 2015-12-22T16:42:01Z

@brandur I guess maybe we end up stuck with both, at least potentially. Offset is simpler to implement and sometimes desired (when you want to range over non-unique and/or not ordered/incrementing values), but comes with tradeoffs. Otherwise using our original style (or something like it) over incrementing id would be more stable and more performant, but if I'm not mistaken it tends to need auto-incrementing id to work well in terms of stability (and I'm not sure we want to reveal those and/or depend upon their presence). Definitely something I'll have to ponder more before we settle on something here.

bjeanes · 2015-12-23T03:33:39Z

Joining in here because part of the discussion for changing this was prompted by a few situations my team ran into. Content-based pagination is superior in all the ways you've mentioned @brandur but it still fundamentally relies on paginating over unique values. Some workarounds might be conceivable based on faux-composite indexes, but these have other trade-offs. Cleverer but inconceivable to implement well alternatives could also include returning smaller or larger ranges than requested when the page boundary lands in the middle of a set of same-value fields. But... :/

Have you thought much about a way to handle non-unique sort fields? In practice, these might be dates which rarely collide but in theory one might be sorting by something like a state field with lots of collisions.

bjeanes · 2015-12-23T03:40:04Z

I don't think I like what I'm about to propose but I'll do so anyway in case it can be a strawman which prompts other thoughts...

If, like Heroku, the sortable fields are pre-declared and the range headers are treated opaquely, we could create extra columns which hold unique values corresponding in order to the non-unique column, but which arbitrarily orders same-value records uniquely. E.g. the created_at column might have a corresponding created_at_sort which is unique.

The trade-off here might be not working too well for mutable fields, but even this could be accommodated, to an extend, by spacing out the sort values. This might mean a header could be something like Content-Range: created_at ]2015-12-22T1500Z-002.., where the -002 is acting as the tiebreaker.

bjeanes · 2015-12-23T03:42:51Z

To take this further, Postgres is capable enough that these fields could actually be a custom type or domain which is a tuple of the original value and an auto-incrementing number.

gudmundur · 2016-01-05T14:17:15Z

@bjeanes Wouldn't the tiebreaker have to be based on something unique in the content?

bjeanes · 2016-01-06T05:06:23Z

@gudmundur yes, whether a sequence or some other field of the record, like id.

brandur · 2016-01-07T22:11:20Z

@bjeanes

Have you thought much about a way to handle non-unique sort fields? In practice, these might be dates which rarely collide but in theory one might be sorting by something like a state field with lots of collisions.

Not really :/ I don't know of any non-unique examples in the Stripe API off-hand (which isn't to say there aren't; I still don't know it all that well), but I'll dig in if one comes up.

The trade-off here might be not working too well for mutable fields, but even this could be accommodated, to an extend, by spacing out the sort values. This might mean a header could be something like Content-Range: created_at ]2015-12-22T1500Z-002.., where the -002 is acting as the tiebreaker.

Good idea! That certainly seems worthy of investigation at the very least. There are of course that some possible downsides like the ranges looking a tad unwieldy and possibly a more difficult implementation, but I think you probably considered all of that already.

A variation of the same idea might also be to allow expand the spec to allow ranging on composite fields and then just make sure that composite fields are always used in place of non-unique fields by themselves. So for example, instead of saying that it's possible to range on Range: created_at, you'd have something like Range: created_at/id and all range values would be compounds like 2015-12-22T1500Z/001.

I was thinking originally that you might need some composite indexes on (field, id), but thinking it about it more, it should be fine without in most cases by just having Postgres scan inside the smaller data set filtered on field. You'd obviously want to aim to only index on non-unique fields where you have some reasonable guarantees that field values won't be hugely duplicated.

Anyway, I'm looking forward to see where you guys go with this!

brandur · 2016-01-07T22:11:48Z

And sorry about the delay in reply here! I let my inbox runaway on me over the holidays :/

bjeanes · 2016-01-08T03:58:34Z

There are of course that some possible downsides like the ranges looking a tad unwieldy and possibly a more difficult implementation

Potentially, but it should be pretty opaque. Since it's a field > value, you could still do Accept-Range: created_at ]2015-12-22T1500Z (i.e. drop the -002) for manually crafted requests. After that, you're likely piping through whatever you get back in Next-Range or whatever that header is.

brandur · 2016-01-08T18:17:46Z

Since it's a field > value, you could still do Accept-Range: created_at ]2015-12-22T1500Z (i.e. drop the -002) for manually crafted requests. After that, you're likely piping through whatever you get back in Next-Range or whatever that header is.

+1. That would be a nice feature for convenience.

gudmundur · 2016-02-09T07:39:45Z

For the case of ranging over unique values, I think our approach is fairly nailed down.

I've mulled non-unique values for a while now, then finally spent time on Googling this and then stumbled on this http://www.slideshare.net/MarkusWinand/p2d2-pagination-done-the-postgresql-way. It seems that we can use row value expressions in Postgres to solve this.

In conjunction with a non-unique field, we mix in a unique value, uuid for example. That would give us a header in the form of something like Next-Range: created_at ]2016-02-09T0730Z/facf751c-860e-4004-a206-6a0899b89597... When we detect a Range header with that format, we construct a query

SELECT
  *
FROM
  model
WHERE 
  ROW(created_at, id) <  ROW('2016-02-09T0730Z', 'facf751c-860e-4004-a206-6a0899b89597')
ORDER BY
  created_at, id
LIMIT 
  200;

Using this requires that we have a composite index on (created_at, id), and I'm assuming that if things are keyed off a user, for example, it'll have to be (user_id, created_at, id). According to the slide deck, this should perform like the wind.

@brandur @bjeanes What do you guys think?

bjeanes · 2016-02-09T08:12:53Z

@gudmundur oh my god that is so elegant. I love it!

I'm assuming that if things are keyed off a user, for example, it'll have to be (user_id, created_at, id)

I'm thinking that in most cases your access isn't based on ownership so much as other access mechanisms (permissions, sharing, etc) which means we lose out on that extra tuple value. Nonetheless, that should do the trick in some cases. Really nice find @gudmundur!

attn @geemus too

bjeanes · 2016-02-09T08:13:08Z

(Obviously, I say that without having tested this yet, but in theory it seems solid)

bjeanes · 2016-02-09T08:14:48Z

It looks like the ROW keyword is optional for these compound cases anyway, so simply doing the filter as: (field1, field2) < ('val1', 'val2') gets us the same neat effect, I think!

Edit: which, incidentally, seems to be what slide 28 of the presentation you linked is doing

brandur · 2016-02-09T17:09:40Z

Thanks for the write-up Gudmundur! This looks good to me as well!

On Tuesday, February 9, 2016, Bo Jeanes [email protected] wrote:

It looks like the ROW keyword is optional for these compound cases
anyway, so simply doing the filter as: (field1, field2) < ('val1', 'val2')
gets us the same neat effect, I think!

—
Reply to this email directly or view it on GitHub
#234 (comment).

geemus · 2016-02-18T15:01:26Z

Sounds great, thanks for researching.

My only pause might be around the separator. I worry that forward slash might be part of a legit value (though it isn't in either of those) and/or if there is something clearer. Comma seems nice in some ways, for instance, though I'm not sure it helps at all wrt avoiding conflicts.

ransombriggs · 2016-05-24T21:22:31Z

@gudmundur @geemus what is the current status / blockers for this? I would like to be able to use this in my pliny app and if there is anything I can do to help, please let me know.

gudmundur · 2016-05-25T07:24:10Z

@ransombriggs Implementing this. What specific part do you need? Pagination over uniques or over non-uniques? We'll be needing this on DevEx soon, so I'll allocate time to do this in the next couple of days.

ransombriggs · 2016-05-25T16:26:25Z

@gudmundur over uniques, I am just working on a pliny app and did not want to duplicate work if I could avoid it

gudmundur added 13 commits December 12, 2015 15:50

Add RangeParser with attributes

8464905

Add test with expected range parser behavior

b967f70

Include RangeParser in Pliny

99610c1

Handle empty range header

774c81d

Raise on multiple semicolons

4b89829

Add methods to parse bounds and parameters

968fb2a

Raise on non "objects" unit

da27407

Parse start and end bounds

7cebe3c

Parse range parameters

9499700

Check for empty parameters when absent

cfbca69

styling

783b649

Clean up start and end bound parsing

a252ca4

Assert against unbound start bound

b8aeda6

gudmundur reviewed Dec 12, 2015
View reviewed changes

tmaier mentioned this pull request Jan 5, 2016

Add paginator helper #40

Closed

8 tasks

geemus mentioned this pull request Jan 7, 2016

Paginating/ranging over non-unique fields interagent/http-api-design#81

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pagination using Range header #234

Pagination using Range header #234

gudmundur commented Dec 12, 2015

gudmundur Dec 12, 2015

brandur Dec 14, 2015

geemus Dec 15, 2015

brandur commented Dec 14, 2015

gudmundur commented Dec 14, 2015

brandur commented Dec 15, 2015

geemus commented Dec 15, 2015

brandur commented Dec 15, 2015

geemus commented Dec 22, 2015

bjeanes commented Dec 23, 2015

bjeanes commented Dec 23, 2015

bjeanes commented Dec 23, 2015

gudmundur commented Jan 5, 2016

bjeanes commented Jan 6, 2016

brandur commented Jan 7, 2016

brandur commented Jan 7, 2016

bjeanes commented Jan 8, 2016

brandur commented Jan 8, 2016

gudmundur commented Feb 9, 2016

bjeanes commented Feb 9, 2016

bjeanes commented Feb 9, 2016

bjeanes commented Feb 9, 2016

brandur commented Feb 9, 2016

geemus commented Feb 18, 2016

ransombriggs commented May 24, 2016

gudmundur commented May 25, 2016

ransombriggs commented May 25, 2016

Pagination using Range header #234

Are you sure you want to change the base?

Pagination using Range header #234

Conversation

gudmundur commented Dec 12, 2015

gudmundur Dec 12, 2015

Choose a reason for hiding this comment

brandur Dec 14, 2015

Choose a reason for hiding this comment

geemus Dec 15, 2015

Choose a reason for hiding this comment

brandur commented Dec 14, 2015

gudmundur commented Dec 14, 2015

brandur commented Dec 15, 2015

geemus commented Dec 15, 2015

brandur commented Dec 15, 2015

geemus commented Dec 22, 2015

bjeanes commented Dec 23, 2015

bjeanes commented Dec 23, 2015

bjeanes commented Dec 23, 2015

gudmundur commented Jan 5, 2016

bjeanes commented Jan 6, 2016

brandur commented Jan 7, 2016

brandur commented Jan 7, 2016

bjeanes commented Jan 8, 2016

brandur commented Jan 8, 2016

gudmundur commented Feb 9, 2016

bjeanes commented Feb 9, 2016

bjeanes commented Feb 9, 2016

bjeanes commented Feb 9, 2016

brandur commented Feb 9, 2016

geemus commented Feb 18, 2016

ransombriggs commented May 24, 2016

gudmundur commented May 25, 2016

ransombriggs commented May 25, 2016