-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paginating/ranging over non-unique fields #81
Comments
The simple solution I have now is that the query to sort the results is always something like For the small-ish number of records that we might need to serve (1-10,000 likely, or no more than 10K for awhile) in the service I'm proposing, sorting like this in each query without saving an index should be ok, but that probably can be improved in the future. |
@mathias I had originally started pondering down a very similar path. Unfortunately although a convention based approach helps in terms of consistent order, it doesn't help in terms of boundaries. ie if the main key has several, sorted by id, when you ask for the next page, you would could either end up skipping the differing id ones (or duplicating). In the pathological case, there could even be a case where an entire range (or more) of records shared the same primary sorting key. No particular suggestion just yet, need to ponder/research a bit more. Will try to work on something soon. |
@geemus Doesn't using the id of the record (which should be unique) as the second |
I imagine with some data (as pseudo-CSV)
When we want to sort by deploys descending, it'd really sort like this:
And if we sorted by another_attr ascending:
Essentially the unique ID is being used to break ties for the sort order. |
Correct. The issue is, how do you ask for the second range if the last value of the first range and first value of the second range have the same primary sort column value? So, for instance:
Let's say this is the data, and we are using ranges with 2 items each. The first range would be |
@geemus The right side would have to be a unique value for each of those. You couldn't have a list like
Because the 1 on the right is reused. "Inside" any value on the left side, the right side would also be sorted. A better version of the above data would hopefully look like:
where the values on the right side are the database ID (or UUID) -- basically, where they're unique. However, this leads to a cumbersome thing from the client side: We'll have to know that internally that's what pagination needs, and send both values to paginate:
Passing |
Well, my example was bad, but yeah, you would have to pass both keys (at least in some cases) to get the next range. And there are already semantics around asking for multiple ranges in the RFCs that are NOT at all like this, which seems troublesome. In the RFCs it is more like I was pondering using something like |
You're right, and I suspect that the fact that the Range header is intended for ordered and unique values (bytes in a file), it won't work here without expanding what we allow passed in / expanding the definition to some degree. Adding in the Sounds like S3 uses
They also have a Would the response headers look like this?
|
Follow-up to that: I guess the
? Edit: Gah, can't seem to get this right. We wouldn't include a range in marker in |
Yeah, I think whatever we end up with, it should be such that client can just pass Next-Range. Since the markers in S3 give you back things AFTER the marker, but not including the marker, already (if I'm reading that right), I think the marker values in the examples could just be |
Marker seems like a close precedent, though it is used to set a boundary on the primary sort order by S3. Nonetheless, worth thinking about applying it here! I appreciate you dwelling on this for us @geemus! |
Still working on this, but ran out of time as I leave for vacation. A couple thoughts (though I'm midstream in research/pondering). Many of the other implementations of pagination via range do it with natural numbers, more like byte ranges. I imagine this often points to incrementing primary keys, but it could also just be used as a means of pointing to offsets/limits. ie something like To make it more generic it could be something like Anyway, I have a ton of tabs open I'm still working through to try and get something solidified, but I'm also just out of time (it's already a couple hours past my EOD, but wanted to share where I was for reference and so that I can hopefully more easily pick the thread back up upon my return). |
See also a detailed discussion around a possible way to work around this (and the many tradeoffs/issues): interagent/pliny#234 |
Here is a proposal I wrote a little while back that would be much closer to the original Range stuff: https://gist.github.com/geemus/8dfc01bd99367346c25891c915327220 |
I know there were some performance concerns around using the object offset for paginating. But I do wonder how large the dataset has to be for that to become an actual issue. On interagent/pliny#234 I proposed a design using the unique identifiers with row value expressions. The only question remaining was what separator to use for the ranges. It seems to me that the performance of pagination is highly dependent on the underlying data store, so I'm not sure we can have a one size fit all. |
The current advice implicitly in this doc and explicitly in the Heroku API reference breaks large result sets into ranges, and uses field values as constraints to be applied to the query for the rest of the result set. It isn't hard to stumble upon scenarios where a result set is sorted by a non-unique field (ratings, popularity, counts, times, etc), wherein ranges requested by the specified constraints would yield overlapping result sets.
Unfortunately, the
Content-Range
scheme doesn't account for these scenarios (likely, because we haven't yet had such a scenario in the Heroku API). As this is document aims to serve as general API design advice and non-unique sort+pagination is not an uncommon situation, some thought towards this problem may be merited by the document authors. Additionally, our team (at Heroku) has come across some scenarios we'd like to introduce which would benefit from an answer.@geemus @brandur have either of you thought about this much before?
@mathias and I have discussed the idea of using "composite keys" which include at least one unique field as a component in search to achieve something similar in spirit to the current advice, but nothing clearly simple emerges.
The text was updated successfully, but these errors were encountered: