If you've been working with software for any length of time, you'll be very used to consuming results from APIs. Typically you'll make a request, and end up with a list of information which is quite long. People are usually bad at reading very long lists, so as developers, it's good practice to shorten them to something more manageable.
If you've used Google, you've seen this in action. You may have 9000 results, but you get 10 per page by default. People can handle 10. But 9000 in one go is just far too much.
We achieve this result via pagination. We split all our results into a series of pages. A common (and easy) way to do this is to say to the API - "Give me the first ten results". Then when we click on our 'next' button, we say, "Skip the first ten results, and serve up the next ten". In essence, we are offsetting our results by an amount of our choosing.
This is particularly easy when you're working on the frontend, but it's actually not very efficient at all.
When you are working with an offset, you make your request as detailed above and receive the expected results. But from the database's perspective it looks a bit more like this:
Offset only cares about how many rows you want to drop. It doesn't care about optimising a query, so it fetches everything, then drops what has been requested. This applies across both SQL and NoSQL databases, and creates a heavy workload for the database.
Ok, so we've seen that's an issue, but it's also problematic because you probably don't work with one request at a time. What happens if a new row is created whilst you're fetching? It turns out that offset tries to be clever and counter this by inserting duplicates in case something else was inserted during the fetch. This is a weird quirk and obviously frustrating to deal with.
So when you're coding a pagination function, you're unlikely to be giving the database enough context to efficiently and accurately provide you with the information that you want.
Ok, that's a bold title. This is a potential way to achieve your desired result though. It's actually simple - you can use a
WHERE clause to return the information that you haven't retrieved. As shown in the incredible article by Use The Index Luke on which this is based, you can achieve your desired result with this:
1SELECT someCol, anotherCol2 FROM someTable3 WHERE 'some_condition' > 14 AND id < ?last_seen_id5 ORDER BY id DESC6 FETCH FIRST 10 ROWS ONLY
This is called
keyset_pagination and it offers the advantages of being faster than
offset and also prevents the strange duplication of resuls, or other frustrating anomalies.
There are some downsides however. You can't go to an arbitrary page for example, so it's a method better suited to things like infinite scrolling in a way similar to Instagram, than it is to clicking on page numbers. It's a judgment call as to what will fit your use-case the best.
We've now understood the difference between offset and keyset pagination, and the advantages and disadvantages of both. I feel that while offset pagination is convenient, there are some tradeoffs when it comes to speed and predictability. I'd be very curious to know how Google achieves this.
Overall, my advice would be to use code similar to the snippet above, and avoid a simplistic offset where possible if you are concerned about performance in your application.
Want to know more about this kind of stuff? Sign up to my newsletter below!