-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API / DB performance and stability #4285
Comments
We have determined that there are multiple problems with our current API design. Problem 1: Users downloading all of the dataWe provide a web-based search (for interactive use) and an API (for systems/application use). When it comes to accessing all of the data, the API is the only tool users have. While it may be that some users do not have to download all of the data all of the time, there are users for whom having multiple years of data (or "all of it") is necessary. This means we have people who are pulling several million rows (5M in To be clear: We are not saying "people using our API is a problem." Far from it! However, we simply aren't optimized for this kind of use-pattern, as further investigation makes clear... The underlying problem: linear searchThis is the underlying problem: When someone uses the API to download the first 20K rows, the The total cost to download 4M rows, when executing via the API on a local development environment, is 50M DBUs. That is expensive. In this particular instance, the local dev environment might be more performant than our RDS environment. Certainly, there is less contention. So, we can assume that the numbers presented here represent a lower bound on the costs and performance we will see in production. We do index the DB. Unfortunately, the Postgres B-tree index does not improve the performance of linear scans; to optimize for the case described above, we would need a counted B-tree index. Apparently, Oracle has one. Postgres does not. So, we cannot "simply" fix this problem with a new index. A possible solution: adding
|
What follows is a discussion of performance testing on a local development machine, using 4M rows of data in the
|
units | api110 | public100 | batches |
---|---|---|---|
DBUs | 36742666 | 26626400 | 1130022 |
Relative | 32x | 23x | 1x |
(I'm rounding here. I really don't care about anything to the right of the decimal point, given the order of magnitude of the numbers involved.)
It cost 37M DBUs to download all 4M rows via api110
, 27M DBUs to download via public100
, and 1M DBUs to download via batches.
The difference between 37M and 27M is, I believe, because public100
as an API has no JOIN
statements. That means there's a 30% improvement in fetch improvement by pre-computing the JOIN
statements.
We also add a batch_number
column in public100
, and this is a pre-computed value div(row, 20000)
. It is then indexed. As a result, it is possible to write a query like
GET https://api-url/federal_awards?batch_number=eq.200
Because it is indexed, fetching a batch anywhere in the 4M rows has the same cost. Therefore, it is roughly 30x less expensive than api110
.
If people are going to download the entire dataset via API, this is the least expensive way we can offer it.
Timed performance
On the same local dev machine (meaning that networking costs are negligible, and we have more CPU and RAM available than in our RDS instances), we can time the amount of time it takes to fetch 4M rows of data via the API. Again, all of these values are likely representative, and real-world performance will likely be worse. However, the relative timings are likely to be consistent.
units | api110 | public100 | batches |
---|---|---|---|
Seconds | 188 | 56 | 27 |
Relative | 6x | 2x | 1x |
In terms of actual time, it takes just over 3 minutes to download all 4M rows via api110
. Again, this is with the script doing the downloading running on the same host that the development version of the FAC is running on. Bandwidth is effectively infinite, latency is effectively zero.
It takes roughly 1 minute to download all of the data using the API without any JOIN
statements.
When we download using the new batches, it takes just under 30 seconds.
It is, therefore, 6x faster to download the data via batches than the current API, and roughly 2x faster than using the optimized tables directly (with OFFSET
values).
Conclusion
It is possible to improve the API for the FAC. We can do so while maintaining a roughly consistent table shape (e.g. the same tables), adding columns (to improve search possibilities), and in doing so, provide optimization for use cases we see in the wild (e.g. downloading of all data via the API).
Improvements based on EXPLAIN
values are as much as 30x, and clock time as much as 6x. Testing in the cloud.gov environment to come.
The numbers above were generated with the rough-and-ready script here.
As a note from conversation today: running a snapshot backup as part of a deploy will likely collide with In other words, even though it is a "no-op," it is part of the data pipeline: the dissemination tables that are backed up in to E.g. Backup tables from DB1->DB2 --> Copy those tables into various forms --> Point API at those tables is what we want. |
Following difficulties deploying API v1.1.1, we discussed whether the current materialized view approach is still the right one. Deploying the full stack depends on a precise order of operations that can be fragile and prone to subtle failures.
Related: #4039 and friends.
Tasks
Tasks
Solution pathway
We've decided to move the API to the secondary database (
fac-snapshot-db
, or "DB2"). This solves multiple performance and load issues within the application.Tasks
Example of gating code for standup:
FAC/backend/dissemination/sql/fac-snapshot-db/post/020_api_v1_1_0.sql
Line 1 in cf6b5c9
Spurious
NOTIFY
statements are everything except for the last one infinalize
.Our nightly backup (and more importantly, our deploys) will want to DROP/recreate
dissemination_general
in DB2 (fac-snapshot-db
). So, we need to make yet one more copy of thedissemination_*
tables for theapi_v1_1_0
VIEW
s to point at. (That is, a copy we can create/tear down/etc.)Work is underway in https://github.com/GSA-TTS/FAC/tree/jadudm/api-perf
The text was updated successfully, but these errors were encountered: