Skip to content

Commit 363bce8

Browse files
committed
Move design doc into new file
1 parent c91dcda commit 363bce8

File tree

2 files changed

+119
-118
lines changed

2 files changed

+119
-118
lines changed

DESIGN.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# tscached: Initial Design
2+
3+
![tscached logo]
4+
(https://github.com/zachm/tscached/raw/master/logo/logo.png)
5+
6+
# Stored Data
7+
8+
We're storing two types of data in Redis, now referred to as **KQuery** and **MTS**.
9+
10+
### KQuery
11+
- Key: a hash from the JSON dump of a given KairosDB query.
12+
- One exception: its start/end time values are missing.
13+
- Value: JSON dump of the query, including ABSOLUTE timestamps matching it.
14+
- The timestamps in the value will be updated whenever we update its constituent MTS.
15+
- We also include a list of Redis keys for matching MTS (in use for the HOT scenario).
16+
17+
### MTS (Metric Time Series)
18+
- Briefly, each Metric Time Series is a KairosDB **result** dict.
19+
- Given that one KairosDB query may return N time series results, this represents one of them.
20+
- Key: a subset hash: includes elements `name, group_by, tags`.
21+
- Value: the full contents of the result dict.
22+
23+
## Algorithm Outline
24+
25+
You have received a query intended for KairosDB. What to do?
26+
27+
What we do depends on whether (and what) corresponding data exists in the KQuery Store.
28+
29+
### Cache MISS (cold)
30+
Unfortunately for the user, tscached has never seen this exact query before.
31+
32+
To proceed:
33+
- Forward the entire query to Kairos.
34+
- Split Kairos result into discrete MTS; hash them; write them into Redis.
35+
- Write KQuery (including set of MTS hashes) into Redis.
36+
- No trimming old values needed, since we only queried for what we wanted.
37+
- Return result (may as well use the pre-built Kairos version) to user.
38+
39+
### Cache HIT (hot)
40+
The user is in luck: this data is extremely fresh. This is the postcache scenario.
41+
42+
To be a *hot* hit, three properties must be true:
43+
- A KQuery entry must exist for the data.
44+
- KQuery data must have a start timestamp before or equivalent to that requested.
45+
- KQuery data must have an end timestamp within 10s of NOW (configurable).
46+
47+
To proceed:
48+
- Do **not** query Kairos at all - this is explicit flood control!
49+
- Pull all listed MTS out of Redis.
50+
- For each MTS, trim any data older than the START.
51+
- Return the rebuilt result (ts fixing, etc.) without updating Redis.
52+
53+
54+
### Cache HIT (warm)
55+
This is the key tscached advancement. Data that is already extant in Redis, but more
56+
than 10s old, is **appended to** instead of overwritten.
57+
58+
This removes a ridiculously high burden from your KairosDB cluster: in example, reading
59+
from an entire production environment and plotting its load average:
60+
61+
- 10 second resolution
62+
- 24 hour chart
63+
- 2,000 hosts in the environment
64+
- Returns **17.28 MILLION** data points.
65+
66+
Needless to say, one requires a *ridiculously oversized* KairosDB cluster to handle
67+
this kind of load. So why bother? Results from such a query total only a few megabytes
68+
of JSON. With tscached, after the first (painful!) MISS, we may now query for as few as
69+
**2,000** data points on each subsequent query.
70+
71+
The end goal, therefore, is to turn a bursty load into a constant load... for all the obvious
72+
reasons!
73+
74+
To proceed:
75+
- Mutate Request: Forward original request to Kairos for older/younger intervals.
76+
- Pull relevant MTS (listed in KQuery) from Redis.
77+
- Merge MTS in an all-pairs strategy. This will rely on Index-Hash lookup for KairosDB data.
78+
- As merges occur, overwrite discrete MTS into Redis.
79+
- Any new MTS (that just started reporting) will be merged with empty sets and written to Redis
80+
- Update KQuery with new Start, End timestamps and with any new MTS hashes.
81+
- If MTS Start retrieved is too old (compared to original request) trim it down.
82+
- Return the combined result.
83+
84+
85+
## Future work
86+
87+
### Start updating
88+
We may at first not want to support start updating. That would be a strange use
89+
case: you'd have a dashboard of 1h that you stretched out to 6h. Query -6 to -1 then the
90+
last bit of clock time...
91+
92+
### Staleness prevention
93+
If a KQuery was last requested 6 hours ago (and only for one hour's range) we should
94+
not bother reading from it now. In other words, despite handling the same *semantic data*
95+
as before, tscached is effectively cold. TTL expiry may be useful for this case.
96+
97+
### Preemptive caching (Read-before)
98+
tscached is intended to minimize read load on KairosDB; the fact that it will make
99+
dashboards built with Grafana, et al. much faster to load is a happy coincidence.
100+
101+
This leads to a natural next step: if a finite number of dashboards are built in a Grafana
102+
installation, but then queried very rarely (i.e., only under emergency scenarios), why not
103+
provide *shadow load* on them all the time? This would, in effect, keep tscached current and
104+
result in a very high (hot) hit rate.
105+
106+
#### How to implement read-before caching
107+
There are many ways to achieve this: a daemon, cron jobs, etc. Here is our presumed first attempt:
108+
- Keep a list, in Redis, of KQueries to treat as read-before enabled.
109+
- Run a cron job every ~five minutes that iterates over the list of *shadow load-enabled* dashboards.
110+
- This cron job may simply query the tscached webapp, or use the same code to offload the work.
111+
- Regardless, it will keep the cached data fresh to within a given window, even without a human viewing it.
112+
113+
#### How to know what should be read-before cached
114+
- It doesn't make sense to cache on KQueries solely used in *compose* operations, since the query is constantly changing.
115+
- Presumably those used by *saved* dashboards would benefit much more!
116+
- We can check the HTTP *referer* field: If it contains `/dashboard/db/` (for a Grafana frontend) then it's a saved dashboard (good). If the last characters are `edit` then it's an update of that saved dashboard (ungood).
117+
- If we can fully understand the schema of URLs from a given frontend, Grafana or otherwise, this strategy should work.
118+
- A better approach? Send an `X-tscached` header with an appropriate mode. This requires an upstream change in whichever graphing dashboard you choose to use though.
119+

README.md

Lines changed: 0 additions & 118 deletions
Original file line numberDiff line numberDiff line change
@@ -10,121 +10,3 @@ step further: *A previously issued query will be reissued across only the elapse
1010
last execution.* This provides a substantial improvement in serving high-volume load, especially temporally long queries that return thousands of time series. Using only simple techniques - consistent hashing, read-through caching, and backend load chunking - we provide user-perceived read latency improvements of up to 100x.
1111

1212
There are several different frontends to use with a Kairos-compliant API like this one, but the most full-featured remains (as always) [Grafana](http://grafana.org/) with [this plugin](https://github.com/grafana/kairosdb-datasource) installed.
13-
14-
15-
Everything that follows is something of a fluid design document.
16-
17-
## Stored Data
18-
19-
We're storing two types of data in Redis, now referred to as **KQuery** and **MTS**.
20-
21-
### KQuery
22-
- Key: a hash from the JSON dump of a given KairosDB query.
23-
- One exception: its start/end time values are missing.
24-
- Value: JSON dump of the query, including ABSOLUTE timestamps matching it.
25-
- The timestamps in the value will be updated whenever we update its constituent MTS.
26-
- We also include a list of Redis keys for matching MTS (in use for the HOT scenario).
27-
28-
### MTS (Metric Time Series)
29-
- Briefly, each Metric Time Series is a KairosDB **result** dict.
30-
- Given that one KairosDB query may return N time series results, this represents one of them.
31-
- Key: a subset hash: includes elements `name, group_by, tags`.
32-
- Value: the full contents of the result dict.
33-
34-
## Algorithm Outline
35-
36-
You have received a query intended for KairosDB. What to do?
37-
38-
What we do depends on whether (and what) corresponding data exists in the KQuery Store.
39-
40-
### Cache MISS (cold)
41-
Unfortunately for the user, tscached has never seen this exact query before.
42-
43-
To proceed:
44-
- Forward the entire query to Kairos.
45-
- Split Kairos result into discrete MTS; hash them; write them into Redis.
46-
- Write KQuery (including set of MTS hashes) into Redis.
47-
- No trimming old values needed, since we only queried for what we wanted.
48-
- Return result (may as well use the pre-built Kairos version) to user.
49-
50-
### Cache HIT (hot)
51-
The user is in luck: this data is extremely fresh. This is the postcache scenario.
52-
53-
To be a *hot* hit, three properties must be true:
54-
- A KQuery entry must exist for the data.
55-
- KQuery data must have a start timestamp before or equivalent to that requested.
56-
- KQuery data must have an end timestamp within 10s of NOW (configurable).
57-
58-
To proceed:
59-
- Do **not** query Kairos at all - this is explicit flood control!
60-
- Pull all listed MTS out of Redis.
61-
- For each MTS, trim any data older than the START.
62-
- Return the rebuilt result (ts fixing, etc.) without updating Redis.
63-
64-
65-
### Cache HIT (warm)
66-
This is the key tscached advancement. Data that is already extant in Redis, but more
67-
than 10s old, is **appended to** instead of overwritten.
68-
69-
This removes a ridiculously high burden from your KairosDB cluster: in example, reading
70-
from an entire production environment and plotting its load average:
71-
72-
- 10 second resolution
73-
- 24 hour chart
74-
- 2,000 hosts in the environment
75-
- Returns **17.28 MILLION** data points.
76-
77-
Needless to say, one requires a *ridiculously oversized* KairosDB cluster to handle
78-
this kind of load. So why bother? Results from such a query total only a few megabytes
79-
of JSON. With tscached, after the first (painful!) MISS, we may now query for as few as
80-
**2,000** data points on each subsequent query.
81-
82-
The end goal, therefore, is to turn a bursty load into a constant load... for all the obvious
83-
reasons!
84-
85-
To proceed:
86-
- Mutate Request: Forward original request to Kairos for older/younger intervals.
87-
- Pull relevant MTS (listed in KQuery) from Redis.
88-
- Merge MTS in an all-pairs strategy. This will rely on Index-Hash lookup for KairosDB data.
89-
- As merges occur, overwrite discrete MTS into Redis.
90-
- Any new MTS (that just started reporting) will be merged with empty sets and written to Redis
91-
- Update KQuery with new Start, End timestamps and with any new MTS hashes.
92-
- If MTS Start retrieved is too old (compared to original request) trim it down.
93-
- Return the combined result.
94-
95-
96-
## Future work
97-
98-
### Start updating
99-
We may at first not want to support start updating. That would be a strange use
100-
case: you'd have a dashboard of 1h that you stretched out to 6h. Query -6 to -1 then the
101-
last bit of clock time...
102-
103-
### Staleness prevention
104-
If a KQuery was last requested 6 hours ago (and only for one hour's range) we should
105-
not bother reading from it now. In other words, despite handling the same *semantic data*
106-
as before, tscached is effectively cold. TTL expiry may be useful for this case.
107-
108-
### Preemptive caching (Read-before)
109-
tscached is intended to minimize read load on KairosDB; the fact that it will make
110-
dashboards built with Grafana, et al. much faster to load is a happy coincidence.
111-
112-
This leads to a natural next step: if a finite number of dashboards are built in a Grafana
113-
installation, but then queried very rarely (i.e., only under emergency scenarios), why not
114-
provide *shadow load* on them all the time? This would, in effect, keep tscached current and
115-
result in a very high (hot) hit rate.
116-
117-
#### How to implement read-before caching
118-
There are many ways to achieve this: a daemon, cron jobs, etc. Here is our presumed first attempt:
119-
- Keep a list, in Redis, of KQueries to treat as read-before enabled.
120-
- Run a cron job every ~five minutes that iterates over the list of *shadow load-enabled* dashboards.
121-
- This cron job may simply query the tscached webapp, or use the same code to offload the work.
122-
- Regardless, it will keep the cached data fresh to within a given window, even without a human viewing it.
123-
124-
#### How to know what should be read-before cached
125-
- It doesn't make sense to cache on KQueries solely used in *compose* operations, since the query is constantly changing.
126-
- Presumably those used by *saved* dashboards would benefit much more!
127-
- We can check the HTTP *referer* field: If it contains `/dashboard/db/` (for a Grafana frontend) then it's a saved dashboard (good). If the last characters are `edit` then it's an update of that saved dashboard (ungood).
128-
- If we can fully understand the schema of URLs from a given frontend, Grafana or otherwise, this strategy should work.
129-
- A better approach? Send an `X-tscached` header with an appropriate mode. This requires an upstream change in whichever graphing dashboard you choose to use though.
130-

0 commit comments

Comments
 (0)