You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're storing two types of data in Redis, now referred to as **KQuery** and **MTS**.
9
+
10
+
### KQuery
11
+
- Key: a hash from the JSON dump of a given KairosDB query.
12
+
- One exception: its start/end time values are missing.
13
+
- Value: JSON dump of the query, including ABSOLUTE timestamps matching it.
14
+
- The timestamps in the value will be updated whenever we update its constituent MTS.
15
+
- We also include a list of Redis keys for matching MTS (in use for the HOT scenario).
16
+
17
+
### MTS (Metric Time Series)
18
+
- Briefly, each Metric Time Series is a KairosDB **result** dict.
19
+
- Given that one KairosDB query may return N time series results, this represents one of them.
20
+
- Key: a subset hash: includes elements `name, group_by, tags`.
21
+
- Value: the full contents of the result dict.
22
+
23
+
## Algorithm Outline
24
+
25
+
You have received a query intended for KairosDB. What to do?
26
+
27
+
What we do depends on whether (and what) corresponding data exists in the KQuery Store.
28
+
29
+
### Cache MISS (cold)
30
+
Unfortunately for the user, tscached has never seen this exact query before.
31
+
32
+
To proceed:
33
+
- Forward the entire query to Kairos.
34
+
- Split Kairos result into discrete MTS; hash them; write them into Redis.
35
+
- Write KQuery (including set of MTS hashes) into Redis.
36
+
- No trimming old values needed, since we only queried for what we wanted.
37
+
- Return result (may as well use the pre-built Kairos version) to user.
38
+
39
+
### Cache HIT (hot)
40
+
The user is in luck: this data is extremely fresh. This is the postcache scenario.
41
+
42
+
To be a *hot* hit, three properties must be true:
43
+
- A KQuery entry must exist for the data.
44
+
- KQuery data must have a start timestamp before or equivalent to that requested.
45
+
- KQuery data must have an end timestamp within 10s of NOW (configurable).
46
+
47
+
To proceed:
48
+
- Do **not** query Kairos at all - this is explicit flood control!
49
+
- Pull all listed MTS out of Redis.
50
+
- For each MTS, trim any data older than the START.
51
+
- Return the rebuilt result (ts fixing, etc.) without updating Redis.
52
+
53
+
54
+
### Cache HIT (warm)
55
+
This is the key tscached advancement. Data that is already extant in Redis, but more
56
+
than 10s old, is **appended to** instead of overwritten.
57
+
58
+
This removes a ridiculously high burden from your KairosDB cluster: in example, reading
59
+
from an entire production environment and plotting its load average:
60
+
61
+
- 10 second resolution
62
+
- 24 hour chart
63
+
- 2,000 hosts in the environment
64
+
- Returns **17.28 MILLION** data points.
65
+
66
+
Needless to say, one requires a *ridiculously oversized* KairosDB cluster to handle
67
+
this kind of load. So why bother? Results from such a query total only a few megabytes
68
+
of JSON. With tscached, after the first (painful!) MISS, we may now query for as few as
69
+
**2,000** data points on each subsequent query.
70
+
71
+
The end goal, therefore, is to turn a bursty load into a constant load... for all the obvious
72
+
reasons!
73
+
74
+
To proceed:
75
+
- Mutate Request: Forward original request to Kairos for older/younger intervals.
76
+
- Pull relevant MTS (listed in KQuery) from Redis.
77
+
- Merge MTS in an all-pairs strategy. This will rely on Index-Hash lookup for KairosDB data.
78
+
- As merges occur, overwrite discrete MTS into Redis.
79
+
- Any new MTS (that just started reporting) will be merged with empty sets and written to Redis
80
+
- Update KQuery with new Start, End timestamps and with any new MTS hashes.
81
+
- If MTS Start retrieved is too old (compared to original request) trim it down.
82
+
- Return the combined result.
83
+
84
+
85
+
## Future work
86
+
87
+
### Start updating
88
+
We may at first not want to support start updating. That would be a strange use
89
+
case: you'd have a dashboard of 1h that you stretched out to 6h. Query -6 to -1 then the
90
+
last bit of clock time...
91
+
92
+
### Staleness prevention
93
+
If a KQuery was last requested 6 hours ago (and only for one hour's range) we should
94
+
not bother reading from it now. In other words, despite handling the same *semantic data*
95
+
as before, tscached is effectively cold. TTL expiry may be useful for this case.
96
+
97
+
### Preemptive caching (Read-before)
98
+
tscached is intended to minimize read load on KairosDB; the fact that it will make
99
+
dashboards built with Grafana, et al. much faster to load is a happy coincidence.
100
+
101
+
This leads to a natural next step: if a finite number of dashboards are built in a Grafana
102
+
installation, but then queried very rarely (i.e., only under emergency scenarios), why not
103
+
provide *shadow load* on them all the time? This would, in effect, keep tscached current and
104
+
result in a very high (hot) hit rate.
105
+
106
+
#### How to implement read-before caching
107
+
There are many ways to achieve this: a daemon, cron jobs, etc. Here is our presumed first attempt:
108
+
- Keep a list, in Redis, of KQueries to treat as read-before enabled.
109
+
- Run a cron job every ~five minutes that iterates over the list of *shadow load-enabled* dashboards.
110
+
- This cron job may simply query the tscached webapp, or use the same code to offload the work.
111
+
- Regardless, it will keep the cached data fresh to within a given window, even without a human viewing it.
112
+
113
+
#### How to know what should be read-before cached
114
+
- It doesn't make sense to cache on KQueries solely used in *compose* operations, since the query is constantly changing.
115
+
- Presumably those used by *saved* dashboards would benefit much more!
116
+
- We can check the HTTP *referer* field: If it contains `/dashboard/db/` (for a Grafana frontend) then it's a saved dashboard (good). If the last characters are `edit` then it's an update of that saved dashboard (ungood).
117
+
- If we can fully understand the schema of URLs from a given frontend, Grafana or otherwise, this strategy should work.
118
+
- A better approach? Send an `X-tscached` header with an appropriate mode. This requires an upstream change in whichever graphing dashboard you choose to use though.
Copy file name to clipboardExpand all lines: README.md
-118Lines changed: 0 additions & 118 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,121 +10,3 @@ step further: *A previously issued query will be reissued across only the elapse
10
10
last execution.* This provides a substantial improvement in serving high-volume load, especially temporally long queries that return thousands of time series. Using only simple techniques - consistent hashing, read-through caching, and backend load chunking - we provide user-perceived read latency improvements of up to 100x.
11
11
12
12
There are several different frontends to use with a Kairos-compliant API like this one, but the most full-featured remains (as always) [Grafana](http://grafana.org/) with [this plugin](https://github.com/grafana/kairosdb-datasource) installed.
13
-
14
-
15
-
Everything that follows is something of a fluid design document.
16
-
17
-
## Stored Data
18
-
19
-
We're storing two types of data in Redis, now referred to as **KQuery** and **MTS**.
20
-
21
-
### KQuery
22
-
- Key: a hash from the JSON dump of a given KairosDB query.
23
-
- One exception: its start/end time values are missing.
24
-
- Value: JSON dump of the query, including ABSOLUTE timestamps matching it.
25
-
- The timestamps in the value will be updated whenever we update its constituent MTS.
26
-
- We also include a list of Redis keys for matching MTS (in use for the HOT scenario).
27
-
28
-
### MTS (Metric Time Series)
29
-
- Briefly, each Metric Time Series is a KairosDB **result** dict.
30
-
- Given that one KairosDB query may return N time series results, this represents one of them.
31
-
- Key: a subset hash: includes elements `name, group_by, tags`.
32
-
- Value: the full contents of the result dict.
33
-
34
-
## Algorithm Outline
35
-
36
-
You have received a query intended for KairosDB. What to do?
37
-
38
-
What we do depends on whether (and what) corresponding data exists in the KQuery Store.
39
-
40
-
### Cache MISS (cold)
41
-
Unfortunately for the user, tscached has never seen this exact query before.
42
-
43
-
To proceed:
44
-
- Forward the entire query to Kairos.
45
-
- Split Kairos result into discrete MTS; hash them; write them into Redis.
46
-
- Write KQuery (including set of MTS hashes) into Redis.
47
-
- No trimming old values needed, since we only queried for what we wanted.
48
-
- Return result (may as well use the pre-built Kairos version) to user.
49
-
50
-
### Cache HIT (hot)
51
-
The user is in luck: this data is extremely fresh. This is the postcache scenario.
52
-
53
-
To be a *hot* hit, three properties must be true:
54
-
- A KQuery entry must exist for the data.
55
-
- KQuery data must have a start timestamp before or equivalent to that requested.
56
-
- KQuery data must have an end timestamp within 10s of NOW (configurable).
57
-
58
-
To proceed:
59
-
- Do **not** query Kairos at all - this is explicit flood control!
60
-
- Pull all listed MTS out of Redis.
61
-
- For each MTS, trim any data older than the START.
62
-
- Return the rebuilt result (ts fixing, etc.) without updating Redis.
63
-
64
-
65
-
### Cache HIT (warm)
66
-
This is the key tscached advancement. Data that is already extant in Redis, but more
67
-
than 10s old, is **appended to** instead of overwritten.
68
-
69
-
This removes a ridiculously high burden from your KairosDB cluster: in example, reading
70
-
from an entire production environment and plotting its load average:
71
-
72
-
- 10 second resolution
73
-
- 24 hour chart
74
-
- 2,000 hosts in the environment
75
-
- Returns **17.28 MILLION** data points.
76
-
77
-
Needless to say, one requires a *ridiculously oversized* KairosDB cluster to handle
78
-
this kind of load. So why bother? Results from such a query total only a few megabytes
79
-
of JSON. With tscached, after the first (painful!) MISS, we may now query for as few as
80
-
**2,000** data points on each subsequent query.
81
-
82
-
The end goal, therefore, is to turn a bursty load into a constant load... for all the obvious
83
-
reasons!
84
-
85
-
To proceed:
86
-
- Mutate Request: Forward original request to Kairos for older/younger intervals.
87
-
- Pull relevant MTS (listed in KQuery) from Redis.
88
-
- Merge MTS in an all-pairs strategy. This will rely on Index-Hash lookup for KairosDB data.
89
-
- As merges occur, overwrite discrete MTS into Redis.
90
-
- Any new MTS (that just started reporting) will be merged with empty sets and written to Redis
91
-
- Update KQuery with new Start, End timestamps and with any new MTS hashes.
92
-
- If MTS Start retrieved is too old (compared to original request) trim it down.
93
-
- Return the combined result.
94
-
95
-
96
-
## Future work
97
-
98
-
### Start updating
99
-
We may at first not want to support start updating. That would be a strange use
100
-
case: you'd have a dashboard of 1h that you stretched out to 6h. Query -6 to -1 then the
101
-
last bit of clock time...
102
-
103
-
### Staleness prevention
104
-
If a KQuery was last requested 6 hours ago (and only for one hour's range) we should
105
-
not bother reading from it now. In other words, despite handling the same *semantic data*
106
-
as before, tscached is effectively cold. TTL expiry may be useful for this case.
107
-
108
-
### Preemptive caching (Read-before)
109
-
tscached is intended to minimize read load on KairosDB; the fact that it will make
110
-
dashboards built with Grafana, et al. much faster to load is a happy coincidence.
111
-
112
-
This leads to a natural next step: if a finite number of dashboards are built in a Grafana
113
-
installation, but then queried very rarely (i.e., only under emergency scenarios), why not
114
-
provide *shadow load* on them all the time? This would, in effect, keep tscached current and
115
-
result in a very high (hot) hit rate.
116
-
117
-
#### How to implement read-before caching
118
-
There are many ways to achieve this: a daemon, cron jobs, etc. Here is our presumed first attempt:
119
-
- Keep a list, in Redis, of KQueries to treat as read-before enabled.
120
-
- Run a cron job every ~five minutes that iterates over the list of *shadow load-enabled* dashboards.
121
-
- This cron job may simply query the tscached webapp, or use the same code to offload the work.
122
-
- Regardless, it will keep the cached data fresh to within a given window, even without a human viewing it.
123
-
124
-
#### How to know what should be read-before cached
125
-
- It doesn't make sense to cache on KQueries solely used in *compose* operations, since the query is constantly changing.
126
-
- Presumably those used by *saved* dashboards would benefit much more!
127
-
- We can check the HTTP *referer* field: If it contains `/dashboard/db/` (for a Grafana frontend) then it's a saved dashboard (good). If the last characters are `edit` then it's an update of that saved dashboard (ungood).
128
-
- If we can fully understand the schema of URLs from a given frontend, Grafana or otherwise, this strategy should work.
129
-
- A better approach? Send an `X-tscached` header with an appropriate mode. This requires an upstream change in whichever graphing dashboard you choose to use though.
0 commit comments