Move design doc into new file

zachm · zachm · commit 363bce822334 · 2016-03-26T18:01:02.000-07:00
diff --git a/DESIGN.md b/DESIGN.md
@@ -0,0 +1,119 @@
+# tscached: Initial Design
+
+![tscached logo]
+(https://github.com/zachm/tscached/raw/master/logo/logo.png)
+
+# Stored Data
+
+We're storing two types of data in Redis, now referred to as **KQuery** and **MTS**.
+
+### KQuery
+- Key: a hash from the JSON dump of a given KairosDB query.
+- One exception: its start/end time values are missing.
+- Value: JSON dump of the query, including ABSOLUTE timestamps matching it.
+- The timestamps in the value will be updated whenever we update its constituent MTS.
+- We also include a list of Redis keys for matching MTS (in use for the HOT scenario).
+
+### MTS (Metric Time Series)
+- Briefly, each Metric Time Series is a KairosDB **result** dict.
+- Given that one KairosDB query may return N time series results, this represents one of them.
+- Key: a subset hash: includes elements `name, group_by, tags`.
+- Value: the full contents of the result dict.
+
+## Algorithm Outline
+
+You have received a query intended for KairosDB. What to do?
+
+What we do depends on whether (and what) corresponding data exists in the KQuery Store.
+
+### Cache MISS (cold)
+Unfortunately for the user, tscached has never seen this exact query before.
+
+To proceed:
+- Forward the entire query to Kairos.
+- Split Kairos result into discrete MTS; hash them; write them into Redis.
+- Write KQuery (including set of MTS hashes) into Redis.
+- No trimming old values needed, since we only queried for what we wanted.
+- Return result (may as well use the pre-built Kairos version) to user.
+
+### Cache HIT (hot)
+The user is in luck: this data is extremely fresh. This is the postcache scenario.
+
+To be a *hot* hit, three properties must be true:
+- A KQuery entry must exist for the data.
+- KQuery data must have a start timestamp before or equivalent to that requested.
+- KQuery data must have an end timestamp within 10s of NOW (configurable).
+
+To proceed:
+- Do **not** query Kairos at all - this is explicit flood control!
+- Pull all listed MTS out of Redis.
+- For each MTS, trim any data older than the START.
+- Return the rebuilt result (ts fixing, etc.) without updating Redis.
+
+
+### Cache HIT (warm)
+This is the key tscached advancement. Data that is already extant in Redis, but more
+than 10s old, is **appended to** instead of overwritten.
+
+This removes a ridiculously high burden from your KairosDB cluster: in example, reading
+from an entire production environment and plotting its load average:
+
+- 10 second resolution
+- 24 hour chart
+- 2,000 hosts in the environment
+- Returns **17.28 MILLION** data points.
+
+Needless to say, one requires a *ridiculously oversized* KairosDB cluster to handle
+this kind of load. So why bother? Results from such a query total only a few megabytes
+of JSON. With tscached, after the first (painful!) MISS, we may now query for as few as
+**2,000** data points on each subsequent query.
+
+The end goal, therefore, is to turn a bursty load into a constant load... for all the obvious
+reasons!
+
+To proceed:
+- Mutate Request: Forward original request to Kairos for older/younger intervals.
+- Pull relevant MTS (listed in KQuery) from Redis.
+- Merge MTS in an all-pairs strategy. This will rely on Index-Hash lookup for KairosDB data.
+- As merges occur, overwrite discrete MTS into Redis.
+- Any new MTS (that just started reporting) will be merged with empty sets and written to Redis
+- Update KQuery with new Start, End timestamps and with any new MTS hashes.
+- If MTS Start retrieved is too old (compared to original request) trim it down.
+- Return the combined result.
+
+
+## Future work
+
+### Start updating
+We may at first not want to support start updating. That would be a strange use
+case: you'd have a dashboard of 1h that you stretched out to 6h. Query -6 to -1 then the
+last bit of clock time...
+
+### Staleness prevention
+If a KQuery was last requested 6 hours ago (and only for one hour's range) we should
+not bother reading from it now. In other words, despite handling the same *semantic data*
+as before, tscached is effectively cold. TTL expiry may be useful for this case.
+
+### Preemptive caching (Read-before)
+tscached is intended to minimize read load on KairosDB; the fact that it will make
+dashboards built with Grafana, et al. much faster to load is a happy coincidence.
+
+This leads to a natural next step: if a finite number of dashboards are built in a Grafana
+installation, but then queried very rarely (i.e., only under emergency scenarios), why not
+provide *shadow load* on them all the time? This would, in effect, keep tscached current and
+result in a very high (hot) hit rate.
+
+#### How to implement read-before caching
+There are many ways to achieve this: a daemon, cron jobs, etc. Here is our presumed first attempt:
+- Keep a list, in Redis, of KQueries to treat as read-before enabled.
+- Run a cron job every ~five minutes that iterates over the list of *shadow load-enabled* dashboards.
+- This cron job may simply query the tscached webapp, or use the same code to offload the work.
+- Regardless, it will keep the cached data fresh to within a given window, even without a human viewing it.
+
+#### How to know what should be read-before cached
+- It doesn't make sense to cache on KQueries solely used in *compose* operations, since the query is constantly changing.
+- Presumably those used by *saved* dashboards would benefit much more!
+- We can check the HTTP *referer* field: If it contains `/dashboard/db/` (for a Grafana frontend) then it's a saved dashboard (good). If the last characters are `edit` then it's an update of that saved dashboard (ungood).
+- If we can fully understand the schema of URLs from a given frontend, Grafana or otherwise, this strategy should work.
+- A better approach? Send an `X-tscached` header with an appropriate mode. This requires an upstream change in whichever graphing dashboard you choose to use though.
+
diff --git a/README.md b/README.md
@@ -10,121 +10,3 @@ step further: *A previously issued query will be reissued across only the elapse
 last execution.* This provides a substantial improvement in serving high-volume load, especially temporally long queries that return thousands of time series. Using only simple techniques - consistent hashing, read-through caching, and backend load chunking - we provide user-perceived read latency improvements of up to 100x.
 
 There are several different frontends to use with a Kairos-compliant API like this one, but the most full-featured remains (as always) [Grafana](http://grafana.org/) with [this plugin](https://github.com/grafana/kairosdb-datasource) installed.
-
-
-Everything that follows is something of a fluid design document.
-
-## Stored Data
-
-We're storing two types of data in Redis, now referred to as **KQuery** and **MTS**.
-
-### KQuery
-- Key: a hash from the JSON dump of a given KairosDB query.
-- One exception: its start/end time values are missing.
-- Value: JSON dump of the query, including ABSOLUTE timestamps matching it.
-- The timestamps in the value will be updated whenever we update its constituent MTS.
-- We also include a list of Redis keys for matching MTS (in use for the HOT scenario).
-
-### MTS (Metric Time Series)
-- Briefly, each Metric Time Series is a KairosDB **result** dict.
-- Given that one KairosDB query may return N time series results, this represents one of them.
-- Key: a subset hash: includes elements `name, group_by, tags`.
-- Value: the full contents of the result dict.
-
-## Algorithm Outline
-
-You have received a query intended for KairosDB. What to do?
-
-What we do depends on whether (and what) corresponding data exists in the KQuery Store.
-
-### Cache MISS (cold)
-Unfortunately for the user, tscached has never seen this exact query before.
-
-To proceed:
-- Forward the entire query to Kairos.
-- Split Kairos result into discrete MTS; hash them; write them into Redis.
-- Write KQuery (including set of MTS hashes) into Redis.
-- No trimming old values needed, since we only queried for what we wanted.
-- Return result (may as well use the pre-built Kairos version) to user.
-
-### Cache HIT (hot)
-The user is in luck: this data is extremely fresh. This is the postcache scenario.
-
-To be a *hot* hit, three properties must be true:
-- A KQuery entry must exist for the data.
-- KQuery data must have a start timestamp before or equivalent to that requested.
-- KQuery data must have an end timestamp within 10s of NOW (configurable).
-
-To proceed:
-- Do **not** query Kairos at all - this is explicit flood control!
-- Pull all listed MTS out of Redis.
-- For each MTS, trim any data older than the START.
-- Return the rebuilt result (ts fixing, etc.) without updating Redis.
-
-
-### Cache HIT (warm)
-This is the key tscached advancement. Data that is already extant in Redis, but more
-than 10s old, is **appended to** instead of overwritten.
-
-This removes a ridiculously high burden from your KairosDB cluster: in example, reading
-from an entire production environment and plotting its load average:
-
-- 10 second resolution
-- 24 hour chart
-- 2,000 hosts in the environment
-- Returns **17.28 MILLION** data points.
-
-Needless to say, one requires a *ridiculously oversized* KairosDB cluster to handle
-this kind of load. So why bother? Results from such a query total only a few megabytes
-of JSON. With tscached, after the first (painful!) MISS, we may now query for as few as
-**2,000** data points on each subsequent query.
-
-The end goal, therefore, is to turn a bursty load into a constant load... for all the obvious
-reasons!
-
-To proceed:
-- Mutate Request: Forward original request to Kairos for older/younger intervals.
-- Pull relevant MTS (listed in KQuery) from Redis.
-- Merge MTS in an all-pairs strategy. This will rely on Index-Hash lookup for KairosDB data.
-- As merges occur, overwrite discrete MTS into Redis.
-- Any new MTS (that just started reporting) will be merged with empty sets and written to Redis
-- Update KQuery with new Start, End timestamps and with any new MTS hashes.
-- If MTS Start retrieved is too old (compared to original request) trim it down.
-- Return the combined result.
-
-
-## Future work
-
-### Start updating
-We may at first not want to support start updating. That would be a strange use
-case: you'd have a dashboard of 1h that you stretched out to 6h. Query -6 to -1 then the
-last bit of clock time...
-
-### Staleness prevention
-If a KQuery was last requested 6 hours ago (and only for one hour's range) we should
-not bother reading from it now. In other words, despite handling the same *semantic data*
-as before, tscached is effectively cold. TTL expiry may be useful for this case.
-
-### Preemptive caching (Read-before)
-tscached is intended to minimize read load on KairosDB; the fact that it will make
-dashboards built with Grafana, et al. much faster to load is a happy coincidence.
-
-This leads to a natural next step: if a finite number of dashboards are built in a Grafana
-installation, but then queried very rarely (i.e., only under emergency scenarios), why not
-provide *shadow load* on them all the time? This would, in effect, keep tscached current and
-result in a very high (hot) hit rate.
-
-#### How to implement read-before caching
-There are many ways to achieve this: a daemon, cron jobs, etc. Here is our presumed first attempt:
-- Keep a list, in Redis, of KQueries to treat as read-before enabled.
-- Run a cron job every ~five minutes that iterates over the list of *shadow load-enabled* dashboards.
-- This cron job may simply query the tscached webapp, or use the same code to offload the work.
-- Regardless, it will keep the cached data fresh to within a given window, even without a human viewing it.
-
-#### How to know what should be read-before cached
-- It doesn't make sense to cache on KQueries solely used in *compose* operations, since the query is constantly changing.
-- Presumably those used by *saved* dashboards would benefit much more!
-- We can check the HTTP *referer* field: If it contains `/dashboard/db/` (for a Grafana frontend) then it's a saved dashboard (good). If the last characters are `edit` then it's an update of that saved dashboard (ungood).
-- If we can fully understand the schema of URLs from a given frontend, Grafana or otherwise, this strategy should work.
-- A better approach? Send an `X-tscached` header with an appropriate mode. This requires an upstream change in whichever graphing dashboard you choose to use though.
-