|
| 1 | +--- |
| 2 | +title: Welcome read-cache-after-write consistency! |
| 3 | +date: 2026-03-13 |
| 4 | +author: >- |
| 5 | + [Attila Mészáros](https://github.com/csviri) |
| 6 | +--- |
| 7 | + |
| 8 | +**TL;DR:** |
| 9 | +In version 5.3.0 we introduced strong consistency guarantees for updates with a new API. |
| 10 | +You can now update resources (both your custom resource and managed resources) |
| 11 | +and the framework will guarantee that these updates will be instantly visible |
| 12 | +when accessing resources from caches, |
| 13 | +and naturally also for subsequent reconciliations. |
| 14 | + |
| 15 | +I briefly [talked about this](https://www.youtube.com/watch?v=HrwHh5Yh6AM&t=1387s) topic at KubeCon last year. |
| 16 | + |
| 17 | +```java |
| 18 | +public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) { |
| 19 | + |
| 20 | + ConfigMap managedConfigMap = prepareConfigMap(webPage); |
| 21 | + // apply the resource with new API |
| 22 | + context.resourceOperations().serverSideApply(managedConfigMap); |
| 23 | + |
| 24 | + // fresh resource instantly available from our update in the caches |
| 25 | + var upToDateResource = context.getSecondaryResource(ConfigMap.class); |
| 26 | + |
| 27 | + // from now on built-in update methods by default use this feature; |
| 28 | + // it is guaranteed that resource changes will be visible for next reconciliation |
| 29 | + return UpdateControl.patchStatus(alterStatusObject(webPage)); |
| 30 | +} |
| 31 | +``` |
| 32 | + |
| 33 | +In addition to that, the framework will automatically filter events for your own updates, |
| 34 | +so they don't trigger the reconciliation again. |
| 35 | + |
| 36 | +{{% alert color=success %}} |
| 37 | +**This should significantly simplify controller development, and will make reconciliation |
| 38 | +much simpler to reason about!** |
| 39 | +{{% /alert %}} |
| 40 | + |
| 41 | +This post will deep dive into this topic, exploring the details and rationale behind it. |
| 42 | + |
| 43 | +See the related umbrella [issue](https://github.com/operator-framework/java-operator-sdk/issues/2944) on GitHub. |
| 44 | + |
| 45 | +## Informers and eventual consistency |
| 46 | + |
| 47 | +First, we have to understand a fundamental building block of Kubernetes operators: Informers. |
| 48 | +Since there is plentiful accessible information about this topic, here's a brief summary. Informers: |
| 49 | + |
| 50 | +1. Watch Kubernetes resources — the K8S API sends events if a resource changes to the client |
| 51 | + through a websocket. An event usually contains the whole resource. (There are some exceptions, see Bookmarks.) |
| 52 | + See details about watch as a K8S API concept in the [official docs](https://kubernetes.io/docs/reference/using-api/api-concepts/#semantics-for-watch). |
| 53 | +2. Cache the latest state of the resource. |
| 54 | +3. If an informer receives an event in which the `metadata.resourceVersion` is different from the version |
| 55 | + in the cached resource, it calls the event handler, thus in our case triggering the reconciliation. |
| 56 | + |
| 57 | +A controller is usually composed of multiple informers: one tracking the primary resource, and |
| 58 | +additional informers registered for each (secondary) resource we manage. |
| 59 | +Informers are great since we don't have to poll the Kubernetes API — it is push-based. They also provide |
| 60 | +a cache, so reconciliations are very fast since they work on top of cached resources. |
| 61 | + |
| 62 | +Now let's take a look at the flow when we update a resource: |
| 63 | + |
| 64 | + |
| 65 | +```mermaid |
| 66 | +graph LR |
| 67 | + subgraph Controller |
| 68 | + Informer:::informer |
| 69 | + Cache[(Cache)]:::teal |
| 70 | + Reconciler:::reconciler |
| 71 | + Informer -->|stores| Cache |
| 72 | + Reconciler -->|reads| Cache |
| 73 | + end |
| 74 | + K8S[⎈ Kubernetes API Server]:::k8s |
| 75 | +
|
| 76 | + Informer -->|watches| K8S |
| 77 | + Reconciler -->|updates| K8S |
| 78 | +
|
| 79 | + classDef informer fill:#C0527A,stroke:#8C3057,color:#fff |
| 80 | + classDef reconciler fill:#E8873A,stroke:#B05E1F,color:#fff |
| 81 | + classDef teal fill:#3AAFA9,stroke:#2B807B,color:#fff |
| 82 | + classDef k8s fill:#326CE5,stroke:#1A4AAF,color:#fff |
| 83 | +``` |
| 84 | + |
| 85 | +It is easy to see that the cache of the informer is eventually consistent with the update we sent from the reconciler. |
| 86 | +It usually takes only a very short time (a few milliseconds) to sync the caches and everything is fine. Well, sometimes |
| 87 | +it isn't. The websocket can be disconnected (which actually happens on purpose sometimes), the API Server can be slow, etc. |
| 88 | + |
| 89 | + |
| 90 | +## The problem(s) we try to solve |
| 91 | + |
| 92 | +Let's consider an operator with the following requirements: |
| 93 | + - we have a custom resource `PrefixedPod` where the spec contains only one field: `podNamePrefix` |
| 94 | + - the goal of the operator is to create a Pod with a name that has the prefix and a random suffix |
| 95 | + - it should never run two Pods at once; if the `podNamePrefix` changes, it should delete |
| 96 | + the current Pod and then create a new one |
| 97 | + - the status of the custom resource should contain the `generatedPodName` |
| 98 | + |
| 99 | +How the code would look in 5.2.x: |
| 100 | + |
| 101 | +```java |
| 102 | + |
| 103 | +public UpdateControl<PrefixedPod> reconcile(PrefixedPod primary, Context<PrefixedPod> context) { |
| 104 | + |
| 105 | + Optional<Pod> currentPod = context.getSecondaryResource(Pod.class); |
| 106 | + |
| 107 | + if (currentPod.isPresent()) { |
| 108 | + if (podNameHasPrefix(primary.getSpec().getPodNamePrefix() ,currentPod.get())) { |
| 109 | + // all ok we can return |
| 110 | + return UpdateControl.noUpdate(); |
| 111 | + } else { |
| 112 | + // deletes the current pod with different name pattern |
| 113 | + context.getClient().resource(currentPod.get()).delete(); |
| 114 | + // return; pod delete event will trigger the reconciliation |
| 115 | + return UpdateControl.noUpdate(); |
| 116 | + } |
| 117 | + } else { |
| 118 | + // creates new pod |
| 119 | + var newPod = context.getClient().resource(createPodWithOwnerReference(primary)).serverSideApply(); |
| 120 | + return UpdateControl.patchStatus(setGeneratedPodNameToStatus(primary,newPod)); |
| 121 | + } |
| 122 | +} |
| 123 | + |
| 124 | +@Override |
| 125 | +public List<EventSource<?, PrefixedPod>> prepareEventSources(EventSourceContext<PrefixedPod> context) { |
| 126 | + // Code omitted for adding InformerEventSource for the Pod |
| 127 | +} |
| 128 | +``` |
| 129 | + |
| 130 | +That is quite simple: if there is a Pod with a different name prefix we delete it, otherwise we create the Pod |
| 131 | +and update the status. The Pod is created with an owner reference, so any update on the Pod will trigger |
| 132 | +the reconciliation. |
| 133 | + |
| 134 | +Now consider the following sequence of events: |
| 135 | + |
| 136 | +1. We create a `PrefixedPod` with `spec.podNamePrefix`: `first-pod-prefix`. |
| 137 | +2. Concurrently: |
| 138 | + - The reconciliation logic runs and creates a Pod with a generated name suffix: "first-pod-prefix-a3j3ka"; |
| 139 | + it also sets this in the status and updates the custom resource status. |
| 140 | + - While the reconciliation is running, we update the custom resource to have the value |
| 141 | + `second-pod-prefix`. |
| 142 | +3. The update of the custom resource triggers the reconciliation. |
| 143 | + |
| 144 | +When the spec change triggers the reconciliation in point 3, there is absolutely **no guarantee** that: |
| 145 | +- the created Pod will already be visible — `currentPod` might simply be empty |
| 146 | +- the `status.generatedPodName` will be visible |
| 147 | + |
| 148 | +Since both are backed by an informer and the caches of those informers are only eventually consistent with our updates, |
| 149 | +the next reconciliation would create a new Pod, violating the requirement to not have two |
| 150 | +Pods running at the same time. In addition, the controller would override the status. Although in the case of a Kubernetes |
| 151 | +resource we can still find the existing Pods later via owner references, if we were managing a |
| 152 | +non-Kubernetes (external) resource we would not notice that we had already created one. |
| 153 | + |
| 154 | +So can we have stronger guarantees regarding caches? It turns out we can now... |
| 155 | + |
| 156 | +## Achieving read-cache-after-write consistency |
| 157 | + |
| 158 | +When we send an update (this also applies to various create and patch requests) to the Kubernetes API, in the response |
| 159 | +we receive the up-to-date resource with the resource version that is the most recent at that point. |
| 160 | +The idea is that we can cache this response in a cache on top of the Informer's cache. |
| 161 | +We call this cache `TemporaryResourceCache` (TRC), and besides caching such responses, it also plays a role in event filtering |
| 162 | +as we will see later. |
| 163 | + |
| 164 | +Note that the challenge in the past was knowing when to evict this response from the TRC. Eventually, |
| 165 | +we will receive an event in the informer and the informer cache will be populated with an up-to-date resource. |
| 166 | +But it was not possible to reliably tell whether an event contained a resource that was the result |
| 167 | +of an update before or after our own update. The reason is that the Kubernetes documentation stated that |
| 168 | +`metadata.resourceVersion` should be treated as an opaque string and matched only with equality. |
| 169 | +Although with optimistic locking we were able to overcome this issue — see [this blog post](primary-cache-for-next-recon.md). |
| 170 | + |
| 171 | +{{% alert color=success %}} |
| 172 | +This changed in the Kubernetes guidelines. Now, if we can parse the `resourceVersion` as an integer, |
| 173 | +we can use numerical comparison. See the related [KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/5504-comparable-resource-version). |
| 174 | +{{% /alert %}} |
| 175 | + |
| 176 | +From this point the idea of the algorithm is very simple: |
| 177 | + |
| 178 | +1. After updating a Kubernetes resource, cache the response in the TRC. |
| 179 | +2. When the informer propagates an event, check if its resource version is greater than or equal to |
| 180 | + the one in the TRC. If yes, evict the resource from the TRC. |
| 181 | +3. When the controller reads a resource from cache, it checks the TRC first, then falls back to the Informer's cache. |
| 182 | + |
| 183 | + |
| 184 | +```mermaid |
| 185 | +sequenceDiagram |
| 186 | + box rgba(50,108,229,0.1) |
| 187 | + participant K8S as ⎈ Kubernetes API Server |
| 188 | + end |
| 189 | + box rgba(232,135,58,0.1) |
| 190 | + participant R as Reconciler |
| 191 | + end |
| 192 | + box rgba(58,175,169,0.1) |
| 193 | + participant I as Informer |
| 194 | + participant IC as Informer Cache |
| 195 | + participant TRC as Temporary Resource Cache |
| 196 | + end |
| 197 | +
|
| 198 | + R->>K8S: 1. Update resource |
| 199 | + K8S-->>R: Updated resource (with new resourceVersion) |
| 200 | + R->>TRC: 2. Cache updated resource in TRC |
| 201 | +
|
| 202 | + I-)K8S: 3. Watch event (resource updated) |
| 203 | + I->>TRC: On event: event resourceVersion ≥ TRC version? |
| 204 | + alt Yes: event is up-to-date |
| 205 | + I-->>TRC: Evict resource from TRC |
| 206 | + else No: stale event |
| 207 | + Note over TRC: TRC entry retained |
| 208 | + end |
| 209 | +
|
| 210 | + R->>TRC: 4. Read resource from cache |
| 211 | + alt Resource found in TRC |
| 212 | + TRC-->>R: Return cached resource |
| 213 | + else Not in TRC |
| 214 | + R->>IC: Read from Informer Cache |
| 215 | + IC-->>R: Return resource |
| 216 | + end |
| 217 | +``` |
| 218 | + |
| 219 | +## Filtering events for our own updates |
| 220 | + |
| 221 | +When we update a resource, eventually the informer will propagate an event that would trigger a reconciliation. |
| 222 | +However, this is mostly not desired. Since we already have the up-to-date resource at that point, |
| 223 | +we would like to be notified only if the resource is changed after our change. |
| 224 | +Therefore, in addition to caching the resource, we also filter out events that contain a resource |
| 225 | +version older than or equal to our cached resource version. |
| 226 | + |
| 227 | +Note that the implementation of this is relatively complex, since while performing the update we want to record all the |
| 228 | +events received in the meantime and decide whether to propagate them further once the update request is complete. |
| 229 | + |
| 230 | +However, this way we significantly reduce the number of reconciliations, making the whole process much more efficient. |
| 231 | + |
| 232 | +### The case for instant reschedule |
| 233 | + |
| 234 | +We realize that some of our users might rely on the fact that reconciliation is triggered by their own updates. |
| 235 | +To support backwards compatibility, or rather a migration path, we now provide a way to instruct the framework |
| 236 | +to queue an instant reconciliation: |
| 237 | + |
| 238 | +```java |
| 239 | +public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) { |
| 240 | + |
| 241 | + // omitted reconciliation logic |
| 242 | + |
| 243 | + return UpdateControl.<WebPage>noUpdate().reschedule(); |
| 244 | +} |
| 245 | +``` |
| 246 | + |
| 247 | +## Additional considerations and alternatives |
| 248 | + |
| 249 | +An alternative approach would be to not trigger the next reconciliation until the |
| 250 | +target resource appears in the Informer's cache. The upside is that we don't have to maintain an |
| 251 | +additional cache of the resource, just the target resource version; therefore this approach might have |
| 252 | +a smaller memory footprint, but not necessarily. See the related [KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/5647-stale-controller-handling#proposal) |
| 253 | +that takes this approach. |
| 254 | + |
| 255 | +On the other hand, when we make a request, the response object is always deserialized regardless of whether we are going |
| 256 | +to cache it or not. This object in most cases will be cached for a very short time and later garbage collected. |
| 257 | +Therefore, the memory overhead should be minimal. |
| 258 | + |
| 259 | +Having the TRC has an additional advantage: since we have the resource instantly in our caches, we can |
| 260 | +elegantly continue the reconciliation in the same pass and reconcile resources that depend |
| 261 | +on the latest state. More concretely, this also helps with our [Dependent Resources / Workflows](../../docs/documentation/dependent-resource-and-workflows/workflows.md#reconcile-sample) |
| 262 | +which rely on up-to-date caches. In this sense, this approach is much more optimal regarding throughput. |
| 263 | + |
| 264 | +## Conclusion |
| 265 | + |
| 266 | +I personally worked on a prototype of an operator that depended on an unreleased version of JOSDK already |
| 267 | +implementing these features. The most obvious gain was how much simpler the reasoning became in some cases and how it reduced the corner |
| 268 | +cases that we would otherwise have to solve with the [expectation pattern](https://ahmet.im/blog/controller-pitfalls/#expectations-pattern) |
| 269 | +or other facilities. |
| 270 | + |
| 271 | +## Special thanks |
| 272 | + |
| 273 | +I would like to thank all the contributors who directly or indirectly contributed, including [metacosm](https://github.com/metacosm), |
| 274 | +[manusa](https://github.com/manusa), and [xstefank](https://github.com/xstefank). |
| 275 | + |
| 276 | +Last but certainly not least, special thanks to [Steven Hawkins](https://github.com/shawkins), |
| 277 | +who maintains the Informer implementation in the [fabric8 Kubernetes client](https://github.com/fabric8io/kubernetes-client) |
| 278 | +and implemented the first version of the algorithms. We then iterated on it together multiple times. |
| 279 | +Covering all the edge cases was quite an effort. |
| 280 | +Just as a highlight, I'll mention the [last one](https://github.com/operator-framework/java-operator-sdk/issues/3208). |
| 281 | + |
| 282 | +Thank you! |
0 commit comments