Skip to content

Commit 84d7b89

Browse files
csviriCopilot
andauthored
docs: blog post on read-cache-after-write consistency (#3194)
Signed-off-by: Attila Mészáros <a_meszaros@apple.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
1 parent b491b6a commit 84d7b89

File tree

3 files changed

+286
-3
lines changed

3 files changed

+286
-3
lines changed

docs/content/en/blog/news/primary-cache-for-next-recon.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,10 @@ author: >-
1010
Read-cache-after-write consistency feature replaces this functionality. (since version 5.3.0)
1111

1212
> It provides this functionality also for secondary resources and optimistic locking
13-
is not required anymore. See [details here](./../../docs/documentation/reconciler.md#read-cache-after-write-consistency-and-event-filtering).
14-
{{% /alert %}}
13+
is not required anymore. See the [docs](./../../docs/documentation/reconciler.md#read-cache-after-write-consistency-and-event-filtering) and
14+
related [blog post](read-after-write-consistency.md) for details.
1515

16+
{{% /alert %}}
1617

1718
We recently released v5.1 of Java Operator SDK (JOSDK). One of the highlights of this release is related to a topic of
1819
so-called
Lines changed: 282 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
---
2+
title: Welcome read-cache-after-write consistency!
3+
date: 2026-03-13
4+
author: >-
5+
[Attila Mészáros](https://github.com/csviri)
6+
---
7+
8+
**TL;DR:**
9+
In version 5.3.0 we introduced strong consistency guarantees for updates with a new API.
10+
You can now update resources (both your custom resource and managed resources)
11+
and the framework will guarantee that these updates will be instantly visible
12+
when accessing resources from caches,
13+
and naturally also for subsequent reconciliations.
14+
15+
I briefly [talked about this](https://www.youtube.com/watch?v=HrwHh5Yh6AM&t=1387s) topic at KubeCon last year.
16+
17+
```java
18+
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
19+
20+
ConfigMap managedConfigMap = prepareConfigMap(webPage);
21+
// apply the resource with new API
22+
context.resourceOperations().serverSideApply(managedConfigMap);
23+
24+
// fresh resource instantly available from our update in the caches
25+
var upToDateResource = context.getSecondaryResource(ConfigMap.class);
26+
27+
// from now on built-in update methods by default use this feature;
28+
// it is guaranteed that resource changes will be visible for next reconciliation
29+
return UpdateControl.patchStatus(alterStatusObject(webPage));
30+
}
31+
```
32+
33+
In addition to that, the framework will automatically filter events for your own updates,
34+
so they don't trigger the reconciliation again.
35+
36+
{{% alert color=success %}}
37+
**This should significantly simplify controller development, and will make reconciliation
38+
much simpler to reason about!**
39+
{{% /alert %}}
40+
41+
This post will deep dive into this topic, exploring the details and rationale behind it.
42+
43+
See the related umbrella [issue](https://github.com/operator-framework/java-operator-sdk/issues/2944) on GitHub.
44+
45+
## Informers and eventual consistency
46+
47+
First, we have to understand a fundamental building block of Kubernetes operators: Informers.
48+
Since there is plentiful accessible information about this topic, here's a brief summary. Informers:
49+
50+
1. Watch Kubernetes resources — the K8S API sends events if a resource changes to the client
51+
through a websocket. An event usually contains the whole resource. (There are some exceptions, see Bookmarks.)
52+
See details about watch as a K8S API concept in the [official docs](https://kubernetes.io/docs/reference/using-api/api-concepts/#semantics-for-watch).
53+
2. Cache the latest state of the resource.
54+
3. If an informer receives an event in which the `metadata.resourceVersion` is different from the version
55+
in the cached resource, it calls the event handler, thus in our case triggering the reconciliation.
56+
57+
A controller is usually composed of multiple informers: one tracking the primary resource, and
58+
additional informers registered for each (secondary) resource we manage.
59+
Informers are great since we don't have to poll the Kubernetes API — it is push-based. They also provide
60+
a cache, so reconciliations are very fast since they work on top of cached resources.
61+
62+
Now let's take a look at the flow when we update a resource:
63+
64+
65+
```mermaid
66+
graph LR
67+
subgraph Controller
68+
Informer:::informer
69+
Cache[(Cache)]:::teal
70+
Reconciler:::reconciler
71+
Informer -->|stores| Cache
72+
Reconciler -->|reads| Cache
73+
end
74+
K8S[⎈ Kubernetes API Server]:::k8s
75+
76+
Informer -->|watches| K8S
77+
Reconciler -->|updates| K8S
78+
79+
classDef informer fill:#C0527A,stroke:#8C3057,color:#fff
80+
classDef reconciler fill:#E8873A,stroke:#B05E1F,color:#fff
81+
classDef teal fill:#3AAFA9,stroke:#2B807B,color:#fff
82+
classDef k8s fill:#326CE5,stroke:#1A4AAF,color:#fff
83+
```
84+
85+
It is easy to see that the cache of the informer is eventually consistent with the update we sent from the reconciler.
86+
It usually takes only a very short time (a few milliseconds) to sync the caches and everything is fine. Well, sometimes
87+
it isn't. The websocket can be disconnected (which actually happens on purpose sometimes), the API Server can be slow, etc.
88+
89+
90+
## The problem(s) we try to solve
91+
92+
Let's consider an operator with the following requirements:
93+
- we have a custom resource `PrefixedPod` where the spec contains only one field: `podNamePrefix`
94+
- the goal of the operator is to create a Pod with a name that has the prefix and a random suffix
95+
- it should never run two Pods at once; if the `podNamePrefix` changes, it should delete
96+
the current Pod and then create a new one
97+
- the status of the custom resource should contain the `generatedPodName`
98+
99+
How the code would look in 5.2.x:
100+
101+
```java
102+
103+
public UpdateControl<PrefixedPod> reconcile(PrefixedPod primary, Context<PrefixedPod> context) {
104+
105+
Optional<Pod> currentPod = context.getSecondaryResource(Pod.class);
106+
107+
if (currentPod.isPresent()) {
108+
if (podNameHasPrefix(primary.getSpec().getPodNamePrefix() ,currentPod.get())) {
109+
// all ok we can return
110+
return UpdateControl.noUpdate();
111+
} else {
112+
// deletes the current pod with different name pattern
113+
context.getClient().resource(currentPod.get()).delete();
114+
// return; pod delete event will trigger the reconciliation
115+
return UpdateControl.noUpdate();
116+
}
117+
} else {
118+
// creates new pod
119+
var newPod = context.getClient().resource(createPodWithOwnerReference(primary)).serverSideApply();
120+
return UpdateControl.patchStatus(setGeneratedPodNameToStatus(primary,newPod));
121+
}
122+
}
123+
124+
@Override
125+
public List<EventSource<?, PrefixedPod>> prepareEventSources(EventSourceContext<PrefixedPod> context) {
126+
// Code omitted for adding InformerEventSource for the Pod
127+
}
128+
```
129+
130+
That is quite simple: if there is a Pod with a different name prefix we delete it, otherwise we create the Pod
131+
and update the status. The Pod is created with an owner reference, so any update on the Pod will trigger
132+
the reconciliation.
133+
134+
Now consider the following sequence of events:
135+
136+
1. We create a `PrefixedPod` with `spec.podNamePrefix`: `first-pod-prefix`.
137+
2. Concurrently:
138+
- The reconciliation logic runs and creates a Pod with a generated name suffix: "first-pod-prefix-a3j3ka";
139+
it also sets this in the status and updates the custom resource status.
140+
- While the reconciliation is running, we update the custom resource to have the value
141+
`second-pod-prefix`.
142+
3. The update of the custom resource triggers the reconciliation.
143+
144+
When the spec change triggers the reconciliation in point 3, there is absolutely **no guarantee** that:
145+
- the created Pod will already be visible — `currentPod` might simply be empty
146+
- the `status.generatedPodName` will be visible
147+
148+
Since both are backed by an informer and the caches of those informers are only eventually consistent with our updates,
149+
the next reconciliation would create a new Pod, violating the requirement to not have two
150+
Pods running at the same time. In addition, the controller would override the status. Although in the case of a Kubernetes
151+
resource we can still find the existing Pods later via owner references, if we were managing a
152+
non-Kubernetes (external) resource we would not notice that we had already created one.
153+
154+
So can we have stronger guarantees regarding caches? It turns out we can now...
155+
156+
## Achieving read-cache-after-write consistency
157+
158+
When we send an update (this also applies to various create and patch requests) to the Kubernetes API, in the response
159+
we receive the up-to-date resource with the resource version that is the most recent at that point.
160+
The idea is that we can cache this response in a cache on top of the Informer's cache.
161+
We call this cache `TemporaryResourceCache` (TRC), and besides caching such responses, it also plays a role in event filtering
162+
as we will see later.
163+
164+
Note that the challenge in the past was knowing when to evict this response from the TRC. Eventually,
165+
we will receive an event in the informer and the informer cache will be populated with an up-to-date resource.
166+
But it was not possible to reliably tell whether an event contained a resource that was the result
167+
of an update before or after our own update. The reason is that the Kubernetes documentation stated that
168+
`metadata.resourceVersion` should be treated as an opaque string and matched only with equality.
169+
Although with optimistic locking we were able to overcome this issue — see [this blog post](primary-cache-for-next-recon.md).
170+
171+
{{% alert color=success %}}
172+
This changed in the Kubernetes guidelines. Now, if we can parse the `resourceVersion` as an integer,
173+
we can use numerical comparison. See the related [KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/5504-comparable-resource-version).
174+
{{% /alert %}}
175+
176+
From this point the idea of the algorithm is very simple:
177+
178+
1. After updating a Kubernetes resource, cache the response in the TRC.
179+
2. When the informer propagates an event, check if its resource version is greater than or equal to
180+
the one in the TRC. If yes, evict the resource from the TRC.
181+
3. When the controller reads a resource from cache, it checks the TRC first, then falls back to the Informer's cache.
182+
183+
184+
```mermaid
185+
sequenceDiagram
186+
box rgba(50,108,229,0.1)
187+
participant K8S as ⎈ Kubernetes API Server
188+
end
189+
box rgba(232,135,58,0.1)
190+
participant R as Reconciler
191+
end
192+
box rgba(58,175,169,0.1)
193+
participant I as Informer
194+
participant IC as Informer Cache
195+
participant TRC as Temporary Resource Cache
196+
end
197+
198+
R->>K8S: 1. Update resource
199+
K8S-->>R: Updated resource (with new resourceVersion)
200+
R->>TRC: 2. Cache updated resource in TRC
201+
202+
I-)K8S: 3. Watch event (resource updated)
203+
I->>TRC: On event: event resourceVersion ≥ TRC version?
204+
alt Yes: event is up-to-date
205+
I-->>TRC: Evict resource from TRC
206+
else No: stale event
207+
Note over TRC: TRC entry retained
208+
end
209+
210+
R->>TRC: 4. Read resource from cache
211+
alt Resource found in TRC
212+
TRC-->>R: Return cached resource
213+
else Not in TRC
214+
R->>IC: Read from Informer Cache
215+
IC-->>R: Return resource
216+
end
217+
```
218+
219+
## Filtering events for our own updates
220+
221+
When we update a resource, eventually the informer will propagate an event that would trigger a reconciliation.
222+
However, this is mostly not desired. Since we already have the up-to-date resource at that point,
223+
we would like to be notified only if the resource is changed after our change.
224+
Therefore, in addition to caching the resource, we also filter out events that contain a resource
225+
version older than or equal to our cached resource version.
226+
227+
Note that the implementation of this is relatively complex, since while performing the update we want to record all the
228+
events received in the meantime and decide whether to propagate them further once the update request is complete.
229+
230+
However, this way we significantly reduce the number of reconciliations, making the whole process much more efficient.
231+
232+
### The case for instant reschedule
233+
234+
We realize that some of our users might rely on the fact that reconciliation is triggered by their own updates.
235+
To support backwards compatibility, or rather a migration path, we now provide a way to instruct the framework
236+
to queue an instant reconciliation:
237+
238+
```java
239+
public UpdateControl<WebPage> reconcile(WebPage webPage, Context<WebPage> context) {
240+
241+
// omitted reconciliation logic
242+
243+
return UpdateControl.<WebPage>noUpdate().reschedule();
244+
}
245+
```
246+
247+
## Additional considerations and alternatives
248+
249+
An alternative approach would be to not trigger the next reconciliation until the
250+
target resource appears in the Informer's cache. The upside is that we don't have to maintain an
251+
additional cache of the resource, just the target resource version; therefore this approach might have
252+
a smaller memory footprint, but not necessarily. See the related [KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/5647-stale-controller-handling#proposal)
253+
that takes this approach.
254+
255+
On the other hand, when we make a request, the response object is always deserialized regardless of whether we are going
256+
to cache it or not. This object in most cases will be cached for a very short time and later garbage collected.
257+
Therefore, the memory overhead should be minimal.
258+
259+
Having the TRC has an additional advantage: since we have the resource instantly in our caches, we can
260+
elegantly continue the reconciliation in the same pass and reconcile resources that depend
261+
on the latest state. More concretely, this also helps with our [Dependent Resources / Workflows](../../docs/documentation/dependent-resource-and-workflows/workflows.md#reconcile-sample)
262+
which rely on up-to-date caches. In this sense, this approach is much more optimal regarding throughput.
263+
264+
## Conclusion
265+
266+
I personally worked on a prototype of an operator that depended on an unreleased version of JOSDK already
267+
implementing these features. The most obvious gain was how much simpler the reasoning became in some cases and how it reduced the corner
268+
cases that we would otherwise have to solve with the [expectation pattern](https://ahmet.im/blog/controller-pitfalls/#expectations-pattern)
269+
or other facilities.
270+
271+
## Special thanks
272+
273+
I would like to thank all the contributors who directly or indirectly contributed, including [metacosm](https://github.com/metacosm),
274+
[manusa](https://github.com/manusa), and [xstefank](https://github.com/xstefank).
275+
276+
Last but certainly not least, special thanks to [Steven Hawkins](https://github.com/shawkins),
277+
who maintains the Informer implementation in the [fabric8 Kubernetes client](https://github.com/fabric8io/kubernetes-client)
278+
and implemented the first version of the algorithms. We then iterated on it together multiple times.
279+
Covering all the edge cases was quite an effort.
280+
Just as a highlight, I'll mention the [last one](https://github.com/operator-framework/java-operator-sdk/issues/3208).
281+
282+
Thank you!

docs/content/en/blog/releases/v5-3-release.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ If your reconciler relied on being re-triggered by its own writes, a new `resche
5454
> in-flight updates. Use `context.getSecondaryResources(..)` or `InformerEventSource.get(ResourceID)`
5555
> instead.
5656
57-
See the [reconciler docs](/docs/documentation/reconciler#read-cache-after-write-consistency-and-event-filtering) for details.
57+
See the related [blog post](../news/read-after-write-consistency.md) and [reconciler docs](/docs/documentation/reconciler#read-cache-after-write-consistency-and-event-filtering) for details.
5858

5959
### MicrometerMetricsV2
6060

0 commit comments

Comments
 (0)