-
Notifications
You must be signed in to change notification settings - Fork 142
Add v3 xDS support #140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add v3 xDS support #140
Conversation
58c2018
to
47f9e14
Compare
* Returns all cluster items in the CDS payload. | ||
*/ | ||
public abstract SnapshotResources<Cluster> clusters(); | ||
abstract String version(String typeUrl, List<String> resourceNames); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diff came out big, Snapshot basically got renamed to V2Snapshot, and a new V3Snapshot was created. they extend a common abstract class.
|
||
import static io.envoyproxy.controlplane.cache.Resources.V2_TYPE_URLS; | ||
|
||
public class V2SimpleCache<T> extends SimpleCache<T, V2Snapshot> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These classes make it so callers don't have to worry about the new type variable on SimpleCache, and also passing type URLs when instantiating the cache
* consistent (i.e. all referenced resources must be included in the snapshot). | ||
* | ||
* @param group group identifier | ||
* @param snapshot a versioned collection of node config data | ||
*/ | ||
void setSnapshot(T group, Snapshot snapshot); | ||
void setSnapshot(T group, U snapshot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Callers will need to update to be specific about the type of snapshot they're generating, either V2Snapshot or V3Snapshot
* a v3 request. | ||
*/ | ||
@AutoValue | ||
public abstract class XdsRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is basically a oneof of v2 request or v3 request, and delegates methods to the right one.
@@ -0,0 +1,546 @@ | |||
package io.envoyproxy.controlplane.cache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the tests, my approach was basically to rename the existing test to V2, and then copy/paste to a V3 class, updating the imports to use v3 message types, and using v3 type URLs. I also had to implement the new callbacks interfaces, and throw IllegalStateException in the test if the unexpected request version happens.
This means theres' a lot of copy/paste, but IMO that's less concerning for tests. I like tests to be clear rather than super DRY, whereas the implementation being reused as much as possible makes less surface for bugs.
I'm open to feedback on this though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do a second round and try to dry them up myself and see if that is better.
@@ -54,17 +55,42 @@ default void onStreamOpen(long streamId, String typeUrl) { | |||
* will be returned to the client and the stream will be closed with error. | |||
*/ | |||
default void onStreamRequest(long streamId, DiscoveryRequest request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left the old method name the same to minimize impact on end users, but maybe we should force a rename? failing to implement the v3 version and turning on v3 transport in the data plane would cause issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename it to be explicit. We're already forcing renames in other places.
* @throws RequestException optionally can throw {@link RequestException} with custom status. That status | ||
* will be returned to the client and the stream will be closed with error. | ||
*/ | ||
default void onV3StreamRequest(long streamId, io.envoyproxy.envoy.service.discovery.v3.DiscoveryRequest request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I not default this for safety?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more what the consequences are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe could mean a user accidentally doesn't implement the new callback after doing an upgrade, and then they don't get the callbacks they expect.
374c0f9
to
2610f3f
Compare
I thought envoyproxy/envoy#10776 was done already, I think we have to wait for that |
alternatively, I need to implement up/down conversion here |
2610f3f
to
d1de621
Compare
This commit adds support for the v3 xDS transport as well as support for storing and serving v3 resources out of the resource cache. Note that resource versions and xDS transport version are decoupled to provide a migration path to v3. End users can decide to generate v3 resources and serve them over v2 transport during migration, or continue to generate v2 resources and serve them over v3 transport. For implementation simplicity, it is not possible to generate a combination of v2 and v3 resources in the control plane, which would require either up or down conversion. Signed-off-by: Michael Puncel <[email protected]>
d1de621
to
cae4ebc
Compare
case V3: | ||
return output.toBuilder() | ||
.setTypeUrl(Resources.V2_TYPE_URLS_TO_V3.get(output.getTypeUrl())) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added conversion in this PR. the type URL of the output Any will be rewritten to match the request version if they're different.
This is covered in V2DiscoveryServerV3ResourcesXdsIT, V2DiscoveryServerV3ResourcesAdsIT, V3DiscoveryServerV2ResourcesXdsIT, and V3DiscoveryServerV2ResourcesAdsIT.
Those tests do not cover using resource_api_version at all, which I'd be open to adding if folks think there'd be value
This is really big :) I'll probably need a couple of rounds of review to get through all. |
I also want to note that while it passes integration tests, I haven't tried it in production yet. I put up the PR on the early side since there has been interest from others in the status of this work. I'll keep posted on this PR when we try it out @slonka yep, definitely a big one. I do think most of the LOC added is in tests, which are mostly copy/paste. Let me know if there is anything I can do to make reviewing easier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, small changes needed. I also did a dry run of migration in our control plane and everything seems ok: https://github.com/allegro/envoy-control/compare/v3-update-dry-run?expand=1 . I still need to understand migration paths (and their tests) more deeply. I'll do it in the second run of review.
} | ||
} catch (InvalidProtocolBufferException e) { | ||
LOGGER.error( | ||
"Failed to convert HTTP connection manager config struct into protobuf message for listener {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding info about it being V3?
// For EDS clusters, use the cluster name or the service name override. | ||
if (c.getType() == io.envoyproxy.envoy.config.cluster.v3.Cluster.DiscoveryType.EDS) { | ||
if (!isNullOrEmpty(c.getEdsClusterConfig().getServiceName())) { | ||
refs.add(c.getEdsClusterConfig().getServiceName()); | ||
} else { | ||
refs.add(c.getName()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: This can be extracted to a method and used in V2 version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it? c
has a different type for each of them
@@ -13,7 +13,8 @@ | |||
long lastWatchRequestTime(); | |||
|
|||
/** | |||
* Returns the node grouping represented by this status, generated via {@link NodeGroup#hash(Node)}. | |||
* Returns the node grouping represented by this status, generated via | |||
* {@link NodeGroup#hashv3(Node)} or {@link NodeGroup#hashV2(io.envoyproxy.envoy.api.v2.core.Node)}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invalid link, should be V3
* {@link NodeGroup#hashv3(Node)} or {@link NodeGroup#hashV2(io.envoyproxy.envoy.api.v2.core.Node)}. | |
* {@link NodeGroup#hashV3(Node)} or {@link NodeGroup#hashV2(io.envoyproxy.envoy.api.v2.core.Node)}. |
this.latestResponse = new ConcurrentHashMap<>(Resources.TYPE_URLS.size()); | ||
this.ackedResources = new ConcurrentHashMap<>(Resources.TYPE_URLS.size()); | ||
|
||
// NOTE: while V2 URLs are referenced here, the size of V2_TYPE_URLS and V3_TYPE_URLS is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a comment maybe we can check the size here and throw an exception if it's not.
@@ -54,17 +55,42 @@ default void onStreamOpen(long streamId, String typeUrl) { | |||
* will be returned to the client and the stream will be closed with error. | |||
*/ | |||
default void onStreamRequest(long streamId, DiscoveryRequest request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename it to be explicit. We're already forcing renames in other places.
@@ -0,0 +1,546 @@ | |||
package io.envoyproxy.controlplane.cache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do a second round and try to dry them up myself and see if that is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the change looks nice, thanks for the work. I left a few comments.
Also, test copies aren't great in general, yet I don't have an idea of how to easily share them between v2 and v3. Maybe somebody else has an idea?
* resources. Snapshots should have distinct versions per node group. | ||
*/ | ||
@AutoValue | ||
public abstract class V3Snapshot extends Snapshot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why version prefixes in class names like here? To me that's what packages are for, i.e. (...).v3.Snapshot
. Otherwise, we'll eventually remove v2 support but stay with awkward V3Snapshot
and the like forever.
@@ -14,5 +14,12 @@ | |||
* | |||
* @param node identifier for the envoy instance that is requesting config | |||
*/ | |||
T hash(Node node); | |||
T hashV2(Node node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather use method overloading than version suffixes, so have hash(Node)
and hash(v3.Node)
instead of hashV2(Node)
and hashV3(v3.Node)
.
+ ".transport_sockets.tls.v3.Secret"; | ||
|
||
private static final String V2_TYPE_URL_PREFIX = "type.googleapis.com/envoy.api.v2."; | ||
public static final String CLUSTER_TYPE_URL = V2_TYPE_URL_PREFIX + "Cluster"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have CLUSTER_TYPE_URL
and V3_ CLUSTER_TYPE_URL
, not V2_CLUSTER_TYPE_URL
and V3_ CLUSTER_TYPE_URL
. Why is that?
Also, we could consider introducing a namespace for constants like:
public class Resources {
// ...
public static class V2 {
public static final String CLUSTER_TYPE_URL = "...";
// ...
}
public static class V3 {
public static final String CLUSTER_TYPE_URL = "...";
// ...
}
}
…adoc link. Use method overloading instead of different names for v2 vs v3 node hash Signed-off-by: Michael Puncel <[email protected]>
Signed-off-by: Michael Puncel <[email protected]>
Signed-off-by: Michael Puncel <[email protected]>
Signed-off-by: Michael Puncel <[email protected]>
Signed-off-by: Michael Puncel <[email protected]>
85864e8
to
3a07190
Compare
Thanks for the feedback! I believe I addressed everything except the test duplication so far |
@mpuncel we're watching ServiceMeshCon & KubeCon this week but we'll try to make a review ASAP. Please make sure the build is passing though. |
@sschepens can you look at this PR as well? |
I'm missing the overall picture of how the migration should occur. Now:
Where we want to be:
During the migration we would have some Envoys requesting resources with
since we did not switch them from If I'm completely wrong, can you describe the migration steps on the Envoy side and the CP side? |
I was envisioning that a control plane implementation will set up both a V2DiscoveryServer and a V3DiscoveryServer implementation, and give them the same cache instance. This would let you handle a combination of v2 and v3 transport as the migration happens, but only need to generate one resource version. It's my understanding that all combinations of transport and resource API version can be handled, but I don't see a reason why during migration you wouldn't keep transport_api_version and resource_api_version the same. It looks like the integration tests might unintentionally always be using the v2 resource version. we could:
|
…on tests Signed-off-by: Michael Puncel <[email protected]>
Note that there are not tests that cover a different transport version from resource version in the interest of not doubling the number of integration tests. Signed-off-by: Michael Puncel <[email protected]>
Signed-off-by: Michael Puncel <[email protected]>
0048bec
to
45f67aa
Compare
I DRY'd up the tests somewhat. I think a lot of the challenge there is that a lot of the heavy lifting is in static field initializers, so I don't think I can use an inheritance based approach very effectively. Java is not my main language though, so feel free to point out suggestions. I also fixed the integration tests so that the v3 clients are requesting v3 resources as well, it was an oversight that I wasn't setting Also added a migration guide explaining how I think migration will work. Hopefully that helps clear up how this PR should be used, and y'all can double check me, it's definitely a bit confusing. |
Also let me know if it would help for me to break this into smaller PRs that can be reviewed piecemeal. I needed to do it all together I think in order to understand how it would work, and I think it's probably useful to see in its entirety before reviewing smaller chunks, but I know it's pretty tough to review in its current form |
It would seem that remaining on v2 of xDS API incurs a performance penalty when using Envoy 14+: envoyproxy/envoy#12080 Any chance more progress has been made on v3 support in jcp? I've been hoping I would not need to fork this repo, but after discovering the above issue .... |
@mpuncel great initiative! We at Spotify are currently migrating edge control plane to use xds api v3 and are interested in this PR. Let us know if we could help. |
Edit: I'm working on an integration test that has both V2 and V3 Envoy's connected. If that works, this will remove all doubt that the migration is possible.
Edit: Ok, I created a version that serves only V3 Snapshots and both V2 and V3 Envoy consume it. It works because of this #140 (comment) right? If this https://github.com/mpuncel/java-control-plane/compare/mpuncel/v3-support...slonka:v3-support-slonka?expand=1 test shows how you envisioned the migration then I'm convinced and we can move this PR forward. |
I released |
cc @rulex123 |
@nezdolik - I've done it a couple of times but maybe this one will pass: https://app.circleci.com/pipelines/github/envoyproxy/java-control-plane/22/workflows/55a2c4b0-f831-4bfd-bc9e-656d4c925f45 |
@slonka the latest run is successful. This flakiness could appear in future builds. Did not dig into code much, one assumption is that echo container was not ready to serve request at the time the test was run (docker networking was not fully set up). |
@nezdolik the tests from this branch failed once again: https://app.circleci.com/pipelines/github/envoyproxy/java-control-plane?branch=pull%2F140 I do not know why CircleCi reported your PR (142) in this PR as passing (or am I missing something?). AFAIK our test setup in I think we can open up an issue about flaky tests and move forward with this PR. Have you had a chance to test this version at Spotify? |
@slonka maybe because the git history in my PR is complete copy of this PR, if CI relies on comparing git hashes to distinguish builds... |
@slonka @mpuncel while upgrading our custom plane to use code from this PR, i came across
So clients of the library will have to override both of those methods which is not optimal. Wdyt? |
I don't see any other way to distinguish between v2 and v3 node, if we want to still allow SimpleCache to serve them. If we split it I think we would have to provide a separate cache for v2 and v3. Can you share an example of how would you do it differently? |
@slonka if we "force" clients of the library to implement both v2 and v3 methods, it means that users cannot do v3 migration in one go and will have to do extra step by end of Q4 after the library removes all usages of v2 and is released with breaking changes. If this is acceptable and the PR intends to introduce v3 support with some mixed versioning in api then i guess code is fine as it is.
Would be great to hear more opinions on this. |
@slonka got it, was just thinking that there are users (like us) who would rather fully move on v3 with no need to maintain v2 code in their control plane. |
I agree that it's better to fail at compile time than discover problems at startup. Default methods could cause the latter. On the other hand, I think removing the implementation for a method that no longer exists in the future interface will be rather straightforward to do by the clients – again, the compiler will show what's to change. That's how I see the tradeoff, I can be convinced though. I didn't dig into the code enough to know if duplicate code approach is feasible. If it is then we could do it this way. |
I just want to make sure I understand that completely. Ok, so during the migration period when some Envoy's are v2 and some are v3 we would translate the v2 node to v3 type? Or do you want to have different caches for v2 and v3 nodes? |
@slonka thinking about it once again, looks like is a good overall tradeoff (not to duplicate caches for library maintainers + removing one method for library clients some time before end of Q4). |
Pinging @sschepens once more as the author of #139 if you want to take a look. If there are no objections I'm merging and releasing it on Friday afternoon / Monday morning (probably Monday morning) |
I was not able to release this version due to failing "WarmingClusters" integration tests. People who want to use this version now should use the snapshot version while I fix the tests in another PR. |
Released under |
This commit adds support for the v3 xDS transport as well as support for
storing and serving v3 resources out of the resource cache.
Note that resource versions and xDS transport version are decoupled to
provide a migration path to v3. End users can decide to generate v3
resources and serve them over v2 transport during migration, or continue
to generate v2 resources and serve them over v3 transport.
For implementation simplicity, it is not possible to generate a
combination of v2 and v3 resources in the control plane, which would
require either up or down conversion.
Signed-off-by: Michael Puncel [email protected]