Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent partially applies new fleet-server TLS configuration #5888

Open
AndersonQ opened this issue Oct 30, 2024 · 17 comments
Open

Agent partially applies new fleet-server TLS configuration #5888

AndersonQ opened this issue Oct 30, 2024 · 17 comments
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@AndersonQ
Copy link
Member

AndersonQ commented Oct 30, 2024

The Elastic Agent can receive porxy and TLS configuration for fleet-server thorough its CLI, during install/enroll, and the policy. The configurations from the policy take precedence over the CLI, however and empty configuration from the policy does not changes the current configuration.

If the Elastic Agent is installed with a proxy using mTLS and the policy has another proxy configured with simple/one way TLS, the agent will apply the new proxy and CA, but keep the old certificate, certificate key and key passphrase/key passphrase path. Letting the agent on an inconsistent state. Such state might even lead to a failure to start if the old certificate-related configurations are files and the files are removed. If it happens, on start up, the agent will try to load those files and fail, preventing the agent from starting.

The culprit is

func updateFleetConfig(log *logger.Logger, policyConfig remote.Config, agentConfig *remote.Config) {
// Hosts is the only connectivity field sent Fleet, let's clear everything else aside from Hosts
if len(policyConfig.Hosts) > 0 {
agentConfig.Hosts = make([]string, len(policyConfig.Hosts))
copy(agentConfig.Hosts, policyConfig.Hosts)
agentConfig.Host = ""
agentConfig.Protocol = ""
agentConfig.Path = ""
}
// Empty proxies from fleet are ignored. That way a proxy set by --proxy-url
// it won't be overridden by an absent or empty proxy from fleet-server.
// However, if there is a proxy sent by fleet-server, it'll take precedence.
// Therefore, it's not possible to remove a proxy once it's set.
if policyConfig.Transport.Proxy.URL == nil ||
policyConfig.Transport.Proxy.URL.String() == "" {
log.Debugw("proxy from fleet is empty or null, the proxy will not be changed", "current_proxy", agentConfig.Transport.Proxy.URL)
} else {
log.Debugw("received proxy from fleet, applying it", "old_proxy", agentConfig.Transport.Proxy.URL, "new_proxy", policyConfig.Transport.Proxy.URL)
// copy the proxy struct
agentConfig.Transport.Proxy = policyConfig.Transport.Proxy
// replace in agentConfig the attributes that are passed by reference within the proxy struct
// Headers map
agentConfig.Transport.Proxy.Headers = map[string]string{}
for k, v := range policyConfig.Transport.Proxy.Headers {
agentConfig.Transport.Proxy.Headers[k] = v
}
// Proxy URL
urlCopy := *policyConfig.Transport.Proxy.URL
agentConfig.Transport.Proxy.URL = &urlCopy
}
if policyConfig.Transport.TLS != nil {
tlsCopy := tlscommon.Config{}
if agentConfig.Transport.TLS != nil {
// copy the TLS struct
tlsCopy = *agentConfig.Transport.TLS
}
if policyConfig.Transport.TLS.Certificate == emptyCertificateConfig() {
log.Debug("TLS certificates from fleet are empty or null, the TLS config will not be changed")
} else {
tlsCopy.Certificate = policyConfig.Transport.TLS.Certificate
log.Debug("received TLS certificate/key from fleet, applying it")
}
if len(policyConfig.Transport.TLS.CAs) == 0 {
log.Debug("TLS CAs from fleet are empty or null, the TLS config will not be changed")
} else {
tlsCopy.CAs = make([]string, len(policyConfig.Transport.TLS.CAs))
copy(tlsCopy.CAs, policyConfig.Transport.TLS.CAs)
log.Debug("received TLS CAs from fleet, applying it")
}
agentConfig.Transport.TLS = &tlsCopy
}
}

which handles the proxy, TLS Certificate and the TLS CA configurations as different entities, when they should be treated as one.

In other words, if either change, the proxy or the TLS certificate or the CAs, they all should be replaced. If there is any change in policyConfig.Transport, apply the whole new Transport. Right now as proxy, TLS certificate and CA are handled separately, if the CLI defines a proxy with mTLS and the policy has another proxy with one way TLS, the result config will be a mix of both instead of only the new proxy with one way TLS. The expected result is to have the new proxy URL+headers and the CAs, but the certificate and certificate key must be cleared.

For confirmed bugs, please report:

  • Version: 8.16, main
  • Operating System: all
  • Discuss Forum URL: N/A
  • Steps to Reproduce:
    • add a HTTPS proxy on fleetUI with its CA
    • add this proxy to a fleet host
    • enroll the agent passing a different proxy with mTLS
    • after the agent is successfully enrolled and applied the policy from fleet-server, run the inspect command
    • observer the agent has the new proxy and CA, but kept the old certificate, certificate key and passphrase (if set)
@AndersonQ AndersonQ added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Oct 30, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@AndersonQ
Copy link
Member Author

AndersonQ commented Oct 30, 2024

@pierrehilbert, @nimarezainia, @cmacknz

we need to fully define and document what use-cases we'll cover with the proxy and TLS configuration. There are possible conflicting scenarios when the agent does not receive a proxy, certificate+key+passphrase or CAs.

just like that: https://xkcd.com/1172/

@AndersonQ
Copy link
Member Author

@leehinman I guess you're the one who can help to define what are real use cases or not

@nimarezainia
Copy link
Contributor

@pierrehilbert, @nimarezainia, @cmacknz

we need to fully define and document what use-cases we'll cover with the proxy and TLS configuration. There are possible conflicting scenarios when the agent does not receive a proxy, certificate+key+passphrase or CAs.

just like that: https://xkcd.com/1172/

@AndersonQ the use cases that we need to support should be defined already. I had a doc sometime ago and permutations of how we connect via a proxy are documented, I can forward if needed. I would say that we need to fix the code to ensure those use cases are functioning properly.

is this about the proxy having a different tls config?

@AndersonQ
Copy link
Member Author

is this about the proxy having a different tls config?

yes, about a new proxy removing any of the tls config. Either the clien certificates or the CA.

Considering the control plane connection: agent -> fleet-server, TLS is configured through the CLI and a proxy might be added
The cli configuration has precedence over empty/null/absent configuration from the policy.

Depending on type of the proxy the agent might still have to perform a TLS handshake with fleet-server (pass-through proxy)
Right now the agent considers 3 entities for the control plane connections (consider all TLS config is set as a path to a file):

  • Proxy:
    • url
    • headers
  • CA
    • list of CAs
  • client certificate
    • certificate
    • certificate key
    • key passphrase
    • key passphrase path

each of them can are evaluated in isolation and follow the precedence:
policy > cli > empty/null/absent

which means, once set, any of them can only be changed, never erased

This allows the agent to be individually configured, having different agents with
different TLS and proxy configuration in the same policy.

However, right now, once a mTLS proxy is configured, it's not possible to remove
the mutual TLS configuration. In a scenario where the user adds a new proxy to the policy
with one-way TLS, the agent will keep the client certificate configured. Which isn't an immediate problem as if the proxy does not require the agent to send its certificate, the agent will not send it. Therefore, the on-way TLS will work as expected.

The side effect is that if the client certificate (certificate, certificate key and key passphrase path) are a path and are deleted, the agent might not start up anymore as it'll look for the certificate (certificate, certificate key and key passphrase path).

I think it's more about we well defining and document what can and cannot be done.
We have the precedence working as expected: policy > cli > empty/null/absent
From the UI the proxy and its TLS configuration seems to be one thing, but on the agent side it is not. Also the agent does not differentiate between TLS for the proxy and directly to fleet-server.

The less intrusive is to keep it as is. However I believe we need to document clearly the implications of it and eventually, if needed, work to support removing proxy and TLS configurations thought the policy

@cmacknz
Copy link
Member

cmacknz commented Oct 31, 2024

From the UI the proxy and its TLS configuration seems to be one thing, but on the agent side it is not. Also the agent does not differentiate between TLS for the proxy and directly to fleet-server.

From a UI perspective, if you are adding a proxy and that proxy requires mTLS, then having the UI present all the TLS config together makes sense. It also is necessary to have a single atomic policy change for all of these parameters at once.

The UI below looks right in the context of adding a new proxy. This is assuming that the certificate and certificate key boxes are the client mTLS certificate and key. If they are, we should make that explicit.

Image

I think generally if someone is using mTLS it is likely to be a universal requirement of their deployment and not a quirk of a specific proxy, so it is highly unlikely mTLS is going to be suddenly removed. That said, not being able to remove mTLS if it was enabled at install time is a problem that will limit our ability to debug issues.

I suspect the core problem here is that clearing the proxy client cert and key has no effect if agent was configured locally with a client cert and key.

It should be possible for Fleet to tell agent to ignore the install time cert and key. This is a more general problem for parameters we allow to be set at install time that is not specific to this particular issue.

Unless I am missing something, we should open a separate issue to allow giving Fleet complete control over the agent configuration regardless of what got set on the command line at install or enroll time.

@AndersonQ
Copy link
Member Author

From a UI perspective, if you are adding a proxy and that proxy requires mTLS, then having the UI present all the TLS config together makes sense. It also is necessary to have a single atomic policy change for all of these parameters at once.

It's already atomic in the sense the agent either applies all changes or none if the validation fails. However it'll apply the precedence "entity by entity".

I suspect the core problem here is that clearing the proxy client cert and key has no effect if agent was configured locally with a client cert and key.

or through the policy. Not being able to remove a previously set config the is the expected and current behaviour.

It should be possible for Fleet to tell agent to ignore the install time cert and key. This is a more general problem for parameters we allow to be set at install time that is not specific to this particular issue.
Unless I am missing something, we should open a separate issue to allow giving Fleet complete control over the agent configuration regardless of what got set on the command line at install or enroll time.

That is the dream.

What I believe we need to decide is if, while we do not have this full control from fleet, the agent will treat proxy, CA, client cert as a single entity or keep the current behaviour.

It's my understanding it makes more send to make them all a single entity, which, given the proxy address has changed, would allow to remove CAs and the client certificate/key.

@AndersonQ
Copy link
Member Author

One thing is worth pointing out, the agent won't apply a configuration that causes loss of connectivity. Therefore, regardless the inconsistent state it might end up on, it'll still be able to reach fleet-server

@cmacknz
Copy link
Member

cmacknz commented Nov 1, 2024

It's already atomic in the sense the agent either applies all changes or none if the validation fails. However it'll apply the precedence "entity by entity".

When I said "atomic" I meant from the UI perspective, the mTLS configuration and the proxy URL should come in the same policy change action or agent won't apply it. The Proxy configuration UI is currently correct in grouping them all together.

What I believe we need to decide is if, while we do not have this full control from fleet, the agent will treat proxy, CA, client cert as a single entity or keep the current behaviour.

I don't fully understand this. The agent "entity" is an outgoing HTTP connection which may pass through a proxy and/or may use mTLS and/or may require additional certificate authorities to validate the certificate presented by the server it connects to. mTLS and the use of a proxy URL have no other relation, other than the fact that people may configure both a proxy and mTLS. The agent configuration needs no concept of a proxy entity right now.

The UI side of this models proxies as a single entity in a single page because people editing a proxy will want to consider the TLS configuration of the proxy. "Proxy" as an entity is a UI only concept here, and facilitates sending agent all of the required changes to support a proxy in one policy change action.

@jlind23
Copy link
Contributor

jlind23 commented Nov 4, 2024

Version: 8.16, main

@AndersonQ does this mean we have to fix it asap in order to make it land in 8.16.1? Is this bad enough that we should block 8.16? Shall we already create a known issue?

@AndersonQ
Copy link
Member Author

AndersonQ commented Nov 4, 2024

@jlind23 No, we're fine:

  • The current behaviour is what we agreed upon since the proxy in the policy was introduced, if any proxy/TLS configuration is set, it can only be overridden, not removed.
  • even with a possible mix up, if a proxy config changes from mTLS to TLS (one-way) it'll still work as the agent will not be required by the proxy to present the client certificate
  • if the new config does not work, the agent won't apply it

@AndersonQ
Copy link
Member Author

AndersonQ commented Nov 4, 2024

I don't fully understand this. The agent "entity" is an outgoing HTTP connection which may pass through a proxy and/or may use mTLS and/or may require additional certificate authorities to validate the certificate presented by the server it connects to. mTLS and the use of a proxy URL have no other relation, other than the fact that people may configure both a proxy and mTLS. The agent configuration needs no concept of a proxy entity right now.

@cmacknz, by "single entity" I meant the agent would check if the proxy changed, if so, it'd apply the TLS configuration from the policy, even if it's empty.

As you said, it's what one would expect to happen given the current UI

@leehinman
Copy link
Contributor

Been thinking about this for a bit from a PKI standpoint.

(Assumption) Because Fleet is a central management for configuration of agents, we would expect the configuration received from Fleet to completely override any local configuration. This would mean "erasing" an option if it wasn't present in the fleet supplied policy, but present in the local policy (not what happens today).

Edge cases:

  1. client tls private keys and certificates used in mTLS (proxy, enroll, etc). These SHOULD NOT be manged by fleet. This is because the client private key and certificate should be unique per host and a policy applies to a collection of hosts. Using the same private key on multiple hosts breaks the assumption that the key is "private". We need a local UI to update the private key & certificate to handle expiration/rotation.

  2. certificate authorities used in mTLS SHOULD be managed by Fleet. Because these are used to verify the server identity, these could and should be managed centrally. The agent should persist the certificate authorities in it's local policy. This is because it may need them to connect to the fleet server on restart.

  3. Proxy configuration. This COULD be managed centrally, but it would mean that all the agents in the policy use the same proxy configuration. This might not be a good assumption. The agent should persist the proxy configuration in it's local policy. This is because it may need the proxy to connect to fleet server on restart. If central management of proxy configuration is not a good assumption, then we need a local UI to update the configuration.

@cmacknz
Copy link
Member

cmacknz commented Nov 4, 2024

Well said @leehinman, we perhaps need an explicit way to annotate which parts of configuration can or must be per agent, because there is no way to centrally manage per agent values today. The mTLS private key and certificates are clearest example we've had so far. CC @nimarezainia The workflow we'd need to enable is rotating the mTLS private key of a single agent.

by "single entity" I meant the agent would check if the proxy changed, if so, it'd apply the TLS configuration from the policy, even if it's empty.

If I understand this right, this would be making an assumption in the agent that the UI will only ever update proxy and mTLS parameters together? I don't think agent should assume anything about how the UI will group changes together when there is not an explicitly defined API contract we can validate against (we don't have that here yet).

@AndersonQ
Copy link
Member Author

AndersonQ commented Nov 5, 2024

If I understand this right, this would be making an assumption in the agent that the UI will only ever update proxy and mTLS parameters together?

That would be the end result.

What we have today is trying to respect the separation Lee stated. "Proxy", "client cert and key" and "ca" are configured separately, but once set, they cannot be removed.

@ycombinator
Copy link
Contributor

@cmacknz is the next step here do build out the necessary UI enhancement for managing per-agent configuration (e.g. mTLS private key and certificate)? Or is there an immediate path forward where we can resolve this issue entirely on the Agent side without needing an UI work as a pre-requisite?

@AndersonQ
Copy link
Member Author

@ycombinator I'm not Craig, but my take on it is the following:

We cannot do it fully on the agent side, not the desired final state:

  1. client tls private keys and certificates used in mTLS (proxy, enroll, etc). These SHOULD NOT be manged by fleet. This is because the client private key and certificate should be unique per host and a policy applies to a collection of hosts. Using the same private key on multiple hosts breaks the assumption that the key is "private". We need a local UI to update the private key & certificate to handle expiration/rotation.

  2. certificate authorities used in mTLS SHOULD be managed by Fleet. Because these are used to verify the server identity, these could and should be managed centrally. The agent should persist the certificate authorities in it's local policy. This is because it may need them to connect to the fleet server on restart.

  3. Proxy configuration. This COULD be managed centrally, but it would mean that all the agents in the policy use the same proxy configuration. This might not be a good assumption. The agent should persist the proxy configuration in it's local policy. This is because it may need the proxy to connect to fleet server on restart. If central management of proxy configuration is not a good assumption, then we need a local UI to update the configuration.

What we can do on the agent side it to couple any proxy change with the CA and client tls private keys and certificates. So in the edge case where a CA and client cert were set though the CLI (during enroll) and a new proxy is set with absent CA and client cert, the agent would also remove the CA and client cert.

This is what can be done without UI changes. This would create a more consistent state, but goes against the behaviour that an nil or absent proxy/CA/cert would not remove an existing proxy/CA/cert.

I believe tying proxy+CA/cert together is the more consistent approach. That means, if the proxy changes, the agent would apply what ever CA/cert is in the policy, even if it means removing what was set by the CLI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

7 participants