Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature/Certificate] Restart node when certificate updated #152

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

PLsergent
Copy link

@PLsergent PLsergent commented Aug 24, 2022

Q A
Bug fix? no
New feature? yes
API breaks? no
Deprecations? no
License Apache 2.0

What's in this PR?

This PR aims at creating a new field in the struct NodeState called CertificateExpireDate, and then automate the Nifi cluster rolling upgrade when the certificate is renewed (handle by the cert manager).

The idea is that we will give the field CertificateExpireDate the value of the current certificate renewal date, during each reconcile we will check if the CertificateExpireDate field has the same value as the current certificate renewal date, if no, we will do a rolling upgrade of the cluster and then assign the new renewal date to each node.

Why?

When the certificate handle by the cert manager was renewed it was necessary to restart the Nifi nodes manually to be able to access the UI.

Additional context

Checklist

  • Implementation tested
  • Error handling code meets the guideline
  • Logging code meets the guideline
  • User guide and development docs updated (if needed)
  • Append changelog with changes

ToDo

  • Test implementation after rebase

@PLsergent PLsergent changed the title feature/node_restart_when_certificate_updated [Feature/Certificate] Restart node when certificate updated Aug 24, 2022
pkg/util/pki/common.go Outdated Show resolved Hide resolved
@r65535
Copy link
Contributor

r65535 commented Aug 24, 2022

Have you tried out the certificate auto-reload feature that NiFi now supports out of the box?
Link
If you set nifi.security.autoreload.enabled in nifi.properties, any changes to the JKS keystore get auto-loaded without a restart.
Admittedly, I've not tried this in a k8s environment though - so might not be as slick as advertised here!

It might be worth testing out that workflow, before getting the operator to do the cert management via this PR?

@PLsergent
Copy link
Author

Have you tried out the certificate auto-reload feature that NiFi now supports out of the box? @r65535

I never heard about that so yes I'll try to use the feature before going further with this PR.

@mh013370
Copy link
Member

Have you tried out the certificate auto-reload feature that NiFi now supports out of the box? @r65535

I never heard about that so yes I'll try to use the feature before going further with this PR.

If cert-manager automatically renews certificates and the configured Secrets get updated, then Kubernetes will automatically update the volumes mounted to the running Pods:

https://kubernetes.io/docs/concepts/configuration/secret/#mounted-secrets-are-updated-automatically

If this is the case, then we might be able to take advantage of that feature of NiFi automatically. There's one condition where Kubernetes will not automatically update Secrets mounted as volumes in Pods when a mount subPath is used. I double checked and made sure nifikop doesn't set this:

func generateVolumesForSSL(cluster *v1alpha1.NifiCluster, nodeId int32) []corev1.Volume {
return []corev1.Volume{
{
Name: serverKeystoreVolume,
VolumeSource: corev1.VolumeSource{
Secret: &corev1.SecretVolumeSource{
SecretName: fmt.Sprintf(pkicommon.NodeServerCertTemplate, cluster.Name, nodeId),
DefaultMode: util.Int32Pointer(0644),
},
},
},
{
Name: clientKeystoreVolume,
VolumeSource: corev1.VolumeSource{
Secret: &corev1.SecretVolumeSource{
SecretName: cluster.GetNifiControllerUserIdentity(),
DefaultMode: util.Int32Pointer(0644),
},
},
},
}
}
func generateVolumeMountForSSL() []corev1.VolumeMount {
return []corev1.VolumeMount{
{
Name: serverKeystoreVolume,
MountPath: serverKeystorePath,
},
{
Name: clientKeystoreVolume,
MountPath: clientKeystorePath,
},
}
}

@PLsergent PLsergent marked this pull request as draft August 30, 2022 14:57
@PLsergent
Copy link
Author

PLsergent commented Aug 30, 2022

After further investigations with the help of @juldrixx we are not sure that the option nifi.security.autoreload.enabled is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877

@mh013370
Copy link
Member

After further investigations with the help of @juldrixx we are not sure that the option nifi.security.autoreload.enabled is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877

We chatted about this a little bit afterwards. This auto-reload feature works for the NiFi Jetty web server only. It doesn't apply to any SSLContextServices attached to controller services or processors. We think they could, but that's not how it's currently written.

However, a less disruptive approach to node restarts would be to have the operator watch the Certificates its creating, detect renewals, and restart (e.g. disable/re-enable) all controller services and processors with SSLContextService in deployed NifiDataflow. This is certainly a more complex approach to node restarts, but it's also less disruptive. I'm not necessarily suggesting you take this approach, but the NifiDataflow controller already performs controller service/processor restarts when there are flow changes.

@juldrixx
Copy link
Contributor

After further investigations with the help of @juldrixx we are not sure that the option nifi.security.autoreload.enabled is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877

We chatted about this a little bit afterwards. This auto-reload feature works for the NiFi Jetty web server only. It doesn't apply to any SSLContextServices attached to controller services or processors. We think they could, but that's not how it's currently written.

However, a less disruptive approach to node restarts would be to have the operator watch the Certificates its creating, detect renewals, and restart (e.g. disable/re-enable) all controller services and processors with SSLContextService in deployed NifiDataflow. This is certainly a more complex approach to node restarts, but it's also less disruptive. I'm not necessarily suggesting you take this approach, but the NifiDataflow controller already performs controller service/processor restarts when there are flow changes.

The problem is not on the dataflow side but on the pure NiFi side. When the certificate has expired, NiFi does not automatically reload it. So when we try to access the interface, we are blocked because the certificate is no longer valid. We have to force NiFi to reload the new certificate by restarting it. Maybe our way of solving the problem or the problem itself is not the right one.

@mh013370
Copy link
Member

After further investigations with the help of @juldrixx we are not sure that the option nifi.security.autoreload.enabled is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877

We chatted about this a little bit afterwards. This auto-reload feature works for the NiFi Jetty web server only. It doesn't apply to any SSLContextServices attached to controller services or processors. We think they could, but that's not how it's currently written.
However, a less disruptive approach to node restarts would be to have the operator watch the Certificates its creating, detect renewals, and restart (e.g. disable/re-enable) all controller services and processors with SSLContextService in deployed NifiDataflow. This is certainly a more complex approach to node restarts, but it's also less disruptive. I'm not necessarily suggesting you take this approach, but the NifiDataflow controller already performs controller service/processor restarts when there are flow changes.

The problem is not on the dataflow side but on the pure NiFi side. When the certificate has expired, NiFi does not automatically reload it. So when we try to access the interface, we are blocked because the certificate is no longer valid. We have to force NiFi to reload the new certificate by restarting it. Maybe our way of solving the problem or the problem itself is not the right one.

If i'm not mistaken, accessing the UI is what the nifi.security.autoreload.enabled feature allows for when certs are rotated. NiFI's Jetty web server uses the FrameworkServerConnectorFactory, which reloads certs from disk, to create connections. It uses Jetty's KeyStoreScanner and TrustStoreScanner to reload certs. As long as they get updated in the pods, then they should get updated if that NiFi property is set.

@PLsergent PLsergent force-pushed the feature/node_restart_when_certificate_updated branch from 8dc1fe9 to 2c33359 Compare August 31, 2022 14:05
@juldrixx
Copy link
Contributor

juldrixx commented Aug 31, 2022

I have recreated the problem. I set a Duration of 1h and a RenewBefore of 5m, so the certificates expire after 1h. When the certificates have expired, the user interface is not accessible and the operator cannot communicate with NiFi. Whether the nifi.security.autoreload.enabled function is enabled or not makes no difference.

@juldrixx
Copy link
Contributor

After further investigations with the help of @juldrixx we are not sure that the option nifi.security.autoreload.enabled is enough to apply new certificates. If any of you has an exemple that would tell me otherwise you're welcome ! @r65535 @michael81877

We chatted about this a little bit afterwards. This auto-reload feature works for the NiFi Jetty web server only. It doesn't apply to any SSLContextServices attached to controller services or processors. We think they could, but that's not how it's currently written.
However, a less disruptive approach to node restarts would be to have the operator watch the Certificates its creating, detect renewals, and restart (e.g. disable/re-enable) all controller services and processors with SSLContextService in deployed NifiDataflow. This is certainly a more complex approach to node restarts, but it's also less disruptive. I'm not necessarily suggesting you take this approach, but the NifiDataflow controller already performs controller service/processor restarts when there are flow changes.

The problem is not on the dataflow side but on the pure NiFi side. When the certificate has expired, NiFi does not automatically reload it. So when we try to access the interface, we are blocked because the certificate is no longer valid. We have to force NiFi to reload the new certificate by restarting it. Maybe our way of solving the problem or the problem itself is not the right one.

If i'm not mistaken, accessing the UI is what the nifi.security.autoreload.enabled feature allows for when certs are rotated. NiFI's Jetty web server uses the FrameworkServerConnectorFactory, which reloads certs from disk, to create connections. It uses Jetty's KeyStoreScanner and TrustStoreScanner to reload certs. As long as they get updated in the pods, then they should get updated if that NiFi property is set.

I dug around trying to figure out why it wasn't working.
When you mount the secret in the pod via a volume, it generates a symlink pyramid where the first symlink truststore.jks points to a second symlink ..data/truststore.jks which points to ..2022_08_31_19_45_44.079808492/truststore.jks. In the nifi.properties, it is told to look at truststore.jks but in reality the scan that tracks the state of the file monitors ..2022_08_31_19_45_44.079808492/truststore.jks. Now when the Certificate is updated, the secret is also updated and then reinjected into the pods but not under ..2022_08_31_19_45_44.079808492/truststore.jks but under another timestamp ..<NEW_TIMESTAMP>/truststore.jks. From the point of view of the symlink we specify in the nifi.properties this update is transparent but not to the scan that monitors the file. For him it has not been updated but deleted and stops monitoring it.

So it seems that the function can't be used unless you know a method to delete this timestamp directory (I didn't find it).

@mh013370
Copy link
Member

mh013370 commented Sep 1, 2022

Thank you for spending the time to track this down. That makes sense. It looks like Jetty has a unit test that specifically handles symlink targets. I wonder if it's in a newer version of Jetty?

https://github.com/eclipse/jetty.project/blob/cb127793e5d8b5c5730b964392a9a905ba49191d/tests/test-integration/src/test/java/org/eclipse/jetty/test/KeyStoreScannerTest.java#L208-L238

More details here jetty/jetty.project#5042

@juldrixx
Copy link
Contributor

juldrixx commented Sep 1, 2022

I've talked to the folks at NiFi and @michael81877, and it doesn't seem like this is expected behavior (link to discussion). I have opened a ticket at NiFi to have this fixed. Now the question is what to do with the PR. We can keep the feature but make it optional to have a system that would work in NiFi versions without the fix, or we drop it and hope that on the NiFi side the fix is done for the next release and that it fixes the problem.

@mh013370
Copy link
Member

mh013370 commented Sep 1, 2022

Yeah perhaps we can still pursue node restarts and make that a flag as a part of the SSL config. Then it wouldn't matter what NiFi version the operator is controlling. And then starting from whatever version of NiFi has the fix, the restarts are no longer necessary.

@PLsergent PLsergent marked this pull request as ready for review November 3, 2022 15:55
@PLsergent PLsergent force-pushed the feature/node_restart_when_certificate_updated branch from 8f077e4 to b452e9c Compare December 9, 2022 16:41
@PLsergent PLsergent force-pushed the feature/node_restart_when_certificate_updated branch from b452e9c to f8c34ad Compare December 15, 2022 14:25
@PLsergent
Copy link
Author

@erdrix @michael81877 ready for another review 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants