Make key and trust stores reload automatically upon change

To be Reviewed By: February 18, 2021

Authors: Joris Melchior

Status: Draft | Discussion | Active | Dropped | Superseded

Superseded by: N/A

Related: N/A

Problem

Currently, in order to rotate certificates each member of the cluster needs to be restarted to load new certs and trust. It would be preferable if certificates can be rotated without having to restart members.

Anti-Goals

We do not try to change the format of key and trust stores and we do not try to change basic design of our TLS implementation for the Geode cluster.

Solution

When starting up a cluster member we currently read the TLS configuration which, when TLS is enabled has key and trust store files defined. In case those files are defined they are read, and the information inside them is loaded into the key and trust manager objects that are loaded into the SSLContext.

This solution will introduce wrapper objects for the key and trust managers and file/directory watcher(s) that can detect changes to the key and trust store files. When key and trust store files are changed this will trigger a reload into key and trust managers and through the wrapper objects these new key and trust managers will be injected into the SSLContext so that the context can start using the new key and trust managers in process.

Changes and Additions to Public Interfaces

This change will not have an impact on Public Interfaces.

Performance Impact

This change should reduce member restarts and therefor have a positive impact on system performance.

Backwards Compatibility and Upgrade Path

Will the regular rolling upgrade process work with these changes? Yes

How do the proposed changes impact backwards-compatibility? Are message or file formats changing? No

Is there a need for a deprecation process to provide an upgrade path to users who will need to adjust their applications? No

Prior Art

What would be the alternatives to the proposed solution? What would happen if we don’t solve the problem? Why should this proposal be preferred?

We are not aware of alternatives to this solution.

If we don't solve the problem we will continue to require to restart members when certs are changed.

This proposal will simplify the administration task of rotating certs for a Geode cluster.

FAQ

Answers to questions you’ve commonly been asked after requesting comments for this proposal.

Errata

What are minor adjustments that had to be made to the proposal since it was approved?

  • No labels

7 Comments

  1. There is a tradeoff here that should be made explicit. The current RFC says:

    If we don't solve the problem we will continue to require to restart members when certs are changed.

    This proposal does seem to make it possible to avoid a member-wise restart when changing certs. However, if a key/trust store is invalid, then failing to restart a member when that key/trust store changes, leads to late discovery of the problem.

    The existing behavior (before this RFC) is that any problems with the key/trust stores are discovered as soon as the first member is restarted (with the new configuration files.)

    This RFC changes the systems response to bad key/trust stores from fail fast to fail slow.

    The RFC goes on to say:

    This proposal will simplify the administration task of rotating certs for a Geode cluster.

    Administration does seem to be simplified in the happy-path case. But when key/trust store problems are present, administration is complicated. In the current system (before this RFC) if a key/trust store has a problem, the operator finds out as soon as the first member is restarted with the new config files. Under this proposal, an operator may not find out for hours or days if there is a problem. Depending on system configuration and workload, it may take a long time after updating the security context, for that security context to get used to create new sockets.

    So we either live with this new class of problem problem (late discovery of key/trust store problems) or we try to mitigate it. Unfortunately, the mitigations add complexity. The simplest mitigation I can think of would be to force a Geode member, immediately after instantiating a new security context, to roll through every existing client or server connection, tearing it down and re-establishing it. This would give up some of the desired performance advantage of this RFC but would provide an operator with immediate feedback in the presence of key/trust store problems.

    On the other hand, if we need to go that far, the added complexity of this PR relative to any performance benefit, might not be worthwhile. At a minimum it might induce us to take a second (or third) look at just why restarting a cache server is "too expensive". In light of the many re-configuration use-cases that can benefit from the fail-fast, simple, nature of "rolling update", perhaps our effort would be better spent in making rolling update acceptably fast.

    1. Thanks for bringing up the trade-off between fail fast and fail slow. It seems like even with the current implementation it is possible to have late discovery of a key/trust store issue if, for example, you do not restart the member soon after changing the key/trust store. So I'm not sure I agree that this RFC changes the system's response to fail slow. I think the fail fast/slow trade-off actually has more to do with the environment where Geode is run and how the Geode cluster is managed. For example, in a cloud-native or Kubernetes environment there is usually a concept of liveness or readiness probes which can be used to detect whether the system is online and responding to requests. These probes could be used to make the system fail fast if a key/trust store has a problem: A probe which periodically pings an HTTPS endpoint would start failing soon after a key/trust store is updated to one that contains invalid certs. If needed, this probe could be made smarter by querying Geode's management REST API and checking the member status, or it could even try to create a new client connection each time. This mitigation would probably be a lot easier than having Geode tear down its existing client/server connections each time the key/trust store is updated.

      1. You say it is possible to have late discovery of key/trust store problems before this RFC. This is true. But the RFC sets an expectation that the key/trust stores will be used by running processes as soon as those stores change. Implied in that is that problems will be discovered soon too. Even if we document the fact that problems can lie undetected for a long time, it's still a complication in troubleshooting since any time a failure is far removed in time, from the original fault (the problem in the updated key/trust stores) drawing the line between the two gets harder.

        An argument can be made that this RFC simplifies any Geode operator for Kubernetes, since best practices dictate storing key/trust stores as secrets, and secret changes, unlike image changes, do not cause automatic rolling updates via Deployments. Without Deployment support (for secret changes) some other mechanism must be used to make Geode processes reflect key/trust store changes. If the only alternative to this RFC entailed adding logic to any Geode operator, then that might be a good argument for this RFC.

        But the Kubernetes community has recognized this Deployment behavior as a problem. From [1]:

        The problem: ConfigMap objects (and secret) updates are a risk for your cluster…

        Kubernetes has native semantics that support replacing a container image using a rolling update strategy. Because Kubernetes does not offer a way to notify containers about changes in their ConfigMaps in a rolling fashion, configuration changes are different…

        Configuration errors can lead to outages, cascading failures, or even misbehavior, which is not easy to detect. Having a mechanism to deploy configuration changes via rolling updates allows errors to be detected early so that the update can be aborted and rolled back. Using a rolling update mechanism to deliver configuration changes to containers is a step forward in treating configuration and code in the same way…

        What do they think of the approach proposed in this RFC? From [2]:

        One approach for dealing with (ConfigMap and Secret files updates) is to add application logic to watch ConfigMap and Secret files for changes, and reconfigure the application on the fly. This can lead to complicated logic since objects using the old configuration will need to be detected and recreated.

        What they are alluding to there is precisely the sort of uncontrolled rollout of changes (to secrets) that I'm worried about.

        But the community has developed some generic workarounds that don't place any added burden on an operator implementation. From [1] again:

        Hashed ConfigMaps and Secrets in Bitnami's Kubernetes Production Runtime allow you to trigger a rolling update on Pods whenever an associated ConfigMap or Secret object is changed

        And from [2]:

        Another approach is to trigger a rolling update of the Deployment when it's dependent ConfigMaps and Secrets are updated. This blog post describes a solution that creates ConfigMaps and Secrets alongside a Deployment, and uses a hash of these resources to automatically trigger a rolling update if they have changed.

        And from [3]:

        Problem: "I can manage all my K8s config in git, except Secrets."

        Solution: Encrypt your Secret into a SealedSecret, which is safe to store - even to a public repository.

        This type of approach leverages Kubernetes rolling update for controlled evolution, not only of images, but also ConfigMaps and secrets too. It lets administrators detect problems during the update instead of later. That leads to less downtime. And it does this without placing any extra burden on Kubernetes operators, and without any complication in Geode itself.

        These best practices let us control system configuration with the same rigor that we control source code[4]. The benefits to Geode are clear:

        1. Geode doesn't have to support on-the-fly configuration changes. If configuration changes are needed, start a new container/pod with the new configuration and retire the old one.
        2. administrators can rely on (and get good at) a single mechanism for all configuration changes (new product releases, changes to properties files, key/trust store updates)
        3. administrators see problems (in new configurations) ASAP—this makes remediation a lot easier

        On the other side, this approach may place pressure on certain Geode functionality TBD. I think we should try it and see where those pressure points are. I'd rather see us invest our effort there, instead of investing it in dynamic configuration.

        Under this RFC, a probe that pings an HTTPS endpoint could catch some key/trust store problems. A probe that hits some hypothetical management endpoint that causes the Geode process to initiate a new TLS connection could catch others. But I don't know if this latter functionality exists in Geode—it may require new development.

        Even if both sides (server/passive binding, client/active binding) could be exercised by a probe, how do you envision a Kubernetes administrator isolating this verification to a single pod, before rolling out key/trust store changes to other pods? The ability to update a single pod, and then to stop the update if that pod fails, is key. It's a key reason why I'm an advocate of implementing key/trust store updates via rolling update.


        [1] Rolling Updates for ConfigMap Objects, Felipe Alfaro on May 6, 2019, on the Bitnami engineering site: https://engineering.bitnami.com/articles/rolling-updates-for-configmap-objects.html

        [2] GitOps Kubernetes Rolling Update when ConfigMaps and Secrets Change, Caleb Lloyd on Jul 5, 2018, on the Boxboat blog: https://boxboat.com/2018/07/05/gitops-kubernetes-rolling-update-configmap-secret-change/

        [3] Kubernetes Sealed Secrets https://github.com/bitnami-labs/sealed-secrets

        [4] GitOps https://www.gitops.tech/

        1. It's true that dynamically reloading the key/trust stores introduces a risk that problems are found later since the members do not necessarily use the updated key/trust stores until a new TLS connection needs to be created. I also agree that in a Kubernetes environment, the workarounds you cited would provide a way to update key/trust stores backed by Kubernetes Secrets via a rolling update. However, we still think it would be valuable to introduce this change in Geode:

          • We've heard feedback from users that rolling updates for cert rotations can be disruptive and impractical. Anthony Baker gives an example below. Additionally, it's often desirable from a security standpoint to rotate certificates frequently since that removes the need for revocation in the event the certificate is compromised.
          • This RFC doesn't prevent users from implementing key/trust store updates via rolling update. For example, in a Kubernetes environment they can use the process you described above. Or, in an unmanaged (bare-metal or VM) environment they could create new key/trust store files and restart the members with the paths to the new files.
          • There are mitigations to the risk of finding problems late, such as using an automated certificate management system such as Vault or cert-manager which reduces likelihood of issues in the certificate update process, or using a probe like I mentioned in my previous comment.
  2. I think there are some geode and stateful workloads that really matter to this discussion.  Restarting a stateful workload is not cheap in many cases.  While you could expect Tomcat web app to restart in seconds, a data system may take significantly more time.  Even when it is "ready" it may take more time to warm caches and return to servicing requests at max speed.  While this should be accounted for in failure scenarios regardless, it's a good thing to avoid situations where an expected event (rolling certs) combined with an unexpected event (server failure) can leave the system with insufficient resources and/or redundancy.

    In geode's case, we know that restart times can be greatly impacted by persistent regions with indexes.  For large data sets, fully replaying the WAL (oplogs) can take tens of minutes.  If we assume a cluster size of 30 and a restart time of 10min, the rotation time would be around 10h (2 restarts needed) assuming serialized rolling restarts.  Perhaps that could be optimized by AZ but that's depends on operational overhead.

    For questions like this, what I really look for is a good default behavior with an extension point so that I can customize it however I want.


    1. On this point:

      it's a good thing to avoid situations where an expected event (rolling certs) combined with an unexpected event (server failure) can leave the system with insufficient resources and/or redundancy.

      The implication here is that my counter-proposal will will leave the system with insufficient redundancy during key/trust store change. But that is true only if the administrator hasn't configured sufficient redundancy to accommodate reconfiguration and failures. But administrators already face this issue when upgrading the Geode software version. My proposal to use the same mechanism for key/trust store change introduces no new burden or risk.

      If we imagine an administrator updating Geode software once a year, and updating key/trust stores once a year, then my proposal means they're doing this rolling update operation twice a year instead of once a year. In each case it's incumbent on the administrator to ensure that sufficient redundancy is configured to allow for unexpected failures during the expected restarts. Will an administrator configure this redundancy statically or will they temporarily add redundancy before the change window and subtract it afterward? That's up to them. But the key point is: they already have this problem.

      A key benefit of the universality of the rolling update approach is that the administrator is doing fundamentally the same operation every time, instead of doing two different operations. And guess what: not only would they update Geode software and key/trust stores using fundamentally the same operation: they'd also update any Geode tuning parameters that way too. Any change to the configuration would be done using rolling update under my counter-proposal. The benefit is that an administrator would get very good at this one operation. Another benefit that the Geode software remains simpler (the present RFC isn't needed.)

      Anthony points out an important downside: this operation that the administrator is getting very good at could really suck in some circumstances. He cites the inarguable fact that rolling updates have a performance impact. (This is what I meant in my most recent reply above, when I said that my counter-proposal could "place pressure on certain Geode functionality TBD".) He posits a hypothetical cluster of 30 nodes taking 10 hours to complete a rolling update. The end-to-end time is one issue since, ostensibly, operators would like to be on-hand to resolve any unexpected failures. But another issue he highlights is that cluster performance may be slowed during the 10 hours and for some time thereafter.

      Rather than downplay this issue, I'd go so far as to say that the crux of our debate here is: how shall we address it? If we generalize the approach in this RFC we're saying: when faced with a configuration change scenario that would require a rolling update, let's make a Geode process dynamically reconfigurable instead. My counter-proposal is saying: nope, there is only one way to reconfigure a Geode process, and that is by starting a new one with the new configuration and then retiring the old one.

      My counter-proposal, then, forces us to confront performance problems like Anthony's head-on. But here's the thing: our users already have to confront those performance problems head-on, for all the other rolling update scenarios. We have to confront them for Geode software updates and any changes to gemfire.properties, for instance. Since rolling update performance is already a pressing issue, we haven't really solved a problem by making one thing (in this case key/trust store change) not require a process restart.

      The importance of this RFC extends beyond the proximate issues i.e. key/trust store functionality, to the broader design direction it implicates. What's troubling is we are adding dynamic configurability to a system that largely doesn't do that, at a time when the industry is marching decisively in the opposite direction.

      1. The RFC isn't suggesting that the only way to perform key management is through this dynamic mechanism, it is proposing an alternative. If an administrator wants to take the more burdensome approach of static key management then they can.

        I will argue that you final statement is false in this context about key management. The industry is going towards more frequent updates to managed keys and as such the frameworks for these system largely expect the applications being managed to ingest updates to keys managed by the system efficiently. The configuration that is static is the key management system, it is the keys in these systems that are dynamic. For K8s as an example my configuration from the servers standpoint is the mountpoint of mutable key or certificate. The system then expects that the process running in this container to adequately respond to mutation at the mountpoint. Geode is an outlier in this space.