Kubernetes Gremlins: What Broke My Cluster?
Ruben Hakopian
Ruben Hakopian
July 23, 2021
/
3 minutes read

Kubernetes Gremlins: What Broke My Cluster?

Any engineer that ever deployed a HelloWorld application to Kubernetes knows that it is a journey filled with trials, errors, and disasters, let alone running a microservices application in production. YAML configurations are loosely coupled, making it very hard to determine what caused an application outage and resolve it quickly. Such issues are usually side effects of harmless-looking changes such as editing ConfigMaps, labels, annotations, etc., causing conflicts and unintended modifications on dependencies. It is a major nightmare for human operators to deal with on a daily basis.

One might argue that GitOps solves this problem by preserving change history in the git log. That is true to some extent, but as described by Henning Jacobs at KubeCon 2020, with the growing popularity of operator pattern, the state of the cluster can change based on internal triggers, bypassing the git history. Also, the effects of some changes are not directly observed, making it hard to detect breaking changes and recover from them. For example, changes to ConfigMaps do not affect running pods, and changes would be observed on new pods only.

To solve that problem, we had created the Time Machine. Kubevious comes with Time Machine capability starting from the earliest versions, which lets you travel back in time to see the state of the entire cluster at any particular time (in the past). The challenge our users had was finding the right second to travel to. After carefully listening to the feedback, we redesigned the Time Machine and introduced a new feature that we called Change History. In this article will do a shallow dive into Change History, how to use it, and cover some use cases.

In Kubevious Portal, you will now find a new window - “Change History”. Just like the Properties and Alerts windows, it is populated on selection in the diagram. Change History displays the history of changes made to the selected object, which could be any resource, such as Namespace, ConfigMap, Ingress, Application, etc. The example below shows that Namespace “berlioz” was initially created on July 19th had 2 errors and 1 warning, was deleted on July 20th, recreated on July 21th and currently has 4 errors.

Main

Change History includes changes to errors, warnings, flags, and custom markers and which properties were modified as a part of the change.

Clicking on the date activates the time machine, and the cluster can be diagnosed at the moment when the namespace was deleted. The Timeline window shows the overall cluster heartbeat, errors, and time machine activation indicator. The “berlioz” namespace disappears from the diagram, yet we still see its change history and selection in the Properties tool window.

Deleted

Going back in time even more to July 19th would bring back the namespaces, but this time with 2 errors (as opposed to 4 errors currently.

Active

So far, we were looking into the namespace itself. Changes to applications, load balancers, pods are not reflected in that list. Only flags, markers, and numbers of errors & warnings can give some indication of what is going on there and where to dig deeper to see what is going on there. To do that, we added a “Subtree” mode to Change History. To activate, it just takes you to click the corresponding button on the right side.

Subtree

The “Subtree” mode allows tracking all changes to relevant objects within the entire selection subtree. Some examples are modified ConfigMaps, Ingresses, Services, Volumes, container images, etc. Tool window allows drilling down by selecting the relevant object.

Change History is tightly integrated with Time Machine and Rules Engine (you can learn more about writing custom validation rules and associating custom markers with resources here) and provides a highly contextualized track record of configuration and operational status changes to the clusters, namespaces, applications, or any resources. Change History is a must-have tool for any Kubernetes operators to quickly identify anomalies and recover Cloud Native applications when application health degrades. Change History is available in the Kubevious Portal. I’m inviting you to give it a try and share your feedback. Follow the steps here to sign up here.

message