Kubernetes Outages with real-world case studies

Kubernetes is rapidly becoming the most popular and trusted method of hosting applications. Data from Sysdig indicates that almost 90% of containerized deployments are now happening on Kubernetes (including RedHat OpenShift and Rancher).

With some of the most business-critical applications in the world depending on the stability and high-availability of Kubernetes services, even a minor outage can result in devastating outcomes like financial loss, tarnished reputations, and more.

Kubernetes is a complex system with lots of moving pieces. Outages are not easy to debug and to resolved. Such events can cause a lot of damages, given the right circumstances. One of the main reasons that many businesses find themselves unprepared is that there is so little awareness. Incidents are kept secret or covered up to maintain reputation.

Let’s look at some well-documented cases to understand better how to prepare for and overcome such situations.

Free Now: New worker nodes unable to join clusters

Autoscaling is one of the most exciting things about Kubernetes. But if it starts failing for some reason like new workers not being able to join a cluster due to limited node capacity, then your application’s performance will deteriorate rapidly.

Free Now had such an incident in September 2019 when some of its engineers started complaining about failed deployments in all environments. In this instance, the issue was caused by a change to the CentOS mirror a few days prior. Once capacity increased, Nodeup was unable to find a required dependency from CentOS, resulting in pods not being scheduled and getting stuck in the pending state.

Part of the reason for this issue was that the dependency URL was hardcoded in the bootstrap process for creating new nodes. After identifying the issue, changing the AMI from Amazon Linux 2 to Debian did the trick. Pods got scheduled as usual, and capacity reached levels allowing for deployments to proceed.

Preply: Partial DNS outage in Kubernetes cluster

In late February 2020, Preply, one of the world’s foremost educational platforms, had a partial outage that made a few of its services unavailable to some users. The issue was detected within a mere 4 minutes, leading to it being resolved within another 22 minutes.

That was a close save because node limits were not reached. But let’s look at the cause and the steps that were taken to resolve the DNS outage. Thanks to good monitoring practices and Prometheus, engineers were able to detect a nearly 500% spike in three of its services within minutes. It was triggered by the CoreDNS-autoscaler dropping the pod count from three to two.

Given the highly unusual magnitude of the load, engineers immediately started troubleshooting. The cause was clear, and it was due to Kube-proxy failing to delete an old row from the conntrack table. Connection tracking or “conntrack” as it is more commonly known is a core feature of the networking stack of the Linux kernel. Kube-proxy uses it to track logical flows (network connections) and to identify the pods available for service requests. Because the conntrack table was not updated accurately, some services were still being routed to deleted pods.

Once the team identified the issue, it quickly performed a regular deploy on the cluster to create new nodes. The CoreDNS-autoscaler then took care of the rest by adding the necessary pods, and the conntrack table was rewritten automatically.

Ravelin: Not so graceful shutdowns

This case study is somewhat similar to the earlier one in that services kept referring to pods that were no longer available, but for a different reason. In Ravelin’s case, their migration to Kubernetes on GKE was going very smoothly. Albeit for the issue of graceful shutdowns, or the lack there or.

The issue was that unlike usual Kubernetes services that are very quick at removing endpoints, ingress is somewhat slower. Even though the replication controller chose to remove a pod and removed it from the load-balancer, ingress continued to send it traffic.

The solution was to ensure that your pods stayed alive despite receiving a SIGTERM and continued to serve their new connections with a “Connection”: “close” response header. They were set to exit only after the termination grace period had been exceeded.

Spotify: Accidental deletion of regional production clusters, twice

This item on our list is somewhat different and happened due to human error, but I am including it here because it’s quite common and is something that you need to be prepared for.

In late 2018, just after they migrated to Kubernetes, one of Spotify’s engineers accidentally deleted their US regional cluster thinking that it was a test cluster he had created. The cleanup took over 3 hours due to many issues in their cluster deployment scripts that made it necessary to start over after each failure.

Shortly after that, Terraform was introduced to manage deployments. An uninformed pull request to Terraform by another engineer ultimately killed two regional clusters (US and Asia). It was configured to manage clusters based on a common configuration. So when the pull request had only the cluster that the engineer was working on, it assumed the other two weren’t needed.

These issues were resolved, and guidelines were put in place to avoid such mistakes in the future. However, Spotify users did not notice any downtime during these incidents because Spotify had only partially migrated and had their failover set to non-Kubernetes instances. Long term resolutions included backing up clusters, codifying infrastructure, making it easier to restore, and simulated disaster recovery scenarios with its teams.

Conclusion

To summarize the outcome of this article, here are some tips that could prepare you for Kubernetes outgates:

Document dependencies of critical processes like bootstrapping new nodes.
Ensure that you have comprehensive monitoring tools like Prometheus.
Create alerts for failures in critical activities, like creating nodes.
Ensure pods stay alive until new connections are no longer routed to it with termination grace periods. If you are unable to achieve due to using 3rd party code, add a pre-stop lifecycle hook that sleeps for the duration of the termination grace period so that the pod stays alive.
Codify your cluster infrastructure so that you can quickly restore or replicate it when disaster strikes. Ensure that scripts are resumable so that you don’t need to start over due to failures.
Keep your teams prepared for disaster recovery scenarios of all types. Simulating such scenarios on test clusters can allow them to come up with potential solutions.
Finally, while it may not always be feasible, a non-Kubernetes failover plan can come in handy. A managed Kubernetes service can also be useful in case your teams do not have the necessary know-how.

We hope that these real-world cases, the steps used to resolve them, and the corrective actions recommended above will help you to avoid or at least overcome your Kubernetes outages.

References

https://sysdig.com/blog/sysdig-2019-container-usage-report/

#kubernetes #case-study #outage #failure #Free Now #Preply #Ravelin #Spotify

Back to list