Kubernetes Failure Stories at Scale
Harshit Mehndiratta
Harshit Mehndiratta
February 22, 2021
/
8 minutes read

Kubernetes Failure Stories at Scale

Kubernetes is the most extensible and feature-rich container orchestration platform that has its benefits and flaws. It can be easily implemented across all environments — multiple public clouds, on-premises, and hybrid while easily becoming a central cause of failure in various applications or infrastructure.

During the migration to the Kubernetes, many companies have shared their failures and impacts so that it becomes easier for organizations to migrate to Kubernetes by learning from these scenarios. Many of these Kubernetes failures have different factors to blame, which we will discuss in this blog.

Without further due, let’s get started.

Unstable EKS migration - MindTickle

Founded in 2011 as a SAAS platform, MindTickle provides sales readiness to bridge the skills and knowledge gap for teams looking to maximize leads. Mindtickle had their infrastructure as code on top of Kubernetes managed by kops, which they planned to migrate to Amazon Elastic Kubernetes Service (EKS) for a better networking experience.

Suddenly while migrating, the EKS platform becomes unresponsive and very slow to run. The team has raised AWS support tickets mentioning EKS issues for which AWS support has suggested packet capturing.

Running a packet capture on the pods revealed a pattern in which internal network calls within the Kubernetes cluster were working fine.

In contrast, the calls outside the cluster were failing-retransmission of packets between Kubernetes nodes and the EC2 machines was happening.

AWS support advised a detailed packet capture to check the various entities’ connections, which revealed abnormal communication between nodes and pods.

After analyzing, AWS Kubernetes experts recommended setting the Container Networking Interface CNI flag AWS_VPC_K8S_CNI_EXTERNALSNAT=true, which eliminated all the networking problems, and migration to EKS was successfully completed.

How did the AWS CNI flag resolve the issue?

Typically communication between a virtual private cloud is direct and requires no source network address translation (SNAT). But when traffic is destined outside the Kubernetes clusters, the Kubernetes CNI plugin implements external SNAT for easy communication between the pod and the internet.

Here the AWS CNI flag AWS_VPC_K8S_CNI_EXTERNALSNAT=true enabled CNI plugin’s external source network address translation (SNAT), which allows calls outside of the private cloud to be easily handled by the NAT gateway.

Takeaways

  • Inspect and debug every managed service component. Problems can lie outside the cluster and across the stack.
  • Implement observability tools to understand behind the scenes of connections properly. Knowing the inner workings of pod and node communication helps detect abnormalities and make better predictions, resulting in smoother migration.

Unnecessary CPU throttling in Kubernetes - Buffer

Buffer is a social media management platform made for marketers and agencies. Initially, Buffer had a classic monolithic code base, but since 2016 Buffer runs all of its workloads on Kubernetes and manages them using kops.

Buffer infrastructure comprises 60 nodes and about 1500 containers, which encountered aggressive throttling in Kubernetes (irrespective of CPU limits defined).

Throttling indicates how many times the container resource is capped to a particular limit for improving resource utilization, which was nowhere near the defined CPU limits in the case of Buffer. However, still, most of the containers were throttling, causing high latencies.

To debug the issue, Buffer removed all the CPU limits for the services, which had its disadvantages and advantages. The cluster stability got affected; processes were consuming too many resources or disrupting other services making the containers unresponsive.

Latency, on the other hand, got drastically improved across all the modified services. Buffer’s main landing page itself became 22 times faster after removing the limits.

How did throttling get resolved?

There was a serious bug in the Linux kernel, which was unnecessary throttling containers with CPU Limits. The bug has been fixed for Linux distribution that has kernel version running 4.19 or higher. Distributions running kernel versions lower than 4.19 are recommended to upgrade to the latest Linux distribution.

Takeaways

  • Running Kubernetes under a Linux distribution can have underperforming containers because of kernel issues.
  • Removing the CPU limit is dangerous and is not a viable solution to thermal throttling.
  • Removing CPU limits requires extensive monitoring tools to manage CPU and memory usage in nodes
  • Removing CPU limits can result in a high resource usage for which Horizontal pod autoscaling has to be configured to schedule pods in nodes that have available resources.

Missing Application Logs in Production - PrometheusKube

PrometheusKube is a Kubernetes operator that provides prebuilt packages to use Prometheus Alerts, Grafana Dashboards, and Runbooks in open-source software. Prometheus Kube packages define simple and efficient alerting rules with simple to understand key indicators.

PrometheusKube experienced a weird outage where one of the nodes was missing production logs. Developers noticed that the log processor Fluent-bit which collects data metrics from various resources, had stopped shipping logs in production.

DevOps team at PrometheusKube started debugging the issue by reading fluent bit logs, where they discovered that Elasticsearch could not handle requests due to insufficient threads. The thread capacity was not configurable, for which more CPU resources were allocated.

After reallocating more resources, insufficient threats error has disappeared, but the software was not recording logs. Looking into Fluent-bit’s instance of the affected node, there were no logging actions.

To resolve that issue, PrometheusKube rolled out an alternative log shipping solution called fluentd, which was deployed on the affected node and slowly rolled out to every node by removing the fluent-bit.

Fluentd has its own struggles during deployment. Developers were experiencing buffer Overflows errors which were overcome by using the Position DB feature.

Takeaways

  • Outages can happen due to a deadlock in software code.
  • Advanced monitoring and scripting methods should be present to detect, identify, and root cause abnormalities.

Redis operator failure in Kubernetes - Flant

Flant provides DevOps-as-a-Service for organizations looking to implement best practices, CI/CD for their Kubernetes native applications. Flant manages Kubernetes storage components of their customers by using Redis Operator in various applications.

Redis provides an established key-value data store used as a persistent Storage with Kubernetes applications to request and consume storage resources easily.

Typically redis creates a set of resources for running Redis failover instances which adds persistence to Kubernetes storage components in case of scaling or failure. But at Flant, increasing these redis resources turned out to be a total disaster.

When Flant engineers tried to increase the number of redis replicas to increase database reliability while lowering containers’ memory requests and limits, the system became unstable. The restart of master nodes has resulted in a data loss leading to downtime and the need to restore the data from a backup.

After carefully investigating, Redis operator and insufficient readiness probes were found as the leading cause of data loss.

Key Takeaways

  • Carefully use Kubernetes operators for managing mission-critical stateful applications.
  • Try to avoid changing container limits while increasing redis instances.
  • Must backup the disk data before making any changes to Redis cluster

Resource Failure and High CPU usage in Kubernetes - Moonlight

Moonlight is a professional network that matches remote software developers with hiring companies. Moonlight uses a managed Kubernetes environment (GKE) for hosting their website applications which suffered a severe outage.

The issue started with redis database connectivity which Moonlight API utilizes for request validation. Google cloud support reported there are some service disruptions that Moonlight assumed as the major cause of failure. But things became worse when the whole Moonlight website crashed, leading to no traffic.

DevOps team at Moonlight quickly raised the issue to the Google cloud support engineering team, which tracked the resource usage and identified a pattern. GKE was scheduling pods to nodes that are consuming 100 percent CPU usage.

Initially, the pattern was found in Redis pods. Still, as the cycle continues, the Kubernetes scheduler assigned more pods with high CPU usage on the same node that experienced a kernel panic, resulting in long periods of downtime.

The downtime was solved by implementing anti-affinity rules to the deployment, which automatically distribute pods across several nodes to increase performance and manage periods of high CPU usage.

Takeaways

  • Adding anti-affinity rules make web-based applications more resistant to faults.
  • Kubernetes scheduler assign pods to the same node unless inter-pod anti-affinity rules are in place.
  • Keeping CPU-intensive and critical applications away from each other results into a more reliable system.

Weavescope utilizing 100% CPU - JW Player

JW Player is an advanced media player used for streaming content, running video ads, and embedding videos into websites. JW Player utilizes Kubernetes for their development and staging clusters hacked for computing power by cryptocurrency miners.

The vulnerability was discovered when the cluster monitoring tool started alerting for high usage. For some days, the alerts were considered normal due to load on one of the services, but when a gcc process was found running across every machine at 100 percent CPU usage, the DevOps team at JW player became suspicious.

Checking the gcc process revealed that an application is not a GNU compiler collection(gcc) but a Weavescope monitoring tool named gcc.

Weavescope was strategically used to get access to the shell by using the URL of the load balancer. Deployments that do not make the load balancer internal and expose the URL of the load balancer can access the Weavescope dashboard without any authentication, which presents a user with an interactive shell that allows the execution of commands inside a running container.

JW Player resolved the issue by stopping and removing the Weavescope tool from Kubernetes clusters. gcc process outbound connections were also traced, which were found to be redirecting to a Github repository of a cryptocurrency miner.

Takeaways

  • Never provide untrusted application access to the root directory on Kubernetes nodes.
  • Restrict Weave Scope’s access to execute a shell.
  • Implement anomaly and intrusion monitoring tools.
  • Avoid running containers as root.

Conclusion

The failure stories covered in the blog were focused on the Kubernetes platform. Most issues in the stories like clusters become unstable, complex deployments, applications missing logs, and throttling largely depend on other connected components.

So, it is necessary to understand the overall complexity of the problem before blaming it on Kubernetes. People had this notion that Kubernetes (K8s) may be the root cause of underlying issues. Still, it is a part of an ecosystem that has thousands of products working together seamlessly to deliver reliability and scalability on the bigger picture.

message