Kubernetes Gone Wrong: 3 Failure Stories

Kubernetes is the most reliable orchestration platform and is worth deploying, but it is also complex and gives rise to troubleshooting issues when not properly configured. Many companies faced challenges during their migration to this container orchestration platform. Some have shared their failures and impacts so that it becomes easier for organizations to migrate to Kubernetes by learning from these failures. Most of these Kubernetes failures have different factors to blame, which has resulted in various impacts.

10x higher latency than usual with Kubernetes – Adevinta

What would you call a situation where one application hosted on the cloud has a response time of 20ms while the other is taking ten times as much with Kubernetes. Well, the team Adevinta has gone through the same situation.

Adveinta ran many diagnostics starting from DNS queries on both instances in which they got some delay in resolve times, but that was nowhere near the ten times latency target. Also, the analysis of tcpdump, which test DNS for resolve issues, came out with no reliable logs, but there was a problem with how the requests are being handled. Multiple queries were handled on one request, which was making the cloud responses slow.

It turns out that those multiple queries were the part of the Kubernetes authorization policies. First was querying the role associated with the instance, and the second one was requesting temporary credentials to access the instance. Adevinta knew it could become a bottleneck, but for some reason, they did not configure it well as a result. The AWS was refreshing credentials as soon as 15 min left in expiration time, thus increasing latency time.

Failure to integrate Istio with Kubernetes - Exponea

Istio makes it easier for organizations with monolithic architecture to transition towards microservices architecture by providing security and control in Kubernetes.

But this was not the case with Exponea.

Exponea was running a big part of its infrastructure on Google Kubernetes Engine (GKE), and they were expecting to simplify the process of application deployment while increasing security and insights using Istio.

After a few experiments, they discovered that Istio is not value for their GKE, so they postponed the integration fully. They had significant reasons to support it.

Their Kubernetes jobs were not finishing. Kubernetes Jobs run scripts when the program in job exits Kubernetes takes it as a finished job, but Istio sidecars, which are utility containers to support main containers, kept on running they never finished which resulted into a Job Crash.
Handling of healthy shutdowns by proxy sidecars is another issue they faced. Usually, sidecars in Istio quits easily upon the signal from Kubernetes after completing the necessary job. However, this becomes a problem when your service is already shut down due to Job crash.
Setting deployment strategy with Istio and both the control plane and proxies became difficult. According to the new upgrade of Istio, if we do not divide the control panel for each tenant, recreating pods in the cluster will kill the Kubernetes Master node, which Exponea did not want.
However, Exponea tried to make desired changes for integrating Istio in the pipeline, but again they ended up in a dead-end. Exponea clients wanted multiple cluster setup instead of a multitenant setup, which they will not be able to integrate Istio fully.

A Kubernetes Migration Story - Prezi

The disadvantage of having a full-blown container orchestration platform is that it needs proper configuring. Not configuring it properly can give rise to issues that are smaller to debug but time taking to find. Prezi had the same issue when they migrated their Nginx based reverse proxy to Kubernetes: they noticed that latency has got higher for a few of their services.

Prezi started debugging mainly for services that do subsequent calls since those are one that can create the highest levels of latencies. They also analyzed that the only change is that their reverse proxy, which was earlier running on ElasticBeanstalk machines, is now running on Kubernetes Nodes.

Starting from traces, they discovered that from their gateway, the service took a full second to reach to service handler as compare to service handler which just took milliseconds to serve it.

Interestingly the delay was consistently 1 second. They went further in to assess whether the outgoing TCP connection is taking long. After implementing custom attributes on TCP variables, they turned out to be right.

By default, in Kubernetes, when pod has to connect to service with an IP, it has to go through another step of source network address translation(SNAT) which takes a little more time to resolve (here it has taken 1 second) as compare to Nginx based reverse proxy.

Prezi fixed the latency delay by turning off SNAT altogether on the node. Once they applied the settings, Kubernetes implemented more resources for the service, so a single pod gets less traffic hence bringing the latency numbers to what they had before their Kubernetes migration.

Final Thoughts

From the above blog, we found out that one of the most significant sources of issues in Kubernetes or any other container Orchestration platform is not in its fundamental features. The problem mainly lies in the compatibility and configuring Of these platforms, according to needs. When an organization tries to put together unlikely pieces of technology together, assuming that they are best in the market will also work best. They mostly failed.

So, it is essential to observe when migrating to container orchestration platforms like Kubernetes. With proper configuring and observability, we can not only get close to the root of the issue but built solutions that will help harness the true potential of the most reliable container orchestration platform around.

#kubernetes #case-study #outage #failure #Istio #Adevinta #Exponea #Prezi

Back to list