Complexities Migrating to Kubernetes and Some Horror Stories

People always think that Kubernetes (K8s) can still do much more than it’s designed to do. Originally the concept of comfortable elasticity inside an environment was never the part of K8s stack. When the pieces come into play, they are part of the monitored infrastructure, and you have to keep track of it.

The Kubernetes ecosystem has thousands of products on it, and many people assume that they work together seamlessly, but they do not. One of the most common failures is a lack of understanding of how Kubernetes functions. Some of the most common mistakes people make are producing bad or conflicting configurations in areas such as networking, storage, and compute specs.

Trying to do all by yourself is also a problem. People try to start on their own, and when they try to put into production and scale, they run into issues. The standard question that many faces is scaling and can’t keep up with all the necessary work—also day two operations like upgrades. K8s talent is pretty hard to find and retain. Even if you are lucky enough to find them, it’s tough to keep them.

Many teams don’t implement any policy around creating external load balancers/ingresses, which is why many security and operational failures occur. There’s a lot of thought given to which images can be deployed, but many companies have a blind spot related to how easy it is to accidentally steal traffic from one workload and send it to another.

The best-case scenario in this common failure is downtime, and the worst case is that your sensitive data gets exposed to the internet. Nobody notices because there’s no notification and alert. And many failures come from accidental oversight and human error when it comes to labeling policies and naming.

Another common K8s failure scenario is when the K8s cluster infrastructure hardware fails to satisfy the container startup policies. Since K8s provides a declarative way to deploy applications, and those policies are strictly enforced, it is critical that declared and desired container states can be met by the infrastructure allocated. Otherwise, the container will fail to start.

Zalando’s Kubernetes journey

This story is about Henning Jacobs, who is the head of developer productivity at Zalando, an eCommerce platform that runs over 140 Kubernetes clusters. Zalando uses Kubernetes to support over 1100 developers, and in his March 2019 blog post, Henning thoroughly explained why “Kubernetes extensible and cohesive API matters”. He wrote in his book that “learning it is worthwhile, even if you just want to run a bunch of containers.”

Henning was responding to a blog post titled “Maybe You Don’t Need Kubernetes” written by a backend engineer for the hotel listings site Trivago (“Especially for smaller teams, it can be time-consuming to maintain and has a steep learning curve”). Jacob agrees that there can be many choices for running containers, and “all of them work, but differ heavily on what interface they provide”.

But he also strongly argued that “having extensible API matters as you will sooner or later hit a use case not reflected 100% by your infrastructure API, and or you need to integrate with your existing organization’s landscape”. He applauded custom resource definitions (CRDs), which “allow building higher-level abstractions on top of core concepts”. Henning’s blog post also noted that there’s already an ecosystem built on the top of the Kubernetes API, strongly arguing that the world converged on its feature set as it did with Linux Kernel.

“I think this network effect will prevail, and we will see more high-level tools for Kubernetes” he wrote, and he even thinks K8s as one of those resources. “I started collecting Kubernetes for no other reason than to leverage the enormous community and to improve infrastructure operations…”

But how does Jacobs feel about Kubernetes? In 2019 he was asked to be the guest on Google’s own Kubernetes podcast, where he shares his experience about the migration of Zalando to the cloud in 2015, and he said that the site grew out of their commitment to open-source software.

Though Zalando shares their open-source Kubernetes open-source components on the Github. “But actually, it’s not only about code, but it’s also about how to deal with the system and components, how they work together, and how to set this up.”

Excessive CPU throttling at Omnio

Another story that you should know about is of DevOps engineer for the travel search engine Omio – which runs the 100% of their workloads on Kubernetes. He wrote about “CPU limits and aggressive throttling in Kubernetes” resulting in high errors and latency: “there is a serious known CFS bug in the kernel that causes unnecessary stalls and throttling”.

Running out of IPs at LoveHolidays

The head of the DevOps for the travel search site LoveHolidays experiences the time when Google Kubernetes Engine (GKE) ran out of IP addresses, resulting in blocked autoscaling of both pods and nodes and fast deployment: “by default, GKE provides 256 IPs per node, which means that even large subnets like /16 can run out pretty quickly when you are running 256 nodes”.

In 2019, a DevOps engineer for Exponea’s email list validation site wrote about why they postponed integrating Istio and deploying it into production.

Migration isn’t always an easy task

The team at Ravelin had a lovely time with their migration to Google Cloud Kubernetes until they faced with the API layer problem. To facilitate movement, they insisted on ingress. I mean on paper it seemed very easy.

Define the ingress controller
Tinker with terraforming for hitting some IP addresses
Google will take care of everything

In most of the documentation, the process of removing pods from service is:

The replication controller decides to remove the pod.
The pod’s endpoint is removed from the load-balancer or service. So new traffic no longer flows to the seed.
The pod’s pre-stop hook is invoked, or the pod receives a SIGTERM.
Then the pod shutdowns and stop listening for new connections.
When the shutdown completes, and the pod exists, then all the existing connections eventually terminate or become idle.

Step 3 and 2 take place simultaneously, and it happens very quickly. And the problem here is that ingress is relatively slow.

So the pod receives the SIGTERM long before the changes in endpoints actioned at the ingress. The pod continuously receives the new connections, and the client gets 500 errors at a time. This is the story of Kubernetes migration falling apart.

Takeaways

Bugs are not the familiar sources of problems in the Kubernetes, or it isn’t either about fundamental flaws in the containerization or microservices. The problem appears when we put some unlikely pieces together in the first place.

But as a whole, Kubernetes saves a lot of trouble before they even surface thanks to its built-in fault-tolerance. Despite all these cases, Kubernetes remains the most reliable container orchestration service around.

#kubernetes #case-study #outage #failure #Zalando #Omnio #Love Holidays #Ravelin

Back to list