From the time Google invented Kubernetes. Google Kubernetes Engine (GKE) was the first managed Kubernetes service that came into the market. Together Kubernetes and GKE have provided businesses an ideal Kubernetes management platform that takes care of most of the cluster’s operational overhead.
For other remaining cluster tasks, Google Kubernetes Engine (GKE) required manual tweaking to meet business needs. But recently, Google announced GKE Autopilot, which takes care of the remaining cluster management overhead. This means operators do not have to deal with cluster configurations and can focus more on running their applications on GKE.
GKE Autopilot implements many levels of controls for Kubernetes clusters to provide a simple way to build a secure and consistent Kubernetes platform. It comes preconfigured with many features which have their benefits and tradeoffs.
This blog will discuss those benefits and tradeoffs and how they stack up against the standard GKE platform.
Autopilot is an operation mode in Google Kubernetes Engine (GKE) which extends Kubernetes cluster configuration and management capabilities. Autopilot adds and automates much-needed operational, security, and provisioning features that standard GKE lacks, leading to reduced costs and management time.
With GKE Autopilot, users can manage the entire Kubernetes infrastructure providing just the application configuration. The operation mode takes care of the resources and node provisioning, standard security configuration, auto-scaling, auto-upgrades, maintenance, Day 2 operations based on application specifications.
When compared with the standard GKE mode, where users manually provision resources based on the requirements, GKE autopilot drastically reduces the decisions required to provide a secure production-grade infrastructure. Very few steps are required from the developer’s perspective. The service automatically determines the best configuration based on the workloads, which frees up dev teams to focus more on the application’s logic and less on managing Kubernetes clusters.
The GKE autopilot has a stack of different components such as shielded VMs, VPC-based public/private network, CSI-based storage, etc. That offers a fully automated platform for the users/organizations that wants to manage service mesh functionalities, cluster deployments, and other associated tasks.
With Autopilot, Google addresses the challenges of rightsizing Kubernetes environments and provides a platform that maintains SRE for the control plane and nodes.
The biggest advantage of GKE Autopilot is its billing based on the deployment units. Autopilot implements an autoscale policy that dynamically provisions resources to match workload requirements, leading to improved resource utilization. With Autopilot, users don’t have to pay flat hourly fees for preconfigured node instances; it will charge per second for vCPU, memory, and disk requests.
In addition to pod-based billing, Autopilot also includes Service Level Agreement (SLA) on pod-basis. GKE Autopilot guarantees 99.9% uptime for pods in multiple zones enabling users to prioritize workloads for better availability and operations.
Standard GKE already provides lots of functionality, such as policy hardening, container layers, and auto-scaling. But with Autopilot, all of these processes and best practices are preconfigured from the get-go. Autopilot helps secure the cluster infrastructure by automating GKE hardening guidelines and utilizing security capabilities like Shielded GKE Nodes and Workload Identity.
Shielded Autopilot nodes prohibit unsafe practices, such as SSH access to the container and eliminate pathways to access the underlying servers. On the other hand, workload identity protection enables granular authorization for each cluster and automatically authenticates the Kubernetes services when they access Google Cloud APIs.
Like Standard GKE, Autopilot is fully compatible with their third partner monitoring (Datadog) and logging (Gitlab) solutions to improve the performance and resilience of Kubernetes environments. Users can easily control the maintenance of nodes and analyze the issues reducing the security of clusters.
Also, both Datadog and Gitlab are configured in the same way as they do with standard GKE. There is no need for additional configuration and integration, which reduces the overall operational load required for managing Autopilot clusters.
GKE Autopilot becoming GA is great news for SRE teams as it greatly benefits the applications and clients, but due to many features stripped away and planned for the upcoming release. There are many limitations that businesses are facing right now, which we will discuss below.
Pod vCPUs in Autopilot is available in 0.25 increments and must be in the ratio of 1:1 to 1:6.5 with memory. Organizations having services that utilize resources outside of these ratio ranges will be scaled up to meet these criteria. This cannot be cost-effective for businesses who deploy small services since they have to be overscaled to match ratio ranges.
Google Kubernetes Engine (GKE) Autopilot not supporting privileged containers can be a real downside for many organizations. Privileged mode is very useful for admins who want to make changes in kubelet or networking settings. Without the capability to update hosts or run pods in privilege mode in Autopilot clusters, these changes aren’t allowed, impacting many cluster workloads.
Autopilot does not support SSH access to nodes, meaning all the nodes, including the GPU and TPU (Tensor Processing Unit), are locked. According to Google, support is planned for the future, but it can be limiting for organizations that prefer configuring clusters remotely for right now.
GKE Autopilot makes porting between standard GKE clusters and Autopilot or vice-versa impossible, which makes it difficult for organizations when shifting to Autopilot for automated operations. According to Google, Autopilot does not support porting because of its configuration design. Standard GKE was built to support a wide range of configurations, whereas Autopilot is a configuration mode that is built on top of GKE for optimized management and security.
GKE Autopilot mode for Kubernetes only supports Google’s own container optimized Linux with containerd as an operating system. So if your organizations use Red Hat Enterprise Linux (RHEL), Linux with Docker, or Windows Server, Autopilot is not a great choice right now. The rationale for choosing an in-house Linux+ containerd combination for Autopilot is to meet the security and SLA standards set for Autopilot by Google.
GKE Autopilot, with its provisioning and management capabilities, is a great platform for users who want their source code up and running in a matter of seconds. But for power users who demand control and customization through third-party integrations, Autopilot has its limitations when compared with Standard GKE.
For example, GKE Autopilot does not have Istio and Knative integration, which brings true scaling capabilities. Configuring third-party storage platforms or network policies is also not supported by GKE Autopilot, another capability standard GKE includes.
With Autopilot available, users can use GKE in two different operation modes, Standard and Autopilot. Both modes offer different levels of customizations and configurations over the GKE clusters, which we will compare below based on different Kubernetes factors.
Standard GKE offers capabilities only to manage the control plane; all other features such as nodes or node-pools, automation, security has to be manually configured by the user. GKE Autopilot, on the other hand, is preconfigured to manage the control plane, nodes, and day-2 automation operations such as node auto-upgrades, repair, and maintenance.
GKE offers two cluster availability types: zonal (single-zone or multi-zonal) and regionally based on the workload requirements and budget.
A multi-zonal cluster contains a single control plane replica running on nodes spanned across multiple zones. In case of zone outage, the control plane and workloads functionality are available through the different zones. In contrast, regional clusters contain multiple control plane replicas running in multiple zones within a given region.
GKE Autopilot is preconfigured to use regional clusters, whereas standard GKE allows users to choose between zonal/regional cluster types.
In GKE autopilot mode, networking is preconfigured to use VPC-native with a maximum quota of 32 pods per node and intranode visibility. Standard GKE can also be configured to use VPC-native with a larger quota of 110 pods per node and intranode visibility, but the setting is turned off by default.
Other network parameters, including private cluster, authorized networks, and cloud NAT are available as optional settings in both standard and autopilot modes and can be easily configured.
Automation in GKE enables users to schedule custom maintenance windows and surge upgrades depending upon the workload needs. GKE autopilot is preconfigured with maintenance windows and surge upgrades which automatically takes care of node maintenance based on pod specifications. GKE standard, on the other hand, provides the option to enable maintenance windows, specify times for upgrades, and maintenance exclusion.
Billing for standard GKE mode is on a per-node basis depending upon the type of node (CPU, memory, disk, etc.) Users have to pay the assigned amount regardless of the usage of node resources. In contrast, Autopilot billing is based on per pod resource requests (CPU, memory, and storage) and uses a pay-per-use model. Users only pay for the CPU memory and storage used.
This billing model is very similar to AWS Fargate pricing.
GKE Autopilot is preconfigured to handle the scaling of nodes, whereas, in standard GKE, node auto-provisioning is available as an optional setting. Horizontal pod autoscaling (HPA) and Vertical Pod autoscaling (VPA) are available by default in Autopilot, but in standard GKE, they are turned off by default.
With GKE Autopilot, Google has reduced the number of configuration steps required to deploy the GKE cluster. The ability of Autopilot to automatically provision based on the compute resources and workload requirements can improve resource utilization and reduce Day-2 operation challenges.
GKE Autopilot is an easy-to-use solution for enterprises looking to implement google SRE for managing their Kubernetes Environments. Autopilot also offers up to 99.95% and 99.9% SLA for the control plane and pods while significantly reducing the overall cost incurred by cluster resources.
It will be interesting to see how Autopilot matures with integrations and deployment and creates a strong differentiation factor for its platform. Services such as AWS Fargate for EKS already offer strong security capabilities, ops-friendly configuration with pods as a fundamental unit of deployment and billing.