GKE Security Best Practices: Designing Secure Clusters

As the brainchild behind the original Kubernetes project, Google launched its Google Kubernetes Engine (GKE) platform — then called Google Container Engine — in August 2015. Today, GKE is one of the most popular managed Kubernetes services in the market.

Like any infrastructure platform or Kubernetes service, though, GKE customers have to make important decisions and formulate a plan for configuring and maintaining secure GKE clusters. While many of the security requirements and responsibilities apply to all Kubernetes clusters, regardless of where they are hosted, GKE also has several unique requirements that users must consider and act on to ensure that their GKE clusters and the workloads their organization run on them are safeguarded from potential breaches or malicious attacks.

This post is the first in a four-part series that will explore best practices for securely configuring and operating GKE clusters and the containerized applications that run on those clusters. Part one focuses on what you need to know when planning and setting up your GKE clusters.

Use Cloud IAM Conditions

Why: For many years, GCP Identity and Access Management did not offer resource-scoped permissions management within a GCP project, making it difficult to enforce tight control of individual clusters and other GCP resources sharing a project.

Limiting IAM access by resource within a GCP project with one or more GKE clusters provides critical security protections. The introduction of Cloud IAM Conditions now enables GCP users to leverage finer-grained control of their IAM permissions.

What to do: Use Cloud IAM Conditions to limit permission scope for users and workload service accounts in projects containing GKE clusters.

Use VPC Networks and Alias IP Ranges

Why: The strongest network security protections and controls for GKE clusters require using the current generation of Google Compute networks. VPC-native networking for GKE clusters enables the ability to create pod-level firewall rules and also automatically performs anti-spoofing checks for node-to-node traffic.

What to do: Use VPC networks and Alias IP ranges for new clusters. Using alias IP ranges, called VPC-native networking, requires deployment to VPC networks and can only be selected at cluster creation.

New GKE clusters not assigned to an existing network will create a VPC network by default. If you assign your new clusters to existing networks, make sure you are using or creating VPC networks for them instead of using legacy networks. VPC-native traffic routing with alias IPs is not currently the default for new clusters, however, and must be selected.

Enable Private Clusters

Why: Strict network isolation which prevents unauthorized external ingress to GKE cluster API endpoints, nodes, or pod containers comprises a critical piece of Kubernetes cluster security. By default, GKE clusters have Kubernetes cluster API endpoints and nodes with public IP addresses. Additionally, the Kubernetes cluster API is open to the Internet unless you explicitly limit access.

Creating private GKE clusters provides many of these protections or simplifies their implementation in your clusters, although they come with some restrictions and trade-offs.

What to do: The private cluster option must be selected at cluster creation. Additional requirements include use of VPC (non-legacy) networks and Alias IP ranges.

A private GKE cluster still has a publicly-accessible cluster API endpoint by default. However, you can disable the public endpoint or limit access to it using master authorized networks.

After creating your private cluster, you may need to perform extra configuration steps to ensure your cluster can pull from container image registries.

Note that GKE nodes which do not have public IP addresses of their own will require additional configuration to be able to connect to any Google service APIs or external sites. If your GKE workloads need to connect to other GCP services, you will need to configure Private Google Access for those services on each VPC subnet with nodes that require access.

For node access to other services and sites outside the VPC, you can configure a Cloud NAT or custom NAT gateway.

Note also that Windows node pools require additional steps for deployment to private GKE clusters.

Use Container-Optimized OS (COS) for Node Images

Why: Compromised nodes create a huge danger to your entire cluster and its workloads. Using minimal operating system images with read-only file systems provides two critical ways to protect your nodes for many attacks and to limit their potential blast radius. Attackers have limited tools on the image to try to leverage to increase their access, and if they cannot write or overwrite configuration files and binaries on the node’s root file system, they cannot hijack the system as easily or install their own malicious tools.

What to do: Use GKE Container-Optimized OS (COS) nodes with read-only root file systems for your Linux node images. The COS image has a read-only root partition.

Enforce Node Access Scopes

Why: GKE nodes themselves require some baseline Cloud IAM permissions in order to provide some common and important cluster functionality, including the ability to pull container images that are stored in Cloud Registry and to send logging and monitoring data to Stackdriver. Prior to Kubernetes version 1.10, GKE clusters automatically used a GCP service account that granted compute-rw privileges, which allow the nodes and any processes, including containers, running on them to make potentially destructive changes in the cluster’s GCP project. These clusters also had the storage-ro privilege, granted read access to all Cloud Storage buckets in the project.

What to do: GKE now allows you to to configure which Cloud IAM permissions your cluster nodes actually need. If you have existing clusters which are still using the legacy GKE scopes, you will want to update them. You will also need to create custom service accounts for GKE workloads that may have been sharing the node scopes.

Use Shielded Nodes

Why: While GCP has high standards of security for their software and hardware, running workloads on multi-tenant infrastructure still carries some risks. No commonly used operating system is completely safe from attacks, whether launched from its own running processes or external network or infrastructure manipulations.

GCP’s Shielded Cloud initiative focuses on mitigating and removing risks associated with multi-tenant cloud environments. GKE clusters now support Shielded Nodes. These nodes use Shielded GCE VMs to safeguard and monitor the runtime integrity of your nodes, starting during the boot process.

What to do: Shielded Nodes can be enabled at any time for a cluster. Note that once a cluster has been configured to use Shielded Nodes, standard nodes will no longer be able to join the cluster.