Following security best practices for AWS EKS clusters is just as critical as for any Kubernetes cluster. In a talk I gave at the Bay Area AWS Community Day, I shared lessons learned and best practices for engineers running workloads on EKS clusters. This overview recaps my talk and includes links to instructions and further reading.
Amazon Elastic Kubernetes Service (EKS) is AWS’ managed Kubernetes service. AWS hosts and manages the Kubernetes masters, and the user is responsible for creating the worker nodes, which run on EC2 instances.
While Kubernetes offers a number of tools to control the security of your workloads, these services aren’t enabled by default, not even in EKS. The user has to configure and manage them properly to protect their EKS clusters. Most of the recommendations covered here apply to a Kubernetes cluster running anywhere, but EKS still has a few idiosyncrasies. In addition to the cluster management topics below, all other security best practices for using AWS cloud services, especially for EC2, still need to be followed.
EKS Cluster Design Considerations
EKS masters automatically span three Availability Zones for high-availability. The user is responsible for creating the nodegroups as AWS Autoscaling Groups of EC2 instances for the workloads. These nodegroups should also span multiple AZs and should be placed in private subnets in the VPC with NAT Gateways for egress. If Internet ingress to cluster services is required, public subnets can be placed in the VPC to host ELBs.
- VPC design: https://docs.aws.amazon.com/eks/latest/userguide/network_reqs.html
- Cluster security groups (use separate groups per cluster!): https://docs.aws.amazon.com/eks/latest/userguide/sec-group-reqs.html
Definitive Guide to Elastic Kubernetes Service (EKS) Security
Download our 41-page ebook that takes a deep dive into EKS security, including building secure images and clusters and securing cluster add-ons, securing the network, and protecting running workloads.Download Today
All EKS clusters use the AWS VPC Container Network Interface (CNI) for Kubernetes pod networking. The VPC CNI uses AWS Elastic Network Interfaces (ENI) for pod networking. While this approach has the benefit of putting the pods directly on the VPC network, it has several drawbacks:
- It doesn’t support AWS Network Policies or any other way to create firewall rules for Kubernetes deployment workloads. While ENIs can have their own EC2 security groups, the CNI doesn’t support any granularity finer than a security group per node, which does not really align with how pods get scheduled on nodes. This limitation makes the CNI very unsuitable for multi-tenant clusters and makes it hard to limit the blast radius if a pod is exploited.
- EC2 instances support only a certain number of ENIs, which limits the number of pods that can be scheduled on a node. The number varies by instance type.
Fortunately, the first problem, lack of support for network segregation between workloads in a cluster, has a simple fix. The Calico CNI can be deployed in EKS to run alongside the VPC CNI, providing Kubernetes Network Policies support. I strongly recommend deploying Calico to EKS.
- Configuration (note the ENI Allocation section and understand your pod limits per node instance type!): https://github.com/aws/amazon-vpc-cni-k8s/blob/master/README.md
- ENI Configuration: https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html
- Installation instructions: https://docs.aws.amazon.com/eks/latest/userguide/calico.html
Secure Container Images
Building secure container images is a topic unto itself.
For images you build yourself, make scanning for vulnerabilities part of your CI cycle. Choose a container vulnerability scanner that supports not just operating system packages but also language libraries (remember Apache Struts!). If you’re using a base image, make sure it’s up to date. Don’t install packages you don’t need, especially network tools or other software that can be leveraged in an exploit. Update your images frequently.
For third-party images, vulnerability scanning is also the main priority. If the images for the applications you need always seem to be full of CVEs, even in the latest version, you may want to build your own image for the application.
Kubernetes also provides a great mechanism to ensure dangerous containers won’t get deployed to your cluster, the almighty Admission Controller. Kubernetes can be configured to send all deployment requests to a dynamic admission controller for approval before they go to the Kubernetes scheduler, which is responsible for placing pods on nodes. If the deployment fails the controller’s configured tests, the pod won’t get run at all.
- Vulnerability Scanning
- Anchore - open-source, does both OS package and language library scanning: https://github.com/anchore/anchore-engine
- Admission Controllers
- Anchore’s admission controller: https://github.com/anchore/kubernetes-admission-controller
- Write your own: https://kubernetes.io/blog/2019/03/21/a-guide-to-kubernetes-admission-controllers/
- 11 tips for operationalizing admission controllers: https://www.stackrox.com/post/2019/03/11-tips-to-operationalizing-kubernetes-admission-controllers-for-better-security/
Pod Runtime Security
Limiting the permissions and capabilities of container runtimes is perhaps the most critical piece of security for EKS workloads, with many pieces.
First, start by using Namespaces liberally. Figure out what pattern makes sense for your application - making sure workloads that need to be managed by different teams have their own Namespace, which enables important privilege segregation.
Next, learn how to write strict Kubernetes RBAC roles. Follow the principle of least privilege, and try to avoid using wildcards. Avoid using ClusterRoles and ClusterRoleBindings as much as possible, as they are global across all namespaces.
For cluster network security, install the Calico CNI (see earlier best practice) so you can use Network Policies to control traffic in the cluster.
Container runtimes start by adding a securityContext to every deployment that allocates a non-root user and disables privilege escalation. You can enforce these limitations in your cluster by using PodSecurityPolicies. Also make sure you give each application in your cluster its own Kubernetes service account.
Last, but not least, you must protect the Identity and Access Management (IAM) credentials of your nodes’ IAM Instance Role. The nodes are standard EC2 instances that will have an IAM role and a standard set of EKS permissions, in addition to permissions you may have added. The workload pods should not be allowed to grab the IAM’s credentials from the EC2 metadata point. You have several options for protecting the endpoint that still enable automated access to AWS APIs for deployments that need it. If you don’t use kube2iam or kiam, which both work by intercepting calls to the metadata endpoint and issuing limited credentials back to the pod based on your configuration, install the Calico CNI so you can add a Network Policy to block access to the metadata IP, 169.254.169.254.
- Namespaces: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/
- Kubernetes RBAC
- Kubernetes Network Policies (requires installing Calico CNI, see above):
- Lock down Container Runtime Privileges
- Pod security contexts: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
- Pod Security Policies: https://kubernetes.io/docs/concepts/policy/pod-security-policy/
- Protect EC2 Instance Role Credentials and Manage AWS IAM Permissions for Pods:
- Manage IAM credentials for pods:
- Using IAM with Kubernetes service accounts: https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html
- Block pod access to the EC2 metadata API (not needed if you use kube2iam or kiam):
- Better: Install Calico and use a Network Policy
Running workloads on EKS or any Kubernetes cluster brings new security responsibilities with it. While Kubernetes provides ways to lock down your clusters, you are still responsible for using them. These starting points, combined with the typical rules of thumb for security like the Principle of Least Privilege and trying to limit potential blast radiuses, should point you in the right direction.