skip to main content

Mar 17, 2020

Guide to Designing EKS Clusters for Better Security

By: Karen Bruner

When it comes to cloud services like AWS, customers need to understand what features and tools their cloud provider makes available, as well as which pieces of the management role falls on the user. That share of the workload becomes even more critical with respect to securing the Kubernetes cluster, the workloads deployed to it, and its underlying infrastructure.

Customers share the responsibility for the security and compliance of their use of services with AWS. AWS takes responsibility for securing their infrastructure and addressing security issues in their software. The customer must ensure the security of their own applications and also must correctly use the controls offered to protect their data and workloads within the AWS infrastructure.

For a managed Kubernetes service like EKS, users have three main layers that require action:

  • the cluster and its components,
  • the workloads running on the cluster, and
  • the underlying AWS services on which the cluster depends, which include much of the Elastic Compute Cloud (EC2) ecosystem: instances, storage, Virtual Private Cloud (VPC), and more.

In this blog post, we will provide a set of guidelines to help you design your EKS clusters without compromising security.

EKS Cluster Design

The path to running secure EKS clusters starts with designing a secure cluster. By understanding the controls available for Kubernetes and EKS, while also understanding where EKS clusters need additional reinforcement, it becomes easier to implement and maintain cluster security.

For existing clusters, most of these protections can be applied at any time. However, creating replacement clusters and migrating workloads would provide a fresh environment and an opportunity to apply workload protections, described in later sections, as the deployments get migrated.

Cloud Infrastructure Security

Why: A secure EKS cluster needs to run in a secure AWS environment.

What to do: Follow recommended security best practices for your AWS account, particularly for AWS IAM permissions, VPC network controls, and EC2 resources. This includes:

  • limiting access from root AWS user accounts since they have unlimited privileges
  • implementing the principle of least privilege when granting IAM permissions and other best practices
  • enable CloudTrail logging for audit and forensics
  • various configuration best practices for VPC networking and EC2 resources

VPC Layout

Why: Creating a VPC with network security in mind will be key for protecting EKS nodes from external threats.

What to do: Place your nodes on private subnets only. Public subnets should only be used for external facing load balancers or NAT gateways. For external access to services running on the EKS cluster, use load balancers managed by Kubernetes Service resources or Ingress controllers, rather than allowing direct access to node instances. If your workloads need Internet access or you are using EKS managed node groups, with their requirement of having Internet access, use NAT gateways for VPC egress.

Dedicated IAM Role for EKS Cluster Creation

Why: EKS gives the IAM user or role creating the cluster permanent authentication on the cluster’s Kubernetes API service. AWS provides no ability to make this grant optional, to remove it, or to move it to a different IAM user or role (as of 3/17/2020). Furthermore, by default, this cluster user has full admin privileges in the cluster’s RBAC configuration. The official EKS documentation does not explicitly discuss this implementation or its serious implications with respect to the user’s ability to manage cluster access effectively.

What to do: Your options for locking down this access depend on whether you are trying to secure existing clusters or create new clusters. For reference, the cluster creator authenticates against the Kubernetes API as user kubernetes-user in the group system:masters.

For new clusters

  1. Use a different dedicated IAM role for each cluster to create the cluster.
  2. After creation, remove all IAM permissions from the role.
  3. Update the aws-auth ConfigMap in the kube-system namespace to add more IAM users/roles assigned to groups other than system:masters.
  4. Add these groups as subjects of RoleBindings and ClusterRoleBindings in the cluster RBAC as needed.
  5. Test the changes using the credentials or assuming the role of those IAM entities.
  6. Edit the cluster-admin ClusterRoleBinding to remove the system:masters group as a subject. Important: Note that if you edit or remove the cluster-admin ClusterRoleBinding without first adding alternative ClusterRoleBindings for admin access, you could lose admin access to the cluster API.
  7. Remove the right to assume this role from all IAM entities. Grant the right to assume this role only when needed to repair cluster authorization issues due to misconfiguration of the cluster’s RBAC integration with AWS IAM.

For existing clusters

  1. Update the aws-auth ConfigMap in the kube-system namespace to add more IAM users/roles assigned to groups other than system:masters.
  2. Add these groups as subjects of RoleBindings and ClusterRoleBindings in the cluster RBAC as needed.
  3. Test the changes using the credentials or assuming the role of those IAM entities.
  4. Edit the cluster-admin ClusterRoleBinding to remove the system:masters group as a subject. Important: Note that if you edit or remove the cluster-admin ClusterRoleBinding without first adding alternative ClusterRoleBindings for admin access, you could lose admin access to the cluster API.

You may also want to go through your AWS support channels and ask them to change this immutable cluster admin authentication behavior.

Managed vs Self-managed Node Groups

Why: AWS introduced managed node groups at re:Invent 2019 to simplify the creation and management of EKS node groups. The original method of creating EKS node groups, by creating an AWS Autoscaling Group configured for EKS, can also still be used. Both types of node groups have advantages and disadvantages.

Managed Node Groups

Benefits

  • Easier to create
  • Some reduction of user management requirements during node version patching/upgrades by draining nodes of pods and replacing them

Drawbacks and limitations

  • Every node gets a public IP address and must be able to send VPC egress traffic to join the cluster, even if the node is on a private subnet and the EKS cluster has a private Kubernetes API endpoint.
  • Greatly reduced user options for instance and node configuration, such as not allowing custom instance user data (often used for installing third-party monitoring agents or other system daemons) or supporting the addition of automatic node taints. In particular, the inability to modify instance user data has the side effect of leaving customization of the cluster networking unsupported for managed node groups and makes it very difficult to install monitoring agents directly on the nodes.
  • Only Amazon Linux is supported.

Self-managed Node Groups:

Benefits

  • User has much more control over node and network configuration
  • Supports Amazon Linux, Ubuntu, or even custom AMIs for node image

Drawbacks

  • Node group creation is not as automated as it is for managed node groups. Tools like eksctl can automate much of this work, however.
  • Requires manually replacing nodes or migrating to new node groups during node version upgrades. eksctl or other automation can ease this workload.

What to do: Decide which type suits your needs best. The fact that every node in a managed group gets a public IP address will be problematic for some security teams, while other teams may be more able or willing to fill the automation gap between managed and self-managed node groups.

Cluster Resource Tagging

Why: Using unique AWS resource tags for each cluster will make it easier to limit IAM resource permissions to specific clusters by using conditionals based on tag values. Not all resources support tags, but most do, and AWS continues to add support for tagging existing resource types.

What to do: Tag your resources.

Control SSH Access to Nodes

Why: Workloads running in pods on the cluster should never need to ssh to the nodes themselves. Blocking the pods’ access to the ssh port reduces the opportunities a malicious pod has for getting access directly on the node.

What to do: Options for preventing access to the node’s SSH port:

  • Do not enable ssh access for node instances. In rare cases, not being able to ssh to the node may make troubleshooting more difficult, but system and EKS logs generally contain enough information for diagnosing problems. Usually, terminating problematic nodes is preferable to diagnosing issues, unless you see frequent node issues which may be symptomatic of chronic problems.
  • Install the Calico CNI (Container Network Interface) and use Kubernetes Network Policies to block all pod egress traffic to port 22.
  • You can use the AWS Systems Manager Session Manager instead of running sshd to connect to nodes. Note that this option only works for self-managed node groups as it requires installing an agent on each node. You need to add the agent’s installation to the user data field, which cannot be modified in managed node groups.
  • See the next section on EC2 security groups for EKS nodes for additional options.

EC2 Security Groups for Nodes

Why: The ability to provide network Isolation of the pods to any system services listening on the node provides critical control for safeguarding the node itself from malicious or compromised pods that may be running. However, it requires considerable effort to accomplish this traffic segmentation effectively in EKS clusters.

EC2 security groups assigned to an ENI (Elastic Network Interface) apply to all the IP addresses associated with that ENI. Because the AWS VPC CNI (Container Network Interface) used for EKS cluster networking defaults to putting pod IP addresses on the same ENI as the node’s primary IP address, the node shares EC2 security groups with the pods running on the node. No simple way exists under this scheme to limit traffic to and from the pods without also affecting the nodes, and vice versa.

The issue gets more complicated. Starting with Kubernetes 1.14, EKS now adds a cluster security group that applies to all nodes (and therefore pods) and control plane components. This cluster security group has one rule for inbound traffic: allow all traffic on all ports to all members of the security group. This security group ensures that all cluster-related traffic between nodes and the control plane components remains open. However, because the pods also by extension share this security group with the nodes, their access to the nodes and the control plane is also unrestricted on the VPC network.

Each node group also generally has its own security group. Users of self-managed node groups will need to create the security group for the node group. Managed node groups each have an automatically-created security group. Why not just add a rule to the node security groups limiting access to port 22 for ssh, perhaps to the security group of a bastion host? Unfortunately, that won’t work, either. All security group rules are of type \“allow.\” When an EC2 network interface has two or more security groups that apply, their rules get merged additively. If multiple groups have rules that include the same port, the most permissive rule for that port gets applied.

What to do: EKS users have a few options that can be used alone or in combination for mitigating the default EKS security group issues.

Customize the AWS VPC CNI networking configuration.

The default behavior of the CNI, to share the node’s primary ENI with pods, can be changed to reserving one ENI for only the node’s IP address by setting the configuration variable AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG to true.

Benefits

  • The nodes can now have their own security groups which do not apply to pods.
  • Also allows the pods to be placed on different subnets than the nodes, for greater control and to manage limited IP space.

Drawbacks

  • This solution cannot be used for clusters with managed node groups, as it requires changing the user data used to bootstrap EC2 instances, which managed node groups do not allow.
  • Not supported with Windows nodes.
  • Customizing the AWS VPC CNI behavior for a cluster requires a great deal of work, including tasks required when new subnets for nodes are added to the VPC or when new node groups get added to the cluster.
  • Because one ENI becomes dedicated for the node’s IP address, the maximum number of pods that can run on the node will decrease. The exact difference depends on the instance type.

Create a dedicated subnet in the VPC.

Create a subnet in the VPC dedicated to bastion hosts and which has its own network ACL separate from the cluster’s subnets. Add a rule to the network ACL for the cluster’s subnets to allow access to sensitive ports like ssh on the node subnets only from the bastion subnet’s CIDR block and follow that rule with an explicit DENY rule from 0.0.0.0/0.

Benefits

  • Relatively simple to set up
  • Generally requires no regular maintenance or updates

Drawbacks

  • Applies to all interfaces on the node subnets, which may have unintended consequences if the subnets are shared with other non-EKS EC2 instances and resources.

Close down the cluster security group.

The cluster security group is left completely open for all traffic within the security group by default. Note that you will still need to ensure that the cluster security group allows access to critical ports on the control plane, especially the cluster API endpoint.

Benefits

  • Addresses the security group issue at the security group level

Drawbacks

  • Requires time and research to craft the rules to restrict ingress traffic without breaking cluster networking
  • When using the default AWS VPC CNI configuration, these restrictions would apply to the nodes as well as the pods.
  • Cluster upgrades might overwrite these changes.

    While some of the recommendations in this section require action before or at the time of cluster creation, many can still be applied later. If necessary, prioritize the following protections, and then work on the remaining tasks as time permits.

    • Cloud Infrastructure Security — Creating and maintaining best practices for your AWS account and the resources you use will always require ongoing vigilance. Always start here when tightening your cloud security.
    • Control SSH Access to Nodes — Isolating the nodes from the containers they host as much as possible is a critical protection for your cluster. In addition, limit node SSH access from other sources.