Very Good Security (VGS) uses Kubernetes, hosted on AWS, to speed up application delivery and optimize hosting costs. A common issue is ensuring replicas are evenly distributed across availability zones making applications resilient and HA.
By default, the Kubernetes scheduler uses a bin-packing algorithm to fit as many pods as possible into a cluster. The scheduler prefers a more evenly distributed general node load to app replicas precisely spread across nodes. Therefore, by default, multi-replica is not guaranteed multi-AZ.
For production services, we use explicit pod anti-affinity to ensure replicas are distributed between AZs.
In this article, we will have AWS-based cluster with 3 nodes in 3 availability zones of the us-west-2 region.
Example of hard AZ-based anti-affinity
spec:
replicas: 4
selector:
...
template:
metadata:
creationTimestamp: null
labels:
app: redash
tier: backend
name: redash
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redash
topologyKey: failure-domain.beta.kubernetes.io/zone
containers:
- args:
- server
...
Results in
redash-1842984337-92cft 4/4 Running 2 2m 100.92.22.65 ip-10-20-35-160.us-west-2.compute.internal app=redash,pod-template-hash=1842984337,tier=backend
redash-1842984337-l3ljk 4/4 Running 2 2m 100.70.66.129 ip-10-20-80-215.us-west-2.compute.internal app=redash,pod-template-hash=1842984337,tier=backend
redash-1842984337-qlrkj 4/4 Running 0 19m 100.88.238.7 ip-10-20-115-58.us-west-2.compute.internal app=redash,pod-template-hash=1842984337,tier=backend
redash-1842984337-s93ls 0/4 Pending 0 16s <none> app=redash,pod-template-hash=1842984337,tier=backend
Here we run 4 replicas on a 3-node cluster, the 4th pod cannot be scheduled because this is hard anti-affinity (requiredDuringSchedulingIgnoredDuringExecution) and we have only 3 AZs.
This policy has less use in real deployments because users usually prefer capacity/uptime over HA. In other words, if 1 of the nodes goes down, it is preferred to schedule its replica on the remaining nodes (temporary ignoring anti-affinity), rather than have partial downtime. To achieve such behavior we use soft AZ anti-affinity.
Example of a soft AZ-based anti-affinity
spec:
replicas: 3
selector:
...
template:
metadata:
creationTimestamp: null
labels:
app: redash
tier: backend
name: redash
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redash
topologyKey: failure-domain.beta.kubernetes.io/zone
weight: 100
containers:
- args:
- server
...
Results in
redash-3474901287-4k90r 4/4 Running 0 2m 100.92.22.66 ip-10-20-35-160.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-nhb9f 4/4 Running 0 8s 100.70.66.130 ip-10-20-80-215.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-qxq9f 4/4 Running 0 8s 100.88.238.9 ip-10-20-115-58.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
This time we provisioned 3 replicas with Soft AZ anti-affinity (preferredDuringSchedulingIgnoredDuringExecution), and although it's soft, they are getting spread across AZs. This is also due to the set weight of 100 which makes anti-affinity more important for the scheduler than the node-load policy, e.g. k8s will prefer AZ spreading over equally-loading of the nodes and other factors.
Then we switch replicas to 4, which results in:
redash-3474901287-16ksc 4/4 Running 0 7s 100.70.66.131 ip-10-20-80-215.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-4k90r 4/4 Running 0 6m 100.92.22.66 ip-10-20-35-160.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-nhb9f 4/4 Running 0 3m 100.70.66.130 ip-10-20-80-215.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-qxq9f 4/4 Running 0 3m 100.88.238.9 ip-10-20-115-58.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
We now have 4 replicas > 3 AZs, thus 1 of the pods got co-located with another one in AZ. It has not failed because this is a Soft policy, not a Hard one.
Suppose you change replicas to 9:
redash-3474901287-4k90r 4/4 Running 0 10m 100.92.22.66 ip-10-20-35-160.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-bb71n 4/4 Running 0 38s 100.70.66.133 ip-10-20-80-215.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-bncsm 4/4 Running 0 14s 100.92.22.71 ip-10-20-35-160.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-d4tmz 4/4 Running 0 14s 100.92.22.70 ip-10-20-35-160.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-gd51w 4/4 Running 0 14s 100.88.238.12 ip-10-20-115-58.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-rtzss 4/4 Running 0 14s 100.92.22.72 ip-10-20-35-160.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-vjlnr 4/4 Running 0 14s 100.70.66.134 ip-10-20-80-215.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-w1k5f 4/4 Running 0 38s 100.88.238.11 ip-10-20-115-58.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
redash-3474901287-xp98s 4/4 Running 0 14s 100.70.66.135 ip-10-20-80-215.us-west-2.compute.internal app=redash,pod-template-hash=3474901287,tier=backend
In this case pods are not spread equally:
- The 10-20-35-160 node is running 4 replicas
- The 10-20-115-58 node is running 2 replicas
- The 10-20-80-215 node is running 3 replicas
This happens because once each of the AZs have 1 pod with app=redash
, the soft anti-affinity stops having any power. For scheduler, it is equally impossible to obey it for each of the nodes, thus scheduler is guided by other policies, e.g. equal load split between the nodes.
Soft anti-affinity is obeyed during down-scaling as well. For example, when you scale down the replicas from 9 to 3, the scheduler selects pods to kill in such a way that app will get 1 pod per AZ at the end.
Conclusion:
- Explicit anti-affinity policy and multiple replicas are required for production deployments.
- Soft anti-affinity is preferred over hard, unless the specifics of your project dictate otherwise.
- The number of replicas can exceed the number of AZs, if dictated by your deployment workload, and soft anti-affinity is used. In this case, an even AZ distribution is not guaranteed for the replicas beyond the number of AZs.
Further reading
For more information on the Kubernetes Scheduling Policy, including the anti-affinity policies:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/