Kubernetes (K8S): Architecture, Pods, Deployments, and Security
#kubernetes#k8s#containers#orchestration#devops#cloud#security
Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications. You describe the desired state (e.g. “run 3 replicas of this image”), and Kubernetes keeps the cluster in that state.
Why Kubernetes?
Containers (e.g. Docker) give you portable, consistent application units built from images. At scale you need orchestration: lifecycle (who restarts failed containers?), networking (how do services find each other?), resource utilization (how do we schedule and limit CPU/memory?), and portability (how do we roll out updates or move workloads across nodes?). Kubernetes addresses these by treating the cluster as a single system and keeping workloads in the desired state.
- Orchestration: Schedule and run containers across many nodes; handle restarts and placement.
- Scaling: Scale workloads up or down (manually or with autoscalers).
- Self-healing: Restart failed containers, replace unhealthy pods, reschedule when nodes fail.
- Declarative config: Define desired state in YAML; Kubernetes reconciles the actual state.
Kubernetes architecture
A Kubernetes cluster is split into two planes: the control plane (manages the cluster) and the data plane (runs your workloads).
High-level view
+----------------- CONTROL PLANE ------------------+
| +------------+ +-----------+ +-------------+ |
| | etcd | | | | Controller | |
| | (storage) | | Scheduler | | Manager | |
| +------+-----+ +-----+-----+ +------+------+ |
| | v | |
| | +---------+--------+ | |
| +--->| kube-api- | | |
| | server |<-----+ |
| +------------------+ |
| ^ |
| | |
| +-------+----------+ | +-----------------+
| | cloud-controll- | | | cloud provider |
| | manager |-----------------|--->| API |
| +------------------+ | +-----------------+
+--------------------------------------------------+
|
v
+--------------- DATA PLANE (Nodes) -----------------+
| +------------------------+ |
| v v |
| +------------------+ +-------------------+ |
| | Node 1 | | Node 2 | |
| | kubelet | | kubelet | |
| | kube-proxy | | kube-proxy | |
| | container runtime | container runtime| |
| | [Pods] | | [Pods] | |
| +------------------+ +-------------------+ |
+----------------------------------------------------+
Control plane components
| Component | Role |
|---|---|
| API Server (kube-apiserver) | Single entrypoint for all cluster operations. Validates and processes REST requests; updates etcd. kubectl and other clients talk only to the API server. |
| etcd | Distributed key-value store holding cluster state (desired and current). Only the API server reads/writes etcd. High availability is critical for production. |
| Scheduler (kube-scheduler) | Watches for newly created pods with no assigned node; selects a node (based on resources, affinity, taints/tolerations) and assigns the pod. |
| Controller Manager (kube-controller-manager) | Runs controllers that reconcile state: Node Controller, Deployment Controller, ReplicaSet Controller, etc. They watch the API and drive the cluster toward the desired state. |
| Cloud Controller Manager | Optional; ties the cluster to cloud provider APIs (load balancers, nodes, routes). Used on AKS, EKS, GKE. |
Data plane (worker nodes)
| Component | Role |
|---|---|
| kubelet | Agent on each node. Registers the node with the API server; ensures containers in pods are running (pulls images, starts/stops containers, reports status). |
| kube-proxy | Network proxy on each node. Implements Service abstraction: maintains iptables or IPVS rules so traffic to a Service IP/port is forwarded to backend pods. |
| Container runtime | Software that runs containers (containerd, CRI-O, etc.). kubelet talks to it via the Container Runtime Interface (CRI). |
Request flow (example: create a Deployment)
- You run
kubectl apply -f deployment.yaml→ kubectl sends the manifest to the API Server. - API Server validates and stores the Deployment (and derived ReplicaSet) in etcd.
- Deployment controller (in Controller Manager) sees the new ReplicaSet and creates Pod objects (no node yet).
- Scheduler sees Pods with no
nodeName, selects nodes, and updates each Pod with the chosen node (write to etcd via API Server). - kubelet on each assigned node sees new Pods, pulls images via the container runtime, and starts containers.
- kubelet reports Pod status back to the API Server; controllers and users see the cluster state.
Core concepts
| Term | Meaning |
|---|---|
| Pod | Smallest deployable unit: one or more containers that share storage and network. |
| Deployment | Declarative way to manage a set of identical pods (replicas, rolling updates). |
| Service | Stable network endpoint to reach pods (cluster IP, NodePort, or LoadBalancer). |
| Namespace | Virtual cluster for grouping and isolating resources (e.g. dev, prod). |
| Node | A worker machine (VM or physical) that runs pods. |
Kubernetes resources overview
The following table summarizes the main Kubernetes resources (as in Kubernetes in Action, Lukša). Cluster-level resources are not namespaced; others live in a namespace.
| Resource (abbr.) | API version | Description |
|---|---|---|
| Namespace (ns) | v1 | Organizes resources into non-overlapping groups (e.g. per tenant, env). |
| Pod (po) | v1 | Basic deployable unit: one or more co-located containers sharing network and storage. |
| ReplicaSet (rs) | apps/v1 | Keeps a set of pod replicas running; used by Deployment. |
| ReplicationController (rc) | v1 | Older, less capable way to keep pod replicas; prefer ReplicaSet. |
| Deployment (deploy) | apps/v1 | Declarative deployment and rolling updates of pods via ReplicaSet. |
| StatefulSet (sts) | apps/v1 | Manages stateful pods with stable identity and ordered deployment. |
| DaemonSet (ds) | apps/v1 | Runs one pod replica per node (all nodes or those matching a selector). |
| Job | batch/v1 | Runs pods until a completable task succeeds (one or more pods). |
| CronJob | batch/v1 | Runs a Job on a schedule (cron expression). |
| Service (svc) | v1 | Exposes one or more pods at a stable IP and port (ClusterIP, NodePort, LoadBalancer). |
| Endpoints (ep) | v1 | Lists the pod IPs that back a Service (usually auto-managed). |
| Ingress (ing) | networking.k8s.io/v1 | Exposes services to the outside via HTTP(S) host/path routing. |
| ConfigMap (cm) | v1 | Key-value config for apps (non-sensitive); mount as files or env. |
| Secret | v1 | Sensitive data (passwords, tokens); base64, use encryption at rest. |
| PersistentVolume (pv) | v1 | Cluster-level piece of storage; bound by a PersistentVolumeClaim. |
| PersistentVolumeClaim (pvc) | v1 | Request for storage; bound to a PersistentVolume or dynamic provisioner. |
| StorageClass (sc) | storage.k8s.io/v1 | Defines a class of storage for dynamic provisioning of PVCs. |
Pods in more detail
- Lifecycle phases: Pending → Running (or Succeeded/Failed for one-off pods). A pod is Pending until scheduled and until at least one container has started.
- Init containers: Run to completion before the main containers start. Use them for setup (e.g. migrate DB, wait for a dependency). They run in order; if one fails, the pod is restarted (according to restartPolicy).
- Multiple containers in a pod: Share the same network namespace (localhost) and can share volumes. Typical pattern: main app + sidecar (e.g. log shipper, proxy). The kubelet restarts the whole pod if any container exits (with restartPolicy OnFailure or Always).
spec:
initContainers:
- name: init-db
image: busybox
command: ['sh', '-c', 'until nslookup db; do sleep 2; done']
containers:
- name: app
image: my-app:latest
Workload resources: ReplicaSet, Job, DaemonSet, StatefulSet
| Resource | Use case |
|---|---|
| ReplicaSet | Keep N identical pod replicas; use via Deployment, not alone. |
| Job | Run a batch task until success (e.g. backup, migration). completions, parallelism, backoffLimit. |
| CronJob | Run a Job on a schedule (e.g. "0 * * * *" every hour). |
| DaemonSet | One pod per node (e.g. node exporter, log collector, CNI). |
| StatefulSet | Stateful apps with stable identity: stable pod name and storage, ordered create/delete. |
Deployment strategies and workload examples
This section covers how you roll out changes (deployment strategies) and all workload types in Kubernetes with concrete examples.
Deployment strategies (Deployment resource)
A Deployment manages a ReplicaSet and drives rolling updates (or recreate) when you change the pod template. You control the strategy and pace.
| Strategy | Behavior | When to use |
|---|---|---|
| RollingUpdate (default) | Gradually replace old pods with new ones. You can set maxSurge (extra pods allowed above desired count) and maxUnavailable (pods that can be missing). | Default for most apps; minimal downtime, continuous availability. |
| Recreate | Terminate all existing pods, then create new ones. No overlap. | When the app cannot run two versions at once (e.g. schema migration, singleton). |
| Blue-green | Not built-in. You run two Deployments (e.g. my-app-v1, my-app-v2) and switch the Service selector from one to the other. | Manual or scripted; instant cutover, easy rollback. |
| Canary | Not built-in. You run a small set of new-version pods and route a fraction of traffic (e.g. via Ingress weights or two Services). | Gradual exposure; use service mesh or Ingress for traffic split. |
Example: Deployment with RollingUpdate and Recreate
# Rolling update: at most 1 extra pod, at most 0 unavailable (surge only)
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-registry.io/my-app:v2
ports:
- containerPort: 3000
---
# Recreate: all pods terminated before new ones start
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-migrator
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: migrator
template:
metadata:
labels:
app: migrator
spec:
containers:
- name: migrator
image: my-registry.io/db-migrate:latest
command: ["/run-migrations.sh"]
Rollout commands:
# Watch rollout status
kubectl rollout status deployment/my-app
# Pause/resume a rollout
kubectl rollout pause deployment/my-app
kubectl rollout resume deployment/my-app
# Rollback to previous revision
kubectl rollout undo deployment/my-app
# Rollback to a specific revision
kubectl rollout history deployment/my-app
kubectl rollout undo deployment/my-app --to-revision=2
ReplicaSet
A ReplicaSet keeps a fixed number of pod replicas running. You usually don’t create it directly; the Deployment controller creates and updates ReplicaSets. If you manage a ReplicaSet yourself, scaling and updates are manual (no rollout history or rollback). Use a ReplicaSet directly only for special cases (e.g. temporary scaling that you don’t want the Deployment to own).
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: my-app-rs
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-registry.io/my-app:latest
ports:
- containerPort: 3000
DaemonSet
A DaemonSet ensures that every node (or every node matching a selector) runs exactly one pod of the template. Use it for node-level agents: logging (Fluentd, Fluent Bit), monitoring (node-exporter), storage (CSI node plugin), or networking (CNI). New nodes get the pod automatically when they join the cluster.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: prom/node-exporter:latest
ports:
- containerPort: 9100
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
memory: "128Mi"
Run only on a subset of nodes: use nodeSelector, affinity, or taints/tolerations so the DaemonSet’s pods only schedule on nodes that match (e.g. nodes with a label node-role=logging).
StatefulSet
A StatefulSet is for stateful workloads that need stable identity and stable storage: pods get stable names (<statefulset-name>-0, -1, …), created and deleted in order, and each pod can have its own PersistentVolumeClaim (via volumeClaimTemplates). You need a headless Service (no cluster IP) so DNS can return individual pod hostnames.
apiVersion: v1
kind: Service
metadata:
name: redis-headless
spec:
clusterIP: None
selector:
app: redis
ports:
- port: 6379
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis
spec:
serviceName: redis-headless
replicas: 3
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
volumeMounts:
- name: data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1Gi
Pods are named redis-0, redis-1, redis-2 and are reachable at redis-0.redis-headless.namespace.svc.cluster.local, etc. Storage is bound per pod and survives pod restart.
Job
A Job runs one or more pods until a finite number of completions succeed. Use it for batch work: migrations, backups, report generation. You can set parallelism (how many pods run at once), completions (how many successful completions are required), and backoffLimit (retries before the Job is marked failed).
apiVersion: batch/v1
kind: Job
metadata:
name: db-migrate
spec:
completions: 1
parallelism: 1
backoffLimit: 4
template:
spec:
restartPolicy: OnFailure
containers:
- name: migrate
image: my-registry.io/migrate:latest
command: ["/bin/sh", "-c", "npm run migrate"]
Parallel Job (e.g. process N items):
apiVersion: batch/v1
kind: Job
metadata:
name: batch-process
spec:
completions: 10
parallelism: 3
backoffLimit: 2
template:
spec:
restartPolicy: OnFailure
containers:
- name: worker
image: my-registry.io/batch-worker:latest
Up to 3 pods run at a time until 10 completions succeed.
CronJob
A CronJob creates Jobs on a schedule (cron expression). Use it for periodic tasks: backups, cleanup, report generation.
apiVersion: batch/v1
kind: CronJob
metadata:
name: daily-backup
spec:
schedule: "0 2 * * *" # 02:00 every day (UTC)
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: my-registry.io/backup:latest
command: ["/backup.sh"]
- schedule: Standard cron (minute hour day month weekday).
- concurrencyPolicy:
Allow(default),Forbid(skip if previous run still active), orReplace(cancel previous and start new). - successfulJobsHistoryLimit / failedJobsHistoryLimit: How many finished Jobs to keep for visibility.
Summary: when to use which workload
| Workload | Use when |
|---|---|
| Deployment | Stateless app; you want rolling updates, rollback, and replica count. |
| ReplicaSet | Rarely directly; prefer Deployment. |
| DaemonSet | One pod per node for logging, monitoring, CNI, or storage agent. |
| StatefulSet | Stateful app with stable name and storage (DB, queue, etc.). |
| Job | One-off or batch task that must run to completion. |
| CronJob | Same as Job but on a schedule. |
Services and networking
- ClusterIP: Default. A virtual IP inside the cluster; pods reach the service by name (DNS:
<svc>.<ns>.svc.cluster.local). Endpoints are created automatically from the service selector and list the backing pod IPs. - NodePort: Exposes the service on each node’s IP at a static port (30000–32767). Good for dev or when you don’t have a load balancer.
- LoadBalancer: Provisions an external load balancer (cloud or on-prem). Often used with Ingress for HTTP(S).
- Headless service:
clusterIP: None. No cluster IP; DNS returns all pod IPs. Used for StatefulSet or when clients need to talk to specific pods. - ExternalName: Maps the service to a CNAME in DNS (e.g. an external API hostname); no cluster IP or proxy, useful for referencing out-of-cluster services by a stable name.
kube-proxy runs on every node and implements the Service abstraction. It watches the API for Service and Endpoints changes and updates iptables (or IPVS) so traffic to the service virtual IP is forwarded to a backing pod. In iptables mode (default), rules point directly at pod IPs, so traffic goes from client → node → iptables → pod without an extra userspace hop. In userspace mode, kube-proxy receives the traffic and forwards it; iptables mode is preferred for performance. You can set sessionAffinity: ClientIP on a Service so the same client IP is sent to the same pod when possible.
Ingress exposes HTTP(S) routes to services. An Ingress controller (e.g. NGINX, Traefik) watches Ingress resources and configures the load balancer. One Ingress can route multiple hosts/paths to different ClusterIP services.
How Kubernetes DNS works for pods and services
Kubernetes ships with a cluster DNS service (CoreDNS or the older kube-dns) that watches the API server for Service and Endpoints objects and creates DNS records for them.
-
Service names → virtual IPs
- Every Service gets an A record like
<svc>.<namespace>.svc.cluster.local. - Pods automatically get search paths such as
<namespace>.svc.cluster.local,svc.cluster.local, andcluster.local. - This is why from a pod in namespace
appyou can usually reachbackendjust by using the short namebackend(it expands via search paths).
- Every Service gets an A record like
-
Headless Services and pod DNS
- For a headless Service (
clusterIP: None), DNS does not return a virtual IP; instead it returns one A record per pod IP. - With StatefulSets, each pod also gets a stable DNS name like
web-0.web.app.svc.cluster.local, where:web-0is the pod name,webis the Service name,appis the namespace.
- Clients can either use the Service name (round‑robin across pods) or the individual pod hostnames for peer‑to‑peer protocols.
- For a headless Service (
-
Pods themselves
- Regular pods do not usually have standalone DNS records that apps rely on; instead, you go through the Service.
- The pod’s
/etc/resolv.confis configured to use the cluster DNS and the search domains mentioned above.
In short: kube-proxy makes Service IPs forward traffic to pods, while CoreDNS makes Service and pod names resolve to those IPs. Together they give you stable names (my-svc.my-namespace.svc.cluster.local) on top of dynamic pod IPs.
CNI and pod networking
Kubernetes expects every pod to get a routable IP so that pods can talk to each other and kube-proxy can forward service traffic to pod IPs. The Container Network Interface (CNI) is the standard that cluster installs use: a plugin runs when a pod is created/deleted and configures the pod’s network (e.g. a veth pair and bridge, or overlay). Common CNI plugins include Flannel (simple overlay, good for getting started), Calico (BGP or overlay, network policy and performance), Weave (overlay with encryption option), and Canal (Calico networking + Flannel for policy). Choose based on your need for encryption, network policy, and multi-node routing (e.g. BGP in bare metal).
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-ingress
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-app-svc
port:
number: 80
ConfigMap and Secret (configuration)
- ConfigMap: Store non-sensitive config (URLs, feature flags, config files). Mount as a volume or inject as environment variables. Changes to the volume may be reflected in the pod depending on sync settings.
- Secret: Same idea for sensitive data (passwords, TLS certs). Stored base64; enable encryption at rest for the API server in production. Mount as a volume or env; prefer projected volumes or external secret operators for rotation.
Namespace resource quotas and limits (multitenancy)
To share a cluster across teams or environments, use namespaces and cap resource usage per namespace so one tenant cannot starve others. ResourceQuota limits total usage in a namespace (e.g. total CPU, memory, number of pods, PVCs). LimitRange sets default and max requests/limits for containers in that namespace so every pod gets bounds even if the manifest omits them. Together they support safe multitenancy and prevent runaway workloads.
Volumes and persistent storage
- emptyDir: Temporary directory per pod; deleted when the pod is removed. Good for scratch space or sharing data between containers in a pod.
- PersistentVolumeClaim (PVC): Request storage (size, StorageClass). The cluster binds it to a PersistentVolume (PV) or triggers dynamic provisioning. Pods mount the PVC; data survives pod restarts.
- StorageClass: Defines a provisioner and parameters (e.g. cloud disk type). When you create a PVC that references a StorageClass, the provisioner creates the backing volume and binds the PVC.
Minimal Deployment example
Save as app-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
spec:
replicas: 2
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-registry.io/my-app:latest
ports:
- containerPort: 3000
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "128Mi"
cpu: "200m"
Create the deployment:
kubectl apply -f app-deployment.yaml
Exposing the app with a Service
apiVersion: v1
kind: Service
metadata:
name: my-app-svc
spec:
selector:
app: my-app
ports:
- port: 80
targetPort: 3000
type: ClusterIP # or NodePort / LoadBalancer
kubectl apply -f app-service.yaml
Essential kubectl commands
# List pods (default namespace)
kubectl get pods
# List pods in a namespace
kubectl get pods -n production
# Describe a pod (events, state, details)
kubectl describe pod <pod-name>
# View logs from a pod
kubectl logs <pod-name>
# Follow logs (like tail -f)
kubectl logs -f <pod-name>
# Execute a command in a pod
kubectl exec -it <pod-name> -- sh
# List deployments
kubectl get deployments
# Scale a deployment
kubectl scale deployment my-app --replicas=5
# Delete a deployment and its pods
kubectl delete deployment my-app
Pod lifecycle and restarts
- Kubernetes keeps the number of replicas you specified; if a pod exits or fails, it is replaced.
- livenessProbe and readinessProbe tell Kubernetes when to restart a pod or when to send traffic:
containers:
- name: app
image: my-app:latest
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 2
periodSeconds: 5
Namespaces
# List namespaces
kubectl get namespaces
# Create a namespace
kubectl create namespace staging
# Run a one-off pod in a namespace
kubectl run debug --image=busybox -n staging -- sleep 3600
Security for Kubernetes
Securing a Kubernetes cluster involves the control plane, the nodes, the network, and the workloads. Below are the main areas and practices.
1. RBAC (Role-Based Access Control)
RBAC controls who can do what in the cluster. You define permissions (Roles or ClusterRoles) and bind them to subjects (users, groups, or ServiceAccounts) via RoleBinding or ClusterRoleBinding. The API server evaluates RBAC after authentication (identity established) and before admission (request allowed or denied).
RBAC building blocks
| Resource | Scope | Purpose |
|---|---|---|
| Role | One namespace | Set of rules (apiGroups, resources, verbs) in that namespace only. |
| ClusterRole | Cluster-wide (or reusable) | Same as Role but not tied to a namespace; can reference cluster-scoped resources (nodes, PVs) or be bound in many namespaces. |
| RoleBinding | One namespace | Binds a Role or ClusterRole to subjects; grants permissions in that namespace. |
| ClusterRoleBinding | Cluster-wide | Binds a ClusterRole to subjects; grants permissions across the whole cluster. |
Rule structure: Each rule has apiGroups (e.g. "" for core, apps, rbac.authorization.k8s.io), resources (e.g. pods, deployments, secrets), and verbs (e.g. get, list, watch, create, update, patch, delete). You can restrict by resourceNames so the subject can only get/patch/delete named resources. subresources (e.g. pods/log, pods/status) can be listed when needed.
Common verbs: get (single resource by name), list / watch (list or watch), create, update, patch, delete, deletecollection. For non-resource URLs (e.g. /healthz) use nonResourceURLs with urls and verbs.
Subjects: User, Group, ServiceAccount
Bindings reference subjects:
- User (
kind: User,name: alice): External identity (e.g. from OIDC, client cert). No User object in the cluster; the API server infers the user from the request. - Group (
kind: Group,name: dev-team): Often used with OIDC or LDAP so you grant access to a group once. - ServiceAccount (
kind: ServiceAccount,name: my-app,namespace: production): Identity for pods; the pod’s token is used to call the API. Most common for in-cluster workloads.
Example: Role and RoleBinding (namespaced)
# Role: read pods and pod logs in namespace production
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
---
# RoleBinding: grant pod-reader to ServiceAccount ci-bot in production
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods
namespace: production
subjects:
- kind: ServiceAccount
name: ci-bot
namespace: production
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
Example: ClusterRole and RoleBinding (reuse in many namespaces)
A ClusterRole can be bound in multiple namespaces via separate RoleBindings, so you define “read pods in any namespace” once and bind it per namespace.
# ClusterRole: list and watch pods in any namespace (no namespace in metadata)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: pod-reader-cluster
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
---
# RoleBinding in production: grant pod-reader-cluster only in production
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods-production
namespace: production
subjects:
- kind: ServiceAccount
name: monitor
namespace: monitoring
roleRef:
kind: ClusterRole
name: pod-reader-cluster
apiGroup: rbac.authorization.k8s.io
Example: ClusterRole and ClusterRoleBinding (cluster-wide)
Use for cluster-wide read-only or admin. Avoid granting cluster-admin (full access) in production; prefer narrow ClusterRoles.
# ClusterRole: read nodes and namespaces (cluster-scoped)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-viewer
rules:
- apiGroups: [""]
resources: ["nodes", "namespaces"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods", "services"]
verbs: ["get", "list", "watch"]
---
# ClusterRoleBinding: grant to all authenticated users in group "viewers"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-viewers
subjects:
- kind: Group
name: viewers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: cluster-viewer
apiGroup: rbac.authorization.k8s.io
Example: Role with resourceNames (narrow scope)
Limit access to specific secrets by name:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: secret-reader-app-db
rules:
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["app-db-credentials"]
verbs: ["get"]
Checking permissions
# Can the current user create pods in production?
kubectl auth can-i create pods -n production
# Can ServiceAccount monitoring/grafana list pods in production?
kubectl auth can-i list pods -n production --as=system:serviceaccount:monitoring:grafana
# Describe which roles/bindings apply (optional: use kubectl describe or yaml)
kubectl get role,rolebinding -n production
kubectl get clusterrole,clusterrolebinding
Best practices
- Least privilege: Grant only the verbs and resources needed; use resourceNames when appropriate.
- Prefer namespaced Role + RoleBinding so one namespace compromise does not affect others.
- Use ClusterRole + RoleBinding (not ClusterRoleBinding) when you want the same role in several namespaces without cluster-wide access.
- Use dedicated ServiceAccounts per app and bind minimal Roles to them; avoid using
defaultServiceAccount for workloads.
2. Service accounts
A ServiceAccount is an identity for processes running inside the cluster (typically pods). When a pod calls the Kubernetes API, it authenticates using its ServiceAccount token. The API server then applies RBAC to determine what that identity can do.
Default and custom ServiceAccounts
- default: Every namespace has a
defaultServiceAccount. Pods that do not setspec.serviceAccountNameusedefaultin that namespace. Avoid using it for application pods; create a dedicated ServiceAccount per app and grant only the RBAC it needs. - Creating a ServiceAccount: No role is attached by default; you must create a Role or ClusterRole and a RoleBinding or ClusterRoleBinding that references the ServiceAccount.
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app
namespace: production
kubectl create serviceaccount my-app -n production
Assigning a ServiceAccount to a pod
Set spec.serviceAccountName (or spec.deprecatedServiceAccount; prefer the former). The pod can then use the token mounted by Kubernetes to talk to the API server.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: production
spec:
template:
spec:
serviceAccountName: my-app
containers:
- name: app
image: my-registry.io/my-app:latest
The ServiceAccount must exist in the same namespace as the pod. If it does not, the pod will not start.
Token: legacy mount vs projected (recommended)
- Legacy token: Kubernetes used to auto-mount a long-lived token for the pod’s ServiceAccount at
/var/run/secrets/kubernetes.io/serviceaccount/token. This token does not expire soon and is a risk if the pod or node is compromised. Prefer not to rely on it for new workloads. - Token volume projection: You can mount a short-lived, audience-bound token using a projected volume. This is the recommended way for pods that need to call the API. You set an expiration (e.g. 1 hour) and an audience (e.g.
apior your OIDC audience). The kubelet refreshes the token before it expires.
spec:
serviceAccountName: my-app
containers:
- name: app
image: my-registry.io/my-app:latest
volumeMounts:
- name: token
mountPath: /var/run/secrets/tokens
readOnly: true
volumes:
- name: token
projected:
sources:
- serviceAccountToken:
path: token
expirationSeconds: 3600
audience: api
If you use projected tokens, ensure the ServiceAccount has the right RBAC (Role + RoleBinding or ClusterRole + RoleBinding) so the pod can perform only the API actions it needs.
Disabling automatic token mount
You can set automountServiceAccountToken: false on the pod so no ServiceAccount token is mounted. Use this for pods that never call the API, to reduce exposure.
spec:
serviceAccountName: my-app
automountServiceAccountToken: false
containers:
- name: app
image: my-registry.io/my-app:latest
End-to-end: ServiceAccount + RBAC + pod
- Create a ServiceAccount (e.g.
my-appinproduction). - Create a Role (or ClusterRole) with the minimum required verbs and resources.
- Create a RoleBinding (or ClusterRoleBinding) that binds that Role to the ServiceAccount
my-appinproduction. - Create the Deployment (or Pod) with
serviceAccountName: my-appand, if the app needs to call the API, a projected token volume with expiration and audience.
Then the pod authenticates as system:serviceaccount:production:my-app and RBAC allows only the permissions you granted.
3. Secrets management
- Kubernetes Secrets store sensitive data (passwords, tokens, TLS certs) as base64; they are not encrypted at rest by default. Enable encryption at rest for the API server (e.g. with a KMS provider) in production.
- Avoid putting secrets in plain YAML in Git. Use external secret managers (e.g. HashiCorp Vault, AWS Secrets Manager, Azure Key Vault) with operators (e.g. External Secrets Operator) to sync into Kubernetes Secrets.
- Prefer projected volumes or CSI secret stores so pods get only the secrets they need.
# Mount a secret as a file in a pod
spec:
containers:
- name: app
volumeMounts:
- name: db-secret
mountPath: /etc/secrets
readOnly: true
volumes:
- name: db-secret
secret:
secretName: db-credentials
4. Network policies
By default, pods in a cluster can often talk to any other pod. NetworkPolicy restricts ingress/egress traffic (e.g. only allow frontend → backend, block cross-namespace traffic).
# Allow only pods with label role=frontend to reach pods with label app=api on port 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-allow-frontend
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
role: frontend
ports:
- protocol: TCP
port: 8080
- Enforcing NetworkPolicy requires a CNI plugin that supports it (e.g. Calico, Cilium).
- Start with deny-by-default or explicit allow lists for critical namespaces.
5. Pod security (security context and Pod Security Standards)
- Security context on pods/containers: run as non-root user (
runAsNonRoot,runAsUser), drop capabilities (securityContext.capabilities.drop: ["ALL"]), read-only root filesystem where possible. - Pod Security Standards (PSS): Privileged, Baseline, Restricted. Enforce via Pod Security Admission (labels on namespaces) or a policy engine (e.g. OPA Gatekeeper, Kyverno).
# Example: restricted-style pod
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
readOnlyRootFilesystem: true
6. Image security
- Use private or trusted registries; avoid
latesttag in production. - Image scanning (e.g. Trivy, Snyk) in CI and at admission (e.g. Trivy admission controller, Gatekeeper) to block vulnerable images.
- Image signing and verification: use Cosign and policy-controller (or similar) so only signed images are allowed.
7. Control plane and node hardening
- API server: Restrict access (firewall, private endpoints); enable audit logging; use admission controllers (e.g. PodSecurity, validating webhooks) to enforce policies.
- etcd: Encrypt at rest; restrict network access to API server only.
- Nodes: Keep OS and kubelet/runtime updated; use node hardening (CIS benchmarks); consider read-only root filesystem and minimal images for the host where possible.
- kubelet: Configure anonymousAuth: false; use NodeRestriction admission to limit what kubelets can do.
8. Admission control
Admission controllers run after authentication and authorization; they can mutate or validate requests before the object is stored in etcd. Use them to enforce security and governance.
- Built-in: PodSecurity (enforces Pod Security Standards), NodeRestriction (limits what kubelets can modify), ResourceQuota, LimitRanger (default/limit resources per namespace or pod).
- Validating / mutating webhooks: Your own services receive AdmissionReview requests and allow or deny (and optionally patch) the object. Use for custom rules (e.g. “all pods must have a sidecar”, “no hostPath”).
- Policy engines: Open Policy Agent (OPA) Gatekeeper or Kyverno run as admission webhooks and enforce policies defined in CRDs (e.g. “only images from registry X”, “every Namespace must have a label”).
# Example: LimitRanger sets default requests/limits in a namespace
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
memory: "256Mi"
cpu: "200m"
defaultRequest:
memory: "64Mi"
cpu: "50m"
type: Container
9. Audit logging and compliance
- Audit logging: The API server can log every request (metadata or full body) to a file or backend. Enable audit policy and store logs in a secure, append-only store. Use for compliance (who did what, when) and incident response.
- Compliance: Map controls to frameworks (e.g. CIS Kubernetes Benchmark, SOC 2, PCI-DSS). Use CIS Benchmarks and tools (e.g. kube-bench, OpenSearch/audit) to check and report.
10. Summary: security checklist
| Area | Practices |
|---|---|
| Access | RBAC with least privilege; avoid cluster-admin; use dedicated ServiceAccounts and projected tokens. |
| Secrets | Encryption at rest; external secret manager; minimal exposure to pods. |
| Network | NetworkPolicy for segmentation; restrict egress where possible. |
| Workloads | Non-root, drop capabilities, read-only root; enforce PSS (Baseline/Restricted). |
| Images | Scan in CI and at admission; sign and verify images. |
| Admission | PodSecurity, LimitRanger, webhooks or Gatekeeper/Kyverno for custom policy. |
| Cluster | Harden API server, etcd, and nodes; enable audit logging; follow CIS benchmarks. |
Enhancements and operational best practices
Beyond basic Deployments and Services, you can improve reliability, performance, and observability with the following.
Resource requests and limits
Always set requests and limits for CPU and memory so the scheduler can place pods correctly and the node can enforce limits. Without them, pods can overcommit and cause noisy-neighbour or OOM issues.
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
- Requests: Minimum guaranteed; scheduler uses them for placement.
- Limits: Hard cap; exceeding memory leads to OOM kill; exceeding CPU leads to throttling.
- Use LimitRanger or namespace defaults so every container gets requests/limits even if the manifest omits them.
Horizontal and vertical scaling
- Horizontal Pod Autoscaler (HPA): Scales the number of pod replicas based on CPU, memory, or custom/external metrics (e.g. from Prometheus). Keeps latency and utilization in range under load.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- Vertical Pod Autoscaler (VPA): Recommends or updates CPU/memory requests and limits based on actual usage. Useful for right-sizing over time.
Pod Disruption Budgets (PDB)
A PodDisruptionBudget limits how many pods of a given selector can be down at once during voluntary disruptions (node drain, cluster upgrade). That way the scheduler and cluster autoscaler can evict pods without dropping below your desired availability.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: my-app
Use minAvailable or maxUnavailable; combine with multiple replicas so at least one stays up during drains.
Topology spread and affinity
- Topology spread constraints: Spread pods across zones or nodes (e.g.
topologyKey: topology.kubernetes.io/zone) to reduce blast radius of a single failure. - Pod affinity / anti-affinity: Prefer (or require) pods to run on the same node (affinity) or on different nodes (anti-affinity) for high availability or colocation.
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: my-app
Health probes and readiness
- livenessProbe: If it fails, the kubelet restarts the container. Use for “is the process dead?” (e.g. HTTP
/healthor TCP). - readinessProbe: If it fails, the pod is removed from Service endpoints (no traffic). Use for “is the app ready to serve?” (e.g. dependencies up, cache warm).
- Set initialDelaySeconds and periodSeconds so slow starters are not killed and probes do not overload the app.
Observability: metrics, logging, tracing
- Metrics: Expose Prometheus-style metrics from the app or use cAdvisor/kubelet metrics. kube-state-metrics exposes cluster object state. Use HorizontalPodAutoscaler with custom metrics from Prometheus for scaling.
- Logging: Centralize logs (e.g. Fluent Bit or Fluentd as DaemonSet, Loki or Elasticsearch as backend). Avoid storing secrets in log lines.
- Tracing: Use OpenTelemetry or Jaeger so requests can be traced across services. Inject trace context via sidecar or SDK.
Backup and disaster recovery
- etcd: Back up etcd regularly (snapshots); restore procedure is critical for control-plane recovery. Many managed offerings (EKS, AKS, GKE) handle this; self-managed clusters need a process.
- Workloads and persistent data: Use Velero (or similar) to back up cluster resources (and optionally PV snapshots) to object storage. Restore to the same or another cluster for DR.
High availability (HA) for Kubernetes
High availability means your workloads stay available despite node failures, zone outages, voluntary disruptions (drains, upgrades), and load spikes. This section covers six mechanisms: pod anti-affinity, topology spread constraints, Pod Disruption Budgets, PriorityClass, HPA, and VPA, with architecture, trade-offs, and examples.
HA architecture overview
Together, these controls affect where pods run (spread and affinity), how many can be down at once (PDB), who gets evicted first under pressure (PriorityClass), and how many replicas and how much CPU/memory they get (HPA, VPA).
+--------------------------- High availability layers --------------------------+
| |
| Placement (where pods run) Protection (during disruptions) |
| +------------------------+ +-------------------------------------+ |
| | Pod anti-affinity | | PodDisruptionBudget (minAvailable / | |
| | Topology spread | | maxUnavailable) | |
| +------------------------+ +-------------------------------------+ |
| |
| Preemption (who stays when nodes are full) |
| +------------------------+ |
| | PriorityClass | --> Higher priority pods preferred; lower |
| | (priority, preemption) | can be evicted to make room. |
| +------------------------+ |
| |
| Scaling (replicas and resources) |
| +------------------------+ +------------------------+ |
| | HPA | | VPA | |
| | (replica count by | | (requests/limits by | |
| | CPU/memory/metrics) | | actual usage) | |
| +------------------------+ +------------------------+ |
| |
+-------------------------------------------------------------------------------+
1. Pod anti-affinity
Pod anti-affinity tells the scheduler to avoid placing new pods on the same node (or zone) as existing pods matching a label selector. That spreads replicas across nodes so a single node failure does not take down all replicas. Trade-off: Strict anti-affinity (requiredDuringSchedulingIgnoredDuringExecution) can consume more resources: the scheduler may leave nodes underutilized or even fail to schedule if there are fewer nodes than replicas. Use preferredDuringSchedulingIgnoredDuringExecution when you want a soft preference without blocking scheduling.
Example: required (hard) anti-affinity — one pod per node
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: my-app
topologyKey: kubernetes.io/hostname
containers:
- name: app
image: my-registry.io/my-app:latest
If you have fewer than 3 nodes, one or more pods will stay Pending. Use only when you have enough nodes and accept the resource cost.
Example: preferred (soft) anti-affinity — spread when possible
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: my-app
topologyKey: kubernetes.io/hostname
The scheduler tries to put pods on different nodes but will schedule anyway if it cannot (e.g. single node); no extra pending pods.
2. Pod topology spread constraints
Topology spread constraints spread pods across domains (e.g. node, zone, region) and limit the skew (difference in count between the most and least populated domain). They are a flexible way to achieve HA without the strict “one per node” rule of hard anti-affinity. You can combine multiple constraints (e.g. spread by hostname and by zone).
- topologyKey: Domain to spread over (e.g.
kubernetes.io/hostname,topology.kubernetes.io/zone). - maxSkew: Maximum allowed difference in the number of matching pods between any two domains. For example,
maxSkew: 1with 3 replicas and 3 zones means at most 2 pods in one zone and 1 in another. - whenUnsatisfiable: DoNotSchedule (hard: do not schedule if it would violate skew) or ScheduleAnyway (soft: prefer spread but allow scheduling).
Example: spread across zones and nodes
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 6
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: my-app
containers:
- name: app
image: my-registry.io/my-app:latest
Here, zone spread is hard (no more than one extra pod in any zone than in another), and node spread is soft (best-effort within each zone).
3. Pod Disruption Budget (PDB)
A PodDisruptionBudget limits how many pods of a given selector can be unavailable at once during voluntary disruptions (e.g. kubectl drain, node upgrade, cluster autoscaler scale-in). The eviction API and drain logic respect PDBs: they will not evict pods if that would break the PDB. PDBs do not protect against involuntary failures (node crash, OOM); use replicas + spread for that.
- minAvailable: Minimum number of pods that must remain available (absolute number or percentage, e.g.
minAvailable: 1orminAvailable: "50%"). - maxUnavailable: Maximum number of pods that can be unavailable (absolute or percentage). Use either minAvailable or maxUnavailable, not both.
Example: at least 2 pods or 50% available
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
namespace: production
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
---
# Alternative: percentage-based
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb-pct
namespace: production
spec:
maxUnavailable: "25%"
selector:
matchLabels:
app: my-app
Combine PDB with multiple replicas and topology spread (or anti-affinity) so that when a node is drained, remaining pods are still spread and the service stays healthy.
4. PriorityClass
PriorityClass assigns a numeric priority to pods. When the scheduler or kubelet (e.g. under resource pressure) must choose which pods to run or evict, higher priority pods are preferred. Lower-priority pods can be preempted (evicted) to make room for pending higher-priority pods. Use this so critical workloads (e.g. production app) keep running and best-effort or batch workloads give way.
- value: 32-bit integer; higher = higher priority. System priorities often use negative or reserved ranges; user classes typically 1000+.
- globalDefault: If true, this class is the default for pods that do not specify a priorityClassName. Only one PriorityClass should have this.
- preemptionPolicy: PreemptLowerPriority (default) or Never. Never means the pod is scheduled only when resources are available without preempting others.
Example: high-priority app and low-priority batch
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-app
value: 10000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority-batch
value: 100
globalDefault: false
preemptionPolicy: Never
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
spec:
priorityClassName: high-priority-app
containers:
- name: app
image: my-registry.io/my-app:latest
Under node pressure, pods with low-priority-batch can be evicted before pods with high-priority-app. Use PriorityClass sparingly and document which workloads are critical.
5. Horizontal Pod Autoscaler (HPA)
HPA automatically adjusts the number of pod replicas (Deployment, StatefulSet, or other scale subresource) based on metrics (CPU, memory, or custom/external). When load rises, HPA increases replicas; when load falls, it scales down within minReplicas and maxReplicas. This keeps the application available and responsive under variable load.
Metrics: Resource (CPU/memory from the metrics server), Pods (average of a pod metric), Object (a metric describing another object), or External (e.g. Prometheus via custom metrics API). behavior (scaleUp/scaleDown) can tune how fast HPA reacts.
Example: CPU and memory-based HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
The metrics-server must be installed in the cluster for Resource metrics. For custom or external metrics, you need an adapter (e.g. Prometheus Adapter) that implements the custom.metrics.k8s.io API.
6. Vertical Pod Autoscaler (VPA)
VPA recommends or automatically updates CPU and memory requests and limits for pods based on actual usage. It helps right-size workloads over time so they get enough resources (fewer OOMs and throttling) without over-provisioning. VPA runs as a controller and can operate in Off (recommend only), Initial (set at create), Recreate (update and recreate pods), or Auto (Recreate is the typical production mode).
Note: VPA is not in-tree; install the Vertical Pod Autoscaler from the Kubernetes autoscaler repo. VPA and HPA together: If you use both for the same pods, avoid autoscaling on the same resource (e.g. use VPA for memory and HPA for CPU, or use VPA in recommendation mode and tune requests manually).
Example: VPA for a Deployment
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Recreate"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: "2"
memory: 2Gi
controlledResources: ["cpu", "memory"]
VPA will recommend or apply new requests/limits within minAllowed and maxAllowed and, with updateMode: Recreate, will recreate pods to apply changes. Use updateMode: Off to only get recommendations without automatic updates.
Summary: HA mechanisms
| Mechanism | Purpose | Trade-off / note |
|---|---|---|
| Pod anti-affinity | Spread replicas across nodes/zones; avoid single point of failure. | Hard anti-affinity can waste capacity or block scheduling; prefer soft or topology spread. |
| Topology spread | Limit skew across topology domains (node, zone, region). | Flexible; combine zone (hard) + node (soft) for HA without over-constraining. |
| PDB | Limit voluntary disruption (drain, upgrade) so enough replicas stay up. | Only for voluntary disruptions; pair with replicas and spread. |
| PriorityClass | Ensure critical pods are scheduled and not preempted before best-effort. | Use sparingly; document and avoid too many high-priority workloads. |
| HPA | Scale replica count by CPU, memory, or custom metrics. | Needs metrics-server (and optionally custom metrics adapter); set min/max and behavior. |
| VPA | Right-size CPU/memory requests and limits from actual usage. | Install separately; use with HPA carefully (different resources or recommendation-only). |
Control and management
How you operate and govern clusters at scale—lifecycle, automation, and policy.
Cluster setup options
You can run Kubernetes on public clouds, on-prem, or locally. Common options:
- Managed services (GKE, EKS, AKS): The provider runs the control plane and often offers managed node pools, automatic upgrades, and integrated logging/monitoring. Best for production when you want to focus on workloads rather than cluster ops.
- kubeadm: Official tool to bootstrap a cluster (control plane and join nodes). Works on various Linux distros; you manage VMs, networking, and upgrades. Good for custom or air-gapped environments.
- kops (Kubernetes Operations): Provisions and manages clusters on AWS (and other clouds); supports high availability, multiple AZs, and rolling updates. Alternative to hand-rolling with kubeadm on cloud.
- Local / dev: minikube, kind (Kubernetes in Docker), or k3d (k3s in Docker) run a small cluster on your machine for development and learning.
Provider-specific scripts (e.g. legacy kube-up.sh for GCE/AWS) are largely superseded by kubeadm and managed offerings.
Cluster lifecycle and upgrades
- Kubernetes version skew: Support policy (e.g. N and N-1 for control plane vs kubelet) constrains upgrade order. Upgrade control plane first, then nodes (kubelet and runtime).
- Node management: Rolling node updates (drain, update, uncordon) or use managed node pools that handle OS and runtime upgrades. Cluster Autoscaler adds/removes nodes based on pending pods and utilization.
- Managed services: AKS, EKS, GKE (and others) manage control-plane upgrades and often node images; you choose when to adopt new versions.
Operators and custom resources (CRDs)
- Custom Resource Definitions (CRDs): Extend the API with your own resources (e.g.
PostgresCluster,Certificate). Controllers watch these resources and reconcile real state (e.g. create pods, call external APIs). - Operators: Pattern that packages domain logic (install, upgrade, backup) for an app (e.g. Prometheus Operator, cert-manager, PostgreSQL Operator). They use CRDs and controllers so you manage the app declaratively like native Kubernetes resources.
GitOps and declarative delivery
- GitOps: Git is the source of truth for desired cluster state. A controller (e.g. Flux, Argo CD) watches a Git repo (and optionally image registries) and applies manifests or Helm charts to the cluster. Changes are made by PR; rollback by revert.
- Benefits: Audit trail, consistency across envs, approval workflows, and separation between “what to run” (Git) and “what is running” (cluster). Use with Kustomize or Helm for environment-specific overlays.
Policy enforcement and governance
- Policy as code: Define rules (e.g. “all images must be from registry X”, “no privileged pods”) and enforce them at admission (Gatekeeper, Kyverno) or in CI (e.g. Conftest, OPA) before apply.
- Kyverno: Kubernetes-native policy (no separate language); generate resources (e.g. add NetworkPolicy when Namespace is created) and validate/mutate. Good for tenant isolation and compliance.
- Multi-cluster and governance: Use fleet or multi-cluster tools (e.g. Rancher, Argo CD with multiple clusters, GKE Fleet) to apply policies and apps across many clusters from a single place.
Summary: control and management
| Area | Practices |
|---|---|
| Lifecycle | Plan upgrades (control plane then nodes); use managed node pools or Autoscaler. |
| Extensibility | CRDs and operators for custom apps; use well-maintained operators. |
| Delivery | GitOps (Flux, Argo CD) with Git as source of truth; Kustomize/Helm for overlays. |
| Governance | Admission and policy (Gatekeeper, Kyverno); multi-cluster policy where needed. |
Summary
- Architecture: Control plane (API server, etcd, scheduler, controllers) manages the cluster; data plane (kubelet, kube-proxy, container runtime) runs pods on nodes.
- Pods run your containers (init containers, multi-container pods); Deployments manage replicas and rolling updates via ReplicaSet; Services (ClusterIP, NodePort, LoadBalancer, headless) and Ingress expose pods on the network.
- Deployment strategies: RollingUpdate (default; maxSurge/maxUnavailable) or Recreate; blue-green and canary are done via separate Deployments and traffic switching. Use kubectl rollout for status, pause, and undo.
- Workloads: Deployment for stateless apps; DaemonSet for one pod per node (logging, monitoring); StatefulSet for stateful apps with stable names and PVCs; Job for batch/completion; CronJob for scheduled Jobs.
- Config: ConfigMap and Secret inject configuration; PersistentVolumeClaim and StorageClass provide persistent storage.
- Use kubectl to apply YAML and inspect pods, deployments, services, and other resources.
- Security: RBAC, NetworkPolicy, Secrets (encryption at rest, external managers), pod security contexts and Pod Security Standards, admission control (PodSecurity, LimitRanger, webhooks, Gatekeeper/Kyverno), ServiceAccount and token projection, audit logging, and control-plane/node hardening.
- Enhancements: Set resource requests/limits; tune health probes; add metrics, logging, and tracing; back up etcd and use Velero for DR.
- High availability (HA): Use pod anti-affinity (soft preferred to avoid extra resource cost), topology spread constraints (zone + node), PodDisruptionBudget, PriorityClass for critical vs best-effort, HPA for replica scaling, and VPA for right-sizing requests/limits.
- Control and management: Plan cluster upgrades and node lifecycle; use operators and CRDs for complex apps; adopt GitOps (Flux, Argo CD) for declarative delivery; enforce policy with Gatekeeper or Kyverno and govern multi-cluster where needed.
For production, add ConfigMaps, Secrets, Ingress, security and policy, observability, and a cluster (e.g. AKS, EKS, GKE) or a local setup like minikube/kind for learning.
References
- Kubernetes in Action (2nd ed.), Marko Lukša, Manning — comprehensive coverage of Pods, ReplicaSet, Deployment, Service, Endpoints, Ingress, ConfigMap, Secret, PV/PVC, StorageClass, Job, CronJob, DaemonSet, StatefulSet, and more.
- Getting Started with Kubernetes (2nd ed.), Jonathan Baier, Packt — introduction to Kubernetes, cluster setup (GCE, AWS, kubeadm), pods and services, networking (CNI, kube-proxy, Ingress), Deployments and Jobs, storage and StatefulSets, monitoring (Heapster, Prometheus), and container security.
Comments