quyennv.com

Senior DevOps Engineer · Healthcare, Fanance

Detecting…

Kubernetes (K8S): Architecture, Pods, Deployments, and Security

#kubernetes#k8s#containers#orchestration#devops#cloud#security

0

Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications. You describe the desired state (e.g. “run 3 replicas of this image”), and Kubernetes keeps the cluster in that state.

Why Kubernetes?

Containers (e.g. Docker) give you portable, consistent application units built from images. At scale you need orchestration: lifecycle (who restarts failed containers?), networking (how do services find each other?), resource utilization (how do we schedule and limit CPU/memory?), and portability (how do we roll out updates or move workloads across nodes?). Kubernetes addresses these by treating the cluster as a single system and keeping workloads in the desired state.

  • Orchestration: Schedule and run containers across many nodes; handle restarts and placement.
  • Scaling: Scale workloads up or down (manually or with autoscalers).
  • Self-healing: Restart failed containers, replace unhealthy pods, reschedule when nodes fail.
  • Declarative config: Define desired state in YAML; Kubernetes reconciles the actual state.

Kubernetes architecture

A Kubernetes cluster is split into two planes: the control plane (manages the cluster) and the data plane (runs your workloads).

High-level view

+----------------- CONTROL PLANE ------------------+
|  +------------+  +-----------+  +-------------+  |
|  |     etcd   |  |           |  |  Controller |  |
|  |  (storage) |  | Scheduler |  |  Manager    |  |
|  +------+-----+  +-----+-----+  +------+------+  |
|         |              v               |         |
|         |    +---------+--------+      |         |
|         +--->|    kube-api-     |      |         |
|              |      server      |<-----+         |
|              +------------------+                |
|                     ^                            | 
|                     |                            |
|             +-------+----------+                 |    +-----------------+
|             |  cloud-controll- |                 |    |  cloud provider |
|             |    manager       |-----------------|--->|      API        |
|             +------------------+                 |    +-----------------+
+--------------------------------------------------+  
                      |
                      v
+--------------- DATA PLANE (Nodes) -----------------+
|              +------------------------+            |
|              v                        v            |
|  +------------------+       +-------------------+  |
|  |     Node 1       |       |     Node 2        |  |
|  |  kubelet         |       |  kubelet          |  |
|  |  kube-proxy      |       |  kube-proxy       |  |
|  |  container runtime       |  container runtime|  |
|  |  [Pods]          |       |  [Pods]           |  |
|  +------------------+       +-------------------+  |
+----------------------------------------------------+

Control plane components

ComponentRole
API Server (kube-apiserver)Single entrypoint for all cluster operations. Validates and processes REST requests; updates etcd. kubectl and other clients talk only to the API server.
etcdDistributed key-value store holding cluster state (desired and current). Only the API server reads/writes etcd. High availability is critical for production.
Scheduler (kube-scheduler)Watches for newly created pods with no assigned node; selects a node (based on resources, affinity, taints/tolerations) and assigns the pod.
Controller Manager (kube-controller-manager)Runs controllers that reconcile state: Node Controller, Deployment Controller, ReplicaSet Controller, etc. They watch the API and drive the cluster toward the desired state.
Cloud Controller ManagerOptional; ties the cluster to cloud provider APIs (load balancers, nodes, routes). Used on AKS, EKS, GKE.

Data plane (worker nodes)

ComponentRole
kubeletAgent on each node. Registers the node with the API server; ensures containers in pods are running (pulls images, starts/stops containers, reports status).
kube-proxyNetwork proxy on each node. Implements Service abstraction: maintains iptables or IPVS rules so traffic to a Service IP/port is forwarded to backend pods.
Container runtimeSoftware that runs containers (containerd, CRI-O, etc.). kubelet talks to it via the Container Runtime Interface (CRI).

Request flow (example: create a Deployment)

  1. You run kubectl apply -f deployment.yaml → kubectl sends the manifest to the API Server.
  2. API Server validates and stores the Deployment (and derived ReplicaSet) in etcd.
  3. Deployment controller (in Controller Manager) sees the new ReplicaSet and creates Pod objects (no node yet).
  4. Scheduler sees Pods with no nodeName, selects nodes, and updates each Pod with the chosen node (write to etcd via API Server).
  5. kubelet on each assigned node sees new Pods, pulls images via the container runtime, and starts containers.
  6. kubelet reports Pod status back to the API Server; controllers and users see the cluster state.

Core concepts

TermMeaning
PodSmallest deployable unit: one or more containers that share storage and network.
DeploymentDeclarative way to manage a set of identical pods (replicas, rolling updates).
ServiceStable network endpoint to reach pods (cluster IP, NodePort, or LoadBalancer).
NamespaceVirtual cluster for grouping and isolating resources (e.g. dev, prod).
NodeA worker machine (VM or physical) that runs pods.

Kubernetes resources overview

The following table summarizes the main Kubernetes resources (as in Kubernetes in Action, Lukša). Cluster-level resources are not namespaced; others live in a namespace.

Resource (abbr.)API versionDescription
Namespace (ns)v1Organizes resources into non-overlapping groups (e.g. per tenant, env).
Pod (po)v1Basic deployable unit: one or more co-located containers sharing network and storage.
ReplicaSet (rs)apps/v1Keeps a set of pod replicas running; used by Deployment.
ReplicationController (rc)v1Older, less capable way to keep pod replicas; prefer ReplicaSet.
Deployment (deploy)apps/v1Declarative deployment and rolling updates of pods via ReplicaSet.
StatefulSet (sts)apps/v1Manages stateful pods with stable identity and ordered deployment.
DaemonSet (ds)apps/v1Runs one pod replica per node (all nodes or those matching a selector).
Jobbatch/v1Runs pods until a completable task succeeds (one or more pods).
CronJobbatch/v1Runs a Job on a schedule (cron expression).
Service (svc)v1Exposes one or more pods at a stable IP and port (ClusterIP, NodePort, LoadBalancer).
Endpoints (ep)v1Lists the pod IPs that back a Service (usually auto-managed).
Ingress (ing)networking.k8s.io/v1Exposes services to the outside via HTTP(S) host/path routing.
ConfigMap (cm)v1Key-value config for apps (non-sensitive); mount as files or env.
Secretv1Sensitive data (passwords, tokens); base64, use encryption at rest.
PersistentVolume (pv)v1Cluster-level piece of storage; bound by a PersistentVolumeClaim.
PersistentVolumeClaim (pvc)v1Request for storage; bound to a PersistentVolume or dynamic provisioner.
StorageClass (sc)storage.k8s.io/v1Defines a class of storage for dynamic provisioning of PVCs.

Pods in more detail

  • Lifecycle phases: Pending → Running (or Succeeded/Failed for one-off pods). A pod is Pending until scheduled and until at least one container has started.
  • Init containers: Run to completion before the main containers start. Use them for setup (e.g. migrate DB, wait for a dependency). They run in order; if one fails, the pod is restarted (according to restartPolicy).
  • Multiple containers in a pod: Share the same network namespace (localhost) and can share volumes. Typical pattern: main app + sidecar (e.g. log shipper, proxy). The kubelet restarts the whole pod if any container exits (with restartPolicy OnFailure or Always).
spec:
  initContainers:
    - name: init-db
      image: busybox
      command: ['sh', '-c', 'until nslookup db; do sleep 2; done']
  containers:
    - name: app
      image: my-app:latest

Workload resources: ReplicaSet, Job, DaemonSet, StatefulSet

ResourceUse case
ReplicaSetKeep N identical pod replicas; use via Deployment, not alone.
JobRun a batch task until success (e.g. backup, migration). completions, parallelism, backoffLimit.
CronJobRun a Job on a schedule (e.g. "0 * * * *" every hour).
DaemonSetOne pod per node (e.g. node exporter, log collector, CNI).
StatefulSetStateful apps with stable identity: stable pod name and storage, ordered create/delete.

Deployment strategies and workload examples

This section covers how you roll out changes (deployment strategies) and all workload types in Kubernetes with concrete examples.

Deployment strategies (Deployment resource)

A Deployment manages a ReplicaSet and drives rolling updates (or recreate) when you change the pod template. You control the strategy and pace.

StrategyBehaviorWhen to use
RollingUpdate (default)Gradually replace old pods with new ones. You can set maxSurge (extra pods allowed above desired count) and maxUnavailable (pods that can be missing).Default for most apps; minimal downtime, continuous availability.
RecreateTerminate all existing pods, then create new ones. No overlap.When the app cannot run two versions at once (e.g. schema migration, singleton).
Blue-greenNot built-in. You run two Deployments (e.g. my-app-v1, my-app-v2) and switch the Service selector from one to the other.Manual or scripted; instant cutover, easy rollback.
CanaryNot built-in. You run a small set of new-version pods and route a fraction of traffic (e.g. via Ingress weights or two Services).Gradual exposure; use service mesh or Ingress for traffic split.

Example: Deployment with RollingUpdate and Recreate

# Rolling update: at most 1 extra pod, at most 0 unavailable (surge only)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: app
          image: my-registry.io/my-app:v2
          ports:
            - containerPort: 3000
---
# Recreate: all pods terminated before new ones start
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-migrator
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: migrator
  template:
    metadata:
      labels:
        app: migrator
    spec:
      containers:
        - name: migrator
          image: my-registry.io/db-migrate:latest
          command: ["/run-migrations.sh"]

Rollout commands:

# Watch rollout status
kubectl rollout status deployment/my-app

# Pause/resume a rollout
kubectl rollout pause deployment/my-app
kubectl rollout resume deployment/my-app

# Rollback to previous revision
kubectl rollout undo deployment/my-app

# Rollback to a specific revision
kubectl rollout history deployment/my-app
kubectl rollout undo deployment/my-app --to-revision=2

ReplicaSet

A ReplicaSet keeps a fixed number of pod replicas running. You usually don’t create it directly; the Deployment controller creates and updates ReplicaSets. If you manage a ReplicaSet yourself, scaling and updates are manual (no rollout history or rollback). Use a ReplicaSet directly only for special cases (e.g. temporary scaling that you don’t want the Deployment to own).

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: my-app-rs
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: app
          image: my-registry.io/my-app:latest
          ports:
            - containerPort: 3000

DaemonSet

A DaemonSet ensures that every node (or every node matching a selector) runs exactly one pod of the template. Use it for node-level agents: logging (Fluentd, Fluent Bit), monitoring (node-exporter), storage (CSI node plugin), or networking (CNI). New nodes get the pod automatically when they join the cluster.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostNetwork: true
      hostPID: true
      containers:
        - name: node-exporter
          image: prom/node-exporter:latest
          ports:
            - containerPort: 9100
          resources:
            requests:
              cpu: "50m"
              memory: "64Mi"
            limits:
              memory: "128Mi"

Run only on a subset of nodes: use nodeSelector, affinity, or taints/tolerations so the DaemonSet’s pods only schedule on nodes that match (e.g. nodes with a label node-role=logging).

StatefulSet

A StatefulSet is for stateful workloads that need stable identity and stable storage: pods get stable names (<statefulset-name>-0, -1, …), created and deleted in order, and each pod can have its own PersistentVolumeClaim (via volumeClaimTemplates). You need a headless Service (no cluster IP) so DNS can return individual pod hostnames.

apiVersion: v1
kind: Service
metadata:
  name: redis-headless
spec:
  clusterIP: None
  selector:
    app: redis
  ports:
    - port: 6379
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
spec:
  serviceName: redis-headless
  replicas: 3
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          ports:
            - containerPort: 6379
          volumeMounts:
            - name: data
              mountPath: /data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 1Gi

Pods are named redis-0, redis-1, redis-2 and are reachable at redis-0.redis-headless.namespace.svc.cluster.local, etc. Storage is bound per pod and survives pod restart.

Job

A Job runs one or more pods until a finite number of completions succeed. Use it for batch work: migrations, backups, report generation. You can set parallelism (how many pods run at once), completions (how many successful completions are required), and backoffLimit (retries before the Job is marked failed).

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate
spec:
  completions: 1
  parallelism: 1
  backoffLimit: 4
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: migrate
          image: my-registry.io/migrate:latest
          command: ["/bin/sh", "-c", "npm run migrate"]

Parallel Job (e.g. process N items):

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-process
spec:
  completions: 10
  parallelism: 3
  backoffLimit: 2
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: worker
          image: my-registry.io/batch-worker:latest

Up to 3 pods run at a time until 10 completions succeed.

CronJob

A CronJob creates Jobs on a schedule (cron expression). Use it for periodic tasks: backups, cleanup, report generation.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-backup
spec:
  schedule: "0 2 * * *"   # 02:00 every day (UTC)
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: my-registry.io/backup:latest
              command: ["/backup.sh"]
  • schedule: Standard cron (minute hour day month weekday).
  • concurrencyPolicy: Allow (default), Forbid (skip if previous run still active), or Replace (cancel previous and start new).
  • successfulJobsHistoryLimit / failedJobsHistoryLimit: How many finished Jobs to keep for visibility.

Summary: when to use which workload

WorkloadUse when
DeploymentStateless app; you want rolling updates, rollback, and replica count.
ReplicaSetRarely directly; prefer Deployment.
DaemonSetOne pod per node for logging, monitoring, CNI, or storage agent.
StatefulSetStateful app with stable name and storage (DB, queue, etc.).
JobOne-off or batch task that must run to completion.
CronJobSame as Job but on a schedule.

Services and networking

  • ClusterIP: Default. A virtual IP inside the cluster; pods reach the service by name (DNS: <svc>.<ns>.svc.cluster.local). Endpoints are created automatically from the service selector and list the backing pod IPs.
  • NodePort: Exposes the service on each node’s IP at a static port (30000–32767). Good for dev or when you don’t have a load balancer.
  • LoadBalancer: Provisions an external load balancer (cloud or on-prem). Often used with Ingress for HTTP(S).
  • Headless service: clusterIP: None. No cluster IP; DNS returns all pod IPs. Used for StatefulSet or when clients need to talk to specific pods.
  • ExternalName: Maps the service to a CNAME in DNS (e.g. an external API hostname); no cluster IP or proxy, useful for referencing out-of-cluster services by a stable name.

kube-proxy runs on every node and implements the Service abstraction. It watches the API for Service and Endpoints changes and updates iptables (or IPVS) so traffic to the service virtual IP is forwarded to a backing pod. In iptables mode (default), rules point directly at pod IPs, so traffic goes from client → node → iptables → pod without an extra userspace hop. In userspace mode, kube-proxy receives the traffic and forwards it; iptables mode is preferred for performance. You can set sessionAffinity: ClientIP on a Service so the same client IP is sent to the same pod when possible.

Ingress exposes HTTP(S) routes to services. An Ingress controller (e.g. NGINX, Traefik) watches Ingress resources and configures the load balancer. One Ingress can route multiple hosts/paths to different ClusterIP services.

How Kubernetes DNS works for pods and services

Kubernetes ships with a cluster DNS service (CoreDNS or the older kube-dns) that watches the API server for Service and Endpoints objects and creates DNS records for them.

  • Service names → virtual IPs

    • Every Service gets an A record like <svc>.<namespace>.svc.cluster.local.
    • Pods automatically get search paths such as <namespace>.svc.cluster.local, svc.cluster.local, and cluster.local.
    • This is why from a pod in namespace app you can usually reach backend just by using the short name backend (it expands via search paths).
  • Headless Services and pod DNS

    • For a headless Service (clusterIP: None), DNS does not return a virtual IP; instead it returns one A record per pod IP.
    • With StatefulSets, each pod also gets a stable DNS name like web-0.web.app.svc.cluster.local, where:
      • web-0 is the pod name,
      • web is the Service name,
      • app is the namespace.
    • Clients can either use the Service name (round‑robin across pods) or the individual pod hostnames for peer‑to‑peer protocols.
  • Pods themselves

    • Regular pods do not usually have standalone DNS records that apps rely on; instead, you go through the Service.
    • The pod’s /etc/resolv.conf is configured to use the cluster DNS and the search domains mentioned above.

In short: kube-proxy makes Service IPs forward traffic to pods, while CoreDNS makes Service and pod names resolve to those IPs. Together they give you stable names (my-svc.my-namespace.svc.cluster.local) on top of dynamic pod IPs.

CNI and pod networking

Kubernetes expects every pod to get a routable IP so that pods can talk to each other and kube-proxy can forward service traffic to pod IPs. The Container Network Interface (CNI) is the standard that cluster installs use: a plugin runs when a pod is created/deleted and configures the pod’s network (e.g. a veth pair and bridge, or overlay). Common CNI plugins include Flannel (simple overlay, good for getting started), Calico (BGP or overlay, network policy and performance), Weave (overlay with encryption option), and Canal (Calico networking + Flannel for policy). Choose based on your need for encryption, network policy, and multi-node routing (e.g. BGP in bare metal).

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
spec:
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-app-svc
                port:
                  number: 80

ConfigMap and Secret (configuration)

  • ConfigMap: Store non-sensitive config (URLs, feature flags, config files). Mount as a volume or inject as environment variables. Changes to the volume may be reflected in the pod depending on sync settings.
  • Secret: Same idea for sensitive data (passwords, TLS certs). Stored base64; enable encryption at rest for the API server in production. Mount as a volume or env; prefer projected volumes or external secret operators for rotation.

Namespace resource quotas and limits (multitenancy)

To share a cluster across teams or environments, use namespaces and cap resource usage per namespace so one tenant cannot starve others. ResourceQuota limits total usage in a namespace (e.g. total CPU, memory, number of pods, PVCs). LimitRange sets default and max requests/limits for containers in that namespace so every pod gets bounds even if the manifest omits them. Together they support safe multitenancy and prevent runaway workloads.

Volumes and persistent storage

  • emptyDir: Temporary directory per pod; deleted when the pod is removed. Good for scratch space or sharing data between containers in a pod.
  • PersistentVolumeClaim (PVC): Request storage (size, StorageClass). The cluster binds it to a PersistentVolume (PV) or triggers dynamic provisioning. Pods mount the PVC; data survives pod restarts.
  • StorageClass: Defines a provisioner and parameters (e.g. cloud disk type). When you create a PVC that references a StorageClass, the provisioner creates the backing volume and binds the PVC.

Minimal Deployment example

Save as app-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    app: my-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: app
          image: my-registry.io/my-app:latest
          ports:
            - containerPort: 3000
          resources:
            requests:
              memory: "64Mi"
              cpu: "100m"
            limits:
              memory: "128Mi"
              cpu: "200m"

Create the deployment:

kubectl apply -f app-deployment.yaml

Exposing the app with a Service

apiVersion: v1
kind: Service
metadata:
  name: my-app-svc
spec:
  selector:
    app: my-app
  ports:
    - port: 80
      targetPort: 3000
  type: ClusterIP   # or NodePort / LoadBalancer
kubectl apply -f app-service.yaml

Essential kubectl commands

# List pods (default namespace)
kubectl get pods

# List pods in a namespace
kubectl get pods -n production

# Describe a pod (events, state, details)
kubectl describe pod <pod-name>

# View logs from a pod
kubectl logs <pod-name>

# Follow logs (like tail -f)
kubectl logs -f <pod-name>

# Execute a command in a pod
kubectl exec -it <pod-name> -- sh

# List deployments
kubectl get deployments

# Scale a deployment
kubectl scale deployment my-app --replicas=5

# Delete a deployment and its pods
kubectl delete deployment my-app

Pod lifecycle and restarts

  • Kubernetes keeps the number of replicas you specified; if a pod exits or fails, it is replaced.
  • livenessProbe and readinessProbe tell Kubernetes when to restart a pod or when to send traffic:
containers:
  - name: app
    image: my-app:latest
    livenessProbe:
      httpGet:
        path: /health
        port: 3000
      initialDelaySeconds: 5
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /ready
        port: 3000
      initialDelaySeconds: 2
      periodSeconds: 5

Namespaces

# List namespaces
kubectl get namespaces

# Create a namespace
kubectl create namespace staging

# Run a one-off pod in a namespace
kubectl run debug --image=busybox -n staging -- sleep 3600

Security for Kubernetes

Securing a Kubernetes cluster involves the control plane, the nodes, the network, and the workloads. Below are the main areas and practices.

1. RBAC (Role-Based Access Control)

RBAC controls who can do what in the cluster. You define permissions (Roles or ClusterRoles) and bind them to subjects (users, groups, or ServiceAccounts) via RoleBinding or ClusterRoleBinding. The API server evaluates RBAC after authentication (identity established) and before admission (request allowed or denied).

RBAC building blocks

ResourceScopePurpose
RoleOne namespaceSet of rules (apiGroups, resources, verbs) in that namespace only.
ClusterRoleCluster-wide (or reusable)Same as Role but not tied to a namespace; can reference cluster-scoped resources (nodes, PVs) or be bound in many namespaces.
RoleBindingOne namespaceBinds a Role or ClusterRole to subjects; grants permissions in that namespace.
ClusterRoleBindingCluster-wideBinds a ClusterRole to subjects; grants permissions across the whole cluster.

Rule structure: Each rule has apiGroups (e.g. "" for core, apps, rbac.authorization.k8s.io), resources (e.g. pods, deployments, secrets), and verbs (e.g. get, list, watch, create, update, patch, delete). You can restrict by resourceNames so the subject can only get/patch/delete named resources. subresources (e.g. pods/log, pods/status) can be listed when needed.

Common verbs: get (single resource by name), list / watch (list or watch), create, update, patch, delete, deletecollection. For non-resource URLs (e.g. /healthz) use nonResourceURLs with urls and verbs.

Subjects: User, Group, ServiceAccount

Bindings reference subjects:

  • User (kind: User, name: alice): External identity (e.g. from OIDC, client cert). No User object in the cluster; the API server infers the user from the request.
  • Group (kind: Group, name: dev-team): Often used with OIDC or LDAP so you grant access to a group once.
  • ServiceAccount (kind: ServiceAccount, name: my-app, namespace: production): Identity for pods; the pod’s token is used to call the API. Most common for in-cluster workloads.

Example: Role and RoleBinding (namespaced)

# Role: read pods and pod logs in namespace production
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: pod-reader
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]
---
# RoleBinding: grant pod-reader to ServiceAccount ci-bot in production
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: production
subjects:
  - kind: ServiceAccount
    name: ci-bot
    namespace: production
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Example: ClusterRole and RoleBinding (reuse in many namespaces)

A ClusterRole can be bound in multiple namespaces via separate RoleBindings, so you define “read pods in any namespace” once and bind it per namespace.

# ClusterRole: list and watch pods in any namespace (no namespace in metadata)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: pod-reader-cluster
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
---
# RoleBinding in production: grant pod-reader-cluster only in production
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods-production
  namespace: production
subjects:
  - kind: ServiceAccount
    name: monitor
    namespace: monitoring
roleRef:
  kind: ClusterRole
  name: pod-reader-cluster
  apiGroup: rbac.authorization.k8s.io

Example: ClusterRole and ClusterRoleBinding (cluster-wide)

Use for cluster-wide read-only or admin. Avoid granting cluster-admin (full access) in production; prefer narrow ClusterRoles.

# ClusterRole: read nodes and namespaces (cluster-scoped)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-viewer
rules:
  - apiGroups: [""]
    resources: ["nodes", "namespaces"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods", "services"]
    verbs: ["get", "list", "watch"]
---
# ClusterRoleBinding: grant to all authenticated users in group "viewers"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-viewers
subjects:
  - kind: Group
    name: viewers
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: cluster-viewer
  apiGroup: rbac.authorization.k8s.io

Example: Role with resourceNames (narrow scope)

Limit access to specific secrets by name:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: secret-reader-app-db
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    resourceNames: ["app-db-credentials"]
    verbs: ["get"]

Checking permissions

# Can the current user create pods in production?
kubectl auth can-i create pods -n production

# Can ServiceAccount monitoring/grafana list pods in production?
kubectl auth can-i list pods -n production --as=system:serviceaccount:monitoring:grafana

# Describe which roles/bindings apply (optional: use kubectl describe or yaml)
kubectl get role,rolebinding -n production
kubectl get clusterrole,clusterrolebinding

Best practices

  • Least privilege: Grant only the verbs and resources needed; use resourceNames when appropriate.
  • Prefer namespaced Role + RoleBinding so one namespace compromise does not affect others.
  • Use ClusterRole + RoleBinding (not ClusterRoleBinding) when you want the same role in several namespaces without cluster-wide access.
  • Use dedicated ServiceAccounts per app and bind minimal Roles to them; avoid using default ServiceAccount for workloads.

2. Service accounts

A ServiceAccount is an identity for processes running inside the cluster (typically pods). When a pod calls the Kubernetes API, it authenticates using its ServiceAccount token. The API server then applies RBAC to determine what that identity can do.

Default and custom ServiceAccounts

  • default: Every namespace has a default ServiceAccount. Pods that do not set spec.serviceAccountName use default in that namespace. Avoid using it for application pods; create a dedicated ServiceAccount per app and grant only the RBAC it needs.
  • Creating a ServiceAccount: No role is attached by default; you must create a Role or ClusterRole and a RoleBinding or ClusterRoleBinding that references the ServiceAccount.
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app
  namespace: production
kubectl create serviceaccount my-app -n production

Assigning a ServiceAccount to a pod

Set spec.serviceAccountName (or spec.deprecatedServiceAccount; prefer the former). The pod can then use the token mounted by Kubernetes to talk to the API server.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: production
spec:
  template:
    spec:
      serviceAccountName: my-app
      containers:
        - name: app
          image: my-registry.io/my-app:latest

The ServiceAccount must exist in the same namespace as the pod. If it does not, the pod will not start.

  • Legacy token: Kubernetes used to auto-mount a long-lived token for the pod’s ServiceAccount at /var/run/secrets/kubernetes.io/serviceaccount/token. This token does not expire soon and is a risk if the pod or node is compromised. Prefer not to rely on it for new workloads.
  • Token volume projection: You can mount a short-lived, audience-bound token using a projected volume. This is the recommended way for pods that need to call the API. You set an expiration (e.g. 1 hour) and an audience (e.g. api or your OIDC audience). The kubelet refreshes the token before it expires.
spec:
  serviceAccountName: my-app
  containers:
    - name: app
      image: my-registry.io/my-app:latest
      volumeMounts:
        - name: token
          mountPath: /var/run/secrets/tokens
          readOnly: true
  volumes:
    - name: token
      projected:
        sources:
          - serviceAccountToken:
              path: token
              expirationSeconds: 3600
              audience: api

If you use projected tokens, ensure the ServiceAccount has the right RBAC (Role + RoleBinding or ClusterRole + RoleBinding) so the pod can perform only the API actions it needs.

Disabling automatic token mount

You can set automountServiceAccountToken: false on the pod so no ServiceAccount token is mounted. Use this for pods that never call the API, to reduce exposure.

spec:
  serviceAccountName: my-app
  automountServiceAccountToken: false
  containers:
    - name: app
      image: my-registry.io/my-app:latest

End-to-end: ServiceAccount + RBAC + pod

  1. Create a ServiceAccount (e.g. my-app in production).
  2. Create a Role (or ClusterRole) with the minimum required verbs and resources.
  3. Create a RoleBinding (or ClusterRoleBinding) that binds that Role to the ServiceAccount my-app in production.
  4. Create the Deployment (or Pod) with serviceAccountName: my-app and, if the app needs to call the API, a projected token volume with expiration and audience.

Then the pod authenticates as system:serviceaccount:production:my-app and RBAC allows only the permissions you granted.

3. Secrets management

  • Kubernetes Secrets store sensitive data (passwords, tokens, TLS certs) as base64; they are not encrypted at rest by default. Enable encryption at rest for the API server (e.g. with a KMS provider) in production.
  • Avoid putting secrets in plain YAML in Git. Use external secret managers (e.g. HashiCorp Vault, AWS Secrets Manager, Azure Key Vault) with operators (e.g. External Secrets Operator) to sync into Kubernetes Secrets.
  • Prefer projected volumes or CSI secret stores so pods get only the secrets they need.
# Mount a secret as a file in a pod
spec:
  containers:
    - name: app
      volumeMounts:
        - name: db-secret
          mountPath: /etc/secrets
          readOnly: true
  volumes:
    - name: db-secret
      secret:
        secretName: db-credentials

4. Network policies

By default, pods in a cluster can often talk to any other pod. NetworkPolicy restricts ingress/egress traffic (e.g. only allow frontend → backend, block cross-namespace traffic).

# Allow only pods with label role=frontend to reach pods with label app=api on port 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-allow-frontend
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              role: frontend
      ports:
        - protocol: TCP
          port: 8080
  • Enforcing NetworkPolicy requires a CNI plugin that supports it (e.g. Calico, Cilium).
  • Start with deny-by-default or explicit allow lists for critical namespaces.

5. Pod security (security context and Pod Security Standards)

  • Security context on pods/containers: run as non-root user (runAsNonRoot, runAsUser), drop capabilities (securityContext.capabilities.drop: ["ALL"]), read-only root filesystem where possible.
  • Pod Security Standards (PSS): Privileged, Baseline, Restricted. Enforce via Pod Security Admission (labels on namespaces) or a policy engine (e.g. OPA Gatekeeper, Kyverno).
# Example: restricted-style pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop: ["ALL"]
        readOnlyRootFilesystem: true

6. Image security

  • Use private or trusted registries; avoid latest tag in production.
  • Image scanning (e.g. Trivy, Snyk) in CI and at admission (e.g. Trivy admission controller, Gatekeeper) to block vulnerable images.
  • Image signing and verification: use Cosign and policy-controller (or similar) so only signed images are allowed.

7. Control plane and node hardening

  • API server: Restrict access (firewall, private endpoints); enable audit logging; use admission controllers (e.g. PodSecurity, validating webhooks) to enforce policies.
  • etcd: Encrypt at rest; restrict network access to API server only.
  • Nodes: Keep OS and kubelet/runtime updated; use node hardening (CIS benchmarks); consider read-only root filesystem and minimal images for the host where possible.
  • kubelet: Configure anonymousAuth: false; use NodeRestriction admission to limit what kubelets can do.

8. Admission control

Admission controllers run after authentication and authorization; they can mutate or validate requests before the object is stored in etcd. Use them to enforce security and governance.

  • Built-in: PodSecurity (enforces Pod Security Standards), NodeRestriction (limits what kubelets can modify), ResourceQuota, LimitRanger (default/limit resources per namespace or pod).
  • Validating / mutating webhooks: Your own services receive AdmissionReview requests and allow or deny (and optionally patch) the object. Use for custom rules (e.g. “all pods must have a sidecar”, “no hostPath”).
  • Policy engines: Open Policy Agent (OPA) Gatekeeper or Kyverno run as admission webhooks and enforce policies defined in CRDs (e.g. “only images from registry X”, “every Namespace must have a label”).
# Example: LimitRanger sets default requests/limits in a namespace
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - default:
        memory: "256Mi"
        cpu: "200m"
      defaultRequest:
        memory: "64Mi"
        cpu: "50m"
      type: Container

9. Audit logging and compliance

  • Audit logging: The API server can log every request (metadata or full body) to a file or backend. Enable audit policy and store logs in a secure, append-only store. Use for compliance (who did what, when) and incident response.
  • Compliance: Map controls to frameworks (e.g. CIS Kubernetes Benchmark, SOC 2, PCI-DSS). Use CIS Benchmarks and tools (e.g. kube-bench, OpenSearch/audit) to check and report.

10. Summary: security checklist

AreaPractices
AccessRBAC with least privilege; avoid cluster-admin; use dedicated ServiceAccounts and projected tokens.
SecretsEncryption at rest; external secret manager; minimal exposure to pods.
NetworkNetworkPolicy for segmentation; restrict egress where possible.
WorkloadsNon-root, drop capabilities, read-only root; enforce PSS (Baseline/Restricted).
ImagesScan in CI and at admission; sign and verify images.
AdmissionPodSecurity, LimitRanger, webhooks or Gatekeeper/Kyverno for custom policy.
ClusterHarden API server, etcd, and nodes; enable audit logging; follow CIS benchmarks.

Enhancements and operational best practices

Beyond basic Deployments and Services, you can improve reliability, performance, and observability with the following.

Resource requests and limits

Always set requests and limits for CPU and memory so the scheduler can place pods correctly and the node can enforce limits. Without them, pods can overcommit and cause noisy-neighbour or OOM issues.

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "500m"
  • Requests: Minimum guaranteed; scheduler uses them for placement.
  • Limits: Hard cap; exceeding memory leads to OOM kill; exceeding CPU leads to throttling.
  • Use LimitRanger or namespace defaults so every container gets requests/limits even if the manifest omits them.

Horizontal and vertical scaling

  • Horizontal Pod Autoscaler (HPA): Scales the number of pod replicas based on CPU, memory, or custom/external metrics (e.g. from Prometheus). Keeps latency and utilization in range under load.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  • Vertical Pod Autoscaler (VPA): Recommends or updates CPU/memory requests and limits based on actual usage. Useful for right-sizing over time.

Pod Disruption Budgets (PDB)

A PodDisruptionBudget limits how many pods of a given selector can be down at once during voluntary disruptions (node drain, cluster upgrade). That way the scheduler and cluster autoscaler can evict pods without dropping below your desired availability.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: my-app

Use minAvailable or maxUnavailable; combine with multiple replicas so at least one stays up during drains.

Topology spread and affinity

  • Topology spread constraints: Spread pods across zones or nodes (e.g. topologyKey: topology.kubernetes.io/zone) to reduce blast radius of a single failure.
  • Pod affinity / anti-affinity: Prefer (or require) pods to run on the same node (affinity) or on different nodes (anti-affinity) for high availability or colocation.
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app: my-app

Health probes and readiness

  • livenessProbe: If it fails, the kubelet restarts the container. Use for “is the process dead?” (e.g. HTTP /health or TCP).
  • readinessProbe: If it fails, the pod is removed from Service endpoints (no traffic). Use for “is the app ready to serve?” (e.g. dependencies up, cache warm).
  • Set initialDelaySeconds and periodSeconds so slow starters are not killed and probes do not overload the app.

Observability: metrics, logging, tracing

  • Metrics: Expose Prometheus-style metrics from the app or use cAdvisor/kubelet metrics. kube-state-metrics exposes cluster object state. Use HorizontalPodAutoscaler with custom metrics from Prometheus for scaling.
  • Logging: Centralize logs (e.g. Fluent Bit or Fluentd as DaemonSet, Loki or Elasticsearch as backend). Avoid storing secrets in log lines.
  • Tracing: Use OpenTelemetry or Jaeger so requests can be traced across services. Inject trace context via sidecar or SDK.

Backup and disaster recovery

  • etcd: Back up etcd regularly (snapshots); restore procedure is critical for control-plane recovery. Many managed offerings (EKS, AKS, GKE) handle this; self-managed clusters need a process.
  • Workloads and persistent data: Use Velero (or similar) to back up cluster resources (and optionally PV snapshots) to object storage. Restore to the same or another cluster for DR.

High availability (HA) for Kubernetes

High availability means your workloads stay available despite node failures, zone outages, voluntary disruptions (drains, upgrades), and load spikes. This section covers six mechanisms: pod anti-affinity, topology spread constraints, Pod Disruption Budgets, PriorityClass, HPA, and VPA, with architecture, trade-offs, and examples.

HA architecture overview

Together, these controls affect where pods run (spread and affinity), how many can be down at once (PDB), who gets evicted first under pressure (PriorityClass), and how many replicas and how much CPU/memory they get (HPA, VPA).

+--------------------------- High availability layers --------------------------+
|                                                                               |
|  Placement (where pods run)          Protection (during disruptions)          |
|  +------------------------+          +-------------------------------------+  |
|  | Pod anti-affinity      |          | PodDisruptionBudget (minAvailable / |  |
|  | Topology spread        |          |  maxUnavailable)                    |  |
|  +------------------------+          +-------------------------------------+  |
|                                                                               |
|  Preemption (who stays when nodes are full)                                   |
|  +------------------------+                                                   |
|  | PriorityClass           |  -->  Higher priority pods preferred; lower      |
|  | (priority, preemption)  |        can be evicted to make room.              | 
|  +------------------------+                                                   |
|                                                                               |
|  Scaling (replicas and resources)                                             |
|  +------------------------+  +------------------------+                       |
|  | HPA                    |  | VPA                    |                       |
|  | (replica count by      |  | (requests/limits by    |                       |
|  |  CPU/memory/metrics)   |  |  actual usage)         |                       |
|  +------------------------+  +------------------------+                       |
|                                                                               |
+-------------------------------------------------------------------------------+

1. Pod anti-affinity

Pod anti-affinity tells the scheduler to avoid placing new pods on the same node (or zone) as existing pods matching a label selector. That spreads replicas across nodes so a single node failure does not take down all replicas. Trade-off: Strict anti-affinity (requiredDuringSchedulingIgnoredDuringExecution) can consume more resources: the scheduler may leave nodes underutilized or even fail to schedule if there are fewer nodes than replicas. Use preferredDuringSchedulingIgnoredDuringExecution when you want a soft preference without blocking scheduling.

Example: required (hard) anti-affinity — one pod per node

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: my-app
              topologyKey: kubernetes.io/hostname
      containers:
        - name: app
          image: my-registry.io/my-app:latest

If you have fewer than 3 nodes, one or more pods will stay Pending. Use only when you have enough nodes and accept the resource cost.

Example: preferred (soft) anti-affinity — spread when possible

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: my-app
            topologyKey: kubernetes.io/hostname

The scheduler tries to put pods on different nodes but will schedule anyway if it cannot (e.g. single node); no extra pending pods.

2. Pod topology spread constraints

Topology spread constraints spread pods across domains (e.g. node, zone, region) and limit the skew (difference in count between the most and least populated domain). They are a flexible way to achieve HA without the strict “one per node” rule of hard anti-affinity. You can combine multiple constraints (e.g. spread by hostname and by zone).

  • topologyKey: Domain to spread over (e.g. kubernetes.io/hostname, topology.kubernetes.io/zone).
  • maxSkew: Maximum allowed difference in the number of matching pods between any two domains. For example, maxSkew: 1 with 3 replicas and 3 zones means at most 2 pods in one zone and 1 in another.
  • whenUnsatisfiable: DoNotSchedule (hard: do not schedule if it would violate skew) or ScheduleAnyway (soft: prefer spread but allow scheduling).

Example: spread across zones and nodes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: my-app
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: my-app
      containers:
        - name: app
          image: my-registry.io/my-app:latest

Here, zone spread is hard (no more than one extra pod in any zone than in another), and node spread is soft (best-effort within each zone).

3. Pod Disruption Budget (PDB)

A PodDisruptionBudget limits how many pods of a given selector can be unavailable at once during voluntary disruptions (e.g. kubectl drain, node upgrade, cluster autoscaler scale-in). The eviction API and drain logic respect PDBs: they will not evict pods if that would break the PDB. PDBs do not protect against involuntary failures (node crash, OOM); use replicas + spread for that.

  • minAvailable: Minimum number of pods that must remain available (absolute number or percentage, e.g. minAvailable: 1 or minAvailable: "50%").
  • maxUnavailable: Maximum number of pods that can be unavailable (absolute or percentage). Use either minAvailable or maxUnavailable, not both.

Example: at least 2 pods or 50% available

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app
---
# Alternative: percentage-based
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb-pct
  namespace: production
spec:
  maxUnavailable: "25%"
  selector:
    matchLabels:
      app: my-app

Combine PDB with multiple replicas and topology spread (or anti-affinity) so that when a node is drained, remaining pods are still spread and the service stays healthy.

4. PriorityClass

PriorityClass assigns a numeric priority to pods. When the scheduler or kubelet (e.g. under resource pressure) must choose which pods to run or evict, higher priority pods are preferred. Lower-priority pods can be preempted (evicted) to make room for pending higher-priority pods. Use this so critical workloads (e.g. production app) keep running and best-effort or batch workloads give way.

  • value: 32-bit integer; higher = higher priority. System priorities often use negative or reserved ranges; user classes typically 1000+.
  • globalDefault: If true, this class is the default for pods that do not specify a priorityClassName. Only one PriorityClass should have this.
  • preemptionPolicy: PreemptLowerPriority (default) or Never. Never means the pod is scheduled only when resources are available without preempting others.

Example: high-priority app and low-priority batch

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-app
value: 10000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority-batch
value: 100
globalDefault: false
preemptionPolicy: Never
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      priorityClassName: high-priority-app
      containers:
        - name: app
          image: my-registry.io/my-app:latest

Under node pressure, pods with low-priority-batch can be evicted before pods with high-priority-app. Use PriorityClass sparingly and document which workloads are critical.

5. Horizontal Pod Autoscaler (HPA)

HPA automatically adjusts the number of pod replicas (Deployment, StatefulSet, or other scale subresource) based on metrics (CPU, memory, or custom/external). When load rises, HPA increases replicas; when load falls, it scales down within minReplicas and maxReplicas. This keeps the application available and responsive under variable load.

Metrics: Resource (CPU/memory from the metrics server), Pods (average of a pod metric), Object (a metric describing another object), or External (e.g. Prometheus via custom metrics API). behavior (scaleUp/scaleDown) can tune how fast HPA reacts.

Example: CPU and memory-based HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

The metrics-server must be installed in the cluster for Resource metrics. For custom or external metrics, you need an adapter (e.g. Prometheus Adapter) that implements the custom.metrics.k8s.io API.

6. Vertical Pod Autoscaler (VPA)

VPA recommends or automatically updates CPU and memory requests and limits for pods based on actual usage. It helps right-size workloads over time so they get enough resources (fewer OOMs and throttling) without over-provisioning. VPA runs as a controller and can operate in Off (recommend only), Initial (set at create), Recreate (update and recreate pods), or Auto (Recreate is the typical production mode).

Note: VPA is not in-tree; install the Vertical Pod Autoscaler from the Kubernetes autoscaler repo. VPA and HPA together: If you use both for the same pods, avoid autoscaling on the same resource (e.g. use VPA for memory and HPA for CPU, or use VPA in recommendation mode and tune requests manually).

Example: VPA for a Deployment

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Recreate"
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: "2"
          memory: 2Gi
        controlledResources: ["cpu", "memory"]

VPA will recommend or apply new requests/limits within minAllowed and maxAllowed and, with updateMode: Recreate, will recreate pods to apply changes. Use updateMode: Off to only get recommendations without automatic updates.

Summary: HA mechanisms

MechanismPurposeTrade-off / note
Pod anti-affinitySpread replicas across nodes/zones; avoid single point of failure.Hard anti-affinity can waste capacity or block scheduling; prefer soft or topology spread.
Topology spreadLimit skew across topology domains (node, zone, region).Flexible; combine zone (hard) + node (soft) for HA without over-constraining.
PDBLimit voluntary disruption (drain, upgrade) so enough replicas stay up.Only for voluntary disruptions; pair with replicas and spread.
PriorityClassEnsure critical pods are scheduled and not preempted before best-effort.Use sparingly; document and avoid too many high-priority workloads.
HPAScale replica count by CPU, memory, or custom metrics.Needs metrics-server (and optionally custom metrics adapter); set min/max and behavior.
VPARight-size CPU/memory requests and limits from actual usage.Install separately; use with HPA carefully (different resources or recommendation-only).

Control and management

How you operate and govern clusters at scale—lifecycle, automation, and policy.

Cluster setup options

You can run Kubernetes on public clouds, on-prem, or locally. Common options:

  • Managed services (GKE, EKS, AKS): The provider runs the control plane and often offers managed node pools, automatic upgrades, and integrated logging/monitoring. Best for production when you want to focus on workloads rather than cluster ops.
  • kubeadm: Official tool to bootstrap a cluster (control plane and join nodes). Works on various Linux distros; you manage VMs, networking, and upgrades. Good for custom or air-gapped environments.
  • kops (Kubernetes Operations): Provisions and manages clusters on AWS (and other clouds); supports high availability, multiple AZs, and rolling updates. Alternative to hand-rolling with kubeadm on cloud.
  • Local / dev: minikube, kind (Kubernetes in Docker), or k3d (k3s in Docker) run a small cluster on your machine for development and learning.

Provider-specific scripts (e.g. legacy kube-up.sh for GCE/AWS) are largely superseded by kubeadm and managed offerings.

Cluster lifecycle and upgrades

  • Kubernetes version skew: Support policy (e.g. N and N-1 for control plane vs kubelet) constrains upgrade order. Upgrade control plane first, then nodes (kubelet and runtime).
  • Node management: Rolling node updates (drain, update, uncordon) or use managed node pools that handle OS and runtime upgrades. Cluster Autoscaler adds/removes nodes based on pending pods and utilization.
  • Managed services: AKS, EKS, GKE (and others) manage control-plane upgrades and often node images; you choose when to adopt new versions.

Operators and custom resources (CRDs)

  • Custom Resource Definitions (CRDs): Extend the API with your own resources (e.g. PostgresCluster, Certificate). Controllers watch these resources and reconcile real state (e.g. create pods, call external APIs).
  • Operators: Pattern that packages domain logic (install, upgrade, backup) for an app (e.g. Prometheus Operator, cert-manager, PostgreSQL Operator). They use CRDs and controllers so you manage the app declaratively like native Kubernetes resources.

GitOps and declarative delivery

  • GitOps: Git is the source of truth for desired cluster state. A controller (e.g. Flux, Argo CD) watches a Git repo (and optionally image registries) and applies manifests or Helm charts to the cluster. Changes are made by PR; rollback by revert.
  • Benefits: Audit trail, consistency across envs, approval workflows, and separation between “what to run” (Git) and “what is running” (cluster). Use with Kustomize or Helm for environment-specific overlays.

Policy enforcement and governance

  • Policy as code: Define rules (e.g. “all images must be from registry X”, “no privileged pods”) and enforce them at admission (Gatekeeper, Kyverno) or in CI (e.g. Conftest, OPA) before apply.
  • Kyverno: Kubernetes-native policy (no separate language); generate resources (e.g. add NetworkPolicy when Namespace is created) and validate/mutate. Good for tenant isolation and compliance.
  • Multi-cluster and governance: Use fleet or multi-cluster tools (e.g. Rancher, Argo CD with multiple clusters, GKE Fleet) to apply policies and apps across many clusters from a single place.

Summary: control and management

AreaPractices
LifecyclePlan upgrades (control plane then nodes); use managed node pools or Autoscaler.
ExtensibilityCRDs and operators for custom apps; use well-maintained operators.
DeliveryGitOps (Flux, Argo CD) with Git as source of truth; Kustomize/Helm for overlays.
GovernanceAdmission and policy (Gatekeeper, Kyverno); multi-cluster policy where needed.

Summary

  • Architecture: Control plane (API server, etcd, scheduler, controllers) manages the cluster; data plane (kubelet, kube-proxy, container runtime) runs pods on nodes.
  • Pods run your containers (init containers, multi-container pods); Deployments manage replicas and rolling updates via ReplicaSet; Services (ClusterIP, NodePort, LoadBalancer, headless) and Ingress expose pods on the network.
  • Deployment strategies: RollingUpdate (default; maxSurge/maxUnavailable) or Recreate; blue-green and canary are done via separate Deployments and traffic switching. Use kubectl rollout for status, pause, and undo.
  • Workloads: Deployment for stateless apps; DaemonSet for one pod per node (logging, monitoring); StatefulSet for stateful apps with stable names and PVCs; Job for batch/completion; CronJob for scheduled Jobs.
  • Config: ConfigMap and Secret inject configuration; PersistentVolumeClaim and StorageClass provide persistent storage.
  • Use kubectl to apply YAML and inspect pods, deployments, services, and other resources.
  • Security: RBAC, NetworkPolicy, Secrets (encryption at rest, external managers), pod security contexts and Pod Security Standards, admission control (PodSecurity, LimitRanger, webhooks, Gatekeeper/Kyverno), ServiceAccount and token projection, audit logging, and control-plane/node hardening.
  • Enhancements: Set resource requests/limits; tune health probes; add metrics, logging, and tracing; back up etcd and use Velero for DR.
  • High availability (HA): Use pod anti-affinity (soft preferred to avoid extra resource cost), topology spread constraints (zone + node), PodDisruptionBudget, PriorityClass for critical vs best-effort, HPA for replica scaling, and VPA for right-sizing requests/limits.
  • Control and management: Plan cluster upgrades and node lifecycle; use operators and CRDs for complex apps; adopt GitOps (Flux, Argo CD) for declarative delivery; enforce policy with Gatekeeper or Kyverno and govern multi-cluster where needed.

For production, add ConfigMaps, Secrets, Ingress, security and policy, observability, and a cluster (e.g. AKS, EKS, GKE) or a local setup like minikube/kind for learning.

References

  • Kubernetes in Action (2nd ed.), Marko Lukša, Manning — comprehensive coverage of Pods, ReplicaSet, Deployment, Service, Endpoints, Ingress, ConfigMap, Secret, PV/PVC, StorageClass, Job, CronJob, DaemonSet, StatefulSet, and more.
  • Getting Started with Kubernetes (2nd ed.), Jonathan Baier, Packt — introduction to Kubernetes, cluster setup (GCE, AWS, kubeadm), pods and services, networking (CNI, kube-proxy, Ingress), Deployments and Jobs, storage and StatefulSets, monitoring (Heapster, Prometheus), and container security.

← All posts

Comments