Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. As the industry shifted from monolithic services toward microservices, the need emerged for an orchestrator that could run independent, lightweight application containers with high availability, scalability, and robustness — that is exactly what Kubernetes delivers.

⚡ Quick Takeaways
  • Pod — the smallest deployable unit; wraps one or more containers and gets a shared virtual IP inside the cluster.
  • Service — gives Pods a stable IP/DNS name that survives Pod restarts; doubles as an internal load balancer.
  • Control plane — API Server (single gateway), Scheduler (places Pods), Controller Manager (reconciliation loops), etcd (source of truth).
  • Deployment vs StatefulSet — Deployments for stateless apps; StatefulSets for databases that need stable ordered identities.
  • Secrets are base64, not encrypted — pair with HashiCorp Vault or a KMS for real secret management.
  • Helm — the Kubernetes package manager; bundles YAML manifests into versioned, parameterized Charts.
tldr

Kubernetes organizes containers into Pods, assigns them stable network identities via Services, and coordinates everything through a master node running the API Server, Scheduler, Controller Manager, and etcd. Worker nodes run Kubelet and Kube Proxy to host the actual workloads.

Kubernetes cluster architecture overview
Kubernetes cluster architecture overview

Core Primitives

Pod

The fundamental unit in Kubernetes. A Pod is an abstraction layer over one or more containers, making Kubernetes independent of the underlying container runtime (Docker, containerd, CRI-O). Each Pod receives a virtual IP address for intra-cluster communication, but this IP changes whenever the Pod is replaced — which is why Pods need a stable addressing mechanism on top.

Service

A Service provides a static IP address that persists independently of Pod lifecycles. When a Pod dies and a new one starts, the Service's IP stays fixed and continues routing traffic to healthy Pods. Services also act as an internal load balancer, distributing requests across multiple replicas of the same application.

Ingress

Ingress sits at the cluster boundary and behaves like a reverse proxy. It translates an external domain name (e.g. api.example.com) into the correct internal Service IP and port based on routing rules, so you expose one entry point rather than a NodePort per service.

ConfigMap & Secret

ConfigMap externalizes application configuration (environment variables, config files) so you can update settings without rebuilding the container image. Secret mirrors ConfigMap but stores sensitive data — credentials, API keys, TLS certificates — encoded in base64. Neither should be used as a secure vault; for production secrets management, pair them with tools like HashiCorp Vault.

Volumes

Container filesystems are ephemeral — data disappears when a Pod restarts. Volumes attach persistent storage (local disk or remote block/object storage) to a Pod, ensuring stateful workloads like databases survive Pod replacements.

Deployment vs. StatefulSet

AspectDeploymentStatefulSet
Use caseStateless apps (web servers, APIs)Stateful apps (databases, message brokers)
Pod identityInterchangeable — any replica can handle any requestStable, ordered identity (pod-0, pod-1 …)
Data syncNot managedOrdered rollouts ensure leader/follower sync
Common patternMost microservicesMany teams prefer external DBs over StatefulSet

Architecture: Master & Worker Nodes

Worker Node Components

Master Node Components

Configuration (YAML)

Every Kubernetes resource is declared as YAML. The four mandatory top-level keys are apiVersion, kind, metadata, and spec. The cluster continuously reconciles actual state with the declared spec, and writes the current state back into the status field automatically.

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    app: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: my-app:1.0.0
          ports:
            - containerPort: 8080

Namespaces

Namespaces provide logical isolation within a single physical cluster, letting teams share infrastructure while keeping resources separated by function, team, or environment. Kubernetes ships with four built-in namespaces:

Important scoping rules: ConfigMaps and Secrets are namespace-scoped — each namespace needs its own copy. Services can be referenced across namespaces using the FQDN service.namespace.svc.cluster.local. Volumes and Nodes are global resources not bound to any namespace.

Helm

Helm is the package manager for Kubernetes. It bundles all the YAML manifests for an application into a distributable unit called a Chart. Helm also acts as a templating engine — you parameterize values (image tag, replica count, resource limits) in a values.yaml file, making it straightforward to deploy the same chart to dev, staging, and production with different settings.

bash
# install a chart from a public repo
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-redis bitnami/redis \
  --set auth.password=secret \
  --set replica.replicaCount=2

# list running releases
helm list

# roll back to previous release
helm rollback my-redis 1

Workload Controllers: DaemonSet and Jobs

DaemonSet

A DaemonSet ensures that exactly one copy of a Pod runs on every node (or every node matching a selector). It is the right tool for infrastructure-level agents that must be node-local: log collectors (Fluentd, Fluent Bit), metrics exporters (node-exporter), network plugins (CNI agents), and security scanners. As nodes join the cluster, the DaemonSet controller automatically places a Pod on each new node. As nodes leave, their Pod is garbage collected.

yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      tolerations:                      # tolerate master-node taint so it runs everywhere
        - key: node-role.kubernetes.io/control-plane
          effect: NoSchedule
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:2.1
          volumeMounts:
            - name: varlog
              mountPath: /var/log
      volumes:
        - name: varlog
          hostPath:
            path: /var/log

Deployment vs. StatefulSet vs. DaemonSet — Decision Matrix

AspectDeploymentStatefulSetDaemonSet
Pod identityInterchangeable — random suffixStable: pod-0, pod-1, pod-2One per node, node-named
StorageShared or nonePer-Pod PVC, retained on deleteHostPath typical
Rollout orderConcurrent (maxSurge / maxUnavailable)Ordered: pod-0 → pod-1 → …Node-by-node rolling update
Scale downRandom Pod removedHighest ordinal removed firstPod per node, not manually scaled
Use casesWeb servers, APIs, workersDatabases, Kafka, ZooKeeper, ElasticsearchLog agents, node exporters, CNI plugins

Scheduling and Node Affinity

The Kubernetes Scheduler does more than just "find a node with enough CPU and RAM." It applies a sophisticated two-phase algorithm: filtering eliminates nodes that cannot satisfy the Pod's requirements, and scoring ranks the remaining nodes by preference. Understanding this lets you control exactly where workloads land.

Node Selectors and Node Affinity

nodeSelector is the simplest form: a map of key-value pairs that must match a node's labels. Node Affinity is more powerful — it supports requiredDuringSchedulingIgnoredDuringExecution (hard constraint) and preferredDuringSchedulingIgnoredDuringExecution (soft preference with a weight).

yaml
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:   # hard: only GPU nodes
        nodeSelectorTerms:
          - matchExpressions:
            - key: accelerator
              operator: In
              values: [nvidia-a100]
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:  # soft: spread replicas across zones
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: my-app
            topologyKey: topology.kubernetes.io/zone

Taints and Tolerations

Taints work in the opposite direction: they repel Pods from nodes unless a Pod explicitly tolerates the taint. Common uses include dedicating nodes to specific workloads (GPU nodes that only accept ML jobs), excluding Pods from control-plane nodes, or gracefully draining nodes for maintenance (kubectl drain adds a NoSchedule taint automatically).

bash
# Taint a node so only GPU-tolerating Pods land on it
kubectl taint nodes gpu-node-1 accelerator=gpu:NoSchedule

# Pod must declare this toleration to be scheduled on gpu-node-1
# tolerations:
#   - key: "accelerator"
#     operator: "Equal"
#     value: "gpu"
#     effect: "NoSchedule"

Networking: CNI, Services, and Ingress in Depth

Kubernetes networking follows four fundamental rules: every Pod gets a unique IP, every Pod can communicate with every other Pod without NAT, every node can communicate with every Pod, and Pods see the same IP that external agents use to address them. Enforcing these rules is the job of the Container Network Interface plugin.

CNI Plugins

The Container Network Interface (CNI) is a specification that Kubernetes calls when a Pod is created or deleted to configure network interfaces. Popular implementations differ significantly in their networking model:

Service Types

TypeBehaviorUse case
ClusterIPVirtual IP reachable only inside the cluster; default typeInternal microservice-to-microservice communication
NodePortExposes the Service on a port (30000–32767) on every node's external IPDev/test external access; not for production
LoadBalancerProvisions a cloud load balancer (AWS ELB, GCP LB) with an external IP pointing at the ServiceExposing a single Service to the internet in cloud environments
ExternalNameReturns a CNAME DNS alias to an external hostname; no proxyingPointing cluster services at external DBs or SaaS APIs
Headless (ClusterIP: None)No virtual IP; DNS returns Pod IPs directlyStatefulSets, service discovery by the application itself (Kafka, Cassandra)

Ingress and Ingress Controllers

An Ingress resource defines routing rules (hostname, path prefix → Service), but it does nothing without an Ingress Controller — a running Pod that watches Ingress objects and reconfigures itself accordingly. Common controllers include Nginx Ingress Controller, Traefik, HAProxy, and the cloud-native AWS ALB Ingress Controller. You can run multiple controllers in one cluster by annotating Ingresses with ingressClassName.

yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    cert-manager.io/cluster-issuer: letsencrypt-prod   # auto TLS via cert-manager
spec:
  ingressClassName: nginx
  tls:
    - hosts: [api.example.com]
      secretName: api-tls-cert
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /v1/orders
            pathType: Prefix
            backend:
              service:
                name: order-service
                port:
                  number: 8080
          - path: /v1/products
            pathType: Prefix
            backend:
              service:
                name: product-service
                port:
                  number: 8080

Storage: PV, PVC, and StorageClass

Kubernetes abstracts storage through a three-layer hierarchy: PersistentVolume (PV) represents actual storage capacity in the cluster (provisioned by an admin or dynamically by a StorageClass); PersistentVolumeClaim (PVC) is a user's request for storage with specific size and access mode requirements; and StorageClass defines the provisioner and parameters for dynamic PV creation.

yaml
--- StorageClass: defines how PVs are dynamically provisioned
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
reclaimPolicy: Retain          # don't delete EBS volume when PVC is deleted
volumeBindingMode: WaitForFirstConsumer  # provision in same AZ as the Pod

--- PVC: user requests 50 GiB from the fast-ssd StorageClass
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
spec:
  accessModes: [ReadWriteOnce]    # only one node can mount read-write
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 50Gi

Access Modes

StatefulSet + PVC

StatefulSets use volumeClaimTemplates to automatically create one PVC per Pod (e.g., postgres-data-postgres-0, postgres-data-postgres-1). These PVCs are not deleted when the StatefulSet is scaled down or deleted — a deliberate safety net so you don't accidentally wipe your database. Delete them explicitly when you truly want the data gone.

Probes and Self-Healing

Kubernetes makes three kinds of health probes available for every container. Configuring them correctly is the difference between a self-healing cluster and one where failed Pods silently take traffic until an on-call engineer notices.

Liveness Probe

Determines whether the container is alive. A failing liveness probe causes kubelet to kill and restart the container. Use it to detect deadlocks and unrecoverable hangs — situations where the process is running but not actually doing work. Be careful not to set the initial threshold too low: a liveness probe that fires during startup kills the Pod before it's ready, causing a restart loop.

Readiness Probe

Determines whether the container is ready to serve traffic. A failing readiness probe removes the Pod from the Service's endpoints — traffic stops being routed to it — but the container is not restarted. This is the right probe for slow-starting applications, warmup periods, or temporary overload: the Pod stays alive but is taken out of rotation until it signals readiness again.

Startup Probe

Runs during container startup and disables liveness and readiness probes until it succeeds. Essential for applications with long startup times (e.g., JVM services, databases loading large datasets). Once the startup probe succeeds, liveness and readiness probes take over. Without it, a liveness probe might kill a legitimately-starting container.

yaml
containers:
  - name: order-service
    image: order-service:2.3.1
    startupProbe:               # allow up to 60s to start
      httpGet:
        path: /actuator/health
        port: 8080
      failureThreshold: 12      # 12 * 5s = 60s max startup time
      periodSeconds: 5
    livenessProbe:
      httpGet:
        path: /actuator/health/liveness
        port: 8080
      initialDelaySeconds: 0    # startup probe already handled the delay
      periodSeconds: 10
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /actuator/health/readiness
        port: 8080
      periodSeconds: 5
      failureThreshold: 2

Autoscaling: HPA, VPA, and Cluster Autoscaler

Kubernetes offers three complementary autoscaling mechanisms that operate at different granularities. Understanding when to use each — and how they interact — is a common interview topic.

Horizontal Pod Autoscaler (HPA)

HPA scales the number of Pod replicas up or down based on observed metrics. The default metric is CPU utilization relative to the Pod's resource request, but custom metrics (request rate, queue depth, external metrics from Prometheus) are supported via the custom metrics API. HPA polls the metrics server every 15 seconds and adjusts replicas within the bounds of minReplicas and maxReplicas.

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60   # scale up if CPU > 60% of request
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"    # 1000 RPS per replica
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # wait 5 min before scaling down

Vertical Pod Autoscaler (VPA)

VPA adjusts resource requests and limits (CPU and memory) for individual Pods rather than the replica count. In Auto mode it evicts and restarts Pods with updated requests; in Off mode it only provides recommendations. VPA is useful when you don't know the right resource requests up front, but it conflicts with HPA on CPU/memory — use VPA for requests and HPA on custom metrics, or use VPA in recommendation-only mode alongside HPA.

Cluster Autoscaler

Both HPA and VPA assume there are available nodes to schedule Pods onto. When there aren't, the Cluster Autoscaler (CA) talks to the cloud provider's API (AWS, GCP, Azure) to add or remove nodes from the cluster's node groups. CA adds nodes when Pods are unschedulable due to insufficient resources, and removes underutilized nodes after a configurable cool-down period (default 10 minutes). Importantly, CA respects Pod Disruption Budgets during scale-down: it won't remove a node if doing so would violate your PDB.

autoscaling recipe

The standard production setup: HPA on CPU + custom metrics to adjust replica count; VPA in recommendation mode to right-size your resource requests; Cluster Autoscaler to add/remove nodes as needed. Set resource requests and limits on every container — without requests, the HPA has no baseline to compute utilization against, and the scheduler can't make informed placement decisions.

The Operator Pattern

A Kubernetes Operator is a custom controller that encodes the operational knowledge of a specific application into code. It extends the Kubernetes API with Custom Resource Definitions (CRDs) — new resource types like PostgresCluster, KafkaCluster, or ElasticsearchCluster — and runs a control loop that watches those resources and drives the cluster toward the desired state they declare.

Why Operators Exist

Kubernetes natively handles stateless workloads beautifully: deploy, scale, rollback. But stateful systems — databases, message brokers, search engines — require domain-specific operational logic: initializing a cluster, electing a leader, performing a rolling upgrade that respects quorum, taking consistent backups, restoring from a snapshot. A human operator executing runbooks encodes this knowledge; a Kubernetes Operator automates it.

yaml
# Using the CloudNativePG Operator to declare a PostgreSQL cluster
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: orders-db
spec:
  instances: 3           # 1 primary + 2 replicas
  primaryUpdateStrategy: unsupervised
  storage:
    size: 100Gi
    storageClass: fast-ssd
  backup:
    barmanObjectStore:
      destinationPath: s3://my-bucket/backups/orders-db
      s3Credentials:
        accessKeyId:
          name: s3-creds
          key: ACCESS_KEY_ID

The CloudNativePG Operator sees this Cluster CR and handles the rest: creating the primary Pod, setting up streaming replication to the replicas, configuring a headless Service for each Pod's stable DNS name, configuring a separate Service for primary writes vs. replica reads, setting up continuous WAL archiving to S3, and performing rolling upgrades that promote a replica before shutting down the primary to minimize downtime. All of that operational complexity — which would require a lengthy runbook — is now encoded as code and triggered by a single YAML declaration.

Popular Operators

Resource Management: Requests, Limits, and QoS Classes

Every container should declare CPU and memory requests (guaranteed allocation used by the scheduler) and limits (the maximum the container may consume). The relationship between them determines a Pod's QoS class, which controls eviction priority under node memory pressure.

QoS ClassConditionEviction Priority
Guaranteedrequests == limits for every container in the PodLast to be evicted; reserved capacity
BurstableAt least one container has requests set, but requests < limitsEvicted after BestEffort; eviction order by memory usage relative to request
BestEffortNo requests or limits set at allFirst to be evicted under memory pressure
best practice

Set CPU requests = your steady-state usage and CPU limits = 2-4x requests to allow bursting. For memory, set requests = your p99 usage and limits = requests (i.e., Guaranteed class) for latency-sensitive services — a container that exceeds its memory limit is OOMKilled immediately; a pod that exceeds its CPU limit is only throttled (not killed), making memory limits far more consequential.

takeaway

Think of Kubernetes as a self-healing, declarative control loop: you describe what you want in YAML, and the master node's Controller Manager continuously drives actual state toward it. Pods give you isolation; Services give you stable addressing; etcd gives you durable cluster memory; and Helm gives you reusable, versioned packaging. Layer in HPA + Cluster Autoscaler for elasticity, Operators for stateful applications, and proper probe configuration to make the self-healing loop actually work.

🎯 interview hot-takes

What happens when a Pod crashes? The Controller Manager detects divergence from the desired replica count and instructs the Scheduler to place a new Pod — the Service's IP stays unchanged throughout.
Deployment vs StatefulSet vs DaemonSet? Deployments for stateless apps (interchangeable Pods); StatefulSets for databases (stable ordered Pod identity, per-Pod PVC); DaemonSets for node-local infrastructure agents (one Pod per node).
Why is etcd critical? It is the single source of truth for all cluster state; losing etcd without a backup means losing the entire cluster configuration.
HPA vs VPA vs Cluster Autoscaler? HPA scales replica count; VPA adjusts per-Pod resource requests; Cluster Autoscaler adds/removes nodes. Use all three together: HPA on custom metrics, VPA in recommendation mode, CA to provision capacity.
What is a liveness vs readiness probe? Liveness failure kills and restarts the container; readiness failure removes it from the Service endpoint without restarting — use readiness for warmup and temporary overload.
What is the Operator pattern? A custom controller that encodes operational runbooks as code, using CRDs to extend the Kubernetes API with domain-specific resource types like KafkaCluster or PostgresCluster.

← previous
Apache Kafka