Kubernetes

Introduction

It is a container orchestration tool to sustain high availability, scalability, but also used for disaster recovery (backup and restore)

Architecture

A cluster is made up of a master node and couples of worker node

Namespace

Namespaces are intended for use in environments with many users spread across multiple teams, or projects. For clusters with a few to tens of users, you should not need to create or think about namespaces at all. Start using namespaces when you need the features they provide.
Namespaces are a way to divide cluster resources between multiple users
Prevent resource starvation: By setting resource quotas, you can prevent individual pods or containers from consuming too many resources and causing resource starvation for other pods or containers running on the same node.
Ensure fair resource allocation: Resource quotas can ensure that each namespace or user on the cluster receives a fair share of the available resources, preventing any one user or application from monopolizing resources.
Enforce compliance and governance: Resource quotas can be used to enforce compliance and governance policies, such as limiting the amount of data that can be stored in a particular namespace or restricting the use of certain types of resources.

Worker Node

Each nodes contains kubelet. It is used to faciliate the communication with master node. For instance, kubelet receives the signal of starting container and uses the container runtime to start the pod and monitors its life cycle, including readiness and liveliness probes and reports back to kube-APIserver.
Worker node is a actual work happening, it has containers of different applications deployed on it

Master Node

For kube-APIserver, you can accept commands that view or change the state of the cluster, including launching pods, so that you can use the kubectl command frequently
Etcd is the clusters database, includes all of the cluster configuration data and more dynamic information, such as what nodes are part of the cluster, what pods should be running and where they should be running.
Kube-scheduler is responsible for scheduling pods onto nodes, it discovers a pod object that doesn't yet have an assignment to a node, it chooses a node and simply writes the name of that node into the pod object.
Kube-controller-manager continuously monitors the state of the cluster through kubeAPIserver, Whenever the current state of the cluster doesn't match the desired state, kube-controller-manager will attempt to make changes to achieve the desired state.

Components

Node

The node is equal to a virtual machine

Pod

It is the smallest unit in kubernetes
A node can contain multiple pods
It is an abstract layer over container, which is a container runtime
Mostly 1 container on the pod
Each pod has its own internal IP address, they can communicate with each other inside the same virtual network by their IP address
If a pod is dead, it will be replaced by new pod with new internal IP address

Ingress

Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined on the Ingress resource.
An Ingress may be configured to give Services externally-reachable URLs, load balance traffic, terminate SSL / TLS, and offer name-based virtual hosting. An Ingress controller is responsible for fulfilling the Ingress, usually with a load balancer, though it may also configure your edge router or additional frontends to help handle the traffic.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/rewrite-target: /
  name: nginx-demo
  namespace: default
spec:
  rules:
  - http:
      paths:
      - path: /nginx
        backend:
          serviceName: nginx
          servicePort: 80

apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: nginx
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: nginx
  replicas: 1
  template: 
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
        volumeMounts:
          - name: html-file
            mountPath: /usr/share/nginx/html
      volumes:  
        - name: html-file 
          configMap:  
            name: nginx-index-v1
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  namespace: ingress-nginx
  labels:
    app: nginx
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
---
kind: ConfigMap  
apiVersion: v1  
metadata:  
  name: nginx-index-v1
data:  
  index.html: Nginx V1

Service

An abstract way to expose an application running on a set of Pods as a network service
The set of Pods targeted by a Service is usually determined by a selector that you define
There are mainly 4 types of Services:
Cluster IP (default): Exposes the Service on a cluster-internal IP. Choosing this value makes the Service only reachable from within the cluster.
Node Port: Clients send requests to the IP address of a node and with nodePort defined
Load Balancer: Clients send requests to the IP address of a network load balancer.
External Name: Maps the Service to the contents of the external Name field (e.g. foo.bar.example.com), by returning a CNAME record with its value. No proxying of any kind is set up.

apiVersion: v1
kind: Service
metadata:
  name: webapp-service
spec:
  type: NodePort
  selector:
    app: webapp
  ports:
    - protocol: TCP
      port: 3000
      targetPort: 3000
      nodePort: 30100

Deployment

It is declared based on the blueprint of the pods
Desired state can be defined in a Deployment, and the controller manager will change the actual state to the desired state at a controlled rate.
It will create a ReplicaSet, which will further create the pod. If you create a deployment with name counter, it will create a ReplicaSet with name counter-<replica-set-id>, which will further create a Pod with name counter-<replica-set->-<pod-id>.
It is using RollingUpdate(default) strategy, a new ReplicaSet is created and the Deployment moves the Pods from the old ReplicaSet to the new one at a controlled rate.
Auto scale the replicas of pod based on the rule defined
Only one PVC will be created that both the pods will be sharing

apiVersion: apps/v1
kind: Deployment
metadata:
  name: counter
spec:
  replicas: 3
  selector:
    matchLabels:
      app: counter
  template:
    metadata:
      labels:
        app: counter
    spec:
      containers:
      - name: counter
        image: "kahootali/counter:1.1"
        volumeMounts:
        - name: counter
          mountPath: /app/
      volumes:
      - name: counter
        persistentVolumeClaim:
          claimName: counter
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: counter
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 50Mi
  storageClassName: efs

Stateful Set

Similar with deployment, but applicable to stateful application (with disk storage)
It create pods by itself but not reply on replica set
Every replica of a stateful set will have its own state, and each of the pods will be creating its own PVC(Persistent Volume Claim). So a statefulset with 3 replicas will create 3 pods, each having its own Volume, so total 3 PVCs.
StatefulSets don’t create ReplicaSet or anything of that sort, so you cant rollback a StatefulSet to a previous version. You can only delete or scale up/down the Statefulset. If you update a StatefulSet, it also performs RollingUpdate i.e. one replica pod will go down and the updated pod will come up, then the next replica pod will go down in same manner

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: counter
spec:
  serviceName: "counter-app"
  selector:
    matchLabels:
      app: counter
  replicas: 1 
  template:
    metadata:
      labels:
        app: counter
    spec:      
      containers:
      - name: counter
        image: "kahootali/counter:1.1"  
        volumeMounts:
        - name: counter
          mountPath: /app/      
  volumeClaimTemplates:
  - metadata:
      name: counter
    spec:
      accessModes: [ "ReadWriteMany" ]
      storageClassName: efs
      resources:
        requests:
          storage: 50Mi

DaemonSet

A DaemonSet is a controller that ensures that the pod runs on all the nodes of the cluster. If a node is added/removed from a cluster, DaemonSet automatically adds/deletes the pod.
The created number of pods is equal to the number of nodes
Some typical use cases of a DaemonSet is to run cluster level applications like:
- Monitoring Exporters: You would want to monitor all the nodes of your cluster so you will need to run a monitor on all the nodes of the cluster like NodeExporter.
- Logs Collection Daemon: You would want to export logs from all nodes so you would need a DaemonSet of log collector like Fluentd to export logs from all your nodes.
DaemonSet don’t create ReplicaSet or anything of that sort, so you cant rollback a StatefulSet to a previous version.
All pods are sharing the same Volume

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: counter-app
spec:  
  selector:
    matchLabels:
      app: counter
  template:
    metadata:
      name: counter-app
      labels:
        app: counter
    spec:
      tolerations: 
      - effect: NoSchedule
        operator: Exists
      containers:
      - name: counter
        image: "kahootali/counter:1.1"
        volumeMounts:
        - name: counter
          mountPath: /app/
      volumes:
      - name: counter
        persistentVolumeClaim:
          claimName: counter
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: counter
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 50Mi
  storageClassName: efs

Config Map & Secret

apiVersion: v1
kind: ConfigMap
metadata:
  name: game-demo
data:
  # property-like keys; each key maps to a simple value
  player_initial_lives: "3"
  ui_properties_file_name: "user-interface.properties"

  # file-like keys
  game.properties: |
    enemy.types=aliens,monsters
    player.maximum-lives=5    
  user-interface.properties: |
    color.good=purple
    color.bad=yellow
    allow.textmode=true

apiVersion: v1
kind: Secret
metadata:
  name: demo-secret
type: Opaque
data:
  username: YWRtaW4=
  password: cGFzc3dvcmQ=

Both can be used to store environment variables in key-value pair, but secret is used to store sensitive data (e.g password) as the data will be encrypted, the resulting ciphertext is then stored in etcd
Kubernetes will mount the secret as a volume into the container where the application is running. The secrets volume is then decrypted and the data is made available to the application as files or environment variables.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: loki
spec:
  template:
    spec:
      containers:
      - name: loki
        image: grafana/loki
        envFrom:
          - configMapRef:
              name: game-demo
          - secretRef:
              name: demo-secrets

Here is an example for attaching config map and secret to deployment env

Cron Job

Kubernetes will need at least three different objects (CronJob, Job, and Pod) to fulfill the cron task.
When the controller finds a CronJob to execute (meaning the current time matches the time specified via cron syntax), it will create another object called Job
A Job creates one or more Pods based on the configuration passed down from CronJob via Job in jobTemplateand will continue to retry execution of the Pods until a specified number of them successfully terminate

apiVersion: batch/v1
kind: CronJob
metadata:
  name: helloworld
spec:
  schedule: "*/15 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: task
              image: busybox
              // act as a entry point of docker image
              command:
                - /bin/sh
                - -c
                - date; echo 'Hello World from Cronitor.io';

Vs Application Cron job

Advantage

Not rely on any framework or language
Can be managed simply via infra as code with yaml file

Disadvantage

Not suitable for complex logic, e.g: the dependency between cron job, retry based on logicial failure
Difficult to manually trigger the cron job

Volume

There are several types of volume

emptyDir

Temporary storage that exists only during pod lifetime
Data is lost when pod is deleted

volumes:
  - name: cache-volume
    emptyDir: {}

hostPath

Mounts a directory from the host node's filesystem
Data persists pod restarts but tied to specific node
Good for accessing node logs or docker socket

volumes:
  - name: docker-socket
    hostPath:
      path: /var/run/docker.sock

ConfigMap

For mounting configuration data as files
Read-only by default

volumes:
  - name: config-volume
    configMap:
      name: my-config

PersistentVolume (PV) and PersistentVolumeClaim (PVC)

For persistent storage that survives pod restarts

volumes:
  - name: data-volume
    persistentVolumeClaim:
      claimName: my-pvc

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
    - ReadWriteOnce    # Can be mounted as read-write by a single node
  resources:
    requests:
      storage: 10Gi    # Requesting 10GB of storage
  storageClassName: standard  # What kind of storage to use

Storage Class

Defines what type of storage you want
Determines storage characteristics like:
- Performance (IOPS, throughput)
- Reliability
- Backup policies
- Cost tier
Here is an example

# Storage Class Definition
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-storage
provisioner: diskplugin.csi.alibabacloud.com
parameters:
  type: cloud_essd # The type of storage
reclaimPolicy: Delete # how to handle when storage is deleted

---
# PVC using the Storage Class
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: database-storage
spec:
  storageClassName: fast-storage
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

---
# Pod using the PVC
apiVersion: v1
kind: Pod
metadata:
  name: database-pod
spec:
  containers:
  - name: database
    image: mysql:5.7
    volumeMounts:
    - name: storage
      mountPath: /var/lib/mysql
  volumes:
  - name: storage
    persistentVolumeClaim:
      claimName: database-storage

Volume Mount

To attach the volume to the container path

apiVersion: apps/v1
kind: Deployment
metadata:
  name: counter
spec:
  replicas: 3
  selector:
    matchLabels:
      app: counter
  template:
    metadata:
      labels:
        app: counter
    spec:
      containers:
      - name: counter
        image: "kahootali/counter:1.1"
        # Mount volume to container path
        volumeMounts:
        - name: counter
          mountPath: /app/
      # Declare the volume
      volumes:
      - name: counter
        persistentVolumeClaim:
          claimName: counter

Resource Management

resources:
    limits:
      cpu: 500m
      memory: 1024Mi
    requests:
      cpu: 500m
      memory: 1024Mi

Resource Management can be declared as a part of deployment file
A resource request is the amount of CPU and memory that a container requires to run, and it is used by Kubernetes to allocate resources to the container when it is scheduled to run on a node. When a container with a resource request is scheduled to run on a node, Kubernetes will find a node that has enough available resources to meet the container's request, and then allocate those resources to the container.
A resource limit, on the other hand, is the maximum amount of CPU and memory that a container is allowed to consume. If a container exceeds its resource limit. When a process in the container tries to consume more than the allowed amount of memory, the system kernel terminates the process that attempted the allocation, with an out of memory (OOM) error.

Health Detection

livenessProbe:
    httpGet:
      path: /v1/health
    periodSeconds: 300
    successThreshold: 1
    failureThreshold: 5
  readinessProbe:
    httpGet:
      path: /v1/health

Liveness probes are crucial for ensuring your application stays up and running. If a liveness probe fails, Kubernetes will restart the pod to restore service.
Readiness probes check if your application is ready to receive requests. If a readiness probe fails, Kubernetes will remove the pod’s IP address from the service load balancer. This ensures no requests are forwarded to the pod until it becomes ready again.

Service Discovery

A cluster-aware DNS server, such as CoreDNS, watches the Kubernetes API for new Services and creates a set of DNS records for each one. If DNS has been enabled throughout your cluster then all Pods should automatically be able to resolve Services by their DNS name.
For example, if you have a Service called my-service in a Kubernetes namespace my-ns, the control plane and the DNS Service acting together create a DNS record for my-service.my-ns. Pods in the my-ns namespace should be able to find the service by doing a name lookup for my-service (my-service.my-ns would also work).

Commands

# get pod list
kubectl get pods --namespace <namespace>
# get the pod details
kubectl describe pod <podname> -n <namespace>
# get the log of pod
kubectl logs <podname> -n <namespace>
# go into the shell
kubectl exec --stdin --tty <podname> -- /bin/bash
# port forward
kubectl port-forward <podname> <local port>:<container port>

References

Kubernetes: An OverviewThe New Stack

A visual guide to Kubernetes networking fundamentals

K8s: Deployments vs StatefulSets vs DaemonSetsMedium

PreviousDocker NextHelm

Last updated 2 months ago

Was this helpful?