StatefulSet is used instead of Deployment when we want to deploy stateful pods (such as databases) with replication between them. One of the database pods is set up as master and the rest as slaves.

StatefulSet is similar to Deployment. It’s a template to deploy pods. It supports scaling, updates, rollbacks etc.

StatefulSet deploys pods in a sequential order (ordered, graceful deployment). Only after the first pod is in a running state, the next pod will be deployed. This helps ensure that the master pod is deployed first and only then the slaves are brought up one by one. When scaled in or during deletion of the StatefulSet, the pods are brought down sequentially in the reverse order.

StatefulSets assign an ordinal pod name to each pod as they are brought up. This goes as <stateful-set-name>-x where x can be 0, 1, 2, 3, and so on. This means the master pod in any StatefulSet will be named <stateful-set-name>-0. Using a headless service allows us to use these ordinal pod names to form DNS names for these pods. This way, we can configure the database running in the slave pods to reach out to the master database at a predictable hostname.

<aside> 💡 K8s deployment object cannot be used in this scenario since it brings up all the pods at the same time without any fixed order. Also, the pod names generated have a random slug which can change if the pod is restarted. So, the master pod cannot have its pod name fixed. This means the slave pods cannot reach the master pod reliably to setup continuous replication.

</aside>

SatefulSet definition file

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: db
  labels:
    name: db
spec:
	serviceName: mysql-h
  replicas: 3
  selector:
    matchLabels:
      name: db
  template:
    metadata:
      labels:
        name: db
    spec:
      containers:
      - name: db
        image: mysql

StatefulSet definition file is written the same way a deployment definition file is written. Only the kind is changed and a serviceName property is added to the spec section which points to a headless service.

The StatefulSet uses the headless service to create unique predictable DNS records to reach a specific pod in the StatefulSet.

Headless Service

A headless service creates a predictable DNS entry for each pod in a StatefulSet. This allows any other pod in the cluster to reach any pod in the StatefulSet by its DNS name. A headless service does not load balance the requests like any other service in K8s. It instead routes the request to a specific pod in the StatefulSet.

In the diagram, green service is load balancing the read requests coming from the web pod to the database pods. The headless service mysql-h creates DNS entries for each database pod. This allows the web pod to reach the master database pod mysql-0 to perform writes.

The DNS names of the pods are <pod-name>.<headless-service-dns>

Untitled

apiVersion: v1
kind: Service
metadata:
	name: mysql-h
spec:
	ports:
		- port: 3306
	selector:
		app: mysql
	clusterIP: None

Setting the clusterIP: None in a service definition file makes it headless. In the example, port 3306 is the port on which the headless service will route the incoming requests to the pod based on the DNS name. The selector is used to select the pods in the StatefulSet and create DNS entries based on the pod name and the cluster domain.

Storage in StatefulSets

PV shared between pods

Attaching a PVC (with a storage class configured) to the database pods will provision a PV and mount all the pods to that PV. This means all the pods (instances of the application) will share the same storage volume.

Untitled

Note that reads/writes by multiple instances at the same time is not supported by all the volume types.

apiVersion: apps/v1
kind: StatefulSet
metadata:
	name: mysql
	labels:
		app: mysql
spec:
	serviceName: mysql-h
	replicas: 3
	selector:
		matchLabels:
			app: mysql
	template:
		metadata:
			labels:
			app: mysql
		spec:
			containers:
				- name: mysql
					image: mysql
					volumeMounts:
						- name: data-volume
							mountPath: /var/lib/mysql
			volumes:
				- name: data-volume
					persistentVolumeClaim:
						claimName: data-volume

Dedicated PV for each Pod

We can configure the StatefulSet such that each database pod creates a PVC (with a storage class configured) to provision a dedicated PV for itself. This will allow us to implement read-replicas at the database layer.

Untitled