Kubernetes Storage on Azure 1 of 3 – built-in storage and NFS

In the previous post, we understand that to host stateful workload, we need to manage persistent storage to the Kubernetes platform. In this post, I will explore the different storage options. These options are specific to Azure Kubernetes service. However, the principals apply to any Kubernetes platform regardless of cloud vendor.

In another old post, I discussed in-tree and CSI storage classes, and from a developer’s perspective, how to mount volumes statically and dynamically once the storage class is available. Here in this post we are concerned with how to make storage classes available, from a platform specialist perspective.

Storage in Azure Kubernetes

As soon as the Azure Kubernetes cluster is launched, a number of built-in storage classes are available. Unlike third-party storage classes, they do not require kubelet identity to be contributor for node resource group. As discussed we shall use CSI based storage classes.

  • managed-csi
  • managed-csi-premium
  • azurefile-csi
  • azurefile-csi-premium

The main difference between them is the backing technology. However, form Kubernetes workload’s perspective, the Pods as storage consumers are concerned with the access mode instead of backing technology. Here are supported access modes:

  • ReadWriteOnce: read-write by a single node
  • ReadOnlyMany: read only by many nodes
  • ReadWriteMany: read-write by many nodes
  • ReadWriteOncePod: new in Kubernetes 1.22 to restrict volume access to a single Pod

The storage classes managed-csi and managed-csi-premium support ReadWriteOnce. The storage classes azurefile-csi and azurefile-csi-premium support ReadWriteMany.

Apart from these built-in options, Azure also suggests a few more options based on other types of Azure resources. For example, AKS can integrate with HPC cache and it for HPC. We can also self-manage a virtual machine configured as NFS server, and use the NFS subdir external provisioner to configure storage class. Despite of the overhead with managing a VM, you have more configurability. My previous client reports that they gain better performance than the built-in options.

Another alternative is Azure NetApp Files. However, being a full enterprise grade solution (similar to FSx ONTAP), Azure NetApp Files costs an arm and a leg. In this comparison, it cost 60 to 100 times as the cost by built-in options.

Another option is Azure Ultra Disk, which needs to be enabled at cluster level. You can provision performance target (DiskIOPSReadWrite and DiskMBpsReadWrite) in the storage class. Ultra Disk is a good middle ground between the pricey NetApp files and the less performant built-in options.

Terraform Template for AKS

To explore the storage options, I use my own terraform template to create an AKS cluster. The template is in the azure directory of the cloudkube repo. The template consists a few configurations with Azure Kubernetes to facilitate storage configuration. First, it configures an SSH key pair to use between the bastion host and the node. Users can SSH to Kubernetes nodes from bastion host as soon as terraform apply is completed. Second, the third party storage options installed after the cluster creation need their Pod to instruct Azure to create Azure disks. This requires that a Kubernetes node agent have the permission to provision resources in the node resource group. This is important to understand because there are a couple of managed identities at play (refer to this post) when building an AKS cluster.

In Azure Kubernetes, it is the managed identity of kubelet, that needs to have contributor permission over the resource group for the nodes (not the one for the AKS cluster itself). A managed identity is expressed by a client ID, an object ID (aka principal ID), and the identity ID. We can find them out with an AZ CLI command as below:

We can also tell that the kubelet managed identity represents node agent, by connecting to a node and looking at the argument (kubernetes.azure.com/kubelet-identity-client-id) of kubelet process:

In the template, I also chose to designate the same BYO identity for both the cluster and for kubelet (node agent), in order to minimize my requirement on permission. If I had left it with a system assigned identity for node agent, I would have to assign that identity as a contributor for the node resource group, either as a user, or via Terraform’s identity. Either way, it is beyond what a Contributor is allowed to do.

Benchmarking with kubestr

I used fio utility for storage benchmarking from virtual machines. However, fio utility is not for container. For fio testing on Kubernetes, I’d have to use a Docker image, and test with target volume attached. Fortunately, the Kasten team shared their initiative in the open source project Kubestr. The kubestr release is available as an executable on common platforms. It connects to the cluster the same way as kubectl and here is a demo. To begin with, download the utility to bastion host, and run it without any argument, which prints the storage classes and volume snapshot classes:

curl -L -o kubestr.tar.gz https://github.com/kastenhq/kubestr/releases/download/v0.4.31/kubestr_0.4.31_Linux_amd64.tar.gz
tar -xvf kubestr.tar.gz && rm kubestr.tar.gz && chmod +x kubestr

./kubestr   # if kubectl is configured, this command will print out the details of storage classes and volume snapshot classes

In addition to outputting details, it is also very simple to perform storage benchmarking with kubestr. All we need to do is giving it the storage class name and it will run four tests by default with common global options (ioengine=libaio verify=0 direct=1 gtod_reduce=1). The four tests are:

JobNameblock_sizefilesizeiodepthrw
read_iops4k2G64randread
write_iops4k2G64randwrite
read_bw128k2G64randread
write_bw128k2G64randwrite

During each test, it measures and reports IOPS and bandwidth (throughput). If your I/O profile falls out of the four jobs, you can even customize your test by supplying a fio config file. For example, you need a longer test duration, or you need a larger total size for the test. Before the test, kubestr automatically mount their test volumes using the storage class being tested.

./kubestr fio -s my-storage-class  # benchmarking a storage class

For read_iops and write_iops, we mainly look at the IOPS. For read_bw and write_bw, we mainly look at the bandwidth. The iops and bw based on samples are reported as first line of result, followed by min, max and average.

Metrics

With kubestr I ran a performance test amongst the native storage classes with Azure Kubernetes Service, with results as below:

read_iopswrite_iopsread_bwwrite_bw
managed-csiIOPS=314.729797 BW(KiB/s)=1275IOPS=297.071136 BW(KiB/s)=1204IOPS=315.311188 BW(KiB/s)=40887IOPS=261.048645 BW(KiB/s)=33941
managed-csi-premiumIOPS=493.395844 BW(KiB/s)=1990IOPS=426.731812 BW(KiB/s)=1723IOPS=455.950348 BW(KiB/s)=58894IOPS=422.888855 BW(KiB/s)=54662
azurefile-csiIOPS=259.333282 BW(KiB/s)=1053IOPS=283.985779 BW(KiB/s)=1152IOPS=240.447403 BW(KiB/s)=31298IOPS=230.689804 BW(KiB/s)=30048
azurefile-csi-premiumIOPS=394.044739 BW(KiB/s)=1586IOPS=371.181793 BW(KiB/s)=1494IOPS=380.360535 BW(KiB/s)=49018IOPS=491.313446 BW(KiB/s)=63310

These numbers are based on default test parameters (e.g. 100Gi PVC size). The numbers indicate that block storage generally performs better in default setting. So we should use managed disk instead of azure file unless multiple pods needs to access the same volume.

Validate Persistent Storage

We can use mysql as a quick and dirty test of storage classes. We can deploy the following manifest:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-pvc
spec:
  storageClassName: managed-csi-premium # the storage class being tested
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: mysql
data:
  password: eHl6   # base64 code of xyz
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql
  labels:
    app: mysql
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mysql
  template:
    metadata:
      labels:
        app: mysql
    spec:
      containers:
        - image: mysql:5.6
          name: mysql
          env:
            - name: MYSQL_ROOT_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: mysql
                  key: password
          ports:
            - containerPort: 3306
              name: mysql
          volumeMounts:
            - name: mysql-persistent-storage
              mountPath: /var/lib/mysql
      volumes:
        - name: mysql-persistent-storage
          persistentVolumeClaim:
            claimName: mysql-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: mysql-service
  labels:
    service: mysql
spec:
  selector:
    app: mysql
  ports:
  - name: tcp-mysql
    protocol: TCP
    port: 3306
    targetPort: 3306

Once the Pod has been created, then we use a throw-away Pod to connect to mysql service and build some data:

$ kubectl run mysql-cli --rm -i --tty --image imega/mysql-client -- /bin/sh
If you don't see a command prompt, try pressing enter.
/ # mysql --host=mysql-service.default.svc.cluster.local --user=root --password=xyz
Welcome to the MariaDB monitor. Commands end with ; or \g. Your MySQL connection id is 9 Server version: 5.6.51 MySQL Community Server (GPL) Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MySQL [(none)]>

From the mysql shell, we create a test database with dummy data populated:

CREATE DATABASE `testdb`; 
USE testdb; 
CREATE TABLE IF NOT EXISTS tasks ( 
  task_id INT AUTO_INCREMENT PRIMARY KEY, 
  title VARCHAR(255) NOT NULL, 
  description TEXT, 
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP 
) ENGINE=INNODB; 
INSERT INTO tasks (title, description) VALUES ('Job A','Morning Standup');
INSERT INTO tasks (title, description) VALUES ('Job B','Latte with two shots of espresso');
INSERT INTO tasks (title, description) VALUES ('Job C','Coding coding and coding');
INSERT INTO tasks (title, description) VALUES ('Job D','git commit');

Each SQL command should return with number of rows affected and then we can exit the MySQL shell and Pod shell. Once we exit out of the Pod shell, the Pod is deleted. We can re-connect to validate the data are still present:

$ kubectl run mysql-tester --rm -i --tty --image imega/mysql-client -- mysql --host=mysql-service.default.svc.cluster.local --user=root --password=xyz --database testdb --execute='SELECT * FROM tasks;'
If you don't see a command prompt, try pressing enter.
Got error: Access denied for user 'root'@'147.206.3.15' (using password: NO)
+---------+-------+----------------------------------+---------------------+
| task_id | title | description                      | created_at          |
+---------+-------+----------------------------------+---------------------+
|       1 | Job A | Morning Standup                  | 2022-06-22 20:29:12 |
|       2 | Job B | Latte with two shots of espresso | 2022-06-22 20:29:12 |
|       3 | Job C | Coding coding and coding         | 2022-06-22 20:29:12 |
|       4 | Job D | git commit                       | 2022-06-22 20:29:13 |
+---------+-------+----------------------------------+---------------------+
Session ended, resume using 'kubectl attach mysql-tester -c mysql-tester -i -t' command when the pod is running
pod "mysql-tester" deleted

The output validates the persistent storage of the data. When testing a different storage class, simply start over with a different storage class specified in the PVC part of the manifest.

Summary

This post focuses on storage options for persistent volumes on Azure Kubernetes service. I use my own Terraform template with some custom configuration. I also covered kubestr as benchmarking tool and ran it against the built-in storage classes.

In some situations, the built-in options do not suit your needs. For example, you might want your application to use persistent volume in a consistent way across multiple cloud vendors. We will have to resort to third-party software defined storage layer. In the next post, I will explore a couple of SDS-based options, namely Portworx and Ceph. The custom configurations in the Terraform template will be helpful when we configure Portworx.