doc: add design doc for QoS

Add design doc for QoS for rbd devices mapped with both krbd and rbd-nbd closes: #521 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2025-05-30 10:36:41 +00:00 · 2023-08-29 11:59:16 +02:00 · 2023-08-29 11:59:16 +02:00 · 7c0ce53eeb
commit 7c0ce53eeb
parent 4eb2dbd4f2
1 changed files with 271 additions and 0 deletions
--- a/docs/design/proposals/rbd-qos.md
+++ b/docs/design/proposals/rbd-qos.md
@ -0,0 +1,271 @@
 # Design Doc for RBD QoS using cgroup v2
 ## Introduction
 The RBD QoS (Quality of Service) design aims to address the issue of IO noisy
 neighbor problems encountered in early Ceph deployments catering to OpenStack
 environments. These problems were effectively managed by implementing QEMU
 throttling at the virtio-blk/scsi level. To further enhance this,
 capacity-based IOPS were introduced, providing a more dynamic experience
 similar to public cloud environments.
 The challenge arises in virtual environments, where a noisy neighbor can lead
 to performance degradation for other instances sharing the same resources.
 Although it's uncommon to observe noisy neighbor issues in Kubernetes
 environments backed by Ceph storage, the possibility exists. The existing QoS
 support with rbd-nbd doesn't apply to krbd, and as rbd-nbd isn't suitable for
 container production workloads, a solution is needed for krbd.
 To mitigate resource starvation issues, setting QoS at the device level through
 cgroup v2 when enabled becomes crucial. This approach guarantees that I/O
 capacity isn't overcommitted and is fairly distributed among workloads.
 ## Dependency
 * cgroup v2 must be enabled on the Node
 * We might have Kubernetes dependency as well
 * Container runtime dependency that supports cgroupv2
 ## Manual steps for implementing RBD QoS in a Kubernetes Cluster
 ```bash
 [$] ssh root@node1
 sh-4.4# chroot /host
 sh-5.1# cat /proc/partitions
 major minor  #blocks  name
 259        0  125829120 nvme0n1
 259        1       1024 nvme0n1p1
 259        2     130048 nvme0n1p2
 259        3     393216 nvme0n1p3
 259        4  125303791 nvme0n1p4
 259        6   52428800 nvme2n1
   7        0  536870912 loop0
 259        5  536870912 nvme1n1
 252        0   52428800 rbd0
 sh-5.1#
 ```
 Once the rbd device is mapped on the node we get the device's major and minor
 number we need to set the io limit on the device but we need to find the right
 cgroup file where we need to set the limit
 Kubernetes/Openshift creates a custom cgroup hierarchy for the pods it created
 but start is `/sys/fs/cgroup`  folder
 ```bash
 sh-5.1# cd /sys/fs/cgroup/
 sh-5.1# ls
 cgroup.controllers cgroup.subtree_control cpuset.mems.effective  io.stat   memory.reclaim   sys-kernel-debug.mount
 cgroup.max.depth cgroup.threads  dev-hugepages.mount    kubepods.slice  memory.stat   sys-kernel-tracing.mount
 cgroup.max.descendants cpu.pressure  dev-mqueue.mount       machine.slice  misc.capacity   system.slice
 cgroup.procs  cpu.stat  init.scope        memory.numa_stat  sys-fs-fuse-connections.mount user.slice
 cgroup.stat  cpuset.cpus.effective io.pressure        memory.pressure  sys-kernel-config.mount
 ```
 `kubepods.slice` is the starting point and it contains multiple slices
 ```bash
 sh-5.1# cd kubepods.slice
 sh-5.1# ls
 cgroup.controllers cpuset.cpus    hugetlb.2MB.rsvd.max       memory.pressure
 cgroup.events  cpuset.cpus.effective   io.bfq.weight        memory.reclaim
 cgroup.freeze  cpuset.cpus.partition   io.latency        memory.stat
 cgroup.kill  cpuset.mems    io.max        memory.swap.current
 cgroup.max.depth cpuset.mems.effective   io.pressure        memory.swap.events
 cgroup.max.descendants hugetlb.1GB.current   io.stat        memory.swap.high
 cgroup.procs  hugetlb.1GB.events   kubepods-besteffort.slice      memory.swap.max
 cgroup.stat  hugetlb.1GB.events.local  kubepods-burstable.slice      memory.zswap.current
 cgroup.subtree_control hugetlb.1GB.max    kubepods-pod2b38830b_c2d6_4528_8935_b1c08511b1e3.slice  memory.zswap.max
 cgroup.threads  hugetlb.1GB.numa_stat   memory.current       misc.current
 cgroup.type  hugetlb.1GB.rsvd.current  memory.events        misc.max
 cpu.idle  hugetlb.1GB.rsvd.max   memory.events.local       pids.current
 cpu.max   hugetlb.2MB.current   memory.high        pids.events
 cpu.max.burst  hugetlb.2MB.events   memory.low        pids.max
 cpu.pressure  hugetlb.2MB.events.local  memory.max        rdma.current
 cpu.stat  hugetlb.2MB.max    memory.min        rdma.max
 cpu.weight  hugetlb.2MB.numa_stat   memory.numa_stat
 cpu.weight.nice  hugetlb.2MB.rsvd.current  memory.oom.group
 ```
 Based on the QoS of the pod, either our application pod will end up in the
 above `kubepods-besteffort.slice` or `kubepods-burstable.slice` or
 `kubepods.slice` (Guaranteed QoS) cgroup. The 3 QoS classes are defined
 [here](https://kubernetes.io/docs/concepts/workloads/pods/pod-QoS/#quality-of-service-classes)
 To identify the right cgroup file, we need pod UUID and container UUID from the
 `pod yaml` output
 ```bash
 [$]kubectl get po csi-rbd-demo-pod -oyaml |grep uid
  uid: cdf7b785-4eb7-44f7-99cc-ef53890f4dfd
 [$]kubectl get po csi-rbd-demo-pod -oyaml |grep -i containerID
  - containerID: cri-o://77e57fbbc0f0630f41f9f154f4b5fe368b6dcf7bef7dcd75a9c4b56676f10bc9
 [$]kubectl get po csi-rbd-demo-pod -oyaml |grep -i qosClass
  qosClass: BestEffort
 ```
 Now check in the `kubepods-besteffort.slice` and identify the right path using
 pod UID and container UID
 Before that check `io.max` on the application pod and see if there is any limit
 ```bash
 [$]kubectl exec -it csi-rbd-demo-pod -- sh
 sh-4.4# cat /sys/fs/cgroup/io.max
 sh-4.4#
 ```
 Come back to the Node and find the right cgroup scope
 ```bash
 sh-5.1# cd kubepods-besteffort.slice/kubepods-besteffort-podcdf7b785_4eb7_44f7_99cc_ef53890f4dfd.slice/crio-77e57fbbc0f0630f41f9f154f4b5fe368b6dcf7bef7dcd75a9c4b56676f10bc9.scope/
 sh-5.1# echo "252:0 wbps=1048576" > io.max
 sh-5.1# cat io.max
 252:0 rbps=max wbps=1048576 riops=max wiops=max
 ```
 Now go back to the application pod and check if we have the right limit set
 ```bash
 [$]kubectl exec -it csi-rbd-demo-pod -- sh
 sh-4.4# cat /sys/fs/cgroup/io.max
 252:0 rbps=max wbps=1048576 riops=max wiops=max
 sh-4.4#
 ```
 Note:- We can only support the QoS that cgroup v2 io controller supports, this
 means that cumulative read+write QoS limits won't be supported.
 Below are the configurations that will be supported
 |  Parameter     |  Description     |
 |  ---  |  ---  |
 |  MaxReadIOPS     | Max read IO operations per second      |
 |  MaxWriteIOPS     | Max write IO operations per second      |
 |  MaxReadBytesPerSecond     |  Max read bytes per second     |
 |  MaxWriteBytesPerSecond     |  Max write bytes per second     |
 ## Different approaches
 The above solution can be implemented using 3 different approaches.
 ### 1. QoS using new parameters in RBD StorageClass
 ```yaml
 ---
 apiVersion: storage.k8s.io/v1
 kind: StorageClass
 metadata:
   name: csi-rbd-sc
 provisioner: rbd.csi.ceph.com
 parameters:
  MaxReadIOPS: ""
  MaxWriteIOPS: ""
  MaxReadBytesPerSecond: ""
  MaxWriteBytesPerSecond: ""
 ```
 #### Implementation for StorageClass QoS
 1. Create new storageClass with new parameters for QoS
 1. Modify CSIDriver object to pass pod details to the NodePublishVolume CSI
   procedure
 1. During NodePublishVolume CSI procedure
   * Retrieve the QoS configuration from the volumeContext in NodePublishRequest
   * Identify the rbd device using the NodeStageVolumePath
   * Get the pod UUID from the NodeStageVolume
   * Set io.max file in all the containers in the pod
 #### Drawbacks of StorageClass QoS
 1. No way to update the QoS at runtime
 1. Need to take a backup and restore to New QoS StorageClass
 1. Delete and Recreate the PV object
 ### 2. QoS using parameters in VolumeAttributeClass
 ```yaml
 apiVersion: storage.k8s.io/v1alpha1
 kind: VolumeAttributesClass
 metadata:
  name: silver
 parameters:
  MaxReadIOPS: ""
  MaxWriteIOPS: ""
  MaxReadBytesPerSecond: ""
  MaxWriteBytesPerSecond: ""
 ```
 VolumeAttributesClassName is a new parameter in the PVC object the user can
 choose from and this can also be updated or removed later.
 This new VolumeAttributeClass is designed to keep storage that supports setting
 QoS at the storage level which means setting some configuration at the storage
 (like QoS for nbd)
 #### Implementation of VolumeAttributeClass QoS
 1. Modify CSIDriver object to pass pod details to the NodePublishVolume CSI
   procedure
 1. Add support in Ceph-CSI to expose ModifyVolume CSI procedure
 1. Ceph-CSI will store QoS in the rbd image metadata
 1. During NodeStage operation retrieve the image metadata and store it in
   stagingPath
 1. Whenever a new pod comes in apply the QoS
 #### Drawbacks of VolumeAttributeClass QoS
 One problem with above is all application need to be scaled downed and scaled
 up to get the new QoS value even though its changed in the PVC object, this is
 sometime impossible as it will have downtime.
 ### 3. QoS using parameters in VolumeAttributeClass with NodePublish Secret
 1. Modify CSIDriver object to pass pod details to the NodePublishVolume CSI
   procedure
 1. Add support in Ceph-CSI to expose ModifyVolume CSI procedure
 1. Ceph-CSI will store QoS in the rbd image metadata
 1. During NodePublishVolume operation retrieve the QoS from image metadata
 1. Whenever a new pod comes in apply the QoS
 This solution addresses the aforementioned issue, but it requires a secret to
 communicate with the ceph cluster. Therefore, we must create a new
 PublishSecret for the storageClass, which may be beneficial when Kubernetes
 eventually enables Node operations.
 Both options 2 and 3 are contingent upon changes to the CSI spec and Kubernetes
 support. Additionally,
 [VolumeAttributeClass](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/3751-volume-attributes-class/README.md)
 is currently being developed within the Kubernetes realm and will initially be
 in the Alpha stage. Consequently, it will be disabled by default.
 #### Advantages of QoS using VolumeAttributeClass
 1. No Restore/Clone operation is required to change the QoS
 1. Easily QoS can be changed for existing PVC only with second approach not
   with third as it needs new secret.
 ### Hybrid Approach
 Considering the advantages and drawbacks, we can use StorageClass and
 VolumeAttributeClass to support QoS, with VolumeAttributeClass taking
 precedence over StorageClass. This approach offers a flexible solution that
 accounts for dynamic changes while addressing the challenges of existing
 approaches.
 ### References
 Some of the useful links that helped me to understand cgroup v2 and how to set
 QoS on the device.
 * [Kubernetes cgroup v2
  Architecture](https://kubernetes.io/docs/concepts/architecture/cgroups/)
 * [cgroup v2 kernel doc](https://docs.kernel.org/admin-guide/cgroup-v2.html)
 * [ceph RBD QoS tracker](https://tracker.ceph.com/issues/36191)
 * [cgroup v2 io
  controller](https://facebookmicrosites.github.io/cgroup2/docs/io-controller.html)
 * [Kubernetes IOPS
  issue](https://github.com/kubernetes/kubernetes/issues/92287)