doc: add design doc for clusterid poolid mapping

added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> (cherry picked from commit 5fc9c3a046)
2025-04-08 16:43:01 +00:00 · 2021-07-15 14:11:00 +05:30 · 2021-07-15 14:11:00 +05:30 · f65961d01e
commit f65961d01e
parent cbe3ac71f3
1 changed files with 176 additions and 0 deletions
--- a/docs/design/proposals/clusterid-mapping.md
+++ b/docs/design/proposals/clusterid-mapping.md
@ -0,0 +1,176 @@
+# Design to handle clusterID and poolID for DR
+
+During disaster recovery/migration of a cluster, as part of the failover, the
+kubernetes artifacts like deployment, PVC, PV, etc will be restored to a new
+cluster by the admin. Even if the kubernetes objects are restored the
+corresponding RBD/CephFS subvolume cannot be retrieved during CSI operations as
+the clusterID and poolID are not the same in both clusters. Let's see the
+problem in more detail below.
+
+`0001-0009-rook-ceph-0000000000000002-b0285c97-a0ce-11eb-8c66-0242ac110002`
+
+The above is the sample volumeID sent back in response to the CreateVolume
+operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses
+above as the identifier for other operations on the volume/PVC.
+
+The VolumeID is encoded as,
+
+```text
+0001 -->                              [csi_id_version=1:4byte] + [-:1byte]
+0009 -->                              [length of clusterID=1:4byte] + [-:1byte]
+rook-ceph -->                         [clusterID:36bytes (MAX)] + [-:1byte]
+0000000000000002 -->                  [poolID:16bytes] + [-:1byte]
+b0285c97-a0ce-11eb-8c66-0242ac110002 --> [ObjectUUID:36bytes]
+Total of constant field lengths, including '-' field separators would hence be,
+4+1+4+1+1+16+1+36 = 64
+```
+
+When mirroring is enabled volume which is `csi-vol-ObjectUUID` is mirrored to
+the other cluster.
+
+> `csi-vol` is const name and over has the option to override it in
+> storageclass.
+
+During the Disaster Recovery (failover operation) the PVC and PV will be
+recreated on the other cluster. When Ceph-CSI receives the request for
+operations like (NodeStage, ExpandVolume, DeleteVolume, etc) the volumeID is
+sent in the request which will help to identify the volume.
+
+```yaml=
+apiVersion: v1
+kind: ConfigMap
+data:
+  config.json: |-
+    [
+      {
+       "clusterID": "rook-ceph",
+       "radosNamespace": "<rados-namespace>",
+       "monitors": [
+         "192.168.39.82:6789"
+       ],
+       "cephFS": {
+         "subvolumeGroup": "<subvolumegroup for cephfs volumes>"
+       }
+      },
+      {
+       "clusterID": "fs-id",
+       "radosNamespace": "<rados-namespace>",
+       "monitors": [
+         "192.168.39.83:6789"
+       ],
+       "cephFS": {
+         "subvolumeGroup": "<subvolumegroup for cephfs volumes>"
+       }
+      }
+    ]
+metadata:
+  name: ceph-csi-config
+```
+
+During CSI/Replication operations, Ceph-CSI will decode the volumeID and gets
+the monitor configuration from the configmap and by the poolID will get the
+pool Name and retrieves the OMAP data stored in the rados OMAP and finally
+check the volume is present in the pool.
+
+## Problems with volumeID Replication
+
+* The clusterID can be different
+  * as the clusterID is the namespace where rook is deployed, the Rook might be
+    deployed in the different namespace on a secondary cluster
+  * In standalone Ceph-CSI the clusterID is fsID and fsID is unique per
+    cluster
+
+* The poolID can be different
+  * PoolID which is encoded in the volumeID won't remain the same across
+    clusters
+
+To solve this problem we need to have a new mapping between clusterID's and the
+poolID's.
+
+Example configmap Need to be created before failover to `site2-storage` from
+`site1-storage` and `site3-storage`.
+
+```yaml=
+apiVersion: v1
+kind: ConfigMap
+data:
+  cluster-mapping.json: |-
+  [{
+    "clusterIDMapping": {
+    "site1-storage" (clusterID on site1): "site2-storage" (clusterID on site2)
+   },
+    "RBDPoolIDMapping": [{
+    "1" (poolID on site1): "2" (poolID on site2),
+    "11": "12"
+   }],
+    "CephFSFsIDMapping": [{
+    "13" (FsID on site1): "34" (FsID on site2),
+    "3": "4"
+   }]
+  }, {
+   "clusterIDMapping": {
+   "site3-storage"  (clusterID on site3): "site2-storage" (clusterID on site2)
+   },
+   "RBDPoolIDMapping": [{
+   "5" (poolID on site3): "2" (poolID on site2),
+   "16": "12"
+   }],
+   "CephFSFsIDMapping": [{
+   "3"(FsID on site3): "34" (FsID on site2),
+   "4": "4"
+   }]
+ }]
+metadata:
+  name: ceph-csi-config
+```
+
+**Note:-** the configmap will be mounted as a volume to the CSI (provisioner
+and node plugin) pods.
+
+The above configmap will get created as it is or updated (if new Pools are
+created on the existing cluster) with new entries when the admin choose to
+failover/failback the cluster.
+
+Whenever Ceph-CSI receives a CSI/Replication request it will first decode the
+volumeHandle and try to get the required OMAP details. If it is not able to
+retrieve the poolID or clusterID details from the decoded volumeHandle, Ceph-CSI
+will check for the clusterID and PoolID mapping.
+
+If the old volumeID
+`0001-00013-site1-storage-0000000000000001-b0285c97-a0ce-11eb-8c66-0242ac110002`
+contains the `site1-storage` as the clusterID, now Ceph-CSI will look for the
+corresponding clusterID `site2-storage` from the above configmap. If the
+clusterID mapping is found now Ceph-CSI will look for the poolID mapping ie
+mapping between `1` and `2`.
+
+Example:- pool with the same name exists on both the clusters with different IDs
+Replicapool with ID `1` on site1 and Replicapool with ID `2` on site2.
+
+After getting the required mapping Ceph-CSI has the required information to get
+more details from the rados OMAP. If we have multiple clusterID mapping it will
+loop through all the mapping and checks the corresponding pool to get the OMAP
+data. If the clusterID mapping does not exist Ceph-CSI will return a `Not
+Found` error message to the caller.
+
+After failover to the cluster `site2-storage`, the admin might have created new
+PVCs on the primary cluster `site2-storage`. Later after recovering the
+cluster `site1-storage`, the admin might choose to failback from
+`site2-storage` to `site1-storage`. Now admin needs to copy all the newly
+created kubernetes artifacts to the failback cluster. For clusterID mapping, the
+admin needs to copy the above-created configmap `ceph-clusterid-mapping` to
+the failback cluster. When Ceph-CSI receives a CSI/Replication request for
+the volumes created on the `site2-storage` it will decode the volumeID and
+retrieves the clusterID ie `site2-storage`. In the above configmap
+`ceph-clusterid-mapping` the `site2-storage` is the value and `site1-storage`
+is the key in the `clusterIDMapping` entry.
+
+Ceph-CSI will check both `key` and `value` to check the clusterID mapping. If it
+is found in `key` it will consider `value` as the  corresponding mapping, if it
+is found in `value` place it will treat `key` as the corresponding mapping and
+retrieves all the poolID details of the cluster.
+
+This mapping on the remote cluster is only required when we are doing a
+failover operation from the primary cluster to a remote cluster. The existing
+volumes that are created on the remote cluster does not require
+any mapping as the volumeHandle already contains the required information about
+the local cluster (clusterID, poolID etc).