doc: add design doc for clusterid poolid mapping

added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> (cherry picked from commit 5fc9c3a046d445480ee6f39c9bfb53e308158561)
2025-05-29 10:06:41 +00:00 · 2021-07-15 14:11:00 +05:30 · 2021-07-15 14:11:00 +05:30 · f65961d01e
commit f65961d01e
parent cbe3ac71f3
1 changed files with 176 additions and 0 deletions
--- a/docs/design/proposals/clusterid-mapping.md
+++ b/docs/design/proposals/clusterid-mapping.md
@ -0,0 +1,176 @@
 # Design to handle clusterID and poolID for DR
 During disaster recovery/migration of a cluster, as part of the failover, the
 kubernetes artifacts like deployment, PVC, PV, etc will be restored to a new
 cluster by the admin. Even if the kubernetes objects are restored the
 corresponding RBD/CephFS subvolume cannot be retrieved during CSI operations as
 the clusterID and poolID are not the same in both clusters. Let's see the
 problem in more detail below.
 `0001-0009-rook-ceph-0000000000000002-b0285c97-a0ce-11eb-8c66-0242ac110002`
 The above is the sample volumeID sent back in response to the CreateVolume
 operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses
 above as the identifier for other operations on the volume/PVC.
 The VolumeID is encoded as,
 ```text
 0001 -->                              [csi_id_version=1:4byte] + [-:1byte]
 0009 -->                              [length of clusterID=1:4byte] + [-:1byte]
 rook-ceph -->                         [clusterID:36bytes (MAX)] + [-:1byte]
 0000000000000002 -->                  [poolID:16bytes] + [-:1byte]
 b0285c97-a0ce-11eb-8c66-0242ac110002 --> [ObjectUUID:36bytes]
 Total of constant field lengths, including '-' field separators would hence be,
 4+1+4+1+1+16+1+36 = 64
 ```
 When mirroring is enabled volume which is `csi-vol-ObjectUUID` is mirrored to
 the other cluster.
 > `csi-vol` is const name and over has the option to override it in
 > storageclass.
 During the Disaster Recovery (failover operation) the PVC and PV will be
 recreated on the other cluster. When Ceph-CSI receives the request for
 operations like (NodeStage, ExpandVolume, DeleteVolume, etc) the volumeID is
 sent in the request which will help to identify the volume.
 ```yaml=
 apiVersion: v1
 kind: ConfigMap
 data:
  config.json: |-
    [
      {
       "clusterID": "rook-ceph",
       "radosNamespace": "<rados-namespace>",
       "monitors": [
         "192.168.39.82:6789"
       ],
       "cephFS": {
         "subvolumeGroup": "<subvolumegroup for cephfs volumes>"
       }
      },
      {
       "clusterID": "fs-id",
       "radosNamespace": "<rados-namespace>",
       "monitors": [
         "192.168.39.83:6789"
       ],
       "cephFS": {
         "subvolumeGroup": "<subvolumegroup for cephfs volumes>"
       }
      }
    ]
 metadata:
  name: ceph-csi-config
 ```
 During CSI/Replication operations, Ceph-CSI will decode the volumeID and gets
 the monitor configuration from the configmap and by the poolID will get the
 pool Name and retrieves the OMAP data stored in the rados OMAP and finally
 check the volume is present in the pool.
 ## Problems with volumeID Replication
 * The clusterID can be different
  * as the clusterID is the namespace where rook is deployed, the Rook might be
    deployed in the different namespace on a secondary cluster
  * In standalone Ceph-CSI the clusterID is fsID and fsID is unique per
    cluster
 * The poolID can be different
  * PoolID which is encoded in the volumeID won't remain the same across
    clusters
 To solve this problem we need to have a new mapping between clusterID's and the
 poolID's.
 Example configmap Need to be created before failover to `site2-storage` from
 `site1-storage` and `site3-storage`.
 ```yaml=
 apiVersion: v1
 kind: ConfigMap
 data:
  cluster-mapping.json: |-
  [{
    "clusterIDMapping": {
    "site1-storage" (clusterID on site1): "site2-storage" (clusterID on site2)
   },
    "RBDPoolIDMapping": [{
    "1" (poolID on site1): "2" (poolID on site2),
    "11": "12"
   }],
    "CephFSFsIDMapping": [{
    "13" (FsID on site1): "34" (FsID on site2),
    "3": "4"
   }]
  }, {
   "clusterIDMapping": {
   "site3-storage"  (clusterID on site3): "site2-storage" (clusterID on site2)
   },
   "RBDPoolIDMapping": [{
   "5" (poolID on site3): "2" (poolID on site2),
   "16": "12"
   }],
   "CephFSFsIDMapping": [{
   "3"(FsID on site3): "34" (FsID on site2),
   "4": "4"
   }]
 }]
 metadata:
  name: ceph-csi-config
 ```
 **Note:-** the configmap will be mounted as a volume to the CSI (provisioner
 and node plugin) pods.
 The above configmap will get created as it is or updated (if new Pools are
 created on the existing cluster) with new entries when the admin choose to
 failover/failback the cluster.
 Whenever Ceph-CSI receives a CSI/Replication request it will first decode the
 volumeHandle and try to get the required OMAP details. If it is not able to
 retrieve the poolID or clusterID details from the decoded volumeHandle, Ceph-CSI
 will check for the clusterID and PoolID mapping.
 If the old volumeID
 `0001-00013-site1-storage-0000000000000001-b0285c97-a0ce-11eb-8c66-0242ac110002`
 contains the `site1-storage` as the clusterID, now Ceph-CSI will look for the
 corresponding clusterID `site2-storage` from the above configmap. If the
 clusterID mapping is found now Ceph-CSI will look for the poolID mapping ie
 mapping between `1` and `2`.
 Example:- pool with the same name exists on both the clusters with different IDs
 Replicapool with ID `1` on site1 and Replicapool with ID `2` on site2.
 After getting the required mapping Ceph-CSI has the required information to get
 more details from the rados OMAP. If we have multiple clusterID mapping it will
 loop through all the mapping and checks the corresponding pool to get the OMAP
 data. If the clusterID mapping does not exist Ceph-CSI will return a `Not
 Found` error message to the caller.
 After failover to the cluster `site2-storage`, the admin might have created new
 PVCs on the primary cluster `site2-storage`. Later after recovering the
 cluster `site1-storage`, the admin might choose to failback from
 `site2-storage` to `site1-storage`. Now admin needs to copy all the newly
 created kubernetes artifacts to the failback cluster. For clusterID mapping, the
 admin needs to copy the above-created configmap `ceph-clusterid-mapping` to
 the failback cluster. When Ceph-CSI receives a CSI/Replication request for
 the volumes created on the `site2-storage` it will decode the volumeID and
 retrieves the clusterID ie `site2-storage`. In the above configmap
 `ceph-clusterid-mapping` the `site2-storage` is the value and `site1-storage`
 is the key in the `clusterIDMapping` entry.
 Ceph-CSI will check both `key` and `value` to check the clusterID mapping. If it
 is found in `key` it will consider `value` as the  corresponding mapping, if it
 is found in `value` place it will treat `key` as the corresponding mapping and
 retrieves all the poolID details of the cluster.
 This mapping on the remote cluster is only required when we are doing a
 failover operation from the primary cluster to a remote cluster. The existing
 volumes that are created on the remote cluster does not require
 any mapping as the volumeHandle already contains the required information about
 the local cluster (clusterID, poolID etc).