ceph-csi/docs/design/proposals/clusterid-mapping.md

# Design to handle clusterID and poolID for DR

During disaster recovery/migration of a cluster, as part of the failover, the
kubernetes artifacts like deployment, PVC, PV, etc. will be restored to a new
cluster by the admin. Even if the kubernetes objects are restored the
corresponding RBD/CephFS subvolume cannot be retrieved during CSI operations as
the clusterID and poolID are not the same in both clusters. Let's see the
problem in more detail below.

`0001-0009-rook-ceph-0000000000000002-b0285c97-a0ce-11eb-8c66-0242ac110002`

The above is the sample volumeID sent back in response to the CreateVolume
operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses above
as the identifier for other operations on the volume/PVC.

The VolumeID is encoded as,

```text
0001 -->                              [csi_id_version=1:4byte] + [-:1byte]
0009 -->                              [length of clusterID=1:4byte] + [-:1byte]
rook-ceph -->                         [clusterID:36bytes (MAX)] + [-:1byte]
0000000000000002 -->                  [poolID:16bytes] + [-:1byte]
b0285c97-a0ce-11eb-8c66-0242ac110002 --> [ObjectUUID:36bytes]
Total of constant field lengths, including '-' field separators would hence be,
4+1+4+1+1+16+1+36 = 64
```

When mirroring is enabled volume which is `csi-vol-ObjectUUID` is mirrored to
the other cluster.

> `csi-vol` is const name and over has the option to override it in
> storageclass.

During the Disaster Recovery (failover operation) the PVC and PV will be
recreated on the other cluster. When Ceph-CSI receives the request for
operations like (NodeStage, ExpandVolume, DeleteVolume, etc.) the volumeID is
sent in the request which will help to identify the volume.

```yaml=
apiVersion: v1
kind: ConfigMap
data:
  config.json: |-
    [
      {
       "clusterID": "rook-ceph",
       "rbd": {
          "radosNamespace": "<rados-namespace>",
       },
       "monitors": [
         "192.168.39.82:6789"
       ],
       "cephFS": {
         "subvolumeGroup": "<subvolumegroup for cephfs volumes>"
       }
      },
      {
       "clusterID": "fs-id",
       "rbd": {
          "radosNamespace": "<rados-namespace>",
       },
       "monitors": [
         "192.168.39.83:6789"
       ],
       "cephFS": {
         "subvolumeGroup": "<subvolumegroup for cephfs volumes>"
       }
      }
    ]
metadata:
  name: ceph-csi-config
```

During CSI/Replication operations, Ceph-CSI will decode the volumeID and gets
the monitor configuration from the configmap and by the poolID will get the pool
Name and retrieves the OMAP data stored in the rados OMAP and finally check the
volume is present in the pool.

## Problems with volumeID Replication

* The clusterID can be different
   * as the clusterID is the namespace where rook is deployed, the Rook might
    be deployed in the different namespace on a secondary cluster
   * In standalone Ceph-CSI the clusterID is fsID and fsID is unique per
    cluster

* The poolID can be different
   * PoolID which is encoded in the volumeID won't remain the same across
    clusters

To solve this problem we need to have a new mapping between clusterID's and the
poolID's.

Example configmap Need to be created before failover to `site2-storage` from
`site1-storage` and `site3-storage`.

```yaml=
apiVersion: v1
kind: ConfigMap
data:
  cluster-mapping.json: |-
  [{
    "clusterIDMapping": {
    "site1-storage" (clusterID on site1): "site2-storage" (clusterID on site2)
   },
    "RBDPoolIDMapping": [{
    "1" (poolID on site1): "2" (poolID on site2),
    "11": "12"
   }],
    "CephFSFscIDMapping": [{
    "13" (FscID on site1): "34" (FscID on site2),
    "3": "4"
   }]
  }, {
   "clusterIDMapping": {
   "site3-storage"  (clusterID on site3): "site2-storage" (clusterID on site2)
   },
   "RBDPoolIDMapping": [{
   "5" (poolID on site3): "2" (poolID on site2),
   "16": "12"
   }],
   "CephFSFscIDMapping": [{
   "3"(FscID on site3): "34" (FscID on site2),
   "4": "4"
   }]
 }]
metadata:
  name: ceph-csi-config
```

**Note:-** the configmap will be mounted as a volume to the CSI (provisioner and
node plugin) pods.

The above configmap will get created as it is or updated (if new Pools are
created on the existing cluster) with new entries when the admin choose to
failover/failback the cluster.

Whenever Ceph-CSI receives a CSI/Replication request it will first decode the
volumeHandle and try to get the required OMAP details. If it is not able to
retrieve the poolID or clusterID details from the decoded volumeHandle, Ceph-CSI
will check for the clusterID and PoolID mapping.

If the old volumeID
`0001-00013-site1-storage-0000000000000001-b0285c97-a0ce-11eb-8c66-0242ac110002`
contains the `site1-storage` as the clusterID, now Ceph-CSI will look for the
corresponding clusterID `site2-storage` from the above configmap. If the
clusterID mapping is found now Ceph-CSI will look for the poolID mapping ie
mapping between `1` and `2`.

Example:- pool with the same name exists on both the clusters with different IDs
Replicapool with ID `1` on site1 and Replicapool with ID `2` on site2.

After getting the required mapping Ceph-CSI has the required information to get
more details from the rados OMAP. If we have multiple clusterID mapping it will
loop through all the mapping and checks the corresponding pool to get the OMAP
data. If the clusterID mapping does not exist Ceph-CSI will return a `Not Found`
error message to the caller.

After failover to the cluster `site2-storage`, the admin might have created new
PVCs on the primary cluster `site2-storage`. Later after recovering the
cluster `site1-storage`, the admin might choose to failback from
`site2-storage` to `site1-storage`. Now admin needs to copy all the newly
created kubernetes artifacts to the failback cluster. For clusterID mapping, the
admin needs to copy the above-created configmap `ceph-clusterid-mapping` to the
failback cluster. When Ceph-CSI receives a CSI/Replication request for the
volumes created on the `site2-storage` it will decode the volumeID and retrieves
the clusterID ie `site2-storage`. In the above configmap
`ceph-clusterid-mapping` the `site2-storage` is the value and `site1-storage`
is the key in the `clusterIDMapping` entry.

Ceph-CSI will check both `key` and `value` to check the clusterID mapping. If it
is found in `key` it will consider `value` as the corresponding mapping, if it
is found in `value` place it will treat `key` as the corresponding mapping and
retrieves all the poolID details of the cluster.

This mapping on the remote cluster is only required when we are doing a failover
operation from the primary cluster to a remote cluster. The existing volumes
that are created on the remote cluster does not require any mapping as the
volumeHandle already contains the required information about the local cluster (
clusterID, poolID etc).
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00			`# Design to handle clusterID and poolID for DR`

			`During disaster recovery/migration of a cluster, as part of the failover, the`
doc: few corrections or typo fixing in design documentation - Fixes spelling mistakes. - Grammatical error correction. - Wrapping the text at 80 line count..etc Signed-off-by: Humble Chirammal <hchiramm@redhat.com> 2021-12-20 09:21:47 +00:00			`kubernetes artifacts like deployment, PVC, PV, etc. will be restored to a new`
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00			`cluster by the admin. Even if the kubernetes objects are restored the`
			`corresponding RBD/CephFS subvolume cannot be retrieved during CSI operations as`
			`the clusterID and poolID are not the same in both clusters. Let's see the`
			`problem in more detail below.`

			`0001-0009-rook-ceph-0000000000000002-b0285c97-a0ce-11eb-8c66-0242ac110002`

			`The above is the sample volumeID sent back in response to the CreateVolume`
doc: few corrections or typo fixing in design documentation - Fixes spelling mistakes. - Grammatical error correction. - Wrapping the text at 80 line count..etc Signed-off-by: Humble Chirammal <hchiramm@redhat.com> 2021-12-20 09:21:47 +00:00			`operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses above`
			`as the identifier for other operations on the volume/PVC.`
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00
			`The VolumeID is encoded as,`

			```text
			`0001 --> [csi_id_version=1:4byte] + [-:1byte]`
			`0009 --> [length of clusterID=1:4byte] + [-:1byte]`
			`rook-ceph --> [clusterID:36bytes (MAX)] + [-:1byte]`
			`0000000000000002 --> [poolID:16bytes] + [-:1byte]`
			`b0285c97-a0ce-11eb-8c66-0242ac110002 --> [ObjectUUID:36bytes]`
			`Total of constant field lengths, including '-' field separators would hence be,`
			`4+1+4+1+1+16+1+36 = 64`
			```

			When mirroring is enabled volume which is `csi-vol-ObjectUUID` is mirrored to
			`the other cluster.`

			> `csi-vol` is const name and over has the option to override it in
			`> storageclass.`

			`During the Disaster Recovery (failover operation) the PVC and PV will be`
			`recreated on the other cluster. When Ceph-CSI receives the request for`
doc: few corrections or typo fixing in design documentation - Fixes spelling mistakes. - Grammatical error correction. - Wrapping the text at 80 line count..etc Signed-off-by: Humble Chirammal <hchiramm@redhat.com> 2021-12-20 09:21:47 +00:00			`operations like (NodeStage, ExpandVolume, DeleteVolume, etc.) the volumeID is`
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00			`sent in the request which will help to identify the volume.`

			```yaml=
			`apiVersion: v1`
			`kind: ConfigMap`
			`data:`
			`config.json: \|-`
			`[`
			`{`
			`"clusterID": "rook-ceph",`
rbd: move radosNamespace to RBD section As radosNamespace is more specific to RBD not the general ceph configuration. Now we introduced a new RBD section for RBD specific options, Moving the radosNamespace to RBD section and keeping the radosNamespace still under the global ceph level configration for backward compatibility. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2022-04-18 06:08:52 +00:00			`"rbd": {`
			`"radosNamespace": "<rados-namespace>",`
			`},`
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00			`"monitors": [`
			`"192.168.39.82:6789"`
			`],`
			`"cephFS": {`
			`"subvolumeGroup": "<subvolumegroup for cephfs volumes>"`
			`}`
			`},`
			`{`
			`"clusterID": "fs-id",`
rbd: move radosNamespace to RBD section As radosNamespace is more specific to RBD not the general ceph configuration. Now we introduced a new RBD section for RBD specific options, Moving the radosNamespace to RBD section and keeping the radosNamespace still under the global ceph level configration for backward compatibility. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2022-04-18 06:08:52 +00:00			`"rbd": {`
			`"radosNamespace": "<rados-namespace>",`
			`},`
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00			`"monitors": [`
			`"192.168.39.83:6789"`
			`],`
			`"cephFS": {`
			`"subvolumeGroup": "<subvolumegroup for cephfs volumes>"`
			`}`
			`}`
			`]`
			`metadata:`
			`name: ceph-csi-config`
			```

			`During CSI/Replication operations, Ceph-CSI will decode the volumeID and gets`
doc: few corrections or typo fixing in design documentation - Fixes spelling mistakes. - Grammatical error correction. - Wrapping the text at 80 line count..etc Signed-off-by: Humble Chirammal <hchiramm@redhat.com> 2021-12-20 09:21:47 +00:00			`the monitor configuration from the configmap and by the poolID will get the pool`
			`Name and retrieves the OMAP data stored in the rados OMAP and finally check the`
			`volume is present in the pool.`
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00
			`## Problems with volumeID Replication`

			`* The clusterID can be different`
ci: fix mdl related failures This commit address the issue- https://github.com/ceph/ceph-csi/issues/3448. Signed-off-by: riya-singhal31 <rsinghal@redhat.com> 2022-11-09 13:37:26 +00:00			`* as the clusterID is the namespace where rook is deployed, the Rook might`
doc: few corrections or typo fixing in design documentation - Fixes spelling mistakes. - Grammatical error correction. - Wrapping the text at 80 line count..etc Signed-off-by: Humble Chirammal <hchiramm@redhat.com> 2021-12-20 09:21:47 +00:00			`be deployed in the different namespace on a secondary cluster`
ci: fix mdl related failures This commit address the issue- https://github.com/ceph/ceph-csi/issues/3448. Signed-off-by: riya-singhal31 <rsinghal@redhat.com> 2022-11-09 13:37:26 +00:00			`* In standalone Ceph-CSI the clusterID is fsID and fsID is unique per`
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00			`cluster`

			`* The poolID can be different`
ci: fix mdl related failures This commit address the issue- https://github.com/ceph/ceph-csi/issues/3448. Signed-off-by: riya-singhal31 <rsinghal@redhat.com> 2022-11-09 13:37:26 +00:00			`* PoolID which is encoded in the volumeID won't remain the same across`
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00			`clusters`

			`To solve this problem we need to have a new mapping between clusterID's and the`
			`poolID's.`

			Example configmap Need to be created before failover to `site2-storage` from
			`site1-storage` and `site3-storage`.

			```yaml=
			`apiVersion: v1`
			`kind: ConfigMap`
			`data:`
			`cluster-mapping.json: \|-`
			`[{`
			`"clusterIDMapping": {`
			`"site1-storage" (clusterID on site1): "site2-storage" (clusterID on site2)`
			`},`
			`"RBDPoolIDMapping": [{`
			`"1" (poolID on site1): "2" (poolID on site2),`
			`"11": "12"`
			`}],`
doc: change FsID to FscID for cephfs updated the filesystem identifier from FsId to FscID. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-08-04 05:27:37 +00:00			`"CephFSFscIDMapping": [{`
			`"13" (FscID on site1): "34" (FscID on site2),`
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00			`"3": "4"`
			`}]`
			`}, {`
			`"clusterIDMapping": {`
			`"site3-storage" (clusterID on site3): "site2-storage" (clusterID on site2)`
			`},`
			`"RBDPoolIDMapping": [{`
			`"5" (poolID on site3): "2" (poolID on site2),`
			`"16": "12"`
			`}],`
doc: change FsID to FscID for cephfs updated the filesystem identifier from FsId to FscID. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-08-04 05:27:37 +00:00			`"CephFSFscIDMapping": [{`
			`"3"(FscID on site3): "34" (FscID on site2),`
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00			`"4": "4"`
			`}]`
			`}]`
			`metadata:`
			`name: ceph-csi-config`
			```

doc: few corrections or typo fixing in design documentation - Fixes spelling mistakes. - Grammatical error correction. - Wrapping the text at 80 line count..etc Signed-off-by: Humble Chirammal <hchiramm@redhat.com> 2021-12-20 09:21:47 +00:00			`Note:- the configmap will be mounted as a volume to the CSI (provisioner and`
			`node plugin) pods.`
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00
			`The above configmap will get created as it is or updated (if new Pools are`
			`created on the existing cluster) with new entries when the admin choose to`
			`failover/failback the cluster.`

			`Whenever Ceph-CSI receives a CSI/Replication request it will first decode the`
			`volumeHandle and try to get the required OMAP details. If it is not able to`
			`retrieve the poolID or clusterID details from the decoded volumeHandle, Ceph-CSI`
			`will check for the clusterID and PoolID mapping.`

			`If the old volumeID`
			`0001-00013-site1-storage-0000000000000001-b0285c97-a0ce-11eb-8c66-0242ac110002`
			contains the `site1-storage` as the clusterID, now Ceph-CSI will look for the
			corresponding clusterID `site2-storage` from the above configmap. If the
			`clusterID mapping is found now Ceph-CSI will look for the poolID mapping ie`
			mapping between `1` and `2`.

			`Example:- pool with the same name exists on both the clusters with different IDs`
			Replicapool with ID `1` on site1 and Replicapool with ID `2` on site2.

			`After getting the required mapping Ceph-CSI has the required information to get`
			`more details from the rados OMAP. If we have multiple clusterID mapping it will`
			`loop through all the mapping and checks the corresponding pool to get the OMAP`
doc: few corrections or typo fixing in design documentation - Fixes spelling mistakes. - Grammatical error correction. - Wrapping the text at 80 line count..etc Signed-off-by: Humble Chirammal <hchiramm@redhat.com> 2021-12-20 09:21:47 +00:00			data. If the clusterID mapping does not exist Ceph-CSI will return a `Not Found`
			`error message to the caller.`
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00
			After failover to the cluster `site2-storage`, the admin might have created new
			PVCs on the primary cluster `site2-storage`. Later after recovering the
			cluster `site1-storage`, the admin might choose to failback from
			`site2-storage` to `site1-storage`. Now admin needs to copy all the newly
			`created kubernetes artifacts to the failback cluster. For clusterID mapping, the`
doc: few corrections or typo fixing in design documentation - Fixes spelling mistakes. - Grammatical error correction. - Wrapping the text at 80 line count..etc Signed-off-by: Humble Chirammal <hchiramm@redhat.com> 2021-12-20 09:21:47 +00:00			admin needs to copy the above-created configmap `ceph-clusterid-mapping` to the
			`failback cluster. When Ceph-CSI receives a CSI/Replication request for the`
			volumes created on the `site2-storage` it will decode the volumeID and retrieves
			the clusterID ie `site2-storage`. In the above configmap
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00			`ceph-clusterid-mapping` the `site2-storage` is the value and `site1-storage`
			is the key in the `clusterIDMapping` entry.

			Ceph-CSI will check both `key` and `value` to check the clusterID mapping. If it
doc: few corrections or typo fixing in design documentation - Fixes spelling mistakes. - Grammatical error correction. - Wrapping the text at 80 line count..etc Signed-off-by: Humble Chirammal <hchiramm@redhat.com> 2021-12-20 09:21:47 +00:00			is found in `key` it will consider `value` as the corresponding mapping, if it
doc: add design doc for clusterid poolid mapping added design doc to handle volumeID mapping in case of the failover in the Disaster Recovery. update #2118 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com> 2021-07-15 08:41:00 +00:00			is found in `value` place it will treat `key` as the corresponding mapping and
			`retrieves all the poolID details of the cluster.`

doc: few corrections or typo fixing in design documentation - Fixes spelling mistakes. - Grammatical error correction. - Wrapping the text at 80 line count..etc Signed-off-by: Humble Chirammal <hchiramm@redhat.com> 2021-12-20 09:21:47 +00:00			`This mapping on the remote cluster is only required when we are doing a failover`
			`operation from the primary cluster to a remote cluster. The existing volumes`
			`that are created on the remote cluster does not require any mapping as the`
			`volumeHandle already contains the required information about the local cluster (`
			`clusterID, poolID etc).`
No results found.