mirror of
https://github.com/ceph/ceph-csi.git
synced 2025-01-07 12:29:31 +00:00
doc: add design doc for clusterid poolid mapping
added design doc to handle volumeID mapping in case
of the failover in the Disaster Recovery.
update #2118
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit 5fc9c3a046
)
This commit is contained in:
parent
cbe3ac71f3
commit
f65961d01e
176
docs/design/proposals/clusterid-mapping.md
Normal file
176
docs/design/proposals/clusterid-mapping.md
Normal file
@ -0,0 +1,176 @@
|
||||
# Design to handle clusterID and poolID for DR
|
||||
|
||||
During disaster recovery/migration of a cluster, as part of the failover, the
|
||||
kubernetes artifacts like deployment, PVC, PV, etc will be restored to a new
|
||||
cluster by the admin. Even if the kubernetes objects are restored the
|
||||
corresponding RBD/CephFS subvolume cannot be retrieved during CSI operations as
|
||||
the clusterID and poolID are not the same in both clusters. Let's see the
|
||||
problem in more detail below.
|
||||
|
||||
`0001-0009-rook-ceph-0000000000000002-b0285c97-a0ce-11eb-8c66-0242ac110002`
|
||||
|
||||
The above is the sample volumeID sent back in response to the CreateVolume
|
||||
operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses
|
||||
above as the identifier for other operations on the volume/PVC.
|
||||
|
||||
The VolumeID is encoded as,
|
||||
|
||||
```text
|
||||
0001 --> [csi_id_version=1:4byte] + [-:1byte]
|
||||
0009 --> [length of clusterID=1:4byte] + [-:1byte]
|
||||
rook-ceph --> [clusterID:36bytes (MAX)] + [-:1byte]
|
||||
0000000000000002 --> [poolID:16bytes] + [-:1byte]
|
||||
b0285c97-a0ce-11eb-8c66-0242ac110002 --> [ObjectUUID:36bytes]
|
||||
Total of constant field lengths, including '-' field separators would hence be,
|
||||
4+1+4+1+1+16+1+36 = 64
|
||||
```
|
||||
|
||||
When mirroring is enabled volume which is `csi-vol-ObjectUUID` is mirrored to
|
||||
the other cluster.
|
||||
|
||||
> `csi-vol` is const name and over has the option to override it in
|
||||
> storageclass.
|
||||
|
||||
During the Disaster Recovery (failover operation) the PVC and PV will be
|
||||
recreated on the other cluster. When Ceph-CSI receives the request for
|
||||
operations like (NodeStage, ExpandVolume, DeleteVolume, etc) the volumeID is
|
||||
sent in the request which will help to identify the volume.
|
||||
|
||||
```yaml=
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
data:
|
||||
config.json: |-
|
||||
[
|
||||
{
|
||||
"clusterID": "rook-ceph",
|
||||
"radosNamespace": "<rados-namespace>",
|
||||
"monitors": [
|
||||
"192.168.39.82:6789"
|
||||
],
|
||||
"cephFS": {
|
||||
"subvolumeGroup": "<subvolumegroup for cephfs volumes>"
|
||||
}
|
||||
},
|
||||
{
|
||||
"clusterID": "fs-id",
|
||||
"radosNamespace": "<rados-namespace>",
|
||||
"monitors": [
|
||||
"192.168.39.83:6789"
|
||||
],
|
||||
"cephFS": {
|
||||
"subvolumeGroup": "<subvolumegroup for cephfs volumes>"
|
||||
}
|
||||
}
|
||||
]
|
||||
metadata:
|
||||
name: ceph-csi-config
|
||||
```
|
||||
|
||||
During CSI/Replication operations, Ceph-CSI will decode the volumeID and gets
|
||||
the monitor configuration from the configmap and by the poolID will get the
|
||||
pool Name and retrieves the OMAP data stored in the rados OMAP and finally
|
||||
check the volume is present in the pool.
|
||||
|
||||
## Problems with volumeID Replication
|
||||
|
||||
* The clusterID can be different
|
||||
* as the clusterID is the namespace where rook is deployed, the Rook might be
|
||||
deployed in the different namespace on a secondary cluster
|
||||
* In standalone Ceph-CSI the clusterID is fsID and fsID is unique per
|
||||
cluster
|
||||
|
||||
* The poolID can be different
|
||||
* PoolID which is encoded in the volumeID won't remain the same across
|
||||
clusters
|
||||
|
||||
To solve this problem we need to have a new mapping between clusterID's and the
|
||||
poolID's.
|
||||
|
||||
Example configmap Need to be created before failover to `site2-storage` from
|
||||
`site1-storage` and `site3-storage`.
|
||||
|
||||
```yaml=
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
data:
|
||||
cluster-mapping.json: |-
|
||||
[{
|
||||
"clusterIDMapping": {
|
||||
"site1-storage" (clusterID on site1): "site2-storage" (clusterID on site2)
|
||||
},
|
||||
"RBDPoolIDMapping": [{
|
||||
"1" (poolID on site1): "2" (poolID on site2),
|
||||
"11": "12"
|
||||
}],
|
||||
"CephFSFsIDMapping": [{
|
||||
"13" (FsID on site1): "34" (FsID on site2),
|
||||
"3": "4"
|
||||
}]
|
||||
}, {
|
||||
"clusterIDMapping": {
|
||||
"site3-storage" (clusterID on site3): "site2-storage" (clusterID on site2)
|
||||
},
|
||||
"RBDPoolIDMapping": [{
|
||||
"5" (poolID on site3): "2" (poolID on site2),
|
||||
"16": "12"
|
||||
}],
|
||||
"CephFSFsIDMapping": [{
|
||||
"3"(FsID on site3): "34" (FsID on site2),
|
||||
"4": "4"
|
||||
}]
|
||||
}]
|
||||
metadata:
|
||||
name: ceph-csi-config
|
||||
```
|
||||
|
||||
**Note:-** the configmap will be mounted as a volume to the CSI (provisioner
|
||||
and node plugin) pods.
|
||||
|
||||
The above configmap will get created as it is or updated (if new Pools are
|
||||
created on the existing cluster) with new entries when the admin choose to
|
||||
failover/failback the cluster.
|
||||
|
||||
Whenever Ceph-CSI receives a CSI/Replication request it will first decode the
|
||||
volumeHandle and try to get the required OMAP details. If it is not able to
|
||||
retrieve the poolID or clusterID details from the decoded volumeHandle, Ceph-CSI
|
||||
will check for the clusterID and PoolID mapping.
|
||||
|
||||
If the old volumeID
|
||||
`0001-00013-site1-storage-0000000000000001-b0285c97-a0ce-11eb-8c66-0242ac110002`
|
||||
contains the `site1-storage` as the clusterID, now Ceph-CSI will look for the
|
||||
corresponding clusterID `site2-storage` from the above configmap. If the
|
||||
clusterID mapping is found now Ceph-CSI will look for the poolID mapping ie
|
||||
mapping between `1` and `2`.
|
||||
|
||||
Example:- pool with the same name exists on both the clusters with different IDs
|
||||
Replicapool with ID `1` on site1 and Replicapool with ID `2` on site2.
|
||||
|
||||
After getting the required mapping Ceph-CSI has the required information to get
|
||||
more details from the rados OMAP. If we have multiple clusterID mapping it will
|
||||
loop through all the mapping and checks the corresponding pool to get the OMAP
|
||||
data. If the clusterID mapping does not exist Ceph-CSI will return a `Not
|
||||
Found` error message to the caller.
|
||||
|
||||
After failover to the cluster `site2-storage`, the admin might have created new
|
||||
PVCs on the primary cluster `site2-storage`. Later after recovering the
|
||||
cluster `site1-storage`, the admin might choose to failback from
|
||||
`site2-storage` to `site1-storage`. Now admin needs to copy all the newly
|
||||
created kubernetes artifacts to the failback cluster. For clusterID mapping, the
|
||||
admin needs to copy the above-created configmap `ceph-clusterid-mapping` to
|
||||
the failback cluster. When Ceph-CSI receives a CSI/Replication request for
|
||||
the volumes created on the `site2-storage` it will decode the volumeID and
|
||||
retrieves the clusterID ie `site2-storage`. In the above configmap
|
||||
`ceph-clusterid-mapping` the `site2-storage` is the value and `site1-storage`
|
||||
is the key in the `clusterIDMapping` entry.
|
||||
|
||||
Ceph-CSI will check both `key` and `value` to check the clusterID mapping. If it
|
||||
is found in `key` it will consider `value` as the corresponding mapping, if it
|
||||
is found in `value` place it will treat `key` as the corresponding mapping and
|
||||
retrieves all the poolID details of the cluster.
|
||||
|
||||
This mapping on the remote cluster is only required when we are doing a
|
||||
failover operation from the primary cluster to a remote cluster. The existing
|
||||
volumes that are created on the remote cluster does not require
|
||||
any mapping as the volumeHandle already contains the required information about
|
||||
the local cluster (clusterID, poolID etc).
|
Loading…
Reference in New Issue
Block a user