mirror of
https://github.com/ceph/ceph-csi.git
synced 2024-11-19 04:40:19 +00:00
0f8813d89f
In the case of the Async DR, the volumeID will not be the same if the clusterID or the PoolID is different, With Earlier implementation, it is expected that the new volumeID mapping is stored in the rados omap pool. In the case of the ControllerExpand or the DeleteVolume Request, the only volumeID will be sent it's not possible to find the corresponding poolID in the new cluster. With This Change, it works as below The csi-rbdplugin-controller will watch for the PV objects, when there are any PV objects created it will check the omap already exists, If the omap doesn't exist it will generate the new volumeID and it checks for the volumeID mapping entry in the PV annotation, if the mapping does not exist, it will add the new entry to the PV annotation. The cephcsi will check for the PV annotations if the omap does not exist if the mapping exists in the PV annotation, it will use the new volumeID for further operations. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
103 lines
4.6 KiB
Markdown
103 lines
4.6 KiB
Markdown
# RBD MIRRORING
|
||
|
||
RBD mirroring is a process of replication of RBD images between two or more
|
||
Ceph clusters. Mirroring ensures point-in-time, crash-consistent RBD images
|
||
between clusters, RBD mirroring is mainly used for disaster recovery (i.e.
|
||
having a secondary site as a failover). See [Ceph
|
||
documentation](https://docs.ceph.com/en/latest/rbd/rbd-mirroring) on RBD
|
||
mirroring for complete information.
|
||
|
||
## Architecture
|
||
|
||
![mirror](rbd-mirror.png)
|
||
|
||
## Design
|
||
|
||
Currently, CEPH-CSI generates its unique ID for each RBD image and stores the
|
||
corresponding PersistentVolume (PV) name and the unique ID mapping, It creates
|
||
the RBD image with the unique ID and returns the encoded value which contains
|
||
all the required information for other operations. For mirroring, the same RBD
|
||
image will be mirrored to the secondary cluster. As the journal(OMAP data) is
|
||
not mirrored to the secondary cluster, The RBD images corresponding to the PV
|
||
can not be identified without OMAP data.
|
||
|
||
**Pre-req** It's expected that the Kubernetes Admin/User will create the static
|
||
PersistentVolumeClaim (PVC) on the secondary site during the failover.
|
||
|
||
**Note:** when the static PVC created on the secondary site we cannot use the
|
||
VolumeHandle to identify the OMAP data nor the image anymore because as we have
|
||
only PoolID and ClusterID in the VolumeHandle. We cannot identify the correct
|
||
pool name from the PoolID because pool name will remain the same on both
|
||
clusters but not the PoolID even the ClusterID can be different on the
|
||
secondary cluster.
|
||
|
||
> Sample PV spec which will be used by rbdplugin controller to regenerate OMAP
|
||
> data
|
||
|
||
```yaml
|
||
csi:
|
||
controllerExpandSecretRef:
|
||
name: rook-csi-rbd-provisioner
|
||
namespace: rook-ceph
|
||
driver: rook-ceph.rbd.csi.ceph.com
|
||
fsType: ext4
|
||
nodeStageSecretRef:
|
||
name: rook-csi-rbd-node
|
||
namespace: rook-ceph
|
||
volumeAttributes:
|
||
clusterID: rook-ceph
|
||
imageFeatures: layering
|
||
imageFormat: "2"
|
||
imageName: csi-vol-0c23de1c-18fb-11eb-a903-0242ac110005
|
||
journalPool: replicapool
|
||
pool: replicapool
|
||
radosNamespace: ""
|
||
volumeHandle: 0001-0009-rook-ceph-0000000000000002-0c23de1c-18fb-11eb-a903-0242ac110005
|
||
```
|
||
|
||
> **VolumeHandle** is the unique volume name returned by the CSI volume plugin’s
|
||
CreateVolume to refer to the volume on all subsequent calls.
|
||
|
||
Once the static PVC is created on the secondary cluster, the Kubernetes User
|
||
can try delete the PVC,expand the PVC or mount the PVC. In case of mounting
|
||
(NodeStageVolume) we will get the volume context in RPC call but not in the
|
||
Delete/Expand Request. In Delete/Expand RPC request only the VolumeHandle
|
||
(`clusterID-poolID-volumeuniqueID`) will be sent where it contains the encoded
|
||
information related to ClusterID and PoolID. The VolumeHandle is not useful in
|
||
secondary cluster as the PoolID and ClusterID always may not be the same.
|
||
|
||
> In this design document we will talk about new controller(rbdplugin
|
||
> controller) not replication controller, in next releases we will design the
|
||
> replication controller to perform mirroring operations. The rbdplugin
|
||
> controller will run as a sidecar in RBD provisioner pod.
|
||
|
||
To solve this problem, We will have a new controller(rbdplugin controller)
|
||
running as part of provisioner pod which watches for the PV objects. When a PV
|
||
is created it will extract the required information from the PV spec and it
|
||
will regenerate the OMAP data and also it will generate a new VolumeHandle
|
||
(`newclusterID-newpoolID-volumeuniqueID`) and it adds a PV annotation
|
||
`csi.ceph.io/volume-handle` for mapping between old VolumeHandle and new
|
||
VolumeHandle. Whenever Ceph-CSI gets a RPC request with older VolumeHandle, it
|
||
will check if any new VolumeHandle exists for the old VolumeHandle. If yes, it
|
||
uses the new VolumeHandle for internal operations (to get pool name, Ceph
|
||
monitor details from the ClusterID etc).
|
||
|
||
Currently, We are making use of watchers in node stage request to make sure
|
||
ReadWriteOnce (RWO) PVC is mounted on a single node at a given point in time.
|
||
We need to change the watchers logic in the node stage request as when we
|
||
enable the RBD mirroring on an image, a watcher will be added on a RBD image by
|
||
the rbd mirroring daemon.
|
||
|
||
To solve the ClusterID problem, If the ClusterID is different on the second
|
||
cluster, the admin has to create a new ConfigMap for the mapped ClusterID's.
|
||
Whenever Ceph-CSI gets a request, it will check if the ClusterID mapping exists
|
||
and uses the mapped ClusterID to get the information like Ceph monitors etc.
|
||
|
||
**This design does not cover the below items:**
|
||
|
||
* Bootstrapping RBD mirror daemon.
|
||
* Mirroring of PVC snapshots
|
||
* Mirroring for topology provisioned PVC.
|
||
* Documenting the steps to handle failover/fallback of an image.
|
||
* Workflow of a Replication controller.
|