mirror of
https://github.com/ceph/ceph-csi.git
synced 2024-11-09 16:00:22 +00:00
doc: design document for rbd mirroring
This document outlines the internal cephcsi design to handle mirrored RBD images. Co-authored-by: ShyamsundarR <srangana@redhat.com> Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This commit is contained in:
parent
39b1f2b4d3
commit
28793efc90
102
docs/design/proposals/rbd-mirror.md
Normal file
102
docs/design/proposals/rbd-mirror.md
Normal file
@ -0,0 +1,102 @@
|
||||
# RBD MIRRORING
|
||||
|
||||
RBD mirroring is a process of replication of RBD images between two or more
|
||||
Ceph clusters. Mirroring ensures point-in-time, crash-consistent RBD images
|
||||
between clusters, RBD mirroring is mainly used for disaster recovery (i.e.
|
||||
having a secondary site as a failover). See [Ceph
|
||||
documentation](https://docs.ceph.com/en/latest/rbd/rbd-mirroring) on RBD
|
||||
mirroring for complete information.
|
||||
|
||||
## Architecture
|
||||
|
||||
![mirror](rbd-mirror.png)
|
||||
|
||||
## Design
|
||||
|
||||
Currently, CEPH-CSI generates its unique ID for each RBD image and stores the
|
||||
corresponding PersistentVolume (PV) name and the unique ID mapping, It creates
|
||||
the RBD image with the unique ID and returns the encoded value which contains
|
||||
all the required information for other operations. For mirroring, the same RBD
|
||||
image will be mirrored to the secondary cluster. As the journal(OMAP data) is
|
||||
not mirrored to the secondary cluster, The RBD images corresponding to the PV
|
||||
can not be identified without OMAP data.
|
||||
|
||||
**Pre-req** It's expected that the Kubernetes Admin/User will create the static
|
||||
PersistentVolumeClaim (PVC) on the secondary site during the failover.
|
||||
|
||||
**Note:** when the static PVC created on the secondary site we cannot use the
|
||||
VolumeHandle to identify the OMAP data nor the image anymore because as we have
|
||||
only PoolID and ClusterID in the VolumeHandle. We cannot identify the correct
|
||||
pool name from the PoolID because pool name will remain the same on both
|
||||
clusters but not the PoolID even the ClusterID can be different on the
|
||||
secondary cluster.
|
||||
|
||||
> Sample PV spec which will be used by rbdplugin controller to regenerate OMAP
|
||||
> data
|
||||
|
||||
```yaml
|
||||
csi:
|
||||
controllerExpandSecretRef:
|
||||
name: rook-csi-rbd-provisioner
|
||||
namespace: rook-ceph
|
||||
driver: rook-ceph.rbd.csi.ceph.com
|
||||
fsType: ext4
|
||||
nodeStageSecretRef:
|
||||
name: rook-csi-rbd-node
|
||||
namespace: rook-ceph
|
||||
volumeAttributes:
|
||||
clusterID: rook-ceph
|
||||
imageFeatures: layering
|
||||
imageFormat: "2"
|
||||
imageName: csi-vol-0c23de1c-18fb-11eb-a903-0242ac110005
|
||||
journalPool: replicapool
|
||||
pool: replicapool
|
||||
radosNamespace: ""
|
||||
volumeHandle: 0001-0009-rook-ceph-0000000000000002-0c23de1c-18fb-11eb-a903-0242ac110005
|
||||
```
|
||||
|
||||
> **VolumeHandle** is the unique volume name returned by the CSI volume plugin’s
|
||||
CreateVolume to refer to the volume on all subsequent calls.
|
||||
|
||||
Once the static PVC is created on the secondary cluster, the Kubernetes User
|
||||
can try delete the PVC,expand the PVC or mount the PVC. Incase of mounting
|
||||
(NodeStageVolume) we will get the volume context in RPC call but not in the
|
||||
Delete/Expand Request. In Delete/Expand RPC request only the VolumeHandle
|
||||
(`clusterID-poolID-volumeuniqueID`) will be sent where it contains the encoded
|
||||
information related to ClusterID and PoolID. The VolumeHandle is not useful in
|
||||
secondary cluster as the PoolID and ClusterID always may not be the same.
|
||||
|
||||
> In this design document we will talk about new controller(rbdplugin
|
||||
> controller) not replication controller, in next releases we will design the
|
||||
> replication controller to perform mirroring operations. The rbdplugin
|
||||
> controller will run as a sidecar in RBD provisioner pod.
|
||||
|
||||
To solve this problem, We will have a new controller(rbdplugin controller)
|
||||
running as part of provisioner pod which watches for the PV objects. When a PV
|
||||
is created it will extract the required information from the PV spec and it
|
||||
will regenerate the OMAP data and also it will generate a new VolumeHandle
|
||||
(`newclusterID-newpoolID-volumeuniqueID`) and it creates an OMAP object for
|
||||
mapping between old VolumeHandle and new VolumeHandle. Whenever Ceph-CSI gets a
|
||||
RPC request with older VolumeHandle, it will check if any new VolumeHandle
|
||||
exists for the old VolumeHandle. If yes, it uses the new VolumeHandle for
|
||||
internal operations (to get pool name, Ceph monitor details from the ClusterID
|
||||
etc).
|
||||
|
||||
Currently, We are making use of watchers in node stage request to make sure
|
||||
ReadWriteOnce (RWO) PVC is mounted on a single node at a given point in time.
|
||||
We need to change the watchers logic in the node stage request as when we
|
||||
enable the RBD mirroring on an image, a watcher will be added on a RBD image by
|
||||
the rbd mirroring daemon.
|
||||
|
||||
To solve the ClusterID problem, If the ClusterID is different on the second
|
||||
cluster, the admin has to create a new ConfigMap for the mapped ClusterID's.
|
||||
Whenever Ceph-CSI gets a request, it will check if the ClusterID mapping exists
|
||||
and uses the mapped ClusterID to get the information like Ceph monitors etc.
|
||||
|
||||
**This design does not cover the below items:**
|
||||
|
||||
* Bootstrapping RBD mirror daemon.
|
||||
* Mirroring of PVC snapshots
|
||||
* Mirroring for topology provisioned PVC.
|
||||
* Documenting the steps to handle failover/fallback of an image.
|
||||
* Workflow of a Replication controller.
|
BIN
docs/design/proposals/rbd-mirror.png
Normal file
BIN
docs/design/proposals/rbd-mirror.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 59 KiB |
Loading…
Reference in New Issue
Block a user