doc: few corrections or typo fixing in design documentation

- Fixes spelling mistakes.
- Grammatical error correction.
- Wrapping the text at 80 line count..etc

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
This commit is contained in:
Humble Chirammal 2021-12-20 14:51:47 +05:30 committed by mergify[bot]
parent 12e8e46bcf
commit 3196b798cc
6 changed files with 245 additions and 240 deletions

View File

@ -6,26 +6,26 @@ snapshot contents and then mount that volume to workloads.
CephFS exposes snapshots as special, read-only directories of a subvolume
located in `<subvolume>/.snap`. cephfs-csi can already provision writable
volumes with snapshots as their data source, where snapshot contents are
cloned to the newly created volume. However, cloning a snapshot to volume
is a very expensive operation in CephFS as the data needs to be fully copied.
When the need is to only read snapshot contents, snapshot cloning is extremely
volumes with snapshots as their data source, where snapshot contents are cloned
to the newly created volume. However, cloning a snapshot to volume is a very
expensive operation in CephFS as the data needs to be fully copied. When the
need is to only read snapshot contents, snapshot cloning is extremely
inefficient and wasteful.
This proposal describes a way for cephfs-csi to expose CephFS snapshots
as shallow, read-only volumes, without needing to clone the underlying
snapshot data.
This proposal describes a way for cephfs-csi to expose CephFS snapshots as
shallow, read-only volumes, without needing to clone the underlying snapshot
data.
## Use-cases
What's the point of such read-only volumes?
* **Restore snapshots selectively:** users may want to traverse snapshots,
restoring data to a writable volume more selectively instead of restoring
the whole snapshot.
* **Volume backup:** users can't backup a live volume, they first need
to snapshot it. Once a snapshot is taken, it still can't be backed-up,
as backup tools usually work with volumes (that are exposed as file-systems)
restoring data to a writable volume more selectively instead of restoring the
whole snapshot.
* **Volume backup:** users can't backup a live volume, they first need to
snapshot it. Once a snapshot is taken, it still can't be backed-up, as backup
tools usually work with volumes (that are exposed as file-systems)
and not snapshots (which might have backend-specific format). What this means
is that in order to create a snapshot backup, users have to clone snapshot
data twice:
@ -37,19 +37,18 @@ What's the point of such read-only volumes?
The temporary backed-up volume will most likely be thrown away after the
backup transfer is finished. That's a lot of wasted work for what we
originally wanted to do! Having the ability to create volumes from
snapshots cheaply would be a big improvement for this use case.
originally wanted to do! Having the ability to create volumes from snapshots
cheaply would be a big improvement for this use case.
## Alternatives
* _Snapshots are stored in `<subvolume>/.snap`. Users could simply visit this
directory by themselves._
`.snap` is CephFS-specific detail of how snapshots are exposed.
Users / tools may not be aware of this special directory, or it may not fit
their workflow. At the moment, the idiomatic way of accessing snapshot
contents in CSI drivers is by creating a new volume and populating it
with snapshot.
`.snap` is CephFS-specific detail of how snapshots are exposed. Users / tools
may not be aware of this special directory, or it may not fit their workflow.
At the moment, the idiomatic way of accessing snapshot contents in CSI drivers
is by creating a new volume and populating it with snapshot.
## Design
@ -57,21 +56,21 @@ Key points:
* Volume source is a snapshot, volume access mode is `*_READER_ONLY`.
* No actual new subvolumes are created in CephFS.
* The resulting volume is a reference to the source subvolume snapshot.
This reference would be stored in `Volume.volume_context` map. In order
to reference a snapshot, we need subvol name and snapshot name.
* Mounting such volume means mounting the respective CephFS subvolume
and exposing the snapshot to workloads.
* Let's call a *shallow read-only volume with a subvolume snapshot
as its data source* just a *shallow volume* from here on out for brevity.
* The resulting volume is a reference to the source subvolume snapshot. This
reference would be stored in `Volume.volume_context` map. In order to
reference a snapshot, we need subvol name and snapshot name.
* Mounting such volume means mounting the respective CephFS subvolume and
exposing the snapshot to workloads.
* Let's call a *shallow read-only volume with a subvolume snapshot as its data
source* just a *shallow volume* from here on out for brevity.
### Controller operations
Care must be taken when handling life-times of relevant storage resources.
When a shallow volume is created, what would happen if:
Care must be taken when handling life-times of relevant storage resources. When
a shallow volume is created, what would happen if:
* _Parent subvolume of the snapshot is removed while the shallow volume
still exists?_
* _Parent subvolume of the snapshot is removed while the shallow volume still
exists?_
This shouldn't be a problem already. The parent volume has either
`snapshot-retention` subvol feature in which case its snapshots remain
@ -80,8 +79,8 @@ When a shallow volume is created, what would happen if:
* _Source snapshot from which the shallow volume originates is removed while
that shallow volume still exists?_
We need to make sure this doesn't happen and some book-keeping
is necessary. Ideally we could employ some kind of reference counting.
We need to make sure this doesn't happen and some book-keeping is necessary.
Ideally we could employ some kind of reference counting.
#### Reference counting for shallow volumes
@ -92,26 +91,26 @@ When creating a volume snapshot, a reference tracker (RT), represented by a
RADOS object, would be created for that snapshot. It would store information
required to track the references for the backing subvolume snapshot. Upon a
`CreateSnapshot` call, the reference tracker (RT) would be initialized with a
single reference record, where the CSI snapshot itself is the first reference
to the backing snapshot. Each subsequent shallow volume creation would add a
new reference record to the RT object. Each shallow volume deletion would
remove that reference from the RT object. Calling `DeleteSnapshot` would remove
the reference record that was previously added in `CreateSnapshot`.
single reference record, where the CSI snapshot itself is the first reference to
the backing snapshot. Each subsequent shallow volume creation would add a new
reference record to the RT object. Each shallow volume deletion would remove
that reference from the RT object. Calling `DeleteSnapshot` would remove the
reference record that was previously added in `CreateSnapshot`.
The subvolume snapshot would be removed from the Ceph cluster only once the RT
object holds no references. Note that this behavior would permit calling
`DeleteSnapshot` even if it is still referenced by shallow volumes.
* `DeleteSnapshot`:
* RT holds no references or the RT object doesn't exist:
* RT holds no references or the RT object doesn't exist:
delete the backing snapshot too.
* RT holds at least one reference: keep the backing snapshot.
* RT holds at least one reference: keep the backing snapshot.
* `DeleteVolume`:
* RT holds no references: delete the backing snapshot too.
* RT holds at least one reference: keep the backing snapshot.
* RT holds no references: delete the backing snapshot too.
* RT holds at least one reference: keep the backing snapshot.
To enable creating shallow volumes from snapshots that were provisioned by
older versions of cephfs-csi (i.e. before this feature is introduced),
To enable creating shallow volumes from snapshots that were provisioned by older
versions of cephfs-csi (i.e. before this feature is introduced),
`CreateVolume` for shallow volumes would also create an RT object in case it's
missing. It would be initialized to two: the source snapshot and the newly
created shallow volume.
@ -141,17 +140,17 @@ Things to look out for:
It doesn't consume any space on the filesystem. `Volume.capacity_bytes` is
allowed to contain zero. We could use that.
* _What should be the requested size when creating the volume (specified e.g.
in PVC)?_
* _What should be the requested size when creating the volume (specified e.g. in
PVC)?_
This one is tricky. CSI spec allows for
`CreateVolumeRequest.capacity_range.{required_bytes,limit_bytes}` to be
zero. On the other hand,
`PersistentVolumeClaim.spec.resources.requests.storage` must be bigger
than zero. cephfs-csi doesn't care about the requested size (the volume
will be read-only, so it has no usable capacity) and would always set it
to zero. This shouldn't case any problems for the time being, but still
is something we should keep in mind.
`CreateVolumeRequest.capacity_range.{required_bytes,limit_bytes}` to be zero.
On the other hand,
`PersistentVolumeClaim.spec.resources.requests.storage` must be bigger than
zero. cephfs-csi doesn't care about the requested size (the volume will be
read-only, so it has no usable capacity) and would always set it to zero. This
shouldn't case any problems for the time being, but still is something we
should keep in mind.
`CreateVolume` and behavior when using volume as volume source (PVC-PVC clone):
@ -167,8 +166,8 @@ Volume deletion is trivial.
### `CreateSnapshot`
Snapshotting read-only volumes doesn't make sense in general, and should
be rejected.
Snapshotting read-only volumes doesn't make sense in general, and should be
rejected.
### `ControllerExpandVolume`
@ -194,8 +193,8 @@ whole subvolume first, and only then perform the binds to target paths.
#### For case (a)
Subvolume paths are normally retrieved by
`ceph fs subvolume info/getpath <VOLUME NAME> <SUBVOLUME NAME> <SUBVOLUMEGROUP NAME>`,
which outputs a path like so:
`ceph fs subvolume info/getpath <VOLUME NAME> <SUBVOLUME NAME> <SUBVOLUMEGROUP NAME>`
, which outputs a path like so:
```
/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>
@ -217,12 +216,12 @@ itself still exists or not.
#### For case (b)
For cases where subvolumes are managed externally and not by cephfs-csi, we
must assume that the cephx user we're given can access only
For cases where subvolumes are managed externally and not by cephfs-csi, we must
assume that the cephx user we're given can access only
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>` so users won't be able to
benefit from snapshot retention. Users will need to be careful not to delete
the parent subvolumes and snapshots while they are associated by these shallow
RO volumes.
benefit from snapshot retention. Users will need to be careful not to delete the
parent subvolumes and snapshots while they are associated by these shallow RO
volumes.
### `NodePublishVolume`, `NodeUnpublishVolume`
@ -235,38 +234,38 @@ mount.
## Volume parameters, volume context
This section provides a discussion around determinig what volume parameters and
This section provides a discussion around determining what volume parameters and
volume context parameters will be used to convey necessary information to the
cephfs-csi driver in order to support shallow volumes.
Volume parameters `CreateVolumeRequest.parameters`:
* Should be "shallow" the default mode for all `CreateVolume` calls that have
(a) snapshot as data source and (b) read-only volume access mode? If not,
a new volume parameter should be introduced: e.g `isShallow: <bool>`. On the
(a) snapshot as data source and (b) read-only volume access mode? If not, a
new volume parameter should be introduced: e.g `isShallow: <bool>`. On the
other hand, does it even makes sense for users to want to create full copies
of snapshots and still have them read-only?
Volume context `Volume.volume_context`:
* Here we definitely need `isShallow` or similar. Without it we wouldn't be
able to distinguish between a regular volume that just happens to have
a read-only access mode, and a volume that references a snapshot.
* Here we definitely need `isShallow` or similar. Without it we wouldn't be able
to distinguish between a regular volume that just happens to have a read-only
access mode, and a volume that references a snapshot.
* Currently cephfs-csi recognizes `subvolumePath` for dynamically provisioned
volumes and `rootPath` for pre-previsioned volumes. As mentioned in
[`NodeStageVolume`, `NodeUnstageVolume` section](#NodeStageVolume-NodeUnstageVolume),
snapshots cannot be mounted directly. How do we pass in path to the parent
[`NodeStageVolume`, `NodeUnstageVolume` section](#NodeStageVolume-NodeUnstageVolume)
, snapshots cannot be mounted directly. How do we pass in path to the parent
subvolume?
* a) Path to the snapshot is passed in via `subvolumePath` / `rootPath`,
* a) Path to the snapshot is passed in via `subvolumePath` / `rootPath`,
e.g.
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>/.snap/<SNAPSHOT NAME>`.
From that we can derive path to the subvolume: it's the parent of `.snap`
directory.
* b) Similar to a), path to the snapshot is passed in via `subvolumePath` /
* b) Similar to a), path to the snapshot is passed in via `subvolumePath` /
`rootPath`, but instead of trying to derive the right path we introduce
another volume context parameter containing path to the parent subvolume
explicitly.
* c) `subvolumePath` / `rootPath` contains path to the parent subvolume and
* c) `subvolumePath` / `rootPath` contains path to the parent subvolume and
we introduce another volume context parameter containing name of the
snapshot. Path to the snapshot is then formed by appending
`/.snap/<SNAPSHOT NAME>` to the subvolume path.

View File

@ -1,7 +1,7 @@
# Design to handle clusterID and poolID for DR
During disaster recovery/migration of a cluster, as part of the failover, the
kubernetes artifacts like deployment, PVC, PV, etc will be restored to a new
kubernetes artifacts like deployment, PVC, PV, etc. will be restored to a new
cluster by the admin. Even if the kubernetes objects are restored the
corresponding RBD/CephFS subvolume cannot be retrieved during CSI operations as
the clusterID and poolID are not the same in both clusters. Let's see the
@ -10,8 +10,8 @@ problem in more detail below.
`0001-0009-rook-ceph-0000000000000002-b0285c97-a0ce-11eb-8c66-0242ac110002`
The above is the sample volumeID sent back in response to the CreateVolume
operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses
above as the identifier for other operations on the volume/PVC.
operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses above
as the identifier for other operations on the volume/PVC.
The VolumeID is encoded as,
@ -33,7 +33,7 @@ the other cluster.
During the Disaster Recovery (failover operation) the PVC and PV will be
recreated on the other cluster. When Ceph-CSI receives the request for
operations like (NodeStage, ExpandVolume, DeleteVolume, etc) the volumeID is
operations like (NodeStage, ExpandVolume, DeleteVolume, etc.) the volumeID is
sent in the request which will help to identify the volume.
```yaml=
@ -68,15 +68,15 @@ metadata:
```
During CSI/Replication operations, Ceph-CSI will decode the volumeID and gets
the monitor configuration from the configmap and by the poolID will get the
pool Name and retrieves the OMAP data stored in the rados OMAP and finally
check the volume is present in the pool.
the monitor configuration from the configmap and by the poolID will get the pool
Name and retrieves the OMAP data stored in the rados OMAP and finally check the
volume is present in the pool.
## Problems with volumeID Replication
* The clusterID can be different
* as the clusterID is the namespace where rook is deployed, the Rook might be
deployed in the different namespace on a secondary cluster
* as the clusterID is the namespace where rook is deployed, the Rook might
be deployed in the different namespace on a secondary cluster
* In standalone Ceph-CSI the clusterID is fsID and fsID is unique per
cluster
@ -124,8 +124,8 @@ metadata:
name: ceph-csi-config
```
**Note:-** the configmap will be mounted as a volume to the CSI (provisioner
and node plugin) pods.
**Note:-** the configmap will be mounted as a volume to the CSI (provisioner and
node plugin) pods.
The above configmap will get created as it is or updated (if new Pools are
created on the existing cluster) with new entries when the admin choose to
@ -149,18 +149,18 @@ Replicapool with ID `1` on site1 and Replicapool with ID `2` on site2.
After getting the required mapping Ceph-CSI has the required information to get
more details from the rados OMAP. If we have multiple clusterID mapping it will
loop through all the mapping and checks the corresponding pool to get the OMAP
data. If the clusterID mapping does not exist Ceph-CSI will return a `Not
Found` error message to the caller.
data. If the clusterID mapping does not exist Ceph-CSI will return a `Not Found`
error message to the caller.
After failover to the cluster `site2-storage`, the admin might have created new
PVCs on the primary cluster `site2-storage`. Later after recovering the
cluster `site1-storage`, the admin might choose to failback from
`site2-storage` to `site1-storage`. Now admin needs to copy all the newly
created kubernetes artifacts to the failback cluster. For clusterID mapping, the
admin needs to copy the above-created configmap `ceph-clusterid-mapping` to
the failback cluster. When Ceph-CSI receives a CSI/Replication request for
the volumes created on the `site2-storage` it will decode the volumeID and
retrieves the clusterID ie `site2-storage`. In the above configmap
admin needs to copy the above-created configmap `ceph-clusterid-mapping` to the
failback cluster. When Ceph-CSI receives a CSI/Replication request for the
volumes created on the `site2-storage` it will decode the volumeID and retrieves
the clusterID ie `site2-storage`. In the above configmap
`ceph-clusterid-mapping` the `site2-storage` is the value and `site1-storage`
is the key in the `clusterIDMapping` entry.
@ -169,8 +169,8 @@ is found in `key` it will consider `value` as the corresponding mapping, if it
is found in `value` place it will treat `key` as the corresponding mapping and
retrieves all the poolID details of the cluster.
This mapping on the remote cluster is only required when we are doing a
failover operation from the primary cluster to a remote cluster. The existing
volumes that are created on the remote cluster does not require
any mapping as the volumeHandle already contains the required information about
the local cluster (clusterID, poolID etc).
This mapping on the remote cluster is only required when we are doing a failover
operation from the primary cluster to a remote cluster. The existing volumes
that are created on the remote cluster does not require any mapping as the
volumeHandle already contains the required information about the local cluster (
clusterID, poolID etc).

View File

@ -16,7 +16,7 @@ Some but not all the benefits of this approach:
* volume encryption: encryption of a volume attached by rbd
* encryption at rest: encryption of physical disk done by ceph
* LUKS: Linux Unified Key Setup: stores all of the needed setup information for
* LUKS: Linux Unified Key Setup: stores all the needed setup information for
dm-crypt on the disk
* dm-crypt: linux kernel device-mapper crypto target
* cryptsetup: the command line tool to interface with dm-crypt
@ -28,8 +28,8 @@ requirement by using dm-crypt module through cryptsetup cli interface.
### Implementation Summary
* Encryption is implemented using cryptsetup with LUKS extension.
A good introduction to LUKS and dm-crypt in general can be found
* Encryption is implemented using cryptsetup with LUKS extension. A good
introduction to LUKS and dm-crypt in general can be found
[here](https://wiki.archlinux.org/index.php/Dm-crypt/Device_encryption#Encrypting_devices_with_cryptsetup)
Functions to implement necessary interaction are implemented in a separate
`cryptsetup.go` file.
@ -45,8 +45,8 @@ requirement by using dm-crypt module through cryptsetup cli interface.
volume attach request
* `NodeStageVolume`: refactored to open encrypted device (`openEncryptedDevice`)
* `openEncryptedDevice`: looks up for a passphrase matching the volume id,
returns the new device path in the form: `/dev/mapper/luks-<volume_id>`.
On the woker node where the attach is scheduled:
returns the new device path in the form: `/dev/mapper/luks-<volume_id>`. On
the worker node where the attach is scheduled:
```shell
$ lsblk
@ -63,7 +63,7 @@ requirement by using dm-crypt module through cryptsetup cli interface.
* StorageClass extended with following parameters:
1. `encrypted` ("true" or "false")
1. `encryptionKMSID` (string representing kms configuration of choice)
2. `encryptionKMSID` (string representing kms configuration of choice)
ceph-csi plugin may support different kms vendors with different type of
authentication
@ -133,14 +133,19 @@ metadata:
The main components that are used to support encrypted volumes:
1. the `EncryptionKMS` interface
* an instance is configured per volume object (`rbdVolume.KMS`)
* used to authenticate with a master key or token
* can store the KEK (Key-Encryption-Key) for encrypting and decrypting the
* an instance is configured per volume object (`rbdVolume.KMS`)
* used to authenticate with a master key or token
* can store the KEK (Key-Encryption-Key) for encrypting and decrypting the
DEKs (Data-Encryption-Key)
1. the `DEKStore` interface
* saves and fetches the DEK (Data-Encryption-Key)
* can be provided by a KMS, or by other components (like `rbdVolume`)
* saves and fetches the DEK (Data-Encryption-Key)
* can be provided by a KMS, or by other components (like `rbdVolume`)
1. the `VolumeEncryption` type
* combines `EncryptionKMS` and `DEKStore` into a single place
* easy to configure from other components or subsystems
* provides a simple API for all KMS operations
* combines `EncryptionKMS` and `DEKStore` into a single place
* easy to configure from other components or subsystems
* provides a simple API for all KMS operations

View File

@ -14,7 +14,8 @@ KMS implementation. Or, if changes would be minimal, a configuration option to
one of the implementations can be added.
Different KMS implementations and their configurable options can be found at
[`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml).
[`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml)
.
### VaultTokensKMS
@ -26,7 +27,8 @@ An example of the per Tenant configuration options are in
[`tenant-config.yaml`](../../../examples/kms/vault/tenant-config.yaml) and
[`tenant-token.yaml`](../../../examples/kms/vault/tenant-token.yaml).
Implementation is in [`vault_tokens.go`](../../../internal/util/vault_tokens.go).
Implementation is in [`vault_tokens.go`](../../../internal/util/vault_tokens.go)
.
### Vault
@ -36,7 +38,7 @@ Implementation is in [`vault.go`](../../../internal/util/vault.go).
## Extension or New KMS implementation
Normally ServiceAccounts are provided by Kubernetes in the containers
Normally ServiceAccounts are provided by Kubernetes in the containers'
filesystem. This only allows a single ServiceAccount and is static for the
lifetime of the Pod. Ceph-CSI runs in the namespace of the storage
administrator, and has access to the single ServiceAccount linked in the
@ -53,7 +55,7 @@ steps need to be taken:
replace the default (`AuthKubernetesTokenPath:
/var/run/secrets/kubernetes.io/serviceaccount/token`)
Currently the Ceph-CSI components may read Secrets and ConfigMaps from the
Currently, the Ceph-CSI components may read Secrets and ConfigMaps from the
Tenants namespace. These permissions need to be extended to allow Ceph-CSI to
read the contents of the ServiceAccount(s) in the Tenants namespace.
@ -61,7 +63,8 @@ read the contents of the ServiceAccount(s) in the Tenants namespace.
### Global Configuration
1. a StorageClass links to a KMS configuration by providing the `kmsID` parameter
1. a StorageClass links to a KMS configuration by providing the `kmsID`
parameter
1. a ConfigMap in the namespace of the Ceph-CSI deployment contains the KMS
configuration for the `kmsID`
([`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml))
@ -76,8 +79,8 @@ configuration from the ConfigMap.
1. needs ServiceAccount with a known name with permissions to connect to Vault
1. optional ConfigMap with options for Vault that override default settings
A `CreateVolume` request contains the owner (Namespace) of the Volume.
The KMS configuration indicates that additional attributes need to be fetched
from the Tenants namespace, so the provisioner will fetch these. The additional
configuration and ServiceAccount are merged in the provisioners configuration
A `CreateVolume` request contains the owner (Namespace) of the Volume. The KMS
configuration indicates that additional attributes need to be fetched from the
Tenants namespace, so the provisioner will fetch these. The additional
configuration and ServiceAccount are merged in the provisioners' configuration
for the KMS-implementation while creating the volume.

View File

@ -1,11 +1,11 @@
# RBD MIRRORING
RBD mirroring is a process of replication of RBD images between two or more
Ceph clusters. Mirroring ensures point-in-time, crash-consistent RBD images
between clusters, RBD mirroring is mainly used for disaster recovery (i.e.
having a secondary site as a failover). See [Ceph
documentation](https://docs.ceph.com/en/latest/rbd/rbd-mirroring) on RBD
mirroring for complete information.
RBD mirroring is a process of replication of RBD images between two or more Ceph
clusters. Mirroring ensures point-in-time, crash-consistent RBD images between
clusters, RBD mirroring is mainly used for disaster recovery (i.e. having a
secondary site as a failover).
See [Ceph documentation](https://docs.ceph.com/en/latest/rbd/rbd-mirroring) on
RBD mirroring for complete information.
## Architecture
@ -28,8 +28,8 @@ PersistentVolumeClaim (PVC) on the secondary site during the failover.
VolumeHandle to identify the OMAP data nor the image anymore because as we have
only PoolID and ClusterID in the VolumeHandle. We cannot identify the correct
pool name from the PoolID because pool name will remain the same on both
clusters but not the PoolID even the ClusterID can be different on the
secondary cluster.
clusters but not the PoolID even the ClusterID can be different on the secondary
cluster.
> Sample PV spec which will be used by rbdplugin controller to regenerate OMAP
> data
@ -56,10 +56,10 @@ csi:
```
> **VolumeHandle** is the unique volume name returned by the CSI volume plugins
CreateVolume to refer to the volume on all subsequent calls.
> CreateVolume to refer to the volume on all subsequent calls.
Once the static PVC is created on the secondary cluster, the Kubernetes User
can try delete the PVC,expand the PVC or mount the PVC. In case of mounting
Once the static PVC is created on the secondary cluster, the Kubernetes User can
try to delete the PVC,expand the PVC or mount the PVC. In case of mounting
(NodeStageVolume) we will get the volume context in RPC call but not in the
Delete/Expand Request. In Delete/Expand RPC request only the VolumeHandle
(`clusterID-poolID-volumeuniqueID`) will be sent where it contains the encoded
@ -73,17 +73,17 @@ secondary cluster as the PoolID and ClusterID always may not be the same.
To solve this problem, We will have a new controller(rbdplugin controller)
running as part of provisioner pod which watches for the PV objects. When a PV
is created it will extract the required information from the PV spec and it
is created it will extract the required information from the PV spec, and it
will regenerate the OMAP data. Whenever Ceph-CSI gets a RPC request with older
VolumeHandle, it will check if any new VolumeHandle exists for the old
VolumeHandle. If yes, it uses the new VolumeHandle for internal operations (to
get pool name, Ceph monitor details from the ClusterID etc).
Currently, We are making use of watchers in node stage request to make sure
ReadWriteOnce (RWO) PVC is mounted on a single node at a given point in time.
We need to change the watchers logic in the node stage request as when we
enable the RBD mirroring on an image, a watcher will be added on a RBD image by
the rbd mirroring daemon.
ReadWriteOnce (RWO) PVC is mounted on a single node at a given point in time. We
need to change the watchers logic in the node stage request as when we enable
the RBD mirroring on an image, a watcher will be added on a RBD image by the rbd
mirroring daemon.
To solve the ClusterID problem, If the ClusterID is different on the second
cluster, the admin has to create a new ConfigMap for the mapped ClusterID's.

View File

@ -1,59 +1,57 @@
# RBD NBD VOLUME HEALER
- [RBD NBD VOLUME HEALER](#rbd-nbd-volume-healer)
- [Rbd Nbd](#rbd-nbd)
- [Advantages of userspace mounters](#advantages-of-userspace-mounters)
- [Side effects of userspace mounters](#side-effects-of-userspace-mounters)
- [Volume Healer](#volume-healer)
- [More thoughts](#more-thoughts)
- [Rbd Nbd](#rbd-nbd)
- [Advantages of userspace mounters](#advantages-of-userspace-mounters)
- [Side effects of userspace mounters](#side-effects-of-userspace-mounters)
- [Volume Healer](#volume-healer)
- [More thoughts](#more-thoughts)
## Rbd nbd
The rbd CSI plugin will provision new rbd images and attach and mount those
to workloads. Currently, the default mounter is krbd, which uses the kernel
rbd driver to mount the rbd images onto the application pod. Here on
at Ceph-CSI we will also have a userspace way of mounting the rbd images,
via rbd-nbd.
The rbd CSI plugin will provision new rbd images and attach and mount those to
workloads. Currently, the default mounter is krbd, which uses the kernel rbd
driver to mount the rbd images onto the application pod. Here on at Ceph-CSI we
will also have a userspace way of mounting the rbd images, via rbd-nbd.
[Rbd-nbd](https://docs.ceph.com/en/latest/man/8/rbd-nbd/) is a client for
RADOS block device (rbd) images like the existing rbd kernel module. It
will map an rbd image to an nbd (Network Block Device) device, allowing
access to it as a regular local block device.
[Rbd-nbd](https://docs.ceph.com/en/latest/man/8/rbd-nbd/) is a client for RADOS
block device (rbd) images like the existing rbd kernel module. It will map an
rbd image to an nbd (Network Block Device) device, allowing access to it as a
regular local block device.
![csi-rbd-nbd](./images/csi-rbd-nbd.svg)
Its worth making a note that the rbd-nbd processes will run on the
client-side, which is inside the `csi-rbdplugin` node plugin.
Its worth making a note that the rbd-nbd processes will run on the client-side,
which is inside the `csi-rbdplugin` node plugin.
### Advantages of userspace mounters
- It is easier to add features to rbd-nbd as it is released regularly with
Ceph, and more difficult and time consuming to add features to the kernel
rbd module as that is part of the Linux kernel release schedule.
- Container upgrades will be independent of the host node, which means if
there are any new features with rbd-nbd, we dont have to reboot the node
as the changes will be shipped inside the container.
- Because the container upgrades are host node independent, we will be a
better citizen in K8s by switching to the userspace model.
- It is easier to add features to rbd-nbd as it is released regularly with Ceph,
and more difficult and time consuming to add features to the kernel rbd module
as that is part of the Linux kernel release schedule.
- Container upgrades will be independent of the host node, which means if there
are any new features with rbd-nbd, we dont have to reboot the node as the
changes will be shipped inside the container.
- Because the container upgrades are host node independent, we will be a better
citizen in K8s by switching to the userspace model.
- Unlike krbd, rbd-nbd uses librbd user-space library that gets most of the
development focus, and hence rbd-nbd will be feature-rich.
- Being entirely kernel space impacts fault-tolerance as any kernel panic
affects a whole node not only a single pod that is using rbd storage.
Thanks to the rbd-nbds userspace design, we are less bothered here, the
krbd is a complete kernel and vendor-specific driver which needs changes
on every feature basis, on the other hand, rbd-nbd depends on NBD generic
driver, while all the vendor-specific logic sits in the userspace. It's
worth taking note that NBD generic driver is mostly unchanged much from
years and consider it to be much stable. Also given NBD is a generic
driver there will be many eyes on it compared to the rbd driver.
affects a whole node not only a single pod that is using rbd storage. Thanks
to the rbd-nbds userspace design, we are less bothered here, the krbd is a
complete kernel and vendor-specific driver which needs changes on every
feature basis, on the other hand, rbd-nbd depends on NBD generic driver, while
all the vendor-specific logic sits in the userspace. It's worth taking note
that NBD generic driver is mostly unchanged much from years and consider it to
be much stable. Also given NBD is a generic driver there will be many eyes on
it compared to the rbd driver.
### Side effects of userspace mounters
Since the rbd-nbd processes run per volume map on the client side i.e.
inside the `csi-rbdplugin` node plugin, a restart of the node plugin will
terminate all the rbd-nbd processes, and there is no way to restore
these processes back to life currently, which could lead to IO errors
on all the application pods.
Since the rbd-nbd processes run per volume map on the client side i.e. inside
the `csi-rbdplugin` node plugin, a restart of the node plugin will terminate all
the rbd-nbd processes, and there is no way to restore these processes back to
life currently, which could lead to IO errors on all the application pods.
![csi-plugin-restart](./images/csi-plugin-restart.svg)
@ -61,42 +59,42 @@ This is where the Volume healer could help.
## Volume healer
Volume healer runs on the start of rbd node plugin and runs within the
node plugin driver context.
Volume healer runs on the start of rbd node plugin and runs within the node
plugin driver context.
Volume healer does the below,
- Get the Volume attachment list for the current node where it is running
- Filter the volume attachments list through matching driver name and
status attached
- For each volume attachment get the respective PV information and check
the criteria of PV Bound, mounter type
- Build the StagingPath where rbd images PVC is mounted, based on the
KUBELET path and PV object
- Filter the volume attachments list through matching driver name and status
attached
- For each volume attachment get the respective PV information and check the
criteria of PV Bound, mounter type
- Build the StagingPath where rbd images PVC is mounted, based on the KUBELET
path and PV object
- Construct the NodeStageVolume() request and send Request to CSI Driver.
- The NodeStageVolume() has a way to identify calls received from the
healer and when executed from the healer context, it just runs in the
minimal required form, where it fetches the previously mapped device to
the image, and the respective secrets and finally ensures to bringup the
respective process back to life. Thus enabling IO to continue.
- The NodeStageVolume() has a way to identify calls received from the healer and
when executed from the healer context, it just runs in the minimal required
form, where it fetches the previously mapped device to the image, and the
respective secrets and finally ensures to bringup the respective process back
to life. Thus enabling IO to continue.
### More thoughts
- Currently the NodeStageVolume() call is safeguarded by the global Ceph-CSI
level lock (per volID) that needs to be acquired before doing any of the
NodeStage, NodeUnstage, NodePublish, NodeUnPulish operations. Hence none
of the operations happen in parallel.
NodeStage, NodeUnstage, NodePublish, NodeUnPublish operations. Hence none of
the operations happen in parallel.
- Any issues if the NodeUnstage is issued by kubelet?
- This can not be a problem as we take a lock at the Ceph-CSI level
- If the NodeUnstage success, Ceph-CSI will return StagingPath not found
error, we can then skip
- If the NodeUnstage fails with an operation already going on, in the
next NodeUnstage the volume gets unmounted
- If the NodeUnstage fails with an operation already going on, in the next
NodeUnstage the volume gets unmounted
- What if the PVC is deleted?
- If the PVC is deleted, the volume attachment list might already got
- If the PVC is deleted, the volume attachment list might already get
refreshed and entry will be skipped/deleted at the healer.
- For any reason, If the request bails out with Error NotFound, skip the
PVC, assuming it might have deleted or the NodeUnstage might have
already happened.
- The Volume healer currently works with rbd-nbd, but the design can
PVC, assuming it might have deleted or the NodeUnstage might have already
happened.
- The Volume healer currently works with rbd-nbd, but the design can
accommodate other userspace mounters (may be ceph-fuse).