doc: few corrections or typo fixing in design documentation

- Fixes spelling mistakes.
- Grammatical error correction.
- Wrapping the text at 80 line count..etc

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
This commit is contained in:
Humble Chirammal 2021-12-20 14:51:47 +05:30 committed by mergify[bot]
parent 12e8e46bcf
commit 3196b798cc
6 changed files with 245 additions and 240 deletions

View File

@ -6,50 +6,49 @@ snapshot contents and then mount that volume to workloads.
CephFS exposes snapshots as special, read-only directories of a subvolume CephFS exposes snapshots as special, read-only directories of a subvolume
located in `<subvolume>/.snap`. cephfs-csi can already provision writable located in `<subvolume>/.snap`. cephfs-csi can already provision writable
volumes with snapshots as their data source, where snapshot contents are volumes with snapshots as their data source, where snapshot contents are cloned
cloned to the newly created volume. However, cloning a snapshot to volume to the newly created volume. However, cloning a snapshot to volume is a very
is a very expensive operation in CephFS as the data needs to be fully copied. expensive operation in CephFS as the data needs to be fully copied. When the
When the need is to only read snapshot contents, snapshot cloning is extremely need is to only read snapshot contents, snapshot cloning is extremely
inefficient and wasteful. inefficient and wasteful.
This proposal describes a way for cephfs-csi to expose CephFS snapshots This proposal describes a way for cephfs-csi to expose CephFS snapshots as
as shallow, read-only volumes, without needing to clone the underlying shallow, read-only volumes, without needing to clone the underlying snapshot
snapshot data. data.
## Use-cases ## Use-cases
What's the point of such read-only volumes? What's the point of such read-only volumes?
* **Restore snapshots selectively:** users may want to traverse snapshots, * **Restore snapshots selectively:** users may want to traverse snapshots,
restoring data to a writable volume more selectively instead of restoring restoring data to a writable volume more selectively instead of restoring the
the whole snapshot. whole snapshot.
* **Volume backup:** users can't backup a live volume, they first need * **Volume backup:** users can't backup a live volume, they first need to
to snapshot it. Once a snapshot is taken, it still can't be backed-up, snapshot it. Once a snapshot is taken, it still can't be backed-up, as backup
as backup tools usually work with volumes (that are exposed as file-systems) tools usually work with volumes (that are exposed as file-systems)
and not snapshots (which might have backend-specific format). What this means and not snapshots (which might have backend-specific format). What this means
is that in order to create a snapshot backup, users have to clone snapshot is that in order to create a snapshot backup, users have to clone snapshot
data twice: data twice:
1. first time, when restoring the snapshot into a temporary volume from 1. first time, when restoring the snapshot into a temporary volume from
where the data will be read, where the data will be read,
1. and second time, when transferring that volume into some backup/archive 1. and second time, when transferring that volume into some backup/archive
storage (e.g. object store). storage (e.g. object store).
The temporary backed-up volume will most likely be thrown away after the The temporary backed-up volume will most likely be thrown away after the
backup transfer is finished. That's a lot of wasted work for what we backup transfer is finished. That's a lot of wasted work for what we
originally wanted to do! Having the ability to create volumes from originally wanted to do! Having the ability to create volumes from snapshots
snapshots cheaply would be a big improvement for this use case. cheaply would be a big improvement for this use case.
## Alternatives ## Alternatives
* _Snapshots are stored in `<subvolume>/.snap`. Users could simply visit this * _Snapshots are stored in `<subvolume>/.snap`. Users could simply visit this
directory by themselves._ directory by themselves._
`.snap` is CephFS-specific detail of how snapshots are exposed. `.snap` is CephFS-specific detail of how snapshots are exposed. Users / tools
Users / tools may not be aware of this special directory, or it may not fit may not be aware of this special directory, or it may not fit their workflow.
their workflow. At the moment, the idiomatic way of accessing snapshot At the moment, the idiomatic way of accessing snapshot contents in CSI drivers
contents in CSI drivers is by creating a new volume and populating it is by creating a new volume and populating it with snapshot.
with snapshot.
## Design ## Design
@ -57,21 +56,21 @@ Key points:
* Volume source is a snapshot, volume access mode is `*_READER_ONLY`. * Volume source is a snapshot, volume access mode is `*_READER_ONLY`.
* No actual new subvolumes are created in CephFS. * No actual new subvolumes are created in CephFS.
* The resulting volume is a reference to the source subvolume snapshot. * The resulting volume is a reference to the source subvolume snapshot. This
This reference would be stored in `Volume.volume_context` map. In order reference would be stored in `Volume.volume_context` map. In order to
to reference a snapshot, we need subvol name and snapshot name. reference a snapshot, we need subvol name and snapshot name.
* Mounting such volume means mounting the respective CephFS subvolume * Mounting such volume means mounting the respective CephFS subvolume and
and exposing the snapshot to workloads. exposing the snapshot to workloads.
* Let's call a *shallow read-only volume with a subvolume snapshot * Let's call a *shallow read-only volume with a subvolume snapshot as its data
as its data source* just a *shallow volume* from here on out for brevity. source* just a *shallow volume* from here on out for brevity.
### Controller operations ### Controller operations
Care must be taken when handling life-times of relevant storage resources. Care must be taken when handling life-times of relevant storage resources. When
When a shallow volume is created, what would happen if: a shallow volume is created, what would happen if:
* _Parent subvolume of the snapshot is removed while the shallow volume * _Parent subvolume of the snapshot is removed while the shallow volume still
still exists?_ exists?_
This shouldn't be a problem already. The parent volume has either This shouldn't be a problem already. The parent volume has either
`snapshot-retention` subvol feature in which case its snapshots remain `snapshot-retention` subvol feature in which case its snapshots remain
@ -80,8 +79,8 @@ When a shallow volume is created, what would happen if:
* _Source snapshot from which the shallow volume originates is removed while * _Source snapshot from which the shallow volume originates is removed while
that shallow volume still exists?_ that shallow volume still exists?_
We need to make sure this doesn't happen and some book-keeping We need to make sure this doesn't happen and some book-keeping is necessary.
is necessary. Ideally we could employ some kind of reference counting. Ideally we could employ some kind of reference counting.
#### Reference counting for shallow volumes #### Reference counting for shallow volumes
@ -92,26 +91,26 @@ When creating a volume snapshot, a reference tracker (RT), represented by a
RADOS object, would be created for that snapshot. It would store information RADOS object, would be created for that snapshot. It would store information
required to track the references for the backing subvolume snapshot. Upon a required to track the references for the backing subvolume snapshot. Upon a
`CreateSnapshot` call, the reference tracker (RT) would be initialized with a `CreateSnapshot` call, the reference tracker (RT) would be initialized with a
single reference record, where the CSI snapshot itself is the first reference single reference record, where the CSI snapshot itself is the first reference to
to the backing snapshot. Each subsequent shallow volume creation would add a the backing snapshot. Each subsequent shallow volume creation would add a new
new reference record to the RT object. Each shallow volume deletion would reference record to the RT object. Each shallow volume deletion would remove
remove that reference from the RT object. Calling `DeleteSnapshot` would remove that reference from the RT object. Calling `DeleteSnapshot` would remove the
the reference record that was previously added in `CreateSnapshot`. reference record that was previously added in `CreateSnapshot`.
The subvolume snapshot would be removed from the Ceph cluster only once the RT The subvolume snapshot would be removed from the Ceph cluster only once the RT
object holds no references. Note that this behavior would permit calling object holds no references. Note that this behavior would permit calling
`DeleteSnapshot` even if it is still referenced by shallow volumes. `DeleteSnapshot` even if it is still referenced by shallow volumes.
* `DeleteSnapshot`: * `DeleteSnapshot`:
* RT holds no references or the RT object doesn't exist: * RT holds no references or the RT object doesn't exist:
delete the backing snapshot too. delete the backing snapshot too.
* RT holds at least one reference: keep the backing snapshot. * RT holds at least one reference: keep the backing snapshot.
* `DeleteVolume`: * `DeleteVolume`:
* RT holds no references: delete the backing snapshot too. * RT holds no references: delete the backing snapshot too.
* RT holds at least one reference: keep the backing snapshot. * RT holds at least one reference: keep the backing snapshot.
To enable creating shallow volumes from snapshots that were provisioned by To enable creating shallow volumes from snapshots that were provisioned by older
older versions of cephfs-csi (i.e. before this feature is introduced), versions of cephfs-csi (i.e. before this feature is introduced),
`CreateVolume` for shallow volumes would also create an RT object in case it's `CreateVolume` for shallow volumes would also create an RT object in case it's
missing. It would be initialized to two: the source snapshot and the newly missing. It would be initialized to two: the source snapshot and the newly
created shallow volume. created shallow volume.
@ -141,17 +140,17 @@ Things to look out for:
It doesn't consume any space on the filesystem. `Volume.capacity_bytes` is It doesn't consume any space on the filesystem. `Volume.capacity_bytes` is
allowed to contain zero. We could use that. allowed to contain zero. We could use that.
* _What should be the requested size when creating the volume (specified e.g. * _What should be the requested size when creating the volume (specified e.g. in
in PVC)?_ PVC)?_
This one is tricky. CSI spec allows for This one is tricky. CSI spec allows for
`CreateVolumeRequest.capacity_range.{required_bytes,limit_bytes}` to be `CreateVolumeRequest.capacity_range.{required_bytes,limit_bytes}` to be zero.
zero. On the other hand, On the other hand,
`PersistentVolumeClaim.spec.resources.requests.storage` must be bigger `PersistentVolumeClaim.spec.resources.requests.storage` must be bigger than
than zero. cephfs-csi doesn't care about the requested size (the volume zero. cephfs-csi doesn't care about the requested size (the volume will be
will be read-only, so it has no usable capacity) and would always set it read-only, so it has no usable capacity) and would always set it to zero. This
to zero. This shouldn't case any problems for the time being, but still shouldn't case any problems for the time being, but still is something we
is something we should keep in mind. should keep in mind.
`CreateVolume` and behavior when using volume as volume source (PVC-PVC clone): `CreateVolume` and behavior when using volume as volume source (PVC-PVC clone):
@ -167,8 +166,8 @@ Volume deletion is trivial.
### `CreateSnapshot` ### `CreateSnapshot`
Snapshotting read-only volumes doesn't make sense in general, and should Snapshotting read-only volumes doesn't make sense in general, and should be
be rejected. rejected.
### `ControllerExpandVolume` ### `ControllerExpandVolume`
@ -194,8 +193,8 @@ whole subvolume first, and only then perform the binds to target paths.
#### For case (a) #### For case (a)
Subvolume paths are normally retrieved by Subvolume paths are normally retrieved by
`ceph fs subvolume info/getpath <VOLUME NAME> <SUBVOLUME NAME> <SUBVOLUMEGROUP NAME>`, `ceph fs subvolume info/getpath <VOLUME NAME> <SUBVOLUME NAME> <SUBVOLUMEGROUP NAME>`
which outputs a path like so: , which outputs a path like so:
``` ```
/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID> /volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>
@ -217,12 +216,12 @@ itself still exists or not.
#### For case (b) #### For case (b)
For cases where subvolumes are managed externally and not by cephfs-csi, we For cases where subvolumes are managed externally and not by cephfs-csi, we must
must assume that the cephx user we're given can access only assume that the cephx user we're given can access only
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>` so users won't be able to `/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>` so users won't be able to
benefit from snapshot retention. Users will need to be careful not to delete benefit from snapshot retention. Users will need to be careful not to delete the
the parent subvolumes and snapshots while they are associated by these shallow parent subvolumes and snapshots while they are associated by these shallow RO
RO volumes. volumes.
### `NodePublishVolume`, `NodeUnpublishVolume` ### `NodePublishVolume`, `NodeUnpublishVolume`
@ -235,38 +234,38 @@ mount.
## Volume parameters, volume context ## Volume parameters, volume context
This section provides a discussion around determinig what volume parameters and This section provides a discussion around determining what volume parameters and
volume context parameters will be used to convey necessary information to the volume context parameters will be used to convey necessary information to the
cephfs-csi driver in order to support shallow volumes. cephfs-csi driver in order to support shallow volumes.
Volume parameters `CreateVolumeRequest.parameters`: Volume parameters `CreateVolumeRequest.parameters`:
* Should be "shallow" the default mode for all `CreateVolume` calls that have * Should be "shallow" the default mode for all `CreateVolume` calls that have
(a) snapshot as data source and (b) read-only volume access mode? If not, (a) snapshot as data source and (b) read-only volume access mode? If not, a
a new volume parameter should be introduced: e.g `isShallow: <bool>`. On the new volume parameter should be introduced: e.g `isShallow: <bool>`. On the
other hand, does it even makes sense for users to want to create full copies other hand, does it even makes sense for users to want to create full copies
of snapshots and still have them read-only? of snapshots and still have them read-only?
Volume context `Volume.volume_context`: Volume context `Volume.volume_context`:
* Here we definitely need `isShallow` or similar. Without it we wouldn't be * Here we definitely need `isShallow` or similar. Without it we wouldn't be able
able to distinguish between a regular volume that just happens to have to distinguish between a regular volume that just happens to have a read-only
a read-only access mode, and a volume that references a snapshot. access mode, and a volume that references a snapshot.
* Currently cephfs-csi recognizes `subvolumePath` for dynamically provisioned * Currently cephfs-csi recognizes `subvolumePath` for dynamically provisioned
volumes and `rootPath` for pre-previsioned volumes. As mentioned in volumes and `rootPath` for pre-previsioned volumes. As mentioned in
[`NodeStageVolume`, `NodeUnstageVolume` section](#NodeStageVolume-NodeUnstageVolume), [`NodeStageVolume`, `NodeUnstageVolume` section](#NodeStageVolume-NodeUnstageVolume)
snapshots cannot be mounted directly. How do we pass in path to the parent , snapshots cannot be mounted directly. How do we pass in path to the parent
subvolume? subvolume?
* a) Path to the snapshot is passed in via `subvolumePath` / `rootPath`, * a) Path to the snapshot is passed in via `subvolumePath` / `rootPath`,
e.g. e.g.
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>/.snap/<SNAPSHOT NAME>`. `/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>/.snap/<SNAPSHOT NAME>`.
From that we can derive path to the subvolume: it's the parent of `.snap` From that we can derive path to the subvolume: it's the parent of `.snap`
directory. directory.
* b) Similar to a), path to the snapshot is passed in via `subvolumePath` / * b) Similar to a), path to the snapshot is passed in via `subvolumePath` /
`rootPath`, but instead of trying to derive the right path we introduce `rootPath`, but instead of trying to derive the right path we introduce
another volume context parameter containing path to the parent subvolume another volume context parameter containing path to the parent subvolume
explicitly. explicitly.
* c) `subvolumePath` / `rootPath` contains path to the parent subvolume and * c) `subvolumePath` / `rootPath` contains path to the parent subvolume and
we introduce another volume context parameter containing name of the we introduce another volume context parameter containing name of the
snapshot. Path to the snapshot is then formed by appending snapshot. Path to the snapshot is then formed by appending
`/.snap/<SNAPSHOT NAME>` to the subvolume path. `/.snap/<SNAPSHOT NAME>` to the subvolume path.

View File

@ -1,7 +1,7 @@
# Design to handle clusterID and poolID for DR # Design to handle clusterID and poolID for DR
During disaster recovery/migration of a cluster, as part of the failover, the During disaster recovery/migration of a cluster, as part of the failover, the
kubernetes artifacts like deployment, PVC, PV, etc will be restored to a new kubernetes artifacts like deployment, PVC, PV, etc. will be restored to a new
cluster by the admin. Even if the kubernetes objects are restored the cluster by the admin. Even if the kubernetes objects are restored the
corresponding RBD/CephFS subvolume cannot be retrieved during CSI operations as corresponding RBD/CephFS subvolume cannot be retrieved during CSI operations as
the clusterID and poolID are not the same in both clusters. Let's see the the clusterID and poolID are not the same in both clusters. Let's see the
@ -10,8 +10,8 @@ problem in more detail below.
`0001-0009-rook-ceph-0000000000000002-b0285c97-a0ce-11eb-8c66-0242ac110002` `0001-0009-rook-ceph-0000000000000002-b0285c97-a0ce-11eb-8c66-0242ac110002`
The above is the sample volumeID sent back in response to the CreateVolume The above is the sample volumeID sent back in response to the CreateVolume
operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses above
above as the identifier for other operations on the volume/PVC. as the identifier for other operations on the volume/PVC.
The VolumeID is encoded as, The VolumeID is encoded as,
@ -33,7 +33,7 @@ the other cluster.
During the Disaster Recovery (failover operation) the PVC and PV will be During the Disaster Recovery (failover operation) the PVC and PV will be
recreated on the other cluster. When Ceph-CSI receives the request for recreated on the other cluster. When Ceph-CSI receives the request for
operations like (NodeStage, ExpandVolume, DeleteVolume, etc) the volumeID is operations like (NodeStage, ExpandVolume, DeleteVolume, etc.) the volumeID is
sent in the request which will help to identify the volume. sent in the request which will help to identify the volume.
```yaml= ```yaml=
@ -68,15 +68,15 @@ metadata:
``` ```
During CSI/Replication operations, Ceph-CSI will decode the volumeID and gets During CSI/Replication operations, Ceph-CSI will decode the volumeID and gets
the monitor configuration from the configmap and by the poolID will get the the monitor configuration from the configmap and by the poolID will get the pool
pool Name and retrieves the OMAP data stored in the rados OMAP and finally Name and retrieves the OMAP data stored in the rados OMAP and finally check the
check the volume is present in the pool. volume is present in the pool.
## Problems with volumeID Replication ## Problems with volumeID Replication
* The clusterID can be different * The clusterID can be different
* as the clusterID is the namespace where rook is deployed, the Rook might be * as the clusterID is the namespace where rook is deployed, the Rook might
deployed in the different namespace on a secondary cluster be deployed in the different namespace on a secondary cluster
* In standalone Ceph-CSI the clusterID is fsID and fsID is unique per * In standalone Ceph-CSI the clusterID is fsID and fsID is unique per
cluster cluster
@ -124,8 +124,8 @@ metadata:
name: ceph-csi-config name: ceph-csi-config
``` ```
**Note:-** the configmap will be mounted as a volume to the CSI (provisioner **Note:-** the configmap will be mounted as a volume to the CSI (provisioner and
and node plugin) pods. node plugin) pods.
The above configmap will get created as it is or updated (if new Pools are The above configmap will get created as it is or updated (if new Pools are
created on the existing cluster) with new entries when the admin choose to created on the existing cluster) with new entries when the admin choose to
@ -149,28 +149,28 @@ Replicapool with ID `1` on site1 and Replicapool with ID `2` on site2.
After getting the required mapping Ceph-CSI has the required information to get After getting the required mapping Ceph-CSI has the required information to get
more details from the rados OMAP. If we have multiple clusterID mapping it will more details from the rados OMAP. If we have multiple clusterID mapping it will
loop through all the mapping and checks the corresponding pool to get the OMAP loop through all the mapping and checks the corresponding pool to get the OMAP
data. If the clusterID mapping does not exist Ceph-CSI will return a `Not data. If the clusterID mapping does not exist Ceph-CSI will return a `Not Found`
Found` error message to the caller. error message to the caller.
After failover to the cluster `site2-storage`, the admin might have created new After failover to the cluster `site2-storage`, the admin might have created new
PVCs on the primary cluster `site2-storage`. Later after recovering the PVCs on the primary cluster `site2-storage`. Later after recovering the
cluster `site1-storage`, the admin might choose to failback from cluster `site1-storage`, the admin might choose to failback from
`site2-storage` to `site1-storage`. Now admin needs to copy all the newly `site2-storage` to `site1-storage`. Now admin needs to copy all the newly
created kubernetes artifacts to the failback cluster. For clusterID mapping, the created kubernetes artifacts to the failback cluster. For clusterID mapping, the
admin needs to copy the above-created configmap `ceph-clusterid-mapping` to admin needs to copy the above-created configmap `ceph-clusterid-mapping` to the
the failback cluster. When Ceph-CSI receives a CSI/Replication request for failback cluster. When Ceph-CSI receives a CSI/Replication request for the
the volumes created on the `site2-storage` it will decode the volumeID and volumes created on the `site2-storage` it will decode the volumeID and retrieves
retrieves the clusterID ie `site2-storage`. In the above configmap the clusterID ie `site2-storage`. In the above configmap
`ceph-clusterid-mapping` the `site2-storage` is the value and `site1-storage` `ceph-clusterid-mapping` the `site2-storage` is the value and `site1-storage`
is the key in the `clusterIDMapping` entry. is the key in the `clusterIDMapping` entry.
Ceph-CSI will check both `key` and `value` to check the clusterID mapping. If it Ceph-CSI will check both `key` and `value` to check the clusterID mapping. If it
is found in `key` it will consider `value` as the corresponding mapping, if it is found in `key` it will consider `value` as the corresponding mapping, if it
is found in `value` place it will treat `key` as the corresponding mapping and is found in `value` place it will treat `key` as the corresponding mapping and
retrieves all the poolID details of the cluster. retrieves all the poolID details of the cluster.
This mapping on the remote cluster is only required when we are doing a This mapping on the remote cluster is only required when we are doing a failover
failover operation from the primary cluster to a remote cluster. The existing operation from the primary cluster to a remote cluster. The existing volumes
volumes that are created on the remote cluster does not require that are created on the remote cluster does not require any mapping as the
any mapping as the volumeHandle already contains the required information about volumeHandle already contains the required information about the local cluster (
the local cluster (clusterID, poolID etc). clusterID, poolID etc).

View File

@ -16,7 +16,7 @@ Some but not all the benefits of this approach:
* volume encryption: encryption of a volume attached by rbd * volume encryption: encryption of a volume attached by rbd
* encryption at rest: encryption of physical disk done by ceph * encryption at rest: encryption of physical disk done by ceph
* LUKS: Linux Unified Key Setup: stores all of the needed setup information for * LUKS: Linux Unified Key Setup: stores all the needed setup information for
dm-crypt on the disk dm-crypt on the disk
* dm-crypt: linux kernel device-mapper crypto target * dm-crypt: linux kernel device-mapper crypto target
* cryptsetup: the command line tool to interface with dm-crypt * cryptsetup: the command line tool to interface with dm-crypt
@ -28,8 +28,8 @@ requirement by using dm-crypt module through cryptsetup cli interface.
### Implementation Summary ### Implementation Summary
* Encryption is implemented using cryptsetup with LUKS extension. * Encryption is implemented using cryptsetup with LUKS extension. A good
A good introduction to LUKS and dm-crypt in general can be found introduction to LUKS and dm-crypt in general can be found
[here](https://wiki.archlinux.org/index.php/Dm-crypt/Device_encryption#Encrypting_devices_with_cryptsetup) [here](https://wiki.archlinux.org/index.php/Dm-crypt/Device_encryption#Encrypting_devices_with_cryptsetup)
Functions to implement necessary interaction are implemented in a separate Functions to implement necessary interaction are implemented in a separate
`cryptsetup.go` file. `cryptsetup.go` file.
@ -45,8 +45,8 @@ requirement by using dm-crypt module through cryptsetup cli interface.
volume attach request volume attach request
* `NodeStageVolume`: refactored to open encrypted device (`openEncryptedDevice`) * `NodeStageVolume`: refactored to open encrypted device (`openEncryptedDevice`)
* `openEncryptedDevice`: looks up for a passphrase matching the volume id, * `openEncryptedDevice`: looks up for a passphrase matching the volume id,
returns the new device path in the form: `/dev/mapper/luks-<volume_id>`. returns the new device path in the form: `/dev/mapper/luks-<volume_id>`. On
On the woker node where the attach is scheduled: the worker node where the attach is scheduled:
```shell ```shell
$ lsblk $ lsblk
@ -62,10 +62,10 @@ requirement by using dm-crypt module through cryptsetup cli interface.
before detaching the volume. before detaching the volume.
* StorageClass extended with following parameters: * StorageClass extended with following parameters:
1. `encrypted` ("true" or "false") 1. `encrypted` ("true" or "false")
1. `encryptionKMSID` (string representing kms configuration of choice) 2. `encryptionKMSID` (string representing kms configuration of choice)
ceph-csi plugin may support different kms vendors with different type of ceph-csi plugin may support different kms vendors with different type of
authentication authentication
* New KMS Configuration created. * New KMS Configuration created.
@ -75,37 +75,37 @@ requirement by using dm-crypt module through cryptsetup cli interface.
apiVersion: storage.k8s.io/v1 apiVersion: storage.k8s.io/v1
kind: StorageClass kind: StorageClass
metadata: metadata:
name: csi-rbd name: csi-rbd
provisioner: rbd.csi.ceph.com provisioner: rbd.csi.ceph.com
parameters: parameters:
# String representing Ceph cluster configuration # String representing Ceph cluster configuration
clusterID: <cluster-id> clusterID: <cluster-id>
# ceph pool # ceph pool
pool: rbd pool: rbd
# RBD image features, CSI creates image with image-format 2 # RBD image features, CSI creates image with image-format 2
# CSI RBD currently supports only `layering` feature. # CSI RBD currently supports only `layering` feature.
imageFeatures: layering imageFeatures: layering
# The secrets have to contain Ceph credentials with required access # The secrets have to contain Ceph credentials with required access
# to the 'pool'. # to the 'pool'.
csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
csi.storage.k8s.io/provisioner-secret-namespace: default csi.storage.k8s.io/provisioner-secret-namespace: default
csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret
csi.storage.k8s.io/controller-expand-secret-namespace: default csi.storage.k8s.io/controller-expand-secret-namespace: default
csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
csi.storage.k8s.io/node-stage-secret-namespace: default csi.storage.k8s.io/node-stage-secret-namespace: default
# Specify the filesystem type of the volume. If not specified, # Specify the filesystem type of the volume. If not specified,
# csi-provisioner will set default as `ext4`. # csi-provisioner will set default as `ext4`.
csi.storage.k8s.io/fstype: ext4 csi.storage.k8s.io/fstype: ext4
# Encrypt volumes # Encrypt volumes
encrypted: "true" encrypted: "true"
# Use external key management system for encryption passphrases by specifying # Use external key management system for encryption passphrases by specifying
# a unique ID matching KMS ConfigMap. The ID is only used for correlation to # a unique ID matching KMS ConfigMap. The ID is only used for correlation to
# configmap entry. # configmap entry.
encryptionKMSID: <kms-id> encryptionKMSID: <kms-id>
reclaimPolicy: Delete reclaimPolicy: Delete
``` ```
@ -133,14 +133,19 @@ metadata:
The main components that are used to support encrypted volumes: The main components that are used to support encrypted volumes:
1. the `EncryptionKMS` interface 1. the `EncryptionKMS` interface
* an instance is configured per volume object (`rbdVolume.KMS`)
* used to authenticate with a master key or token * an instance is configured per volume object (`rbdVolume.KMS`)
* can store the KEK (Key-Encryption-Key) for encrypting and decrypting the * used to authenticate with a master key or token
DEKs (Data-Encryption-Key) * can store the KEK (Key-Encryption-Key) for encrypting and decrypting the
DEKs (Data-Encryption-Key)
1. the `DEKStore` interface 1. the `DEKStore` interface
* saves and fetches the DEK (Data-Encryption-Key)
* can be provided by a KMS, or by other components (like `rbdVolume`) * saves and fetches the DEK (Data-Encryption-Key)
* can be provided by a KMS, or by other components (like `rbdVolume`)
1. the `VolumeEncryption` type 1. the `VolumeEncryption` type
* combines `EncryptionKMS` and `DEKStore` into a single place
* easy to configure from other components or subsystems * combines `EncryptionKMS` and `DEKStore` into a single place
* provides a simple API for all KMS operations * easy to configure from other components or subsystems
* provides a simple API for all KMS operations

View File

@ -14,7 +14,8 @@ KMS implementation. Or, if changes would be minimal, a configuration option to
one of the implementations can be added. one of the implementations can be added.
Different KMS implementations and their configurable options can be found at Different KMS implementations and their configurable options can be found at
[`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml). [`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml)
.
### VaultTokensKMS ### VaultTokensKMS
@ -26,7 +27,8 @@ An example of the per Tenant configuration options are in
[`tenant-config.yaml`](../../../examples/kms/vault/tenant-config.yaml) and [`tenant-config.yaml`](../../../examples/kms/vault/tenant-config.yaml) and
[`tenant-token.yaml`](../../../examples/kms/vault/tenant-token.yaml). [`tenant-token.yaml`](../../../examples/kms/vault/tenant-token.yaml).
Implementation is in [`vault_tokens.go`](../../../internal/util/vault_tokens.go). Implementation is in [`vault_tokens.go`](../../../internal/util/vault_tokens.go)
.
### Vault ### Vault
@ -36,7 +38,7 @@ Implementation is in [`vault.go`](../../../internal/util/vault.go).
## Extension or New KMS implementation ## Extension or New KMS implementation
Normally ServiceAccounts are provided by Kubernetes in the containers Normally ServiceAccounts are provided by Kubernetes in the containers'
filesystem. This only allows a single ServiceAccount and is static for the filesystem. This only allows a single ServiceAccount and is static for the
lifetime of the Pod. Ceph-CSI runs in the namespace of the storage lifetime of the Pod. Ceph-CSI runs in the namespace of the storage
administrator, and has access to the single ServiceAccount linked in the administrator, and has access to the single ServiceAccount linked in the
@ -53,7 +55,7 @@ steps need to be taken:
replace the default (`AuthKubernetesTokenPath: replace the default (`AuthKubernetesTokenPath:
/var/run/secrets/kubernetes.io/serviceaccount/token`) /var/run/secrets/kubernetes.io/serviceaccount/token`)
Currently the Ceph-CSI components may read Secrets and ConfigMaps from the Currently, the Ceph-CSI components may read Secrets and ConfigMaps from the
Tenants namespace. These permissions need to be extended to allow Ceph-CSI to Tenants namespace. These permissions need to be extended to allow Ceph-CSI to
read the contents of the ServiceAccount(s) in the Tenants namespace. read the contents of the ServiceAccount(s) in the Tenants namespace.
@ -61,7 +63,8 @@ read the contents of the ServiceAccount(s) in the Tenants namespace.
### Global Configuration ### Global Configuration
1. a StorageClass links to a KMS configuration by providing the `kmsID` parameter 1. a StorageClass links to a KMS configuration by providing the `kmsID`
parameter
1. a ConfigMap in the namespace of the Ceph-CSI deployment contains the KMS 1. a ConfigMap in the namespace of the Ceph-CSI deployment contains the KMS
configuration for the `kmsID` configuration for the `kmsID`
([`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml)) ([`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml))
@ -76,8 +79,8 @@ configuration from the ConfigMap.
1. needs ServiceAccount with a known name with permissions to connect to Vault 1. needs ServiceAccount with a known name with permissions to connect to Vault
1. optional ConfigMap with options for Vault that override default settings 1. optional ConfigMap with options for Vault that override default settings
A `CreateVolume` request contains the owner (Namespace) of the Volume. A `CreateVolume` request contains the owner (Namespace) of the Volume. The KMS
The KMS configuration indicates that additional attributes need to be fetched configuration indicates that additional attributes need to be fetched from the
from the Tenants namespace, so the provisioner will fetch these. The additional Tenants namespace, so the provisioner will fetch these. The additional
configuration and ServiceAccount are merged in the provisioners configuration configuration and ServiceAccount are merged in the provisioners' configuration
for the KMS-implementation while creating the volume. for the KMS-implementation while creating the volume.

View File

@ -1,11 +1,11 @@
# RBD MIRRORING # RBD MIRRORING
RBD mirroring is a process of replication of RBD images between two or more RBD mirroring is a process of replication of RBD images between two or more Ceph
Ceph clusters. Mirroring ensures point-in-time, crash-consistent RBD images clusters. Mirroring ensures point-in-time, crash-consistent RBD images between
between clusters, RBD mirroring is mainly used for disaster recovery (i.e. clusters, RBD mirroring is mainly used for disaster recovery (i.e. having a
having a secondary site as a failover). See [Ceph secondary site as a failover).
documentation](https://docs.ceph.com/en/latest/rbd/rbd-mirroring) on RBD See [Ceph documentation](https://docs.ceph.com/en/latest/rbd/rbd-mirroring) on
mirroring for complete information. RBD mirroring for complete information.
## Architecture ## Architecture
@ -28,8 +28,8 @@ PersistentVolumeClaim (PVC) on the secondary site during the failover.
VolumeHandle to identify the OMAP data nor the image anymore because as we have VolumeHandle to identify the OMAP data nor the image anymore because as we have
only PoolID and ClusterID in the VolumeHandle. We cannot identify the correct only PoolID and ClusterID in the VolumeHandle. We cannot identify the correct
pool name from the PoolID because pool name will remain the same on both pool name from the PoolID because pool name will remain the same on both
clusters but not the PoolID even the ClusterID can be different on the clusters but not the PoolID even the ClusterID can be different on the secondary
secondary cluster. cluster.
> Sample PV spec which will be used by rbdplugin controller to regenerate OMAP > Sample PV spec which will be used by rbdplugin controller to regenerate OMAP
> data > data
@ -56,10 +56,10 @@ csi:
``` ```
> **VolumeHandle** is the unique volume name returned by the CSI volume plugins > **VolumeHandle** is the unique volume name returned by the CSI volume plugins
CreateVolume to refer to the volume on all subsequent calls. > CreateVolume to refer to the volume on all subsequent calls.
Once the static PVC is created on the secondary cluster, the Kubernetes User Once the static PVC is created on the secondary cluster, the Kubernetes User can
can try delete the PVC,expand the PVC or mount the PVC. In case of mounting try to delete the PVC,expand the PVC or mount the PVC. In case of mounting
(NodeStageVolume) we will get the volume context in RPC call but not in the (NodeStageVolume) we will get the volume context in RPC call but not in the
Delete/Expand Request. In Delete/Expand RPC request only the VolumeHandle Delete/Expand Request. In Delete/Expand RPC request only the VolumeHandle
(`clusterID-poolID-volumeuniqueID`) will be sent where it contains the encoded (`clusterID-poolID-volumeuniqueID`) will be sent where it contains the encoded
@ -73,17 +73,17 @@ secondary cluster as the PoolID and ClusterID always may not be the same.
To solve this problem, We will have a new controller(rbdplugin controller) To solve this problem, We will have a new controller(rbdplugin controller)
running as part of provisioner pod which watches for the PV objects. When a PV running as part of provisioner pod which watches for the PV objects. When a PV
is created it will extract the required information from the PV spec and it is created it will extract the required information from the PV spec, and it
will regenerate the OMAP data. Whenever Ceph-CSI gets a RPC request with older will regenerate the OMAP data. Whenever Ceph-CSI gets a RPC request with older
VolumeHandle, it will check if any new VolumeHandle exists for the old VolumeHandle, it will check if any new VolumeHandle exists for the old
VolumeHandle. If yes, it uses the new VolumeHandle for internal operations (to VolumeHandle. If yes, it uses the new VolumeHandle for internal operations (to
get pool name, Ceph monitor details from the ClusterID etc). get pool name, Ceph monitor details from the ClusterID etc).
Currently, We are making use of watchers in node stage request to make sure Currently, We are making use of watchers in node stage request to make sure
ReadWriteOnce (RWO) PVC is mounted on a single node at a given point in time. ReadWriteOnce (RWO) PVC is mounted on a single node at a given point in time. We
We need to change the watchers logic in the node stage request as when we need to change the watchers logic in the node stage request as when we enable
enable the RBD mirroring on an image, a watcher will be added on a RBD image by the RBD mirroring on an image, a watcher will be added on a RBD image by the rbd
the rbd mirroring daemon. mirroring daemon.
To solve the ClusterID problem, If the ClusterID is different on the second To solve the ClusterID problem, If the ClusterID is different on the second
cluster, the admin has to create a new ConfigMap for the mapped ClusterID's. cluster, the admin has to create a new ConfigMap for the mapped ClusterID's.

View File

@ -1,59 +1,57 @@
# RBD NBD VOLUME HEALER # RBD NBD VOLUME HEALER
- [RBD NBD VOLUME HEALER](#rbd-nbd-volume-healer) - [RBD NBD VOLUME HEALER](#rbd-nbd-volume-healer)
- [Rbd Nbd](#rbd-nbd) - [Rbd Nbd](#rbd-nbd)
- [Advantages of userspace mounters](#advantages-of-userspace-mounters) - [Advantages of userspace mounters](#advantages-of-userspace-mounters)
- [Side effects of userspace mounters](#side-effects-of-userspace-mounters) - [Side effects of userspace mounters](#side-effects-of-userspace-mounters)
- [Volume Healer](#volume-healer) - [Volume Healer](#volume-healer)
- [More thoughts](#more-thoughts) - [More thoughts](#more-thoughts)
## Rbd nbd ## Rbd nbd
The rbd CSI plugin will provision new rbd images and attach and mount those The rbd CSI plugin will provision new rbd images and attach and mount those to
to workloads. Currently, the default mounter is krbd, which uses the kernel workloads. Currently, the default mounter is krbd, which uses the kernel rbd
rbd driver to mount the rbd images onto the application pod. Here on driver to mount the rbd images onto the application pod. Here on at Ceph-CSI we
at Ceph-CSI we will also have a userspace way of mounting the rbd images, will also have a userspace way of mounting the rbd images, via rbd-nbd.
via rbd-nbd.
[Rbd-nbd](https://docs.ceph.com/en/latest/man/8/rbd-nbd/) is a client for [Rbd-nbd](https://docs.ceph.com/en/latest/man/8/rbd-nbd/) is a client for RADOS
RADOS block device (rbd) images like the existing rbd kernel module. It block device (rbd) images like the existing rbd kernel module. It will map an
will map an rbd image to an nbd (Network Block Device) device, allowing rbd image to an nbd (Network Block Device) device, allowing access to it as a
access to it as a regular local block device. regular local block device.
![csi-rbd-nbd](./images/csi-rbd-nbd.svg) ![csi-rbd-nbd](./images/csi-rbd-nbd.svg)
Its worth making a note that the rbd-nbd processes will run on the Its worth making a note that the rbd-nbd processes will run on the client-side,
client-side, which is inside the `csi-rbdplugin` node plugin. which is inside the `csi-rbdplugin` node plugin.
### Advantages of userspace mounters ### Advantages of userspace mounters
- It is easier to add features to rbd-nbd as it is released regularly with - It is easier to add features to rbd-nbd as it is released regularly with Ceph,
Ceph, and more difficult and time consuming to add features to the kernel and more difficult and time consuming to add features to the kernel rbd module
rbd module as that is part of the Linux kernel release schedule. as that is part of the Linux kernel release schedule.
- Container upgrades will be independent of the host node, which means if - Container upgrades will be independent of the host node, which means if there
there are any new features with rbd-nbd, we dont have to reboot the node are any new features with rbd-nbd, we dont have to reboot the node as the
as the changes will be shipped inside the container. changes will be shipped inside the container.
- Because the container upgrades are host node independent, we will be a - Because the container upgrades are host node independent, we will be a better
better citizen in K8s by switching to the userspace model. citizen in K8s by switching to the userspace model.
- Unlike krbd, rbd-nbd uses librbd user-space library that gets most of the - Unlike krbd, rbd-nbd uses librbd user-space library that gets most of the
development focus, and hence rbd-nbd will be feature-rich. development focus, and hence rbd-nbd will be feature-rich.
- Being entirely kernel space impacts fault-tolerance as any kernel panic - Being entirely kernel space impacts fault-tolerance as any kernel panic
affects a whole node not only a single pod that is using rbd storage. affects a whole node not only a single pod that is using rbd storage. Thanks
Thanks to the rbd-nbds userspace design, we are less bothered here, the to the rbd-nbds userspace design, we are less bothered here, the krbd is a
krbd is a complete kernel and vendor-specific driver which needs changes complete kernel and vendor-specific driver which needs changes on every
on every feature basis, on the other hand, rbd-nbd depends on NBD generic feature basis, on the other hand, rbd-nbd depends on NBD generic driver, while
driver, while all the vendor-specific logic sits in the userspace. It's all the vendor-specific logic sits in the userspace. It's worth taking note
worth taking note that NBD generic driver is mostly unchanged much from that NBD generic driver is mostly unchanged much from years and consider it to
years and consider it to be much stable. Also given NBD is a generic be much stable. Also given NBD is a generic driver there will be many eyes on
driver there will be many eyes on it compared to the rbd driver. it compared to the rbd driver.
### Side effects of userspace mounters ### Side effects of userspace mounters
Since the rbd-nbd processes run per volume map on the client side i.e. Since the rbd-nbd processes run per volume map on the client side i.e. inside
inside the `csi-rbdplugin` node plugin, a restart of the node plugin will the `csi-rbdplugin` node plugin, a restart of the node plugin will terminate all
terminate all the rbd-nbd processes, and there is no way to restore the rbd-nbd processes, and there is no way to restore these processes back to
these processes back to life currently, which could lead to IO errors life currently, which could lead to IO errors on all the application pods.
on all the application pods.
![csi-plugin-restart](./images/csi-plugin-restart.svg) ![csi-plugin-restart](./images/csi-plugin-restart.svg)
@ -61,42 +59,42 @@ This is where the Volume healer could help.
## Volume healer ## Volume healer
Volume healer runs on the start of rbd node plugin and runs within the Volume healer runs on the start of rbd node plugin and runs within the node
node plugin driver context. plugin driver context.
Volume healer does the below, Volume healer does the below,
- Get the Volume attachment list for the current node where it is running - Get the Volume attachment list for the current node where it is running
- Filter the volume attachments list through matching driver name and - Filter the volume attachments list through matching driver name and status
status attached attached
- For each volume attachment get the respective PV information and check - For each volume attachment get the respective PV information and check the
the criteria of PV Bound, mounter type criteria of PV Bound, mounter type
- Build the StagingPath where rbd images PVC is mounted, based on the - Build the StagingPath where rbd images PVC is mounted, based on the KUBELET
KUBELET path and PV object path and PV object
- Construct the NodeStageVolume() request and send Request to CSI Driver. - Construct the NodeStageVolume() request and send Request to CSI Driver.
- The NodeStageVolume() has a way to identify calls received from the - The NodeStageVolume() has a way to identify calls received from the healer and
healer and when executed from the healer context, it just runs in the when executed from the healer context, it just runs in the minimal required
minimal required form, where it fetches the previously mapped device to form, where it fetches the previously mapped device to the image, and the
the image, and the respective secrets and finally ensures to bringup the respective secrets and finally ensures to bringup the respective process back
respective process back to life. Thus enabling IO to continue. to life. Thus enabling IO to continue.
### More thoughts ### More thoughts
- Currently the NodeStageVolume() call is safeguarded by the global Ceph-CSI - Currently the NodeStageVolume() call is safeguarded by the global Ceph-CSI
level lock (per volID) that needs to be acquired before doing any of the level lock (per volID) that needs to be acquired before doing any of the
NodeStage, NodeUnstage, NodePublish, NodeUnPulish operations. Hence none NodeStage, NodeUnstage, NodePublish, NodeUnPublish operations. Hence none of
of the operations happen in parallel. the operations happen in parallel.
- Any issues if the NodeUnstage is issued by kubelet? - Any issues if the NodeUnstage is issued by kubelet?
- This can not be a problem as we take a lock at the Ceph-CSI level - This can not be a problem as we take a lock at the Ceph-CSI level
- If the NodeUnstage success, Ceph-CSI will return StagingPath not found - If the NodeUnstage success, Ceph-CSI will return StagingPath not found
error, we can then skip error, we can then skip
- If the NodeUnstage fails with an operation already going on, in the - If the NodeUnstage fails with an operation already going on, in the next
next NodeUnstage the volume gets unmounted NodeUnstage the volume gets unmounted
- What if the PVC is deleted? - What if the PVC is deleted?
- If the PVC is deleted, the volume attachment list might already got - If the PVC is deleted, the volume attachment list might already get
refreshed and entry will be skipped/deleted at the healer. refreshed and entry will be skipped/deleted at the healer.
- For any reason, If the request bails out with Error NotFound, skip the - For any reason, If the request bails out with Error NotFound, skip the
PVC, assuming it might have deleted or the NodeUnstage might have PVC, assuming it might have deleted or the NodeUnstage might have already
already happened. happened.
- The Volume healer currently works with rbd-nbd, but the design can - The Volume healer currently works with rbd-nbd, but the design can
accommodate other userspace mounters (may be ceph-fuse). accommodate other userspace mounters (may be ceph-fuse).