diff --git a/docs/design/proposals/cephfs-snapshot-shallow-ro-vol.md b/docs/design/proposals/cephfs-snapshot-shallow-ro-vol.md index 73e2a3f16..114297ede 100644 --- a/docs/design/proposals/cephfs-snapshot-shallow-ro-vol.md +++ b/docs/design/proposals/cephfs-snapshot-shallow-ro-vol.md @@ -6,50 +6,49 @@ snapshot contents and then mount that volume to workloads. CephFS exposes snapshots as special, read-only directories of a subvolume located in `/.snap`. cephfs-csi can already provision writable -volumes with snapshots as their data source, where snapshot contents are -cloned to the newly created volume. However, cloning a snapshot to volume -is a very expensive operation in CephFS as the data needs to be fully copied. -When the need is to only read snapshot contents, snapshot cloning is extremely +volumes with snapshots as their data source, where snapshot contents are cloned +to the newly created volume. However, cloning a snapshot to volume is a very +expensive operation in CephFS as the data needs to be fully copied. When the +need is to only read snapshot contents, snapshot cloning is extremely inefficient and wasteful. -This proposal describes a way for cephfs-csi to expose CephFS snapshots -as shallow, read-only volumes, without needing to clone the underlying -snapshot data. +This proposal describes a way for cephfs-csi to expose CephFS snapshots as +shallow, read-only volumes, without needing to clone the underlying snapshot +data. ## Use-cases What's the point of such read-only volumes? * **Restore snapshots selectively:** users may want to traverse snapshots, - restoring data to a writable volume more selectively instead of restoring - the whole snapshot. -* **Volume backup:** users can't backup a live volume, they first need - to snapshot it. Once a snapshot is taken, it still can't be backed-up, - as backup tools usually work with volumes (that are exposed as file-systems) + restoring data to a writable volume more selectively instead of restoring the + whole snapshot. +* **Volume backup:** users can't backup a live volume, they first need to + snapshot it. Once a snapshot is taken, it still can't be backed-up, as backup + tools usually work with volumes (that are exposed as file-systems) and not snapshots (which might have backend-specific format). What this means is that in order to create a snapshot backup, users have to clone snapshot data twice: - 1. first time, when restoring the snapshot into a temporary volume from - where the data will be read, - 1. and second time, when transferring that volume into some backup/archive - storage (e.g. object store). + 1. first time, when restoring the snapshot into a temporary volume from + where the data will be read, + 1. and second time, when transferring that volume into some backup/archive + storage (e.g. object store). The temporary backed-up volume will most likely be thrown away after the backup transfer is finished. That's a lot of wasted work for what we - originally wanted to do! Having the ability to create volumes from - snapshots cheaply would be a big improvement for this use case. + originally wanted to do! Having the ability to create volumes from snapshots + cheaply would be a big improvement for this use case. ## Alternatives * _Snapshots are stored in `/.snap`. Users could simply visit this directory by themselves._ - `.snap` is CephFS-specific detail of how snapshots are exposed. - Users / tools may not be aware of this special directory, or it may not fit - their workflow. At the moment, the idiomatic way of accessing snapshot - contents in CSI drivers is by creating a new volume and populating it - with snapshot. + `.snap` is CephFS-specific detail of how snapshots are exposed. Users / tools + may not be aware of this special directory, or it may not fit their workflow. + At the moment, the idiomatic way of accessing snapshot contents in CSI drivers + is by creating a new volume and populating it with snapshot. ## Design @@ -57,21 +56,21 @@ Key points: * Volume source is a snapshot, volume access mode is `*_READER_ONLY`. * No actual new subvolumes are created in CephFS. -* The resulting volume is a reference to the source subvolume snapshot. - This reference would be stored in `Volume.volume_context` map. In order - to reference a snapshot, we need subvol name and snapshot name. -* Mounting such volume means mounting the respective CephFS subvolume - and exposing the snapshot to workloads. -* Let's call a *shallow read-only volume with a subvolume snapshot - as its data source* just a *shallow volume* from here on out for brevity. +* The resulting volume is a reference to the source subvolume snapshot. This + reference would be stored in `Volume.volume_context` map. In order to + reference a snapshot, we need subvol name and snapshot name. +* Mounting such volume means mounting the respective CephFS subvolume and + exposing the snapshot to workloads. +* Let's call a *shallow read-only volume with a subvolume snapshot as its data + source* just a *shallow volume* from here on out for brevity. ### Controller operations -Care must be taken when handling life-times of relevant storage resources. -When a shallow volume is created, what would happen if: +Care must be taken when handling life-times of relevant storage resources. When +a shallow volume is created, what would happen if: -* _Parent subvolume of the snapshot is removed while the shallow volume - still exists?_ +* _Parent subvolume of the snapshot is removed while the shallow volume still + exists?_ This shouldn't be a problem already. The parent volume has either `snapshot-retention` subvol feature in which case its snapshots remain @@ -80,8 +79,8 @@ When a shallow volume is created, what would happen if: * _Source snapshot from which the shallow volume originates is removed while that shallow volume still exists?_ - We need to make sure this doesn't happen and some book-keeping - is necessary. Ideally we could employ some kind of reference counting. + We need to make sure this doesn't happen and some book-keeping is necessary. + Ideally we could employ some kind of reference counting. #### Reference counting for shallow volumes @@ -92,26 +91,26 @@ When creating a volume snapshot, a reference tracker (RT), represented by a RADOS object, would be created for that snapshot. It would store information required to track the references for the backing subvolume snapshot. Upon a `CreateSnapshot` call, the reference tracker (RT) would be initialized with a -single reference record, where the CSI snapshot itself is the first reference -to the backing snapshot. Each subsequent shallow volume creation would add a -new reference record to the RT object. Each shallow volume deletion would -remove that reference from the RT object. Calling `DeleteSnapshot` would remove -the reference record that was previously added in `CreateSnapshot`. +single reference record, where the CSI snapshot itself is the first reference to +the backing snapshot. Each subsequent shallow volume creation would add a new +reference record to the RT object. Each shallow volume deletion would remove +that reference from the RT object. Calling `DeleteSnapshot` would remove the +reference record that was previously added in `CreateSnapshot`. The subvolume snapshot would be removed from the Ceph cluster only once the RT object holds no references. Note that this behavior would permit calling `DeleteSnapshot` even if it is still referenced by shallow volumes. * `DeleteSnapshot`: - * RT holds no references or the RT object doesn't exist: - delete the backing snapshot too. - * RT holds at least one reference: keep the backing snapshot. +* RT holds no references or the RT object doesn't exist: + delete the backing snapshot too. +* RT holds at least one reference: keep the backing snapshot. * `DeleteVolume`: - * RT holds no references: delete the backing snapshot too. - * RT holds at least one reference: keep the backing snapshot. +* RT holds no references: delete the backing snapshot too. +* RT holds at least one reference: keep the backing snapshot. -To enable creating shallow volumes from snapshots that were provisioned by -older versions of cephfs-csi (i.e. before this feature is introduced), +To enable creating shallow volumes from snapshots that were provisioned by older +versions of cephfs-csi (i.e. before this feature is introduced), `CreateVolume` for shallow volumes would also create an RT object in case it's missing. It would be initialized to two: the source snapshot and the newly created shallow volume. @@ -141,17 +140,17 @@ Things to look out for: It doesn't consume any space on the filesystem. `Volume.capacity_bytes` is allowed to contain zero. We could use that. -* _What should be the requested size when creating the volume (specified e.g. - in PVC)?_ +* _What should be the requested size when creating the volume (specified e.g. in + PVC)?_ This one is tricky. CSI spec allows for - `CreateVolumeRequest.capacity_range.{required_bytes,limit_bytes}` to be - zero. On the other hand, - `PersistentVolumeClaim.spec.resources.requests.storage` must be bigger - than zero. cephfs-csi doesn't care about the requested size (the volume - will be read-only, so it has no usable capacity) and would always set it - to zero. This shouldn't case any problems for the time being, but still - is something we should keep in mind. + `CreateVolumeRequest.capacity_range.{required_bytes,limit_bytes}` to be zero. + On the other hand, + `PersistentVolumeClaim.spec.resources.requests.storage` must be bigger than + zero. cephfs-csi doesn't care about the requested size (the volume will be + read-only, so it has no usable capacity) and would always set it to zero. This + shouldn't case any problems for the time being, but still is something we + should keep in mind. `CreateVolume` and behavior when using volume as volume source (PVC-PVC clone): @@ -167,8 +166,8 @@ Volume deletion is trivial. ### `CreateSnapshot` -Snapshotting read-only volumes doesn't make sense in general, and should -be rejected. +Snapshotting read-only volumes doesn't make sense in general, and should be +rejected. ### `ControllerExpandVolume` @@ -194,8 +193,8 @@ whole subvolume first, and only then perform the binds to target paths. #### For case (a) Subvolume paths are normally retrieved by -`ceph fs subvolume info/getpath `, -which outputs a path like so: +`ceph fs subvolume info/getpath ` +, which outputs a path like so: ``` /volumes/// @@ -217,12 +216,12 @@ itself still exists or not. #### For case (b) -For cases where subvolumes are managed externally and not by cephfs-csi, we -must assume that the cephx user we're given can access only +For cases where subvolumes are managed externally and not by cephfs-csi, we must +assume that the cephx user we're given can access only `/volumes///` so users won't be able to -benefit from snapshot retention. Users will need to be careful not to delete -the parent subvolumes and snapshots while they are associated by these shallow -RO volumes. +benefit from snapshot retention. Users will need to be careful not to delete the +parent subvolumes and snapshots while they are associated by these shallow RO +volumes. ### `NodePublishVolume`, `NodeUnpublishVolume` @@ -235,38 +234,38 @@ mount. ## Volume parameters, volume context -This section provides a discussion around determinig what volume parameters and +This section provides a discussion around determining what volume parameters and volume context parameters will be used to convey necessary information to the cephfs-csi driver in order to support shallow volumes. Volume parameters `CreateVolumeRequest.parameters`: * Should be "shallow" the default mode for all `CreateVolume` calls that have - (a) snapshot as data source and (b) read-only volume access mode? If not, - a new volume parameter should be introduced: e.g `isShallow: `. On the + (a) snapshot as data source and (b) read-only volume access mode? If not, a + new volume parameter should be introduced: e.g `isShallow: `. On the other hand, does it even makes sense for users to want to create full copies of snapshots and still have them read-only? Volume context `Volume.volume_context`: -* Here we definitely need `isShallow` or similar. Without it we wouldn't be - able to distinguish between a regular volume that just happens to have - a read-only access mode, and a volume that references a snapshot. +* Here we definitely need `isShallow` or similar. Without it we wouldn't be able + to distinguish between a regular volume that just happens to have a read-only + access mode, and a volume that references a snapshot. * Currently cephfs-csi recognizes `subvolumePath` for dynamically provisioned volumes and `rootPath` for pre-previsioned volumes. As mentioned in - [`NodeStageVolume`, `NodeUnstageVolume` section](#NodeStageVolume-NodeUnstageVolume), - snapshots cannot be mounted directly. How do we pass in path to the parent + [`NodeStageVolume`, `NodeUnstageVolume` section](#NodeStageVolume-NodeUnstageVolume) + , snapshots cannot be mounted directly. How do we pass in path to the parent subvolume? - * a) Path to the snapshot is passed in via `subvolumePath` / `rootPath`, - e.g. - `/volumes////.snap/`. - From that we can derive path to the subvolume: it's the parent of `.snap` - directory. - * b) Similar to a), path to the snapshot is passed in via `subvolumePath` / - `rootPath`, but instead of trying to derive the right path we introduce - another volume context parameter containing path to the parent subvolume - explicitly. - * c) `subvolumePath` / `rootPath` contains path to the parent subvolume and - we introduce another volume context parameter containing name of the - snapshot. Path to the snapshot is then formed by appending - `/.snap/` to the subvolume path. +* a) Path to the snapshot is passed in via `subvolumePath` / `rootPath`, + e.g. + `/volumes////.snap/`. + From that we can derive path to the subvolume: it's the parent of `.snap` + directory. +* b) Similar to a), path to the snapshot is passed in via `subvolumePath` / + `rootPath`, but instead of trying to derive the right path we introduce + another volume context parameter containing path to the parent subvolume + explicitly. +* c) `subvolumePath` / `rootPath` contains path to the parent subvolume and + we introduce another volume context parameter containing name of the + snapshot. Path to the snapshot is then formed by appending + `/.snap/` to the subvolume path. diff --git a/docs/design/proposals/clusterid-mapping.md b/docs/design/proposals/clusterid-mapping.md index acb734bcb..4f45e05d5 100644 --- a/docs/design/proposals/clusterid-mapping.md +++ b/docs/design/proposals/clusterid-mapping.md @@ -1,7 +1,7 @@ # Design to handle clusterID and poolID for DR During disaster recovery/migration of a cluster, as part of the failover, the -kubernetes artifacts like deployment, PVC, PV, etc will be restored to a new +kubernetes artifacts like deployment, PVC, PV, etc. will be restored to a new cluster by the admin. Even if the kubernetes objects are restored the corresponding RBD/CephFS subvolume cannot be retrieved during CSI operations as the clusterID and poolID are not the same in both clusters. Let's see the @@ -10,8 +10,8 @@ problem in more detail below. `0001-0009-rook-ceph-0000000000000002-b0285c97-a0ce-11eb-8c66-0242ac110002` The above is the sample volumeID sent back in response to the CreateVolume -operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses -above as the identifier for other operations on the volume/PVC. +operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses above +as the identifier for other operations on the volume/PVC. The VolumeID is encoded as, @@ -33,7 +33,7 @@ the other cluster. During the Disaster Recovery (failover operation) the PVC and PV will be recreated on the other cluster. When Ceph-CSI receives the request for -operations like (NodeStage, ExpandVolume, DeleteVolume, etc) the volumeID is +operations like (NodeStage, ExpandVolume, DeleteVolume, etc.) the volumeID is sent in the request which will help to identify the volume. ```yaml= @@ -68,15 +68,15 @@ metadata: ``` During CSI/Replication operations, Ceph-CSI will decode the volumeID and gets -the monitor configuration from the configmap and by the poolID will get the -pool Name and retrieves the OMAP data stored in the rados OMAP and finally -check the volume is present in the pool. +the monitor configuration from the configmap and by the poolID will get the pool +Name and retrieves the OMAP data stored in the rados OMAP and finally check the +volume is present in the pool. ## Problems with volumeID Replication * The clusterID can be different - * as the clusterID is the namespace where rook is deployed, the Rook might be - deployed in the different namespace on a secondary cluster + * as the clusterID is the namespace where rook is deployed, the Rook might + be deployed in the different namespace on a secondary cluster * In standalone Ceph-CSI the clusterID is fsID and fsID is unique per cluster @@ -124,8 +124,8 @@ metadata: name: ceph-csi-config ``` -**Note:-** the configmap will be mounted as a volume to the CSI (provisioner -and node plugin) pods. +**Note:-** the configmap will be mounted as a volume to the CSI (provisioner and +node plugin) pods. The above configmap will get created as it is or updated (if new Pools are created on the existing cluster) with new entries when the admin choose to @@ -149,28 +149,28 @@ Replicapool with ID `1` on site1 and Replicapool with ID `2` on site2. After getting the required mapping Ceph-CSI has the required information to get more details from the rados OMAP. If we have multiple clusterID mapping it will loop through all the mapping and checks the corresponding pool to get the OMAP -data. If the clusterID mapping does not exist Ceph-CSI will return a `Not -Found` error message to the caller. +data. If the clusterID mapping does not exist Ceph-CSI will return a `Not Found` +error message to the caller. After failover to the cluster `site2-storage`, the admin might have created new PVCs on the primary cluster `site2-storage`. Later after recovering the cluster `site1-storage`, the admin might choose to failback from `site2-storage` to `site1-storage`. Now admin needs to copy all the newly created kubernetes artifacts to the failback cluster. For clusterID mapping, the -admin needs to copy the above-created configmap `ceph-clusterid-mapping` to -the failback cluster. When Ceph-CSI receives a CSI/Replication request for -the volumes created on the `site2-storage` it will decode the volumeID and -retrieves the clusterID ie `site2-storage`. In the above configmap +admin needs to copy the above-created configmap `ceph-clusterid-mapping` to the +failback cluster. When Ceph-CSI receives a CSI/Replication request for the +volumes created on the `site2-storage` it will decode the volumeID and retrieves +the clusterID ie `site2-storage`. In the above configmap `ceph-clusterid-mapping` the `site2-storage` is the value and `site1-storage` is the key in the `clusterIDMapping` entry. Ceph-CSI will check both `key` and `value` to check the clusterID mapping. If it -is found in `key` it will consider `value` as the corresponding mapping, if it +is found in `key` it will consider `value` as the corresponding mapping, if it is found in `value` place it will treat `key` as the corresponding mapping and retrieves all the poolID details of the cluster. -This mapping on the remote cluster is only required when we are doing a -failover operation from the primary cluster to a remote cluster. The existing -volumes that are created on the remote cluster does not require -any mapping as the volumeHandle already contains the required information about -the local cluster (clusterID, poolID etc). +This mapping on the remote cluster is only required when we are doing a failover +operation from the primary cluster to a remote cluster. The existing volumes +that are created on the remote cluster does not require any mapping as the +volumeHandle already contains the required information about the local cluster ( +clusterID, poolID etc). diff --git a/docs/design/proposals/encrypted-pvc.md b/docs/design/proposals/encrypted-pvc.md index f99aa4943..c2ff751f8 100644 --- a/docs/design/proposals/encrypted-pvc.md +++ b/docs/design/proposals/encrypted-pvc.md @@ -16,7 +16,7 @@ Some but not all the benefits of this approach: * volume encryption: encryption of a volume attached by rbd * encryption at rest: encryption of physical disk done by ceph -* LUKS: Linux Unified Key Setup: stores all of the needed setup information for +* LUKS: Linux Unified Key Setup: stores all the needed setup information for dm-crypt on the disk * dm-crypt: linux kernel device-mapper crypto target * cryptsetup: the command line tool to interface with dm-crypt @@ -28,8 +28,8 @@ requirement by using dm-crypt module through cryptsetup cli interface. ### Implementation Summary -* Encryption is implemented using cryptsetup with LUKS extension. - A good introduction to LUKS and dm-crypt in general can be found +* Encryption is implemented using cryptsetup with LUKS extension. A good + introduction to LUKS and dm-crypt in general can be found [here](https://wiki.archlinux.org/index.php/Dm-crypt/Device_encryption#Encrypting_devices_with_cryptsetup) Functions to implement necessary interaction are implemented in a separate `cryptsetup.go` file. @@ -45,8 +45,8 @@ requirement by using dm-crypt module through cryptsetup cli interface. volume attach request * `NodeStageVolume`: refactored to open encrypted device (`openEncryptedDevice`) * `openEncryptedDevice`: looks up for a passphrase matching the volume id, - returns the new device path in the form: `/dev/mapper/luks-`. - On the woker node where the attach is scheduled: + returns the new device path in the form: `/dev/mapper/luks-`. On + the worker node where the attach is scheduled: ```shell $ lsblk @@ -62,10 +62,10 @@ requirement by using dm-crypt module through cryptsetup cli interface. before detaching the volume. * StorageClass extended with following parameters: - 1. `encrypted` ("true" or "false") - 1. `encryptionKMSID` (string representing kms configuration of choice) - ceph-csi plugin may support different kms vendors with different type of - authentication + 1. `encrypted` ("true" or "false") + 2. `encryptionKMSID` (string representing kms configuration of choice) + ceph-csi plugin may support different kms vendors with different type of + authentication * New KMS Configuration created. @@ -75,37 +75,37 @@ requirement by using dm-crypt module through cryptsetup cli interface. apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: - name: csi-rbd + name: csi-rbd provisioner: rbd.csi.ceph.com parameters: - # String representing Ceph cluster configuration - clusterID: - # ceph pool - pool: rbd + # String representing Ceph cluster configuration + clusterID: + # ceph pool + pool: rbd - # RBD image features, CSI creates image with image-format 2 - # CSI RBD currently supports only `layering` feature. - imageFeatures: layering + # RBD image features, CSI creates image with image-format 2 + # CSI RBD currently supports only `layering` feature. + imageFeatures: layering - # The secrets have to contain Ceph credentials with required access - # to the 'pool'. - csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret - csi.storage.k8s.io/provisioner-secret-namespace: default - csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret - csi.storage.k8s.io/controller-expand-secret-namespace: default - csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret - csi.storage.k8s.io/node-stage-secret-namespace: default - # Specify the filesystem type of the volume. If not specified, - # csi-provisioner will set default as `ext4`. - csi.storage.k8s.io/fstype: ext4 + # The secrets have to contain Ceph credentials with required access + # to the 'pool'. + csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret + csi.storage.k8s.io/provisioner-secret-namespace: default + csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret + csi.storage.k8s.io/controller-expand-secret-namespace: default + csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret + csi.storage.k8s.io/node-stage-secret-namespace: default + # Specify the filesystem type of the volume. If not specified, + # csi-provisioner will set default as `ext4`. + csi.storage.k8s.io/fstype: ext4 - # Encrypt volumes - encrypted: "true" + # Encrypt volumes + encrypted: "true" - # Use external key management system for encryption passphrases by specifying - # a unique ID matching KMS ConfigMap. The ID is only used for correlation to - # configmap entry. - encryptionKMSID: + # Use external key management system for encryption passphrases by specifying + # a unique ID matching KMS ConfigMap. The ID is only used for correlation to + # configmap entry. + encryptionKMSID: reclaimPolicy: Delete ``` @@ -133,14 +133,19 @@ metadata: The main components that are used to support encrypted volumes: 1. the `EncryptionKMS` interface - * an instance is configured per volume object (`rbdVolume.KMS`) - * used to authenticate with a master key or token - * can store the KEK (Key-Encryption-Key) for encrypting and decrypting the - DEKs (Data-Encryption-Key) + +* an instance is configured per volume object (`rbdVolume.KMS`) +* used to authenticate with a master key or token +* can store the KEK (Key-Encryption-Key) for encrypting and decrypting the + DEKs (Data-Encryption-Key) + 1. the `DEKStore` interface - * saves and fetches the DEK (Data-Encryption-Key) - * can be provided by a KMS, or by other components (like `rbdVolume`) + +* saves and fetches the DEK (Data-Encryption-Key) +* can be provided by a KMS, or by other components (like `rbdVolume`) + 1. the `VolumeEncryption` type - * combines `EncryptionKMS` and `DEKStore` into a single place - * easy to configure from other components or subsystems - * provides a simple API for all KMS operations + +* combines `EncryptionKMS` and `DEKStore` into a single place +* easy to configure from other components or subsystems +* provides a simple API for all KMS operations diff --git a/docs/design/proposals/encryption-with-vault-sa.md b/docs/design/proposals/encryption-with-vault-sa.md index 733a0b31f..835ac8ac4 100644 --- a/docs/design/proposals/encryption-with-vault-sa.md +++ b/docs/design/proposals/encryption-with-vault-sa.md @@ -14,7 +14,8 @@ KMS implementation. Or, if changes would be minimal, a configuration option to one of the implementations can be added. Different KMS implementations and their configurable options can be found at -[`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml). +[`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml) +. ### VaultTokensKMS @@ -26,7 +27,8 @@ An example of the per Tenant configuration options are in [`tenant-config.yaml`](../../../examples/kms/vault/tenant-config.yaml) and [`tenant-token.yaml`](../../../examples/kms/vault/tenant-token.yaml). -Implementation is in [`vault_tokens.go`](../../../internal/util/vault_tokens.go). +Implementation is in [`vault_tokens.go`](../../../internal/util/vault_tokens.go) +. ### Vault @@ -36,7 +38,7 @@ Implementation is in [`vault.go`](../../../internal/util/vault.go). ## Extension or New KMS implementation -Normally ServiceAccounts are provided by Kubernetes in the containers +Normally ServiceAccounts are provided by Kubernetes in the containers' filesystem. This only allows a single ServiceAccount and is static for the lifetime of the Pod. Ceph-CSI runs in the namespace of the storage administrator, and has access to the single ServiceAccount linked in the @@ -53,7 +55,7 @@ steps need to be taken: replace the default (`AuthKubernetesTokenPath: /var/run/secrets/kubernetes.io/serviceaccount/token`) -Currently the Ceph-CSI components may read Secrets and ConfigMaps from the +Currently, the Ceph-CSI components may read Secrets and ConfigMaps from the Tenants namespace. These permissions need to be extended to allow Ceph-CSI to read the contents of the ServiceAccount(s) in the Tenants namespace. @@ -61,7 +63,8 @@ read the contents of the ServiceAccount(s) in the Tenants namespace. ### Global Configuration -1. a StorageClass links to a KMS configuration by providing the `kmsID` parameter +1. a StorageClass links to a KMS configuration by providing the `kmsID` + parameter 1. a ConfigMap in the namespace of the Ceph-CSI deployment contains the KMS configuration for the `kmsID` ([`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml)) @@ -76,8 +79,8 @@ configuration from the ConfigMap. 1. needs ServiceAccount with a known name with permissions to connect to Vault 1. optional ConfigMap with options for Vault that override default settings -A `CreateVolume` request contains the owner (Namespace) of the Volume. -The KMS configuration indicates that additional attributes need to be fetched -from the Tenants namespace, so the provisioner will fetch these. The additional -configuration and ServiceAccount are merged in the provisioners configuration +A `CreateVolume` request contains the owner (Namespace) of the Volume. The KMS +configuration indicates that additional attributes need to be fetched from the +Tenants namespace, so the provisioner will fetch these. The additional +configuration and ServiceAccount are merged in the provisioners' configuration for the KMS-implementation while creating the volume. diff --git a/docs/design/proposals/rbd-mirror.md b/docs/design/proposals/rbd-mirror.md index 63bd345d0..bc2a41311 100644 --- a/docs/design/proposals/rbd-mirror.md +++ b/docs/design/proposals/rbd-mirror.md @@ -1,11 +1,11 @@ # RBD MIRRORING -RBD mirroring is a process of replication of RBD images between two or more -Ceph clusters. Mirroring ensures point-in-time, crash-consistent RBD images -between clusters, RBD mirroring is mainly used for disaster recovery (i.e. -having a secondary site as a failover). See [Ceph -documentation](https://docs.ceph.com/en/latest/rbd/rbd-mirroring) on RBD -mirroring for complete information. +RBD mirroring is a process of replication of RBD images between two or more Ceph +clusters. Mirroring ensures point-in-time, crash-consistent RBD images between +clusters, RBD mirroring is mainly used for disaster recovery (i.e. having a +secondary site as a failover). +See [Ceph documentation](https://docs.ceph.com/en/latest/rbd/rbd-mirroring) on +RBD mirroring for complete information. ## Architecture @@ -28,8 +28,8 @@ PersistentVolumeClaim (PVC) on the secondary site during the failover. VolumeHandle to identify the OMAP data nor the image anymore because as we have only PoolID and ClusterID in the VolumeHandle. We cannot identify the correct pool name from the PoolID because pool name will remain the same on both -clusters but not the PoolID even the ClusterID can be different on the -secondary cluster. +clusters but not the PoolID even the ClusterID can be different on the secondary +cluster. > Sample PV spec which will be used by rbdplugin controller to regenerate OMAP > data @@ -56,10 +56,10 @@ csi: ``` > **VolumeHandle** is the unique volume name returned by the CSI volume plugin’s -CreateVolume to refer to the volume on all subsequent calls. +> CreateVolume to refer to the volume on all subsequent calls. -Once the static PVC is created on the secondary cluster, the Kubernetes User -can try delete the PVC,expand the PVC or mount the PVC. In case of mounting +Once the static PVC is created on the secondary cluster, the Kubernetes User can +try to delete the PVC,expand the PVC or mount the PVC. In case of mounting (NodeStageVolume) we will get the volume context in RPC call but not in the Delete/Expand Request. In Delete/Expand RPC request only the VolumeHandle (`clusterID-poolID-volumeuniqueID`) will be sent where it contains the encoded @@ -73,17 +73,17 @@ secondary cluster as the PoolID and ClusterID always may not be the same. To solve this problem, We will have a new controller(rbdplugin controller) running as part of provisioner pod which watches for the PV objects. When a PV -is created it will extract the required information from the PV spec and it +is created it will extract the required information from the PV spec, and it will regenerate the OMAP data. Whenever Ceph-CSI gets a RPC request with older VolumeHandle, it will check if any new VolumeHandle exists for the old VolumeHandle. If yes, it uses the new VolumeHandle for internal operations (to get pool name, Ceph monitor details from the ClusterID etc). Currently, We are making use of watchers in node stage request to make sure -ReadWriteOnce (RWO) PVC is mounted on a single node at a given point in time. -We need to change the watchers logic in the node stage request as when we -enable the RBD mirroring on an image, a watcher will be added on a RBD image by -the rbd mirroring daemon. +ReadWriteOnce (RWO) PVC is mounted on a single node at a given point in time. We +need to change the watchers logic in the node stage request as when we enable +the RBD mirroring on an image, a watcher will be added on a RBD image by the rbd +mirroring daemon. To solve the ClusterID problem, If the ClusterID is different on the second cluster, the admin has to create a new ConfigMap for the mapped ClusterID's. diff --git a/docs/design/proposals/rbd-volume-healer.md b/docs/design/proposals/rbd-volume-healer.md index 9328090aa..f5bb7d823 100644 --- a/docs/design/proposals/rbd-volume-healer.md +++ b/docs/design/proposals/rbd-volume-healer.md @@ -1,59 +1,57 @@ # RBD NBD VOLUME HEALER - [RBD NBD VOLUME HEALER](#rbd-nbd-volume-healer) - - [Rbd Nbd](#rbd-nbd) - - [Advantages of userspace mounters](#advantages-of-userspace-mounters) - - [Side effects of userspace mounters](#side-effects-of-userspace-mounters) - - [Volume Healer](#volume-healer) - - [More thoughts](#more-thoughts) +- [Rbd Nbd](#rbd-nbd) +- [Advantages of userspace mounters](#advantages-of-userspace-mounters) +- [Side effects of userspace mounters](#side-effects-of-userspace-mounters) +- [Volume Healer](#volume-healer) +- [More thoughts](#more-thoughts) ## Rbd nbd -The rbd CSI plugin will provision new rbd images and attach and mount those -to workloads. Currently, the default mounter is krbd, which uses the kernel -rbd driver to mount the rbd images onto the application pod. Here on -at Ceph-CSI we will also have a userspace way of mounting the rbd images, -via rbd-nbd. +The rbd CSI plugin will provision new rbd images and attach and mount those to +workloads. Currently, the default mounter is krbd, which uses the kernel rbd +driver to mount the rbd images onto the application pod. Here on at Ceph-CSI we +will also have a userspace way of mounting the rbd images, via rbd-nbd. -[Rbd-nbd](https://docs.ceph.com/en/latest/man/8/rbd-nbd/) is a client for -RADOS block device (rbd) images like the existing rbd kernel module. It -will map an rbd image to an nbd (Network Block Device) device, allowing -access to it as a regular local block device. +[Rbd-nbd](https://docs.ceph.com/en/latest/man/8/rbd-nbd/) is a client for RADOS +block device (rbd) images like the existing rbd kernel module. It will map an +rbd image to an nbd (Network Block Device) device, allowing access to it as a +regular local block device. ![csi-rbd-nbd](./images/csi-rbd-nbd.svg) -It’s worth making a note that the rbd-nbd processes will run on the -client-side, which is inside the `csi-rbdplugin` node plugin. +It’s worth making a note that the rbd-nbd processes will run on the client-side, +which is inside the `csi-rbdplugin` node plugin. ### Advantages of userspace mounters -- It is easier to add features to rbd-nbd as it is released regularly with - Ceph, and more difficult and time consuming to add features to the kernel - rbd module as that is part of the Linux kernel release schedule. -- Container upgrades will be independent of the host node, which means if - there are any new features with rbd-nbd, we don’t have to reboot the node - as the changes will be shipped inside the container. -- Because the container upgrades are host node independent, we will be a - better citizen in K8s by switching to the userspace model. +- It is easier to add features to rbd-nbd as it is released regularly with Ceph, + and more difficult and time consuming to add features to the kernel rbd module + as that is part of the Linux kernel release schedule. +- Container upgrades will be independent of the host node, which means if there + are any new features with rbd-nbd, we don’t have to reboot the node as the + changes will be shipped inside the container. +- Because the container upgrades are host node independent, we will be a better + citizen in K8s by switching to the userspace model. - Unlike krbd, rbd-nbd uses librbd user-space library that gets most of the development focus, and hence rbd-nbd will be feature-rich. - Being entirely kernel space impacts fault-tolerance as any kernel panic - affects a whole node not only a single pod that is using rbd storage. - Thanks to the rbd-nbd’s userspace design, we are less bothered here, the - krbd is a complete kernel and vendor-specific driver which needs changes - on every feature basis, on the other hand, rbd-nbd depends on NBD generic - driver, while all the vendor-specific logic sits in the userspace. It's - worth taking note that NBD generic driver is mostly unchanged much from - years and consider it to be much stable. Also given NBD is a generic - driver there will be many eyes on it compared to the rbd driver. + affects a whole node not only a single pod that is using rbd storage. Thanks + to the rbd-nbd’s userspace design, we are less bothered here, the krbd is a + complete kernel and vendor-specific driver which needs changes on every + feature basis, on the other hand, rbd-nbd depends on NBD generic driver, while + all the vendor-specific logic sits in the userspace. It's worth taking note + that NBD generic driver is mostly unchanged much from years and consider it to + be much stable. Also given NBD is a generic driver there will be many eyes on + it compared to the rbd driver. ### Side effects of userspace mounters -Since the rbd-nbd processes run per volume map on the client side i.e. -inside the `csi-rbdplugin` node plugin, a restart of the node plugin will -terminate all the rbd-nbd processes, and there is no way to restore -these processes back to life currently, which could lead to IO errors -on all the application pods. +Since the rbd-nbd processes run per volume map on the client side i.e. inside +the `csi-rbdplugin` node plugin, a restart of the node plugin will terminate all +the rbd-nbd processes, and there is no way to restore these processes back to +life currently, which could lead to IO errors on all the application pods. ![csi-plugin-restart](./images/csi-plugin-restart.svg) @@ -61,42 +59,42 @@ This is where the Volume healer could help. ## Volume healer -Volume healer runs on the start of rbd node plugin and runs within the -node plugin driver context. +Volume healer runs on the start of rbd node plugin and runs within the node +plugin driver context. Volume healer does the below, - Get the Volume attachment list for the current node where it is running -- Filter the volume attachments list through matching driver name and - status attached -- For each volume attachment get the respective PV information and check - the criteria of PV Bound, mounter type -- Build the StagingPath where rbd images PVC is mounted, based on the - KUBELET path and PV object +- Filter the volume attachments list through matching driver name and status + attached +- For each volume attachment get the respective PV information and check the + criteria of PV Bound, mounter type +- Build the StagingPath where rbd images PVC is mounted, based on the KUBELET + path and PV object - Construct the NodeStageVolume() request and send Request to CSI Driver. -- The NodeStageVolume() has a way to identify calls received from the - healer and when executed from the healer context, it just runs in the - minimal required form, where it fetches the previously mapped device to - the image, and the respective secrets and finally ensures to bringup the - respective process back to life. Thus enabling IO to continue. +- The NodeStageVolume() has a way to identify calls received from the healer and + when executed from the healer context, it just runs in the minimal required + form, where it fetches the previously mapped device to the image, and the + respective secrets and finally ensures to bringup the respective process back + to life. Thus enabling IO to continue. ### More thoughts - Currently the NodeStageVolume() call is safeguarded by the global Ceph-CSI level lock (per volID) that needs to be acquired before doing any of the - NodeStage, NodeUnstage, NodePublish, NodeUnPulish operations. Hence none - of the operations happen in parallel. + NodeStage, NodeUnstage, NodePublish, NodeUnPublish operations. Hence none of + the operations happen in parallel. - Any issues if the NodeUnstage is issued by kubelet? - This can not be a problem as we take a lock at the Ceph-CSI level - - If the NodeUnstage success, Ceph-CSI will return StagingPath not found - error, we can then skip - - If the NodeUnstage fails with an operation already going on, in the - next NodeUnstage the volume gets unmounted + - If the NodeUnstage success, Ceph-CSI will return StagingPath not found + error, we can then skip + - If the NodeUnstage fails with an operation already going on, in the next + NodeUnstage the volume gets unmounted - What if the PVC is deleted? - - If the PVC is deleted, the volume attachment list might already got + - If the PVC is deleted, the volume attachment list might already get refreshed and entry will be skipped/deleted at the healer. - - For any reason, If the request bails out with Error NotFound, skip the - PVC, assuming it might have deleted or the NodeUnstage might have - already happened. -- The Volume healer currently works with rbd-nbd, but the design can - accommodate other userspace mounters (may be ceph-fuse). + - For any reason, If the request bails out with Error NotFound, skip the + PVC, assuming it might have deleted or the NodeUnstage might have already + happened. + - The Volume healer currently works with rbd-nbd, but the design can + accommodate other userspace mounters (may be ceph-fuse). \ No newline at end of file