mirror of
https://github.com/ceph/ceph-csi.git
synced 2024-12-18 11:00:25 +00:00
doc: few corrections or typo fixing in design documentation
- Fixes spelling mistakes. - Grammatical error correction. - Wrapping the text at 80 line count..etc Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
This commit is contained in:
parent
12e8e46bcf
commit
3196b798cc
@ -6,26 +6,26 @@ snapshot contents and then mount that volume to workloads.
|
||||
|
||||
CephFS exposes snapshots as special, read-only directories of a subvolume
|
||||
located in `<subvolume>/.snap`. cephfs-csi can already provision writable
|
||||
volumes with snapshots as their data source, where snapshot contents are
|
||||
cloned to the newly created volume. However, cloning a snapshot to volume
|
||||
is a very expensive operation in CephFS as the data needs to be fully copied.
|
||||
When the need is to only read snapshot contents, snapshot cloning is extremely
|
||||
volumes with snapshots as their data source, where snapshot contents are cloned
|
||||
to the newly created volume. However, cloning a snapshot to volume is a very
|
||||
expensive operation in CephFS as the data needs to be fully copied. When the
|
||||
need is to only read snapshot contents, snapshot cloning is extremely
|
||||
inefficient and wasteful.
|
||||
|
||||
This proposal describes a way for cephfs-csi to expose CephFS snapshots
|
||||
as shallow, read-only volumes, without needing to clone the underlying
|
||||
snapshot data.
|
||||
This proposal describes a way for cephfs-csi to expose CephFS snapshots as
|
||||
shallow, read-only volumes, without needing to clone the underlying snapshot
|
||||
data.
|
||||
|
||||
## Use-cases
|
||||
|
||||
What's the point of such read-only volumes?
|
||||
|
||||
* **Restore snapshots selectively:** users may want to traverse snapshots,
|
||||
restoring data to a writable volume more selectively instead of restoring
|
||||
the whole snapshot.
|
||||
* **Volume backup:** users can't backup a live volume, they first need
|
||||
to snapshot it. Once a snapshot is taken, it still can't be backed-up,
|
||||
as backup tools usually work with volumes (that are exposed as file-systems)
|
||||
restoring data to a writable volume more selectively instead of restoring the
|
||||
whole snapshot.
|
||||
* **Volume backup:** users can't backup a live volume, they first need to
|
||||
snapshot it. Once a snapshot is taken, it still can't be backed-up, as backup
|
||||
tools usually work with volumes (that are exposed as file-systems)
|
||||
and not snapshots (which might have backend-specific format). What this means
|
||||
is that in order to create a snapshot backup, users have to clone snapshot
|
||||
data twice:
|
||||
@ -37,19 +37,18 @@ What's the point of such read-only volumes?
|
||||
|
||||
The temporary backed-up volume will most likely be thrown away after the
|
||||
backup transfer is finished. That's a lot of wasted work for what we
|
||||
originally wanted to do! Having the ability to create volumes from
|
||||
snapshots cheaply would be a big improvement for this use case.
|
||||
originally wanted to do! Having the ability to create volumes from snapshots
|
||||
cheaply would be a big improvement for this use case.
|
||||
|
||||
## Alternatives
|
||||
|
||||
* _Snapshots are stored in `<subvolume>/.snap`. Users could simply visit this
|
||||
directory by themselves._
|
||||
|
||||
`.snap` is CephFS-specific detail of how snapshots are exposed.
|
||||
Users / tools may not be aware of this special directory, or it may not fit
|
||||
their workflow. At the moment, the idiomatic way of accessing snapshot
|
||||
contents in CSI drivers is by creating a new volume and populating it
|
||||
with snapshot.
|
||||
`.snap` is CephFS-specific detail of how snapshots are exposed. Users / tools
|
||||
may not be aware of this special directory, or it may not fit their workflow.
|
||||
At the moment, the idiomatic way of accessing snapshot contents in CSI drivers
|
||||
is by creating a new volume and populating it with snapshot.
|
||||
|
||||
## Design
|
||||
|
||||
@ -57,21 +56,21 @@ Key points:
|
||||
|
||||
* Volume source is a snapshot, volume access mode is `*_READER_ONLY`.
|
||||
* No actual new subvolumes are created in CephFS.
|
||||
* The resulting volume is a reference to the source subvolume snapshot.
|
||||
This reference would be stored in `Volume.volume_context` map. In order
|
||||
to reference a snapshot, we need subvol name and snapshot name.
|
||||
* Mounting such volume means mounting the respective CephFS subvolume
|
||||
and exposing the snapshot to workloads.
|
||||
* Let's call a *shallow read-only volume with a subvolume snapshot
|
||||
as its data source* just a *shallow volume* from here on out for brevity.
|
||||
* The resulting volume is a reference to the source subvolume snapshot. This
|
||||
reference would be stored in `Volume.volume_context` map. In order to
|
||||
reference a snapshot, we need subvol name and snapshot name.
|
||||
* Mounting such volume means mounting the respective CephFS subvolume and
|
||||
exposing the snapshot to workloads.
|
||||
* Let's call a *shallow read-only volume with a subvolume snapshot as its data
|
||||
source* just a *shallow volume* from here on out for brevity.
|
||||
|
||||
### Controller operations
|
||||
|
||||
Care must be taken when handling life-times of relevant storage resources.
|
||||
When a shallow volume is created, what would happen if:
|
||||
Care must be taken when handling life-times of relevant storage resources. When
|
||||
a shallow volume is created, what would happen if:
|
||||
|
||||
* _Parent subvolume of the snapshot is removed while the shallow volume
|
||||
still exists?_
|
||||
* _Parent subvolume of the snapshot is removed while the shallow volume still
|
||||
exists?_
|
||||
|
||||
This shouldn't be a problem already. The parent volume has either
|
||||
`snapshot-retention` subvol feature in which case its snapshots remain
|
||||
@ -80,8 +79,8 @@ When a shallow volume is created, what would happen if:
|
||||
* _Source snapshot from which the shallow volume originates is removed while
|
||||
that shallow volume still exists?_
|
||||
|
||||
We need to make sure this doesn't happen and some book-keeping
|
||||
is necessary. Ideally we could employ some kind of reference counting.
|
||||
We need to make sure this doesn't happen and some book-keeping is necessary.
|
||||
Ideally we could employ some kind of reference counting.
|
||||
|
||||
#### Reference counting for shallow volumes
|
||||
|
||||
@ -92,26 +91,26 @@ When creating a volume snapshot, a reference tracker (RT), represented by a
|
||||
RADOS object, would be created for that snapshot. It would store information
|
||||
required to track the references for the backing subvolume snapshot. Upon a
|
||||
`CreateSnapshot` call, the reference tracker (RT) would be initialized with a
|
||||
single reference record, where the CSI snapshot itself is the first reference
|
||||
to the backing snapshot. Each subsequent shallow volume creation would add a
|
||||
new reference record to the RT object. Each shallow volume deletion would
|
||||
remove that reference from the RT object. Calling `DeleteSnapshot` would remove
|
||||
the reference record that was previously added in `CreateSnapshot`.
|
||||
single reference record, where the CSI snapshot itself is the first reference to
|
||||
the backing snapshot. Each subsequent shallow volume creation would add a new
|
||||
reference record to the RT object. Each shallow volume deletion would remove
|
||||
that reference from the RT object. Calling `DeleteSnapshot` would remove the
|
||||
reference record that was previously added in `CreateSnapshot`.
|
||||
|
||||
The subvolume snapshot would be removed from the Ceph cluster only once the RT
|
||||
object holds no references. Note that this behavior would permit calling
|
||||
`DeleteSnapshot` even if it is still referenced by shallow volumes.
|
||||
|
||||
* `DeleteSnapshot`:
|
||||
* RT holds no references or the RT object doesn't exist:
|
||||
* RT holds no references or the RT object doesn't exist:
|
||||
delete the backing snapshot too.
|
||||
* RT holds at least one reference: keep the backing snapshot.
|
||||
* RT holds at least one reference: keep the backing snapshot.
|
||||
* `DeleteVolume`:
|
||||
* RT holds no references: delete the backing snapshot too.
|
||||
* RT holds at least one reference: keep the backing snapshot.
|
||||
* RT holds no references: delete the backing snapshot too.
|
||||
* RT holds at least one reference: keep the backing snapshot.
|
||||
|
||||
To enable creating shallow volumes from snapshots that were provisioned by
|
||||
older versions of cephfs-csi (i.e. before this feature is introduced),
|
||||
To enable creating shallow volumes from snapshots that were provisioned by older
|
||||
versions of cephfs-csi (i.e. before this feature is introduced),
|
||||
`CreateVolume` for shallow volumes would also create an RT object in case it's
|
||||
missing. It would be initialized to two: the source snapshot and the newly
|
||||
created shallow volume.
|
||||
@ -141,17 +140,17 @@ Things to look out for:
|
||||
|
||||
It doesn't consume any space on the filesystem. `Volume.capacity_bytes` is
|
||||
allowed to contain zero. We could use that.
|
||||
* _What should be the requested size when creating the volume (specified e.g.
|
||||
in PVC)?_
|
||||
* _What should be the requested size when creating the volume (specified e.g. in
|
||||
PVC)?_
|
||||
|
||||
This one is tricky. CSI spec allows for
|
||||
`CreateVolumeRequest.capacity_range.{required_bytes,limit_bytes}` to be
|
||||
zero. On the other hand,
|
||||
`PersistentVolumeClaim.spec.resources.requests.storage` must be bigger
|
||||
than zero. cephfs-csi doesn't care about the requested size (the volume
|
||||
will be read-only, so it has no usable capacity) and would always set it
|
||||
to zero. This shouldn't case any problems for the time being, but still
|
||||
is something we should keep in mind.
|
||||
`CreateVolumeRequest.capacity_range.{required_bytes,limit_bytes}` to be zero.
|
||||
On the other hand,
|
||||
`PersistentVolumeClaim.spec.resources.requests.storage` must be bigger than
|
||||
zero. cephfs-csi doesn't care about the requested size (the volume will be
|
||||
read-only, so it has no usable capacity) and would always set it to zero. This
|
||||
shouldn't case any problems for the time being, but still is something we
|
||||
should keep in mind.
|
||||
|
||||
`CreateVolume` and behavior when using volume as volume source (PVC-PVC clone):
|
||||
|
||||
@ -167,8 +166,8 @@ Volume deletion is trivial.
|
||||
|
||||
### `CreateSnapshot`
|
||||
|
||||
Snapshotting read-only volumes doesn't make sense in general, and should
|
||||
be rejected.
|
||||
Snapshotting read-only volumes doesn't make sense in general, and should be
|
||||
rejected.
|
||||
|
||||
### `ControllerExpandVolume`
|
||||
|
||||
@ -194,8 +193,8 @@ whole subvolume first, and only then perform the binds to target paths.
|
||||
#### For case (a)
|
||||
|
||||
Subvolume paths are normally retrieved by
|
||||
`ceph fs subvolume info/getpath <VOLUME NAME> <SUBVOLUME NAME> <SUBVOLUMEGROUP NAME>`,
|
||||
which outputs a path like so:
|
||||
`ceph fs subvolume info/getpath <VOLUME NAME> <SUBVOLUME NAME> <SUBVOLUMEGROUP NAME>`
|
||||
, which outputs a path like so:
|
||||
|
||||
```
|
||||
/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>
|
||||
@ -217,12 +216,12 @@ itself still exists or not.
|
||||
|
||||
#### For case (b)
|
||||
|
||||
For cases where subvolumes are managed externally and not by cephfs-csi, we
|
||||
must assume that the cephx user we're given can access only
|
||||
For cases where subvolumes are managed externally and not by cephfs-csi, we must
|
||||
assume that the cephx user we're given can access only
|
||||
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>` so users won't be able to
|
||||
benefit from snapshot retention. Users will need to be careful not to delete
|
||||
the parent subvolumes and snapshots while they are associated by these shallow
|
||||
RO volumes.
|
||||
benefit from snapshot retention. Users will need to be careful not to delete the
|
||||
parent subvolumes and snapshots while they are associated by these shallow RO
|
||||
volumes.
|
||||
|
||||
### `NodePublishVolume`, `NodeUnpublishVolume`
|
||||
|
||||
@ -235,38 +234,38 @@ mount.
|
||||
|
||||
## Volume parameters, volume context
|
||||
|
||||
This section provides a discussion around determinig what volume parameters and
|
||||
This section provides a discussion around determining what volume parameters and
|
||||
volume context parameters will be used to convey necessary information to the
|
||||
cephfs-csi driver in order to support shallow volumes.
|
||||
|
||||
Volume parameters `CreateVolumeRequest.parameters`:
|
||||
|
||||
* Should be "shallow" the default mode for all `CreateVolume` calls that have
|
||||
(a) snapshot as data source and (b) read-only volume access mode? If not,
|
||||
a new volume parameter should be introduced: e.g `isShallow: <bool>`. On the
|
||||
(a) snapshot as data source and (b) read-only volume access mode? If not, a
|
||||
new volume parameter should be introduced: e.g `isShallow: <bool>`. On the
|
||||
other hand, does it even makes sense for users to want to create full copies
|
||||
of snapshots and still have them read-only?
|
||||
|
||||
Volume context `Volume.volume_context`:
|
||||
|
||||
* Here we definitely need `isShallow` or similar. Without it we wouldn't be
|
||||
able to distinguish between a regular volume that just happens to have
|
||||
a read-only access mode, and a volume that references a snapshot.
|
||||
* Here we definitely need `isShallow` or similar. Without it we wouldn't be able
|
||||
to distinguish between a regular volume that just happens to have a read-only
|
||||
access mode, and a volume that references a snapshot.
|
||||
* Currently cephfs-csi recognizes `subvolumePath` for dynamically provisioned
|
||||
volumes and `rootPath` for pre-previsioned volumes. As mentioned in
|
||||
[`NodeStageVolume`, `NodeUnstageVolume` section](#NodeStageVolume-NodeUnstageVolume),
|
||||
snapshots cannot be mounted directly. How do we pass in path to the parent
|
||||
[`NodeStageVolume`, `NodeUnstageVolume` section](#NodeStageVolume-NodeUnstageVolume)
|
||||
, snapshots cannot be mounted directly. How do we pass in path to the parent
|
||||
subvolume?
|
||||
* a) Path to the snapshot is passed in via `subvolumePath` / `rootPath`,
|
||||
* a) Path to the snapshot is passed in via `subvolumePath` / `rootPath`,
|
||||
e.g.
|
||||
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>/.snap/<SNAPSHOT NAME>`.
|
||||
From that we can derive path to the subvolume: it's the parent of `.snap`
|
||||
directory.
|
||||
* b) Similar to a), path to the snapshot is passed in via `subvolumePath` /
|
||||
* b) Similar to a), path to the snapshot is passed in via `subvolumePath` /
|
||||
`rootPath`, but instead of trying to derive the right path we introduce
|
||||
another volume context parameter containing path to the parent subvolume
|
||||
explicitly.
|
||||
* c) `subvolumePath` / `rootPath` contains path to the parent subvolume and
|
||||
* c) `subvolumePath` / `rootPath` contains path to the parent subvolume and
|
||||
we introduce another volume context parameter containing name of the
|
||||
snapshot. Path to the snapshot is then formed by appending
|
||||
`/.snap/<SNAPSHOT NAME>` to the subvolume path.
|
||||
|
@ -1,7 +1,7 @@
|
||||
# Design to handle clusterID and poolID for DR
|
||||
|
||||
During disaster recovery/migration of a cluster, as part of the failover, the
|
||||
kubernetes artifacts like deployment, PVC, PV, etc will be restored to a new
|
||||
kubernetes artifacts like deployment, PVC, PV, etc. will be restored to a new
|
||||
cluster by the admin. Even if the kubernetes objects are restored the
|
||||
corresponding RBD/CephFS subvolume cannot be retrieved during CSI operations as
|
||||
the clusterID and poolID are not the same in both clusters. Let's see the
|
||||
@ -10,8 +10,8 @@ problem in more detail below.
|
||||
`0001-0009-rook-ceph-0000000000000002-b0285c97-a0ce-11eb-8c66-0242ac110002`
|
||||
|
||||
The above is the sample volumeID sent back in response to the CreateVolume
|
||||
operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses
|
||||
above as the identifier for other operations on the volume/PVC.
|
||||
operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses above
|
||||
as the identifier for other operations on the volume/PVC.
|
||||
|
||||
The VolumeID is encoded as,
|
||||
|
||||
@ -33,7 +33,7 @@ the other cluster.
|
||||
|
||||
During the Disaster Recovery (failover operation) the PVC and PV will be
|
||||
recreated on the other cluster. When Ceph-CSI receives the request for
|
||||
operations like (NodeStage, ExpandVolume, DeleteVolume, etc) the volumeID is
|
||||
operations like (NodeStage, ExpandVolume, DeleteVolume, etc.) the volumeID is
|
||||
sent in the request which will help to identify the volume.
|
||||
|
||||
```yaml=
|
||||
@ -68,15 +68,15 @@ metadata:
|
||||
```
|
||||
|
||||
During CSI/Replication operations, Ceph-CSI will decode the volumeID and gets
|
||||
the monitor configuration from the configmap and by the poolID will get the
|
||||
pool Name and retrieves the OMAP data stored in the rados OMAP and finally
|
||||
check the volume is present in the pool.
|
||||
the monitor configuration from the configmap and by the poolID will get the pool
|
||||
Name and retrieves the OMAP data stored in the rados OMAP and finally check the
|
||||
volume is present in the pool.
|
||||
|
||||
## Problems with volumeID Replication
|
||||
|
||||
* The clusterID can be different
|
||||
* as the clusterID is the namespace where rook is deployed, the Rook might be
|
||||
deployed in the different namespace on a secondary cluster
|
||||
* as the clusterID is the namespace where rook is deployed, the Rook might
|
||||
be deployed in the different namespace on a secondary cluster
|
||||
* In standalone Ceph-CSI the clusterID is fsID and fsID is unique per
|
||||
cluster
|
||||
|
||||
@ -124,8 +124,8 @@ metadata:
|
||||
name: ceph-csi-config
|
||||
```
|
||||
|
||||
**Note:-** the configmap will be mounted as a volume to the CSI (provisioner
|
||||
and node plugin) pods.
|
||||
**Note:-** the configmap will be mounted as a volume to the CSI (provisioner and
|
||||
node plugin) pods.
|
||||
|
||||
The above configmap will get created as it is or updated (if new Pools are
|
||||
created on the existing cluster) with new entries when the admin choose to
|
||||
@ -149,18 +149,18 @@ Replicapool with ID `1` on site1 and Replicapool with ID `2` on site2.
|
||||
After getting the required mapping Ceph-CSI has the required information to get
|
||||
more details from the rados OMAP. If we have multiple clusterID mapping it will
|
||||
loop through all the mapping and checks the corresponding pool to get the OMAP
|
||||
data. If the clusterID mapping does not exist Ceph-CSI will return a `Not
|
||||
Found` error message to the caller.
|
||||
data. If the clusterID mapping does not exist Ceph-CSI will return a `Not Found`
|
||||
error message to the caller.
|
||||
|
||||
After failover to the cluster `site2-storage`, the admin might have created new
|
||||
PVCs on the primary cluster `site2-storage`. Later after recovering the
|
||||
cluster `site1-storage`, the admin might choose to failback from
|
||||
`site2-storage` to `site1-storage`. Now admin needs to copy all the newly
|
||||
created kubernetes artifacts to the failback cluster. For clusterID mapping, the
|
||||
admin needs to copy the above-created configmap `ceph-clusterid-mapping` to
|
||||
the failback cluster. When Ceph-CSI receives a CSI/Replication request for
|
||||
the volumes created on the `site2-storage` it will decode the volumeID and
|
||||
retrieves the clusterID ie `site2-storage`. In the above configmap
|
||||
admin needs to copy the above-created configmap `ceph-clusterid-mapping` to the
|
||||
failback cluster. When Ceph-CSI receives a CSI/Replication request for the
|
||||
volumes created on the `site2-storage` it will decode the volumeID and retrieves
|
||||
the clusterID ie `site2-storage`. In the above configmap
|
||||
`ceph-clusterid-mapping` the `site2-storage` is the value and `site1-storage`
|
||||
is the key in the `clusterIDMapping` entry.
|
||||
|
||||
@ -169,8 +169,8 @@ is found in `key` it will consider `value` as the corresponding mapping, if it
|
||||
is found in `value` place it will treat `key` as the corresponding mapping and
|
||||
retrieves all the poolID details of the cluster.
|
||||
|
||||
This mapping on the remote cluster is only required when we are doing a
|
||||
failover operation from the primary cluster to a remote cluster. The existing
|
||||
volumes that are created on the remote cluster does not require
|
||||
any mapping as the volumeHandle already contains the required information about
|
||||
the local cluster (clusterID, poolID etc).
|
||||
This mapping on the remote cluster is only required when we are doing a failover
|
||||
operation from the primary cluster to a remote cluster. The existing volumes
|
||||
that are created on the remote cluster does not require any mapping as the
|
||||
volumeHandle already contains the required information about the local cluster (
|
||||
clusterID, poolID etc).
|
||||
|
@ -16,7 +16,7 @@ Some but not all the benefits of this approach:
|
||||
|
||||
* volume encryption: encryption of a volume attached by rbd
|
||||
* encryption at rest: encryption of physical disk done by ceph
|
||||
* LUKS: Linux Unified Key Setup: stores all of the needed setup information for
|
||||
* LUKS: Linux Unified Key Setup: stores all the needed setup information for
|
||||
dm-crypt on the disk
|
||||
* dm-crypt: linux kernel device-mapper crypto target
|
||||
* cryptsetup: the command line tool to interface with dm-crypt
|
||||
@ -28,8 +28,8 @@ requirement by using dm-crypt module through cryptsetup cli interface.
|
||||
|
||||
### Implementation Summary
|
||||
|
||||
* Encryption is implemented using cryptsetup with LUKS extension.
|
||||
A good introduction to LUKS and dm-crypt in general can be found
|
||||
* Encryption is implemented using cryptsetup with LUKS extension. A good
|
||||
introduction to LUKS and dm-crypt in general can be found
|
||||
[here](https://wiki.archlinux.org/index.php/Dm-crypt/Device_encryption#Encrypting_devices_with_cryptsetup)
|
||||
Functions to implement necessary interaction are implemented in a separate
|
||||
`cryptsetup.go` file.
|
||||
@ -45,8 +45,8 @@ requirement by using dm-crypt module through cryptsetup cli interface.
|
||||
volume attach request
|
||||
* `NodeStageVolume`: refactored to open encrypted device (`openEncryptedDevice`)
|
||||
* `openEncryptedDevice`: looks up for a passphrase matching the volume id,
|
||||
returns the new device path in the form: `/dev/mapper/luks-<volume_id>`.
|
||||
On the woker node where the attach is scheduled:
|
||||
returns the new device path in the form: `/dev/mapper/luks-<volume_id>`. On
|
||||
the worker node where the attach is scheduled:
|
||||
|
||||
```shell
|
||||
$ lsblk
|
||||
@ -63,7 +63,7 @@ requirement by using dm-crypt module through cryptsetup cli interface.
|
||||
|
||||
* StorageClass extended with following parameters:
|
||||
1. `encrypted` ("true" or "false")
|
||||
1. `encryptionKMSID` (string representing kms configuration of choice)
|
||||
2. `encryptionKMSID` (string representing kms configuration of choice)
|
||||
ceph-csi plugin may support different kms vendors with different type of
|
||||
authentication
|
||||
|
||||
@ -133,14 +133,19 @@ metadata:
|
||||
The main components that are used to support encrypted volumes:
|
||||
|
||||
1. the `EncryptionKMS` interface
|
||||
* an instance is configured per volume object (`rbdVolume.KMS`)
|
||||
* used to authenticate with a master key or token
|
||||
* can store the KEK (Key-Encryption-Key) for encrypting and decrypting the
|
||||
|
||||
* an instance is configured per volume object (`rbdVolume.KMS`)
|
||||
* used to authenticate with a master key or token
|
||||
* can store the KEK (Key-Encryption-Key) for encrypting and decrypting the
|
||||
DEKs (Data-Encryption-Key)
|
||||
|
||||
1. the `DEKStore` interface
|
||||
* saves and fetches the DEK (Data-Encryption-Key)
|
||||
* can be provided by a KMS, or by other components (like `rbdVolume`)
|
||||
|
||||
* saves and fetches the DEK (Data-Encryption-Key)
|
||||
* can be provided by a KMS, or by other components (like `rbdVolume`)
|
||||
|
||||
1. the `VolumeEncryption` type
|
||||
* combines `EncryptionKMS` and `DEKStore` into a single place
|
||||
* easy to configure from other components or subsystems
|
||||
* provides a simple API for all KMS operations
|
||||
|
||||
* combines `EncryptionKMS` and `DEKStore` into a single place
|
||||
* easy to configure from other components or subsystems
|
||||
* provides a simple API for all KMS operations
|
||||
|
@ -14,7 +14,8 @@ KMS implementation. Or, if changes would be minimal, a configuration option to
|
||||
one of the implementations can be added.
|
||||
|
||||
Different KMS implementations and their configurable options can be found at
|
||||
[`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml).
|
||||
[`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml)
|
||||
.
|
||||
|
||||
### VaultTokensKMS
|
||||
|
||||
@ -26,7 +27,8 @@ An example of the per Tenant configuration options are in
|
||||
[`tenant-config.yaml`](../../../examples/kms/vault/tenant-config.yaml) and
|
||||
[`tenant-token.yaml`](../../../examples/kms/vault/tenant-token.yaml).
|
||||
|
||||
Implementation is in [`vault_tokens.go`](../../../internal/util/vault_tokens.go).
|
||||
Implementation is in [`vault_tokens.go`](../../../internal/util/vault_tokens.go)
|
||||
.
|
||||
|
||||
### Vault
|
||||
|
||||
@ -36,7 +38,7 @@ Implementation is in [`vault.go`](../../../internal/util/vault.go).
|
||||
|
||||
## Extension or New KMS implementation
|
||||
|
||||
Normally ServiceAccounts are provided by Kubernetes in the containers
|
||||
Normally ServiceAccounts are provided by Kubernetes in the containers'
|
||||
filesystem. This only allows a single ServiceAccount and is static for the
|
||||
lifetime of the Pod. Ceph-CSI runs in the namespace of the storage
|
||||
administrator, and has access to the single ServiceAccount linked in the
|
||||
@ -53,7 +55,7 @@ steps need to be taken:
|
||||
replace the default (`AuthKubernetesTokenPath:
|
||||
/var/run/secrets/kubernetes.io/serviceaccount/token`)
|
||||
|
||||
Currently the Ceph-CSI components may read Secrets and ConfigMaps from the
|
||||
Currently, the Ceph-CSI components may read Secrets and ConfigMaps from the
|
||||
Tenants namespace. These permissions need to be extended to allow Ceph-CSI to
|
||||
read the contents of the ServiceAccount(s) in the Tenants namespace.
|
||||
|
||||
@ -61,7 +63,8 @@ read the contents of the ServiceAccount(s) in the Tenants namespace.
|
||||
|
||||
### Global Configuration
|
||||
|
||||
1. a StorageClass links to a KMS configuration by providing the `kmsID` parameter
|
||||
1. a StorageClass links to a KMS configuration by providing the `kmsID`
|
||||
parameter
|
||||
1. a ConfigMap in the namespace of the Ceph-CSI deployment contains the KMS
|
||||
configuration for the `kmsID`
|
||||
([`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml))
|
||||
@ -76,8 +79,8 @@ configuration from the ConfigMap.
|
||||
1. needs ServiceAccount with a known name with permissions to connect to Vault
|
||||
1. optional ConfigMap with options for Vault that override default settings
|
||||
|
||||
A `CreateVolume` request contains the owner (Namespace) of the Volume.
|
||||
The KMS configuration indicates that additional attributes need to be fetched
|
||||
from the Tenants namespace, so the provisioner will fetch these. The additional
|
||||
configuration and ServiceAccount are merged in the provisioners configuration
|
||||
A `CreateVolume` request contains the owner (Namespace) of the Volume. The KMS
|
||||
configuration indicates that additional attributes need to be fetched from the
|
||||
Tenants namespace, so the provisioner will fetch these. The additional
|
||||
configuration and ServiceAccount are merged in the provisioners' configuration
|
||||
for the KMS-implementation while creating the volume.
|
||||
|
@ -1,11 +1,11 @@
|
||||
# RBD MIRRORING
|
||||
|
||||
RBD mirroring is a process of replication of RBD images between two or more
|
||||
Ceph clusters. Mirroring ensures point-in-time, crash-consistent RBD images
|
||||
between clusters, RBD mirroring is mainly used for disaster recovery (i.e.
|
||||
having a secondary site as a failover). See [Ceph
|
||||
documentation](https://docs.ceph.com/en/latest/rbd/rbd-mirroring) on RBD
|
||||
mirroring for complete information.
|
||||
RBD mirroring is a process of replication of RBD images between two or more Ceph
|
||||
clusters. Mirroring ensures point-in-time, crash-consistent RBD images between
|
||||
clusters, RBD mirroring is mainly used for disaster recovery (i.e. having a
|
||||
secondary site as a failover).
|
||||
See [Ceph documentation](https://docs.ceph.com/en/latest/rbd/rbd-mirroring) on
|
||||
RBD mirroring for complete information.
|
||||
|
||||
## Architecture
|
||||
|
||||
@ -28,8 +28,8 @@ PersistentVolumeClaim (PVC) on the secondary site during the failover.
|
||||
VolumeHandle to identify the OMAP data nor the image anymore because as we have
|
||||
only PoolID and ClusterID in the VolumeHandle. We cannot identify the correct
|
||||
pool name from the PoolID because pool name will remain the same on both
|
||||
clusters but not the PoolID even the ClusterID can be different on the
|
||||
secondary cluster.
|
||||
clusters but not the PoolID even the ClusterID can be different on the secondary
|
||||
cluster.
|
||||
|
||||
> Sample PV spec which will be used by rbdplugin controller to regenerate OMAP
|
||||
> data
|
||||
@ -56,10 +56,10 @@ csi:
|
||||
```
|
||||
|
||||
> **VolumeHandle** is the unique volume name returned by the CSI volume plugin’s
|
||||
CreateVolume to refer to the volume on all subsequent calls.
|
||||
> CreateVolume to refer to the volume on all subsequent calls.
|
||||
|
||||
Once the static PVC is created on the secondary cluster, the Kubernetes User
|
||||
can try delete the PVC,expand the PVC or mount the PVC. In case of mounting
|
||||
Once the static PVC is created on the secondary cluster, the Kubernetes User can
|
||||
try to delete the PVC,expand the PVC or mount the PVC. In case of mounting
|
||||
(NodeStageVolume) we will get the volume context in RPC call but not in the
|
||||
Delete/Expand Request. In Delete/Expand RPC request only the VolumeHandle
|
||||
(`clusterID-poolID-volumeuniqueID`) will be sent where it contains the encoded
|
||||
@ -73,17 +73,17 @@ secondary cluster as the PoolID and ClusterID always may not be the same.
|
||||
|
||||
To solve this problem, We will have a new controller(rbdplugin controller)
|
||||
running as part of provisioner pod which watches for the PV objects. When a PV
|
||||
is created it will extract the required information from the PV spec and it
|
||||
is created it will extract the required information from the PV spec, and it
|
||||
will regenerate the OMAP data. Whenever Ceph-CSI gets a RPC request with older
|
||||
VolumeHandle, it will check if any new VolumeHandle exists for the old
|
||||
VolumeHandle. If yes, it uses the new VolumeHandle for internal operations (to
|
||||
get pool name, Ceph monitor details from the ClusterID etc).
|
||||
|
||||
Currently, We are making use of watchers in node stage request to make sure
|
||||
ReadWriteOnce (RWO) PVC is mounted on a single node at a given point in time.
|
||||
We need to change the watchers logic in the node stage request as when we
|
||||
enable the RBD mirroring on an image, a watcher will be added on a RBD image by
|
||||
the rbd mirroring daemon.
|
||||
ReadWriteOnce (RWO) PVC is mounted on a single node at a given point in time. We
|
||||
need to change the watchers logic in the node stage request as when we enable
|
||||
the RBD mirroring on an image, a watcher will be added on a RBD image by the rbd
|
||||
mirroring daemon.
|
||||
|
||||
To solve the ClusterID problem, If the ClusterID is different on the second
|
||||
cluster, the admin has to create a new ConfigMap for the mapped ClusterID's.
|
||||
|
@ -1,59 +1,57 @@
|
||||
# RBD NBD VOLUME HEALER
|
||||
|
||||
- [RBD NBD VOLUME HEALER](#rbd-nbd-volume-healer)
|
||||
- [Rbd Nbd](#rbd-nbd)
|
||||
- [Advantages of userspace mounters](#advantages-of-userspace-mounters)
|
||||
- [Side effects of userspace mounters](#side-effects-of-userspace-mounters)
|
||||
- [Volume Healer](#volume-healer)
|
||||
- [More thoughts](#more-thoughts)
|
||||
- [Rbd Nbd](#rbd-nbd)
|
||||
- [Advantages of userspace mounters](#advantages-of-userspace-mounters)
|
||||
- [Side effects of userspace mounters](#side-effects-of-userspace-mounters)
|
||||
- [Volume Healer](#volume-healer)
|
||||
- [More thoughts](#more-thoughts)
|
||||
|
||||
## Rbd nbd
|
||||
|
||||
The rbd CSI plugin will provision new rbd images and attach and mount those
|
||||
to workloads. Currently, the default mounter is krbd, which uses the kernel
|
||||
rbd driver to mount the rbd images onto the application pod. Here on
|
||||
at Ceph-CSI we will also have a userspace way of mounting the rbd images,
|
||||
via rbd-nbd.
|
||||
The rbd CSI plugin will provision new rbd images and attach and mount those to
|
||||
workloads. Currently, the default mounter is krbd, which uses the kernel rbd
|
||||
driver to mount the rbd images onto the application pod. Here on at Ceph-CSI we
|
||||
will also have a userspace way of mounting the rbd images, via rbd-nbd.
|
||||
|
||||
[Rbd-nbd](https://docs.ceph.com/en/latest/man/8/rbd-nbd/) is a client for
|
||||
RADOS block device (rbd) images like the existing rbd kernel module. It
|
||||
will map an rbd image to an nbd (Network Block Device) device, allowing
|
||||
access to it as a regular local block device.
|
||||
[Rbd-nbd](https://docs.ceph.com/en/latest/man/8/rbd-nbd/) is a client for RADOS
|
||||
block device (rbd) images like the existing rbd kernel module. It will map an
|
||||
rbd image to an nbd (Network Block Device) device, allowing access to it as a
|
||||
regular local block device.
|
||||
|
||||
![csi-rbd-nbd](./images/csi-rbd-nbd.svg)
|
||||
|
||||
It’s worth making a note that the rbd-nbd processes will run on the
|
||||
client-side, which is inside the `csi-rbdplugin` node plugin.
|
||||
It’s worth making a note that the rbd-nbd processes will run on the client-side,
|
||||
which is inside the `csi-rbdplugin` node plugin.
|
||||
|
||||
### Advantages of userspace mounters
|
||||
|
||||
- It is easier to add features to rbd-nbd as it is released regularly with
|
||||
Ceph, and more difficult and time consuming to add features to the kernel
|
||||
rbd module as that is part of the Linux kernel release schedule.
|
||||
- Container upgrades will be independent of the host node, which means if
|
||||
there are any new features with rbd-nbd, we don’t have to reboot the node
|
||||
as the changes will be shipped inside the container.
|
||||
- Because the container upgrades are host node independent, we will be a
|
||||
better citizen in K8s by switching to the userspace model.
|
||||
- It is easier to add features to rbd-nbd as it is released regularly with Ceph,
|
||||
and more difficult and time consuming to add features to the kernel rbd module
|
||||
as that is part of the Linux kernel release schedule.
|
||||
- Container upgrades will be independent of the host node, which means if there
|
||||
are any new features with rbd-nbd, we don’t have to reboot the node as the
|
||||
changes will be shipped inside the container.
|
||||
- Because the container upgrades are host node independent, we will be a better
|
||||
citizen in K8s by switching to the userspace model.
|
||||
- Unlike krbd, rbd-nbd uses librbd user-space library that gets most of the
|
||||
development focus, and hence rbd-nbd will be feature-rich.
|
||||
- Being entirely kernel space impacts fault-tolerance as any kernel panic
|
||||
affects a whole node not only a single pod that is using rbd storage.
|
||||
Thanks to the rbd-nbd’s userspace design, we are less bothered here, the
|
||||
krbd is a complete kernel and vendor-specific driver which needs changes
|
||||
on every feature basis, on the other hand, rbd-nbd depends on NBD generic
|
||||
driver, while all the vendor-specific logic sits in the userspace. It's
|
||||
worth taking note that NBD generic driver is mostly unchanged much from
|
||||
years and consider it to be much stable. Also given NBD is a generic
|
||||
driver there will be many eyes on it compared to the rbd driver.
|
||||
affects a whole node not only a single pod that is using rbd storage. Thanks
|
||||
to the rbd-nbd’s userspace design, we are less bothered here, the krbd is a
|
||||
complete kernel and vendor-specific driver which needs changes on every
|
||||
feature basis, on the other hand, rbd-nbd depends on NBD generic driver, while
|
||||
all the vendor-specific logic sits in the userspace. It's worth taking note
|
||||
that NBD generic driver is mostly unchanged much from years and consider it to
|
||||
be much stable. Also given NBD is a generic driver there will be many eyes on
|
||||
it compared to the rbd driver.
|
||||
|
||||
### Side effects of userspace mounters
|
||||
|
||||
Since the rbd-nbd processes run per volume map on the client side i.e.
|
||||
inside the `csi-rbdplugin` node plugin, a restart of the node plugin will
|
||||
terminate all the rbd-nbd processes, and there is no way to restore
|
||||
these processes back to life currently, which could lead to IO errors
|
||||
on all the application pods.
|
||||
Since the rbd-nbd processes run per volume map on the client side i.e. inside
|
||||
the `csi-rbdplugin` node plugin, a restart of the node plugin will terminate all
|
||||
the rbd-nbd processes, and there is no way to restore these processes back to
|
||||
life currently, which could lead to IO errors on all the application pods.
|
||||
|
||||
![csi-plugin-restart](./images/csi-plugin-restart.svg)
|
||||
|
||||
@ -61,42 +59,42 @@ This is where the Volume healer could help.
|
||||
|
||||
## Volume healer
|
||||
|
||||
Volume healer runs on the start of rbd node plugin and runs within the
|
||||
node plugin driver context.
|
||||
Volume healer runs on the start of rbd node plugin and runs within the node
|
||||
plugin driver context.
|
||||
|
||||
Volume healer does the below,
|
||||
|
||||
- Get the Volume attachment list for the current node where it is running
|
||||
- Filter the volume attachments list through matching driver name and
|
||||
status attached
|
||||
- For each volume attachment get the respective PV information and check
|
||||
the criteria of PV Bound, mounter type
|
||||
- Build the StagingPath where rbd images PVC is mounted, based on the
|
||||
KUBELET path and PV object
|
||||
- Filter the volume attachments list through matching driver name and status
|
||||
attached
|
||||
- For each volume attachment get the respective PV information and check the
|
||||
criteria of PV Bound, mounter type
|
||||
- Build the StagingPath where rbd images PVC is mounted, based on the KUBELET
|
||||
path and PV object
|
||||
- Construct the NodeStageVolume() request and send Request to CSI Driver.
|
||||
- The NodeStageVolume() has a way to identify calls received from the
|
||||
healer and when executed from the healer context, it just runs in the
|
||||
minimal required form, where it fetches the previously mapped device to
|
||||
the image, and the respective secrets and finally ensures to bringup the
|
||||
respective process back to life. Thus enabling IO to continue.
|
||||
- The NodeStageVolume() has a way to identify calls received from the healer and
|
||||
when executed from the healer context, it just runs in the minimal required
|
||||
form, where it fetches the previously mapped device to the image, and the
|
||||
respective secrets and finally ensures to bringup the respective process back
|
||||
to life. Thus enabling IO to continue.
|
||||
|
||||
### More thoughts
|
||||
|
||||
- Currently the NodeStageVolume() call is safeguarded by the global Ceph-CSI
|
||||
level lock (per volID) that needs to be acquired before doing any of the
|
||||
NodeStage, NodeUnstage, NodePublish, NodeUnPulish operations. Hence none
|
||||
of the operations happen in parallel.
|
||||
NodeStage, NodeUnstage, NodePublish, NodeUnPublish operations. Hence none of
|
||||
the operations happen in parallel.
|
||||
- Any issues if the NodeUnstage is issued by kubelet?
|
||||
- This can not be a problem as we take a lock at the Ceph-CSI level
|
||||
- If the NodeUnstage success, Ceph-CSI will return StagingPath not found
|
||||
error, we can then skip
|
||||
- If the NodeUnstage fails with an operation already going on, in the
|
||||
next NodeUnstage the volume gets unmounted
|
||||
- If the NodeUnstage fails with an operation already going on, in the next
|
||||
NodeUnstage the volume gets unmounted
|
||||
- What if the PVC is deleted?
|
||||
- If the PVC is deleted, the volume attachment list might already got
|
||||
- If the PVC is deleted, the volume attachment list might already get
|
||||
refreshed and entry will be skipped/deleted at the healer.
|
||||
- For any reason, If the request bails out with Error NotFound, skip the
|
||||
PVC, assuming it might have deleted or the NodeUnstage might have
|
||||
already happened.
|
||||
- The Volume healer currently works with rbd-nbd, but the design can
|
||||
PVC, assuming it might have deleted or the NodeUnstage might have already
|
||||
happened.
|
||||
- The Volume healer currently works with rbd-nbd, but the design can
|
||||
accommodate other userspace mounters (may be ceph-fuse).
|
Loading…
Reference in New Issue
Block a user