mirror of
https://github.com/ceph/ceph-csi.git
synced 2024-12-18 02:50:30 +00:00
doc: few corrections or typo fixing in design documentation
- Fixes spelling mistakes. - Grammatical error correction. - Wrapping the text at 80 line count..etc Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
This commit is contained in:
parent
12e8e46bcf
commit
3196b798cc
@ -6,50 +6,49 @@ snapshot contents and then mount that volume to workloads.
|
|||||||
|
|
||||||
CephFS exposes snapshots as special, read-only directories of a subvolume
|
CephFS exposes snapshots as special, read-only directories of a subvolume
|
||||||
located in `<subvolume>/.snap`. cephfs-csi can already provision writable
|
located in `<subvolume>/.snap`. cephfs-csi can already provision writable
|
||||||
volumes with snapshots as their data source, where snapshot contents are
|
volumes with snapshots as their data source, where snapshot contents are cloned
|
||||||
cloned to the newly created volume. However, cloning a snapshot to volume
|
to the newly created volume. However, cloning a snapshot to volume is a very
|
||||||
is a very expensive operation in CephFS as the data needs to be fully copied.
|
expensive operation in CephFS as the data needs to be fully copied. When the
|
||||||
When the need is to only read snapshot contents, snapshot cloning is extremely
|
need is to only read snapshot contents, snapshot cloning is extremely
|
||||||
inefficient and wasteful.
|
inefficient and wasteful.
|
||||||
|
|
||||||
This proposal describes a way for cephfs-csi to expose CephFS snapshots
|
This proposal describes a way for cephfs-csi to expose CephFS snapshots as
|
||||||
as shallow, read-only volumes, without needing to clone the underlying
|
shallow, read-only volumes, without needing to clone the underlying snapshot
|
||||||
snapshot data.
|
data.
|
||||||
|
|
||||||
## Use-cases
|
## Use-cases
|
||||||
|
|
||||||
What's the point of such read-only volumes?
|
What's the point of such read-only volumes?
|
||||||
|
|
||||||
* **Restore snapshots selectively:** users may want to traverse snapshots,
|
* **Restore snapshots selectively:** users may want to traverse snapshots,
|
||||||
restoring data to a writable volume more selectively instead of restoring
|
restoring data to a writable volume more selectively instead of restoring the
|
||||||
the whole snapshot.
|
whole snapshot.
|
||||||
* **Volume backup:** users can't backup a live volume, they first need
|
* **Volume backup:** users can't backup a live volume, they first need to
|
||||||
to snapshot it. Once a snapshot is taken, it still can't be backed-up,
|
snapshot it. Once a snapshot is taken, it still can't be backed-up, as backup
|
||||||
as backup tools usually work with volumes (that are exposed as file-systems)
|
tools usually work with volumes (that are exposed as file-systems)
|
||||||
and not snapshots (which might have backend-specific format). What this means
|
and not snapshots (which might have backend-specific format). What this means
|
||||||
is that in order to create a snapshot backup, users have to clone snapshot
|
is that in order to create a snapshot backup, users have to clone snapshot
|
||||||
data twice:
|
data twice:
|
||||||
|
|
||||||
1. first time, when restoring the snapshot into a temporary volume from
|
1. first time, when restoring the snapshot into a temporary volume from
|
||||||
where the data will be read,
|
where the data will be read,
|
||||||
1. and second time, when transferring that volume into some backup/archive
|
1. and second time, when transferring that volume into some backup/archive
|
||||||
storage (e.g. object store).
|
storage (e.g. object store).
|
||||||
|
|
||||||
The temporary backed-up volume will most likely be thrown away after the
|
The temporary backed-up volume will most likely be thrown away after the
|
||||||
backup transfer is finished. That's a lot of wasted work for what we
|
backup transfer is finished. That's a lot of wasted work for what we
|
||||||
originally wanted to do! Having the ability to create volumes from
|
originally wanted to do! Having the ability to create volumes from snapshots
|
||||||
snapshots cheaply would be a big improvement for this use case.
|
cheaply would be a big improvement for this use case.
|
||||||
|
|
||||||
## Alternatives
|
## Alternatives
|
||||||
|
|
||||||
* _Snapshots are stored in `<subvolume>/.snap`. Users could simply visit this
|
* _Snapshots are stored in `<subvolume>/.snap`. Users could simply visit this
|
||||||
directory by themselves._
|
directory by themselves._
|
||||||
|
|
||||||
`.snap` is CephFS-specific detail of how snapshots are exposed.
|
`.snap` is CephFS-specific detail of how snapshots are exposed. Users / tools
|
||||||
Users / tools may not be aware of this special directory, or it may not fit
|
may not be aware of this special directory, or it may not fit their workflow.
|
||||||
their workflow. At the moment, the idiomatic way of accessing snapshot
|
At the moment, the idiomatic way of accessing snapshot contents in CSI drivers
|
||||||
contents in CSI drivers is by creating a new volume and populating it
|
is by creating a new volume and populating it with snapshot.
|
||||||
with snapshot.
|
|
||||||
|
|
||||||
## Design
|
## Design
|
||||||
|
|
||||||
@ -57,21 +56,21 @@ Key points:
|
|||||||
|
|
||||||
* Volume source is a snapshot, volume access mode is `*_READER_ONLY`.
|
* Volume source is a snapshot, volume access mode is `*_READER_ONLY`.
|
||||||
* No actual new subvolumes are created in CephFS.
|
* No actual new subvolumes are created in CephFS.
|
||||||
* The resulting volume is a reference to the source subvolume snapshot.
|
* The resulting volume is a reference to the source subvolume snapshot. This
|
||||||
This reference would be stored in `Volume.volume_context` map. In order
|
reference would be stored in `Volume.volume_context` map. In order to
|
||||||
to reference a snapshot, we need subvol name and snapshot name.
|
reference a snapshot, we need subvol name and snapshot name.
|
||||||
* Mounting such volume means mounting the respective CephFS subvolume
|
* Mounting such volume means mounting the respective CephFS subvolume and
|
||||||
and exposing the snapshot to workloads.
|
exposing the snapshot to workloads.
|
||||||
* Let's call a *shallow read-only volume with a subvolume snapshot
|
* Let's call a *shallow read-only volume with a subvolume snapshot as its data
|
||||||
as its data source* just a *shallow volume* from here on out for brevity.
|
source* just a *shallow volume* from here on out for brevity.
|
||||||
|
|
||||||
### Controller operations
|
### Controller operations
|
||||||
|
|
||||||
Care must be taken when handling life-times of relevant storage resources.
|
Care must be taken when handling life-times of relevant storage resources. When
|
||||||
When a shallow volume is created, what would happen if:
|
a shallow volume is created, what would happen if:
|
||||||
|
|
||||||
* _Parent subvolume of the snapshot is removed while the shallow volume
|
* _Parent subvolume of the snapshot is removed while the shallow volume still
|
||||||
still exists?_
|
exists?_
|
||||||
|
|
||||||
This shouldn't be a problem already. The parent volume has either
|
This shouldn't be a problem already. The parent volume has either
|
||||||
`snapshot-retention` subvol feature in which case its snapshots remain
|
`snapshot-retention` subvol feature in which case its snapshots remain
|
||||||
@ -80,8 +79,8 @@ When a shallow volume is created, what would happen if:
|
|||||||
* _Source snapshot from which the shallow volume originates is removed while
|
* _Source snapshot from which the shallow volume originates is removed while
|
||||||
that shallow volume still exists?_
|
that shallow volume still exists?_
|
||||||
|
|
||||||
We need to make sure this doesn't happen and some book-keeping
|
We need to make sure this doesn't happen and some book-keeping is necessary.
|
||||||
is necessary. Ideally we could employ some kind of reference counting.
|
Ideally we could employ some kind of reference counting.
|
||||||
|
|
||||||
#### Reference counting for shallow volumes
|
#### Reference counting for shallow volumes
|
||||||
|
|
||||||
@ -92,26 +91,26 @@ When creating a volume snapshot, a reference tracker (RT), represented by a
|
|||||||
RADOS object, would be created for that snapshot. It would store information
|
RADOS object, would be created for that snapshot. It would store information
|
||||||
required to track the references for the backing subvolume snapshot. Upon a
|
required to track the references for the backing subvolume snapshot. Upon a
|
||||||
`CreateSnapshot` call, the reference tracker (RT) would be initialized with a
|
`CreateSnapshot` call, the reference tracker (RT) would be initialized with a
|
||||||
single reference record, where the CSI snapshot itself is the first reference
|
single reference record, where the CSI snapshot itself is the first reference to
|
||||||
to the backing snapshot. Each subsequent shallow volume creation would add a
|
the backing snapshot. Each subsequent shallow volume creation would add a new
|
||||||
new reference record to the RT object. Each shallow volume deletion would
|
reference record to the RT object. Each shallow volume deletion would remove
|
||||||
remove that reference from the RT object. Calling `DeleteSnapshot` would remove
|
that reference from the RT object. Calling `DeleteSnapshot` would remove the
|
||||||
the reference record that was previously added in `CreateSnapshot`.
|
reference record that was previously added in `CreateSnapshot`.
|
||||||
|
|
||||||
The subvolume snapshot would be removed from the Ceph cluster only once the RT
|
The subvolume snapshot would be removed from the Ceph cluster only once the RT
|
||||||
object holds no references. Note that this behavior would permit calling
|
object holds no references. Note that this behavior would permit calling
|
||||||
`DeleteSnapshot` even if it is still referenced by shallow volumes.
|
`DeleteSnapshot` even if it is still referenced by shallow volumes.
|
||||||
|
|
||||||
* `DeleteSnapshot`:
|
* `DeleteSnapshot`:
|
||||||
* RT holds no references or the RT object doesn't exist:
|
* RT holds no references or the RT object doesn't exist:
|
||||||
delete the backing snapshot too.
|
delete the backing snapshot too.
|
||||||
* RT holds at least one reference: keep the backing snapshot.
|
* RT holds at least one reference: keep the backing snapshot.
|
||||||
* `DeleteVolume`:
|
* `DeleteVolume`:
|
||||||
* RT holds no references: delete the backing snapshot too.
|
* RT holds no references: delete the backing snapshot too.
|
||||||
* RT holds at least one reference: keep the backing snapshot.
|
* RT holds at least one reference: keep the backing snapshot.
|
||||||
|
|
||||||
To enable creating shallow volumes from snapshots that were provisioned by
|
To enable creating shallow volumes from snapshots that were provisioned by older
|
||||||
older versions of cephfs-csi (i.e. before this feature is introduced),
|
versions of cephfs-csi (i.e. before this feature is introduced),
|
||||||
`CreateVolume` for shallow volumes would also create an RT object in case it's
|
`CreateVolume` for shallow volumes would also create an RT object in case it's
|
||||||
missing. It would be initialized to two: the source snapshot and the newly
|
missing. It would be initialized to two: the source snapshot and the newly
|
||||||
created shallow volume.
|
created shallow volume.
|
||||||
@ -141,17 +140,17 @@ Things to look out for:
|
|||||||
|
|
||||||
It doesn't consume any space on the filesystem. `Volume.capacity_bytes` is
|
It doesn't consume any space on the filesystem. `Volume.capacity_bytes` is
|
||||||
allowed to contain zero. We could use that.
|
allowed to contain zero. We could use that.
|
||||||
* _What should be the requested size when creating the volume (specified e.g.
|
* _What should be the requested size when creating the volume (specified e.g. in
|
||||||
in PVC)?_
|
PVC)?_
|
||||||
|
|
||||||
This one is tricky. CSI spec allows for
|
This one is tricky. CSI spec allows for
|
||||||
`CreateVolumeRequest.capacity_range.{required_bytes,limit_bytes}` to be
|
`CreateVolumeRequest.capacity_range.{required_bytes,limit_bytes}` to be zero.
|
||||||
zero. On the other hand,
|
On the other hand,
|
||||||
`PersistentVolumeClaim.spec.resources.requests.storage` must be bigger
|
`PersistentVolumeClaim.spec.resources.requests.storage` must be bigger than
|
||||||
than zero. cephfs-csi doesn't care about the requested size (the volume
|
zero. cephfs-csi doesn't care about the requested size (the volume will be
|
||||||
will be read-only, so it has no usable capacity) and would always set it
|
read-only, so it has no usable capacity) and would always set it to zero. This
|
||||||
to zero. This shouldn't case any problems for the time being, but still
|
shouldn't case any problems for the time being, but still is something we
|
||||||
is something we should keep in mind.
|
should keep in mind.
|
||||||
|
|
||||||
`CreateVolume` and behavior when using volume as volume source (PVC-PVC clone):
|
`CreateVolume` and behavior when using volume as volume source (PVC-PVC clone):
|
||||||
|
|
||||||
@ -167,8 +166,8 @@ Volume deletion is trivial.
|
|||||||
|
|
||||||
### `CreateSnapshot`
|
### `CreateSnapshot`
|
||||||
|
|
||||||
Snapshotting read-only volumes doesn't make sense in general, and should
|
Snapshotting read-only volumes doesn't make sense in general, and should be
|
||||||
be rejected.
|
rejected.
|
||||||
|
|
||||||
### `ControllerExpandVolume`
|
### `ControllerExpandVolume`
|
||||||
|
|
||||||
@ -194,8 +193,8 @@ whole subvolume first, and only then perform the binds to target paths.
|
|||||||
#### For case (a)
|
#### For case (a)
|
||||||
|
|
||||||
Subvolume paths are normally retrieved by
|
Subvolume paths are normally retrieved by
|
||||||
`ceph fs subvolume info/getpath <VOLUME NAME> <SUBVOLUME NAME> <SUBVOLUMEGROUP NAME>`,
|
`ceph fs subvolume info/getpath <VOLUME NAME> <SUBVOLUME NAME> <SUBVOLUMEGROUP NAME>`
|
||||||
which outputs a path like so:
|
, which outputs a path like so:
|
||||||
|
|
||||||
```
|
```
|
||||||
/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>
|
/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>
|
||||||
@ -217,12 +216,12 @@ itself still exists or not.
|
|||||||
|
|
||||||
#### For case (b)
|
#### For case (b)
|
||||||
|
|
||||||
For cases where subvolumes are managed externally and not by cephfs-csi, we
|
For cases where subvolumes are managed externally and not by cephfs-csi, we must
|
||||||
must assume that the cephx user we're given can access only
|
assume that the cephx user we're given can access only
|
||||||
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>` so users won't be able to
|
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>` so users won't be able to
|
||||||
benefit from snapshot retention. Users will need to be careful not to delete
|
benefit from snapshot retention. Users will need to be careful not to delete the
|
||||||
the parent subvolumes and snapshots while they are associated by these shallow
|
parent subvolumes and snapshots while they are associated by these shallow RO
|
||||||
RO volumes.
|
volumes.
|
||||||
|
|
||||||
### `NodePublishVolume`, `NodeUnpublishVolume`
|
### `NodePublishVolume`, `NodeUnpublishVolume`
|
||||||
|
|
||||||
@ -235,38 +234,38 @@ mount.
|
|||||||
|
|
||||||
## Volume parameters, volume context
|
## Volume parameters, volume context
|
||||||
|
|
||||||
This section provides a discussion around determinig what volume parameters and
|
This section provides a discussion around determining what volume parameters and
|
||||||
volume context parameters will be used to convey necessary information to the
|
volume context parameters will be used to convey necessary information to the
|
||||||
cephfs-csi driver in order to support shallow volumes.
|
cephfs-csi driver in order to support shallow volumes.
|
||||||
|
|
||||||
Volume parameters `CreateVolumeRequest.parameters`:
|
Volume parameters `CreateVolumeRequest.parameters`:
|
||||||
|
|
||||||
* Should be "shallow" the default mode for all `CreateVolume` calls that have
|
* Should be "shallow" the default mode for all `CreateVolume` calls that have
|
||||||
(a) snapshot as data source and (b) read-only volume access mode? If not,
|
(a) snapshot as data source and (b) read-only volume access mode? If not, a
|
||||||
a new volume parameter should be introduced: e.g `isShallow: <bool>`. On the
|
new volume parameter should be introduced: e.g `isShallow: <bool>`. On the
|
||||||
other hand, does it even makes sense for users to want to create full copies
|
other hand, does it even makes sense for users to want to create full copies
|
||||||
of snapshots and still have them read-only?
|
of snapshots and still have them read-only?
|
||||||
|
|
||||||
Volume context `Volume.volume_context`:
|
Volume context `Volume.volume_context`:
|
||||||
|
|
||||||
* Here we definitely need `isShallow` or similar. Without it we wouldn't be
|
* Here we definitely need `isShallow` or similar. Without it we wouldn't be able
|
||||||
able to distinguish between a regular volume that just happens to have
|
to distinguish between a regular volume that just happens to have a read-only
|
||||||
a read-only access mode, and a volume that references a snapshot.
|
access mode, and a volume that references a snapshot.
|
||||||
* Currently cephfs-csi recognizes `subvolumePath` for dynamically provisioned
|
* Currently cephfs-csi recognizes `subvolumePath` for dynamically provisioned
|
||||||
volumes and `rootPath` for pre-previsioned volumes. As mentioned in
|
volumes and `rootPath` for pre-previsioned volumes. As mentioned in
|
||||||
[`NodeStageVolume`, `NodeUnstageVolume` section](#NodeStageVolume-NodeUnstageVolume),
|
[`NodeStageVolume`, `NodeUnstageVolume` section](#NodeStageVolume-NodeUnstageVolume)
|
||||||
snapshots cannot be mounted directly. How do we pass in path to the parent
|
, snapshots cannot be mounted directly. How do we pass in path to the parent
|
||||||
subvolume?
|
subvolume?
|
||||||
* a) Path to the snapshot is passed in via `subvolumePath` / `rootPath`,
|
* a) Path to the snapshot is passed in via `subvolumePath` / `rootPath`,
|
||||||
e.g.
|
e.g.
|
||||||
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>/.snap/<SNAPSHOT NAME>`.
|
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>/.snap/<SNAPSHOT NAME>`.
|
||||||
From that we can derive path to the subvolume: it's the parent of `.snap`
|
From that we can derive path to the subvolume: it's the parent of `.snap`
|
||||||
directory.
|
directory.
|
||||||
* b) Similar to a), path to the snapshot is passed in via `subvolumePath` /
|
* b) Similar to a), path to the snapshot is passed in via `subvolumePath` /
|
||||||
`rootPath`, but instead of trying to derive the right path we introduce
|
`rootPath`, but instead of trying to derive the right path we introduce
|
||||||
another volume context parameter containing path to the parent subvolume
|
another volume context parameter containing path to the parent subvolume
|
||||||
explicitly.
|
explicitly.
|
||||||
* c) `subvolumePath` / `rootPath` contains path to the parent subvolume and
|
* c) `subvolumePath` / `rootPath` contains path to the parent subvolume and
|
||||||
we introduce another volume context parameter containing name of the
|
we introduce another volume context parameter containing name of the
|
||||||
snapshot. Path to the snapshot is then formed by appending
|
snapshot. Path to the snapshot is then formed by appending
|
||||||
`/.snap/<SNAPSHOT NAME>` to the subvolume path.
|
`/.snap/<SNAPSHOT NAME>` to the subvolume path.
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
# Design to handle clusterID and poolID for DR
|
# Design to handle clusterID and poolID for DR
|
||||||
|
|
||||||
During disaster recovery/migration of a cluster, as part of the failover, the
|
During disaster recovery/migration of a cluster, as part of the failover, the
|
||||||
kubernetes artifacts like deployment, PVC, PV, etc will be restored to a new
|
kubernetes artifacts like deployment, PVC, PV, etc. will be restored to a new
|
||||||
cluster by the admin. Even if the kubernetes objects are restored the
|
cluster by the admin. Even if the kubernetes objects are restored the
|
||||||
corresponding RBD/CephFS subvolume cannot be retrieved during CSI operations as
|
corresponding RBD/CephFS subvolume cannot be retrieved during CSI operations as
|
||||||
the clusterID and poolID are not the same in both clusters. Let's see the
|
the clusterID and poolID are not the same in both clusters. Let's see the
|
||||||
@ -10,8 +10,8 @@ problem in more detail below.
|
|||||||
`0001-0009-rook-ceph-0000000000000002-b0285c97-a0ce-11eb-8c66-0242ac110002`
|
`0001-0009-rook-ceph-0000000000000002-b0285c97-a0ce-11eb-8c66-0242ac110002`
|
||||||
|
|
||||||
The above is the sample volumeID sent back in response to the CreateVolume
|
The above is the sample volumeID sent back in response to the CreateVolume
|
||||||
operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses
|
operation and added as a volumeHandle in the PV spec. CO (Kubernetes) uses above
|
||||||
above as the identifier for other operations on the volume/PVC.
|
as the identifier for other operations on the volume/PVC.
|
||||||
|
|
||||||
The VolumeID is encoded as,
|
The VolumeID is encoded as,
|
||||||
|
|
||||||
@ -33,7 +33,7 @@ the other cluster.
|
|||||||
|
|
||||||
During the Disaster Recovery (failover operation) the PVC and PV will be
|
During the Disaster Recovery (failover operation) the PVC and PV will be
|
||||||
recreated on the other cluster. When Ceph-CSI receives the request for
|
recreated on the other cluster. When Ceph-CSI receives the request for
|
||||||
operations like (NodeStage, ExpandVolume, DeleteVolume, etc) the volumeID is
|
operations like (NodeStage, ExpandVolume, DeleteVolume, etc.) the volumeID is
|
||||||
sent in the request which will help to identify the volume.
|
sent in the request which will help to identify the volume.
|
||||||
|
|
||||||
```yaml=
|
```yaml=
|
||||||
@ -68,15 +68,15 @@ metadata:
|
|||||||
```
|
```
|
||||||
|
|
||||||
During CSI/Replication operations, Ceph-CSI will decode the volumeID and gets
|
During CSI/Replication operations, Ceph-CSI will decode the volumeID and gets
|
||||||
the monitor configuration from the configmap and by the poolID will get the
|
the monitor configuration from the configmap and by the poolID will get the pool
|
||||||
pool Name and retrieves the OMAP data stored in the rados OMAP and finally
|
Name and retrieves the OMAP data stored in the rados OMAP and finally check the
|
||||||
check the volume is present in the pool.
|
volume is present in the pool.
|
||||||
|
|
||||||
## Problems with volumeID Replication
|
## Problems with volumeID Replication
|
||||||
|
|
||||||
* The clusterID can be different
|
* The clusterID can be different
|
||||||
* as the clusterID is the namespace where rook is deployed, the Rook might be
|
* as the clusterID is the namespace where rook is deployed, the Rook might
|
||||||
deployed in the different namespace on a secondary cluster
|
be deployed in the different namespace on a secondary cluster
|
||||||
* In standalone Ceph-CSI the clusterID is fsID and fsID is unique per
|
* In standalone Ceph-CSI the clusterID is fsID and fsID is unique per
|
||||||
cluster
|
cluster
|
||||||
|
|
||||||
@ -124,8 +124,8 @@ metadata:
|
|||||||
name: ceph-csi-config
|
name: ceph-csi-config
|
||||||
```
|
```
|
||||||
|
|
||||||
**Note:-** the configmap will be mounted as a volume to the CSI (provisioner
|
**Note:-** the configmap will be mounted as a volume to the CSI (provisioner and
|
||||||
and node plugin) pods.
|
node plugin) pods.
|
||||||
|
|
||||||
The above configmap will get created as it is or updated (if new Pools are
|
The above configmap will get created as it is or updated (if new Pools are
|
||||||
created on the existing cluster) with new entries when the admin choose to
|
created on the existing cluster) with new entries when the admin choose to
|
||||||
@ -149,28 +149,28 @@ Replicapool with ID `1` on site1 and Replicapool with ID `2` on site2.
|
|||||||
After getting the required mapping Ceph-CSI has the required information to get
|
After getting the required mapping Ceph-CSI has the required information to get
|
||||||
more details from the rados OMAP. If we have multiple clusterID mapping it will
|
more details from the rados OMAP. If we have multiple clusterID mapping it will
|
||||||
loop through all the mapping and checks the corresponding pool to get the OMAP
|
loop through all the mapping and checks the corresponding pool to get the OMAP
|
||||||
data. If the clusterID mapping does not exist Ceph-CSI will return a `Not
|
data. If the clusterID mapping does not exist Ceph-CSI will return a `Not Found`
|
||||||
Found` error message to the caller.
|
error message to the caller.
|
||||||
|
|
||||||
After failover to the cluster `site2-storage`, the admin might have created new
|
After failover to the cluster `site2-storage`, the admin might have created new
|
||||||
PVCs on the primary cluster `site2-storage`. Later after recovering the
|
PVCs on the primary cluster `site2-storage`. Later after recovering the
|
||||||
cluster `site1-storage`, the admin might choose to failback from
|
cluster `site1-storage`, the admin might choose to failback from
|
||||||
`site2-storage` to `site1-storage`. Now admin needs to copy all the newly
|
`site2-storage` to `site1-storage`. Now admin needs to copy all the newly
|
||||||
created kubernetes artifacts to the failback cluster. For clusterID mapping, the
|
created kubernetes artifacts to the failback cluster. For clusterID mapping, the
|
||||||
admin needs to copy the above-created configmap `ceph-clusterid-mapping` to
|
admin needs to copy the above-created configmap `ceph-clusterid-mapping` to the
|
||||||
the failback cluster. When Ceph-CSI receives a CSI/Replication request for
|
failback cluster. When Ceph-CSI receives a CSI/Replication request for the
|
||||||
the volumes created on the `site2-storage` it will decode the volumeID and
|
volumes created on the `site2-storage` it will decode the volumeID and retrieves
|
||||||
retrieves the clusterID ie `site2-storage`. In the above configmap
|
the clusterID ie `site2-storage`. In the above configmap
|
||||||
`ceph-clusterid-mapping` the `site2-storage` is the value and `site1-storage`
|
`ceph-clusterid-mapping` the `site2-storage` is the value and `site1-storage`
|
||||||
is the key in the `clusterIDMapping` entry.
|
is the key in the `clusterIDMapping` entry.
|
||||||
|
|
||||||
Ceph-CSI will check both `key` and `value` to check the clusterID mapping. If it
|
Ceph-CSI will check both `key` and `value` to check the clusterID mapping. If it
|
||||||
is found in `key` it will consider `value` as the corresponding mapping, if it
|
is found in `key` it will consider `value` as the corresponding mapping, if it
|
||||||
is found in `value` place it will treat `key` as the corresponding mapping and
|
is found in `value` place it will treat `key` as the corresponding mapping and
|
||||||
retrieves all the poolID details of the cluster.
|
retrieves all the poolID details of the cluster.
|
||||||
|
|
||||||
This mapping on the remote cluster is only required when we are doing a
|
This mapping on the remote cluster is only required when we are doing a failover
|
||||||
failover operation from the primary cluster to a remote cluster. The existing
|
operation from the primary cluster to a remote cluster. The existing volumes
|
||||||
volumes that are created on the remote cluster does not require
|
that are created on the remote cluster does not require any mapping as the
|
||||||
any mapping as the volumeHandle already contains the required information about
|
volumeHandle already contains the required information about the local cluster (
|
||||||
the local cluster (clusterID, poolID etc).
|
clusterID, poolID etc).
|
||||||
|
@ -16,7 +16,7 @@ Some but not all the benefits of this approach:
|
|||||||
|
|
||||||
* volume encryption: encryption of a volume attached by rbd
|
* volume encryption: encryption of a volume attached by rbd
|
||||||
* encryption at rest: encryption of physical disk done by ceph
|
* encryption at rest: encryption of physical disk done by ceph
|
||||||
* LUKS: Linux Unified Key Setup: stores all of the needed setup information for
|
* LUKS: Linux Unified Key Setup: stores all the needed setup information for
|
||||||
dm-crypt on the disk
|
dm-crypt on the disk
|
||||||
* dm-crypt: linux kernel device-mapper crypto target
|
* dm-crypt: linux kernel device-mapper crypto target
|
||||||
* cryptsetup: the command line tool to interface with dm-crypt
|
* cryptsetup: the command line tool to interface with dm-crypt
|
||||||
@ -28,8 +28,8 @@ requirement by using dm-crypt module through cryptsetup cli interface.
|
|||||||
|
|
||||||
### Implementation Summary
|
### Implementation Summary
|
||||||
|
|
||||||
* Encryption is implemented using cryptsetup with LUKS extension.
|
* Encryption is implemented using cryptsetup with LUKS extension. A good
|
||||||
A good introduction to LUKS and dm-crypt in general can be found
|
introduction to LUKS and dm-crypt in general can be found
|
||||||
[here](https://wiki.archlinux.org/index.php/Dm-crypt/Device_encryption#Encrypting_devices_with_cryptsetup)
|
[here](https://wiki.archlinux.org/index.php/Dm-crypt/Device_encryption#Encrypting_devices_with_cryptsetup)
|
||||||
Functions to implement necessary interaction are implemented in a separate
|
Functions to implement necessary interaction are implemented in a separate
|
||||||
`cryptsetup.go` file.
|
`cryptsetup.go` file.
|
||||||
@ -45,8 +45,8 @@ requirement by using dm-crypt module through cryptsetup cli interface.
|
|||||||
volume attach request
|
volume attach request
|
||||||
* `NodeStageVolume`: refactored to open encrypted device (`openEncryptedDevice`)
|
* `NodeStageVolume`: refactored to open encrypted device (`openEncryptedDevice`)
|
||||||
* `openEncryptedDevice`: looks up for a passphrase matching the volume id,
|
* `openEncryptedDevice`: looks up for a passphrase matching the volume id,
|
||||||
returns the new device path in the form: `/dev/mapper/luks-<volume_id>`.
|
returns the new device path in the form: `/dev/mapper/luks-<volume_id>`. On
|
||||||
On the woker node where the attach is scheduled:
|
the worker node where the attach is scheduled:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
$ lsblk
|
$ lsblk
|
||||||
@ -62,10 +62,10 @@ requirement by using dm-crypt module through cryptsetup cli interface.
|
|||||||
before detaching the volume.
|
before detaching the volume.
|
||||||
|
|
||||||
* StorageClass extended with following parameters:
|
* StorageClass extended with following parameters:
|
||||||
1. `encrypted` ("true" or "false")
|
1. `encrypted` ("true" or "false")
|
||||||
1. `encryptionKMSID` (string representing kms configuration of choice)
|
2. `encryptionKMSID` (string representing kms configuration of choice)
|
||||||
ceph-csi plugin may support different kms vendors with different type of
|
ceph-csi plugin may support different kms vendors with different type of
|
||||||
authentication
|
authentication
|
||||||
|
|
||||||
* New KMS Configuration created.
|
* New KMS Configuration created.
|
||||||
|
|
||||||
@ -75,37 +75,37 @@ requirement by using dm-crypt module through cryptsetup cli interface.
|
|||||||
apiVersion: storage.k8s.io/v1
|
apiVersion: storage.k8s.io/v1
|
||||||
kind: StorageClass
|
kind: StorageClass
|
||||||
metadata:
|
metadata:
|
||||||
name: csi-rbd
|
name: csi-rbd
|
||||||
provisioner: rbd.csi.ceph.com
|
provisioner: rbd.csi.ceph.com
|
||||||
parameters:
|
parameters:
|
||||||
# String representing Ceph cluster configuration
|
# String representing Ceph cluster configuration
|
||||||
clusterID: <cluster-id>
|
clusterID: <cluster-id>
|
||||||
# ceph pool
|
# ceph pool
|
||||||
pool: rbd
|
pool: rbd
|
||||||
|
|
||||||
# RBD image features, CSI creates image with image-format 2
|
# RBD image features, CSI creates image with image-format 2
|
||||||
# CSI RBD currently supports only `layering` feature.
|
# CSI RBD currently supports only `layering` feature.
|
||||||
imageFeatures: layering
|
imageFeatures: layering
|
||||||
|
|
||||||
# The secrets have to contain Ceph credentials with required access
|
# The secrets have to contain Ceph credentials with required access
|
||||||
# to the 'pool'.
|
# to the 'pool'.
|
||||||
csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
|
csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
|
||||||
csi.storage.k8s.io/provisioner-secret-namespace: default
|
csi.storage.k8s.io/provisioner-secret-namespace: default
|
||||||
csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret
|
csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret
|
||||||
csi.storage.k8s.io/controller-expand-secret-namespace: default
|
csi.storage.k8s.io/controller-expand-secret-namespace: default
|
||||||
csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
|
csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
|
||||||
csi.storage.k8s.io/node-stage-secret-namespace: default
|
csi.storage.k8s.io/node-stage-secret-namespace: default
|
||||||
# Specify the filesystem type of the volume. If not specified,
|
# Specify the filesystem type of the volume. If not specified,
|
||||||
# csi-provisioner will set default as `ext4`.
|
# csi-provisioner will set default as `ext4`.
|
||||||
csi.storage.k8s.io/fstype: ext4
|
csi.storage.k8s.io/fstype: ext4
|
||||||
|
|
||||||
# Encrypt volumes
|
# Encrypt volumes
|
||||||
encrypted: "true"
|
encrypted: "true"
|
||||||
|
|
||||||
# Use external key management system for encryption passphrases by specifying
|
# Use external key management system for encryption passphrases by specifying
|
||||||
# a unique ID matching KMS ConfigMap. The ID is only used for correlation to
|
# a unique ID matching KMS ConfigMap. The ID is only used for correlation to
|
||||||
# configmap entry.
|
# configmap entry.
|
||||||
encryptionKMSID: <kms-id>
|
encryptionKMSID: <kms-id>
|
||||||
|
|
||||||
reclaimPolicy: Delete
|
reclaimPolicy: Delete
|
||||||
```
|
```
|
||||||
@ -133,14 +133,19 @@ metadata:
|
|||||||
The main components that are used to support encrypted volumes:
|
The main components that are used to support encrypted volumes:
|
||||||
|
|
||||||
1. the `EncryptionKMS` interface
|
1. the `EncryptionKMS` interface
|
||||||
* an instance is configured per volume object (`rbdVolume.KMS`)
|
|
||||||
* used to authenticate with a master key or token
|
* an instance is configured per volume object (`rbdVolume.KMS`)
|
||||||
* can store the KEK (Key-Encryption-Key) for encrypting and decrypting the
|
* used to authenticate with a master key or token
|
||||||
DEKs (Data-Encryption-Key)
|
* can store the KEK (Key-Encryption-Key) for encrypting and decrypting the
|
||||||
|
DEKs (Data-Encryption-Key)
|
||||||
|
|
||||||
1. the `DEKStore` interface
|
1. the `DEKStore` interface
|
||||||
* saves and fetches the DEK (Data-Encryption-Key)
|
|
||||||
* can be provided by a KMS, or by other components (like `rbdVolume`)
|
* saves and fetches the DEK (Data-Encryption-Key)
|
||||||
|
* can be provided by a KMS, or by other components (like `rbdVolume`)
|
||||||
|
|
||||||
1. the `VolumeEncryption` type
|
1. the `VolumeEncryption` type
|
||||||
* combines `EncryptionKMS` and `DEKStore` into a single place
|
|
||||||
* easy to configure from other components or subsystems
|
* combines `EncryptionKMS` and `DEKStore` into a single place
|
||||||
* provides a simple API for all KMS operations
|
* easy to configure from other components or subsystems
|
||||||
|
* provides a simple API for all KMS operations
|
||||||
|
@ -14,7 +14,8 @@ KMS implementation. Or, if changes would be minimal, a configuration option to
|
|||||||
one of the implementations can be added.
|
one of the implementations can be added.
|
||||||
|
|
||||||
Different KMS implementations and their configurable options can be found at
|
Different KMS implementations and their configurable options can be found at
|
||||||
[`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml).
|
[`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml)
|
||||||
|
.
|
||||||
|
|
||||||
### VaultTokensKMS
|
### VaultTokensKMS
|
||||||
|
|
||||||
@ -26,7 +27,8 @@ An example of the per Tenant configuration options are in
|
|||||||
[`tenant-config.yaml`](../../../examples/kms/vault/tenant-config.yaml) and
|
[`tenant-config.yaml`](../../../examples/kms/vault/tenant-config.yaml) and
|
||||||
[`tenant-token.yaml`](../../../examples/kms/vault/tenant-token.yaml).
|
[`tenant-token.yaml`](../../../examples/kms/vault/tenant-token.yaml).
|
||||||
|
|
||||||
Implementation is in [`vault_tokens.go`](../../../internal/util/vault_tokens.go).
|
Implementation is in [`vault_tokens.go`](../../../internal/util/vault_tokens.go)
|
||||||
|
.
|
||||||
|
|
||||||
### Vault
|
### Vault
|
||||||
|
|
||||||
@ -36,7 +38,7 @@ Implementation is in [`vault.go`](../../../internal/util/vault.go).
|
|||||||
|
|
||||||
## Extension or New KMS implementation
|
## Extension or New KMS implementation
|
||||||
|
|
||||||
Normally ServiceAccounts are provided by Kubernetes in the containers
|
Normally ServiceAccounts are provided by Kubernetes in the containers'
|
||||||
filesystem. This only allows a single ServiceAccount and is static for the
|
filesystem. This only allows a single ServiceAccount and is static for the
|
||||||
lifetime of the Pod. Ceph-CSI runs in the namespace of the storage
|
lifetime of the Pod. Ceph-CSI runs in the namespace of the storage
|
||||||
administrator, and has access to the single ServiceAccount linked in the
|
administrator, and has access to the single ServiceAccount linked in the
|
||||||
@ -53,7 +55,7 @@ steps need to be taken:
|
|||||||
replace the default (`AuthKubernetesTokenPath:
|
replace the default (`AuthKubernetesTokenPath:
|
||||||
/var/run/secrets/kubernetes.io/serviceaccount/token`)
|
/var/run/secrets/kubernetes.io/serviceaccount/token`)
|
||||||
|
|
||||||
Currently the Ceph-CSI components may read Secrets and ConfigMaps from the
|
Currently, the Ceph-CSI components may read Secrets and ConfigMaps from the
|
||||||
Tenants namespace. These permissions need to be extended to allow Ceph-CSI to
|
Tenants namespace. These permissions need to be extended to allow Ceph-CSI to
|
||||||
read the contents of the ServiceAccount(s) in the Tenants namespace.
|
read the contents of the ServiceAccount(s) in the Tenants namespace.
|
||||||
|
|
||||||
@ -61,7 +63,8 @@ read the contents of the ServiceAccount(s) in the Tenants namespace.
|
|||||||
|
|
||||||
### Global Configuration
|
### Global Configuration
|
||||||
|
|
||||||
1. a StorageClass links to a KMS configuration by providing the `kmsID` parameter
|
1. a StorageClass links to a KMS configuration by providing the `kmsID`
|
||||||
|
parameter
|
||||||
1. a ConfigMap in the namespace of the Ceph-CSI deployment contains the KMS
|
1. a ConfigMap in the namespace of the Ceph-CSI deployment contains the KMS
|
||||||
configuration for the `kmsID`
|
configuration for the `kmsID`
|
||||||
([`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml))
|
([`csi-kms-connection-details.yaml`](../../../examples/kms/vault/csi-kms-connection-details.yaml))
|
||||||
@ -76,8 +79,8 @@ configuration from the ConfigMap.
|
|||||||
1. needs ServiceAccount with a known name with permissions to connect to Vault
|
1. needs ServiceAccount with a known name with permissions to connect to Vault
|
||||||
1. optional ConfigMap with options for Vault that override default settings
|
1. optional ConfigMap with options for Vault that override default settings
|
||||||
|
|
||||||
A `CreateVolume` request contains the owner (Namespace) of the Volume.
|
A `CreateVolume` request contains the owner (Namespace) of the Volume. The KMS
|
||||||
The KMS configuration indicates that additional attributes need to be fetched
|
configuration indicates that additional attributes need to be fetched from the
|
||||||
from the Tenants namespace, so the provisioner will fetch these. The additional
|
Tenants namespace, so the provisioner will fetch these. The additional
|
||||||
configuration and ServiceAccount are merged in the provisioners configuration
|
configuration and ServiceAccount are merged in the provisioners' configuration
|
||||||
for the KMS-implementation while creating the volume.
|
for the KMS-implementation while creating the volume.
|
||||||
|
@ -1,11 +1,11 @@
|
|||||||
# RBD MIRRORING
|
# RBD MIRRORING
|
||||||
|
|
||||||
RBD mirroring is a process of replication of RBD images between two or more
|
RBD mirroring is a process of replication of RBD images between two or more Ceph
|
||||||
Ceph clusters. Mirroring ensures point-in-time, crash-consistent RBD images
|
clusters. Mirroring ensures point-in-time, crash-consistent RBD images between
|
||||||
between clusters, RBD mirroring is mainly used for disaster recovery (i.e.
|
clusters, RBD mirroring is mainly used for disaster recovery (i.e. having a
|
||||||
having a secondary site as a failover). See [Ceph
|
secondary site as a failover).
|
||||||
documentation](https://docs.ceph.com/en/latest/rbd/rbd-mirroring) on RBD
|
See [Ceph documentation](https://docs.ceph.com/en/latest/rbd/rbd-mirroring) on
|
||||||
mirroring for complete information.
|
RBD mirroring for complete information.
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
@ -28,8 +28,8 @@ PersistentVolumeClaim (PVC) on the secondary site during the failover.
|
|||||||
VolumeHandle to identify the OMAP data nor the image anymore because as we have
|
VolumeHandle to identify the OMAP data nor the image anymore because as we have
|
||||||
only PoolID and ClusterID in the VolumeHandle. We cannot identify the correct
|
only PoolID and ClusterID in the VolumeHandle. We cannot identify the correct
|
||||||
pool name from the PoolID because pool name will remain the same on both
|
pool name from the PoolID because pool name will remain the same on both
|
||||||
clusters but not the PoolID even the ClusterID can be different on the
|
clusters but not the PoolID even the ClusterID can be different on the secondary
|
||||||
secondary cluster.
|
cluster.
|
||||||
|
|
||||||
> Sample PV spec which will be used by rbdplugin controller to regenerate OMAP
|
> Sample PV spec which will be used by rbdplugin controller to regenerate OMAP
|
||||||
> data
|
> data
|
||||||
@ -56,10 +56,10 @@ csi:
|
|||||||
```
|
```
|
||||||
|
|
||||||
> **VolumeHandle** is the unique volume name returned by the CSI volume plugin’s
|
> **VolumeHandle** is the unique volume name returned by the CSI volume plugin’s
|
||||||
CreateVolume to refer to the volume on all subsequent calls.
|
> CreateVolume to refer to the volume on all subsequent calls.
|
||||||
|
|
||||||
Once the static PVC is created on the secondary cluster, the Kubernetes User
|
Once the static PVC is created on the secondary cluster, the Kubernetes User can
|
||||||
can try delete the PVC,expand the PVC or mount the PVC. In case of mounting
|
try to delete the PVC,expand the PVC or mount the PVC. In case of mounting
|
||||||
(NodeStageVolume) we will get the volume context in RPC call but not in the
|
(NodeStageVolume) we will get the volume context in RPC call but not in the
|
||||||
Delete/Expand Request. In Delete/Expand RPC request only the VolumeHandle
|
Delete/Expand Request. In Delete/Expand RPC request only the VolumeHandle
|
||||||
(`clusterID-poolID-volumeuniqueID`) will be sent where it contains the encoded
|
(`clusterID-poolID-volumeuniqueID`) will be sent where it contains the encoded
|
||||||
@ -73,17 +73,17 @@ secondary cluster as the PoolID and ClusterID always may not be the same.
|
|||||||
|
|
||||||
To solve this problem, We will have a new controller(rbdplugin controller)
|
To solve this problem, We will have a new controller(rbdplugin controller)
|
||||||
running as part of provisioner pod which watches for the PV objects. When a PV
|
running as part of provisioner pod which watches for the PV objects. When a PV
|
||||||
is created it will extract the required information from the PV spec and it
|
is created it will extract the required information from the PV spec, and it
|
||||||
will regenerate the OMAP data. Whenever Ceph-CSI gets a RPC request with older
|
will regenerate the OMAP data. Whenever Ceph-CSI gets a RPC request with older
|
||||||
VolumeHandle, it will check if any new VolumeHandle exists for the old
|
VolumeHandle, it will check if any new VolumeHandle exists for the old
|
||||||
VolumeHandle. If yes, it uses the new VolumeHandle for internal operations (to
|
VolumeHandle. If yes, it uses the new VolumeHandle for internal operations (to
|
||||||
get pool name, Ceph monitor details from the ClusterID etc).
|
get pool name, Ceph monitor details from the ClusterID etc).
|
||||||
|
|
||||||
Currently, We are making use of watchers in node stage request to make sure
|
Currently, We are making use of watchers in node stage request to make sure
|
||||||
ReadWriteOnce (RWO) PVC is mounted on a single node at a given point in time.
|
ReadWriteOnce (RWO) PVC is mounted on a single node at a given point in time. We
|
||||||
We need to change the watchers logic in the node stage request as when we
|
need to change the watchers logic in the node stage request as when we enable
|
||||||
enable the RBD mirroring on an image, a watcher will be added on a RBD image by
|
the RBD mirroring on an image, a watcher will be added on a RBD image by the rbd
|
||||||
the rbd mirroring daemon.
|
mirroring daemon.
|
||||||
|
|
||||||
To solve the ClusterID problem, If the ClusterID is different on the second
|
To solve the ClusterID problem, If the ClusterID is different on the second
|
||||||
cluster, the admin has to create a new ConfigMap for the mapped ClusterID's.
|
cluster, the admin has to create a new ConfigMap for the mapped ClusterID's.
|
||||||
|
@ -1,59 +1,57 @@
|
|||||||
# RBD NBD VOLUME HEALER
|
# RBD NBD VOLUME HEALER
|
||||||
|
|
||||||
- [RBD NBD VOLUME HEALER](#rbd-nbd-volume-healer)
|
- [RBD NBD VOLUME HEALER](#rbd-nbd-volume-healer)
|
||||||
- [Rbd Nbd](#rbd-nbd)
|
- [Rbd Nbd](#rbd-nbd)
|
||||||
- [Advantages of userspace mounters](#advantages-of-userspace-mounters)
|
- [Advantages of userspace mounters](#advantages-of-userspace-mounters)
|
||||||
- [Side effects of userspace mounters](#side-effects-of-userspace-mounters)
|
- [Side effects of userspace mounters](#side-effects-of-userspace-mounters)
|
||||||
- [Volume Healer](#volume-healer)
|
- [Volume Healer](#volume-healer)
|
||||||
- [More thoughts](#more-thoughts)
|
- [More thoughts](#more-thoughts)
|
||||||
|
|
||||||
## Rbd nbd
|
## Rbd nbd
|
||||||
|
|
||||||
The rbd CSI plugin will provision new rbd images and attach and mount those
|
The rbd CSI plugin will provision new rbd images and attach and mount those to
|
||||||
to workloads. Currently, the default mounter is krbd, which uses the kernel
|
workloads. Currently, the default mounter is krbd, which uses the kernel rbd
|
||||||
rbd driver to mount the rbd images onto the application pod. Here on
|
driver to mount the rbd images onto the application pod. Here on at Ceph-CSI we
|
||||||
at Ceph-CSI we will also have a userspace way of mounting the rbd images,
|
will also have a userspace way of mounting the rbd images, via rbd-nbd.
|
||||||
via rbd-nbd.
|
|
||||||
|
|
||||||
[Rbd-nbd](https://docs.ceph.com/en/latest/man/8/rbd-nbd/) is a client for
|
[Rbd-nbd](https://docs.ceph.com/en/latest/man/8/rbd-nbd/) is a client for RADOS
|
||||||
RADOS block device (rbd) images like the existing rbd kernel module. It
|
block device (rbd) images like the existing rbd kernel module. It will map an
|
||||||
will map an rbd image to an nbd (Network Block Device) device, allowing
|
rbd image to an nbd (Network Block Device) device, allowing access to it as a
|
||||||
access to it as a regular local block device.
|
regular local block device.
|
||||||
|
|
||||||
![csi-rbd-nbd](./images/csi-rbd-nbd.svg)
|
![csi-rbd-nbd](./images/csi-rbd-nbd.svg)
|
||||||
|
|
||||||
It’s worth making a note that the rbd-nbd processes will run on the
|
It’s worth making a note that the rbd-nbd processes will run on the client-side,
|
||||||
client-side, which is inside the `csi-rbdplugin` node plugin.
|
which is inside the `csi-rbdplugin` node plugin.
|
||||||
|
|
||||||
### Advantages of userspace mounters
|
### Advantages of userspace mounters
|
||||||
|
|
||||||
- It is easier to add features to rbd-nbd as it is released regularly with
|
- It is easier to add features to rbd-nbd as it is released regularly with Ceph,
|
||||||
Ceph, and more difficult and time consuming to add features to the kernel
|
and more difficult and time consuming to add features to the kernel rbd module
|
||||||
rbd module as that is part of the Linux kernel release schedule.
|
as that is part of the Linux kernel release schedule.
|
||||||
- Container upgrades will be independent of the host node, which means if
|
- Container upgrades will be independent of the host node, which means if there
|
||||||
there are any new features with rbd-nbd, we don’t have to reboot the node
|
are any new features with rbd-nbd, we don’t have to reboot the node as the
|
||||||
as the changes will be shipped inside the container.
|
changes will be shipped inside the container.
|
||||||
- Because the container upgrades are host node independent, we will be a
|
- Because the container upgrades are host node independent, we will be a better
|
||||||
better citizen in K8s by switching to the userspace model.
|
citizen in K8s by switching to the userspace model.
|
||||||
- Unlike krbd, rbd-nbd uses librbd user-space library that gets most of the
|
- Unlike krbd, rbd-nbd uses librbd user-space library that gets most of the
|
||||||
development focus, and hence rbd-nbd will be feature-rich.
|
development focus, and hence rbd-nbd will be feature-rich.
|
||||||
- Being entirely kernel space impacts fault-tolerance as any kernel panic
|
- Being entirely kernel space impacts fault-tolerance as any kernel panic
|
||||||
affects a whole node not only a single pod that is using rbd storage.
|
affects a whole node not only a single pod that is using rbd storage. Thanks
|
||||||
Thanks to the rbd-nbd’s userspace design, we are less bothered here, the
|
to the rbd-nbd’s userspace design, we are less bothered here, the krbd is a
|
||||||
krbd is a complete kernel and vendor-specific driver which needs changes
|
complete kernel and vendor-specific driver which needs changes on every
|
||||||
on every feature basis, on the other hand, rbd-nbd depends on NBD generic
|
feature basis, on the other hand, rbd-nbd depends on NBD generic driver, while
|
||||||
driver, while all the vendor-specific logic sits in the userspace. It's
|
all the vendor-specific logic sits in the userspace. It's worth taking note
|
||||||
worth taking note that NBD generic driver is mostly unchanged much from
|
that NBD generic driver is mostly unchanged much from years and consider it to
|
||||||
years and consider it to be much stable. Also given NBD is a generic
|
be much stable. Also given NBD is a generic driver there will be many eyes on
|
||||||
driver there will be many eyes on it compared to the rbd driver.
|
it compared to the rbd driver.
|
||||||
|
|
||||||
### Side effects of userspace mounters
|
### Side effects of userspace mounters
|
||||||
|
|
||||||
Since the rbd-nbd processes run per volume map on the client side i.e.
|
Since the rbd-nbd processes run per volume map on the client side i.e. inside
|
||||||
inside the `csi-rbdplugin` node plugin, a restart of the node plugin will
|
the `csi-rbdplugin` node plugin, a restart of the node plugin will terminate all
|
||||||
terminate all the rbd-nbd processes, and there is no way to restore
|
the rbd-nbd processes, and there is no way to restore these processes back to
|
||||||
these processes back to life currently, which could lead to IO errors
|
life currently, which could lead to IO errors on all the application pods.
|
||||||
on all the application pods.
|
|
||||||
|
|
||||||
![csi-plugin-restart](./images/csi-plugin-restart.svg)
|
![csi-plugin-restart](./images/csi-plugin-restart.svg)
|
||||||
|
|
||||||
@ -61,42 +59,42 @@ This is where the Volume healer could help.
|
|||||||
|
|
||||||
## Volume healer
|
## Volume healer
|
||||||
|
|
||||||
Volume healer runs on the start of rbd node plugin and runs within the
|
Volume healer runs on the start of rbd node plugin and runs within the node
|
||||||
node plugin driver context.
|
plugin driver context.
|
||||||
|
|
||||||
Volume healer does the below,
|
Volume healer does the below,
|
||||||
|
|
||||||
- Get the Volume attachment list for the current node where it is running
|
- Get the Volume attachment list for the current node where it is running
|
||||||
- Filter the volume attachments list through matching driver name and
|
- Filter the volume attachments list through matching driver name and status
|
||||||
status attached
|
attached
|
||||||
- For each volume attachment get the respective PV information and check
|
- For each volume attachment get the respective PV information and check the
|
||||||
the criteria of PV Bound, mounter type
|
criteria of PV Bound, mounter type
|
||||||
- Build the StagingPath where rbd images PVC is mounted, based on the
|
- Build the StagingPath where rbd images PVC is mounted, based on the KUBELET
|
||||||
KUBELET path and PV object
|
path and PV object
|
||||||
- Construct the NodeStageVolume() request and send Request to CSI Driver.
|
- Construct the NodeStageVolume() request and send Request to CSI Driver.
|
||||||
- The NodeStageVolume() has a way to identify calls received from the
|
- The NodeStageVolume() has a way to identify calls received from the healer and
|
||||||
healer and when executed from the healer context, it just runs in the
|
when executed from the healer context, it just runs in the minimal required
|
||||||
minimal required form, where it fetches the previously mapped device to
|
form, where it fetches the previously mapped device to the image, and the
|
||||||
the image, and the respective secrets and finally ensures to bringup the
|
respective secrets and finally ensures to bringup the respective process back
|
||||||
respective process back to life. Thus enabling IO to continue.
|
to life. Thus enabling IO to continue.
|
||||||
|
|
||||||
### More thoughts
|
### More thoughts
|
||||||
|
|
||||||
- Currently the NodeStageVolume() call is safeguarded by the global Ceph-CSI
|
- Currently the NodeStageVolume() call is safeguarded by the global Ceph-CSI
|
||||||
level lock (per volID) that needs to be acquired before doing any of the
|
level lock (per volID) that needs to be acquired before doing any of the
|
||||||
NodeStage, NodeUnstage, NodePublish, NodeUnPulish operations. Hence none
|
NodeStage, NodeUnstage, NodePublish, NodeUnPublish operations. Hence none of
|
||||||
of the operations happen in parallel.
|
the operations happen in parallel.
|
||||||
- Any issues if the NodeUnstage is issued by kubelet?
|
- Any issues if the NodeUnstage is issued by kubelet?
|
||||||
- This can not be a problem as we take a lock at the Ceph-CSI level
|
- This can not be a problem as we take a lock at the Ceph-CSI level
|
||||||
- If the NodeUnstage success, Ceph-CSI will return StagingPath not found
|
- If the NodeUnstage success, Ceph-CSI will return StagingPath not found
|
||||||
error, we can then skip
|
error, we can then skip
|
||||||
- If the NodeUnstage fails with an operation already going on, in the
|
- If the NodeUnstage fails with an operation already going on, in the next
|
||||||
next NodeUnstage the volume gets unmounted
|
NodeUnstage the volume gets unmounted
|
||||||
- What if the PVC is deleted?
|
- What if the PVC is deleted?
|
||||||
- If the PVC is deleted, the volume attachment list might already got
|
- If the PVC is deleted, the volume attachment list might already get
|
||||||
refreshed and entry will be skipped/deleted at the healer.
|
refreshed and entry will be skipped/deleted at the healer.
|
||||||
- For any reason, If the request bails out with Error NotFound, skip the
|
- For any reason, If the request bails out with Error NotFound, skip the
|
||||||
PVC, assuming it might have deleted or the NodeUnstage might have
|
PVC, assuming it might have deleted or the NodeUnstage might have already
|
||||||
already happened.
|
happened.
|
||||||
- The Volume healer currently works with rbd-nbd, but the design can
|
- The Volume healer currently works with rbd-nbd, but the design can
|
||||||
accommodate other userspace mounters (may be ceph-fuse).
|
accommodate other userspace mounters (may be ceph-fuse).
|
Loading…
Reference in New Issue
Block a user