mirror of
https://github.com/ceph/ceph-csi.git
synced 2024-12-18 11:00:25 +00:00
doc: add proposal doc for CephFS snapshots as shallow RO volumes
This patch adds a proposal document for "CephFS snapshots as shallow RO volumes". Updates: #2142 Signed-off-by: Robert Vasek <robert.vasek@cern.ch>
This commit is contained in:
parent
85c84910d3
commit
fedbb01ec3
272
docs/design/proposals/cephfs-snapshot-shallow-ro-vol.md
Normal file
272
docs/design/proposals/cephfs-snapshot-shallow-ro-vol.md
Normal file
@ -0,0 +1,272 @@
|
|||||||
|
# Snapshots as shallow read-only volumes
|
||||||
|
|
||||||
|
CSI spec doesn't have a notion of "mounting a snapshot". Instead, the idiomatic
|
||||||
|
way of accessing snapshot contents is first to create a volume populated with
|
||||||
|
snapshot contents and then mount that volume to workloads.
|
||||||
|
|
||||||
|
CephFS exposes snapshots as special, read-only directories of a subvolume
|
||||||
|
located in `<subvolume>/.snap`. cephfs-csi can already provision writable
|
||||||
|
volumes with snapshots as their data source, where snapshot contents are
|
||||||
|
cloned to the newly created volume. However, cloning a snapshot to volume
|
||||||
|
is a very expensive operation in CephFS as the data needs to be fully copied.
|
||||||
|
When the need is to only read snapshot contents, snapshot cloning is extremely
|
||||||
|
inefficient and wasteful.
|
||||||
|
|
||||||
|
This proposal describes a way for cephfs-csi to expose CephFS snapshots
|
||||||
|
as shallow, read-only volumes, without needing to clone the underlying
|
||||||
|
snapshot data.
|
||||||
|
|
||||||
|
## Use-cases
|
||||||
|
|
||||||
|
What's the point of such read-only volumes?
|
||||||
|
|
||||||
|
* **Restore snapshots selectively:** users may want to traverse snapshots,
|
||||||
|
restoring data to a writable volume more selectively instead of restoring
|
||||||
|
the whole snapshot.
|
||||||
|
* **Volume backup:** users can't backup a live volume, they first need
|
||||||
|
to snapshot it. Once a snapshot is taken, it still can't be backed-up,
|
||||||
|
as backup tools usually work with volumes (that are exposed as file-systems)
|
||||||
|
and not snapshots (which might have backend-specific format). What this means
|
||||||
|
is that in order to create a snapshot backup, users have to clone snapshot
|
||||||
|
data twice:
|
||||||
|
|
||||||
|
1. first time, when restoring the snapshot into a temporary volume from
|
||||||
|
where the data will be read,
|
||||||
|
1. and second time, when transferring that volume into some backup/archive
|
||||||
|
storage (e.g. object store).
|
||||||
|
|
||||||
|
The temporary backed-up volume will most likely be thrown away after the
|
||||||
|
backup transfer is finished. That's a lot of wasted work for what we
|
||||||
|
originally wanted to do! Having the ability to create volumes from
|
||||||
|
snapshots cheaply would be a big improvement for this use case.
|
||||||
|
|
||||||
|
## Alternatives
|
||||||
|
|
||||||
|
* _Snapshots are stored in `<subvolume>/.snap`. Users could simply visit this
|
||||||
|
directory by themselves._
|
||||||
|
|
||||||
|
`.snap` is CephFS-specific detail of how snapshots are exposed.
|
||||||
|
Users / tools may not be aware of this special directory, or it may not fit
|
||||||
|
their workflow. At the moment, the idiomatic way of accessing snapshot
|
||||||
|
contents in CSI drivers is by creating a new volume and populating it
|
||||||
|
with snapshot.
|
||||||
|
|
||||||
|
## Design
|
||||||
|
|
||||||
|
Key points:
|
||||||
|
|
||||||
|
* Volume source is a snapshot, volume access mode is `*_READER_ONLY`.
|
||||||
|
* No actual new subvolumes are created in CephFS.
|
||||||
|
* The resulting volume is a reference to the source subvolume snapshot.
|
||||||
|
This reference would be stored in `Volume.volume_context` map. In order
|
||||||
|
to reference a snapshot, we need subvol name and snapshot name.
|
||||||
|
* Mounting such volume means mounting the respective CephFS subvolume
|
||||||
|
and exposing the snapshot to workloads.
|
||||||
|
* Let's call a *shallow read-only volume with a subvolume snapshot
|
||||||
|
as its data source* just a *shallow volume* from here on out for brevity.
|
||||||
|
|
||||||
|
### Controller operations
|
||||||
|
|
||||||
|
Care must be taken when handling life-times of relevant storage resources.
|
||||||
|
When a shallow volume is created, what would happen if:
|
||||||
|
|
||||||
|
* _Parent subvolume of the snapshot is removed while the shallow volume
|
||||||
|
still exists?_
|
||||||
|
|
||||||
|
This shouldn't be a problem already. The parent volume has either
|
||||||
|
`snapshot-retention` subvol feature in which case its snapshots remain
|
||||||
|
available, or if it doesn't have that feature, it will fail to be deleted
|
||||||
|
because it still has snapshots associated to it.
|
||||||
|
* _Source snapshot from which the shallow volume originates is removed while
|
||||||
|
that shallow volume still exists?_
|
||||||
|
|
||||||
|
We need to make sure this doesn't happen and some book-keeping
|
||||||
|
is necessary. Ideally we could employ some kind of reference counting.
|
||||||
|
|
||||||
|
#### Reference counting for shallow volumes
|
||||||
|
|
||||||
|
As mentioned above, this is to protect shallow volumes, should their source
|
||||||
|
snapshot be requested for deletion.
|
||||||
|
|
||||||
|
When creating a volume snapshot, a reference tracker (RT), represented by a
|
||||||
|
RADOS object, would be created for that snapshot. It would store information
|
||||||
|
required to track the references for the backing subvolume snapshot. Upon a
|
||||||
|
`CreateSnapshot` call, the reference tracker (RT) would be initialized with a
|
||||||
|
single reference record, where the CSI snapshot itself is the first reference
|
||||||
|
to the backing snapshot. Each subsequent shallow volume creation would add a
|
||||||
|
new reference record to the RT object. Each shallow volume deletion would
|
||||||
|
remove that reference from the RT object. Calling `DeleteSnapshot` would remove
|
||||||
|
the reference record that was previously added in `CreateSnapshot`.
|
||||||
|
|
||||||
|
The subvolume snapshot would be removed from the Ceph cluster only once the RT
|
||||||
|
object holds no references. Note that this behavior would permit calling
|
||||||
|
`DeleteSnapshot` even if it is still referenced by shallow volumes.
|
||||||
|
|
||||||
|
* `DeleteSnapshot`:
|
||||||
|
* RT holds no references or the RT object doesn't exist:
|
||||||
|
delete the backing snapshot too.
|
||||||
|
* RT holds at least one reference: keep the backing snapshot.
|
||||||
|
* `DeleteVolume`:
|
||||||
|
* RT holds no references: delete the backing snapshot too.
|
||||||
|
* RT holds at least one reference: keep the backing snapshot.
|
||||||
|
|
||||||
|
To enable creating shallow volumes from snapshots that were provisioned by
|
||||||
|
older versions of cephfs-csi (i.e. before this feature is introduced),
|
||||||
|
`CreateVolume` for shallow volumes would also create an RT object in case it's
|
||||||
|
missing. It would be initialized to two: the source snapshot and the newly
|
||||||
|
created shallow volume.
|
||||||
|
|
||||||
|
##### Concurrent access to RT objects
|
||||||
|
|
||||||
|
RADOS API provides access to compound atomic read and write operations. These
|
||||||
|
will be used to implement reference tracking functionality, protecting
|
||||||
|
modifications of reference records.
|
||||||
|
|
||||||
|
#### `CreateVolume`
|
||||||
|
|
||||||
|
A read-only volume with snapshot source would be created under these conditions:
|
||||||
|
|
||||||
|
1. `CreateVolumeRequest.volume_content_source` is a snapshot,
|
||||||
|
1. `CreateVolumeRequest.volume_capabilities[*].access_mode` is any of read-only
|
||||||
|
volume access modes.
|
||||||
|
1. Possibly other volume parameters in `CreateVolumeRequest.parameters`
|
||||||
|
specific to shallow volumes.
|
||||||
|
|
||||||
|
`CreateVolumeResponse.Volume.volume_context` would then contain necessary
|
||||||
|
information to identify the source subvolume / snapshot.
|
||||||
|
|
||||||
|
Things to look out for:
|
||||||
|
|
||||||
|
* _What's the volume size?_
|
||||||
|
|
||||||
|
It doesn't consume any space on the filesystem. `Volume.capacity_bytes` is
|
||||||
|
allowed to contain zero. We could use that.
|
||||||
|
* _What should be the requested size when creating the volume (specified e.g.
|
||||||
|
in PVC)?_
|
||||||
|
|
||||||
|
This one is tricky. CSI spec allows for
|
||||||
|
`CreateVolumeRequest.capacity_range.{required_bytes,limit_bytes}` to be
|
||||||
|
zero. On the other hand,
|
||||||
|
`PersistentVolumeClaim.spec.resources.requests.storage` must be bigger
|
||||||
|
than zero. cephfs-csi doesn't care about the requested size (the volume
|
||||||
|
will be read-only, so it has no usable capacity) and would always set it
|
||||||
|
to zero. This shouldn't case any problems for the time being, but still
|
||||||
|
is something we should keep in mind.
|
||||||
|
|
||||||
|
`CreateVolume` and behavior when using volume as volume source (PVC-PVC clone):
|
||||||
|
|
||||||
|
| New volume | Source volume | Behavior |
|
||||||
|
|----------------|----------------|-----------------------------------------------------------------------------------|
|
||||||
|
| shallow volume | shallow volume | Create a new reference to the parent snapshot of the source shallow volume. |
|
||||||
|
| regular volume | shallow volume | Equivalent for a request to create a regular volume with snapshot as its source. |
|
||||||
|
| shallow volume | regular volume | Such request doesn't make sense and `CreateVolume` should return an error. |
|
||||||
|
|
||||||
|
### `DeleteVolume`
|
||||||
|
|
||||||
|
Volume deletion is trivial.
|
||||||
|
|
||||||
|
### `CreateSnapshot`
|
||||||
|
|
||||||
|
Snapshotting read-only volumes doesn't make sense in general, and should
|
||||||
|
be rejected.
|
||||||
|
|
||||||
|
### `ControllerExpandVolume`
|
||||||
|
|
||||||
|
Same thing as above. Expanding read-only volumes doesn't make sense in general,
|
||||||
|
and should be rejected.
|
||||||
|
|
||||||
|
## Node operations
|
||||||
|
|
||||||
|
Two cases need to be considered:
|
||||||
|
|
||||||
|
* (a) Volume/snapshot provisioning is handled by cephfs-csi
|
||||||
|
* (b) Volume/snapshot provisioning is handled externally (e.g. pre-provisioned
|
||||||
|
manually, or by OpenStack Manila, ...)
|
||||||
|
|
||||||
|
### `NodeStageVolume`, `NodeUnstageVolume`
|
||||||
|
|
||||||
|
Here we're mounting the source subvolume onto the node. Subsequent volume
|
||||||
|
publish calls then use bind mounts to expose the snapshot directory located in
|
||||||
|
`.snap/<SNAPSHOT DIRECTORY NAME>`. Unfortunately, we cannot mount snapshots
|
||||||
|
directly because they are not visible during mount time. We need to mount the
|
||||||
|
whole subvolume first, and only then perform the binds to target paths.
|
||||||
|
|
||||||
|
#### For case (a)
|
||||||
|
|
||||||
|
Subvolume paths are normally retrieved by
|
||||||
|
`ceph fs subvolume info/getpath <VOLUME NAME> <SUBVOLUME NAME> <SUBVOLUMEGROUP NAME>`,
|
||||||
|
which outputs a path like so:
|
||||||
|
|
||||||
|
```
|
||||||
|
/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>
|
||||||
|
```
|
||||||
|
|
||||||
|
Snapshots are then accessible in:
|
||||||
|
|
||||||
|
* `/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/.snap` and
|
||||||
|
* `/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>/.snap`.
|
||||||
|
|
||||||
|
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>` may be deleted if the source
|
||||||
|
subvolume is deleted, but thanks to the `snapshot-retention` feature, snapshots
|
||||||
|
in `/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/.snap` will remain to be available.
|
||||||
|
|
||||||
|
The CephFS mount should therefore have its root set to the parent of what
|
||||||
|
`fs subvolume getpath` returns, i.e. `/volumes/<VOLUME NAME>/<SUBVOLUME NAME>`.
|
||||||
|
That way we will have snapshots available regardless of whether the subvolume
|
||||||
|
itself still exists or not.
|
||||||
|
|
||||||
|
#### For case (b)
|
||||||
|
|
||||||
|
For cases where subvolumes are managed externally and not by cephfs-csi, we
|
||||||
|
must assume that the cephx user we're given can access only
|
||||||
|
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>` so users won't be able to
|
||||||
|
benefit from snapshot retention. Users will need to be careful not to delete
|
||||||
|
the parent subvolumes and snapshots while they are associated by these shallow
|
||||||
|
RO volumes.
|
||||||
|
|
||||||
|
### `NodePublishVolume`, `NodeUnpublishVolume`
|
||||||
|
|
||||||
|
Node publish is trivial. We bind staging path to target path as a read-only
|
||||||
|
mount.
|
||||||
|
|
||||||
|
### `NodeGetVolumeStats`
|
||||||
|
|
||||||
|
`NodeGetVolumeStatsResponse.usage[*].available` should be always zero.
|
||||||
|
|
||||||
|
## Volume parameters, volume context
|
||||||
|
|
||||||
|
This section provides a discussion around determinig what volume parameters and
|
||||||
|
volume context parameters will be used to convey necessary information to the
|
||||||
|
cephfs-csi driver in order to support shallow volumes.
|
||||||
|
|
||||||
|
Volume parameters `CreateVolumeRequest.parameters`:
|
||||||
|
|
||||||
|
* Should be "shallow" the default mode for all `CreateVolume` calls that have
|
||||||
|
(a) snapshot as data source and (b) read-only volume access mode? If not,
|
||||||
|
a new volume parameter should be introduced: e.g `isShallow: <bool>`. On the
|
||||||
|
other hand, does it even makes sense for users to want to create full copies
|
||||||
|
of snapshots and still have them read-only?
|
||||||
|
|
||||||
|
Volume context `Volume.volume_context`:
|
||||||
|
|
||||||
|
* Here we definitely need `isShallow` or similar. Without it we wouldn't be
|
||||||
|
able to distinguish between a regular volume that just happens to have
|
||||||
|
a read-only access mode, and a volume that references a snapshot.
|
||||||
|
* Currently cephfs-csi recognizes `subvolumePath` for dynamically provisioned
|
||||||
|
volumes and `rootPath` for pre-previsioned volumes. As mentioned in
|
||||||
|
[`NodeStageVolume`, `NodeUnstageVolume` section](#NodeStageVolume-NodeUnstageVolume),
|
||||||
|
snapshots cannot be mounted directly. How do we pass in path to the parent
|
||||||
|
subvolume?
|
||||||
|
* a) Path to the snapshot is passed in via `subvolumePath` / `rootPath`,
|
||||||
|
e.g.
|
||||||
|
`/volumes/<VOLUME NAME>/<SUBVOLUME NAME>/<UUID>/.snap/<SNAPSHOT NAME>`.
|
||||||
|
From that we can derive path to the subvolume: it's the parent of `.snap`
|
||||||
|
directory.
|
||||||
|
* b) Similar to a), path to the snapshot is passed in via `subvolumePath` /
|
||||||
|
`rootPath`, but instead of trying to derive the right path we introduce
|
||||||
|
another volume context parameter containing path to the parent subvolume
|
||||||
|
explicitly.
|
||||||
|
* c) `subvolumePath` / `rootPath` contains path to the parent subvolume and
|
||||||
|
we introduce another volume context parameter containing name of the
|
||||||
|
snapshot. Path to the snapshot is then formed by appending
|
||||||
|
`/.snap/<SNAPSHOT NAME>` to the subvolume path.
|
Loading…
Reference in New Issue
Block a user