As detailed in issue #279, current lock scheme has hash
buckets that are count of CPUs. This causes a lot of contention
when parallel requests are made to the CSI plugin. To reduce
lock contention, this commit introduces granular locks per
identifier.
The commit also changes the timeout for gRPC requests to Create
and Delete volumes, as the current timeout is 10s (kubernetes
documentation says 15s but code defaults are 10s). A virtual
setup takes about 12-15s to complete a request at times, that leads
to unwanted retries of the same request, hence the increased
timeout to enable operation completion with minimal retries.
Tests to create PVCs before and after these changes look like so,
Before:
Default master code + sidecar provisioner --timeout option set
to 30 seconds
20 PVCs
Creation: 3 runs, 396/391/400 seconds
Deletion: 3 runs, 218/271/118 seconds
- Once was stalled for more than 8 minutes and cancelled the run
After:
Current commit + sidecar provisioner --timeout option set to 30 sec
20 PVCs
Creation: 3 runs, 42/59/65 seconds
Deletion: 3 runs, 32/32/31 seconds
Fixes: #279
Signed-off-by: ShyamsundarR <srangana@redhat.com>
RBD plugin needs only a single ID to manage images and operations against a
pool, mentioned in the storage class. The current scheme of 2 IDs is hence not
needed and removed in this commit.
Further, unlike CephFS plugin, the RBD plugin splits the user id and the key
into the storage class and the secret respectively. Also the parameter name
for the key in the secret is noted in the storageclass making it a variant and
hampers usability/comprehension. This is also fixed by moving the id and the key
to the secret and not retaining the same in the storage class, like CephFS.
Fixes#270
Testing done:
- Basic PVC creation and mounting
Signed-off-by: ShyamsundarR <srangana@redhat.com>
The SnapshotClass for RBD requires a pool parameter. This is redundant
as a snapshot is not created on a different pool than the source image
of the snapshot (refer rbd man page).
Further, when a snapshot needs to be created its source CSI VolumeID
is passed to the creation call, and hence the source volumes pool
needs to be reused to create the snapshot.
Similarly to clone a snapshot, the create request would come in with a
SnapshotID to help identify the snapshot pool, and the same create
request parameters would contain the storage class based pool parameter
to create the clone into (as clones can be in different pools as
compared to their parent snapshots).
Thus, the parameter pool seems redundant in the snapshot class and
should be removed to improve ease of use.
Fixes#379
Signed-off-by: ShyamsundarR <srangana@redhat.com>
This is a part of the stateless set of commits for CephCSI.
This commit removes the dependency on config maps to store cephFS provisioned
volumes, and instead relies on RADOS based objects and keys, and required
CSI VolumeID encoding to detect the provisioned volumes.
Changes:
- Provide backward compatibility to provisioned volumes by older plugin versions (1.0.0 or older)
- Remove Create/Delete support for statically provisioned volumes (fixes#382)
- Added namespace support to RADOS OMaps and used the same to store RADOS CSI objects and keys in the CephFS metadata pool
- Added support to mention fsname for CephFS provisioning (fixes#359)
- Changed field name in CSI Identifier to 'location', to denote a pool or fscid
- Updated mounter cache to use new scheme
- Required Helm manifests are updated
- Required documentation and other manifests are updated
- Made driver option 'metadatastorage' as optional, as fresh installs do not need to specify the same
Testing done:
- Create/Mount/Delete PVC
- Create/Delete 5 PVCs
- Mount version 1.0.0 PVC
- Delete version 1.0.0 PV
- Mount Statically defined PV/PVC/Pod
- Mount Statically defined version 1.0.0 PV/PVC/Pod
- Delete Statically defined version 1.0.0 PV/PVC/Pod
- Node restart when mounted to test mountcache
- Use InstanceID other than 'default'
- RBD basic round of tests, as namespace is added to OMaps
- csitest against ceph-fs plugin
- NOTE: CephFS plugin still does not detect and address already created
volumes but of a different size
- Test not providing any value to the metadata storage parameter
Signed-off-by: ShyamsundarR <srangana@redhat.com>
Existing config maps are now replaced with rados omaps that help
store information regarding the requested volume names and the rbd
image names backing the same.
Further to detect cluster, pool and which image a volume ID refers
to, changes to volume ID encoding has been done as per provided
design specification in the stateless ceph-csi proposal.
Additional changes and updates,
- Updated documentation
- Updated manifests
- Updated Helm chart
- Addressed a few csi-test failures
Signed-off-by: ShyamsundarR <srangana@redhat.com>
Based on the review comments addressed the following,
- Moved away from having to update the pod with volumes
when a new Ceph cluster is added for provisioning via the
CSI driver
- The above now used k8s APIs to fetch secrets
- TBD: Need to add a watch mechanisim such that these
secrets can be cached and updated when changed
- Folded the Cephc configuration and ID/key config map
and secrets into a single secret
- Provided the ability to read the same config via mapped
or created files within the pod
Tests:
- Ran PV creation/deletion/attach/use using new scheme
StorageClass
- Ran PV creation/deletion/attach/use using older scheme
to ensure nothing is broken
- Did not execute snapshot related tests
Signed-off-by: ShyamsundarR <srangana@redhat.com>
This commit provides the option to pass in Ceph cluster-id instead
of a MON list from the storage class.
This helps in moving towards a stateless CSI implementation.
Tested the following,
- PV provisioning and staging using cluster-id in storage class
- PV provisioning and staging using MON list in storage class
Did not test,
- snapshot operations in either forms of the storage class
Signed-off-by: ShyamsundarR <srangana@redhat.com>
This commit reverts the initial implementation of the
multi-node-multi-writer feature:
commit: b5b8e46460
It replaces that implementation with a more restrictive version that
only allows multi-node-multi-writer for volumes of type `block`
With this change there are no volume parameters required in the stoarge
class, we also fail any attempt to create a file based device with
multi-node-multi-write being specified, this way a user doesn't have to
wait until they try and do the publish before realizing it doesn't work.
This change adds the ability to define a `multiNodeWritable` option in
the Storage Class.
This change does a number of things:
1. Allow multi-node-multi-writer access modes if the SC options is
enabled
2. Bypass the watcher checks for MultiNodeMultiWriter Volumes
3. Maintains existing watcher checks for SingleNodeWriter access modes
regardless of the StorageClass option.
fix lint-errors
currently all the created volumes are
stored in the metadata store, so we
can use this information to support
list volumes.
Signed-off-by: Madhu Rajanna <mrajanna@redhat.com>
The timeout value in external-provisioner is fairly low. It's not
uncommon that it times out and retries before the rbdplugin is done
with CreateVolume. rbdplugin has to serialize calls and ensure that
they are idempotent to deal with this.
When the initial DeleteVolume times out (as it does on slow clusters
due to the low 10 second limit), the external-provisioner calls it
again. The CSI standard requires the second call to succeed if the
volume has been deleted in the meantime. This didn't work because
DeleteVolume returned an error when failing to find the volume info
file:
rbdplugin: E1008 08:05:35.631783 1 utils.go:100] GRPC error: rbd: open err /var/lib/kubelet/plugins/csi-rbdplugin/controller/csi-rbd-622a252c-cad0-11e8-9112-deadbeef0101.json/open /var/lib/kubelet/plugins/csi-rbdplugin/controller/csi-rbd-622a252c-cad0-11e8-9112-deadbeef0101.json: no such file or directory
The fix is to treat a missing volume info file as "volume already
deleted" and return success. To detect this, the original os error
must be wrapped, otherwise the caller of loadVolInfo cannot determine
the root cause.
Note that further work may be needed to make the driver really
resilient, for example there are probably concurrency issues.
But for now this fixes: #82