Problem:
--------
1. rbd-nbd by default logs to /var/log/ceph/ceph-client.admin.log,
Unfortunately, container doesn't have /var/log/ceph directory hence
rbd-nbd is not logging now.
2. Rbd-nbd logs are not persistent across nodeplugin restarts.
Solution:
--------
Provide a host path so that log directory is made available, and the
logs persist on the hostnode across container restarts.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
- mount host's /etc/selinux in node plugins
- process mount options in all code paths for cephfs volume options
Signed-off-by: Alexandre Lossent <alexandre.lossent@cern.ch>
This change resolves a typo for installing the CSIDriver
resource in Kubernetes clusters before 1.18,
where the apiVersion is incorrect.
See also:
https://kubernetes-csi.github.io/docs/csi-driver-object.html
[ndevos: replace v1betav1 in examples with v1beta1]
Signed-off-by: Thomas Kooi <t.j.kooi@avisi.nl>
Problem:
-------
For rbd nbd userspace mounter backends, after a restart of the nodeplugin
all the mounts will start seeing IO errors. This is because, for rbd-nbd
backends there will be a userspace mount daemon running per volume, post
restart of the nodeplugin pod, there is no way to restore the daemons
back to life.
Solution:
--------
The volume healer is a one-time activity that is triggered at the startup
time of the rbd nodeplugin. It navigates through the list of volume
attachments on the node and acts accordingly.
For now, it is limited to nbd type storage only, but it is flexible and
can be extended in the future for other backend types as needed.
From a few feets above:
This solves a severe problem for nbd backed csi volumes. The healer while
going through the list of volume attachments on the node, if finds the
volume is in attached state and is of type nbd, then it will attempt to
fix the rbd-nbd volumes by sending a NodeStageVolume request with the
required volume attributes like secrets, device name, image attributes,
and etc.. which will finally help start the required rbd-nbd daemons in
the nodeplugin csi-rbdplugin container. This will allow reattaching the
backend images with the right nbd device, thus allowing the applications
to perform IO without any interruptions even after a nodeplugin restart.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Nodeplugin needs below cluster roles:
persistentvolumes: get
volumeattachments: list, get
These additional permissions are needed by the volume healer. Volume healer
aims at fixing the volume health issues at the very startup time of the
nodeplugin. As part of its operations, volume healer has to run through
the list of volume attachments and understand details about each
persistentvolume.
The later commits will use these additional cluster roles.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
The provisioner and node-plugin have the capability to connect to
Hashicorp Vault with a ServiceAccount from the Namespace where the PVC
is created. This requires permissions to read the contents of the
ServiceAccount from an other Namespace than where Ceph-CSI is deployed.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
csidriver object can be created on the kubernetes
for below reason.
If a CSI driver creates a CSIDriver object,
Kubernetes users can easily discover the CSI
Drivers installed on their cluster
(simply by issuing kubectl get CSIDriver)
Ref: https://kubernetes-csi.github.io/docs/csi-driver-object.html#what-is-the-csidriver-object
attachRequired is always required to be set to
true to avoid issue on RWO PVC.
more details about it at https://github.com/rook/rook/pull/4332
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
use the latest version of csi-snapshotter sidecar image at the
provisioner templates
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
set system-cluster-critical priorityclass on
provisioner pods. the system-cluster-critical is
having lowest priority compared to node-critical.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
set system-node-critical priority on the plugin
pods, as its the highest priority and this need to
be applied on plugin pods as its critical for
storage in cluster.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
as provisioner need to get the configmap from
different namespace to check tenant configuration.
added the clusterrole get access for the same.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Tenants can have their own ConfigMap that contains connection parameters
to the Vault Service where the PV encyption keys are located. It is
possible for a Tenant to use a different Vault Service than the one
configured by the Storage Admin who deployed Ceph-CSI.
For this, the node-plugin needs to be able to read the ConfigMap from
the Tenants namespace.
See-also: docs/design/proposals/encryption-with-vault-tokens.md
Signed-off-by: Niels de Vos <ndevos@redhat.com>
if the kms encryption configmap is not mounted
as a volume to the CSI pods, add the code to
read the configuration from the kubernetes. Later
the code to fetch the configmap will be moved to
the new sidecar which is will talk to respective
CO to fetch the encryption configurations.
The k8s configmap uses the standard vault spefic
names to add the configurations. this will be converted
back to the CSI configurations.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
In order to fetch the Kubernetes Secret with the Vault Token for a
Tenant, the ClusterRole needs to allow reading Secrets from all
Kubernetes Namespaces (each Tenant has their own Namespace).
Signed-off-by: Niels de Vos <ndevos@redhat.com>
This argument in csi-provisioner sidecar allows us to receive pv/pvc
name/namespace metadata in the createVolume() request.
For ex:
csi.storage.k8s.io/pvc/name
csi.storage.k8s.io/pvc/namespace
csi.storage.k8s.io/pv/name
This is a useful information which can be used depend on the use case we
have at our driver. The features like vault token enablement for multi
tenancy, RBD mirroring ..etc can consume this based on the need.
Refer: #1305
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>
external-provisioner is exposing a new argument
to set the default fstype while starting the provisioner
sidecar, if the fstype is not specified in the storageclass
the default fstype will be applied for the pvc created from
the storageclass.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
with csi-provisioner v2.x the topology based
provisioning will not have any backward compatibility
with older version of kubernetes, if the nodes are
not labeled with topology keys, the pvc creation
is going to get fail with error `accessibility
requirements: no available topology found`, disabling
the topology based provisioning by default, if user want
to use it he can always enable it.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This PR makes the changes in csi templates and
upgrade documentation required for updating
csi sidecar images.
Signed-off-by: Mudit Agarwal <muagarwa@redhat.com>
updated deployment template for the new controller and
also added `update` configmap RBAC for the controller
as the controller uses the configmap for the leader
election.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The added anti-affinity rules prevent provisioner operators from scheduling on
the same nodes. The kubernetes scheduler will spread the pods across nodes to
improve availability during node failures.
Signed-off-by: Nico Berlee <nico.berlee@on2it.net>
The lifecycle preStop hook fails on container stop / exit
because /bin/sh is not present in the driver registrar container
image.
the driver-registrar will remove the socket file
before stopping. we dont need to have any preStop hook
to remove the socket as it was not working as expected
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The aggregate clusterrole were designed for the scenario where
the rules are not completely owned by one component.
the aggregate rules can be removed and simplify
certain issues around upgrades.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
as v1.0.0 is deprecated we need to remove the support
for it in the Next coming (v3.0.0) release. This PR
removes the support for the same.
closes#882
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
added Hardlimit and Softlimit flags for cephcsi
arguments. When the Softlimit is reached cephcsi
will start a background task to flatten the rbd
image and return success and if the hardlimit
is reached it will start a background task
to flatten the rbd image and return ready
to use as false to make sure that the image
will not be used until it is flatten.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Considering this parameter is available for other sidecars we should
have a parity between the sidecars. Adding it for the same reason
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
unlike other containers, image pull policy of
csi-snapshotter was set to "Always", which can
be changed to pull only if not present.
Signed-off-by: Yug Gupta <ygupta@redhat.com>
Recently resizer 0.5.0 has been released.
This PR updated the resizer container from
v0.4.0 to v0.5.0
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
As kubernetes CSI sidecar is exposing the
GRPC mertics we can make use of the same in
ceph-csi we dont need to expose our own.
update: #881
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
There are currently unwanted RBAC permission
is given for ceph-csi, This PR reduces removes
such unwanted RBAC resources.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
- adds proposal document for PVC encryption from PR448
- adds per-volume encription by generating encryption passphrase
for each volume and storing it in a KMS
- adds HashiCorp Vault integration as a KMS for encryption passphrases
- avoids encrypting volume second time if it was already encrypted but
no file system created
- avoids unnecessary checks if volume is a mapped device when encryption
was not requested
- prevents resizing encrypted volumes (it is not currently supported)
- prevents creating snapshots from encrypted volumes to prevent attack
on encryption key (security guard until re-encryption of volumes
implemented)
Signed-off-by: Vasyl Purchel vasyl.purchel@workday.com
Signed-off-by: Andrea Baglioni andrea.baglioni@workday.comFixes#420Fixes#744
`/run/mount` need to be share between host and
csi-plugin containers for `/run/mount/utab`
this is required to ensures that the network
is not stopped prior to unmounting the network devices.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
On systems with SELinux enabled, non-privileged containers
can't access data of privileged containers. Since the socket
is exposed by privileged containers, all sidecars must be
privileged too. This is needed only for containers running
in daemonset as we are using bidirectional mounts in daemonset
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
currently, we are making use of host path directory
to store the provisioner socket, as this
the socket is not needed by anyone else other than
containers inside the provisioner pod using the
empty directory to store this socket is the best option.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
this time out value to 150s or higher. The higher timeout value can help to reduce the
load of our backend ceph cluster and also can avoid throttling issues at sidecars to an extent.
Fix# #602
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
rootfs dependency was removed from rbd
by removing support for `nsenter`, This
PR removed the `/` mount from provisioner
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
if both controller and nodeserver flags are set/unset
cephcsi will start both server,
if only one flag is set, it will start relavent
service.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The container runtime CRI-O limits the number of PIDs to 1024 by
default. When many PVCs are requested at the same time, it is possible
for the provisioner to start too many threads (or go routines) and
executing 'rbd' commands can start to fail. In case a go routine can not
get started, the process panics.
The PID limit can be changed by passing an argument to kubelet, but this
will affect all pids running on a host. Changing the parameters to
kubelet is also not a very elegant solution.
Instead, the provisioner pod can change the configuration itself. The
pod is running in privileged mode and can write to /sys/fs/cgroup where
the limit is configured.
With this change, the limit is configured to 'max', just as if there is
no limit at all. The logs of the csi-rbdplugin in the provisioner pod
will reflect the change it makes when starting the service:
$ oc -n rook-ceph logs -c csi-rbdplugin csi-rbdplugin-provisioner-0
..
I0726 13:59:19.737678 1 cephcsi.go:127] Initial PID limit is set to 1024
I0726 13:59:19.737746 1 cephcsi.go:136] Reconfigured PID limit to -1 (max)
..
It is possible to pass a different limit on the commandline of the
cephcsi executable. The following flag has been added:
--pidlimit=<int> the PID limit to configure through cgroups
This accepts special values -1 (max) and 0 (default, do not
reconfigure). Other integers will be the limit that gets configured in
cgroups.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
This change also starts mapping nbd based access using ther rbd CLI
as, it is a prerequisite to get device listing for nbd as well.
Signed-off-by: ShyamsundarR <srangana@redhat.com>
Use Deployment with leader election instead of StatefulSet
Deployment behaves better when a node gets disconnected
from the rest of the cluster - new provisioner leader
is elected in ~15 seconds, while it may take up to
5 minutes for StatefulSet to start a new replica.
Refer: kubernetes-csi/external-provisioner@52d1fbc
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Every Ceph CLI that is invoked at present passes the key via the
--key option, and hence is exposed to key being displayed on
the host using a ps command or such means.
This commit addresses this issue by stashing the key in a tmp
file, which is again created on a tmpfs (or empty dir backed by
memory). Further using such tmp files as arguments to the --keyfile
option for every CLI that is invoked.
This prevents the key from being visible as part of the argument list
of the invoked program on the system.
Fixes: #318
Signed-off-by: ShyamsundarR <srangana@redhat.com>
in NodeStage RPC call we have to map the
device to the node plugin and make sure the
the device will be mounted to the global path
in nodeUnstage request unmount the device from
global path and unmap the device
if the volume mode is block we will be creating
a file inside a stageTargetPath and it will be
considered as the global path
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
As detailed in issue #279, current lock scheme has hash
buckets that are count of CPUs. This causes a lot of contention
when parallel requests are made to the CSI plugin. To reduce
lock contention, this commit introduces granular locks per
identifier.
The commit also changes the timeout for gRPC requests to Create
and Delete volumes, as the current timeout is 10s (kubernetes
documentation says 15s but code defaults are 10s). A virtual
setup takes about 12-15s to complete a request at times, that leads
to unwanted retries of the same request, hence the increased
timeout to enable operation completion with minimal retries.
Tests to create PVCs before and after these changes look like so,
Before:
Default master code + sidecar provisioner --timeout option set
to 30 seconds
20 PVCs
Creation: 3 runs, 396/391/400 seconds
Deletion: 3 runs, 218/271/118 seconds
- Once was stalled for more than 8 minutes and cancelled the run
After:
Current commit + sidecar provisioner --timeout option set to 30 sec
20 PVCs
Creation: 3 runs, 42/59/65 seconds
Deletion: 3 runs, 32/32/31 seconds
Fixes: #279
Signed-off-by: ShyamsundarR <srangana@redhat.com>
Deployment behaves better when a node gets disconnected from the rest of
the cluster - new provisioner leader is elected in ~15 seconds, while
it may take up to 5 minutes for StatefulSet to start a new replica.
Refer: 52d1fbcf9dFixes: #335
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
currently, we have 3 docker files(cephcsi,rbd,cephfs) in the ceph-csi repo.
[commit ](85e121ebfe)
added by John to build a single image which can act as rbd or
cephfs based on the input configuration.
This PR updates the makefile and kubernetes templates to use
the unified image and also its deletes the other two dockerfiles.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Existing config maps are now replaced with rados omaps that help
store information regarding the requested volume names and the rbd
image names backing the same.
Further to detect cluster, pool and which image a volume ID refers
to, changes to volume ID encoding has been done as per provided
design specification in the stateless ceph-csi proposal.
Additional changes and updates,
- Updated documentation
- Updated manifests
- Updated Helm chart
- Addressed a few csi-test failures
Signed-off-by: ShyamsundarR <srangana@redhat.com>
The kubernetes manifests and Helm templates have been updated to use
aggregated ClusterRoles. The same change has been done in Rook as well.
Refer rook/rook#2634 and rook/rook#2975
Signed-off-by: Kaushal M <kshlmster@gmail.com>
PR #290 missed the update permission to persistentvolumes.
Without that permission, you will get the following error when attaching a RBD volume to a pod:
```
Warning FailedAttachVolume 100s (x11 over 7m52s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-d23f8745-60bb-11e9-bd35-5254001c78d6" : could not add PersistentVolume finalizer: persistentvolumes "pvc-d23f8745-60bb-11e9-bd35-5254001c78d6" is forbidden: User "system:serviceaccount:kube-system:rbd-csi-provisioner" cannot update resource "persistentvolumes" in API group "" at the cluster scope
```
currently we are deploying external-attacher
as a seperate statefulset, which leads to
attacher communicating with the node provisoner
daemonset, This PR deploys external-attacher
as a sidecar container inside provisioner
statefulset, so that external-provisioner
always communicates with the plugin responsible
for the provision controller capcabilities.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Based on the review comments addressed the following,
- Moved away from having to update the pod with volumes
when a new Ceph cluster is added for provisioning via the
CSI driver
- The above now used k8s APIs to fetch secrets
- TBD: Need to add a watch mechanisim such that these
secrets can be cached and updated when changed
- Folded the Cephc configuration and ID/key config map
and secrets into a single secret
- Provided the ability to read the same config via mapped
or created files within the pod
Tests:
- Ran PV creation/deletion/attach/use using new scheme
StorageClass
- Ran PV creation/deletion/attach/use using older scheme
to ensure nothing is broken
- Did not execute snapshot related tests
Signed-off-by: ShyamsundarR <srangana@redhat.com>