Commit Graph

2536 Commits

Author SHA1 Message Date
Humble Chirammal
c42d4768ca util: remove the deleteLock acquistion check for clone and snapshot
At present while acquiring the deleteLock on the volume, we check
for ongoing clone and snapshot creation operations on the same.
Considering snapshot and clone controllers does not allow parent
volume deletion on subjected operations, we can be free from this
extra check.

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-22 15:07:49 +00:00
Niels de Vos
82557e3f34 util: allow configuring VAULT_BACKEND for Vault connection
It seems that the version of the key/value engine can not always be
detected for Hashicorp Vault. In certain cases, it is required to
configure the `VAULT_BACKEND` (or `vaultBackend`) option so that a
successful connection to the service can be made.

The `kv-v2` is the current default for development deployments of
Hashicorp Vault (what we use for automated testing). Production
deployments default to version 1 for now.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-22 13:02:47 +00:00
Thomas Kooi
75b9b9fe6d cleanup: fix beta apiVersion for csidriver
This change resolves a typo for installing the CSIDriver
resource in Kubernetes clusters before 1.18,
where the apiVersion is incorrect.

See also:
https://kubernetes-csi.github.io/docs/csi-driver-object.html

[ndevos: replace v1betav1 in examples with v1beta1]
Signed-off-by: Thomas Kooi <t.j.kooi@avisi.nl>
2021-07-22 09:12:44 +00:00
Rakshith R
43f753760b cleanup: resolve nlreturn linter issues
nlreturn linter requires a new line before return
and branch statements except when the return is alone
inside a statement group (such as an if statement) to
increase code clarity. This commit addresses such issues.

Updates: #1586

Signed-off-by: Rakshith R <rar@redhat.com>
2021-07-22 06:05:01 +00:00
Niels de Vos
5c016b4b94 e2e: retry on "connect: connection refused" errors
Sometimes there are failures in the e2e suite when connecting to the
etcdserver fails. The following error was caught:

    INFO: Error getting pvc "rbd-pvc" in namespace "rbd-1318": Get "https://192.168.39.222:8443/api/v1/namespaces/rbd-1318/persistentvolumeclaims/rbd-pvc": dial tcp 192.168.39.222:8443: connect: connection refused
    FAIL: failed to create PVC with error failed to get pvc: Get "https://192.168.39.222:8443/api/v1/namespaces/rbd-1318/persistentvolumeclaims/rbd-pvc": dial tcp 192.168.39.222:8443: connect: connection refused

If etcdserver was only briefly unavailable, one or more retries might be
sufficient to have the test pass.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-21 13:08:41 +00:00
Humble Chirammal
d85304c7c2 doc: lift clone feature support to Beta
Since kubernetes 1.16 clone ( create a new volume from exisiing volume)
functionality is at Beta state and kubernetes v1.18, the snapshot functionality
has been lifted to GA. Ceph CSI drivers have been supporting this feature
for last few releases and users are heavily using this feature since then.

We also have good amount of e2e test case which cover volume creation
from PVC backend or iow, PVC as a datasource. With that, this PR
proposes of lifting this feature support to Beta with v3.4.0 version.

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-21 06:57:10 +00:00
Humble Chirammal
69836d7c02 doc: lift snapshot creation/deletion to Beta
Since kubernetes 1.17 snapshot functionality is at Beta state
and external snapshotter 3.0.3. Since v4.0.0 of snapshotter controller
and kubernetes v1.20, the snapshot functionality has been lifted to GA.
Ceph CSI drivers have been supporting this feature for last few releases
and users are heavily using this feature since then.

We also have good amount of e2e test case which cover volume creation
from snapshot backend or iow, snapshot as a datasource. With that, this PR
proposes of lifting this feature support to Beta with v3.4.0 version.

Updates# https://github.com/ceph/ceph-csi/issues/2199

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-21 06:57:10 +00:00
Humble Chirammal
b66f40793f doc: lift snapshot creation/deletion to Beta
Since kubernetes 1.17 snapshot functionality is at Beta state
and external snapshotter 3.0.3. Since v4.0.0 of snapshotter controller
and kubernetes v1.20, the snapshot functionality has been lifted to GA.
Ceph CSI drivers have been supporting this feature for last few releases
and users are heavily using this feature since then.

We also have good amount of e2e test case which cover volume creation
from snapshot backend or iow, snapshot as a datasource. With that, this PR
proposes of lifting this feature support to Beta with v3.4.0 version.

Updates# https://github.com/ceph/ceph-csi/issues/2199

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-21 06:57:10 +00:00
Humble Chirammal
8f3d8f4cfd doc: explicitly mention volume/pv metrics feature
we have volume PV metrics support available in our driver along
with the grpc metrics support (EnableGRPCMetrics) we added in between
for csi operations. The latter is getting deprecated and the current
mention in the support matrix on metrics support confuse many.
This PR explictly mention this support in the docs to volume/PV metrics

This PR also add a seperate row for block mode PV metrics

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-20 16:03:04 +00:00
Yati Padia
7f5df7c940 cleanup: resolves gofumpt issues in e2e
This commit resolves gofumpt issues in
e2e folder.

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-20 15:37:58 +00:00
Niels de Vos
c4372b8567 doc: describe Hashicorp Vault with a ServiceAccount per Tenant
In addition to the single ServiceAccount KMS support for Hashicorp
Vault, Ceph-CSI can now use a ServiceAccount per Tenant as well. This
adds the user-documentation with references to the example deployment
files.

Closes: #2222
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-20 12:31:40 +00:00
Niels de Vos
841a53bc3d e2e: retry kubectl commands in case deploying Vault fails
Sometimes it happens that the deployment of Hashicorp Vault fails.
Deployment is one of the 1st steps that are done when starting the e2e
suite, and the Kubernetes cluster may still be a little overloaded while
it is settling down. It should be possible to retry and succeed after a
while.

Fixes: #2288
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-19 16:12:18 +00:00
Niels de Vos
d5ea89e603 e2e: add retryKubectlInput() for retrying kubectl calls
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-19 16:12:18 +00:00
Yati Padia
3469dfc753 cleanup: resolve errorlint issues
This commit resolves errorlint issues
which checks for the code that will cause
problems with the error wrapping scheme.

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-19 13:31:29 +00:00
Yati Padia
bfda5fa57f cleanup: resolve revive linter issue
revive linter checks for var-declaration
format.
For example:
"e2e/rbd_helper.go:441:36: var-declaration:
should drop = nil from declaration of
var noPVCValidation; it is the zero value (revive)
var noPVCValidation validateFunc = nil"

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-19 08:39:32 +00:00
Humble Chirammal
bd947bbe31 util: remove deleteLock check while acquiring snapshot createLock
snapshot controller make sure the pvc which is the source for the
snapshot request wont get deleted while snapshot is getting created,
so we dont need to check for any ongoing delete operation here on the
volume.

Subjected code path in snapshot controller:

```
pvc, err := ctrl.getClaimFromVolumeSnapshot(snapshot)
.
..
pvcClone.ObjectMeta.Finalizers = append(pvcClone.ObjectMeta.Finalizers, utils.PVCFinalizer)
_, err = ctrl.client.CoreV1().PersistentVolumeClaims(pvcClone.Namespace).Update(..)
```

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-17 10:23:13 +00:00
Prasanna Kumar Kalever
10fc639d68 ci: fix nolintlint warnings
warnings from golangci-lint:

e2e/pod.go:207:122: directive `//nolint:unparam,lll // cn can be used
with different inputs later` is unused for linter unparam (nolintlint)
func execCommandInContainer(f *framework.Framework, c, ns, cn string,
opt *metav1.ListOptions) (string, string, error) { //nolint:unparam,lll
// cn can be used with different inputs later

e2e/pod.go:307:70: directive `//nolint:unparam // skipNotFound can be
used with different inputs later` is unused for linter unparam (nolintlint)
func deletePodWithLabel(label, ns string, skipNotFound bool) error {
//nolint:unparam // skipNotFound can be used with different inputs later

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
fd3bf1750b e2e: fix the testcases for rbd-nbd
Now that the healer functionaity for mounter processes is available,
lets start, using it.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
78f740d903 rbd: improve healer to run multiple NodeStageVolume req concurrently
This will bring down the healer run time by a great factor.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
b6a88dd728 rbd: add volume healer
Problem:
-------
For rbd nbd userspace mounter backends, after a restart of the nodeplugin
all the mounts will start seeing IO errors. This is because, for rbd-nbd
backends there will be a userspace mount daemon running per volume, post
restart of the nodeplugin pod, there is no way to restore the daemons
back to life.

Solution:
--------
The volume healer is a one-time activity that is triggered at the startup
time of the rbd nodeplugin. It navigates through the list of volume
attachments on the node and acts accordingly.

For now, it is limited to nbd type storage only, but it is flexible and
can be extended in the future for other backend types as needed.

From a few feets above:
This solves a severe problem for nbd backed csi volumes. The healer while
going through the list of volume attachments on the node, if finds the
volume is in attached state and is of type nbd, then it will attempt to
fix the rbd-nbd volumes by sending a NodeStageVolume request with the
required volume attributes like secrets, device name, image attributes,
and etc.. which will finally help start the required rbd-nbd daemons in
the nodeplugin csi-rbdplugin container. This will allow reattaching the
backend images with the right nbd device, thus allowing the applications
to perform IO without any interruptions even after a nodeplugin restart.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
6007fc9bfe cleanup: move static volume check to helper function
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
10e4eee481 deploy: add few more cluster-roles for rbd nodeplugin
Nodeplugin needs below cluster roles:
persistentvolumes: get
volumeattachments: list, get

These additional permissions are needed by the volume healer. Volume healer
aims at fixing the volume health issues at the very startup time of the
nodeplugin. As part of its operations, volume healer has to run through
the list of volume attachments and understand details about each
persistentvolume.

The later commits will use these additional cluster roles.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
874f6629fb rbd: get default plugin path
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
6d24080851 rbd: update per volume metadata stash-file with devicePath
As part of stage transaction if the mounter is of type nbd, then capture
device path after a successful rbd-nbd map.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
70998571aa cleanup: change variable name from path to metaDataPath
path is used by standard package.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Humble Chirammal
94c5c5e119 util: remove deleteLock while we acquire clone operation lock
clone controller make sure there is no delete operation happens
on the source PVC which has been referred as the datasource of
clone PVC, we are safe to operate without looking at delete
operation lock in this case.

Subjected code in the controller:

...
if claim.Spec.DataSource != nil && rc.clone {
		err = p.setCloneFinalizer(ctx, claim)
		...
}

if !checkFinalizer(claim, pvcCloneFinalizer) {
		claim.Finalizers = append(claim.Finalizers, pvcCloneFinalizer)
		_, err := p.client.CoreV1().PersistentVolumeClaims(claim.Namespace).Update(..claim..)
	}

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-16 12:32:28 +00:00
Humble Chirammal
e088e8fd2e cephfs: Get rid of locking at nodepublish
Considering kubelet make sure the stage and publish operations
are serialized, we dont need any extra locking in nodePublish

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-16 07:18:56 +00:00
Humble Chirammal
61bf49a4f5 rbd: Get rid of locking at nodePublish
Considering kubelet make sure the stage and publish operations
are serialized, we dont need any extra locking in nodePublish

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-16 07:18:56 +00:00
Humble Chirammal
ced3a0922f cephfs: Get rid of locking at nodeUnpublish call
Considering kubelet make sure the unstage and unpublish operations
are serialized, we dont need any extra locking in nodeUnpublish

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-16 07:18:56 +00:00
Humble Chirammal
ef852cc93d rbd: Get rid of locking at nodeUnpublish call
Considering kubelet make sure the unstage and unpublish operations
are serialized, we dont need any extra locking in nodeUnpublish

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-16 07:18:56 +00:00
Niels de Vos
276918db10 rebase: update minikube to v1.22.0
Minikube has bumped it's support for latest Kubernetes version to
1.22.0-beta.0. This might improve our CI jobs with Kubernetes 1.22 too.

See-also: https://github.com/kubernetes/minikube/releases/tag/v1.22.0
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-15 11:39:33 +00:00
Yati Padia
f36d611ef9 cleanup: resolves gofumpt issues of internal codes
This PR runs gofumpt for internal folder.

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-14 19:50:56 +00:00
Yati Padia
696ee496fc cleanup: resolves gofumpt for cmd
This commit resolves gofumpt linter for
cmd folder

Updates: 1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-14 17:19:00 +00:00
Yati Padia
299979fc14 ci: add unit test for toError()
This commit adds unit test for the
func converting cephFSCloneState to error.

Fixes: #2259

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-14 15:02:12 +00:00
Yati Padia
c66872c3c6 cleanup: ineffective assignment
This commit resolves ineffective assignent of
snap.

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-14 12:39:17 +00:00
Niels de Vos
4d4a2a7814 e2e: prevent re-using empty pvc object
When an error occurs, the pvc object is overwritten in the
PollImmediate() loop. Re-using the pvc.Namespace results in error
messages like

    Error getting pvc in namespace: '': an empty namespace may not be set when a resource name is provided

and prevents the retry by PollImmediate() to never succeed. Storing the
namespace in a local variable prevents this from happening.

Reported-by: Rakshith R <rar@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-14 10:18:51 +00:00
Niels de Vos
f7ae33c67c e2e: only call error check functions when err != nil
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-14 10:18:51 +00:00
Niels de Vos
075a4087d7 e2e: mark "etcdserver: request timed out" errors as retryable
There are regular CI failures where etcdserver times out. These errors
seem not to get caught by any of the existing error comparing. Matching
the error by string should prevent temporary etcdserver issues now too.

Updates: #2218
Closes: #1969
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-14 10:18:51 +00:00
Yati Padia
f210d5758b cleanup: spell check getImageMirroingStatus
This commit corrects the spelling for
getImageMirroingStatus() -> getImageMirroringStatus

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-14 07:32:01 +00:00
Niels de Vos
d941e5abac util: make parseTenantConfig() usable for modular KMSs
parseTenantConfig() only allowed configuring a defined set of options,
and KMSs were not able to re-use the implementation. Now, the function
parses the ConfigMap from the Tenants Namespace and returns a map with
options that the KMS supports.

The map that parseTenantConfig() returns can be inspected by the KMS,
and applied to the vaultTenantConnection type by calling parseConfig().

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-13 17:16:35 +00:00
Niels de Vos
96bb8bfd0e e2e: add securityContext.runAsUser to vault-init-job
Kubelet sometimes reports the following error:

    failed to "StartContainer" for "vault-init-job" with CreateContainerConfigError: container has runAsNonRoot and image will run as root

Setting securityContext.runAsUser resolves this.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-13 17:16:35 +00:00
Niels de Vos
e3c7dea7d6 e2e: add test for Vault with ServiceAccount per Tenant
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-13 17:16:35 +00:00
Niels de Vos
b700fa43e6 doc: add example for Tenant ServiceAccount
The ServiceAccount "ceph-csi-vault-sa" is expected to be placed in the
Namespace "tenant" so that the provisioner and node-plugin fetch the
ServiceAccount from a Namespace where Ceph-CSI is not deployed.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-13 17:16:35 +00:00
Niels de Vos
8662e01d2c deploy: allow RBD components to get ServiceAccounts
The provisioner and node-plugin have the capability to connect to
Hashicorp Vault with a ServiceAccount from the Namespace where the PVC
is created. This requires permissions to read the contents of the
ServiceAccount from an other Namespace than where Ceph-CSI is deployed.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-13 17:16:35 +00:00
Niels de Vos
3d7d48a4aa util: VaultTenantSA KMS implementation
This new KMS uses a Kubernetes ServiceAccount from a Tenant (Namespace)
to connect to Hashicorp Vault. The provisioner and node-plugin will
check for the configured ServiceAccount and use the token that is
located in one of the linked Secrets. Subsequently the Vault connection
is configured to use the Kubernetes token from the Tenant.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-13 17:16:35 +00:00
Niels de Vos
6dc5bf2b29 util: split vaultTenantConnection from VaultTokensKMS
This makes the Tenant configuration for Hashicorp Vault KMS connections
more modular. Additional KMS implementations that use Hashicorp Vault
with per-Tenant options can re-use the new vaultTenantConnection.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-13 17:16:35 +00:00
Niels de Vos
ed298341a6 doc: proposal for KMS with ServiceAccount per Tenant
A new KMS that supports Hashicorp Vault with the Kubernetes Auth backend
and ServiceAccounts per Tenant (Kubernetes Namespace).

Updates: #2222
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-13 12:12:25 +00:00
Yati Padia
69c9e5ffb1 cleanup: resolve parallel test issue
This commit resolves parallel test issues
and also excludes internal/util/conn_pool_test.go
as those test can't run in parallel.

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-13 11:31:39 +00:00
Prasanna Kumar Kalever
8b3136e696 doc: add documentaion for rbd-nbd mounter
Closes #2124

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-13 10:19:17 +00:00
Yati Padia
4a649fe17f cleanup: resolve godot linter
This commit resolves godot linter issue
which says "Comment should end in a period (godot)".

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-13 06:50:03 +00:00