Commit Graph

3862 Commits

Author SHA1 Message Date
Niels de Vos
ce9e54e5bd ci: add Mergify backport rules for release-v3.4
The new `backport-to-release-v3.4` label can be added to PRs and Mergify
will create a backport once the PR for the devel branch has been merged.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-28 12:53:58 +05:30
Prasanna Kumar Kalever
52799da09d doc: add design doc for volume healer
Closes: #667

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-28 11:54:59 +05:30
Prasanna Kumar Kalever
ebe4e1f944 ci: ignore spell check for design proposal images
To avoid failures triggered by checking SVG image formats.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-28 11:54:59 +05:30
Prasanna Kumar Kalever
068e44bdb1 cleanup: move rbd-mirror image to a new directory
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-28 11:54:59 +05:30
Madhu Rajanna
080b251850 e2e: validate images in trash for rados namespace
added validation check to verify stale images in trash
for the rados namespace testing.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-07-28 03:48:33 +00:00
Madhu Rajanna
8f185bf7b2 rbd: use rados namespace for manager command
Currently we have a bug that we are not using rados
namespace when adding ceph manager command to
remove the image from the trash. This commit
adds the missing rados namespace when adding
ceph manager task.

without fix the image will be moved to trash
and no task will be added to remove from the
trash. it will become ceph responsibility to
remove the image from trash when it will cleanup
the trash.

workaroud: manually purge the trash

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-07-28 03:48:33 +00:00
Yug Gupta
d14c0afe28 doc: Add documentation for DR
Add documenation for Disaster Recovery
which steps to Failover and Failback in case
of a planned migration or a Disaster.

Signed-off-by: Yug Gupta <yuggupta27@gmail.com>
2021-07-27 11:43:01 +00:00
Niels de Vos
ec6703ed58 rbd: rename encryption metadata keys to enable mirroring
RBD image metadata keys that start with '.rbd' are expected to be
internal to RBD itself and are not mirrored to remote sites. Renaming
the keys (dropping the '.' prefix) and using the new MigrateMetadata()
function now makes the keys available on remote sites too.

Closes: #2219
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-26 11:49:56 +00:00
Niels de Vos
607129171d rbd: move image metadata key migration to its own function
The new MigrateMetadata() function can be used to get the metadata of an
image with a deprecated and new key. Renaming metadata keys can be done
easily this way.

A default value will be set in the image metadata when it is missing
completely. But if the deprecated key was set, the data is stored under
the new key and the deprecated key is removed.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-26 11:49:56 +00:00
Yati Padia
6691951453 rbd: use go-ceph for getImageMirroringStatus
Currently, getImageMirroringStatus() is using RBD CLI.
This commit converts RBD CLI to go-ceph API.

Fixes: #2120

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-26 06:37:40 +00:00
Niels de Vos
4e6d9be826 ci: fix yamllint error in generated golangci.yml file
When running 'make containerized-test' the following error gets
reported:

    yamllint -s -d '{extends: default, rules: {line-length: {allow-non-breakable-inline-mappings: true}},ignore: charts/*/templates/*.yaml}' ./scripts/golangci.yml
    ./scripts/golangci.yml
      179:81    error    line too long (84 > 80 characters)  (line-length)

The golangci.yml.in is used to generate golangci.yml, addressing the
line-length there resolves the issue.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-26 04:05:50 +00:00
Niels de Vos
e75d308b9c e2e: isRetryableAPIError() should match any etcdserver timeout
framework.RunKubectl() returns an error that does not end with
"etcdserver: request timed out", but contains the text somewhere in the
middle:

    error running /usr/bin/kubectl --server=https://192.168.39.57:8443 --kubeconfig=/root/.kube/config --namespace=cephcsi-e2e-a44ec4b4 create -f -:
    Command stdout:

    stderr:
    Error from server: error when creating "STDIN": etcdserver: request timed out

    error:
    exit status 1

isRetryableAPIError() should  return `true` for this case as well, so
instead of using HasSuffix(), we'll use Contains().

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-23 12:20:16 +00:00
Prasanna Kumar Kalever
75dda7ac0d e2e: add test for expansion of encrypted volumes
Also adds a test case to validate the default encryption type

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-23 10:00:23 +00:00
Prasanna Kumar Kalever
526ff95f10 rbd: add support to expand encrypted volume
Previously in ControllerExpandVolume() we had a check for encrypted
volumes and we use to fail for all expand requests on an encrypted
volume. Also for Block VolumeMode PVCs NodeExpandVolume used to be
ignored/skipped.

With these changes, we add support for the expansion of encrypted volumes.
Also for raw Block VolumeMode PVCs with Encryption we call NodeExpandVolume.

That said,
With LUKS1, cryptsetup utility doesn't prompt for a passphrase on resizing
the crypto mapper device. This is because LUKS1 devices don't use kernel
keyring for volume keys.

Whereas, LUKS2 devices use kernel keyring for volume key by default, i.e.
cryptsetup utility asks for a passphrase if it detects volume key was
previously passed to dm-crypt via kernel keyring service, we are overriding
the default by --disable-keyring option during cryptsetup open command.
So that at the time of crypto mapper device resize we will not be
prompted for any passphrase.

Fixes: #1469

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-23 10:00:23 +00:00
Prasanna Kumar Kalever
4fa05cb3a1 util: add helper functions for resize of encrypted volume
such as:
ResizeEncryptedVolume() and LuksResize()

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-23 10:00:23 +00:00
Prasanna Kumar Kalever
572f39d656 util: fix log level in OpenEncryptedVolume()
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-23 10:00:23 +00:00
Prasanna Kumar Kalever
812003eb45 util: fix bug in DeviceEncryptionStatus()
With Luks1 device:
$ cryptsetup status /dev/mapper/crypto-rbd0
/dev/mapper/crypto-rbd0 is active and is in use.
  type:    LUKS1
  cipher:  aes-xts-plain64
  keysize: 512 bits
  key location: dm-crypt
  device:  /dev/rbd0
  sector size:  512
  offset:  4096 sectors
  size:    4190208 sectors
  mode:    read/write

With Luks2 device:
$ cryptsetup status /dev/mapper/crypto-rbd0
/dev/mapper/crypto-rbd0 is active and is in use.
  type:    LUKS2
  cipher:  aes-xts-plain64
  keysize: 512 bits
  key location: dm-crypt
  device:  /dev/rbd0
  sector size:  512
  offset:  32768 sectors
  size:    4161536 sectors
  mode:    read/write

This could lead to failures with unmap in the NodeUnstageVolume path
for the encrypted volumes.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-23 10:00:23 +00:00
Yati Padia
1ae2afe208 cleanup: modifies the error caused due to merged PRs
This commit modifies the error of godot, cyclop,
paralleltest linter caused due to merged PRs.

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-22 18:15:48 +00:00
Yati Padia
4e890e9daf ci: disable gci and wrapcheck linter
This commit disables wrapcheck and gci
linters.

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-22 18:15:48 +00:00
Yati Padia
172b66f73f cleanup: resolves cyclop linter issue
this commit adds `// nolint:cyclop` for the
fucntions whose complexity is above 20

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-22 18:15:48 +00:00
Yati Padia
e85c0eedc4 ci: set max-complexity of cyclop as 20
This commit sets the value of max-
complexity of cyclop linter as 20

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-22 18:15:48 +00:00
Yati Padia
45b40661e2 ci: disable gomoddirectives linter
This commit disables gomoddirectives
linter as it bans use of replace directive.

Update: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-22 18:15:48 +00:00
Yati Padia
9d6ce7c5dd ci: disable forbidigo linter
This commit disables the forbidigo linter as
this linter forbids the use of fmt.Printf
but we need to use it in various part of
our codebase.

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-22 18:15:48 +00:00
Yati Padia
9414a76a86 ci: disable exhaustivestruct linter
This commit disables the exhaustivestruct linter
as it is meant to be used only for special cases.
We don't need to enable this for our project.

Fixes: #2224

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-22 18:15:48 +00:00
Yati Padia
c5bc3d38c4 ci: update static check tools
This PR updates the static check tools to
the latest version.
Further needs to resolve all the errors after
updating the version.

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-22 18:15:48 +00:00
Humble Chirammal
abe6a6e5ac util: remove deleteLock test as it is enforced by the controller
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-22 15:07:49 +00:00
Humble Chirammal
c42d4768ca util: remove the deleteLock acquistion check for clone and snapshot
At present while acquiring the deleteLock on the volume, we check
for ongoing clone and snapshot creation operations on the same.
Considering snapshot and clone controllers does not allow parent
volume deletion on subjected operations, we can be free from this
extra check.

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-22 15:07:49 +00:00
Niels de Vos
82557e3f34 util: allow configuring VAULT_BACKEND for Vault connection
It seems that the version of the key/value engine can not always be
detected for Hashicorp Vault. In certain cases, it is required to
configure the `VAULT_BACKEND` (or `vaultBackend`) option so that a
successful connection to the service can be made.

The `kv-v2` is the current default for development deployments of
Hashicorp Vault (what we use for automated testing). Production
deployments default to version 1 for now.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-22 13:02:47 +00:00
Thomas Kooi
75b9b9fe6d cleanup: fix beta apiVersion for csidriver
This change resolves a typo for installing the CSIDriver
resource in Kubernetes clusters before 1.18,
where the apiVersion is incorrect.

See also:
https://kubernetes-csi.github.io/docs/csi-driver-object.html

[ndevos: replace v1betav1 in examples with v1beta1]
Signed-off-by: Thomas Kooi <t.j.kooi@avisi.nl>
2021-07-22 09:12:44 +00:00
Rakshith R
43f753760b cleanup: resolve nlreturn linter issues
nlreturn linter requires a new line before return
and branch statements except when the return is alone
inside a statement group (such as an if statement) to
increase code clarity. This commit addresses such issues.

Updates: #1586

Signed-off-by: Rakshith R <rar@redhat.com>
2021-07-22 06:05:01 +00:00
Niels de Vos
5c016b4b94 e2e: retry on "connect: connection refused" errors
Sometimes there are failures in the e2e suite when connecting to the
etcdserver fails. The following error was caught:

    INFO: Error getting pvc "rbd-pvc" in namespace "rbd-1318": Get "https://192.168.39.222:8443/api/v1/namespaces/rbd-1318/persistentvolumeclaims/rbd-pvc": dial tcp 192.168.39.222:8443: connect: connection refused
    FAIL: failed to create PVC with error failed to get pvc: Get "https://192.168.39.222:8443/api/v1/namespaces/rbd-1318/persistentvolumeclaims/rbd-pvc": dial tcp 192.168.39.222:8443: connect: connection refused

If etcdserver was only briefly unavailable, one or more retries might be
sufficient to have the test pass.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-21 13:08:41 +00:00
Humble Chirammal
d85304c7c2 doc: lift clone feature support to Beta
Since kubernetes 1.16 clone ( create a new volume from exisiing volume)
functionality is at Beta state and kubernetes v1.18, the snapshot functionality
has been lifted to GA. Ceph CSI drivers have been supporting this feature
for last few releases and users are heavily using this feature since then.

We also have good amount of e2e test case which cover volume creation
from PVC backend or iow, PVC as a datasource. With that, this PR
proposes of lifting this feature support to Beta with v3.4.0 version.

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-21 06:57:10 +00:00
Humble Chirammal
69836d7c02 doc: lift snapshot creation/deletion to Beta
Since kubernetes 1.17 snapshot functionality is at Beta state
and external snapshotter 3.0.3. Since v4.0.0 of snapshotter controller
and kubernetes v1.20, the snapshot functionality has been lifted to GA.
Ceph CSI drivers have been supporting this feature for last few releases
and users are heavily using this feature since then.

We also have good amount of e2e test case which cover volume creation
from snapshot backend or iow, snapshot as a datasource. With that, this PR
proposes of lifting this feature support to Beta with v3.4.0 version.

Updates# https://github.com/ceph/ceph-csi/issues/2199

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-21 06:57:10 +00:00
Humble Chirammal
b66f40793f doc: lift snapshot creation/deletion to Beta
Since kubernetes 1.17 snapshot functionality is at Beta state
and external snapshotter 3.0.3. Since v4.0.0 of snapshotter controller
and kubernetes v1.20, the snapshot functionality has been lifted to GA.
Ceph CSI drivers have been supporting this feature for last few releases
and users are heavily using this feature since then.

We also have good amount of e2e test case which cover volume creation
from snapshot backend or iow, snapshot as a datasource. With that, this PR
proposes of lifting this feature support to Beta with v3.4.0 version.

Updates# https://github.com/ceph/ceph-csi/issues/2199

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-21 06:57:10 +00:00
Humble Chirammal
8f3d8f4cfd doc: explicitly mention volume/pv metrics feature
we have volume PV metrics support available in our driver along
with the grpc metrics support (EnableGRPCMetrics) we added in between
for csi operations. The latter is getting deprecated and the current
mention in the support matrix on metrics support confuse many.
This PR explictly mention this support in the docs to volume/PV metrics

This PR also add a seperate row for block mode PV metrics

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-20 16:03:04 +00:00
Yati Padia
7f5df7c940 cleanup: resolves gofumpt issues in e2e
This commit resolves gofumpt issues in
e2e folder.

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-20 15:37:58 +00:00
Niels de Vos
c4372b8567 doc: describe Hashicorp Vault with a ServiceAccount per Tenant
In addition to the single ServiceAccount KMS support for Hashicorp
Vault, Ceph-CSI can now use a ServiceAccount per Tenant as well. This
adds the user-documentation with references to the example deployment
files.

Closes: #2222
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-20 12:31:40 +00:00
Niels de Vos
841a53bc3d e2e: retry kubectl commands in case deploying Vault fails
Sometimes it happens that the deployment of Hashicorp Vault fails.
Deployment is one of the 1st steps that are done when starting the e2e
suite, and the Kubernetes cluster may still be a little overloaded while
it is settling down. It should be possible to retry and succeed after a
while.

Fixes: #2288
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-19 16:12:18 +00:00
Niels de Vos
d5ea89e603 e2e: add retryKubectlInput() for retrying kubectl calls
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-07-19 16:12:18 +00:00
Yati Padia
3469dfc753 cleanup: resolve errorlint issues
This commit resolves errorlint issues
which checks for the code that will cause
problems with the error wrapping scheme.

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-19 13:31:29 +00:00
Yati Padia
bfda5fa57f cleanup: resolve revive linter issue
revive linter checks for var-declaration
format.
For example:
"e2e/rbd_helper.go:441:36: var-declaration:
should drop = nil from declaration of
var noPVCValidation; it is the zero value (revive)
var noPVCValidation validateFunc = nil"

Updates: #1586

Signed-off-by: Yati Padia <ypadia@redhat.com>
2021-07-19 08:39:32 +00:00
Humble Chirammal
bd947bbe31 util: remove deleteLock check while acquiring snapshot createLock
snapshot controller make sure the pvc which is the source for the
snapshot request wont get deleted while snapshot is getting created,
so we dont need to check for any ongoing delete operation here on the
volume.

Subjected code path in snapshot controller:

```
pvc, err := ctrl.getClaimFromVolumeSnapshot(snapshot)
.
..
pvcClone.ObjectMeta.Finalizers = append(pvcClone.ObjectMeta.Finalizers, utils.PVCFinalizer)
_, err = ctrl.client.CoreV1().PersistentVolumeClaims(pvcClone.Namespace).Update(..)
```

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-07-17 10:23:13 +00:00
Prasanna Kumar Kalever
10fc639d68 ci: fix nolintlint warnings
warnings from golangci-lint:

e2e/pod.go:207:122: directive `//nolint:unparam,lll // cn can be used
with different inputs later` is unused for linter unparam (nolintlint)
func execCommandInContainer(f *framework.Framework, c, ns, cn string,
opt *metav1.ListOptions) (string, string, error) { //nolint:unparam,lll
// cn can be used with different inputs later

e2e/pod.go:307:70: directive `//nolint:unparam // skipNotFound can be
used with different inputs later` is unused for linter unparam (nolintlint)
func deletePodWithLabel(label, ns string, skipNotFound bool) error {
//nolint:unparam // skipNotFound can be used with different inputs later

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
fd3bf1750b e2e: fix the testcases for rbd-nbd
Now that the healer functionaity for mounter processes is available,
lets start, using it.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
78f740d903 rbd: improve healer to run multiple NodeStageVolume req concurrently
This will bring down the healer run time by a great factor.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
b6a88dd728 rbd: add volume healer
Problem:
-------
For rbd nbd userspace mounter backends, after a restart of the nodeplugin
all the mounts will start seeing IO errors. This is because, for rbd-nbd
backends there will be a userspace mount daemon running per volume, post
restart of the nodeplugin pod, there is no way to restore the daemons
back to life.

Solution:
--------
The volume healer is a one-time activity that is triggered at the startup
time of the rbd nodeplugin. It navigates through the list of volume
attachments on the node and acts accordingly.

For now, it is limited to nbd type storage only, but it is flexible and
can be extended in the future for other backend types as needed.

From a few feets above:
This solves a severe problem for nbd backed csi volumes. The healer while
going through the list of volume attachments on the node, if finds the
volume is in attached state and is of type nbd, then it will attempt to
fix the rbd-nbd volumes by sending a NodeStageVolume request with the
required volume attributes like secrets, device name, image attributes,
and etc.. which will finally help start the required rbd-nbd daemons in
the nodeplugin csi-rbdplugin container. This will allow reattaching the
backend images with the right nbd device, thus allowing the applications
to perform IO without any interruptions even after a nodeplugin restart.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
6007fc9bfe cleanup: move static volume check to helper function
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
10e4eee481 deploy: add few more cluster-roles for rbd nodeplugin
Nodeplugin needs below cluster roles:
persistentvolumes: get
volumeattachments: list, get

These additional permissions are needed by the volume healer. Volume healer
aims at fixing the volume health issues at the very startup time of the
nodeplugin. As part of its operations, volume healer has to run through
the list of volume attachments and understand details about each
persistentvolume.

The later commits will use these additional cluster roles.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
874f6629fb rbd: get default plugin path
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00
Prasanna Kumar Kalever
6d24080851 rbd: update per volume metadata stash-file with devicePath
As part of stage transaction if the mounter is of type nbd, then capture
device path after a successful rbd-nbd map.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-07-16 16:30:58 +00:00