Commit Graph

2546 Commits

Author SHA1 Message Date
Niels de Vos
1fa8939e84 e2e: retry when a "transport is closing" error is hit
There have been occasional CI job failures due to "transport is closing"
errors. Adding this error to the isRetryableAPIError() function should
make sure to retry the request until the connection is restored.

Fixes: #2613
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-11-17 14:07:07 +00:00
Niels de Vos
1f650e1204 rebase: replace vendored layeh.com/radius with GitHub source
The webserver at layeh.com seems to be misbehaving, which causes `go mod
verify` to fail. The layeh.com/radius repository is maintained on
GitHub, so the sources can be vendored/verified from there too.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-11-17 11:51:30 +00:00
Madhu Rajanna
0a5bd09a61 ci: fix branch name in retest action
updated the branch name from main to
devel in retest action workflow.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-11-17 05:50:43 +00:00
Madhu Rajanna
b62de1376d ci: update github workflow to test docker build
updated github action to test a retest action
docker build workflow.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-11-17 05:50:43 +00:00
Madhu Rajanna
c4ceadd06a ci: fix docker build issue for retest
fix docker build `cannot normalize nothing`

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-11-17 05:50:43 +00:00
Madhu Rajanna
46c40fe5ad ci: skip shell check in vendor directory
skip shell check in vendor directories.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-11-16 12:03:36 +00:00
Madhu Rajanna
ec34fdd505 ci: skip codespell for retest vendor
as there are lot of spell check failures
on the vendor directory, skipping it.

skipping retest action folder as
PullRequests word is not getting ignored

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-11-16 12:03:36 +00:00
Madhu Rajanna
f9f465073f ci: add github action to build retest
added basic github action for
retest building.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-11-16 12:03:36 +00:00
Madhu Rajanna
5a53f53166 ci: add retest github action
added source code of github retest
action.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-11-16 12:03:36 +00:00
Madhu Rajanna
f7e7172c7b doc: add documentation for retest job
added details about retest job and the
creteria to auto retest PR.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-11-16 12:03:36 +00:00
Madhu Rajanna
ed6d28a1fc ci: add action to retest failed approved PR's
Adding github action to retest the failed
approved PR's. sample output is available
at https://github.com/Madhu-1/retest-action/pull/3

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-11-16 12:03:36 +00:00
Rakshith R
191b603974 ci: remove gh action gosec linter,since it is already part of golangci
This commit removes gosec standalone linter and related parts,
since golangci linter runs gosec linter too.

Signed-off-by: Rakshith R <rar@redhat.com>
2021-11-16 12:29:56 +01:00
Prasanna Kumar Kalever
0bf9db822b e2e: validate encrypted image mount inside the nodeplugin
currently the mountType validation of the encrypted volume is done in
the application, we should rather validate this inside the nodeplugin
pod.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-11-16 10:12:46 +00:00
Prasanna Kumar Kalever
e6fa392df1 rbd: fix mapOptions passing with rbd-nbd mounter
This was a regression introduced by:
https://github.com/ceph/ceph-csi/pull/2556

Fixes: #2610
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-11-16 10:12:46 +00:00
Prasanna Kumar Kalever
cee6da5313 e2e: adding io-timeout for lower kernel versions
This got removed unintentionally with
https://github.com/ceph/ceph-csi/pull/2628

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-11-16 10:12:46 +00:00
Prasanna Kumar Kalever
c97b6432e3 e2e: restrict IO with lower version kernel at rbd-nbd tests
Currently, at "perform IO on rbd-nbd volume after nodeplugin restart"
test we are performing write on the rbd-nbd based mount after nodeplugin
restart. But due to a bug in NBD driver the writes are failing, please
note NBD zero cmd timeout handling is fixed with kernel >= 5.4 and hence
we should defend on writes based on kernel version to avoid unnecessary
CI failures.

For more information see
https://github.com/ceph/ceph-csi/issues/2204#issuecomment-930941047

updates: #2204
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-11-10 16:46:50 +00:00
Prasanna Kumar Kalever
50e9dfa5c5 cleanup: fix log level
This log line is seen frequently in the logs and its better to be at
Warning loglevel rather than Error based on its severity

E1109 08:30:45.612395   38328 util.go:247] kernel 4.19.202 does not support required features

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-11-10 10:54:29 +00:00
Humble Chirammal
5a4bf4d151 doc: add migration design documentation
This commit adds migration design doc which carry information about
the required changes and design for rbd intree to csi migration.

Fixes https://github.com/ceph/ceph-csi/issues/2596
Updates https://github.com/ceph/ceph-csi/issues/2509

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-11-09 12:06:50 +00:00
Prasanna Kumar Kalever
3686b6da8b rbd: utilize cookie support from rbd for nbd
Problem:
On remap/attach of device (i.e. nodeplugin restart), there is no way
for rbd-nbd to defend if the backend storage is matching with the initial
backend storage.

Say, if an initial map request for backend "pool1/image1" got mapped to
/dev/nbd0 and the userspace process is terminated (on nodeplugin restart).
A next remap/attach (nodeplugin start) request within reattach-timeout is
allowed to use /dev/nbd0 for a different backend "pool1/image2"

For example, an operation like below could be dangerous:

$ sudo rbd-nbd map --try-netlink rbd-pool/ext4-image
/dev/nbd0
$ sudo blkid /dev/nbd0
/dev/nbd0: UUID="bfc444b4-64b1-418f-8b36-6e0d170cfc04" TYPE="ext4"
$ sudo pkill -15 rbd-nbd   <-- nodeplugin terminate
$ sudo rbd-nbd attach --try-netlink --device /dev/nbd0 rbd-pool/xfs-image
/dev/nbd0
$ sudo blkid /dev/nbd0
/dev/nbd0: UUID="d29bf343-6570-4069-a9ea-2fa156ced908" TYPE="xfs"

Solution:
rbd-nbd/kernel now provides a way to keep some metadata in sysfs to identify
between the device and the backend, so that when a remap/attach request is
made, rbd-nbd can compare and avoid such dangerous operations.

With the provided solution, as part of the initial map request, backend
cookie (ceph-csi VOLID) can be stored in the sysfs per device config, so
that on a remap/attach request rbd-nbd will check and validate if the
backend per device cookie matches with the initial map backend with the help
of cookie.

At Ceph-csi we use VOLID as device cookie, which will be unique, we pass
the VOLID as cookie at map and use the same at the time of attach, that
way rbd-nbd can identify backends and their matching devices.

Requires:
https://github.com/ceph/ceph/pull/41323
https://lkml.org/lkml/2021/4/29/274

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-11-04 03:20:59 +00:00
Prasanna Kumar Kalever
793b22cf27 rbd: check for nbd cookie support
Change checkRbdNbdTools() to setRbdNbdToolFeatures()

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-11-04 03:20:59 +00:00
Niels de Vos
b95f3cdcbc ci: do not let dependabot automatically rebase
When dependabot creates a PR, and an other gets merged, the bot
automatically triggers a rebase. This will drop any approvals, causing
delays in the review/merge process.

The project uses Mergify to automatically rebase when needed, and
approvals are retained when Mergify rebases PR. By disabling the
auto-rebasing done by dependabot, fewer rebases should be needed,
contributors only need to review once, and CI jobs are triggered less
often.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-11-03 03:25:08 +00:00
dependabot[bot]
c286ab3c0a rebase: bump github.com/aws/aws-sdk-go from 1.41.10 to 1.41.15
Bumps [github.com/aws/aws-sdk-go](https://github.com/aws/aws-sdk-go) from 1.41.10 to 1.41.15.
- [Release notes](https://github.com/aws/aws-sdk-go/releases)
- [Changelog](https://github.com/aws/aws-sdk-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/aws/aws-sdk-go/compare/v1.41.10...v1.41.15)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2021-11-02 20:48:31 +00:00
dependabot[bot]
b344a9f463 rebase: bump github.com/hashicorp/vault/api from 1.2.0 to 1.3.0
Bumps [github.com/hashicorp/vault/api](https://github.com/hashicorp/vault) from 1.2.0 to 1.3.0.
- [Release notes](https://github.com/hashicorp/vault/releases)
- [Changelog](https://github.com/hashicorp/vault/blob/main/CHANGELOG.md)
- [Commits](https://github.com/hashicorp/vault/compare/v1.2.0...v1.3.0)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/vault/api
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2021-11-02 10:39:23 +00:00
Prasanna Kumar Kalever
9a3170bf77 rbd: provide a way to disable the auto fallback to nbd mounter
This change allows the user to choose not to fallback to NBD mounter
when some ImageFeatures are absent with krbd driver, rather just fail
the NodeStage call.

Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-11-01 08:17:36 +00:00
Prasanna Kumar Kalever
bfc24f6f12 cleanup: generalize the parseBool function
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-11-01 08:17:36 +00:00
Prasanna Kumar Kalever
84ec797dda rbd: detect krbd features in runtime and fallback to nbd
Currently, we recognize and warn for the provided image features based on
our prior intelligence at ceph-csi (i.e based on supportedFeatures map
and validateImageFeatures) at image/PV creation time. It might be very
much possible that the cluster is heterogeneous i.e. the PV creation and
application container might both be on different nodes with different
kernel versions (krbd driver versions).

This PR adds a mechanism to check for the supported krbd features during
mount time, if the krbd driver doesn't have the specified image feature
then it will fall back to rbd-nbd mounter.

Fixes: #478
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
2021-11-01 08:17:36 +00:00
Shaohui Liu
af752dd38f helm: support adding annotations to StorageClasses
Signed-off-by: Shaohui Liu <liushaohui@xiaomi.com>
2021-10-28 16:56:12 +00:00
Niels de Vos
c852f487a5 util: set defaults for Vault config before converting
When using UPPER_CASE formatting for the HashiCorp Vault KMS
configuration, a missing `VAULT_DESTROY_KEYS` will cause the option to
be set to "false". The default for the option is intended for be "true".

This is a difference in behaviour between the `vaultDestroyKeys` and
`VAULT_DESTROY_KEYS` options. Both should use a default of "true" when
the configuration does not set the option explicitly.

By setting the default options in the `standardVault` struct before
unmarshalling the configuration in it, the default values will be
retained for the missing configuration options.

Reported-by: Rachael George <rgeorge@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-10-28 14:41:53 +00:00
Humble Chirammal
de57fa1804 e2e: adjust deletion, filesystem and block tests for migration volume
this commit create and make use of migration secret in the requests and
validate various csi operations

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-10-27 18:35:00 +00:00
Humble Chirammal
6aec858cba rbd: parse migration secret and set fields for nodestage operations
this commit make use of the migration request secret parsing and set
the required fields for further nodestage operations

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-10-27 18:35:00 +00:00
Humble Chirammal
5621f2cfca rbd: split the parsing and deletion logic to its own functions.
parseAndDeleteMigratedVolume() prviously clubbed the logic of
parsing of migration volume handle and then continued with the
deletion of the volume. however this commit split this
logic into two, ie parsing has been done in parseMigrationVolID()
and DeleteMigratedVolume() deletes the backend volume.

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-10-27 18:35:00 +00:00
Humble Chirammal
ff0911fb6a rbd: add unittests for IsMigrationSecret and ParseAndSetSecretMapFromMigSecret
This commit adds unit tests for newly introduced migration specific
functions.

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-10-27 18:35:00 +00:00
Humble Chirammal
b49bf4b987 rbd: parse migration secret and set it for controller server operations
This commit adds a couple of helper functions to parse the migration
request secret and set it for further csi driver operations.

More details:

The intree secret has a data field called "key" which is the base64
admin secret key. The ceph CSI driver currently expect the secret to
contain data field "UserKey" for the equivalant. The CSI driver also
expect the "UserID" field which is not available in the in-tree secret
by deafult. This missing userID will be filled (if the username differ
than 'admin') in the migration secret as 'adminId' field in the
migration request, this commit adds the logic to parse this migration
secret as below:

"key" field value will be picked up from the migraion secret to "UserKey"
field.
"adminId" field value will be picked up from the migration secret to "UserID"
field

if `adminId` field is nil or not set, `UserID` field will be filled with
default value ie `admin`.The above logic get activated only when the secret
is a migration secret, otherwise skipped to the normal workflow as we have
today.

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
2021-10-27 18:35:00 +00:00
Niels de Vos
b132696e54 rbd: note that thick-provisioning is deprecated
Thick-provisioning was introduced to make accounting of assigned space
for volumes easier. When thick-provisioned volumes are the only consumer
of the Ceph cluster, this works fine. However, it is unlikely that this
is the case. Instead, accounting of the requested (thin-provisioned)
size of volumes is much more practical as different types of volumes can
be tracked.

OpenShift already provides cluster-wide quotas, which can combine
accounting of requested volumes by grouping different StorageClasses.

In addition to the difficult practise of allowing only thick-provisioned
RBD backed volumes, the performance makes thick-provisioning
troublesome. As volumes need to be completely allocated, data needs to
be written to the volume. This can take a long time, depending on the
size of the volume. Provisioning, cloning and snapshotting becomes very
much noticeable, and because of the additional time consumption, more
prone to failures.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-10-27 06:54:07 +00:00
Madhu Rajanna
0838845c6a cleanup: remove FIXME from ResyncVolume
as the complexity of ResyncVolume is
reduced removing the FIXME which is not valid
anymore.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-10-26 12:00:36 +00:00
Madhu Rajanna
2017b8c621 rbd: log mirror daemon state for replication
log the mirror deamon state in the local and
remote cluster for better debugging.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-10-26 12:00:36 +00:00
Madhu Rajanna
7472338334 rbd: remove unwanted const
for comparing the image states use the states
defined in the go-ceph avoid creating of the
deplicate const in cephcsi.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-10-26 12:00:36 +00:00
Madhu Rajanna
b92a6f5ccb rbd: log the remote site details during resync
logging the remote site details during resyncing
for better debugging.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-10-26 12:00:36 +00:00
Madhu Rajanna
1fd2f28fee rbd: check local image state for resyncing
below are the local states of the mirrored image

"unknown"  -> If the image is in an error state
means data is completely synced
"error" -> If the image is in an error state
means it needs resync
"syncing"
"starting_replay"
"replaying"
"stopping_replay"
"stopped"

If the resync is successfully started which
means the image will be in "replaying" state.
we can consider "replaying" state to report
resync succesfully going on state.

we are discarding the intermediate states like
"syncing", "starting_replay" and "stopping_replay".

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
2021-10-26 12:00:36 +00:00
dependabot[bot]
c8e78089f7 rebase: bump github.com/aws/aws-sdk-go from 1.41.5 to 1.41.10
Bumps [github.com/aws/aws-sdk-go](https://github.com/aws/aws-sdk-go) from 1.41.5 to 1.41.10.
- [Release notes](https://github.com/aws/aws-sdk-go/releases)
- [Changelog](https://github.com/aws/aws-sdk-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/aws/aws-sdk-go/compare/v1.41.5...v1.41.10)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2021-10-26 06:55:16 +00:00
Rakshith R
41d894f98a e2e: add test cases for EnsureImageCleanup
This tests pvc,pvcsmartclone,snapshot deletion when
underlying images are in trash.

Signed-off-by: Rakshith R <rar@redhat.com>
2021-10-20 18:25:31 +00:00
Rakshith R
12cd05a408 rbd: add EnsureImageCleanup to snapshot deletion
Signed-off-by: Rakshith R <rar@redhat.com>
2021-10-20 18:25:31 +00:00
Rakshith R
1849076aab rbd: add EnsureImageCleanup to ensure image cleanup from trash
After moving moving image to trash, if `trash remove` step fails,
then external-provisioner will issue subsequent requests, in which
image will be absent in pool( will be in trash) and omap cleanup will
be done with stale image left in trash with no `trash remove` step on it.

To avoid this scenario list trash images and find corresponding id for given
image name and add a task to flatten when we encounter a ErrImageNotFound.

Fixes: #1728

Signed-off-by: Rakshith R <rar@redhat.com>
2021-10-20 18:25:31 +00:00
dependabot[bot]
5280b67327 rebase: bump github.com/hashicorp/vault/api from 1.1.1 to 1.2.0
Bumps [github.com/hashicorp/vault/api](https://github.com/hashicorp/vault) from 1.1.1 to 1.2.0.
- [Release notes](https://github.com/hashicorp/vault/releases)
- [Changelog](https://github.com/hashicorp/vault/blob/main/CHANGELOG.md)
- [Commits](https://github.com/hashicorp/vault/compare/v1.1.1...v1.2.0)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/vault/api
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2021-10-20 13:57:39 +00:00
Niels de Vos
9bd9f5e91d rebase: update github.com/hashicorp/vault/sdk to latest
The github.com/hashicorp/vault/sdk was listed in the replace section,
most likely because using a newer version failed. By adding a missing
tagged version to the `exclude` section in go.mod, updating the package
works fine.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-10-20 13:57:39 +00:00
Niels de Vos
6d3e25f069 util: NodeGetVolumeStatsResponse.Usage may not contain negative values
Following the CSI specification, values that are included in the
VolumeUsage MUST NOT be negative. However, CephFS seems to return -1 for
the number of inodes that are available. Instead of returning a
negative value, set it to 0 so that it will not get included in the
encoded JSON response.

Updates: #2579
See-also: 5b0d454015/spec.md (L2477-L2487)
Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-10-20 07:18:48 +00:00
dependabot[bot]
6ffb91c047 rebase: bump github.com/aws/aws-sdk-go from 1.41.0 to 1.41.5
Bumps [github.com/aws/aws-sdk-go](https://github.com/aws/aws-sdk-go) from 1.41.0 to 1.41.5.
- [Release notes](https://github.com/aws/aws-sdk-go/releases)
- [Changelog](https://github.com/aws/aws-sdk-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/aws/aws-sdk-go/compare/v1.41.0...v1.41.5)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2021-10-19 17:52:53 +00:00
dependabot[bot]
a66012a5d4 rebase: bump github.com/ceph/go-ceph from 0.11.0 to 0.12.0
Bumps [github.com/ceph/go-ceph](https://github.com/ceph/go-ceph) from 0.11.0 to 0.12.0.
- [Release notes](https://github.com/ceph/go-ceph/releases)
- [Changelog](https://github.com/ceph/go-ceph/blob/master/docs/release-process.md)
- [Commits](https://github.com/ceph/go-ceph/compare/v0.11.0...v0.12.0)

---
updated-dependencies:
- dependency-name: github.com/ceph/go-ceph
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2021-10-19 13:27:19 +00:00
Robert Vasek
fedbb01ec3 doc: add proposal doc for CephFS snapshots as shallow RO volumes
This patch adds a proposal document for "CephFS snapshots
as shallow RO volumes".

Updates: #2142
Signed-off-by: Robert Vasek <robert.vasek@cern.ch>
2021-10-19 11:35:02 +00:00
Niels de Vos
85c84910d3 e2e: add a monitor container to the vault Pod
The command `vault monitor` can be used to stream logging from the Vault
service. This is very helpful while debugging Vault configuration
failures.

By adding a 2nd container to the Vault deployment, it is now possible to
get the messages from the Vault service by running

    $ kubectl logs -c monitor <vault-pod-0123abcd>

This will be very useful when the e2e tests do not delete the deployment
after a failure and fetch the logs from all containers.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
2021-10-19 03:37:42 +00:00