currently we have a bug in rbd mirror scheduling module.
After doing failover and failback the scheduling is not
getting updated and the mirroring snapshots are not
getting created periodically as per the scheduling
interval. This PR workarounds this one by doing below
operations
* Create a dummy (unique) image per cluster and this image
should be easily identified.
* During Promote operation on any image enable the
mirroring on the dummy image. when we enable the mirroring
on the dummy image the pool will get updated and the
scheduling will be reconfigured.
* During Demote operation on any image disable the mirroring
on the dummy image. the disable need to be done to enable
the mirroring again when we get the promote request to make
the image as primary
* When the DR is no more needed, this image need to be
manually cleanup as for now as we dont want to add a check
in the existing DeleteVolume code path for delete dummy image
as it impact the performance of the DeleteVolume workflow.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Moved to add scheduling to the promote
operation as scheduling need to be added
when the image is promoted and this is
the correct method of adding the scheduling
to make the scheduling take place.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
instead of logging the volumeID and the pool
name. log the poolname and image name for better
debugging.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Adding actions/retest to the dependabot configuration makes sure all
vendored packages will get updated when new releases are available.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Considering we are far out of these release and only care about
kubernetes releases from v1.20, there is no need to have this
version check in place for the tests.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Considering we are far out of these release and only care about
kubernetes releases from v1.20, there is no need to have this
version check in place for the tests.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Considering we are far out of these release and only care about
kubernetes releases from v1.20, there is no need to have this
version check in place for the tests.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
considering we are far out of this release and only care about
kubernetes releases from v1.20, there is no need to have this
version check in place for the tests.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
considering we are far out of this release and only care about
kubernetes releases from v1.20, there is no need to have this
version check in place for the tests.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
considering we are far out of this release and only care about
kubernetes releases from v1.20, there is no need to have this
version check in place for the tests.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
If we hit any error while running the cryptosetup
commands we are logging only the error message.
with only error message it is difficult to analyze
the problem, logging the stdError will help us to
check what is the problem.
updates: #2610
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
On line 341 a `transaction` is created. This is passed to the deferred
`undoStagingTransaction()` function when an error in the
`NodeStageVolume` procedure is detected. So far, so good.
However, on line 356 a new `transaction` is returned. This new
`transaction` is not used for the defer call.
By removing the empty `transaction` that is used in the defer call, and
calling `undoStagingTransaction()` on an error of `stageTransaction()`,
the code is a little simpler, and the cleanup of the transaction should
be done correctly now.
Updates: #2610
Signed-off-by: Niels de Vos <ndevos@redhat.com>
It seems that building the `retest` action makes it consume shared
libraries that are not part of the `scratch` base container layer. By
using the golang:1.16 container image as a base, all required shared
libraries are available.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
There have been occasional CI job failures due to "transport is closing"
errors. Adding this error to the isRetryableAPIError() function should
make sure to retry the request until the connection is restored.
Fixes: #2613
Signed-off-by: Niels de Vos <ndevos@redhat.com>
The webserver at layeh.com seems to be misbehaving, which causes `go mod
verify` to fail. The layeh.com/radius repository is maintained on
GitHub, so the sources can be vendored/verified from there too.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
as there are lot of spell check failures
on the vendor directory, skipping it.
skipping retest action folder as
PullRequests word is not getting ignored
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
currently the mountType validation of the encrypted volume is done in
the application, we should rather validate this inside the nodeplugin
pod.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Currently, at "perform IO on rbd-nbd volume after nodeplugin restart"
test we are performing write on the rbd-nbd based mount after nodeplugin
restart. But due to a bug in NBD driver the writes are failing, please
note NBD zero cmd timeout handling is fixed with kernel >= 5.4 and hence
we should defend on writes based on kernel version to avoid unnecessary
CI failures.
For more information see
https://github.com/ceph/ceph-csi/issues/2204#issuecomment-930941047
updates: #2204
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
This log line is seen frequently in the logs and its better to be at
Warning loglevel rather than Error based on its severity
E1109 08:30:45.612395 38328 util.go:247] kernel 4.19.202 does not support required features
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Problem:
On remap/attach of device (i.e. nodeplugin restart), there is no way
for rbd-nbd to defend if the backend storage is matching with the initial
backend storage.
Say, if an initial map request for backend "pool1/image1" got mapped to
/dev/nbd0 and the userspace process is terminated (on nodeplugin restart).
A next remap/attach (nodeplugin start) request within reattach-timeout is
allowed to use /dev/nbd0 for a different backend "pool1/image2"
For example, an operation like below could be dangerous:
$ sudo rbd-nbd map --try-netlink rbd-pool/ext4-image
/dev/nbd0
$ sudo blkid /dev/nbd0
/dev/nbd0: UUID="bfc444b4-64b1-418f-8b36-6e0d170cfc04" TYPE="ext4"
$ sudo pkill -15 rbd-nbd <-- nodeplugin terminate
$ sudo rbd-nbd attach --try-netlink --device /dev/nbd0 rbd-pool/xfs-image
/dev/nbd0
$ sudo blkid /dev/nbd0
/dev/nbd0: UUID="d29bf343-6570-4069-a9ea-2fa156ced908" TYPE="xfs"
Solution:
rbd-nbd/kernel now provides a way to keep some metadata in sysfs to identify
between the device and the backend, so that when a remap/attach request is
made, rbd-nbd can compare and avoid such dangerous operations.
With the provided solution, as part of the initial map request, backend
cookie (ceph-csi VOLID) can be stored in the sysfs per device config, so
that on a remap/attach request rbd-nbd will check and validate if the
backend per device cookie matches with the initial map backend with the help
of cookie.
At Ceph-csi we use VOLID as device cookie, which will be unique, we pass
the VOLID as cookie at map and use the same at the time of attach, that
way rbd-nbd can identify backends and their matching devices.
Requires:
https://github.com/ceph/ceph/pull/41323https://lkml.org/lkml/2021/4/29/274
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
When dependabot creates a PR, and an other gets merged, the bot
automatically triggers a rebase. This will drop any approvals, causing
delays in the review/merge process.
The project uses Mergify to automatically rebase when needed, and
approvals are retained when Mergify rebases PR. By disabling the
auto-rebasing done by dependabot, fewer rebases should be needed,
contributors only need to review once, and CI jobs are triggered less
often.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
This change allows the user to choose not to fallback to NBD mounter
when some ImageFeatures are absent with krbd driver, rather just fail
the NodeStage call.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Currently, we recognize and warn for the provided image features based on
our prior intelligence at ceph-csi (i.e based on supportedFeatures map
and validateImageFeatures) at image/PV creation time. It might be very
much possible that the cluster is heterogeneous i.e. the PV creation and
application container might both be on different nodes with different
kernel versions (krbd driver versions).
This PR adds a mechanism to check for the supported krbd features during
mount time, if the krbd driver doesn't have the specified image feature
then it will fall back to rbd-nbd mounter.
Fixes: #478
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
When using UPPER_CASE formatting for the HashiCorp Vault KMS
configuration, a missing `VAULT_DESTROY_KEYS` will cause the option to
be set to "false". The default for the option is intended for be "true".
This is a difference in behaviour between the `vaultDestroyKeys` and
`VAULT_DESTROY_KEYS` options. Both should use a default of "true" when
the configuration does not set the option explicitly.
By setting the default options in the `standardVault` struct before
unmarshalling the configuration in it, the default values will be
retained for the missing configuration options.
Reported-by: Rachael George <rgeorge@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>
this commit create and make use of migration secret in the requests and
validate various csi operations
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
this commit make use of the migration request secret parsing and set
the required fields for further nodestage operations
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
parseAndDeleteMigratedVolume() prviously clubbed the logic of
parsing of migration volume handle and then continued with the
deletion of the volume. however this commit split this
logic into two, ie parsing has been done in parseMigrationVolID()
and DeleteMigratedVolume() deletes the backend volume.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>