After Failover of workloads to the secondary
cluster when the primary cluster is down,
RBD Image is not marked healthy, and VR
resources are not promoted to the Primary,
In VolumeReplication, the `CURRENT STATE`
remains Unknown and doesn't change to Primary.
This happens because the primary cluster went down,
and we have force promoted the image on the
secondary cluster. and the image stays in
up+stopping_replay or could be any other states.
Currently assumption was that the image will
always be `up+stopped`. But the image will be in
`up+stopped` only for planned failover and it
could be in any other state if its a forced
failover. For this reason, removing
checkHealthyPrimary from the PromoteVolume RPC call.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit 8c5563a9bc)
Cmd to disable apache arrow repo is removed, since
it is no longer needed.
Cmd to disable tcmu repo is added to make build pass.
refer: https://github.com/ceph/ceph-container/issues/2034
Signed-off-by: Rakshith R <rar@redhat.com>
(cherry picked from commit 5ed305850f)
When using `lock_on_read`, the RBD image needs to have the
`exclusive-lock` feature enabled too.
Fixes: #3221
Signed-off-by: Niels de Vos <ndevos@redhat.com>
(cherry picked from commit 2df55a55a3)
Removed the code in checkHealthyPrimary which
makes the ceph call, passing it as input now.
Added unit test for checkHealthyPrimary function
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit 8a47904e8f)
we need to check for image should be in up+stopped state
not anyone of the state for that the we need to use
OR check not the AND check.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit 53e76fab69)
When the image is force promoted to primary on the
cluster the remote image might not be in replaying
state because due to the split brain state. This
PR reverts back the commit
c3c87f2ef3. Which we added
to check the remote image status.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit 704cb5c941)
Kubernetes 1.24 and newer use a different path for staging the volume.
That means the CSI-driver is requested to mount the volume at an other
location, compared to previous versions of Kubernetes. CSI-drivers
implementing the volumeHealer, must receive the correct path, otherwise
the after a nodeplugin restart the NBD mounts will bailout attempting
to NodeStageVolume() call and return an error.
See-also: kubernetes/kubernetes#107065
Fixes: #3176
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
(cherry picked from commit 1da446d2f2)
During failover we do demote the volume on the primary
as the image is still not promoted yet on the remote cluster,
there are spurious split-brain errors reported by RBD,
the Cephcsi resync will attempt to resync from the "known"
secondary and that will cause data loss
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit 3acaa018db)
This version is required to support k8s 1.24,
which is inturn required to test new sa token
behavior.
Signed-off-by: Rakshith R <rar@redhat.com>
(cherry picked from commit 7a00736b4f)
create the token if kubernetes version in
1.24+ and use it for vault sa.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Signed-off-by: Rakshith R <rar@redhat.com>
(cherry picked from commit 7a2dd4c3cf)
In case of pre-provisioned volume the clusterID is
not set in the volume context as the clusterID is missing
we cannot extract the NetNamespaceFilePath from the
configuration file. For static volume and dynamically
provisioned volume the clusterID is set.
Note:- This is a special case to support mounting PV
without clusterID parameter.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit c9943320ac)
Before the change, the error msg was the following:
```
failed to set VAULT_AUTH_MOUNT_PATH in Vault config: path is empty
```
`vaultAuthPath` is the actual variable name set by the
user. The error message will now be the following:
```
failed to set "vaultAuthPath" in vault config: path is empty
```
Signed-off-by: Rakshith R <rar@redhat.com>
(cherry picked from commit 7688306f87)
continue running rbd driver when /sys/bus/rbd/supported_features file is
missing, do not bailout.
Fixes: #2678
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
(cherry picked from commit 6470cf3343)
krbdFeatures is set to zero when kernel version < 3.8, i.e. in case where
/sys/bus/rbd/supported_features is absent and we are unable to prepare
the krbd attributes based on kernel version.
When krbdFeatures is set to zero fallback to NBD only when autofallback
is turned ON.
Fixes: #2678
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
(cherry picked from commit 83cc1b0e58)
Upstream /sys/bus/rbd/supported_features is part of Linux kernel v4.11.0
Prepare the attributes and use them in case if
/sys/bus/rbd/supported_features is missing.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
(cherry picked from commit e53fd87154)
`bash -E` causes inheritance of the ERR trap into shell functions,
command substitutions, and commands executed in a subshell environment.
Because the `kubectl_retry` function depends on detection an error of a
subshell, the ERR trap is not needed to be executed. The trap contains
extra logging, and exits the script in the `rook.sh` case. The aborting
of the script is not wanted when a retry is expected to be done.
While checking for known failures, the `grep` command may exit with 1,
if there are no matches. That means, the `ret` variable will be set to
0, but there will also be an error exit status. This causes `bash -E` to
abort the function, and call the ERR trap.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
(cherry picked from commit 4891e534d3)
For the default mounter the mounter option
will not be set in the storageclass and as it is
not available in the storageclass same will not
be set in the volume context, Because of this the
mapOptions are getting discarded. If the mounter
is not set assuming it's an rbd mounter.
Note:- If the mounter is not set in the storageclass
we can set it in the volume context explicitly,
Doing this check-in node server to support backward
existing volumes and the check is minimal we are not
altering the volume context.
fixes: #3076
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit 70674565df)
Still seeing the issue of the commitlint
as below
fatal: unsafe repository
('/go/src/github.com/ceph/ceph-csi'
is owned by someone else)
To add an exception for this directory,
call:
git config --global --add safe.directory \
/go/src/github.com/ceph/ceph-csi
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit df047ddaaf)
This commit is added to use canary csi-provisioner image
to test different sc pvc-pvc cloning feature, which is not
yet present in released versions.
refer:
https://github.com/kubernetes-csi/external-provisioner/pull/699
Signed-off-by: Rakshith R <rar@redhat.com>
(cherry picked from commit c880061882)
# Conflicts:
# charts/ceph-csi-rbd/values.yaml
# deploy/rbd/kubernetes/csi-rbdplugin-provisioner.yaml
This commit makes modification so as to allow pvc-pvc clone
with different storageclass having different encryption
configs.
This commit also modifies `copyEncryptionConfig()` to
include a `isEncrypted()` check within the function.
Signed-off-by: Rakshith R <rar@redhat.com>
(cherry picked from commit f1ccc4eced)
Before the change, the error msg was the following:
```
failed to set VAULT_AUTH_MOUNT_PATH in Vault config: path is empty
```
`vaultAuthPath` is the actual variable name set by the
user. The error message will now be the following:
```
failed to set "vaultAuthPath" in vault config: path is empty
```
Signed-off-by: Rakshith R <rar@redhat.com>
(cherry picked from commit bd57feb26e)
In case the NFS-export has already been removed from the NFS-server, but
the CSI Controller was restarted, a retry to remove the NFS-volume will
fail with an error like:
> GRPC error: ....: response status not empty: "Export does not exist"
When this error is reported, assume the NFS-export was already removed
from the NFS-server configuration, and continue with deleting the
backend volume.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
(cherry picked from commit 9d7faf850f)
Depending on the Kubernetes version, the following warning is reported
regulary:
> Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+,
> unavailable in v1.25+
The warning is written to stderr, so skipping AlreadyExists or NotFound
is not sufficient to trigger a retry. Ignoring '^Warning:' in the stderr
output should prevent unneeded failures while deploying Rook or other
components.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
(cherry picked from commit 89e2ff39f1)
Rook deployments fail quite regulary in the CI environment now. It is
not clear what the cause is, hopefully a little better logging will
guide us to the issue.
Now executing `kubectl` in a sub-shell, ensuring that the redirection of
the command lands in the right files.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
(cherry picked from commit 1e21972956)
When running the kubernetes cluster with one single privileged
PodSecurityPolicy which is allowing everything the nodeplugin
daemonset can fail to start. To be precise the problem is the
defaultAllowPrivilegeEscalation: false configuration in the PSP.
Containers of the nodeplugin daemonset won't start when they
have privileged: true but no allowPrivilegeEscalation in their
container securityContext.
Kubernetes will not schedule if this mismatch exists cannot set
allowPrivilegeEscalation to false and privileged to true:
Signed-off-by: Silvan Loser <silvan.loser@hotmail.ch>
Signed-off-by: Silvan Loser <33911078+losil@users.noreply.github.com>
(cherry picked from commit f2e0fa28fb)
When running the kubernetes cluster with one single privileged
PodSecurityPolicy which is allowing everything the nodeplugin
daemonset can fail to start. To be precise the problem is the
defaultAllowPrivilegeEscalation: false configuration in the PSP.
Containers of the nodeplugin daemonset won't start when they
have privileged: true but no allowPrivilegeEscalation in their
container securityContext.
Kubernetes will not schedule if this mismatch exists cannot set
allowPrivilegeEscalation to false and privileged to true
Signed-off-by: Silvan Loser <silvan.loser@hotmail.ch>
Signed-off-by: Silvan Loser <33911078+losil@users.noreply.github.com>
(cherry picked from commit 06c4477ff9)
updated doc for 3.6.1 release, this will
be backported to release-v3.6 branch and
we will make deployment changes and do release.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit 5e1a074ea3)
as same host directory is not shared between
the cephfs and the rbd plugin pod. we need
to keep the netNamespaceFilePath separately
for both cephfs and rbd. CephFS plugin will
use this path to execute mount -t commands.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit d2bc9743f7)
Adjusted the mix of tabs and the spaces and also
used block comment for better readability.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit eb4bfb7326)
As radosNamespace is more specific to
RBD not the general ceph configuration. Now
we introduced a new RBD section for RBD specific
options, Moving the radosNamespace to RBD section
and keeping the radosNamespace still under the
global ceph level configration for backward
compatibility.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit b4acbd08a5)
As the netNamespaceFilePath can be separate for
both cephfs and rbd adding the netNamespaceFilePath
path for RBD, This will help us to keep RBD and
CephFS specific options separately.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit 766346868e)
The NFS Controller returns a non-gRPC error in case the CreateVolume
call for the CephFS volume fails. It is better to return the gRPC-error
that the CephFS Controller passed along.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
(cherry picked from commit 2b71aac752)
The NFS-Admin API has been added to go-ceph v0.15.0. As the API can not
be tested in the go-ceph CI, it requires build-tag `ceph_ci_untested`.
This additional build-tag has been added to the `Makefile` and should be
removed when the API does not require the build-tag anymore.
See-also: ceph/go-ceph#655
Signed-off-by: Niels de Vos <ndevos@redhat.com>
(cherry picked from commit 282c33cb58)
Recent versions of Ceph allow calling the NFS-export management
functions over the go-ceph API.
This seems incompatible with older versions that have been tested with
the `ceph nfs` commands that this commit replaces.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
(cherry picked from commit 28369702d2)
use leases for leader election instead
of the deprecated configmap based leader
election.
This PR is making leases as default leader election
refer https://github.com/kubernetes-sigs/
controller-runtime/pull/1773, default from configmap
to configmap leases was done with
https://github.com/kubernetes-sigs/
controller-runtime/pull/1144.
Release notes https://github.com/kubernetes-sigs/
controller-runtime/releases/tag/v0.7.0
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit d886ab0d66)
log the nsenter and its argument after executing
the command with the nsenter CLI.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit c245436ec4)
calling setRbdNbdToolFeatures inside an init
gets called in main.go for both cephfs and rbd
driver. instead of calling it in init function
calling this in rbd driver.go as this is specific
to rbd.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit dffb6e72c2)
This commit adds nfs provisioner & plugin sa to
scc.yaml to be used with openshift.
Signed-off-by: Rakshith R <rar@redhat.com>
(cherry picked from commit 784b086ea5)
To consider the image is healthy during the Promote
operation currently we are checking only the image
state on the primary site. If the network is flaky
or the remote site is down the image health is
not as expected. To make sure the image is healthy
across the clusters check the state on both local
and the remote clusters.
some details:
https://bugzilla.redhat.com/show_bug.cgi?id=2014495
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
(cherry picked from commit 64a9b1fa59)