Kubernetes 1.24 and newer use a different path for staging the volume.
That means the CSI-driver is requested to mount the volume at an other
location, compared to previous versions of Kubernetes. CSI-drivers
implementing the volumeHealer, must receive the correct path, otherwise
the after a nodeplugin restart the NBD mounts will bailout attempting
to NodeStageVolume() call and return an error.
See-also: kubernetes/kubernetes#107065
Fixes: #3176
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
During failover we do demote the volume on the primary
as the image is still not promoted yet on the remote cluster,
there are spurious split-brain errors reported by RBD,
the Cephcsi resync will attempt to resync from the "known"
secondary and that will cause data loss
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
create the token if kubernetes version in
1.24+ and use it for vault sa.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Signed-off-by: Rakshith R <rar@redhat.com>
This commit implements most of
docs/design/proposals/cephfs-snapshot-shallow-ro-vol.md design document;
specifically (de-)provisioning of snapshot-backed volumes, mounting such
volumes as well as mounting pre-provisioned snapshot-backed volumes.
Signed-off-by: Robert Vasek <robert.vasek@cern.ch>
RBD supports creating rbd images with
object size, stripe unit and stripe count
to support striping. This PR adds the support
for the same.
More details about striping at
https://docs.ceph.com/en/quincy/man/8/rbd/#stripingfixes: #3124
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This change helps read the cluster name from the cmdline args,
the provisioner will set the same on the RBD images.
Fixes: #2973
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
In case of pre-provisioned volume the clusterID is
not set in the volume context as the clusterID is missing
we cannot extract the NetNamespaceFilePath from the
configuration file. For static volume and dynamically
provisioned volume the clusterID is set.
Note:- This is a special case to support mounting PV
without clusterID parameter.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Before the change, the error msg was the following:
```
failed to set VAULT_AUTH_MOUNT_PATH in Vault config: path is empty
```
`vaultAuthPath` is the actual variable name set by the
user. The error message will now be the following:
```
failed to set "vaultAuthPath" in vault config: path is empty
```
Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds support for pvc-pvc clone.
Only capability needed to be advertised, the
underlying support is already provided by cephfs
backend.
Signed-off-by: Rakshith R <rar@redhat.com>
continue running rbd driver when /sys/bus/rbd/supported_features file is
missing, do not bailout.
Fixes: #2678
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
krbdFeatures is set to zero when kernel version < 3.8, i.e. in case where
/sys/bus/rbd/supported_features is absent and we are unable to prepare
the krbd attributes based on kernel version.
When krbdFeatures is set to zero fallback to NBD only when autofallback
is turned ON.
Fixes: #2678
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Upstream /sys/bus/rbd/supported_features is part of Linux kernel v4.11.0
Prepare the attributes and use them in case if
/sys/bus/rbd/supported_features is missing.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Move k8s.GetVolumeMetadata() out of setVolumeMetadata() and rename it to
setAllMetadata() so that the same can be used for setting volume and
snapshot metadata.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
There is not much the NFS-provisioner needs to do to expand a volume,
everything is handled by the CephFS components.
NFS does not need a resize on the node, so only ControllerExpandVolume
is required.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
For the default mounter the mounter option
will not be set in the storageclass and as it is
not available in the storageclass same will not
be set in the volume context, Because of this the
mapOptions are getting discarded. If the mounter
is not set assuming it's an rbd mounter.
Note:- If the mounter is not set in the storageclass
we can set it in the volume context explicitly,
Doing this check-in node server to support backward
existing volumes and the check is minimal we are not
altering the volume context.
fixes: #3076
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
With cgroup v2, the location of the pids.max file changed and so did the
/proc/self/cgroup file
new /proc/self/cgroup file
`
0::/user.slice/user-500.slice/session-14.scope
`
old file:
`
11:pids:/user.slice/user-500.slice/session-2.scope
10:blkio:/user.slice
9:net_cls,net_prio:/
8:perf_event:/
...
`
There is no directory per subsystem (e.g. /sys/fs/cgroup/pids) any more, all
files are now in one directory.
fixes: https://github.com/ceph/ceph-csi/issues/3085
Signed-off-by: Marcus Röder <m.roeder@yieldlab.de>
This commit makes modification so as to allow pvc-pvc clone
with different storageclass having different encryption
configs.
This commit also modifies `copyEncryptionConfig()` to
include a `isEncrypted()` check within the function.
Signed-off-by: Rakshith R <rar@redhat.com>
Before the change, the error msg was the following:
```
failed to set VAULT_AUTH_MOUNT_PATH in Vault config: path is empty
```
`vaultAuthPath` is the actual variable name set by the
user. The error message will now be the following:
```
failed to set "vaultAuthPath" in vault config: path is empty
```
Signed-off-by: Rakshith R <rar@redhat.com>
In case the NFS-export has already been removed from the NFS-server, but
the CSI Controller was restarted, a retry to remove the NFS-volume will
fail with an error like:
> GRPC error: ....: response status not empty: "Export does not exist"
When this error is reported, assume the NFS-export was already removed
from the NFS-server configuration, and continue with deleting the
backend volume.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
as same host directory is not shared between
the cephfs and the rbd plugin pod. we need
to keep the netNamespaceFilePath separately
for both cephfs and rbd. CephFS plugin will
use this path to execute mount -t commands.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
As radosNamespace is more specific to
RBD not the general ceph configuration. Now
we introduced a new RBD section for RBD specific
options, Moving the radosNamespace to RBD section
and keeping the radosNamespace still under the
global ceph level configration for backward
compatibility.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
As the netNamespaceFilePath can be separate for
both cephfs and rbd adding the netNamespaceFilePath
path for RBD, This will help us to keep RBD and
CephFS specific options separately.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The NFS Controller returns a non-gRPC error in case the CreateVolume
call for the CephFS volume fails. It is better to return the gRPC-error
that the CephFS Controller passed along.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Recent versions of Ceph allow calling the NFS-export management
functions over the go-ceph API.
This seems incompatible with older versions that have been tested with
the `ceph nfs` commands that this commit replaces.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
use leases for leader election instead
of the deprecated configmap based leader
election.
This PR is making leases as default leader election
refer https://github.com/kubernetes-sigs/
controller-runtime/pull/1773, default from configmap
to configmap leases was done with
https://github.com/kubernetes-sigs/
controller-runtime/pull/1144.
Release notes https://github.com/kubernetes-sigs/
controller-runtime/releases/tag/v0.7.0
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
To consider the image is healthy during the Promote
operation currently we are checking only the image
state on the primary site. If the network is flaky
or the remote site is down the image health is
not as expected. To make sure the image is healthy
across the clusters check the state on both local
and the remote clusters.
some details:
https://bugzilla.redhat.com/show_bug.cgi?id=2014495
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
calling setRbdNbdToolFeatures inside an init
gets called in main.go for both cephfs and rbd
driver. instead of calling it in init function
calling this in rbd driver.go as this is specific
to rbd.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Set snapshot-name/snapshot-namespace/snapshotcontent-name details
on RBD backend snapshot image as metadata on snapshot
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Make sure to set metadata when image exist, i.e. if the provisioner pod
is restarted while createVolume is in progress, say it created the image
but didn't yet set the metadata.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Example if a PVC was delete by setting `persistentVolumeReclaimPolicy` as
`Retain` on PV, and PV is reattached to a new PVC, we make sure to update
PV/PVC image metadata on a PV reattach.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
This helps Monitoring solutions without access to Kubernetes clusters to
display the details of the PV/PVC/NameSpace in their dashboard.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Define and use PV and PVC metadata keys used by external provisioner.
The CSI external-provisioner (v1.6.0+) introduces the
--extra-create-metadata flag, which automatically sets map<string, string>
parameters in the CSI CreateVolumeRequest.
Add utility functions to set/Get PV/PVC/PVCNamespace metadata on image
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
add support to run rbd map and mount -t
commands with the nsenter.
complete design of pod/multus network
is added here https://github.com/rook/rook/
blob/master/design/ceph/multus-network.md#csi-pods
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Currently we only check if the rbd-nbd tool supports cookie feature.
This change will also defend cookie addition based on kernel version
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
The omap is stored with the requested
snapshot name not with the subvolume
snapshotname. This fix uses the correct
snapshot request name to cleanup the omap
once the subvolume snapshot is deleted.
fixes: #2974
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The `ceph nfs export ...` commands have changed in recent Ceph releases.
Use the most recent command as a default, fall back to the older command
when an error is reported.
This shoud make the NFS-provisioner work on any current Ceph version.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Increase the timeout to 2 minutes to give enough time
for rollback to complete.
As rollback is performed by the force-promote command it,
at times, may take more than a minute
(based on dirty blocks that need to be rolled
back approximately) to rollback.
The added extra 1 minute is useful though to avoid
multiple calls to complete the rollback and in
extremely corner cases to avoid failures in the
first instance of the call when the mirror watcher
is not yet removed (post scaling down the
RBD mirror instance)
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Restoring a snapshot with a new PVC results with a wrong
dataPoolName in case of initial volume linked
to a storageClass with topology constraints and erasure coding.
Signed-off-by: Thibaut Blanchard <thibaut.blanchard@gmail.com>
NFSVolume instances are short lived, they only extist for a certain gRPC
procedure. It is easier to store the calling Context in the NFSVolume
struct, than to pass it to some of the functions that require it.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
These NFS Controller and Identity servers are the base for the new
provisioner. The functionality is currently extremely limited, follow-up
PRs will implement various CSI procedures.
CreateVolume is implemented with the bare minimum. This makes it
possible to create a volume, and mount it with the
kubernetes-csi/csi-driver-nfs NodePlugin.
DeleteVolume unexports the volume from the Ceph managed NFS-Ganesha
service. In case the Ceph cluster provides multiple NFS-Ganesha
deployments, things might not work as expected. This is going to be
addressed in follow-up improvements.
Lots of TODO comments need to be resolved before this can be declared
"production ready". Unit- and e2e-tests are missing as well.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
RT, reference tracker, is key-based implementation of a reference counter.
Unlike an integer-based counter, RT counts references by tracking unique
keys. This allows accounting in situations where idempotency must be
preserved. It guarantees there will be no duplicit increments or decrements
of the counter.
Signed-off-by: Robert Vasek <robert.vasek@cern.ch>
OIDC token file path has been modified from
`/var/run/secrets/token` to `/run/secrets/tokens`.
This has been done to ensure compliance with
FHS 3.0.
refer:
https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch05s13.html
Signed-off-by: Rakshith R <rar@redhat.com>
Below are the 3 different cases where we need
the PVC namespace for encryption
* CreateVolume:- Read the namespace from the
createVolume parameters and store it in the omap
* NodeStage:- Read the namespace from the omap
not from the volumeContext
* Regenerate:- Read the pvc namespace from the claimRef
not from the volumeAttributes.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
remove kubernetes csi prefixed parameters
from the volumeContext as we dont want
to store it in the PV VolumeAttributes.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
remove kubernetes csi prefixed parameters
from the volumeContext as we dont want
to store it in the PV VolumeAttributes.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
added helper function to strip the kubernetes
specific parameters from the volumeContext as
volumeContext is storaged in the PV volumeAttributes
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This commit ensures that parent image is flattened before
creating volume.
- If the data source is a PVC, the underlying image's parent
is flattened(which would be a temp clone or snapshot).
hard & soft limit is reduced by 2 to account for depth that
will be added by temp & final clone.
- If the data source is a Snapshot, the underlying image is
itself flattened.
hard & soft limit is reduced by 1 to account for depth that
will be added by the clone which will be restored from the
snapshot.
Flattening step for resulting PVC image restored from snapshot is removed.
Flattening step for temp clone & final image is removed when pvc clone is
being created.
Fixes: #2190
Signed-off-by: Rakshith R <rar@redhat.com>
as per the CSI standard the size is optional parameter,
as we are allowing the clone to a bigger size
today we need to block the clone to a smaller size
as its a have side effects like data corruption etc.
Note:- Even though this check is present in kubernetes
sidecar as CSI is CO independent adding the check
here.
fixes: #2718
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
These RPCs( nodestage,unstage,volumestats) are
implemented RPCs for our drivers atm. This commit removes
the `unimplemented` responses from the common/default
server initialization routins.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
These RPCs ( controller expand, create and delete snapshots) are
no longer unimplmented and we dont have to declare these as with
`unimplemented` states. This commit remove the same.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
With Amazon STS and kubernetes cluster is configured with
OIDC identity provider, credentials to access Amazon KMS
can be fetched using oidc-token(serviceaccount token).
Each tenant/namespace needs to create a secret with aws region,
role and CMK ARN.
Ceph-CSI will assume the given role with oidc token and access
aws KMS, with given CMK to encrypt/decrypt DEK which will stored
in the image metdata.
Refer: https://docs.aws.amazon.com/STS/latest/APIReference/welcome.htmlResolves: #2879
Signed-off-by: Rakshith R <rar@redhat.com>
Currently, we support
mapOption: "krbd:v1,v2,v3;nbd:v1,v2,v3"
- By omitting `krbd:` or `nbd:`, the option(s) apply to
rbdDefaultMounter which is krbd.
- A user can _override_ the options for a mounter by specifying `krbd:`
or `nbd:`.
mapOption: "v1,v2,v3;nbd:v1,v2,v3"
is effectively the same as the 1st example.
- Sections are split by `;`.
- If users want to specify common options for both `krbd` and `nbd`,
they should mention them twice.
But in case if the krbd or nbd specifc options contian `:` within them,
then the parsing is failing now.
E0301 10:19:13.615111 7348 utils.go:200] ID: 63 Req-ID:
0001-0009-rook-ceph-0000000000000001-fd37c41b-9948-11ec-ad32-0242ac110004
GRPC error: badly formatted map/unmap options:
"krbd:read_from_replica=localize,crush_location=zone:zone1;"
This patch fix the above case where the options itself contain `:`
delimitor
ex: krbd:v1,v2,v3=v31:v32;nbd:v1,v2,v3"
Please note, if you are using such options which contain `:` delimiter,
then it is mandatory to specify the mounter-type.
Fixes: #2910
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Mounts managed by ceph-fuse may get corrupted by e.g. the ceph-fuse process
exiting abruptly, or its parent container being terminated, taking down its
child processes with it.
This commit adds checks to NodeStageVolume and NodePublishVolume procedures
to detect whether a mountpoint in staging_target_path and/or target_path is
corrupted, and remount is performed if corruption is detected.
Signed-off-by: Robert Vasek <robert.vasek@cern.ch>
Refactored a couple of helper functions for easier resue.
* Code for building store.VolumeOptions is factored out into a separate function.
* Changed args of getCredentailsForVolume() and NodeServer.mount() so that
instead of passing in whole csi.NodeStageVolumeRequest, only necessary
properties are passed explicitly. This is to allow these functions to be
called outside of NodeStageVolume() where NodeStageVolumeRequest is not
available.
Signed-off-by: Robert Vasek <robert.vasek@cern.ch>
blkdiscard cmd discards all data on the block device which
is not desired. Hence, return unimplemented code if the
volume access mode is block.
Signed-off-by: Rakshith R <rar@redhat.com>
When a tenant provides a configuration that includes the
`vaultNamespace` option, the `vaultAuthNamespace` option is still taken
from the global configuration. This is not wanted in all cases, as the
`vaultAuthNamespace` option defauls to the `vaultNamespace` option which
the tenant may want to override as well.
The following behaviour is now better defined:
1. no `vaultAuthNamespace` in the global configuration:
A tenant can override the `vaultNamespace` option and that will also
set the `vaultAuthNamespace` option to the same value.
2. `vaultAuthNamespace` and `vaultNamespace` in the global configuration:
When both options are set to different values in the global
configuration, the tenant `vaultNamespace` option will not override
the global `vaultAuthNamespace` option. The tenant can configure
`vaultAuthNamespace` with a different value if required.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Makes the rbd images features in the storageclass
as optional so that default image features of librbd
can be used. and also kept the option to user
to specify the image features in the storageclass.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
as deep-flatten is long supported in ceph and its
enabled by default in the librbd, providing an option
to enable it in cephcsi for the rbd images we are
creating.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This commits refactors the cephfs core
functions with interfaces. This helps in
better code structuring and writing the
unit test cases.
update #852
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
logging the error is not user-friendly and
it contains system error message. Log the
stderr which is user-friendly error message
for identifying the problem.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Return the dataPool used to create the image instead of the default one
provided by the createVolumeRequest.
In case of topologyConstrainedDataPools, they may differ.
Don't add datapool if it's not present
Signed-off-by: Sébastien Bernard <sebastien.bernard@sfr.com>
At present we are node staging with worldwide permissions which is
not correct. We should allow the CO to take care of it and make
the decision. This commit also remove `fuseMountOptions` and
`KernelMountOptions` as they are no longer needed
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
the omap is stored with the requested
snapshot name not with the subvolume
snapshotname. This fix uses the correct
snapshot request name to cleanup the omap
once the subvolume snapshot is deleted.
fixes: #2832
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This commit removes `kp-metadata` registration from existing HPCS
or Key Protect code as per the plan.
Fix#2816
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
As we are using optional additional auth data while wrapping
the DEK, we have to send the same additionally while unwrapping.
Error:
```
failed to unwrap the DEK: kp.Error: ..(INVALID_FIELD_ERR)',
reasons='[INVALID_FIELD_ERR: The field `ciphertext` must be: the
original base64 encoded ciphertext from the wrap operation
```
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
When a tenant configures `vaultNamespace` in their own ConfigMap, it is
not applied to the Vault configuration, unless `vaultAuthNamespace` is
set as well. This is unexpected, as the `vaultAuthNamespace` usually is
something configured globally, and not per tenant.
The `vaultAuthNamespace` is an advanced option, that is often not needed
to be configured. Only when tenants have to configure their own
`vaultNamespace`, it is possible that they need to use a different
`vaultAuthNamespace`. The default for the `vaultAuthNamespace` is now
the `vaultNamespace` value from the global configuration. Tenants can
still set it to something else in their own ConfigMap if needed.
Note that Hashicorp Vault Namespaces are only functional in the
Enterprise version of the product. Therefor this can not be tested in
the Ceph-CSI e2e with the Open Source version of Vault.
Fixes: https://bugzilla.redhat.com/2050056
Reported-by: Rachael George <rgeorge@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>
This commit removes the thick provisioning
code as thick provisioning is deprecated in
cephcsi 3.5.0.
fixes: #2795
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
At present the KMS structs are exported and ideally we should be
able to work without exporting the same.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
At present the KMS structs are exported and ideally we should be
able to work without exporting the same.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Currently, we are using methods and all the methods
makes a network call to fetch details from the ceph
clusters, its difficult to write test cases for
these functions, if we move to the interfaces
we can make use of mock to write unit testing
for the caller functions.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
To be consistent with other components and also to explictly
state it belong to `ibm keyprotect` service introducing this
change
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
currently we are overriding the permission to `0o777` at time of node
stage which is not the correct action. That said, this permission
change causes an extra permission correction at time of nodestaging
by the CO while the FSGROUP change policy has been set to
`OnRootMismatch`.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
as ioutil.ReadFile is deprecated and
suggestion is to use os.ReadFile as
per https://pkg.go.dev/io/ioutil updating
the same.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
as ioutil.WriteFile is deprecated and
suggestion is to use os.WriteFile as
per https://pkg.go.dev/io/ioutil updating
the same.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Currently most of the internal methods have the
rbdVolume as the received. As these methods
are completely internal and requires only
the fields of the rbdImage use rbdImage
as the receiver instead of rbdVolume.
updates #2742
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
During CreateVolume from snapshot/volume,
its difficult to identify if the clone is
failed and a new clone is created. In case
of clone failure logging the error message
for better debugging.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
as per the CSI standard the size is optional parameter,
as we are allowing the clone to a bigger size
today we need to block the clone to a smaller size
as its a have side effects like data corruption etc.
Note:- Even though this check is present in kubernetes
sidecar as CSI is CO independent adding the check
here.
updates: #2718
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
as per the CSI standard the size is optional parameter,
as we are allowing the restore to a bigger size
today we need to block the restore to a smaller size
as its a have side effects like data corruption.
Note:- Even though this check is present in kubernetes
sidecar as CSI is CO independent adding the check
here.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Currently, as a workaround, we are calling
the resize volume on the cloned, restore volumes
to adjust the cloned, restored volumes.
With this fix, we are calling the resize volume
only if there is a size mismatch with requested
and the volume from which the new volume needs
to be created.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
SINGLE_NODE_WRITER capability ambiguity has been fixed in csi spec v1.5
which allows the SP drivers to declare more granular WRITE capability in form
of SINGLE_NODE_SINGLE_WRITER or SINGLE_NODE_MULTI_WRITER.
These are not really new capabilities rather capabilities introduced to
get the desired functionality from CO side based on the capabilities SP
driver support for various CSI operations, this new capabilities also help
to address new access mode RWOP (readwriteoncepod).
This commit adds a helper function which identity the request is of
multiwriter mode and also validates whether it is filesystem mode or
block mode. Based on the inspection it fails to allow multi write
requests for filesystem mode and only allow multi write request against
block mode.
This commit also adds unit tests for isMultiWriterBlock function which
validates various accesstypes and accessmodes.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
SINGLE_NODE_WRITER capability ambiguity has been fixed in csi spec v1.5
which allows the SP drivers to declare more granular WRITE capability.
These are not really new capabilities rather capabilities introduced to
get the desired functionality from CO side based on the capabilities SP
driver support for various CSI operations.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
This commit adds optional BaseURL and TokenURL configuration to
key protect/hpcs configuration and client connections, if not
provided default values are used.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
implement UnfenceClusterNetwork grpc call
which allows to unblock the access to a
CIDR block by removing it from network fence.
Signed-off-by: Yug Gupta <yuggupta27@gmail.com>
implement FenceClusterNetwork grpc call which
allows to blocks access to a CIDR block by
creating a network fence.
Signed-off-by: Yug Gupta <yuggupta27@gmail.com>
Convert the CIDR block into a range of IPs,
and then add network fencing via "ceph osd blocklist"
for each IP in that range.
Signed-off-by: Yug Gupta <yuggupta27@gmail.com>
This commit removes rbdVol.getTrashPath() function
since it is no longer being used due to introduction
of go-ceph rbd admin task api for deletion.
Signed-off-by: Rakshith R <rar@redhat.com>
With introduction of go-ceph rbd admin task api, credentials are
no longer required to be passed as cli cmd is not invoked.
Signed-off-by: Rakshith R <rar@redhat.com>
This commit removes `rv.Connect(cr)` since the rbdVolume should
have an active connection in this stage of the function call.
`rv.getCloneDepth(ctx)` will work after a connect to the cluster.
Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds support to go-ceph rbd task api
`trash remove` and `flatten` instead of using cli
cmds.
Fixes: #2186
Signed-off-by: Rakshith R <rar@redhat.com>
considering IBM has different crypto services (ex: SKLM) in place, its
good to keep the configmap key names with below format
`IBM_KP_...` instead of `KP_..`
so that in future, if we add more crypto services from IBM we can keep
similar schema specific to that specific service from IBM.
Ex: `IBM_SKLM_...`
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
The CSI Controller (provisioner) can call `rbd sparsify` to reduce the
space consumption of the volume.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
use ExecCommandWithTimeout with timeout
of 1 minute for the promote operation.
If the command doesnot returns error/response
in 1 minute the process will be killed
and error will be returned to the user.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
added ExecCommandWithTimeout helper function
to execute the commands with the timeout option,
if the command does not return any response with
in the timeout time the process will be terminated
and error will be returned back to the user.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
after creating the rbd image log the image
details corresponding for the request along
with the request name.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
as getImageInfo is already called inside
cloneRbdImageFromSnapshot function right
after creating the clone. remove the extra
API call to get the details again.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
after creating the clone get the current
image details like size, creationTime,
imageFeatures etc from the ceph cluster.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
moved ParentName, ParentPool and ImageFeatureSet
fields to the rbdImage struct as these are the
first citizens on the rbdImage.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
If the volume with a bigger size is created
from a snapshot or from another volume we
need to exapand the filesystem also in the
csidriver as nodeExpand request is not triggered
for this one, During NodeStageVolume we can
expand the filesystem by checking filesystem
needs expansion or not.
If its a encrypted device, check the device
size of rbd device and the LUKS device if required
the device will be expanded before
expanding the filesystem.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
If the requested volume size is greater than
the snapshot size, resize the cloned volume
after creating a clone from a snapshot.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
If the requested volume size is greater than
the parent volume size, resize the cloned volume
after creating a final clone from a parent volume.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
added a check to consider ErrImageNotFound error
during DeleteSnapshot operation, if the error
is ErrImageNotFound we need to ensure that image
is removed from the trash and also the rados
OMAP data is removed.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
we need actual size of the rbdVolume
created for the snapshot, as we are not
storing the size of the snapshot in OMAP
we need to fetch the size from ceph cluster
and update the same on rbdSnapshot struct.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
as we are moving the VolSize to rbdImage struct
we should reuse the same instead of maintaining
one more field in rbdSnapshot struct.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
move the Volsize to the rbdImage struct
as size is more applicable for rbdImage
as rbdImage is used for both rbdVolume
and rbdSnapshot.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
as we are no longer supporting the v1.x
version of cephcsi. removing the json tag
used to store rbd volume details in configmap.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
when doing the internal operation to get the
latest details the rbd image size is also getting
updated and this will update the volume size also
without actual requested size we cannot do the
resize operation for bigger clones. This commit
adds a new field called RequestedVolSize to rbdVolume
struct to hold the user requested size.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
added a new helper function called cleanupThickClone
to cleanup the snapshot and clone if the thick
provisioning is not fully completed.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
remove the bigger size validation when
creating a volume from a snapshot or when
creation a clone from a volume as we resized
the volume after cloning.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
dummy image rbdVolume struct is derived
from the actual one rbdVolume of the
volumeID sent in the EnableVolumeReplication
request. and the dummy rbdVolume struct contains
the image id of the actual volume because
of that when we are repairing the dummy
image the image is sent to trash but not
deleted due to the wrong image ID. resetting
the image id will makes sure the image id
is fetching from ceph cluster and same
image id will be used for manager operation.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This commit adds the support for HPCS/Key Protect IBM KMS service
to Ceph CSI service. EncryptDEK() and DecryptDEK() of RBD volumes are
done with the help of key protect KMS server by wrapping and unwrapping
the DEK and by using the DEKStoreMetadata.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
The ceph changes are done on the both server and the
client side this change is not enough for remove
setting the size of cloned volumes.
this caused the regression like #2719#2720#2721#2722.
This reverts commit 3565a342d5.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
IsBlockMultiNode() is a new helper that takes a slice of
VolumeCapability objects and checks if it includes multi-node access
and/or block-mode support.
This can then easily be used in other services that need checking for
these particular capabilities, and preventing multi-node block-mode
access.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
we have added clusterID mapping to identify the volumes
in case of a failover in Disaster recovery in #1946.
with #2314 we are moving to a configuration in
configmap for clusterID and poolID mapping.
and with #2314 we have all the required information
to identify the image mappings.
This commit removes the workaround implementation done
in #1946.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The rbd package contains several functions that can be used by
CSI-Addons Service implmentations. Unfortunately it is not possible to
do this, as the rbd-driver needs to import the csi-addons/rbd package to
provide the CSI-Addons server. This causes a circular import when
services use the rbd package:
- rbd/driver.go import csi-addons/rbd
- csi-addons/rbd import rbd (including the driver)
By moving rbd/driver.go into its own package, the circular import can be
prevented.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
HexStringToInteger() used to return a uint64, but everywhere else uint
is used. Having HexStringToInteger() return a uint as well makes it a
little easier to use when setting it with SetGlobalInt().
Signed-off-by: Niels de Vos <ndevos@redhat.com>
When the rbd-driver starts, it initializes some global (yuck!) variables
in the rbd package. Because the rbd-driver is moved out into its own
package, these variables can not easily be set anymore.
Introcude SetGlobalInt(), SetGlobalBool() and InitJournals() so that the
rbd-driver can configure the rbd package.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
The rbd-driver calls rbd.runVolumeHealer() which is not available
outside the rbd package. By moving the rbd-driver into its own package,
RunVolumeHealer() needs to be exported.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
NodeServer.mounter is internal to the NodeServer type, but it needs to
be initialized by the rbd-driver. The rbd-driver is moved to its own
package, so .Mounter needs to be available from there in order to set
it.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
genVolFromVolID() is used by the CSI Controller service to create an
rbdVolume object from a CSI volume_id. This function is useful for
CSI-Addons Services as well, so rename it to GenVolFromVolID().
Signed-off-by: Niels de Vos <ndevos@redhat.com>
k8s.io/utils/mount has moved to k8s.io/mount-utils, and Ceph-CSI uses
that already in most locations. Only internal/util/util.go still imports
the old path.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
The dummy image will be created with 1Mib size.
during the snapshot transfer operation the 1Mib
will be transferred even if the dummy image doesnot
contains any data. adding the new image features
`fast-diff,layering,obj-map,exclusive-lock`on the
dummy image will ensure that only the diff is
transferred to the remote cluster.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
we added a workaround for rbd scheduling by creating
a dummy image in #2656. with the fix we are creating
a dummy image of the size of the first actual rbd
image which is sent in EnableVolumeReplication request
if the actual rbd image size is 1TiB we are creating
a dummy image of 1TiB which is not good. even though
its a thin provisioned rbd images this is causing
issue for the transfer of the snapshot during
the mirroring operation.
This commit recreates the rbd image with 1MiB size
which is the smaller supported size in rbd.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
rbd mirroring CLI calls are async and it doesn't wait
for the operation to be completed. ex:- `rbd mirror image enable`
it will enable the mirroring on the image but it doesn't
ensure that the image is mirroring enabled and healthy
primary. The same goes for the promote volume also.
This commits adds a check-in PromoteVolume to make sure
the image in a healthy state i.e `up+stopped`.
note:- not considering any intermediate states to make
sure the image is completely healthy before responding
success to the RPC call.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Journal-based RADOS block device mirroring ensures point-in-time
consistent replicas of all changes to an image, including reads and
writes, block device resizing, snapshots, clones, and flattening.
Journaling-based mirroring records all modifications to an image in the
order in which they occur. This ensures that a crash-consistent mirror
of an image is available.
Mirroring when configured in journal mode, mirroring will
utilize the RBD journaling image feature to replicate the image
contents. If the RBD journaling image feature is not yet enabled on the
image, it will be automatically enabled.
Fixes: #2018
Co-authored-by: Madhu Rajanna <madhupr007@gmail.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Depending on the way Ceph-CSI is deployed, the capabilities will be
configured for the GetCapabilities procedure. The other procedures are
more straight-forward.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
After adding the new CSI-Addons Server, golang-ci complains that
driver.Run() is too complex. By moving the profiling checks and starting
of the go-routines in their own function, golang-ci is happy again.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Add a new CSI-Addons Server and empty Identity Service for the RBD
plugin. The implementation of the Identity Service procedure calls will
be done in other PRs.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
currently we are fist operating on the dummy
image to refresh the pool and then we are adding
the scheduling. we think the scheduling should
be added first and than we should refresh the
pool. If we do this all the existing schedules
will be considered from the scheduler.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
with shallow copy of rbdVol to dummyVol
the image name update of the dummyVol is getting
reflected on the rbdVol which we dont want.
do deep copy to avoid this problem.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Uses the below schema to supply mounter specific map/unmapOptions to the
nodeplugin based on the discussion we all had at
https://github.com/ceph/ceph-csi/pull/2636
This should specifically be really helpful with the `tryOthermonters`
set to true, i.e with fallback mechanism settings turned ON.
mapOption: "kbrd:v1,v2,v3;nbd:v1,v2,v3"
- By omitting `krbd:` or `nbd:`, the option(s) apply to
rbdDefaultMounter which is krbd.
- A user can _override_ the options for a mounter by specifying `krbd:`
or `nbd:`.
mapOption: "v1,v2,v3;nbd:v1,v2,v3"
is effectively the same as the 1st example.
- Sections are split by `;`.
- If users want to specify common options for both `krbd` and `nbd`,
they should mention them twice.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
The dummy mirror image needs to be disabled and then
reenabled for mirroring, to ensure a newly promoted
primary is now starting to schedule snapshots.
Signed-off-by: Shyamsundar Ranganathan <srangana@redhat.com>
currently we have a bug in rbd mirror scheduling module.
After doing failover and failback the scheduling is not
getting updated and the mirroring snapshots are not
getting created periodically as per the scheduling
interval. This PR workarounds this one by doing below
operations
* Create a dummy (unique) image per cluster and this image
should be easily identified.
* During Promote operation on any image enable the
mirroring on the dummy image. when we enable the mirroring
on the dummy image the pool will get updated and the
scheduling will be reconfigured.
* During Demote operation on any image disable the mirroring
on the dummy image. the disable need to be done to enable
the mirroring again when we get the promote request to make
the image as primary
* When the DR is no more needed, this image need to be
manually cleanup as for now as we dont want to add a check
in the existing DeleteVolume code path for delete dummy image
as it impact the performance of the DeleteVolume workflow.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Moved to add scheduling to the promote
operation as scheduling need to be added
when the image is promoted and this is
the correct method of adding the scheduling
to make the scheduling take place.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
instead of logging the volumeID and the pool
name. log the poolname and image name for better
debugging.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
If we hit any error while running the cryptosetup
commands we are logging only the error message.
with only error message it is difficult to analyze
the problem, logging the stdError will help us to
check what is the problem.
updates: #2610
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
On line 341 a `transaction` is created. This is passed to the deferred
`undoStagingTransaction()` function when an error in the
`NodeStageVolume` procedure is detected. So far, so good.
However, on line 356 a new `transaction` is returned. This new
`transaction` is not used for the defer call.
By removing the empty `transaction` that is used in the defer call, and
calling `undoStagingTransaction()` on an error of `stageTransaction()`,
the code is a little simpler, and the cleanup of the transaction should
be done correctly now.
Updates: #2610
Signed-off-by: Niels de Vos <ndevos@redhat.com>
This log line is seen frequently in the logs and its better to be at
Warning loglevel rather than Error based on its severity
E1109 08:30:45.612395 38328 util.go:247] kernel 4.19.202 does not support required features
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Problem:
On remap/attach of device (i.e. nodeplugin restart), there is no way
for rbd-nbd to defend if the backend storage is matching with the initial
backend storage.
Say, if an initial map request for backend "pool1/image1" got mapped to
/dev/nbd0 and the userspace process is terminated (on nodeplugin restart).
A next remap/attach (nodeplugin start) request within reattach-timeout is
allowed to use /dev/nbd0 for a different backend "pool1/image2"
For example, an operation like below could be dangerous:
$ sudo rbd-nbd map --try-netlink rbd-pool/ext4-image
/dev/nbd0
$ sudo blkid /dev/nbd0
/dev/nbd0: UUID="bfc444b4-64b1-418f-8b36-6e0d170cfc04" TYPE="ext4"
$ sudo pkill -15 rbd-nbd <-- nodeplugin terminate
$ sudo rbd-nbd attach --try-netlink --device /dev/nbd0 rbd-pool/xfs-image
/dev/nbd0
$ sudo blkid /dev/nbd0
/dev/nbd0: UUID="d29bf343-6570-4069-a9ea-2fa156ced908" TYPE="xfs"
Solution:
rbd-nbd/kernel now provides a way to keep some metadata in sysfs to identify
between the device and the backend, so that when a remap/attach request is
made, rbd-nbd can compare and avoid such dangerous operations.
With the provided solution, as part of the initial map request, backend
cookie (ceph-csi VOLID) can be stored in the sysfs per device config, so
that on a remap/attach request rbd-nbd will check and validate if the
backend per device cookie matches with the initial map backend with the help
of cookie.
At Ceph-csi we use VOLID as device cookie, which will be unique, we pass
the VOLID as cookie at map and use the same at the time of attach, that
way rbd-nbd can identify backends and their matching devices.
Requires:
https://github.com/ceph/ceph/pull/41323https://lkml.org/lkml/2021/4/29/274
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
This change allows the user to choose not to fallback to NBD mounter
when some ImageFeatures are absent with krbd driver, rather just fail
the NodeStage call.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Currently, we recognize and warn for the provided image features based on
our prior intelligence at ceph-csi (i.e based on supportedFeatures map
and validateImageFeatures) at image/PV creation time. It might be very
much possible that the cluster is heterogeneous i.e. the PV creation and
application container might both be on different nodes with different
kernel versions (krbd driver versions).
This PR adds a mechanism to check for the supported krbd features during
mount time, if the krbd driver doesn't have the specified image feature
then it will fall back to rbd-nbd mounter.
Fixes: #478
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
When using UPPER_CASE formatting for the HashiCorp Vault KMS
configuration, a missing `VAULT_DESTROY_KEYS` will cause the option to
be set to "false". The default for the option is intended for be "true".
This is a difference in behaviour between the `vaultDestroyKeys` and
`VAULT_DESTROY_KEYS` options. Both should use a default of "true" when
the configuration does not set the option explicitly.
By setting the default options in the `standardVault` struct before
unmarshalling the configuration in it, the default values will be
retained for the missing configuration options.
Reported-by: Rachael George <rgeorge@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>
this commit make use of the migration request secret parsing and set
the required fields for further nodestage operations
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
parseAndDeleteMigratedVolume() prviously clubbed the logic of
parsing of migration volume handle and then continued with the
deletion of the volume. however this commit split this
logic into two, ie parsing has been done in parseMigrationVolID()
and DeleteMigratedVolume() deletes the backend volume.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
This commit adds a couple of helper functions to parse the migration
request secret and set it for further csi driver operations.
More details:
The intree secret has a data field called "key" which is the base64
admin secret key. The ceph CSI driver currently expect the secret to
contain data field "UserKey" for the equivalant. The CSI driver also
expect the "UserID" field which is not available in the in-tree secret
by deafult. This missing userID will be filled (if the username differ
than 'admin') in the migration secret as 'adminId' field in the
migration request, this commit adds the logic to parse this migration
secret as below:
"key" field value will be picked up from the migraion secret to "UserKey"
field.
"adminId" field value will be picked up from the migration secret to "UserID"
field
if `adminId` field is nil or not set, `UserID` field will be filled with
default value ie `admin`.The above logic get activated only when the secret
is a migration secret, otherwise skipped to the normal workflow as we have
today.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Thick-provisioning was introduced to make accounting of assigned space
for volumes easier. When thick-provisioned volumes are the only consumer
of the Ceph cluster, this works fine. However, it is unlikely that this
is the case. Instead, accounting of the requested (thin-provisioned)
size of volumes is much more practical as different types of volumes can
be tracked.
OpenShift already provides cluster-wide quotas, which can combine
accounting of requested volumes by grouping different StorageClasses.
In addition to the difficult practise of allowing only thick-provisioned
RBD backed volumes, the performance makes thick-provisioning
troublesome. As volumes need to be completely allocated, data needs to
be written to the volume. This can take a long time, depending on the
size of the volume. Provisioning, cloning and snapshotting becomes very
much noticeable, and because of the additional time consumption, more
prone to failures.
Signed-off-by: Niels de Vos <ndevos@redhat.com>