This commit makes use of crush location labels from node
labels to supply `crush_location` and `read_from_replica=localize`
options during mount. Using these options, cephfs
will be able to redirect reads to the closest OSD,
improving performance.
Signed-off-by: Praveen M <m.praveen@ibm.com>
Snapshot procedures do not seem to contain the `Req-ID:` prefix in the
logs anymore (or weren't they there at all?) for some reason. This adds
them back.
Signed-off-by: Niels de Vos <ndevos@ibm.com>
Implemented the capability to include kernel mount options and
fuse mount options for individual clusters within the ceph-csi-config
ConfigMap.This allows users to configure the kernel/fuse mount options
for each cluster separately. The mount options specified in the ConfigMap
will supersede those provided via command line arguments.
Signed-off-by: Praveen M <m.praveen@ibm.com>
This commit adds GetCephFSMountOptions util method which returns
KernelMountOptions and fuseMountOptions for cluster `clusterID`.
Signed-off-by: Praveen M <m.praveen@ibm.com>
Implemented the capability to include read affinity options
for individual clusters within the ceph-csi-config ConfigMap.
This allows users to configure the crush location for each
cluster separately. The read affinity options specified in
the ConfigMap will supersede those provided via command line arguments.
Signed-off-by: Praveen M <m.praveen@ibm.com>
If any operations like Resize, Deleting
snapshot fails, we need to remove
both snapshot and the clone to avoid
resource leak.
closes: #4218
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The ReplicationServer is not used anymore, the functionality has moved
to CSI-Addons and the `internal/csi-addons/rbd` package. These last
references were not activated anywhere, so can be removed without any
impact.
See-also: #3314
Signed-off-by: Niels de Vos <ndevos@ibm.com>
When FilesystemNodeGetVolumeStats() succeeds, the volume must be
healthy. This can be included in the VolumeCondition CSI message by
default.
Checks that detect an abnormal VolumeCondition should prevent calling
FilesystemNodeGetVolumeStats() as it is possible that the function will
hang.
Signed-off-by: Niels de Vos <ndevos@ibm.com>
The HealthChecker is configured to use the Staging path pf the volume,
with a `.csi/` subdirectory. In the future this directory could be a
directory that is not under the Published directory.
Fixes: #4219
Signed-off-by: Niels de Vos <ndevos@ibm.com>
re-arrange the struct members to
fix below lint issue
```
struct of size 336 bytes could be of size 328 bytes
```
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This commit eliminates the code for protecting and unprotecting
snapshots, as the functionality to protect and unprotect snapshots
is being deprecated.
Signed-off-by: Praveen M <m.praveen@ibm.com>
this commit adds client eviction to cephfs, based
on the IPs in cidr block, it evicts those IPs from
the network.
Signed-off-by: Riya Singhal <rsinghal@redhat.com>
Issue:
The RoundOffCephFSVolSize() function omits the fractional
part when calculating the size for cephfs volumes, leading
to the created volume capacity to be lesser than the requested
volume capacity.
Fix:
Consider the fractional part during the size calculation so the
rounded off volume size will be greater than or equal to the
requested volume size.
Signed-off-by: karthik-us <ksubrahm@redhat.com>
Fixes: #4179
Multiple go-routines may simultaneously create the
subVolumeGroupCreated map or write into it
for a particular group.
This commit safeguards subVolumeGroupCreated map
from concurrent creation/writes while allowing for multiple
readers.
Signed-off-by: Rakshith R <rar@redhat.com>
Multiple go-routines may simultaneously check for a clusterID's
presence in clusterAdditionalInfo and create an entry if it is
absent. This set of operation needs to be serialized.
Therefore, this commit safeguards clusterAdditionalInfo map
from concurrent writes with a mutex to prevent the above problem.
Signed-off-by: Rakshith R <rar@redhat.com>
This PR updates the snapshot RbdImageName in
`createSnapshot` method. This resolves the
incorrect statement logged during snapshot creation.
Signed-off-by: Praveen M <m.praveen@ibm.com>
This commit updates the snapshot RbdImageName with the clone
RbdImageName before snapshot creation. This will fix the
incorrect log statement.
Signed-off-by: Praveen M <m.praveen@ibm.com>
During ResyncVolume call, discard not found
error from GetMetadata API. If the image gets
resynced the local image creation time will be
lost, if the key is not present in the image
metadata then we can assume that the image is
already resynced.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Add support to create RWX clone from the
ROX clone, in ceph no subvolume clone is
created when ROX clone is created from a
snapshot just a internal ref counter is
added. This PR allows creating a RWX clone
from a ROX clone which allows users to create
RW copy of PVC where cephcsi will identify
the snapshot created for the ROX volume and
creates a subvolume from the CephFS snapshot.
updates: #3603
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
During the Demote volume store
the image creation timestamp.
During Resync do below operation
* Check image creation timestamp
stored during Demote operation
and current creation timestamp during Resync
and check both are equal and its for
force resync then issue resync
* If the image on both sides is
not in unknown state, check
last_snapshot_timestamp on the
local mirror description, if its present
send volumeReady as false or else return
error message.
If both the images are in up+unknown the
send volumeReady as true.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Not sure why but go-lint is failing
with below error and this fix is required
to make it pass
```
directive `//nolint:staticcheck // See comment above.`
is unused for linter "staticcheck" (nolintlint)
```
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
MetricsBindAddress is replaced by Metrics in the
controller-runtime manager options in version 0.16.0
as part of
e59161ee8f
Updating the same here.
Signed-off-by: karthik-us <ksubrahm@redhat.com>
Set VolumeOptions.Pool parameter to empty for Snapshot-backed volumes.
This Pool parameter is optional and only used as 'pool-layout' parameter
during subvolume and subvolume clone create request in cephcsi
and not used for Snapshot-backed volume at all.
It is not saved anywhere for use in subsequent operations after create too.
Therefore, We can set it to empty and not error out.
Signed-off-by: rakshith-r <rar@redhat.com>
This commit makes sure sparsify() is not run when rbd
image is in use.
Running rbd sparsify with workload doing io and too
frequently is not desirable.
When a image is in use fstrim is run and sparsify will
be run only when image is not mapped.
Signed-off-by: Rakshith R <rar@redhat.com>
The clients parameter in the storage class is used to limit access to
the export to the set of hostnames, networks or ip addresses specified.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
When a volume has AccessType=Block and is encrypted with LUKS, a resize
of the filesystem on the (decrypted) block-device is attempted. This
should not be done, as the application that requested the Block volume
is the only authoritive reader/writer of the data.
In particular VirtualMachines that use RBD volumes as a disk, usually
have a partition table on the disk, instead of only a single filesystem.
The `resizefs` command will not be able to resize the filesystem on the
block-device, as it is a partition table.
When `resizefs` fails during NodeStageVolume, the volume is unstaged and
an error is returned.
Resizing an encrypted block-device requires `cryptsetup resize` so that
the LUKS header on the RBD-image is updated with the correct size. But
there is no need to call `resizefs` in this case.
Fixes: #3945
Signed-off-by: Niels de Vos <ndevos@ibm.com>
This commit modifies code to handle last sync duration being
empty & 0,returning nil & 0 on encountering it respectively.
Earlier both case return 0. Test case is added too.
Signed-off-by: Rakshith R <rar@redhat.com>
This commit get more information from the description
like lastsyncbytes and lastsyncduration and send them
as a response of getvolumereplicationinfo request.
Signed-off-by: Yati Padia <ypadia@redhat.com>
this commit migrates the replication controller server
from internal/rbd and adds it to csi-addons.
Signed-off-by: riya-singhal31 <rsinghal@redhat.com>
this commit removes grpc import from replication.go
and replaced it with usual errors and passed gRPC
responses in csi-addons
Signed-off-by: riya-singhal31 <rsinghal@redhat.com>
There is no release for sigs.k8s.io/controller-runtime that supports
Kubernetes v1.27. The main branch has all the required modifications, so
we can use that for the time being.
Signed-off-by: Niels de Vos <ndevos@ibm.com>
golangci-lint reports that `grpc_middleware.WithUnaryServerChain` is
deprecated and `google.golang.org/grpc.ChainUnaryInterceptor` should be
used instead.
Signed-off-by: Niels de Vos <ndevos@ibm.com>
CephNFS can enable different security flavours for exported volumes.
This can be configured in the optional `secTypes` parameter in the
StorageClass.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
By default, `cryptsetup luksFormat` uses Argon2i as Password-Based Key
Derivation Function (PBKDF), which not only has a CPU cost, but also a memory
cost (to make brute-force attacks harder).
The memory cost is based on the available system memory by default, which in
the context of Ceph CSI can be a problem for two reasons:
1. Pods can have a memory limit (much lower that the memory available on the
node, usually) which isn't taken into account by `cryptsetup`, so it can get
OOM-killed when formating a new volume;
2. The amount of memory that was used during `cryptsetup luksFormat` will then
be needed for `cryptsetup luksOpen`, so if the volume was formated on a node
with a lot of memory, but then needs to be opened on a different node with
less memory, `cryptsetup` will get OOM-killed.
This commit sets the PBKDF memory limit to a fixed value to ensure consistent
memory usage regardless of the specifications of the nodes where the volume
happens to be formatted in the first place.
The limit is set to a relatively low value (32 MiB) so that the `csi-rbdplugin`
container in the `nodeplugin` pod doesn't require an extravagantly high memory
limit in order to format/open volumes (particularly with operations happening
in parallel), while at the same time not being so low as to render it
completely pointless.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Add `mkfsOptions` to the StorageClass and pass them to the `mkfs`
command while creating the filesystem on the RBD device.
Fixes: #374
Signed-off-by: Niels de Vos <ndevos@ibm.com>
Storing the default `mkfs` arguments in a map with key per filesystem
type makes this a little more modular. It prepares th code for fetching
the `mkfs` arguments from the VolumeContext.
Signed-off-by: Niels de Vos <ndevos@ibm.com>
This commit makes use of crush location labels from node
labels to supply `crush_location` and `read_from_replica=localize`
options during rbd map cmd. Using these options, ceph
will be able to redirect reads to the closest OSD,
improving performance.
Signed-off-by: Rakshith R <rar@redhat.com>
The StagingTargetPath is an optional entry in
NodeExpandVolumeRequest, We cannot expect it to be
set always and at the same time cephcsi depended
on the StaingTargetPath to retrieve some metadata
information.
This commit will check all the mount ref and identifies
the stagingTargetPath by checking the image-meta.json
file exists and this is a costly operation as we need to
loop through all the mounts and check image-meta.json
in each mount but this is happens only if the
StaingTargetPath is not set in the NodeExpandVolumeRequest
fixes#3623
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
set disableInUseChecks on rbd volume struct
as it will be used later to check whether
the rbd image is allowed to mount on multiple
nodes.
fixes: #3604
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
We should not call ExpandVolume for the BackingSnapshot
subvolume as there wont be any real subvolume created for
it and even if we call it the ExpandVolume will fail
fail as there is no real subvolume exists.
This commits fixes by adjusting the `if` check to ensure
that ExpandVolume will only be called either the
VolumeRequest is to create from a snapshot or volume
and BackingSnapshot is not true.
sample code here https://go.dev/play/p/PI2tNii5tTg
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Add Ceph FS fscrypt support, similar to the RBD/ext4 fscrypt
integration. Supports encrypted PVCs, snapshots and clones.
Requires kernel and Ceph MDS support that is currently not in any
stable release.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
this commit remove the protobuf dependency locking in the module
description.
Also, ptypes.TimestampProto is deprecated and this commit
make use of the timestamppb.New() for the construction.
ParseTime() function has been removed and callers adjusted to the
same.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
We need to unset the metadata on the clone
and restore PVC if the parent PVC was created
when setmetadata was set to true and it was
set to false when restore and clone pvc was
created.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
`ceph osd blocklist range add/rm <ip>` cmd is outputting
"blocklisting cidr:10.1.114.75:0/32 until 202..." messages
incorrectly into stdErr. This commit ignores stdErr when err
is nil.
Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds code to setup encryption on a rbdVol
being repaired in a followup CreateVolume request.
This is fixes a bug wherein encryption metadata may not
have been set in previous request due to container restart.
Fixes: #3402
Signed-off-by: Rakshith R <rar@redhat.com>
Checking volume details for the existing volumeID
first. if details like OMAP, RBD Image, Pool doesnot
exists try to use clusterIDMapping to look for the
correct informations.
fixes: #2929
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Sometime the json unmarshal might
get success and return empty time
stamp. add a check to make sure the
time is not zero always.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
As per the csiaddon spec last sync time is
required parameter in the GetVolumeReplicationInfo
if we are failed to parse the description, return
not found error message instead of nil
which is empty response
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
If a PV is reattached to a new PVC in a different
namespace we need to update the namespace name
in the rados object.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
If a PV is reattached to a new PVC in a different
namespace we need to update the namespace name
in the rbd image metadata.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
When we do stat on the targetpath, if there is
any error we can check is it due to corruption.
If yes, cephcsi can return abnormal in the
NodeGetVolumeStats so that consumer (CO/admin)
and detect and take further action.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
When we do stat on the targetpath, if there is
any error we can check is it due to corruption.
If yes, cephcsi can return abnormal in the
NodeGetVolumeStats so that consumer (CO/admin)
and detect and take further action.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
To avoid subvolume leaks if the SetAllMetadata
operations fails delete the subvolume.
If any operation fails after creating the subvolume
we will remove the omap as the omap gets
removed we will need to remove the subvolume to
avoid stale resources.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Different places have different meaningful fallback. When parsing
from user we should default to block, when parsing stored config we
should default to invalid and handle that as an error.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
Integrate basic fscrypt functionality into RBD initialization. To
activate file encryption instead of block introduce the new
'encryptionType' storage class key.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
Call Mount.Setup with SingleUserWritable constant instead of 0o755,
which is silently ignored and causes the /.fscrypt/{policy,protector}/
directories to have mode 000.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
Revert once our google/fscrypt dependency is upgraded to a version
that includes https://github.com/google/fscrypt/pull/359 gets accepted
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
Use constant protector name 'ceph-csi' instead of constant prefix
concatenated with the volume ID. When cloning volumes the ID changes
and fscrypt protected directories become inunlockable due to the
protector name change
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
NewContextFrom{Mountpoint,Path} functions use cached
`/proc/self/mountinfo` to find mounted file systems by device ID.
Since we run fscrypt as a library in a long-lived process the cached
information is likely to be stale. Stale entries may map device IDs to
mount points of already destroyed RBDs and fail context creation.
Updating the cache beforehand prevents this.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
Currently fscrypt supports policies version 1 and 2. 2 is the best
choice and was the only choice prior to this commit. This adds support
for kernels < 5.4, by selecting policy version 1 there.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
Fetch password when keyFn is invoked, not when it is created. This
allows creation of the keyFn before actually creating the passphrase.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
Fetch keys from KMS before doing anything else. This will catch KMS
errors before setting up any fscrypt metadata.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
Integrate google/fscrypt into Ceph CSI KMS and encryption setup. Adds
dependencies to google/fscrypt and pkg/xattr. Be as generic as
possible to support integration with both RBD and Ceph FS.
Add the following public functions:
InitializeNode: per-node initialization steps. Must be called
before Unlock at least once.
Unlock: All steps necessary to unlock an encrypted directory including
setting it up initially.
IsDirectoryUnlocked: Test if directory is really encrypted
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
In preparation of fscrypt support for RBD filesystems, rename block
encryption related function to include the word 'block'. Add struct
fields and IsFileEncrypted.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
Add registry similar to the providers one. This allows testers to
add and use GetKMSTestDummy() to create stripped down provider
instances suitable for use in unit tests.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
Add GetSecret() to allow direct access to passphrases without KDF and
wrapping by a DEKStore.
This will be used by fscrypt, which has its own KDF and wrapping. It
will allow users to take a k8s secret, for example, and use that
directly as a password in fscrypt.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
Fetch encryption type from vol options. Make fallback type
configurable to support RBD (default block) and Ceph FS (default file)
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
fscrypt support requires keys longer than 20 bytes. As a preparation,
make the new passphrase length configurable, but default to 20 bytes.
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
The error message return from the GRPC
should be of GRPC error messages only
not the normal go errors. This commits
returns GRPC error if setAllMetadata
fails.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
If any operations fails after the volume creation
we will cleanup the omap objects, but it is missing
if setAllMetadata fails. This commits adds the code
to cleanup the rbd image if metadata operation fails.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
As we need to compare the error type instead
of the error value we need to use errors.As
to check the API is implemented or not.
fixes: #3347
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
CephFS does not have a concept of "free inodes", inodes get allocated
on-demand in the filesystem.
This confuses alerting managers that expect a (high) number of free
inodes, and warnings get produced if the number of free inodes is not
high enough. This causes alerts to always get reported for CephFS.
To prevent the false-positive alerts from happening, the
NodeGetVolumeStats procedure for CephFS (and CephNFS) will not contain
inodes in the reply anymore.
See-also: https://bugzilla.redhat.com/2128263
Signed-off-by: Niels de Vos <ndevos@redhat.com>
To address the problem that snapshot
schedules are triggered for volumes
that are promoted, a dummy image was
disabled/enabled for replication.
This was done as a workaround, because the
promote operation was not triggering
the schedules for the image being promoted.
The bugs related to the same have been fixed in
RBD mirroring functionality and hence the
workaround #2656 can be removed from the code base.
ceph tracker https://tracker.ceph.com/issues/53914
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This commit gets the description from remote status
instead of local status.
Local status doesn't have ',' due to which we get
array index out of range panic.
Fixes: #3388
Signed-off-by: Yati Padia <ypadia@redhat.com>
Co-authored-by: shyam Ranganathan <srangana@redhat.com>
This commit implements getVolumeReplicationInfo
to get the last sync time and update it in volume
replication CR.
Signed-off-by: yati1998 <ypadia@redhat.com>
This commit adds blocklist range cmd feature,
while fallbacks to old blocklist one ip at a
time if the cmd is invalid(not available).
Signed-off-by: Rakshith R <rar@redhat.com>
Incase the subvolumegroup is deleted
and recreated we need to restart the
cephcsi provisioner pod to clear cache
that cephcsi maintains. With this PR
if cephcsi sees NotFound error duing
subvolume creation it will reset the cache
for that filesystem so that in next RPC
call cephcsi will try to create the
subvolumegroup again
Ref: https://github.com/rook/rook/issues/10623
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
In a cluster we can have multiple filesystem
for that we need to have a map of
subvolumegroups to check filesystem is created
nor not.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
If the image is mirroring enabled
and primary consider it for mapping,
if the image is mirroring enabled but
not primary yet. return error message
until the image is marked as primary.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
If the ceph cluster is of older version and doesnot
support metadata operation, Instead of failing
the request return the success if metadata
operation is not supported.
fixes#3347
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This commit updates csi-addons spec version
and modifies logging to strip replication
request secret using csi.StripSecret, then
with replication.protosanitizer if the former
fails. This is done in order to make sure
we strip csi and replication format of secrets.
Signed-off-by: Rakshith R <rar@redhat.com>
This commit uses %q instead %v in error messages
and adds result reason and message in kmip
verifyresponse().
Signed-off-by: Rakshith R <rar@redhat.com>
This commit fixes a bug in kmip kms Decrypt
function, where emd.DEK was fed in a Nonce
instead of emd.Nonce by mistake.
Signed-off-by: Rakshith R <rar@redhat.com>
The github.com/google/uuid package is used by Kubernetes, and it is part
of the vendor/ directory already. Our usage of github.com/pborman/uuid
can be replaced by github.com/google/uuid, so that
github.com/pborman/uuid can be removed as a dependency.
Closes: #3315
Signed-off-by: Niels de Vos <ndevos@redhat.com>
csi-addons server will advertise replication capability and
replication service will run with csi-addons server too.
Signed-off-by: Rakshith R <rar@redhat.com>
The Key Management Interoperability Protocol (KMIP)
is an extensible communication protocol
that defines message formats for the manipulation
of cryptographic keys on a key management server.
Ceph-CSI can now be configured to connect to
various KMS using KMIP for encrypting RBD volumes.
https://en.wikipedia.org/wiki/Key_Management_Interoperability_Protocol
Signed-off-by: Rakshith R <rar@redhat.com>
getting is unused for linter "staticcheck"
(nolintlint) error message due to wrong
comment format. this the format now with
`//directive // comment`
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This commit cleans up for loop to use index to access
value instead of copying value into a new variable
while iterating.
```
internal/util/csiconfig.go:103:2: rangeValCopy: each \
iteration copies 136 bytes (consider pointers or indexing) \
(gocritic)
for _, cluster := range config {
```
Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds nfs nodeserver capable of
mounting nfs volumes, even with pod networking
using NSenter design similar to rbd and cephfs.
NodePublish, NodeUnpublish, NodeGetVolumeStats
and NodeGetCapabilities have been implemented.
The nodeserver implementation has been inspired
from https://github.com/kubernetes-csi/csi-driver-nfs,
which was previously used for mounted cephcsi exported
nfs volumes. The current implementation is also
backward compatible for the previously created
PVCs.
Signed-off-by: Rakshith R <rar@redhat.com>
Current code uses an !A && !B condition incorrectly to
test A:Up and B:status for a remote peer image.
This should be !A || !B as we require both conditions to
be in the specified state (Up: true, and status Unknown).
This is corrected by this commit, and further fixes:
- check and return ready only when a remote site is
found in the status output
- check if all peer sites are ready, if multiple are found
and return ready appropriately
Signed-off-by: Shyamsundar Ranganathan <srangana@redhat.com>
During ResyncVolume we check if the image
is in an error state, and we resync.
After resync, the image will move to
either the `Error` or the `Resyncing` state.
And if the image is in the above two
conditions, we will return a successful
response and Ready=false so that the
consumer can wait until the volume is
ready to use. If the image is in any
other state we return an error message
to indicate the syncing is not going on.
The whole resync and image state change
depends on the rbd mirror daemon. If the
mirror daemon is not running, the image
can be in Resyncing or Unknown state.
The Ramen marks the volume replication as
secondary, and once the resync starts, it
will delete the volume replication CR as a
cleanup process.
As we dont have a check for the rbd mirror
daemon, we are returning a resync success
response and Ready=false. Due to this false
response Ramen is assuming the resync started
and deleted the volume replication CR, and
because of this, the cluster goes into a bad
state and needs manual intervention.
fixes#3289
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
IsNotMountPoint() is deprecated and Mounter.IsMountPoint() is
recommended to be used instead.
Reported-by: golangci/staticcheck
Signed-off-by: Niels de Vos <ndevos@redhat.com>
NewWithoutSystemd() has been introduced in the k8s.io/mount-utils
package so that systemd is not called while executing functions. This
offers consumers the ability to prevent confusing and scary messages
from getting logged.
See-also: kubernetes/kubernetes#111218
Signed-off-by: Niels de Vos <ndevos@redhat.com>
previously, it was a requirement to have attacher sidecar for CSI
drivers and there had an implementation of dummy mode of operation.
However skipAttach implementation has been stabilized and the dummy
mode of operation is going to be removed from the external-attacher.
Considering this driver work on volumeattachment objects for NBD driver
use cases, we have to implement dummy controllerpublish and unpublish
and thus keep supporting our operations even in absence of dummy mode
of operation in the sidecar.
This commit make a NOOP controller publish and unpublish for RBD driver.
CephFS driver does not require attacher and it has already been made free
from the attachment operations.
Ref# https://github.com/ceph/ceph-csi/pull/3149
Ref# https://github.com/kubernetes-csi/external-attacher/issues/226
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
`--setmetadata` is false by default, honoring it
will keep the metadata disabled by default
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
`--setmetadata` is false by default, honoring it
will keep the metadata disabled by default
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Make sure to set metadata when subvolume snapshot exist, i.e. if the
provisioner pod is restarted while createSnapShot is in progress, say it
created the subvolume snapshot but didn't yet set the metadata.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Set snapshot-name/snapshot-namespace/snapshotcontent-name details
on subvolume snapshots as metadata on create.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
This change helps read the cluster name from the cmdline args,
the provisioner will set the same on the subvolume.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Make sure to set metadata when subvolume exist, i.e. if the provisioner pod
is restarted while createVolume is in progress, say it created the subvolume
but didn't yet set the metadata.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
This helps Monitoring solutions without access to Kubernetes clusters to
display the details of the PV/PVC/NameSpace in their dashboard.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
After Failover of workloads to the secondary
cluster when the primary cluster is down,
RBD Image is not marked healthy, and VR
resources are not promoted to the Primary,
In VolumeReplication, the `CURRENT STATE`
remains Unknown and doesn't change to Primary.
This happens because the primary cluster went down,
and we have force promoted the image on the
secondary cluster. and the image stays in
up+stopping_replay or could be any other states.
Currently assumption was that the image will
always be `up+stopped`. But the image will be in
`up+stopped` only for planned failover and it
could be in any other state if its a forced
failover. For this reason, removing
checkHealthyPrimary from the PromoteVolume RPC call.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Recently the k8s.io/mount-utils package added more runtime dectection.
When creating a new Mounter, the detect is run every time. This is
unfortunate, as it logs a message like the following:
```
mount_linux.go:283] Detected umount with safe 'not mounted' behavior
```
This message might be useful, so it probably good to keep it.
In Ceph-CSI there are various locations where Mounter instances are
created. Moving that to the DefaultNodeServer type reduces it to a
single place. Some utility functions need to accept the additional
parameter too, so that has been modified as well.
See-also: kubernetes/kubernetes#109676
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Looks like cephfs snapshot size is buggy and its
getting removed in ceph fs. we cannot get the size
of the snapshot during CreateVolume call, so we cannot
do any size check at CreateVolume to check if the
restore size is smaller or not.
As we are removing this check it also fixes#3147
but we dont have any validation at CSI level for
smaller restore we need to depend on kubernetes
external-provisioner for it.
fixes: #3147
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Due to the bug in the df stat we need to round off
the subvolume size to align with 4Mib.
Note:- Minimum supported size in cephcsi is 1Mib,
we dont need to take care of Kib.
fixes#3240
More details at https://github.com/ceph/ceph/pull/46905
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
While creating subvolumes, CephFS driver set the mode to `777`
and pass it along to go ceph apis which cause the subvolume
permission to be on 777, however if we create a subvolume
directly in the ceph cluster, the default permission bits are
set which is 755 for the subvolume. This commit try to stick
to the default behaviour even while creating the subvolume.
This also means that we can work with fsgrouppolicy set to
`File` in csiDriver object which is also addressed in this commit.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
When the Ceph user is restricted to a specific namespace in the pool, it is
crucial that evey interaction with the cluster is done within that namespace.
This wasn't the case in `getCloneDepth()`.
This issue was causing snapshot creation to fail with
> Failed to check and update snapshot content: failed to take snapshot of the
> volume X: "rpc error: code = Internal desc = rbd: ret=-1, Operation not
> permitted"
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
There are regular reports that identify a non-error as the cause of
failures. The Kubernetes mount-utils package has detection for systemd
based environments, and if systemd is unavailable, the following error
is logged:
Cannot run systemd-run, assuming non-systemd OS
systemd-run output: System has not been booted with systemd as init
system (PID 1). Can't operate.
Failed to create bus connection: Host is down, failed with: exit status 1
Because of the `failed` and `exit status 1` error message, users might
assume that the mounting failed. This does not need to be the case. The
container-images that the Ceph-CSI projects provides, do not use
systemd, so the error will get logged with each mount attempt.
By using the newer MountSensitiveWithoutSystemd() function from the
mount-utils package where we can, the number of confusing logs get
reduced.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
go-ceph provides a new GetFailure() method to retrieve details errors
when cloning failed. This is now included in the `cephFSCloneState`
struct, which was a simple string before.
While modifying the `cephFSCloneState` struct, the constants have been
removed, as go-ceph provides them as well.
Fixes: #3140
Signed-off-by: Niels de Vos <ndevos@redhat.com>
This commit removes the clone incase
unsetAllMetadata or copyEncryptionConfig or
expand fails for createVolumeFromSnapshot
and CreateSnapshot.
It also removes the clone in case of
any failure in createCloneFromImage.
issue: #3103
Signed-off-by: Yati Padia <ypadia@redhat.com>
Address: hugeParam linter
internal/controller/persistentvolume/persistentvolume.go:59:7:
hugeParam: r is heavy (80 bytes); consider passing it by pointer
(gocritic)
[...]
internal/controller/persistentvolume/persistentvolume.go:135:7:
hugeParam: r is heavy (80 bytes); consider passing it by pointer
(gocritic)
func (r ReconcilePersistentVolume) reconcilePV(ctx context.Context, obj
runtime.Object) error {}
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
As we added support to set the metadata on the rbd images created for
the PVC and volume snapshot, by default metadata is set on all the images.
As we have seen we are hitting issues#2327 a lot of times with this,
we start to leave a lot of stale images. Currently, we rely on
`--extra-create-metadata=true` to decide to set the metadata or not,
we cannot set this option to false to disable setting metadata because we
use this for encryption too.
This changes is to provide an option to disable setting the image
metadata when starting cephcsi.
Fixes: #3009
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Removed the code in checkHealthyPrimary which
makes the ceph call, passing it as input now.
Added unit test for checkHealthyPrimary function
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
we need to check for image should be in up+stopped state
not anyone of the state for that the we need to use
OR check not the AND check.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
When the image is force promoted to primary on the
cluster the remote image might not be in replaying
state because due to the split brain state. This
PR reverts back the commit
c3c87f2ef3. Which we added
to check the remote image status.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Kubernetes 1.24 and newer use a different path for staging the volume.
That means the CSI-driver is requested to mount the volume at an other
location, compared to previous versions of Kubernetes. CSI-drivers
implementing the volumeHealer, must receive the correct path, otherwise
the after a nodeplugin restart the NBD mounts will bailout attempting
to NodeStageVolume() call and return an error.
See-also: kubernetes/kubernetes#107065
Fixes: #3176
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
During failover we do demote the volume on the primary
as the image is still not promoted yet on the remote cluster,
there are spurious split-brain errors reported by RBD,
the Cephcsi resync will attempt to resync from the "known"
secondary and that will cause data loss
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
create the token if kubernetes version in
1.24+ and use it for vault sa.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Signed-off-by: Rakshith R <rar@redhat.com>
This commit implements most of
docs/design/proposals/cephfs-snapshot-shallow-ro-vol.md design document;
specifically (de-)provisioning of snapshot-backed volumes, mounting such
volumes as well as mounting pre-provisioned snapshot-backed volumes.
Signed-off-by: Robert Vasek <robert.vasek@cern.ch>
RBD supports creating rbd images with
object size, stripe unit and stripe count
to support striping. This PR adds the support
for the same.
More details about striping at
https://docs.ceph.com/en/quincy/man/8/rbd/#stripingfixes: #3124
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This change helps read the cluster name from the cmdline args,
the provisioner will set the same on the RBD images.
Fixes: #2973
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
In case of pre-provisioned volume the clusterID is
not set in the volume context as the clusterID is missing
we cannot extract the NetNamespaceFilePath from the
configuration file. For static volume and dynamically
provisioned volume the clusterID is set.
Note:- This is a special case to support mounting PV
without clusterID parameter.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Before the change, the error msg was the following:
```
failed to set VAULT_AUTH_MOUNT_PATH in Vault config: path is empty
```
`vaultAuthPath` is the actual variable name set by the
user. The error message will now be the following:
```
failed to set "vaultAuthPath" in vault config: path is empty
```
Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds support for pvc-pvc clone.
Only capability needed to be advertised, the
underlying support is already provided by cephfs
backend.
Signed-off-by: Rakshith R <rar@redhat.com>
continue running rbd driver when /sys/bus/rbd/supported_features file is
missing, do not bailout.
Fixes: #2678
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
krbdFeatures is set to zero when kernel version < 3.8, i.e. in case where
/sys/bus/rbd/supported_features is absent and we are unable to prepare
the krbd attributes based on kernel version.
When krbdFeatures is set to zero fallback to NBD only when autofallback
is turned ON.
Fixes: #2678
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Upstream /sys/bus/rbd/supported_features is part of Linux kernel v4.11.0
Prepare the attributes and use them in case if
/sys/bus/rbd/supported_features is missing.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Move k8s.GetVolumeMetadata() out of setVolumeMetadata() and rename it to
setAllMetadata() so that the same can be used for setting volume and
snapshot metadata.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
There is not much the NFS-provisioner needs to do to expand a volume,
everything is handled by the CephFS components.
NFS does not need a resize on the node, so only ControllerExpandVolume
is required.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
For the default mounter the mounter option
will not be set in the storageclass and as it is
not available in the storageclass same will not
be set in the volume context, Because of this the
mapOptions are getting discarded. If the mounter
is not set assuming it's an rbd mounter.
Note:- If the mounter is not set in the storageclass
we can set it in the volume context explicitly,
Doing this check-in node server to support backward
existing volumes and the check is minimal we are not
altering the volume context.
fixes: #3076
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
With cgroup v2, the location of the pids.max file changed and so did the
/proc/self/cgroup file
new /proc/self/cgroup file
`
0::/user.slice/user-500.slice/session-14.scope
`
old file:
`
11:pids:/user.slice/user-500.slice/session-2.scope
10:blkio:/user.slice
9:net_cls,net_prio:/
8:perf_event:/
...
`
There is no directory per subsystem (e.g. /sys/fs/cgroup/pids) any more, all
files are now in one directory.
fixes: https://github.com/ceph/ceph-csi/issues/3085
Signed-off-by: Marcus Röder <m.roeder@yieldlab.de>
This commit makes modification so as to allow pvc-pvc clone
with different storageclass having different encryption
configs.
This commit also modifies `copyEncryptionConfig()` to
include a `isEncrypted()` check within the function.
Signed-off-by: Rakshith R <rar@redhat.com>
Before the change, the error msg was the following:
```
failed to set VAULT_AUTH_MOUNT_PATH in Vault config: path is empty
```
`vaultAuthPath` is the actual variable name set by the
user. The error message will now be the following:
```
failed to set "vaultAuthPath" in vault config: path is empty
```
Signed-off-by: Rakshith R <rar@redhat.com>
In case the NFS-export has already been removed from the NFS-server, but
the CSI Controller was restarted, a retry to remove the NFS-volume will
fail with an error like:
> GRPC error: ....: response status not empty: "Export does not exist"
When this error is reported, assume the NFS-export was already removed
from the NFS-server configuration, and continue with deleting the
backend volume.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
as same host directory is not shared between
the cephfs and the rbd plugin pod. we need
to keep the netNamespaceFilePath separately
for both cephfs and rbd. CephFS plugin will
use this path to execute mount -t commands.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
As radosNamespace is more specific to
RBD not the general ceph configuration. Now
we introduced a new RBD section for RBD specific
options, Moving the radosNamespace to RBD section
and keeping the radosNamespace still under the
global ceph level configration for backward
compatibility.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
As the netNamespaceFilePath can be separate for
both cephfs and rbd adding the netNamespaceFilePath
path for RBD, This will help us to keep RBD and
CephFS specific options separately.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The NFS Controller returns a non-gRPC error in case the CreateVolume
call for the CephFS volume fails. It is better to return the gRPC-error
that the CephFS Controller passed along.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Recent versions of Ceph allow calling the NFS-export management
functions over the go-ceph API.
This seems incompatible with older versions that have been tested with
the `ceph nfs` commands that this commit replaces.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
use leases for leader election instead
of the deprecated configmap based leader
election.
This PR is making leases as default leader election
refer https://github.com/kubernetes-sigs/
controller-runtime/pull/1773, default from configmap
to configmap leases was done with
https://github.com/kubernetes-sigs/
controller-runtime/pull/1144.
Release notes https://github.com/kubernetes-sigs/
controller-runtime/releases/tag/v0.7.0
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
To consider the image is healthy during the Promote
operation currently we are checking only the image
state on the primary site. If the network is flaky
or the remote site is down the image health is
not as expected. To make sure the image is healthy
across the clusters check the state on both local
and the remote clusters.
some details:
https://bugzilla.redhat.com/show_bug.cgi?id=2014495
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
calling setRbdNbdToolFeatures inside an init
gets called in main.go for both cephfs and rbd
driver. instead of calling it in init function
calling this in rbd driver.go as this is specific
to rbd.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Set snapshot-name/snapshot-namespace/snapshotcontent-name details
on RBD backend snapshot image as metadata on snapshot
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Make sure to set metadata when image exist, i.e. if the provisioner pod
is restarted while createVolume is in progress, say it created the image
but didn't yet set the metadata.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Example if a PVC was delete by setting `persistentVolumeReclaimPolicy` as
`Retain` on PV, and PV is reattached to a new PVC, we make sure to update
PV/PVC image metadata on a PV reattach.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
This helps Monitoring solutions without access to Kubernetes clusters to
display the details of the PV/PVC/NameSpace in their dashboard.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Define and use PV and PVC metadata keys used by external provisioner.
The CSI external-provisioner (v1.6.0+) introduces the
--extra-create-metadata flag, which automatically sets map<string, string>
parameters in the CSI CreateVolumeRequest.
Add utility functions to set/Get PV/PVC/PVCNamespace metadata on image
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
add support to run rbd map and mount -t
commands with the nsenter.
complete design of pod/multus network
is added here https://github.com/rook/rook/
blob/master/design/ceph/multus-network.md#csi-pods
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Currently we only check if the rbd-nbd tool supports cookie feature.
This change will also defend cookie addition based on kernel version
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
The omap is stored with the requested
snapshot name not with the subvolume
snapshotname. This fix uses the correct
snapshot request name to cleanup the omap
once the subvolume snapshot is deleted.
fixes: #2974
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The `ceph nfs export ...` commands have changed in recent Ceph releases.
Use the most recent command as a default, fall back to the older command
when an error is reported.
This shoud make the NFS-provisioner work on any current Ceph version.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Increase the timeout to 2 minutes to give enough time
for rollback to complete.
As rollback is performed by the force-promote command it,
at times, may take more than a minute
(based on dirty blocks that need to be rolled
back approximately) to rollback.
The added extra 1 minute is useful though to avoid
multiple calls to complete the rollback and in
extremely corner cases to avoid failures in the
first instance of the call when the mirror watcher
is not yet removed (post scaling down the
RBD mirror instance)
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Restoring a snapshot with a new PVC results with a wrong
dataPoolName in case of initial volume linked
to a storageClass with topology constraints and erasure coding.
Signed-off-by: Thibaut Blanchard <thibaut.blanchard@gmail.com>
NFSVolume instances are short lived, they only extist for a certain gRPC
procedure. It is easier to store the calling Context in the NFSVolume
struct, than to pass it to some of the functions that require it.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
These NFS Controller and Identity servers are the base for the new
provisioner. The functionality is currently extremely limited, follow-up
PRs will implement various CSI procedures.
CreateVolume is implemented with the bare minimum. This makes it
possible to create a volume, and mount it with the
kubernetes-csi/csi-driver-nfs NodePlugin.
DeleteVolume unexports the volume from the Ceph managed NFS-Ganesha
service. In case the Ceph cluster provides multiple NFS-Ganesha
deployments, things might not work as expected. This is going to be
addressed in follow-up improvements.
Lots of TODO comments need to be resolved before this can be declared
"production ready". Unit- and e2e-tests are missing as well.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
RT, reference tracker, is key-based implementation of a reference counter.
Unlike an integer-based counter, RT counts references by tracking unique
keys. This allows accounting in situations where idempotency must be
preserved. It guarantees there will be no duplicit increments or decrements
of the counter.
Signed-off-by: Robert Vasek <robert.vasek@cern.ch>
OIDC token file path has been modified from
`/var/run/secrets/token` to `/run/secrets/tokens`.
This has been done to ensure compliance with
FHS 3.0.
refer:
https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch05s13.html
Signed-off-by: Rakshith R <rar@redhat.com>
Below are the 3 different cases where we need
the PVC namespace for encryption
* CreateVolume:- Read the namespace from the
createVolume parameters and store it in the omap
* NodeStage:- Read the namespace from the omap
not from the volumeContext
* Regenerate:- Read the pvc namespace from the claimRef
not from the volumeAttributes.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
remove kubernetes csi prefixed parameters
from the volumeContext as we dont want
to store it in the PV VolumeAttributes.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>