The ceph changes are done on the both server and the
client side this change is not enough for remove
setting the size of cloned volumes.
this caused the regression like #2719#2720#2721#2722.
This reverts commit 3565a342d5.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
IsBlockMultiNode() is a new helper that takes a slice of
VolumeCapability objects and checks if it includes multi-node access
and/or block-mode support.
This can then easily be used in other services that need checking for
these particular capabilities, and preventing multi-node block-mode
access.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
we have added clusterID mapping to identify the volumes
in case of a failover in Disaster recovery in #1946.
with #2314 we are moving to a configuration in
configmap for clusterID and poolID mapping.
and with #2314 we have all the required information
to identify the image mappings.
This commit removes the workaround implementation done
in #1946.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The rbd package contains several functions that can be used by
CSI-Addons Service implmentations. Unfortunately it is not possible to
do this, as the rbd-driver needs to import the csi-addons/rbd package to
provide the CSI-Addons server. This causes a circular import when
services use the rbd package:
- rbd/driver.go import csi-addons/rbd
- csi-addons/rbd import rbd (including the driver)
By moving rbd/driver.go into its own package, the circular import can be
prevented.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
HexStringToInteger() used to return a uint64, but everywhere else uint
is used. Having HexStringToInteger() return a uint as well makes it a
little easier to use when setting it with SetGlobalInt().
Signed-off-by: Niels de Vos <ndevos@redhat.com>
When the rbd-driver starts, it initializes some global (yuck!) variables
in the rbd package. Because the rbd-driver is moved out into its own
package, these variables can not easily be set anymore.
Introcude SetGlobalInt(), SetGlobalBool() and InitJournals() so that the
rbd-driver can configure the rbd package.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
The rbd-driver calls rbd.runVolumeHealer() which is not available
outside the rbd package. By moving the rbd-driver into its own package,
RunVolumeHealer() needs to be exported.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
NodeServer.mounter is internal to the NodeServer type, but it needs to
be initialized by the rbd-driver. The rbd-driver is moved to its own
package, so .Mounter needs to be available from there in order to set
it.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
genVolFromVolID() is used by the CSI Controller service to create an
rbdVolume object from a CSI volume_id. This function is useful for
CSI-Addons Services as well, so rename it to GenVolFromVolID().
Signed-off-by: Niels de Vos <ndevos@redhat.com>
k8s.io/utils/mount has moved to k8s.io/mount-utils, and Ceph-CSI uses
that already in most locations. Only internal/util/util.go still imports
the old path.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
The dummy image will be created with 1Mib size.
during the snapshot transfer operation the 1Mib
will be transferred even if the dummy image doesnot
contains any data. adding the new image features
`fast-diff,layering,obj-map,exclusive-lock`on the
dummy image will ensure that only the diff is
transferred to the remote cluster.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
we added a workaround for rbd scheduling by creating
a dummy image in #2656. with the fix we are creating
a dummy image of the size of the first actual rbd
image which is sent in EnableVolumeReplication request
if the actual rbd image size is 1TiB we are creating
a dummy image of 1TiB which is not good. even though
its a thin provisioned rbd images this is causing
issue for the transfer of the snapshot during
the mirroring operation.
This commit recreates the rbd image with 1MiB size
which is the smaller supported size in rbd.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
rbd mirroring CLI calls are async and it doesn't wait
for the operation to be completed. ex:- `rbd mirror image enable`
it will enable the mirroring on the image but it doesn't
ensure that the image is mirroring enabled and healthy
primary. The same goes for the promote volume also.
This commits adds a check-in PromoteVolume to make sure
the image in a healthy state i.e `up+stopped`.
note:- not considering any intermediate states to make
sure the image is completely healthy before responding
success to the RPC call.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Journal-based RADOS block device mirroring ensures point-in-time
consistent replicas of all changes to an image, including reads and
writes, block device resizing, snapshots, clones, and flattening.
Journaling-based mirroring records all modifications to an image in the
order in which they occur. This ensures that a crash-consistent mirror
of an image is available.
Mirroring when configured in journal mode, mirroring will
utilize the RBD journaling image feature to replicate the image
contents. If the RBD journaling image feature is not yet enabled on the
image, it will be automatically enabled.
Fixes: #2018
Co-authored-by: Madhu Rajanna <madhupr007@gmail.com>
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Depending on the way Ceph-CSI is deployed, the capabilities will be
configured for the GetCapabilities procedure. The other procedures are
more straight-forward.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
After adding the new CSI-Addons Server, golang-ci complains that
driver.Run() is too complex. By moving the profiling checks and starting
of the go-routines in their own function, golang-ci is happy again.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Add a new CSI-Addons Server and empty Identity Service for the RBD
plugin. The implementation of the Identity Service procedure calls will
be done in other PRs.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
currently we are fist operating on the dummy
image to refresh the pool and then we are adding
the scheduling. we think the scheduling should
be added first and than we should refresh the
pool. If we do this all the existing schedules
will be considered from the scheduler.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
with shallow copy of rbdVol to dummyVol
the image name update of the dummyVol is getting
reflected on the rbdVol which we dont want.
do deep copy to avoid this problem.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Uses the below schema to supply mounter specific map/unmapOptions to the
nodeplugin based on the discussion we all had at
https://github.com/ceph/ceph-csi/pull/2636
This should specifically be really helpful with the `tryOthermonters`
set to true, i.e with fallback mechanism settings turned ON.
mapOption: "kbrd:v1,v2,v3;nbd:v1,v2,v3"
- By omitting `krbd:` or `nbd:`, the option(s) apply to
rbdDefaultMounter which is krbd.
- A user can _override_ the options for a mounter by specifying `krbd:`
or `nbd:`.
mapOption: "v1,v2,v3;nbd:v1,v2,v3"
is effectively the same as the 1st example.
- Sections are split by `;`.
- If users want to specify common options for both `krbd` and `nbd`,
they should mention them twice.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
The dummy mirror image needs to be disabled and then
reenabled for mirroring, to ensure a newly promoted
primary is now starting to schedule snapshots.
Signed-off-by: Shyamsundar Ranganathan <srangana@redhat.com>
currently we have a bug in rbd mirror scheduling module.
After doing failover and failback the scheduling is not
getting updated and the mirroring snapshots are not
getting created periodically as per the scheduling
interval. This PR workarounds this one by doing below
operations
* Create a dummy (unique) image per cluster and this image
should be easily identified.
* During Promote operation on any image enable the
mirroring on the dummy image. when we enable the mirroring
on the dummy image the pool will get updated and the
scheduling will be reconfigured.
* During Demote operation on any image disable the mirroring
on the dummy image. the disable need to be done to enable
the mirroring again when we get the promote request to make
the image as primary
* When the DR is no more needed, this image need to be
manually cleanup as for now as we dont want to add a check
in the existing DeleteVolume code path for delete dummy image
as it impact the performance of the DeleteVolume workflow.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Moved to add scheduling to the promote
operation as scheduling need to be added
when the image is promoted and this is
the correct method of adding the scheduling
to make the scheduling take place.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
instead of logging the volumeID and the pool
name. log the poolname and image name for better
debugging.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
If we hit any error while running the cryptosetup
commands we are logging only the error message.
with only error message it is difficult to analyze
the problem, logging the stdError will help us to
check what is the problem.
updates: #2610
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
On line 341 a `transaction` is created. This is passed to the deferred
`undoStagingTransaction()` function when an error in the
`NodeStageVolume` procedure is detected. So far, so good.
However, on line 356 a new `transaction` is returned. This new
`transaction` is not used for the defer call.
By removing the empty `transaction` that is used in the defer call, and
calling `undoStagingTransaction()` on an error of `stageTransaction()`,
the code is a little simpler, and the cleanup of the transaction should
be done correctly now.
Updates: #2610
Signed-off-by: Niels de Vos <ndevos@redhat.com>
This log line is seen frequently in the logs and its better to be at
Warning loglevel rather than Error based on its severity
E1109 08:30:45.612395 38328 util.go:247] kernel 4.19.202 does not support required features
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Problem:
On remap/attach of device (i.e. nodeplugin restart), there is no way
for rbd-nbd to defend if the backend storage is matching with the initial
backend storage.
Say, if an initial map request for backend "pool1/image1" got mapped to
/dev/nbd0 and the userspace process is terminated (on nodeplugin restart).
A next remap/attach (nodeplugin start) request within reattach-timeout is
allowed to use /dev/nbd0 for a different backend "pool1/image2"
For example, an operation like below could be dangerous:
$ sudo rbd-nbd map --try-netlink rbd-pool/ext4-image
/dev/nbd0
$ sudo blkid /dev/nbd0
/dev/nbd0: UUID="bfc444b4-64b1-418f-8b36-6e0d170cfc04" TYPE="ext4"
$ sudo pkill -15 rbd-nbd <-- nodeplugin terminate
$ sudo rbd-nbd attach --try-netlink --device /dev/nbd0 rbd-pool/xfs-image
/dev/nbd0
$ sudo blkid /dev/nbd0
/dev/nbd0: UUID="d29bf343-6570-4069-a9ea-2fa156ced908" TYPE="xfs"
Solution:
rbd-nbd/kernel now provides a way to keep some metadata in sysfs to identify
between the device and the backend, so that when a remap/attach request is
made, rbd-nbd can compare and avoid such dangerous operations.
With the provided solution, as part of the initial map request, backend
cookie (ceph-csi VOLID) can be stored in the sysfs per device config, so
that on a remap/attach request rbd-nbd will check and validate if the
backend per device cookie matches with the initial map backend with the help
of cookie.
At Ceph-csi we use VOLID as device cookie, which will be unique, we pass
the VOLID as cookie at map and use the same at the time of attach, that
way rbd-nbd can identify backends and their matching devices.
Requires:
https://github.com/ceph/ceph/pull/41323https://lkml.org/lkml/2021/4/29/274
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
This change allows the user to choose not to fallback to NBD mounter
when some ImageFeatures are absent with krbd driver, rather just fail
the NodeStage call.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Currently, we recognize and warn for the provided image features based on
our prior intelligence at ceph-csi (i.e based on supportedFeatures map
and validateImageFeatures) at image/PV creation time. It might be very
much possible that the cluster is heterogeneous i.e. the PV creation and
application container might both be on different nodes with different
kernel versions (krbd driver versions).
This PR adds a mechanism to check for the supported krbd features during
mount time, if the krbd driver doesn't have the specified image feature
then it will fall back to rbd-nbd mounter.
Fixes: #478
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
When using UPPER_CASE formatting for the HashiCorp Vault KMS
configuration, a missing `VAULT_DESTROY_KEYS` will cause the option to
be set to "false". The default for the option is intended for be "true".
This is a difference in behaviour between the `vaultDestroyKeys` and
`VAULT_DESTROY_KEYS` options. Both should use a default of "true" when
the configuration does not set the option explicitly.
By setting the default options in the `standardVault` struct before
unmarshalling the configuration in it, the default values will be
retained for the missing configuration options.
Reported-by: Rachael George <rgeorge@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>
this commit make use of the migration request secret parsing and set
the required fields for further nodestage operations
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
parseAndDeleteMigratedVolume() prviously clubbed the logic of
parsing of migration volume handle and then continued with the
deletion of the volume. however this commit split this
logic into two, ie parsing has been done in parseMigrationVolID()
and DeleteMigratedVolume() deletes the backend volume.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
This commit adds a couple of helper functions to parse the migration
request secret and set it for further csi driver operations.
More details:
The intree secret has a data field called "key" which is the base64
admin secret key. The ceph CSI driver currently expect the secret to
contain data field "UserKey" for the equivalant. The CSI driver also
expect the "UserID" field which is not available in the in-tree secret
by deafult. This missing userID will be filled (if the username differ
than 'admin') in the migration secret as 'adminId' field in the
migration request, this commit adds the logic to parse this migration
secret as below:
"key" field value will be picked up from the migraion secret to "UserKey"
field.
"adminId" field value will be picked up from the migration secret to "UserID"
field
if `adminId` field is nil or not set, `UserID` field will be filled with
default value ie `admin`.The above logic get activated only when the secret
is a migration secret, otherwise skipped to the normal workflow as we have
today.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Thick-provisioning was introduced to make accounting of assigned space
for volumes easier. When thick-provisioned volumes are the only consumer
of the Ceph cluster, this works fine. However, it is unlikely that this
is the case. Instead, accounting of the requested (thin-provisioned)
size of volumes is much more practical as different types of volumes can
be tracked.
OpenShift already provides cluster-wide quotas, which can combine
accounting of requested volumes by grouping different StorageClasses.
In addition to the difficult practise of allowing only thick-provisioned
RBD backed volumes, the performance makes thick-provisioning
troublesome. As volumes need to be completely allocated, data needs to
be written to the volume. This can take a long time, depending on the
size of the volume. Provisioning, cloning and snapshotting becomes very
much noticeable, and because of the additional time consumption, more
prone to failures.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
for comparing the image states use the states
defined in the go-ceph avoid creating of the
deplicate const in cephcsi.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
below are the local states of the mirrored image
"unknown" -> If the image is in an error state
means data is completely synced
"error" -> If the image is in an error state
means it needs resync
"syncing"
"starting_replay"
"replaying"
"stopping_replay"
"stopped"
If the resync is successfully started which
means the image will be in "replaying" state.
we can consider "replaying" state to report
resync succesfully going on state.
we are discarding the intermediate states like
"syncing", "starting_replay" and "stopping_replay".
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
After moving moving image to trash, if `trash remove` step fails,
then external-provisioner will issue subsequent requests, in which
image will be absent in pool( will be in trash) and omap cleanup will
be done with stale image left in trash with no `trash remove` step on it.
To avoid this scenario list trash images and find corresponding id for given
image name and add a task to flatten when we encounter a ErrImageNotFound.
Fixes: #1728
Signed-off-by: Rakshith R <rar@redhat.com>
Following the CSI specification, values that are included in the
VolumeUsage MUST NOT be negative. However, CephFS seems to return -1 for
the number of inodes that are available. Instead of returning a
negative value, set it to 0 so that it will not get included in the
encoded JSON response.
Updates: #2579
See-also: 5b0d454015/spec.md (L2477-L2487)
Signed-off-by: Niels de Vos <ndevos@redhat.com>
In some corner case like `re-player shutdown` the
local image will not be in error state. It would
be also worth considering `description` field to
make sure about split-brain.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
previously we were retriving clusterID using the monitors field
in the volume context at node stage code path. however it is possible to
retrieve or use clusterID directly from the volume context. This
commit also remove the getClusterIDFromMigrationVolume() function
which was used previously and its tests
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
we reuse or overload the variable name in the test execution at present.
This commit use a different variable name as initialized in each run
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
For static volume, the user will manually mounts
already existing image as a volume to the application
pods. As its a rbd Image, if the PVC is of type
fileSystem the image will be mapped, formatted
and mounted on the node,
If the user resizes the image on the ceph cluster.
User cannot not automatically resize the filesystem
created on the rbd image. Even if deletes and
recreates the kubernetes objects, the new size
will not be visible on the node.
With this changes During the NodeStageVolumeRequest
the nodeplugin will check the size of the mapped rbd
image on the node using the devicePath. and also
the rbd image size on the ceph cluster.
If the size is not matching it will do the file
system resize on the node as part of the
NodeStageVolumeRequest RPC call.
The user need to do below operation to see new size
* Resize the rbd image in ceph cluster
* Scale down all the application pods using the static
PVC.
* Make sure no application pods which are using the
static PVC is running on a node.
* Scale up all the application pods.
Validate the new size in application pod mounted
volume.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
in NodeStage operation we are flattening
the image to support mounting on the older
clients. this commits moves it to a helper
function to reduce code complexity.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
During PVC snapshot/clone both kms config and passphrase needs to copied,
while for PVC restore only passphrase needs to be copied to dest rbdvol
since destination storageclass may have another kms config.
Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds the logic to detect a passed in volumeID
is a migrated volume ID and if yes, the driver connect to the
backend cluster and clean/delete the image. The logic
only applied if its a migration volume ID. The migration volume ID
carry the information like mons, pool and image name which is
good enough for the driver to identify and connect to the backend
cluster for its operations.
migration volID format:
<mig>_mons-<monsHash>_image-<imageUID>_<poolHash>
Details on the hash values:
* MonsHash: this carry a hash value (md5sum) which will be acted as the
`clusterID` for the operations in this context.
* ImageUID: this is the unique UUID generated by kubernetes for the created
volume.
* PoolHash: this is an encoded string of pool name.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
as we are refractoring the cephfs code,
Moving all the core functions to a new folder
/pkg called core. This will make things easier
to implement. For now onwards all the core
functionalities will be added to the core
package.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
the migration nodestage request does not carry the 'clusterID' in it
and only monitors are available with the volumeContext. The volume
context flag 'migration=true' and 'static=true' flags allow us to
fill 'clusterID' from the passed in monitors to the volume Context,so
that rest of the static operations on nodestage can be proceeded as we
do treat static volumes today.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
as part of migration support, the clusterID has to be fetched
from passed in mon. Because the intree RBD storage class only
got monitor and not `clusterID` parameter support. However, in
CSI, SC has the `clusterID` parameter support but not mon. Due
to that we have to fetch the clusterID from config file for the
passed in mon and use it in our operations. This adds a helper
function to retrieve clusterID from passed in mon string.
Updates https://github.com/ceph/ceph-csi/issues/2509
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Currently, we delete the ceph client log file on unmap/detach.
This patch provides additional alternatives for users who would like to
persist the log files.
Strategies:
-----------
`remove`: delete log file on unmap/detach
`compress`: compress the log file to gzip on unmap/detach
`preserve`: preserve the log file in text format
Note that the default strategy will be remove on unmap, and these options
can be tweaked from the storage class
Compression size details example:
On Map: (with debug-rbd=20)
---------
$ ls -lh
-rw-r--r-- 1 root root 526K Sep 1 18:15
rbd-nbd-0001-0024-fed5480a-f00f-417a-a51d-31d8a8144c03-0000000000000003-d2e89c87-0b4d-11ec-8ea6-160f128e682d.log
On unmap:
---------
$ ls -lh
-rw-r--r-- 1 root root 33K Sep 1 18:15
rbd-nbd-0001-0024-fed5480a-f00f-417a-a51d-31d8a8144c03-0000000000000003-d2e89c87-0b4d-11ec-8ea6-160f128e682d.gz
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Log:
internal/rbd/rbd_attach.go:424:2: hugeParam: dArgs is heavy (88 bytes);
consider passing it by pointer (gocritic)
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Currently we return a !ready status if an image
is not found when a replication resync is issued.
We also return a !ready just post issuing a resync.
The change is to ensure we return errors in these
cases for the caller to retry the operation till
we can determine we are actually resyncing, and then
return !ready with nil errors.
Part of addressing:
https://github.com/csi-addons/volume-replication-operator/issues/101
Signed-off-by: Shyamsundar Ranganathan <srangana@redhat.com>
This commit:
- modifies GetMonsAndClusterID() to take clusterID instead of options.
- moves out validation of clusterID is set or not out of GetMonsAndClusterID().
- defines ErrClusterIDNotSet new error for reusability.
- add GetClusterID() to obtain clusterID from options.
Signed-off-by: Rakshith R <rar@redhat.com>
This commit adds capability to genVolFromVolumeOptions() to fetch
mapped clusted-id & mon ips for mirrored PVC on secondary cluster
which may have different cluster-id.
This is required for NodeStageVolume().
We also don't need to check for mapping during volume create requests,
so it can be disabled by passing a bool checkClusterIDMapping as false.
GetMonsAndClusterID() is modified to accept bool checkClusterIDMapping
based on which clustermapping is checked to fetch mapped cluster-id and
mon-ips.
Signed-off-by: Rakshith R <rar@redhat.com>
at present, eventhough the checkReservation works for both volume
and snapshot, the arg parentName make sense only for snapshot cases
renaming that arg to more approprite
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
This commit calls WriteCephConfig() in cephcsi.go to
create ceph.conf and keyring if it is not mounted to
be used by all cli calls and conn cmds.
Before this change, rbd-controller/omap-generator did not create
ceph.conf on startup.
Signed-off-by: Rakshith R <rar@redhat.com>
When we read the csi-kms-connection-details configmap
vaultAuthNamespace might not be set when we do the
conversion the vaultAuthNamespace might be set to empty
key and this commits check for the empty value of
vaultAuthNamespace and set the vaultAuthNamespace
to vaultNamespace.
setting empty value for vaultAuthNamespace happened due
to Marshalling at https://github.com/ceph/ceph-csi/blob/devel/
internal/kms/vault_tokens.go#L136-L139.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The configurations in cpeh.conf is not picked up by rados connection
automatically, hence we need to call conn.ReadConfigFile before calling
Connect().
Signed-off-by: Rakshith R <rar@redhat.com>
volumeID can be moved to the volumeOptions
as most of the volume related helper functions
are available on the volumeoptions.go
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Moved execCommandErr to the volumemounter.go
which is the only caller of this function and
moving the execCommandErr helps in reducing the
util file.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
moved the cephfs related validation like
validating the input parameters sent in the
GRPC request to a new file.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
checkStaticVolume() in the reconcilePV function has been unwantedly
introducing variables to confirm the pv spec is static or not. This
patch simplify it and make a smaller footprint of the functions.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
At present there is a 'todo' to check for zero volume size
in the createVolume request which in unwanted, ie the pvc
creation with size 0 fail from the kubernetes api validation itself:
For ex:
```
..spec.resources[storage]: Invalid value: "0": must be greater than zero```
```
so we dont need any extra check in the controller server
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Thanks to the random unmap failure on my local machine:
I0901 17:08:37.841890 2617035 cephcmds.go:55] ID: 11 Req-ID:
0001-0024-fed5480a-f00f-417a-a51d-31d8a8144c03-0000000000000003-024983f3-0b47-11ec-8fcb-e671f0b9f58e
an error (exit status 22) occurred while running rbd args: [unmap
rbd-pool/csi-vol-024983f3-0b47-11ec-8fcb-e671f0b9f58e --device-type nbd
--options try-netlink --options reattach-timeout=300 --options
io-timeout=0]
Noticed the map args are also getting passed to/as unmap args, which is not
correct. We have separate things for mapOptions and unmapOptions. This PR
makes sure that the map args are not passed at the time of unmap.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
At present the nodeStageVolume() handle many logic of filling rbdvol
struct based on the request received and this method is complex to
follow. with this patch, filling or populating volOptions has been
segregrated and handled hence make the stage functions' job easy.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
MirrorImageState (type C.rbd_mirror_image_state_t) has a string
method which can be used while returning error in the replication
controller. Previously, we were using int return in the error which
is not the proper usage.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
all the error check scenarios of genVolFromVolID() and unreserving
omap entries based on the error made deleteVolume method complex,
this patch create a new function which handle the error check and
unrerving omap entries accordingly and finally return the response
to deletevolume/caller.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
When NewK8sClient() detects and error, it used to call FatalLogMsg()
which causes a panic. There are additional features that can be used on
Kubernetes clusters, but these are not a requirement for most
functionalities of the driver.
Instead of causing a panic, returning an error should suffice. This
allows using the driver on non-Kubernetes clusters again.
Fixes: #2452
Signed-off-by: Niels de Vos <ndevos@redhat.com>
CephFS csi driver explictly set the size of the cloned volume
to the size of parent volume as cephfs mgr was lacking this
functionality previously. However it has been addressed in cephfs
so we dont need explicit size setting.
Ref#https://tracker.ceph.com/issues/46163
Supported Ceph releases:
Ceph versions equal or above - v16.0.0, v15.2.9, v14.2.12
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
This commit adds fetchMappedClusterIDAndMons() which returns
monitors and clusterID info after checking cluster mapping info.
This is required for regenerating omap entries in mirrored cluster
with different clusterID.
Signed-off-by: Rakshith R <rar@redhat.com>
This commit moves getMappedID() from rbd to util
package since it is not rbd specific and exports
it from there.
Signed-off-by: Rakshith R <rar@redhat.com>
A new "internal/kms" package is introduced, it holds the API that can be
consumed by the RBD components.
The KMS providers are currently in the same package as the API. With
later follow-up changes the providers will be placed in their own
sub-package.
Because of the name of the package "kms", the types, functions and
structs inside the package should not be prefixed with KMS anymore:
internal/kms/kms.go:213:6: type name will be used as kms.KMSInitializerArgs by other packages, and that stutters; consider calling this InitializerArgs (golint)
Updates: #852
Signed-off-by: Niels de Vos <ndevos@redhat.com>
By placing the NewK8sClient() function in its own package, the KMS API
can be split from the "internal/util" package. Some of the KMS providers
use the NewK8sClient() function, and this causes circular dependencies
between "internal/utils" -> "internal/kms" -> "internal/utils", which
are not alowed in Go.
Updates: #852
Signed-off-by: Niels de Vos <ndevos@redhat.com>
if the volumeattachment has been fetched but marked for deletion
the nbd healer dont want to process further on this pv. This patch
adds a check for pv is marked for deletion and if so, make the
healer skip processing the same
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Moving the log functions into its own internal/util/log package makes it
possible to split out the humongous internal/util packages in further
smaller pieces. This reduces the inter-dependencies between utility
functions and components, preventing circular dependencies which are not
allowed in Go.
Updates: #852
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Unit-testing often fails due to a race condition while writing the
clusterMappingConfigFile from multiple go-routines at the same time.
Failures from `make containerized-test` look like this:
=== CONT TestGetClusterMappingInfo/site2-storage_cluster-id_mapping
cluster_mapping_test.go:153: GetClusterMappingInfo() = <nil>, expected data &[{map[site1-storage:site2-storage] [map[1:3]] [map[11:5]]} {map[site3-storage:site2-storage] [map[8:3]] [map[10:5]]}]
=== CONT TestGetClusterMappingInfo/site3-storage_cluster-id_mapping
cluster_mapping_test.go:153: GetClusterMappingInfo() = <nil>, expected data &[{map[site3-storage:site2-storage] [map[8:3]] [map[10:5]]}]
--- FAIL: TestGetClusterMappingInfo (0.01s)
--- PASS: TestGetClusterMappingInfo/mapping_file_not_found (0.00s)
--- PASS: TestGetClusterMappingInfo/mapping_file_found_with_empty_data (0.00s)
--- PASS: TestGetClusterMappingInfo/cluster-id_mapping_not_found (0.00s)
--- FAIL: TestGetClusterMappingInfo/site2-storage_cluster-id_mapping (0.00s)
--- FAIL: TestGetClusterMappingInfo/site3-storage_cluster-id_mapping (0.00s)
--- PASS: TestGetClusterMappingInfo/site1-storage_cluster-id_mapping (0.00s)
By splitting the public GetClusterMappingInfo() function into an
internal getClusterMappingInfo() that takes a filename, unit-testing can
use different files for each go-routine, and testing becomes more
predictable.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
With the tests at CI, it kind of looks like that the IO is timing out after
30 seconds (default with rbd-nbd). Since we have tweaked reattach-timeout
to 300 seconds at ceph-csi, we need to explicitly set io-timeout on the
device too, as it doesn't make any sense to keep
io-timeout < reattach-timeout
Hence we set io-timeout for rbd nbd to 0. Specifying io-timeout 0 tells
the nbd driver to not abort the request and instead see if it can be
restarted on another socket.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
Suggested-by: Ilya Dryomov <idryomov@redhat.com>
- Update the meta stash with logDir details
- Use the same to remove logfile on unstage/unmap to be space efficient
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
- One logfile per device/volume
- Add ability to customize the logdir, default: /var/log/ceph
Note: if user customizes the hostpath to something else other than default
/var/log/ceph, then it is his responsibility to update the `cephLogDir`
in storageclass to reflect the same with daemon:
```
cephLogDir: "/var/log/mynewpath"
```
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
ContollerManager had a typo in it, and if we correct it,
linter will fail and suggest not to use controller.ControllerManager
as the interface name and package name is redundant, keeping manager
as the interface name which is the practice and also address the
linter issues.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
https://github.com/ceph/go-ceph/pull/455/ added `state` field
to subvolume info struct which helps to identify the snapshot
retention state in the caller. This patch make use of the same
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
If the image is in a secondary state and its
up+replaying means its an healthy secondary
and the image is primary somewhere in the remote cluster
and the local image is getting replayed. Delete the
OMAP data generated as we cannot delete the
secondary image. When the image on the primary
cluster gets deleted/mirroring disabled, the image on
all the remote (secondary) clusters will get
auto-deleted. This helps in garbage collecting
the OMAP, PVC and PV objects after failback operation.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
If the image is in secondary state and its
up+replaying means its an healthy secondary
and the image is primary somewhere in the remote
cluster and the local image is getting replayed.
Return success for the Disabling mirroring as
we cannot disable the mirroring on the secondary
state, when the image on the remote site gets
disabled the image on all the remote (secondary)
will get auto deleted. This helps in garbage
collecting the volume replication kuberentes
artifacts
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This commit adds functionality of extracting encryption kmsID,
owner from volumeAttributes in RegenerateJournal() and adds utility
functions ParseEncryptionOpts and FetchEncryptionKMSID.
Signed-off-by: Rakshith R <rar@redhat.com>
This commit refractors RegenerateJournal() to take in
volumeAttributes map[string]string as argument so it
can extract required attributes internally.
Signed-off-by: Rakshith R <rar@redhat.com>
For clusterMappingConfigFile using different
file name so that multiple unit test cases can
work without any data race.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
consider the empty mirroring mode when
validating the snapshot interval and
the scheduling time.
Even if the mirroring Mode is not set
validate the snapshot scheduling details
as cephcsi sets the mirroring mode to default
snapshot.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
This commit fixes snapshot id idempotency issue by
always returning an error when flattening is in progress
and not using `readyToUse:false` response.
Signed-off-by: Rakshith R <rar@redhat.com>
This commit fixes a bug in checkCloneImage() which was caused
by checking cloned image before checking on temp-clone image snap
in a subsequent request which lead to stale images. This was solved
by checking temp-clone image snap and flattening temp-clone if
needed.
This commit also fixes comparison bug in flattenCloneImage().
Signed-off-by: Rakshith R <rar@redhat.com>
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
rbd flatten functions is a CLI call and it expects
the creds as the input and copying of creds is
required when we generate the temp clone image.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Volume generated from snap using genrateVolFromSnap
already copies volume ID correctly, therefore removing
`vol.VolID = rbdVol.VolID` which wrongly copies parent
Volume ID instead leading to error from copyEncryption()
on parent and clone volume ID being equal.
Signed-off-by: Rakshith R <rar@redhat.com>
Golang-ci complains about the following:
internal/util/vault_tokens.go:99:20: string `true` has 4 occurrences, but such constant `vaultDefaultDestroyKeys` already exists (goconst)
v.VaultCAVerify = "true"
^
This occurence of "true" can be replaced by vaultDefaultCAVerify so
address the warning.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Hashicorp Vault does not completely remove the secrets in a kv-v2
backend when the keys are deleted. The metadata of the keys will be
kept, and it is possible to recover the contents of the keys afterwards.
With the new `vaultDestroyKeys` configuration parameter, this behaviour
can now be selected. By default the parameter will be set to `true`,
indicating that the keys and contents should completely be destroyed.
Setting it to any other value will make it possible to recover the
deleted keys.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Whenever Ceph-CSI receives a CSI/Replication
request it will first decode the
volumeHandle and try to get the required
OMAP details if it is not able to
retrieve, receives a `Not Found` error
message and Ceph-CSI will check for the
clusterID mapping. If the old volumeID
`0001-00013-site1-storage-0000000000000001
-b0285c97-a0ce-11eb-8c66-0242ac110002`
contains the `site1-storage` as the clusterID,
now Ceph-CSI will look for the corresponding
clusterID `site2-storage` from the above configmap.
If the clusterID mapping is found now Ceph-CSI
will look for the poolID mapping ie mapping between
`1` and `2`. Example:- pool with name exists on
both the clusters with different ID's Replicapool
with ID `1` on site1 and Replicapool with ID `2`
on site2. After getting the required mapping Ceph-CSI
has the required information to get more details
from the rados OMAP. If we have multiple clusterID mapping
it will loop through all the mapping and checks the
corresponding pool to get the OMAP data. If the clusterID
mapping does not exist Ceph-CSI will return an `Not Found`
error message to the caller.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
added helper function to read the clusterID mapping
from the mounted file.
The clusterID mapping contains below mappings
* ClusterID mappings (to cluster to which we are failingover
and from which cluster failover happened)
* RBD PoolID mapping of between the clusters.
* CephFS FscID mapping between the clusters.
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The VAULT_AUTH_MOUNT_PATH is a Vault configuration parameter that allows
a user to set a non default path for the Kubernetes ServiceAccount
integration. This can already be configured for the Vault KMS, and is
now added to the Vault Tenant SA KMS as well.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
The new `vaultAuthNamespace` configuration parameter can be set to the
Vault Namespace where the authentication is setup in the service. Some
Hashicorp Vault deployments use sub-namespaces for their users/tenants,
with a 'root' namespace where the authentication is configured. This
requires passing of different Vault namespaces for different operations.
Example:
- the Kubernetes Auth mechanism is configured for in the Vault
Namespace called 'devops'
- a user/tenant has a sub-namespace called 'devops/website' where the
encryption passphrases can be placed in the key-value store
The configuration for this, then looks like:
vaultAuthNamespace: devops
vaultNamespace: devops/homepage
Note that Vault Namespaces are a feature of the Hashicorp Vault
Enterprise product, and not part of the Open Source version. This
prevents adding e2e tests that validate the Vault Namespace
configuration.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
- mount host's /etc/selinux in node plugins
- process mount options in all code paths for cephfs volume options
Signed-off-by: Alexandre Lossent <alexandre.lossent@cern.ch>
This commit uses `string.SplitN` instead of `string.Split`.
The path for pids.max has extra `:` symbols in it due to which
getCgroupPidsFile() splits the string into 5 tokens instead of
3 leading to loss of part of the path.
As a result, the below error is reported:
`Failed to get the PID limit, can not reconfigure: open
/sys/fs/cgroup/pids/system.slice/containerd.service/
kubepods-besteffort-pod183b9d14_aed1_4b66_a696_da0c738bc012.slice/pids.max:
no such file or directory`
SplitN takes an argument n and splits the string
accordingly which helps us to get the desired
file path.
Fixes: #2337
Co-authored-by: Yati Padia <ypadia@redhat.com>
Signed-off-by: Yati Padia <ypadia@redhat.com>
Currently we have a bug that we are not using rados
namespace when adding ceph manager command to
remove the image from the trash. This commit
adds the missing rados namespace when adding
ceph manager task.
without fix the image will be moved to trash
and no task will be added to remove from the
trash. it will become ceph responsibility to
remove the image from trash when it will cleanup
the trash.
workaroud: manually purge the trash
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
RBD image metadata keys that start with '.rbd' are expected to be
internal to RBD itself and are not mirrored to remote sites. Renaming
the keys (dropping the '.' prefix) and using the new MigrateMetadata()
function now makes the keys available on remote sites too.
Closes: #2219
Signed-off-by: Niels de Vos <ndevos@redhat.com>
The new MigrateMetadata() function can be used to get the metadata of an
image with a deprecated and new key. Renaming metadata keys can be done
easily this way.
A default value will be set in the image metadata when it is missing
completely. But if the deprecated key was set, the data is stored under
the new key and the deprecated key is removed.
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Currently, getImageMirroringStatus() is using RBD CLI.
This commit converts RBD CLI to go-ceph API.
Fixes: #2120
Signed-off-by: Yati Padia <ypadia@redhat.com>
Previously in ControllerExpandVolume() we had a check for encrypted
volumes and we use to fail for all expand requests on an encrypted
volume. Also for Block VolumeMode PVCs NodeExpandVolume used to be
ignored/skipped.
With these changes, we add support for the expansion of encrypted volumes.
Also for raw Block VolumeMode PVCs with Encryption we call NodeExpandVolume.
That said,
With LUKS1, cryptsetup utility doesn't prompt for a passphrase on resizing
the crypto mapper device. This is because LUKS1 devices don't use kernel
keyring for volume keys.
Whereas, LUKS2 devices use kernel keyring for volume key by default, i.e.
cryptsetup utility asks for a passphrase if it detects volume key was
previously passed to dm-crypt via kernel keyring service, we are overriding
the default by --disable-keyring option during cryptsetup open command.
So that at the time of crypto mapper device resize we will not be
prompted for any passphrase.
Fixes: #1469
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
With Luks1 device:
$ cryptsetup status /dev/mapper/crypto-rbd0
/dev/mapper/crypto-rbd0 is active and is in use.
type: LUKS1
cipher: aes-xts-plain64
keysize: 512 bits
key location: dm-crypt
device: /dev/rbd0
sector size: 512
offset: 4096 sectors
size: 4190208 sectors
mode: read/write
With Luks2 device:
$ cryptsetup status /dev/mapper/crypto-rbd0
/dev/mapper/crypto-rbd0 is active and is in use.
type: LUKS2
cipher: aes-xts-plain64
keysize: 512 bits
key location: dm-crypt
device: /dev/rbd0
sector size: 512
offset: 32768 sectors
size: 4161536 sectors
mode: read/write
This could lead to failures with unmap in the NodeUnstageVolume path
for the encrypted volumes.
Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com>
This commit modifies the error of godot, cyclop,
paralleltest linter caused due to merged PRs.
Updates: #1586
Signed-off-by: Yati Padia <ypadia@redhat.com>