doc: added docs for ceph-fuse mount corruption

Signed-off-by: Robert Vasek <robert.vasek@cern.ch>
2025-04-11 18:13:00 +00:00 · 2022-02-02 13:20:37 +01:00 · 2022-02-02 13:20:37 +01:00 · 78697e4f7c
commit 78697e4f7c
parent 548a752ff0
1 changed files with 45 additions and 0 deletions
--- a/docs/ceph-fuse-corruption.md
+++ b/docs/ceph-fuse-corruption.md
@ -0,0 +1,45 @@
+# ceph-fuse: detection of corrupted mounts and their recovery
+
+Mounts managed by ceph-fuse may get corrupted by e.g. the ceph-fuse process
+exiting abruptly, or its parent Node Plugin container being terminated, taking
+down its child processes with it.
+
+This may manifest in concerned workloads like so:
+
+```
+# mount | grep fuse
+ceph-fuse on /cephfs-share type fuse.ceph-fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
+# ls /cephfs-share
+ls: /cephfs-share: Socket not connected
+```
+
+or,
+
+```
+# stat /home/kubelet/pods/ae344b80-3b07-4589-b1a1-ca75fa9debf2/volumes/kubernetes.io~csi/pvc-ec69de59-7823-4840-8eee-544f8261fef0/mount: transport endpoint is not connected
+```
+
+This feature allows CSI CephFS plugin to be able to detect if a ceph-fuse mount
+is corrupted during the volume publishing phase, and will attempt to recover it
+for the newly scheduled pod. Pods that already reside on a node whose
+ceph-fuse mountpoints were broken may still need to be restarted, however.
+
+## Detection
+
+A mountpoint is deemed corrupted if `stat()`-ing it returns one of the
+following errors:
+
+* `ENOTCONN`
+* `ESTALE`
+* `EIO`
+* `EACCES`
+* `EHOSTDOWN`
+
+## Recovery
+
+Once a mountpoint corruption is detected, its recovery is performed by
+remounting the volume associated with it.
+
+Recovery is attempted only if `/csi/mountinfo` directory is made available to
+CSI CephFS plugin (available by default in the Helm chart and Kubernetes
+manifests).