pdcsi-node OOM because fsck on large volumes

JordanP · 07-10-2024 07:26 AM

Hi,

I have some large (>5TB) volumes formatted with an EXT4 filesystem. When a pod tries to attach such a volume, at some point an fsck process is spawned. And that fsck process seems to be killed because the gce-pd-driver container in the pdcsi-node pod has a memory limit of 50MB.

INFO 2024-07-10T14:11:01.219800489Z [resource.labels.containerName: gce-pd-driver] For disk restore-us-central1-d8ff-pg-data-pg-main-0-2938 the /dev/* path is /dev/sdb for disk/by-id path /dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938
INFO 2024-07-10T14:11:01.221497320Z [resource.labels.containerName: gce-pd-driver] For disk restore-us-central1-d8ff-pg-data-pg-main-0-2938, device path /dev/sdb, found serial number restore-us-central1-d8ff-pg-data-pg-main-0-2938
INFO 2024-07-10T14:11:01.221534929Z [resource.labels.containerName: gce-pd-driver] Successfully found attached GCE PD "restore-us-central1-d8ff-pg-data-pg-main-0-2938" at device path /dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938.
INFO 2024-07-10T14:11:01.221567009Z [resource.labels.containerName: gce-pd-driver] NodePublishVolume check volume path /var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/a001f3bad7990d3afb447355ea6314f7da31e3609310dfee75555b3d2e9f0687/globalmount is mounted false: error <nil>
INFO 2024-07-10T14:11:01.221573429Z [resource.labels.containerName: gce-pd-driver] Attempting to determine if disk "/dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938])
INFO 2024-07-10T14:11:01.287565794Z [resource.labels.containerName: gce-pd-driver] Output: "DEVNAME=/dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938\nTYPE=ext4\n"
INFO 2024-07-10T14:11:01.287604043Z [resource.labels.containerName: gce-pd-driver] Checking for issues with fsck on disk: /dev/disk/by-id/google-restore-us-central1-d8ff-pg-data-pg-main-0-2938
INFO 2024-07-10T14:11:08.508376947Z [resource.labels.containerName: gce-pd-driver] `fsck` error fsck from util-linux 2.36.1
ERROR 2024-07-10T14:11:08.508442197Z [resource.labels.containerName: gce-pd-driver] /dev/sdb: recovering journal
ERROR 2024-07-10T14:11:08.508450037Z [resource.labels.containerName: gce-pd-driver] fsck: Warning... fsck.ext4 for device /dev/sdb exited with signal 9.

The VM kernel logs say

Sometimes, the OOM killer not only decide to kill the child fsck process but also the gce-pd-csi-driver process, which crashe the whole pdcsi-node pod:

Could we raise the 50MB memory limit for the gce-pd-driver container ? It looks like it's really not enough to fsck a very large FS. What do you think ?

koenfaro

Same issue here, really annoying. Any workarounds? I am half-way migrating a cluster, and the two biggest disks now just fail to be mounted at all, great. Guess I will try a fsck from a compute node.

ajutrowski

Same issue, I didn't find any reliable solution yet

JordanP

I opened a « bug report / help request » here

https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/issues/1782 and found a workaround

ajutrowski

@JordanPI found this thread as well, and it was somewhat helpful. In our case, we concluded that we could stop the pod from which we're snapshotting the volume. So, our process is to stop the pod, take the snapshot, start a new pod from the snapshot, and then restart the original pod.