Solved: GKE Autopilot Pod Stuck Terminating

mbh · 07-09-2024 03:20 AM

Hello,

A few weeks back, one of our workloads running on a GKE Autopilot cluster failed. It coincided with a cluster upgrade (

google.container.internal.ClusterManagerInternal.UpdateClusterInternal events are observed around the time the workload failed). The main pod hosting our workload was stuck in the 'Terminating' state. The pod was stuck with the following error:

Type : Warning 
Reason : FailedKillPod 
Age : 63s (x646 over 5h35m) 
From : kubelet 
Message : error killing pod: failed to "KillPodSandbox" for "## REDACTED ##" with 
  KillPodSandboxError: "rpc error: code = Unknown 
  desc = failed to destroy network for sandbox \"## REDACTED ##\": plugin type=\"cilium-cni\" 
  failed (delete): unable to connect to Cilium daemon: 
  failed to create cilium agent client after 30.000000 seconds timeout: 
  Get \"http://localhost/v1/config\": dial unix /var/run/cilium/cilium.sock: 
  connect: connection refused\nIs the agent running?"

Force killing the pod kicked things back into life and a new pod was created, presumably on a new upgraded node.

My specific questions are:

1. Does anyone have any thoughts on why this occurred? The error message sounds like there was a race condition where the low level cilium daemon was terminated before all the user level pods had terminated.

2. Does anyone know how I can prevent something like this potentially happening again in the future?

Thanks

Arekkusu

> You can start to monitor the cilium pod health in your kube-system namespace and its daemon set and if there is any error check the pod log for troubleshooting instead of killing the pod, you may troubleshoot the cilium using its CLIs

By design users have no access to Node in GKE Autopilot so this would only be possible in GKE Standard.

I suspect this doesn't have to be a Dataplane v2 (~Cilium) issue and could be related to higher system pod usage. It could be that 1.28.9-gke.1209000+ addresses this but really can't provide more detail here.

If you have a support package please consider raising a support case.

As a general best practice to reduce the impact of a single pod failure it is recommended to use Deployment with multiple replicas and configure Pod Disruption Budget. I realize this is fairly general advice.

View solution in original post

Murugesan

Hi @mbh ,

It seems your cilium pod had an intermittent connectivity issues between your node or pod, because once you kill the pod forcefully, it resumed back to normal.
In the future, to avoid this problem.
You can start to monitor the cilium pod health in your kube-system namespace and its daemon set and if there is any error check the pod log for troubleshooting instead of killing the pod, you may troubleshoot the cilium using its CLIs

Arekkusu

> You can start to monitor the cilium pod health in your kube-system namespace and its daemon set and if there is any error check the pod log for troubleshooting instead of killing the pod, you may troubleshoot the cilium using its CLIs

By design users have no access to Node in GKE Autopilot so this would only be possible in GKE Standard.

I suspect this doesn't have to be a Dataplane v2 (~Cilium) issue and could be related to higher system pod usage. It could be that 1.28.9-gke.1209000+ addresses this but really can't provide more detail here.

If you have a support package please consider raising a support case.

As a general best practice to reduce the impact of a single pod failure it is recommended to use Deployment with multiple replicas and configure Pod Disruption Budget. I realize this is fairly general advice.

mbh

@Arekkusu - thanks for your response and suggestions. The workload in question is deployed via a stateful set, due to a requirement for stable persistent storage. The use case can happily cope with short service interruptions and running replicated pods is probably overkill for what we are trying to achieve - but appreciate this would likely have helped in this situation.

The workload has been stable for ~18 months prior to this, so hopefully this is an isolated incident, and fingers crossed the update you reference will help prevent future occurrences.

Murugesan

Thank you for pointing out @Arekkusu , I missed the information on the :Autopilot GKE". Still I would prefer to get into COS through serial port and checking the log on what caused the cilium agent failure inside the node. Is that will give more insight on this issue and anticipate ?