Solved: Re: Pod has a restartable init container and the S...

rrajvanshi1947 · 02-13-2024 06:58 PM

Tried Sidecar container feature on 2/12 on a RAPID release channel cluster. Was working as expected but tried reproducing it today on all 3 k8s versions in RAPID channel: 1.29.1-gke.1016000 , 1.29.1-gke.1425000 & 1.29.0-gke.1381000 but getting error:

18 Pod has a restartable init container and the SidecarContainers feature is disabled, 1 node(s) had untolerated taint {cloud.google.com/gke-quick-remove: true}

Has this been disabled starting 2/13?
I want to use the SidecarContainers feature in REGULAR release channel for our prod cluster since 1.29.0-gke.1381000 is now available in this channel .

Cmd: kubectl get --raw /metrics | grep kubernetes_feature_enabled
does show:
kubernetes_feature_enabled{name="SidecarContainers",stage="BETA"} 1

Very strange behavior since it was working yesterday. Please advise.
Thanks

rrajvanshi1947

Thanks for the detailed response @SergeyKanzhelev . We tried the feature in the REGULAR channel this week and it works as expected. Maybe this was a transient issue..

This message may appear if you use some private scheduler as well. Do you use any private schedulers or webhooks?
Not sure about these but we were using autopilot private cluster from https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/master/modules/b...

View solution in original post

RonEtch

Hi @rrajvanshi1947

Welcome to Google Cloud Community!

Please note that Rapid Channel can give you latest features available and capabilities but, subject for known issues. Try considering to re-create the cluster with newer version or contacting support team for futher investigation.

I hope this information is helpful.

SergeyKanzhelev

Hi,

If you have already created that support ticket, please let me know so I wouldn't ask the same questions as may be asked there.

I just tried to repro manually, and see that sidecar containers work. The version I tried is `1.29.1-gke.1589017` and the sidecar I tried is "kubectl apply -f https://raw.githubusercontent.com/kubernetes/website/main/content/en/examples/application/deployment...".

The message you see is coming from the scheduler: https://github.com/kubernetes/kubernetes/blob/d194e6d06c4f1004cebe00f2c539a564f77276ec/pkg/scheduler... This indicates that it is either the older version of k8s (likely 1.28) or somehow the feature gate was disabled. There is no conditions we have internally that disables the feature.

This message may appear if you use some private scheduler as well. Do you use any private schedulers or webhooks?

Also a few more things to verify:

- Both control plane and nodes were upgraded to 1.29+

- There is no webhooks or custom schedulers interfere with the default scheduler

- Any mutating webhook installed (including third party tools) recognizes the 1.28 types

Also any additional information you can provide on how to reproduce this would help.

/Sergey

rrajvanshi1947

Thanks for the detailed response @SergeyKanzhelev . We tried the feature in the REGULAR channel this week and it works as expected. Maybe this was a transient issue..

This message may appear if you use some private scheduler as well. Do you use any private schedulers or webhooks?
Not sure about these but we were using autopilot private cluster from https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/master/modules/b...

SergeyKanzhelev

Thank you for confirming, please let us know if there will be more issues with this feature.

mdzigurski

It doesn't seem to work in GKE 1.29.4-gke.1043002. The kubectl get raw metrics shows that the feature is enabled. However, restartPolicy: Aways is being ignored.
This example doesn't seem to work also: https://raw.githubusercontent.com/kubernetes/website/main/content/en/examples/application/deployment...

Tested with 2 clusters, both running 1.29.4-gke.1043002

SergeyKanzhelev

Hi @mdzigurski ,

Thank you for the report. When you are saying that the restartPolicy is ignored, what do you see? Does Pod get stuck in initialization stage forever?

Also, did you check that both - cluster and node versions are 1.29+?

mdzigurski

@SergeyKanzhelev All nodes were on the same 1.29.4-gke.1043002 version. Correct, the pod is stuck in the initialization stage forever. The sidecar container had status running and ready false forever. The main container was waiting for the sidecar container to become ready. Initially, I had a readiness probe defined on the sidecar container together with restartPolicy Always. GKE UI showed an error saying the readiness probe could only be defined on init containers with restartPolicy Always. Hence, I mentioned that the restartPolicy setting was ignored. One of the clusters had a Berglas mutation webhook, but the other one did not; that made no difference, it seems.

SergeyKanzhelev

I will be testing more on my side. Can you confirm a few things:

1. It doesn't work for you even with the https://raw.githubusercontent.com/kubernetes/website/main/content/en/examples/application/deployment...

2. When you look in cloud console and check for Pod's yaml, does it show restartPolicy:Always on init container there? If you do kubectl describe pod (use a fresh version of kubectl) - does it show this field?

3. Do you have any other webhooks installed in your cluster that might have updated the Pod Spec by stripping this field? Maybe istio or something like this? It is not uncommon if the webhook is compiled against the old k8s API.

SergeyKanzhelev

Just confirmed I was able to get sidecar working on that exact version of GKE cluster. I cannot discount that there may be some GKE settings that can make sidecar broken, but based on all the work and usage we have, I wouldn't think of this as a likely issue. There might be some third party tooling you use incompatible with the sidecars. I would be interested to learn what it is so we can think of ways to improve experience.

mdzigurski

@SergeyKanzhelev Thanks for providing the steps to debug further and confirming that it worked for you. You were right about a webhook stripping the field. I discovered that both clusters had Berglas webhook installed. After disabling it, the example you provided worked fine. Berglas logs also confirmed that. Further looking into Berglas source code, I saw it was compiled against 0.27.2 API [ref] , which, as you mentioned, is where the problem could be. I will look into rebuilding it with a more recent version of K8s API.

time="2024-06-14T05:10:39Z" level=debug msg="Webhook mutating review finished with: '[{\"op\":\"remove\",\"path\":\"/spec/initContainers/0/restartPolicy\"}]' JSON Patch" dry-run=false kind=v1/Pod name= ns=staging-west op=create path=/ trace-id= webhhok-type=mutating webhook-id=berglasSecrets webhook-kind=mutating wh-version=v1