-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory pressure shouldn't hard evict static pods #38322
Comments
Is there any reason any static pods are best effort? Shouldn't, ideally, all static pods have be guaranteed? Rather than including a "scheduler.alpha.kubernetes.io/critical-pod" annotation, I'd rather modify the pod resource request/limits to make them guaranteed. |
Two possible solutions (of others, I'm sure) and why I'd prefer guaranteed Make static pods 'special'Having static pods skip all admittance checks is implemented in #38795. I don't like this because of the special-casing, security implications, and because I think there's probably a more correct solution. I think this solution will lead to the chance for static pods to thrash if the node's in a messy state. Make static pods guaranteedI think that making all static pods guaranteed might be a better solution. It results in a lower total node-allocatable for users, but the reality was users shouldn't have allocated those resources already because the static pods needed them. The hardest part of this solution is picking the request/limit number for each static pod correctly, but honestly we need to do that anyways. |
I think the issue to discuss big picture solutions is #22212 The suggested oneliner to fix this can go right after https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/eviction/eviction_manager.go#L110: if notBestEffot || criticalPod {
return lifecycle.PodAdmitResult{Admit: true}
} where criticalPod is set based on
|
@bprashanth I'm not really happy with such an annotation though. The basically oneline change in #38795 is now "don't apply memory-admit failures to static pods", but I'd only see that as a temporary workaround for the proper solutions to this. |
Fixes kubernetes#38322 This assumes that all static pods are important to the point that eviction denials do not apply to them.
Yeah I'm fine doing that too. It's basically the recommeded fix but looking at the source instead of an annotation. I think we'll need the annotation though. To recap varios other bugs, there are actually multiple issues here:
your pr fixes 1 we still need something like an annotation to fix 2 in the short term (obviously fixable with preemption priority). We could use the existing podSource again (but that means something like fluent running in burstable, would also get a low oom score, so may not be the best tradeoff because we know that things is a pig). 3 involves treating file source different from apiserver source in many more places. Right now, failing admission check marks a pod Failed and the kubelet assumes that some other controller will recreate the pod, and the scheduler will re-assign to a healthy node. The problem of course, is that the kubelet is the controller for static-pods, so really, it needs to continuously reconcile static pods that have failed admission checks and process them as new creates instead of terminated pods. |
@bprashanth I like your idea using using scheduler.alpha.kubernetes.io/critical-pod annotation here, because:
@euank I don't like the treat static pods special because only kubelet knows those pods are file-based. There is no API field indicating it, so very likely we run into the issues described above in 3) and 4). |
+1, using the critical pod annotation is a good idea and (I think) takes pressure off of us to design a full solution for #22212 right away. |
Just to be clear, though -- the "critical pod" annotation was never intended for this purpose. It is intended to indicate which pods rescheduler should work to get scheduled. Rescheduler only operates on pods that go through the normal default scheduler -- but static pods (and DaemonSet pods) never go through the scheduler. So, it may be confusing to use it here, and instead maybe we should use a different annotation. Reusing the annotation here won't break anything but it seems likely to cause confusion. Can you use a new annotation? |
I had suggested a new annotation above (note the "node" prefix). Dawn's argument for resuing is that it future proofs the pod template (eg: moved to a deployment or daemonset, it'll work). I'm currently going to keep it the same. |
Thanks for the explaining @dchen1107, having a temporary annotation as a stop-gap for a proper solution makes more sense to me now. |
The issue can happen also other pods, like fluentd, or other logging agents. We should apply a similar "fix" as in #38836 for fluentd. @crassirostris Can you please patch fluentd configuration to be treated as critical? I think it should be patched to 1.5 together with #38836 |
@fgrzadkowski Sure |
Automatic merge from submit-queue (batch tested with PRs 39114, 36004) assign -998 as the oom_score_adj for critical pods (e.g. kube-proxy) I also validated this with a testing cluster: Fresh built cluster, and kill kube-proxy pod, etc. ``` root 2660 2643 0 Dec21 ? 00:00:00 /bin/sh -c kube-proxy --master=https://104.198.79.64 --kubeconfig=/var/lib/kube-proxy/kubeconfig --cluster-cidr=10.180.0.0/14 --resource-container="" --v=4 1>>/var/log/kube-proxy.log 2>&1 root 2667 2660 0 Dec21 ? 00:03:14 kube-proxy --master=https://104.198.79.64 --kubeconfig=/var/lib/kube-proxy/kubeconfig --cluster-cidr=10.180.0.0/14 --resource-container= --v=4 # cat /proc/2660/oom_score_adj -998 # cat /proc/2667/oom_score_adj -998 ``` In this pr, I also include a small fix for import cycle issue. The right fix should remove the dependency on qos package from pkg/apis/componentconfig/v1alpha1. But since we plan to cherrypick this pr to both 1.5 and 1.4 (possible), I want touch the source as little as possible. Partial fix: #38322
@davidopp |
To the best of my knowledge nobody has this goal on their 1.6 critical
list. @mikedanese or ??? is there a state dump to be had?
…On Wed, Jan 11, 2017 at 9:32 PM, Vish Kannan ***@***.***> wrote:
Requires that they all (including kube-proxy) run as daemonset so that
things like resources are modifiable. kube-proxy needs to scale with
the size of the cluster.
@thockin <https://github.com/thockin> can we prioritize porting kube
proxy to a daemonset? Have we scoped out the work involved? Is anyone
working on it? As a stop-gap solution, may be we can keep updating the
static pod manifest using a daemonset of something.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38322 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVFgmRR2-gtJx2d9dS3NJ8qP6xIMGks5rRbsJgaJpZM4LHFxk>
.
|
I don't like the short-term solution initially, and tried to prioritize the proper fix through #22212 more than a year. But we just have too much to carry as a team, and something failed to be properly prioritized. At the end, I accepted the short-term fixes due to two reasons:
Meanwhile, I did some risk analysis on three prs being cherrypicked into the release branch a couple of weeks ago through a private email thread. Now I copied & paste here: From dawnchen@ to the customers:The cherrypick pr(s) include the following fixes:
One suggestion from me is that please do not use critical-pod-annotation for your pods, and just leave it only for kube-system pods for now.
The suggestion I have for this is that please do not introduce new static pods to your node. Obviously all above are short term workaround, and the long term solution should be addressed through issue/22212. issues/39124 includes two issues: 1) one small part of issue/22212 2) disk management issue. We currently prioritized both as P0 for Q1. Hope I answered your question and concern here. Please let me know if you have more issues. |
@thockin @davidopp @dchen1107 @erictune @janetkuo and myself discussed this issue further today. Following is the summary: The primary problems we addressed are as follows:
In the short term,
We decided to have only two levels of priority (Critical & QoS) instead of three (Static, Critical and QoS) for evictions. This is to align with the long term plan. This short term solution will solve problems Now for the long term,
@derekwaynecarr @liggitt @smarterclayton does this satisfy your needs? |
4 is acceptable - we prohibit regular users from creating daemon sets.
…On Wed, Jan 18, 2017 at 6:07 PM, Vish Kannan ***@***.***> wrote:
@thockin <https://github.com/thockin> @davidopp
<https://github.com/davidopp> @dchen1107 <https://github.com/dchen1107>
@erictune <https://github.com/erictune> @janetkuo
<https://github.com/janetkuo> and myself discussed this issue further
today. Following is the summary:
The primary problems we addressed are as follows:
1. Static system pods (kube-proxy & fluentd on certain deployments)
get evicted and are not scheduled back on nodes
2. Static system pods are not "guaranteed" to run on a node and can
fail feasibility checks.
3. Existing alpha API for specifying pod priority using pod level
annotations causes security issues
In the short term,
1. Continue using "pod level annotations" for priority.
2. Feature gate the "Critical Pod Annotation" across the system, tie
it to a configurable namespace (or just to kube-system), and disable it by
default. GKE will enable this feature.
3. Static pods can be marked as "critical"
4. Kubelet will not evict "critical pods" and instead restart them.
5. Kubelet will guarantee that static pod manifests are processed
before API pods are processed. This is to ensure that static pods can fit
on a node.
6. A static pod or a daemonset that gets updated might not be accepted
by the node due to resource constraints, even if they are critical.
We decided to have only two levels of priority (Critical & QoS) instead of
three (Static, Critical and QoS) for evictions. This is to align with the
long term plan.
This short term solution will solve problems 1 & 3 mentioned above.
Now for the long term,
1. A preemption & priority scheme will be designed that will allow us
to deprecate the "Critical" pod level annotation. @davidopp
<https://github.com/davidopp> is working on this. This solves problem 3.
ETA: Design in v1.6, Alpha (or beta?) in v1.7
2. As part of the this design, Kubelet's role in preemptions will be
finalized based on which static pods can be admitted on to a fully
committed node. Solves problem 2. ETA
3. Critical static pods will use this new priority scheme to guarantee
their availability. Solves problem 1. ETA: TBD (based on availability
of the feature)
4. [Closely related] Existing Static pods (kube-proxy & fluentd) will
switch to using Daemon Sets once high priority daemon set pods are
"guaranteed" to run on a node. As of now "Daemon Set" pods bypass the
scheduler and can get "rejected" by the kubelet.
@derekwaynecarr <https://github.com/derekwaynecarr> @liggitt
<https://github.com/liggitt> @smarterclayton
<https://github.com/smarterclayton> does this satisfy your needs?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38322 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p59METCXejO-vP0EN1gi09HtQo44ks5rTpsZgaJpZM4LHFxk>
.
|
@vishh -- for the near term plan, does a critical pod get a higher oom_score, or do we just let it be restarted as normal? for 1.6, i assume we are ok with critical pods being constrained on their memory usage relative to their qos tier? |
@derekwaynecarr I was thinking of not making "Critical" a new QoS class, but instead use it just for preemption and eviction decisions. So oom_score will not be impacted. A Critical pod can still get OOM killed by the kernel, but will not be evicted. I think that is a good design because even critical pods can have memory leaks. |
@vishh Thanks for summarizing the discussion. @smarterclayton @derekwaynecarr @liggitt #38322 (comment) is aligned the node team roadmap: https://docs.google.com/a/google.com/spreadsheets/d/1-hADEbGEUrW04QP4bVk7xf1CWuqbFU6ulMAH1jqZQrU/edit?usp=sharing (row 25). This is P0 for both sig-scheduling and sig-node team. |
that addresses my security concerns, thanks |
Can you clarify what "feature gating" means here? Does it mean critical pod annotation cannot be added when the feature is disabled, or that it can always be added but will be ignored for the purposes of the eviction policy when the feature is disabled? The second is fine but the first is not because rescheduler uses it. I assume you mean the second but I just want to be sure. |
@davidopp -- i assume the feature gate is not restricting the presence of the annotation, but instead is just a kubelet flag that says if the annotations presence gives special node local decisions. the annotation is free to be used by the rescheduler assuming the rescheduler is enabled (which also appears to default to false). |
@davidopp -- ignore previous comment, i see its on by default. btw, i see rescheduler also restricts the annotation as only being meaningful to a single namespace (which is good). anyway, i assume its the second in your original comment. |
@derekwaynecarr would having the critical pod annotation be turned on by
default OK for OpenShift?
…On Fri, Jan 20, 2017 at 10:23 AM, Derek Carr ***@***.***> wrote:
@davidopp <https://github.com/davidopp> -- ignore previous comment, i see
its on by default. btw, i see rescheduler also restricts the annotation as
only being meaningful to a single namespace (which is good). anyway, i
assume its the second in your original comment.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38322 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKLZmM23WRUhMzkmkhfNgmKwg9-qgks5rUPuUgaJpZM4LHFxk>
.
|
@vishh alpha features should not be on by default. |
@vishh - to elaborate, in code, the default must be off. in |
Ack! That's what I was thinking. I think @davidopp is concerned because the
critical pods annotation for rescheduling has already been exposed to users
by default.
…On Fri, Jan 20, 2017 at 1:21 PM, Derek Carr ***@***.***> wrote:
@vishh <https://github.com/vishh> - to elaborate, in code, the default
must be off. in kube-up deployments, the default can change, and i have
no opinion there.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38322 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKLCH5yzTd3YXVqgk7NzYCyYJ3O7Lks5rUSVkgaJpZM4LHFxk>
.
|
Right -- it's fine to disable the eviction behavior of the annotation by default. Just don't prevent people from setting the annotation by default, since it is used by a completely unrelated subsystem (the "rescheduler"). |
I was not proposing to "prevent" users from using the pod annotation,
instead I was recommending "disabling" any special behavior across
kubernetes for that pod annotation. The rescheduler will ignore this
annotation if the feature is "disabled", which it will be by default.
I assume consuming the pod annotation for rescheduling use cases is also
considered problematic.
…On Fri, Jan 20, 2017 at 3:22 PM, David Oppenheimer ***@***.*** > wrote:
Right -- it's fine to disable the eviction behavior of the annotation by
default. Just don't prevent people from setting the annotation by default,
since it is used by a completely unrelated subsystem (the "rescheduler").
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38322 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKDYdYTe-lgzVUsg5xAYLmIoAw_jdks5rUUHIgaJpZM4LHFxk>
.
|
Yes, the annotation being settable by normal users is the problem.
On Sun, Jan 22, 2017 at 6:57 PM, Vish Kannan <[email protected]>
wrote:
… I was not proposing to "prevent" users from using the pod annotation,
instead I was recommending "disabling" any special behavior across
kubernetes for that pod annotation. The rescheduler will ignore this
annotation if the feature is "disabled", which it will be by default.
I assume consuming the pod annotation for rescheduling use cases is also
considered problematic.
On Fri, Jan 20, 2017 at 3:22 PM, David Oppenheimer <
***@***.***
> wrote:
> Right -- it's fine to disable the eviction behavior of the annotation by
> default. Just don't prevent people from setting the annotation by
default,
> since it is used by a completely unrelated subsystem (the "rescheduler").
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/kubernetes/kubernetes/issues/
38322#issuecomment-274206281>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AGvIKDYdYTe-
lgzVUsg5xAYLmIoAw_jdks5rUUHIgaJpZM4LHFxk>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38322 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pwCS9mZ_QRBLNd3gwOG3HBnREYTmks5rU-zpgaJpZM4LHFxk>
.
|
So we are saying that it's OK if upgrading to 1.6 breaks this feature for everyone who is currently relying on it to get their DNS, Heapster, etc. scheduled today on GCE, AWS, Azure, on-prem, etc., unless they take the additional action of re-enabling it? |
cc/ @piosz @crassirostris |
It sounds like it would be okay if the behavior is flag-controllable, and off by default in head, but enabled on GKE, and kube-up.sh. Not a great long-term solution, but could bridge the gap until another solution is ready. |
I am confused here. I thought we had a design meeting and achieved the agreement with the following action items:
Are we change the timeline or revert some of the decision here? |
Dawn, the part you're missing (I think) is that we should also make the critical annotation disabled or disableable in rescheduler by default. |
I think there are a lot of people who run Kubernetes without GKE or kube-up.sh. But I guess we can say the rescheduler is alpha, so it's OK if upgrading disables it... (The documentation doesn't say anything about it being alpha or beta or GA.) |
Let's move this discussion to #40573 |
If a node runs out of memory, we kill something. Then if there is memory pressure, we don't admit best effort (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/eviction/eviction_manager.go#L110). Maybe we can admit static-pods, even if they're best effort?
@kubernetes/sig-node if we can't differentiate static from non-static, maybe we can use the
scheduler.alpha.kubernetes.io/critical-pod
annotation?The text was updated successfully, but these errors were encountered: