-
Notifications
You must be signed in to change notification settings - Fork 39.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EvenPodsSpread discussion on tainted nodes #80921
Comments
For case 1, if we exclude tainted nodes, the incomingPod should go to NodeC - since zone 1 already has 4 pods. Putting incomingPod on NodeC would achieve better evenness. |
@tedyu Nope. "Exclude" means we don't count the matching pods on that node any more, so zone1 only has 1 pod, while zone2 has 3 pods. |
We should think of this more generally: do we want filtered nodes (by other filters) to influence the skews calculation? I believe that every filter should be independent of each other. It is my understanding that this is the pattern that the new framework promotes too. In this case, the simplest solution is to count all the pods, only considering the spreading selectors. We should keep in mind that the most common use case for Even Pod Spreading is Deployments, in |
In the scheduler's history, it's always like this until this feature :) In the alpha implementation, it respects nodeSelector and hard nodeAffinity.
This is the current behavior. But we're open to make changes if this doesn't fit the majority usage. BTW: the "change" could be another option of kubernetes/staging/src/k8s.io/api/core/v1/types.go Lines 3041 to 3048 in a8d7139
|
I missed the fact that when a node doesn't satisfy affinity or taints, then it's likely to have 0 pods that match the spreading constraint labels. Then the minimum count would be 0 for that node, messing up the skew. That said, simply excluding nodes that doesn't satisfy taints wouldn't solve the general issue. We are considering taints here, but potentially other filters could affect the skew (maybe PodFitsResources?). Maybe we should solve this with predicates (and later plugins) order. Instead of considering all the pods, we only consider the pods that passed the previous filters. So, if we decide that a filter should affect the skew, we run it before EvenPodSpread, otherwise we run it after. |
How are tainted nodes currently implemented ? I assume they are yet another predicate like
In case a tainted node has existing pods, then probably the right behavior is to consider them while spreading new pods In case a tainted node has no existing pods(more likely case), then the right behavior is to ignore this node completely as if it doesn't exist. So may be we need this customizable. Also what is the point of considering a tainted node and eventually selecting it to place a pod which remains pending. In summary, taints should be like predicates, which completely eliminates the nodes that have taints which are not being tolerated by incoming pods. The only case I find considering is when a tainted node has existing pods, then the user might be interested in considering them but I dont want to optimize/code for this case yet. |
Yes.
This is the current behavior. No matter the tainted node has matching pods or not.
I don't think this is the right way to go. If we want to customize some special case, it should base on taints (esp. some "permanent" ones like "unschedulable" caused by
It's b/c either the taint can be taken off (automatically/manually), or the workload can be applied with tolerations, so that the Pending state can be resolved.
It's always the case - we have separate taint predicates and that functions in a standalone way. The core question I'm bringing up here is: if we should respect the result of taint predicate within evenpodsspread predicate, probably for some cases. |
@alculquicondor Today we run the (ordered) predicates among N nodes in parallel. So it's not practical to continuously communicate the result of current predicate on current node to other N-1 nodes. |
@alculquicondor, I guess you meant "nodes" instead of "pods" since filters filter out nodes, not pods. Forgive me if my questions are dumb. I didn't follow closely on the even pod spreading feature. |
Indeed, I meant nodes, sorry. |
@liu-cong That was just a hypothetical idea. We don't do that atm, and I don't think we will do that in the future :) @alculquicondor I haven't checked current filter plugins flow. For current predicates flow, the predicates chain was running in parallel, so you have no chance to know the results of "previous" ones on other nodes. The term "previous" makes no sense here - some nodes may even haven't started their predicates cycle yet. One question: in current filter plugins flow, are we gonna run each filter plugin to complete on all nodes, and then go to the next? (I hope the answer is no) |
You were right. My memory was vague. The filters will run sequentially for all nodes in parallel. But, apart from implementation details, my opinion is that unsatisfied taints should take the node out from the calculations, regardless of the type of taint (for simplicity). |
Yup, this is aligned with what we discussed in the sig-meeting. Before making any changes, let's roll this feature out first and see how users think. |
Sorry I could not attend the meeting. So this means a node is taken out of consideration for EvenPodSpread if the incoming pod is not tolerating the taint on the node. How does this happen if all predicates are processed in parallel ? Does EvenPodSpread also apply the taint predicate again ? In general I agree for simplicity this is the right behavior of taking a node out from calculations if it doesnt satisfy the taints. Although we should document, that if the node still has existing pods, that could lead to unwanted behavior and not expected spreading ? |
Oh it's just a possible solution. We won't roll it out until we see compelling usecases.
No at the moment. Pods in tainted node are considered as matching candidates during EvenPodsSpread predicate/priority.
It can. And things are more tricky when combining with the "NoExecute" taint - which will apply a tolerationSeconds=300 to each existingPod, and evict them after 300 seconds. So this issue is for (early-phase) brainstorming, we won't come to a conclusion until getting enough data points. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@Huang-Wei how do think we can gather the feedback? It looks like our conclusion was to skip the nodes that don't support the toleration. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Background
EvenPodsSpread is a feature to make pod placement decisions based on the distribution of existing pods. The concept
maxSkew
brings a question: how do we define the scope (which nodes) to calculate the number of matching pods so as to calculate the skew.Here are some implications implemented in the alpha:
spec.topologySpreadConstraints[*].topologyKey
present are in the scope.nodeSelector
/nodeAffinity
(if defined) are in the scope.The above implications have been discussed and agreed, so it's not likely to change in future releases. However, within the scope, there are one special case I want to bring up to get more feedback.
Core Question
Here is the core question: what if the node is applied taints? Should we count the matching pods in the node? (The alpha answer is yes)
NOTE: to simplify the expression, in this post terms like "tainted nodes" and "applied taints" technically means "the node is tainted AND incoming pod does't have corresponding tolerations". If the incoming pod's tolerations match the taints, we for sure consider that node.
Case 1 - spreading on zones
We have a cluster with 2 zones and 4 nodes, and NodeB is tainted. (P = Pod)
Suppose an incoming pod wants to be scheduled evenly across zones with maxSkew=1. In the current alpha implementation, NodeB is still considered as a valid node, so its 3 pods contributes to the final matching number in Zone1 - which sums up to 4. So the scheduling result is incoming pod goes to NodeC.
Sounds perfectly reasonable. But hold on to take a look at another case.
Case 2: spreading on nodes
In this case, an incoming pod wants to be scheduled evenly across nodes, with maxSkew=1. As NodeB has 0 pods, so to make it even, we should put incoming pod to NodeB, and it will be pending... (we've assumed that incoming pod doesn't have corresponding tolerations defined). This is also based on current alpha complementation.
What if we exclude tainted nodes?
Things are a little different. Here is the comparison:
Maybe more options
Maybe it's not a YES or NO answer. It's possible to provide more options to decide which kinds of taints will be included/excluded:
Here are the taints I can think of:
cordon
, but existing pods are evictedAnything else?
Things will be more complicated if
TaintNodesByCondition
andTaintBasedEvictions
features are disabled. In these cases internally it's controlled by conditions and no taints applied. But these two features will eventually GA and adopting taints/tolerations is the direction, so I guess we don't need to consider these combinations./sig scheduling
cc/ @bsalamat @k82cn @ahg-g @ravisantoshgudimetla @liu-cong @alculquicondor @draveness @krmayankk @resouer
The text was updated successfully, but these errors were encountered: