EvenPodsSpread discussion on tainted nodes #80921

Huang-Wei · 2019-08-02T19:30:45Z

Background

EvenPodsSpread is a feature to make pod placement decisions based on the distribution of existing pods. The concept maxSkew brings a question: how do we define the scope (which nodes) to calculate the number of matching pods so as to calculate the skew.

Here are some implications implemented in the alpha:

Nodes with all spec.topologySpreadConstraints[*].topologyKey present are in the scope.
Nodes matching incomingPod's nodeSelector/nodeAffinity (if defined) are in the scope.
Other nodes are NOT in the scope.

The above implications have been discussed and agreed, so it's not likely to change in future releases. However, within the scope, there are one special case I want to bring up to get more feedback.

Core Question

Here is the core question: what if the node is applied taints? Should we count the matching pods in the node? (The alpha answer is yes)

NOTE: to simplify the expression, in this post terms like "tainted nodes" and "applied taints" technically means "the node is tainted AND incoming pod does't have corresponding tolerations". If the incoming pod's tolerations match the taints, we for sure consider that node.

Case 1 - spreading on zones

We have a cluster with 2 zones and 4 nodes, and NodeB is tainted. (P = Pod)

+-------------------+---------------+
|       Zone1       |     Zone2     |
+-------+-----------+-------+-------+
| NodeA |   NodeB   | NodeC | NodeD |
|       | (tainted) |       |       |
+-------+-----------+-------+-------+
|   P   |   P P P   |   P   |  P P  |
+-------+-----------+-------+-------+

Suppose an incoming pod wants to be scheduled evenly across zones with maxSkew=1. In the current alpha implementation, NodeB is still considered as a valid node, so its 3 pods contributes to the final matching number in Zone1 - which sums up to 4. So the scheduling result is incoming pod goes to NodeC.

Sounds perfectly reasonable. But hold on to take a look at another case.

Case 2: spreading on nodes

+-------------------+---------------+
|       Zone1       |     Zone2     |
+-------+-----------+-------+-------+
| NodeA |   NodeB   | NodeC | NodeD |
|       | (tainted) |       |       |
+-------+-----------+-------+-------+
|   P   |           |   P   |  P P  |
+-------+-----------+-------+-------+

In this case, an incoming pod wants to be scheduled evenly across nodes, with maxSkew=1. As NodeB has 0 pods, so to make it even, we should put incoming pod to NodeB, and it will be pending... (we've assumed that incoming pod doesn't have corresponding tolerations defined). This is also based on current alpha complementation.

What if we exclude tainted nodes?

Things are a little different. Here is the comparison:

	include tainted nodes (Alpha)	exclude tainted nodes (to be discussed)
case1	pod goes to NodeC	pod goes to NodeA
case2	pod goes to NodeB (pending)	pod goes to NodeA or NodeC

Maybe more options

Maybe it's not a YES or NO answer. It's possible to provide more options to decide which kinds of taints will be included/excluded:

Here are the taints I can think of:

The taint is applied by users (admin or operator)
The taint is applied by Kubernetes
1. Node becomes not-ready or unreachable. It can be a network glitch, CNI plugin error, or node reboot.
2. Node becomes unschedulable. This is usually controlled by users via:
  1. kubectl cordon - node is marked as unschedulable, existing pods are kept
  2. kubectl drains - similar like cordon, but existing pods are evicted
3. Node gets resource pressure.

Anything else?

Things will be more complicated if TaintNodesByCondition and TaintBasedEvictions features are disabled. In these cases internally it's controlled by conditions and no taints applied. But these two features will eventually GA and adopting taints/tolerations is the direction, so I guess we don't need to consider these combinations.

/sig scheduling
cc/ @bsalamat @k82cn @ahg-g @ravisantoshgudimetla @liu-cong @alculquicondor @draveness @krmayankk @resouer

The text was updated successfully, but these errors were encountered:

tedyu · 2019-08-02T19:55:01Z

For case 1, if we exclude tainted nodes, the incomingPod should go to NodeC - since zone 1 already has 4 pods. Putting incomingPod on NodeC would achieve better evenness.

Huang-Wei · 2019-08-02T20:04:11Z

@tedyu Nope. "Exclude" means we don't count the matching pods on that node any more, so zone1 only has 1 pod, while zone2 has 3 pods.

alculquicondor · 2019-08-07T22:07:07Z

We should think of this more generally: do we want filtered nodes (by other filters) to influence the skews calculation? I believe that every filter should be independent of each other. It is my understanding that this is the pattern that the new framework promotes too.

In this case, the simplest solution is to count all the pods, only considering the spreading selectors.

We should keep in mind that the most common use case for Even Pod Spreading is Deployments, in
which all pods have the same characteristics (tolerations, affinity, etc). This should fit within the independent-plugins paradigm easily.

Huang-Wei · 2019-08-07T23:07:03Z

I believe that every filter should be independent of each other. It is my understanding that this is the pattern that the new framework promotes too.

In the scheduler's history, it's always like this until this feature :) In the alpha implementation, it respects nodeSelector and hard nodeAffinity.

In this case, the simplest solution is to count all the pods, only considering the spreading selectors.

This is the current behavior. But we're open to make changes if this doesn't fit the majority usage.

BTW: the "change" could be another option of UnsatisfiableConstraintAction:

kubernetes/staging/src/k8s.io/api/core/v1/types.go

Lines 3041 to 3048 in a8d7139

    
           const ( 
        
           	// DoNotSchedule instructs the scheduler not to schedule the pod 
        
           	// when constraints are not satisfied. 
        
           	DoNotSchedule UnsatisfiableConstraintAction = "DoNotSchedule" 
        
           	// ScheduleAnyway instructs the scheduler to schedule the pod 
        
           	// even if constraints are not satisfied. 
        
           	ScheduleAnyway UnsatisfiableConstraintAction = "ScheduleAnyway" 
        
           )

alculquicondor · 2019-08-08T02:17:35Z

I missed the fact that when a node doesn't satisfy affinity or taints, then it's likely to have 0 pods that match the spreading constraint labels. Then the minimum count would be 0 for that node, messing up the skew.

That said, simply excluding nodes that doesn't satisfy taints wouldn't solve the general issue. We are considering taints here, but potentially other filters could affect the skew (maybe PodFitsResources?).

Maybe we should solve this with predicates (and later plugins) order. Instead of considering all the pods, we only consider the pods that passed the previous filters. So, if we decide that a filter should affect the skew, we run it before EvenPodSpread, otherwise we run it after.

krmayankk · 2019-08-08T08:34:33Z

How are tainted nodes currently implemented ? I assume they are yet another predicate like PodSelectorMatches . I agree with this general comment from @alculquicondor

Maybe we should solve this with predicates (and later plugins) order. Instead of considering all the pods, we only consider the pods that passed the previous filters. So, if we decide that a filter should affect the skew, we run it before EvenPodSpread, otherwise we run it after.

In case a tainted node has existing pods, then probably the right behavior is to consider them while spreading new pods

In case a tainted node has no existing pods(more likely case), then the right behavior is to ignore this node completely as if it doesn't exist. So may be we need this customizable.

Also what is the point of considering a tainted node and eventually selecting it to place a pod which remains pending.

In summary, taints should be like predicates, which completely eliminates the nodes that have taints which are not being tolerated by incoming pods. The only case I find considering is when a tainted node has existing pods, then the user might be interested in considering them but I dont want to optimize/code for this case yet.

Huang-Wei · 2019-08-08T16:52:50Z

How are tainted nodes currently implemented ? I assume they are yet another predicate like PodSelectorMatches .

Yes.

In case a tainted node has existing pods, then probably the right behavior is to consider them while spreading new pods

This is the current behavior. No matter the tainted node has matching pods or not.

In case a tainted node has no existing pods(more likely case), then the right behavior is to ignore this node completely as if it doesn't exist. So may be we need this customizable.

I don't think this is the right way to go. If we want to customize some special case, it should base on taints (esp. some "permanent" ones like "unschedulable" caused by kubectl drain) instead of the existing pods.

Also what is the point of considering a tainted node and eventually selecting it to place a pod which remains pending.

It's b/c either the taint can be taken off (automatically/manually), or the workload can be applied with tolerations, so that the Pending state can be resolved.

In summary, taints should be like predicates, which completely eliminates the nodes that have taints which are not being tolerated by incoming pods.

It's always the case - we have separate taint predicates and that functions in a standalone way. The core question I'm bringing up here is: if we should respect the result of taint predicate within evenpodsspread predicate, probably for some cases.

Huang-Wei · 2019-08-08T16:58:59Z

Maybe we should solve this with predicates (and later plugins) order. Instead of considering all the pods, we only consider the pods that passed the previous filters. So, if we decide that a filter should affect the skew, we run it before EvenPodSpread, otherwise we run it after.

@alculquicondor Today we run the (ordered) predicates among N nodes in parallel. So it's not practical to continuously communicate the result of current predicate on current node to other N-1 nodes.

liu-cong · 2019-08-08T18:36:37Z

Maybe we should solve this with predicates (and later plugins) order. Instead of considering all the pods, we only consider the pods that passed the previous filters. So, if we decide that a filter should affect the skew, we run it before EvenPodSpread, otherwise we run it after.

@alculquicondor Today we run the (ordered) predicates among N nodes in parallel. So it's not practical to continuously communicate the result of current predicate on current node to other N-1 nodes.

@alculquicondor, I guess you meant "nodes" instead of "pods" since filters filter out nodes, not pods.
@Huang-Wei , why do we need to communicate the result on current node to other nodes? When we reach even pod spreading filter, my understanding is that it currently apply against all nodes. Cannot we just change to the remaining nodes after previous filters?

Forgive me if my questions are dumb. I didn't follow closely on the even pod spreading feature.

alculquicondor · 2019-08-08T19:18:10Z

Indeed, I meant nodes, sorry.
@Huang-Wei, there shouldn't be need to communicate the results of the current predicate (EvenPodSpread in this case), but rather previous ones. I believe this information should be available in the cache.

Huang-Wei · 2019-08-09T01:01:47Z

@liu-cong That was just a hypothetical idea. We don't do that atm, and I don't think we will do that in the future :)

@alculquicondor I haven't checked current filter plugins flow. For current predicates flow, the predicates chain was running in parallel, so you have no chance to know the results of "previous" ones on other nodes. The term "previous" makes no sense here - some nodes may even haven't started their predicates cycle yet.

One question: in current filter plugins flow, are we gonna run each filter plugin to complete on all nodes, and then go to the next? (I hope the answer is no)

alculquicondor · 2019-08-09T12:44:18Z

You were right. My memory was vague. The filters will run sequentially for all nodes in parallel.
Sounds like we can only use the metadata (today) or prefilter plugins (in the framework).

But, apart from implementation details, my opinion is that unsatisfied taints should take the node out from the calculations, regardless of the type of taint (for simplicity).

Huang-Wei · 2019-08-09T17:15:04Z

But, apart from implementation details, my opinion is that unsatisfied taints should take the node out from the calculations, regardless of the type of taint (for simplicity).

Yup, this is aligned with what we discussed in the sig-meeting. Before making any changes, let's roll this feature out first and see how users think.

krmayankk · 2019-08-10T07:48:08Z

Sorry I could not attend the meeting. So this means a node is taken out of consideration for EvenPodSpread if the incoming pod is not tolerating the taint on the node. How does this happen if all predicates are processed in parallel ? Does EvenPodSpread also apply the taint predicate again ?

In general I agree for simplicity this is the right behavior of taking a node out from calculations if it doesnt satisfy the taints. Although we should document, that if the node still has existing pods, that could lead to unwanted behavior and not expected spreading ?

Huang-Wei · 2019-08-12T00:12:25Z

Sorry I could not attend the meeting. So this means a node is taken out of consideration for EvenPodSpread if the incoming pod is not tolerating the taint on the node.

Oh it's just a possible solution. We won't roll it out until we see compelling usecases.

Does EvenPodSpread also apply the taint predicate again ?

No at the moment. Pods in tainted node are considered as matching candidates during EvenPodsSpread predicate/priority.

that if the node still has existing pods, that could lead to unwanted behavior and not expected spreading ?

It can. And things are more tricky when combining with the "NoExecute" taint - which will apply a tolerationSeconds=300 to each existingPod, and evict them after 300 seconds.

So this issue is for (early-phase) brainstorming, we won't come to a conclusion until getting enough data points.

fejta-bot · 2019-11-10T00:56:52Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-12-10T01:40:11Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

alculquicondor · 2019-12-10T14:40:01Z

@Huang-Wei how do think we can gather the feedback?

It looks like our conclusion was to skip the nodes that don't support the toleration.

fejta-bot · 2020-01-09T15:40:20Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-01-09T15:40:28Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Huang-Wei added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 2, 2019

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Aug 2, 2019

Huang-Wei mentioned this issue Aug 2, 2019

Umbrella issue for Even Pods Spread [Alpha] #77284

Closed

9 tasks

liu-cong mentioned this issue Aug 21, 2019

REQUEST: New membership for liu-cong kubernetes/org#1127

Closed

6 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 10, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 10, 2019

k8s-ci-robot closed this as completed Jan 9, 2020

sync-by-unito bot mentioned this issue Aug 16, 2021

K8SSAND-788 ⁃ In Rack Topology, Replace Affinity Rules with TopologySpreadConstraint k8ssandra/k8ssandra-operator#74

Open

nyarly mentioned this issue Nov 3, 2021

Take taints/tolerations into consideration when calculating PodTopologySpread skew #106127

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EvenPodsSpread discussion on tainted nodes #80921

EvenPodsSpread discussion on tainted nodes #80921

Huang-Wei commented Aug 2, 2019

tedyu commented Aug 2, 2019

Huang-Wei commented Aug 2, 2019 •

edited

Loading

alculquicondor commented Aug 7, 2019

Huang-Wei commented Aug 7, 2019 •

edited

Loading

alculquicondor commented Aug 8, 2019 •

edited

Loading

krmayankk commented Aug 8, 2019

Huang-Wei commented Aug 8, 2019

Huang-Wei commented Aug 8, 2019

liu-cong commented Aug 8, 2019

alculquicondor commented Aug 8, 2019

Huang-Wei commented Aug 9, 2019

alculquicondor commented Aug 9, 2019

Huang-Wei commented Aug 9, 2019

krmayankk commented Aug 10, 2019

Huang-Wei commented Aug 12, 2019

fejta-bot commented Nov 10, 2019

fejta-bot commented Dec 10, 2019

alculquicondor commented Dec 10, 2019

fejta-bot commented Jan 9, 2020

k8s-ci-robot commented Jan 9, 2020

EvenPodsSpread discussion on tainted nodes #80921

EvenPodsSpread discussion on tainted nodes #80921

Comments

Huang-Wei commented Aug 2, 2019

Background

Core Question

Case 1 - spreading on zones

Case 2: spreading on nodes

What if we exclude tainted nodes?

Maybe more options

Anything else?

tedyu commented Aug 2, 2019

Huang-Wei commented Aug 2, 2019 • edited Loading

alculquicondor commented Aug 7, 2019

Huang-Wei commented Aug 7, 2019 • edited Loading

alculquicondor commented Aug 8, 2019 • edited Loading

krmayankk commented Aug 8, 2019

Huang-Wei commented Aug 8, 2019

Huang-Wei commented Aug 8, 2019

liu-cong commented Aug 8, 2019

alculquicondor commented Aug 8, 2019

Huang-Wei commented Aug 9, 2019

alculquicondor commented Aug 9, 2019

Huang-Wei commented Aug 9, 2019

krmayankk commented Aug 10, 2019

Huang-Wei commented Aug 12, 2019

fejta-bot commented Nov 10, 2019

fejta-bot commented Dec 10, 2019

alculquicondor commented Dec 10, 2019

fejta-bot commented Jan 9, 2020

k8s-ci-robot commented Jan 9, 2020

Huang-Wei commented Aug 2, 2019 •

edited

Loading

Huang-Wei commented Aug 7, 2019 •

edited

Loading

alculquicondor commented Aug 8, 2019 •

edited

Loading