Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EvenPodsSpread discussion on tainted nodes #80921

Closed
Huang-Wei opened this issue Aug 2, 2019 · 20 comments
Closed

EvenPodsSpread discussion on tainted nodes #80921

Huang-Wei opened this issue Aug 2, 2019 · 20 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@Huang-Wei
Copy link
Member

Background

EvenPodsSpread is a feature to make pod placement decisions based on the distribution of existing pods. The concept maxSkew brings a question: how do we define the scope (which nodes) to calculate the number of matching pods so as to calculate the skew.

Here are some implications implemented in the alpha:

  • Nodes with all spec.topologySpreadConstraints[*].topologyKey present are in the scope.
  • Nodes matching incomingPod's nodeSelector/nodeAffinity (if defined) are in the scope.
  • Other nodes are NOT in the scope.

The above implications have been discussed and agreed, so it's not likely to change in future releases. However, within the scope, there are one special case I want to bring up to get more feedback.

Core Question

Here is the core question: what if the node is applied taints? Should we count the matching pods in the node? (The alpha answer is yes)

NOTE: to simplify the expression, in this post terms like "tainted nodes" and "applied taints" technically means "the node is tainted AND incoming pod does't have corresponding tolerations". If the incoming pod's tolerations match the taints, we for sure consider that node.

Case 1 - spreading on zones

We have a cluster with 2 zones and 4 nodes, and NodeB is tainted. (P = Pod)

+-------------------+---------------+
|       Zone1       |     Zone2     |
+-------+-----------+-------+-------+
| NodeA |   NodeB   | NodeC | NodeD |
|       | (tainted) |       |       |
+-------+-----------+-------+-------+
|   P   |   P P P   |   P   |  P P  |
+-------+-----------+-------+-------+

Suppose an incoming pod wants to be scheduled evenly across zones with maxSkew=1. In the current alpha implementation, NodeB is still considered as a valid node, so its 3 pods contributes to the final matching number in Zone1 - which sums up to 4. So the scheduling result is incoming pod goes to NodeC.

Sounds perfectly reasonable. But hold on to take a look at another case.

Case 2: spreading on nodes

+-------------------+---------------+
|       Zone1       |     Zone2     |
+-------+-----------+-------+-------+
| NodeA |   NodeB   | NodeC | NodeD |
|       | (tainted) |       |       |
+-------+-----------+-------+-------+
|   P   |           |   P   |  P P  |
+-------+-----------+-------+-------+

In this case, an incoming pod wants to be scheduled evenly across nodes, with maxSkew=1. As NodeB has 0 pods, so to make it even, we should put incoming pod to NodeB, and it will be pending... (we've assumed that incoming pod doesn't have corresponding tolerations defined). This is also based on current alpha complementation.

What if we exclude tainted nodes?

Things are a little different. Here is the comparison:

include tainted nodes (Alpha) exclude tainted nodes (to be discussed)
case1 pod goes to NodeC pod goes to NodeA
case2 pod goes to NodeB (pending) pod goes to NodeA or NodeC

Maybe more options

Maybe it's not a YES or NO answer. It's possible to provide more options to decide which kinds of taints will be included/excluded:

Here are the taints I can think of:

  1. The taint is applied by users (admin or operator)
  2. The taint is applied by Kubernetes
    1. Node becomes not-ready or unreachable. It can be a network glitch, CNI plugin error, or node reboot.
    2. Node becomes unschedulable. This is usually controlled by users via:
      1. kubectl cordon - node is marked as unschedulable, existing pods are kept
      2. kubectl drains - similar like cordon, but existing pods are evicted
    3. Node gets resource pressure.

Anything else?

Things will be more complicated if TaintNodesByCondition and TaintBasedEvictions features are disabled. In these cases internally it's controlled by conditions and no taints applied. But these two features will eventually GA and adopting taints/tolerations is the direction, so I guess we don't need to consider these combinations.

/sig scheduling
cc/ @bsalamat @k82cn @ahg-g @ravisantoshgudimetla @liu-cong @alculquicondor @draveness @krmayankk @resouer

@Huang-Wei Huang-Wei added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 2, 2019
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Aug 2, 2019
@tedyu
Copy link
Contributor

tedyu commented Aug 2, 2019

For case 1, if we exclude tainted nodes, the incomingPod should go to NodeC - since zone 1 already has 4 pods. Putting incomingPod on NodeC would achieve better evenness.

@Huang-Wei
Copy link
Member Author

Huang-Wei commented Aug 2, 2019

@tedyu Nope. "Exclude" means we don't count the matching pods on that node any more, so zone1 only has 1 pod, while zone2 has 3 pods.

@alculquicondor
Copy link
Member

We should think of this more generally: do we want filtered nodes (by other filters) to influence the skews calculation? I believe that every filter should be independent of each other. It is my understanding that this is the pattern that the new framework promotes too.

In this case, the simplest solution is to count all the pods, only considering the spreading selectors.

We should keep in mind that the most common use case for Even Pod Spreading is Deployments, in
which all pods have the same characteristics (tolerations, affinity, etc). This should fit within the independent-plugins paradigm easily.

@Huang-Wei
Copy link
Member Author

Huang-Wei commented Aug 7, 2019

I believe that every filter should be independent of each other. It is my understanding that this is the pattern that the new framework promotes too.

In the scheduler's history, it's always like this until this feature :) In the alpha implementation, it respects nodeSelector and hard nodeAffinity.

In this case, the simplest solution is to count all the pods, only considering the spreading selectors.

This is the current behavior. But we're open to make changes if this doesn't fit the majority usage.

BTW: the "change" could be another option of UnsatisfiableConstraintAction:

const (
// DoNotSchedule instructs the scheduler not to schedule the pod
// when constraints are not satisfied.
DoNotSchedule UnsatisfiableConstraintAction = "DoNotSchedule"
// ScheduleAnyway instructs the scheduler to schedule the pod
// even if constraints are not satisfied.
ScheduleAnyway UnsatisfiableConstraintAction = "ScheduleAnyway"
)

@alculquicondor
Copy link
Member

alculquicondor commented Aug 8, 2019

I missed the fact that when a node doesn't satisfy affinity or taints, then it's likely to have 0 pods that match the spreading constraint labels. Then the minimum count would be 0 for that node, messing up the skew.

That said, simply excluding nodes that doesn't satisfy taints wouldn't solve the general issue. We are considering taints here, but potentially other filters could affect the skew (maybe PodFitsResources?).

Maybe we should solve this with predicates (and later plugins) order. Instead of considering all the pods, we only consider the pods that passed the previous filters. So, if we decide that a filter should affect the skew, we run it before EvenPodSpread, otherwise we run it after.

@krmayankk
Copy link

How are tainted nodes currently implemented ? I assume they are yet another predicate like PodSelectorMatches . I agree with this general comment from @alculquicondor

Maybe we should solve this with predicates (and later plugins) order. Instead of considering all the pods, we only consider the pods that passed the previous filters. So, if we decide that a filter should affect the skew, we run it before EvenPodSpread, otherwise we run it after.

In case a tainted node has existing pods, then probably the right behavior is to consider them while spreading new pods

In case a tainted node has no existing pods(more likely case), then the right behavior is to ignore this node completely as if it doesn't exist. So may be we need this customizable.

Also what is the point of considering a tainted node and eventually selecting it to place a pod which remains pending.

In summary, taints should be like predicates, which completely eliminates the nodes that have taints which are not being tolerated by incoming pods. The only case I find considering is when a tainted node has existing pods, then the user might be interested in considering them but I dont want to optimize/code for this case yet.

@Huang-Wei
Copy link
Member Author

How are tainted nodes currently implemented ? I assume they are yet another predicate like PodSelectorMatches .

Yes.

In case a tainted node has existing pods, then probably the right behavior is to consider them while spreading new pods

This is the current behavior. No matter the tainted node has matching pods or not.

In case a tainted node has no existing pods(more likely case), then the right behavior is to ignore this node completely as if it doesn't exist. So may be we need this customizable.

I don't think this is the right way to go. If we want to customize some special case, it should base on taints (esp. some "permanent" ones like "unschedulable" caused by kubectl drain) instead of the existing pods.

Also what is the point of considering a tainted node and eventually selecting it to place a pod which remains pending.

It's b/c either the taint can be taken off (automatically/manually), or the workload can be applied with tolerations, so that the Pending state can be resolved.

In summary, taints should be like predicates, which completely eliminates the nodes that have taints which are not being tolerated by incoming pods.

It's always the case - we have separate taint predicates and that functions in a standalone way. The core question I'm bringing up here is: if we should respect the result of taint predicate within evenpodsspread predicate, probably for some cases.

@Huang-Wei
Copy link
Member Author

Maybe we should solve this with predicates (and later plugins) order. Instead of considering all the pods, we only consider the pods that passed the previous filters. So, if we decide that a filter should affect the skew, we run it before EvenPodSpread, otherwise we run it after.

@alculquicondor Today we run the (ordered) predicates among N nodes in parallel. So it's not practical to continuously communicate the result of current predicate on current node to other N-1 nodes.

@liu-cong
Copy link
Contributor

liu-cong commented Aug 8, 2019

Maybe we should solve this with predicates (and later plugins) order. Instead of considering all the pods, we only consider the pods that passed the previous filters. So, if we decide that a filter should affect the skew, we run it before EvenPodSpread, otherwise we run it after.

@alculquicondor Today we run the (ordered) predicates among N nodes in parallel. So it's not practical to continuously communicate the result of current predicate on current node to other N-1 nodes.

@alculquicondor, I guess you meant "nodes" instead of "pods" since filters filter out nodes, not pods.
@Huang-Wei , why do we need to communicate the result on current node to other nodes? When we reach even pod spreading filter, my understanding is that it currently apply against all nodes. Cannot we just change to the remaining nodes after previous filters?

Forgive me if my questions are dumb. I didn't follow closely on the even pod spreading feature.

@alculquicondor
Copy link
Member

Indeed, I meant nodes, sorry.
@Huang-Wei, there shouldn't be need to communicate the results of the current predicate (EvenPodSpread in this case), but rather previous ones. I believe this information should be available in the cache.

@Huang-Wei
Copy link
Member Author

@liu-cong That was just a hypothetical idea. We don't do that atm, and I don't think we will do that in the future :)

@alculquicondor I haven't checked current filter plugins flow. For current predicates flow, the predicates chain was running in parallel, so you have no chance to know the results of "previous" ones on other nodes. The term "previous" makes no sense here - some nodes may even haven't started their predicates cycle yet.

One question: in current filter plugins flow, are we gonna run each filter plugin to complete on all nodes, and then go to the next? (I hope the answer is no)

@alculquicondor
Copy link
Member

You were right. My memory was vague. The filters will run sequentially for all nodes in parallel.
Sounds like we can only use the metadata (today) or prefilter plugins (in the framework).

But, apart from implementation details, my opinion is that unsatisfied taints should take the node out from the calculations, regardless of the type of taint (for simplicity).

@Huang-Wei
Copy link
Member Author

But, apart from implementation details, my opinion is that unsatisfied taints should take the node out from the calculations, regardless of the type of taint (for simplicity).

Yup, this is aligned with what we discussed in the sig-meeting. Before making any changes, let's roll this feature out first and see how users think.

@krmayankk
Copy link

Sorry I could not attend the meeting. So this means a node is taken out of consideration for EvenPodSpread if the incoming pod is not tolerating the taint on the node. How does this happen if all predicates are processed in parallel ? Does EvenPodSpread also apply the taint predicate again ?

In general I agree for simplicity this is the right behavior of taking a node out from calculations if it doesnt satisfy the taints. Although we should document, that if the node still has existing pods, that could lead to unwanted behavior and not expected spreading ?

@Huang-Wei
Copy link
Member Author

Sorry I could not attend the meeting. So this means a node is taken out of consideration for EvenPodSpread if the incoming pod is not tolerating the taint on the node.

Oh it's just a possible solution. We won't roll it out until we see compelling usecases.

Does EvenPodSpread also apply the taint predicate again ?

No at the moment. Pods in tainted node are considered as matching candidates during EvenPodsSpread predicate/priority.

that if the node still has existing pods, that could lead to unwanted behavior and not expected spreading ?

It can. And things are more tricky when combining with the "NoExecute" taint - which will apply a tolerationSeconds=300 to each existingPod, and evict them after 300 seconds.

So this issue is for (early-phase) brainstorming, we won't come to a conclusion until getting enough data points.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 10, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 10, 2019
@alculquicondor
Copy link
Member

@Huang-Wei how do think we can gather the feedback?

It looks like our conclusion was to skip the nodes that don't support the toleration.

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

7 participants