Advanced Scheduling and Taints and Tolerations - Scheduling | Cluster Administration | OpenShift Container Platform Branch Build

Overview
Taints and Tolerations
- Using Multiple Taints
Adding a Taint to an Existing Node
Adding a Toleration to a Pod
- Using Toleration Seconds to Delay Pod Evictions
Preventing Pod Eviction for Node Problems
Daemonsets and Tolerations
Examples

Overview

Taints and tolerations allow the node to control which pods should (or should not) be scheduled on them.

Taints and Tolerations

A taint allows a node to refuse pod to be scheduled unless that pod has a matching toleration.

You apply taints to a node through the node specification (NodeSpec) and apply tolerations to a pod through the pod specification (PodSpec). A taint on a node instructs the node to repel all pods that do not tolerate the taint.

Taints and tolerations consist of a key, value, and effect. An operator allows you to leave one of these parameters empty.

Parameter Description

key

The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

value

The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

effect

The effect is one of the following:

NoSchedule

New pods that do not match the taint are not scheduled onto that node.
Existing pods on the node remain.

PreferNoSchedule

New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to.
Existing pods on the node remain.

NoExecute

New pods that do not match the taint cannot be scheduled onto that node.
Existing pods on the node that do not have a matching toleration are removed.

operator

Equal

The key/value/effect parameters must match. This is the default.

Exists

The key/effect parameters must match. You must leave a blank value parameter, which matches any.

A toleration matches a taint:

If the operator parameter is set to Equal:
- the key parameters are the same;
- the value parameters are the same;
- the effect parameters are the same.
If the operator parameter is set to Exists:
- the key parameters are the same;
- the effect parameters are the same.

Using Multiple Taints

You can put multiple taints on the same node and multiple tolerations on the same pod. OpenShift Container Platform processes multiple taints and tolerations as follows:

Process the taints for which the pod has a matching toleration.
The remaining unmatched taints have the indicated effects on the pod:
- If there is at least one unmatched taint with effect NoSchedule, OpenShift Container Platform cannot schedule a pod onto that node.
- If there is no unmatched taint with effect NoSchedule but there is at least one unmatched taint with effect PreferNoSchedule, OpenShift Container Platform tries to not schedule the pod onto the node.
- If there is at least one unmatched taint with effect NoExecute, OpenShift Container Platform evicts the pod from the node (if it is already running on the node), or the pod is not scheduled onto the node (if it is not yet running on the node).
  - Pods that do not tolerate the taint are evicted immediately.
  - Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever.
  - Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time.

For example:

The node has the following taints:

$ oc adm taint nodes node1 key1=value1:NoSchedule
$ oc adm taint nodes node1 key1=value1:NoExecute
$ oc adm taint nodes node1 key2=value2:NoSchedule

The pod has the following tolerations:

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"

In this case, the pod cannot be scheduled onto the node, because there is no toleration matching the third taint. The pod continues running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.

Adding a Taint to an Existing Node

You add a taint to a node using the oc adm taint command with the parameters described in the Taint and toleration components table:

$ oc adm taint nodes <node-name> <key>=<value>:<effect>

For example:

$ oc adm taint nodes node1 key1=value1:NoSchedule

The example places a taint on node1 that has key key1, value value1, and taint effect NoSchedule.

Adding a Toleration to a Pod

To add a toleration to a pod, edit the pod specification to include a tolerations section:

Sample pod configuration file with Equal operator

tolerations:
- key: "key1" (1)
  operator: "Equal" (1)
  value: "value1" (1)
  effect: "NoExecute" (1)
  tolerationSeconds: 3600 (2)

1	The toleration parameters, as described in the Taint and toleration components table.
2	The `tolerationSeconds` parameter specifies how long a pod can remain bound to a node before being evicted. See Using Toleration Seconds to Delay Pod Evictions below.

Sample pod configuration file with Exists operator

tolerations:
- key: "key1"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 3600

Both of these tolerations match the taint created by the oc adm taint command above. A pod with either toleration would be able to schedule onto node1.

Using Toleration Seconds to Delay Pod Evictions

You can specify how long a pod can remain bound to a node before being evicted by specifying the tolerationSeconds parameter in the pod specification. If a taint with the NoExecute effect is added to a node, any pods that do not tolerate the taint are evicted immediately (pods that do tolerate the taint are not evicted). However, if a pod that to be evicted has the tolerationSeconds parameter, the pod is not evicted until that time period expires.

For example:

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
  tolerationSeconds: 3600

Here, if this pod is running but does not have a matching taint, the pod stays bound to the node for 3,600 seconds and then be evicted. If the taint is removed before that time, the pod is not evicted.

Setting a Default Value for Toleration Seconds

This plug-in sets the default forgiveness toleration for pods, to tolerate the node.alpha.kubernetes.io/notReady:NoExecute and node.alpha.kubernetes.io/notReady:NoExecute taints for five minutes.

If the pod configuration provided by the user already has either toleration, the default is not added.

To enable Default Toleration Seconds:

Modify the master configuration file (/etc/origin/master/master-config.yaml) to Add DefaultTolerationSeconds to the admissionConfig section:

admissionConfig:
  pluginConfig:
    DefaultTolerationSeconds:
      configuration:
        kind: DefaultAdmissionConfig
        apiVersion: v1
        disable: false

Restart OpenShift for the changes to take effect:

# systemctl restart atomic-openshift-master-api atomic-openshift-master-controllers

Verify that the default was added:

Create a pod:

$ oc create -f </path/to/file>

For example:

$ oc create -f hello-pod.yaml
pod "hello-pod" created

Check the pod tolerations:

$ oc describe pod <pod-name> |grep -i toleration

For example:

$ oc describe pod hello-pod |grep -i toleration
Tolerations:    node.alpha.kubernetes.io/notReady=:Exists:NoExecute for 300s

Preventing Pod Eviction for Node Problems

OpenShift Container Platform can be configured to represent node unreachable and node not ready conditions as taints. This allows per-pod specification of how long to remain bound to a node that becomes unreachable or not ready, rather than using the default of five minutes.

When the Taint Based Evictions feature is enabled, the taints are automatically added by the node controller and the normal logic for evicting pods from Ready nodes is disabled.

If a node enters a not ready state, the node.alpha.kubernetes.io/notReady:NoExecute taint is added and pods cannot be scheduled on the node. Existing pods remain for the toleration seconds period.
If a node enters a not reachable state, the node.alpha.kubernetes.io/unreachable:NoExecute taint is added and pods cannot be scheduled on the node. Existing pods remain for the toleration seconds period.

To enable Taint Based Evictions:

Modify the master configuration file (/etc/origin/master/master-config.yaml) to add the following to the kubernetesMasterConfig section:
```
kubernetesMasterConfig:
   controllerArguments:
        feature-gates:
        - "TaintBasedEvictions=true"
```

Check that the taint is added to a node:

oc describe node $node | grep -i taint

Taints: node.alpha.kubernetes.io/notReady:NoExecute

Restart OpenShift for the changes to take effect:

# systemctl restart atomic-openshift-master-api atomic-openshift-master-controllers

Add a toleration to pods:

tolerations:
- key: "node.alpha.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 6000

tolerations:
- key: "node.alpha.kubernetes.io/notReady"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 6000

To maintain the existing rate limiting behavior of pod evictions due to node problems, the system adds the taints in a rate-limited way. This prevents massive pod evictions in scenarios such as the master becoming partitioned from the nodes.

Daemonsets and Tolerations

DaemonSet pods are created with NoExecute tolerations for node.alpha.kubernetes.io/unreachable and node.alpha.kubernetes.io/notReady with no tolerationSeconds to ensure that DaemonSet pods are never evicted due to these problems, even when the Default Toleration Seconds feature is disabled.

Examples

Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that should not be running on a node. A few of typical scenrios are:

Dedicating a node for a user
Binding a user to a node
Dedicating nodes with special hardware

Dedicating a Node for a User

You can specify a set of nodes for exclusive use by a particular set of users.

To specify dedicated nodes:

Add a taint to those nodes:

For example:

$ oc adm taint nodes node1 dedicated=groupName:NoSchedule

Add a corresponding toleration to the pods by writing a custom admission controller.

Only the pods with the tolerations are allowed to use the dedicated nodes.

Binding a User to a Node

You can configure a node so that particular users can use only the dedicated nodes.

To configure a node so that users can use only that node:

Add a taint to those nodes:

For example:

$ oc adm taint nodes node1 dedicated=groupName:NoSchedule

Add a corresponding toleration to the pods by writing a custom admission controller.

The admission controller should add a node affinity to require that the pods can only schedule onto nodes labeled with the key:value label (dedicated=groupName).
Add a label similar to the taint (such as the key:value label) to the dedicated nodes.

Nodes with Special Hardware

In a cluster where a small subset of nodes have specialized hardware (for example GPUs), you can use taints and tolerations to keep pods that do not need the specialized hardware off of those nodes, leaving the nodes for pods that do need the specialized hardware. You can also require pods that need specialized hardware to use specific nodes.

To ensure pods are blocked from the specialized hardware:

Taint the nodes that have the specialized hardware using one of the following commands:

$ oc adm taint nodes <node-name> disktype=ssd:NoSchedule
$ oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule

Adding a corresponding toleration to pods that use the special hardware using an admission controller.

For example, the admission controller could use some characteristic(s) of the pod to determine that the pod should be allowed to use the special nodes by adding a toleration.

To ensure pods can only use the specialized hardware, you need some additional mechanism. For example, you could label the nodes that have the special hardware and use node affinity on the pods that need the hardware.