OpenShift Container Platform may kill a process in a container if the total memory usage of
all the processes in the container exceeds the memory limit, or in serious cases
of node memory exhaustion.
When a process is OOM killed, this may or may not result in the container
exiting immediately. If the container PID 1 process receives the SIGKILL, the
container will exit immediately. Otherwise, the container behavior is
dependent on the behavior of the other processes.
If the container does not exit immediately, an OOM kill is detectable as
follows:
-
A container process exited with code 137, indicating it received a SIGKILL
signal
-
The oom_kill counter in /sys/fs/cgroup/memory/memory.oom_control
is
incremented
$ grep '^oom_kill ' /sys/fs/cgroup/memory/memory.oom_control
oom_kill 0
$ sed -e '' </dev/zero # provoke an OOM kill
Killed
$ echo $?
137
$ grep '^oom_kill ' /sys/fs/cgroup/memory/memory.oom_control
oom_kill 1
If one or more processes in a pod are OOM killed, when the pod subsequently
exits, whether immediately or not, it will have phase Failed and reason
OOMKilled. An OOM killed pod may be restarted depending on the value of
restartPolicy
. If not restarted, controllers such as the
ReplicationController will notice the pod’s failed status and create a new pod
to replace the old one.
If not restarted, the pod status is as follows:
$ oc get pod test
NAME READY STATUS RESTARTS AGE
test 0/1 OOMKilled 0 1m
$ oc get pod test -o yaml
...
status:
containerStatuses:
- name: test
ready: false
restartCount: 0
state:
terminated:
exitCode: 137
reason: OOMKilled
phase: Failed
If restarted, its status is as follows:
$ oc get pod test
NAME READY STATUS RESTARTS AGE
test 1/1 Running 1 1m
$ oc get pod test -o yaml
...
status:
containerStatuses:
- name: test
ready: true
restartCount: 1
lastState:
terminated:
exitCode: 137
reason: OOMKilled
state:
running:
phase: Running