Managing Resources

This guide describes good practices concepts to Managing Resources.

Managing Resources

Note that pretty much all controllers consume:

  • CPU: largely based on the number of reconciliations they perform, which are generally related to event activity for resources they’re watching.
  • Memory: largely based on the number of primary resources that exist (multiplied by some factor based on the number of operand resources they need to watch as a result) via informer caches.

And then there is a concern that one Pod or Container could monopolize all available resources and Cluster admins must consider the effects that one Pod or Container may have on other components.

In an effort to prevent a container from consuming all the resources on a cluster or affecting other workloads from being scheduled, many production clusters will define ResourceQuota configurations.

The ResourceQuota configuration also applies to tenant workloads that are managed by your Operator. Cluster administrators will typically set a ResourceQuota for each tenant’s namespace as part of the onboarding. If a LimitRange with default values has not been created in each namespace and your Operator creates Containers inside the tenant namespace without specifying at least resource requests for CPU and Memory of its Pods then, the system or quota may reject Pod creation. Check the following statements obtained from K8s docs:

  • “If a LimitRange is activated in a namespace for computing resources like CPU and memory, users must specify requests or limits for those values. Otherwise, the system may reject Pod creation." (Reference).

  • “If quota is enabled in a namespace for compute resources like cpu and memory, users must specify requests or limits for those values; otherwise, the quota system may reject pod creation." (Reference).

In an effort to support clusters with the above configuration, to ensure safe operations and avoid negatively impacting other workloads: Operators should always include reasonable memory and CPU resource requests for their own deployment as well as for operands they deploy.

HINT Cluster admins might also able to avoid the above scenario by setting default values when they are not specified for each Pod and/or Container in a namespace.

Therefore, your Operator should always apply at least resource requests for CPU and memory for Pods/Deployments that it creates as part of the reconciliation logic. Ideally your Operator also applies memory limits to those Pods/Deployments. You may also consider CPU limits.

Resource requests and limits for the Operator Deployment can be defined by modifying theconfig/manager/manager.yaml as shown below:

  ...
  # TODO(user): Configure the resources accordingly based on the project requirements.
  # More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
  resources:
    requests:
      cpu: 10m
      memory: 64Mi
    ...

How to compute default values

IMPORTANT: A single configuration that fits all scenarios is not possible. In this way, Operators authors MUST to ensure that Cluster Admins and its users can change the resource requests/limits of the Operator/manager, and of its Operands.

However, you are able to benchmark your Operator by Monitoring the resource usage to ensure good and reasonable values for the general cases. Kubebuilder and SDK provide some metrics which can help you with.

NOTE Also, be aware that if the project was generated by Kubebuilder or SDK scaffold then some values for the Operator/manager(See config/manager/manager.yaml) are populated by default to get you started, however, you ought to optimize them based on your own tests and the specific needs of your operator.

How to change the Operator/manager resources values when under OLM management

If your operator is managed by OLM, administrators or users can configure your operator’s resource requests and limits via the subscription.

General guidelines

Following are some general recommendations to manage the resources:

  • MUST declare resource requests for both, CPU and Memory, for the Operator/manager and any Pod/Deployment managed by it
  • OPTIONALLY setting the resources limit for CPU and Memory for the Operator Pod and any Pod/Deployment managed by it.
  • SHOULD provide the mechanisms for Monitoring compute & memory resource usage so that, Cluster Admins can use these metrics to monitor and resize the Operator and its Operands.CAVEAT: If the Operator is integrated with OLM and the bundle has a PodMonitor or a ServiceMonitor the complete InstallPlan will fail on a cluster, which does not have these CRD/the Prometheus operator installed. In this case, you might want to ensure the dependency requirement with OLM dependency or make clear its requirement for the Operator consumers.
  • SHOULD allow admins to customize the requests/limits amounts defined for the Pod/Deployment created by the Operator and not hardcode these values.
  • SHOULD document how your Operator consumer can customize/rightsize the resources requests and limits for the Operator and Operands Pod/Deployments or describe how the solution could be configured to automatically adjust these values based on the environment. the Operator automatically adjust the values to the environment rather than asking its consumers to amend them. You might also consider leveraging the Vertical Pod Autoscaler to have the resources requested by the Operator automatically adjusted to the cluster where it is deployed. You might also look at allowing horizontal pod autoscaling for the other Pod/Deployments created by your Operator. In this case, be aware that resource requests are also required for horizontal pod autoscaling to work.CAVEAT: If you are using OpenShift as your Kubernetes distribution you might want to check the doc Automatically adjust pod resource levels with the vertical pod autoscaler. Also, if the VPA CRD is not available in the cluster where the operator gets deployed the InstallPlan will fail. Please, see how to work with OLM dependency if your project integrating with it.

Why should you set these?

Resource Requests

What happens when the resource requests are not set?

  • configurations made by the cluster administrators such as ResourceQuota might not work without LimitRanges. The LimitRanger admission controller can provide default values for resource requests and limits when they have not been defined.
  • the Operators consumers might face resource shortages on a node when resource usage increases, for example, during a daily peak in request rate.
  • the Operator’s consumers might be unable to successfully deploy the Operator because it does not have the minimal resources available.
  • the scheduler cannot make an informed placement choice when it picks the nodes the operator pods will be running on.
  • when there is memory contention on the node the pod is likely to get either evicted or OOM killed.
  • when there is CPU contention on the node the pod is likely to get starved of CPU cycles making the operator unresponsive.

Resource Limits

What happens when the resource limits are not set?

Wrong configurations or code implementations can consume all resources available, affecting other components on cluster. Also, it might leave the containers more vulnerable such as to Dos Attacks. More info. See that when there is memory contention on a node, pods will start getting evicted and possibly killed according to their OOM score. The node will be flagged with a MemoryPressure condition and eventually made unschedulable. However, when there is CPU contention on the node, neighbour pods may get slowed down to the CPU cores they requested.

However, a popular practice by cluster administrators is to leverage ResourceQuota to limit the total amount of resources that can be requested or allowed in a single namespace. This may protect against over consumption of resources by operands of a faulty or wrongly configured operator. On the other hand it also means that the operator may not be able to create additional pods, limiting its functionality when the limit has been reached.

Also, see might want to check in the K8s docs the following sections:

What happens when:

Limits reached

What happens when the resource limits have been reached?

  • For Memory: the container might be terminated with the reason OOM Killed. If it is restartable, the kubelet will restart it, as with any other type of runtime failure.
  • For CPU: the container might or might not be allowed to exceed its CPU limit for extended periods of time. However, it will not be killed for excessive CPU usage. The CPU is considered a “compressible” resource, and if the Pod starts hitting the CPU limits, Kubernetes uses kernel to starts throttling the container. That means the CPU will be artificially restricted, giving a potentially worse performance only.

You might want to check the Troubleshooting section in the Kubernetes documents to better understand how to debug these scenarios.

Limits are specified but not requests

What happens when the resource limits are defined but not the requests?

If you specify a CPU or Memory limit for a Container but do not specify a request, Kubernetes automatically assigns a CPU or Memory request that matches the limit. In this way, you will be requesting always the limit and will be allocating more resources than required. (NOT RECOMMENDED)

Values are too big

Memory and CPU requests and limits are associated with Containers, but be aware that the Memory and CPU requests and limits of a Pod are the sum of its specific computing types for all the containers in the Pod.

If you define that your Pods should have Memory or CPU request too big then, you might not only be allocating and blocking the usage of more than you ought unnecessarily. Also, your Operator consumers might be able to install your Operator via OLM, for example, but will be unable to check the Pods/Deployment running successfully when the amount defined exceeds the capacity available. In these scenarios, the Operator consumers will check that Pod(s) failed to schedule with event errors like Insufficient cpu and/or Insufficient memory.

Last modified December 13, 2021: fix nit layout and broken link (#5450) (620f7806)