
Kubernetes has become popular for many organizations running enterprise and production-grade containerized applications. It allows users to efficiently run and scale containers in production based on certain criteria that engineers need to configure.
While K8s offers a great deal of flexibility, incorrect or inadequate configurations can sometimes lead to degraded performance and hamper reliability if not kept in check.
This guide will walk you through some of the most common challenges and misconfigurations that impact application performance and lead to reliability issues while running Kubernetes at scale. We’ll also provide some best practices on how to tackle each complexity.
Kubernetes Misconfiguration #1: Misconfigured Resource Requests and Limits
Containers scheduled to run in Kubernetes request CPU and memory from the cluster. Based on this request, the cluster determines on which node the pod should be scheduled and how much of each resource needs to be allocated for a given pod.
Resource request: The minimum CPU and memory resources a running container requires. Kubernetes guarantees that pods will get the requested amount of resources.
Resource limit: The maximum CPU and memory resources a pod is able to use. This prevents the container from using more resources than specified, minimizing the impact of other containers running on the same node.
How Can Resource Requests and Limits Be Misconfigured?
Under-Provisioning Request
If resource requests are configured lower than necessary, Kubernetes can end up scheduling pods in a node where there are fewer resources available. Pods will then not be able to handle the normal load, resulting in performance degradation as pods battle for CPU and memory.
Example: A pod that needs 500m CPU is configured to request 200m and is thus scheduled on a node where only 200m CPU is available. This will lead to CPU throttling and increased latency.
Overprovisioning Request
When resource requests are set higher than required, they can lead to inefficient CPU and memory utilization. Nodes will appear full, preventing the scheduler from adding additional pods on that node. This will result in underutilization of resources and increased costs.
Example: A pod that needs 500m CPU to run had a request for 2 CPUs. This will waste 1.5 CPUs that could have been allocated to other pods.
Underprovisioning Limit
Setting resource limits lower than required can lead to an application throttling or Kubernetes even killing the pod. This will lead to application instability and degraded performance.
Example: A pod that requires 2 CPUs during peak load is set to a limit of 1 CPU. Kubernetes will start throttling the pod, leading to increased response times and potential timeouts.
Overprovisioning Limit
Similarly, overprovisioning resource limits might lead to increased CPU and memory consumption by a single pod, thus starving other pods of resources in the same node.
Example: A pod with a limit of 4 CPUs running on a node with 4 CPUs will monopolize the node, leading to other pods being throttled or evicted.
Best Practices for Provisioning Resource Requests and Limits
- Continuously monitor pod resource utilization and make informed decisions by adjusting the requests and limits based on application workload.
- Set alerts based on thresholds and take necessary actions when limits are reached.
- Use the Horizontal Pod Autoscaler for pods to automatically scale depending on CPU utilization, memory consumption, and other metrics.
- Consider custom metrics exported from a metrics server for autoscaling pods as well.
Here’s an example of a simple HPA definition that assumes you have an Nginx container running on your cluster; the autoscaling is based on the CPU utilization of the container:
apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: nginx-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: nginx-deployment minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50
As and when the load on the pod increases, its CPU utilization will also shoot up, and the scheduler will ensure more replicas are started to handle the increased load.
Kubernetes Misconfiguration #2: Improper Pod Affinity and Anti-Affinity Rules
Pod affinity and anti-affinity are clever mechanisms in Kubernetes that help the scheduler determine on which nodes pods should be scheduled based on the labels of other pods and nodes. It helps Kubernetes maintain a healthy cluster by optimizing resource utilization and increasing performance and fault tolerance.
Pod affinity defines that a pod should be scheduled on a node where pods with a given label are already running. Making sure pods that communicate often are on the same node will reduce network latency.
Pod anti-affinity informs you to not schedule a pod on a node where pods with other labels are running. This can improve fault tolerance by spreading out replicas of a pod on different nodes across availability zones.
How Can Pod Affinity and Anti-Affinity Rules Be Improperly Configured?
Resource Contention
If pod affinity rules are too strict, you can have multiple pods being scheduled on the same node, leading to resource contention (CPU, memory, I/O). On the other hand, if pod anti-affinity rules are too strict, it can restrict pods from being scheduled on nodes that have enough available resources, resulting in underutilization of the cluster.
Increased Response Times
If pods that communicate frequently are scheduled on different nodes, it will introduce network latency and increased response times. Overly strict anti-affinity rules can schedule similar pods too sparsely within the cluster, increasing the latency between communicating pods.
Scheduling Delays
Stricter affinity rules can cause delays in pods being scheduled, as the scheduler needs to evaluate and find suitable nodes. Similarly, anti-affinity rules that are too strict might limit the number of nodes available for scheduling, causing delays.
Best Practices for Configuring Pod Affinity and Anti-Affinity Rules
- Continuously monitor the performance and resource utilization of your pods.
- Make sure your rules are not too strict and allow for flexible scheduling of pods.
- Use pod topology spread constraints in addition to affinity and anti-affinity rules for more granular control over pod placement and to ensure high availability.
The following is an example of using a topology spread constraint along with affinity and anti-affinity rules:
apiVersion: v1 kind: Pod metadata: name: example-pod spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - frontend topologyKey: "kubernetes.io/hostname" podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - backend topologyKey: "kubernetes.io/hostname" topologySpreadConstraints: - maxSkew: 1 topologyKey: "kubernetes.io/hostname" whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: my-app containers: - name: example-container image: nginx
In the code above:
- Pod affinity makes sure that any pods with the label app=frontend will be scheduled on the same node. The topologyKey defines that the rule applies on the node level.
- Pod anti-affinity ensures that the pod is not scheduled on a node that runs other pods with the label app=backend.
- Pod topology spread constraints ensure that pods with the label app=my-app are evenly spread across all nodes. The maxSkew: 1 states that no node should have more than one additional (or one less) pod than another node.
Kubernetes Misconfiguration #3: Persistent Volume Mismanagement
Persistent volumes (PVs) in Kubernetes are a special type of K8s resource dynamically provisioned using Storage Classes or by an administrator. They allow you to store data in volumes that remain independent from the lifecycle of a pod. PVs are often implemented to store files such as logs and images and ensure that data is persisted even when pods are restarted or rescheduled.
How Can PVs Be Misconfigured?
Incorrect Access Modes
If you specify an incorrect access mode for a persistent volume (e.g., ReadWriteOnce instead of ReadWriteMany), it will prevent the pod from mounting the volume. This, in turn, will lead to reliability issues since the pod will not be able to start or access the data it needs.
Mismatched Storage Requests
Persistent volumes (PVs) are requested by a persisted volume claim (PVC). If the request is for more memory than a PV has, the PVC will remain unmounted. In this case, the pod will not be able to start, leading to application downtime.
Incorrect Storage Class
If an incorrect or non-existent storage class is used in the PVC, it will remain unmounted, also leading to application downtime.
Best Practices for Configuring Persistent Volumes
- Validate access modes specified in the PVC and ensure they match the requirements of the application and the capabilities of the underlying storage.
- Verify that a PV has the desired memory available before provisioning the storage class.
- In a multi-zone cluster, use binding mode WaitForFirstConsumer so that a PV is available for use in the same zone as the pod using it.
- Use reclamePolicy as Retain to prevent accidental deletion of data.
The following is an example of a well-configured PVC and Storage Class.
Storage Class
kind: StorageClass metadata: name: fast-storage provisioner: kubernetes.io/aws-ebs parameters: type: gp2 reclaimPolicy: Retain volumeBindingMode: WaitForFirstConsumer
PVC
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: my-pvc spec: accessModes: - ReadWriteOnce storageClassName: fast-storage resources: requests: storage: 10Gi
Kubernetes Misconfiguration #4: Misconfigured Network Policies
In Kubernetes, network policies define a way to establish communication between the pods and other network endpoints within the cluster.
Ingress Policies are responsible for dictating which other pods, namespaces, IP blocks, etc. are allowed to send traffic to the pod in question. Egress Policies control the outgoing network from the pod, defining to which destinations the selected pod can send traffic.
How Can Network Policies Be Misconfigured?
Overly Permissive Network Policies
If your network policy allows too much traffic into your pods, you risk increasing your attack surface. This exposes the pods to potential security threats and DDoS attacks, which might hinder network bandwidth and degrade performance. Also, allowing unrestricted access to your pods can lead to resource contention, as malicious or unintended traffic can keep consuming CPU, memory, or network bandwidth, impacting the performance of your applications.
Overly Restrictive Network Policies
Overly restrictive policies, on the other hand, can block essential traffic to or from your pods. This can lead to retries or timeouts, increasing the latency of your app. With overly restrictive policies, you will also have higher policy evaluation overhead, which might force traffic to take longer routes or go through additional hops.
Best Practices for Configuring Network Policies
- Start small and allow incremental enforcement of your policies by testing them thoroughly so that no essential communication is blocked.
- Follow the principle of least privilege by permitting traffic only from known sources.
- Implement monitoring and logging to uncover bottlenecks and resolve issues immediately.
- Regularly audit network policies to evaluate if they are still valid and adjust as necessary.
Kubernetes Misconfiguration #5: Unmanaged Secret and Configuration Management
Applications will frequently need to interact with external services like APIs or read from data stores that require authentication. Credentials to authenticate for such services, like API keys or database passwords, can be managed in Kubernetes through a component known as a secret.
However, secrets in Kubernetes are stored as base64 encoded strings, meaning they can be easily decoded back to their original string values. This increases the potential risk of those credentials being leaked or mishandled.
How Do Unmanaged Secrets Degrade Overall Application Reliability?
Security Risks
Unmanaged or poorly managed secrets in Kubernetes risk being exposed to external actors, which can lead to data breaches. Hardcoded secrets within the application or in configuration files run the risk of getting into version control systems or logs, making the application vulnerable to attacks.
Operational Challenges
The culture of DevSecOps entails the rotation of security keys at regular intervals. If keys or secrets are not managed centrally, it might become painful to rotate keys for all applications running in production. Without the use of a centralized secret management platform, different environments might end up with inconsistent secrets, leading to configuration drift and degraded developer experience.
Compliance Issues
Many Fortune 500 companies handling sensitive data must follow relevant regulations. Poor secret management can lead to non-compliance with popular industry standards like GDPR, HIPAA, PCI-DSS, etc.
Best Practices for Managing Secrets in Kubernetes
- Use an external, centralized secret storage and management solution, e.g., AWS Secrets Manager, HashiCorp Vault, or Azure KeyVault.
- Mirror the secret into your Kubernetes environment using an External Secrets Operator.
- Use a secret management tool like Mozilla’s SOPS (Secrets OPerationS) to create and manage secrets; these let you securely encrypt and decrypt secrets, as well as store them in your version control.
Scenario: Using an External Secrets Operator
The following presents an example of using an external secrets operator with Kubernetes. This assumes that a secret has already been created in AWS Secrets Manager with an IAM policy attached allowing the cluster to use its IAM Roles for the Service Accounts (IRSA) role to fetch the secret. It also assumes that a SecretStore with the name secretstore-sample already exists within the cluster.
Step 1: Deploy the External Secrets Operator
helm repo add external-secrets https://external-secrets.github.io/kubernetes-external-secrets/ helm install external-secrets external-secrets/kubernetes-external-secrets
Step 2: Create an ExternalSecret Resource in Your Cluster
apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: my-external-secret spec: refreshInterval: 1h secretStoreRef: name: secretstore-sample kind: SecretStore target: name: secret-to-be-created creationPolicy: Owner data: - secretKey: secret-key-to-be-managed remoteRef: key: provider-key version: provider-key-version property: provider-key-property
Here, you specify the name of the secret in AWS Secrets Manager in the key section for remoteRef.
Step 3: Use the External Secret in Your Application
kind: Pod metadata: name: mypod spec: containers: - name: mycontainer image: nginx env: - name: USERNAME valueFrom: secretKeyRef: name: my-external-secret key: username - name: PASSWORD valueFrom: secretKeyRef: name: my-external-secret key: password restartPolicy: Never
With this approach, you can ensure your secrets are being managed and accessed securely within your cluster.
Kubernetes Misconfiguration #6: Outdated or Incompatible Kubernetes Versions
Kubernetes releases a major version every quarter with new features, security patches, and other upgrades. Each major version is supported for nearly a year with critical bug fixes and patch updates.
What Are the Implications of Running an Outdated Kubernetes Version?
Running an outdated or incompatible Kubernetes version can have several implications, including security risks, compliance issues, and lack of support.
Security Risks
Using an older version of Kubernetes opens the door for unpatched vulnerabilities. Attackers can leverage a known vulnerability in an older version to compromise the cluster.
Lack of Support
Kubernetes officially supports a major version for a fixed period. Once the window is over, you will stop receiving updates or patches, which might make it more difficult to resolve issues.
Missing Out on Performance Upgrades
As Kubernetes releases new versions, they often add new ways to improve performance. Keeping up-to-date with the latest version also means you can leverage these enhanced performance benefits.
Best Practices for Updating Kubernetes
- Monitor the Kubernetes release notes and keep an eye on announcements regarding new versions and updates.
- Schedule and plan for regular maintenance windows every quarter to keep your cluster up-to-date.
- Migrate to a managed solution such as Google Kubernetes Engine, Azure AKS, or Amazon EKS that offers better ways to handle cluster upgrades.
- Leverage all K8s community resources and learn from devs’ experiences and best practices.
- Document your upgrade process for upgrades in the future.
- Continuously monitor your cluster health and application performance after each upgrade to quickly discover and resolve issues.
Kubernetes Misconfiguration #7: Incorrect Autoscaling Strategies
Once containers are deployed to Kubernetes, the pods must scale based on demand. Kubernetes has some special mechanisms in place that handle pod scaling:
- Horizontal Pod Autoscaler: Scales the number of running pods up or down depending on resource metrics including CPU usage or memory consumption
- Vertical Pod Autoscaler: Changes the resource requests and limits for a pod while keeping the number of pods consistent
- Cluster Autoscaler: Automatically provisions additional nodes, or removes them, from the cluster as needed depending on the number of nodes being utilized
- Custom metric autoscaler: Allows scaling based on a combination of simple metrics or metrics obtained from external tools like Prometheus
- Scheduled scaling: Scales pods based on a pre-defined schedule
What Can Go Wrong with Autoscaling Strategies?
Horizontal Pod Autoscaler (HPA)
If the metrics used for scaling are not representative of the actual application load, HPA can over- or under-scale the pods. This might cause delays in scaling and lead to degraded pod performance.
Also, if there are not enough available resources within the cluster, the pods might not be able to scale efficiently.
Custom Metric Autoscaler
While using custom metrics for autoscaling is considered a better alternative than using basic resource metrics, it often comes with its own set of challenges.
Custom metrics can be a single metric from the application or a composite metric like a combination of CPU, memory, P99 latency, etc. This combination makes them quite complex and error-prone, which if not handled properly can lead to delayed or inaccurate scaling and unreliable application behavior.
Scheduled Scaling
Scheduled scaling means application load will increase and decrease based on a pre-defined schedule. However, in practice, this is not always the case. If the increase in load does not match the scaling window, pods might start throttling, leading to degraded application performance.
Best Practices for Autoscaling
- Understand and analyze the behavior of your application before choosing an autoscaling strategy.
- If your application exposes a REST API, you can monitor the request rate, response latency, CPU utilization, etc., and then come up with a custom metric for autoscaling.
- If your application is a data-intensive application that processes data from upstream queues, the incoming volume and consumer lag can be interesting metrics to scale upon.
- Implement health checks via Kubernetes, e.g., liveness and readiness probes, to determine the health of your pods.
Kubernetes Misconfiguration #8: Outdated Container Base Images
All application containers are built using some foundational layers that provide the essential operating system libraries and dependencies. These foundational layers, a.k.a. base containers, can be language-specific or simple plain operating systems that provide a starting point for building application container images.
Common OS base images include Alpine, Ubuntu, and Debian while language-specific images include Python, Node, and OpenJDK.
Why Should You Update Your Base Container Images?
Security Vulnerabilities
Each version update comes with essential security patches or updates. If you do not upgrade your container base images regularly, then you may lack critical patches and increase your attack surface. An attacker can then exploit known vulnerabilities via these images and ultimately cause a potential breach or data loss.
Compatibility Issues
Older image versions might include outdated libraries and dependencies that are no longer compatible with your modern application requirements. Also, newer versions of Kubernetes might not support the older version of the container base image, which can lead to potential runtime errors or lack of functionality.
Lack of Support
Using outdated base images means you will no longer receive official support for critical bug fixes or security patches. This can also lead to compliance issues and increased risk of threats or exploits.
Best Practices for Updating Container Base Images
- Use minimal base images like alpine for a smaller attack surface.
- When mutating your base images, only install libraries and packages essential to your application; also, remove all unnecessary and helper tools from the image.
- Implement specific image tags to use pinned versions like alpine:3.20.3 instead of using the apline:latest version.
- Invest in a container scanning tool like Snyk that scans all your images before you push them to your registry.
- Integrate image scanning as a part of your continuous integration pipeline to help you identify images with vulnerabilities faster and take appropriate actions.
- Continuously monitor your images for critical security issues or misconfigurations.
KubeCon + CloudNativeCon EU 2025 is coming to London from April 1-4, bringing together cloud-native professionals, developers, and industry leaders for an exciting week of innovation, collaboration, and learning. Don’t miss your chance to be part of the premier conference for Kubernetes and cloud-native technologies. Secure your spot today by registering now! Learn more and register here.