The Kubernetes Cost Problem
Kubernetes makes it easy to deploy applications. It also makes it easy to waste money. Developers request "generous" CPU and memory for their pods because under-provisioning causes crashes, while over-provisioning has no visible consequence -- until the cloud bill arrives.
Studies consistently show that Kubernetes clusters run at 20-40% average utilization. That means 60-80% of your compute spend is waste. This guide covers practical techniques to close that gap.
Understanding Kubernetes Resource Model
Requests vs Limits
Requests: The guaranteed resources Kubernetes reserves for your pod. This is what the scheduler uses to place pods on nodes. Over-requesting means nodes fill up quickly, forcing you to add more nodes than needed.
Limits: The maximum resources your pod can use. Setting limits too low causes CPU throttling and OOM kills. Setting them too high (or not setting them) allows runaway processes.
The gap between requests and actual usage is your optimization opportunity.
The Real Cost Driver
Your cloud bill is based on node count and size, not pod resource requests. But pod requests determine how many pods fit on each node, which determines how many nodes you need.
If your pods request 2 CPU / 4 GB but actually use 0.5 CPU / 1 GB, you are paying for 4x the compute you need.
Pod Rightsizing
Step 1: Measure Actual Usage
Before changing anything, collect utilization data: - Deploy Prometheus with kube-state-metrics and node-exporter - Collect CPU and memory usage per pod over at least 14 days (to capture weekly patterns) - Track P50, P95, and P99 usage -- not just averages - Note any periodic spikes (batch jobs, deployments, traffic peaks)
Step 2: Identify Over-Provisioned Pods
Look for pods where: - CPU request is more than 2x the P95 CPU usage - Memory request is more than 1.5x the P95 memory usage - Pods have limits set to values they never approach
Step 3: Set Optimal Requests
Recommended formula: - CPU request: P95 usage + 20% buffer - Memory request: P95 usage + 25% buffer (memory spikes can cause OOM kills, so be more conservative) - CPU limit: 2-3x the request (or no limit if your cluster has resource quotas) - Memory limit: 1.5x the request (hard cap to prevent OOM at the node level)
Step 4: Use VPA for Automation
The Vertical Pod Autoscaler (VPA) automates rightsizing: - Recommendation mode: VPA suggests optimal requests based on observed usage - Auto mode: VPA automatically adjusts pod requests (requires pod restart) - Start with recommendation mode, review suggestions, then enable auto mode for stable workloads
Node Optimization
Right Node Types
Choose instance types that match your workload profile: - Compute-optimized (c-series): For CPU-intensive workloads (API servers, data processing) - Memory-optimized (r-series): For memory-heavy workloads (databases, caches, JVM applications) - General-purpose (m-series): For mixed workloads -- the safest default - ARM-based (Graviton, Ampere): 20-40% cheaper for compatible workloads
Cluster Autoscaler Configuration
The Cluster Autoscaler adds and removes nodes based on demand. Optimize its configuration: - Set scale-down delay to 10 minutes (avoid thrashing) - Configure scale-down utilization threshold to 50% (remove nodes below 50% usage) - Use multiple node groups with different instance types for workload diversity - Set appropriate minimum and maximum node counts per group
Spot/Preemptible Nodes
For fault-tolerant workloads, spot instances offer 60-90% savings: - Run stateless application pods on spot nodes - Keep stateful workloads (databases, message queues) on on-demand nodes - Use pod topology spread constraints to distribute across spot and on-demand nodes - Configure pod disruption budgets to handle spot interruptions gracefully
Namespace-Level Governance
Resource Quotas
Set resource quotas per namespace to prevent any team from consuming unbounded resources: - Total CPU and memory requests per namespace - Maximum number of pods per namespace - Storage request limits per namespace
Limit Ranges
Set default requests and limits for pods that do not specify them: - Default CPU request: 100m - Default memory request: 128Mi - Maximum CPU per pod: 4 cores - Maximum memory per pod: 8Gi
This prevents both under-provisioned pods (no requests) and over-provisioned pods (requesting 64 GB for a simple API).
Cost Visibility Tools
Kubecost
Open-source Kubernetes cost monitoring: - Per-namespace, per-deployment, per-pod cost allocation - Efficiency metrics (CPU and memory utilization vs. requests) - Savings recommendations based on actual usage patterns - Integration with cloud billing for accurate cost attribution
Cloud Provider Tools
- AWS: EKS cost monitoring in Cost Explorer with split cost allocation
- Azure: AKS cost analysis in Azure Cost Management
- GCP: GKE usage metering with BigQuery integration
Quick Wins Checklist
- Remove idle workloads: Delete deployments with zero traffic for 30+ days
- Shut down dev/staging at night: Scale non-production namespaces to zero outside business hours
- Right-size the top 10 pods: Focus on the largest resource consumers first
- Enable Cluster Autoscaler: Ensure nodes are removed when no longer needed
- Add spot nodes: Move stateless workloads to spot instances
At Optivulnix, Kubernetes cost optimization is a specialty within our FinOps practice. We typically find 30-50% savings in Kubernetes infrastructure costs. Contact us for a free cluster cost assessment.
