Skip to main content
FinOps

Reserved Instances vs Savings Plans vs Spot for AI/ML workloads

Mohit Sharma|May 10, 2026|10 min read
Reserved Instances vs Savings Plans vs Spot for AI/ML workloads

For AI/ML workloads on AWS, the right commitment mix depends on workload type. Training: heavy Spot, no commitments. Steady inference: Compute Savings Plans or EC2 Instance Savings Plans. Bursty inference: On-Demand with optional commitment for the steady-state floor. GPU-specific Reserved Instances rarely pay back due to instance-type churn. Most AI/ML cost over-spend is from picking the wrong commitment type, not the wrong amount.

Why AI/ML commitment is its own problem

The general FinOps commitment guidance — "commit to your steady-state baseline, run growth on On-Demand" — applies to AI/ML workloads with significant modifications. Three properties of AI/ML workloads make standard commitment guidance less reliable:

  1. GPU instance types churn fast. AWS releases new GPU instance types regularly. P5 launched in 2023; P6 is plausibly imminent. A 3-year RI on a P4d instance type bought in 2024 looks worse every quarter as newer instances offer better price-performance.
  2. Workload patterns are bimodal. Training is bursty (sometimes for days, sometimes idle for weeks). Inference is steady (or follows business-hour patterns). Treating "AI/ML compute" as one workload and applying one commitment strategy produces wrong answers for both.
  3. The base rate is high enough that mistakes are expensive. GPU On-Demand is meaningfully more expensive than CPU On-Demand. A bad commitment decision on $30k/month of GPU is more costly than the same percentage decision on $30k/month of CPU.

This piece is for mid-market customers running AI/ML workloads on AWS at $20k-$300k/month of AI-specific spend. The general principles apply to GCP and Azure with different specifics.

The commitment instruments — what actually applies to GPU

InstrumentApplies to GPU?Discount rangeTerm flexibility
Compute Savings PlanYes (covers EC2, Fargate, Lambda)10-25%1 or 3 year, $/hr commit
EC2 Instance Savings PlanYes (specific instance family in specific region)18-50%1 or 3 year, $/hr commit
Standard Reserved InstancesYes (specific instance type)25-55%1 or 3 year, full or partial upfront
Convertible Reserved InstancesYes12-35%1 or 3 year, can be exchanged
Spot InstancesYes (with caveats)60-90%No commitment, can be reclaimed
Capacity ReservationsYes (no discount but availability guarantee)0%On-demand pricing
Capacity Blocks for MLYes (P5, P4d)0% (premium pricing for guaranteed access)Specific date range

The most common mistake: assuming Standard RIs are always better than Savings Plans because the discount is higher. For GPU specifically, the instance-type lock of a Standard RI is often the wrong trade.

Workload type 1: Model training

Training workloads are episodic — a training run might consume 100% of a P5 instance for 3 days and then nothing for two weeks. The cost-per-month varies dramatically.

Recommended: Spot, with no commitment.

Why Spot works for training:

  • Most training frameworks (PyTorch DDP, FSDP, DeepSpeed; HuggingFace Accelerate; Ray Train) support checkpointing. A Spot interruption mid-training resumes from the last checkpoint after the new instance comes up.
  • Spot pricing for ML-relevant GPU instances (g5, p4d, p5) is typically 50-80% off On-Demand.
  • The bursty workload pattern means a multi-instance commitment would be wasted during idle periods.

The Spot strategy:

  • Diversify across instance types and AZs. Spot pools are per-instance-type-per-AZ. A training job that only requests p4d.24xlarge in one AZ has narrow Spot availability. Configuring it to also accept p4de.24xlarge across multiple AZs widens the pool.
  • Set checkpoint intervals to a level you can afford to lose. A checkpoint every 30 minutes means at most 30 minutes of work lost per interruption. The overhead of more frequent checkpointing trades against the cost of redoing work.
  • Use Spot Instance Advisor for the ML-relevant instance types to see typical interruption rates and savings. As of recent observation, p4d Spot has interruption rates well under 5% in most regions; p5 is higher (newer, less Spot supply).

When Spot doesn't work for training:

  • Multi-week training runs with poor checkpointing support
  • Workloads with hard deadline requirements (model needs to ship by date X)
  • Specific GPU types with very tight Spot supply (you may not be able to get the instances at all)

For these cases, the alternatives are On-Demand for short windows (acceptable) or Capacity Blocks for ML (purchase guaranteed access for a date range, premium pricing).

We have not found commitment-based instruments to make sense for pure training workloads in any mid-market engagement we've run. The workload pattern doesn't fit.

Workload type 2: Steady inference

Inference workloads serving production traffic look more like traditional web services — steady baseline traffic with predictable peaks. Commitment economics apply.

Recommended: Compute Savings Plan covering steady-state baseline.

Why Compute Savings Plan over Reserved Instances:

  • Instance-type flexibility. A CSP applies to any EC2 instance family. If you migrate inference workloads from g5.xlarge to a newer GPU instance type or to an Inferentia/Trainium instance, the CSP follows. A Standard RI on g5.xlarge would not.
  • Region flexibility. A CSP applies across regions. Training workloads typically target a single region; inference may serve multiple. CSP fits both.
  • Discount level. CSP discount on GPU is typically 10-25%, lower than a Standard RI's potential 25-55%, but the flexibility is worth more than the additional discount in our experience for AI/ML.

The CSP strategy:

  • Compute the steady-state baseline of inference compute spend over the trailing 90 days. Exclude bursty training spend from this calculation.
  • Commit to 60-70% of that baseline on a 1-year term. The remaining 30-40% absorbs growth and load fluctuation.
  • Reassess quarterly. If inference volume has grown, the CSP may be undersized. If you've migrated to a more efficient instance family, the CSP $/hr commit still applies but the actual hours covered shifts.

When EC2 Instance Savings Plan can be better than Compute SP:

  • The inference workload is locked to a specific instance family for 12+ months (e.g., you're committed to g5.xlarge for technical reasons)
  • The discount difference (Instance SP can be 18-50% vs Compute SP 10-25%) outweighs the flexibility loss
  • The instance family has stable pricing not subject to imminent replacement by a newer family

For most mid-market AI/ML inference workloads, Compute SP is the better default. EC2 Instance SP is the optimization for cases with high confidence in instance-type stability.

When Standard RIs apply:

  • Very large, very stable inference deployments
  • Instance types with no plausible newer-generation replacement (rare for GPU)
  • Customer is willing to accept the rigidity for the higher discount

We see Standard RI used appropriately for AI/ML in maybe 15% of cases. The rest, the flexibility cost is too high.

Workload type 3: Bursty inference

Some inference workloads have a small steady-state floor (e.g., always-on for the few customers using the feature) plus large bursts (e.g., during business hours, or when a campaign drives traffic).

Recommended: Compute SP covering the floor, On-Demand for the bursts, Spot for any non-critical bursts.

The strategy mirrors the inference-baseline approach but with sharper sizing:

  • Identify the steady-state floor accurately. This is the minimum spend that has been continuous over the trailing 90 days. It is usually smaller than people assume.
  • Commit to ~80% of the floor (higher coverage ratio because the floor is more predictable than the overall baseline).
  • Run bursts on On-Demand. If the burst is non-critical (e.g., asynchronous batch inference), Spot.

The trap: committing to the average of the bursty workload rather than the floor. This produces over-commitment that goes unused during low-traffic periods.

Workload type 4: Background batch inference

Workloads that process queued batches (overnight inference jobs, async generation, periodic embedding refresh) are usually a strong fit for Spot.

Recommended: Spot, no commitment.

The pattern is similar to training — workload is interruptible, batch-shaped, can resume. The diversification and instance-type flexibility apply the same way.

For batch inference specifically, frameworks like Ray, Modal, or RunPod handle Spot lifecycle automatically. The application code rarely needs to handle interruption directly.

When Capacity Blocks for ML make sense

AWS Capacity Blocks for ML let you reserve GPU capacity for a specific date range — pre-purchase access to p4d or p5 instances at premium pricing for a known training window.

This makes sense when:

  • You have a planned training run that needs a specific number of GPUs for a specific window
  • Spot availability for the instance type is unreliable
  • The cost of On-Demand for the window is acceptable but you need certainty of capacity

It does not make sense as a default. The pricing is higher than On-Demand for the same instances; the value is the availability guarantee. Use it for known-large training runs; don't use it for ongoing capacity.

Inferentia and Trainium

AWS's custom AI silicon (Inferentia for inference, Trainium for training) deserves mention because the cost-per-token economics can be substantially better than NVIDIA GPUs for certain workloads.

The current state (May 2026):

  • Inferentia2 is mature; integrates well with PyTorch via the Neuron SDK; cost-per-inference for transformer-class models is typically 30-60% below g5/g6 GPUs for compatible workloads.
  • Trainium is improving; not yet a like-for-like replacement for NVIDIA training in most production scenarios; specific use cases work well.
  • Both are available with the standard commitment instruments (Compute SP, Standard RI).

For mid-market customers running production inference workloads at meaningful scale, evaluating Inferentia is the highest-ROI architectural change. The migration cost is real (model recompilation, framework adjustments) but the per-call cost reduction can be the difference between profitable and unprofitable AI features at the unit-economics level.

We have run two Inferentia migrations on engagement; both reduced inference cost by >40% with comparable latency.

Common mistakes

Patterns we see frequently:

Committing during training-heavy months. Companies sometimes commit based on a peak month that included a one-off training campaign. The committed level then exceeds steady-state for the next 11 months. Always strip identifiable one-off training spend from the baseline before committing.

3-year RIs on bleeding-edge GPU types. A 3-year commitment on H100 instances bought in 2024 will compete against more capable instance types released within the term. Stick to 1-year terms or Convertible RIs for GPU.

Treating GPU and CPU spend as one commitment pool. Compute SPs technically cover both, but the workload patterns are different enough that mixing them in a single commitment produces sub-optimal coverage. Run separate commitment analyses.

Ignoring inference Spot eligibility. Some inference workloads are interruptible (batch processing, async generation, even some real-time workloads with retry tolerance). Defaulting all inference to On-Demand or commitments leaves Spot savings on the table.

Over-using Capacity Reservations. Capacity Reservations don't discount; they reserve capacity at On-Demand pricing. They're for availability, not cost. Buying them as a "commitment" produces no savings.

Implementation order

For a mid-market AI/ML workload running primarily on AWS:

  1. Categorize workloads into training, steady inference, bursty inference, and background batch.
  2. Move all training and background batch to Spot with appropriate diversification.
  3. Compute steady-state inference baseline (90-day trailing, excluding training spikes).
  4. Apply Compute Savings Plan at 60-70% of baseline, 1-year term.
  5. For very stable inference workloads: evaluate EC2 Instance Savings Plan or Standard RI for the additional discount on the locked portion.
  6. Reassess quarterly. Workloads shift; commitment levels shift.

This sequence captures most of the achievable savings within 2-3 weeks of analysis and procurement work.

Where this advice doesn't fit

  • Customers running primarily on GPUs through SaaS APIs (OpenAI, Anthropic, Google) — the cost optimization is at the prompt and request layer, not the infrastructure commitment layer. See our pillar on GenAI workload cost frameworks for that side of the problem.
  • Multi-cloud AI/ML. The commitment instruments are AWS-specific. GCP CUDs for GPU and Azure Reserved VM Instances for GPU follow analogous patterns; the specifics differ.
  • Very small AI/ML spend (under $5k/month). Commitment work doesn't pay back. Run On-Demand or Spot, revisit when scale justifies.
  • Customers with strict capacity-guarantee requirements. Capacity Blocks or full Standard RIs are the answer; cost optimization takes a back seat to availability.

FAQ

Q: How does Spot interruption affect training time on average? For p4d instances in well-supplied regions, Spot interruption rates are typically 2-5% per day per instance. Multi-instance training runs see proportionally higher cumulative interruption, but checkpointing keeps the cost manageable. Median training time impact in our experience: 5-15% longer than equivalent On-Demand.

Q: Should we use Convertible RIs for GPU instead of Standard? Convertible RIs let you exchange between instance types without losing the commitment. For GPU specifically, this is valuable as new instance types release. The trade-off: lower discount than Standard. For mid-market AI/ML where instance-type flexibility matters, Convertible is often the right choice when you do want an RI rather than a Savings Plan.

Q: What about Bedrock and SageMaker pricing? Bedrock pricing is per-token; commitment options are different (Provisioned Throughput is the closest analog). SageMaker has its own Savings Plan that covers training and inference instances. The patterns described above apply to direct EC2 GPU usage; managed services have separate optimization.

Q: How do we model training cost projections for budgeting? Project per planned training run, not per month. Each run has a known instance-hour cost (with Spot discount applied). Sum the planned runs for the quarter; add a 30% buffer for unplanned experiments. This is more accurate than monthly trend extrapolation for training, which is bursty.

Q: Are there alternatives to AWS for AI/ML at mid-market scale? Yes — GCP, Azure, and specialty clouds (RunPod, Lambda Labs, CoreWeave). For pure training, the specialty clouds often have better price-performance and Spot-equivalent flexibility. For inference deployed alongside production services, hyperscaler integration usually wins. We see hybrid configurations where training happens on a specialty cloud and inference on AWS.

*For broader mid-market FinOps framework context, see our pillar on FinOps for 50-500 person companies. For GenAI-specific cost framework that addresses API-priced workloads, see our companion piece on GenAI cost frameworks.*

Mohit Sharma

Mohit Sharma

Principal Consultant

Specializes in Cloud Architecture, Cybersecurity, and Enterprise AI Automation. Designs secure, scalable, and high-performance cloud ecosystems aligned with business strategy and long-term growth.

Stay Updated

Get the latest cloud optimization insights delivered to your inbox.

Ready to Transform Your Cloud Infrastructure?

Let our team show you where your cloud spend is going -- and how to fix it. AI-powered optimization across AWS, Azure, GCP, and OCI.

Schedule Your Free Consultation