Case study: a 31% AWS bill reduction for a Series B SaaS company

A Series B SaaS company (anonymized; ~140 employees, AWS spend ~$112k/month at engagement start) brought us in for a 6-week FinOps engagement. The CFO had flagged AWS spend growing faster than ARR; engineering leadership knew there was waste but did not have the bandwidth to address it. We identified $35k/month in achievable monthly savings, implemented $34.7k of it (31% reduction), and stood up the practice that has held the savings 9 months later. This is what we did and what we would do differently.

The customer

Anonymized profile:

B2B SaaS, vertical-specific (analytics product for a specific industry)
~140 employees, ~50 in engineering
Series B raised in early 2025
AWS-only cloud spend, ~$112k/month at engagement start, growing 8-12% month-over-month for the prior six months
ARR growing roughly half that rate, hence the CFO's concern
No dedicated FinOps person; platform engineering team of three handled cost reactively

The engineering team was aware they had cost issues. They had a list of ideas. They did not have the cycles to act on them while also shipping the product roadmap.

Starting state

We spent week one on diagnosis. The cost picture, broken down by service:

Service	% of monthly spend	Monthly $
EKS (compute)	38%	$42.6k
RDS (Postgres)	18%	$20.2k
S3 + data transfer	14%	$15.7k
Other EC2 (legacy non-EKS workloads)	11%	$12.3k
ElastiCache + OpenSearch	9%	$10.1k
Lambda + API Gateway	5%	$5.6k
Other	5%	$5.5k

EKS being 38% of spend made it the obvious primary target. The "other" category included a long tail of services worth flagging but not worth deep optimization at this scale.

The starting state of practices, scored against the four-practice framework from our mid-market FinOps pillar:

Practice	Score (0-3)	Notes
Visibility	1	Tagging existed but not enforced; ~22% of spend was unattributed
Commitment	1	A few 3-year RIs purchased opportunistically; no systematic commitment management
Rightsizing	0	Compute Optimizer recommendations were visible but had not been acted on in the last quarter
Lifecycle	0	No process; manual cleanups happened sporadically when someone noticed

Total: 2 of 12. Pre-framework. Lots of room to move.

Approach

We sequenced the engagement around the four-practice framework, with EKS as the primary cost surface and lifecycle/rightsizing as the highest-ROI early levers.

The plan we presented to the customer:

Week	Focus	Expected savings
1	Diagnosis + visibility setup	None (foundational)
2	Lifecycle audit + non-prod scheduling	$4-7k/month
3	EKS pod rightsizing (top 25 deployments)	$9-13k/month
4	Karpenter migration + Spot adoption	$7-10k/month
5	RDS rightsizing + Compute Savings Plan	$8-11k/month
6	Practice handoff + 30-day review schedule	None (sustaining)

Total expected: $28-41k/month. We landed in the middle at $35k.

What we did and what we considered and rejected

Visibility (week 1)

We pushed tag enforcement into Terraform via a small policy module and into the EKS cluster via Kyverno. New resources without team, product, and environment tags failed at provisioning. Existing untagged resources got bulk-tagged via an inventory script with team owners verifying assignments.

The unattributed spend dropped from 22% to under 4% within two weeks. The 4% remaining was infrastructure shared across teams (the cluster control plane, monitoring, networking) which we tagged as "shared-infra" rather than forcing artificial allocation.

We considered standing up Vantage for the visibility layer. The customer was already paying for Datadog and the cost of an additional tool was deferred. The native Cost Explorer + a simple Looker dashboard built on the CUR (Cost and Usage Report) was sufficient for week-one visibility. We later recommended Vantage as a future enhancement; the customer adopted it 4 months after engagement end.

Lifecycle audit (week 2)

The lifecycle audit surfaced four categories of waste:

Idle EC2 instances: 14 instances at <5% CPU for 30+ days, total $1.8k/month. Owners contacted; 12 confirmed for decommissioning, 2 had a reason to stay.
Orphaned EBS volumes: 87 volumes attached to nothing, totaling $0.9k/month. Mass-deleted after a 7-day soft-delete grace period.
Old RDS snapshots: Snapshots older than 90 days, totaling $0.7k/month. Deleted with finance approval.
Non-prod environments running 24/7: Three EKS namespaces (dev, staging, qa) running continuously. We implemented scheduled scale-down to 0 outside 8am--7pm weekdays via a CronJob.

The non-prod scheduling was the largest single win in this category — about $3.8k/month from cutting roughly 110 hours of weekly runtime to 50 hours.

Total lifecycle savings: $7.2k/month. Above the high end of the estimate.

EKS pod rightsizing (week 3)

We deployed VPA in recommendation-only mode across the cluster. After 7 days of observation, we had recommendations for the top 25 deployments by spend.

The pattern was consistent: most deployments requested 3-4x the CPU and 2-3x the memory they actually used. The most extreme case was a background-job processor requesting 8 CPU per pod and using 0.4 CPU at P95.

For each deployment, we worked with the owning team to:

Review the VPA recommendation
Decide on the new request value (P95 + 30% buffer was our default; teams adjusted for known spike behavior)
Apply the change through the team's normal deployment process

Cluster-wide requested CPU dropped 41%; requested memory dropped 38%. Cluster Autoscaler responded over the following week by scaling node count down from 47 to 29 average nodes.

Savings on EKS compute: $11.4k/month. Within the estimate.

We considered also enabling VPA in Auto mode on a subset of low-traffic services. We deferred this because the customer's deployment process was deeply integrated with their CI/CD and we did not want to introduce a separate mutation path during the engagement. We recommended it as a future option.

Karpenter migration + Spot (week 4)

Cluster Autoscaler was replaced with Karpenter. The migration took 4 days end-to-end, including a phased rollout (Karpenter handled new pod placements while CA handled existing nodes; we drained CA-managed nodes over 48 hours).

Karpenter's instance-type flexibility allowed it to pick smaller, cheaper instance types more aggressively than CA. The bin-packing improvement reduced average node count by another 4 (from 29 to 25).

For Spot adoption, we identified the production workloads suitable for Spot (stateless services with appropriate retry behavior), enabled the AWS Node Termination Handler, and configured Karpenter to prefer Spot for those workloads with On-Demand fallback. Spot capture rate stabilized at 71% of those workloads' compute over the following two weeks.

Combined savings from Karpenter + Spot: $8.6k/month. Slightly above estimate.

We considered using Spot.io for managed Spot optimization. The cost was meaningful at this scale and the customer's team felt comfortable owning the Spot configuration directly. We deferred Spot.io as a future option if the team's bandwidth changed.

RDS rightsizing + Compute Savings Plan (week 5)

The customer's primary RDS instance was over-provisioned for the actual database workload — peak CPU was under 25% over the trailing 90 days on a db.r6i.4xlarge. We worked with the team to right-size to db.r6i.2xlarge, with a planned re-evaluation after 30 days.

For commitment, we computed the trailing 90-day baseline of On-Demand compute (excluding Spot, which is not eligible for Savings Plans) and recommended a Compute Savings Plan covering 65% of that baseline at the 1-year term.

The customer's CFO approved the commitment. Purchase processed.

Savings: $4.1k from RDS rightsizing + $4.8k from Savings Plan effective discount = $8.9k/month.

Practice handoff (week 6)

The final week was knowledge transfer. We documented:

The visibility setup (where the dashboards live, who owns them, how to add a new team)
The lifecycle process (monthly audit checklist, default-to-decommission rule)
The rightsizing process (quarterly VPA review of top 25 deployments)
The commitment review process (quarterly trailing-90-day analysis)

We named an internal owner — one of the platform engineers — who would run the practice ongoing at ~25% time. We held a 90-day check-in on the practice's health 3 months after engagement end.

Results

End-of-engagement state versus starting state:

Metric	Before	After	Delta
Monthly AWS spend	$112k	$77.3k	-$34.7k (-31%)
EKS compute	$42.6k	$24.8k	-$17.8k (-42%)
RDS	$20.2k	$13.5k	-$6.7k (-33%)
Practice maturity score	2/12	9/12	+7
Tagged spend percentage	78%	96%	+18pp

Of the $34.7k/month savings, the breakdown by lever:

Lifecycle (idle resources + non-prod scheduling): $7.2k
EKS pod rightsizing: $11.4k
Karpenter + Spot: $8.6k
RDS rightsizing: $4.1k
Compute Savings Plan: $4.8k (effective discount on covered baseline)

Note that the Savings Plan figure is the discount on On-Demand pricing for the covered portion. The cash purchase was significantly more than $4.8k/month on a 1-year commitment basis, which the CFO budgeted explicitly.

What we'd do differently

In retrospect:

We should have moved RDS rightsizing earlier. It was the second-largest single-resource opportunity behind EKS compute and we addressed it in week 5. If we had run RDS in parallel with EKS in week 3, we would have captured an additional 2 weeks of savings.

The Spot migration was more conservative than necessary. We moved 60% of stateless workloads to Spot in week 4. The customer's team was comfortable enough with Spot behavior that we could have moved closer to 80% in the same timeframe. A second wave of Spot expansion happened in month 3 post-engagement.

We did not address Lambda or DynamoDB usage. Both were small enough as a percentage of spend ($5.6k Lambda, ~$2k DynamoDB) that we deprioritized them. In hindsight, the Lambda layer had clear over-provisioning (memory configurations 2-4x higher than usage) that would have been a quick week-3 sub-task.

S3 and data transfer were under-investigated. $15.7k/month in S3+egress was the third-largest line item. We did a surface-level review (lifecycle policies, intelligent tiering) but did not deeply analyze data transfer patterns. A full data-transfer audit would have been another week and may have surfaced 10-20% additional savings on that line.

Where this engagement applies and where it doesn't

This engagement profile applies to:

Mid-market SaaS companies with single-cloud (AWS-dominant) spend in the $50k-$300k/month range
Companies with an existing platform engineering function that can absorb the practice ongoing
Companies in growth stages where the CFO is asking the cost question explicitly (the executive air cover matters for the prioritization)

It does not apply to:

Pre-Series-A companies where the cloud spend is under $20k/month and the optimization opportunity is smaller than the engagement cost
Multi-cloud or hybrid-cloud organizations where the surface area is materially different
Highly regulated companies where the change-management overhead would extend each phase by 2-3x
Companies that have already done a thorough first-pass FinOps engagement; the savings runway is smaller for them

The 31% number is not a guarantee. Customer outcomes vary based on starting state. We have done engagements that produced 18% savings (already-tuned customer, smaller runway) and engagements that produced 47% savings (very early-stage waste). The four-practice framework has been the consistent through-line; the magnitude is starting-state-dependent.

Sustaining results

Nine months after engagement end, the customer's monthly spend is $84k/month. The practice has held ~$28k/month in savings against an underlying workload that has grown ~10% in that time. Without the practice, projected spend would be roughly $124k/month at this point.

The practice owner — the named engineer — has run the monthly lifecycle audit and quarterly rightsizing reviews on schedule. We held a 90-day check-in (which surfaced a few minor process tweaks) and have not been engaged again since. This is the right outcome; the goal of the engagement was to enable the customer's team, not to create dependency.

*For broader mid-market FinOps framework context, see our pillar on FinOps for 50-500 person companies. If your situation looks similar to this customer's, our team can run a 60-minute initial diagnostic call to assess the magnitude of opportunity in your environment.*