AWS idle resource detection: scripts, platforms, or manual reviews -- what actually moves the needle

AWS idle resource detection is the practice of finding compute, storage, and network resources that are running and billing but doing no useful work — unattached EBS volumes, idle ELBs, dormant EIPs, EC2 instances at single-digit CPU, RDS instances no application is connected to, snapshots whose parent volumes were deleted years ago. Most mid-market AWS accounts carry 25-30% of monthly spend in this category. There are three credible ways to find it: custom scripts hitting AWS APIs, a commercial cost platform, or a disciplined quarterly manual review. Each catches different waste. Picking the wrong one is what produces the next surprise bill.

Why this question keeps coming back

A predictable sequence plays out at 50-500 person companies. Someone reads the Flexera State of the Cloud number — waste rates in the 27-32% range across segments — runs Cost Explorer for the first time, and finds the same number on their own bill. The reaction is almost always the same: buy a platform that promises to "kill idle resources automatically." Six months later, the savings are real but smaller than the contract, the platform has flagged a hundred items engineering will never act on, and the actual strategic waste — the multi-AZ failover capacity nobody uses, the data lake nobody queries — is still there.

The mistake is not buying the platform. The mistake is treating idle resource detection as one problem instead of three. Easy waste, dev/staging waste, and strategic overcommit are three different categories. They respond to three different tools.

We will walk what each approach actually catches, what it misses, and where the boundary sits for a 50-200 person team versus a 200-500 person team.

What "idle" actually means on AWS

Before comparing approaches, the category needs precision. "Idle" on AWS resolves to about a dozen specific patterns, and the detection signal differs for each:

Unattached EBS volumes. Volume exists, has no Attachments, has not been attached for N days. Pure waste. Detectable by API call, no metrics needed.
Unused EIPs. Elastic IPs that are not associated with a running instance or network interface bill at $0.005 per hour (roughly $3.65/month per IP). Trivial to detect.
Old EBS snapshots. Snapshots whose source volume no longer exists, or that are older than your retention policy. Requires policy definition first.
Idle ELBs / NLBs. Load balancers with RequestCount (or NewFlowCount for NLBs) consistently at zero. Needs CloudWatch metrics over a 14-30 day window.
Idle EC2 instances. Running instances with sustained CPU < 5%, network in/out < 5 KB/s, and disk I/O near zero. Needs multi-metric correlation — a single metric will produce false positives on legitimately quiet boxes.
Idle RDS / Aurora. RDS instances with zero DatabaseConnections over the observation window, or near-zero IOPS. Needs CloudWatch.
Stopped instances with attached EBS. Stopped EC2 instances are not billed for compute, but their EBS volumes continue to bill. A stopped instance for nine months is paying for storage.
Orphaned NAT Gateways. NAT Gateways in VPCs with no running workloads, or with traffic so low the gateway cost dwarfs the data transfer cost.
Unused Lambda provisioned concurrency. Provisioned concurrency configured on functions that no longer receive traffic.
Idle Kinesis shards / MSK clusters / OpenSearch domains. Often the largest single line items in this category. Easy to find once you know to look; not surfaced by default.
Over-provisioned ASG ceilings. Auto Scaling Groups with max_size set 5-10x above peak actual demand. Not strictly idle, but contributes to surprise scaling cost.

The detection problem differs by category. The first three need API calls only. The middle group needs metric history. The last two need engineering judgment.

Approach 1: Custom scripts against AWS APIs

This is the approach most platform teams start with, and for good reason. The AWS APIs that matter are free, the data is yours, and the script is auditable.

The minimum useful script set hits four sources:

AWS Trusted Advisor. The full cost optimization check set requires a paid Support tier. Important context for a 2026 plan: AWS announced that Developer Support, the original Business Support plan, AND Enterprise On-Ramp are ALL being discontinued on January 1, 2027. The migration targets are Business Support+ (the replacement SKU, starting at $29/month minimum) and Enterprise Support (which had its minimum reduced from $15,000 to $5,000/month and absorbs Enterprise On-Ramp customers via auto-upgrade through 2026). The legacy Business Support plan still listed at $100/month (or 10% of monthly AWS usage, whichever is greater) is a sunset SKU through January 1, 2027 — for forward-planning purposes, model Business Support+ at $29/month minimum or Enterprise Support at the new $5,000/month minimum, not the legacy pricing. The free tier (available to all accounts) exposes a limited set of checks; the full cost optimization category remains gated behind a paid tier.
AWS Compute Optimizer. Free. Produces rightsizing recommendations for EC2 instances, EBS volumes, Lambda functions, ECS services on Fargate, RDS databases, EC2 Auto Scaling groups, and commercial software licenses (SQL Server on EC2), based on 14 days of CloudWatch data. Underused by mid-market teams because the UI is buried. The API (get-ec2-instance-recommendations, get-ebs-volume-recommendations, get-auto-scaling-group-recommendations) is what you actually want. The ASG coverage matters specifically because over-provisioned max_size ceilings are one of the harder-to-detect waste categories — Compute Optimizer is one of the few free tools that surfaces them.
Cost Optimization Hub. AWS rolled this out as a centralized recommendations layer that aggregates Compute Optimizer, Savings Plans, RIs, and idle resource recommendations across linked accounts. If you have an AWS Organization, enable it once at the management account and you get cross-account recommendations without per-account setup.
CloudWatch + Cost Explorer. For everything the above does not cover — idle ELBs, idle RDS, dormant Kinesis — you query CloudWatch directly for metric history and Cost Explorer for spend attribution.

A pragmatic implementation looks like a weekly cron job, ~400 lines of Python with boto3, output as a CSV in S3 with a Slack notification. We have built variants of this for clients across the 50-500 segment. The work takes a senior platform engineer about a week, and the script keeps running essentially forever with minor upkeep.

What scripts catch well:

All of the unambiguous categories — unattached EBS, unused EIPs, orphaned snapshots, stopped-with-storage. The easy 8-15% of waste that does not require judgment.
Anomaly-style findings (a new untagged resource appeared, an EBS volume crossed an age threshold).
Per-account inventory that survives auditors asking "do you have a record of all idle resources you reviewed last quarter."

What scripts miss:

Anything requiring multi-metric judgment without a lot of tuning. We have seen a "low CPU" script flag an in-house Postgres replica because the primary handled all writes — the script was technically correct and operationally wrong.
Dev/staging-specific patterns where the resource is meant to be idle 16 hours a day. A script that flags an EC2 instance at 4% average CPU does not know it serves the QA team only during business hours.
Anything related to commitment optimization (Savings Plans utilization, RI coverage gaps). Compute Optimizer and Cost Optimization Hub help here, but interpreting the output requires finance context the script does not have.
Strategic overprovisioning — the multi-AZ standby that has not failed over in two years, the disaster recovery region that runs warm "because we are SOC 2."

For a 50-200 person company with one platform engineer and a $30-150k monthly AWS bill, the script approach is usually correct. The waste it catches is the waste that exists; the categories it misses are smaller in absolute dollars at that scale.

For more on storage-specific patterns the scripts approach catches well, see our deeper treatment of S3 storage cost reduction.

Approach 2: Commercial platforms

The vendor landscape has consolidated. The names worth knowing in mid-2026:

CAST AI. Strongest on Kubernetes — node bin-packing, spot instance automation, Karpenter-style optimization. If most of your AWS spend is EKS, this is the platform with the deepest cluster-level automation.
Spot (now part of Flexera). Strong on spot instance management, Ocean for container workloads. Sales motion now goes through Flexera; standalone SKU still exists but expect cross-sell pressure for the broader Flexera One platform.
Flexera One. The umbrella suite, originally built around RightScale / Cloudability assets. Strong on FinOps reporting and multi-cloud spend attribution; less aggressive on automated remediation than CAST AI.
Vantage, ProsperOps, Zesty. Mid-tier specialists. Vantage on visibility and per-resource attribution, ProsperOps on commitment automation, Zesty on storage and instance automation.

Platforms catch what scripts miss in two ways: they correlate signals scripts cannot easily correlate, and they automate remediation in environments where engineers are unwilling to write the destructive code paths themselves.

What platforms catch well:

Dev/staging waste. This is where the platforms genuinely pay back. Automated scheduling (stop dev EC2 / RDS nights and weekends), idle EKS namespace detection, Karpenter-aware node consolidation. The waste here is real, the patterns are repeatable, and engineering will not write the cron job to do it themselves.
Spot instance automation. Diversified spot pool management, graceful interruption handling, spot-to-on-demand failover. A weekend project to build badly, a quarter to build well, and a platform's actual core competency.
Container-level rightsizing. Live cluster pod rightsizing based on observed usage, with rollback safety. Hard to get right with scripts.
Cross-account commitment optimization. Especially ProsperOps and Spot — continuous Savings Plans rebalancing as your spend mix changes. Worth real money at 200-500 person scale.

What platforms miss or get wrong:

Strategic context. A platform does not know that the idle RDS instance is the warm standby your CISO insists on for SOC 2 evidence. It will recommend termination. Engineers learn to ignore the recommendation queue; the queue gets longer; the platform pays back less than the contract assumed.
The "kills idle resources automatically" pitch. No serious platform actually does this in production at a 200-person company. What they do is queue recommendations, require human approval, and execute approved actions. The "automation" is workflow, not autonomous destruction. Read the SOW carefully.
Long tail. The unused $40/month EIP. The five-year-old EBS snapshot. Platforms see these, but the recommendation queue is dominated by the higher-dollar items, and the long tail stays.

For a 200-500 person company with a $200k-$1M monthly bill and meaningful dev/staging spend or significant EKS footprint, a platform usually pays back. Below 200 people, the platform's annual contract value frequently exceeds the addressable waste the platform actually catches — you need to do the math on your specific bill before signing.

Our broader treatment of the buy-vs-build question for FinOps tooling sits inside the FinOps framework for 50-500 person companies.

Approach 3: Quarterly manual reviews

The least fashionable approach, and frequently the highest-ROI one for the categories that matter most.

A manual review is a half-day workshop, once a quarter, with the platform lead, a representative from each product engineering team, and (for the strategic categories) the head of infrastructure. The agenda is fixed:

Walk the top 20 line items by spend. Cost Explorer grouped by service, filtered to the largest accounts. For each line item, name the workload, the owner, and the reason for the current shape.
Walk the "long-lived expensive things." Multi-AZ deployments, cross-region replication, DR environments, warm standbys, reserved capacity that was bought for a workload that has since changed. The question is not "is it idle" — it is "is it still load-bearing for the assumption that justified it."
Walk the dev/staging inventory. Not for line-item waste (scripts and platforms catch that) but for environmental sprawl. How many staging environments exist. How many of them have a current owner. Which ones can be collapsed.
Walk the data layer. S3 buckets nobody has read in 90 days. Glacier vaults from previous compliance regimes. RDS read replicas added for a load test that ended in 2024. Kinesis streams retained because nobody wants to be the one to delete them.

Three to four hours, no tooling beyond Cost Explorer and a shared doc, and the strategic waste surfaces.

What manual reviews catch well:

The strategic-context categories scripts and platforms cannot reach. Almost every meaningful manual review we have facilitated has surfaced something in the $5-50k/month range that no automated tool had flagged — because the resource was genuinely doing something, just not something that mattered anymore.
Architectural drift. Decisions made 18 months ago that no longer fit the current workload mix.
Team-level ownership questions. The conversation forces an answer to "who owns this" in a room where the answer can be recorded.

What manual reviews miss:

Everything that needs continuous attention. A quarterly review will not catch a new untagged $8000/month EKS cluster spun up in week two of the quarter. It will be there, billing, for eleven more weeks before anyone notices.
The long tail of small items. The unattached EBS volumes are too numerous to walk individually.
Pace. If your spend is growing 15%/quarter, four review cycles a year is too few to keep up with the rate of new waste creation.

For a 50-200 person company, a quarterly manual review layered on top of scripts is what we typically recommend as the steady-state. For a 200-500 person company, the same review cadence still applies, but the scripts step is usually replaced or supplemented by a platform.

Where each approach actually sits

Approach	Best for	Catches	Misses	Time / cost
Custom scripts (AWS APIs)	50-200 person, $30-150k/month bill, one platform engineer	Unattached EBS, unused EIPs, orphaned snapshots, basic idle EC2 / RDS / ELB	Dev/staging patterns, strategic context, commitment optimization	1 week build, 2 hrs/week ongoing, AWS Support tier from $29/month (Business Support+) up to $5,000/month (Enterprise Support, new floor) if not already in place
Commercial platform (CAST AI / Spot / Flexera / Vantage / ProsperOps)	200-500 person, $200k-$1M/month, significant EKS or dev/staging	Dev/staging automation, spot management, container rightsizing, commitment rebalancing	Strategic context, long tail under platform threshold	$30k-$300k/year ACV plus implementation; 4-12 weeks to value
Quarterly manual review	All sizes, layered on top of one of the above	Strategic overprovisioning, architectural drift, ownership gaps	Continuous detection, long tail	Half-day per quarter, four people

The honest reading: scripts catch the easy wins. Platforms catch the dev/staging and commitment patterns. Manual reviews catch the strategic stuff. The teams that materially reduce waste use two of the three; the teams that buy a platform expecting it to do all three are the teams that are disappointed in six months.

The categories worth automating vs the ones that break automation

Some categories are safe to automate destructively. Some are not. The distinction matters because it determines where your engineering trust gets spent.

Safe to automate (script or platform, low risk):

Unattached EBS volumes older than 30 days — snapshot first, then delete.
Unused EIPs older than 7 days — release.
EBS snapshots older than your retention policy with a deleted source volume — delete.
Stopped EC2 instances older than 90 days — snapshot and terminate.
Empty EKS namespaces older than 30 days — delete.

Automate with approval workflow, not destructively:

Idle EC2 instances. The false positive rate on "low CPU" is too high to act without confirmation.
Idle RDS / Aurora. A database with zero connections might be a dev DB that comes online once a week.
Idle ELBs. We have seen ELBs with zero requests for 60 days that turned out to front a critical quarterly batch job.

Do not automate — requires human decision:

Multi-AZ topology changes. The standby exists for failure modes that have not occurred this quarter; that does not mean they will not.
DR region warm standby reduction. This is a risk decision, not a cost decision.
Reserved capacity adjustments. Wrong calls compound monthly for the term of the commitment.
Expensive dev environments tied to specific workloads (ML training, performance testing). Cheaper to ask the owner than to delete and restore.
Auto Scaling Group max_size adjustments. Lowering the ceiling can cap legitimate scale-out during incidents.

The 50-200 person team that automates the first list, runs the second list through a Slack approval bot, and reviews the third list quarterly will catch >90% of addressable AWS waste at <10% of the time cost of a "kill everything idle" approach.

What scope changes between 50-200 and 200-500 people

The same three approaches apply at both ends, but the weighting shifts.

50-200 person company:

Scripts plus a quarterly review is usually the right stack.
Trusted Advisor on a paid Support tier (not the limited free checks) is worth the cost. Note the 2027 transition — if you are currently on Developer Support, legacy Business Support, or Enterprise On-Ramp, plan the move to Business Support+ ($29/month minimum) or Enterprise Support (new $5,000/month minimum) before the January 1, 2027 discontinuation date so your Trusted Advisor coverage is not interrupted. Enterprise On-Ramp customers will be auto-upgraded to Enterprise Support through 2026; confirm the timeline with your AWS account team rather than assuming.
Compute Optimizer and Cost Optimization Hub are free and underused — start there. Compute Optimizer's ASG and commercial-license recommendations in particular are surfaces most teams have never opened.
A commercial platform is hard to justify unless you have a specific high-spend workload (large EKS footprint, heavy spot eligibility) where the platform's automation pays back its ACV.
Dev/staging waste is typically 10-20% of total spend. Scheduled shutdowns (even via a 50-line Lambda) usually capture most of it.

200-500 person company:

A platform is usually justified, especially if you have meaningful EKS spend or production spot workloads.
Scripts remain useful for the categories the platform underweights — the long tail, audit evidence, cross-account inventory.
The manual review shifts in character. It is less about "find waste" and more about "validate the platform's recommendations are being acted on, and surface the categories the platform cannot see."
Commitment optimization (Savings Plans, RI coverage) becomes its own line of work, often with ProsperOps or similar handling continuous rebalancing.

Across our 2024-2026 engagement window (roughly 30 mid-market AWS engagements), the 200-500 segment that combines a commercial platform with a disciplined quarterly review consistently lands in the 8-12% steady-state waste range. The 50-200 segment that runs scripts plus a quarterly review lands in the 10-15% range. Both are well below the Flexera benchmark; both leave waste on the table, because some waste is genuinely cheaper to tolerate than to eliminate.

For a deeper, framework-level treatment of how this fits into a broader cost program, see our Ultimate Cloud FinOps Savings Guide and the dedicated Optivulnix FinOps practice.

What we recommend by stage

Month 1. Enable Cost Optimization Hub at the management account. Enable Compute Optimizer (including ASG and commercial-license recommendations). Pull a Trusted Advisor report — if you are on the free tier or Developer Support, decide now whether the paid Support upgrade is in scope. For forward planning, the realistic targets are Business Support+ (starting at $29/month) for most 50-200 person teams, or Enterprise Support (new $5,000/month minimum) if you already had Enterprise On-Ramp or need TAM-level engagement. Avoid signing onto the legacy Business Support plan now — it is a sunset SKU through January 1, 2027.
Month 2-3. Build the scripts. Unattached EBS, unused EIPs, orphaned snapshots, stopped-with-storage, idle ELBs. Cron weekly, Slack output, CSV in S3. Approve the destructive paths for the first list above.
Month 4. First quarterly manual review. Walk top 20 line items. Surface strategic categories.
Month 6. Decide on platform. If your bill is < $150k/month and you do not have significant EKS or spot workloads, scripts plus reviews is usually the steady-state. If your bill is > $200k/month with meaningful container or spot exposure, the platform evaluation starts now.
Month 9+. Whichever combination is in place, treat steady-state waste between 8-15% as healthy. Below 8% is usually achievable only by over-engineering. Above 15% is a sign one of the three layers is not running.

FAQ

Is AWS Trusted Advisor enough on its own?

No, even on a paid Support tier. Trusted Advisor's cost optimization checks cover a useful subset — low utilization EC2, idle load balancers, underutilized RDS, unused EIPs, idle Redshift, among others. It does not cover the strategic-context categories at all, and it lags newer service types (recent EKS, Lambda, container patterns). Pair it with Compute Optimizer and Cost Optimization Hub at minimum. The complete check reference is at https://docs.aws.amazon.com/awssupport/latest/user/trusted-advisor-check-reference.html.

Does Cost Optimization Hub replace the need for scripts?

For the rightsizing and commitment recommendations it covers, yes. For inventory, audit evidence, custom tag-based detection, and categories AWS does not yet surface (idle Kinesis shards, oversize NAT Gateways), no. Most mid-market teams need both.

Can we just use the free tier of a commercial platform to test?

Sometimes. CAST AI, Vantage, and ProsperOps offer free trials or freemium tiers that let you see the recommendation queue before signing. Use the trial to validate the recommendation queue against your actual environment — the question is not whether the platform finds waste (it will) but whether your team will act on what it finds. A queue nobody acts on is not savings.

How do unattached EBS volumes get there in the first place?

EC2 termination with EBS volumes set to "delete on termination = false" is the most common path. CloudFormation / Terraform stack deletions that fail partway. Manual instance terminations from the console. EKS PersistentVolumes that outlive their PersistentVolumeClaims. The pattern is universal enough that auto-snapshot-then-delete after 30 days unattached is one of the safest destructive automations available.

What about idle EC2 detection — which CPU threshold is right?

There is no universal threshold. We typically use: average CPU < 5% AND average network in/out < 5 KB/s AND no SSH sessions in the last 14 days. Single-metric thresholds produce false positives. The two-week window matters — one week catches normally-low workloads that happen to spike weekly; four weeks is too slow to act on.

Which AWS Support plan should we be on heading into 2027?

Three plans are being discontinued on January 1, 2027: Developer Support, the original Business Support, and Enterprise On-Ramp. For most 50-200 person companies, Business Support+ (starting at $29/month minimum) is the right destination — it retains the full Trusted Advisor cost optimization check set and 24/7 support. For 200-500 person companies that already had Enterprise On-Ramp, expect an auto-upgrade path to Enterprise Support during 2026; the Enterprise Support minimum has been reduced from $15,000 to $5,000/month, which makes it materially more accessible than it used to be. Verify your specific transition timeline with your AWS account team — the public schedule and your individual account schedule are not always identical.

Do we need a dedicated FinOps engineer to run this?

For 50-200 person companies, no — this is 25-30% of one platform engineer's time, integrated into the platform team's normal cadence. For 200-500 person companies, a dedicated FinOps lead becomes credible somewhere around the $500k-$1M monthly bill mark. Below that, the cost of the hire exceeds the marginal waste the hire eliminates above what a focused platform engineer would have caught.

If you are trying to decide between scripts, a platform, or a mixed approach for your AWS estate, our FinOps team runs a structured assessment that produces a sized recommendation in two weeks. We do not resell any of the platforms named above; the recommendation is based on your bill shape, your team structure, and the specific waste categories that matter at your scale. Reach out via the Optivulnix FinOps practice page.

References

AWS Trusted Advisor: https://docs.aws.amazon.com/awssupport/latest/user/trusted-advisor.html
AWS Trusted Advisor check reference: https://docs.aws.amazon.com/awssupport/latest/user/trusted-advisor-check-reference.html
AWS Support plans (2027 transition context): https://docs.aws.amazon.com/awssupport/latest/user/aws-support-plans.html
AWS Compute Optimizer: https://docs.aws.amazon.com/compute-optimizer/
AWS Cost Optimization Hub: https://docs.aws.amazon.com/cost-management/latest/userguide/cost-optimization-hub.html
Flexera State of the Cloud Report: https://info.flexera.com/CM-REPORT-State-of-the-Cloud
FinOps Foundation State of FinOps: https://www.finops.org/wg/state-of-finops/

AWS idle resource detection: scripts, platforms, or manual reviews -- what actually moves the needle

Why this question keeps coming back

What "idle" actually means on AWS

Approach 1: Custom scripts against AWS APIs

Approach 2: Commercial platforms

Approach 3: Quarterly manual reviews

Where each approach actually sits

The categories worth automating vs the ones that break automation

What scope changes between 50-200 and 200-500 people

What we recommend by stage

FAQ

References

Mohit Sharma

Stay Updated

Related Articles

Multi-Region Deployment Strategies for Low-Latency Indian Applications

Ultimate Cloud FinOps Savings Guide for 2026

Ready to Transform Your Cloud Infrastructure?