In 2021, GEICO had 80% of its workloads in the public cloud and was spending over $300 million annually across eight different cloud providers — half of that with a single hyperscaler. The company had migrated more than 600 applications starting in 2014, driven by a reasonable goal: modernize aging infrastructure and innovate at scale. What it got instead was a cloud estate comprising over 200,000 cores of compute, 30,000 instances of containers and virtual machines, and a bill that had become, in the words of its own head of hardware and storage engineering, Sahid Jaffa, fundamentally unsustainable. GEICO eventually started the long process of repatriation, moving workloads off hyperscalers and onto open-standard hardware from original design manufacturers.
GEICO is not an outlier. It is a case study that scales.
The cloud migration wave of the 2010s was real, necessary, and broadly beneficial. It was also, for a significant slice of enterprise infrastructure, a decision deferred rather than a problem solved. Organizations rushed workloads to AWS and Azure under pressure to exit data center leases, meet digital transformation mandates, or simply keep pace with competitors who were doing the same. The architectural work — refactoring monolithic applications, rightsizing compute, designing cost-aware systems — was routinely skipped in favor of speed. That shortcut has compounded. The Flexera 2025 State of the Cloud Report, drawing on 759 cloud decision-makers globally, found that 27% of cloud spend is wasted — a number that has not meaningfully changed since 2019. At Gartner’s projected global cloud infrastructure spend, that figure represents somewhere between $100 and $182 billion in value being destroyed annually.
The industry has named the symptom — cloud waste — without fully reckoning with the cause. The cause is technical debt, migrated.
The Architecture of a Bad Decision
Technical debt, as Ward Cunningham defined it when he coined the term in 1992, is not inherently reckless. Sometimes it is a conscious trade: move fast, ship something that works, accept that you will clean it up later. The mistake is not recognizing that “later” carries interest. In software development, interest compounds as dependencies pile up, teams turn over, and the original shortcuts become load-bearing walls. The same dynamic applies to cloud architecture — except the interest is billed hourly.
The dominant migration pattern of the 2010s was lift-and-shift, or rehosting: take an application as it is, move it from a physical server to a virtual machine in the cloud, and declare migration complete. From a project management perspective, lift-and-shift is attractive. It is fast, it is low-risk in the narrow sense of not requiring code changes, and it produces the board-ready headline that the organization has moved to the cloud. From an architectural perspective, it is a vehicle for transporting every existing inefficiency into a metered billing environment.
An on-premises monolith running at 20% CPU utilization on dedicated hardware represents sunk cost. The same monolith running at 20% CPU utilization on an AWS EC2 instance is an ongoing monthly charge. Datadog’s State of Cloud Costs report found that 65% of EC2 instances had average CPU utilization below 20% over a 30-day window — a direct artifact of on-premises provisioning assumptions migrated intact. On-premises teams traditionally provisioned for peak load with a safety buffer, because hardware was a capital expense that took months to procure. That logic evaporated the moment compute became variable-cost and available in seconds. The provisioning habits did not.
Storage compounds the problem. AWS EBS volumes persist after EC2 instance termination unless explicitly configured otherwise. Azure managed disks must be deleted separately from the VMs they attach to. Flexera’s 2025 data identifies unattached storage volumes as among the top three identified waste items across all organization sizes — a finding consistent across years of reporting. Orphaned disks are not a technology problem. They are a migration-era process gap: teams that moved quickly to the cloud never established the operational discipline to decommission what they left behind.
Networking adds another layer. NAT Gateway charges on AWS run $0.045 per GB of data processed. In microservice architectures — themselves often a cloud-era aspiration grafted imperfectly onto legacy monoliths — internal traffic that routes unnecessarily through NAT gateways can generate costs invisible to the teams creating them. Similar traps exist in Azure’s egress pricing, GCP’s BigQuery scan costs, and the commitment discount structures all three hyperscalers offer. Reserved instances and Savings Plans can reduce compute costs by up to 66%, but they require teams to forecast and commit one to three years out. Organizations that provisioned for on-premises peak loads, moved those configurations to cloud, and then changed their workload profiles — through product pivots, customer growth, or architectural changes — found themselves paying for committed capacity they no longer needed.
This is the structure of cloud technical debt: not a single bad decision but a cascade of deferred ones, each individually defensible, collectively costly.
The Numbers Behind the Crisis
What 27% Actually Means
A 27% waste rate sounds abstract until it is applied to real spend. Gartner projects global public cloud spending at $723 billion in 2025, up 21% from 2024. If Flexera’s 27% figure holds — and it has been consistent for six consecutive years — roughly $195 billion in cloud spend delivers no business value. Even the more conservative estimate, limiting waste to idle and unattached resources rather than all suboptimal spend, produces a figure above $100 billion. Harness projects that enterprises alone will waste $44.5 billion in cloud infrastructure costs in 2025 due to underutilized resources and misaligned spending.
These are not abstract projections. A 2024 Stacklet survey found that 78% of organizations estimate between 21% and 50% of their cloud expenditure is wasted, and more than half believe the waste rate exceeds 40%. The spread between those estimates — 21% vs. 40% — is itself diagnostic: organizations do not have reliable visibility into what they are actually wasting, which means they cannot fix it systematically. Flexera’s data shows that cloud budgets are being exceeded by 17% annually despite 87% of respondents citing cost efficiency as their top metric for six consecutive years. The stated priority and the operational reality are not aligned.
The Container Problem
Container spend adds a particularly acute dimension to this. Datadog’s research found that over 80% of container spend is wasted on idle resources, largely because Kubernetes pod resource requests are set manually by engineers who have no reliable feedback loop on actual consumption. A pod requesting 4GB of RAM but using 400MB is invisible to standard cloud billing dashboards. The waste persists because it is invisible — and it persists at scale because 68% of organizations saw their Kubernetes-related costs increase in the last 12 months, with roughly half of those increases exceeding 20% year-over-year, according to CNCF data.
Only 43% of organizations track cloud costs at the unit level, meaning the majority cannot translate cloud spend into cost per product, per customer, or per feature. Without that granularity, optimization is guesswork. Teams know their monthly AWS bill increased; they do not know which product or engineering decision drove the increase. The accountability loop is broken by design.
The Invisible Incentive Structure
The cost accountability problem has a structural cause that the FinOps community has started to name clearly. 55% of developers report ignoring cost management, according to Harness. This is not indifference — it is rational behavior given how engineering organizations are typically structured. Engineers are measured on shipping features, maintaining uptime, and reducing latency. Cloud cost is someone else’s budget line, managed by a FinOps team or a finance function that reviews bills monthly, by which point the decisions are already sunk. The a16z analysis that framed this dynamic most sharply, The Cost of Cloud, a Trillion Dollar Paradox by Sarah Wang and Martin Casado, argued that cloud spend functions as a drag on software company market caps — effectively suppressing equity value by hundreds of billions of dollars across the sector.
Their most actionable case study was a company where the CFO endorsed an unusual incentive program: any engineer who saved a certain amount of cloud spend by optimizing or shutting down workloads received a spot bonus. The program paid out 10% of the engineering organization and brought down overall spend by $3 million in six months. The ROI was unambiguous. The lesson is also uncomfortable: the only way to get engineers to treat cloud cost as a first-class metric is to make cloud cost their first-class problem.
The Repatriation Signal
The most visible symptom of the cloud technical debt crisis is the repatriation trend — companies moving workloads off public cloud and back to on-premises or colocation infrastructure. A Barclays CIO survey in Q4 2024 found that 86% of enterprise CIOs planned to move some public cloud workloads back to private cloud or on-premises environments, the highest figure the survey has ever recorded. An IDC study from the same period found that 80% of organizations expected to repatriate a share of compute or storage within 12 months. Only 8% were planning a full cloud exit — the movement is selective, not ideological.
The case studies are specific and public. Dropbox built its own colocation infrastructure between 2013 and 2016, via an internal project codenamed Magic Pocket, saving nearly $74.6 million over two years and dramatically reducing its AWS dependency. 37signals, the company behind Basecamp and HEY, began exiting AWS in 2022. Its CTO David Heinemeier Hansson had been paying $3.2 million annually for AWS compute; he spent $700,000 on Dell servers, recouped that capital cost within the first year as contracts expired, and reported savings of nearly $2 million per year in 2024 — ahead of his original estimate of $7 million over five years. By 2025, 37signals was moving its S3 storage off AWS as well, with AWS waiving $250,000 in egress fees to facilitate the exit. The total projected savings now exceed $10 million over five years.
These examples share a common architecture: predictable, steady-state workloads that do not require the elasticity cloud was designed to provide. When a workload runs at consistent utilization 24 hours a day, 365 days a year, the pay-as-you-go model offers no advantage. It charges a premium for flexibility that is never exercised. For Dropbox, the premium was storage at scale. For 37signals, it was compute for a product with known, stable load. For GEICO, it was both — compounded across eight providers and six years of accumulated commitments.
The a16z analysis captures the economics cleanly: repatriating $100 million of annual public cloud spend can translate to roughly half that amount in all-in annual total cost of ownership — accounting for server racks, real estate, cooling, network, and engineering costs. For predictable workloads, the owned-infrastructure model is simply cheaper over a three-to-five-year horizon.
Where the Counterargument Holds
The repatriation narrative, if taken too literally, overreaches. The same Flexera report that documents 27% waste and 86% CIO repatriation intent also forecasts cloud spend increasing 28% in the coming year. Net cloud adoption is still growing. Organizations are repatriating selectively — the Flexera 2025 data shows 21% of workloads have been repatriated, while net-new cloud workloads continue to outpace exits.
The correct reading is not “cloud was a mistake” but “lift-and-shift was a mistake for workloads that warranted architecture, not migration.” Public cloud remains genuinely superior for variable demand — development and test environments, burst compute, global content delivery, and workloads where the cost of building on-premises capacity exceeds the cost of paying for elasticity. A startup without the capital to provision hardware, a retailer absorbing holiday traffic spikes, or an AI team running GPU-intensive training jobs on irregular schedules all have legitimate use cases for cloud-native deployment.
The critics also undercount the operational cost of on-premises ownership. 37signals’ CTO acknowledged that the team managing their repatriated infrastructure is the same size as before — no hidden operational dragon materialized. But that outcome reflects a company with experienced systems engineers and moderate infrastructure complexity. GEICO’s repatriation required significant investment in open-compute-standard hardware, new tooling, and operational processes the cloud had previously abstracted away. Dropbox’s Magic Pocket project took three years and required building proprietary storage infrastructure at a scale most enterprises will never reach. For the median enterprise IT team, repatriation is not a cost-cutting option — it is a capability question.
The more honest diagnosis is vendor lock-in risk. Cloud providers have built migration into their architectures by design: proprietary managed services, serverless platforms, and ML toolchains that reduce egress cost when you enter and dramatically increase exit cost when you leave. AWS waived $250,000 in egress fees for 37signals specifically because the public exit was becoming a reputational event. Most organizations will not be extended that courtesy, and most have accumulated dependencies — on RDS, Lambda, Azure Cognitive Services, or GCP’s BigQuery — that make departure from any single provider far more expensive than the original architecture review anticipated.
The Path Forward: From Reactive to Structural
The FinOps movement is the industry’s primary response to cloud technical debt, and it has made meaningful progress. 59% of organizations now have or are expanding FinOps teams, up from 51% the prior year. IDC reports that 75% of Forbes Global 2000 companies have adopted some form of FinOps practice. The FinOps Foundation updated its framework in 2025 to expand scope beyond public cloud to include SaaS licensing, private cloud, and — increasingly — AI workload spend, where 98% of FinOps practitioners now report managing AI spend, up from 63% the prior year and 31% the year before that.
The problem is that FinOps, as typically practiced, is a remediation function rather than an architectural one. It finds waste after it has accumulated. It cannot prevent an engineering team from over-provisioning a Kubernetes cluster at the moment of deployment — it can only surface the utilization data 30 days later and recommend a remediation that the engineering team may or may not act on. Only 43% of organizations track costs at the unit level, which means most FinOps practices are working from aggregate billing data rather than the cost-per-feature intelligence needed to make architectural trade-offs visible at the time they matter.
The organizations closing this gap are doing three things that most are not.
First, they are pushing cost accountability to the engineering teams that create the spend. The FinOps Foundation’s own principles are explicit: “accountability of usage and cost is pushed to the edge, with engineers taking ownership of costs from architecture design to ongoing operations.” This requires tooling — cost dashboards embedded in deployment pipelines, Slack integrations that surface anomalies to the team that owns the workload, budget thresholds enforced at the CI/CD level — and it requires cultural change that engineering leadership has to sponsor, not outsource to a finance function.
Second, they are conducting workload classification before making optimization decisions. Not every workload should be in public cloud. Not every workload should be on-premises. The analysis that drives the decision is simple in principle: is this workload variable or predictable? Does it require global distribution? Does it depend on managed services that would be expensive to replicate? What is the three-year total cost of ownership under each deployment model? The a16z framework argues that companies should think about repatriation before they reach scale — before inertia and lock-in strip the decision from their hands. Most did not. The opportunity now is to apply that analysis retroactively, systematically, and without ideological attachment to either cloud or on-premises as the default.
Third, they are treating commitment discount management as a treasury function, not an engineering one. Reserved instances and Savings Plans reduce compute costs by 37% on average, but only 49% of enterprises actively manage or renew long-term pricing agreements. The gap between potential and realized discount is not a technical problem — it is a procurement and forecasting problem. Organizations that have closed it treat their cloud commitment portfolio with the same rigor they apply to software license negotiations and vendor contracts. The tools exist: AWS Cost Explorer, Azure Cost Management, and third-party platforms like ProsperOps and nOps can automate commitment recommendations and rebalancing. The gap is organizational will, not technical capability.
The AI Multiplier
Every issue described in this article is about to get larger. AI infrastructure is the fastest-growing component of cloud spend, with 83% of cloud organizations already using or experimenting with generative AI services, according to Flexera. AI workloads are more expensive, more variable, and far less understood from a cost-attribution perspective than traditional compute. Only 63% of organizations track AI spend at all, and among those that do, most are still in basic visibility mode — understanding what they spend, not yet whether that spend delivers value.
The pattern will be familiar. Organizations rushed to cloud in 2012–2016 without building the governance infrastructure to manage what they were buying. They are now repeating the same mistake with AI, provisioning GPU clusters, committing to foundation model APIs, and standing up RAG pipelines without the unit economics discipline to know whether any of it is worth what it costs. The FinOps Foundation has been explicit about this: “just as early cloud usage led to unwieldy costs, AI spend could undoubtedly be non-optimal as well.” The wave is already visible. Cloud spend is expected to increase 28% next year, with AI as the primary driver. The question is whether organizations have learned enough from the first cycle to manage the second one differently.
Conclusion
The cloud promised to turn capital expense into operational expense, and it delivered. What the pitch elided was that operational expense, unlike capital expense, does not naturally discipline itself. A server you own sits in a rack whether it runs at 5% or 95% utilization. A virtual machine you rent bills you either way. The lift-and-shift era moved a decade of on-premises provisioning assumptions into a metered environment without changing the behavior that created those assumptions. The result is a $182 billion annual charge for infrastructure that either sits idle, runs oversized, or operates in architectures designed for a billing model that no longer applies.
The solution is not to abandon cloud. It is to stop treating cloud as a destination and start treating it as a procurement decision that requires the same continuous discipline as any other major operating expense. For some workloads, that means repatriation. For most, it means rightsizing, tagging discipline, commitment management, and engineering accountability structures that do not currently exist.
The companies that got this right early — Dropbox in 2016, Spotify with its Cost Insights tool, the unnamed enterprise whose CFO signed off on cloud-savings SPIFFs — share a common feature: they decided that infrastructure economics was a first-class engineering problem, not a finance problem that happened to involve engineers. That reframe is the one the rest of the industry still needs to make. The bill for not making it has been running for a decade, and it is still compounding.
What Executives and Practitioners Should Do Now
The strategic diagnosis is clear. The operational question is what to actually do Monday morning. The answer splits cleanly by role.
For CIOs and CTOs, cloud technical debt should be on the same risk register as software technical debt. That means quantifying it. Conduct a workload classification audit — not a cost-optimization exercise, but a genuine architectural review that asks, for each major workload: is this better served by public cloud, private cloud, or on-premises colocation, and what is the three-to-five-year TCO under each model? This is not a one-time exercise. Workload profiles change as products evolve, customer bases shift, and AI integrations alter the cost structure of downstream compute.
Data sovereignty pressure is adding urgency to this analysis. A 2025 survey found that 97% of mid-market organizations plan to move workloads off public clouds for better sovereignty, driven partly by the EU’s DORA regulation, France’s Doctrine Cloud mandate, and Germany’s GAIA-X initiative. DORA entered force on January 17, 2025, requiring financial institutions to directly manage ICT risk including cloud concentration risk. Organizations in regulated industries that have not yet assessed their hyperscaler dependencies against these frameworks are already operating in a compliance gap.
For VP Engineering and Platform teams, the most durable lever for cost reduction is deploying cost visibility at the moment engineering decisions are made — not 30 days later when the bill arrives. This means integrating cost data into deployment pipelines: surfacing estimated monthly cost at the pull request stage, flagging oversized instances at provisioning time, and building showback dashboards that engineering teams actually use. Cloud environments with poor tagging and ownership tracking have 40% higher waste rates than those with mature tagging discipline. Tagging is unglamorous. It is also the precondition for every other optimization effort. Reserved instance coverage deserves the same rigor as any vendor contract negotiation. Organizations adopting reserved instances or savings plans reduce costs by 37% on average, but coverage remains low because the commitment process is owned by finance teams who lack the technical context, while engineering teams who have the context lack the incentive to act.
For FinOps practitioners, the scope of the job is expanding faster than most teams have staffed for it. The FinOps Foundation’s 2026 State of FinOps data shows 98% of practitioners now managing AI spend, up from 63% the prior year. The same principles that apply to EC2 rightsizing — instrument, observe, rightsize, repeat — apply to GPU clusters and foundation model API calls. The difference is that AI spend is less transparent, more variable, and almost never attributed to a specific product outcome. Building cost-per-inference and cost-per-outcome metrics for AI workloads is the next hard problem in FinOps, and organizations that solve it first will have a meaningful advantage in managing the next cycle of cloud spend growth.
The FinOps Foundation’s core principles are instructive: centralized enablement with decentralized accountability, real-time visibility, and cost treated as a first-class engineering metric from the start of the development lifecycle — not appended as a finance review at the end. The gap between that principle and how most organizations actually operate is where the $182 billion annually goes.
