costmetrics

5 LLM Cost Metrics Every CFO Should See on Monday Morning

Your CFO can tell you, to the cent, what the company spent on cloud infrastructure last quarter. They can break down headcount costs by department, software licenses by vendor, and travel expenses by region. Ask them what the company spent on AI last month, and you will likely get a blank stare, or worse, a confident answer that is catastrophically wrong.

This is not their fault. The tooling does not exist in most organisations to make AI spend visible at the executive level. What finance teams typically see is a single line item on an AWS or Azure invoice labelled something like “Amazon Bedrock — $47,312.89.” That number tells you almost nothing. It does not tell you whether that money was well spent, which teams drove the cost, or whether you could achieve the same outcomes for half the price.

The companies that will dominate the next phase of AI adoption are the ones that treat LLM spend with the same rigour they apply to every other operational cost. That starts with surfacing the right metrics to the right people.

Here are the five numbers your CFO should be looking at every Monday morning.

1. Cost Per Successful Completion

What it is: The total LLM spend divided by the number of requests that actually achieved their intended purpose.

Why it matters: Raw API spend is a vanity metric. A $50,000 monthly bill means nothing without context. If your customer support agent resolved 200,000 tickets with that spend, you are paying $0.25 per resolution — almost certainly cheaper than a human agent. If it resolved 500 tickets because the other 199,500 failed, hallucinated, or required human escalation anyway, you are paying $100 per resolution and would have been better off hiring temps.

Cost Per Successful Completion forces your teams to define what “success” actually means for each AI use case, and then measure whether they are achieving it economically. It is the single most important metric for determining whether your AI investment is generating returns or burning runway.

The formula:

Cost Per Successful Completion = Total LLM Spend / Count of Successful Completions

What good looks like: This varies wildly by use case, but the trend matters more than the absolute number. If this metric is climbing week over week, something is degrading — prompt quality, model performance after an update, or increased complexity of incoming requests.

2. Token Waste Ratio

What it is: The percentage of tokens consumed that did not contribute to the final output delivered to the user or downstream system.

Why it matters: Most enterprise AI systems are haemorrhaging tokens. Every retry, every failed function call, every overstuffed context window, and every verbose system prompt is money evaporating into the ether. In a typical RAG application, we routinely see 60–80% of input tokens consisting of retrieved context that the model never actually references in its response. You are paying to send the model a 50-page document so it can extract a single paragraph.

Token Waste Ratio makes this invisible tax visible. When your CFO sees that 72% of tokens are being wasted, they will ask the obvious question: “Can we get that number down?” And the answer, almost always, is yes.

The formula:

Token Waste Ratio = (Total Tokens − Tokens in Successful Final Output) / Total Tokens × 100

What good looks like: Below 40% for chat applications. Below 50% for RAG systems. Anything above 70% is a red flag screaming for prompt optimisation or context windowing improvements.

3. Model-Task Mismatch Rate

What it is: The percentage of API calls where the model used was more capable (and expensive) than the task required.

Why it matters: This is the metric that directly quantifies the “Model Laziness” problem we discussed in our previous post on Model Arbitrage. If your engineering team is routing every request to GPT-5 or Claude Opus regardless of complexity, your Mismatch Rate will be sky-high, and so will your bill.

TaskModel UsedModel NeededOverspend
Password reset FAQGPT-5.2Llama-3-8B~95%
Email classificationClaude OpusClaude Haiku~90%
Sentiment taggingGPT-5.2Fine-tuned BERT~99%
Contract analysisClaude OpusClaude Opus0% ✓

A healthy organisation should see a Mismatch Rate below 20%. If more than half your calls are mismatched, you are likely overspending by 40–60% with zero impact on output quality.

The formula:

Mismatch Rate = Count of Over-Provisioned API Calls / Total API Calls × 100

4. Agent Loop Depth

What it is: The average number of LLM calls an autonomous agent makes before completing (or abandoning) a task.

Why it matters: Agentic AI is the fastest-growing cost vector in enterprise AI, and it is the hardest to predict. A well-designed agent might resolve a task in 3–5 LLM calls. A poorly designed one, or one encountering an edge case, might spiral into 50, 100, or 500 calls before hitting a timeout, each one burning tokens at frontier-model prices.

We covered the catastrophic version of this in our post on The $10,000 Weekend. But you do not need a full-blown runaway to lose money. Even a modest increase in average loop depth from 5 to 8 calls per task represents a 60% cost increase that will never show up in a standard cloud bill. It will just look like “AI costs went up.”

What good looks like: Establish baselines per agent type and track the distribution, not just the average. A mean of 6 with a max of 200 tells a very different story than a mean of 6 with a max of 9. The outliers are where the money hides.

What to track:

  • Mean loop depth per agent (is it trending up?)
  • 95th percentile loop depth (where are the outliers?)
  • Timeout/abandonment rate (how often do agents give up?)
  • Cost per loop iteration (is the agent calling Opus or Haiku for each step?)

5. Cost Per Business Outcome

What it is: The total AI spend attributed to producing a specific, measurable business result — a closed support ticket, a processed invoice, a generated lead, a completed code review.

Why it matters: This is the metric that bridges the gap between engineering and the boardroom. Your CFO does not care about tokens. They do not care about model versions. They care about unit economics. If AI-powered invoice processing costs $0.12 per invoice compared to $4.50 for manual processing, that is a story the board understands. If AI-assisted code review costs $2.30 per pull request but catches 40% more bugs before production, that is an ROI calculation finance can model.

Without this metric, AI remains a mysterious line item that finance tolerates during good times and targets during cost cuts. With it, AI becomes a quantifiable investment with measurable returns, which is exactly what it should be.

The formula:

Cost Per Business Outcome = Total LLM Spend for Use Case / Count of Completed Business Outcomes

Putting It All Together: The Monday Morning Dashboard

None of these metrics are useful in isolation. The power comes from seeing them together, on a single screen, every Monday morning. Here is what that view looks like:

MetricThis WeekLast WeekTrend
Cost / Successful Completion$0.31$0.28↑ 10.7%
Token Waste Ratio58%62%↓ 4pts
Model-Task Mismatch34%41%↓ 7pts
Avg Agent Loop Depth6.25.8↑ 6.9%
Cost / Business Outcome$0.47$0.52↓ 9.6%

A dashboard like this tells a complete story in thirty seconds. In the example above, the CFO can immediately see that while overall cost per business outcome is improving (good), the cost per completion is creeping up and agent loop depth is rising (investigate). Token waste is improving, probably because the team just optimised their RAG pipeline. Model mismatch is down, suggesting that routing improvements are taking effect.

This is the kind of operational intelligence that turns AI from a speculative expense into a managed investment.

Your AI Bill Deserves the Same Scrutiny as Your Headcount

We have entered an era where AI spend will rival — and for some companies, exceed — traditional cloud infrastructure costs. The organisations that thrive will not be the ones that spend the most on AI. They will be the ones that spend the most intelligently.

That starts with measurement. You cannot optimise what you cannot see. And right now, most enterprises are flying blind.

These five metrics give your finance team, your engineering leadership, and your board a shared language for discussing AI economics. They transform vague concerns about “AI costs going up” into specific, actionable insights that drive real optimisation.

PromptLeash surfaces all five of these metrics automatically. No custom instrumentation. No data engineering projects. Connect your LLM providers, and within minutes you will see exactly where your money is going and where it is being wasted.

Schedule a demo at promptleash.ai/contact and give your CFO something to smile about on Monday.

Leave a Comment

Your email address will not be published. Required fields are marked *