AI AgentIndustry & CompetitionAI Products & Platforms

The AI Industry Is Searching for a New Metric

Published May 13, 2026

You’ve been coding with Claude Code all day. When you look back, the sentence in your head is “I shipped three features today” — not “I burned twenty million tokens today.”

Most builders think in the first language. But the industry keeps its books in the second — APIs charge by token, investors track token consumption for growth, and NVIDIA’s blog once ran the headline “cost per token is the only metric that matters.”

These are two fundamentally different metrics. Token is an input-side metric: it measures how much compute you consumed. “Three features shipped” is an output-side metric: it measures what you produced. Input and output metrics point toward different optimization targets — an old problem in business management, but one the AI industry can no longer postpone.

The problem with input metrics isn’t inaccuracy — it’s that they point the wrong way

Every input metric has a built-in flaw: it rewards consumption, not efficiency.

This isn’t unique to tokens. Track programmers by lines of code, and they’ll write more lines. Track managers by meetings held, and they’ll hold more meetings. Input metrics are inherently gameable — because gaming them is cheaper than gaming output metrics. Inside Amazon and Meta, employees have already been caught running unnecessary agent tasks to inflate their token numbers, because those numbers represent “AI usage activity” on internal leaderboards.

Agents amplify this flaw. When a person chats with Claude, token consumption is linear and predictable. When an agent executes a task, it loops, retries tools, expands context — its consumption curve can jump an order of magnitude at any step. Portal26, which handles enterprise billing, observed that customers are “afraid to scale agent usage because they can’t tell whether next month’s bill will be three digits or five.” When your metric is both gameable and unpredictable in an agent context, it fails at even the most basic function: cost management.

But even if you solved predictability, token as a metric has a deeper problem: it tells you nothing about whether your money was well spent. A bill for eight million tokens doesn’t tell you whether those tokens completed twenty tasks or two hundred, whether the success rate was 30% or 90%, whether the agent kept retrying the same failed approach. Token measures how much you burned. Not what you bought.

Output is too hard to measure — so every era finds a proxy

The natural question at this point: why not just measure output directly?

The problem is that real business output — “how much more money did this agent make the company,” “how much time did it save the team” — is the hardest thing to quantify. Measuring one person’s performance is hard enough; measuring a fleet of agents is harder. This is why every economic era settles for a proxy metric: it’s imperfect, but as long as it points roughly in the same direction as real output, it beats an input metric.

Facebook’s 2012 IPO is a textbook case. What was Facebook’s real value? The density of its social graph, the frequency of user interactions, the precision of its ad targeting — none of these are easy to measure directly. So it found a proxy: Monthly Active Users (MAU). MAU isn’t real output — someone logging in doesn’t guarantee they generate ad value — but it correlates roughly with real value creation: more users, denser network, better ads. Before MAU, the internet used pageviews — an input metric: more pages, more ad slots. Facebook replaced that proxy with the largest IPO in history — not because it proved MAU was more accurate, but because it told a story the capital markets were willing to buy.

Twitter, Snapchat, WeChat all followed. Once a proxy is established by the dominant player in capital markets, it becomes the industry’s default language.

Two candidate proxies are on the table

Back to AI. What the industry needs isn’t a “better token” — it’s a proxy like MAU: rough, but directionally aligned with real agent output.

Two candidates have taken shape.

One is Salesforce’s AWU (Agentic Work Unit), launched in 2026 and defined as each verifiable discrete task an agent completes. CEO Marc Benioff said “value isn’t in the token — it’s in what the platform does with the token.” Salesforce is already using AWU for outcome-based pricing: you pay only when the agent successfully resolves an issue.

The same month, Baidu proposed a concept called “Daily Active Agents” (DAA) at its Create 2026 conference — counting how many agents are actively working each day.

A simple analogy captures the logic behind both: AWU counts piecework, DAA counts headcount.

Piecework is closer to real output — “completed a task” already carries an output implication. But it faces a granularity problem: a customer service agent answering one question counts as one piece, a coding agent fixing a bug counts as one piece, but the difficulty and business value differ by an order of magnitude. Without weighting, this proxy mashes different value densities into a single number.

Headcount is simpler — one number, one slide. But it’s further from real output. “Active” doesn’t mean “productive” — an agent can perform useless operations in the background and still count as active. And Baidu hasn’t published what counts as “one agent,” how multi-agent collaboration is de-duplicated, or whether failed tasks count as active. Without answers, “headcount” faces the same credibility problem early MAU had: whoever reports it, wins.

Both proposals are immature. Neither has third-party auditing. Both rely on the proposer’s own platform as the counting base. Salesforce’s version at least has a market check — if customers don’t pay, the pricing changes. Baidu’s version, for now, is closer to a narrative tool.

But their importance isn’t in the specifics. It’s in the direction: two companies of completely different natures, in the same time window, independently pushing the same shift — changing the accounting metric from “how much was consumed” to “how much was completed.”

Change the metric, change the incentives

This isn’t an accounting question. It’s a product-direction switch.

When the industry treats token as the core metric, platforms optimize for making you consume more tokens — longer contexts, more complex reasoning chains, more tool calls. This isn’t conspiracy; it’s what any company’s KPI pushes toward. Token growth is revenue growth; product teams are measured by token consumption.

When the industry starts taking an output-side proxy seriously — piecework or headcount — the incentive structure flips. The platform’s optimal strategy is no longer to burn more tokens, but to help your agents complete more tasks with fewer resources. This means leaner tool calls, more accurate first-pass reasoning, fewer unnecessary retries.

Two metric orientations produce two different product directions. In the first world, the agent frameworks builders receive grow more complex — because complexity consumes tokens. In the second, frameworks grow more restrained — because restraint completes tasks faster. You probably already know which one you’d prefer.

This transition is still early. There aren’t enough strong backers yet — if Microsoft proposed its own agent proxy at a Copilot earnings call, or Google at an Android agent runtime launch, the industry would rapidly enter the “dual-reporting” transition phase where old and new metrics coexist. But the direction itself no longer needs arguing. Model inference costs are dropping by factors of 9x to 900x per year; token as a revenue unit is depreciating in real time. When the billing unit itself is losing value while its consumption spins out of control in agent contexts, finding a new proxy stops being a question of “whether” and becomes a question of “which.”

Every era’s proxy starts as a rough draft — GDP’s first draft in 1934 was merely a memo Kuznets wrote for the U.S. Congress. Rough is fine. Get the direction right, and let the market pick the rest.