01: Metrics and Meat Shields

Welcome to our very first Friday briefing; on measuring AI adoption

Bud Caddell

Apr 17, 2026

This is Superadditive’s first weekly newsletter, so let’s start with what it is:

These briefings are for the people inside organizations shaping how AI actually gets used.
We want this to be worth reading on your weekend—something you can sit with and show up Monday ready to act on.
Each week, we look for the signal inside the noise: where AI is making teams smarter, where it’s quietly degrading judgment, and where we’ve seen this pattern before.
There’s no shortage of newsletters. We aim to be the best read in your inbox. If we’re not, tell us. If we are, spread the word.

Takeaways from this week

A synthesis of our top stories and what it means right now

Three things happened in the same six days, and they are worth considering together.

On April 10, Duolingo CEO Luis von Ahn said on the Silicon Valley Girl podcast that the company would no longer factor AI usage into employee performance reviews. A year earlier, in April 2025, von Ahn had declared Duolingo “AI-first” and announced that employees would be evaluated, in part, on how much they used AI in their work. That last piece is what he is walking back. Employees had started asking whether they were being graded on their work or on their tool use, and he could not give them a good answer.

We also got two separate studies that demonstrate how complicated AI integration really is. First, researchers from UCLA, MIT, Carnegie Mellon, and Oxford released a preprint showing that participants given AI help on reasoning tasks performed worse, and gave up faster, once the AI was taken away. Second, the American Psychological Association published a peer-reviewed study of 1,923 workers showed that how people interact with AI outputs (whether they just accept it or edit it critically) affects how they rate their own ability to think.

Clearly, understanding how people are using AI is more important than just counting if they use it. But Shopify, Meta, Nvidia, and McKinsey are still pushing in the direction Duolingo just came back from. Meta requires engineers to hit a quota of agent-assisted code changes. Jensen Huang told Nvidia engineers earning over $500,000 that he expects them to burn through at least $250,000 in AI tokens per year. These are the modern equivalent of lines-of-code quotas. They will produce exactly what such quotas always produce, which is more of the thing being measured and less of the thing the measurement was supposed to be a proxy for.

What to say to your CEO this week: Our peer companies are installing AI-usage targets. The newest evidence says the targets measure the wrong thing. The research that came out this week shows that outcomes depend on whether people edit and push back on what AI produces, which is the opposite of what a usage quota rewards. If we need an AI metric for the board, make it about work quality with AI help, not tool use. Duolingo, a year ahead of us on this, just reversed course.

This week’s move

If your board or CEO is asking for an AI metric, the question to run at them is: would we have accepted this metric for a software tool in 2019? Nobody measured Excel usage. The tool is a means. The work is the end.

A usable AI metric has three properties. It measures outcomes, not inputs. It is resistant to gaming. And it can be computed from work that already exists, not from a new surveillance layer.

Three candidates that clear this bar. Rework rate — the share of first-draft AI-assisted output that has to be substantially redone downstream. The Stanford and BetterUp “workslop” research puts this at around 15% of all work content in the average organization, with receivers spending close to two hours per instance cleaning it up. If your rework rate is above that, your AI is not yet a net positive. Cycle-time on well-defined tasks — the time from request to shippable deliverable on tasks with clear specs, compared to the same cohort of work last year. This one is gameable but only in directions you want (smaller tasks, cleaner specs). Error-catch rate — in regulated or safety-critical workflows, the share of AI-suggested outputs that a human reviewer changes before release. A high catch rate is good. A catch rate of zero means your reviewers have stopped looking.

None of these require new tooling. All three survive the question “what would this incentivize?”

Avoid: token spend, prompts per day, hours logged with an AI tool, percent of code or copy AI-generated. These are body counts. They look precise. They reward exactly the wrong behavior, which is the finding from the Dubey study this week.

Top stories

Duolingo drops AI usage from performance reviews. CEO Luis von Ahn, who introduced the “AI-first” policy in April 2025, reversed the performance-review component in an April 10 podcast interview. Employees had raised concerns that the metric was rewarding tool use rather than work quality. Duolingo is still using AI internally and launched 148 AI-generated courses last year. The reversal is specific to how employees are evaluated. It puts Duolingo at odds with Meta, Nvidia, McKinsey, and Shopify, which are moving in the opposite direction. Fortune, April 13, 2026

New preprint: AI assistance reduces persistence and independent performance. Researchers from UCLA, MIT, Carnegie Mellon, and Oxford ran three experiments with roughly 1,200 total participants on math reasoning and reading comprehension tasks. Half got AI help. Halfway through, the AI was taken away. The AI group’s accuracy dropped sharply and they gave up on problems at higher rates than the control group. The effect held after roughly ten minutes of AI exposure. The researchers describe the risk as “boiling frog” dynamics, in which gradual erosion is hard to reverse once it becomes visible. The paper has not yet been peer-reviewed. Futurism, April 14, 2026

APA study: how workers use AI determines cognitive outcomes. A peer-reviewed study of 1,923 North American workers published April 16 in the APA journal Technology, Mind, and Behavior found that participants who modified, challenged, or rejected AI outputs reported higher confidence in their own reasoning and a stronger sense of authorship. Participants who accepted AI output without editing it reported lower executive function and what the authors call cognitive offload. The study, led by Sarah Baldeo of ID Quotient, argues AI is neither inherently harmful nor helpful. The outcome depends on whether users stay mentally engaged. PR Newswire via Yahoo Finance, April 16, 2026

Goldman Sachs: AI now eliminating roughly 16,000 U.S. jobs per month, net. New Goldman analysis released this week estimates AI substitution is wiping out about 25,000 U.S. jobs per month, with AI augmentation adding back about 9,000. A parallel Robert Half survey found 29% of companies that conducted AI-driven layoffs have already rehired workers into similar roles. Forrester’s 2026 Future of Work report puts the regret rate at 55%. A February 2026 Careerminds survey of 600 HR leaders found two-thirds had already rehired workers they eliminated for AI reasons, with 33% reporting lost institutional knowledge as a consequence. AZFamily, April 15, 2026

MIT “Rising Tides” study on AI capability growth. MIT’s FutureTech group, working with Neil Thompson, published preliminary findings from more than 17,000 worker evaluations covering 3,000-plus text-based tasks from U.S. Labor Department job categories. The headline finding is that AI capability is rising steadily across many tasks at once, not surging in sudden breakthroughs over narrow tasks. AI completed about 60% of tasks at a “minimally sufficient” level in mid-2025, up from 50% a year earlier. Only 26% of outputs were rated superior quality. The authors argue the gradual pattern gives workers and policymakers more warning than the “crashing wave” framing suggests. TechXplore, April 2, 2026

PwC: 20% of companies are capturing 74% of AI’s economic value. PwC’s April 13 AI Performance Study surveyed 1,217 senior executives across 25 sectors. The top quintile of AI adopters is pulling sharply ahead of the rest on revenue and efficiency gains tied to AI. The study frames this not as a function of tool access — most companies have the same models — but of what PwC calls “AI fitness,” a combination of governance, data foundations, and how aggressively companies reinvest gains into new revenue opportunities. The majority of companies remain stuck in pilot mode. PwC press release, April 13, 2026

Upstart class action: AI model blamed for misstated guidance. Investors filed a class action against Upstart Holdings on April 8 in the Northern District of California alleging the company failed to disclose that its Model 22 AI lending system was overreacting to macroeconomic signals and reducing borrower approvals, materially affecting revenue and financial guidance. The suit covers investors who bought Upstart stock between May and November 2025. The case is notable because it targets the gap between what the company said about its AI and what the AI was actually doing — a framing that is likely to show up in more securities litigation. NYC Today via National Today, April 8, 2026

“History does not repeat itself, but it does rhyme.”

In 1961, Robert McNamara arrived at the Pentagon from Ford, where he had been part of the Whiz Kids group that used statistical process control to turn manufacturing around. He brought the same instinct to the Vietnam conflict. The problem was that the war did not have a tonnage or defect-rate to optimize. So McNamara’s team picked a metric that could be counted, which was enemy dead, and made it the primary measure of progress. Body counts were published daily.

By 1977, Army historian Douglas Kinnard surveyed the generals who had commanded in Vietnam. Only 2% of them considered the body count a valid measure of progress. One called it “a fake, totally worthless.” Another wrote that the numbers were “grossly exaggerated by many units primarily because of the incredible interest shown by people like McNamara.” The Army had won on the metric and lost on the war.

Sociologist Daniel Yankelovich described the McNamara fallacy in four steps. Measure what is easy to measure. Assign arbitrary values to what is not. Assume what is not measured is unimportant. Conclude that what is not measured does not exist. Yankelovich called the final step suicide.

The parallel to the current moment is close enough to be uncomfortable. The CEOs installing AI-usage quotas are usually doing it because a board is asking how the AI investment is going. The measurable answer is time spent, tokens consumed, and percent of output AI-assisted. The unmeasurable answer — whether the work got better, whether judgment was preserved, whether institutional knowledge was retained, whether customers could tell the difference — is the one that matters. McNamara’s original sin was not bad math. It was confusing what was countable with what was important. Duolingo, a year into running its own body count, seems to have figured this out. The question is whether the companies that follow them will need a full cycle to learn the same thing. MIT Technology Review on McNamara and Kinnard’s survey, 2013; Yankelovich’s four-step framing via Wikipedia

Potpourri

From the floor. The Stanford and BetterUp research group that coined “workslop” — AI-generated content that looks polished but does not actually advance the work — ran a follow-up round surveyed this spring. In their original study, 40% of 1,150 full-time employees reported having received workslop in the past month, and receivers were spending close to two hours of cleanup per incident. A finding that has stuck with practitioners: roughly half of recipients said they now considered their co-workers less creative and less reliable after receiving AI-generated work from them. The productivity drain is measurable. The trust drain is worse. Harvard Business Review, September 22, 2025

Overheard. A Forrester researcher, quoted in the firm’s 2026 Future of Work report, on why so many companies regret AI-driven layoffs: too many executives “lay workers off for the future promise of AI.” The problem is not that AI cannot do the work. The problem is that it often cannot do it yet, and the cost of waiting is being paid by the people who were let go. Computerworld, November 4, 2025

Practitioner voices

Kyle Kingsbury, a distributed-systems engineer who runs the Jepsen consultancy and writes at aphyr.com, published a long essay on April 15 titled “The Future of Everything is Lies, I Guess: New Jobs.” One section anticipates a role Kingsbury calls the meat shield: the human whose job is to be accountable when an automated system fails. He writes that when Meta employs human reviewers to check automated moderation decisions, or when lawyers get sanctioned for submitting AI confabulations to a court, the company is “dangling a warm body over the maws of the legal system and public opinion.” You can fine a corporation that uses an LLM, but only a human can apologize or go to jail.

Kingsbury connects this directly to Madeline Clare Elish’s concept of the moral crumple zone, where the human operator absorbs legal and moral blame for a mostly-automated system’s failures. The framing is useful because it names something that is showing up in the quiet rewrites of employment contracts and job descriptions: accountability is being pushed down even as decision authority is being pushed up to the model. aphyr.com, April 15, 2026

Discussion about this post

Ready for more?