Skip to main content

Grading OKRs is just resulting with better branding

Annie Duke's poker term for judging a decision by how it turned out is resulting, and last Friday of the quarter you did it in front of the whole room. Why OKR scores reward sandbaggers, punish good bets, and survive because they sound like rigour - and the one-sentence fix that changes grading Friday.

·1945 words·10 min read
Grading OKRs is just resulting with better branding

It’s the last Friday of the quarter. The OKRs are up on the screen, a number sitting next to each Key Result. 0.8. 0.6. One sad little 0.3 that everyone scrolls past a bit too quickly.

The room runs the ritual. The 0.8s get a nod. The 0.3 gets a wince and an action to “dig into what went wrong there.” Someone reminds everyone that 0.7 is the sweet spot, so really, we’re doing fine. Scores logged. Next quarter’s targets loaded. Everyone feels like they’ve been held to account.

You just spent an hour judging the quality of your decisions by the quality of your outcomes.

There’s a name for that. It isn’t accountability.

Resulting, for people who’ve never lost money at poker
#

Resulting is judging a decision by how it turned out. Annie Duke named it, and poker players learn it the hard way: if you grade every hand by whether you won the pot, you start praising the idiot who called all-in with rubbish and happened to hit, and bollocking the player who made the correct fold and watched the cards fall the other way.

A good decision can have a bad outcome. A bad decision can have a good outcome. On the one hand, the result tells you almost nothing about the quality of the call.

You already know this. You’d never let someone say “I bet my mortgage on red, it came up red, therefore that was sound financial planning.” It’s obviously broken. The result was a coin flip; the decision was reckless either way.

And then on the last Friday of the quarter, you put a 0.3 on the screen and call it a disappointment.

But a quarter isn’t one hand
#

Fair. And this is where the poker comparison hits its limit, so let me not paper over it.

Poker has a clean break. You make the call, then you have no say in the cards. That gap is the whole reason resulting is a fallacy there: the decision and the outcome are genuinely separate events.

A quarter has no such gap. It’s 90 days of continuous decisions, adjustments and execution, all of it feeding the result. The team that ran honest experiments and learned a pile made dozens of calls after the opening bet. So a low score can genuinely reflect bad execution, not only bad luck. Outcomes carry signal, just a noisy one.

Here’s the twist: that makes the score worse to grade, not safer. Because the number is carrying three things at once. The quality of the bet. The quality of the doing. And the weather, the part nobody in the room controlled. A single figure between 0 and 1 can’t tell you which is which. Was that 0.4 a sharp, well-executed bet in a market that simply didn’t move? A sound bet fumbled in delivery? A weak bet that got lucky to reach 0.4 at all? The score blends all three and hands you the blend as a verdict.

And you can’t lean on volume to wash the noise out, the way a poker player can. A quarter is three or four bets, each one different, each played once. The clock ran for 90 days; the result is still a single tangled resolution, not a sample you can trust.

Over enough bets and enough quarters, a pattern does surface. A team that consistently places good bets wins more of them over the year. That’s an argument for watching the quality of your bets over time, not for reading this quarter’s number as a grade.

What the number is actually measuring
#

Picture two Key Results.

The first was a real bet. The team looked at a genuinely uncertain market, picked a hard target, ran honest experiments, executed well, and landed at 0.4 because the world didn’t move the way anyone hoped. Good call, bad cards.

The second was sandbagged. The team quietly picked a target they were already 90% sure they’d hit, did the obvious work, and cruised to a 1.0. No risk taken. Nothing learned. A foregone conclusion wearing the costume of an ambitious goal.

On grading Friday, the 0.4 gets the wince and the 1.0 gets the nod.

You’ve just rewarded the worst decision in the room and punished the best one. And you’ve taught everyone watching exactly what to do next quarter.

It cuts both ways, and the upward cut is worse
#

Most people, when they finally clock resulting, worry about the unfair 0.3. The good bet that got marked down.

The dangerous one is the other direction. The 1.0 that was luck or sandbagging gets canonised. The approach that produced it becomes “what good looks like.” It goes in the deck. It gets copied to other teams. You scale the thing that happened to win, with no idea whether the decision behind it was any good.

A team that grades outcomes drifts toward hittable targets. They get safer every quarter, because safe targets score well and ambitious bets score badly, and the spreadsheet doesn’t know the difference between a sandbag and a triumph. Goodhart’s law, on a quarterly timer.

The branding bit
#

Here’s why it survives when roadmap grading and velocity worship get laughed out of the room.

The score feels like rigour. A number between 0 and 1, logged against a target, reviewed on a cadence. It has the grammar of measurement. It looks like the opposite of hand-waving. Leaders reach for it because “we scored 0.68 against our objectives” sounds defensible in a way that “we placed some sensible bets and a few didn’t land” never does, even when the second sentence is the more honest one.

So resulting gets a dashboard, a quarterly ceremony, and a respectable name. Same bias a poker player would recognise in a heartbeat, with better branding.

The steelman, and the part that actually bites
#

The serious OKR people are ahead of me on some of this. The “0.7 is the sweet spot” convention exists precisely to punish sandbagging: hit 1.0, and you’re told you aimed too low. Better guidance has been said for years - grades are meant to start a conversation, not end one. So before anyone tells me I’m flogging a 2014 ritual: I know the good version exists.

Hold the good version in your head, though, and the real problem gets sharper, because now you have to pick your poison.

A score has one honest virtue: it’s hard to fake after the fact. You hit the number, or you didn’t. What it invites instead is faking before the fact, which is what sandbagging is. You game the scoreboard by setting it to a low score. The 0.7 norm is a patch for that, and a leaky one, because a culture can learn to sandbag to 0.7 as happily as to 1.0.

Now grade the bet instead of the result, which is what I’ve been pointing to. That kills the sandbag because a safe bet is, by definition, a bad bet. But it opens the opposite hole. Once the outcome is known, “it was a brave, well-reasoned bet” is the easiest story in the world to tell, and the most articulate person in the room tells it best. Grade bets carelessly and you’ve traded a scoreboard you can rig in advance for a story you can rewrite in hindsight.

So neither end is safe alone. The score resists the story and invites the sandbag. The bet resists the sandbag and invites the story. There’s only one thing that closes both holes, and it’s the same move either way. You write it down first.

Grade what you wrote down before you knew
#

A score grades the destination. The road you took, the only part you get to keep, goes unmarked. So grade the road, under one rule that keeps it honest: you don’t get to invent the standard after you’ve seen the result.

Which means the bet has to be written down before the quarter, in a form that can embarrass you later.

The bet, and what would prove it wrong. “Improve retention” won’t do it. The sentence has to have teeth: “We believe personalised re-engagement lifts 30-day retention; if it hasn’t moved 5 points by quarter end, the bet is dead, and the money moves.” Now, “was it a good bet” has an anchor that isn’t a story. You’re holding the call to a standard you set when you didn’t even know the answer. That’s the thing poker has, and OKR grading throws away: the odds were knowable before the cards turned.

The learning, measured against what you said you’d learn. “We learned loads” doesn’t count, because everyone learned loads. What did you believe in week one that you don’t believe now, and did you kill the bets that earned killing? A hypothesis you wrote down and can hold up against reality is hard to fake. The free-floating “we did our best” is the easiest fake there is.

And yes, this is gameable too. Put “bets killed” on the wall as the number to beat, and teams will manufacture theatrical kills and perform their learning, the same sandbag in a new costume. Goodhart doesn’t spare my metric just because I prefer it. A tighter measure won’t save you. The guard is that you’re checking pre-committed reasoning against what actually happened, out loud, and a bet you timestamped in week one is much harder to narrate your way out of than a result you’re explaining after the fact. The honesty comes from the timestamp.

One honest limit, because it’s a different fight. Grading the bet well tells you whether the call was sound. It doesn’t widen the motorway. A good bet placed on a system that can’t carry it still stalls, and no grading ritual fixes that, which is the argument in You Don’t Rise to the Level of Your Goals, You Flow to the Level of Your Systems. This piece is about judging the call. That one’s about building the road.

Why we won’t, and the small thing that helps
#

The honest reason teams keep grading outcomes is that the number is easy, and bet quality is hard to defend upward. “We killed three bets and learned our retention assumption was wrong” doesn’t fit in a cell. A score does. It’s comforting precisely because it hides everything that matters.

So here’s the small move, and it costs one sentence per Key Result. At the start of next quarter, before any work begins, write the kill criteria next to each one: “this bet is dead if ___ hasn’t happened by ___.” That’s it.

Annie Duke calls these kill criteria. Her argument in Quit is that you have to set them while you can still think straight, because in the moment, you’ll always find a reason to push on. Sunk cost doesn’t show up on day one.

Do that, and grading Friday changes character. You stop inventing a story to fit a number. You hold this quarter’s result against a standard your past self set before anyone knew how it would land. The sandbagged bet has no kill criteria worth the name, and it shows. The brave one that missed has people reading what you wrote in week one and arguing about the call, which is the conversation you actually wanted.

The scoreboard was never the point. The quality of the bets was. One of those you can improve. The other one you just read out.

When did your team last call a low score a good decision, out loud, in the room? That’s the tell. If the answer is never, you’re not grading performance. You’re resulting.

Paul Brown
Author
Paul Brown
Partner at Thrivve Partners. Product and flow practitioner. I help organisations build delivery systems that scale capability, not dependency, and shape product decisions with evidence rather than opinion. ProKanban Trainer.

Related