The 5 numbers that saved £2M


The Platform Fix | Issue #008

Hello Reader—

At 3am last Tuesday, alerts started screaming.

“CRITICAL: CPU usage at 95%! Memory at 87%! Disk I/O spiking!”

The platform team scrambled. Emergency scaling. Incident calls. War room activated.

Six hours later, they discovered the truth:
Every single alert was meaningless.

The platform was handling traffic perfectly. Users were happy. Revenue was flowing.

They’d spent £2M building dashboards that tracked everything except what actually mattered.

Here’s what I learned from 50+ platform disasters: The metrics you’re obsessing over are probably the wrong ones.


THE VANITY METRICS EPIDEMIC

After analysing failed platforms, I found a terrifying pattern:

90% of platform teams track metrics that don’t predict problems.

The “standard” dashboard:

  • ✅ CPU utilisation (meaningless without context)
  • ✅ Memory usage (rarely the real bottleneck)
  • ✅ Pod count (vanity metric extraordinaire)
  • ✅ Deployment frequency (quantity ≠ quality)
  • ✅ Uptime percentage (hides user experience)

What they DON’T track:

  • Time to detect real user impact
  • Mean time to customer resolution
  • Developer velocity (features shipped)
  • Business metric correlation
  • Actual user experience

That financial services client? Their “99.9% uptime” dashboard showed green while customers couldn’t log in for 3 hours. The login service was “up” - it was just returning errors.


THE 5 METRICS THAT ACTUALLY PREDICT DISASTER

After rescuing 50+ platforms, these 5 numbers tell me everything:

1. Time to Customer Impact Detection

What it measures: How long between a problem starting and you knowing customers are affected Target: Under 2 minutes Why it matters: Infrastructure metrics lie. Customer impact is truth.

2. Developer Deployment Confidence Score

What it measures: % of deployments that happen without fear Target: 90%+ Why it matters: If developers are scared to deploy, your platform is broken

3. Mean Time to Business Resolution

What it measures: Time from customer impact to business problem solved Target: Under 15 minutes Why it matters: Technical fixes don’t matter if customers are still suffering

4. Platform Adoption Velocity

What it measures: % of new services choosing your platform vs. alternatives Target: 80%+ Why it matters: Developers vote with their feet. Low adoption = platform failure

5. Revenue-Correlated Incidents

What it measures: How many alerts actually correlate with revenue drops Target: 90%+ of critical alerts should show business impact Why it matters: If your alerts don’t affect money, they’re noise


REAL STORY: THE DASHBOARD THAT SAVED £5M

E-commerce client. Black Friday approaching. Their dashboard showed:

  • ✅ All systems green
  • ✅ 99.8% uptime
  • ✅ CPU under 60%
  • ✅ Memory stable

But their Customer Impact Detection metric was spiking. Checkout completion rate dropped 15%.

Investigation revealed: Payment gateway was responding with 200 OK… but processing 0 transactions.

Traditional monitoring: “Everything’s fine!” Business-focused metrics: “We’re losing £50K/hour!”

They fixed it in 12 minutes. Saved Black Friday. The “perfect” infrastructure metrics would have hidden the disaster until Monday morning.


THE METRIC AUDIT FRAMEWORK

This week, audit your dashboards:

Step 1: The Business Impact Test For each metric, ask: “If this number changes, does revenue change?” If no, delete it.

Step 2: The Action Test

For each alert, ask: “What specific action does this trigger?” If the answer is “investigate,” it’s noise.

Step 3: The Customer Test For each dashboard, ask: “Does this tell me about customer experience?” If no, you’re tracking vanity.

Step 4: The Prediction Test For each metric, ask: “Does this predict problems before customers notice?” If no, you’re always reactive.

One client deleted 73% of their metrics using this framework. Incident response time dropped from 45 minutes to 8 minutes.


THIS WEEK’S PLATFORM PSYCHOLOGY INSIGHT

From last week’s Hero Dependency Assessments (thank you to the 127 who took it!):

Average hero dependency score: 7.2/10 (dangerously high)

But here’s the pattern: Teams with the LOWEST hero dependency had the BEST metrics.

Why? Heroes hide problems with manual fixes. When you remove heroes, you’re forced to build systems that actually work.

The uncomfortable truth: Your hero is probably covering up metric failures you don’t even know about.


YOUR 5-MINUTE METRIC MAKEOVER

Replace these vanity metrics with business metrics:

Instead of: CPU utilization
Track: Response time at 95th percentile

Instead of: Pod count
Track: Cost per customer transaction

Instead of: Deployment frequency
Track: Features shipped per sprint

Instead of: Memory usage
Track: User session success rate

Instead of: Uptime percentage
Track: Revenue-impacting incidents


READER SUCCESS: THE METRIC REVOLUTION

“Steve, we implemented your 5 metrics last month. Yesterday, our Customer Impact Detection caught a problem 47 minutes before our old monitoring would have. Saved us £200K in lost sales and our reputation with a major client. Our CEO now asks for these numbers in every board meeting.”

  • Platform Lead, Retail (Manchester)

EXCLUSIVE: SLASH PLATFORM COSTS 40% WHILE DOUBLING PERFORMANCE

Speaking of metrics that matter… I’m hosting an exclusive masterclass where I’ll show you exactly how to:

Cut platform costs by 40% using the 5-Minute Complexity Audit
Double performance with the Revenue Recovery Framework
Live architecture teardown - I’ll review real platforms and show what to delete
Post-cleanup playbook - Your step-by-step guide to sustainable platforms

When: August 19, 2025, 2:00pm GMT
Who: Limited to 12 platform leaders
Investment: Free (but the savings will be massive)

REGISTER NOW - Only 4 Spots Remaining

This session will transform how you think about platform costs and performance.


📢 FROM THE NEWSLETTER TO THE STAGE

I'll be at BitSummit Hamburg (Sept 4th) sharing the full story behind our biggest platform transformation - the one that started with pink post-its and ended with GitOps clarity.

"From Console Chaos to GitOps Clarity: A FinTech Transformation Tale"

​Newsletter readers get 15% off with code: STEVE_BITSUMMIT ​
Register: https://bitsummitapp.eventify.io/t2/tickets

See you in Hamburg? Reply and let me know!


WHAT’S COMING NEXT WEEK

Issue #009: “The Platform Team That Fired Themselves (And Why It Worked)”

  • When platform teams become the bottleneck
  • The self-service revolution that saved £3M
  • Your platform autonomy scorecard

Plus: Live results from this week’s Metric Makeovers!


Track what matters. Ignore the noise.

Steve

P.S. That £2M dashboard disaster? We replaced 47 metrics with 5. Incident detection improved 600%. Sometimes less really is more.

P.P.S. In the masterclass, I’ll show you the exact metrics that predicted 3 major platform failures before they happened. These numbers don’t lie.

© 2025 Steven Wade Consulting Ltd

Unsubscribe · Preferences

Steve Wade

Platform Engineering leaders are drowning in failed Kubernetes migrations. Get weekly stories of £3M disasters turned into 30-day wins, plus frameworks that actually work. No fluff, just battle-tested CNCF insights.

Read more from Steve Wade

The Platform Fix | Issue #009 Hello Reader— “We’re shutting down the platform team.” The Slack channel went silent. 15 engineers. £4.2M annual budget. Gone. But here’s the twist: It was their idea. Six months later, deployment frequency increased 400%. Developer satisfaction hit 9.2/10. Platform costs dropped £3M annually. The platform team didn’t get fired. They got promoted to “Product Engineering” and became the most valuable team in the company. Here’s how they did it - and why your...

The Platform Fix | Issue #007 Hello Reader— One Monday, I got the call every CTO dreads. “Steve, our 10X engineer just quit. The platform is completely down. We can’t deploy anything. The board is asking if we should shut down the entire engineering division.” Three years. £10M invested. One person held it all together. When he left, everything collapsed in 72 hours. Here’s the uncomfortable truth: Your platform heroes aren’t saving you. They’re slowly killing your business. THE £500K HERO...

The Platform Fix | Issue #006 At 4am, James was a hero. Again. He’d fixed production. Saved the company. Everyone would thank him on Monday. Six months later, James burned out and quit. The platform collapsed within a week. Your heroes aren’t saving your platform. They’re hiding its failures. THE HERO PARADOX™ Every failing platform has the same story: One brilliant engineer holds it all together. They know every system. Fix every issue. Answer every question. Everyone says: “Thank god for...