profile

Steve Wade

Platform Engineering leaders are drowning in failed Kubernetes migrations. Get weekly stories of £3M disasters turned into 30-day wins, plus frameworks that actually work. No fluff, just battle-tested CNCF insights.

Featured Post

Your 'bus factor' is probably 1 (here's the test)

The Platform Fix Hello Reader— Your lead platform engineer books a two-week holiday. Your first thought: "Can we survive without them?" If the answer is "I don't know" or "probably not," you have what I call a Bus Factor of 1. One person gets hit by a bus (or finds a better job), and your platform collapses. I've tested this pattern across 47 enterprise platforms. 41 had a Bus Factor of 1 or 2. Most didn't realise it until that critical person resigned. The Bus Factor Reality Check Here are...

The Platform Fix | Issue #013 Hello Reader— Remember that fintech where 3 engineers quit? We found something else. One K8s feature burning £67k/month. They’d enabled it because “everyone does.” Sound familiar? The Cluster Autoscaler Conspiracy: Here’s what vendors don’t tell you about autoscaling: 73% of clusters scale for peaks that happen <1% of the time Average utilisation: 23% You’re paying for 77% air The math is brutal: 100 nodes provisioned. 23 actually needed. £50k+/month up in smoke....

The Platform Fix | Issue #013 Hello Reader— Last week, three platform engineers handed in their notice at a UK fintech. All on the same day. The CEO called me in a panic. "Steve, they're our best people. What the hell happened?" I'll tell you what happened. And it's happening at your company too. The Brutal Truth: Your platform engineers aren't engineers anymore. They're YAML therapists. The numbers don't lie: 74% of their time on "keeping lights on" Average 3am wake-up calls per week: 4...

The Platform Fix | Issue #012 Hello Reader— Just got back from Hamburg. 30 platform conversations. 3 interventions. But one haunts me… A CTO pulled me aside: “Steve, we’re already falling. How do we land without dying?” Pinterest called their K8s failure “one-in-a-million.” I’ve seen that exact pattern 47 times. This year alone. Every Doomed Migration Follows Three Stages: Stage 1: Honeymoon (Months 1-3) “Ahead of schedule!” Team excited. Everything simple. Stage 2: Struggle (Months 4-9)...

The Platform Fix | Issue #011 Hello Reader— I'm writing this from my hotel in Hamburg. In 3 hours, I'll be on stage at Bit Summit talking about platform simplification. Deutsche Bank and ING will be in the front row. But here's what I won't mention in my talk: "Steve, we need Istio. Everyone's using service mesh." Those words cost a UK retail bank £2M and 18 months. Last Tuesday, we ripped it all out. Service Mesh Is Like Insurance: It sounds responsible until you read the fine print. What...

The Platform Fix Hello Reader— Last week, I reviewed a platform with 47 monitoring tools. 47. That’s not architecture. That’s hoarding with a YAML addiction. Today, I’m done being polite about platform complexity. Main Teaching: I’ve just published something that might get me uninvited from a few conferences: The Pragmatic CNCF Manifesto. After 50+ migrations and £100M+ in complexity eliminated, I’ve written the guide I wish existed when I started. The one vendors don’t want you to read. The...

The Platform Fix | Issue #009 Hello Reader— “We’re shutting down the platform team.” The Slack channel went silent. 15 engineers. £4.2M annual budget. Gone. But here’s the twist: It was their idea. Six months later, deployment frequency increased 400%. Developer satisfaction hit 9.2/10. Platform costs dropped £3M annually. The platform team didn’t get fired. They got promoted to “Product Engineering” and became the most valuable team in the company. Here’s how they did it - and why your...

The Platform Fix | Issue #008 Hello Reader— At 3am last Tuesday, alerts started screaming. “CRITICAL: CPU usage at 95%! Memory at 87%! Disk I/O spiking!” The platform team scrambled. Emergency scaling. Incident calls. War room activated. Six hours later, they discovered the truth: Every single alert was meaningless. The platform was handling traffic perfectly. Users were happy. Revenue was flowing. They’d spent £2M building dashboards that tracked everything except what actually mattered....

The Platform Fix | Issue #007 Hello Reader— One Monday, I got the call every CTO dreads. “Steve, our 10X engineer just quit. The platform is completely down. We can’t deploy anything. The board is asking if we should shut down the entire engineering division.” Three years. £10M invested. One person held it all together. When he left, everything collapsed in 72 hours. Here’s the uncomfortable truth: Your platform heroes aren’t saving you. They’re slowly killing your business. THE £500K HERO...

The Platform Fix | Issue #006 At 4am, James was a hero. Again. He’d fixed production. Saved the company. Everyone would thank him on Monday. Six months later, James burned out and quit. The platform collapsed within a week. Your heroes aren’t saving your platform. They’re hiding its failures. THE HERO PARADOX™ Every failing platform has the same story: One brilliant engineer holds it all together. They know every system. Fix every issue. Answer every question. Everyone says: “Thank god for...