The £2M service mesh mistake (sharing this on stage today)


The Platform Fix | Issue #011

Hello Reader—

I'm writing this from my hotel in Hamburg. In 3 hours, I'll be on stage at Bit Summit talking about platform simplification.

Deutsche Bank and ING will be in the front row. But here's what I won't mention in my talk:

"Steve, we need Istio. Everyone's using service mesh."

Those words cost a UK retail bank £2M and 18 months. Last Tuesday, we ripped it all out.

Service Mesh Is Like Insurance:

It sounds responsible until you read the fine print.

What vendors promise (and what I'll diplomatically acknowledge on stage):

  • "Automatic security!"
  • "Observable everything!"
  • "Traffic management magic!"

What you actually get (what I'm telling YOU):

  • 10x YAML complexity
  • Sidecars that never die
  • Latency you can't explain
  • Engineers who can't debug production

Real data from that £2M disaster:

  • Setup time: 6 months (promised 6 weeks)
  • Additional engineers needed: 4
  • Performance overhead: 23%
  • Problems it actually solved: 1

The killer? Their original problem could've been solved with basic network policies. 20 lines of YAML. One afternoon.

But here's what I WILL share on stage today: Our GitOps-based simplification approach that's helping major European banks cut platform complexity by 70%.

Platform Psychology Insight:

Engineers adopt service mesh for the same reason they buy gym memberships in January - it feels like the responsible thing to do. But complexity compounds. Every abstraction you add is a mortgage on your team's future. Choose your debt wisely.

[I might actually use this gym membership line on stage. We'll see how brave I feel.]

Reader Success:

Sarah at well-known UK telco was 4 months into Istio implementation. Ran our Reality Check. Realized they needed basic TLS, not service mesh. Switched to nginx ingress + cert-manager. Done in 2 weeks. Saved £430k/year in infrastructure and salaries.

Quick Win:

Before adopting ANY new CNCF tool, ask: "What specific problem does this solve that we have TODAY?" Not tomorrow. Not "might have." TODAY. If the answer takes more than one sentence, you don't need it.

What's Coming Next Week:

The Kubernetes feature that's secretly burning £50k/month (hint: it's not what you think). Plus: How to cut platform costs by 50% without touching a single workload.

Keep simplifying

-Steve

P.S. If you're at Bit Summit today, find me after my talk. Happy to share the stories that didn't make it into the "official" presentation. First round's on me.

P.P.S. Help train the spam filters - if this landed anywhere but primary, give it a quick drag. Platform Fix is too pragmatic for the promotions tab.

P.P.P.S. Service mesh has its place. In about 20% of architectures. For the other 80%? You're buying a Ferrari to sit in traffic. (Yes, I'm putting this in my slides.)

© 2025 Platform Fix

113 Cherry St #92768, Seattle, Washington 98104-2205

Unsubscribe · Preferences

Steve Wade

Platform Engineering leaders are drowning in failed Kubernetes migrations. Get weekly stories of £3M disasters turned into 30-day wins, plus frameworks that actually work. No fluff, just battle-tested CNCF insights.

Read more from Steve Wade

The Platform Fix Hello Reader— Last week, I reviewed a platform with 47 monitoring tools. 47. That’s not architecture. That’s hoarding with a YAML addiction. Today, I’m done being polite about platform complexity. Main Teaching: I’ve just published something that might get me uninvited from a few conferences: The Pragmatic CNCF Manifesto. After 50+ migrations and £100M+ in complexity eliminated, I’ve written the guide I wish existed when I started. The one vendors don’t want you to read. The...

The Platform Fix | Issue #009 Hello Reader— “We’re shutting down the platform team.” The Slack channel went silent. 15 engineers. £4.2M annual budget. Gone. But here’s the twist: It was their idea. Six months later, deployment frequency increased 400%. Developer satisfaction hit 9.2/10. Platform costs dropped £3M annually. The platform team didn’t get fired. They got promoted to “Product Engineering” and became the most valuable team in the company. Here’s how they did it - and why your...

The Platform Fix | Issue #008 Hello Reader— At 3am last Tuesday, alerts started screaming. “CRITICAL: CPU usage at 95%! Memory at 87%! Disk I/O spiking!” The platform team scrambled. Emergency scaling. Incident calls. War room activated. Six hours later, they discovered the truth: Every single alert was meaningless. The platform was handling traffic perfectly. Users were happy. Revenue was flowing. They’d spent £2M building dashboards that tracked everything except what actually mattered....