The £2M Service Mesh Nobody Needed

The Platform Fix | Issue #002

At 2am on a Tuesday, my phone rang.

“Steve, our service mesh is down. Everything’s broken. The entire platform team is panicking.”

I asked one question: “What was it actually doing for you?”

Silence.

After 10 minutes of investigation, we discovered their £2M service mesh was handling… basic load balancing.

Something their existing ingress controller already did.

They’d spent 18 months implementing a solution to a problem they never had.

HOW I LEARNED THIS THE HARD WAY

Here's a confession: In 2019, I was that architect.

Fresh from a conference, I convinced a startup to implement Istio. "It's what Google uses!" I said. Six months later, I was debugging envoy proxies at 4am, trying to understand why requests were randomly failing. The CEO asked the same question: "What problem is this solving?"

I couldn't answer. That night, staring at Grafana dashboards showing metrics no one understood, I realized I'd become part of the problem. We ripped it all out the next week.

That failure? It taught me to always ask "why" before "how."

THE OVER-ENGINEERING EPIDEMIC

After analysing 50+ failed migrations, I’ve found a pattern:

80% implement service mesh “because Netflix has one”
70% add GitOps “because it’s best practice”
90% build complex CI/CD before having anything to deploy

The real killer? Each layer of complexity multiplies your failure points exponentially.

THE SIMPLICITY TEST™

Before adding ANY new component to your platform, answer these three questions:

1. THE PROBLEM TEST

Can you explain the specific problem this solves in one sentence? No buzzwords allowed.

❌ Bad: “We need better observability and resilience.”
✅ Good: “Our checkout service fails 50 times daily with no alerts.”

2. THE ALTERNATIVE TEST

What’s the simplest possible solution?
Could kubectl do this?
Could a bash script handle it?
Is there a managed service?

One client replaced their custom deployment pipeline with GitHub Actions. Saved £100K/year.

3. THE DELETION TEST

If you removed this tomorrow, what would actually break?

Try it in staging. You’ll be amazed how much “critical” infrastructure does… nothing.

THIS WEEK’S PLATFORM PSYCHOLOGY INSIGHT

From your Reality Check scores this week, the average was 4.2/10.

The pattern? Teams scoring 7+ had one thing in common: They could explain their platform to a new developer in under 5 minutes.

Those scoring below 3? Their onboarding docs averaged 47 pages.

Here's what's really happening:

Complexity isn't sophistication. It's fear disguised as engineering.

I learned this during a awkward elevator ride. A CEO asked me to explain what we'd built. By floor 15, I was still talking about our "distributed event-driven microservices architecture with autonomous scaling capabilities." His response? "But what does it do?"

I couldn't give a simple answer. We'd built a masterpiece no one understood.

REAL STORY: THE £50K YAML FILE

Last month, a financial services client called. Their deployment configs were 3,000+ lines of YAML. Developers were terrified to touch anything.

The Full Story:

It started innocently. One engineer added some "helpful" templates. Another added abstractions. A third added configuration for edge cases. Two years and 15 contributors later, deploying a simple API required:

47 environment variables
12 config maps
8 "required" sidecars
3 init containers
Custom annotations that no one remembered why they existed

The breaking point? A junior developer needed to change a single environment variable. It took three senior engineers two hours to figure out where.

We ran the Simplicity Test:

Removed 80% of "just in case" configurations
Deleted entire abstraction layers
Stripped back to Kubernetes basics

Result? Deployment time dropped from 45 minutes to 7.
One developer actually cried with relief.

The kicker: Those deleted configurations? They were handling edge cases that had never occurred in 3 years. The engineer who added them? He'd left 18 months ago.

MY WORST OVER-ENGINEERING MOMENT

Since I'm asking you to be honest, here's mine:

2018: I built a "future-proof" platform for a media company. Custom operators for everything. Operators watching operators. GitOps controlling GitOps. It was beautiful. It was elegant.

It was completely unusable.

Training time: 3 weeks.
Debug time for simple issues: Hours.

The team nicknamed it "The Death Star" - impressive, but one wrong move and everything exploded.

The day we replaced it with vanilla Kubernetes? Productivity jumped 300%.

YOUR 5-MINUTE ARCHITECTURE REVIEW

This week, pick ONE component of your platform:

Draw it on a napkin (literally - if it doesn’t fit, it’s too complex)
Count the dependencies (each one is a failure point)
Time the explanation (can you explain it in 2 minutes?)
Find one thing to delete (there’s always something)
Measure the impact (or lack thereof)

Pro tip: Start with your newest component. It’s usually the most over-engineered.

THE MILLION-POUND QUESTION

“But Steve, what if we need it later?”

In 10 years of platform engineering, I’ve seen teams add back deleted complexity exactly… twice.

You know what I see daily? Teams drowning in “what if” architecture while their actual users suffer.

True story: I once kept a service mesh "just in case" for an entire year.
Cost: £8K/month. Times used: Zero.
My therapist says I'm making progress on my "architectural hoarding" issues. 😅

Here’s the brutal truth: Your clever abstractions are killing your platform.

LAST WEEK’S REALITY CHECK RESULTS

Thank you to the 47 of you who took the Reality Check! Here’s what I learned:

Average score: 4.2/10
Biggest weakness: Business alignment (average 2/5)
Surprising finding: Teams with LESS experience scored higher

Why? They hadn’t learned to over-complicate things yet.

One reader with a score of 2 wrote: “I thought we were failing. Your review showed we just need basic monitoring, not a complete platform rebuild. Saved us 6 months.”

Haven’t taken it yet? [Take the 5-Minute K8s Reality Check →]

Bring your results to the masterclass - I’ll explain exactly what your score means and what to focus on first.

READER QUESTION: “CAN YOU REVIEW MY ARCHITECTURE?”

Great news - I do this live every month!

Join my CNCF Migration Masterclass where I:

Apply the Simplicity Test to real architectures
Show exactly what to delete (prepare to be shocked)
Answer your specific questions
Give you templates to take away

Next session: Tuesday 22nd July, 2pm UK / 9am ET

I’ll review 2-3 volunteer architectures live. One could be yours.

Limited to 12 platform leaders. Free. No fluff.

Reserve Your Spot →

P.S. Can’t make it? Register anyway for the recording. But the live Q&A is where the magic happens - I solve problems in real-time.

FROM COMPLEXITY TO CLARITY

After seeing your Reality Check scores, here’s what jumped out:

Scores 0-3: You’re not ready for K8s (and that might save you £500K)
Scores 4-6: You have foundation issues that will snowball
Scores 7-10: You’re ready, but are you over-engineering?

The interesting pattern? Teams scoring 4-6 all had the same blind spot: They couldn’t see their own complexity.

That’s exactly why the live masterclass works so well. When I diagram your architecture and start crossing out unnecessary elements, the complexity becomes obvious. Painfully obvious.

WHAT’S COMING NEXT WEEK

Issue #003: “Why Your Developers Secretly Hate Your Platform”

The 3 words that predict platform failure
How to get brutal honesty from your team
The 10-minute platform empathy test

Plus: I’ll share the worst over-engineering stories you send me (anonymously).

Ready to simplify?
Steve

P.S. Worst over-engineering you’ve seen? Reply and tell me. The most outrageous story gets featured next week (anonymously, of course).

P.P.S. That £2M service mesh client? They’re now running everything on basic Kubernetes with an nginx ingress controller. Deploy time: 3 minutes. Complexity: Minimal. Developer happiness: Through the roof. I’ll show you exactly what we deleted in the masterclass.

📅 NEXT MASTERCLASS: Tuesday 22nd July, 2pm UK
Topic: The Simplicity Test - Live Architecture Reviews
Save Your Free Spot →

The 3am Call is published every Thursday. Created for platform engineering leaders who are tired of midnight emergencies.

Unsubscribe · Preferences

Steve Wade

The £2M Service Mesh Nobody Needed

Why your K8s migration is already dead. (Issue #001)