At 2am on a Tuesday, my phone rang.
“Steve, our service mesh is down. Everything’s broken. The entire platform team is panicking.”
I asked one question: “What was it actually doing for you?”
Silence.
After 10 minutes of investigation, we discovered their £2M service mesh was handling… basic load balancing.
Something their existing ingress controller already did.
They’d spent 18 months implementing a solution to a problem they never had.
HOW I LEARNED THIS THE HARD WAY
Here's a confession: In 2019, I was that architect.
Fresh from a conference, I convinced a startup to implement Istio. "It's what Google uses!" I said. Six months later, I was debugging envoy proxies at 4am, trying to understand why requests were randomly failing. The CEO asked the same question: "What problem is this solving?"
I couldn't answer. That night, staring at Grafana dashboards showing metrics no one understood, I realized I'd become part of the problem. We ripped it all out the next week.
That failure? It taught me to always ask "why" before "how."
THE OVER-ENGINEERING EPIDEMIC
After analysing 50+ failed migrations, I’ve found a pattern:
- 80% implement service mesh “because Netflix has one”
- 70% add GitOps “because it’s best practice”
- 90% build complex CI/CD before having anything to deploy
The real killer? Each layer of complexity multiplies your failure points exponentially.
THE SIMPLICITY TEST™
Before adding ANY new component to your platform, answer these three questions:
1. THE PROBLEM TEST
Can you explain the specific problem this solves in one sentence? No buzzwords allowed.
❌ Bad: “We need better observability and resilience.”
✅ Good: “Our checkout service fails 50 times daily with no alerts.”
2. THE ALTERNATIVE TEST
What’s the simplest possible solution?
Could kubectl do this?
Could a bash script handle it?
Is there a managed service?
One client replaced their custom deployment pipeline with GitHub Actions. Saved £100K/year.
3. THE DELETION TEST
If you removed this tomorrow, what would actually break?
Try it in staging. You’ll be amazed how much “critical” infrastructure does… nothing.
THIS WEEK’S PLATFORM PSYCHOLOGY INSIGHT
From your Reality Check scores this week, the average was 4.2/10.
The pattern? Teams scoring 7+ had one thing in common: They could explain their platform to a new developer in under 5 minutes.
Those scoring below 3? Their onboarding docs averaged 47 pages.
Here's what's really happening:
Complexity isn't sophistication. It's fear disguised as engineering.
I learned this during a awkward elevator ride. A CEO asked me to explain what we'd built. By floor 15, I was still talking about our "distributed event-driven microservices architecture with autonomous scaling capabilities." His response? "But what does it do?"
I couldn't give a simple answer. We'd built a masterpiece no one understood.
REAL STORY: THE £50K YAML FILE
Last month, a financial services client called. Their deployment configs were 3,000+ lines of YAML. Developers were terrified to touch anything.
The Full Story:
It started innocently. One engineer added some "helpful" templates. Another added abstractions. A third added configuration for edge cases. Two years and 15 contributors later, deploying a simple API required:
- 47 environment variables
- 12 config maps
- 8 "required" sidecars
- 3 init containers
- Custom annotations that no one remembered why they existed
The breaking point? A junior developer needed to change a single environment variable. It took three senior engineers two hours to figure out where.
We ran the Simplicity Test:
- Removed 80% of "just in case" configurations
- Deleted entire abstraction layers
- Stripped back to Kubernetes basics
Result? Deployment time dropped from 45 minutes to 7.
One developer actually cried with relief.
The kicker: Those deleted configurations? They were handling edge cases that had never occurred in 3 years. The engineer who added them? He'd left 18 months ago.
MY WORST OVER-ENGINEERING MOMENT
Since I'm asking you to be honest, here's mine:
2018: I built a "future-proof" platform for a media company. Custom operators for everything. Operators watching operators. GitOps controlling GitOps. It was beautiful. It was elegant.
It was completely unusable.
Training time: 3 weeks.
Debug time for simple issues: Hours.
The team nicknamed it "The Death Star" - impressive, but one wrong move and everything exploded.
The day we replaced it with vanilla Kubernetes? Productivity jumped 300%.
YOUR 5-MINUTE ARCHITECTURE REVIEW
This week, pick ONE component of your platform:
- Draw it on a napkin (literally - if it doesn’t fit, it’s too complex)
- Count the dependencies (each one is a failure point)
- Time the explanation (can you explain it in 2 minutes?)
- Find one thing to delete (there’s always something)
- Measure the impact (or lack thereof)
Pro tip: Start with your newest component. It’s usually the most over-engineered.
THE MILLION-POUND QUESTION
“But Steve, what if we need it later?”
In 10 years of platform engineering, I’ve seen teams add back deleted complexity exactly… twice.
You know what I see daily? Teams drowning in “what if” architecture while their actual users suffer.
True story: I once kept a service mesh "just in case" for an entire year.
Cost: £8K/month. Times used: Zero.
My therapist says I'm making progress on my "architectural hoarding" issues. 😅
Here’s the brutal truth: Your clever abstractions are killing your platform.
LAST WEEK’S REALITY CHECK RESULTS
Thank you to the 47 of you who took the Reality Check! Here’s what I learned:
- Average score: 4.2/10
- Biggest weakness: Business alignment (average 2/5)
- Surprising finding: Teams with LESS experience scored higher
Why? They hadn’t learned to over-complicate things yet.
One reader with a score of 2 wrote: “I thought we were failing. Your review showed we just need basic monitoring, not a complete platform rebuild. Saved us 6 months.”
Haven’t taken it yet? [Take the 5-Minute K8s Reality Check →]
Bring your results to the masterclass - I’ll explain exactly what your score means and what to focus on first.
READER QUESTION: “CAN YOU REVIEW MY ARCHITECTURE?”
Great news - I do this live every month!
Join my CNCF Migration Masterclass where I:
- Apply the Simplicity Test to real architectures
- Show exactly what to delete (prepare to be shocked)
- Answer your specific questions
- Give you templates to take away
Next session: Tuesday 22nd July, 2pm UK / 9am ET
I’ll review 2-3 volunteer architectures live. One could be yours.
Limited to 12 platform leaders. Free. No fluff.
Reserve Your Spot →
P.S. Can’t make it? Register anyway for the recording. But the live Q&A is where the magic happens - I solve problems in real-time.
FROM COMPLEXITY TO CLARITY
After seeing your Reality Check scores, here’s what jumped out:
- Scores 0-3: You’re not ready for K8s (and that might save you £500K)
- Scores 4-6: You have foundation issues that will snowball
- Scores 7-10: You’re ready, but are you over-engineering?
The interesting pattern? Teams scoring 4-6 all had the same blind spot: They couldn’t see their own complexity.
That’s exactly why the live masterclass works so well. When I diagram your architecture and start crossing out unnecessary elements, the complexity becomes obvious. Painfully obvious.
WHAT’S COMING NEXT WEEK
Issue #003: “Why Your Developers Secretly Hate Your Platform”
- The 3 words that predict platform failure
- How to get brutal honesty from your team
- The 10-minute platform empathy test
Plus: I’ll share the worst over-engineering stories you send me (anonymously).
Ready to simplify?
Steve
P.S. Worst over-engineering you’ve seen? Reply and tell me. The most outrageous story gets featured next week (anonymously, of course).
P.P.S. That £2M service mesh client? They’re now running everything on basic Kubernetes with an nginx ingress controller. Deploy time: 3 minutes. Complexity: Minimal. Developer happiness: Through the roof. I’ll show you exactly what we deleted in the masterclass.
📅 NEXT MASTERCLASS: Tuesday 22nd July, 2pm UK
Topic: The Simplicity Test - Live Architecture Reviews
Save Your Free Spot →
The 3am Call is published every Thursday. Created for platform engineering leaders who are tired of midnight emergencies.