Everyone talks about the need for observability, it even comes up in board meetings. But let’s talk about the reasons you shouldn’t add observability to production.
Observability is the enemy: ten reasons
1. Our users will tell us when the site is down
If you’ve got a successful app, it’s likely that your users are testing every single route in your service every minute of the day. And if they really like using your service, then surely some of them will bother to let you know that your site is down before they sign up for a competitor. Chances are most of your reports will come via Twitter or Reddit, so be sure to keep an eye on those sites, really it’s better that way since you want everyone to know when your site is down.
As long as reports of your outage are spread far and wide, people will know not to bother you while you work on a fix. If users end up leaving your app for good? Perfect, you’ll reduce the overall load on the system!
2. Who needs an SLA?
“Look, we’re sorry okay? That’s a lot, for us to apologize. It’s a big deal. I’m the BOSS here and I’m apologizing. That should be enough for you. What’s our plan for the next downtime? Well there won’t be a next time! Don’t worry, I’ll bully everyone from directors on down that this is not acceptable. I’ll get everyone terrified of the slightest outage. Isn’t that better than any so-called ‘contractual SLA’ with ‘guaranteed refunds on quarterly costs?’ There’s no need for that when I promised to yell at people, is there?“
3. It’s easier to reduce your developer velocity to zero
Last outage, the CEO had to apologize. When’s that feature you want coming out? Never! All developer velocity is going to grind to a halt. That’s how committed we are to never having our boss have to apologize to a customer ever again. Hell if the boss has to so much as speak to an existing customer in the next year, it’ll be a failure on our part.
4. AWS needs to make their money
Downscaling? resource overuse? You call it a wasted Operations budget, I call it stimulating the economy. Observability might be a great way to identify possible efficiencies, and even whole clusters that nobody is using, but really who needs the bother.
When the company has to massively reduce size due to budget overruns, then we can downscale whole departments at a time. And without observability, that should happen in no time!
How else will Jeff Bezos afford another 10-and-a-half minutes in the upper atmosphere?
5. We have logs at home
Every system emits logs of some kind, and when you think about it the answers are always in there. Just because these logs are scattered across dozens of microservices, often with no clear connection between the request. But I’m sure with a few hours of digging and writing regex to connect disparate requests ID’s, you’ll find logs that give some clue about the cause.
Imagine doing this kind of digging while a critical service is down. All those angry messages from sales, support, and even direct from customers will certainly be motivating!
6. Post Mortem? Most Shmortem!
Did a restart solve the problem during the last outage? Good enough for me! Is memory usage spiraling out of control? Nothing a restart can’t solve. Rather than finding a root cause, spend that time writing a cron job to restart services every few hours. And when it’s time to place blame at the post mortem, just blame SRE! Clearly the team that measures SLA violations must be responsible for the SLA violations. The circle is complete.
7. It’s not tech debt if you don’t know that it’s there
We all have areas of our services that need improvement, but prioritizing tech debt is always a challenge. An observability tool can show you where your service has performance issues, and throughput measurements can show you which fixes will have the biggest impact. And who needs that on your mind? Without observability you can sweep tech debt under the rug. Out of sight, out of mind I say.
8. What’s the worst that could happen?
Tech debt leads to greater tech debt, but as long as your company’s balance sheets look good, fundamentally things must be okay. Right? Right. After all, as long as operations are working right now, albeit at 99% capacity, what’s the worst that could happen? It’s not like even a tiny spike could shut down your system, a lack of observability could make the problem impossible to diagnose, and a quarter million people could miss Christmas as a result.
9. Why share operations data when you could hoard?
A key component of any observability tool is how data will be analyzed and shared. Instead of Loom captures of an error in progress or screenshots of a console, sending a link to view a trace on an observability dashboard is faster, easier, and communicates much more clearly.
And that’s why you should NEVER do it. After all, if everyone knows how and why problems are happening on production, how will they know you’re a genius for fixing the problem?
That leads us to the next big benefit:
10. A little thing called job security
Observability offers basic insights into how your data and events move through your system. With a little mentorship, and the right observability tools, any intermediate engineer can master even a complex service in just a few months. If you’re an operations or architecture expert, you know that training up-and-coming engineers is the last thing you want to do. Where’s the job security in sharing knowledge? As long as you’re the only one who can fix the service, as long as you’re the one getting woken up day and night when something goes wrong, and as long as the blame falls on you every time something breaks; you know they’ll never fire you. You and the service will be forever interlinked. You’ll never get to work on anything more advanced, and the system will never significantly change as long as you keep nosy engineers out. Preventing full-stack observability is the first step to making sure your knowledge, and your application, are always together in your black box.
Conclusion: The Illusion of Control Without Observability
When uptime, performance, and customer experience are paramount, the idea of skipping observability might seem like a shortcut to operational bliss. But as we've explored, this approach is more akin to flying blind than to any form of enlightened management.
- Relying on users to report issues is not just risky; it's a surefire way to erode trust and send customers to competitors.
- Dismissing the importance of SLAs in favor of verbal apologies from the higher-ups is a gamble with your brand's reputation.
- Slowing down developer velocity to avoid outages is a self-defeating strategy that stifles innovation and growth.
- Ignoring the potential cost savings and efficiencies that observability can bring is a fast track to budget overruns and, ultimately, downsizing.
- Depending solely on logs scattered across services is a recipe for long, stressful debugging sessions, especially during critical outages.
- Neglecting post-mortems and root cause analysis perpetuates a cycle of blame and short-term fixes.
- Unseen tech debt accumulates until it becomes an unmanageable burden, impacting both performance and maintainability.
- Underestimating the worst-case scenarios can lead to catastrophic failures that could have been easily avoided with proper monitoring and alerting.
- Hoarding operational data in silos hampers effective communication and problem-solving, making incidents harder to resolve.
- Finally, the notion that job security comes from being the sole keeper of operational knowledge is outdated. In reality, fostering a culture of shared understanding and continuous improvement is the key to long-term success.
While the allure of avoiding observability may seem tempting, the costs—both seen and unseen—are far too great. Observability isn't the enemy; it's a tool that, when used wisely, can help organizations thrive in an increasingly complex and competitive landscape.