One of the best ways to learn about a domain is to learn how the top companies have done it. Here's a compilation of 30+ curated articles from engineering blogs of top companies like Uber, Netflix, GitHub etc. on observability, monitoring and site reliability.
Building Netflix's Distributed Tracing infrastructure
In this blog, Netflix engineering team describes how they designed the tracing infrastructure behind Edgar. Edgar helps Netflix troubleshoot distributed systems.
Lessons from Building Observability Tools at Netflix
5 key learnings of Netflix engineering team from building observability tools. Scaling log ingestion, contextual distributed tracing, analysis of metrics, choosing observability database and data visualization.
Our journey into building an Observability platform at Razorpay
Detailed talk of how Razorpay, the fintech unicorn from India moved from paid APM tools to open-source observability platforms. Great insights on when does it make sense to build vs buy for observability use-cases.
Cindy Sridharan discusses the problems with distributed tracing, specifically with traceviews and spans. Then she suggests alternatives to traceview with details on service-centric views and service topology graphs.
Structured Logging: The Best Friend You’ll Want When Things Go Wrong
Article from a Grab engineer explaining what is structured logging, why is it better, and how the Grab team built a framework that integrates well with their current Elastic stack-based logging backend.
A Lean and Scalable Data Pipeline to Capture Large Scale Events and Support Experimentation Platform
One of the major challenges of Observability is data storage. In this article, the Grab eng. team shares the lessons learned in building a system that ingests and processes petabytes of data for analytics.
In this blog, GitHub explains why they're adopting OpenTelemetry for its Observability practices. According to GitHub, OpenTelemetry will allow them to standardize telemetry usage making it easier for developers to instrument code.