Observability has gained a lot of popularity in recent years. Modern DevOps paradigms encourage building robust applications by incorporating automation, Infrastructure as Code, and agile development. To assess the health and “robustness” of IT systems, engineering teams typically use logs, metrics, and traces, which are used by various developer tools to facilitate observability. But what is observability exactly, and how does it differ from monitoring?
- Wikipedia’s definition of observability
- Monitoring vs. Observability
- Examples showing the distinction between observability and monitoring
- Observability of serverless microservices
- How is Twitter doing it?
- How can a serverless observability platform help?
Wikipedia’s definition of observability
“Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” — Wikipedia
An observable system allows us to assess how the system works without interfering or even interacting with it. Simply by looking at the outputs of a system (such as logs, metrics, traces), we can assess how this system is performing.
Monitoring vs. Observability
One of the best explanations about monitoring and observability I’ve seen was provided in an online course, “Building Modern Python Applications on AWS”, by Morgan Willis, a Senior Cloud Technologist at AWS.
“Monitoring is the act of collecting data. What types of data we collect, what we do with the data, and if that data is readily analyzed or available is a different story. This is where observability comes into play. Observability is not a verb, it’s not something you do. Instead, observability is more of a property of a system.” — Morgan Willis
According to this explanation, tools such as CloudWatch or X-Ray can be viewed as monitoring or tracing tools. They allow us to collect logs and metrics about our system and send alerts about errors and incidents. Therefore, monitoring is an active part of collecting data that will help us assess the health of our system and how its different components work together. Once we establish monitoring that continuously collects logs, system outputs, metrics, and traces, our system becomes observable.
As a data engineer, I like to think of monitoring as the data ingestion part of ETL (extract, transform, load) — you gather data from multiple sources (logs, traces, metrics) and put them into a data lake. Once all this data is available, a skilled analyst can gain insights from that data and build beautiful dashboards that tell a story that this data conveys. That’s the observability part — gaining insights from the collected data. And observability platforms such as Dashbird play the role of a skilled analyst. They provide you with visualizations and insights about the health of your system.
Monitoring is a prerequisite for observability. A system that we don’t monitor is not observable.
Examples showing the distinction between observability and monitoring
The ultimate purpose of monitoring is to control a system’s health by actively collecting error logs and system metrics and then leveraging those to alert about incidents. This means:
- tracking errors and alerting about them as soon as they happen,
- tracking metrics about CPU utilization or network traffic to later observe whether specific compute resources are healthy or not,
- reacting to outages and security incidents through alerting, alarms, and notifications.
Even though monitoring is an active process, AWS takes care of that automatically when we use CloudWatch or X-Ray.
The purpose of observability is to use the system’s outputs to gather insights and act on them. Examples:
- identify the percentage of errors across all function or container invocations,
- identify bottlenecks in microservices by observing traces that show latency between individual function calls and transition between components,
- identify patterns of when the errors or bottlenecks occur and use the insights to take action in order to prevent such scenarios in the future,
- measure and assess the performance of an entire application,
- identify cold starts
- identify how much memory does your application consume,
- identify when and how long your code runs,
- identify how much costs are incurred per specific resource,
- identify outliers — ex. specific function invocation that took considerably longer than usual,
- identify how changes to one component affect other parts of the system,
- identify and troubleshoot the flow of traffic flowing through our microservices,
- identify how the system performs over time — how many invocations of each function do we see per day, per week, or per month, and how many of them are successful.
Observability of serverless microservices
Although serverless microservices offer a myriad of benefits in terms of decoupling, reducing dependencies between individual components, and overall faster development cycles, the biggest challenge is to ensure that all those small “moving parts” are working well together. It’s highly impractical, if not impossible, to track all microservices by manually looking up the logs, metrics, and traces scattered across different cloud services.
When looking at AWS, you would have to go to AWS to see the logs, find your Lambda function’s log group, then find the logs you are really interested in. Then, to see the corresponding API traces, you would go either to X-Ray or to CloudTrail and again search across potentially hundreds of components to find the one you want to investigate. As you can see, finding and accessing the logs and traces of every single component is quite time-consuming. Additionally, debugging single parts doesn’t give you the “big-picture” view of how those components work together.
With a growing architecture of microservices, we need an easier (automated) way to add observability to the serverless ecosystem.
How is Twitter doing it?
Here’s an example of a service we’re all too familiar with — Twitter. As you might imagine a product like Twitter has a lot of moving parts and when something breaks it can be difficult to understand why or what caused the problem. Imagine having 350 million active users that interact with each other through your system, tweeting, liking, dm-ing, retweeting, and so on. When dealing with thousands of small services communicating asynchronously with each other, it’s becoming increasingly difficult to find the root cause of an error, such as why a tweet isn’t posted or why a message took longer than usual to be delivered. Twitter wrote about their migration to microservices in 2013, you can find the post here.
With distributed systems or microservices at scale, observability becomes a necessity. You need a system that provides you with the right information on which to act upon. Twitter’s observability system is humongous and took years to develop into the well-oiled machine it is today.
“Our time series metric ingestion service handles more than 2.8 billion write requests per minute, stores 4.5 petabytes of time series data, and handles 25,000 query requests per minute.“ — Antony Asta on the scope of their observability systems published in 2016 — for more information, see part one and part two.
How can a serverless observability platform help?
Understandably, not all businesses have the scale of Twitter and they may not have the resources and time to build their own observability system. Therefore, I want to demonstrate a simple and intuitive observability platform. With a 2-minute setup, you can sign up for Dashbird and add observability to your serverless AWS architecture immediately. Each serverless component in your AWS account, on which you enabled CloudWatch logs and X-Ray or CloudTrail traces, is automatically monitored with those tools. But it’s not yet observable until you do something with this collected data.
The true benefit of Dashbird is that it doesn’t require any code changes and any effort on your side — it simply uses the data that already exists, i.e., data for which you already enabled monitoring with AWS-native services designed for that purpose.
As an observability platform, Dashbird allows you to accomplish all of the points addressed when discussing examples of insights gathered from an observable system:
- be notified about incidents, cold starts, and errors as they happen via custom alerting,
- observe the percentage of errors across all invocations and identify potential outliers,
- find out how much memory does your application consume, as well as when and how long your code runs,
- identify how much costs are incurred per specific resource,
- …and so much more.
Dashbird project view — image by the author
While monitoring tools allow you to collect application logs as well as metrics about resource utilization and network traffic, or traces of HTTP requests made to specific services, observability is a property of a system that analyzes and visualizes collected data, thereby allowing you to improve your application lifecycle by gathering insights about the underlying system.
 “Building Modern Python Applications on AWS” — Morgan Willis
 “Monitoring vs Observability: What’s the Difference?”— James Yaria