Monitoring vs. Observability: Can You Tell The Difference?

Monitoring vs. Observability: Can You Tell The Difference?

Published
January 28, 2021
Tags
Data EngineeringServerlessArchitecture & DesignObservability & SRE
Photo by
Photo by Scott Webb from Pexels

Monitoring vs. Observability: Can You Tell The Difference?

Observability has gained a lot of popularity in recent years. Modern DevOps paradigms encourage building robust applications by incorporating automation, Infrastructure as Code, and agile development. To assess the health and “robustness” of IT systems, engineering teams typically use logs, metrics, and traces, which are used by various developer tools to facilitate observability. But what is observability exactly, and how does it differ from monitoring?

Wikipedia’s definition of observability

“Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” — Wikipedia

An observable system allows us to assess how the system works without interfering or even interacting with it. Simply by looking at the outputs of a system (such as logs, metrics, traces), we can assess how this system is performing.

Monitoring vs. Observability

One of the best explanations about monitoring and observability I’ve seen was provided in an online course, “Building Modern Python Applications on AWS”, by Morgan Willis, a Senior Cloud Technologist at AWS.

Monitoring is the act of collecting data. What types of data we collect, what we do with the data, and if that data is readily analyzed or available is a different story. This is where observability comes into play. Observability is not a verb, it’s not something you do. Instead, observability is more of a property of a system.” — Morgan Willis

According to this explanation, tools such as CloudWatch or X-Ray can be viewed as monitoring or tracing tools. They allow us to collect logs and metrics about our system and send alerts about errors and incidents. Therefore, monitoring is an active part of collecting data that will help us assess the health of our system and how its different components work together. Once we establish monitoring that continuously collects logs, system outputs, metrics, and traces, our system becomes observable.

As a data engineer, I like to think of monitoring as the data ingestion part of ETL (extract, transform, load) — you gather data from multiple sources (logs, traces, metrics) and put them into a data lake. Once all this data is available, a skilled analyst can gain insights from that data and build beautiful dashboards that tell a story that this data conveys. That’s the observability part — gaining insights from the collected data. And observability platforms such as Dashbird play the role of a skilled analyst. They provide you with visualizations and insights about the health of your system.

Monitoring is a prerequisite for observability. A system that we don’t monitor is not observable.

Examples showing the distinction between observability and monitoring

Monitoring

The ultimate purpose of monitoring is to control a system’s health by actively collecting error logs and system metrics and then leveraging those to alert about incidents. This means:

  • tracking errors and alerting about them as soon as they happen,
  • tracking metrics about CPU utilization or network traffic to later observe whether specific compute resources are healthy or not,
  • reacting to outages and security incidents through alerting, alarms, and notifications.

Even though monitoring is an active process, AWS takes care of that automatically when we use CloudWatch or X-Ray.

Observability

The purpose of observability is to use the system’s outputs to gather insights and act on them. Examples:

  • identify the percentage of errors across all function or container invocations,
  • identify bottlenecks in microservices by observing traces that show latency between individual function calls and transition between components,
  • identify patterns of when the errors or bottlenecks occur and use the insights to take action in order to prevent such scenarios in the future,
  • measure and assess the performance of an entire application,
  • identify cold starts
  • identify how much memory does your application consume,
  • identify when and how long your code runs,
  • identify how much costs are incurred per specific resource,
  • identify outliers — ex. specific function invocation that took considerably longer than usual,
  • identify how changes to one component affect other parts of the system,
  • identify and troubleshoot the flow of traffic flowing through our microservices,
  • identify how the system performs over time — how many invocations of each function do we see per day, per week, or per month, and how many of them are successful.

Observability of serverless microservices

Although serverless microservices offer a myriad of benefits in terms of decoupling, reducing dependencies between individual components, and overall faster development cycles, the biggest challenge is to ensure that all those small “moving parts” are working well together. It’s highly impractical, if not impossible, to track all microservices by manually looking up the logs, metrics, and traces scattered across different cloud services.

When looking at AWS, you would have to go to AWS to see the logs, find your Lambda function’s log group, then find the logs you are really interested in. Then, to see the corresponding API traces, you would go either to X-Ray or to CloudTrail and again search across potentially hundreds of components to find the one you want to investigate. As you can see, finding and accessing the logs and traces of every single component is quite time-consuming. Additionally, debugging single parts doesn’t give you the “big-picture” view of how those components work together.

With a growing architecture of microservices, we need an easier (automated) way to add observability to the serverless ecosystem.

Conclusion

While monitoring tools allow you to collect application logs as well as metrics about resource utilization and network traffic, or traces of HTTP requests made to specific services, observability is a property of a system that analyzes and visualizes collected data, thereby allowing you to improve your application lifecycle by gathering insights about the underlying system.

Resources:

[2] “Building Modern Python Applications on AWS” Morgan Willis

[3] “Monitoring vs Observability: What’s the Difference?”— James Yaria