To the Layman: Observability is not APM

This post is a small attempt at explaining Observability to any software developer or sysadmin, regardless of the experience level.

It is quite astonishing, that a lot of people in the industry still think Observability is the same as APM which is not the case.

Background

I work as an SRE at a consulting firm, and many a times, I have faced this situation where, stakeholders, and developers generally use the terms APM and Observability interchangeably. Now, they're not at fault. It's mostly SaaS companies, which started out selling APM products but later rebranded themselves as Observability Platform, during the late 2010s and early 2020s.

Now some of you folks, might say, hey it doesn't matter, in the end we want some dashboards that help us debug issues or provide us important telemetry for developers and stakeholders. And you might be right, but, the real value of knowing the difference lies in when you are the one responsible to architect o11y from scratch or if you're the one responsible for evaluating offers from Third Party Vendors, who promise to setup o11y with the bare minimum operational or developer effort.

Often times, in the wider developer community we have popularised the notion of: 3 pillars of Observability -- Metrics, Traces and Logs. Which is too oversimplified, and often is a misleading idea, that helps, vendors with sub-par o11y offerings often built on top of the OG PLG stacks, sell themselves as full O11Y offerings. The three pillars are necessary but not sufficient for observability. The real differentiator is automatic correlation, context propagation, and intelligent insights across these signals-not just their presence. This blog will cover the problem statement and the real-world view of o11y in a more developer-friendly and a PM-friendly manner.

So what is Monitoring in the first place ?

Good Question !, but a better question to ask is, "What do I intend to solve with Monitoring ?". Often times, in technology, answering the WHY? helps in understanding WHAT ? and, that's what we will be answering today.

For any enterprise the most basic need is, the ability to reason, or find issues when systems fail, or something goes wrong. Over time, this need also created the need to take data-driven decisions when implementing and planning features or architectural level changes. For the businesses, this also meant, the ability to take decisions like, When is it the best time to launch our cashback campaign ? We would ideally want to do it when the traffic is the lowest...

Now, to solve this problem, we asked the following questions:

Is there a way, that I can have a historical view of the system stats and application level stats in realtime ?, We came up with Metrics
Is there a way, that I can have a centralised store for keeping the stdout and stderr streams for all applications ? We came up with Logs
Is there a way, that I can get a correlation when a request travels between multiple services, or events are propagated between multiple components in a distributed system ? We came up with Tracing

Now some smart readers like you might say ? "Metrics, I get it! But isn't tracing something which is part of an o11y pipeline". And Now, to sound sweet, I might say "You're partially correct".

But:

This is a common misconception that needs clarification. Tracing technology exists in both monitoring and observability contexts, but how it's used differs fundamentally.

Now for the seeker, I will explain what Monitoring is:

Monitoring is the practice of collecting, aggregating, and alerting on predefined metrics and events from systems. It answers the questions you know to ask ahead of time.

Monitoring falls in two broad domains:

Infrastructure Monitoring: This area of monitoring revolves around, providing visibility into "Infrastructure", what does it mean ? -- Anything, that provides compute, deployment, and operational support to the application, anything that is at a lower level of abstraction below the external APIs and libraries. This includes Virtual Machines, Containers, Databases, Queues, Buckets etc...
Application Performance Monitoring (APM): This area of monitoring explicitly involves itself with the ability to be able to look at Performance and Behaviour of applications. Everything that lives at the abstraction of the code that actually executes. For an API, it means, metrics like latency, 5xx errors, 4xx errors, etc. For an ETL Pipeline, it could be QoS factors such as data throughput.

What the heck is Observability, then...

Observability is the ability to understand the internal state of a system based on its external outputs, with automatic correlation across signals (metrics, traces, logs, profiles, user telemetry) that enables debugging unknown issues-the "unknown unknowns"-without shipping new code or instrumentation.

The key differentiator is, that, Observability gives you correlation, context propagation, and intelligent insights automatically. Monitoring gives you data points; Observability gives you answers.

In essence with monitoring, you get to know, hey, "Service A had 10 5xx errors in the last 10 seconds"... That's it! If you're lucky enough you will notice that redis went down too, you just missed the alert... If you're luckier you might know, that calls from the service to redis instance failed... And if your luck is Godsend, you might get to an error trace with execptions from the redis connector.

With Observability, your system will provide you an additional alert insight (Tools like Datadog and NewRelic provide insights out of the box using ML/heuristics, not just data aggregation) that "All upstream calls from Service A to Redis instance failed", you click on the error trace, you will not only get the error trace, but also link to the exact log line where the failure happened. You check the log line and click on an option that states "Show all logs in context"... You see audit logs alongside from your cloud provider, turns out "Bob" accidentally deleted the VPC Endpoint responsible to connect to the managed Redis service from your cloud provider. You restore the configuration, everything takes few minutes. Low MTTD and MTTR. The Redis alert happened because, your agent trying to reach out to your Redis Instance was on the same VPC.

Now, think of the prior case, with monitoring, you would be manually trying to establish correlation, using the timestamp, between traces, metrics and logs. The incident escalates to P0, and it is quite possible to make a wrong assumption, that the underlying issue lies in the Redis Service, and you might end up worsening the situation.

This is where Observability helps, in a nutshell.

A simpler yet common example

Let us start with a simple service graph.


Service Graph:
---

(P ---> Q means component P depends on a network component Q to service a request.)


Service A   ---> Service B ----> Service C ----> [External Service]

Alerts Received in a traditional monitoring setup

p95 latency in service A exceeds 1s over the last 5 minutes
p95 latency in service B exceeds 1s over the last 5 minutes
p95 latency in service C exceeds 1s over the last 5 minutes

Alerts + Insights from an Observability setup

p95 latency in service A exceeds 1s over the last 5 minutes
p95 latency in service B exceeds 1s over the last 5 minutes
p95 latency in service C exceeds 1s over the last 5 minutes
Insight from logging system: "Elevated error logs repeated more than 20 times in last 5 minutes in Service C. LOG LINE=requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.external.service.data/...Retrying in 1s."
Insight from the tracing system: A service graph which shows network relationship between A, B and C along with latency increase, with automatic correlation showing trace IDs linking errors across all three services.

In which of the following cases would you start looking into the actual issue ?

Ofcourse, latter, where you will check the problems in Service C, to figure out why there were multiple retries, from the [External Service].

You see, no blackbox check failures from your system for [External Service].
You add a precautionary rate-limits at the LoadBalancer/Firewall level.
You call the developer on-call, to help you debug the issue.
Turns out, [External Service] silently changed their rate-limits, and you missed the update. And to furnish few requests, it took multiple retries to fetch the data.
You do a quick application configuration change to reduce the number of concurrent requests to the [External Service] and enqueue requests.

Imagine doing that in a traditional monitoring system, if you were lucky, you'd find the timeout error very quickly from logs, but if the log volume was high, it would have taken a lot of time to search from the log lines, about what exactly failed.

What makes Observability different ?

So, another common misconception among a wide tech audience is:

"Monitoring means metrics and logs, add tracing and you have Observability."

Second common misconception, which we already discussed in the start of the article is:

APM or Observability provides you visibility into the system. Both are synonyms of each other.

Let's bust them one-by-one to truly understand what Observability is:

Monitoring can include any combination of signals: Metrics, Logs, Telemetry data, and Traces. But the mere presence of these signals is just monitoring-collection and alerting on predefined conditions.
The difference lies in correlation and cardinality:
- Monitoring: Works with low-cardinality aggregates. Collects traces as just another data type. You query each signal separately.
- Observability: Handles high-cardinality data (user IDs, trace IDs, session IDs, etc.). Automatically correlates traces with metrics, logs, profiles, and even analytics data. Context flows through the entire system via trace propagation.
Observability platforms can ingest data from multiple sources and provide automatic correlation out-of-the-box. For example, Pixie can generate correlation between profiling data and traces; modern platforms link deployment events to performance changes; some correlate user session data with backend errors.
The key capability: Observability lets you investigate issues you didn't anticipate-the "unknown unknowns"-without deploying new instrumentation. Monitoring alerts you to problems you predicted; Observability helps you debug novel failures.

In essence:

Monitoring tells you what happened (events that occurred)
Observability shows you why it happened (current state of the system as a whole with automatic context)

The Real-World Trade-offs

Now, before you rush to implement full observability, understand the costs:

Technical Costs:

Storage: High-cardinality trace data is expensive (10-100x more than metrics)
Query Performance: Correlating across billions of spans requires sophisticated indexing
Instrumentation Complexity: Proper context propagation across services takes effort

Organizational Costs:

Team training on new querying paradigms
Standardizing instrumentation across teams
Cultural shift to blameless postmortems and data-driven debugging

The OpenTelemetry Factor: Consider using OpenTelemetry as your instrumentation layer. It's vendor-neutral, widely supported, and prevents vendor lock-in. You can send the same telemetry to multiple backends or switch vendors without re-instrumenting your entire stack.

Sampling Strategies: At scale, you can't keep every trace. Implement intelligent sampling:

Head-based sampling (decide at trace start)
Tail-based sampling (decide after seeing the whole trace)
Error-biased sampling (keep all errors, sample success)

What Observability is NOT

Just buying an expensive tool: Without proper instrumentation strategy, you're just paying for expensive monitoring
Collecting everything without purpose: Storage costs will bankrupt you; focus on high-value signals
Ignoring organizational readiness: The best observability platform is useless if your team doesn't know how to query it or your culture blames individuals for outages

Conclusion

The difference between Monitoring and Observability isn't academic-it's the difference between:

Knowing your service threw errors vs. knowing which customer's payment failed and why
Getting 10 alerts vs. getting 1 alert with 9 correlated insights
Spending 4 hours correlating logs manually vs. clicking through linked traces in 15 minutes

Build or buy observability when:

Your system complexity demands it (microservices, distributed systems)
MTTD and MTTR directly impact revenue
You need to debug issues you can't predict in advance

Start with monitoring when:

You have a monolith or simple architecture
Predefined dashboards meet 90% of your needs
Budget constraints are tight

And remember: Observability = Monitoring + Automatic Correlation + Context Propagation + High-Cardinality Support

That's the formula. Everything else is just implementation details.

To the Layman: Observability is not APM

Background

So what is Monitoring in the first place ?

What the heck is Observability, then...

A simpler yet common example

Alerts Received in a traditional monitoring setup

Alerts + Insights from an Observability setup

What makes Observability different ?

The Real-World Trade-offs

What Observability is NOT

Conclusion

Comments

More from this blog

Live Intercept Network Traffic in a Kubernetes Pod - No Service Mesh Needed.

Why I ditched Ubuntu?

How to Write Incremental Database Migrations for PostgreSQL Clusters

What YouTube Tutorials Don't Tell You About DevOps & SRE (A Fresher's POV).

Command Palette

Background

So what is Monitoring in the first place ?

What the heck is Observability, then...

A simpler yet common example

Alerts Received in a traditional monitoring setup

Alerts + Insights from an Observability setup

What makes Observability different ?

The Real-World Trade-offs

What Observability is NOT

Conclusion

Comments

More from this blog