Honeycomb OReilly Book On Observability Engineering
Honeycomb OReilly Book On Observability Engineering
This book does not shirk away from shedding light on the challenges one might face when
bootstrapping a culture of observability on a team, and provides valuable guidance on
how to go about it in a sustainable manner that should stand observability practitioners in
good stead for long-term success.
—Cindy Sridharan, infrastructure engineer
As your systems get more complicated and distributed, monitoring doesn’t really help you
work out what has gone wrong. You need to be able to solve problems you haven’t seen
before, and that’s where observability comes in. I’ve learned so much about observability
from these authors over the last five years, and I’m delighted that they’ve now written
this book that covers both the technical and cultural aspects of introducing and benefiting
from observability of your production systems.
—Sarah Wells, former technical director at the Financial Times
and O’Reilly author
This excellent book is the perfect companion for any engineer or manager wanting to get
the most out of their observability efforts. It strikes the perfect balance between being
concise and comprehensive: It lays a solid foundation by defining observability, explains
how to use it to debug and keep your services reliable, guides you on how to build a
strong business case for it, and finally provides the means to assess your efforts to help
with future improvements.
—Mads Hartmann, SRE at Gitpod
Observability Engineering
Achieving Production Excellence
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Observability Engineering, the cover
image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors and do not represent the publisher’s views.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained in this work is at your
own risk. If any code samples or other technology this work contains or describes is subject to open
source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
This work is part of a collaboration between O’Reilly and Honeycomb. See our statement of editorial
independence.
978-1-098-13363-4
[LSI]
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
iii
3. Lessons from Scaling Without Observability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
An Introduction to Parse 29
Scaling at Parse 31
The Evolution Toward Modern Systems 33
The Evolution Toward Modern Practices 36
Shifting Practices at Parse 38
Conclusion 41
iv | Table of Contents
8. Analyzing Events to Achieve Observability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Debugging from Known Conditions 84
Debugging from First Principles 85
Using the Core Analysis Loop 86
Automating the Brute-Force Portion of the Core Analysis Loop 88
This Misleading Promise of AIOps 91
Conclusion 92
Table of Contents | v
User Experience Is a North Star 131
What Is a Service-Level Objective? 132
Reliable Alerting with SLOs 133
Changing Culture Toward SLO-Based Alerts: A Case Study 135
Conclusion 138
vi | Table of Contents
16. Efficient Data Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
The Functional Requirements for Observability 185
Time-Series Databases Are Inadequate for Observability 187
Other Possible Data Stores 189
Data Storage Strategies 190
Case Study: The Implementation of Honeycomb’s Retriever 193
Partitioning Data by Time 194
Storing Data by Column Within Segments 195
Performing Query Workloads 197
Querying for Traces 199
Querying Data in Real Time 200
Making It Affordable with Tiering 200
Making It Fast with Parallelism 201
Dealing with High Cardinality 202
Scaling and Durability Strategies 202
Notes on Building Your Own Efficient Data Store 204
Conclusion 205
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Table of Contents | ix
Foreword
Over the past couple of years, the term “observability” has moved from the niche
fringes of the systems engineering community to the vernacular of the software
engineering community. As this term gained prominence, it also suffered the (alas,
inevitable) fate of being used interchangeably with another term with which it shares
a certain adjacency: “monitoring.”
What then followed was every bit as inevitable as it was unfortunate: monitoring
tools and vendors started co-opting and using the same language and vocabulary
used by those trying to differentiate the philosophical, technical, and sociotechnical
underpinnings of observability from that of monitoring. This muddying of the waters
wasn’t particularly helpful, to say the least. It risked conflating “observability” and
“monitoring” into a homogenous construct, thereby making it all the more difficult
to have meaningful, nuanced conversations about the differences.
To treat the difference between monitoring and observability as a purely semantic
one is a folly. Observability isn’t purely a technical concept that can be achieved by
buying an “observability tool” (no matter what any vendor might say) or adopting
the open standard du jour. To the contrary, observability is more a sociotechnical con‐
cept. Successfully implementing observability depends just as much, if not more, on
having the appropriate cultural scaffolding to support the way software is developed,
deployed, debugged, and maintained, as it does on having the right tool at one’s
disposal.
In most (perhaps even all) scenarios, teams need to leverage both monitoring and
observability to successfully build and operate their services. But any such successful
implementation requires that practitioners first understand the philosophical differ‐
ences between the two.
What separates monitoring from observability is the state space of system behavior,
and moreover, how one might wish to explore the state space and at precisely what
level of detail. By “state space,” I’m referring to all the possible emergent behaviors a
system might exhibit during various phases: starting from when the system is being
xi
designed, to when the system is being developed, to when the system is being tested,
to when the system is being deployed, to when the system is being exposed to users,
to when the system is being debugged over the course of its lifetime. The more
complex the system, the ever expanding and protean the state space.
Observability allows for this state space to be painstakingly mapped out and explored
in granular detail with a fine-tooth comb. Such meticulous exploration is often
required to better understand unpredictable, long-tail, or multimodal distributions
in system behavior. Monitoring, in contrast, provides an approximation of overall
system health in broad brushstrokes.
It thus follows that everything from the data that’s being collected to this end, to how
this data is being stored, to how this data can be explored to better understand system
behavior varies vis-à-vis the purposes of monitoring and observability.
Over the past couple of decades, the ethos of monitoring has influenced the develop‐
ment of myriad tools, systems, processes, and practices, many of which have become
the de facto industry standard. Because these tools, systems, processes, and practices
were designed for the explicit purpose of monitoring, they do a stellar job to this end.
However, they cannot—and should not—be rebranded or marketed to unsuspecting
customers as “observability” tools or processes. Doing so would provide little to no
discernible benefit, in addition to running the risk of being an enormous time, effort,
and money sink for the customer.
Furthermore, tools are only one part of the problem. Building or adopting observa‐
bility tooling and practices that have proven to be successful at other companies won’t
necessarily solve all the problems faced by one’s organization, inasmuch as a finished
product doesn’t tell the story behind how the tooling and concomitant processes
evolved, what overarching problems it aimed to solve, what implicit assumptions
were baked into the product, and more.
Building or buying the right observability tool won’t be a panacea without first
instituting a conducive cultural framework within the company that sets teams up for
success. A mindset and culture rooted in the shibboleths of monitoring—dashboards,
alerts, static thresholds—isn’t helpful to unlock the full potential of observability. An
observability tool might have access to a very large volume of very granular data, but
successfully making sense of the mountain of data—which is the ultimate arbiter of
the overall viability and utility of the tool, and arguably that of observability itself!—
requires a hypothesis-driven, iterative debugging mindset.
Simply having access to state-of-the-art tools doesn’t automatically cultivate this
mindset in practitioners. Nor does waxing eloquent about nebulous philosophical
distinctions between monitoring and observability without distilling these ideas into
cross-cutting practical solutions. For instance, there are chapters in this book that
take a dim view of holding up logs, metrics, and traces as the “three pillars of
xii | Foreword
observability.” While the criticisms aren’t without merit, the truth is that logs, metrics,
and traces have long been the only concrete examples of telemetry people running
real systems in the real world have had at their disposal to debug their systems, and it
was thus inevitable that the narrative of the “three pillars” cropped up around them.
What resonates best with practitioners building systems in the real world isn’t
abstract, airy-fairy ideas but an actionable blueprint that addresses and proposes
solutions to pressing technical and cultural problems they are facing. This book man‐
ages to bridge the chasm that yawns between the philosophical tenets of observability
and its praxis thereof, by providing a concrete (if opinionated) blueprint of what
putting these ideas into practice might look like.
Instead of focusing on protocols or standards or even low-level representations of
various telemetry signals, the book envisages the three pillars of observability as
the triad of structured events (or traces without a context field, as I like to call
them), iterative verification of hypothesis (or hypothesis-driven debugging, as I like
to call it), and the “core analysis loop.” This holistic reframing of the building blocks
of observability from the first principles helps underscore that telemetry signals
alone (or tools built to harness these signals) don’t make system behavior maximally
observable. The book does not shirk away from shedding light on the challenges one
might face when bootstrapping a culture of observability on a team, and provides
valuable guidance on how to go about it in a sustainable manner that should stand
observability practitioners in good stead for long-term success.
— Cindy Sridharan
Infrastructure Engineer
San Francisco
April 26, 2022
Foreword | xiii
Preface
Thank you for picking up our book on observability engineering for modern software
systems. Our goal is to help you develop a practice of observability within your
engineering organization. This book is based on our experience as practitioners of
observability, and as makers of observability tooling for users who want to improve
their own observability practices.
As outspoken advocates for driving observability practices in software engineering,
our hope is that this book can set a clear record of what observability means in the
context of modern software systems. The term “observability” has seen quite a bit of
recent uptake in the software development ecosystem. This book aims to help you
separate facts from hype by providing a deep analysis of the following:
xv
Additionally, managers of software delivery and operations teams who are interested
in understanding how the practice of observability can benefit their organization will
find value in this book, particularly in the chapters that focus on team dynamics,
culture, and scale.
Anyone who helps teams deliver and operate production software (for example,
product managers, support engineers, and stakeholders) and is curious about this
new thing called “observability” and why people are talking about it should also find
this book useful.
xvi | Preface
What You Will Learn
You will learn what observability is, how to identify an observable system, and why
observability is best suited for managing modern software systems. You’ll learn how
observability differs from monitoring, as well as why and when a different approach
is necessary. We will also cover why industry trends have helped popularize the
need for observability and how that fits into emerging spaces, like the cloud native
ecosystem.
Next, we’ll cover the fundamentals of observability. We’ll examine why structured
events are the building blocks of observable systems and how to stitch those events
together into traces. Events are generated by telemetry that is built into your software,
and you will learn about open source initiatives, like OpenTelemetry, that help jump‐
start the instrumentation process. You will learn about the data-based investigative
process used to locate the source of issues in observable systems, and how it differs
substantially from the intuition-based investigative process used in traditional moni‐
toring. You will also learn how observability and monitoring can coexist.
After an introduction to these fundamental technical concepts, you will learn about
the social and cultural elements that often accompany the adoption of observability.
Managing software in production is a team sport, and you will learn how observa‐
bility can be used to help better shape team dynamics. You will learn about how
observability fits into business processes, affects the software supply chain, and
reveals hidden risks. You will also learn how to put both these technical and social
concepts into practice when we examine how to use service-level objectives for more
effective alerting and dive into the technical details behind why they make alerts both
actionable and debuggable when using observability data.
Then, you’ll learn about inherent challenges when implementing observability solu‐
tions at scale. We’ll start with the considerations you should take into account when
deciding whether to buy or build an observability solution. An essential property
of observability solutions is that they must provide fast answers during iterative
investigations. Therefore, we will show you how to address the inherent challenges
of efficient data storage and retrieval when managing extremely large data sets. You
will also learn when to introduce solutions like event sampling and how to navigate
its trade-offs to find the right approach to fit your needs. You will also learn how to
manage extremely large quantities of data with telemetry pipelines.
Finally, we look at organizational approaches to adopting a culture of observability.
Beyond introducing observability to your team, you will learn practical ways to scale
observability practices across an entire organization. You will learn how to identify
and work with key stakeholders, use technical approaches to win allies, and make a
business case for adopting observability practices.
Preface | xvii
We started writing this book nearly three years ago. Part of the reason it has taken
so long to produce is that the observability landscape has been changing rapidly
and practices are advancing. We believe this book is the most up-to-date and compre‐
hensive look at the state of the art in observability practices as of the time of its
publication. We hope you find the journey as fascinating as we have.
xviii | Preface
We appreciate, but generally do not require, attribution. An attribution usually
includes the title, author, publisher, and ISBN. For example: “Observability Engineer‐
ing by Charity Majors, Liz Fong-Jones, and George Miranda (O’Reilly). Copyright
2022 Hound Technology Inc., 978-1-492-07644-5.”
If you feel your use of code examples falls outside fair use or the permission given
above, feel free to contact us at [email protected].
Our unique network of experts and innovators share their knowledge and expertise
through books, articles, and our online learning platform. O’Reilly’s online learning
platform gives you on-demand access to live training courses, in-depth learning
paths, interactive coding environments, and a vast collection of text and video from
O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at https://oreil.ly/observability-engineering.
Email [email protected] to comment or ask technical questions about this
book.
For news and information about our books and courses, visit https://oreilly.com.
Find us on LinkedIn: https://linkedin.com/company/oreilly-media
Follow us on Twitter: https://twitter.com/oreillymedia
Watch us on YouTube: https://youtube.com/oreillymedia
Preface | xix
Acknowledgments
This book would not have been possible without the support of executive sponsors
at Honeycomb: many thanks to Christine Yen, Deirdre Mahon, and Jo Ann Sanders.
Nor would this book have been possible without the domestic sponsors who put
up with many odd hours, lost weekends, sleepless weeknights, and cranky partners:
many thanks to Rebekah Howard, Elly Fong-Jones, and Dino Miranda. Without all
of them, we would probably still be trying to find the time to fully develop and tie
together the many ideas expressed in this book.
We’d especially like to thank additional contributors who have made the content
in this book much stronger by sharing their varied perspectives and expertise. Chap‐
ter 16, “Efficient Data Storage”, was made possible by Ian Wilkes (author of Honey‐
comb’s Retriever engine, the basis of the case study), and Joan Smith (reviewing for
technical accuracy of references to external literature). Chapter 14, “Observability
and the Software Supply Chain”, was authored by Frank Chen, and Chapter 18, “Tele‐
metry Management with Pipelines”, was authored by Suman Karumuri, and Ryan
Katkov—all of whom we thank for sharing their knowledge and the lessons they’ve
learned from managing incredibly large-scale applications with observability at Slack.
Many thanks to Rachel (pie) Perkins for contributions to several early chapters in this
book. And thanks to the many bees at Honeycomb who, over the years, have helped
us explore what’s achievable with observability.
Finally, many thanks to our many external reviewers: Sarah Wells, Abby Bangser,
Mads Hartmann, Jess Males, Robert Quinlivan, John Feminella, Cindy Sridharan,
Ben Sigelman, and Daniel Spoonhower. We’ve revised our takes, incorporated
broader viewpoints, and revisited concepts throughout the authoring process to
ensure that we’re reflecting an inclusive state of the art in the world of observability.
Although we (the authors of this book) all work for Honeycomb, our goal has always
been to write an objective and inclusive book detailing how observability works in
practice, regardless of specific tool choices. We thank our reviewers for keeping us
honest and helping us develop a stronger and all-encompassing narrative.
xx | Preface
PART I
The Path to Observability
This section defines concepts that are referenced throughout the rest of this book.
You will learn what observability is, how to identify an observable system, and why
observability-based debugging techniques are better suited for managing modern
software systems than monitoring-based debugging techniques.
Chapter 1 examines the roots of the term “observability,” shows how that concept has
been adapted for use in software systems, and provides concrete questions you can
ask yourself to determine if you have an observable system.
Chapter 2 looks at the practices engineers use to triage and locate sources of issues
using traditional monitoring methods. Those methods are then contrasted with
methods used in observability-based systems. This chapter describes these methods
at a high level, but the technical and workflow implementations will become more
concrete in Part II.
Chapter 3 is a case study written by coauthor Charity Majors and told from her per‐
spective. This chapter brings concepts from the first two chapters into a practical case
study illustrating when and why the shift toward observability becomes absolutely
necessary.
Chapter 4 illustrates how and why industry trends have helped popularize the
need for observability and how that fits into emerging spaces, like the cloud native
ecosystem.
CHAPTER 1
What Is Observability?
In the software development industry, the subject of observability has garnered a lot
of interest and is frequently found in lists of hot new topics. But, as things seem
to inevitably go when a hot new topic sees a surging level of interest in adoption,
complex ideas become all too ripe for misunderstanding without a deeper look at the
many nuances encapsulated by a simple topical label. This chapter looks at the mathe‐
matical origins of the term “observability” and examines how software development
practitioners adapted it to describe characteristics of production software systems.
We also look at why the adaptation of observability for use in production software
systems is necessary. Traditional practices for debugging the internal state of software
applications were designed for legacy systems that were much simpler than those
we typically manage today. As systems architecture, infrastructure platforms, and
user expectations have continued to evolve, the tools we use to reason about those
components have not. By and large, the debugging practices developed decades ago
with nascent monitoring tools are still the same as those used by many engineering
teams today—even though the systems they manage are infinitely more complex.
Observability tools were born out of sheer necessity, when traditional tools and
debugging methods simply were not up to the task of quickly finding deeply hidden
and elusive problems.
This chapter will help you understand what “observability” means, how to determine
if a software system is observable, why observability is necessary, and how observabil‐
ity is used to find problems in ways that are not possible with other approaches.
3
The Mathematical Definition of Observability
The term “observability” was coined by engineer Rudolf E. Kálmán in 1960. It has
since grown to mean many different things in different communities. Let’s explore the
landscape before turning to our own definition for modern software systems.
In his 1960 paper, Kálmán introduced a characterization he called observability to
describe mathematical control systems.1 In control theory, observability is defined as
a measure of how well internal states of a system can be inferred from knowledge of
its external outputs.
This definition of observability would have you study observability and controllabil‐
ity as mathematical duals, along with sensors, linear algebra equations, and formal
methods. This traditional definition of observability is the realm of mechanical engi‐
neers and those who manage physical systems with a specific end state in mind.
If you are looking for a mathematical and process engineering oriented textbook,
you’ve come to the wrong place. Those books definitely exist, and any mechanical
engineer or control systems engineer will inform you (usually passionately and at
great length) that observability has a formal meaning in traditional systems engineer‐
ing terminology. However, when that same concept is adapted for use with squishier
virtual software systems, it opens up a radically different way of interacting with and
understanding the code you write.
1 Rudolf E. Kálmán, “On the General Theory of Control Systems”, IFAC Proceedings Volumes 1, no. 1 (August
1960): 491–502.
• Can you continually answer open-ended questions about the inner workings of
your applications to explain any anomalies, without hitting investigative dead
ends (i.e., the issue might be in a certain group of things, but you can’t break it
down any further to confirm)?
• Can you understand what any particular user of your software may be experienc‐
ing at any given time?
• Can you quickly see any cross-section of system performance you care about,
from top-level aggregate views, down to the single and exact user requests that
may be contributing to any slowness (and anywhere in between)?
• Can you compare any arbitrary groups of user requests in ways that let you
correctly identify which attributes are commonly shared by all users who are
experiencing unexpected behavior in your application?
• Once you do find suspicious attributes within one individual user request, can
you search across all user requests to identify similar behavioral patterns to
confirm or rule out your suspicions?
• Can you identify which system user is generating the most load (and therefore
slowing application performance the most), as well as the 2nd, 3rd, or 100th most
load-generating users?
• Can you identify which of those most-load-generating users only recently started
impacting performance?
• If the 142nd slowest user complained about performance speed, can you isolate
their requests to understand why exactly things are slow for that specific user?
• If users complain about timeouts happening, but your graphs show that the
99th, 99.9th, even 99.99th percentile requests are fast, can you find the hidden
timeouts?
• Can you answer questions like the preceding ones without first needing to
predict that you might need to ask them someday (and therefore set up specific
monitors in advance to aggregate the necessary data)?
• Can you answer questions like these about your applications even if you have
never seen or debugged this particular issue before?
• Can you get answers to questions like the preceding ones quickly, so that you
can iteratively ask a new question, and another, and another, until you get to the
correct source of issues, without losing your train of thought (which typically
means getting answers within seconds instead of minutes)?
• Can you answer questions like the preceding ones even if that particular issue has
never happened before?
Meeting all of the preceding criteria is a high bar for many software engineering
organizations to clear. If you can clear that bar, you, no doubt, understand why
observability has become such a popular topic for software engineering teams.
Put simply, our definition of “observability” for software systems is a measure of how
well you can understand and explain any state your system can get into, no matter
how novel or bizarre. You must be able to comparatively debug that bizarre or novel
state across all dimensions of system state data, and combinations of dimensions, in
an ad hoc iterative investigation, without being required to define or predict those
debugging needs in advance. If you can understand any bizarre or novel state without
needing to ship new code, you have observability.
We believe that adapting the traditional concept of observability for software systems
in this way is a unique approach with additional nuances worth exploring. For
modern software systems, observability is not about the data types or inputs, nor
is it about mathematical equations. It is about how people interact with and try to
understand their complex systems. Therefore, observability requires recognizing the
interaction between both people and technology to understand how those complex
systems work together.
If you accept that definition, many additional questions emerge that demand
answers:
• How does one gather that data and assemble it for inspection?
• What are the technical requirements for processing that data?
• What team capabilities are necessary to benefit from that data?
We will get to these questions and more throughout the course of this book. For now,
let’s put some additional context behind observability as it applies to software.
The application of observability to software systems has much in common with its
control theory roots. However, it is far less mathematical and much more practical. In
part, that’s because software engineering is a much younger and more rapidly evolv‐
ing discipline than its more mature mechanical engineering predecessor. Production
software systems are much less subject to formal proofs. That lack of rigor is, in
part, a betrayal from the scars we, as an industry, have earned through operating the
software code we write in production.
2 Sometimes these claims include time spans to signify “discrete occurrences of change” as a fourth pillar of a
generic synonym for “telemetry.”
When compared to the reality of modern systems, it becomes clear that traditional
monitoring approaches fall short in several ways. The reality of modern systems is as
follows:
The last point is important, because it describes the breakdown that occurs between
the limits of correlated knowledge that one human can be reasonably expected to
think about and the reality of modern system architectures. So many possible dimen‐
sions are involved in discovering the underlying correlations behind performance
issues that no human brain, and in fact no schema, can possibly contain them.
With observability, comparing high-dimensionality and high-cardinality data
becomes a critical component of being able to discover otherwise hidden issues
buried in complex system architectures.
3 For a more in-depth analysis, see Pete Hodgson’s blog post “Why Intuitive Troubleshooting Has Stopped
Working for You”.
Modern distributed systems architectures notoriously fail in novel ways that no one is
able to predict and that no one has experienced before. This condition happens often
enough that an entire set of assertions has been coined about the false assumptions
that programmers new to distributed computing often make. Modern distributed
systems are also made accessible to application developers as abstracted infrastructure
platforms. As users of those platforms, application developers are now left to deal
with an inherent amount of irreducible complexity that has landed squarely on their
plates.
The previously submerged complexity of application code subroutines that interacted
with one another inside the hidden random access memory internals of one physical
machine have now surfaced as service requests between hosts. That newly exposed
complexity then hops across multiple services, traversing an unpredictable network
many times over the course of a single function. When modern architectures started
to favor decomposing monoliths into microservices, software engineers lost the abil‐
ity to step through their code with traditional debuggers. Meanwhile, their tools have
yet to come to grips with that seismic shift.
In short: we blew up the monolith. Now every request has to hop the network
multiple times, and every software developer needs to be better versed in systems and
operations just to get their daily work done.
Examples of this seismic shift can be seen with the trend toward containerization,
the rise of container orchestration platforms, the shift to microservices, the common
use of polyglot persistence, the introduction of the service mesh, the popularity of
ephemeral autoscaling instances, serverless computing, lambda functions, and any
other myriad SaaS applications in a software developer’s typical tool set. Stringing
Conclusion
Although the term observability has been defined for decades, its application to
software systems is a new adaptation that brings with it several new considerations
and characteristics. Compared to their simpler early counterparts, modern systems
In the previous chapter, we covered the origins and common use of the metrics data
type for debugging. In this chapter, we’ll more closely examine the specific debugging
practices associated with traditional monitoring tools and how those differ from the
debugging practices associated with observability tools.
Traditional monitoring tools work by checking system conditions against known
thresholds that indicate whether previously known error conditions are present.
That is a fundamentally reactive approach because it works well only for identifying
previously encountered failure modes.
In contrast, observability tools work by enabling iterative exploratory investigations
to systematically determine where and why performance issues may be occurring.
Observability enables a proactive approach to identifying any failure mode, whether
previously known or unknown.
In this chapter, we focus on understanding the limitations of monitoring-based
troubleshooting methods. First, we unpack how monitoring tools are used within
the context of troubleshooting software performance issues in production. Then
we examine the behaviors institutionalized by those monitoring-based approaches.
Finally, we show how observability practices enable teams to identify both previously
known and unknown issues.
19
performance of an application over time; then they report an aggregate measure of
performance over that interval. Monitoring systems collect, aggregate, and analyze
metrics to sift through known patterns that indicate whether troubling trends are
occurring.
This monitoring data has two main consumers: machines and humans. Machines use
monitoring data to make decisions about whether a detected condition should trigger
an alert or a recovery should be declared. A metric is a numerical representation
of system state over the particular interval of time when it was recorded. Similar
to looking at a physical gauge, we might be able to glance at a metric that conveys
whether a particular resource is over- or underutilized at a particular moment in
time. For example, CPU utilization might be at 90% right now.
But is that behavior changing? Is the measure shown on the gauge going up or going
down? Metrics are typically more useful in aggregate. Understanding the trending
values of metrics over time provides insights into system behaviors that affect soft‐
ware performance. Monitoring systems collect, aggregate, and analyze metrics to sift
through known patterns that indicate trends their humans want to know about.
If CPU utilization continues to stay over 90% for the next two minutes, someone
may have decided that’s a condition they want to be alerted about. For clarity, it’s
worth noting that to the machines, a metric is just a number. System state in the
metrics world is very binary. Below a certain number and interval, the machine will
not trigger an alert. Above a certain number and interval, the machine will trigger an
alert. Where exactly that threshold lies is a human decision.
When monitoring systems detect a trend that a human identified as important, an
alert is sent. Similarly, if CPU utilization drops below 90% for a preconfigured time‐
span, the monitoring system will determine that the error condition for the triggered
alert no longer applies and therefore declare the system recovered. It’s a rudimentary
system, yet so many of our troubleshooting capabilities rely on it.
The way humans use that same data to debug issues is a bit more interesting. Those
numerical measurements are fed into TSDBs, and a graphical interface uses that
database to source graphical representations of data trends. Those graphs can be
collected and assembled into progressively more complicated combinations, known
as dashboards.
Static dashboards are commonly assembled one per service, and they’re a useful start‐
ing point for an engineer to begin understanding particular aspects of the underlying
system. This is the original intent for dashboards: to provide an overview of how a set
of metrics is tracking and to surface noteworthy trends. However, dashboards are a
poor choice for discovering new problems with debugging.
When dashboards were first built, we didn’t have many system metrics to worry
about. So it was relatively easy to build a dashboard that showed the critical data
Those are just a few example questions, and they can ask more. However, what
they have available is a dashboard and graphs for CPU load average, memory usage,
index counters, and lots of other internal statistics for the host and running database.
They cannot slice and dice or break down by user, query, destination or source IP,
Example 3: Tool-hopping
An engineer sees a spike in errors at a particular time. They start paging through
dashboards, looking for spikes in other metrics at the same time, and they find some,
but they can’t tell which are the cause of the error and which are the effects. So they
jump over into their logging tool and start grepping for errors. Once they find the
request ID of an error, they turn to their tracing tool and copy-paste the error ID into
the tracing tool. (If that request isn’t traced, they repeat this over and over until they
catch one that is.)
These monitoring tools can get better at detecting finer-grained problems over
time—if you have a robust tradition of always running restrospectives after outages
and adding custom metrics where possible in response. Typically, the way this hap‐
pens is that the on-call engineer figures it out or arrives at a reasonable hypothesis,
and also figures out exactly which metric(s) would answer the question if it exists.
They ship a change to create that metric and begin gathering it. Of course, it’s too
late now to see if your last change had the impact you’re guessing it did—you can’t go
back in time and capture that custom metric a second time unless you can replay the
whole exact scenario—but if it happens again, the story goes, next time you’ll know
for sure.
The engineers in the preceding examples might go back and add custom metrics for
each query family, for expiration rates per collection, for error rates per shards, etc.
(They might go nuts and add custom metrics for every single query family’s lock
usage, hits for each index, buckets for execution times, etc.—and then find out they
doubled their entire monitoring budget the next billing period.)
• When issues occur in production, are you determining where you need to inves‐
tigate based on an actual visible trail of system information breadcrumbs? Or are
you following your intuition to locate those problems? Are you looking in the
place where you know you found the answer last time?
• Are you relying on your expert familiarity of this system and its past problems?
When you use a troubleshooting tool to investigate a problem, are you explora‐
tively looking for clues? Or are you trying to confirm a guess? For example, if
latency is slow across the board and you have dozens of databases and queues
that could be producing it, are you able to use data to determine where the
latency is coming from? Or do you guess it must be your MySQL database, per
usual, and then go check your MySQL graphs to confirm your hunch?
How often do you intuitively jump to a solution and then look for confirmation
that it’s right and proceed accordingly—but actually miss the real issue because
that confirmed assumption was only a symptom, or an effect rather than the
cause?
• Are your troubleshooting tools giving you precise answers to your questions
and leading you to direct answers? Or are you performing translations based on
system familiarity to arrive at the answer you actually need?
• How many times are you leaping from tool to tool, attempting to correlate
patterns between observations, relying on yourself to carry the context between
disparate sources?
• Most of all, is the best debugger on your team always the person who has been
there the longest? This is a dead giveaway that most of your knowledge about
your system comes not from a democratized method like a tool, but through
personal hands-on experience alone.
Guesses aren’t good enough. Correlation is not causation. A vast disconnect often
exists between the specific questions you want to ask and the dashboards available
to provide answers. You shouldn’t have to make a leap of faith to connect cause and
effect.
It gets even worse when you consider the gravity-warping impact that confirmation
bias can have. With a system like this, you can’t find what you don’t know to look for.
You can’t ask questions that you didn’t predict you might need to ask, far in advance.
So far, we’ve defined observability and how it differs from traditional monitoring.
We’ve covered some of the limitations of traditional monitoring tools when managing
modern distributed systems and how observability solves them. But an evolutionary
gap remains between the traditional and modern world. What happens when trying
to scale modern systems without observability?
In this chapter, we look at a real example of slamming into the limitations of tradi‐
tional monitoring and architectures, along with why different approaches are needed
when scaling applications. Coauthor Charity Majors shares her firsthand account on
lessons learned from scaling without observability at her former company, Parse. This
story is told from her perspective.
An Introduction to Parse
Hello, dear reader. I’m Charity, and I’ve been on call since I was 17 years old. Back
then, I was racking servers and writing shell scripts at the University of Idaho. I
remember the birth and spread of many notable monitoring systems: Big Brother,
Nagios, RRDtool and Cacti, Ganglia, Zabbix, and Prometheus. I’ve used most—not
quite all—of them. They were incredibly useful in their time. Once I got a handle on
TSDBs and their interfaces, every system problem suddenly looked like a nail for the
time-series hammer: set thresholds, monitor, rinse, and repeat.
During my career, my niche has been coming in as the first infrastructure engineer
(or one of the first) to join an existing team of software engineers in order to help
mature their product to production readiness. I’ve made decisions about how best to
understand what’s happening in production systems many, many times.
29
That’s what I did at Parse. Parse was a mobile-backend-as-a-service (MBaaS) plat‐
form, providing mobile-app developers a way to link their apps to backend cloud
storage systems, and APIs to backend systems. The platform enabled features like
user management, push notifications, and integration with social networking serv‐
ices. In 2012, when I joined the team, Parse was still in beta. At that time, the
company was using a bit of Amazon CloudWatch and, somehow, was being alerted by
five different systems. I switched us over to using Icinga/Nagios and Ganglia because
those were the tools I knew best.
Parse was an interesting place to work because it was so ahead of its time in many
ways (we would go on to be acquired by Facebook in 2013). We had a microservice
architecture before we had the name “microservices” and long before that pattern
became a movement. We were using MongoDB as our data store and very much
growing up alongside it: when we started, it was version 2.0 with a single lock per
replica set. We were developing with Ruby on Rails and we had to monkey-patch
Rails to support multiple database shards.
We had complex multitenant architectures with shared tenancy pools. In the early
stages, we were optimizing for development speed, full stop.
I want to pause here to stress that optimizing for development speed was the right
thing to do. With that decision, we made many early choices that we later had to undo
and redo. But most start-ups don’t fail because they make the wrong tooling choices.
And let’s be clear: most start-ups do fail. They fail because there’s no demand for their
product, or because they can’t find product/market fit, or because customers don’t
love what they built, or any number of reasons where time is of the essence. Choosing
a stack that used MongoDB and Ruby on Rails enabled us to get to market quickly
enough that we delighted customers, and they wanted a lot more of what we were
selling.
Around the time Facebook acquired Parse, we were hosting over 60,000 mobile apps.
Two and a half years later, when I left Facebook, we were hosting over a million
mobile apps. But even when I first joined, in 2012, the cracks were already starting to
show.
Parse officially launched a couple of months after I joined. Our traffic doubled, then
doubled, and then doubled again. We were the darling of Hacker News, and every
time a post about us showed up there, we’d get a spike of new sign-ups.
In August of 2012, one of our hosted apps moved into a top 10 spot in the iTunes
Store for the first time. The app was marketing a death-metal band from Norway. The
band used the app to livestream broadcasts. For the band, this was in the evening; for
Parse, it was the crack of dawn. Every time the band livestreamed, Parse went down
in seconds flat. We had a scaling problem.
Scaling at Parse | 31
It’s important to note here that Ruby is not a threaded language. So that pool of
API workers was a fixed pool. Whenever any one of the backends got just a little
bit slower at fulfilling requests, the pool would rapidly fill itself up with pending
requests to that backend. Whenever a backend became very slow (or completely
unresponsive), the pools would fill up within seconds—and all of Parse would go
down.
At first, we attacked that problem by overprovisioning instances: our Unicorns ran at
20% utilization during their normal steady state. That approach allowed us to survive
some of the gentler slowdowns. But at the same time, we also made the painful
decision to undergo a complete rewrite from Ruby on Rails to Go. We realized that
the only way out of this hellhole was to adopt a natively threaded language. It took
us over two years to rewrite the code that had taken us one year to write. In the
meantime, we were on the bleeding edge of experiencing all the ways that traditional
operational approaches were fundamentally incompatible with modern architectural
problems.
This was a particularly brutal time all around at Parse. We had an experienced opera‐
tions engineering team doing all the “right things.” We were discovering that the best
practices we all knew, which were born from using traditional approaches, simply
weren’t up to the task of tackling problems in the modern distributed microservices
era.
At Parse, we were all-in on infrastructure as code. We had an elaborate system of
Nagios checks, PagerDuty alerts, and Ganglia metrics. We had tens of thousands
of Ganglia graphs and metrics. But those tools were failing us because they were
valuable only when we already knew what the problem was going to be—when we
knew which thresholds were good and where to check for problems.
For instance, TSDB graphs were valuable when we knew which dashboards to care‐
fully curate and craft—if we could predict which custom metrics we would need in
order to diagnose problems. Our logging tools were valuable when we had a pretty
good idea of what we were looking for—if we’d remembered to log the right things
in advance, and if we knew the right regular expression to search for. Our application
performance management (APM) tools were great when problems manifested as one
of the top 10 bad queries, bad users, or bad endpoints that they were looking for.
But we had a whole host of problems that those solutions couldn’t help us solve. That
previous generation of tools was falling short under these circumstances:
• Every other day, we had a brand-new user skyrocketing into the top 10 list for
one of the mobile-app stores.
• Load coming from any of the users identified in any of our top 10 lists wasn’t the
cause of our site going down.
These types of problems are all categorically different from the last generation of
problems: the ones for which that set of tools was built. Those tools were built for a
world where predictability reigned. In those days, production software systems had
“the app”—with all of its functionality and complexity contained in one place—and
“the database.” But now, scaling meant that we blew up those monolithic apps into
many services used by many different tenants. We blew up the database into a diverse
range of many storage systems.
At many companies, including Parse, our business models turned our products into
platforms. We invited users to run any code they saw fit to run on our hosted services.
We invited them to run any query they felt like running against our hosted databases.
And in doing so, suddenly, all of the control we had over our systems evaporated in a
puff of market dollars to be won.
In this era of services as platforms, our customers love how powerful we make them.
That drive has revolutionized the software industry. And for those of us running
the underlying systems powering those platforms, it meant that everything became
massively—and exponentially—harder not just to operate and administer, but also to
understand.
How did we get here, and when did the industry seemingly change overnight? Let’s
look at the various small iterations that created such a seismic shift.
These technical changes have also had powerful ripple effects at the human and
organizational level: our systems are sociotechnical. The complexity introduced by
these shifts at the social level (and their associated feedback loops) have driven
further changes to the systems and the way we think about them.
The tools and techniques needed to manage monolithic systems like the LAMP stack
were radically ineffective for running modern systems. Systems with applications
deployed in one “big bang” release are managed rather differently than microservices.
With microservices, applications are often rolled out piece by piece, and code deploy‐
ments don’t necessarily release features because feature flags now enable or disable
code paths with no deployment required.
Similarly, in a distributed world, staging systems have become less useful or reliable
than they used to be. Even in a monolithic world, replicating a production environ‐
ment to staging was always difficult. Now, in a distributed world, it’s effectively
impossible. That means debugging and inspection have become ineffective in staging,
and we have shifted to a model requiring those tasks to be accomplished in produc‐
tion itself.
• User experience can no longer be generalized as being the same for all service
users. In the new model, different users of a service may be routed through the
system in different ways, using different components, providing experiences that
can vary widely.
• Monitoring alerts that look for edge cases in production that have system con‐
ditions exceeding known thresholds generate a tremendous number of false
positives, false negatives, and meaningless noise. Alerting has shifted to a model
in which fewer alerts are triggered, by focusing only on symptoms that directly
impact user experience.
• Debuggers can no longer be attached to one specific runtime. Fulfilling service
requests now requires hopping across a network, spanning multiple runtimes,
often multiple times per individual request.
• Known recurring failures that require manual remediation and can be defined in
a runbook are no longer the norm. Service failures have shifted from that model
toward one in which known recurring failures can be recovered automatically.
Failures that cannot be automatically recovered, and therefore trigger an alert,
likely mean the responding engineer will be facing a novel problem.
These tertiary signals mean that a massive gravitational shift in focus is happening
away from the importance of preproduction and toward the importance of being
intimately familiar with production. Traditional efforts to harden code and ensure
its safety before it goes to production are starting to be accepted as limiting and some‐
what futile. Test-driven development and running tests against staging environments
still have use. But they can never replicate the wild and unpredictable nature of how
that code will be used in production.
As developers, we all have a fixed number of cycles we can devote to accomplishing
our work. The limitation of the traditional approach is that it focuses on preproduc‐
tion hardening first and foremost. Any leftover scraps of attention, if they even exist,
are then given to focusing on production systems. If we want to build reliable services
in production, that ordering must be reversed.
In modern systems, we must focus the bulk of our engineering attention and tooling
on production systems, first and foremost. The leftover cycles of attention should be
applied to staging and preproduction systems. There is value in staging systems. But
it is secondary in nature.
Staging systems are not production. They can never replicate what is happening in
production. The sterile lab environment of preproduction systems can never mimic
Even with our few remaining monolithic systems that used boring technology, where
all possible problems were pretty well understood, we still experienced some gains.
We didn’t necessarily discover anything new, but we were able to ship software more
swiftly and confidently as a result.
At Parse, what allowed us to scale our approach was learning how to work with a
system that was observable. By gathering application telemetry, at the right level of
abstraction, aggregated around the user’s experience, with the ability to analyze it in
real time, we gained magical insights. We removed the limitations of our traditional
tools once we had the capability to ask any question, trace any sequence of steps,
and understand any internal system state, simply by observing the outputs of our
applications. We were able to modernize our practices once we had observability.
Conclusion
This story, from my days at Parse, illustrates how and why organizations make a
transition from traditional tools and monitoring approaches to scaling their practices
with modern distributed systems and observability. I departed Facebook in 2015,
shortly before it announced the impending shutdown of the Parse hosting service.
Since then, many of the problems my team and I faced in managing modern
distributed systems have only become more common as the software industry has
shifted toward adopting similar technologies.
Liz, George, and I believe that shift accounts for the enthusiasm behind, and the rise
in popularity of, observability. Observability is the solution to problems that have
become prevalent when scaling modern systems. In further chapters, we’ll explore the
many facets, impacts, and benefits that observability delivers.
Conclusion | 41
CHAPTER 4
How Observability Relates to
DevOps, SRE, and Cloud Native
43
across a large number of virtual and physical servers means the business benefits
from better cost controls and scalability.
The shift to cloud native doesn’t only require adopting a complete set of new technol‐
ogies, but also changing how people work. That shift is inherently sociotechnical. On
the surface, using the microservices toolchain itself has no explicit requirement to
adopt new social practices. But to achieve the promised benefits of the technology,
changing the work habits also becomes necessary. Although this should be evident
from the stated definition and goals, teams typically get several steps in before
realizing that their old work habits do not help them address the management costs
introduced by this new technology. That is why successful adoption of cloud native
design patterns is inexorably tied to the need for observable systems and for DevOps
and SRE practices.
1 Jonathan Smart, “Want to Do an Agile Transformation? Don’t. Focus on Flow, Quality, Happiness, Safety, and
Value”, July 21, 2018.
In the first part of this book, we examined the definition of “observability,” its
necessity in modern systems, its evolution from traditional practices, and the way it
is currently being used in practice. This second section delves deeper into technical
aspects and details why particular requirements are necessary in observable systems.
Chapter 5 introduces the basic data type necessary to build an observable system—
the arbitrarily wide structured event. It is this fundamental data type for telemetry
that makes the analysis described later in this part possible.
Chapter 6 introduces distributed tracing concepts. It breaks down how tracing sys‐
tems work in order to illustrate that trace data is simply a series of interrelated
arbitrarily wide structured events. This chapter walks you through manually creating
a minimal trace with code examples.
Chapter 7 introduces the OpenTelemetry project. While the manual code examples
in Chapter 6 help illustrate the concept, you would more than likely start with an
instrumentation library. Rather than choosing a proprietary library or agent that
locks you into one vendor’s particular solution, we recommend starting with an
open source and vendor-neutral instrumentation framework that allows you to easily
switch between observability tools of your choice.
Chapter 8 introduces the core analysis loop. Generating and collecting telemetry data
is only a first step. Analyzing that data is what helps you achieve observability. This
chapter introduces the workflow required to sift through your telemetry data in order
to surface relevant patterns and quickly locate the source of issues. It also covers
approaches for automating the core analysis loop.
Chapter 9 reintroduces the role of metrics as a data type, and where and when to best
use metrics-based traditional monitoring approaches. This chapter also shows how
traditional monitoring practices can coexist with observability practices.
This part focuses on technical requirements in relation to the workflow necessary
to shift toward observability-based debugging practices. In Part III, we’ll examine
how those individual practices impact team dynamics and how to tackle adoption
challenges.
CHAPTER 5
Structured Events Are the
Building Blocks of Observability
51
system, you must be able to iteratively explore system aspects ranging from high-level
aggregate performance all the way down to the raw data used in individual requests.
The technical requirement making that possible starts with the data format needed
for observability: the arbitrarily wide structured event. Collecting these events is not
optional. It is not an implementation detail. It is a requirement that makes any level of
analysis possible within that wide-ranging view.
Unstructured Logs
Log files are essentially large blobs of unstructured text, designed to be readable by
humans but difficult for machines to process. These files are documents generated by
applications and various underlying infrastructure systems that contain a record of
all notable events—as defined by a configuration file somewhere—that have occurred.
For decades, they have been an essential part of system debugging applications in any
environment. Logs typically contain tons of useful information: a description of the
event that happened, an associated timestamp, a severity-level type associated with
that event, and a variety of other relevant metadata (user ID, IP address, etc.).
Traditional logs are unstructured because they were designed to be human readable.
Unfortunately, for purposes of human readability, logs often separate the vivid details
of one event into multiple lines of text, like so:
6:01:00 accepted connection on port 80 from 10.0.0.3:63349
6:01:03 basic authentication accepted for user foo
6:01:15 processing request for /super/slow/server
6:01:18 request succeeded, sent response code 200
6:01:19 closed connection to 10.0.0.3:63349
While that sort of narrative structure can be helpful when first learning the intricacies
of a service in development, it generates huge volumes of noisy data that becomes
slow and clunky in production. In production, these chunks of narrative are often
interspersed throughout millions upon millions of other lines of text. Typically,
they’re useful in the course of debugging once a cause is already suspected and an
investigator is verifying their hypothesis by digging through logs for verification.
However, modern systems no longer run at an easily comprehensible human scale. In
traditional monolithic systems, human operators had a very small number of services
to manage. Logs were written to the local disk of the machines where applications
ran. In the modern era, logs are often streamed to a centralized aggregator, where
they’re dumped into very large storage backends.
Structured Logs
The solution is to instead create structured log data designed for machine parsability.
From the preceding example, a structured version might instead look something like
this:
time="6:01:00" msg="accepted connection" port="80" authority="10.0.0.3:63349"
time="6:01:03" msg="basic authentication accepted" user="foo"
time="6:01:15" msg="processing request" path="/super/slow/server"
time="6:01:18" msg="sent response code" status="200"
time="6:01:19" msg="closed connection" authority="10.0.0.3:63349"
Many logs are only portions of events, regardless of whether those logs are structured.
When connecting observability approaches to logging, it helps to think of an event as
a unit of work within your systems. A structured event should contain information
about what it took for a service to perform a unit of work. A unit of work can be
seen as somewhat relative. For example, a unit of work could be downloading a
single file, parsing it, and extracting specific pieces of information. Yet other times, it
could mean processing an answer after extracting specific pieces of information from
dozens of files. In the context of services, a unit of work could be accepting an HTTP
request and doing everything necessary to return a response. Yet other times, one
HTTP request can generate many other events during its execution.
Ideally, a structured event should be scoped to contain everything about what it took
to perform that unit of work. It should record the input necessary to perform the
work, any attributes gathered—whether computed, resolved, or discovered—along
the way, the conditions of the service as it was performing the work, and details about
the result of the work performed.
It’s common to see anywhere from a few to a few dozen log lines or entries that,
when taken together, represent what could be considered one unit of work. So far,
the example we’ve been using does just that: one unit of work (the handling of one
connection) is represented by five separate log entries. Rather than being helpfully
grouped into one event, the messages are spread out into many messages. Sometimes,
a common field—like a request ID—might be present in each entry so that the
separate entries can be stitched together. But typically there won’t be.
Conclusion | 59
CHAPTER 6
Stitching Events into Traces
In the previous chapter, we explored why events are the fundamental building blocks
of an observable system. This chapter examines how you can stitch events together
into a trace. Within the last decade, distributed tracing has become an indispensable
troubleshooting tool for software engineering teams.
Distributed traces are simply an interrelated series of events. Distributed tracing sys‐
tems provide packaged libraries that “automagically” create and manage the work of
tracking those relationships. The concepts used to create and track the relationships
between discrete events can be applied far beyond traditional tracing use cases. To
further explore what’s possible with observable systems, we must first explore the
inner workings of tracing systems.
In this chapter, we demystify distributed tracing by examining its core components
and why they are so useful for observable systems. We explain the components of
a trace and use code examples to illustrate the steps necessary to assemble a trace
by hand and how those components work. We present examples of adding relevant
data to a trace event (or span) and why you may want that data. Finally, after
showing you how a trace is assembled manually, we’ll apply those same techniques to
nontraditional tracing use cases (like stitching together log events) that are possible
with observable systems.
61
best efforts, and in an age when distributed systems are the norm, the debugging
techniques we use must adapt to meet more complex needs.
Distributed tracing is a method of tracking the progression of a single request—called
a trace—as it is handled by various services that make up an application. Tracing in
this sense is “distributed” because in order to fulfill its functions, a singular request
must often traverse process, machine, and network boundaries. The popularity of
microservice architectures has led to a sharp increase in debugging techniques that
pinpoint where failures occur along that route and what might be contributing to
poor performance. But anytime a request traverses boundaries—such as from on-
premises to cloud infrastructure, or from infrastructure you control to SaaS services
you don’t, and back again—distributed tracing can be incredibly useful to diagnose
problems, optimize code, and build more reliable services.
The rise in popularity of distributed tracing also means that several approaches and
competing standards for accomplishing that task have emerged. Distributed tracing
first gained mainstream traction after Google’s publication of the Dapper paper by
Ben Sigelman et al. in 2010. Two notable open source tracing projects emerged
shortly after: Twitter’s Zipkin in 2012 and Uber’s Jaeger in 2017, in addition to several
commercially available solutions such as Honeycomb or Lightstep.
Despite the implementation differences in these tracing projects, the core method‐
ology and the value they provide are the same. As explored in Part I, modern
distributed systems have a tendency to scale into a tangled knot of dependencies.
Distributed tracing is valuable because it clearly shows the relationships among
various services and components in a distributed system.
Traces help you understand system interdependencies. Those interdependencies can
obscure problems and make them particularly difficult to debug unless the relation‐
ships between them are clearly understood. For example, if a downstream database
service experiences performance bottlenecks, that latency can cumulatively stack up.
By the time that latency is detected three or four layers upstream, identifying which
component of the system is the root of the problem can be incredibly difficult
because now that same latency is being seen in dozens of other services.
In an observable system, a trace is simply a series of interconnected events. To
understand how traces relate to the fundamental building blocks of observability, let’s
start by looking at how traces are assembled.
Figure 6-1. This waterfall-style trace visualization displays four trace spans during one
request
Each chunk of this waterfall is called a trace span, or span for short. Within any given
trace, spans are either the root span—the top-level span in that trace—or are nested
within the root span. Spans nested within the root span may also have nested spans of
their own. That relationship is sometimes referred to as parent-child. For example, in
Figure 6-2, if Service A calls Service B, which calls Service C, then for that trace, Span
A is the parent of Span B, which is in turn the parent of Span C.
Note that a service might be called and appear multiple times within a trace as
separate spans, such as in the case of circular dependencies or intense calculations
broken into parallel functions within the same service hop. In practice, requests often
traverse messy and unpredictable paths through a distributed system. To construct
the view we want for any path taken, no matter how complex, we need five pieces of
data for each component:
Trace ID
We need a unique identifier for the trace we’re about to create so that we can map
it back to a particular request. This ID is created by the root span and propagated
throughout each step taken to fulfill the request.
Span ID
We also need a unique identifier for each individual span created. Spans contain
information captured while a unit of work occurred during a single trace. The
unique ID allows us to refer to this span whenever we need it.
Parent ID
This field is used to properly define nesting relationships throughout the life of
the trace. A Parent ID is absent in the root span (that’s how we know it’s the
root).
Timestamp
Each span must indicate when its work began.
Duration
Each span must also record how long that work took to finish.
Those fields are absolutely required in order to assemble the structure of a trace.
However, you will likely find a few other fields helpful when identifying these spans
or how they relate to your system. Any additional data added to a span is essentially a
series of tags.
For illustration purposes, we’ll presume that a backend for collection of this data
already exists and will focus on just the client-side instrumentation necessary for
tracing. We’ll also presume that we can send data to that system via HTTP.
Let’s say that we have a simple web endpoint. For quick illustrative purposes, we will
create this example with Go as our language. When we issue a GET request, it makes
calls to two other services to retrieve data based on the payload of the request (such
as whether the user is authorized to access the given endpoint) and then it returns the
results:
func rootHandler(r *http.Request, w http.ResponseWriter) {
authorized := callAuthService(r)
name := callNameService(r)
if authorized {
w.Write([]byte(fmt.Sprintf(`{"message": "Waddup %s"}`, name)))
} else {
w.Write([]byte(`{"message": "Not cool dawg"}`))
}
}
First, let’s generate a unique trace ID so we can group any subsequent spans back
to the originating request. We’ll use UUIDs to avoid any data duplication issues and
store the attributes and data for this span in a map (we could then later serialize that
data as JSON to send it to our data backend). We’ll also generate a span ID that can be
used as an identifier for relating different spans in the same trace together:
func rootHandler(...) {
traceData := make(map[string]interface{})
traceData["trace_id"] = uuid.String()
traceData["span_id"] = uuid.String()
startTime := time.Now()
traceData["timestamp"] = startTime.Unix()
traceData["duration_ms"] = time.Now().Sub(startTime)
}
Finally, we’ll add two descriptive fields: service_name indicates which service the
work occurred in, and span name indicates the type of work we did. Additionally, we’ll
set up this portion of our code to send all of this data to our tracing backend via a
remote procedure call (RPC) once it’s all complete:
func loginHandler(...) {
// ... trace id and duration setup from above ...
traceData["name"] = "/oauth2/login"
traceData["service_name"] = "authentication_svc"
sendSpan(traceData)
}
We have the portions of data we need for this one singular trace span. However,
we don’t yet have a way to relay any of this trace data to the other services that
we’re calling as part of our request. At the very least, we need to know which span
this is within our trace, which parent this span belongs to, and that data should be
propagated throughout the life of the request.
The most common way that information is shared in distributed tracing systems
is by setting it in HTTP headers on outbound requests. In our example, we could
expand our helper functions callAuthService and callNameService to accept the
traceData map and use it to set special HTTP headers on their outbound requests.
You could call those headers anything you want, as long as the programs on the
receiving end understand those same names. Typically, HTTP headers follow a par‐
ticular standard, such as those of the World Wide Web Consortium (W3C) or B3. For
our example, we’ll use the B3 standard. We would need to send the following headers
(as in Figure 6-3) to ensure that child spans are able to build and send their spans
correctly:
X-B3-TraceId
Contains the trace ID for the entire trace (from the preceding example)
X-B3-ParentSpanId
Contains the current span ID, which will be set as the parent ID in the child’s
generated span
Now let’s ensure that those headers are sent in our outbound HTTP request:
func callAuthService(req *http.Request, traceData map[string]interface{}) {
aReq, _ = http.NewRequest("GET", "http://authz/check_user", nil)
aReq.Header.Set("X-B3-TraceId", traceData["trace.trace_id"])
aReq.Header.Set("X-B3-ParentSpanId", traceData["trace.span_id"])
We would also make a similar change to our callNameService function. With that,
when each service is called, it can pull the information from these headers and add
them to trace_id and parent_id in their own generation of traceData. Each of
those services would also send their generated spans to the tracing backend. On the
backend, those traces are stitched together to create the waterfall-type visualization
we want to see.
Now that you’ve seen what goes into instrumenting and creating a useful trace view,
let’s see what else we might want to add to our spans to make them more useful for
debugging.
traceData := make(map[string]interface{})
traceData["tags"] = make(map[string]interface{})
startTime := time.Now()
traceData["timestamp"] = startTime.Unix()
traceData["trace_id"] = uuid.String()
traceData["span_id"] = uuid.String()
traceData["name"] = "/oauth2/login"
traceData["service_name"] = "authentication_svc"
authorized := callAuthService(r)
name := callNameService(r)
if authorized {
w.Write([]byte(fmt.Sprintf(`{"message": "Waddup %s"}`, name)))
} else {
w.Write([]byte(`{"message": "Not cool dawg"}`))
}
traceData["duration_ms"] = time.Now().Sub(startTime)
sendSpan(traceData)
}
The code examples used in this section are a bit contrived to illustrate how these
concepts come together in practice. The good news is that you would typically not
have to generate all of this code yourself. Distributed tracing systems commonly have
their own supporting libraries to do most of this boilerplate setup work.
These shared libraries are typically unique to the particular needs of the tracing
solution you wish to use. Unfortunately, vendor-specific solutions do not work well
with other tracing solutions, meaning that you have to re-instrument your code if
you want to try a different solution. In the next chapter, we’ll look at the open source
Conclusion
Events are the building blocks of observability, and traces are simply an interrelated
series of events. The concepts used to stitch together spans into a cohesive trace
are useful in the setting of service-to-service communication. In observable systems,
those same concepts can also be applied beyond making RPCs to any discrete events
in your systems that are interrelated (like individual file uploads all created from the
same batch job).
In this chapter, we instrumented a trace the hard way by coding each necessary step
by hand. A more practical way to get started with tracing is to use an instrumentation
framework. In the next chapter, we’ll look at the open source and vendor-neutral
OpenTelemetry project as well as how and why you would use it to instrument your
production applications.
Conclusion | 71
CHAPTER 7
Instrumentation with OpenTelemetry
In the previous two chapters, we described the principles of structured events and
tracing. Events and traces are the building blocks of observability that you can use
to understand the behavior of your software applications. You can generate those
fundamental building blocks by adding instrumentation code into your application
to emit telemetry data alongside each invocation. You can then route the emitted
telemetry data to a backend data store, so that you can later analyze it to understand
application health and help debug issues.
In this chapter, we’ll show you how to instrument your code to emit telemetry
data. The approach you choose might depend on the instrumentation methods your
observability backend supports. It is common for vendors to create proprietary APM,
metrics, or tracing libraries to generate telemetry data for their specific solutions.
However, for the purposes of this vendor-neutral book, we will describe how to
implement instrumentation with open source standards that will work with a wide
variety of backend telemetry stores.
This chapter starts by introducing the OpenTelemetry standard and its approach
for automatically generating telemetry from applications. Telemetry from automatic
instrumentation is a fine start, but the real power of observability comes from custom
attributes that add context to help you debug how your intended business logic is
actually working. We’ll show you how to use custom instrumentation to augment
the out-of-the-box instrumentation included with OpenTelemetry. By the end of
the chapter, you’ll have an end-to-end instrumentation strategy to generate useful
telemetry data. And in later sections of this book, we’ll show you how to analyze that
telemetry data to find the answers you need.
73
A Brief Introduction to Instrumentation
It is a well-established practice in the software industry to instrument applications
to send data about system state to a central logging or monitoring solution. That
data, known as telemetry, records what your code did when processing a particular
set of requests. Over the years, software developers have defined various overlapping
categories of telemetry data, including logs, metrics, and, more recently, traces and
profiles.
As discussed in Chapter 5, arbitrarily wide events are the ideal building blocks for
observability. But wide events are not new; they are just a special kind of log that
consists of many structured data fields rather than fewer, more ambiguous blobs.
And a trace (see Chapter 6) comprises multiple wide events with embedded fields to
connect disparate events.
However, the approach to application instrumentation has typically been proprietary.
Sometimes producing the data requires manually adding code to generate telemetry
data within your services; other times, an agent can assist in gathering data automati‐
cally. Each distinct monitoring solution (unless it’s printf()) has its own custom set
of necessary steps to generate and transmit telemetry data in a format appropriate for
that product’s backend data store.
In the past, you may have installed instrumentation libraries or agents in their
applications that are backend-specific. You would then add in your own custom
instrumentation to capture any data you deem relevant to understanding the internal
state of your applications, using the functions made available by those client libraries.
However, the product-specific nature of those instrumentation approaches also cre‐
ates vendor lock-in. If you wanted to send telemetry data to a different product,
you needed to repeat the entire instrumentation process using a different library,
wastefully duplicating code and doubling the measurement overhead.
mux.Handle("/route",
otelhttp.NewHandler(otelhttp.WithRouteTag("/route",
http.HandlerFunc(h)), "handler_span_name"))
In contrast, gRPC provides an Interceptor interface that you can provide OTel with
to register its instrumentation:
import (
"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
)
s := grpc.NewServer(
grpc.UnaryInterceptor(otelgrpc.UnaryServerInterceptor()),
grpc.StreamInterceptor(otelgrpc.StreamServerInterceptor()),
)
Armed with this automatically generated data, you will be able to find problems
such as uncached hot database calls, or downstream dependencies that are slow for
a subset of their endpoints, from a subset of your services. But this is only the very
beginning of the insight you’ll get.
// Within each module, define a private tracer once for easy identification
var tr = otel.Tracer("module_name")
This snippet sets two key-value pair attributes. The attribute http.code contains an
integer value (the HTTP response code), and app.user contains a string including
the current username in the execution request.
goroutines, _ := metric.Must(meter).NewInt64Measure("num_goroutines",
metric.WithKeys(appKey, containerKey),
metric.WithDescription("Amount of goroutines running."),
)
func main() {
exporterX = x.NewExporter(...)
exporterY = y.NewExporter(...)
tp, err := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sdktrace.AlwaysSample()),
sdktrace.WithSyncer(exporterX),
sdktrace.WithBatcher(exporterY),
)
otel.SetTracerProvider(tp)
In this example code, we import both my/backend/exporter and some/backend/
exporter, configure them to synchronously or in batches receive trace spans from
a tracer provider, and then set the tracer provider as the default tracer provider. This
machinery causes all subsequent calls to otel.Tracer() to retrieve an appropriately
configured tracer.
In the first two chapters of this part, you learned about telemetry fundamentals that
are necessary to create a data set that can be properly debugged with an observability
tool. While having the right data is a fundamental requirement, observability is
measured by what you can learn about your systems from that data. This chapter
explores debugging techniques applied to observability data and what separates them
from traditional techniques used to debug production systems.
We’ll start by closely examining common techniques for debugging issues with tradi‐
tional monitoring and application performance monitoring tools. As highlighted in
previous chapters, traditional approaches presume a fair amount of familiarity with
previously known failure modes. In this chapter, that approach is unpacked a bit
more so that it can then be contrasted with debugging approaches that don’t require
the same degree of system familiarity to identify issues.
Then, we’ll look at how observability-based debugging techniques can be automated
and consider the roles that both humans and computers play in creating effective
debugging workflows. When combining those factors, you’ll understand how observ‐
ability tools help you analyze telemetry data to identify issues that are impossible to
detect with traditional tools.
This style of hypothesis-driven debugging—in which you form hypotheses and then
explore the data to confirm or deny them—is not only more scientific than relying
on intuition and pattern matching, but it also democratizes the act of debugging. As
opposed to traditional debugging techniques, which favor those with the most system
familiarity and experience to quickly find answers, debugging with observability
favors those who are the most curious or the most diligent about checking up on their
code in production. With observability, even someone with very little knowledge of
the system should be able to jump in and debug an issue.
83
Debugging from Known Conditions
Prior to observability, system and application debugging mostly occurred by building
upon what you know about a system. This can be observed when looking at the way
the most senior members of an engineering team approach troubleshooting. It can
seem downright magical when they know which questions are the right ones to ask
and instinctively know the right place to look. That magic is born from intimate
familiarity with their application and systems.
To capture this magic, managers urge their senior engineers to write detailed run‐
books in an attempt to identify and solve every possible “root cause” they might
encounter out in the wild. In Chapter 2, we covered the escalating arms race of
dashboard creation embarked upon to create just the right view that identifies a
newly encountered problem. But that time spent creating runbooks and dashboards
is largely wasted, because modern systems rarely fail in precisely the same way twice.
And when they do, it’s increasingly common to configure an automated remediation
that can correct that failure until someone can investigate it properly.
Anyone who has ever written or used a runbook can tell you a story about just how
woefully inadequate they are. Perhaps they work to temporarily address technical
debt: there’s one recurring issue, and the runbook tells other engineers how to miti‐
gate the problem until the upcoming sprint when it can finally be resolved. But more
often, especially with distributed systems, a long thin tail of problems that almost
never happen are responsible for cascading failures in production. Or, five seemingly
impossible conditions will align just right to create a large-scale service failure in ways
that might happen only once every few years.
1. Start with the overall view of what prompted your investigation: what did the
customer or alert tell you?
2. Verify that what you know so far is true: is a notable change in performance
happening somewhere in this system? Data visualizations can help you identify
changes of behavior as a change in a curve somewhere in the graph.
3. Search for dimensions that might drive that change in performance. Approaches
to accomplish that might include:
a. Examining sample rows from the area that shows the change: are there any
outliers in the columns that might give you a clue?
b. Slicing those rows across various dimensions looking for patterns: do any of
those views highlight distinct behavior across one or more dimensions? Try an
experimental group by on commonly useful fields, like status_code.
c. Filtering for particular dimensions or values within those rows to better
expose potential outliers.
4. Do you now know enough about what might be occurring? If so, you’re done!
If not, filter your view to isolate this area of performance as your next starting
point. Then return to step 3.
This is the basis of debugging from first principles. You can use this loop as a
brute-force method to cycle through all available dimensions to identify which ones
explain or correlate with the outlier graph in question, with no prior knowledge or
wisdom about the system required.
At a glance, this tells you that the events in this specific area of performance you care
about are different from the rest of the system in all of these ways, whether that be
one deviation or dozens. Let’s look at a more concrete example of the core analysis
loop by using Honeycomb’s BubbleUp feature.
With Honeycomb, you start by visualizing a heatmap to isolate a particular area of
performance you care about. Honeycomb automates the core analysis loop with the
BubbleUp feature: you point and click at the spike or anomalous shape in the graph
that concerns you, and draw a box around it. BubbleUp computes the values of all
dimensions both inside the box (the anomaly you care about and want to explain)
and outside the box (the baseline), and then compares the two and sorts the resulting
view by percent of differences, as shown in Figure 8-2.
In this example, you can quickly see that slow-performing events are mostly orig‐
inating from one particular availability zone (AZ) from our cloud infrastructure
provider. The other information automatically surfaced also points out one particular
virtual machine instance type that appears to be more affected than others. Other
dimensions surfaced, but in this example the differences tended to be less stark,
indicating that they were perhaps not as relevant to our investigation.
This information has been tremendously helpful: we now know the conditions that
appear to be triggering slow performance. A particular type of instance in one partic‐
ular AZ is much more prone to very slow performance than other infrastructure we
care about. In that situation, the glaring difference pointed to what turned out to be
an underlying network issue with our cloud provider’s entire AZ.
Not all issues are as immediately obvious as this underlying infrastructure issue.
Often you may need to look at other surfaced clues to triage code-related issues.
The core analysis loop remains the same, and you may need to slice and dice across
dimensions until one clear signal emerges, similar to the preceding example. In this
case, we contacted our cloud provider and were also able to independently verify the
unreported availability issue when our customers also reported similar issues in the
same zone. If this had instead been a code-related issue, we might decide to reach out
to those users, or figure out the path they followed through the UI to see those errors,
and fix the interface or the underlying system.
Note that the core analysis loop can be achieved only by using the baseline building
blocks of observability, which is to say arbitrarily wide structured events. You cannot
Conclusion
Collecting the right telemetry data is only the first step in the journey toward observ‐
ability. That data must be analyzed according to first principles in order to objectively
and correctly identify application issues in complex environments. The core analysis
loop is an effective technique for fast fault localization. However, that work can be
time-consuming for humans to conduct methodically as a system becomes increas‐
ingly complex.
Conclusion | 93
CHAPTER 9
How Observability and
Monitoring Come Together
95
Where Monitoring Fits
In Chapter 2, we focused on differentiating observability and monitoring. That chap‐
ter mostly focuses on the shortcomings of monitoring systems and how observability
fills in those gaps. But monitoring systems still continue to provide valuable insights.
Let’s start by examining where traditional monitoring systems continue to be the
right tool for the job.
The traditional monitoring approach to understanding system state is a mature and
well-developed process. Decades of iterative improvement have evolved monitoring
tools beyond their humble beginnings with simple metrics and round-robin data‐
bases (RRDs), toward TSDBs and elaborate tagging systems. A wealth of sophistica‐
ted options also exist to provide this service—from open source software solutions, to
start-ups, to publicly traded companies.
Monitoring practices are well-known and widely understood beyond the commun‐
ities of specialists that form around specific tools. Across the software industry,
monitoring best practices exist that anyone who has operated software in production
can likely agree upon.
For example, a widely accepted core tenet of monitoring is that a human doesn’t need
to sit around watching graphs all day; the system should proactively inform its users
when something is wrong. In this way, monitoring systems are reactive. They react to
known failure states by alerting humans that a problem is occurring.
Monitoring systems and metrics have evolved to optimize themselves for that job.
They automatically report whether known failure conditions are occurring or about
to occur. They are optimized for reporting on unknown conditions about known
failure modes (in other words, they are designed to detect known-unknowns).
The optimization of monitoring systems to find known-unknowns means that it’s
a best fit for understanding the state of your systems, which change much less
frequently and in more predictable ways than your application code. By systems, we
mean your infrastructure, or your runtime, or counters that help you see when you’re
about to slam up against an operating constraint.
As seen in Chapter 1, metrics and monitoring were created to examine hardware-
level performance. Over time, they’ve been adapted to encompass a wider range of
infrastructure and system-level concerns. Most readers of this book, who work in
technology companies, should recognize that the underlying systems are not what
matter to your business. At the end of the day, what matters to your business is
how the applications you wrote perform in the hands of your customers. The only
reason your business is concerned about those underlying systems is that they could
negatively impact application performance.
With infrastructure, only one perspective really matters: the perspective of the team
responsible for its management. The important question to ask about infrastructure
is whether the service it provides is essentially healthy. If it’s not, that team must
quickly take action to restore the infrastructure to a healthy condition. The system
Real-World Examples
While observability is still a nascent category, a few patterns are emerging for the
coexistence of monitoring and observability. The examples cited in this section repre‐
sent the patterns we’ve commonly seen among our customers or within the larger
observability community, but they are by no means exhaustive or definitive. These
Conclusion
The guiding principle for determining how observability and monitoring coexist
within your organization should be dictated by the software and infrastructure
responsibilities adopted within its walls. Monitoring is best suited to evaluating
the health of your systems. Observability is best suited to evaluating the health of
your software. Exactly how much of each solution will be necessary in any given
organization depends on how much management of that underlying infrastructure
has been outsourced to third-party (aka cloud) providers.
The most notable exceptions to that neat dividing line are higher-order infrastructure
metrics on physical devices that directly impact software performance, like CPU,
memory, and disk. Metrics that indicate consumption of these physical infrastructure
constraints are critical for understanding the boundaries imposed by underlying
infrastructure. If these metrics are available from your cloud infrastructure provider,
they should be included as part of your approach to observability.
By illustrating a few common approaches to balancing monitoring and observability
in complementary ways, you can see how the considerations outlined throughout
this chapter are implemented in the real world by different teams. Now that we’ve
covered the fundamentals of observability in depth, the next part of this book goes
beyond technology considerations to also explore the cultural changes necessary for
successfully adopting observability practices and driving that adoption across teams.
Conclusion | 103
PART III
Observability for Teams
In Part II, we examined various technical aspects of observability, how those concepts
build on one another to enable the core analysis loop and debugging from first
principles, and how that practice can coexist with traditional monitoring. In this part,
we switch gears to look at the changes in social and cultural practices that help drive
observability adoption across different teams.
Chapter 10 tackles many of the common challenges teams face when first starting
down the path of observability. How and where you start will always depend on
multiple factors, but this chapter recaps many of the techniques we’ve seen work
effectively.
Chapter 11 focuses on how developer workflows change when using observability.
Though we’ve referenced this topic in earlier chapters, here we walk through more
concrete steps. You’ll learn about the benefits developers gain by adding custom
instrumentation into their code early in the development phase, and how that’s used
to debug their tests and to ensure that their code works correctly all the way through
production.
Chapter 12 looks at the potential that observability unlocks when it comes to using
more sophisticated methods for monitoring the health of your services in production.
This chapter introduces service-level objectives (SLOs) and how they can be used for
more effective alerting.
Chapter 13 builds on the preceding chapter by demonstrating why event data is a key
part of creating more accurate, actionable, and debuggle alerts than using SLOs based
on metrics data.
Chapter 14 looks at how teams can use observability to debug and better understand
other parts of their stack, like their CI/CD build pipelines. This guest-contributed
chapter is written by Frank Chen, senior staff software engineer at Slack.
This part of the book focuses on team workflows that can change and benefit from
observability practices by detailing various scenarios and use cases to address com‐
mon pain points for engineering teams managing modern software systems operating
at any scale. In Part IV, we’ll look at specific and unique challenges that occur when
using observability tools at scale.
CHAPTER 10
Applying Observability Practices
in Your Team
Let’s switch gears to focus on the fundamentals of observability from a social and
cultural practice perspective. In this chapter, we provide several tips to help you
get started with observability practices. If you’re in a leadership role within your engi‐
neering team—such as a team lead, a manager, or maybe the resident observability
fan/champion—the hardest thing to figure out (after getting management approval)
in an observability implementation strategy is knowing where to start.
For us, this is a particularly tricky chapter to write. Having helped many teams start
down this path, we know that no universal recipe for success exists. How and where
you get started will always depend on many factors. As unsatisfying as “it depends”
can be for an answer, the truth is that your journey with observability depends
on particulars including the problems most pertinent to you and your team, the
gaps in your existing tooling, the level of support and buy-in from the rest of your
organization, the size of your team, and other such considerations.
Whatever approach works best for you is, by definition, not wrong. The advice in
this chapter is not intended to suggest that this is the one true way to get started
with observability (there is no singular path!). That said, we have seen a few emergent
patterns and, if you are struggling with where to begin, some of these suggestions
may be helpful to you. Feel free to pick and choose from any of the tips in this
chapter.
107
our sociotechnical systems are rapidly evolving, one of the best ways to learn and
improve is by participating in a community of people who are struggling with varia‐
tions on the same themes as you. Community groups connect you with other profes‐
sionals who can quickly become a helpful network of friends and acquaintances.
As you and your community face similar challenges, you’ll have an opportunity to
learn so much very quickly just by hanging out in Slack groups and talking to other
people who are banging against the same types of problems. Community groups
allow you to connect with people beyond your normal circles from a variety of
backgrounds. By actively participating and understanding how other teams handle
some of the same challenges you have, you’ll make comparative observations and
learn from the experiences of others.
Over time, you’ll also discover other community members with common similarities
in tech stack, team size, organizational dynamics, and so forth. Those connections
will give you someone to turn to as a sounding board, for background, or personal
experiences with solutions or approaches you might also be considering. Having
that type of shared context before you pull the trigger on new experiments can be
invaluable. Actively participating in a community group will save you a ton of time
and heartbreak.
Participating in a community will also keep you attuned to developments you may
have otherwise missed. Different providers of observability tools will participate in
different communities to better understand user challenges, gather feedback on new
ideas, or just generally get a pulse on what’s happening. Participating in a community
specific to your observability tool of choice can also give you a pulse on what’s
happening as it happens.
When joining a community, remember that community relationships are a two-way
street. Don’t forget to do your share of chopping wood and carrying water: show up
and start contributing by helping others first. Being a good community citizen means
participating and helping the group for a while before dropping any heavy asks for
help. In other words, don’t speak up only when you need something from others.
Communities are only as strong as you make them.
If you need a place to start, we recommend checking out the CNCF Technical
Advisory Group (TAG) for Observability. There you’ll find both Slack chat groups
as well as regular online meetings. The OpenTelemetry Community page also lists
useful resources. More generally, a lot of conversations around observability happen
via Twitter (search for “observability” and you’ll find people and topics to follow).
More specifically, product-focused Slack groups such as Honeycomb’s Pollinators
Slack exist, where you’ll find a mix of general and vendor-specific information. We
also recommend Michael Hausenblas’s newsletter o11y news.
One of the best strategies for rolling out observability across an entire organization
is to instrument a painful service or two as you first get started. Use that instrumenta‐
tion exercise as a reference point and learning exercise for the rest of the pilot team.
Once the pilot team is familiar with the new tooling, use any new debugging situation
as a way to introduce more and more useful instrumentation.
Whenever an on-call engineer is paged about a problem in production, the first
thing they should do is use the new tooling to instrument problem areas of your
application. Use the new instrumentation to figure out where issues are occurring.
After the second or third time people take this approach, they usually catch on
to how much easier and less time-consuming it is to debug issues by introducing
instrumentation first. Debugging from instrumentation first allows you to see what’s
really happening.
Once the pilot team members are up to speed, they can help others learn. They
can provide coaching on creating helpful instrumentation, suggest helpful queries,
or point others toward more examples of helpful troubleshooting patterns. Each
new debugging issue can be used to build out the instrumentation you need. You
don’t need a fully developed set of instrumentation to get immediate value with
observability.
• If you’re using an ELK stack—or even just the Logstash part—it’s trivial to
add a snippet of code to fork the output of a source stream to a secondary
destination. Send that stream to your observability tool. Invite users to compare
the experience.
• If you’re already using structured logs, all you need to do is add a unique ID to
log events as they propagate throughout your entire stack. You can keep those
logs in your existing log analysis tool, while also sending them as trace events to
your observability tool.
• Try running observability instrumentation (for example, Honeycomb’s Beelines
or OTel) alongside your existing APM solution. Invite users to compare and
contrast the experience.
• If you’re using Ganglia, you can leverage that data by parsing the Extensible
Markup Language (XML) dump it puts into /var/tmp with a once-a-minute
cronjob that shovels that data into your observability tool as events. That’s a less
than optimal use of observability, but it certainly creates familiarity for Ganglia
users.
• Re-create the most useful of your old monitoring dashboards as easily reference‐
able queries within your new observability tool. While dashboards certainly have
their shortcomings (see Chapter 2), this gives new users a landing spot where
they can understand the system performance they care about at a glance, and also
gives them an opportunity to explore and know more.
1 Hal R. Arkes and Catherine Blumer, “The Psychology of Sunk Costs,” Organizational Behavior and Human
Decision Processes 35 (1985): 124–140.
Conclusion
Knowing exactly where and how to start your observability journey depends on the
particulars of your team. Hopefully, these general recommendations are useful to
help you figure out places to get started. Actively participating in a community of
peers can be invaluable as your first place to dig in. As you get started, focus on
solving the biggest pain points rather than starting in places that already operate
smoothly enough. Throughout your implementation journey, remember to keep an
inclination toward moving fast, demonstrating high value and ROI, and tackling
work iteratively. Find opportunities to include as many parts of your organization as
possible. And don’t forget to plan for completing the work in one big final push to get
your implementation project over the finish line.
The tips in this chapter can help you complete the work it takes to get started with
observability. Once that work is complete, using observability on a daily basis helps
unlock other new ways of working by default. The rest of this part of the book
explores those in detail. In the next chapter, we’ll examine how observability-driven
development can revolutionize your understanding of the way new code behaves in
production.
Conclusion | 115
CHAPTER 11
Observability-Driven Development
Test-Driven Development
Today’s gold standard for testing software prior to its release in production is test-
driven development (TDD). TDD is arguably one of the more successful practices to
take hold across the software development industry within the last two decades. TDD
has provided a useful framework for shift-left testing that catches, and prevents, many
potential problems long before they reach production. Adopted across wide swaths of
the software development industry, TDD should be credited with having uplifted the
quality of code running production services.
TDD is a powerful practice that provides engineers a clear way to think about
software operability. Applications are defined by a deterministic set of repeatable
tests that can be run hundreds of times per day. If these repeatable tests pass, the
117
application must be running as expected. Before changes to the application are
produced, they start as a set of new tests that exist to verify that the change would
work as expected. A developer can then begin to write new code in order to ensure
the test passes.
TDD is particularly powerful because tests run the same way every time. Data typi‐
cally doesn’t persist between test runs; it gets dropped, erased, and re-created from
scratch for each run. Responses from underlying or remote systems are stubbed or
mocked. With TDD, developers are tasked with creating a specification that precisely
defines the expected behaviors for an application in a controlled state. The role of
tests is to identify any unexpected deviations from that controlled state so that engi‐
neers can then deal with them immediately. In doing so, TDD removes guesswork
and provides consistency.
But that very consistency and isolation also limits TDD’s revelations about what is
happening with your software in production. Running isolated tests doesn’t reveal
whether customers are having a good experience with your service. Nor does passing
those tests mean that any errors or regressions could be quickly and directly isolated
and fixed before releasing that code back into production.
Any reasonably experienced engineer responsible for managing software running in
production can tell you that production environments are anything but consistent.
Production is full of interesting deviations that your code might encounter out in
the wild but that have been excised from tests because they’re not repeatable, don’t
quite fit the specification, or don’t go according to plan. While the consistency and
isolation of TDD makes your code tractable, it does not prepare your code for the
interesting anomalies that should be surfaced, watched, stressed, and tested because
they ultimately shape your software’s behavior when real people start interacting
with it.
Observability can help you write and ship better code even before it lands in source
control—because it’s part of the set of tools, processes, and culture that allows engi‐
neers to find bugs in their code quickly.
• You see a spike in latency. You start at the edge; group by endpoint; calculate
average, 90th-, and 99th-percentile latencies; identify a cadre of slow requests;
and trace one of them. It shows the timeouts begin at service3. You copy the
context from the traced request into your local copy of the service3 binary and
attempt to reproduce it in the debugger or profiler.
• You see a spike in latency. You start at the edge; group by endpoint; calculate
average, 90th-, and 99th-percentile latencies; and notice that exclusively write
endpoints are suddenly slower. You group by database destination host, and
note that the slow queries are distributed across some, but not all, of your
database primaries. For those primaries, this is happening only to ones of a
certain instance type or in a particular AZ. You conclude the problem is not a
code problem, but one of infrastructure.
Without observability, all you may see is that all of the performance graphs are either
spiking or dipping at the same time.
A more advanced approach is to enable engineers to test their code against a small
subset of production traffic. With sufficient instrumentation, the best way to under‐
stand how a proposed change will work in production is to measure how it will
work by deploying it to production. That can be done in several controlled ways.
For example, that can happen by deploying new features behind a feature flag and
exposing it to only a subset of users. Alternatively, a feature could also be deployed
directly to production and have only select requests from particular users routed to
the new feature. These types of approaches shorten feedback loops to mere seconds
or minutes, rather than what are usually substantially longer periods of time waiting
for the release.
If you are capturing sufficient instrumentation detail in the context of your requests,
you can systematically start at the edge of any problem and work your way to the
correct answer every single time, with no guessing, intuition, or prior knowledge
needed. This is one revolutionary advance that observability has over monitoring
systems, and it does a lot to move operations engineering back into the realm of
science, not magic and intuition.
• Ship a single coherent bundle of changes one at a time, one merge by one
engineer. The single greatest cause of deployments that break “something” and
take hours or days to detangle and roll back is the batching of many changes by
many people over many days.
• Spend real engineering effort on your deployment process and code. Assign
experienced engineers to own it (not the intern). Make sure everyone can under‐
stand your deployment pipelines and that they feel empowered to continuously
improve them. Don’t let them be the sole province of a single engineer or team.
Instead, get everyone’s usage, buy-in, and contributions. For more tips on making
your pipelines understandable for everyone, see Chapter 14.
1 Nicole Forsgren et al., Accelerate: Building and Scaling High Performing Technology Organizations (Portland,
OR: IT Revolution Press, 2018).
Conclusion | 125
CHAPTER 12
Using Service-Level
Objectives for Reliability
While observability and traditional monitoring can coexist, observability unlocks the
potential to use more sophisticated and complementary approaches to monitoring.
The next two chapters will show you how practicing observability and service-level
objectives (SLOs) together can improve the reliability of your systems.
In this chapter, you will learn about the common problems that traditional threshold-
based monitoring approaches create for your team, how distributed systems exacer‐
bate those problems, and how using an SLO-based approach to monitoring instead
solves those problems. We’ll conclude with a real-world example of replacing tradi‐
tional threshold-based alerting with SLOs. And in Chapter 13, we’ll examine how
observability makes your SLO-based alerts actionable and debuggable.
Let’s begin with understanding the role of monitoring and alerting and the previous
approaches to them.
127
While such simplistic “potential-cause” measures are easy to collect, they don’t pro‐
duce meaningful alerts for you to act upon. Deviations in CPU utilization may also
be indicators that a backup process is running, or a garbage collector is doing its
cleanup job, or that any other phenomenon may be happening on a system. In
other words, those conditions may reflect any number of system factors, not just the
problematic ones we really care about. Triggering alerts from these measures based
on the underlying hardware creates a high percentage of false positives.
Experienced engineering teams that own the operation of their software in produc‐
tion will often learn to tune out, or even suppress, these types of alerts because they’re
so unreliable. Teams that do so regularly adopt phrases like “Don’t worry about that
alert; we know the process runs out of memory from time to time.”
Becoming accustomed to alerts that are prone to false positives is a known problem
and a dangerous practice. In other industries, that problem is known as normalization
of deviance: a term coined during the investigation of the Challenger disaster. When
individuals in an organization regularly shut off alarms or fail to take action when
alarms occur, they eventually become so desensitized about the practice deviating
from the expected response that it no longer feels wrong to them. Failures that are
“normal” and disregarded are, at best, simply background noise. At worst, they lead
to disastrous oversights from cascading system failures.
In the software industry, the poor signal-to-noise ratio of monitoring-based alerting
often leads to alert fatigue—and to gradually paying less attention to all alerts, because
so many of them are false alarms, not actionable, or simply not useful. Unfortunately,
with monitoring-based alerting, that problem is often compounded when incidents
occur. Post-incident reviews often generate action items that create new, more impor‐
tant alerts that, presumably, would have alerted in time to prevent the problem. That
leads to an even larger set of alerts generated during the next incident. That pattern
of alert escalation creates an ever-increasing flood of alerts and an ever-increasing
cognitive load on responding engineers to determine which alerts matter and which
don’t.
That type of dysfunction is so common in the software industry that many moni‐
toring and incident-response tool vendors proudly offer various solutions labeled
“AIOps” to group, suppress, or otherwise try to process that alert load for you (see
“This Misleading Promise of AIOps” on page 91 Engineering teams have become
so accustomed to alert noise that this pattern is now seen as normal. If the future
of running software in production is doomed to generate so much noise that it
must be artificially processed, it’s safe to say the situation has gone well beyond a
normalization of deviance. Industry vendors have now productized that deviance and
will happily sell you a solution for its management.
Again, we believe that type of dysfunction exists because of the limitations imposed
by using metrics and monitoring tools that used to be the best choice we had to
1 Betsy Beyer et al., “Tying These Principles Together”, in Site Reliability Engineering (Sebastopol, CA: O’Reilly,
2016), 63–64.
Much has been written about the use of SLOs for service reliability, and they are
not unique to the world of observability. However, using an SLO-based approach to
Any event that is an error would spend some of the error budget allowed in your
SLO. We’ll closely examine patterns for proactively managing SLO error budgets and
triggering alerts in the next chapter. For now, we’ll summarize by saying that given
enough errors, your systems could alert you to a potential breach of the SLO.
SLOs narrow the scope of your alerts to consider only symptoms that impact what
the users of our service experience. If an underlying condition is impacting “a user
loading our home page and seeing it quickly,” an alert should be triggered, because
someone needs to investigate why. However, there is no correlation as to why and how
the service might be degraded. We simply know that something is wrong.
In contrast, traditional monitoring relies on a cause-based approach: a previously
known cause is detected (e.g., an abnormal number of threads), signaling that users
At this point, we have to acknowledge that many engineers who are reading this story
might challenge that traditional cause-based alerting would have worked well enough
in this case. Why not alert on memory and therefore get alerts when the system
runs out of memory (OOM)? There are two points to consider when addressing that
challenge.
First, at Honeycomb, our engineers had long since been trained out of tracking
OOMs. Caching, garbage collection, and backup processes all opportunistically
used—and occasionally used up—system memory. “Ran out of memory” turned out
to be common for our applications. Given our architecture, even having a process
crash from time to time turned out not to be fatal, so long as all didn’t crash at once.
For our purposes, tracking those individual failures had not been useful at all. We
were more concerned with the availability of the cluster as a whole.
Given that scenario, traditional monitoring alerts did not—and never would have—
noticed this gradual degradation at all. Those simple coarse synthetic probes could
detect only a total outage, not one out of 50 probes failing and then recovering. In
that state, machines were still available, so the service was up, and most data was
making it through.
Second, even if it were somehow possible to introduce enough complicated logic to
trigger alerts when only a certain notable number of specific types of OOMs were
detected, we would have needed to predict this exact failure mode, well in advance
of this one bespoke issue ever occurring, in order to devise the right incantation of
conditions on which to trigger a useful alert. That theoretical incantation might have
Figure 12-2. The SLO error budget burned –566% by the time the incident was over;
compliance dropped to 99.97% (from the target of 99.995%). The boxed areas on the
timeline show events marked as bad—and that they occurred fairly uniformly at fairly
regular intervals until corrected.
By that stage, if the team had started treating SLO-based alerts as primary alerting, we
would have been less likely to look for external and transient explanations. We would
have instead moved to actually fix the issue, or at least roll back the latest deploy.
SLOs proved their ability to detect brownouts and prompt the appropriate response.
That incident changed our culture. Once SLO burn alerts had proven their value,
our engineering team had as much respect for SLO-based alerts as they did for
traditional alerts. After a bit more time relying on SLO-based alerts, our team became
increasingly comfortable with the reliability of alerting purely on SLO data.
At that point, we deleted all of our traditional monitoring alerts that were based
on percentages of errors for traffic in the past five minutes, the absolute number
of errors, or lower-level system behavior. We now rely on SLO-based alerts as our
primary line of defense.
139
minutes, 50 seconds per month). As shown in the previous chapter, an event-based
calculation considers each individual event against qualification criteria and keeps a
running tally of “good” events versus “bad” (or errored) events.
Because availability targets are represented as percentages, the error budget corre‐
sponding to an SLO is based on the number of requests that came in during that
time period. For any given period of time, only so many errors can be tolerated.
A system is out of compliance with its SLO when its entire error budget has been
spent. Subtract the number of failed (burned) requests from your total calculated
error budget, and that is known colloquially as the amount of error budget remaining.
To proactively manage SLO compliance, you need to become aware of and resolve
application and system issues long before your entire error budget is burned. Time
is of the essence. To take corrective action that averts the burn, you need to know
whether you are on a trajectory to consume that entire budget well before it happens.
The higher your SLO target, the less time you have to react. Figure 13-1 shows an
example graph indicating error budget burn.
Figure 13-1. A simple graph showing error budget burn over approximately three weeks
The rest of this chapter closely examines and contrasts various approaches to making
effective calculations to trigger error budget burn alerts. Let’s examine what it takes
to get an error budget burn alert working. First, you must start by setting a frame for
considering the all-too-relative dimension of time.
Figure 13-2. A rolling three-day window (left) and a three-day resetting window (right)
For most SLOs, a 30-day window is the most pragmatic period to use. Shorter
windows, like 7 or 14 days, won’t align with customer memories of your reliability or
with product-planning cycles. A window of 90 days tends to be too long; you could
burn 90% of your budget in a single day and still technically fulfill your SLO even
if your customers don’t agree. Long periods also mean that incidents won’t roll off
quickly enough.
You might choose a fixed window to start with, but in practice, fixed window
availability targets don’t match the expectations of your customers. You might issue
a customer a refund for a particular bad outage that happens on the 31st of the
month, but that does not wave a magic wand that suddenly makes them tolerant of
Figure 13-3. In this model, an alert triggers when the remaining error budget (solid line)
dips below the selected threshold (dashed line)
A challenge with this model is that it effectively just moves the goalpost by setting
a different empty threshold. This type of “early warning” system can be somewhat
effective, but it is crude. In practice, after crossing the threshold, your team will act
as if the entire error budget has been spent. This model optimizes to ensure a slight
bit of headroom so that your team meets its objectives. But that comes at the cost of
forfeiting additional time that you could have spent delivering new features. Instead,
your team sits in a feature freeze while waiting for the remaining error budget to
climb back up above the arbitrary threshold.
A second model for triggering alerts above the zero-level mark is to create predictive
burn alerts (Figure 13-4). These forecast whether current conditions will result in
burning your entire error budget.
When using predictive burn alerts, you need to consider the lookahead window and
the baseline (or lookback) window: how far into the future are you modeling your
forecast, and how much recent data should you be using to make that prediction?
Let’s start by considering the lookahead window since it is simpler to consider.
This same technique of linear extrapolation often surfaces in areas such as capacity
planning or project management. For example, if you use a weighted system to
estimate task length in your ticketing system, you will have likely used this same
approach to extrapolate when a feature might be delivered during your future sprint
planning. With SLOs and error budget burn alerts, a similar logic is being applied to
help prioritize production issues that require immediate attention.
Calculating a first guess is relatively straightforward. However, you must weigh addi‐
tional nuances when forecasting predictive burn alerts that determine the quality and
accuracy of those future predictions.
In practice, we can use two approaches to calculate the trajectory of predictive burn
alerts. Short-term burn alerts extrapolate trajectories using only baseline data from the
most recent time period and nothing else. Context-aware burn alerts take historical
performance into account and use the total number of successful and failed events for
the SLO’s entire trailing window to make calculations.
The decision to use one method or the other typically hinges on two factors. The first
is a trade-off between computational cost and sensitivity or specificity. Context-aware
burn alerts are computationally more expensive than short-term burn alerts. How‐
ever, the second factor is a philosophical stance on whether the total amount of error
budget remaining should influence how responsive you are to service degradation.
If resolving a significant error when only 10% of your burn budget remains carries
more urgency than resolving a significant error when 90% of your burn budget
remains, you may favor context-aware burn alerts.
Let’s look at examples of how decisions are made when using each of those
approaches.
Figure 13-6. An adjusted SLO 30-day sliding window. When projecting forward 4
days, the lookback period to consider must be shortened to 26 days before today. The
projection is made by replicating results from the baseline window for the next 4 days
and adding those to the adjusted sliding window.
With those adjusted timeframes defined, you can now calculate how the future looks
four days from now. For that calculation, you would do the following:
1. Examine every entry in the map of SLO events that has occurred in the past 26
days.
2. Store both the total number of events in the past 26 days and the total number of
errors.
3. Reexamine map entries that occurred within that last 1 day to determine the
baseline window failure rate.
4. Extrapolate the next 4 days of performance by presuming they will behave simi‐
larly to the 1-day baseline window.
5. Calculate the adjusted SLO 30-day sliding window as it would appear 4 days from
now.
6. Trigger an alert if the error budget would be exhausted by then.
// Compute the window we will use to do the projection, with the projection
// offset as the earliest bound.
pOffset := time.Duration(ba.ExhaustionMinutes/lookbackRatio) * time.Minute
pWindow := now.Add(-pOffset)
// Set the end of the total window to the beginning of the SLO time period,
// plus ExhaustionMinutes.
tWindow := now.AddDate(0, 0, -slo.TimePeriodDays).Add(
time.Duration(ba.ExhaustionMinutes) * time.Minute)
runningTotal += tm.Total[t]
runningFails += tm.Fails[t]
// If we're within the projection window, use this value to project forward,
// counting it an extra lookbackRatio times.
if t.After(pWindow) {
projectedTotal += lookbackRatio * tm.Total[t]
projectedFails += lookbackRatio * tm.Fails[t]
}
}
Figure 13-7. One day ago, 65% of the error budget remained. Now, only 35% of the error
budget remains. At the current rate, the error budget will be exhausted in less than two
days.
Figure 13-10. A system that usually burns slowly but had an incident that burned a
significant portion of the error budget all at once, with only 1.3% remaining
When a burn alert is triggered, you should assess whether it is part of a burst
condition or is an incident that could burn a significant portion of your error budget
all at once. Comparing the current situation to historical rates can add helpful context
for triaging its importance.
Instead of showing the instantaneous 31.4% remaining and how we got there over
the trailing 30 days as we did in Figure 13-9, Figure 13-11 zooms out to examine
the 30-day-cumulative state of the SLO for each day in the past 90 days. Around the
beginning of February, this SLO started to recover above its target threshold. This
Figure 13-11. A 90-day sliding window for an SLO performs below a 99% target before
recovering toward the start of February
Understanding the general trend of the SLO can also answer questions about how
urgent the incident feels—and can give a hint for solving it. Burning budget all at
once suggests a different sort of failure than burning budget slowly over time.
Conclusion
We’ve examined the role that error budgets play and the mechanisms available to
trigger alerts when using SLOs. Several forecasting methods are available that can be
used to predict when your error budget will be burned. Each method has its own
considerations and trade-offs, and the hope is that this chapter shows you which
method to use to best meet the needs of your specific organization.
SLOs are a modern form of monitoring that solve many of the problems with
noisy monitoring we outlined before. SLOs are not specific to observability. What
is specific to observability is the additional power that event data adds to the SLO
model. When calculating error budget burn rates, events provide a more accurate
assessment of the actual state of production services. Additionally, merely knowing
that an SLO is in danger of a breach does not necessarily provide the insight you need
to determine which users are impacted, which dependent services are affected, or
which combinations of user behavior are triggering errors in your service. Coupling
observability data to SLOs helps you see where and when failures happened after a
burn budget alert is triggered.
Using SLOs with observability data is an important component of both the SRE
approach and the observability-driven-development approach. As seen in previous
chapters, analyzing events that fail can give rich and detailed information about what
is going wrong and why. It can help differentiate systemic problems and occasional
sporadic failures. In the next chapter, we’ll examine how observability can be used to
monitor another critical component of a production application: the software supply
chain.
157
your build pipelines, and the types of instrumentation that are particularly useful in
this context.
Slack’s story is framed from the point of view of an organization with large-scale
software supply chains, though we believe it has lessons applicable at any scale. In
Part IV, we’ll specifically look at challenges that present themselves when implement‐
ing observability at scale.
I’m delighted to share this chapter on practices and use cases for integrating observa‐
bility into your software supply chain. A software supply chain comprises “anything
that goes into or affects your software from development, through your CI/CD
pipeline, until it gets deployed into production.”1
For the past three years, I have spent time building and learning about systems and
human processes to deliver frequent, reliable, and high-quality releases that provide
a simpler, more pleasant, and productive experience for Slack customers. For teams
working on the software supply chain, the pipelines and tools to support CI/CD used
by our wider organization are our production workload.
Slack invested early in CI development for collaboration and in CD for releasing
software into the hands of customers. CI is a development methodology that requires
engineers to build, test, and integrate new code as frequently as possible to a shared
codebase. Integration and verification of new code in a shared codebase increases
confidence that new code does not introduce expected faults to customers. Systems
for CI enable developers to automatically trigger builds, test, and receive feedback
when they commit new code.
Slack evolved from a single web app PHP monorepo (now mostly in Hack) to a
topology of many languages, services, and clients to serve various needs. Slack’s
core business logic still lives in the web app and routes to downstream services like
Flannel. CI workflows at Slack include unit tests, integration tests, and end-to-end
functional tests for a variety of codebases.
1 Maya Kaczorowski, “Secure at Every Step: What is Software Supply Chain Security and Why Does It Matter?”,
GitHub Blog, September 2, 2020.
Figure 14-1. An example end-to-end workflow for testing the web app
In Figure 14-2, you can see a simplified view of a test run after a user pushes a
commit to GitHub. This test run is orchestrated by Checkpoint and subsequently
passed onto a build step and then test step, each performed by Jenkins executors.
In each stage, you see additional dimensions that contextualize execution in CI.
Slack engineers then use single, or combinations of, dimensions as breadcrumbs to
explore executions when performance or reliability issues arise in production. Each
Figure 14-2. A simplified view of a single end-to-end test run orchestrated by our CI
orchestration layer, highlighting the common dimensions shared across our workflow
Client dimensions are configured within each trace. (Figure 14-3 shows example
dimensions.) Slack uses a TraceContext singleton that sets up these dimensions.
Each TraceContext builds an initial trace with common dimensions and a new trace.
Each trace contains multiple spans and an array of specific dimensions at each span.
An individual span (e.g., in Figure 14-4 from runner.test_execution) can contain
context on the original request and add dimensions of interest to the root span.
As you add more dimensions, you add more context and richness to power issue
investigations.
Figure 14-3. Common dimensions in our Hack codebase. You can use these as examples
for structuring dimensions across spans and services in CI.
For example, Slack engineers might want to identify concurrency issues along com‐
mon dimensions (like hostnames or groups of Jenkins workers). The TraceContext
already provides a hostname tag. The CI runner client then appends a tag for the
Jenkins worker label. Using a combination of these two dimensions, Slack engineers
can then group individual hosts or groups of Jenkins workers that have runtime
issues.
Similarly, Slack engineers might want to identify common build failures. The CI
runner client appends a tag for the commit head or commit main branch. This
combination allows for identifying which commits a broken build might come from.
The dimensions in Figure 14-3 are then used in various script and service calls as
they communicate with one another to complete a test run (Figure 14-4).
In the following sections, I’ll share how Slack uses trace tooling and queries to make
sense of the supply chain and how provenance can result in actionable alerting to
resolve issues.
Developer frustration across Slack engineering was increasing because of flaky end-
to-end test runs in 2020. Test turnaround time (p95) was consistently above 30
minutes for a single commit (the time between an engineer pushing a commit to
GitHub and all test executions returning). During this period, most of Slack’s code
testing was driven by end-to-end tests before an engineer merged their code to the
mainline. Many end-to-end test suites had an average suite execution flake rate of
nearly 15%. Cumulatively, these flaky test executions peaked at 100,000 weekly hours
of compute time on discarded test executions.
By mid-2020, the combination of these metrics led to automation teams across Slack
sharing a daily 30-minute triage session to dig into specific test issues. Automation
team leads hesitated to introduce any additional variance to the way Slack used
Cypress, an end-to-end testing platform. The belief was that flakiness was from the
test code itself. Yet no great progress was made in verifying or ruling out that belief.
In late 2020, observability through tracing had shown great promise and impact in
identifying infrastructure bottlenecks in other internal tooling. Internal tooling and
automation teams worked to add tracing for a few runtime parameters and spans in
Cypress.
Within days of instrumentation, multiple dimensions appeared very correlated with
test suites that had higher flake rates. Engineers from these teams looked at this
instrumentation and discovered that users and test suite owners of the test platform
had drastically different configurations. During this discovery process, additional
telemetry was added to the Docker runtime to add additional color to some flakes.
Empowered with data, these engineers experimented to place better defaults for the
platform and to place guardrails for flaky configurations. After these initial adjust‐
ments, test suite flake rates decreased significantly for many users (suites went from
15% to under 0.5% flake rates), as shown in Figure 14-5.
For more on how Slack evolved its testing strategy and culture of
safety, see the Slack blog post “Balancing Safety and Velocity in
CI/CD at Slack”. Slack describes how engineers initiated a project
to transform testing pipelines and de-emphasize end-to-end testing
for code safety. This drastically reduced user-facing flakiness and
increased developer velocity in 2021.
Figure 14-5. Time spent on flaking test runs between major classes of test runs for web
app. The light colored bar shows flaky test executions from the Cypress platform tests.
With this shared understanding of context through tooling, Slack’s next step was to
embed actionable workflows through alerting.
• Test suite owner or platform teams might care about flakiness, reliability, or
memory usage.
• Test infrastructure teams might care about performance and reliability of specific
operations (like Docker ops or cost per test).
• Deployment owners might care about what was tested or upcoming hotfixes
coming through CI.
The prompt for identifying an issue might be anomaly detection alerts for a high-
level business metric or a specific issue that’s suite based (e.g., in Figure 14-6). The
link to our observability tool might direct the user to a collection of views available
based on the test_suite dimension.
Figure 14-6. Identifying runtime increase for test suite above p50
At Slack, we’ve encouraged teams to make dashboards based on specific use cases.
The Honeycomb link brings up a query from our CI Service Traces dashboard
(Figure 14-7) that has parameters set for a potential issue for a test suite. This
message helps inform responders of a specific issue—for example, a test suite called
backend-php-integration is showing signs of a longer runtime—and responders might
use Honeycomb to look at associated traces for potential issues.
Figure 14-7. Slack’s CI Service Traces dashboard displaying queries available to view
different pieces of CI interactions
Figure 14-8. This sample drill-down query approximates a rate, error, and duration
(RED) dashboard with visualizations by grouping individual methods between services
in Checkpoint
A few potential culprit commits were reverted. Because of the potential large number
of variables during an incident investigation, a holding pattern frequently occurs after
a change, before telemetry can report healthiness. Figure 14-10 shows a long thread
by subject-matter experts who were quickly able to test the hypothesis and see system
health using distributed tracing in near-real time.
Figure 14-10. A Slack thread of questions during incident investigation that starts with
testing a hypothesis, taking action, and using observability to validate the hypothesis
(responder names blurred for privacy)
Conclusion
This chapter illustrates how observability can be useful in the software supply chain.
I shared how Slack instrumented the CI pipeline and recent examples of debugging
distributed systems. The intricacies of debugging distributed systems are generally
top of mind for application developers trying to understand how their code behaves
in production environments. But, prior to production, other distributed systems may
be equally challenging to properly understand and debug.
With the right tools and dimensionality in the software supply chain, Slack engineers
were able to solve complex problems throughout the CI workflow that were previ‐
ously invisible or undetected. Whether debugging complaints that an application is
slow or that CI tests are flaky, observability can help developers correlate problems in
high-complexity systems that interact.
In Part III, we focused on overcoming barriers to getting started and new workflows
that help change social and cultural practices in order to put some momentum
behind your observability adoption initiatives. In this part, we examine considera‐
tions on the other end of the adoption spectrum: what happens when observability
adoption is successful and practiced at scale?
When it comes to observability, “at scale” is probably larger than most people think.
As a rough ballpark measure, when measuring telemetry events generated per day
in the high hundreds of millions or low billions, you might have a scale issue. The
concepts explored in this chapter are most acutely felt when operating observability
solutions at scale. However, these lessons are generally useful to anyone going down
the path of observability.
Chapter 15 explores the decision of whether to buy or build an observability solution.
At a large enough scale, as the bill for commercial solutions grows, teams will start
to consider whether they can save more by simply building an observability solution
themselves. This chapter provides guidance on how best to approach that decision.
Chapter 16 explores how a data store must be configured in order to serve the needs
of an observability workload. To achieve the functional requirements of iterative
and open-ended investigations, several technical criteria must be met. This chapter
presents a case study of Honeycomb’s Retriever engine as a model for meeting these
requirements.
Chapter 17 looks at how to reduce the overhead of managing large volumes of tele‐
metry data at scale. This chapter presents several techniques to ensure high-fidelity
observability data, while reducing the overall number of events that must be captured
and stored in your backend data store.
Chapter 18 takes a look at another technique for managing large volumes of teleme‐
try data at scale, or management via pipelines. This is a guest chapter by Suman
Karumuri, senior staff software engineer, and Ryan Katkov, director of engineering, at
Slack. It presents an in-depth look at how Slack uses telemetry management pipelines
to route observability data at scale.
This part of the book focuses on observability concepts that are useful to understand
at any scale but that become critical in large-scale use cases. In Part V, we’ll look at
techniques for spreading observability culture at any scale.
CHAPTER 15
Build Versus Buy and Return on Investment
So far in this book, we’ve examined both the technical fundamentals of observability
and the social steps necessary to initiate the practice. In this part of the book, we will
examine the considerations necessary when implementing observability at scale. We’ll
focus on the functional requirements that are necessary to achieve the observability
workflows described in earlier parts.
At a large enough scale, the question many teams will grapple with is whether they
should build or buy an observability solution. Observability can seem relatively inex‐
pensive on the surface, especially for smaller deployments. As user traffic grows, so
too does the infrastructure footprint and volume of events your application generates.
When dealing with substantially more observability data and seeing a much larger
bill from a vendor, teams will start to consider whether they can save more by simply
building an observability solution themselves.
Alternatively, some organizations consider building an observability solution when
they perceive that a vendor’s ability to meet their specific needs is inadequate. Why
settle for less than you need when software engineers can build the exact thing you
want? As such, we see a variety of considerations play into arguments on whether the
right move for any given team is to build a solution or buy one.
This chapter unpacks those considerations for teams determining whether they
should build or buy an observability solution. It also looks at both quantifiable
and unquantifiable factors when considering return on investment (ROI). The build-
versus-buy choice is also not binary; in some situations, you may want to both buy
and build.
We’ll start by examining the true costs for buying and building. Then we’ll consider
circumstances that may necessitate one or the other. We’ll also look at ways to
potentially strike a balance between building everything yourself or just using a
173
vendor solution. The recommendations in this chapter are most applicable to larger
organizations, but the advice applies to teams weighing this decision at any scale.
1 Dustin Smith, “2021 Accelerate State of DevOps Report Addresses Burnout, Team Performance”, Google
Cloud Blog, September 21, 2021.
Conclusion
This chapter presents general advice, and your own situation may, of course, be
different. When making your own calculations around building or buying, you must
first start by determining the real TCO of both options. Start with the more visibly
quantifiable costs of both (time when considering building, and money when consid‐
ering buying). Then be mindful of the hidden costs of each (opportunity costs and
Conclusion | 183
less visibly spent money when considering building, and future usage patterns and
vendor lock-in when considering buying).
When considering building with open source tools, ensure that you weigh the full
impact of hidden costs like recruiting, hiring, and training the engineers necessary to
develop and maintain bespoke solutions (including their salaries and your infrastruc‐
ture costs) in addition to the opportunity costs of devoting those engineers to run‐
ning tools that are not delivering against core business value. When purchasing an
observability solution, ensure that vendors give you the transparency to understand
their complicated pricing schemes and apply logical rubrics when factoring in both
system architecture and organizational adoption patterns to determine your likely
future costs.
When adding up these less visible costs, the TCO for free solutions can be more
adequately weighed against commercial solutions. Then you can also factor in the
less quantifiable benefits of each approach to determine what’s right for you. Also
remember that you can buy and build to reap the benefits of either approach.
Remember that, as employees for a software vendor, we authors have some implicit
bias in this conversation. Even so, we believe the advice in this chapter is fair and
methodical, and in alignment with our past experience as consumers of observability
tooling rather than producers. Most of the time, the best answer for any given team
focused on delivering business value is to buy an observability solution rather than
building one themselves. However, that advice comes with the caveat that your team
should be building an integration point enabling that commercial solution to be
adapted to the needs of your business.
In the next chapter, should you decide to build your own observability solution, we
will look at what it takes to optimize a data store for the needs of delivering against an
observability workload.
In this chapter, we’ll look at the challenges that must be addressed to effectively store
and retrieve your observability data when you need it most. Speed is a common
concern with data storage and retrieval, but other functional constraints impose key
challenges that must be addressed at the data layer. At scale, the challenges inherent to
observability become especially pronounced. We will lay out the functional require‐
ments necessary to enable observability workflows. Then we will examine real-life
trade-offs and possible solutions by using the implementation of Honeycomb’s pro‐
prietary Retriever data store as inspiration.
You will learn about the various considerations required at the storage and retrieval
layers to ensure speed, scalability, and durability for your observability data. You
will learn about a columnar data store and why it is particularly well suited for
observability data, how querying workloads must be handled, and considerations for
making data storage durable and performant. The solutions presented in this chapter
are not the only possible solutions to the various trade-offs you may encounter.
However, they’re presented as real-world examples of achieving the necessary results
when building an observability solution.
185
As covered in Part II, events are the building blocks of observability, and traces are a
collection of interrelated events (or trace spans). Finding meaningful patterns within
those events requires an ability to analyze high-cardinality and high-dimensionality
data. Any field within any event (or within any trace span) must be queryable. Those
events cannot be pre-aggregated since, in any given investigation, you won’t know in
advance which fields may be relevant. All telemetry data must be available to query in
pre-aggregate resolution, regardless of its complexity, or you risk hitting investigative
dead ends.
Further, because you don’t know which dimensions in an event may be relevant,
you cannot privilege the data-retrieval performance of any particular dimension over
others (they must all be equally fast). Therefore, all possible data needed must be
indexed (which is typically prohibitively expensive), or data retrieval must always be
fast without indexes in place.
Typically, in observability workflows, users are looking to retrieve data in specific
time ranges. That means the only exception to privileged data is the dimension
of time. It is imperative that queries return all data recorded within specific time
intervals, so you must ensure that it is indexed appropriately. A TSDB would seem
to be the obvious choice here, but as you’ll see later in this chapter, using one for
observability presents its own set of incompatible constraints.
Because your observability data is used to debug production issues, it’s imperative to
know whether the specific actions you’ve taken have resolved the problem. Stale data
can cause engineers to waste time on red herrings or make false conclusions about
the current state of their systems. Therefore, an efficient observability system should
include not just historical data but also fresh data that reflects the current state in
close to real time. No more than seconds should elapse between when data is received
into the system and when it becomes available to query.
Lastly, that data store must also be durable and reliable. You cannot lose the observ‐
ability data needed during your critical investigations. Nor can you afford to delay
your critical investigations because any given component within your data store
failed. Any mechanisms you employ to retrieve data must be fault-tolerant and
designed to return fast query results despite the failure of any underlying workers.
The durability of your data store must also be able to withstand failures that occur
within your own infrastructure. Otherwise, your observability solution may also be
inoperable while you’re attempting to debug why the production services it tracks are
inoperable.
Given these functional requirements necessary to enable real-time debugging work‐
flows, traditional data storage solutions are often inadequate for observability. At a
small enough scale, data performance within these parameters can be more easily
achieved. In this chapter, we’ll examine how these problems manifest when you go
beyond single-node storage solutions.
Figure 16-1. A prototypical TSDB showing a limited amount of cardinality and dimen‐
sionality: tags for HTTP method and status code, bucketed by timestamp
Figure 16-2. The explosion of that same TSDB when a high-cardinality index, userid, is
added
1 Lior Abraham et al., “Scuba: Diving into Data at Facebook”, Proceedings of the VLDB Endowment 6.11
(2013): 1057–1067.
Bigtable uses a row-based approach, meaning that the retrieval of individual traces
and spans is fast because the data is serialized and the primary key is indexed (e.g., by
time). To obtain one row (or scan a few contiguous rows), a Bigtable server needs to
retrieve only one set of files with their metadata (a tablet). To function efficiently for
tracing, this approach requires Bigtable to maintain a list of rows sorted by a primary
row key such as trace ID or time. Row keys that arrive out of strict order require the
server to insert them at the appropriate position, causing additional sorting work at
write time that is not required for the nonsequential read workload.
As a mutable data store, Bigtable supports update and deletion semantics and
dynamic repartitioning. In other words, data in Bigtable has flexibility in data pro‐
cessing at the cost of complexity and performance. Bigtable temporarily manages
updates to data as an in-memory mutation log plus an ongoing stack of overlaid files
containing key-value pairs that override values set lower in the stack. Periodically,
once enough of these updates exist, the overlay files must be “compacted” back into
a base immutable layer with a process that rewrites records according to their prece‐
dence order. That compaction process is expensive in terms of disk I/O operations.
For observability workloads, which are effectively write-once read-many, the compac‐
tion process presents a performance quandary. Because newly ingested observability
data must be available to query within seconds, with Bigtable the compaction process
would either need to run constantly, or you would need to read through each of the
stacked immutable key-value pairs whenever a query is submitted. It’s impractical to
perform analysis of arbitrary fields without performing a read of all columns for the
row’s tablet to reproduce the relevant fields and values. It’s similarly impractical to
have compaction occur after each write.
• Individual events do not need to be sorted and put into a strict order by event
timestamp, as long as the start and end timestamp of each segment is stored as
metadata and events arrive with a consistent lag.
• The contents of each segment are append-only artifacts that can be frozen as
is once finished rather than needing to be built as mutable overlays/layers and
compacted periodically.
The segment-partitioning approach has one potential weakness: when backfilled data
is intermingled with current data (for instance, if a batch job finishes and reports
timestamps from hours or days ago), each segment written will have metadata indi‐
cating it spans a broad time window covering not just minutes, but potentially hours’
or days’ worth of time. In this case, segments will need to be scanned for any query
in that wide window, rather than being scanned for data in only the two narrower
windows of time—the current time, and the time around when the backfilled events
happened. Although the Retriever workload has not necessitated this to date, you
could layer on more sophisticated segment partitioning mechanisms if it became a
significant problem.
4 Terry Welch et. al., “A Technique for High-Performance Data Compression,” Computer 17, no. 6 (1984): 8–19.
presence:
[Bitmask indicating which rows have non-null values]
1. Identify all segments that potentially overlap with the time range of the query,
using the start/end time of the query and start/end time of each eligible segment.
2. Independently, for each matching segment: for the columns used for the filters of
the query (e.g., WHERE) or used for the output (e.g., used as a SELECT or GROUP),
scan the relevant column files. To perform a scan, track the current offset you
are working on. Evaluate the timestamp of the row at the current offset first to
validate whether the row falls within the time range of the query. If not, advance
to the next offset and try again.
We can work through here in pseudocode, calculating SUM, COUNT, AVG, MAX, etc. based
on keeping a cumulative sum or highest value seen to date—for instance, on the value
of the field x grouped by fields a and b, where y is greater than zero:
groups := make(map[Key]Aggregation)
for _, s := range segments {
for _, row := range fieldSubset(s, []string{"a", "b", "x", "y"}))
if row["y"] > 0 {
continue
}
Key := Key{A: row["a"], B: row["b"]}
aggr := groups[Key]
aggr.Count++
aggr.Sum += row["x"]
if aggr.Max < row["x"] {
aggr.Max = row["x"]
}
groups[Key] = aggr
}
}
for k := range groups {
groups[k].Avg = groups[k].Sum / groups[k].Count
}
5 This may not seem impressive, until you realize routine queries scan hundreds of millions of records!
Conclusion | 205
CHAPTER 17
Cheap and Accurate Enough: Sampling
In the preceding chapter, we covered how a data store must be configured in order
to efficiently store and retrieve large quantities of observability data. In this chapter,
we’ll look at techniques for reducing the amount of observability data you may need
to store. At a large enough scale, the resources necessary to retain and process every
single event can become prohibitive and impractical. Sampling events can mitigate
the trade-offs between resource consumption and data fidelity.
This chapter examines why sampling is useful (even at a smaller scale), the various
strategies typically used to sample data, and trade-offs between those strategies. We
use code-based examples to illustrate how these strategies are implemented and pro‐
gressively introduce concepts that build upon previous examples. The chapter starts
with simpler sampling schemes applied to single events as a conceptual introduction
to using a statistical representation of data when sampling. We then build toward
more complex sampling strategies as they are applied to a series of related events
(trace spans) and propagate the information needed to reconstruct your data after
sampling.
207
The reality of most applications is that many of their events are virtually identical
and successful. The core function of debugging is searching for emergent patterns or
examining failed events during an outage. Through that lens, it is then wasteful to
transmit 100% of all events to your observability data backend. Certain events can
be selected as representative examples of what occurred, and those sample events can
be transmitted along with metadata your observability backend needs to reconstruct
what actually occurred among the events that weren’t sampled.
To debug effectively, what’s needed is a representative sample of successful, or “good,”
events, against which to compare the “bad” events. Using representative events to
reconstruct your observability data enables you to reduce the overhead of transmit‐
ting every single event, while also faithfully recovering the original shape of that data.
Sampling events can help you accomplish your observability goals at a fraction of the
resource cost. It is a way to refine the observability process at scale.
Historically in the software industry, when facing resource constraints in reporting
high-volume system state, the standard approach to surfacing the signal from the
noise has been to generate aggregated metrics containing a limited number of tags.
As covered in Chapter 2, aggregated views of system state that cannot be decom‐
posed are far too coarse to troubleshoot the needs of modern distributed systems.
Pre-aggregating data before it arrives in your debugging tool means that you can’t dig
further past the granularity of the aggregated values.
With observability, you can sample events by using the strategies outlined in this
chapter and still provide granular visibility into system state. Sampling gives you
the ability to decide which events are useful to transmit and which ones are not.
Unlike pre-aggregated metrics that collapse all events into one coarse representation
of system state over a given period of time, sampling allows you to make informed
decisions about which events can help you surface unusual behavior, while still
optimizing for resource constraints. The difference between sampled events and
aggregated metrics is that full cardinality is preserved on each dimension included in
the representative event.
At scale, the need to refine your data set to optimize for resource costs becomes
critical. But even at a smaller scale, where the need to shave resources is less pressing,
refining the data you decide to keep can still provide valuable cost savings. First, let’s
start by looking at the strategies that can be used to decide which data is worth sam‐
pling. Then, we’ll look at when and how that decision can be made when handling
trace events.
Constant-Probability Sampling
Because of its understandable and easy-to-implement approach, constant-probability
sampling is what most people think of when they think of sampling: a constant
percentage of data is kept rather than discarded (e.g., keep 1 out of every 10 requests).
When performing analysis of sampled data, you will need to transform the data
to reconstruct the original distribution of requests. Suppose that your service is
instrumented with both events and metrics, and receives 100,000 requests. It is mis‐
leading for your telemetry systems to report receiving only 100 events if each received
event represents approximately 1,000 similar events. Other telemetry systems such
as metrics will record increment a counter for each of the 100,000 requests that
your service has received. Only a fraction of those requests will have been sampled,
so your system will need to adjust the aggregation of events to return data that is
approximately correct. For a fixed sampling rate system, you can multiply each event
by the sampling rate in effect to get the estimated count of total requests and sum of
their latency. Scalar distribution properties such as the p99 and median do not need
to be adjusted for a constant probability sampling, as they are not distorted by the
sampling process.
The basic idea of constant sampling is that, if you have enough volume, any error
that comes up will happen again. If that error is happening enough to matter, you’ll
see it. However, if you have a moderate volume of data, constant sampling does not
maintain the statistical likelihood that you still see what you need to see. Constant
sampling is not effective in the following circumstances:
• You care a lot about error cases and not very much about success cases.
• Some customers send orders of magnitude more traffic than others, and you
want all customers to have a good experience.
• You want to ensure that a huge increase in traffic on your servers can’t over‐
whelm your analytics backend.
For an observable system, a more sophisticated approach ensures that enough tele‐
metry data is captured and retained so that you can see into the true state of any given
service at any given time.
• Events with errors are more important than those with successes.
• Events for newly placed orders are more important than those checking on order
status.
• Events affecting paying customers are more important to keep than those for
customers using the free tier.
Fixed-Rate Sampling
A naive approach might be probabilistic sampling using a fixed rate, by randomly
choosing to send 1 in 1,000 events:
var sampleRate = flag.Int("sampleRate", 1000, "Static sample rate")
r := rand.Float64()
if r < 1.0 / *sampleRate {
RecordEvent(req, start, err)
}
}
Every 1,000th event would be kept, regardless of its relevance, as representative of
the other 999 events discarded. To reconstruct your data on the backend, you would
need to remember that each event stood for sampleRate events and multiply out
all counter values accordingly on the receiving end at the instrumentation collector.
Otherwise, your tooling would misreport the total number of events actually encoun‐
tered during that time period.
Recording the sample rate within an event can look like this:
var sampleRate = flag.Int("sampleRate", 1000, "Service's sample rate")
r := rand.Float64()
if r < 1.0 / *sampleRate {
RecordEvent(req, *sampleRate, start, err)
}
}
With this approach, you can keep track of the sampling rate in effect when each
sampled event was recorded. That gives you the data necessary to accurately calculate
values when reconstructing your data, even if the sampling rate dynamically changes.
For example, if you were trying to calculate the total number of events meeting a filter
such as "err != nil", you would multiply the count of seen events with "err !=
nil" by each one’s sampleRate (see Figure 17-2). And, if you were trying to calculate
the sum of durationMs, you would need to weight each sampled event’s durationMs
and multiply it by sampleRate before adding up the weighted figures.
Consistent Sampling
So far in our code, we’ve looked at how a sampling decision is made. But we have yet
to consider when a sampling decision gets made in the case of sampling trace events.
The strategy of using head-based, tail-based, or buffered sampling matters when
considering how sampling interacts with tracing. We’ll cover how those decisions get
implemented toward the end of the chapter. For now, let’s examine how to propagate
context to downstream handlers in order to (later) make that decision.
To properly manage trace events, you should use a centrally generated sampling/trac‐
ing ID propagated to all downstream handlers instead of independently generating
a sampling decision inside each one. Doing so lets you make consistent sampling
decisions for different manifestations of the same end user’s request (see Figure 17-3).
In other words, this ensures that you capture a full end-to-end trace for any given
sampled request. It would be unfortunate to discover that you have sampled an error
far downstream for which the upstream context is missing because it was dropped
because of how your sampling strategy was implemented.
Consistent sampling ensures that when the sample rate is held constant, traces are
either kept or sampled away in their entirety. And if children are sampled at a higher
sample rate—for instance, noisy Redis calls being sampled 1 for 1,000, while their
parents are kept 1 for 10—it will never be the case that a broken trace is created from
a Redis child being kept while its parent is discarded.
start := time.Now()
// Propagate the Sampling-ID when creating a child span
i, err := callAnotherService(r)
resp.Write(i)
Figure 17-4. You can automate the calculation of overall sample volume
func main() {
// Initialize counters.
rc := 0
requestsInPastMinute = &rc
go func() {
for {
time.Sleep(time.Minute)
newSampleRate = *requestsInPastMinute / (60 * *targetEventsPerSec)
if newSampleRate < 1 {
sampleRate = 1.0
} else {
sampleRate = newSampleRate
}
newRequestCounter := 0
// Real production code would do something less prone to race
// conditions
requestsInPastMinute = &newRequestCounter
}
}()
http.Handle("/", handler)
[...]
}
start := time.Now()
*requestsInPastMinute++
i, err := callAnotherService(r)
resp.Write(i)
r := rand.Float64()
if err != nil || time.Since(start) > 500*time.Millisecond {
if r < 1.0 / *outlierSampleRate {
RecordEvent(req, *outlierSampleRate, start, err)
}
} else {
if r < 1.0 / *sampleRate {
RecordEvent(req, *sampleRate, start, err)
}
}
}
Although this is a good example of using multiple static sample rates, the approach
is still susceptible to spikes of instrumentation traffic. If the application experiences a
spike in the rate of errors, every single error gets sampled. Next, we will address that
shortcoming with target rate sampling.
func main() {
// Initialize counters.
rc := 0
requestsInPastMinute = &rc
oc := 0
outliersInPastMinute = &oc
go func() {
for {
time.Sleep(time.Minute)
newSampleRate = *requestsInPastMinute / (60 * *targetEventsPerSec)
if newSampleRate < 1 {
sampleRate = 1.0
} else {
sampleRate = newSampleRate
}
newRequestCounter := 0
requestsInPastMinute = &newRequestCounter
// Boilerplate main() and goroutine init to overwrite maps and roll them over
// every interval goes here.
counts[k]++
if r, ok := sampleRates[k]; ok {
return r
} else {
return 1.0
}
}
start := time.Now()
i, err := callAnotherService(r)
resp.Write(i)
Putting It All Together: Head and Tail per Key Target Rate Sampling
Earlier in this chapter, we noted that head-based sampling requires setting a header
to propagating a sampling decision downstream. For the code example we’ve been
iterating, that means the parent span must pass both the head-sampling decision and
its corresponding rate to all child spans. Doing so forces sampling to occur for all
child spans, even if the dynamic sampling rate at that level would not have chosen to
sample the request:
var headCounts, tailCounts map[interface{}]int
var headSampleRates, tailSampleRates map[interface{}]float64
// Boilerplate main() and goroutine init to overwrite maps and roll them over
// every interval goes here. checkSampleRate() etc. from above as well
start := time.Now()
i, err := callAnotherService(r, headSampleRate)
resp.Write(i)
if headSampleRate > 0 {
RecordEvent(req, headSampleRate, start, err)
} else {
Conclusion
Sampling is a useful technique for refining your observability data. While sampling is
necessary when running at scale, it can be useful in a variety of circumstances even at
smaller scales. The code-based examples illustrate how various sampling strategies are
implemented. It’s becoming increasingly common for open source instrumentation
libraries—such as OTel—to implement that type of sampling logic for you. As those
libraries become the standard for generating application telemetry data, it should
become less likely that you would need to reimplement these sampling strategies in
your own code.
However, even if you rely on third-party libraries to manage that strategy for you, it is
essential that you understand the mechanics behind how sampling is implemented so
you can understand which method is right for your particular situation. Understand‐
ing how the strategies (static versus dynamic, head versus tail, or a combination
thereof) work in practice enables you to use them wisely to achieve data fidelity while
also optimizing for resource constraints.
Conclusion | 223
Similar to deciding what and how to instrument your code, deciding what, when,
and how to sample is best defined by your unique organizational needs. The fields in
your events that influence how interesting they are to sample largely depend on how
useful they are to understanding the state of your environment and their impact on
achieving your business goals.
In the next chapter, we’ll examine an approach to routing large volumes of telemetry
data: telemetry management with pipelines.
225
In this chapter, we will go over how telemetry pipelines can benefit your organi‐
zation’s observability capabilities, describe the basic structure and components of
a telemetry pipeline, and show concrete examples of how Slack uses a telemetry
pipeline, using mainly open source software components. Slack has been using this
pattern in production for the past three years, scaling up to millions of events per
second.
Establishing a telemetry management practice is key for organizations that want to
focus on observability adoption and decrease the amount of work a developer needs
to do to make their service sufficiently observable. A strong telemetry management
practice lays the foundation for a consolidated instrumentation framework and cre‐
ates a consistent developer experience, reducing complexity and churn, especially
when it comes to introducing new telemetry from new software.
At Slack, we generally look for these characteristics when we envision an ideal
telemetry system: we want the pipeline to be able to collect, route, and enrich data
streams coming from applications and services. We are also opinionated about the
components that operate as part of a stream, and we make available a consistent
set of endpoints or libraries. Finally, we use a prescribed common event format that
applications can leverage quickly to realize value.
As an organization grows, observability systems tend to evolve from a simple sys‐
tem in which applications and services produce events directly to the appropriate
backend, to more-complex use cases. If you find yourself needing greater security,
workload isolation, retention requirement enforcement, or a greater degree of control
over the quality of your data, then telemetry management via pipelines can help you
address those needs. At a high level, a pipeline consists of components between the
application and the backend in order to process and route your observability data.
By the end of this chapter, you’ll understand how and when to design a telemetry
pipeline as well as the fundamental building blocks necessary to manage your grow‐
ing observability data needs.
Routing
At its simplest, the primary purpose of a telemetry pipeline is to route data from
where it is generated to different backends, while centrally controlling the configura‐
tion of what telemetry goes where. Statically configuring these routes at the source to
directly send the data to the data store is often not desirable because often you want
Workload Isolation
Workload isolation allows you to protect the reliability and availability of data sets
in critical scenarios. Partitioning your telemetry data across multiple clusters allows
you to isolate workloads from one another. For example, you may wish to separate
an application that produces a high volume of logs from an application that produces
a very low volume of log data. By putting the logs of these applications in the same
cluster, an expensive query against the high-volume log can frequently slow cluster
performance, negatively affecting the experience for other users on the same cluster.
For lower-volume logs such as host logs, having a higher retention period for this
data may be desirable, as it may provide historical context. By isolating workloads,
you gain flexibility and reliability.
Capacity Management
Often for capacity planning or cost-control reasons, you may want to assign quotas
for categories of telemetry and enforce with rate limiting, sampling, or queuing.
Rate limiting
Since the telemetry data is often produced in relationship to natural user requests, the
telemetry data from applications tend to follow unpredictable patterns. A telemetry
pipeline can smooth these data spikes for the backend by sending the telemetry to the
backend at only a constant rate. If there is more data than this rate, the pipeline can
often hold data in memory until the backend can consume it.
If your systems consistently produce data at a higher rate than the backend can con‐
sume, your telemetry pipeline can use rate limits to mitigate impacts. For instance,
you could employ a hard rate limit and drop data that is over the rate limit, or ingest
Sampling
As discussed in Chapter 17, your ingestion component can utilize moving average
sampling, progressively increasing sample rates as volume increases to preserve signal
and avoid saturating backends downstream in the pipeline.
Queuing
You can prioritize ingesting recent data over older data to maximize utility to devel‐
opers. This feature is especially useful during log storms in logging systems. Log
storms happen when the system gets more logs than its designed capacity.
For instance, a large-scale incident like a critical service being down would cause
clients to report a higher volume of errors and would overwhelm the backend. In this
case, prioritizing fresh logs is more important than catching up with old logs, since
fresh logs indicate the current state of the system, whereas older logs tell you the state
of the system at a past time, which becomes less relevant the further you are from
the incident. A backfill operation can tidy up the historical data afterward, when the
system has spare capacity.
For trace data in particular at Slack, one way we ensure data consistency is by using
simple data-filtering operations like filtering low-value spans from making it to the
backend, increasing the overall value of the data set. Other examples of ensuring
quality include techniques like tail sampling, in which only a small subset of the
reported traces are selected for storage in the backend system based on desirable
attributes, such as higher reported latency.
However, a complex setup can have a chain of receiver → buffer → receiver → buffer
→ exporter, as shown in Figure 18-2.
A receiver or the exporter in a pipeline is often responsible for only one of the
possible operations—like capacity planning, routing, or data transformation for the
data. Table 18-1 shows a sample of operations that can be performed on various types
of telemetry data.
Performance
Since applications can produce data in any format and the nature of the data they
produce can change, keeping the pipeline performant can be a challenge. For exam‐
ple, if an application generates a lot of logs that are expensive to process in a logging
pipeline, the log pipeline becomes slower and needs to be scaled up. Often slowness
in one part of the pipeline may cause issues in other parts of the pipeline—like spiky
loads, which, in turn, can destabilize the entire pipeline.
Correctness
Since the pipeline is made up of multiple components, determining whether the
end-to-end operation of the pipeline is correct can be difficult. For example, in a
complex pipeline, it can be difficult to know whether the data you are writing is
transformed correctly or to ensure that the data being dropped is the only type of
data being dropped. Further, since the data format of the incoming data is unknown,
debugging the issues can be complex. You must, therefore, monitor for errors and
data-quality issues in the pipeline.
Availability
Often the backends or various components of the pipeline can be unreliable. As
long as the software components and sinks are designed to ensure resiliency and
availability, you can withstand disruptions in the pipeline.
Isolation
If you colocate logs or metrics from a high-volume system customer and a low-
volume customer located in the same cluster, availability issues may occur if a high
volume of logs causes saturation of the cluster. So, the telemetry pipeline should be
set up such that these streams can be isolated from each other and possibly written to
different backends.
Data Freshness
In addition to being performant, correct, and reliable, a telemetry pipeline should
also operate at or near real time. Often the end-to-end latency between the produc‐
tion of data and it being available for consumption is in the order of seconds, or tens
of seconds in the worst case. However, monitoring the pipeline for data freshness can
be a challenge since you need to have a known data source that produces the data at a
consistent pace.
Host metrics, such as Prometheus, can be used because they are typically scraped at a
consistent interval. You can use those intervals to measure the data freshness of your
logs. For logs or traces, a good, consistent data source is often not available. In those
cases, it could be valuable for you to add a synthetic data source.
Metrics Aggregation
Prometheus is the primary system for metrics at Slack. Our backend was first written
in PHP and later in Hack. Since PHP/Hack use a process per request model, the
Prometheus pull model wouldn’t work because it does not have the process context
and has only the host context. Instead, Slack uses a custom Prometheus library to
emit metrics per request to a local daemon written in Go.
These per request metrics are collected and locally aggregated over a time window by
that daemon process. The daemon process also exposes a metrics endpoint, which is
scraped by our Prometheus servers, as shown in Figure 18-3. This allows us to collect
metrics from our PHP/Hack application servers.
Outside of PHP/Hack, Slack also runs applications in Go or Java that expose metric
endpoints, and their metrics are able to be scraped by Prometheus directly.
Figure 18-4. Slack telemetry pipeline with receivers, buffers, and exporters for trace data
Figure 18-5. Slack’s tracing infrastructure, with applications in pink (light gray in print),
receivers and exporters in blue (medium gray)
Our internal Java and Go applications use the open source instrumentation libraries
from Zipkin and Jaeger, respectively. To capture the spans from these applications,
Wallace exposes receivers for both types of span data. These receivers, called trace
adapters, translate the reported spans into our SpanEvent format and write them to
Wallace, which in turn forwards them to a Murron receiver that writes the data to
Kafka.
Conclusion
You can get started today by using several off-the-shelf services in conjunction with
one another. You may be tempted to write a new system from scratch, but we
recommend adapting an open source service. Smaller pipelines can run on autopilot,
but as the organization grows, managing a telemetry pipeline becomes a complex
endeavor and introduces several challenges.
As you build out your telemetry management system, try to build for the current
needs of the business and anticipate—but don’t implement for—new needs. For
example, you may want to add compliance features sometime down the road, or you
may want to introduce advanced enrichment or filtering. Keeping the pipeline modu‐
lar and following the producer, buffer, processor and exporter model will keep your
observability function running smoothly, while providing value to your business.
Observability often starts within one particular team or business unit in an organiza‐
tion. To spread a culture of observability, teams need support from various stakehold‐
ers across the business.
In this chapter, we’ll start breaking down how that support comes together by lay‐
ing out the business case for observability. Some organizations adopt observability
practices in response to overcoming dire challenges that cannot be addressed by
traditional approaches. Others may need a more proactive approach to changing
traditional practices. Regardless of where in your observability journey you may be,
this chapter will show you how to make a business case for observability within your
own company.
We start by looking at both the reactive and proactive approaches to instituting
change. We’ll examine nonemergency situations to identify a set of circumstances
that can point to a critical need to adopt observability outside the context of cata‐
strophic service outages. Then we’ll cover the steps needed to support creation of an
observability practice, evaluate various tools, and know when your organization has
achieved a state of observability that is “good enough” to shift your focus to other
initiatives.
243
now-critical need for observability. But introducing fundamental change into an
organization in reactive knee-jerk ways can have unintended consequences. The rush
to fix mission-critical business problems often leads to oversimplified approaches
that rarely lead to useful outcomes.
Consider the case of reactive change introduced as the result of critical service
outages. For example, an organization might perform a root-cause analysis to deter‐
mine why an outage occurred, and the analysis might point to a singular reason. In
mission-critical situations, executives are often tempted to use that reason to drive
simplified remediations that demonstrate the problem has been swiftly dealt with.
When the smoking gun for an outage can be pointed to as the line in the root-cause
analysis that says, “We didn’t have backups,” that can be used to justify demoting the
employee who deleted the important file, engaging consultants to introduce a new
backup strategy, and the executives breathing a sigh of relief once they believe the
appropriate gap has been closed.
While that approach might seem to offer a sense of security, it’s ultimately false.
Why was that one file able to create a cascading system failure? Why was a file that
critical so easily deleted? Could the situation have been better mitigated with more
immutable infrastructure? Any number of approaches in this hypothetical scenario
might better treat the underlying causes rather than the most obvious symptoms. In a
rush to fix problems quickly, often the oversimplified approach is the most tempting
to take.
Another reactive approach in organizations originates from the inability to recognize
dysfunction that no longer has to be tolerated. The most common obsolete dysfunc‐
tion tolerated with traditional tooling is an undue burden on software engineering
and operations teams that prevents them from focusing on delivering innovative
work.
As seen in Chapter 3, teams without observability frequently waste time chasing
down incidents with identical symptoms (and underlying causes). Issues often repeat‐
edly trigger fire drills, and those drills cause stress for engineering teams and the
business. Engineering teams experience alert fatigue that leads to burnout and,
eventually, churn—costing the business lost expertise among staff and the time it
takes to rebuild that expertise. Customers experiencing issues will abandon their
transactions—costing the business revenue and customer loyalty. Being stuck in
this constant firefighting and high-stress mode creates a downward spiral that under‐
mines engineering team confidence when making changes to production, which in
turn creates more fragile systems, which in turn require more time to maintain,
which in turn slows the delivery of new features that provide business value.
Unfortunately, many business leaders often accept these hurdles as the normal state of
operations. They introduce processes that they believe help mitigate these problems,
such as change advisory boards or rules prohibiting their team from deploying code
• Customers discover and report critical bugs in production services long before
they are detected and addressed internally.
• When minor incidents occur, detecting and recovering them often takes so long
that they escalate into prolonged service outages.
• The backlog of investigation necessary to troubleshoot incidents and bugs con‐
tinues to grow because new problems pile up faster than they can be retrospected
or triaged.
• The amount of time spent on break/fix operational work exceeds the amount of
time your teams spend on delivering new features.
• Customer satisfaction with your services is low because of repeated poor perfor‐
mance that your support teams cannot verify, replicate, or resolve.
• New features are delayed by weeks or months because engineering teams are
dealing with disproportionately large amounts of unexpected work necessary to
figure out how various services are all interacting with one another.
1 You can reference a recap of Forrester Consulting’s Total Economic Impact (TEI) framework findings for
Honeycomb in the blog post “What Is Honeycomb’s ROI? Forrester’s Study on the Benefits of Observability”
by Evelyn Chea.
An initial business case for introducing observability into your systems can be
twofold. First, it provides your teams a way to find individual user issues that are
typically hidden when using traditional monitoring tools, thereby lowering TTD (see
Chapter 5). Second, automating the core analysis loop can dramatically reduce the
time necessary to isolate the correct source of issues, thereby lowering TTR (see
Chapter 8).
Once early gains in these areas are proven, it is easier to garner support for intro‐
ducing more observability throughout your application stack and organization. Fre‐
quently, we see teams initially approach the world of observability from a reactive
state—typically, seeking a better way to detect and resolve issues. Observability can
immediately help in these cases. But second-order benefits should also be measured
and presented when making a business case.
The upstream impact of detecting and resolving issues faster is that it reduces the
amount of unexpected break/fix operational work for your teams. A qualitative
improvement is often felt here by reducing the burden of triaging issues, which
lowers on-call stress. This same ability to detect and resolve issues also leads to
reducing the backlog of application issues, spending less time resolving bugs, and
spending more time creating and delivering new features. Measuring this qualitative
improvement—even just anecdotally—can help you build a business case that observ‐
ability leads to happier and healthier engineering teams, which in turn creates greater
employee retention and satisfaction.
A third-order benefit comes from the ability to understand the performance of
individual user requests and the cause of bottlenecks: teams can quickly understand
how best to optimize their services. More than half of mobile users will abandon
transactions after three seconds of load time.2 Measuring the rate of successful user
transactions and correlating it with gains in service performance is both possible to
measure and likely to occur in an observable application. Another obvious business
use case for observability is higher customer satisfaction and retention.
2 Tammy Everts, “Mobile Load Time and User Abandonment”, Akamai Developer Blog, September 9, 2016.
With a blameless culture in practice, business leaders should also ensure that a clear
scope of work exists when introducing observability (for example, happening entirely
within one introductory team or line of business). Baseline performance measures
for TTD and TTR can be used as a benchmark to measure improvement within that
scope. The infrastructure and platform work required should be identified, allocated,
and budgeted in support of this effort. Only then should the technical work of
instrumentation and analysis of that team’s software begin.
Instrumentation
The first step to consider is how your applications will emit telemetry data. Tradi‐
tionally, vendor-specific agents and instrumentation libraries were your only choice,
and those choices brought with them a large degree of vendor lock-in. Currently,
for instrumentation of both frameworks and application code, OpenTelemetry is
the emerging standard (see Chapter 7). It supports every open source metric and
trace analytics platform, and is supported by almost every commercial vendor in the
space. There is no longer a reason to lock into one specific vendor’s instrumentation
framework, nor to roll your own agents and libraries.
OTel allows you to configure your instrumentation to send data to the analytics
tool of your choice. By using a common standard, it’s possible to easily demo the
capabilities of any analytics tool by simply sending your instrumentation data to
multiple backends at the same time.
When considering the data that your team must analyze, it’s an oversimplification to
simply break observability into categories like metrics, logging, and tracing. While
those can be valid categories of observability data, achieving observability requires
those data types to interact in a way that gives your teams an appropriate view of their
systems. While messaging that describes observability as three pillars is useful as a
marketing headline, it misses the big picture. At this point, it is more useful to instead
think about which data type or types are best suited to your use case, and which can
be generated on demand from the others.
Consider how the open source software you choose is licensed and
how that impacts your usage. For example, both Elasticsearch and
Grafana have recently made licensing changes you should consider
before using these tools.
Having so many options is great. But you must also carefully consider and be wary of
the operational load incurred by running your own data storage cluster. For example,
the ELK stack is popular because it fulfills needs in the log management and analytics
space. But end users frequently report that their maintenance and care of their
ELK cluster gobbles up systems engineering time and grows quickly in associated
management and infrastructure costs. As a result, you’ll find a competitive market for
managed open source telemetry data storage (e.g., ELK as a service).
When considering data storage, we also caution against finding separate solutions
for each category (or pillar) of observability data you need. Similarly, attempting
to bolt modern observability functionality onto a traditional monitoring system is
likely to be fraught with peril. Since observability arises from the way your engineers
interact with your data to answer questions, having one cohesive solution that works
seamlessly is better than maintaining three or four separate systems. Using disjointed
systems for analysis places the burden of carrying context and translation between
those systems on engineers and creates a poor usability and troubleshooting experi‐
ence. For more details on how approaches can coexist, refer to Chapter 9.
Conclusion
The need for observability is recognized within teams for a variety of reasons.
Whether that need arises reactively in response to a critical outage, or proactively
by realizing how its absence is stifling innovation on your teams, it’s critical to create
a business case in support of your observability initiative.
Similar to security and testability, observability must be approached as an ongoing
practice. Teams practicing observability must make a habit of ensuring that any
changes to code are bundled with proper instrumentation, just as they’re bundled
with tests. Code reviews should ensure that the instrumentation for new code
achieves proper observability standards, just as they ensure it also meets security
standards. Observability requires ongoing care and maintenance, but you’ll know
that observability has been achieved well enough by looking for the cultural behaviors
and key results outlined in this chapter.
In the next chapter, we’ll look at how engineering teams can create alliances with
other internal teams to help accelerate the adoption of observability culture.
Conclusion | 253
CHAPTER 20
Observability’s Stakeholders and Allies
Most of this book has focused on introducing the practice of observability to software
engineering teams. But when it comes to organization-wide adoption, engineering
teams cannot, and should not, go forward alone. Once you’ve instrumented rich wide
events, your telemetry data set contains a treasure trove of information about your
services’ behavior in the marketplace.
Observability’s knack for providing fast answers to any arbitrary question means it
can also fill knowledge gaps for various nonengineering stakeholders across your
organization. A successful tactic for spreading a culture of observability is to build
allies in engineering-adjacent teams by helping them address those gaps. In this chap‐
ter, you’ll learn about engineering-adjacent use cases for observability, which teams
are likely adoption allies, and how helping them can help you build momentum
toward making observability a core part of organizational practices.
255
in the hands of real users. With observability, you can understand, at any given point
in time, your customers’ experience of using your software in the real world. It’s
everyone’s job to understand and improve that experience.
The need for observability is recognized by engineering teams for a variety of rea‐
sons. Functional gaps may exist for quite some time before they’re recognized, and a
catalytic event may spur the need for change—often, a critical outage. Or perhaps the
need is recognized more proactively, such as realizing that the constant firefighting
that comes with chasing elusive bugs is stifling a development team’s ability to
innovate. In either case, a supporting business case exists that drives an observability
adoption initiative.
Similarly, when it comes to observability adoption for nonengineering teams, you
must ask yourself which business cases it can support. Which business cases exist
for understanding, at any given point in time, your customers’ experience using
your software in the real world? Who in your organization needs to understand and
improve customer experience?
Let’s be clear: not every team will specialize in observability. Even among engineering
teams that do specialize in it, some will do far more coding and instrumentation than
others. But almost everyone in your company has a stake in being able to query your
observability data to analyze details about the current state of production.
Because observability allows you to arbitrarily slice and dice data across various
dimensions, you can use it to understand the behaviors of individual users, groups
of users, or the entire system. Those views can be compared, contrasted, or further
mined to answer any combination of questions that are extremely relevant to nonen‐
gineering business units in your organization.
Some business use cases that are supported by observability might include:
• Understanding the adoption of new features. Which customers are using your
newly shipped features? Does that match the list of customers who expressed
interest in it? In what ways do usage patterns of active feature users differ from
those who tried it but later abandoned the experience?
• Finding successful product usage trends for new customers. Does the sales team
understand which combination of features seems to resonate with prospects who
go on to become customers? Do you understand the product usage commonali‐
ties in users that failed to activate your product? Do those point to friction that
needs to be eroded somehow?
• Accurately relaying service availability information to both customers and inter‐
nal support teams via up-to-date service status pages. Can you provide templated
queries so that support teams can self-serve when users report outages?
Accuracy
For observability workloads, if you have to choose, it’s better to return results that
are fast as opposed to perfect (as long as they are very close to correct). For iterative
investigations, you would almost always rather get a result that scans 99.5% of the
events in one second than a result that scans 100% in one minute. This is a real and
common trade-off that must be made in massively parallelized distributed systems
across imperfect and flaky networks (see Chapter 16).
Also (as covered in Chapter 17), some form of dynamic sampling is often employed
to achieve observability at scale. Both of these approaches trade a slight bit of accu‐
racy for massive gains in performance. When it comes to BI tools and business data
warehouses, both sampling and a “close to right” approach are typically verboten.
When it comes to billing, for example, you will always want the accurate result no
matter how long it takes.
Recency
The questions you answer with observability tools have a strong recency bias, and
the most important data is often the freshest. A delay of more than a few seconds
between when something happened in production and when you can query for those
results is unacceptable, especially when you’re dealing with an incident.
As data fades into months past, you tend to care about historical events more in
terms of aggregates and trends, rather than granular individual requests. And when
you do care about specific requests, taking a bit longer to find them is acceptable.
But when data is fresh, you need query results to be raw, rich, and up-to-the-second
current.
Structure
Observability data is built from arbitrarily wide and structured data blobs: one event
per request per service (or per polling interval in long-running batch processes). For
observability needs (to answer any question about what’s happening at any time),
you need to append as many event details as possible, to provide as much context as
needed, so investigators can spy something that might be relevant in the future. Often
those details quickly change and evolve as teams learn what data may be helpful.
Defining a data schema up front would defeat that purpose.
Therefore, with observability workloads, schemas must be inferred after the fact or
changed on the fly (just start sending a new dimension or stop sending it at any
time). Indexes are similarly unhelpful (see Chapter 16).
In comparison, BI tools often collect and process large amounts of unstructured data
into structured, queryable form. BI data warehouses would be an ungovernable mess
without structures and predefined schemas. You need consistent schemas in order
to perform any kind of useful analysis over time. And BI workloads tend to ask
similar questions in repeatable ways to power things like dashboards. BI data can be
optimized with indexes, compound indexes, summaries, etc.
Because BI data warehouses are designed to grow forever, it is important that
they have predefined schemas and grow at a predictable rate. Observability data is
designed for rapid feedback loops and flexibility: it is typically most important under
times of duress, when predictability is far less important than immediacy.
Time Windows
Observability and BI tools both have the concept of a session, or trace. But observa‐
bility tends to be limited to a time span measured in seconds, or minutes at most.
BI tools can handle long-term journeys, or traces that can take days or weeks to
complete. That type of trace longevity is not a use case typically supported by observ‐
ability tools. With observability tools, longer-running processes (like import/export
jobs or queues) are typically handled with a polling process, and not with a single
trace.
Conclusion | 265
CHAPTER 21
An Observability Maturity Model
267
when the model was created, often with the biases of its authors baked into its many
assumptions. Objectives shift, priorities change, better approaches are discovered,
and—more to the point—each approach is unique to individual organizations and
cannot be universally scored.
When looking at a maturity model, it’s important to always remember that no one-
size-fits-all model applies to every organization. Maturity models can, however, be
useful as starting points against which you can critically and methodically weigh your
own needs and desired outcomes to create an approach that’s right for you. Maturity
models can help you identify and quantify tangible and measurable objectives that are
useful in driving long-term initiatives. It is critical that you create hypotheses to test
the assumptions of any maturity model within your own organization and evaluate
which paths are and aren’t viable given your particular constraints. Those hypotheses,
and the maturity model itself, should be continuously improved over time as more
data becomes available.
Measuring technical outcomes can, at first approximation, take the form of examin‐
ing the amount of time it takes to restore service and the number of people who
become involved when the system experiences a failure. For example, the DORA
2018 Accelerate State of DevOps Report defines elite performers as those whose
MTTR is less than one hour, and low performers as those with a MTTR that is
between one week and one month.
Emergency response is a necessary part of running a scalable, reliable service. But
emergency response may have different meanings to different teams. One team might
consider a satisfactory emergency response to mean “power cycle the box,” while
another might interpret that to mean “understand exactly how the auto-remediation
that restores redundancy in data striped across multiple disks broke, and mitigate
future risk.” There are three distinct areas to measure: the amount of time it takes to
3 Betsy B. Beyer et al., “Invent More, Toil Less”, ;login: 41, no. 3 (Fall 2016).
4 Darragh Curran, “Shipping Is Your Company’s Heartbeat”, Intercom, last modified August 18, 2021.
Conclusion
The Observability Maturity Model provides a starting point against which your orga‐
nization can measure its desired outcomes and create its own customized adoption
path. The key capabilities driving high-performing teams that have matured their
observability practice are measured along these axes:
In this book, we’ve looked at observability for software systems from many angles.
We’ve covered what observability is and how that concept operates when adapted
for software systems—from its functional requirements, to functional outcomes, to
sociotechnical practices that must change to support its adoption.
To review, this is how we defined observability at the start of this book:
Observability for software systems is a measure of how well you can understand and
explain any state your system can get into, no matter how novel or bizarre. You must
be able to comparatively debug that bizarre or novel state across all dimensions of
system state data, and combinations of dimensions, in an ad hoc iterative investigation,
without being required to define or predict those debugging needs in advance. If you
can understand any bizarre or novel state without needing to ship new code, you have
observability.
Now that we’ve covered the many concepts and practices intertwined with observabil‐
ity in this book, we can tighten that definition a bit:
If you can understand any state of your software system, no matter how novel or
bizarre, by arbitrarily slicing and dicing high-cardinality and high-dimensionality
telemetry data into any view you need, and use the core analysis loop to comparatively
debug and quickly isolate the correct source of issues, without being required to define
or predict those debugging needs in advance, then you have observability.
279
“observability.” Nobody really understood what we meant whenever we talked about
the cardinality of data or its dimensionality. We would frequently and passionately
need to argue that the so-called three pillars view of observability was only about the
data types, and that it completely ignores the analysis and practices needed to gain
new insights.
As Cindy Sridharan states in the Foreword, the rise in prominence of the term
“observability” has also led (inevitably and unfortunately) to it being used inter‐
changeably with an adjacent concept: monitoring. We would frequently need to
explain that “observability” is not a synonym for “monitoring,” or “telemetry,” or even
“visibility.”
Back then, OpenTelemetry was in its infancy, and that was yet another thing to
explain: how it was different from (or inherited from) OpenTracing and OpenCen‐
sus? Why would you use a new open standard that required a bit more setup work
instead of your vendor’s more mature agent that worked right away? Why should
anyone care?
Now, many people we speak to don’t need those explanations. There’s more agree‐
ment on how observability is different from monitoring. More people understand the
basic concepts and that data misses the point without analysis. They also understand
the benefits and the so-called promised land of observability, because they hear about
the results from many of their peers. What many of the people we speak to today
are looking for is more sophisticated analyses and low-level, specific guidance on
how to get from where they are today to a place where they’re successfully practicing
observability.
Second, this book initially started with a much shorter list of chapters. It had more
basic material and a smaller scope. As we started to better understand which con‐
cerns were common and which had successful emergent patterns, we added more
depth and detail. As we encountered more and more organizations using observa‐
bility at massive scale, we were able to learn comparatively and incorporate those
lessons by inviting direct participation in this book (we’re looking at you, Slack!).
Third, this book has been a collaborative effort with several reviewers, including
those who work for our competitors. We’ve revised our takes, incorporated broader
viewpoints, and revisited concepts throughout the authoring process to ensure that
we’re reflecting an inclusive state of the art in the world of observability. Although
we (the authors of this book) all work for Honeycomb, our goal has always been to
write an objective and inclusive book detailing how observability works in practice,
regardless of specific tool choices. We thank our reviewers for keeping us honest and
helping us develop a stronger narrative.
Based on your feedback, we added more content around the sociotechnical challenges
in adopting observability. Like any technological shift that also requires changing
Additional Resources
The following are some resources we recommend:
Site Reliability Engineering by Betsy Beyer et al. (O’Reilly)
We’ve referenced this book a few times within our own. Also known as “the
Google SRE book,” this book details how Google implemented DevOps practices
within its SRE teams. This book details several concepts and practices that are
adjacent to using observability practices when managing production systems. It
focuses on practices that make production software systems more scalable, relia‐
ble, and efficient. The book introduces SRE practices and details how they are
different from conventional industry approaches. It explores both the theory and
practice of building and operating large-scale distributed systems. It also covers
management practices that can help guide your own SRE adoption initiatives.
Many of the techniques described in this book are most valuable when managing
distributed systems. If you haven’t started down the path of using SRE principles
within your own organization, this book will help you establish practices that will
be complemented by the information you’ve learned in our book.
Implementing Service Level Objectives by Alex Hidalgo (O’Reilly)
This book provides an in-depth exploration of SLOs, which our book only
briefly touches on (see Chapter 12 and Chapter 13). Hidalgo is a site reliability
engineer, an expert at all things related to SLOs, and a friend to Honeycomb.
His book outlines many more concepts, philosophies, and definitions relevant
to the SLO world to introduce fundamentals you need in order to take further
steps. He covers the implementation of SLOs in great detail with mathematical
and statistical models, which are helpful to further understand why observability
data is so uniquely suited to SLOs (the basis of Chapter 13). His book also covers
cultural practices that must shift as a result of adopting SLOs and that further
illustrate some of the concepts introduced in our book.
1 Vera Reynolds, for example, provides the tutorial “OpenTelemetry (OTel) Is Key to Avoiding Vendor Lock-in”
on sending trace data to Honeycomb and New Relic by using OTel.
287
burn alerts brute-force, automating, 88-91
context aware, 148-151 correctness, telemetry pipelines, 233
predictive, 142-154 CPU load average, 23
baseline window, 151-152 CPU metrics, 101
lookahead window, 144-151 CPU utilization, 20
proportional extrapolation, 148 debugging and, 45
response, 152-154 deviations, 128
short term, 147-148 CRM (customer relationship management),
trajectory calculation, 142 264
business intelligence (BI) (see BI (business customer success teams, 259-260
intelligence)) customer support teams, 258-259
C D
Capability Maturity Model, 267 Dapper paper, 62
cardinality, 13-14 dashboards
events and, 58 metrics and, 20
telemetry data, 27 troubleshooting and, 21
CD (continuous deployment), 157 troubleshooting behaviors, 21
change data augmentation, telemetry pipelines, 229
proactive approach, 246-248 data buffering, telemetry pipelines, 228
reactive approach, 243-245 data filters, telemetry pipelines, 229
Checkpoint, 159 data freshness, telemetry pipelines, 234
Chen, Frank, 157 data storage
CI (Continuous Integration), Slack, 161-164 column based, 190-193
CI/CD (Continuous Integration/Continuous functional requirements, 185-193
Delivery), 157-159, 180, 271 iterative analysis approach, 189
cloud computing, system versus software, 98 row-based, 190-193
cloud native computing, 43-45 tools, 250-251
CNCF (Cloud Native Computing Foundation), TSDB (time-series databases), 187-189
44 data transformation, telemetry pipelines, 230
code, infrastructure as, 32 data, pre-aggregated, 186
collectors, OpenTelemetry, 76 databases
column-based storage, 190-193 round-robin, 96
commercial software time-series, 187-189
benefits, 181-182 debugging
hidden costs, 179-180 artificial intelligence and, 91
nonfinancial, 180-181 core analysis loop, 86-88
risks, 182 brute-force, automating, 88-91
community groups, 108 customer support team, 259
compliance, telemetry pipelines, 227 events, properties, 57-58
constant-probability sampling, 209 first principles and, 85-91
context propagation, OpenTelemetry and, 76 known conditions, 84-85
context, tooling and, 164-166 metrics versus observability, 11-13
context-aware burn alerts, 148-151 microservices and, 120-121
continuous deployment (CD), 157 monitoring data and, 19-26
Continuous Integration (CI), Slack, 161-164 novel problems, 58
Continuous Integration/Continuous Delivery observability and, advantages, 26-27
(CI/CD), 157-159, 180, 271 problem location, 119
core analysis loop, debugging and, 86-88 structured events and, 52-53
288 | Index
versus observability, 120 executive teams, 260-261
with observability, 16-17 exporter, telemetry pipeline, 233
development, 117 exporters
(see also TDD (test-driven development)) Elastisearch, 238
observability and, 118 Honeycomb, 238
DevOps, 43-45 OpenTelemetry and, 76
observability and, 46-47 telemetry pipeline, 231
DevOps Research and Assessment (DORA), 44,
182
diagnosing production issues, 27
F
FAANG companies, 268
dimensionality, 14-15 FedRAMP (Federal Risk and Authorization
telemetry data, 27 Management), 227
distributed systems feedback loops, 122
automation and, 130 fields, queries, 186
failure mode, 129 filtering, telemetry pipelines, 229
observability and, 17 first principles, debugging from, 85-91
distributed tracing, 46, 61, 62 fixed-rate sampling, 213
Dapper paper, 62
libraries, 69
DORA (DevOps Research and Assessment) G
2019, 44, 182 GDPR (General Data Protection Regulation),
drilling down, troubleshooting by intuition, 24 227
dynamic sampling, 211 Go, 75
many keys, 220-222 greenfield applications, 102
dysfunction toleration, 244
H
E Hack, 237
EC2 (Elastic Compute Cloud), 31 hidden issues, observability and, 26
Elastisearch exporter, 238 high-cardinality data, 186, 202
ELK stack, 110 high-dimensionality data, 186
ephemerality, observability vs BI tools, 264 Honeycomb
ERP (enterprise resource planning), 264 BubbleUp, 88
error budget, 133, 139 exporter, 238
burn alerts (see burn alerts) Retriever, 193
burn rate, extrapolating future, 145 durability, 202-204
overexpenditures, 142 high-cardinality, 202
requests, 140 parallelism, 201
event-based SLIs, 133 queries, real time, 200
events, 51, 54 query workloads, 197-199
(see also structured events) scability, 202-204
baseline, 91 segment partitioning, 194-195
cardinality and, 58 storing by column, 195-197
content, sampling, 210 streaming data, 202
granularity, 54 tiering, 200
patterns in, 186 traces, querying for, 199
properties, debugging and, 57-58
structured, 187 I
traces, stitching in, 70-71 IaaS (infrastructure-as-a-service), 100
wide, 74, 79 incident command system, 168
Index | 289
infrastructure libraries
as code, 32 distributed tracing systems, 69
health, 98 instrumentation libraries, 74
immutable, 45 telemetry data, open standards, 74
legacy, 102 logs
metrics, 101 limitations, 55-57
monitoring, 101 sharding, 46
infrastructure-as-a-service (IaaS), 100 structured, 56-57
institutional knowledge, 26 unstructured, 55-56
instrumentation, 74 lookahead window, predictive burn alerts,
conventions, 112 144-151
developing, 111
dimensions, 161-164
driving observability, 121-122
M
maturity models, 267
libraries, 74 Capability Maturity Model, 267
shared, 161-164 necessity, 268-269
open standards, 74 OMM (Observability Maturity Model),
OpenCensus, 74 269-270
OpenTracing, 74 capabilities, 270-277
sampling, 212 MBaaS (mobile-backend-as-a service), 30
telemetry data, export to backend, 80-81 mean time to detect (MTTD), 246
tools, 250 mean time to resolve (MTTR), 246
interceptors, 77 metrics, 9
intuition, troubleshooting by, 23 dashboards and, 20
correlation and, 23 infrastructure, 101
drilling down and, 24 limitations, 53-54
tool hopping, 24 monitoring and, 97
isolation monitoring data, 20
TDD (test-driven development), 118 OpenTelemetry, 79
telemetry pipelines, 234 storage, 54
workload isolation, 227 system state and, 20
time-based, 132
J trends, 20
Jaeger, 111 metrics-based tools, 9
Jenkins builder/test executors, 159 microservices, 12, 41
JSON (JavaScript Object Notation), 57, 66 containerized, 45
debugging and, 120-121
K strace for microservices, 120
modern practices, evolution toward, 36-38
Kafka, 228, 231, 236, 238
(see also Apache Kafka) modern systems, observability and, 17
key-value pairs, structured events, 187 monitoring
keys, sampling and, 210, 218-220 alert fatigue, 127-129
known-unknowns, 96 assumptions, 9
threshold alerting, 129-131 case based, 133
debugging and, 19-26
humans and, 20
L infrastructure, 101
languages, threaded, 32 known-unknowns, 96
LFS (Large File Storage), 161 machines and, 20
290 | Index
metrics, 20 monitoring and, 96-103
modern systems and, 10 needs, nonengineering, 255-258
observability and, 96-103 past, 279
reactive nature, 24-26 practices, applying, 107-115
RUM (real user monitoring), 283 requirements, 4-6
synthetic monitoring, 283 ROI (return on investment), 245-246
tools, evolution, 96 analysis, 174-175
monoliths, 35-36 software delivery speed and, 123
complexity, 44 software systems and, 4-7
debuggers, 39 SRE practices and, 46-47
MTTD (mean time to detect), 246 tools
MTTR (mean time to resolve), 246 analytics, 250-251
multitenancy, 41 buying versus building, 109-111
data storage, 250-251
N instrumentation, 250
rollout, 251
natively threaded languages, 32
normalization of deviance, 128 versus BI (business intelligence) tools, 261
accuracy, 262
ephemerality, 264
O query execution time, 262
observability recency, 262
allies, 258 structure, 263
customer product team, 259-260 time windows, 263
customer success team, 259-260 observability approach, switching to, 102
customer support teams, 258-259 observability data versus time-series data,
executive team, 260-261 154-156
sales team, 260-261 OLAP (online analytical processing), 261
as a practice, 248-249 OMM (Observability Maturity Model), 269-270
with BI (business intelligence) tools, 264 capabilities, 270-271
bottom line impact, 246 code quality, 273-274
building complexity, 274
benefits, 176-177 release cycle, 275-276
costs, 175-179 resilience, 271-273
free software, 175-176 technical debt, 274
risks, 177-179 user behavior, 276-277
debugging and, 11-13, 16-17 organization capabilities, 277
advantages, 26-27 OOMs (out-of-memory errors), 168
DevOps and, 46-47 open source storage engines, 190
diagnosing production issues, 27 OpenCensus, 74
distributed systems, 17 OpenTelemetry
enough, 252 APIs and, 75
evolution, 279 automatic instrumentation, 76-77
future of, predictions, 282-285 collectors, 76
hidden issues and, 26 context propagation, 76
importance, 8 custom instrumentation, 78-80
institutional knowledge and, 26 events, wide, 79
introspection and, 45 exporters, 76
mathematical definition, 4 instantiation, 81
mischaracterizations, 7 instrumentation, 250
modern systems and, 17
Index | 291
interceptors, 77 Q
languages, 75 queries
meters, 76 cardinality, 202
metrics, process-wide, 79 data stores, 204
SDK and, 76 observability vs BI tools, 262
telemetry data scalability, 202-204
export to backend, 80-81 querying
trace spans for traces, 199
custom, 78 in real time, 200
starting/finishing, 78
tracers, 76, 78
wrappers, 77 R
RabbitMQ, 228
OpenTracing, 74
reactive approach to change introduction,
operational responsibility, third party, 100
243-245
organizational needs assessment, 99-103
real time querying, 200
OTel (see OpenTelemetry)
real user monitor (RUM), 283
OTLP (OpenTelemetry Protocol), 76
receiver, telemetry pipeline, 231-233
out-of-memory errors (OOMs), 168
recency, observability vs BI tools, 262
reliability, telemetry pipelines, 234
P Remote Procedure Call (RPC), 66
PaaS (platform-as-a-service), 100 requests, error budget and, 140
pain points, 109 resources, 281-282
parallelism, 201 Retriever (see Honeycomb)
parent spans, 63 ROI (return on investment)
parent-child relationships, structured events, analysis, 174-175
187 on observability, 245-246
Parse, 30-30 rollout
practices shift, 38-41 buy versus build, 109-111
scaling and, 31-33 existing efforts and, 112-114
PHP monorepo, 158 instrumentation, 111-112
PII (personally identifiable information), 227, last push, 114-115
229 pain points, 109
platform-as-a-service (Paas), 100 root span, 63
platforms, products as, 33 routing, telemetry pipelines, 226
polyglot storage systems, 41 row-based storage, 190-193
PR (pull request), 121, 159 RPC (Remote Procedure Call), 66
predictive burn alerts, 142-154 RRD (round-robin database), 96
baseline window, 151-152 Ruby on Rails, 31
lookahead window, 144-151 RUM (real user monitoring), 283
proactive approach to change introduction, runbooks, 84
246-248
processor, telemetry pipeline, 231, 232
product teams, 259-260 S
S3 (Simple Storage Service), 200
production issues, diagnosing, 27
SaaS (software-as-a-service), 7
products as platforms, 33
sales teams, 260-261
profiling, 190, 282
sampling, 207-208
progressive delivery, 36
code, 212-223
Prometheus, 110
consistency, 215-216
proportional extrapolation, burn alerts, 148
292 | Index
constant-probability, 209 incident command system, 168-170
dynamic, 211 observability and, 159-160
keys and, 220-222 queries, 164
event payload, 210 scale and, 225
fixed-rate, 213 software supply chain case study, 164-170
historical methods, 211 synthetic logs, 235
instrumentation approach, 212 telemetry management use case
keys, 210 Hack, 237
and target rate, 218-220 logs, 236-238
rate metrics aggregation, 235
recording, 213-215 Murron, 236-238
static, multiple, 218 Prometheus, 235
recent traffic volume, 210 trace events, 236-238
target rate, 216-218 Wallace, 237
head and tail per key, 222-223 telemetry pipelines
keys and, 218-220 capacity management, 228-229
telemetry pipelines, 229 compliance, 227
traces, when to use, 211-212 consistency, 230
scaling data augmentation, 229
Honeycomb Retriever, 202-204 data buffering, 228
Parse, 31-33 data filters, 229
Scuba, 39, 189 data quality, 230
SDK (software development kit), OpenTeleme‐ data transformation, 230
try and, 76 queuing, 229
security, telemetry pipelines, 227 rate limiting, 228
service behavior, structured events and, 57 routing, 226
service meshes, 45 sampling, 229
services, traces and, 64 security, 227
sharding logs, 46 workload isolation, 227
short-term burn alerts, 147-148 testing strategies, 166
Simple Network Management Protocol trace tooling, 164-166
(SNMP), 9 TraceContext, 162
Simple Storage Service (S3), 200 update sharing, 159
sink, telemetry pipeline, 233 SLIs (service-level indicators), 133
site readability engineering (SRE) (see SRE (site SLOs (service-level objectives), 127, 132
readability engineering)) alerting and, 133-135
Slack, 158 case study, 135-137
actionable alerting, 166-168 fixed windows, 141-142
build failures, 163 observability data versus time-series data ,
chat groups, 108 154-156
CI development, 158 sliding windows, 141-142
complexity, 161 SNMP (Simple Network Management Proto‐
CI workflows, 158 col), 9
complexity, 225 software
core business logic, 158 commercial
data freshness, 235 benefits, 181-182
data transformation, 230 hidden costs, 179-180
dimensions, 161 nonfinancial costs, 180-181
concurrency, 163 risks, 182
Index | 293
delivery speed, 123 buffer, 231
observability building, 175-176 build versus buy, 239
supply chain, 158 capacity management
operationalizing, 164-170 queuing, 229
software development kit (SDK), OpenTeleme‐ rate limiting, 228
try and, 76 sampling, 229
software-as-a-service (Saas), 7 compliance, 227
source, telemetry pipeline, 233 consistency, 230
spans, 63 correctness, 233
custom, 78 data augmentation, 229
fields, custom, 68-70 data buffering, 228
names, 65 data filtering, 229
nested, 63 data freshness, 234
parent ID, 64 data transformation, 230
parent spans, 63 exporter, 231, 233
root span, 63 ingestion component, 229
span ID, 64 isolation, 234
SRE (site readability engineering), 43-45 open source alternatives, 238-239
observability and, 46-47 performance, 233
static sample rate, 218 processor, 231, 232
storage engines, open source, 190 quality assurance, 230
strace for microservices, 120 receiver, 231-233
stress testing solutions, 111 reliability, 234
structured events, 51, 56, 187 routing, 226
debugging with, 52-53 security, 227
service behavior and, 57 sink, 233
structured logs, 56-57 source, 233
structures, observability vs BI tools, 263 workload isolation, 227
sunk-cost fallacy, 112 testing, 117
synthetic monitoring, 283 (see also TDD (test-driven development))
system state, metrics and, 20 threaded languages, natively threaded, 32
system versus software, 97-99 threshold alerting, 129-131
tiering, 200
T time windows, observability vs BI tools, 263
time-based metrics, 132
TAG (Technical Advisory Group), 108
target rate sampling, 216-218 time-based SLIs, 133
head and tail per key, 222-223 time-series data, 154-156
keys and, 218-220 time-series databases (TSDB), 9, 20, 187
TCO (total cost of ownership), 175 time-to-detect (TTD), 246
TDD (test-driven development), 117-118 time-to-resolve (TTR), 246
telemetry, 7, 51, 74 tool hopping, 24
high-cardinality context, 27 tools
high-dimensionality context, 27 analytics, 250-251
Slack use case, 235-238 data storage, 250-251
telemetry data, 74 instrumentation, 250
export, backend, 80-81 rollout, 251
libraries, open standards, 74 total cost of ownership (TCO), 175
telemetry pipelines, 225 trace spans (see spans)
availability, 233 TraceContext, Slack and, 162
294 | Index
tracers, OpenTelemetry and, 76 by intuition, 23
traces correlation and, 23
distributed (see distributed traces) drilling down and, 24
events tool hopping, 24
stitching in, 70-71 dashboards and, 21
wide events, 74 TTD (time-to-detect), 246
interdependencies, 62 TTR (time-to-resolve), 246
querying for, 199
sampling decisions, 211-212
services, 64
U
unstructured logs, 55-56
names, 65 user experience, alerts and, 131
system, manual example, 65-68 UUID (universal unique identifier), 13
trace ID, 64
generating, 66
waterfall method and, 63 W
tracing, 61 Wallace, 237
components, 63-65 waterfall technique, traces and, 63
distributed, 46, 62 wide events, 74, 79
traffic, sampling and, 210 core analysis loop and, 90
trends, metrics, 20 workload isolation, telemetry pipelines, 227
troubleshooting wrappers, 77
behaviors, dashboards, and, 21
Index | 295
About the Authors
Charity Majors is the cofounder and CTO at Honeycomb, and the coauthor of
Database Reliability Engineering. Before that, she worked as a systems engineer and
engineering manager for companies like Parse, Facebook, and Linden Lab.
Liz Fong-Jones is a developer advocate and site reliability engineer (SRE) with more
than 17 years of experience. She is an advocate at Honeycomb for the SRE and
observability communities.
George Miranda is a former systems engineer turned product marketer and GTM
leader at Honeycomb. Previously, he spent more than 15 years building large-scale
distributed systems in the finance and video game industries.
Colophon
The animal on the cover of Observability Engineering is a maned wolf (Chrysocyon
brachyurus). Maned wolves are the largest canids in South America and can be found
in Argentina, Brazil, Paraguay, Bolivia, and parts of Peru. Their habitat includes
the Cerrado biome, which is made up of wet and dry forests, grasslands, savannas,
marshes, and wetlands. Despite the name, maned wolves are not actually wolves but a
separate species entirely.
Maned wolves have narrow bodies, large ears, and long black legs that allow them to
see above tall grasses as they run. They stand nearly 3 feet tall at the shoulder and
weigh around 50 pounds. Much of their bodies are covered in a mix of black and
long, golden-red hairs that stand straight up when danger is near. Unlike gray wolves
and most other canid species that form packs, the maned wolf is solitary and often
hunts alone. They are omnivorous and crepuscular, venturing out during dusk and
dawn to prey on small mammals, rabbits, birds, and insects, and to scavenge for fruit
and vegetables like lobeira, a small berry whose name means “fruit of the wolf.”
Maned wolves live in monogamous pairs sharing a territory of 10 miles during the
breeding season, which lasts from April to June in the wild. In captivity, maned
wolves give birth to litters of one to five pups. Both parents have been known to
groom, feed, and defend their young in captivity, but they are rarely seen with their
pups in the wild. Pups develop quickly; they are usually considered fully grown and
ready to leave their parents’ territory after one year. The maned wolf is classified as
“near threatened” by the IUCN, mostly due to loss of habitat. Many of the animals on
O’Reilly covers are endangered; all of them are important to the world.
The cover illustration is by Karen Montgomery, based on a black and white engraving
from Braukhaus Lexicon. The cover fonts are Gilroy Semibold and Guardian Sans.
The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed;
and the code font is Dalton Maag’s Ubuntu Mono.