Jump to ratings and reviews
Rate this book

Release It!: Design and Deploy Production-Ready Software

Rate this book
A single dramatic software failure can cost a company millions of dollars - but can be avoided with simple changes to design and architecture. This new edition of the best-selling industry standard shows you how to create systems that run longer, with fewer failures, and recover better when bad things happen. New coverage includes DevOps, microservices, and cloud-native architecture. Stability antipatterns have grown to include systemic problems in large-scale systems. This is a must-have pragmatic guide to engineering for production systems.

If you're a software developer, and you don't want to get alerts every night for the rest of your life, help is here. With a combination of case studies about huge losses - lost revenue, lost reputation, lost time, lost opportunity - and practical, down-to-earth advice that was all gained through painful experience, this book helps you avoid the pitfalls that cost companies millions of dollars in downtime and reputation. Eighty percent of project life-cycle cost is in production, yet few books address this topic.

This updated edition deals with the production of today's systems - larger, more complex, and heavily virtualized - and is the first book to cover chaos engineering, the discipline of applying randomness and deliberate stress to reveal systematic problems. Build systems that survive the real world, avoid downtime, implement zero-downtime upgrades and continuous delivery, and make cloud-native applications resilient. Examine ways to architect, design, and build software - particularly distributed systems - that stands up to the typhoon winds of a flash mob, a Slashdotting, or a link on Reddit. Take a hard look at software that failed the test and find ways to make sure your software survives.

To skip the pain and get the experience...get this book.

356 pages, Paperback

First published March 30, 2007

Loading interface...
Loading interface...

About the author

Michael T. Nygard

3 books51 followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
1,477 (47%)
4 stars
1,122 (35%)
3 stars
439 (13%)
2 stars
75 (2%)
1 star
26 (<1%)
Displaying 1 - 30 of 255 reviews
Profile Image for Rod Hilton.
151 reviews3,120 followers
September 26, 2017
This remains one of the most important books software engineers can read. The second edition is even better than the first, updated to fix a lot of the "outdated" criticisms the first book gets, incorporating the modern DevOps movement, microservices, and modern technologies used in software engineering.

I really just can't say enough about this book. It's required reading. If you're responsible for code that runs on networked production systems, failing to read this book should be a fireable offense. Skipping "Release It!" is professional negligence. Stop what you're doing and read this before shipping another line of code.

Release It! is all about how to build cynical software, and once you start down that path you find that you can no longer think any other way. This book changes you and your career, it's just phenomenal. Even if you think you know everything in it because the patterns and practices it describes have become widespread, it's still worth reading.

The second edition fixes - or at least somewhat improves - every minor complaint I had about the first edition, and reorganizes the information, adds a lot of great new sections, and removes outdated cruft. It's superior to the first edition in every way, and that was a 5-star book for me.
Profile Image for Rod Hilton.
151 reviews3,120 followers
September 26, 2017
You're wasting time reading this review, you could be reading this book instead.

Release It! is one of the most important books I think programmers can read, easily as important as the oft-cited classics like The Pragmatic Programmer or the GoF book. Release It! isn't about writing super-spiffy code, or object-oriented design, but it should drastically affect how professional programmers write their code. It focuses on what engineers need to do to get their software into a state where it can actually be deployed safely in a production environment. It covers patterns and antipatterns to support (or subvert) stability as well as capacity, and the section of the book covering this is simply excellent. But then it goes beyond that to also discuss Operational enablement. Even if you're not into DevOps, and don't want to really be involved in DevOps work, this book gives you the tools and tips to do what aspect of DevOps is the purview of pure developers.

Nobody who writes production enterprise software should write another line of code until they read this book. I honestly can't give it enough of a glowing endorsement. Like any other patterns book, a great deal of it will be familiar to people who have been in the industry for a while, and have come up with (or encountered) its principles on their own. But I guarantee, if there's a single thing in the book you haven't seen before, it'll be worth reading the entire book, pretty much every section is gold.

My only complaint about it is that Michael Nygard has a tendency to go on randomish tangents, especially when talking about "case studies," which generally come off as the author trying too hard to convince the reader he knows what he's talking about. A lot of these stories are about things he personally encountered in his career, and how he fixed them, and each one has a weird arrogant quality, it's a little offputting to be honest, like Nygard is almost bragging. I get that these sections underscore the value of the ideas in the book, but there's something about how they're written that comes off boastful, or irritating in a way I can't quite put into words. What's more, the book STARTS with a particularly long one of these kinds of stories, making it kind of tough to get into the book. I actually tried to read it years ago and gave up on it very early. Only recently did I decide to revisit it and keep pushing forward (based on a co-worker's recommendation) and I'm extremely glad I did.

I highly, highly recommend picking up this book and reading it cover to cover. It's a struggle at first due to the unfortunate decision to start it off with one of the most annoying sections of the book, but I implore you to power through it and keep reading. It's worth it.
21 reviews8 followers
June 5, 2010
I need to start by saying that this is one of the best technical books I have ever read. To me, it's easily as enjoyable and useful as Code Complete, The Pragmatic Programmer, or The Mythical Man Month. If you're a sysadmin, an architect, or a developer that works with medium-to-large-sized systems, then do the following:

1. Stop reading this post
2. Order this book from your library or buy it from The Pragmatic Programmer's web site
3. Owe me a pint :D


What The Book Is Really About

Actually, there is one thing that I don't like about this book, but it really has nothing to do with the book. The description of this book on the Pragmatic Programmer's web site sucks. It's vague, and it really gives the potential reader a tiny amount of insight into the book's contents.

What it should have said is that this book contains *tons* of great information on designing, deploying, maintaining and *improving* medium-to-large-sized IT systems. It's filled with patterns, anti-patterns, and general best practices that should be part of the shared lexicon of every developer, administrator, and system architect. Also, it does a good job of giving you enough information to be useful without boring you to death. And finally, it's written very well and is a joy to read.

The Highlights

Thread Dumps & Garbage Collection Tuning

The internals of the Java Virtual Machine (JVM) have been a black box to me for the majority of my career in IT. Thankfully, this book has provided excellent examples of how you can troubleshoot and improve your system using tools that interrogate and manipulate a JVM at runtime.

For me, this was the most interesting and useful part of the book, and I am looking forward to seeing what can be gained by tuning and "poking at" the JVM's that are in the system that I maintain.

Patterns and Anti-Patterns

It's great to finally find a book that codifies some patterns that administrators and architects can use.

Transparency

I thought that I new a lot about monitoring and transparency before reading this book, but now I know better. I especially like the concept of a unified "OpsDB", and I am eager to build something like this myself for the system that I maintain.

Integration Point Risks

I always knew that integration points (e.g. data feeds, databases, LDAP providers, etc.) added risk to you system, but the author does a great job calculating the actual risk. Also, he shows you many ways in which you can avoid brittle integration points.

Caveats

I have one warning about this book, but it's half-hearted. This book is what I would all Java-centric. All of the case studies involve systems that are written in Java, and some of the sections will only apply directly to you if you are working with Java-based software.

But does that mean that you should avoid this book if you are working with Ruby, PHP, or .Net-based software? Absolutely not. Even though there are a few small sections of the book that won't directly apply to your line of work, most of them will apply in an indirect way, regardless of your platform. And the other 94% of the book will directly apply to medium-to-large systems of every stripe.
Profile Image for Emre Sevinç.
165 reviews379 followers
June 24, 2020
If you're in the business of designing, building, and operating complex and distributed software systems that serve online customers via web and mobile interfaces, systems that span many different technological components and have strict operational constraints, systems that might be stressed because of unpredictable and sometimes massive spikes in online requests, then this book is very good and highly recommended: it presents a no-nonsense overall approach.

But make no mistake, this is not a programming book that tells you how to write code, follow some step-by-step recipes. People who appreciate this book probably already have a lot of experience, and are at least bitten by one of the horror stories. In other words, the author already assumes that the reader is working at a higher level and have bigger concerns than focusing on intricacies of some particular piece of software code.

One of my favorite chapters is 4. chapter: Stability Antipatterns. Why do I like it? Because instead of first writing "you should do this!", the author first presents how massively online, consumer facing complex software and data systems can grind to a halt by presenting various problems that he's seen happening again and again. This, in turn, motivates the following chapter "Stability Patterns" very well.

Most of the remaining chapters are also very good, making it difficult for me to single out any one of them! Moreover, I highly appreciated author's perspective that stays away from the hype: the hard earned lessons he compiled based on decades long experience are presented in such a manner that no matter if you're building on a public cloud, or in a private data center, no matter if your tech stack is based on .NET & C#, or Java or another popular system, most of the patterns will apply most of the time to most of the aspects you and your team are or will be working on.

I consider this second edition of "Release It!: Design and Deploy Production-Ready Software" a valuable addition to the library of experienced software and data architects, CTOs, people building complex cloud-based systems, and people working on antifragile operational and mission-critical systems. There are of course other very good books that focus on particular aspects that this book spends only a single chapter, but the that's its strong point: it ties a lot of related topics in a logically consistent and pedagogically motivating approach, making it worth its value in gold.

Long story short, I don't know you, but as someone working on design of such data systems, I've already marked a lot of pages, highlighted a lot of lines, and know that I'll be returning to this book time and again.
Profile Image for Sebastian Gebski.
1,077 reviews1,104 followers
May 21, 2014
To be honest, I've got something different than expected. What I was looking for was another 'Continuous Delivery' - a book about modern practices of shortening the delivery cycle, filled with examples / cases / ideas.

What this book is actually about is the overall quality of software products (looked at from very various perspectives) and how do different types of flaws impact the product / service in the end. As you can see, this is a far wider (& more generic) topic - if this is what you're looking for, you'll be very happy as the book is quite decently written. I've enjoyed it as well, even if I was looking for something else.
Profile Image for Trisha Gee.
Author 9 books50 followers
July 3, 2023
An absolute Must-Read. This second edition has been updated to include problems, patterns, and solutions that weren't around when the first one was written.
Profile Image for dantelk.
165 reviews14 followers
August 7, 2024
Easily five stars. The book was so gripping that I almost wished it didn't end. And I can't make the same comment for many "literature" books!

The book may seem like a bit abstract (strictly no code), however the matters it deals are everyday issues which software developers face every day. In fact, while I was reading it, it gave me some insight about a migration issue which was keeping my brain busy on the week's iteration. So this is how relevant this book is on daily stuff of engineers.

Some topics of the book were already somehow known or familiar to me. However, Nygard did a terrific job by knitting my existing knowledge coupling them together. Also I realised how poor my knowledge is on network layer, and decided to read more on that topic.

The book is easil to read while commuting on Marmaray, which is a bonus for me kehkeh!

Recommended to all developers of the industry, junior or experienced...
Profile Image for Ioana Balas.
738 reviews80 followers
February 2, 2021
While it may not be exactly about how you release software, it definitely encompasses it. This book will teach you about operation, scale and extending and future proofing your software through hysterical anecdotes and hair-pulling war stories, and you will be a better professional for it.

I was initially a little taken aback because I was expecting the book to have more details about how you prep for a release, how you come up with a plan, your checklist if you will. But it is more about operating gotchas, what goes beyond the actual plan. It is for sure a technical book, the author deep dives into incidents he has experienced in terms of how load balancers might work, how exceptions may be triggered, and how threads may block each other, versioning, he doesn't steer away from any of the nasty detail. So be prepared to put in the time and the reading focus.

The chapters are either focus on different stages in your prep pipeline, or in different levels of the stack, depending on the topic at hand. Some of my notes during the read:
- Load testing including with noise such as not waiting for requests to receive responses;
- Minimise blast radius at all costs - be cynical about how your component communicates with others;
- Fast fails are superior to slow or blocked calls;
- Once established, a TCP connection can exist for days without being used, as long as both ends have that socket state in memory;
- A bogon is a packet that got routed inefficiently and arrives late, out of sequence, and after the connection is closed;
- Ways of versioning: by URL, by the 'accept' in the header
- All about Circuit Breakers;
- Taking down an instance is the easiest thing for a Chaos Monkey to do;
- Physical redundancy is the most common form of bulkheads;
- Load balancers as shock absorbers;
- Queues must be finite for responses to be finite;
- Servers that require a warm cache or loading reference data are not suitable for containers - opt for short initialisations sequences or startup for these.
Profile Image for Michael Koltsov.
105 reviews63 followers
May 18, 2017
There's a relatively short list of books I would like to keep on my desk. Most often those books are references and a composition of famous quotes. After I've read this chap I'd like to have it on my work desk at any moment.

This book is a perfect mix of lots of useful technical insights, practices and recommendations got from the author's hard-earned experience combined with some of the soft-skills you need to make your software and its maintenance (which as the author states costs more than the initial 1.0 version) as smooth as possible with as much of interrupted sleep as you could possibly get.

The book is definitely outdated, some of the references to particular technologies look odd and obvious (if not even funny). Nevertheless, I will put this book in one row with the "SRE book" & "Project Phoenix" as it combines them both.

My score is 5/5
Profile Image for Lino.
177 reviews6 followers
February 12, 2018
It's on overview of how systems break in production and how to avoid it. The author manages to take a pretty dry subject and produce a book that is very easy to read.

Now, it's pretty dated in a few places. It's from 2007, when the term DevOps wasn't even widely used yet. At the same time it's interesting to see how a lot of it still holds.

There are too many references to JVM-speficic issues. I don't think this ruins the experience for readers working outside the JVM, but it's strange seeing casual references (garbage collection tuning, permgen size, JSP, JMX, etc) in several places while the book title/subtitle/cover/description say nothing about it. Perhaps that's from a time when coding for enterprise meant exclusively Java.
Profile Image for Simon Eskildsen.
215 reviews1,096 followers
April 6, 2015
This book is an incredible introduction to creating and maintaining resilient web applications in realistic, chaotic environments. This book has changed how I approach development more than any other. Every developer with something in production should read it.
Profile Image for Liviu Costea.
25 reviews2 followers
June 19, 2021
I really enjoyed the book and I think every developer should read it For me, an SRE working on continous deployments, many things are already being put into practice. If I would of read it 3 years ago (it's from 2018) I would of definetly gave it 5 stars. The idea of the book is design your code/application/services for production usage and not just to pass the standard QA tests. As one of my colleagues used to say: "production says it all".
April 18, 2019
A stellar career in software engineering with all its hard-earned lessons packed into a single, easy-to-digest book. Release It! Is an essentially practical book that stems from the author's perceived lack of focus on developing software so that it runs in a production environments.

What would these kind of systems with a focus for real-world use look like? He starts by outlining stability anti-patterns. These are bad practices often found in the design of systems which render it more fragile, and in consequence will let operators have less sleep. I divided these three categories: coding deficiencies, systemic effects and spontaneous effects. On the other hand there are also stability patterns, schemas we should consider implementing in order to counter anti-patterns and to make our system more robust. For example, adding timeouts to our calls, using separating middleware to reduce coupling and complexity, among others. These I classified into: self-preservation, systemic cooperation and system management.

Subsequent chapters touch on important aspects of a system, for example, analyzing best practices at different layers of abstraction starting at the bare metal and bits to the end-user-interacting application. How to make systems evolvable, using chaos as an ally in increasing robustness, and continuous deployment with its significance in guaranteeing well-functioning systems and happy costumers.

All in all, this is a great book that should be read by all software engineers dealing with complex systems, specially if you're just starting your career and have started taking more responsibility. I promise you'll find much of value in these pages.
Profile Image for Sergey Shishkin.
159 reviews45 followers
January 2, 2015
This book is a battle proven must read for any software engineer. Even after 7 years since the book has been published, pretty much every advice in it remains valid and relevant. The idea of the operations database has found market confirmation in products like Splunk. The Curcuit Breaker pattern has ever since been implemented numerous times in various OSS libraries.

A seasoned developer would have probably been learned some of the advice the hard way. Nonetheless I've picked up a lot of wisdom from the book, liked Michael's storytelling and appreciated his very broad perspective.
Author 3 books3 followers
August 8, 2021
I started reading the first edition some time in 2014 when using circuit breakers started to become trendy. Even though due to lack of time I only managed to read the first part it was clear to me that this is an important book, giving a lot of practical advice on building systems for production.
Fast forward to 2020 and I am still working on systems that suffer from problems the book has solutions for. And there's a second edition.

Besides the well known stability patterns like circuit breakers and bulkheads the book will give you a lot of foundational knowledge about running systems in production, from processes to network to security, also covering more process oriented questions like deployments or handling security.

The book is a great read and the case studies paint a vivid picture of why the content is relevant. No wonder this book is widely recommended, go read it if you haven't.
30 reviews1 follower
February 21, 2021
This book captures the essence of developing software that is not very common in other types of books and programming classes.

Software is not just about creating code, it is about releasing it, operating it, and administering the system in production. Once this is understood, the way someone develops software changes drastically. Compatibility, monitoring, observability, etc start to become part of the design and we start to embrace change and adaptability.

This book gets into the specificity of this with concrete examples that don't talk only about the technical aspect of the problem and solutions, but the relations between teams, the business aspect, and cost, etc.
Profile Image for Zbyszek Sokolowski.
284 reviews13 followers
September 29, 2018
Great book showing different aspects and clues of delivering software to production. Many software architects are "short sighted" and thinks about make a software which passes unit/integration tests and are failing shortly after release books describes such cases shows good approaches and different ways to deal with the problem and testing like chaos monkey.

quotes:

Armed with a thread dump, the application is an open book, if you know how to read it. You can deduce a great deal about applications for which you’ve never seen the source code. You can tell: What third-party libraries an application uses What kind of thread pools it has How many threads are in each What background processing the application uses What protocols the application uses (by looking at the classes and methods in each thread’s stack trace)

As much as RMI made cross-machine communication feel like local programming, it can be dangerous because calls cannot be made to time out. As a result, the caller is vulnerable to problems in the remote server.

The amazing thing is that the highly stable design usually costs the same to implement as the unstable one.

A robust system keeps processing transactions, even when transient impulses, persistent stresses, or component failures disrupt normal processing. This is what most people mean by “stability.” It’s not just that your individual servers or applications stay up and running but rather that the user can still get work done.

The more tightly coupled the architecture, the greater the chance this coding error can propagate. Conversely, the less-coupled architectures act as shock absorbers, diminishing the effects of this error instead of amplifying them.

events that caused the failure is not independent. A failure in one point or layer actually increases the probability of other failures. If the database gets slow, then the application servers are more likely to run out of memory. Because the layers are coupled, the events are not independent.

precise about these chains of events: Fault A condition that creates an incorrect internal state in your software. A fault may be due to a latent bug that gets triggered, or it may be due to an unchecked condition at a boundary or external interface. Error Visibly incorrect behavior. When your trading system suddenly buys ten billion dollars of Pokemon futures, that is an error. Failure An unresponsive system. When a system doesn’t respond, we say it has failed. Failure is in the eye of the beholder...a computer may have the power on but not respond to any requests. Triggering a fault opens the crack. Faults become errors, and errors provoke failures. That’s how the cracks propagate. At each step in the chain of failure, the crack from a fault may accelerate, slow, or stop.

caused a remote problem to turn into downtime. One way to prepare for every possible failure is to look at every external call, every I/O, every use of resources, and every expected outcome and ask, “What are all the ways this can go wrong?” Think about the different types of impulse and stress that can be applied:

Tight coupling allows cracks in one part of the system to propagate themselves—or multiply themselves—across layer or system boundaries. A failure in one component causes load to be redistributed to its peers and introduces delays and stress to its callers. This increased stress makes it extremely likely that another component in the system will fail. That in turn makes the next failure more likely, eventually resulting in total collapse. In your systems, tight coupling can appear within application code, in calls between systems, or any place a resource has multiple consumers.

A butterfly style has 2N connections, a spiderweb might have up to , and yours falls somewhere in between.

One wrinkle to watch out for, though, is that it can take a long time to discover that you can’t connect. Hang on for a quick dip into the details of TCP/IP networking. Every architecture diagram ever drawn has boxes and arrows, similar to the ones in the following figure. (A new architect will focus on the boxes; an experienced one is more interested in the arrows.)

You have to set the socket timeout if you want to break out of the blocking call. In that case, be prepared for an exception when the timeout occurs.

Once we understood all the links in that chain of failure, we had to find a solution. The resource pool has the ability to test JDBC connections for validity before checking them out. It checked validity by executing a SQL query like “SELECT SYSDATE FROM DUAL.”

Fortunately, a sharp DBA recalled just the thing. Oracle has a feature called dead connection detection that you can enable to discover when clients have crashed. When enabled, the database server sends a ping packet to the client at some periodic interval. If the client responds, then the database knows it’s still alive. If the client fails to respond after a few retries, the database server assumes the client has crashed and frees up all the resources held by that connection.

The most effective stability patterns to combat integration point failures are Circuit Breaker and Decoupling Middleware.

Hunt for resource leaks. Most of the time, a chain reaction happens when your application has a memory leak. As one server runs out of memory and goes down, the other servers pick up the dead one’s burden. The increased traffic means they leak memory faster.

Stop cracks from jumping the gap. A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer. Your system surely calls out to other enterprise systems; make sure you can stay up when they go down.

Scrutinize resource pools. A cascading failure often results from a resource pool, such as a connection pool, that gets exhausted when none of its calls return. The threads that get the connections block forever; all other threads get blocked waiting for connections. Safe resource pools always limit the time a thread can wait to check out a resource.

If you are running in the cloud, then autoscaling is your friend. But beware! It’s not hard to run up a huge bill by autoscaling buggy applications.

Make sure your systems are easy to patch—you’ll be doing a lot of it. Keep your frameworks up-to-date, and keep yourself educated.

That’s why I advocate supplementing internal monitors (such as log file scraping, process monitoring, and port monitoring) with external monitoring. A mock client somewhere (not in the same data center) can run synthetic transactions on a regular basis. That client experiences the same view of the system that real users experience. If that client cannot process the synthetic transactions, then there is a problem, whether or not the server process is running.

If you find yourself synchronizing methods on your domain objects, you should probably rethink the design. Find a way that each thread can get its own copy of the object in question. This is important for two reasons. First, if you are synchronizing the methods to ensure data integrity, then your application will break when it runs on more than one server. In-memory coherence doesn’t matter if there’s another server out there changing the data. Second, your application will scale better if request-handling threads never block each other.

One elegant way to avoid synchronization on domain objects is to make your domain objects immutable.

When the time comes to alter their state, do it by constructing and issuing a “command object.” This style is called “Command Query Responsibility Separation,” and it nicely avoids a large number of concurrency issues.

In object theory, the Liskov substitution principlestates that any property that is true about objects of a type T should also be true for objects of any subtype of T. In other words, a method without side effects in a base class should also be free of side effects in derived classes. A method that throws the exception E in base classes should throw only exceptions of type E (or subtypes of E) in derived classes.

Libraries are notorious sources of blocking threads, whether they are open-source packages or vendor code. Many libraries that work as service clients do their own resource pooling inside the library. These often make request threads block forever when a problem occurs. Of course, these never allow you to configure their failure modes, like what to do when all connections are tied up waiting for replies that’ll never come.
A blocked thread is often found near an integration point. These blocked threads can quickly lead to chain reactions if the remote end of the integration fails. Blocked threads and slow responses can create a positive feedback loop, amplifying a minor problem into a total failure. Remember This Recall that the Blocked Threads antipattern is the proximate cause of most failures.

Use proven primitives. Learn and apply safe primitives. It might seem easy to roll your own producer/consumer queue: it isn’t. Any library of concurrency utilities has more testing than your newborn queue. Defend with Timeouts. You cannot prove that your code has no deadlocks in it, but you can make sure that no deadlock lasts forever. Avoid infinite waits in function calls; use a version that takes a timeout parameter. Always use timeouts, even though it means you need more error-handling code.

Autoscaling can help when the traffic surge does arrive, but watch out for the lag time. Spinning up new virtual machines takes precious minutes. My advice is to “pre-autoscale” by upping the configuration before the marketing event goes

Self-denial attacks originate inside your own organization, when people cause self-inflicted wounds by creating their own flash mobs and traffic spikes. You can aid and abet these marketing efforts and protect your system at the same time, but only if you know what’s coming. Make sure nobody sends mass emails with deep links. Send mass emails in waves to spread out the peak load. Create static “landing zone” pages for the first click from these offers. Watch out for embedded session IDs in URLs.

Too often, though, the shared resource will be allocated for exclusive use while a client is processing some unit of work. In these cases, the probability of contention scales with the number of transactions processed by the layer and the number of clients in that layer. When the shared resource saturates, you get a connection backlog. When the backlog exceeds the listen queue, you get failed transactions. At that point, nearly anything can happen. It depends on what function the caller needs the shared resource to provide. Particularly in the case of cache managers (providing coherency for distributed caches), failed transactions lead to stale data or—worse—loss of data integrity.

When a bunch of servers impose this transient load all at once, it’s called a dogpile. (“Dogpile” is a term from American football in which the ball-carrier gets compressed at the base of a giant pyramid of steroid-infused flesh.)

A pulse can develop during load tests, if the virtual user scripts have fixed-time waits in them. Instead, every pause in a script should have a small random delta applied.

Dogpiles force you to spend too much to handle peak demand. A dogpile concentrates demand. It requires a higher peak capacity than you’d need if you spread the surge out. Use random clock slew to diffuse the demand. Don’t set all your cron jobs for midnight or any other on-the-hour time. Mix them up to spread the load out. Use increasing backoff times to avoid pulsing. A fixed retry interval will concentrate demand from callers on that period. Instead, use a backoff algorithm so different callers will be at different points in their backoff periods.

We can implement similar safeguards in our control plane software: If observations report that more than 80 percent of the system is unavailable, it’s more likely to be a problem with the observer than the system. Apply hysteresis. (See ​Governor​.) Start machines quickly, but shut them down slowly. Starting new machines is safer than shutting old ones off. When the gap between expected state and observed state is large, signal for confirmation. This is equivalent to a big yellow rotating warning lamp on an industrial robot. Systems that consume resources should be stateful enough to detect if they’re trying to spin up infinity instances. Build in deceleration zones to account for momentum. Suppose your control plane senses excess load every second, but it takes five minutes to start a virtual machine to handle the load. It must make sure not to start 300 virtual machines because the high load persists.

A quick failure allows the calling system to finish processing the transaction rapidly. Whether that is ultimately a success or a failure depends on the application logic. A slow response, on the other hand, ties up resources in the calling system and the called system.

Memory leaks often manifest via Slow Responses as the virtual machine works harder and harder to reclaim enough space to process a transaction.

More frequently, however, I see applications letting their sockets’ send buffers getting drained and their receive buffers filling up, causing a TCP stall. This usually happens in a hand-rolled, low-level socket protocol, in which the read routine does not loop until the receive buffer is drained.

Many APIs offer both a call with a timeout and a simpler, easier call that blocks forever. It would be better if, instead of overloading a single function, the no-timeout version were labeled “CheckoutAndMaybeKillMySystem.”

Apply Timeouts to Integration Points, Blocked Threads, and Slow Responses. The Timeouts pattern prevents calls to Integration Points from becoming Blocked Threads. Thus, timeouts avert Cascading Failures. Apply Timeouts to recover from unexpected failures. When an operation is taking too long, sometimes we don’t care why…we just need to give up and keep moving. The Timeouts pattern lets us do that. Consider delayed retries. Most of the explanations for a timeout involve problems in the network or the remote system that won’t be resolved right away. Immediate retries are liable to hit the same problem and result in another timeout. That just makes the user wait even longer for her error message. Most of the time, you should queue the operation and retry it later. Circuit Breaker Not too long ago, when electrical wiring was first being built into houses, many people fell victim to physics.

Now, circuit breakers protect overeager gadget hounds from burning their houses down. The principle is the same: detect excess usage, fail first, and open the circuit.

More abstractly, the circuit breaker exists to allow one subsystem (an electrical circuit) to fail (excessive current draw, possibly from a short circuit) without destroying the entire system (the house). Furthermore, once the danger has passed, the circuit breaker can be reset to restore full function to the system.
Leaky Bucket pattern from Pattern Languages of Program Design It’s a simple counter that you can increment every time you observe a fault. In the background, a thread or timer decrements the counter periodically (down to zero, of course.) If the count exceeds a threshold, then you know that faults are arriving quickly.

Operations needs some way to directly trip or reset the circuit breaker. The circuit breaker is also a convenient place to gather metrics about call volumes and response times.

Circuit breakers are effective at guarding against integration points, cascading failures, unbalanced capacities, and slow responses. They work so closely with timeouts that they often track timeout failures separately from execution failures.
The Bulkheads pattern partitions capacity to preserve partial functionality when bad things happen. Pick a useful granularity. You can partition thread pools inside an application, CPUs in a server, or servers in a cluster. Consider Bulkheads particularly with shared services models. Failures in service-oriented or microservice architectures can propagate very quickly. If your service goes down because of a Chain Reaction, does the entire company come to a halt? Then you’d better put in some Bulkheads.

Nevertheless, someday your little database will grow up. When it hits the teenage years—about two in human years—it’ll get moody, sullen, and resentful. In the worst case, it’ll start undermining the whole system (and it will probably complain that nobody understands it, too).

There are few general rules here. Much depends on the database and libraries in use. RDBMS plus ORM tends to deal badly with dangling references, for example, whereas a document-oriented database won’t even notice.

One log file is like one pile of cow dung—not very valuable, and you’d rather not dig through it. Collect tons of cow dung and it becomes “fertilizer.” Likewise, if you collect enough log files you can discover value.
Ship the log files to a centralized logging server, such as Logstash, where they can be indexed, searched, and monitored.
To a long-running server, memory is like oxygen. Cache, left untended, will suck up all the oxygen. Low memory conditions are a threat to both stability and capacity.

Improper use of caching is the major cause of memory leaks, which in turn lead to horrors like daily server restarts. Nothing gets administrators in the habit of being logged onto production like daily (or nightly) chores.

Even when failing fast, be sure to report a system failure (resources not available) differently than an application failure (parameter violations or invalid state). Reporting a generic “error” message may cause an upstream system to trip a circuit breaker just because some user entered bad data and hit Reload three or four times.

Avoid Slow Responses and Fail Fast. If your system cannot meet its SLA, inform callers quickly. Don’t make them wait for an error message, and don’t make them wait until they time out. That just makes your problem into their problem.

Reserve resources, verify Integration Points early. In the theme of “don’t do useless work,” make sure you’ll be able to complete the transaction before you start. If critical resources aren’t available—for example, a popped Circuit Breaker on a required callout—then don’t waste work by getting to that point. The odds of it changing between the beginning and the middle of the transaction are slim.

Sometimes the best thing you can do to create system-level stability is to abandon component-level stability. In the Erlang world, this is called the “let it crash” philosophy.
We must be able to get back into that clean state and resume normal operation as quickly as possible. Otherwise, we’ll see performance degrade when too many of our instances are restarting at the same time. In the limit, we could have loss of service because all of our instances are busy restarting. With in-process components like actors, the restart time is measured in microseconds. Callers are unlikely to really notice that kind of disruption. You’d have to set up a special test case just to measure it.

Actor systems use a hierarchical tree of supervisors to manage the restarts. Whenever an actor terminates, the runtime notifies the supervisor. The supervisor can then decide to restart the child actor, restart all of its children, or crash itself. If the supervisor crashes, the runtime will terminate all its children and notify the supervisor’s supervisor. Ultimately you can get whole branches of the supervision tree to restart with a clean state. The design of the supervision tree is integral to the system design.

Crash components to save systems. It may seem counterintuitive to create system-level stability through component-level instability. Even so, it may be the best way to get back to a kn
Profile Image for Bill.
224 reviews82 followers
December 28, 2018
The original edition of this book introduced me to stability patterns and their evil twins, stability antipatterns. I've referenced its terminology countless times since reading it, especially Steady State, Circuit Breaker, and Fail Fast. I always include this on my list when newer developers ask what they should read. It's chock full of wisdom about distributed systems that no one bothers to teach you in school. The techniques here make the difference between code that will topple over at the slightest breeze and the architecture that will get you through your business's peak season without getting paged.

The second edition builds on this and adds even more. Rather than just tacking on a new chapter about Docker or something at the end, it's completely overhauled and reorganized. The most obvious change is that it really seems plugged into the zeitgeist, covering the latest crop of open-source PaaS tools with a deeply pragmatic eye from someone who has seen both thoughtful applications of tools and inexperienced devs chasing shiny things. I especially appreciated the expanded section on deep networking issues (Interconnect). And the sections covering zero downtime deployments included nicely articulated concepts that I hadn't encountered before, such as grouping steps under "Expansion" and "Cleanup." I see conceptualizations as one of the most useful parts of reading these sorts of book, because they'll help you communicate about these ideas for years to come.

This edition suffers a bit from the Second System effect. Despite having a nearly identical page count, it didn't feel as tightly focused and tries to cover a bit too many related subjects like security and even organizational issues like decision-making feedback loops. I would also have loved more case studies because I'm a sucker for disaster root cause analysis. The only two new ones in this edition were publicly published (Reddit and S3) so didn't include as much juicy details.

Those minor critiques aside, this remains an indispensable and peerless book on building software for the real world.
Profile Image for James Healy.
32 reviews5 followers
June 1, 2018
This book radically influenced the way I build and deploy software.

It's a whirlwind tour through designing code that behaves well in production, the many ways interaction between multiple systems can fail, deployment styles that avoid scheduled downtime, and case studies to demonstrate the surprises that happen in the real world.

For those new developing and deploying production software the pace might be hard to follow, but those with a bit of experience under their belt will find this triggers memories, provides a language and framework to understand the issues you've encountered in production, and patterns to help you manage those issues when they reoccur.

For those that haven read the first edition of Release it, the second edition is worth a revisit. A lot has changed in 10 years, and the book has been significantly updated to account for that. I like the logical progression of the new book outline too - Creating Stability, Designing for Production, Delivering your System.
Author 2 books110 followers
November 8, 2016
I've been working on project that heavily used clouds and high availability for relatively short period of time but even that experience helped me to appreciate this book.

The book predates all the dev-ops hype, but still gives you tons of suggestions how to build a robust, scalable and easy-to-understand-when-something-goes-wrong application: think about failure, every possible component WILL fail in production. Every possible 'joint' like external system interaction will be broken. Every possible and impossible situation will occur and you should be prepared for that: not by trying eliminate it, but by accepting that disaster will happen.

Some of the advises are a bit outdated (but look at the title, the book is from 2007!), and some of them are less clear that I wanted to, but overall the book is helpful.
Profile Image for Bodo Tasche.
97 reviews12 followers
August 30, 2018
Oh boy, this is difficult. I really want to like this book. It has tons of great content. Sadly it has no flow in it. Feels more like a random list of post its slapped together as a book. Reading through it was really hard for me because of that. I know it’s a technical book, but I expect more than a boring list of things with a couple of small anecdotes in between.

Sadly it also shows some minor forms of toxic work ethics. One example: the „Fire me when the deploy fails“ button. You should never fire someone because of an error. If this was a joke, it was a really bad one.

Ah and the last couple of chapters got more and more erratic. As if the author was under time pressure and hammered in the last pages 6 o‘clock in the morning.
82 reviews28 followers
May 3, 2022
The book started a bit underwhelming. I didn't like Part 1 that much. The content was informative, but a bit outdated. Part 2 and 3 were more interesting though. I liked the chapters on virtualization and containerization. The author was building software systems in the 90s and early 00s, when the concepts of continuous integration and continuous delivery didn't exist. So it was interesting to read his stories about how stressful deploying of a new version of a website was. That helped me to learn about the history of some of the concepts we take for granted these days, such as CI/CD, virtual machines, containers, and immutable infrastructure, and understand how things worked before they existed, and what problems they were designed to solve.
Profile Image for Roman.
140 reviews81 followers
September 25, 2014
Even almost seven years after publishing the book is a source of inspiration in designing production friendly software. I wish i could read the book three years ago. It would safe few sleepless nights to me and my colleagues. The book would deserve to being extend about e.g. Cloud, Software As A Service and even DevOps since they are key change drivers in release/deployment process nowadays. Anyway worth reading and thumbs up.
Profile Image for Andreea Ratiu.
191 reviews31 followers
October 30, 2014
You can tell from the first use case the writer worked with big websites and using Java. Still, the book is full of useful advice on how to design software projects in terms in scalability, transparency, adaptability and ease of troubleshoot. I enjoyed the style - the examples are well chosen and the level of details is not to deep, just enough to explain why some decisions are better than the others and how to apply good judgement when needed.
2 reviews
March 21, 2016
This book is a rare gem. It is full of valuable insights and is written in a very good language. Which makes this book not only valuable source of information but also a pleasure to read.
I would set 10 starts rating out of 5 possible if I could.

Definetely recommend it to any software developer or system engineer.
Profile Image for Luca.
77 reviews15 followers
June 3, 2016
I thought I'd go for 5 stars while reading most of it. I ended up with four because I found last chapters to be a bit confusing. Anyway the book is full of wisdom and I would argue for calling it a must.
45 reviews3 followers
July 26, 2019
Крайне толковая и полезная книга. Очень рекомендую! Не смотря на то, что написана она больше 10 лет назад, она все еще актуальна. Книга не столько про программирование, сколько про практики в разных областях разработки ПО, позволяющие новоиспеченному приложению приносить доход, а не головную боль. А именно: как пережить сезон высокой нагрузки (скажем, черную пятницу для ритейла); как построить приложение, чтобы при непредвиденных обстоятельствах (та же высокая нагрузка, отказ каких-то внешних сервисов) вся система не умирала целиком и могла в конце концов восстановиться; как грамотно составить SLA, чтобы все остались довольны; как писать логи; как и что мониторить; как организовать незаметный для пользователей деплой… Все главы очень практичны, видно, что автор на своей шкуре испытал проблемы взрослеющего ПО.

Очень понравилась идея трещин в программной системе: если какая-то часть системы начинает разрушаться – в ней появляется "трещина" – то эта часть будет оказывать влияние на другие части – "трещина" будет распространяться. Если трещина затронет большое количество подсистем, вся система может разрушиться. Придется останавливать все сервера и запускать заново. В книге дается множество идей, как распознавать места, в которых "трещина" может ускорять свое распространение и как проектировать подсистемы, чтобы они останавливали распространение таких "трещин".

Вообще в ходе прочтения возникала мысль, что книга заполняет пробел между двумя видами литературы для программистов: Вид 1) как написать программу; Вид 2) как решать проблемы, когда ты уже гугл. Release it отвечает на вопрос, какие трудности возникнут, когда программа уже написана, но дата релиза была только вчера, поэтому до проблем гугла с фейсбуком еще очень далеко. Ну и как к этим трудностям подготовиться

Стиль изложения очень живой, много юмора, историй из жизни (без перегибов). Книга не очень длинная, реально хотелось еще. Не потому, ��то темы не раскрыты, а потому, что интересно читать. Прочитал на английском, но существует и перевод на русский

Рекомендую к прочтению людям, задействованным в разработке и проектировании программных систем: от обычных программистов, до тимлидов, ПМов и архитекторов. И себе рекомендую прочитать еще разок через годик
Profile Image for Mariano.
Author 2 books11 followers
August 1, 2020
It's a fantastic book about good software engineering from non-traditional viewpoints.

It takes another approach on good practices of software architecture: it considers more than just classic quality attributes, and it makes you think you to architect your system in a way that's not only reliable and with good quality, but also easy to operate with. Concepts such as evolutionary architecture, adaptable architecture are reivewed throughout the chapters on the last section. In particular I enjoyed to read more about how to make the architecture easy to build and integrate continuously, deploy it safely to production, and make changes on it (because, of course, "change is the defining characteristic of software"). It finishes with a great introduction to chaos engineering.

It covers all important topics on good software architecture: stability patterns, deployability, security, how to avoid typical errors (like cascading failures, and what to do in such scenarios), 12-factor app, and more.

I really liked the concept of cynical software: rather than assuming everything is going to be fine, ask what could possible go wrong, and expect (and be prepared to) the software to fail. Failures will inevitable occur, and we have to think what to do about it.

As an experienced software engineering practitioner, it was highly enjoyable for me to read the case studies presented, as their analysis and conclusions were deeply enlightening.

All in all, a fantastic read, which gave me a lot of food for tought, and lots of materials and references to follow up on!
Displaying 1 - 30 of 255 reviews

Can't find what you're looking for?

Get help and learn more about the design.