Becoming a Rockstar SRE: Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems
By Jeremy Proffitt and Rod Anami
()
About this ebook
Site reliability engineering is all about continuous improvement, finding the balance between business and product demands while working within technological limitations to drive higher revenue. But quantifying and understanding reliability, handling resources, and meeting developer requirements can sometimes be overwhelming. With a focus on reliability from an infrastructure and coding perspective, Becoming a Rockstar SRE brings forth the site reliability engineer (SRE) persona using real-world examples.
This book will acquaint you the role of an SRE, followed by the why and how of site reliability engineering. It walks you through the jobs of an SRE, from the automation of CI/CD pipelines and reducing toil to reliability best practices. You’ll learn what creates bad code and how to circumvent it with reliable design and patterns. The book also guides you through interacting and negotiating with businesses and vendors on various technical matters and exploring observability, outages, and why and how to craft an excellent runbook. Finally, you’ll learn how to elevate your site reliability engineering career, including certifications and interview tips and questions.
By the end of this book, you’ll be able to identify and measure reliability, reduce downtime, troubleshoot outages, and enhance productivity to become a true rockstar SRE!
Related to Becoming a Rockstar SRE
Related ebooks
Clean Code with C#: Refactor your legacy C# code base and improve application performance using best practices Rating: 0 out of 5 stars0 ratingsDiving into Secure Access Service Edge: A technical leadership guide to achieving success with SASE at market speed Rating: 0 out of 5 stars0 ratingsAzure Architecture Explained: A comprehensive guide to building effective cloud solutions Rating: 0 out of 5 stars0 ratingsState Management with React Query: Improve developer and user experience by mastering server state in React Rating: 0 out of 5 stars0 ratingsModernizing Legacy Applications to Microsoft Azure: Plan and execute your modernization journey seamlessly Rating: 0 out of 5 stars0 ratingsFull Stack Web Development with Remix: Enhance the user experience and build better React apps by utilizing the web platform Rating: 0 out of 5 stars0 ratingsAWS Observability Handbook: Monitor, trace, and alert your cloud applications with AWS' myriad observability tools Rating: 0 out of 5 stars0 ratingsLearn T-SQL Querying: A guide to developing efficient and elegant T-SQL code Rating: 0 out of 5 stars0 ratingsThe Self-Taught Cloud Computing Engineer: A comprehensive professional study guide to AWS, Azure, and GCP Rating: 0 out of 5 stars0 ratingsDeveloping Cloud Native Applications in Azure using .NET Core: A Practitioner’s Guide to Design, Develop and Deploy Apps Rating: 0 out of 5 stars0 ratingsUltimate Data Engineering with Databricks Rating: 0 out of 5 stars0 ratingsAWS CDK in Practice: Unleash the power of ordinary coding and streamline complex cloud applications on AWS Rating: 0 out of 5 stars0 ratingsData Lake for Enterprises: Lambda Architecture for building enterprise data systems Rating: 0 out of 5 stars0 ratingsIntelligent Workloads at the Edge: Deliver cyber-physical outcomes with data and machine learning using AWS IoT Greengrass Rating: 0 out of 5 stars0 ratingsServerless Beyond the Buzzword: What Can Serverless Architecture Do for You? Rating: 0 out of 5 stars0 ratingsArchitecting Cloud Computing Solutions: Build cloud strategies that align technology and economics while effectively managing risk Rating: 0 out of 5 stars0 ratingsExt JS Application Development Blueprints Rating: 0 out of 5 stars0 ratings
Software Development & Engineering For You
Python For Dummies Rating: 4 out of 5 stars4/5Android App Development For Dummies Rating: 0 out of 5 stars0 ratingsGrokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5Hand Lettering on the iPad with Procreate: Ideas and Lessons for Modern and Vintage Lettering Rating: 4 out of 5 stars4/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Creative Selection: Inside Apple's Design Process During the Golden Age of Steve Jobs Rating: 5 out of 5 stars5/5Beginning Programming For Dummies Rating: 4 out of 5 stars4/5Level Up! The Guide to Great Video Game Design Rating: 4 out of 5 stars4/5Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems Rating: 4 out of 5 stars4/5Lua Game Development Cookbook Rating: 0 out of 5 stars0 ratingsOneNote: The Ultimate Guide on How to Use Microsoft OneNote for Getting Things Done Rating: 1 out of 5 stars1/5Tiny Python Projects: Learn coding and testing with puzzles and games Rating: 5 out of 5 stars5/5Managing Humans: Biting and Humorous Tales of a Software Engineering Manager Rating: 4 out of 5 stars4/5How Do I Do That In InDesign? Rating: 5 out of 5 stars5/5Adobe Illustrator CC For Dummies Rating: 5 out of 5 stars5/5Coding All-in-One For Dummies Rating: 0 out of 5 stars0 ratingsAgile Practice Guide Rating: 4 out of 5 stars4/5How to Write Effective Emails at Work Rating: 4 out of 5 stars4/5Gray Hat Hacking the Ethical Hacker's Rating: 5 out of 5 stars5/527 PROGRAM MANAGEMENT INTERVIEW TECHNIQUES - To Ace That Dream Job Offer ! Rating: 5 out of 5 stars5/5Data Visualization: a successful design process Rating: 4 out of 5 stars4/5Succeeding with AI: How to make AI work for your business Rating: 0 out of 5 stars0 ratingsRy's Git Tutorial Rating: 0 out of 5 stars0 ratingsGood Code, Bad Code: Think like a software engineer Rating: 5 out of 5 stars5/5iPhone Application Development For Dummies Rating: 4 out of 5 stars4/5
Reviews for Becoming a Rockstar SRE
0 ratings0 reviews
Book preview
Becoming a Rockstar SRE - Jeremy Proffitt
BIRMINGHAM—MUMBAI
Becoming a Rockstar SRE
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Mohd Riyan Khan
Publishing Product Manager: Surbhi Suman
Senior Editor: Romy Dias
Technical Editor: Shruthi Shetty
Copy Editor: Safis Editing
Project Coordinator: Ashwin Kharwa
Proofreader: Safis Editing
Indexer: Tejal Daruwale Soni
Production Designer: Alishon Mendonca
Marketing Coordinator: Agnes D’souza
First published: March 2023
Production reference: 1290323
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80323-922-4
www.packtpub.com
For my wonderful wife, who still likes me after 18 years. I like you too.
– Jeremy Proffitt
To my God, wife Tati, and son Gabe.
– Rod Anami
Contributors
About the authors
Jeremy Proffitt (born January 1977) is obsessed with constantly improving systems and solving problems with an unmatched sense of urgency – the definition of a Site Reliability Engineer (SRE). A master of solutions and technological knowledge, Jeremy is a rockstar SRE with AWS professional certifications in Architecture and DevOps – and has routinely saved millions in potential lost revenue in his career. In his free time, Jeremy enjoys spending time in his rockstar-appropriate technology cave and loves venturing into 3D printing, electronics, and Internet of Things (IoT) projects. By day, Jeremy currently manages a team of top SRE and DevOps talent driving constant improvement and is often cited in the company as a visionary in terms of observability and emergency response.
To the leaders who have helped me see the truth in our work and friends who have stood by and given me the encouragement to follow the wonders of technology, often while in awe of their own work, I say thank you! To my arch-enemies, you have been a wonderful addition that has always challenged me to become better. And finally, to my wife, Jamie, who I still desperately love after 18 years – and mind you, still likes me – I still remember our first date when you took my arm, you stole my heart, and in all our years, I’ve never felt you let go once.
Rod Anami is a seasoned engineer who works with cloud infrastructure and software engineering technologies. As one of the SREs at the Kyndryl CoE, he coaches other SREs on running IT modernization, transformation, and automation projects for clients worldwide. Rod leads the global SRE guild inside Kyndryl, where he helps plant and grow SRE chapters in many countries. Rod is certified as an SRE, technical specialist, and DevOps engineer professional at the ultimate level. He holds AWS, HashiCorp, Azure, and Kubernetes certifications, among many others. He is passionate about contributing to open source software at large with Node.js libraries.
I want to thank my wonderful wife, Tatiana, and my beloved son Gabriel, for giving me the space and support needed to write this book. My parents, Shizuo and Rita, for raising me with solid character. The Google site reliability engineering organization made this fantastic approach and profession open source. I want to thank Kyndryl for backing me on this journey. I had many bosses and leaders, good, bad, and inspiring ones. I want to mention a few who impacted my career immensely by helping me acquire the skills and knowledge for this book: Marcos Cimmino, Tara Sims, Andy Barnes, and Gene Brown. Nothing great is accomplished alone: it requires effort, endurance, enjoyment, colleagues, and God.
About the reviewers
Chris Smith is a strategic IT leader with a proven track record across the financial service industry. His passion is to lead organization-wide transformational efforts for Fortune 500 institutions within digital and contact center technology and operations. He is skilled at driving agile adoption, building an engineering-first mindset, and facilitating cloud modernization of core banking services at scale.
Itohanoghosa Eregie is the founder of techinanutshellhack, a platform dedicated to explaining technology concepts with short video clips about cloud and site SRE concepts in their simplest form via LinkedIn. She worked as a software developer at Cyberspace Limited before finding her passion as a platform engineer, which earned her an opportunity to work with Dell EMC as a resident platform engineer for one of Africa’s largest telecommunications companies, MTN Nigeria, as a platform engineer. Altoros Americas currently employs her as a VMware Tanzu engineer, involved in customer engagement. Itohan is passionate about building resilient systems in the cloud and ensuring organizations adhere to SRE practices.
Brannen Taylor has almost 30 years of experience in corporate IT from the healthcare, managed services, power, hosted DR, and financial services industries. He has worked with small mom-and-pop
operations up to ITIL-heavy Fortune 10 companies. He was a network engineer for 20 years and has been a network operations manager for the past 2 years. He has certifications from many vendors such as Nortel, Cisco, and Palo Alto, as well as a few that are vendor-agnostic, many cloud certifications from AWS and Azure, and is now moving into Network DevOps (NetDevOps), focusing on Nautobot, Ansible, and various vendor SDKs. He enjoys scuba diving with his wife and friends and has two grown children.
I would like to thank God for leading me into a career that I love. I want to thank my children for only eye-rolling me a little when I launch into an explanation about binary when they ask me how email works. I want to thank my wife Lara for putting up with me being on call these past 23 years, working unexpectedly long days, nights, and weekends, and non-stop studying. Thank you to my colleagues and the friends I’ve made along the way.
Gene Brown is the Vice President and a Distinguished Engineer at Kyndryl. He leads the SRE profession and certification program and is the global site reliability engineering leader. He is responsible for driving the enablement of SREs across Kyndryl’s countries, practices, and strategic markets through a Center of Excellence with SRE chapter leaders across the services organization globally.
Gene enjoys spending time with clients interested in adopting SRE and likes comparing notes on what has worked well and how to overcome the challenges that come with cultural change. Gene was the co-founder of IBM’s and Kyndryl’s SRE profession with a focus on certifying SREs based on their applied experience in the field of site reliability engineering.
Table of Contents
Preface
Part 1 - Understanding the Basics of Who, What, and Why
1
SRE Job Role – Activities and Responsibilities
Making this journey personal
SRE driving forces
SRE skills
SRE traits
Understanding the mindset and hobbies of an SRE
SRE affinity game
SRE guiding principles
SRE hobbies
DevOps engineers versus SRE versus others
DevOps and site reliability engineers
Software and site reliability engineers
Describing an SRE’s main responsibilities
An overview of the daily activities of an SRE
People that inspire
Jeremy’s recognition – Paul Tyma, former CTO, LendingTree
Rod’s recognition – Ingo Averdunk, Distinguished Engineer, IBM, and Gene Brown, Distinguished Engineer, Kyndryl
Summary
Further reading
2
Fundamental Numbers – Reliability Statistics
SLA commitment – a conversation, not a number
Internal partner SLAs
External partner SLAs
The cost of more 9s in an SLA
A final word on SLAs
Defining and leveraging SLOs and SLIs
SLOs
SLOs and time
Tracking outage frequency with the MTBF
Measuring the downtime with the MTTR
Understanding the customer and revenue impact
Transparency in outages
The rockstar SRE’s SLA
Summary
3
Imperfect Habits – Duct Tape Architecture and Spaghetti Code
The business of software development – let’s start with the dollars
Defining the value
of software to a business
The value of protecting business
The value of growing a business
The value of saving labor costs
The A/B testing mindset – the art of change in customer interaction
A/B testing in customer flows
Analyzing the results of A/B testing
Leveraging A/B testing to satisfy quarterly numbers
Dedication to the craft of development – and why some are just here for a job
A quick guide to communicating with your colleagues
Reviewing the merge request – it’s about training, oversight, and reliability
Avoiding the typical rubber stamp mentality
A word on production deployments
Why businesses want us to outright ignore best practices
The truth about the ownership of a developer’s time
Understanding the flaws in how we estimate development cost
Fast, good, cheap – pick one
Why is observability the answer to reliability issues?
The cost of highly available architecture
Mixing good and bad – tricks to wrapping bad code and making it resilient
Alerting that fires actions
Adding additional logging to monitor potential issues
Using try catch to encapsulate exceptions
Retries to the rescue…or not
Summary
Part 2 - Implementing Observability for Site Reliability Engineering
4
Essential Observability – Metrics, Events, Logs, and Traces (MELT)
Technical requirements
Accomplishing systems monitoring and telemetry
Monitoring targets for infrastructure
Monitoring types and tools
Monitoring golden signals
Monitoring data
Understanding APM
Getting to know topology self-discovery, the blast radius, predictability, and correlation
Alerting – the art of doing it quietly
The user perspective notification trigger principle
Event-to-incident mapping principle
Mixing everything into observability
Outages versus downtime
Observability architecture
Observability effectiveness
In practice – applying what you have learned
Lab architecture
Lab contents
Lab instructions
Summary
Further reading
5
Resolution Path – Master Troubleshooting
Properly defining the problem – and what to ask and not ask
Source of information
The knowledge base of the reporter
Naming conventions
False urgency
Executive summary
Breaking down and testing systems
Breaking down hardware versus the operating system
Breaking down a web API
Understanding the steps
The problems with this method of troubleshooting
Previous and common events – checking for the simple problems
Prior Root Cause Analysis (RCA) documents
Timeline analysis
Comparison
The best approach
Effective research both online and among peers
The art of the Google search
Skimming the content quickly and refining it
Never forget your internal resources
Breaking down source code efficiently
Code you’ve never seen
When that fails
Logging plus code
In practice – applying what you’ve learned
Summary
6
Operational Framework – Managing Infrastructure and Systems
Technical requirements
Approaching systems administration as a discipline
Design
Installation
Configuration
App deployment
Management
Upgrade
Uninstallation
Understanding IT service management
ITIL
DevOps
Seeing systems administration as multiple layers and multiple towers
Automating systems provisioning and management
Infrastructure as Code
Immutable infrastructure
In practice – applying what you’ve learned
Lab architecture
Lab contents
Lab instructions
Summary
Further readings
7
Data Consumed – Observability Data Science
Technical requirements
Making data-driven decisions
Defining the question and options
Determining which data to use
Identifying which data is already available
Collecting the missing data
Analyzing all datasets together
Presenting the decision as a record
Documenting the lessons learned in the process
Solving problems through a scientific approach
Formulation
Hypothesis
Prediction
Experiment
Analysis
Understanding the most common statistical methods
Percentages
Mean, average, and standard deviation
Quantiles and percentiles
Histograms
Using other mathematical models in observability
Visualizing histograms with Grafana
In practice – applying what you’ve learned
Lab architecture
Lab contents
Lab instructions
Summary
Further reading
Part 3 - Applying Architecture for Reliability
8
Reliable Architecture – Systems Strategy and Design
Technical requirements
Designing for reliability
Architectural aspects
Reliability equations
Design patterns
Modern applications
Splitting and balancing the workload
Splitting
Balancing
Failing over – almost as good
Scaling up and out – horizontal versus vertical
Horizontal
Vertical
Autoscaling
In practice – applying what you’ve learned
Lab architecture
Lab contents
Lab instructions
Summary
Further reading
9
Valued Automation – Toil Discovery and Elimination
Technical requirements
Eliminating toil
Toil redefined
Why toil is bad
Handling toil the right way
Treating automation as a software problem
Document
Algorithm
Code
Automating the (in)famous CI/CD pipeline
Continuous integration
Continuous delivery
Production releases
In practice – applying what you’ve learned
Lab architecture
Lab contents
Lab instructions
Summary
Further reading
10
Exposing Pipelines – GitOps and Testing Essentials
A basic pipeline – building automation to deploy infrastructure as code architecture and code
Pipelines in chronological order
Pipeline templates
Errors or breaks in pipelines
Using containers in pipelines
Pipeline artifacts
Pipeline troubleshooting tips
Automating compliance and security in pipelines
Library age
Application security testing
Dynamic Application Security Testing (DAST)
Static Application Security Testing (SAST)
Secrets scanning
Automated linting for code quality and standards
Compiling with linting feedback
Validating functionality during deployment with automated testing
Why is testing so important to reliability?
Test data
The types of testing
When to test a pipeline
Testing observability
Automated rollbacks
The reduction of developer toil through automated processes
What is the impact of addressing toil?
In practice – applying what you’ve learned
Preparing AWS for the lab
Creating your repository
Adding secrets to your repository
Downloading and committing the lab files
Understanding the pipeline
Adding more steps
Testing but not deploying
Lab final thoughts
Summary
11
Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes
Technical requirements
The multiple definitions of serverless
Serverless Framework
Serverless computing
Serverless functions
Monitoring serverless functions
Errors
Containers and why we love them
Isolation
Immutability
Promotability
Tagging
Rollbacks
Security
Signable
Monitoring containers
Kubernetes and other ways to orchestrate containers
Health checks
Crashing and force-closing containers
HTTP-based load balancing
Server load balancing
Containers as a Service (CaaS)
Simple container orchestration
Kubernetes
Deployment techniques and workers
Traditional replacement deployment
Rolling deployment
A/B or blue/green deployment
Canary deployment
Automation and rolling back failed deployments
Rollback metrics
When to roll back
How to roll back
In practice – applying what you’ve learned
Leveraging Gitpod – a containerized workspace
The emulation source code
Running the emulation
Summary
12
Final Exam – Tests and Capacity Planning
Technical requirements
Understanding types of testing
Development tests
Build tests
Delivery tests
Deployment tests
Production tests
Adopting TDD
Unit testing the hard way
Unit testing with a framework
Using test automation frameworks
Staying ahead with capacity planning
Load test data
The capacity curve
The demand curve
In practice – applying what you’ve learned
Lab architecture
Lab contents
Lab instructions
Summary
Further reading
Part 4 - Mastering the Outage Moments
13
First Thing – Runbooks and Low Noise Outage Notifications
Technical requirements
What makes a good runbook – the basics
Runbooks as living documents
Understanding the runbook audience knowledge level
Runbook audience permissions
What do you put into a runbook anyway?
Beyond the runbook – code and comments
Quickly understanding source code
Searching source code for your needle in a haystack
Commenting for understanding
What’s in a good dashboard?
Types of dashboards
NOC-style red and green
Displaying trends
Aggregates and breakdowns
What dashboards are not
The basics of priority levels
Response effort
Engineer retention
Incident response systems and priority
Incident response systems and phone-based alerts
What is a priority one event?
Defining priority based on...
The priority level of observability failures
Forcing the priority – the rockstar way!
Adjusting alerts
Logs and alerting
Pausing alerts
In practice – applying what you’ve learned
Defining priority levels
Custom hat pricing API runbook
Alerting
Summary
14
Rapid Response – Outage Management Techniques
Where to meet – an effective strategy for communicating good information
Online collaboration
In-person collaboration
The historical data found in outage responses
Participants
Follow-up work
Leveraging the people involved in the response
Tasks
Participants and personalities
Break strategy and stress management
The opportunity to respond at the right time
Training
Runbook and contact list revisions
Team building
Executive messaging bugs in the ear
Opportunities to call out during the RCA
Messaging customers and leadership
Customer versus leadership messaging
Cadence
Email groups
Status sites
Over-messaging
Notes, notes, notes...
In practice – applying what you’ve learned
Outage and alarm
Notification and response
Troubleshooting
The conclusion
Summary
15
Postmortem Candor – Long-Term Resolution
The content of the postmortem in executive summary style
Executive summary style
Overview
Impact
Timeline
Detailed technical description
Response
Resolution
Future actions
Decisions are not blame
Business is business
Resource and time constraints
Monitoring
The cost of more reliability as a business decision
Active:Active
Manual failover
Cost of time to identify
The cost of time to move a load
Hidden development costs
Training and skill sets – they matter
Identifying gaps
Training and certification targets
Creating future action plans
Immediate follow-up
Who to involve
Timelines and priority
Assigning ownership
Tracking the work
In-practice – an example of a postmortem
Writing the overview
Rounding out the postmortem
Custom Hat Company postmortem
Impact
Timeline
Technical details and response
Resolution
Future actions
Summary
Part 5 - Looking into Future Trends and Preparing for SRE Interviews
16
Chaos Injector – Advanced Systems Stability
Technical requirements
Comprehending the wheel-of-misfortune game
All ends are new beginnings
Lessons to be learned
Role-playing scenarios
A little bit of gamification
Understanding chaos engineering for reliability
Principles of chaos engineering
Chaos system architecture
Chaos experiments
In practice – employing the wheel-of-misfortune game
Lab architecture
Lab contents
Lab instructions
In practice – injecting chaos into systems
Lab architecture
Lab contents
Lab instructions
Summary
Further reading
17
Interview Advice – Hiring and Being Hired
What we’re looking for in a candidate
Are you qualified?
Entry-level SRE job
Problem-solving
The ability to accept feedback and direction
A broad knowledge base and skill set
Research and learning skill set
The ability to say No
Culture fit
The X factor
Passion
Experience
Personal responsibility
Common interview questions and answers
Technical questions
Non-technical questions
Insightfully odd questions
What should you look for in a career?
Define a good boss
Dotted line reporting
Morals
Researching the company
Business model
Profitability for the next decade
Structure
Large versus small
Public versus private
Online reviews
Are you over-or under-certified?
Certifications that matter
How many are too many certifications?
Relevancy
Tips for landing the job with a great salary
Interview tips
Salary negotiations
Summary
Appendix A – The Site Reliability Engineer Manifesto
The manifesto
How to adopt it
How to contribute to it
Appendix B – The 12-Factor App Questionnaire
The questionnaire
Factor I – Code base
Factor II – Dependencies
Factor III – Config (configuration)
Factor IV – Backing (backend) services
Factor V – Build, release, run
Factor VI – Processes
Factor VII – Port binding
Factor VIII – Concurrency
Factor IX – Disposability
Factor X – Development/production (dev/prod) parity
Factor XI – Logs
Factor XII – Admin processes
How to adopt this questionnaire
How to contribute to this questionnaire
Index
Other Books You May Enjoy
Preface
Site reliability engineering relates to constant improvement, bridging business and product issues as per customer requirements and technology limitations, thereby generating higher revenue. Quantifying and understanding reliability, resource handling, and developer needs can sometimes be overwhelming. Becoming a Rockstar SRE explores reliability from an infrastructure and coding perspective and uses real-world examples to bring forth the site reliability engineer (SRE) persona.
This book will acquaint you with who an SRE is, followed by discussions on the why and how of site reliability engineering. It walks you through the jobs of an SRE, from automation of continuous integration/continuous delivery (CI/CD) pipelines and reducing toil to the details of reliability and the best practices to excel in it. You’ll learn why harmful code is created and how to circumvent that with reliable designs and patterns. You’ll explore how to interact and negotiate with businesses and vendors on various technical matters. You’ll then deep dive into observability, outage, and why and how to craft an excellent runbook. Finally, you’ll learn how to elevate your site reliability engineering career, including certifications, interview tips, and questions.
By the end of this book, you’ll be able to identify and measure reliability, reduce downtime, troubleshoot outages, and enhance productivity to become a true rockstar SRE!
Who is this book for
This book is intended for IT professionals, from developers looking to advance into an SRE role to system administrators mastering technologies and executives experiencing repeated downtime in their organizations. This book will also be helpful to anyone interested in bringing reliability and automation to their organization to drive down customer impact and revenue loss while increasing development throughput. While reading this book, a basic understanding of API and web architecture and some experience with cloud computing and services will be helpful.
What this book covers
Chapter 1, SRE Job Role – Activities and Responsibilities, talks about the site reliability engineer persona addressing who is an SRE.
Chapter 2, Fundamental Numbers – Reliability Statistics, shows how the site reliability engineering work and business impact are measured.
Chapter 3, Imperfect Habits – Duct Tape Architecture and Spaghetti Code, explains why systems are naturally unreliable.
Chapter 4, Essential Observability – Metrics, Events, Logs, and Traces (MELT), discusses how we go from monitoring to true observability.
Chapter 5, Resolution Path – Master Troubleshooting, lectures on the SRE way of precisely and concisely troubleshooting.
Chapter 6, Operational Framework – Managing Infrastructure and Systems, describes why and how SREs tackle operational work and not just engineering duties.
Chapter 7, Data Consumed – Observability Data Science, teaches the basic mathematical models and statistical methods for SREs.
Chapter 8, Reliable Architecture – Systems Strategy and Design, describes systems thinking applied to reliability and reliable architectural patterns.
Chapter 9, Valued Automation – Toil Discovery and Elimination, familiarizes readers with a critical pillar of site reliability engineering: making operations scalable.
Chapter 10, Exposing Pipelines – GitOps and Testing Essentials, illustrates how to leverage reliability inside DevOps delivery pipelines.
Chapter 11, Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes, presents how workload management affects the reliability of systems.
Chapter 12, Final Exam – Tests and Capacity Planning, demonstrates how good testing and capacity planning keep the performance of systems ahead.
Chapter 13, First Thing – Runbooks and Low Noise Outage Notifications, discusses how well-designed procedures and notifications prepare SREs for problems.
Chapter 14, Rapid Response – Outage Management Techniques, teaches about SRE positive behaviors and how to keep interactions toward the resolution during a significant incident.
Chapter 15, Postmortem Candor – Long-Term Resolution, portrays how postmortems should lead to actions that will make systems more reliable.
Chapter 16, Chaos Injector – Advanced Systems Stability, clarifies how SREs inject chaos into systems to learn more and use gamification to hone their skills.
Chapter 17, Interview Advice – Hiring and Being Hired, displays how companies should hire SREs and how SREs should demonstrate their knowledge during an interview.
Appendix A, The Site Reliability Engineer Manifesto, depicts the primary responsibilities of any SRE in the world.
Appendix B, The 12-Factor App Questionnaire, consolidates a series of questions to test whether an application design is reliable according to the twelve-factor app manifesto from Heroku.
To get the most out of this book
We purposefully used SRE as the acronym for site reliability engineer and kept site reliability engineering in its extended form throughout the book. For us, site reliability engineering is only accomplishable if you have an SRE and not the other way around. Although it’s common to see SRE standing for both site reliability engineer and engineering interchangeably, we want to emphasize the persona and the who in this book.
This book contains simulation labs to give its readers practical knowledge. Each has a prerequisite knowledge set, such as Kubernetes, cloud computing, or software development. It’s not part of this book to teach you about specific technologies and products but the most effective practices and principles that are technology agnostic. However, we must adopt some technology to demonstrate the site reliability engineering concepts and techniques. For that, we preferred open source software and platforms with free tier accounts in the labs.
Each simulation lab states its learning requirements and points to where the reader can find more information and instructions. We divided each practical exercise into three parts:
Lab architecture
Lab contents
Lab instructions
The lab architecture explains the big picture around the design and connections among its main components. The contents section explains what’s inside the GitHub repository, such as files and folders. And the lab instructions have a procedure for installing, configuring, and using the lab properly.
The following is a list of software covered in this book’s simulation labs and the required execution environment:
You will require a laptop with reasonable access to the internet to work in the book’s labs.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
Download the example code files
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Becoming-a-Rockstar-SRE. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Download the color images
We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/W6q5Y.
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: Within this repository, under the Chapter-8 folder, there is just one subfolder called terraform.
A block of code is set as follows:
provider google
{
credentials = file(project-service-account-key.json
)
project = autoscaling-simulation-lab
region = southamerica-east1
zone = southamerica-east1-a
}
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
resource google_compute_autoscaler
foobar
{
...
autoscaling_policy
{
max_replicas = 5
min_replicas = 1
cooldown_period = 60
Any command-line input or output is written as follows:
$ terraform init
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: To do this, we navigate to the Settings tab in our GitHub repository. Then select Secrets on the left side, and choose Actions.
Tips or important notes
Appear like this.
Get in touch
Feedback from our readers is always welcome.
Join the SRE community: We invite you to join the large Site Reliability Engineers community at the sreterminus.slack.com Slack public workspace.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Share Your Thoughts
Once you’ve read Becoming a Rockstar SRE, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Download a free PDF copy of this book
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below
https://packt.link/free-ebook/9781803239224
Submit your proof of purchase
That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1 - Understanding the Basics of Who, What, and Why
In this first part, you will learn about site reliability engineering, its roots, and current usage outside Google. We emphasize how the site reliability engineer (SRE) persona is the center of gravity of everything orbiting systems reliability. When we talk about site reliability engineering, it’s impossible to do so without a discussion about the business of software development, which we tie into not only statistics used for reliability but how those impact what companies are ultimately interested in, customer satisfaction and revenue. Finally, we’ll explore why the lack of reliability persists in organizations and discuss some of the lesser known truths that make site reliability engineering critical and complex.
The following chapters will be covered in this section:
Chapter 1, SRE Job Role – Activities and Responsibilities
Chapter 2, Fundamental Numbers – Reliability Statistics
Chapter 3, Imperfect Habits – Duct Tape Architecture and Spaghetti Code
1
SRE Job Role – Activities and Responsibilities
A lot has been said about site reliability engineering, what it is, what it is not, and the multiple practices and techniques that we should apply to adopt the site reliability engineering model. Who site reliability engineers (SREs) are is often put aside even though it is a crucial aspect. Moreover, how people from various parts of information technology (IT) become SREs and how some of them are recognized as thought leaders in this domain.
However, little has been said about the site reliability engineer persona, as detailed in the following list:
What do they know?
Which skills have they developed?
What do they do daily?
What are their primary responsibilities?
Those characteristics would explain, at a bare minimum, why someone should start the journey to becoming an SRE rockstar. That’s precisely why we decided to start this book by outlining the SRE job role.
In this chapter, we’re going to cover the following main topics:
Making this journey personal
Understanding the mindset and hobbies of an SRE
DevOps engineers versus SRE versus others
Describing an SRE’s main responsibilities
An overview of the daily activities of an SRE
People that inspire
Making this journey personal
Unfortunately, often when an enterprise starts to adopt SRE into their IT governance processes, they don’t use a people-processes-tools (PPT) model to transform their operations and software development areas, having a clear vision of these pillars. Even more often, they don’t emphasize or focus on the people element of PPT in such transformations. We want to change that by making this learning journey personal and centered on the individuals rather than the involved processes or technologies.
It’s critical to understand (and learn) what drives typical SREs forward, which fundamental skills they have developed, and how they hone their skills over time to go above and beyond at work. For that purpose, we will divide this subject into three sections:
SRE driving forces
SRE skills
SRE traits
Let’s start this personal journey by understanding why you should become an SRE.
SRE driving forces
We want to explore what motivates or incentivizes site reliability engineers. There’s no journey of any nature if there is no driving force pushing you through. As a word of advice, we should warn you that learning about site reliability engineering is more of an expedition than a tourism trip. In other words, it’s more a marathon than a sprint. Having clarified that, we’ll begin by putting the possible rewards of this journey on the table. Let’s depict each driving force as a mockup code snippet (JavaScript) to make it fun.
Money
If we could represent in the form of an algorithm how money drives people when they don’t earn enough, it would look like the following:
//
money
if (
money
< MyMinimumSalary) {
motivated = false;
excitement--;
}
doMyWork();
if (motivated && jobSatisfaction) {
honeSRESkills();
doExtraWork();
} else lookForAnotherJob();
Site reliability engineers make more money than most other technical professionals. According to a Glassdoor (2022) report, they can earn more than USD 118K per year on average. In similar reports, SREs are even noted to have surpassed DevOps engineers in a salary comparison. Nevertheless, not making enough money can be a key demotivating factor. It is hard for anyone to move forward with their career if they are preoccupied with expenses.
Although SREs have a notorious income on average, their salaries will vary per country, years of experience, and employer. Companies justify SRE salary levels based on the reliability value they bring to the table. Rest assured, the site reliability engineering career is well paved in the compensation field.
Job satisfaction
What affects our job satisfaction can be depicted as code logic as follows:
//
jobSatisfaction
if (interestingJob || purposefulWorkActivities || challengingSkillDevelopment || technicalAppreciation) {
jobSatisfaction
= true;
excitement++;
}
Job satisfaction is another driving force of site reliability engineers, and it has many factors. We usually translate job satisfaction to employee happiness at work. Site reliability engineering leads to job satisfaction when we look at the following profession characteristics: exciting job content, purposeful work activities, challenging skill development, and technical appreciation.
The job content of site reliability engineering spans multiple domains. You can work with developers one day and help systems administrators the next. You may need to assist in redesigning an app to increase its service reliability. As with any generalist model job with technical depth in many subject areas, you will never get bored for sure.
As we will see later in this chapter, SRE work activities have clear business value. They improve not just the service quality, availability, and resiliency, but also the system’s reliability. Reliable services might help with customer loyalty, bringing additional revenue to the service provider. There is a direct relationship between SRE work and business metrics improvement, making their efforts purposeful.
Since site reliability engineering is a cross-technology domain engineering discipline, any skills acquisition is challenging. SREs have knowledge and skills that a systems administrator or software developer doesn’t have. They are required to keep those skills updated and hone them over time. This necessity to keep learning brings the always-moving-forward feeling that may not happen if you only need to master a single product or technology.
The last factor on our list is technical appreciation. According to Boston Consulting Group (BCG) research, appreciation is the number one job happiness factor. Being an SRE, you will aid customers, users, and other technical professionals because of your keen holistic view of the systems. Consequently, technical appreciation for the job you do is common, and who doesn’t like that?
Innovative solutions
The following code gives you an idea of how exciting exploring uncharted terrains is:
If (!solutionExists) {
deviseNewSolution
();
excitement++;
}
Site reliability engineers are natural trailblazers as they explore new technologies and processes to obtain better reliability and eliminate toil (manual and repetitive tasks that are devoid of value). They face many scenarios and situations that are a first of their kind. Moreover, they are responsible for paving the path for others by documenting procedures in runbooks when none exist. There’s nothing more exciting than devising new solutions or improving existing ones. Imagine how you would feel if they named a technical operating procedure after you.
Nevertheless, SREs want to minimize complexity and reduce technical debt. They don’t create a solution just for the sake of doing it unless it adds value and resolves or prevents events that impact customers.
Good relationships
The following code snippet is a representation of how good relationships are a result of an exciting working environment:
If (excitement > HIGH) {
motivateOthers();
relationships.healthy
= true;
}
Also, good work environment relationships are one of the top 10 factors contributing to employee happiness. SREs have good relationships in their work environment. The reason is straightforward; they act as integration hubs among different tribes and have the mission to break company siloes. SREs need cooperation from both development and operations teams. They are technical diplomats and have strong communication skills. Since they are