Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Becoming a Rockstar SRE: Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems
Becoming a Rockstar SRE: Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems
Becoming a Rockstar SRE: Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems
Ebook1,017 pages7 hours

Becoming a Rockstar SRE: Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Site reliability engineering is all about continuous improvement, finding the balance between business and product demands while working within technological limitations to drive higher revenue. But quantifying and understanding reliability, handling resources, and meeting developer requirements can sometimes be overwhelming. With a focus on reliability from an infrastructure and coding perspective, Becoming a Rockstar SRE brings forth the site reliability engineer (SRE) persona using real-world examples.
This book will acquaint you the role of an SRE, followed by the why and how of site reliability engineering. It walks you through the jobs of an SRE, from the automation of CI/CD pipelines and reducing toil to reliability best practices. You’ll learn what creates bad code and how to circumvent it with reliable design and patterns. The book also guides you through interacting and negotiating with businesses and vendors on various technical matters and exploring observability, outages, and why and how to craft an excellent runbook. Finally, you’ll learn how to elevate your site reliability engineering career, including certifications and interview tips and questions.
By the end of this book, you’ll be able to identify and measure reliability, reduce downtime, troubleshoot outages, and enhance productivity to become a true rockstar SRE!

LanguageEnglish
Release dateApr 28, 2023
ISBN9781804614563
Becoming a Rockstar SRE: Electrify your site reliability engineering mindset to build reliable, resilient, and efficient systems

Related to Becoming a Rockstar SRE

Related ebooks

Software Development & Engineering For You

View More

Related articles

Reviews for Becoming a Rockstar SRE

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Becoming a Rockstar SRE - Jeremy Proffitt

    cover.png

    BIRMINGHAM—MUMBAI

    Becoming a Rockstar SRE

    Copyright © 2023 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Mohd Riyan Khan

    Publishing Product Manager: Surbhi Suman

    Senior Editor: Romy Dias

    Technical Editor: Shruthi Shetty

    Copy Editor: Safis Editing

    Project Coordinator: Ashwin Kharwa

    Proofreader: Safis Editing

    Indexer: Tejal Daruwale Soni

    Production Designer: Alishon Mendonca

    Marketing Coordinator: Agnes D’souza

    First published: March 2023

    Production reference: 1290323

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham

    B3 2PB, UK.

    ISBN 978-1-80323-922-4

    www.packtpub.com

    For my wonderful wife, who still likes me after 18 years. I like you too.

    – Jeremy Proffitt

    To my God, wife Tati, and son Gabe.

    – Rod Anami

    Contributors

    About the authors

    Jeremy Proffitt (born January 1977) is obsessed with constantly improving systems and solving problems with an unmatched sense of urgency – the definition of a Site Reliability Engineer (SRE). A master of solutions and technological knowledge, Jeremy is a rockstar SRE with AWS professional certifications in Architecture and DevOps – and has routinely saved millions in potential lost revenue in his career. In his free time, Jeremy enjoys spending time in his rockstar-appropriate technology cave and loves venturing into 3D printing, electronics, and Internet of Things (IoT) projects. By day, Jeremy currently manages a team of top SRE and DevOps talent driving constant improvement and is often cited in the company as a visionary in terms of observability and emergency response.

    To the leaders who have helped me see the truth in our work and friends who have stood by and given me the encouragement to follow the wonders of technology, often while in awe of their own work, I say thank you! To my arch-enemies, you have been a wonderful addition that has always challenged me to become better. And finally, to my wife, Jamie, who I still desperately love after 18 years – and mind you, still likes me – I still remember our first date when you took my arm, you stole my heart, and in all our years, I’ve never felt you let go once.

    Rod Anami is a seasoned engineer who works with cloud infrastructure and software engineering technologies. As one of the SREs at the Kyndryl CoE, he coaches other SREs on running IT modernization, transformation, and automation projects for clients worldwide. Rod leads the global SRE guild inside Kyndryl, where he helps plant and grow SRE chapters in many countries. Rod is certified as an SRE, technical specialist, and DevOps engineer professional at the ultimate level. He holds AWS, HashiCorp, Azure, and Kubernetes certifications, among many others. He is passionate about contributing to open source software at large with Node.js libraries.

    I want to thank my wonderful wife, Tatiana, and my beloved son Gabriel, for giving me the space and support needed to write this book. My parents, Shizuo and Rita, for raising me with solid character. The Google site reliability engineering organization made this fantastic approach and profession open source. I want to thank Kyndryl for backing me on this journey. I had many bosses and leaders, good, bad, and inspiring ones. I want to mention a few who impacted my career immensely by helping me acquire the skills and knowledge for this book: Marcos Cimmino, Tara Sims, Andy Barnes, and Gene Brown. Nothing great is accomplished alone: it requires effort, endurance, enjoyment, colleagues, and God.

    About the reviewers

    Chris Smith is a strategic IT leader with a proven track record across the financial service industry. His passion is to lead organization-wide transformational efforts for Fortune 500 institutions within digital and contact center technology and operations. He is skilled at driving agile adoption, building an engineering-first mindset, and facilitating cloud modernization of core banking services at scale.

    Itohanoghosa Eregie is the founder of techinanutshellhack, a platform dedicated to explaining technology concepts with short video clips about cloud and site SRE concepts in their simplest form via LinkedIn. She worked as a software developer at Cyberspace Limited before finding her passion as a platform engineer, which earned her an opportunity to work with Dell EMC as a resident platform engineer for one of Africa’s largest telecommunications companies, MTN Nigeria, as a platform engineer. Altoros Americas currently employs her as a VMware Tanzu engineer, involved in customer engagement. Itohan is passionate about building resilient systems in the cloud and ensuring organizations adhere to SRE practices.

    Brannen Taylor has almost 30 years of experience in corporate IT from the healthcare, managed services, power, hosted DR, and financial services industries. He has worked with small mom-and-pop operations up to ITIL-heavy Fortune 10 companies. He was a network engineer for 20 years and has been a network operations manager for the past 2 years. He has certifications from many vendors such as Nortel, Cisco, and Palo Alto, as well as a few that are vendor-agnostic, many cloud certifications from AWS and Azure, and is now moving into Network DevOps (NetDevOps), focusing on Nautobot, Ansible, and various vendor SDKs. He enjoys scuba diving with his wife and friends and has two grown children.

    I would like to thank God for leading me into a career that I love. I want to thank my children for only eye-rolling me a little when I launch into an explanation about binary when they ask me how email works. I want to thank my wife Lara for putting up with me being on call these past 23 years, working unexpectedly long days, nights, and weekends, and non-stop studying. Thank you to my colleagues and the friends I’ve made along the way.

    Gene Brown is the Vice President and a Distinguished Engineer at Kyndryl. He leads the SRE profession and certification program and is the global site reliability engineering leader. He is responsible for driving the enablement of SREs across Kyndryl’s countries, practices, and strategic markets through a Center of Excellence with SRE chapter leaders across the services organization globally.

    Gene enjoys spending time with clients interested in adopting SRE and likes comparing notes on what has worked well and how to overcome the challenges that come with cultural change. Gene was the co-founder of IBM’s and Kyndryl’s SRE profession with a focus on certifying SREs based on their applied experience in the field of site reliability engineering.

    Table of Contents

    Preface

    Part 1 - Understanding the Basics of Who, What, and Why

    1

    SRE Job Role – Activities and Responsibilities

    Making this journey personal

    SRE driving forces

    SRE skills

    SRE traits

    Understanding the mindset and hobbies of an SRE

    SRE affinity game

    SRE guiding principles

    SRE hobbies

    DevOps engineers versus SRE versus others

    DevOps and site reliability engineers

    Software and site reliability engineers

    Describing an SRE’s main responsibilities

    An overview of the daily activities of an SRE

    People that inspire

    Jeremy’s recognition – Paul Tyma, former CTO, LendingTree

    Rod’s recognition – Ingo Averdunk, Distinguished Engineer, IBM, and Gene Brown, Distinguished Engineer, Kyndryl

    Summary

    Further reading

    2

    Fundamental Numbers – Reliability Statistics

    SLA commitment – a conversation, not a number

    Internal partner SLAs

    External partner SLAs

    The cost of more 9s in an SLA

    A final word on SLAs

    Defining and leveraging SLOs and SLIs

    SLOs

    SLOs and time

    Tracking outage frequency with the MTBF

    Measuring the downtime with the MTTR

    Understanding the customer and revenue impact

    Transparency in outages

    The rockstar SRE’s SLA

    Summary

    3

    Imperfect Habits – Duct Tape Architecture and Spaghetti Code

    The business of software development – let’s start with the dollars

    Defining the value of software to a business

    The value of protecting business

    The value of growing a business

    The value of saving labor costs

    The A/B testing mindset – the art of change in customer interaction

    A/B testing in customer flows

    Analyzing the results of A/B testing

    Leveraging A/B testing to satisfy quarterly numbers

    Dedication to the craft of development – and why some are just here for a job

    A quick guide to communicating with your colleagues

    Reviewing the merge request – it’s about training, oversight, and reliability

    Avoiding the typical rubber stamp mentality

    A word on production deployments

    Why businesses want us to outright ignore best practices

    The truth about the ownership of a developer’s time

    Understanding the flaws in how we estimate development cost

    Fast, good, cheap – pick one

    Why is observability the answer to reliability issues?

    The cost of highly available architecture

    Mixing good and bad – tricks to wrapping bad code and making it resilient

    Alerting that fires actions

    Adding additional logging to monitor potential issues

    Using try catch to encapsulate exceptions

    Retries to the rescue…or not

    Summary

    Part 2 - Implementing Observability for Site Reliability Engineering

    4

    Essential Observability – Metrics, Events, Logs, and Traces (MELT)

    Technical requirements

    Accomplishing systems monitoring and telemetry

    Monitoring targets for infrastructure

    Monitoring types and tools

    Monitoring golden signals

    Monitoring data

    Understanding APM

    Getting to know topology self-discovery, the blast radius, predictability, and correlation

    Alerting – the art of doing it quietly

    The user perspective notification trigger principle

    Event-to-incident mapping principle

    Mixing everything into observability

    Outages versus downtime

    Observability architecture

    Observability effectiveness

    In practice – applying what you have learned

    Lab architecture

    Lab contents

    Lab instructions

    Summary

    Further reading

    5

    Resolution Path – Master Troubleshooting

    Properly defining the problem – and what to ask and not ask

    Source of information

    The knowledge base of the reporter

    Naming conventions

    False urgency

    Executive summary

    Breaking down and testing systems

    Breaking down hardware versus the operating system

    Breaking down a web API

    Understanding the steps

    The problems with this method of troubleshooting

    Previous and common events – checking for the simple problems

    Prior Root Cause Analysis (RCA) documents

    Timeline analysis

    Comparison

    The best approach

    Effective research both online and among peers

    The art of the Google search

    Skimming the content quickly and refining it

    Never forget your internal resources

    Breaking down source code efficiently

    Code you’ve never seen

    When that fails

    Logging plus code

    In practice – applying what you’ve learned

    Summary

    6

    Operational Framework – Managing Infrastructure and Systems

    Technical requirements

    Approaching systems administration as a discipline

    Design

    Installation

    Configuration

    App deployment

    Management

    Upgrade

    Uninstallation

    Understanding IT service management

    ITIL

    DevOps

    Seeing systems administration as multiple layers and multiple towers

    Automating systems provisioning and management

    Infrastructure as Code

    Immutable infrastructure

    In practice – applying what you’ve learned

    Lab architecture

    Lab contents

    Lab instructions

    Summary

    Further readings

    7

    Data Consumed – Observability Data Science

    Technical requirements

    Making data-driven decisions

    Defining the question and options

    Determining which data to use

    Identifying which data is already available

    Collecting the missing data

    Analyzing all datasets together

    Presenting the decision as a record

    Documenting the lessons learned in the process

    Solving problems through a scientific approach

    Formulation

    Hypothesis

    Prediction

    Experiment

    Analysis

    Understanding the most common statistical methods

    Percentages

    Mean, average, and standard deviation

    Quantiles and percentiles

    Histograms

    Using other mathematical models in observability

    Visualizing histograms with Grafana

    In practice – applying what you’ve learned

    Lab architecture

    Lab contents

    Lab instructions

    Summary

    Further reading

    Part 3 - Applying Architecture for Reliability

    8

    Reliable Architecture – Systems Strategy and Design

    Technical requirements

    Designing for reliability

    Architectural aspects

    Reliability equations

    Design patterns

    Modern applications

    Splitting and balancing the workload

    Splitting

    Balancing

    Failing over – almost as good

    Scaling up and out – horizontal versus vertical

    Horizontal

    Vertical

    Autoscaling

    In practice – applying what you’ve learned

    Lab architecture

    Lab contents

    Lab instructions

    Summary

    Further reading

    9

    Valued Automation – Toil Discovery and Elimination

    Technical requirements

    Eliminating toil

    Toil redefined

    Why toil is bad

    Handling toil the right way

    Treating automation as a software problem

    Document

    Algorithm

    Code

    Automating the (in)famous CI/CD pipeline

    Continuous integration

    Continuous delivery

    Production releases

    In practice – applying what you’ve learned

    Lab architecture

    Lab contents

    Lab instructions

    Summary

    Further reading

    10

    Exposing Pipelines – GitOps and Testing Essentials

    A basic pipeline – building automation to deploy infrastructure as code architecture and code

    Pipelines in chronological order

    Pipeline templates

    Errors or breaks in pipelines

    Using containers in pipelines

    Pipeline artifacts

    Pipeline troubleshooting tips

    Automating compliance and security in pipelines

    Library age

    Application security testing

    Dynamic Application Security Testing (DAST)

    Static Application Security Testing (SAST)

    Secrets scanning

    Automated linting for code quality and standards

    Compiling with linting feedback

    Validating functionality during deployment with automated testing

    Why is testing so important to reliability?

    Test data

    The types of testing

    When to test a pipeline

    Testing observability

    Automated rollbacks

    The reduction of developer toil through automated processes

    What is the impact of addressing toil?

    In practice – applying what you’ve learned

    Preparing AWS for the lab

    Creating your repository

    Adding secrets to your repository

    Downloading and committing the lab files

    Understanding the pipeline

    Adding more steps

    Testing but not deploying

    Lab final thoughts

    Summary

    11

    Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes

    Technical requirements

    The multiple definitions of serverless

    Serverless Framework

    Serverless computing

    Serverless functions

    Monitoring serverless functions

    Errors

    Containers and why we love them

    Isolation

    Immutability

    Promotability

    Tagging

    Rollbacks

    Security

    Signable

    Monitoring containers

    Kubernetes and other ways to orchestrate containers

    Health checks

    Crashing and force-closing containers

    HTTP-based load balancing

    Server load balancing

    Containers as a Service (CaaS)

    Simple container orchestration

    Kubernetes

    Deployment techniques and workers

    Traditional replacement deployment

    Rolling deployment

    A/B or blue/green deployment

    Canary deployment

    Automation and rolling back failed deployments

    Rollback metrics

    When to roll back

    How to roll back

    In practice – applying what you’ve learned

    Leveraging Gitpod – a containerized workspace

    The emulation source code

    Running the emulation

    Summary

    12

    Final Exam – Tests and Capacity Planning

    Technical requirements

    Understanding types of testing

    Development tests

    Build tests

    Delivery tests

    Deployment tests

    Production tests

    Adopting TDD

    Unit testing the hard way

    Unit testing with a framework

    Using test automation frameworks

    Staying ahead with capacity planning

    Load test data

    The capacity curve

    The demand curve

    In practice – applying what you’ve learned

    Lab architecture

    Lab contents

    Lab instructions

    Summary

    Further reading

    Part 4 - Mastering the Outage Moments

    13

    First Thing – Runbooks and Low Noise Outage Notifications

    Technical requirements

    What makes a good runbook – the basics

    Runbooks as living documents

    Understanding the runbook audience knowledge level

    Runbook audience permissions

    What do you put into a runbook anyway?

    Beyond the runbook – code and comments

    Quickly understanding source code

    Searching source code for your needle in a haystack

    Commenting for understanding

    What’s in a good dashboard?

    Types of dashboards

    NOC-style red and green

    Displaying trends

    Aggregates and breakdowns

    What dashboards are not

    The basics of priority levels

    Response effort

    Engineer retention

    Incident response systems and priority

    Incident response systems and phone-based alerts

    What is a priority one event?

    Defining priority based on...

    The priority level of observability failures

    Forcing the priority – the rockstar way!

    Adjusting alerts

    Logs and alerting

    Pausing alerts

    In practice – applying what you’ve learned

    Defining priority levels

    Custom hat pricing API runbook

    Alerting

    Summary

    14

    Rapid Response – Outage Management Techniques

    Where to meet – an effective strategy for communicating good information

    Online collaboration

    In-person collaboration

    The historical data found in outage responses

    Participants

    Follow-up work

    Leveraging the people involved in the response

    Tasks

    Participants and personalities

    Break strategy and stress management

    The opportunity to respond at the right time

    Training

    Runbook and contact list revisions

    Team building

    Executive messaging bugs in the ear

    Opportunities to call out during the RCA

    Messaging customers and leadership

    Customer versus leadership messaging

    Cadence

    Email groups

    Status sites

    Over-messaging

    Notes, notes, notes...

    In practice – applying what you’ve learned

    Outage and alarm

    Notification and response

    Troubleshooting

    The conclusion

    Summary

    15

    Postmortem Candor – Long-Term Resolution

    The content of the postmortem in executive summary style

    Executive summary style

    Overview

    Impact

    Timeline

    Detailed technical description

    Response

    Resolution

    Future actions

    Decisions are not blame

    Business is business

    Resource and time constraints

    Monitoring

    The cost of more reliability as a business decision

    Active:Active

    Manual failover

    Cost of time to identify

    The cost of time to move a load

    Hidden development costs

    Training and skill sets – they matter

    Identifying gaps

    Training and certification targets

    Creating future action plans

    Immediate follow-up

    Who to involve

    Timelines and priority

    Assigning ownership

    Tracking the work

    In-practice – an example of a postmortem

    Writing the overview

    Rounding out the postmortem

    Custom Hat Company postmortem

    Impact

    Timeline

    Technical details and response

    Resolution

    Future actions

    Summary

    Part 5 - Looking into Future Trends and Preparing for SRE Interviews

    16

    Chaos Injector – Advanced Systems Stability

    Technical requirements

    Comprehending the wheel-of-misfortune game

    All ends are new beginnings

    Lessons to be learned

    Role-playing scenarios

    A little bit of gamification

    Understanding chaos engineering for reliability

    Principles of chaos engineering

    Chaos system architecture

    Chaos experiments

    In practice – employing the wheel-of-misfortune game

    Lab architecture

    Lab contents

    Lab instructions

    In practice – injecting chaos into systems

    Lab architecture

    Lab contents

    Lab instructions

    Summary

    Further reading

    17

    Interview Advice – Hiring and Being Hired

    What we’re looking for in a candidate

    Are you qualified?

    Entry-level SRE job

    Problem-solving

    The ability to accept feedback and direction

    A broad knowledge base and skill set

    Research and learning skill set

    The ability to say No

    Culture fit

    The X factor

    Passion

    Experience

    Personal responsibility

    Common interview questions and answers

    Technical questions

    Non-technical questions

    Insightfully odd questions

    What should you look for in a career?

    Define a good boss

    Dotted line reporting

    Morals

    Researching the company

    Business model

    Profitability for the next decade

    Structure

    Large versus small

    Public versus private

    Online reviews

    Are you over-or under-certified?

    Certifications that matter

    How many are too many certifications?

    Relevancy

    Tips for landing the job with a great salary

    Interview tips

    Salary negotiations

    Summary

    Appendix A – The Site Reliability Engineer Manifesto

    The manifesto

    How to adopt it

    How to contribute to it

    Appendix B – The 12-Factor App Questionnaire

    The questionnaire

    Factor I – Code base

    Factor II – Dependencies

    Factor III – Config (configuration)

    Factor IV – Backing (backend) services

    Factor V – Build, release, run

    Factor VI – Processes

    Factor VII – Port binding

    Factor VIII – Concurrency

    Factor IX – Disposability

    Factor X – Development/production (dev/prod) parity

    Factor XI – Logs

    Factor XII – Admin processes

    How to adopt this questionnaire

    How to contribute to this questionnaire

    Index

    Other Books You May Enjoy

    Preface

    Site reliability engineering relates to constant improvement, bridging business and product issues as per customer requirements and technology limitations, thereby generating higher revenue. Quantifying and understanding reliability, resource handling, and developer needs can sometimes be overwhelming. Becoming a Rockstar SRE explores reliability from an infrastructure and coding perspective and uses real-world examples to bring forth the site reliability engineer (SRE) persona.

    This book will acquaint you with who an SRE is, followed by discussions on the why and how of site reliability engineering. It walks you through the jobs of an SRE, from automation of continuous integration/continuous delivery (CI/CD) pipelines and reducing toil to the details of reliability and the best practices to excel in it. You’ll learn why harmful code is created and how to circumvent that with reliable designs and patterns. You’ll explore how to interact and negotiate with businesses and vendors on various technical matters. You’ll then deep dive into observability, outage, and why and how to craft an excellent runbook. Finally, you’ll learn how to elevate your site reliability engineering career, including certifications, interview tips, and questions.

    By the end of this book, you’ll be able to identify and measure reliability, reduce downtime, troubleshoot outages, and enhance productivity to become a true rockstar SRE!

    Who is this book for

    This book is intended for IT professionals, from developers looking to advance into an SRE role to system administrators mastering technologies and executives experiencing repeated downtime in their organizations. This book will also be helpful to anyone interested in bringing reliability and automation to their organization to drive down customer impact and revenue loss while increasing development throughput. While reading this book, a basic understanding of API and web architecture and some experience with cloud computing and services will be helpful.

    What this book covers

    Chapter 1, SRE Job Role – Activities and Responsibilities, talks about the site reliability engineer persona addressing who is an SRE.

    Chapter 2, Fundamental Numbers – Reliability Statistics, shows how the site reliability engineering work and business impact are measured.

    Chapter 3, Imperfect Habits – Duct Tape Architecture and Spaghetti Code, explains why systems are naturally unreliable.

    Chapter 4, Essential Observability – Metrics, Events, Logs, and Traces (MELT), discusses how we go from monitoring to true observability.

    Chapter 5, Resolution Path – Master Troubleshooting, lectures on the SRE way of precisely and concisely troubleshooting.

    Chapter 6, Operational Framework – Managing Infrastructure and Systems, describes why and how SREs tackle operational work and not just engineering duties.

    Chapter 7, Data Consumed – Observability Data Science, teaches the basic mathematical models and statistical methods for SREs.

    Chapter 8, Reliable Architecture – Systems Strategy and Design, describes systems thinking applied to reliability and reliable architectural patterns.

    Chapter 9, Valued Automation – Toil Discovery and Elimination, familiarizes readers with a critical pillar of site reliability engineering: making operations scalable.

    Chapter 10, Exposing Pipelines – GitOps and Testing Essentials, illustrates how to leverage reliability inside DevOps delivery pipelines.

    Chapter 11, Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes, presents how workload management affects the reliability of systems.

    Chapter 12, Final Exam – Tests and Capacity Planning, demonstrates how good testing and capacity planning keep the performance of systems ahead.

    Chapter 13, First Thing – Runbooks and Low Noise Outage Notifications, discusses how well-designed procedures and notifications prepare SREs for problems.

    Chapter 14, Rapid Response – Outage Management Techniques, teaches about SRE positive behaviors and how to keep interactions toward the resolution during a significant incident.

    Chapter 15, Postmortem Candor – Long-Term Resolution, portrays how postmortems should lead to actions that will make systems more reliable.

    Chapter 16, Chaos Injector – Advanced Systems Stability, clarifies how SREs inject chaos into systems to learn more and use gamification to hone their skills.

    Chapter 17, Interview Advice – Hiring and Being Hired, displays how companies should hire SREs and how SREs should demonstrate their knowledge during an interview.

    Appendix A, The Site Reliability Engineer Manifesto, depicts the primary responsibilities of any SRE in the world.

    Appendix B, The 12-Factor App Questionnaire, consolidates a series of questions to test whether an application design is reliable according to the twelve-factor app manifesto from Heroku.

    To get the most out of this book

    We purposefully used SRE as the acronym for site reliability engineer and kept site reliability engineering in its extended form throughout the book. For us, site reliability engineering is only accomplishable if you have an SRE and not the other way around. Although it’s common to see SRE standing for both site reliability engineer and engineering interchangeably, we want to emphasize the persona and the who in this book.

    This book contains simulation labs to give its readers practical knowledge. Each has a prerequisite knowledge set, such as Kubernetes, cloud computing, or software development. It’s not part of this book to teach you about specific technologies and products but the most effective practices and principles that are technology agnostic. However, we must adopt some technology to demonstrate the site reliability engineering concepts and techniques. For that, we preferred open source software and platforms with free tier accounts in the labs.

    Each simulation lab states its learning requirements and points to where the reader can find more information and instructions. We divided each practical exercise into three parts:

    Lab architecture

    Lab contents

    Lab instructions

    The lab architecture explains the big picture around the design and connections among its main components. The contents section explains what’s inside the GitHub repository, such as files and folders. And the lab instructions have a procedure for installing, configuring, and using the lab properly.

    The following is a list of software covered in this book’s simulation labs and the required execution environment:

    You will require a laptop with reasonable access to the internet to work in the book’s labs.

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Becoming-a-Rockstar-SRE. If there’s an update to the code, it will be updated in the GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Download the color images

    We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/W6q5Y.

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: Within this repository, under the Chapter-8 folder, there is just one subfolder called terraform.

    A block of code is set as follows:

    provider google {

      credentials = file(project-service-account-key.json)

      project = autoscaling-simulation-lab

      region  = southamerica-east1

      zone    = southamerica-east1-a

    }

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    resource google_compute_autoscaler foobar {

    ...

      

    autoscaling_policy

    {

        max_replicas    = 5

        min_replicas    = 1

        cooldown_period = 60

    Any command-line input or output is written as follows:

    $ terraform init

    Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: To do this, we navigate to the Settings tab in our GitHub repository. Then select Secrets on the left side, and choose Actions.

    Tips or important notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    Join the SRE community: We invite you to join the large Site Reliability Engineers community at the sreterminus.slack.com Slack public workspace.

    General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

    Share Your Thoughts

    Once you’ve read Becoming a Rockstar SRE, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?

    Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below

    https://packt.link/free-ebook/9781803239224

    Submit your proof of purchase

    That’s it! We’ll send your free PDF and other benefits to your email directly

    Part 1 - Understanding the Basics of Who, What, and Why

    In this first part, you will learn about site reliability engineering, its roots, and current usage outside Google. We emphasize how the site reliability engineer (SRE) persona is the center of gravity of everything orbiting systems reliability. When we talk about site reliability engineering, it’s impossible to do so without a discussion about the business of software development, which we tie into not only statistics used for reliability but how those impact what companies are ultimately interested in, customer satisfaction and revenue. Finally, we’ll explore why the lack of reliability persists in organizations and discuss some of the lesser known truths that make site reliability engineering critical and complex.

    The following chapters will be covered in this section:

    Chapter 1, SRE Job Role – Activities and Responsibilities

    Chapter 2, Fundamental Numbers – Reliability Statistics

    Chapter 3, Imperfect Habits – Duct Tape Architecture and Spaghetti Code

    1

    SRE Job Role – Activities and Responsibilities

    A lot has been said about site reliability engineering, what it is, what it is not, and the multiple practices and techniques that we should apply to adopt the site reliability engineering model. Who site reliability engineers (SREs) are is often put aside even though it is a crucial aspect. Moreover, how people from various parts of information technology (IT) become SREs and how some of them are recognized as thought leaders in this domain.

    However, little has been said about the site reliability engineer persona, as detailed in the following list:

    What do they know?

    Which skills have they developed?

    What do they do daily?

    What are their primary responsibilities?

    Those characteristics would explain, at a bare minimum, why someone should start the journey to becoming an SRE rockstar. That’s precisely why we decided to start this book by outlining the SRE job role.

    In this chapter, we’re going to cover the following main topics:

    Making this journey personal

    Understanding the mindset and hobbies of an SRE

    DevOps engineers versus SRE versus others

    Describing an SRE’s main responsibilities

    An overview of the daily activities of an SRE

    People that inspire

    Making this journey personal

    Unfortunately, often when an enterprise starts to adopt SRE into their IT governance processes, they don’t use a people-processes-tools (PPT) model to transform their operations and software development areas, having a clear vision of these pillars. Even more often, they don’t emphasize or focus on the people element of PPT in such transformations. We want to change that by making this learning journey personal and centered on the individuals rather than the involved processes or technologies.

    It’s critical to understand (and learn) what drives typical SREs forward, which fundamental skills they have developed, and how they hone their skills over time to go above and beyond at work. For that purpose, we will divide this subject into three sections:

    SRE driving forces

    SRE skills

    SRE traits

    Let’s start this personal journey by understanding why you should become an SRE.

    SRE driving forces

    We want to explore what motivates or incentivizes site reliability engineers. There’s no journey of any nature if there is no driving force pushing you through. As a word of advice, we should warn you that learning about site reliability engineering is more of an expedition than a tourism trip. In other words, it’s more a marathon than a sprint. Having clarified that, we’ll begin by putting the possible rewards of this journey on the table. Let’s depict each driving force as a mockup code snippet (JavaScript) to make it fun.

    Money

    If we could represent in the form of an algorithm how money drives people when they don’t earn enough, it would look like the following:

    //

    money

     

    if (

    money

    < MyMinimumSalary) {

    motivated = false;

    excitement--;

    }

    doMyWork();

    if (motivated && jobSatisfaction) {

        honeSRESkills();

        doExtraWork();

    } else lookForAnotherJob();

    Site reliability engineers make more money than most other technical professionals. According to a Glassdoor (2022) report, they can earn more than USD 118K per year on average. In similar reports, SREs are even noted to have surpassed DevOps engineers in a salary comparison. Nevertheless, not making enough money can be a key demotivating factor. It is hard for anyone to move forward with their career if they are preoccupied with expenses.

    Although SREs have a notorious income on average, their salaries will vary per country, years of experience, and employer. Companies justify SRE salary levels based on the reliability value they bring to the table. Rest assured, the site reliability engineering career is well paved in the compensation field.

    Job satisfaction

    What affects our job satisfaction can be depicted as code logic as follows:

    //

    jobSatisfaction

     

    if (interestingJob || purposefulWorkActivities || challengingSkillDevelopment || technicalAppreciation) {

        jobSatisfaction

    = true;

        excitement++;

    }

    Job satisfaction is another driving force of site reliability engineers, and it has many factors. We usually translate job satisfaction to employee happiness at work. Site reliability engineering leads to job satisfaction when we look at the following profession characteristics: exciting job content, purposeful work activities, challenging skill development, and technical appreciation.

    The job content of site reliability engineering spans multiple domains. You can work with developers one day and help systems administrators the next. You may need to assist in redesigning an app to increase its service reliability. As with any generalist model job with technical depth in many subject areas, you will never get bored for sure.

    As we will see later in this chapter, SRE work activities have clear business value. They improve not just the service quality, availability, and resiliency, but also the system’s reliability. Reliable services might help with customer loyalty, bringing additional revenue to the service provider. There is a direct relationship between SRE work and business metrics improvement, making their efforts purposeful.

    Since site reliability engineering is a cross-technology domain engineering discipline, any skills acquisition is challenging. SREs have knowledge and skills that a systems administrator or software developer doesn’t have. They are required to keep those skills updated and hone them over time. This necessity to keep learning brings the always-moving-forward feeling that may not happen if you only need to master a single product or technology.

    The last factor on our list is technical appreciation. According to Boston Consulting Group (BCG) research, appreciation is the number one job happiness factor. Being an SRE, you will aid customers, users, and other technical professionals because of your keen holistic view of the systems. Consequently, technical appreciation for the job you do is common, and who doesn’t like that?

    Innovative solutions

    The following code gives you an idea of how exciting exploring uncharted terrains is:

    If (!solutionExists) {

        

    deviseNewSolution

    ();

        excitement++;

    }

    Site reliability engineers are natural trailblazers as they explore new technologies and processes to obtain better reliability and eliminate toil (manual and repetitive tasks that are devoid of value). They face many scenarios and situations that are a first of their kind. Moreover, they are responsible for paving the path for others by documenting procedures in runbooks when none exist. There’s nothing more exciting than devising new solutions or improving existing ones. Imagine how you would feel if they named a technical operating procedure after you.

    Nevertheless, SREs want to minimize complexity and reduce technical debt. They don’t create a solution just for the sake of doing it unless it adds value and resolves or prevents events that impact customers.

    Good relationships

    The following code snippet is a representation of how good relationships are a result of an exciting working environment:

    If (excitement > HIGH) {

        motivateOthers();

        

    relationships.healthy

    = true;

    }

    Also, good work environment relationships are one of the top 10 factors contributing to employee happiness. SREs have good relationships in their work environment. The reason is straightforward; they act as integration hubs among different tribes and have the mission to break company siloes. SREs need cooperation from both development and operations teams. They are technical diplomats and have strong communication skills. Since they are

    Enjoying the preview?
    Page 1 of 1