Download as pdf or txt
Download as pdf or txt
You are on page 1of 91

© Copyright Microsoft Corporation. All rights reserved.

FOR USE ONLY AS PART OF MICROSOFT VIRTUAL TRAINING DAYS PROGRAM. THESE MATERIALS ARE NOT AUTHORIZED
FOR DISTRIBUTION, REPRODUCTION OR OTHER USE BY NON-MICROSOFT PARTIES.
Microsoft Azure Virtual
Training Day:
Well-Architected
Well-Architected
Overview
Agenda

▪ Why is being well-architected important?

▪ Overview: Microsoft Azure Well-Architected

▪ Overcoming workload quality inhibitors

▪ How to get started? – Well-Architected Review & Azure


Advisor Demo

▪ Resources
Data breaches cost you
—and your customers
Customer PII was the most frequently, and costliest compromised
type of record per latest data breach study*

$3.86M Average total cost of a data breach

80% Number of breaches carried out with customer PII

$150 Customer PII average cost per record

$175 Increased cost per record of customer PII


in breaches caused by a malicious attack

$137,000+ Remote workforce impact on


average total cost of data breaches
*Cost of a Data Breach Report 2020, IBM Security, Ponemon Institute.
Run Well-Architected cloud workloads—
to create value

Invest in these actions: To avoid these consequences:

▪ Manage budget Expenses, losses


▪ Improve workloads security
▪ Increase incident response
▪ Streamline internal processes Trust
▪ Find costly mistakes
▪ Enhance workload performance
Damages
Well-architect—
optimize workloads for performance

Build workloads with Design performant Optimize workloads with


confidence with proven workloads using deep actionable focus areas
best practices technical guidance
Microsoft Well-Architected—
Build and manage high-performing workloads

Azure
Well-Architected
Review

Design Azure
Reliability
Principles Advisor
Cost Optimization
Azure
Operational Excellence Well-Architected
Framework
Performance Efficiency
Reference
Security Architectures Documentation

Partners,
Support &
Service
Offers
Building well-architected workloads—
is a shared responsibility
Scope of
Well-Architected
Assessments
Customer application
Customer app or workload, built on the Azure platform

Platform features
Optional Azure capabilities a customer enables – to ensure security, reliability, operability, performance

Platform foundation
Core capabilities built into the Azure platform – how the foundation is designed, operated, and monitored
Business requirements DEV/TEST WORKLOADS

influence decisions
about workload
architectures MISSION-CRITICAL WORKLOADS

SECURING ALL WORKLOADS

What tradeoff decisions must you


make in a business context?
Overcoming workload quality inhibitors

Cost Operational Performance Reliability Security


Optimization Excellence Efficiency

▪ No cost and usage monitoring ▪ No rapid issue identification ▪ No monitoring new ▪ Unclear on resiliency ▪ No access control
▪ Unclear on underused/ ▪ No deployment automation services capabilities for improved mechanism
orphaned resources ▪ No communication ▪ No monitoring current architecture design (authentication)
▪ Lack of structured billing mechanisms & dashboards workloads health ▪ Lack of data back up ▪ No security threat
management ▪ Unclear expectations and ▪ No design for scaling practices detection mechanism
▪ Budget reductions from lack business outcomes ▪ Lack of rigor and ▪ No monitoring of current ▪ Lack of security thread
of support for cloud adoption ▪ No visibility on root cause guidance for technology workload health response plan
by leadership for events and architecture ▪ No resiliency testing ▪ No encryption process
selection ▪ No support for disaster
recovery
Best practices to drive workload quality
Cost Operational Performance Reliability Security
Optimization Excellence Efficiency

▪ Azure Hybrid Benefit DevOps ▪ Define requirements ▪ Identity & access


▪ ▪ Design for scaling
▪ Reserve Instances Test with simulations & management
▪ Deployment ▪ Monitor performance ▪
▪ Shutdown ▪ Monitor forced failovers ▪ Infra protection
▪ Resize ▪ Processes & cadence ▪ Deploy consistently ▪ App security
▪ Move to PaaS ▪ Monitor workload health ▪ Data encryption &
▪ Respond to failure & sovereignty
disaster ▪ Security operations
How do you get started?

Optimize existing workloads Design & deploy new workloads

▪ Identify optimization opportunities with


the Azure Advisor Score ▪ Map workload architectures across
business priorities
▪ Understand necessary changes or
past incident occurrences ▪ Review technical guidance of
Well-Architected Framework
▪ Review technical guidance of
Well-Architected Framework ▪ Assess workload architecture design with
the Well-Architected Review
▪ Consider architecture design tradeoffs to
achieve business goals ▪ Consider architecture design tradeoffs to achieve
business goals
▪ Define & implement technical recommendations
▪ Build, deploy and manage Well-Architected,
▪ Implement workload optimizations on
optimized workloads on Azure
a regular cadence
Optimize existing workloads - Process
Collect data to identify
optimization opportunities
Gather
Next Step: Take the
Well-Architected review

Implement
recommendations and Implement Analyze Confirm optimization
opportunities
continuous process

Advise
Actionable plan and
recommendations to optimize
your workloads
Using the Azure Well-Architected Review
▪ This web-based assessment helps improve
the quality of a workload by
▪ Examining the workload across the 5 pillars
of the Azure Well Architected Framework
(Reliability, Cost Optimization, Security,
Operations Excellence, and Performance
Efficiency)
▪ Providing specific guidance to improve
architecture and overcome detected hurdles
effectively
▪ Proactively focusing on the pillar where most
attention is needed
▪ Driving consistency into workload
discussions throughout the team https://aka.ms/wellarchitected/review
Azure Well-Architected
Review
Assess workloads with the pillars of the
Microsoft Azure Well-Architected Framework:

—Understand the Well-Architected level of


your workload environment.

—Follow technical guidance for next steps of


how to improve the quality of your workloads.
Demo
https://aka.ms/wellarchitected/review
Architect and optimize workloads for success

Well-Architected Channel 9 Show


Azure Well-Architected Learning Path Azure Architectures
Review

Well-Architected Design Azure Well-Architected MS Consulting Services


Principles Framework
Security Pillar
Microsoft Well-Architected—
Build and manage high-performing workloads

Azure
Well-Architected
Review

Design Azure
Reliability
Principles Advisor
Cost Optimization
Azure
Operational Excellence Well-Architected
Framework
Performance Efficiency
Reference
Security Documentation
Architectures
Partners,
Support &
Service
Offers
Operations
Security operations that work for you

Azure
Security

Technology Partnerships
Enterprise-class Partnerships for a
technology heterogeneous world
Situation
At a high level, the RxWell solution’s
architecture has three main zones: on-
premises, the cloud, and the end-user apps.
RXR and its partners committed to strict
adherence to the principles of Responsible
AI. All data is anonymized, and only
aggregated data gets stored. Even that Solution
aggregated data is treated as sensitive, and “When it came to developing RxWell, there was simply Impact
none of it is shared externally. Meanwhile, no other company that had the capability and the “The well-architected Azure framework,
the multilayered security controls of Azure infrastructure to meet our comprehensive data, fundamentally addresses things like scalability,
help keep it all safe. analytics, and security needs than Microsoft. With our reliability, security, and operational excellence.
partnership, the RxWell program provides our Because we started building our solution from day
customers the tools they need to safely navigate the one on those pillars, that helped us to absorb all
‘new abnormal’ of COVID-19 and beyond.” these nuances from the integration perspective.”
—Scott Rechler, Chairman and CEO, RXR Realty —Saurav K. Chandra, Principal Architect for Internet of Things, Infosys
Build and manage proactively secured workloads
Security provides principles to protect, detect, and respond to threats
across your Azure environment.

Build upon a secure Proactively stay secure with Detect and respond to
foundation native controls threats

▪ Design assuming workload failure ▪ Continuously manage your workload ▪ Leverage large-scale intelligence
with multi-layer protection controls. security from a single pane of glass from decades of Microsoft security
with Azure Security Center. experience to work with the Microsoft
▪ Build workloads using zero-trust Intelligent Security Graph, collected from
principles in both IaaS and PaaS ▪ Protect your workloads from 8 trillion threat signals analyzed daily
malicious attacks with cloud-
▪ Embrace Azure’s security native Azure Web Application ▪ Embrace automation with Azure
investments, resources, and Firewall Defender to get threat protection
compliance certifications for your workload
▪ Manage identity and access for your
workload with Azure Active Directory ▪ Establish procedures to identify and
mitigate threats for your workloads with
Azure Sentinel
Build on a secure foundation
Principle: Build a comprehensive strategy
A security strategy should consider investments in culture, processes, and security controls across all system
components. The strategy should also consider security for the full lifecycle of system components including the supply
chain of software, hardware, and services.

Protect customer data Secure hardware Test and monitor

▪ Use Azure Active Directory to ▪ Azure is hosted on custom-built


manage access to Azure resources. hardware with integrated security. ▪ Run simulated penetration attacks
to detect system vulnerabilities and
▪ Use Azure Key Vault to store ▪ Host Attestation Service ensures that validate defenses.
sensitive data such as certificates, host machines are trust-worthy
connection strings, and tokens. before they're allowed to interact ▪ Classify, protect, and monitor
with customer data. sensitive data assets using access
▪ The Azure Security Benchmark control, encryption, and logging.
provides recommendations to
improve the security of your
workloads, data, and services.

high-level architecture of the host attestation service


Build upon a secure foundation
Principle: Assume Zero Trust

DDoS protection Web Application Firewall Azure Firewall Network Security Groups VNET Integration
DDOS protection tuned Centralized inbound web Data exfiltration protection Distributed inbound & Restrict access to Azure
to your application application protection using centralized outbound network service resources (PaaS) to
traffic patterns from common exploits outbound and inbound (L3-L4) traffic filtering on only your Virtual Network
and vulnerabilities (non-HTTP/S) network and VM, Container or subnet
application (L3-L7) filtering

Application protection Segmentation


Proactively stay secure with native controls
Principle: Leverage native controls
Native security controls are maintained and supported by the service provider, eliminating or reducing
effort required to integrate external security tooling and update those integrations over time.

Built-in Azure controls

Identity & Apps & data Network Threat Security


access security security protection management

In-depth defense
Proactively stay secure with native controls
Principle: Leverage native controls

Azure Security Center Web Application Firewall Azure Active Directory

A unified infrastructure security ▪ Centralized protection and inspection of ▪ Microsoft’s cloud-based identity
management system that: HTTP requests to prevent attacks such as and access management service,
▪ Strengthens the security SQL Injection or Cross-Site Scripting. which helps your employees
posture of your data center securely access resources.
▪ Provides advanced threat ▪ Managed Identities eliminates
protection across your hybrid the need to store credentials that
workloads in the cloud—on could be leaked.
Azure, or on premises ▪ Use Azure AD Connect for
synchronizing Azure AD with
your existing on-premises
directory.
Detect and respond to threats
Principle: Design for resilience
Native threat detection

Native security and governance Azure Sentinel


3rd-party
Multi-cloud and partners
ASC/Secure Score

Firewall Microsoft 365 Defender Azure Defender

Web App Firewall


Email/docs Endpoints SQL Server VMs Containers

SQL Protection

Identities Apps Network traffic IoT Apps

API Protection XDR

Microsoft Defender
Security
❑ Identity as Primary Access Control
❑ Assign permissions users groups applications
Use built-in roles
❑ Restrict control plane access

❑ Enforce multi-factor verification


❑ Protect all public endpoints

❑ Prevent direct access of virtual machines


Azure Well-Architected
Review
Assess workloads with the pillars of the
Microsoft Azure Well-Architected Framework:

—Understand the Well-Architected level of


your workload environment.

aka.ms/wellarchitected/review
—Follow technical guidance for next steps of
how to create and optimize your workloads.
Let’s walk through
some questions for
Security
in the Well-Architected
Review
Have you done a threat analysis of your workload?

Threat modeling is an engineering technique which can be used to help identify threats, attacks,
vulnerabilities and countermeasures that could affect an application. Threat analysis consists of
defining security requirements, identifying threats, mitigating threats, validating threat
mitigation. All of those are needed to ensure proper security of a workload on both the
prevention and reaction fronts.

❑ Threat modeling processes are adopted, identified threats are ranked based on organizational impact, mapped to
mitigations and communicated to stakeholders.
❑ Timelines and processes are established to deploy mitigations (security fixes) for identified threats.
❑ Security requirements are defined for this workload.
❑ Threat protection was addressed for this workload.
❑ Security posture was evaluated with standard benchmarks (CIS Control Framework, MITRE framework etc.).
❑ Business critical workloads, which may adversely affect operations if they are compromised or become unavailable,
were identified and classified.
❑ None of the above.
How is security validated and how do you handle
incident response when breach happens?
If prevention fails and security of the application is breached, proper response and mitigation
can minimize damage and contain the attacker within minimal boundaries.

❑ For containerized workloads, Azure Defender (Azure Security Center) or other third-party solution is used to scan
for vulnerabilities.
❑ Penetration testing is performed in-house, or a third-party entity performs penetration testing of this workload to
validate the current security defenses.
❑ Simulated attacks on users of this workload, such as phishing campaigns, are carried out regularly.
❑ Operational processes for incident response are defined and tested for this workload.
❑ Playbooks are built to help incident responders quickly understand the workload and components, to mitigate an
attack and do an investigation.
❑ There's a security operations center (SOC) that leverages a modern security approach.
❑ A security training program is developed and maintained to ensure security staff of this workload are well-informed
and equipped with the appropriate skills.
❑ None of the above.
Are keys, secrets, and certificates managed in a
secure way?
Secrets like API keys and certificates are sensitive pieces of information that need to be
managed in a secure way - that includes proper storage, encryption and access control.

❑ There's a clear guidance or requirement on what type of keys (PMK - Platform Managed Keys vs. CMK - Customer
Managed Keys) should be used for this workload.
❑ Passwords and secrets are managed outside of application artifacts, using tools like Azure Key Vault.
❑ Access model for keys and secrets is defined for this workload.
❑ A clear responsibility / role concept for managing keys and secrets is defined for this workload.
❑ Secret/key rotation procedures are in place.
❑ Expiry dates of SSL/TLS certificates are monitored and there are renewal processes in place.
❑ None of the above.
Performance Efficiency Pillar
Microsoft Well-Architected—
Build and manage high-performing workloads

Azure
Well-Architected
Review

Design Azure
Reliability
Principles Advisor
Cost Optimization
Azure
Operational Excellence Well-Architected
Framework
Performance Efficiency
Reference
Security Documentation
Architectures
Partners,
Support &
Service
Offers
Build and manage—scalable, efficient workloads
Design and manage workloads that scale according to load changes, and efficient systems,
processes, and resources

Tools to provide Efficient architecture Active response


scalability tradeoffs to performance issues

▪ Manage resource scaling with Azure • Design parts of the process to be ▪ Evaluate health levels of workloads with
SQL Database and Azure App discrete and decomposable to Azure Monitor and Log Analytics to
Services—or scale dynamically with maximize compute resources, and take provision resources dynamically, and
demand with Azure Autoscale microservices architecture into account scale to match demand

▪ Optimize your network and storage ▪ Assess and remediate deep application
with Azure Cosmos DB, Azure Traffic Client Microservices DevOps performance issues and trends with
Manager, and Azure Cache for Redis Azure Application Insights
Service Service
API
Gate
way
Service ▪ Embrace a data-driven culture to deliver
Service timely insights across data to your entire
organization
Management/Orchestration
Optimal service execution
Principle: Invest in capacity planning

Anticipate load Carefully evaluate


Test continuously
fluctuations services and costs
▪ Establish baselines for your ▪ Test for expected loads because of ▪ Review service-level agreements
application and its supporting planned events, for example, sales (SLAs) of similar services to calculate
infrastructure promotions, or holidays the best fit for your application
▪ Always test the effect on ▪ Plan for unexpected political,
performance when code or economic, and weather events ▪ Consider the effects of business
infrastructure changes are requirements when making trade-offs
▪ Choose paired regions, and ensure
made that all regions can adequately scale between cost and performance
▪ Monitor typical and peak to maximize uptime ▪ Use cost calculators to estimate initial
system loads to provide and operational costs
visibility on operational peaks
outside designed limits
Efficient architecture tradeoffs
Principle: Run performance testing in the scope of development

Distributed systems Test & tune Avoid performance


require more effort performance antipatterns
▪ Evaluate the systemic effect of ▪ Establish an SLA that defines ▪ Performance antipatterns are
each application—its supporting performance targets for latency, common, defective processes and
services, and the latency between number of requests, and exception implementations within
application layers rate for each workload organizations—likely to cause
▪ Ensure that all services can scale to ▪ Use proven best practices such as scalability problems when an
support loads, and that one properly instrumenting code, application is under pressure
service will not be a bottleneck monitoring multiple load ▪ Antipatterns may be obvious, for
▪ Services may need to scale percentages, and systemic example, the inability to scale from
differently under loads troubleshooting on-premises to the cloud
Active response to performance issues
Principle: Continuously monitor the application and supporting infrastructure

Azure Monitor Log analytics Application insights

▪ Comprehensive solution for ▪ Edit and run log queries from data ▪ Provides visibility into app
collecting, analyzing, and acting collected by Azure Monitor Logs, performance and utilization
on telemetry from your cloud and interactively analyze the results patterns
and on-premises environments. Retrieve records matching precise
▪ ▪ Monitors various data sources,
▪ Helps to maximize the criteria, identify trends, analyze including request, response, and
availability and performance of patterns, and provide a variety of failure rates, exceptions, page
applications and services data insights views, and load performance
Scaling design












Azure Well-Architected
Review
Assess workloads with the pillars of the
Microsoft Azure Well-Architected Framework:

—Understand the Well-Architected level of


your workload environment.

—Follow technical guidance for next steps of aka.ms/wellarchitected/review

how to create and optimize your workloads.


Let’s walkthrough some
questions for
Performance Efficiency
in the Well-Architected
Review
What design considerations have you made for
performance efficiency in your workload?
As traffic fluctuates into your application the number of underlying resources that you need
will vary over time.
❑ The workload is deployed across multiple regions.
❑ Regions were chosen based on location, proximity to users, and resource type availability.
❑ Paired regions are used appropriately.
❑ You have ensured that both (all) regions in use have the same performance and scale SKUs that are currently leveraged in the primary
region.
❑ Within a region the application architecture is designed to use Availability Zones.
❑ The application is implemented with strategies for resiliency and self-healing.
❑ Component proximity is considered for application performance reasons.
❑ The application can operation with reduced functionality or degraded performance in the case of an outage.
❑ You choose appropriate datastores for the workload during the application design.
❑ Your application is using a micro-service architecture.
❑ You understand where state will be stored for the workload.
❑ None of the above.
How have you modeled the health of your workload?
❑ Application and resource level logs are aggregated in a single data sink or able to be cross-queried.
❑ A health model is used to qualify what 'healthy' and 'unhealthy' states represent for the application.
❑ Critical system flows are used to inform the health model.
❑ The health model can distinguish between transient and non-transient faults.

❑ The health model can determine if the workload is performing at the expected targets.
❑ Retention times for logs and metrics been defined and housekeeping mechanisms are configured.
❑ Long-term trends are analyzed to predict performance issues before they occur.
❑ None of the above.
How are you benchmarking your workload?
❑ You have identified goals or a baseline for workload performance.
❑ Performance goals are based on device and/or connectivity type as appropriate.
❑ You have defined an initial connection goal for your workload.
❑ There is a goal defined for complete page load times.
❑ You have defined goals for an API (service) endpoint complete response.
❑ There are goals defined for server response time.
❑ You have goals for latency between the systems & microservices of your workload.
❑ There are goals on database query efficiency.
❑ You have a methodology to determine what acceptable performance is.
❑ None of the above.
Operational Excellence Pillar
Microsoft Well-Architected—
Build and manage high-performing workloads

Azure
Well-Architected
Review

Design Azure
Reliability
Principles Advisor
Cost Optimization
Azure
Operational Excellence Well-Architected
Framework
Performance Efficiency
Reference
Security Documentation
Architectures
Partners,
Support &
Service
Offers
Build, deploy, and manage workloads—
with trustworthy processes

Agile and accurate Focused and assertive Continuous


processes application monitoring improvement

• Apply DevOps to break down silos • Dive deep into your workload's • Enjoy the flexibility of creating agile
between development and operations information with Log Analytics for and independent workloads with
across your organization infrastructure and with Azure microservices and loosely coupled
Application Insights for application architectures
• Build and test workloads with trends
Continuous Integration and • Use Chaos Engineering practices and
Continuous Delivery (CI/CD) both in • Manage the health of your system run regular tests to reach higher
development and production stages and activity logging by consuming levels of maturity and operational
core monitoring insights provided effectiveness.
• Perform extensive automated testing by Azure Monitor
with Azure Pipelines or manual testing • Reduce process risks by automating
with Azure Testing Plans operational tasks and deployments
with Azure Automation, Azure CLI
and Azure PowerShell
Agile and accurate processes
Principle: Optimize build and release processes

Continuous Integration
Infrastructure as Code Automated testing
& Continuous Delivery

• Define the entire Infrastructure as • Build and test workloads • Perform extensive automated
Code just as you define your with Continuous Integration testing to ensure a stable code
application and Continuous Delivery base and resource composition
(CI/CD) both in development and before deploying to critical
• Increase accuracy and reduce production stages, to achieve a systems
process risks preventing single and consistent way of
configuration drift building and deploying. • Achieve a faster time-to-ship
with fewer errors
• Enable easy recreation of new • Eliminate error-prone manual
environments, e.g., for interventions
developing new features
• Versioning of CI/CD pipelines for
traceability of changes
Focused and assertive application monitoring
Principle: Monitor system and understand operational health

Monitor build and Monitor infrastructure Understand workload health


release processes and application health to meet business goals

• Give developers early feedback • Build confidence in the overall health of • Understand the business impact of
on pushed code changes your workload reduced workload health

• Avoid outages caused by the • Dive deep into instrumentation with • Correlate events and metrics across
rollout of new features Log Analytics for infrastructure different parts of your solution
monitoring
• Respond to issues with self-healing
• Instrument your code to collect all capabilities
relevant events and metrics

• Use comprehensive dashboards that are


tailored to your audiences

• Leverage Azure Application Insights for


observing application trends
Continuous improvement
Principle: Use loosely coupled architecture

Strive for a true Microservices design


DevOps model

• Apply DevOps to break down silos • Enjoy the flexibility of creating agile
between development and operations and independent workloads with microservices.
across your organization..

• Run agile and independent teams that


are in charge of developing and running
their parts of the workload

• Limit impact of issues by having clear


boundaries between services
Continuous improvement
Principle: Rehearse recovery and practice failure

Rehearse recovery Practice failure

• Only tested recovery procedures will • Test your workload with injected
work in times of emergency faults in a safe environment

• Validate operation runbooks • Use Chaos Engineering practices


to reach higher levels of maturity
• Run regular tests and conduct dry
runs of failover scenarios • Employ a Red Team to find issues
and weak points
Continuous improvement
Principle: Embrace operational improvement

Optimizing inefficiencies through


Evolve processes
automation

• Establish well-defined owners and playbooks • Save time, reduce risks and avoid errors by automating
for procedures and tasks to optimize operational tasks or any deployments that may occur
operational effectiveness. on a schedule, response to events/monitoring alert, or
• Establish regular cadences for testing ad-hoc based on external factors.
operational procedures and tasks. • Automate deployments with Infrastructure as Code to
• Review operational incidents to improve define the infrastructure that needs to be deployed.
operational effectiveness. • Optimize workload configurations by automating
• Establish Root Cause Analysis processes. software installs, adding data to a database, updating
networking and other actions.
Automation
Reduce toil, improve efficiency, and ensure consistency

Infrastructure Infrastructure Operational


deployment checklist configuration checklist task checklist
Infrastructure deployment

❑ declarative over imperative




❑ repeatable infrastructure
❑ Avoid configuration drift
❑ Dynamically provision
❑ disaster recovery plan
Infrastructure configuration

❑ Azure data plane


❑ automation


❑ Configuration



Operational task

❑ on demand on a schedule through


a webhook
❑ Use Azure Functions

❑ Configure Azure Monitor


❑ Configure Azure Kubernetes Service


Azure Well-Architected
Review
Assess workloads with the pillars of the
Microsoft Azure Well-Architected Framework:

—Understand the Well-Architected level of


your workload environment.

—Follow technical guidance for next steps of


how to create and optimize your workloads.
aka.ms/wellarchitected/review
Let’s walkthrough some
questions for
Operational Excellence
in the Well-Architected
Review
How do you interpret the collected data to inform
application health?
Log aggregation technologies should be used to collate logs and metrics across all workload
components for later evaluation. Resources that logs are captured for may include Azure IaaS
and PaaS services as well as 3rd-party appliances such as firewalls or anti-malware solutions
used in the workload.

• A log aggregation technology, such as Azure Log Analytics or Splunk, is used to collect logs and metrics from Azure
resources
• Azure Activity Logs are collected within the log aggregation tool
• Resource-level monitoring is enforced throughout the application
• Logs and metrics are available for critical internal dependencies
• Log levels are used to capture different types of application events.
• There are no known gaps in application observability that led to missed incidents and/or false positives.
• The workload is instrumented to measure customer experience.
• None of the above.
How are you managing the configuration of the
workload?
Cloud-based applications often run on multiple virtual machines or containers in multiple
regions and use multiple external services. How do you manage and store all your app's
configuration settings, feature flags, and secure access settings?

❑ You monitor and take advantage of new features and capabilities of underlying services used in your workload.
❑ Application configuration information is stored using a dedicated management system such as Azure App
Configuration or Azure Key Vault.
❑ Soft-Delete is enabled for your keys and credentials such as things stored in Key Vaults and Key Vault objects.
❑ Configuration settings can be changed or modified without rebuilding or redeploying the application.
❑ Passwords and other secrets are managed in a secure store like Azure Key Vault or HashiCorp Vault.
❑ The application uses Azure Managed Identities.
❑ The expiry dates of SSL certificates are monitored and there are processes in place to renew them.
❑ Components are hosted on shared application or data platforms as appropriate.
❑ Your workload takes advantage of multiple Azure subscriptions.
❑ The workload is designed to leverage managed services.
❑ None of the above.
What operational considerations are you making
regarding infrastructure deployment?
As you provision and update Azure resources, application code, and configuration settings, a
repeatable and predictable process will help you avoid errors and downtime.
❑ The entire application infrastructure is defined as code
❑ No operational changes are performed outside of infrastructure as code
❑ Configuration drift is tracked and addressed
❑ The process to deploy infrastructure is automated
❑ Critical test environments have 1:1 parity with the production environment
❑ Direct write access to infrastructure is not possible and all resources are provisioned or configured through IaC
processes.
❑ None of the above.
Reliability Pillar
Microsoft Well-Architected—
Build and manage high-performing workloads

Azure
Well-Architected
Review

Design Azure
Reliability
Principles Advisor
Cost Optimization
Azure
Operational Excellence Well-Architected
Framework
Performance Efficiency
Reference
Security Documentation
Architectures
Partners,
Support &
Service
Offers
Why is reliability important?
Because avoiding failure is impossible in the public cloud

Applications require resilience to respond to failures and deliver reliability

Reliability Resilience

The what— The how—


▪ Ensuring availability of services = ▪ How production systems achieve reliability.
the goal for production systems.
▪ End goal = not to avoid all failures but to
▪ End goal = Maintain reliable respond to failure in ways that avoid
systems with the appropriate level downtime and data loss.
of availability (uptime).
Customer:
Push Doctor
Industry:
Professional Services
Size:
50-999 employees
Country:
United Kingdom
Products and services: “We’ve used Azure to build a resilient platform and help countless people get quick
Microsoft Azure
Microsoft Azure App Service and easy healthcare access they can count on.”
Microsoft Azure Application Gateway — Paul Smith, Enterprise Architect, Push Doctor
Microsoft Azure Availability Zones
Microsoft Azure Monitor
Microsoft Azure Service Bus
Microsoft Azure SQL Database
Microsoft Power BI

Read full story here

Situation: Solution: Impact:


Using Microsoft Azure platform as a Push Doctor can now match patients
Push Doctor, a patient/doctor video
service resources such as Azure App with a general practitioner in a
consultation platform based in the
Service, Push Doctor’s platform is matter of hours, helping to
United Kingdom, needed highly
now instantly scalable and highly potentially save the lives of people
available and scalable infrastructure
secure, with an impressive 99.99 who would have otherwise waited
that would provide the reliability its
percent uptime. And thanks to much longer for a consultation—a
patients need to access remote
duplicated workloads, it can service that has proved invaluable
healthcare support on their terms.
seamlessly manage failovers. during the COVID-19 crisis.
Building reliable systems is a shared responsibility

Your application
Your app or workload, built on the Azure platform.
Scope of
Reliability
Reviews
Resiliency features
Optional Azure capabilities you can enable as needed—high availability, disaster recovery, and backup.

Reliable foundation
Core capabilities built into the Azure platform – how the foundation is designed, operated,
and monitored to ensure availability.
Building reliable applications in the cloud
Enable systems to recover from failures and continue to function

Testing overall availability


Design for reliability Overall monitoring & diagnostics
& resiliency

▪ Use Availability Zones where applicable ▪ Test regularly to validate existing ▪ Define alerts that are actionable and
to improve reliability and optimize thresholds, targets and assumptions. effectively prioritized.
costs.
▪ Verify how the end-to-end workload ▪ Create alerts that poll for services
▪ Design applications to operate when performs under failure conditions. nearing their limits and quotas.
impacted by failures.
▪ Conduct load testing with expected ▪ Use application instrumentation to
▪ Use the native resiliency capabilities of peak volumes to test scalability and detect and resolve performance
PaaS to support overall app reliability. performance under load. anomalies.
▪ Validate that required capacity is within ▪ Perform chaos testing by injecting ▪ Troubleshoot issues to gain an overall
Azure service scale limits and quotas. faults. view of application health.
Design for reliability
Principle: design applications to be resistant to failures

Use Availability Zones Criteria for improving


Design for failure recovery
within a region application reliability
▪ Resilient application architectures
should be designed to recover ▪ Use Platform as a Service (PaaS),
▪ If greater failure isolation than
gracefully from failures in alignment which offers native resiliency
Availability Zones alone can offer, you
with defined reliability targets. capabilities to support overall
should consider deploying to
application reliability.
multiple regions.
▪ Define an availability strategy to
capture how the application remains ▪ Design your application to
▪ Multiple regions should be used for
available when in a failure state. automatically scale in and out.
failover purposes in a disaster state.
▪ Define a Business Continuity Disaster ▪ Review Azure subscription and
▪ Additional costs—data, networking
Recovery strategy for the application service limits to validate that required
and the Azure Site Recovery service
and/or its key scenarios. capacity is within quotas.
should be considered.
Test for availability and resiliency
Principle: define, automate, and test operational processes

End-to-end workload Build high availability & Automate testing across BCDR
testing resiliency testing into strategy strategy & prepare for failure

▪ Simulation testing involves creating ▪ Resilient application architectures ▪ Create and fully test a disaster
real-life situations and demonstrates should be designed to recover recovery plan using the actual
the effectiveness of proposed gracefully from failures in alignment resources needed to restore
solutions. with defined reliability targets. functionality.

▪ Use fault injection testing to check ▪ Define an availability strategy to ▪ Perform an operational readiness test
the system resiliency during failures— capture how the application remains for failover to the secondary region
by triggering failures or by simulating available when in a failure state. and for failback to the primary region.
them.
▪ Define a Business Continuity Disaster ▪ Codify the steps required to recover
▪ Load testing is crucial for identifying Recovery strategy for the application or failover to a secondary region to
failures that only happen under load, and/or its key scenarios. limit the impact of an outage.
(e.g., an overwhelmed back-end
database, or service throttling).
Monitoring application health
Principle: define, automate, and test operational processes

Azure services & resources Scaling subscription & Fully test BCDR plan
alerts & dashboards service targets
▪ Azure Service Health provides a view ▪ If your application requires more ▪ Create and fully test a disaster recovery
into the health of Azure services and storage accounts than are currently plan using the actual resources needed
regions, as well as communications available in your subscription, create to restore functionality.
about outages and planned a new subscription with additional
maintenance activities. storage accounts. ▪ Perform an operational readiness test
for failover to the secondary region
▪ Azure Resource Health provides ▪ Identify scalability targets for VMs and for failback to the primary region.
information about the health of including VM size, number of disks,
individual and is highly useful when CPU, and memory. ▪ Codify the steps required to recover or
diagnosing unavailable resources. failover to a secondary region to limit
▪ To avoid data throttling, review your the impact of an outage.
▪ Azure dashboards provides a Azure SQL Database requirements to
consolidated view of data from ensure that they are adequate.
Application Insights, Log Analytics,
Azure Monitor metrics, and Service
Health.
Azure Well-Architected
Review
Assess workloads with the pillars of the
Microsoft Azure Well-Architected Framework:

—Understand the Well-Architected level of


your workload environment.

—Follow technical guidance for next steps of aka.ms/wellarchitected/review


how to create and optimize your workloads.
Let’s walkthrough some
questions for
Reliability
in the Well-Architected
Review
What reliability targets and metrics have you
defined for your application?
Availability targets, such as Service Level Agreements (SLA) and Service Level Objectives (SLO), and
Recovery targets, such as Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO),
should be defined and tested to ensure application reliability aligns with business requirements.

❑ Recovery targets to identify how long the workload can be unavailable (Recovery Time Objective) and how much data is
acceptable to lose during a disaster (Recovery Point Objective).
❑ Availability targets such as Service Level Agreements (SLAs) and Service Level Objectives (SLOs).
❑ Availability metrics to measure and monitor availability such as Mean Time To Recover (MTTR) and Mean Time Between
Failure (MTBF).
❑ Composite SLA for the workload derived using the Azure SLAs for all relevant resources.
❑ SLAs for all internal and external dependencies.
❑ Independent availability and recovery targets for critical application subsystems and scenarios.
❑ None of the above.
How have you ensured that your application
architecture is resilient to failures?
Resilient application architectures should be designed to recover gracefully from failures in alignment
with defined reliability targets.

❑ Deployed the application across multiple regions.


❑ Removed all single points of failure by running multiple instances of application components.
❑ Deployed the application across Availability Zones within a region.
❑ Performed Failure Mode Analysis (FMA) to identify fault-points and fault-modes.
❑ Planned for component level faults to minimize application downtime.
❑ Planned for dependency failures to minimize application downtime.
❑ None of the above.
How do you monitor and measure application health?
Monitoring and measuring application availability is vital to qualifying overall application
health and progress towards defined reliability targets.

❑ The application is instrumented with semantic logs and metrics.


❑ Application logs are correlated across components.
❑ All components are monitored and correlated with application telemetry.
❑ Key metrics, thresholds, and indicators are defined and captured.
❑ A health model has been defined based on performance, availability, and recovery targets and is represented
through monitoring dashboard and alerts.
❑ Azure Service Health events are used to alert on applicable Service level events.
❑ Azure Resource Health events are used to alert on resource health events.
❑ None of the above.
Cost Optimization Pillar
Microsoft Well-Architected—
Build and manage high-performing workloads

Azure
Well-Architected
Review

Design Azure
Reliability
Principles Advisor
Cost Optimization
Azure
Operational Excellence Well-Architected
Framework
Performance Efficiency
Reference
Security Documentation
Architectures
Partners,
Support &
Service
Offers
Cost optimization =
top cloud initiative for the fifth year running
Customer:
H&R Block

Industry: "Our monthly spend year-over-year is nearly flat, while we now have approximately 30 percent
Professional Services
more of our total compute in the cloud. Thanks to our partnership with Microsoft, our team has
Size:
10,000+ employees
learned valuable techniques and strategies to continue optimizing our spend."
Country: —Paul Clark, Director of Cloud, H&R Block
United States

Products and services: Situation: Solution: Impact:


Microsoft Azure
Microsoft Azure Advisor As a leader in the effort to By engaging with its Microsoft account H&R Block is now equipped to take
Microsoft Azure Cost Management and modernize the tax industry, H&R team and operationalizing conceptual control of its monthly spend and
Billing Block wanted to optimize its pillars of the Azure Well-Architected able to move its total compute to
Microsoft Azure Well-Architected
Framework cloud infrastructure, and provide Framework, the company was able to the cloud, using its capabilities to
better service for its customers. optimize its investment—replatforming benefit its business and customers.
to cloud-native services and
modernizing its operating models.
Optimize costs with tools, offers, and guidance
Cost optimization offers guidelines—accelerating time to market, while avoiding
capital-intensive solutions

Understand and Cost optimize your


Control your costs
forecast your costs workloads

▪ Monitor your bill, set budgets, and ▪ Optimize your resources with Azure ▪ Establish spending objectives and
allocate spending to teams and Advisor policies using the Microsoft Cloud
projects with Azure Cost ▪ Follow best practices for workload Adoption Framework for Azure
Management + Billing design with the Azure Well- ▪ Implement cost controls in Azure
▪ Forecast costs for future Architected Framework Policy so your teams can go fast
investments with the Azure pricing ▪ Save with Azure offers and licensing while complying with policy
and TCO calculator terms like the Azure Hybrid Benefit
and Reservations
Optimize your costs with tools, offers, and guidance
Principle: Monitor and optimize

Use alerts to monitor Auto-scaling policies Reserved instances


usage and spending provide cost savings can reduce costs

▪ Budget alerts notify you when ▪ When workloads are highly ▪ Use Azure Reservations to lower
spending reaches predetermined variable, choose smaller VM costs by pre-paying for capacity.
thresholds. instances, then scale out, rather ▪ Analyze existing pay-as-you-go
▪ Credit alerts notify you when your than up, to get the needed usage data in Azure Portal before
Azure Prepayment is consumed. performance. opting into reserved instances.
▪ Department spending quota alerts ▪ Many applications can be made
notify you when quotas are stateless, then auto-scaled for cost
reached. benefits.
Optimize your costs with tools, offers, and guidance
Principle: Keep within cost constraints

Develop a cost model Capture requirements Cost tradeoffs

▪ Map your organization's ▪ Break down high-level ▪ Determine if the cost of


needs to specific offerings. goals into functional high availability exceeds
▪ Start with high-level requirements, acceptable downtime.
requirements before ▪ For each functional ▪ Increasing security of the
considering design. requirement, define workload will increase cost.
▪ Geographic and security metrics to estimate costs. ▪ Systems monitoring and
decisions can have a huge automation might increase
impact on your costs. the cost initially but will
reduce cost over time.
Design Checklist
Principle: Aim for scalable costs

❑ Consider tradeoffs security scalability resilience operability


❑ Choose managed services
❑ consumption-based pricing pre-provisioned costs
❑ appropriate subscription
❑ proof-of-concept
❑ Optimize
❑ Reduce server load
Azure Well-Architected
Review
Assess workloads with the pillars of the
Microsoft Azure Well-Architected Framework:

—Understand the Well-Architected level of


your workload environment.

—Follow technical guidance for next steps of


how to create and optimize your workloads. aka.ms/wellarchitected/review
Let’s walk through some
questions for
Cost Optimization
in the Well-Architected
Review
How are you modeling cloud costs?

Cost modeling is an exercise where you create logical groups of cloud resources that are
mapped to the organization's hierarchy and then estimate costs for those groups. The goal of
cost modeling is to estimate the overall cost of the organization in the cloud.

❑ Cloud costs are being modeled for this workload.


❑ The price model of the workload is clear.
❑ Critical system flows through the application have been defined for all key business scenarios
❑ There is a well-understood capacity model for the workload.
❑ Internal and external dependencies are identified, and cost implications understood.
❑ Cost implications of each Azure service used by the application are understood
❑ The right operational capabilities are used for Azure services.
❑ Special discounts given to services or licenses are factored in when calculating new cost models for services being
moved to the cloud.
❑ Azure Hybrid Use Benefit is used to drive down cost in the cloud.
❑ None of the above.
How are you monitoring costs?
Consider the metrics for each resource in the workload. For each metric, build alerts on baseline
thresholds.

❑ Alerts are set for cost thresholds and limits.


❑ Specific owners and processes are defined for each alert type.
❑ Application Performance Management (APM) tools and log aggregation technologies are used to collect logs and
metrics from Azure resources.
❑ Cost Management Tools (such as Azure Cost Management) are being used to track spending in this workload.
❑ None of the above.
How do you ensure that cloud services are
appropriately provisioned?
Deployment of cloud resources of a workload is known as provisioning.

❑ Performance requirements are well-defined.


❑ Targets for the time it takes to perform scale operations are defined and monitored.
❑ The workload is designed to scale independently.
❑ The application has been designed to scale both in and out.
❑ Application components and data are split into groups as part of your disaster recovery strategy.
❑ Tools (such as Azure Advisor) are being used to optimize SKUs discovered in this workload.
❑ Resources are reviewed weekly or bi-weekly for optimization.
❑ Cost-effective regions are considered as part of the deployment selection.
❑ Dev/Test offerings are used correctly.
❑ Shared hosting platforms are used correctly.
❑ None of the above.
Next steps

▪ Assess your workload with a Well-Architected Review:


https://aka.ms/wellarchitected/review
▪ Gather technical recommendations and optimize deployments
with Azure Advisor:
https://aka.ms/azureadvisor
▪ Learn how to build great solutions with Well-Architected Framework:
https://docs.microsoft.com/en-us/learn/

You might also like