Professional Documents
Culture Documents
Azure Well Architected
Azure Well Architected
FOR USE ONLY AS PART OF MICROSOFT VIRTUAL TRAINING DAYS PROGRAM. THESE MATERIALS ARE NOT AUTHORIZED
FOR DISTRIBUTION, REPRODUCTION OR OTHER USE BY NON-MICROSOFT PARTIES.
Microsoft Azure Virtual
Training Day:
Well-Architected
Well-Architected
Overview
Agenda
▪ Resources
Data breaches cost you
—and your customers
Customer PII was the most frequently, and costliest compromised
type of record per latest data breach study*
Azure
Well-Architected
Review
Design Azure
Reliability
Principles Advisor
Cost Optimization
Azure
Operational Excellence Well-Architected
Framework
Performance Efficiency
Reference
Security Architectures Documentation
Partners,
Support &
Service
Offers
Building well-architected workloads—
is a shared responsibility
Scope of
Well-Architected
Assessments
Customer application
Customer app or workload, built on the Azure platform
Platform features
Optional Azure capabilities a customer enables – to ensure security, reliability, operability, performance
Platform foundation
Core capabilities built into the Azure platform – how the foundation is designed, operated, and monitored
Business requirements DEV/TEST WORKLOADS
influence decisions
about workload
architectures MISSION-CRITICAL WORKLOADS
▪ No cost and usage monitoring ▪ No rapid issue identification ▪ No monitoring new ▪ Unclear on resiliency ▪ No access control
▪ Unclear on underused/ ▪ No deployment automation services capabilities for improved mechanism
orphaned resources ▪ No communication ▪ No monitoring current architecture design (authentication)
▪ Lack of structured billing mechanisms & dashboards workloads health ▪ Lack of data back up ▪ No security threat
management ▪ Unclear expectations and ▪ No design for scaling practices detection mechanism
▪ Budget reductions from lack business outcomes ▪ Lack of rigor and ▪ No monitoring of current ▪ Lack of security thread
of support for cloud adoption ▪ No visibility on root cause guidance for technology workload health response plan
by leadership for events and architecture ▪ No resiliency testing ▪ No encryption process
selection ▪ No support for disaster
recovery
Best practices to drive workload quality
Cost Operational Performance Reliability Security
Optimization Excellence Efficiency
Implement
recommendations and Implement Analyze Confirm optimization
opportunities
continuous process
Advise
Actionable plan and
recommendations to optimize
your workloads
Using the Azure Well-Architected Review
▪ This web-based assessment helps improve
the quality of a workload by
▪ Examining the workload across the 5 pillars
of the Azure Well Architected Framework
(Reliability, Cost Optimization, Security,
Operations Excellence, and Performance
Efficiency)
▪ Providing specific guidance to improve
architecture and overcome detected hurdles
effectively
▪ Proactively focusing on the pillar where most
attention is needed
▪ Driving consistency into workload
discussions throughout the team https://aka.ms/wellarchitected/review
Azure Well-Architected
Review
Assess workloads with the pillars of the
Microsoft Azure Well-Architected Framework:
Azure
Well-Architected
Review
Design Azure
Reliability
Principles Advisor
Cost Optimization
Azure
Operational Excellence Well-Architected
Framework
Performance Efficiency
Reference
Security Documentation
Architectures
Partners,
Support &
Service
Offers
Operations
Security operations that work for you
Azure
Security
Technology Partnerships
Enterprise-class Partnerships for a
technology heterogeneous world
Situation
At a high level, the RxWell solution’s
architecture has three main zones: on-
premises, the cloud, and the end-user apps.
RXR and its partners committed to strict
adherence to the principles of Responsible
AI. All data is anonymized, and only
aggregated data gets stored. Even that Solution
aggregated data is treated as sensitive, and “When it came to developing RxWell, there was simply Impact
none of it is shared externally. Meanwhile, no other company that had the capability and the “The well-architected Azure framework,
the multilayered security controls of Azure infrastructure to meet our comprehensive data, fundamentally addresses things like scalability,
help keep it all safe. analytics, and security needs than Microsoft. With our reliability, security, and operational excellence.
partnership, the RxWell program provides our Because we started building our solution from day
customers the tools they need to safely navigate the one on those pillars, that helped us to absorb all
‘new abnormal’ of COVID-19 and beyond.” these nuances from the integration perspective.”
—Scott Rechler, Chairman and CEO, RXR Realty —Saurav K. Chandra, Principal Architect for Internet of Things, Infosys
Build and manage proactively secured workloads
Security provides principles to protect, detect, and respond to threats
across your Azure environment.
Build upon a secure Proactively stay secure with Detect and respond to
foundation native controls threats
▪ Design assuming workload failure ▪ Continuously manage your workload ▪ Leverage large-scale intelligence
with multi-layer protection controls. security from a single pane of glass from decades of Microsoft security
with Azure Security Center. experience to work with the Microsoft
▪ Build workloads using zero-trust Intelligent Security Graph, collected from
principles in both IaaS and PaaS ▪ Protect your workloads from 8 trillion threat signals analyzed daily
malicious attacks with cloud-
▪ Embrace Azure’s security native Azure Web Application ▪ Embrace automation with Azure
investments, resources, and Firewall Defender to get threat protection
compliance certifications for your workload
▪ Manage identity and access for your
workload with Azure Active Directory ▪ Establish procedures to identify and
mitigate threats for your workloads with
Azure Sentinel
Build on a secure foundation
Principle: Build a comprehensive strategy
A security strategy should consider investments in culture, processes, and security controls across all system
components. The strategy should also consider security for the full lifecycle of system components including the supply
chain of software, hardware, and services.
DDoS protection Web Application Firewall Azure Firewall Network Security Groups VNET Integration
DDOS protection tuned Centralized inbound web Data exfiltration protection Distributed inbound & Restrict access to Azure
to your application application protection using centralized outbound network service resources (PaaS) to
traffic patterns from common exploits outbound and inbound (L3-L4) traffic filtering on only your Virtual Network
and vulnerabilities (non-HTTP/S) network and VM, Container or subnet
application (L3-L7) filtering
In-depth defense
Proactively stay secure with native controls
Principle: Leverage native controls
A unified infrastructure security ▪ Centralized protection and inspection of ▪ Microsoft’s cloud-based identity
management system that: HTTP requests to prevent attacks such as and access management service,
▪ Strengthens the security SQL Injection or Cross-Site Scripting. which helps your employees
posture of your data center securely access resources.
▪ Provides advanced threat ▪ Managed Identities eliminates
protection across your hybrid the need to store credentials that
workloads in the cloud—on could be leaked.
Azure, or on premises ▪ Use Azure AD Connect for
synchronizing Azure AD with
your existing on-premises
directory.
Detect and respond to threats
Principle: Design for resilience
Native threat detection
SQL Protection
Microsoft Defender
Security
❑ Identity as Primary Access Control
❑ Assign permissions users groups applications
Use built-in roles
❑ Restrict control plane access
aka.ms/wellarchitected/review
—Follow technical guidance for next steps of
how to create and optimize your workloads.
Let’s walk through
some questions for
Security
in the Well-Architected
Review
Have you done a threat analysis of your workload?
Threat modeling is an engineering technique which can be used to help identify threats, attacks,
vulnerabilities and countermeasures that could affect an application. Threat analysis consists of
defining security requirements, identifying threats, mitigating threats, validating threat
mitigation. All of those are needed to ensure proper security of a workload on both the
prevention and reaction fronts.
❑ Threat modeling processes are adopted, identified threats are ranked based on organizational impact, mapped to
mitigations and communicated to stakeholders.
❑ Timelines and processes are established to deploy mitigations (security fixes) for identified threats.
❑ Security requirements are defined for this workload.
❑ Threat protection was addressed for this workload.
❑ Security posture was evaluated with standard benchmarks (CIS Control Framework, MITRE framework etc.).
❑ Business critical workloads, which may adversely affect operations if they are compromised or become unavailable,
were identified and classified.
❑ None of the above.
How is security validated and how do you handle
incident response when breach happens?
If prevention fails and security of the application is breached, proper response and mitigation
can minimize damage and contain the attacker within minimal boundaries.
❑ For containerized workloads, Azure Defender (Azure Security Center) or other third-party solution is used to scan
for vulnerabilities.
❑ Penetration testing is performed in-house, or a third-party entity performs penetration testing of this workload to
validate the current security defenses.
❑ Simulated attacks on users of this workload, such as phishing campaigns, are carried out regularly.
❑ Operational processes for incident response are defined and tested for this workload.
❑ Playbooks are built to help incident responders quickly understand the workload and components, to mitigate an
attack and do an investigation.
❑ There's a security operations center (SOC) that leverages a modern security approach.
❑ A security training program is developed and maintained to ensure security staff of this workload are well-informed
and equipped with the appropriate skills.
❑ None of the above.
Are keys, secrets, and certificates managed in a
secure way?
Secrets like API keys and certificates are sensitive pieces of information that need to be
managed in a secure way - that includes proper storage, encryption and access control.
❑ There's a clear guidance or requirement on what type of keys (PMK - Platform Managed Keys vs. CMK - Customer
Managed Keys) should be used for this workload.
❑ Passwords and secrets are managed outside of application artifacts, using tools like Azure Key Vault.
❑ Access model for keys and secrets is defined for this workload.
❑ A clear responsibility / role concept for managing keys and secrets is defined for this workload.
❑ Secret/key rotation procedures are in place.
❑ Expiry dates of SSL/TLS certificates are monitored and there are renewal processes in place.
❑ None of the above.
Performance Efficiency Pillar
Microsoft Well-Architected—
Build and manage high-performing workloads
Azure
Well-Architected
Review
Design Azure
Reliability
Principles Advisor
Cost Optimization
Azure
Operational Excellence Well-Architected
Framework
Performance Efficiency
Reference
Security Documentation
Architectures
Partners,
Support &
Service
Offers
Build and manage—scalable, efficient workloads
Design and manage workloads that scale according to load changes, and efficient systems,
processes, and resources
▪ Manage resource scaling with Azure • Design parts of the process to be ▪ Evaluate health levels of workloads with
SQL Database and Azure App discrete and decomposable to Azure Monitor and Log Analytics to
Services—or scale dynamically with maximize compute resources, and take provision resources dynamically, and
demand with Azure Autoscale microservices architecture into account scale to match demand
▪ Optimize your network and storage ▪ Assess and remediate deep application
with Azure Cosmos DB, Azure Traffic Client Microservices DevOps performance issues and trends with
Manager, and Azure Cache for Redis Azure Application Insights
Service Service
API
Gate
way
Service ▪ Embrace a data-driven culture to deliver
Service timely insights across data to your entire
organization
Management/Orchestration
Optimal service execution
Principle: Invest in capacity planning
▪ Comprehensive solution for ▪ Edit and run log queries from data ▪ Provides visibility into app
collecting, analyzing, and acting collected by Azure Monitor Logs, performance and utilization
on telemetry from your cloud and interactively analyze the results patterns
and on-premises environments. Retrieve records matching precise
▪ ▪ Monitors various data sources,
▪ Helps to maximize the criteria, identify trends, analyze including request, response, and
availability and performance of patterns, and provide a variety of failure rates, exceptions, page
applications and services data insights views, and load performance
Scaling design
❑
❑
❑
❑
❑
❑
❑
❑
❑
❑
❑
❑
Azure Well-Architected
Review
Assess workloads with the pillars of the
Microsoft Azure Well-Architected Framework:
❑ The health model can determine if the workload is performing at the expected targets.
❑ Retention times for logs and metrics been defined and housekeeping mechanisms are configured.
❑ Long-term trends are analyzed to predict performance issues before they occur.
❑ None of the above.
How are you benchmarking your workload?
❑ You have identified goals or a baseline for workload performance.
❑ Performance goals are based on device and/or connectivity type as appropriate.
❑ You have defined an initial connection goal for your workload.
❑ There is a goal defined for complete page load times.
❑ You have defined goals for an API (service) endpoint complete response.
❑ There are goals defined for server response time.
❑ You have goals for latency between the systems & microservices of your workload.
❑ There are goals on database query efficiency.
❑ You have a methodology to determine what acceptable performance is.
❑ None of the above.
Operational Excellence Pillar
Microsoft Well-Architected—
Build and manage high-performing workloads
Azure
Well-Architected
Review
Design Azure
Reliability
Principles Advisor
Cost Optimization
Azure
Operational Excellence Well-Architected
Framework
Performance Efficiency
Reference
Security Documentation
Architectures
Partners,
Support &
Service
Offers
Build, deploy, and manage workloads—
with trustworthy processes
• Apply DevOps to break down silos • Dive deep into your workload's • Enjoy the flexibility of creating agile
between development and operations information with Log Analytics for and independent workloads with
across your organization infrastructure and with Azure microservices and loosely coupled
Application Insights for application architectures
• Build and test workloads with trends
Continuous Integration and • Use Chaos Engineering practices and
Continuous Delivery (CI/CD) both in • Manage the health of your system run regular tests to reach higher
development and production stages and activity logging by consuming levels of maturity and operational
core monitoring insights provided effectiveness.
• Perform extensive automated testing by Azure Monitor
with Azure Pipelines or manual testing • Reduce process risks by automating
with Azure Testing Plans operational tasks and deployments
with Azure Automation, Azure CLI
and Azure PowerShell
Agile and accurate processes
Principle: Optimize build and release processes
Continuous Integration
Infrastructure as Code Automated testing
& Continuous Delivery
• Define the entire Infrastructure as • Build and test workloads • Perform extensive automated
Code just as you define your with Continuous Integration testing to ensure a stable code
application and Continuous Delivery base and resource composition
(CI/CD) both in development and before deploying to critical
• Increase accuracy and reduce production stages, to achieve a systems
process risks preventing single and consistent way of
configuration drift building and deploying. • Achieve a faster time-to-ship
with fewer errors
• Enable easy recreation of new • Eliminate error-prone manual
environments, e.g., for interventions
developing new features
• Versioning of CI/CD pipelines for
traceability of changes
Focused and assertive application monitoring
Principle: Monitor system and understand operational health
• Give developers early feedback • Build confidence in the overall health of • Understand the business impact of
on pushed code changes your workload reduced workload health
• Avoid outages caused by the • Dive deep into instrumentation with • Correlate events and metrics across
rollout of new features Log Analytics for infrastructure different parts of your solution
monitoring
• Respond to issues with self-healing
• Instrument your code to collect all capabilities
relevant events and metrics
• Apply DevOps to break down silos • Enjoy the flexibility of creating agile
between development and operations and independent workloads with microservices.
across your organization..
• Only tested recovery procedures will • Test your workload with injected
work in times of emergency faults in a safe environment
• Establish well-defined owners and playbooks • Save time, reduce risks and avoid errors by automating
for procedures and tasks to optimize operational tasks or any deployments that may occur
operational effectiveness. on a schedule, response to events/monitoring alert, or
• Establish regular cadences for testing ad-hoc based on external factors.
operational procedures and tasks. • Automate deployments with Infrastructure as Code to
• Review operational incidents to improve define the infrastructure that needs to be deployed.
operational effectiveness. • Optimize workload configurations by automating
• Establish Root Cause Analysis processes. software installs, adding data to a database, updating
networking and other actions.
Automation
Reduce toil, improve efficiency, and ensure consistency
❑ repeatable infrastructure
❑ Avoid configuration drift
❑ Dynamically provision
❑ disaster recovery plan
Infrastructure configuration
❑ Configuration
❑
❑
❑
Operational task
• A log aggregation technology, such as Azure Log Analytics or Splunk, is used to collect logs and metrics from Azure
resources
• Azure Activity Logs are collected within the log aggregation tool
• Resource-level monitoring is enforced throughout the application
• Logs and metrics are available for critical internal dependencies
• Log levels are used to capture different types of application events.
• There are no known gaps in application observability that led to missed incidents and/or false positives.
• The workload is instrumented to measure customer experience.
• None of the above.
How are you managing the configuration of the
workload?
Cloud-based applications often run on multiple virtual machines or containers in multiple
regions and use multiple external services. How do you manage and store all your app's
configuration settings, feature flags, and secure access settings?
❑ You monitor and take advantage of new features and capabilities of underlying services used in your workload.
❑ Application configuration information is stored using a dedicated management system such as Azure App
Configuration or Azure Key Vault.
❑ Soft-Delete is enabled for your keys and credentials such as things stored in Key Vaults and Key Vault objects.
❑ Configuration settings can be changed or modified without rebuilding or redeploying the application.
❑ Passwords and other secrets are managed in a secure store like Azure Key Vault or HashiCorp Vault.
❑ The application uses Azure Managed Identities.
❑ The expiry dates of SSL certificates are monitored and there are processes in place to renew them.
❑ Components are hosted on shared application or data platforms as appropriate.
❑ Your workload takes advantage of multiple Azure subscriptions.
❑ The workload is designed to leverage managed services.
❑ None of the above.
What operational considerations are you making
regarding infrastructure deployment?
As you provision and update Azure resources, application code, and configuration settings, a
repeatable and predictable process will help you avoid errors and downtime.
❑ The entire application infrastructure is defined as code
❑ No operational changes are performed outside of infrastructure as code
❑ Configuration drift is tracked and addressed
❑ The process to deploy infrastructure is automated
❑ Critical test environments have 1:1 parity with the production environment
❑ Direct write access to infrastructure is not possible and all resources are provisioned or configured through IaC
processes.
❑ None of the above.
Reliability Pillar
Microsoft Well-Architected—
Build and manage high-performing workloads
Azure
Well-Architected
Review
Design Azure
Reliability
Principles Advisor
Cost Optimization
Azure
Operational Excellence Well-Architected
Framework
Performance Efficiency
Reference
Security Documentation
Architectures
Partners,
Support &
Service
Offers
Why is reliability important?
Because avoiding failure is impossible in the public cloud
Reliability Resilience
Your application
Your app or workload, built on the Azure platform.
Scope of
Reliability
Reviews
Resiliency features
Optional Azure capabilities you can enable as needed—high availability, disaster recovery, and backup.
Reliable foundation
Core capabilities built into the Azure platform – how the foundation is designed, operated,
and monitored to ensure availability.
Building reliable applications in the cloud
Enable systems to recover from failures and continue to function
▪ Use Availability Zones where applicable ▪ Test regularly to validate existing ▪ Define alerts that are actionable and
to improve reliability and optimize thresholds, targets and assumptions. effectively prioritized.
costs.
▪ Verify how the end-to-end workload ▪ Create alerts that poll for services
▪ Design applications to operate when performs under failure conditions. nearing their limits and quotas.
impacted by failures.
▪ Conduct load testing with expected ▪ Use application instrumentation to
▪ Use the native resiliency capabilities of peak volumes to test scalability and detect and resolve performance
PaaS to support overall app reliability. performance under load. anomalies.
▪ Validate that required capacity is within ▪ Perform chaos testing by injecting ▪ Troubleshoot issues to gain an overall
Azure service scale limits and quotas. faults. view of application health.
Design for reliability
Principle: design applications to be resistant to failures
End-to-end workload Build high availability & Automate testing across BCDR
testing resiliency testing into strategy strategy & prepare for failure
▪ Simulation testing involves creating ▪ Resilient application architectures ▪ Create and fully test a disaster
real-life situations and demonstrates should be designed to recover recovery plan using the actual
the effectiveness of proposed gracefully from failures in alignment resources needed to restore
solutions. with defined reliability targets. functionality.
▪ Use fault injection testing to check ▪ Define an availability strategy to ▪ Perform an operational readiness test
the system resiliency during failures— capture how the application remains for failover to the secondary region
by triggering failures or by simulating available when in a failure state. and for failback to the primary region.
them.
▪ Define a Business Continuity Disaster ▪ Codify the steps required to recover
▪ Load testing is crucial for identifying Recovery strategy for the application or failover to a secondary region to
failures that only happen under load, and/or its key scenarios. limit the impact of an outage.
(e.g., an overwhelmed back-end
database, or service throttling).
Monitoring application health
Principle: define, automate, and test operational processes
Azure services & resources Scaling subscription & Fully test BCDR plan
alerts & dashboards service targets
▪ Azure Service Health provides a view ▪ If your application requires more ▪ Create and fully test a disaster recovery
into the health of Azure services and storage accounts than are currently plan using the actual resources needed
regions, as well as communications available in your subscription, create to restore functionality.
about outages and planned a new subscription with additional
maintenance activities. storage accounts. ▪ Perform an operational readiness test
for failover to the secondary region
▪ Azure Resource Health provides ▪ Identify scalability targets for VMs and for failback to the primary region.
information about the health of including VM size, number of disks,
individual and is highly useful when CPU, and memory. ▪ Codify the steps required to recover or
diagnosing unavailable resources. failover to a secondary region to limit
▪ To avoid data throttling, review your the impact of an outage.
▪ Azure dashboards provides a Azure SQL Database requirements to
consolidated view of data from ensure that they are adequate.
Application Insights, Log Analytics,
Azure Monitor metrics, and Service
Health.
Azure Well-Architected
Review
Assess workloads with the pillars of the
Microsoft Azure Well-Architected Framework:
❑ Recovery targets to identify how long the workload can be unavailable (Recovery Time Objective) and how much data is
acceptable to lose during a disaster (Recovery Point Objective).
❑ Availability targets such as Service Level Agreements (SLAs) and Service Level Objectives (SLOs).
❑ Availability metrics to measure and monitor availability such as Mean Time To Recover (MTTR) and Mean Time Between
Failure (MTBF).
❑ Composite SLA for the workload derived using the Azure SLAs for all relevant resources.
❑ SLAs for all internal and external dependencies.
❑ Independent availability and recovery targets for critical application subsystems and scenarios.
❑ None of the above.
How have you ensured that your application
architecture is resilient to failures?
Resilient application architectures should be designed to recover gracefully from failures in alignment
with defined reliability targets.
Azure
Well-Architected
Review
Design Azure
Reliability
Principles Advisor
Cost Optimization
Azure
Operational Excellence Well-Architected
Framework
Performance Efficiency
Reference
Security Documentation
Architectures
Partners,
Support &
Service
Offers
Cost optimization =
top cloud initiative for the fifth year running
Customer:
H&R Block
Industry: "Our monthly spend year-over-year is nearly flat, while we now have approximately 30 percent
Professional Services
more of our total compute in the cloud. Thanks to our partnership with Microsoft, our team has
Size:
10,000+ employees
learned valuable techniques and strategies to continue optimizing our spend."
Country: —Paul Clark, Director of Cloud, H&R Block
United States
▪ Monitor your bill, set budgets, and ▪ Optimize your resources with Azure ▪ Establish spending objectives and
allocate spending to teams and Advisor policies using the Microsoft Cloud
projects with Azure Cost ▪ Follow best practices for workload Adoption Framework for Azure
Management + Billing design with the Azure Well- ▪ Implement cost controls in Azure
▪ Forecast costs for future Architected Framework Policy so your teams can go fast
investments with the Azure pricing ▪ Save with Azure offers and licensing while complying with policy
and TCO calculator terms like the Azure Hybrid Benefit
and Reservations
Optimize your costs with tools, offers, and guidance
Principle: Monitor and optimize
▪ Budget alerts notify you when ▪ When workloads are highly ▪ Use Azure Reservations to lower
spending reaches predetermined variable, choose smaller VM costs by pre-paying for capacity.
thresholds. instances, then scale out, rather ▪ Analyze existing pay-as-you-go
▪ Credit alerts notify you when your than up, to get the needed usage data in Azure Portal before
Azure Prepayment is consumed. performance. opting into reserved instances.
▪ Department spending quota alerts ▪ Many applications can be made
notify you when quotas are stateless, then auto-scaled for cost
reached. benefits.
Optimize your costs with tools, offers, and guidance
Principle: Keep within cost constraints
Cost modeling is an exercise where you create logical groups of cloud resources that are
mapped to the organization's hierarchy and then estimate costs for those groups. The goal of
cost modeling is to estimate the overall cost of the organization in the cloud.