Adelaide Tech Series - Operational Excellence 29Mar2023

In our second Adelaide AWS Tech Series event in 2023 we’re going to ’take a peak’ at how Amazon and AWS ‘Operate at Scale’. The customer benefits include cloud reliability ‘of the cloud’ and we share our operational experience with our customers, which let’s you run reliable workloads ‘in the cloud’. This partnership is laid out in the Shared Responsibility Model. Read more at https://aws.amazon.com/compliance/shared-responsibility-model/ .

Today this event is in three parts, but our main focus will be getting you hands on with a ‘prod’ type environment where we explore ‘chaos engineering’. Today we’re focusing on the Reliability and Operational Excellence pillars of AWS Well Architected. https://aws.amazon.com/architecture/well-architected

  • Part One - AWS Well Architected - A brief history and how it provides the backbone for you, our customers, to ‘Operate at Scale’.
  • Part Two - Hands on lab
  • Part Three - Is an Operational Readiness Review scenario (if we have time) or we can go straight to the networking and refreshments. Your choice…

Part One - Operational Excellence and the AWS Well Architected Framework (WAFR)

Before we start our hands on lab, titled ‘Health Checks and Dependencies’, it’s important to understand how Amazon and AWS ‘Operate at Scale’. There are a multitude of learnings here for any organisation, large or small, and in any sector. These learnings apply much more broadly than just IT organisations and ‘born in the cloud’ startups. In fact many of Amazon’s best practices in ‘Operating at Scale’ were developed in the second half of the 20th Century.

AWS’ Approach to Operational Excellence

We are never satisfied with operational performance that is anything less than perfect

A few quotes from AWS and Amazon personnel on Operational Excellence from https://aws.amazon.com/blogs/aws/reinvent-2020-liveblog-werner-vogels-keynote/

  • Steve Roberts “Automation is necessary, but not entirely sufficient, for operational excellence.” Enter pessimistic curiosity!
  • Martin Beeby, “Observability has three main components.
    • Logging
    • Monitoring
    • Tracing
    • You should Log everything”
  • Channy Yun “Yes! The goal of chaos engineering is to understand how your application responds to issues by injecting failures into your infrastructure like automated reasoning –it’s for everyone, not only large scale like Amazon or Netflix.”
  • Werner Vogels “Operations are Forever”
  • Werner Vogels “Everything fails, all the time”
  • Steve Roberts “When traffic is predictable, manual scaling is possible. The pandemic has exposed large traffic swings and unpredictability and we need to rethink architectures and scaling approaches.”

Let’s start with a summary of AWS from an Operations perspective:

We’ll explore some learnings from AWS and a real customer journey using Well Architected to Operational Excellence in the presentation just before we start the lab.

Resources

This section provides attendees with a set of resources, further reading and automation, that dive deeper into our hands on lab.

Part Two - Hands on Lab - Health Checks and Dependencies

Lab Instructions - https://www.wellarchitectedlabs.com/reliability/300_labs/300_health_checks_and_dependencies/ Accessing the Lab - dashboard.eventengine.run/ and enter the hash code provided by your instructor (NOTE: shared to your email addy)

Health Checks and Dependencies

AWS Well-Architected Labs > Reliability > 300 Labs > Level 300: Implementing Health Checks and Managing Dependencies to improve Reliability. This lab is part of the Well Architected labs series which are grouped by the Well Architected pillars. Security, performance, reliability, cost optimization[sic], operational excellence and sustainability.

In this lab we will:

  • Deploy our network and our sample prod app as IaaC using AWS Cloudformation. Two software defined templates here. We deploy the network (or IaaS if you prefer an aas term), then we build the app.
  • Once our environment is built we’ll conduct some chaos engineering by triggering and mitigating faults into our highly available production workload. NOTE: We’ll be accessing workload configuration settings to trigger faults and we’ll be monitoring the impacts; but in a real production environment we would need to reconfigure and apply the rule of least privilege to meet Business Continuity Planning (BCP), Disaster Recovery (DR) and Governance expectations.

The following sub sections include some hints and tips or some User Interface (UI) deviations from the lab instructions.

RECOMMENDATION: Where you see the option 1 / option 2 - modify your code sections I’d suggest you grab the url of the code changes and not worry about modifying the code. We’re not here to rewrite and debug python code today.

We use these urls to modify our templates to break / fix our environment. These changes are applied to our app template, not the network (or IaaS) template.

2.3 Error Handling Code

We’re going to update our Cloudformation template to redeploy our EC2 instances with updated application code that can operate in a degraded state, more gracefully than just returning a 502 error. Here’s the updated code url.

3.1 Re-enable the dependency service

NOTE: For convenience in this workshop we’re accessing our secrets via the AWS Systems Manager Parameter Store. For our real world environment access to this service, and our secrets, would be controlled to align with our organisation’s rule of least privilege controls. In this lab our ‘parameter’ is a boolean flag.

Inject fault on a single server

There have been some console changes here. The menu looks a little different. Ask for help as needed.

3.4.1 Expert option: make and deploy your changes to the code

TODO For those who don’t want to modify the code then you can use the following link to update your cloudformation template. https://www.wellarchitectedlabs.com/Reliability/300_Health_Checks_and_Dependencies/Code/Python/server_healthcheck.py

Part Three - ORR Whiteboard Scenario

If we have time, and folks are keen to keep exploring Operational Excellence we can use the followign link as our whiteboard scenario Look at using this link for an ORR scenario if we have time at the end. https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/appendix-a-creating-orr-guidance-from-an-incident.html

Wrapping up the Event

Here are some words of wisdom on ORR, and one of Werner Vogel’s earlier statements of fact, that make for a good wrapup.

Everything fails, all the time and Operations are Forever

Some good links on ORR at https://aws.amazon.com/builders-library/?cards-body.sort-by=item.additionalFields.sortDate&cards-body.sort-order=desc&awsf.filter-content-category=*all&awsf.filter-content-type=*all&awsf.filter-content-level=*all&cards-body.q=operational&cards-body.q_operator=AND

Here’s the pdf on ORR for AWS Well Architected https://docs.aws.amazon.com/pdfs/wellarchitected/latest/operational-readiness-reviews/operational-readiness-reviews.pdf#wa-operational-readiness-reviews

Set this video as homework AWS re:Invent 2021 - Amazon Builders’ Library: Operational Excellence at Amazon https://www.youtube.com/watch?v=7MrD4VSLC_w

now ‘Go build Well Architected workloads’…

Continue reading articles in my Amazon Web Services series