Adelaide Tech Series - Operational Excellence 29Mar2023

In our second Adelaide AWS Tech Series event in 2023 we’re going to ’take a peak’ at how Amazon and AWS ‘Operate at Scale’. The customer benefits include cloud reliability ‘of the cloud’ and we share our operational experience with our customers, which let’s you run reliable workloads ‘in the cloud’. This partnership is laid out in the Shared Responsibility Model. Read more at https://aws.amazon.com/compliance/shared-responsibility-model/ .

Today this event is in three parts, but our main focus will be getting you hands on with a ‘prod’ type environment where we explore ‘chaos engineering’. Today we’re focusing on the Reliability and Operational Excellence pillars of AWS Well Architected. https://aws.amazon.com/architecture/well-architected

Part One - AWS Well Architected - A brief history and how it provides the backbone for you, our customers, to ‘Operate at Scale’.
Part Two - Hands on lab
Part Three - Is an Operational Readiness Review scenario (if we have time) or we can go straight to the networking and refreshments. Your choice…

Part One - Operational Excellence and the AWS Well Architected Framework (WAFR)

Before we start our hands on lab, titled ‘Health Checks and Dependencies’, it’s important to understand how Amazon and AWS ‘Operate at Scale’. There are a multitude of learnings here for any organisation, large or small, and in any sector. These learnings apply much more broadly than just IT organisations and ‘born in the cloud’ startups. In fact many of Amazon’s best practices in ‘Operating at Scale’ were developed in the second half of the 20th Century.

AWS’ Approach to Operational Excellence

We are never satisfied with operational performance that is anything less than perfect

A few quotes from AWS and Amazon personnel on Operational Excellence from https://aws.amazon.com/blogs/aws/reinvent-2020-liveblog-werner-vogels-keynote/

Steve Roberts “Automation is necessary, but not entirely sufficient, for operational excellence.” Enter pessimistic curiosity!
Martin Beeby, “Observability has three main components.
- Logging
- Monitoring
- Tracing
- You should Log everything”
Channy Yun “Yes! The goal of chaos engineering is to understand how your application responds to issues by injecting failures into your infrastructure like automated reasoning –it’s for everyone, not only large scale like Amazon or Netflix.”
Werner Vogels “Operations are Forever”
Werner Vogels “Everything fails, all the time”
Steve Roberts “When traffic is predictable, manual scaling is possible. The pandemic has exposed large traffic swings and unpredictability and we need to rethink architectures and scaling approaches.”

Let’s start with a summary of AWS from an Operations perspective:

Amazon S3 now contains more than 280 Trillion objects and experiences more than 100 Million requests per second. https://aws.amazon.com/blogs/aws/celebrate-amazon-s3s-17th-birthday-at-aws-pi-day-2023/ ; our data shows that the majority of S3 access is now machine-to-machine.
Infinite scale (practically not mathematically) to support any level of customer scale. Amazon’s migration to AWS without production disruption is described in a post titled ‘Migration Complete – Amazon’s Consumer Business Just Turned off its Final Oracle Database’ https://aws.amazon.com/blogs/aws/migration-complete-amazons-consumer-business-just-turned-off-its-final-oracle-database/ . We’ll talk more about this in our Well Architected introduction. You can also learn more about ‘Amazon.com on AWS’ scale at https://aws.amazon.com/solutions/case-studies/innovators/amazon/
Long run cloud reliability and outages. AWS reliability over more than a decade is obvious in the public data available from AWS. When it comes to uptime, not all cloud providers are created equal. Private cloud is even more confusing. The key takeaway here is availability to data. You can access the AWS Health API at https://docs.aws.amazon.com/health/latest/ug/health-api.html and https://aws.amazon.com/premiumsupport/technology/aws-health-dashboard/. Don’t forget AWS Partner observability tools that access the AWS APIs for you; find AWS Partners at https://partners.amazonaws.com/
AWS Post Event Summaries. These are public releases of the Amazon Correction of Errors (CoE) investigations that impacted customers. The list is small compared to what we resolve internally before there are customer impacts. https://aws.amazon.com/premiumsupport/technology/pes/
AWS Global Infrastructure is all ‘prod’ and all ‘hot’. https://aws.amazon.com/about-aws/global-infrastructure/ Everything is running and available at any time. This leads to the concept of ‘constant work’ and ‘self healing’ that is key to maintaining control of the very dynamic and unpredictable nature of individual customer cloud usage. You can read more in the Amazon Builders Library post titled ‘Reliability, constant work, and a good cup of coffee’ https://aws.amazon.com/builders-library/reliability-and-constant-work/
AWS Well Architected Framework - specifically the Operational Excellence pillar we’re focussing on today. Start diving deep into Well Architected at https://aws.amazon.com/architecture/well-architected This is our primary mechanism for sharing our learnings, suggestions and mandates for being Well Architected. It’s a methodology, guidance, instructions, checklist, snapshot, baseline, API, service and way of thinking…

We’ll explore some learnings from AWS and a real customer journey using Well Architected to Operational Excellence in the presentation just before we start the lab.

Resources

This section provides attendees with a set of resources, further reading and automation, that dive deeper into our hands on lab.

Amazon Builders Library
- Implementing Health Checks https://aws.amazon.com/builders-library/implementing-health-checks/ is the what we’ll be focussing on in our hands on lab.
- Here is a cartoon journey to Operational Excellence based on the Amazon Builders Library. It’s titled ‘Amazon Builders’ Library: Operational Excellence at Amazon’ https://www.youtube.com/watch?v=7MrD4VSLC_w
- Here is another prescriptive article on what many consider a trivial topic; ‘Building dashboards for operational visibility’ https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/?did=ba_card&trk=ba_card At Amazon there are some key capabilities we think are essential for everyone to have in their toolkit. Writing well at Amazon is one, as we don’t communicate internally using Powerpoints or by meetings and minutes. If you want to communicate something important then your write in narrative form, using data, by working backwards from the customer, or both. Dashboarding is another essential for dealing with data. In this article we describe dashboarding (as a mechanism) that can be applied broadly.
- ‘Instrumenting distributed systems for operational visibility’ is another deep dive into measurement, instrumentation, and logging best practices at Amazon. Here we talk about ‘unit of work’, ‘queue depth’ simplifying the Signal to Noise (SNR) to reduce cognitive overload by ‘counting things’ and ‘categorisation of faults’. We humans are good at cognitive or hueristic tasks, while the machines are good at algorithmic tasks.
Automation of Well Architected
- Automating Well Architected Assessments at scale aws-well-architected-tool-template-automation https://github.com/aws-samples/aws-well-architected-tool-template-automation
Reference Architectures provide important starting points or bookends to explore Operational Excellence concepts. They are repeatable, mostly self describing and testable.
- aws-elastic-beanstalk-hardened-security-cdk-sample https://github.com/aws-samples/aws-elastic-beanstalk-hardened-security-cdk-sample
- AWS Solutions are deployable reference architectures that get you 80% of the way to a solution. Virtual Andon on AWS supports the the ‘Andon cord’ popularised by Toyota in the 20th century to allow anyone to quickly stop a process, or processes, before compounding creates a bigger problem. https://aws.amazon.com/solutions/implementations/virtual-andon-on-aws/
AWS ReInvent and Summit Videos
- Nearly everything presented is recorded and shared on Youtube and Github. We like to share and so too do our customers and partners.
- Scaling on AWS for your first 10 million users (2022 update) This presentation walks through starting small and then architecting for progressively larger numbers of users https://www.youtube.com/watch?v=yrP3M4_13QM . I often suggest this as the ‘first video to watch’ about AWS. I revisit it often.
- Search Youtube for AWS AND Operational Excellence
- Search the AWS Blog for Operational Excellence https://aws.amazon.com/blogs/
- ‘Know Before You Go – AWS re:Invent 2022 Monitoring & Observability’ is a guide of relevant sessions at ReInvent 2022 https://aws.amazon.com/blogs/mt/know-before-you-go-aws-reinvent-2022-monitoring-observability/
Just announced in Mar 2023, ‘Announcing the AWS Well-Architected Operational Readiness Review lens’ https://aws.amazon.com/blogs/publicsector/announcing-aws-well-architected-operational-readiness-review-lens/

Part Two - Hands on Lab - Health Checks and Dependencies

Lab Instructions - https://www.wellarchitectedlabs.com/reliability/300_labs/300_health_checks_and_dependencies/ Accessing the Lab - dashboard.eventengine.run/ and enter the hash code provided by your instructor (NOTE: shared to your email addy)

Health Checks and Dependencies

AWS Well-Architected Labs > Reliability > 300 Labs > Level 300: Implementing Health Checks and Managing Dependencies to improve Reliability. This lab is part of the Well Architected labs series which are grouped by the Well Architected pillars. Security, performance, reliability, cost optimization[sic], operational excellence and sustainability.

In this lab we will:

Deploy our network and our sample prod app as IaaC using AWS Cloudformation. Two software defined templates here. We deploy the network (or IaaS if you prefer an aas term), then we build the app.
Once our environment is built we’ll conduct some chaos engineering by triggering and mitigating faults into our highly available production workload. NOTE: We’ll be accessing workload configuration settings to trigger faults and we’ll be monitoring the impacts; but in a real production environment we would need to reconfigure and apply the rule of least privilege to meet Business Continuity Planning (BCP), Disaster Recovery (DR) and Governance expectations.

The following sub sections include some hints and tips or some User Interface (UI) deviations from the lab instructions.

RECOMMENDATION: Where you see the option 1 / option 2 - modify your code sections I’d suggest you grab the url of the code changes and not worry about modifying the code. We’re not here to rewrite and debug python code today.

We use these urls to modify our templates to break / fix our environment. These changes are applied to our app template, not the network (or IaaS) template.

2.3 Error Handling Code

We’re going to update our Cloudformation template to redeploy our EC2 instances with updated application code that can operate in a degraded state, more gracefully than just returning a 502 error. Here’s the updated code url.

https://www.wellarchitectedlabs.com/Reliability/300_Health_Checks_and_Dependencies/Code/Python/server_errorhandling.py
NOTE: That we’re exposing diagnostic info here publically. This is convenient for this workshop, but not appropriate for our real world applications. A Well Architected way to access the diagnostic info is via the application logs that can be lifecycle managed in a dedicated ’logging’ account with read only access.

3.1 Re-enable the dependency service

NOTE: For convenience in this workshop we’re accessing our secrets via the AWS Systems Manager Parameter Store. For our real world environment access to this service, and our secrets, would be controlled to align with our organisation’s rule of least privilege controls. In this lab our ‘parameter’ is a boolean flag.

Inject fault on a single server

There have been some console changes here. The menu looks a little different. Ask for help as needed.

3.4.1 Expert option: make and deploy your changes to the code

TODO For those who don’t want to modify the code then you can use the following link to update your cloudformation template. https://www.wellarchitectedlabs.com/Reliability/300_Health_Checks_and_Dependencies/Code/Python/server_healthcheck.py

Part Three - ORR Whiteboard Scenario

If we have time, and folks are keen to keep exploring Operational Excellence we can use the followign link as our whiteboard scenario Look at using this link for an ORR scenario if we have time at the end. https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/appendix-a-creating-orr-guidance-from-an-incident.html

Wrapping up the Event

Here are some words of wisdom on ORR, and one of Werner Vogel’s earlier statements of fact, that make for a good wrapup.

Everything fails, all the time and Operations are Forever

Some good links on ORR at https://aws.amazon.com/builders-library/?cards-body.sort-by=item.additionalFields.sortDate&cards-body.sort-order=desc&awsf.filter-content-category=*all&awsf.filter-content-type=*all&awsf.filter-content-level=*all&cards-body.q=operational&cards-body.q_operator=AND

Here’s the pdf on ORR for AWS Well Architected https://docs.aws.amazon.com/pdfs/wellarchitected/latest/operational-readiness-reviews/operational-readiness-reviews.pdf#wa-operational-readiness-reviews

Set this video as homework AWS re:Invent 2021 - Amazon Builders’ Library: Operational Excellence at Amazon https://www.youtube.com/watch?v=7MrD4VSLC_w

now ‘Go build Well Architected workloads’…

Continue reading articles in my Amazon Web Services series

Chaos Engineering Workshop