Contingency Planning your AWS Environment

Here in Australia we’re approaching the summer fire season. In recent years we’ve seen rain and flooding events dominate which has resulted in a significant increase in vegetation growth; this becomes potential fuel as summer temperatures rise.

In my Disaster Response role in Amazon, I’m always thinking of how I, and Amazon can help our customers prepare for force majeure and unplanned incidents.

There are many models of complex systems that can help us form mental models, checklists, simulate events and incidents and categorise and priorities actions. I generally use the PACE communications model to structure my thinking. PACE is a cascading approach based on having Primary, Alternate, Contingency and Emergency mitigations.

Here’s my checklist that summaries the talking points that relate to emergent properties like resilience, recoverability and keeping systems running in degraded situations.

TL;DR

  • System Access
  • Comms, SLAs and Single Points of Failure
  • Well Architected
  • Observability
  • Pre Production Supply Chain Integrity
  • Production Integrity
  • Proof

Checklist in Detail

System Access

Ensure your primary identity management capability is robust, configured as a critical service, monitored and managed. Test it when other parts of the system fail or morph to degraded modes. Design, configure and prove (test, verify and validate) your alternate, contingency and emergency means of access to critical sytems; including your AWS accounts and critical workloads that other systems depend on. Run scheduled and unscheduled gamedays. These can be run as people onboard and offboard. They also make great team building opportunities.

Comms, SLAs and Single Points of Failure

Many incident management failures can be attributed to communications issues, not knowing who to call and a general lack of situational awareness.

AWS Well Architected

AWS Well Architected emerged as a way to understand how well architected a system is. It’s a method of inquiry, statusing and of baselining what is important about a system. It can be applied to any complex system but it’s specific use is in helping customers understand what is important about a system they care about that built on AWS.

AWS Well Architected is structured around six pillars:

  • Security
  • Reliability
  • Performance
  • Cost Optimisation
  • Operational Excellence
  • Sustainability

Observability

Observability is an umbrella term that implies situational awareness, measurement and control of a system in all situations; including its normal state.

Things to watch for here include:

  • All infrastructure, platform, application and API logs are pushed to a perimeter logging account that is read only and has lifecycle rules to manage the retention and disposal of all logging data.
  • The cloudtrail logs of all AWS accounts are aggregated in the logging perimeter account
  • AWS accounts, and all security, support and operations email aliases are monitored, managed and integrated into your incident management mechanisms.
  • Metrics for all infrastructure, platform, application and APIs are monitored. Most incident response can be automated in terms of notification, remediation and quarantine from metrics. Normal behaviour is a key input here.

Pre Production Supply Chain Integrity

All the stages your system is subjected to in getting to production are considered here. Patching, provisioning and all system changes need to be managed, verified and validated so you can ensure the integrity of your production systems. Zero touch production is ideal but however you maintain your systems your configuration, change and release management stages need to be monitored, managed and resilient.

  • Service catalog of trusted services
  • Patching, update and configuration drift need to be controlled and reversible
  • Work backwards from known vulnerabilities to ensure your pre prod supply chain is trustable

Production Integrity

Zero touch production where all changes are automated and reversible is the aspiration; but too many legacy systems, technologies and ways of working make this difficult to achieve. Your ability to make reversible changes to production systems is the key measure of success here.

Proof of Resiliency

Documents and point in time information is not proof of the resiliency of a system. The system is the proof and we have complete visibility of all resources on AWS. (A resource is anything with an ARN) Don’t trust isolated views of a system. IaaC, metrics, system responses, even billing data is all usable in enabling a system to provide its own status.

Continue reading articles in my Amazon Web Services series