Amazon Redshift Immersion Day - Tips and Tricks

This is my list of hints and tips for this course. It’s markdown so you can save it, access it or store it anywhere. I’ll add specific answers to questions I get during the course. I’ll share it with everyone.

Administrivia, Schedule and Planning

This immersion day will be delivered in multiple half day blocks. This aligns users and supporters with their specific areas of interest. Users want to run queries on data and the support folks build, manage and configure the underlying infrastructure.

The Redshift Immersion Labs https://catalog.us-east-1.prod.workshops.aws/workshops/9f29cdba-66c0-445e-8cbb-28a092cb5ba7/en-US/ contains more than 23 labs. This is a significant body of work and not practical to squeeze into a single day.

The events will be delivered as follows:

Session Topics Who When
One Intro to Amazon Redshift All 26Sep2022
Two Building and Managing Clusters 1of2 Support 28Sep2022
Three Building and Managing Clusters 2of2 Support 28Sep2022
Four Query Data in Amazon Redshift Users 29Sep2022
Five Federated Queries Users 30Sep2022
Six ETL/ELT Users 4Oct2022

Redshift Immersion Day Labs

I’ve created new accounts and clusters with a clean data load as Session 5 accounts are now terminated. We’ll use this new environment for session 4, AND 5. That way, whatever you don’t finish in session 4, you can continue in session 5. Session 5 has some spare time.

You’ll use the same login url with a new hash which I’ll distribute in the session. Lab1 describes this.

Session 6 - ETL/ELT - Data Sharing - Semi Structured Data

Lab 13 - ETL/ELT Strategies Lab 14 - Data Sharing Lab 15 - Loading & querying semi-structured data

In these 3 labs we’re focussing on data. In lab 13 we’re investigating materialised views and stored procedures. In lab 14 we start by building a second data sharing cluster using automation. In lab 15 we’ll work with semi structured JSON data using the Amazon Redshift SUPER data type

NOTE: You might like to build your data sharing cluster (for lab 14) early so that it is ready for you to dive straight in.

References

30Sep2022

Session 5 Federated Queries

Lab 6 - Query Aurora PostgreSQL using Federation Lab 8 - Querying nested JSON in S3

In this session we’re using our Amazon Redshift cluster to query data that resides somewhere else. This reduces the need to move and duplicate data, reducing costs, reducing unnecessary work and allowing for much faster experimentation. The integrations and interoperability of Amazon Redshift with other AWS services and more broadly other data sources can make data analysts very productive.

Lab6 - Launch an Aurora PostgreSQL DB

In lab6 we’re accessing data in a PostgreSQL database. This could be a Postgis db. In this case we’re Amazon Aurora serverless V1 which allows us to access that db, create the schema and load the data; then we can query it from our Amazon Redshift cluster. The benefit of using Amazon Aurora Serverless V1 is that we can enable the data api which allows us to run sql queries direct from the AWS console.

NOTE We need to deviate from the script in lab6 to get our db setup. Ideally in this lab we’d just use the non serverless db option and then we could either use a JDBC linked tool of choice to create the schema and load the data; or even simpler just automate this work. The following section describes the networking setup we need to do manually, before we create our Amazon Aurora Serverless V1 db.

We’re only using Amazon Aurora Serverless V1 here because it allows us to query the db from the console. Amazon Aurora Serverless V1 has limited availability in most regions and is not configured to support production workloads that have short RTO.

Creating the db

for lab 6 use Postgresql 11.13 to leverage Serverless V1 and the data api; this link lists the AWS regions and availability zones that support Amazon Aurora Serverless V1 from https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.Aurora_Fea_Regions_DB-eng.Feature.Data_API.html

Before you create the db we need to update the subnet group and create our db credentials as a secret.

Update the default subnet group

  • You’ll need to update the default subnet group, first to include the us-west-2 availability zone 2a. If not you’ll see an error message when you attempt to create the db because it’s expecting more than one availability zone.
  • You do this from the RDS page, look for the subnet groups link on the left hand side
  • You need to create a subnet in us-west-2a. I gave it a CIDR range of 10.0.30/24
  • You’ll need us-west-2d and us-west-2a as subnets listed in your subnet group

Create db credentials in Secrets Manager

Here you’ll need to create db credentials and save them in AWS Secrets Manager. We’ll use the secret arn when connecting to the posgresql db in the RDS console and we’ll also use the secret again when we update the policy attached to the Amazon Redshift role.

29Sep2022

Session 4 Query Data in Amazon Redshift

Lab 1 - Creating & Connecting to your Cluster Lab 2 - Data Loading Lab 3 - Table Design and Query Tuning Lab 4 - Modernize w/ Spectrum Lab 5 - Spectrum Query Tuning

Labs 5 and 6

In Lab4 we’re using the TPC order fulfillment data set, but in Lab5 we use a portion of the New York City taxi data set. This is a cool big data set that also includes spatial data, is traffic related and is big enough to be interesting.

In Lab6 we use an Amazon clickstream data set. It has a table named customer; which is the same table name as our TPC data set. We need to now think about how we’re going to seperate these data sets in our single Redshift cluster. We’re actually not loading these new data sets in our cluster, we’re only loading the schemas and then using Amazon Redshift Spectrum to run our queries from our cluster.

In these two labs we’re accessing different data sets. To avoid overwriting the original customer table, and to avoid lumping together new tables into one schema think about creating a new schema for lab5 and lab6 using the suggested queries below before executing the data queries in each lab.

before creating tables create a new schema and set it as default

  • CREATE SCHEMA lab5;

  • alter user awsuser set search_path to lab5;

  • CREATE SCHEMA lab6;

  • alter user awsuser set search_path to lab6;

We’re also leveraging some other AWS services like AWS Glue and the data catalog to allow us to leverage a ‘schema on read’ approach. This allows to have multiple environments leveraging the same data set but avoiding the need to create new infrastructure and copy and load data copies. Check out Build a Schema-On-Read Analytics Pipeline Using Amazon Athena https://aws.amazon.com/blogs/big-data/build-a-schema-on-read-analytics-pipeline-using-amazon-athena/

28Sep2022

Session 2 and 3 - building and managing clusters 1 and 2

We’ll start by building and managing clusters then switch to an operations focus(ish). There is some overlap with some labs because we’ve split the sessions around users and supporters of the environment.

This and the remaining sessions will provide you with a broad overview of Amazon Redshift and how it supports analytics broadly and specifically federated analytics. You can think of Redshift as being the coordinator across large and fragmented data sets, on prem dbs, other data stores in AWS and even reaching into other prod systems.

There is an advanced Amazon Redshift Immersion day and also instructor led training from AWS Training and Certification

  • Building Data Analytics Solutions Using Amazon Redshift (1day)
  • Data Warehousing on AWS (3day)
  • These options are a good next step that will setup you up for success as you leverage AWS ProServe or our Analytics partners to build solutions in this space. 3 steps - Immersion days to see the art of the possible, training to solidify your skills and then go build.

Building and Managing Clusters

Who - Support Mode - Hands on Labs Prerequisites - Cloud Team skills

Session 2 Building and Managing Clusters 1of2

Lab 1 - Creating & Connecting to your Cluster Lab 2 - Data Loading Lab 6 - Query Aurora PostgreSQL using Federation

Session 3 Building and Managing Clusters 2of2

Lab 7 - Operations Lab 9 - Use SAML 2.0 for SSO with Redshift - Walkthrough only as this requires Okta account access. Lab 16 - Redshift Data API

26Sep2022

Session One - Intro to Amazon Redshift

This session is two hours, split into 1 hour blocks

  • Intro to Amazon Redshift - Powerpoint and demo of the lab environment that we’ll use in subsequent sessions
  • Discussion around capability uplift including use cases, training required and SME support required.

What is Amazon Redshift

Amazon Redshift is a Fast, fully managed, petabyte-scale data warehouse service A good place to start learning about Amazon Redshift is https://aws.amazon.com/redshift/faqs/

Data and Analytics Architecture

While this event is themed around Amazon Redshift let’s ensure we have a broad view of relevant architectures. AWS Services don’t constrain you to picking ‘one size fits all’.

Backups

Several options to ‘backup’ data in Redshift and the clusters themselves Amazon Redshift snapshots, manual or automatic, provide backups of your data and allow the cluster to be rebuilt, relocated and restored. https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html Additionally data can be unloaded to Amazon S3. https://docs.aws.amazon.com/redshift/latest/dg/c_unloading_data.html AWS Backup also provides a broader backup at scale option for many AWS services that store data, including on prem, but not Amazon Redshift. https://aws.amazon.com/backup/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc

Amazon Redshift and Spatial Data

One Cluster or Federation?

This question arose in Session 3. Here’s some links and comments as a starting point on this contextual question.

Amazon Redshift versus PostGIS?

The question arose; ‘Redshift versus PostGIS?’ While not having context on use cases the following links may be help to answer that question:

ESRI and AWS

Spatial queries on AWS Blogs

Redshift Links

Self Paced Learning

Best Practices for Data Warehousing with Amazon Redshift https://explore.skillbuilder.aws/learn/course/internal/view/elearning/100/best-practices-for-data-warehousing-with-amazon-redshift?lada=tile&tile=dt

Continue reading articles in my Amazon Web Services series

Series: [AWS Data and Analytics]