Amazon Redshift Immersion Day - Tips and Tricks
This is my list of hints and tips for this course. It’s markdown so you can save it, access it or store it anywhere. I’ll add specific answers to questions I get during the course. I’ll share it with everyone.
Administrivia, Schedule and Planning
This immersion day will be delivered in multiple half day blocks. This aligns users and supporters with their specific areas of interest. Users want to run queries on data and the support folks build, manage and configure the underlying infrastructure.
The Redshift Immersion Labs https://catalog.us-east-1.prod.workshops.aws/workshops/9f29cdba-66c0-445e-8cbb-28a092cb5ba7/en-US/ contains more than 23 labs. This is a significant body of work and not practical to squeeze into a single day.
The events will be delivered as follows:
Session | Topics | Who | When |
---|---|---|---|
One | Intro to Amazon Redshift | All | 26Sep2022 |
Two | Building and Managing Clusters 1of2 | Support | 28Sep2022 |
Three | Building and Managing Clusters 2of2 | Support | 28Sep2022 |
Four | Query Data in Amazon Redshift | Users | 29Sep2022 |
Five | Federated Queries | Users | 30Sep2022 |
Six | ETL/ELT | Users | 4Oct2022 |
Redshift Immersion Day Labs
I’ve created new accounts and clusters with a clean data load as Session 5 accounts are now terminated. We’ll use this new environment for session 4, AND 5. That way, whatever you don’t finish in session 4, you can continue in session 5. Session 5 has some spare time.
You’ll use the same login url with a new hash which I’ll distribute in the session. Lab1 describes this.
Session 6 - ETL/ELT - Data Sharing - Semi Structured Data
Lab 13 - ETL/ELT Strategies Lab 14 - Data Sharing Lab 15 - Loading & querying semi-structured data
In these 3 labs we’re focussing on data. In lab 13 we’re investigating materialised views and stored procedures. In lab 14 we start by building a second data sharing cluster using automation. In lab 15 we’ll work with semi structured JSON data using the Amazon Redshift SUPER data type
NOTE: You might like to build your data sharing cluster (for lab 14) early so that it is ready for you to dive straight in.
References
- PartiQL is a SQL compatible query language that is db independent. https://aws.amazon.com/blogs/opensource/announcing-partiql-one-query-language-for-all-your-data/
- Amazon Redshift SUPER type https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html
30Sep2022
Session 5 Federated Queries
Lab 6 - Query Aurora PostgreSQL using Federation Lab 8 - Querying nested JSON in S3
In this session we’re using our Amazon Redshift cluster to query data that resides somewhere else. This reduces the need to move and duplicate data, reducing costs, reducing unnecessary work and allowing for much faster experimentation. The integrations and interoperability of Amazon Redshift with other AWS services and more broadly other data sources can make data analysts very productive.
Lab6 - Launch an Aurora PostgreSQL DB
In lab6 we’re accessing data in a PostgreSQL database. This could be a Postgis db. In this case we’re Amazon Aurora serverless V1 which allows us to access that db, create the schema and load the data; then we can query it from our Amazon Redshift cluster. The benefit of using Amazon Aurora Serverless V1 is that we can enable the data api which allows us to run sql queries direct from the AWS console.
NOTE We need to deviate from the script in lab6 to get our db setup. Ideally in this lab we’d just use the non serverless db option and then we could either use a JDBC linked tool of choice to create the schema and load the data; or even simpler just automate this work. The following section describes the networking setup we need to do manually, before we create our Amazon Aurora Serverless V1 db.
We’re only using Amazon Aurora Serverless V1 here because it allows us to query the db from the console. Amazon Aurora Serverless V1 has limited availability in most regions and is not configured to support production workloads that have short RTO.
Creating the db
for lab 6 use Postgresql 11.13 to leverage Serverless V1 and the data api; this link lists the AWS regions and availability zones that support Amazon Aurora Serverless V1 from https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.Aurora_Fea_Regions_DB-eng.Feature.Data_API.html
Before you create the db we need to update the subnet group and create our db credentials as a secret.
Update the default subnet group
- You’ll need to update the default subnet group, first to include the us-west-2 availability zone 2a. If not you’ll see an error message when you attempt to create the db because it’s expecting more than one availability zone.
- You do this from the RDS page, look for the subnet groups link on the left hand side
- You need to create a subnet in us-west-2a. I gave it a CIDR range of 10.0.30/24
- You’ll need us-west-2d and us-west-2a as subnets listed in your subnet group
Create db credentials in Secrets Manager
Here you’ll need to create db credentials and save them in AWS Secrets Manager. We’ll use the secret arn when connecting to the posgresql db in the RDS console and we’ll also use the secret again when we update the policy attached to the Amazon Redshift role.
29Sep2022
Session 4 Query Data in Amazon Redshift
Lab 1 - Creating & Connecting to your Cluster Lab 2 - Data Loading Lab 3 - Table Design and Query Tuning Lab 4 - Modernize w/ Spectrum Lab 5 - Spectrum Query Tuning
Labs 5 and 6
In Lab4 we’re using the TPC order fulfillment data set, but in Lab5 we use a portion of the New York City taxi data set. This is a cool big data set that also includes spatial data, is traffic related and is big enough to be interesting.
In Lab6 we use an Amazon clickstream data set. It has a table named customer; which is the same table name as our TPC data set. We need to now think about how we’re going to seperate these data sets in our single Redshift cluster. We’re actually not loading these new data sets in our cluster, we’re only loading the schemas and then using Amazon Redshift Spectrum to run our queries from our cluster.
In these two labs we’re accessing different data sets. To avoid overwriting the original customer table, and to avoid lumping together new tables into one schema think about creating a new schema for lab5 and lab6 using the suggested queries below before executing the data queries in each lab.
before creating tables create a new schema and set it as default
-
CREATE SCHEMA lab5;
-
alter user awsuser set search_path to lab5;
-
CREATE SCHEMA lab6;
-
alter user awsuser set search_path to lab6;
We’re also leveraging some other AWS services like AWS Glue and the data catalog to allow us to leverage a ‘schema on read’ approach. This allows to have multiple environments leveraging the same data set but avoiding the need to create new infrastructure and copy and load data copies. Check out Build a Schema-On-Read Analytics Pipeline Using Amazon Athena https://aws.amazon.com/blogs/big-data/build-a-schema-on-read-analytics-pipeline-using-amazon-athena/
28Sep2022
Session 2 and 3 - building and managing clusters 1 and 2
We’ll start by building and managing clusters then switch to an operations focus(ish). There is some overlap with some labs because we’ve split the sessions around users and supporters of the environment.
This and the remaining sessions will provide you with a broad overview of Amazon Redshift and how it supports analytics broadly and specifically federated analytics. You can think of Redshift as being the coordinator across large and fragmented data sets, on prem dbs, other data stores in AWS and even reaching into other prod systems.
There is an advanced Amazon Redshift Immersion day and also instructor led training from AWS Training and Certification
- Building Data Analytics Solutions Using Amazon Redshift (1day)
- Data Warehousing on AWS (3day)
- These options are a good next step that will setup you up for success as you leverage AWS ProServe or our Analytics partners to build solutions in this space. 3 steps - Immersion days to see the art of the possible, training to solidify your skills and then go build.
Building and Managing Clusters
Who - Support Mode - Hands on Labs Prerequisites - Cloud Team skills
Session 2 Building and Managing Clusters 1of2
Lab 1 - Creating & Connecting to your Cluster Lab 2 - Data Loading Lab 6 - Query Aurora PostgreSQL using Federation
Session 3 Building and Managing Clusters 2of2
Lab 7 - Operations Lab 9 - Use SAML 2.0 for SSO with Redshift - Walkthrough only as this requires Okta account access. Lab 16 - Redshift Data API
26Sep2022
Session One - Intro to Amazon Redshift
This session is two hours, split into 1 hour blocks
- Intro to Amazon Redshift - Powerpoint and demo of the lab environment that we’ll use in subsequent sessions
- Discussion around capability uplift including use cases, training required and SME support required.
What is Amazon Redshift
Amazon Redshift is a Fast, fully managed, petabyte-scale data warehouse service A good place to start learning about Amazon Redshift is https://aws.amazon.com/redshift/faqs/
Data and Analytics Architecture
While this event is themed around Amazon Redshift let’s ensure we have a broad view of relevant architectures. AWS Services don’t constrain you to picking ‘one size fits all’.
- https://aws.amazon.com/architecture/analytics-big-data
- Filter for Reference Architecture Diagrams to see how AWS services work together and integrate
- Leverage AWS Well Architected, particularly the Data Analytics Well Architected Lens https://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/analytics-lens.pdf#analytics-lens
- Filter for whitepapers. There are 38 relating to Analytics and Big Data architectures
- Modern Data Architecture on AWS describes how AWS Services are complementary, integrate well and flexible, to allow you to meet your specific needs https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/
Backups
Several options to ‘backup’ data in Redshift and the clusters themselves Amazon Redshift snapshots, manual or automatic, provide backups of your data and allow the cluster to be rebuilt, relocated and restored. https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html Additionally data can be unloaded to Amazon S3. https://docs.aws.amazon.com/redshift/latest/dg/c_unloading_data.html AWS Backup also provides a broader backup at scale option for many AWS services that store data, including on prem, but not Amazon Redshift. https://aws.amazon.com/backup/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc
Amazon Redshift and Spatial Data
- spatial data types supported including GeoJSON https://docs.aws.amazon.com/redshift/latest/dg/geospatial-overview.html
One Cluster or Federation?
This question arose in Session 3. Here’s some links and comments as a starting point on this contextual question.
- My thought process generally starts like we think about account seperation for coarse grained perms and then I’ll think about segregation within an account. Then I’ll consider minimising data movement and minimising copies of data; this is really where federation starts to make sense.
- We’ll also generally have external drivers like commercial and legislative and corporate policy on data / network access.
- Much of the large scale federation has been pioneered in the health care sector specifically around genome, drug discovery, pandemic response and also where legislation like GDPR has mandated data residency. The AWS Blog has some good articles here. Also check out https://www.linkedin.com/in/michelleperugini/ Dr Michelle Perugini, Adelaide based Presagen founder has been able to address data sovereignty, IP protection and global insights.
- There are also numerous AWS resources on the topic of federation
- Federate data in a Modern Data Architecture | Amazon Web Services https://www.youtube.com/watch?v=tUp4uINZr90
- Design a data mesh architecture using AWS Lake Formation and AWS Glue https://aws.amazon.com/blogs/big-data/design-a-data-mesh-architecture-using-aws-lake-formation-and-aws-glue/
- Schema on Read is an efficient and scalable enabler for federation. Build a Schema-On-Read Analytics Pipeline Using Amazon Athena https://aws.amazon.com/blogs/big-data/build-a-schema-on-read-analytics-pipeline-using-amazon-athena/
Amazon Redshift versus PostGIS?
The question arose; ‘Redshift versus PostGIS?’ While not having context on use cases the following links may be help to answer that question:
- comparison from the National University of Singapore may be helpful as it contains tables mapping PostGIS types and functions to other db types. https://ual.sg/post/2020/07/03/a-comparison-of-spatial-functions-postgis-athena-prestodb-bigquery-vs-redshift/
- PostGIS is an open source spatial relational database and Amazon Redshift is a data warehouse that supports GIS data sets. A more common comparison is PostGIS to Amazon RDS and Aurora.
- There are 510 questions on ‘PostGIS’ on AWS re:Post. https://repost.aws/search/questions?globalSearch=postgis There are no PostGIS versus Amazon Redshift questions but there are numerous PostGIS versus Amazon Aurora questions; because PostGIS is an extension to PostgreSQL on Amazon RDS https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.PostGIS.html
- Geo-spatial support in Redshift became generally available in November 2019. It is not an exact match to PostGIS GIS, customers can use common spatial geometry data types and functions as described in the documentation.
- Again without context on this question the ‘best answer’ I can provide is that federated queries across Amazon Redshift, Amazon Aurora PostgreSQL database, Amazon Athena and Amazon S3 leverage GIS type and query support broadly. https://docs.aws.amazon.com/redshift/latest/dg/federated_query_example.html
- Amazon Redshift spatial functions https://docs.aws.amazon.com/redshift/latest/dg/geospatial-functions.html and are based on DE-9IM. Limitations when using spatial data with Amazon Redshift https://docs.aws.amazon.com/redshift/latest/dg/spatial-limitations.html
- Amazon Athena spatial functions https://docs.aws.amazon.com/athena/latest/ug/geospatial-functions-list-v2.html
- Amazon Aurora spatial functions - MySQL https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.AuroraMySQL.Overview.html#Aurora.AuroraMySQL.Spatial
- Amazon Aurora PostgreSQL Spatial https://aws.amazon.com/about-aws/whats-new/2021/10/amazon-aurora-postgresql-supports-postgis/
- Amazon RDS Spatial and Amazon Aurora Spatial support are kept in sync
- Search across all AWS Service documentation for the term spatial https://docs.aws.amazon.com/search/doc-search.html?searchPath=documentation-guide&searchQuery=spatial
ESRI and AWS
- Leveraging the Power of Esri’s ArcGIS Enterprise Through Amazon Redshift https://aws.amazon.com/blogs/apn/leveraging-the-power-of-esri-arcgis-enterprise-through-amazon-redshift/
Spatial queries on AWS Blogs
- https://aws.amazon.com/blogs/ Search for keywords like ‘spatial’, ‘gis’, ’esri’ etc to find articles of interest
Redshift Links
- Getting started: http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html
- Redshift Best Practices: http://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
- Official AWS Redshift tutorials: http://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables.html http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html http://docs.aws.amazon.com/redshift/latest/dg/tutorial-configuring-workload-management.html
- Re:Invent videos (Level 400) https://www.youtube.com/watch?v=n1puzWLWS38 AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift (BDM402). Start here for a deep dive into Redshift
- https://www.youtube.com/watch?v=Q_K3qH5OYaM AWS re:Invent 2017: Best Practices for Data Warehousing with Amazon Redshift (ABD304) Watch this one after BDM402. It’s a refresher and a deeper dive on how Redshift works
- https://www.youtube.com/watch?v=fmy3jCxUliM&t=71s AWS re:Invent 2015 | (BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices and a well explained slide deck at https://d0.awsstatic.com/events/aws-hosted-events/2015/israel/redshift.pdf
- A deep dive on slices https://stackoverflow.com/questions/39090962/why-does-redshift-need-to-do-a-full-table-scan-to-find-the-max-value-of-the-dist. Also links to the documention which explains encoding on performance https://docs.aws.amazon.com/redshift/latest/dg/t_Verifying_data_compression.html
- Understanding Redshift Query Plans. The returned syntax requires some explanation. https://docs.aws.amazon.com/redshift/latest/dg/c-the-query-plan.html
- STV_BLOCKLIST field descriptions https://docs.aws.amazon.com/redshift/latest/dg/r_STV_BLOCKLIST.html
- STV_QUERY_SUMMARY also has useful info about query efficiency such as writing to disk https://docs.aws.amazon.com/redshift/latest/dg/using-SVL-Query-Summary.html
- Collection of Redshift database tuning queries and even a Docker image on GitHub at https://github.com/awslabs/amazon-redshift-utils
- using the ‘over()’’ windows function in Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html and https://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html
- CTEs and Window Functions: Unleashing the Power of Redshift which is a blog post by Yelp. https://engineeringblog.yelp.com/2015/01/title-ctes-and-window-functions-unleashing-the-power-of-redshift.html
- Redshift performance tuning guidance https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/
- Optimizing for Star Schemas and Interleaved Sorting on Amazon Redshift https://aws.amazon.com/blogs/big-data/optimizing-for-star-schemas-and-interleaved-sorting-on-amazon-redshift/ is worth a read if you are struggling with relating OLTP best practice with OLAP schemas.
Self Paced Learning
Best Practices for Data Warehousing with Amazon Redshift https://explore.skillbuilder.aws/learn/course/internal/view/elearning/100/best-practices-for-data-warehousing-with-amazon-redshift?lada=tile&tile=dt
Continue reading articles in my Amazon Web Services series