Data Warehousing on AWS - Tips and Tricks
This is my list of hints and tips for this course. It’s markdown so you can save it, access it or store it anywhere. I might also give you other links that are course specific. I’ll add specific answers to questions I get during the course. I’ll share it with everyone.
Group Exercises
These exercises augment the course material. The instructor will decide when and if these group exercises are conducted.
Lunch time entertainment
- Here we have some funny or interesting videos. ** Like the New Zealand police recruitment, https://www.youtube.com/watch?v=f9psILoYmCc ** Perhaps even the Australia Day lamb video with the historical boat people and a history of Australia in 3 minutes. https://www.youtube.com/watch?v=yugymulPx9Y . Or how about learning Aussie slang. https://www.youtube.com/watch?v=yDb_WsAt_Z0
- Here’s a sample of the AWS Podcast http://d1le29qyzha1u4.cloudfront.net/AWS_Podcast_Episode_230.mp3
- Come visit Australia but be wary of the critters. Just kidding most are just cuddly except for the crocs. https://www.youtube.com/watch?v=iQSxuqWQ_4c
- Subjecting snowball to a Mil-Std 810 mine blast test. https://www.youtube.com/watch?v=__ooXhq5gZ4&feature=youtu.be
Administrivia
We need to jump through some hoops to get access to the labs, notes and my hints and tips. Be consistent with the email address you use for all sites. There are three seperate sites you need to access and one bitly link which is this page:
- Join or login to https://www.aws.training/ to ensure your training and certifications are captured. No we don’t spam you or sell your details.
- Access Qwiklab (yes it is spelt INCORRECTLY) ** aws.qwiklabs.com for the labs in this class ** run.qwiklabs.com for outside of the class or to do other labs at your own pace. NOTE: Some are free others require course credits.
- Access the course notes and slides. You’ll receive two emails. One confirming your attendance at this course and with the following links. The download link seems broken. You can download apps for phones, tablets and laptops. Or use your browser if you don’t want to install anything.
- www.vitalsource.com look for a signup link and download link. Or just go to https://evantage.gilmoreglobal.com/#/user/signin
- Once you’ve logged into Vitalsource (aka Bookshelf, Gilmore, eVantage) you can redeem your unique course materials code (in a seperate email) and update your book list. You should see a lab guid and student guide for Architecting on AWS, version 5. . The student guide is the powerpoint decks and notes and the lab guide is the step by step instructions for the labs.
- Solutions and AWS CloudFormation templates for the labs can be downloaded from http://bit.ly/2HnUkLc
Some Fundamentals
- What is Big Data? (circa 2016) http://cueris.com/an-elephant-among-the-blind-men-the-big-data-quagmire/
- Why big data initiatives fail. (circa 2016) https://www.thoughtworks.com/insights/blog/7-reasons-big-data-analytics-initiatives-fail
Cool links
- Netflix Technical Blog. Always an interesting read of AWS usage at scale. https://medium.com/netflix-techblog
- AWS Case Studies. These are useful to get non technical types on board with AWS. https://aws.amazon.com/solutions/case-studies/
- AWS created solutions, reference architectures and quickstarts at https://aws.amazon.com/big-data/getting-started/tutorials/ . Look for the self paced labs that are accessible at run.qwiklabs.com
- There are some interesting solutions on the AWS Big Data blog post at https://aws.amazon.com/blogs/big-data/
- Which services are available in which regions? https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/
Ian’s list of links for weekly review of all stuff AWS. (trying to keep up with the firehose)
http://www.tinyurl.com/awswhatsnew Please help spread the words: AWS What’s New: https://aws.amazon.com/new/ AWS Support What’s New: https://aws.amazon.com/premiumsupport/whats-new/ Jeff Barr Blog Post: https://aws.amazon.com/blogs/aws/new-action-links-for-aws-trusted-advisor/ Jeff Barr Tweet: https://twitter.com/jeffbarr/status/558429034712162305 Twitter: https://twitter.com/awscloud/status/558429091561340929 Facebook: http://on.fb.me/1CHt54F Google+: http://bit.ly/1BiVrTt LinkedIn: http://linkd.in/1AVPh9a
https://aws.amazon.com/podcasts/aws-podcast/ and all the faq pages for each product (this is where I start reading)
Best Practice
AWS Answers contains deployable solutions for common big data problems. Great for prototyping and reverse engineering. These solutions also contain best practices in terms of tagging, nomenclature, applying the rule of least privilege and integrating services. https://aws.amazon.com/answers/big-data/
Compute links
- aws cli wait command will time out after 120 checks. They’re labled as failed checks in the documentation but they aren’t strictly a failure. Whatever the wait command is waiting on never reaches the wait state if it times out. The timeout period can vary so check the documentation for the service and wait state you’re interested in. Here’s EBS https://docs.aws.amazon.com/cli/latest/reference/ec2/wait/snapshot-completed.html
- Here’s a useful sortable table of EC2 instance types, sizes and specifications. Not your add columns like available for EMR which makes instance selection very simple. https://ec2instances.info/
Streaming links
- Kinesis can be configured to process large volumes of data or to throttle to protect back services. https://docs.aws.amazon.com/firehose/latest/dev/basic-deliver.html . Also check out the Kinesis service limits https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html
Athena Links
- Athena uses a Serializer / Deserializer (a Ser/De ) in preference to a DDL to extract the structure of data in S3. https://docs.aws.amazon.com/athena/latest/ug/serde-about.html and https://docs.aws.amazon.com/athena/latest/ug/supported-format.html
- NOTE that Athena has limited support for SerDe’s. http://aws.mannem.me/?tag=athena and https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/
- Athena blog posts and solutions. Worth a look to deep dive into Athena. https://aws.amazon.com/blogs/big-data/tag/amazon-athena/
- Athena performance tuning https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
EMR Links
- EMR Deep Dive 2016 https://www.slideshare.net/AmazonWebServices/aws-reinvent-2016-deep-dive-amazon-emr-best-practices-design-patterns-bdm401
- Cluster planning guidelines https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html
- Yahoo Hadoop cluster experience. https://www.techrepublic.com/article/why-the-worlds-largest-hadoop-installation-may-soon-become-the-norm/
- Hadoop file formats and compression options https://cloud.netapp.com/blog/optimizing-aws-emr-best-practices
- EMR best practices whitepaper (circa 2013) https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf
- Instance types supported https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html
- Some facts on Facebook use of hadoop. It’s always interesting to ‘bookend’ the larger use cases. https://code.facebook.com/posts/423120391138341/hadoop/
- EMR cluster resizing guidance from the AWS Big Data blog is worth reading to understand the default EMR cluster constraints. https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/ This post also links to a good description of how to use S3DistCp https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
- Spark versus Tez https://www.xplenty.com/blog/apache-spark-vs-tez-comparison/
- Spark documentation http://spark.apache.org/docs/latest/index.html
DynamoDB links
- Being fully managed it’s not always easy to correlate actual performanc of DynamoDB as the service provides adaptive capacity to avoid throttling a partition. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html
- Using Hive external DynamoDB tables. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.ExternalTableForDDB.html
Glue links
- Glue sample programs on Github from AWS. https://github.com/aws-samples/aws-glue-samples
Security
Redshift Links
- Getting started: http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html
- Redshift Best Practices: http://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
- Official AWS Redshift tutorials: http://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables.html http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html http://docs.aws.amazon.com/redshift/latest/dg/tutorial-configuring-workload-management.html
- Re:Invent videos (Level 400) https://www.youtube.com/watch?v=n1puzWLWS38 AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift (BDM402). Start here for a deep dive into Redshift
- https://www.youtube.com/watch?v=Q_K3qH5OYaM AWS re:Invent 2017: Best Practices for Data Warehousing with Amazon Redshift & Redsh (ABD304) Watch this one after BDM402. It’s a refresher and a deeper dive on how Redshift works
- https://www.youtube.com/watch?v=fmy3jCxUliM&t=71s AWS re:Invent 2015 | (BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices and a well explained slide deck at https://d0.awsstatic.com/events/aws-hosted-events/2015/israel/redshift.pdf
- A deep dive on slices https://stackoverflow.com/questions/39090962/why-does-redshift-need-to-do-a-full-table-scan-to-find-the-max-value-of-the-dist. Also links to the documention which explains encoding on performance https://docs.aws.amazon.com/redshift/latest/dg/t_Verifying_data_compression.html
- Understanding Redshift Query Plans. The returned syntax requires some explanation. https://docs.aws.amazon.com/redshift/latest/dg/c-the-query-plan.html
- STV_BLOCKLIST field descriptions https://docs.aws.amazon.com/redshift/latest/dg/r_STV_BLOCKLIST.html
- STV_QUERY_SUMMARY also has useful info about query efficiency such as writing to disk https://docs.aws.amazon.com/redshift/latest/dg/using-SVL-Query-Summary.html
- Collection of Redshift database tuning queries and even a Docker image on GitHub at https://github.com/awslabs/amazon-redshift-utils
- using the ‘over()’’ windows function in Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html and https://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html
- CTEs and Window Functions: Unleashing the Power of Redshift which is a blog post by Yelp. https://engineeringblog.yelp.com/2015/01/title-ctes-and-window-functions-unleashing-the-power-of-redshift.html
- Redshift performance tuning guidance https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/
- Optimizing for Star Schemas and Interleaved Sorting on Amazon Redshift https://aws.amazon.com/blogs/big-data/optimizing-for-star-schemas-and-interleaved-sorting-on-amazon-redshift/ is worth a read if you are struggling with relating OLTP best practice with OLAP schemas.
Continue reading articles in my Amazon Web Services series