Big Data on AWS - Tips and Tricks
This is my list of hints and tips for this course. It’s markdown so you can save it, access it or store it anywhere. I might also give you other links that are course specific. I’ll add specific answers to questions I get during the course. I’ll share it with everyone.
Your Instructor
- Ian Falconer https://www.linkedin.com/in/leftbrainstuff/
Administrivia
We need to jump through some hoops to get access to the labs, notes and my hints and tips. Be consistent with the email address you use for all sites. There are three seperate sites you need to access and one bitly link which is this page:
- Join or login to https://www.aws.training/ to ensure your training and certifications are captured. No we don’t spam you or sell your details.
- Access Qwiklab (yes it is spelt INCORRECTLY)
- aws.qwiklabs.com for the labs in this class
- run.qwiklabs.com for outside of the class or to do other labs at your own pace.
- NOTE: Some are free others require course credits. Also check out the AWS Professional Developer Series of MOOCs on edX https://www.edx.org/aws-developer-professional-series
- Access the course notes and slides. You’ll receive two emails. One confirming your attendance at this course and with the following links. The download link seems broken. You can download apps for phones, tablets and laptops. Or use your browser.
- www.vitalsource.com look for a signup link and download link. Or just go to https://evantage.gilmoreglobal.com/#/user/signin
- Once you’ve logged into Vitalsource (aka Bookshelf, Gilmore, eVantage) you can redeem your unique course materials code (in a seperate email) and update your book list. You should see a lab guide and student guide for Big Data on AWS, version 3.8 . The student guide is the powerpoint decks and notes and the lab guide is the step by step instructions for the labs. The lab guide is included in the labs so this document is somewhat redundant. You can download the Vitalsource Bookshelf app for Windows, Mac, IoS and Android at https://support.vitalsource.com/hc/en-us/articles/201344733-Bookshelf-Download-Page
- You can print the student and lab guides to pdf from the app.
Academic papers
- There is a distinct lack of good papers and published data on large data set processing from academia. Most of the big data work and tooling development is occuring within corporations.
- MapReduce Programming Model for split-apply-combine as a data strategy https://en.wikipedia.org/wiki/MapReduce but successive iterations of this strategy include Apache Mahout https://en.wikipedia.org/wiki/Apache_Mahout and other less network intensive approaches for clustering, classification and batch based collaborative filtering using multi threaded, in memory methods and tools.
- Academic papers starting from ancient (circa 2008)
- Yahoo EMR circa 2008 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.178.1187&rep=rep1&type=pdf
- Usability of Apache MapReduce, Spark and Flink (circa 2018) https://arxiv.org/pdf/1803.10836.pdf
- Comparitive study of Apache Hadoop on log processing http://dspace.uib.no/bitstream/handle/1956/17308/A-Comparative-Study-on-Distributed-Storage-and-Erasure-Coding-Techniques-Using-Apache-Hadoop-Over-NorNet-Core.pdf?sequence=1&isAllowed=y
- Perhaps the shortest paper on big data but also a very simple overview of MapReduce, Hadoop and some related technologies. https://www.ijariit.com/manuscripts/v4i1/V4I1-1218.pdf
- Another short paper which is also a good first read on MapReduce, Apache technologies for big data and how they interact. http://ijesc.org/upload/2e5dd1d9582d5491b12214524880f6a4.Big%20Data%20Technologies%20for%20Batch%20and%20Real%20Time%20Data%20Processing%20A%20Review.pdf
- Masters Thesis - Performance Comparison of Apache Spark and Tez for Entity Resolution https://pdfs.semanticscholar.org/2542/232e9769399517dac3c863791cf732f3812d.pdf
Some Fundamentals
- What is Big Data? (circa 2016) http://cueris.com/an-elephant-among-the-blind-men-the-big-data-quagmire/
- Why big data initiatives fail. (circa 2016) https://www.thoughtworks.com/insights/blog/7-reasons-big-data-analytics-initiatives-fail
- Apache Hadoop and HDFS https://en.wikipedia.org/wiki/Apache_Hadoop#HDFS Hadoop launched in 2011 and is at V3.0.0 as of Dec 2017
- Apache Mahout is the open source wrapper for big data tools post Hadoop https://en.wikipedia.org/wiki/Apache_Mahout Mahout is a distributed linear algebra framework and mathematically expressive domain specific language. V0.13.0 as of Jun 2018 https://mahout.apache.org/docs/latest/index.html
- Gartner’s hype cycle for data management comments on the switch from self managed Hadoop environments to cloud based managed solutions. https://www.gartner.com/newsroom/id/3809163 You may have to web search for the hype cycle graphic if it doesn’t render. Also read https://www.datanami.com/2015/08/26/why-gartner-dropped-big-data-off-the-hype-curve/ on why Gartner thinks big data is not a new normal
- Apache Hadoop https://hadoop.apache.org/docs/current/index.html
- Making sense of the very extensive Hadoop ecosystem. https://siliconangle.com/blog/2015/09/09/wikibon-analyst-sees-urgent-need-to-simplify-hadoop-ecosystem/ and for some common terms visit wikipedia or https://www.thinkbiganalytics.com/leading_big_data_technologies/hadoop/
- The Hadoop ecosystem
- A long Tabular list of Hadoop tools and implementations https://hadoopecosystemtable.github.io/
- Hadoop in 5 cartoons https://content.pivotal.io/blog/demystifying-apache-hadoop-in-5-pictures
- Firecracker Announcement (circa Nov 2018) Firecracker – Lightweight Virtualization for Serverless Computing (Secure and fast microVMs for serverless computing and containers). Think of firecracker as next generation ‘fabric’ to replace legacy compute underlying containers, Lambda and edge computing. Firecracker also brings first class security to containers. https://aws.amazon.com/blogs/aws/firecracker-lightweight-virtualization-for-serverless-computing/ and here’s the github page https://firecracker-microvm.github.io/
- For those who need an introduction to data analytics fundamentals check out the self paced digital course at https://www.aws.training/learningobject/wbc?id=35364 To make effective and efficient of any big tools, frameworks and solutions you need to a good background in the fundamentals. Like OLTP versus OLAP, schema design, query (SQL) structures and best practices (ie SELECT * on 100M+ rows is not a good query)
- Use the AWS Systems Manager Parameter Store to Query for AWS Regions, Endpoints at https://aws.amazon.com/blogs/aws/new-query-for-aws-regions-endpoints-and-more-using-aws-systems-manager-parameter-store/
Cool links
- AWS Global Infrastructure. Here are several videos that ‘open the kimono’ on how AWS is designed and built to support millions of customers across the globe.
- James Hamilton, AWS SVP and Distinguished Engineer, talks about the design decisions and inner workings of the AWS global infrastructure. James also provides the history behind major technological innovations like we’re seeing now in Cloud Computing. This deck is over 4 years old but still a good summary. This should be the first AWS video you watch. https://www.slideshare.net/AmazonWebServices/spot301-aws-innovation-at-scale-aws-reinvent-2014 . Here’s the youtube video. https://www.youtube.com/watch?v=JIQETrFC_SQ There are Youtube videos from more recent ReInvents with some updates too. Here is James in 2016. It’s titled as AWS re:Invent 2016: Amazon Global Network Overview with James Hamilton https://www.youtube.com/watch?v=uj7Ting6Ckk
- Here is a 4 min snippet from 2016 titled AWS re:Invent 2016: Introduction to Amazon Global Network and CloudFront PoPs with James Hamilton https://www.youtube.com/watch?v=FjHBGjLnou0&feature=youtu.be
- AWS re:Invent 2017 Keynote - Tuesday Night Live with Peter DeSantis, VP AWS Global Infrastructure talks about the AWS global infrastructure. Up to 15:46 minutes is about the infrastructure. https://www.youtube.com/watch?v=dfEcd3zqPOA&feature=youtu.be&t=1h17m0s
- https://www.infrastructure.aws/ now has an interactive map and animations describing the AWS Global Infrastructure. 100 GBps intercontinental network.
- This one is self explanatory AWS re:Invent 2018: Amazon VPC: Security at the Speed Of Light (NET313) https://www.youtube.com/watch?v=UP7wDBjZ37o&feature=youtu.be
- James Hamilton also publishes blog posts on AWS Infrastructure regularly. Here’s one on number of data centers titled How Many Data Centers Needed World-Wide at https://perspectives.mvdirona.com/2017/04/how-many-data-centers-needed-world-wide/
- AWS re:Invent 2017: Scaling Up to Your First 10 Million Users (ARC201). This is like the Tech Essentials course in a single video. Well worth a watch. https://www.youtube.com/watch?v=w95murBkYmU
- To get a good overview in book form of AWS Services get a copy of AWS Certified Solutions Architect Study Guide https://www.amazon.com/Certified-Solutions-Architect-Study-Guide-ebook/dp/B07PL986GY/ref=sr_1_4?keywords=aws+exam+certification+study+guide&qid=1557530288&s=gateway&sr=8-4
- Amazon EC2 Instance Types explained in neat tabular comparisons. https://aws.amazon.com/ec2/instance-types/ . Also here’s a third party site that has a table that lets you sort on memory, network performance, cost and instance type. You can also quickly compare costs here too. https://ec2instances.info/ . Also check out http://instancetyp.es too. Here’s a stackoverflow thread on non AWS benchmarks of different instance types. https://stackoverflow.com/questions/20663619/what-does-amazon-aws-mean-by-network-performance
- List of Big Data sessions from ReInvent 2018 titled AWS Big Data and Analytics Sessions at Re:Invent 2018. Search for them on YouTube or Slideshare https://aws.amazon.com/blogs/big-data/aws-data-analytics-sessions-at-reinvent-2018/
- Netflix Technical Blog. Always an interesting read of AWS usage at scale. https://medium.com/netflix-techblog
- AWS Case Studies. These are useful to get non technical types on board with AWS. https://aws.amazon.com/solutions/case-studies/
- AWS created solutions, reference architectures and quickstarts at https://aws.amazon.com/big-data/getting-started/tutorials/ . Look for the self paced labs that are accessible at run.qwiklabs.com
- There are some interesting solutions on the AWS Big Data blog post at https://aws.amazon.com/blogs/big-data/
- Which services are available in which regions? https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/
- If you’ve never watched anything of Hans Rosling on Youtube you are missing one of the world’s best big data speakers. Hans talks about everything from washing machines to how chimpanzees are smarter than professors. The best stats you’ve ever seen - Hans Rosling (circa 2013) https://www.youtube.com/watch?v=usdJgEwMinM
- AWS hosts more than 110 big data and open data sets. https://registry.opendata.aws/
- Tableau Server on AWS. This is an AWS Quickstart which is a reference architecture you can build from the supplied Cloudformation template. https://aws.amazon.com/quickstart/architecture/tableau-server/
- AWS Free Tier. No explanation needed. https://aws.amazon.com/free/?all-free-tier.sort-by=item.additionalFields.SortRank&all-free-tier.sort-order=asc&awsf.Free%20Tier%20Types=categories%23alwaysfree&awsm.page-all-free-tier=3
- Big data case study, github repo and more from Snowplow
- Snowplow on AWS including EMR, Kinesis and S3 https://github.com/snowplow/snowplow
- The AWS Snowplow case study https://aws.amazon.com/solutions/case-studies/snowplow/
- Unpicking the Snowplow data pipeline and how it drives AWS costs https://snowplowanalytics.com/blog/2013/07/09/understanding-how-different-parts-of-the-Snowplow-data-pipeline-drive-AWS-costs/
- S3 Transfer Acceleration Speed Checker http://s3-accelerate-speedtest.s3-accelerate.amazonaws.com/en/accelerate-speed-comparsion.html uses a multi part upload to check the speed difference when using S3 transfer acceleration between regions.
Best Practice
- Choosing the right AWS big data services for the right use cases. Here’s a ReInvent Deep Dive on the AWS Big Data ecosystem. AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS (BDM201) https://www.youtube.com/watch?v=RNrsIlweCno
- AWS Answers contains deployable solutions for common big data problems. Great for prototyping and reverse engineering. These solutions also contain best practices in terms of tagging, nomenclature, applying the rule of least privilege and integrating services. https://aws.amazon.com/answers/big-data/
- AWS Big Data and Analytics Sessions at Re:Invent 2018 which summarizes many of the big data sessions in one page. https://noise.getoto.net/2018/11/14/aws-big-data-and-analytics-sessions-at-reinvent-2018/
- Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3 https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/
- Centralized Logging (aka log everything) is a recommended best practice cloud pattern. You can access the deployment guide and cloudformation templates at https://aws.amazon.com/solutions/centralized-logging/
- We’re starting to see some novel, adaptable and cost effective approaches to serverless ETL using AWS Step Functions. Here’s one example Serverless Data Processing with AWS Step Functions — An Example https://medium.com/weareservian/serverless-data-processing-with-aws-step-functions-an-example-6876e9bea4c0 Of course you can use AWS Data Pipeline, see the FAQ https://aws.amazon.com/datapipeline/faqs/, for building custom ETL jobs.
- Lambda functions should be stateless and not have external dependencies that may see them time out. Here’s an example of how AWS Lambda and AWS Step Functions can be used to asynchronously transform large data sets by iterating their way through the data. Exception handling, loss of availability and timeouts by external services are handled by the state machine logic eliminating the need to deal with this in the Lambda functions. This solution transcribes podcasts, pushes state to Elasticsearch and even cleans up the transient data transforms. https://aws.amazon.com/blogs/machine-learning/discovering-and-indexing-podcast-episodes-using-amazon-transcribe-and-amazon-comprehend/ and on github https://github.com/aws-samples/amazon-transcribe-comprehend-podcast
- What is DevOps?
- Agile manifesto https://agilemanifesto.org/ 4 behaviours and 12 principles
- Modern summary of agile and DevOps https://gist.github.com/jpswade/4135841363e72ece8086146bd7bb5d91
- Key design principles for any application involving big data. Cost management opportunities are significant. It’s up to you to exploit them. - Here’s a comparison between file types and analytics options that highlights the large cost deltas on offer. It’s titled ‘1.1 Billion Taxi Rides on Amazon Athena’ https://tech.marksblogg.com/billion-nyc-taxi-rides-aws-athena.html Also check out https://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html
- Consider storage and request scenarios between Amazon DynamoDB and Amazon S3. (but check the pricing as it drops over time)
- Has anyone done the costs math on S3 vs. DynamoDB for “small” (json) objects? And, under which circumstances would you have to use one over the other? https://www.reddit.com/r/aws/comments/5haamf/has_anyone_done_the_costs_math_on_s3_vs_dynamodb/
- Here’s a great explanation of using costs (with data) as an input to your architectural choices. It’s titled S3 vs DynamoDB price comparison https://www.cirrusup.cloud/s3-vs-dynamodb-price-comparison/ In this example the threshold is a 20kB file size. But of course the threshold may vary for your use case. Do the math.
- Using multiple cloudformation stacks to build Orchestrate Amazon Redshift-Based ETL workflows with AWS Step Functions and AWS Glue https://aws.amazon.com/blogs/big-data/orchestrate-amazon-redshift-based-etl-workflows-with-aws-step-functions-and-aws-glue/
- Scaling automation and best practice (as IaaC) across multiple accounts using AWS Control Tower, AWS Service Catalog and Cloudformation https://aws.amazon.com/blogs/mt/enabling-self-service-provisioning-of-aws-resources-with-aws-control-tower/
- Start planning your 2019 AWS re:Invent schedule with AWS re:Invent 2019 – Financial Services Industry Guide https://aws.amazon.com/blogs/industries/aws-reinvent-2019-financial-services-industry-guide/
- Using SAM to build a serverless streaming processor with DynamoDB and S3 using SAM and Kinesis titled Increasing real-time stream processing performance with Amazon Kinesis Data Streams enhanced fan-out and AWS Lambda https://aws.amazon.com/blogs/compute/increasing-real-time-stream-processing-performance-with-amazon-kinesis-data-streams-enhanced-fan-out-and-aws-lambda/?nc1=b_rp
- A great way to understand how Cloudformation can build, update and delete immutable or mutable environments is to reverse engineer AWS Quickstarts (gold standard reference architectures). Check out https://aws.amazon.com/quickstart/saas/identity-with-cognito/ for the deployment guide and https://github.com/aws-quickstart/saas-identity-cognito for all the Cloudformation templates.
- Adrian Cockcroft AWS VP of Cloud Architecture Strategy and former CTO of Netflix. Here’s his Youtube playlist with talks about DevOps, migrations, Netflix lessons learned and digital transformation topics. https://www.youtube.com/playlist?list=PL_KXMLr8jNTnwkzV7SePa0jHFUG2qn0MA
Migration Best Practice
- Migrating to AWS - Best Practices and Strategies is a good starting point for execs and planners https://d1.awsstatic.com/Migration/migrating-to-aws-ebook.pdf
- AWS Cloud Adoption Framework (CAF) https://aws.amazon.com/professional-services/CAF/
- AWS Cloud Adoption Readiness Tool (CART) https://cloudreadiness.amazonaws.com/#/cart
- AWS Server Migration Service requirements https://docs.aws.amazon.com/server-migration-service/latest/userguide/prereqs.html
- Migrating to AWS https://aws.amazon.com/cloud-migration/
- Cloud stages of adoption in the AWS blog titled Cloud Transformation Maturity Model: Guidelines to Develop Effective Strategies for Your Cloud Adoption Journey https://aws.amazon.com/blogs/publicsector/cloud-adoption-maturity-model-guidelines-to-develop-effective-strategies-for-your-cloud-adoption-journey/
- Stephen Orban’s 2017 post on how Capital One journeyed through the Cloud stages of adoption titled Capital One’s Cloud Journey Through the Stages of Adoption https://medium.com/aws-enterprise-collection/capital-ones-cloud-journey-through-the-stages-of-adoption-bb0895d7772c
- Check out the AWS Migration Hub https://aws.amazon.com/migration-hub/ and related tooling to support your Migrations
- AWS Database Migration Service Best Practices https://docs.aws.amazon.com/dms/latest/userguide/dms-ug.pdf#CHAP_BestPractices
- Getting Started with the Migration Hub https://docs.aws.amazon.com/migrationhub/latest/ug/getting-started.html
- Microsoft SQL Server to Amazon Aurora MySQL Compatibility Migration Playbook https://aws.amazon.com/blogs/database/another-database-migration-playbook-goes-live-migrate-from-microsoft-sql-server-to-amazon-aurora-mysql/ and find each migration playbook at https://aws.amazon.com/dms/resources/ . Also compare dbinstance costs between db options at https://aws.amazon.com/rds/pricing/?nc=sn&loc=4 Note the extra expense (up to a magnitude greater) between MS SQL Server and Amazon Aurora for example. Most of this cost is licensing and the need to run more complex mirroring servers for MS SQL Server.
Networking Links
- List of CIDR ranges of AWS regions http://ec2-reachability.amazonaws.com/
- Latency between AWS regions. Lot’s of good empirical data points. Note these are averages over a 24 hour period. https://www.cloudping.co/
Compute links
- aws cli wait command will time out after 120 checks. They’re labled as failed checks in the documentation but they aren’t strictly a failure. Whatever the wait command is waiting on never reaches the wait state if it times out. The timeout period can vary so check the documentation for the service and wait state you’re interested in. Here’s EBS https://docs.aws.amazon.com/cli/latest/reference/ec2/wait/snapshot-completed.html
- Here’s a useful sortable table of EC2 instance types, sizes and specifications. Not your add columns like available for EMR which makes instance selection very simple. https://ec2instances.info/ . Also check out http://instancetyp.es too.
- Deep Dive on the Nitro hypervisor and the security benefits of loosely coupling the hypervisor (or more correctly the compute management system and control plane) in a video titled AWS Live re:Inforce - Security Benefits of the EC2 Nitro Architecture https://www.youtube.com/watch?v=t_9CASbagag And Nitro is also described in detail in Amazon EC2 High Memory instances for SAP HANA: simple, flexible, powerful https://aws.amazon.com/blogs/awsforsap/amazon-ec2-high-memory-instances-for-sap-hana-simple-flexible-powerful/
- How do I stop and start my instances using the AWS Instance Scheduler? https://aws.amazon.com/premiumsupport/knowledge-center/stop-start-instance-scheduler/
- Use the AWS Instance Scheduler Solution https://aws.amazon.com/solutions/instance-scheduler/
Serverless Links
Using Amazon SNS and AWS Lambda together in serverless event driven architectures:
- Amazon SNS and AWS X-Ray from the documentation https://docs.aws.amazon.com/xray/latest/devguide/xray-services-sns.html
- Amazon SNS Adds Support for AWS X-Ray in the blog post announcement https://aws.amazon.com/about-aws/whats-new/2019/07/amazon-sns-adds-support-for-aws-x-ray/
- Tracing Lambda-Based Applications with AWS X-Ray https://docs.aws.amazon.com/lambda/latest/dg/using-x-ray.html
- aws x-ray and lambda : the good, the bad and the ugly https://theburningmonk.com/2017/06/aws-x-ray-and-lambda-the-good-the-bad-and-the-ugly/
Storage Links
- Amazon S3 Path Deprecation Plan – The Rest of the Story https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-plan-the-rest-of-the-story/
Database Links
- Microsoft SQL Server to Amazon Aurora MySQL Compatibility Migration Playbook https://aws.amazon.com/blogs/database/another-database-migration-playbook-goes-live-migrate-from-microsoft-sql-server-to-amazon-aurora-mysql/ and find each migration playbook at https://aws.amazon.com/dms/resources/ . Also compare dbinstance costs between db options at https://aws.amazon.com/rds/pricing/?nc=sn&loc=4 Note the extra expense (up to a magnitude greater) between MS SQL Server and Amazon Aurora for example. Most of this cost is licensing and the need to run more complex mirroring servers for MS SQL Server.
- Let Me Graph That For You – Part 1 – Air Routes is an interesting big data approach using Jupyter notebooks running on Amazon Sagemaker to query airline data in a Neptune graph database. This environment can be built and torn down and the data can be retrieved fast from an S3 bucket. A good example of a cost effective big data solution. https://aws.amazon.com/blogs/database/let-me-graph-that-for-you-part-1-air-routes/
Streaming links
- Kinesis can be configured to process large volumes of data or to throttle to protect back services. https://docs.aws.amazon.com/firehose/latest/dev/basic-deliver.html . Also check out the Kinesis service limits https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html
- Flume vs. Kafka vs. Kinesis - A Detailed Guide on Hadoop Ingestion Tools https://www.newgenapps.com/blog/flume-vs.-kafka-vs.-kinesis-guide-hadoop-ingestion-tools and 5 Most Important Difference Between Apache Kafka vs Flume https://www.educba.com/apache-kafka-vs-flume/ and don’t forget the FAQs for AWS Managed Service for Kafka (MSK) https://aws.amazon.com/msk/faqs/ and Amazon Kinesis Data Streams FAQ https://aws.amazon.com/kinesis/data-streams/faqs/ (which can batch, transform and compress) Amazon Kinesis Firehose FAQ https://aws.amazon.com/kinesis/data-firehose/faqs/ and Amazon Kinesis Analytics FAQ https://aws.amazon.com/kinesis/data-analytics/faqs/
Athena Links
- Athena uses a Serializer / Deserializer (a Ser/De ) in preference to a DDL to extract the structure of data in S3. https://docs.aws.amazon.com/athena/latest/ug/serde-about.html and https://docs.aws.amazon.com/athena/latest/ug/supported-format.html
- NOTE that Athena has limited support for SerDe’s. http://aws.mannem.me/?tag=athena and https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/
- Athena blog posts and solutions. Worth a look to deep dive into Athena. https://aws.amazon.com/blogs/big-data/tag/amazon-athena/
- Athena performance tuning https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
- From Athena documentation Bucketing vs Partitioning https://docs.aws.amazon.com/athena/latest/ug/bucketing-vs-partitioning.html Athena has a 100 partition upper limit for a CTAS query.
- Key design principles for any application involving big data. Cost management opportunities are significant. It’s up to you to exploit them. Here’s a comparison between file types and analytics options that highlights the large cost deltas on offer. It’s titled ‘1.1 Billion Taxi Rides on Amazon Athena’ https://tech.marksblogg.com/billion-nyc-taxi-rides-aws-athena.html
- Top 10 Performance Tuning Tips for Amazon Athena by the numbers https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
- Setting up some automation around queries that Athena puts in Amazon S3 can be useful to find and categorize your queries. This post does something similar and is titled Separate queries and managing costs using Amazon Athena workgroups https://aws.amazon.com/blogs/big-data/separating-queries-and-managing-costs-using-amazon-athena-workgroups/ . Alternatively this blog post titled Visualizing big data with AWS AppSync, Amazon Athena, and AWS Amplify uses a Lambda function to return an Athena query to a Cognito authenticated user in a dashboard. https://aws.amazon.com/blogs/mobile/visualizing-big-data-with-aws-appsync-amazon-athena-and-aws-amplify/?nc1=b_rp
- Why use MSCK REPAIR TABLE with Athena https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html
- Best Practices When Using Athena with AWS Glue https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html
Redshift Links
- Performance improvements and benchmarking the Nov 2018 Redshift performance improvements. https://aws.amazon.com/blogs/big-data/performance-matters-amazon-redshift-is-now-up-to-3-5x-faster-for-real-world-workloads/
- Using Amazon Redshift Spectrum with Enhanced VPC Routing https://docs.aws.amazon.com/redshift/latest/mgmt/spectrum-enhanced-vpc.html
- Differences Between Amazon Redshift and PostgreSQL for Stored Procedure Support https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-constraints.html
- When performance tuning you need to be digging into the information Redshift makes available. For example check out Logging & Details about stored procedures https://medium.com/@abacigil/using-amazon-redshift-stored-procedures-e529fe23efa8 and of course in the documentation such as Managing Transactions https://github.com/awsdocs/amazon-redshift-developer-guide/blob/master/doc_source/stored-procedure-transaction-management.md
- Twelve Best Practices for Amazon Redshift Spectrum https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/
EMR Links
- AWS Big Data Blog - Tag = EMR https://aws.amazon.com/blogs/big-data/tag/emr/page/2/
- ANT312 – Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Security and Governance on AWS TODO add youtube link
- ANT389 – Ask an Amazon Redshift Customer Anything TODO add youtube link Run Hadoop on EC@ combined with Redshift and Athena
- ANT206 – Under the Hood: How Amazon Uses AWS Services for Analytics at a Massive Scale TODO youtube video here about Amazon’s internal experience with EMR and Redshift
- SRV316-R1 – Serverless Stream Processing Pipeline Best Practices TODO youtube video here Serverless architectures as alternatives to traditional hadoop environments
- ANT348 – [BS] Amazon EMR: Optimize Transient Clusters for Data Processing and ETL TODO youtube video here EMR best practices
- ANT344 – [BS] One Data Lake, Many Uses: Enable Multi-Tenant Analytics with Amazon EMR TODO youtube video link here Multi tenant EMR
- ANT318 – Build, Deploy and Serve Machine learning models on streaming data using Amazon Sagemaker, Apache Spark on Amazon EMR and Amazon Kinesis TODO youtube link here
- AWS re:Invent 2017: Design Patterns and Best Practices for Data Analytics with Amazo (ABD305) https://www.youtube.com/watch?v=1CAWf9VDgFM
- EMR Deep Dive 2016 https://www.slideshare.net/AmazonWebServices/aws-reinvent-2016-deep-dive-amazon-emr-best-practices-design-patterns-bdm401
- Cluster planning guidelines https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html
- Yahoo Hadoop cluster experience. https://www.techrepublic.com/article/why-the-worlds-largest-hadoop-installation-may-soon-become-the-norm/
- Hadoop file formats and compression options https://cloud.netapp.com/blog/optimizing-aws-emr-best-practices
- EMR best practices whitepaper (circa 2013) https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf
- Instance types supported https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html Also see http://ec2instances.info for a consolidated and sortable list of all Amazon EC2 instances and their key features.
- Some facts on Facebook use of hadoop. It’s always interesting to ‘bookend’ the larger use cases. https://code.facebook.com/posts/423120391138341/hadoop/
- EMR cluster resizing guidance from the AWS Big Data blog is worth reading to understand the default EMR cluster constraints. https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/ This post also links to a good description of how to use S3DistCp https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
- Spark versus Tez https://www.xplenty.com/blog/apache-spark-vs-tez-comparison/
- Spark documentation http://spark.apache.org/docs/latest/index.html
- Spark didn’t support hive bucketing
- There’s an outstanding pull request on Spark github for Hive-like bucketing in Spark: https://github.com/apache/spark/pull/19001
- This document summarizes the differences between bucketing in Hive and bucketing currently implemented in spark: https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing
- Also see: https://issues.apache.org/jira/browse/SPARK-19256
- https://www.youtube.com/watch?v=6BD-Vv-ViBw
- custom host name for emr instances for Eric / Russ Using puppet
- Part 1 Launching and Running an Amazon EMR Cluster inside a VPC https://aws.amazon.com/blogs/big-data/launching-and-running-an-amazon-emr-cluster-inside-a-vpc/
- Part 2 Launching and Running an Amazon EMR Cluster in your VPC – Part 2: Custom DNS https://aws.amazon.com/blogs/big-data/launching-and-running-an-amazon-emr-cluster-in-your-vpc-part-2-custom-dns/
- Monitor Metrics with CloudWatch - Amazon EMR https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html
- Your options for installing dependencies/libraries for EMR for spark-shell https://stackoverflow.com/questions/36511017/installing-dependencies-libraries-for-emr-for-spark-shell
- Debug EMR by accessing the logs either on the master node by SSH, in S3 or in the console https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html
- Run multi tenant and ephemeral EMR in Orchestrate big data workflows with Apache Airflow, Genie, and Amazon EMR: Part 1 https://aws.amazon.com/blogs/big-data/orchestrate-big-data-workflows-with-apache-airflow-genie-and-amazon-emr-part-1/
DynamoDB links
- Being fully managed it’s not always easy to correlate actual performanc of DynamoDB as the service provides adaptive capacity to avoid throttling a partition. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html
- Using Hive external DynamoDB tables. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.ExternalTableForDDB.html
- DynamoDB Provisioned Throughput describes how to predict, identify and tune bottlenecks or throughput mismatches between EMR and DynamodDB from https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.PerformanceTuning.Throughput.html
- Also read
- Your guide to Amazon DynamoDB sessions, workshops, and chalk talks at AWS re:Invent 2018 https://aws.amazon.com/blogs/database/your-guide-to-amazon-dynamodb-sessions-workshops-and-chalk-talks-at-aws-reinvent-2018/
- How Amazon handles prime day (DynamoDB) https://www.youtube.com/watch?v=83-IWlvJ__8
- Search the AWS Big Data Blog for DynamoDB related posts https://aws.amazon.com/search/?searchQuery=dynamo#facet_type=blogs&limit=25&page=1&sortResults=modification_date%20desc
Glue links
- Glue sample programs on Github from AWS. https://github.com/aws-samples/aws-glue-samples
- ANT333 – [BS] Building Advanced Workflows with AWS Glue which is an intro to glue . Here’s the slidshare https://www.slideshare.net/AmazonWebServices/building-advanced-workflows-with-aws-glue-ant333-aws-reinvent-2018
- ANT331 – [BS] Metrics-Driven Performance Tuning for AWS Glue ETL Jobs Here’s the slideshare https://www.slideshare.net/AmazonWebServices/metricsdriven-performance-tuning-for-aws-glue-etl-jobs-ant331-aws-reinvent-2018 This ReInvent 2018 session deep dives into solutions to common AWS Glue errors
- Relationalize of JSON in glue
- Building Serverless ETL Pipelines with AWS Glue https://www.youtube.com/watch?v=PHYWI4Y9mzs
- aws-etl-orchestrator on github https://github.com/aws-samples/aws-etl-orchestrator
- Flattening JSON
- Try the relationalize transform / serde in glue. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html
- Grab the Joining, Filtering, and Loading Relational Data with AWS Glue github python script and use it in glue. Customize it as you need. https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize.md
- Or try Simplify Querying Nested JSON with the AWS Glue Relationalize Transform (circa 2017) https://noise.getoto.net/2017/12/14/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/
Security
- ANT346 – [BS] Lock It Down: Configure End-to-End Security & Access Control on Amazon EMR TODO youtube link here
Big Data architectures
- Data Lake Foundation on AWS. This architecture combines Apache Zeppelin, Amazon RDS, Amazon S3, and other AWS services. This is a more traditional compute and relational database architecture. https://aws.amazon.com/quickstart/architecture/data-lake-foundation-with-zeppelin-and-rds/
- Predictive Data Science with Amazon SageMaker and a Data Lake on AWS. This architecture is a serverless approach for storing and transform data for building predictive and prescriptive applications using machine learning services. https://aws.amazon.com/quickstart/architecture/predictive-data-science-sagemaker-and-data-lake/
- Create cross-account and cross-region AWS Glue connections https://aws.amazon.com/blogs/big-data/create-cross-account-and-cross-region-aws-glue-connections/
- How Annalect built an event log data analytics solution using Amazon Redshift. This architecture has a simple compute and S3 based architecture. https://aws.amazon.com/blogs/big-data/how-annalect-built-an-event-log-data-analytics-solution-using-amazon-redshift/
- Easily manage table metadata for Presto running on Amazon EMR using the AWS Glue Data Catalog https://aws.amazon.com/blogs/big-data/easily-manage-table-metadata-for-presto-running-on-amazon-emr-using-the-aws-glue-data-catalog/
- Dynamically Create Friendly URLs for Your Amazon EMR Web Interfaces https://aws.amazon.com/blogs/big-data/dynamically-create-friendly-urls-for-your-amazon-emr-web-interfaces/
- Custom Log Presto Query Events on Amazon EMR for Auditing and Performance Insights https://aws.amazon.com/blogs/big-data/custom-log-presto-query-events-on-amazon-emr-for-auditing-and-performance-insights/
- Create Custom AMIs and Push Updates to a Running Amazon EMR Cluster Using Amazon EC2 Systems Manager https://aws.amazon.com/blogs/big-data/create-custom-amis-and-push-updates-to-a-running-amazon-emr-cluster-using-amazon-ec2-systems-manager/
Time Series Forecasting
- Amazon Forecast Predefined Dataset Domains and Dataset Types https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html
Start with the Custom Domain for Yelp data (not accessible at https://data.ny.gov/NYC-BigApps/Yelp-API/65z6-rsii or https://github.com/Yelp/dataset-examples) . Need to transform data to have the following attributes:
- item_id (string)
- timestamp (timestamp)
- target_value (floating-point integer) – This is the target field for which Amazon Forecast generates a forecast.
Java Serverless Links
- Here are some links to running Springboot Microservices on AWS - Deploy Spring Boot App to AWS Fargate https://dzone.com/articles/deploy-spring-boot-app-to-aws-fargate and note the link to a github repo - Heres a serverless (Lambda and API Gateway) example https://www.rowellbelen.com/serverless-microservices-with-spring-boot-and-spring-data/ - Heres justification for running on ElasticBeanstalk https://stackoverflow.com/questions/48934158/spring-boot-cloud-microservices-on-aws - Some AWS guidance on Spring Boot - Elastic Beanstalk is called out specifically in the Spring Boot documentation https://docs.spring.io/spring-boot/docs/1.5.10.RELEASE/reference/htmlsingle/#production-ready-metricsWe - (AWS) have made a small library that makes Spring Boot AWS Lambda friendly. It’s called the Java Serverless Container. It supports Lambda, API Gateway, Load Balancing and Route 53 https://github.com/awslabs/aws-serverless-java-container - Spring has another module (which they maintain) that allows you to consume some AWS services: https://spring.io/projects/spring-cloud-aws which supports SQS, SNS, Elasticache, RDS and Cloudformation
- Lambda Execution Context explained. https://docs.aws.amazon.com/lambda/latest/dg/running-lambda-code.html and here’s an AWS Xray link that describes how you can decipher how long your Lambda function spends initializing and running your handler function. https://docs.aws.amazon.com/lambda/latest/dg/lambda-x-ray.html
- Understanding container reuse in Lambda https://aws.amazon.com/blogs/compute/container-reuse-in-lambda/
- Best practice for Lambdas. Especially Java based. https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html and https://docs.aws.amazon.com/lambda/latest/dg/java-programming-model.html
- Lambda cold start comparison https://medium.com/@nathan.malishev/lambda-cold-starts-language-comparison-%EF%B8%8F-a4f4b5f16a62 and this also links to the cheekily titled I’m afraid you’re thinking about AWS Lambda cold starts all wrong https://hackernoon.com/im-afraid-you-re-thinking-about-aws-lambda-cold-starts-all-wrong-7d907f278a4f
Self paced Learning and Building
- AWS Certification roadmap https://aws.amazon.com/certification/ Check out the learning paths link at the bottom of the page.
- Upgrade your resume with the AWS Certified Big Data — Specialty Certification https://aws.amazon.com/blogs/big-data/upgrade-your-resume-with-the-aws-certified-big-data-specialty-certification/
- Read the service FAQ pages, http://aws.amazon.com/faqs/, and documentation for each of the services. Just search for AWS + + documentation in any search engine. You can keep the documentation as pdf, html online or even in your Kindle. You can also git clone the documentation for most services.
- Find and build interesting AWS and partner solutions you find the in AWS Blog https://aws.amazon.com/blogs/ . Any post you find with a yellow launch button will build that solution using Cloudformation.
- AWS free digital training is mostly 100 level but we also have over 40 hours of Machine Learning training available for free. You can search by topic, role or level. https://www.aws.training/LearningLibrary?src=courses You’ll find specialist deep dives from level 100 through 300 like this video describing the differences between NACLs and Security groups. https://www.aws.training/Details/Video?id=16486 NOTE: You’ll need to enroll and allow popups in your browser.
- You can also take AWS Qwiklabs Labs for free at https://aws.amazon.com/training/self-paced-labs/
- Get a sandbox or personal account. There are free tiers for many services. https://aws.amazon.com/free/
- http://run.qwiklabs.com and complete quests and labs. These enhance your familiarity with AWS services without you having to use your own account. Some labs are free. Others will require you to redeem Qwiklab credits. Reach out to your training manager or AWS account manager. Also check out the Exam guides for SA, SysOps and Advanced Networking https://www.amazon.com/Certified-Advanced-Networking-Official-Study/dp/1119439833/ref=sr_1_1?s=books&ie=UTF8&qid=1519925473&sr=1-1&keywords=advanced+networking
- Search github, https://github.com/aws , and the AWS blogs, https://aws.amazon.com/blogs/ , for solutions that interest you. Look for posts with a launch button. These will build a complete environment using Cloudformation. Retrieve the Cloudformation templates either from the built environment in your account or from Github. You can reverse engineer or use these templates as scaffolds for your own use.
- Visit Stackoverflow and the AWS discussion forum to pose questions or to contribute to answers about AWS
- You can also take a number of AWS MOOCs (Massive Open Online Courses) on EDx and Coursera including:
- There are many other self paced labs and solutions you can build on AWS. Try:
- Build a Serverless Web Application https://aws.amazon.com/getting-started/projects/build-serverless-web-app-lambda-apigateway-s3-dynamodb-cognito/
- How about AWS Developer Center https://aws.amazon.com/developer/ where you can build the Mythical Misfits app in your choice of programming language.
- The AWS Podcast has a monthly update which is a great way to keep up with the latest changes, releases and interviews with domain experts https://aws.amazon.com/podcasts/aws-podcast/
- AWS has released a number of webinars and now has a monthly cadence https://aws.amazon.com/about-aws/events/monthlywebinarseries/
- AWS Techchat is another AWS podcast https://aws.amazon.com/podcasts/aws-techchat/
- AWS Answers is now available to the public. It contains some interesting links. https://aws.amazon.com/answers/
- Get to know your AWS Solution Architects and your Technical Account Manager (TAM). The SAs help you to architect and understand best practice. The TAMs provide support for your applications running on AWS. They can help you prepare for major events like testing and scaling. They can also help troubleshoot and provide visibility into AWS infrastructure metrics for troubleshooting. https://aws.amazon.com/premiumsupport/faqs/
- AWS Glossary contains service names and nomenclature https://docs.aws.amazon.com/general/latest/gr/glos-chap.html
- Now go build stuff…
Continue reading articles in my Amazon Web Services series