AWS re:Invent 2020: Deep dive on best practices for Amazon Redshift

AWS re:Invent 2020: Deep dive on best practices for Amazon Redshift

AWS Events

3 года назад

20,657 Просмотров

Ссылки и html тэги не поддерживаются


Комментарии:

Punit Kumar
Punit Kumar - 27.03.2023 05:00

Thanks and nice explanation ..

Ответить
Siva Sakthi
Siva Sakthi - 21.02.2023 14:40

Amazing presentation and precious content.

Ответить
Arnold Angel
Arnold Angel - 23.08.2022 02:54

Wow! I watched it twice!

Ответить
Obira Daniel
Obira Daniel - 28.07.2022 22:59

Thank you Patel, very good technical guidelines well summarized and straight to the point, Thank you very much.

Ответить
sanooosai
sanooosai - 22.03.2022 21:36

brilliant

Ответить
AnilK Ak
AnilK Ak - 14.03.2022 06:47

Pls lemme know the best practices / ways to implement MERGE operation in redshift ( instead of doing n following multi step process). Thanks.

Ответить
Marc Allen
Marc Allen - 31.10.2021 06:19

Thanks, Great job. Not all Re-Invent talks are as technical as this.

Ответить
Scttrbrn
Scttrbrn - 08.10.2021 09:03

very good presentation - thank you

Ответить
Jose Borja
Jose Borja - 29.09.2021 16:52

The slides make reference to ELT rather than ETL.... ELT is for "schema on read" whereas ETL is for "schema on write" solutions. ELT is more suitable for "file system based" Data Lakes (object storage in AWS S3) while ETL and ELTL is suitable for relational database data warehouse environments (Redshift). ETL is meant to be done by a separate middle ware service between the source and the Redshift Data Warehouse, whereas ELTL is meant to be done entirely in the Redshift Data Warehouse. ELTL is the preferred and modern way method of working in an MPP DW environemnt.

Ответить
Kevin C
Kevin C - 08.09.2021 03:30

this is cool

Ответить
sukul mahadik
sukul mahadik - 29.05.2021 12:57

Notes from this session (Part -2)

Important Terminologies and Concepts:

- Compression:
§ When we ingest the data from Raw into Redshift, it provides us 3x to 4x compression. This allows us to store more data within the cluster. It provides cost savings as now we have reduced storage footprint. This also helps improve query performance by decreasing I/O.
§ CREATE TABLE query assigns default encodings, when no encodings are specified.
§ The COPY command by default automatically analyzes and compresses data on first load into an empty table.
§ Redshift now supports ALTER TABLE statement to change the encoding of an existing column.
§ We also have ANALYZE COMPRESSION a built-in command that finds the optimal compression for each column. When analyze compression utility runs, it would give us the brute force recommendation on encoding, which is going to save on storage footprint, but might not be as performant.
§ When we create a table without specifying the encodings, Redshift automatically assigns the encoding or AZ64 for numeric, date and timestamp columns. This is the most performance compression technique for these types of columns. It uses LZO for varchar and char type columns. Each of the columns can have different encodings so that they can grow and shrink independently.

- Blocks, Zone maps and data sorting

§ Redshift has very large blocks, 1Mb immutable blocks, it can store when you can have super compression, close to millions of values.
§ Blocks are individually encoded with one of the 13 encodings.
§ Zone maps are in-memory metadata which stores per block - min and max values. All blocks automatically have zone maps.
§ The main purpose of Zone Maps when you run a query, it tells the optimizer which blocks has the data. It allows us to prune out the blocks from the disk that does not contain data for a given query, reducing the IO.
§ Data sorting is how the table is physically sorted on the disk. When we define a table, we can define a sort key and the sort key's the main candidates are frequently accessed columns in our query, starting with low cardinality being the first column in our query.
§ The sorting makes queries run faster by increasing the effectiveness of the zone maps and reducing I/O.
This allows range restricted scans to prune blocks by leveraging zone maps.
Sort key can have one or more columns and candidates are chosen based on the query patterns, business requirements and data profile.

- Data distribution:
§ The Data distribution style is a table property that dictates how the data will be distributed throughout the cluster.
§ The Goals for data distribution are to leverage each and every compute node to work in parallel and from a join performance perspective have co-located joint which reduces the amount of data movement.
§ Data distribution styles in Redshift :
□ Key: Redshift takes a hash value, modulo it and distribute the data across the compute nodes and the slices. Same values go to the same location (Slice)
□ ALL : Redshift takes a copy and puts the copy of the table on each and every compute node (on the 1st Slice).Smaller dimension tables, with less than 3 million records are good candidates for this style.
□ Even: Redshift is going to do a round-robin and distribute the data across the compute nodes.
□ Auto: Redshift will automatically handle the Distribution Style for you, based on the recommendation and your workload patterns.

- Materialized views:
§ Materialized views are supported in Redshift. Pre-computed results are persisted in the materialized views. We can leverage materialized view to physicalize our very complex queries.
§ These views can be created using single or multiple base tables. They can include projections , filtering, inner joins from multiple base tables, aggregates - group-by, having.
§ When our underlying base tables are changing incrementally, we can refresh materialized view incrementally.
§ Redshift also supports auto query rewrite and auto refresh for Materialized Views.
§ With Materialized View, we can have a different distribution and sort key than our base tables.


Best Practices : Table design summary

- Use the default auto table optimization.
- Make use of materialized views to physicalize complex queries.
- Add compression to columns. Use AZ64 where possible. For (var)char columns use LZO.
- Add sort keys on columns that are frequently filtered on. First column in sort key should have low cardinality.
- When using a Key distribution style for tables, use a column that has high cardinality. This will help spread the data across the compute nodes.
- Co-locate large tables using DISTSTYLE KEY using high cardinality column.
- Avoid distribution using temporal columns (even if they may look like good candidates for distribution keys). This may lead to only few compute nodes working when we run a query with date filters.

Data Ingestion:

- The most optimal way to ingest the data into Redshift is using the copy command. Say we have a RA3.16XL compute node with 16 slices per node. If we have a single big file, only one slice will perform the loading. This shall decrease the performance. However if the file was split into 16 smaller files, all the slices would work in parallel to maximize the ingestion performance. So as a best practice, the number of files should be a multiple of the number of slices. The copy command continues to scale linearly as we add nodes to the cluster.

- Its recommended to use delimited files with gzip compressed size between 1Mb to 1Gb.

- An alternate way to ingest the data into Redshift is using Redshift Spectrum. We can use INSERT INTO SELECT syntax to fetch data from external Amazon S3 tables. In the select query, we can aggregate. Filter , project and manipulate incoming data. This also allows us the option to load data from several file formats - Amazon ION, grok, parquet, orc, rcfile, sequence.

- Using the redshift spectrum approach, we can do away with ELT and save cluster resources for querying and reporting.
Filtering and aggregating incoming data can improve performance over copy.

- When using redshift spectrum we are charged at 5$/TB of compressed data scanned.

Deletes and updates in Redshift:

- Redshift is meant for OLAP workload, not so much for online transactional processing for pinpointed, insert, delete, or update.
It’s a batch processing engine , optimized for processing massive amounts of data.

- Redshift uses 1Mb size immutable blocks. This means that on write the blocks are cloned so as to prevent introducing fragmentation.

- Small writes(~1-10rows) have cost similar to those of large writes (~100k)

- Since blocks are immutable, when we do a delete in Redshift, it's a logical delete. When we do an update in Redshift, it's a logical delete and an insert. (AUTO) VACUUM can be used to remove ghost rows from the tables.

(AUTO) VACUUM and (AUTO) ANALYZE:

- The VACUUM process can be run manually or automatically in the background.
- The Goals of the VACUUM process are to remove rows that are marked as deleted and globally sort tables (Note that for table with sortkey, ingestion operations will locally sort new data and write into an unsorted region.
- VACUUM should be run only when necessary. Redshift has support for AUTO VACUUM which runs in the background. AUTO VACUUM DELETE can be used to reclaim the space.
- Redshift also supports AUTO TABLE SORT, which based on your query patterns, identifies hot blocks and sorts that data which is a lightweight version of VACUUM SORT.
- We can use VACUUM BOOST to significantly improve the VACUUM performance. However this is resource intensive and should only be run during off-peak times.
- The ANALYZE is the process for collecting table stats which are useful in optimal query planning. AUTO ANALYZE automatically handles statistics gathering.
- ANALYZE can be run periodically after ingestion on just the columns that WHERE predicates are filtered on.

Ответить
sukul mahadik
sukul mahadik - 29.05.2021 12:46

Notes from this session (Part -1)

• AWS Redshift has columnar architecture - Stores data by columns.

• AWS Redshift has MPP (Massively parallel processing) architecture.

• A Redshift cluster can be scaled 2-128 nodes horizontally.

• AWS Redshift is a fully managed service.

• AWS Redshift supports
○ number of analytical function like windowing
○ has support for geometric datatype for spatial functions
○ has native support for JSON where we can store and analyze using the super datatype (preview)
○ has support for stored procedures using PLPG SQL language
○ materialized views
○ user defined functions that are integrated with AWS lambda , which allows us to write user defined functions in any of the languages supported by lambda. Using AWS lambda functions, we can integrate with protegrity for data tokenization, as well we can do data enrichment and integrate with other AWS services.
○ has support for invoking machine learning models using SQL by its integration with Amazon sagemaker autopilot. has seamless integration with S3 (serving as our Datalake). This allows us to run queries that combine data that is on redshift and also on S3 Datalake. This provides us with a lakehouse architecture.


Amazon Redshift architecture:

- Leader Node:
§ As a end user , our SQL client or BI tools connect to redshift using JDBC or ODBC drivers. They connect only to the leader node. Leader node provides the SQL endpoint for us to connect.
§ Currently there is also support for Data API for making asynchronous calls with Redshift. We can use data API on a command line interface, or any of the languages which are supporting Boto3.
§ Leader Node also acts as a query coordinator. It stores metadata , parses the query, compiles the query and sends that compile code to the compute nodes.

- Compute Nodes:
§ Compute Nodes execute queries in parallel. This is what makes Redshift MPP, shared nothing architecture.
§ The parallel integration of the compute node extends to Amazon S3. When we ingest data from S3 into Redshift using copy command, all compute nodes work in parallel. Similarly when unloading data onto S3, all compute nodes work in parallel. Given that Amazon Redshift is a fully managed service, it takes incremental backups at every eight hours or 5GB of data change. These are asynchronous in nature and they are pushed streamingly in parallel on Redshift Managed S3. For backup and restore as well, the compute nodes work in parallel.

- Redshift Spectrum:
§ Redshift Spectrum is the elastic layer of compute which serves as Serverless layer between the compute node and Amazon S3.
§ All redshift clusters can leverage Redshift spectrum , which provides us the ability to query the data sitting in our Data Lake in any of the open formats, JSON, CSV, Avro, Parquet or ORC. We can also write queries to combine the data which is in your Data Lake to the data in Redshift local.

- RA3 instance type:
§ RA3 instance type with redshift managed storage allows us separation of compute and storage. We can scale and size your Redshift cluster, based on your compute needs for our workload.
§ We only pay separately for the storage that we consume.
§ Redshift Managed storage comprises of local SSD cache and Redshift Managed S3. Based on our query pattern, the most frequently accessed blocks are cached locally , and the steering of the data is completely transparent to us.
§ One node in RA3 instance family can scale, provides your storage quota of 128 terabyte. We can scale the cluster from two to 128 compute nodes, providing us 16 petabytes of compressed storage.
§ When you extend your data warehouse to our Data Lake S3, we can have the data size in exabyte.

- Advanced query accelerator (AQUA) for Amazon Redshift:
§ With RA3 instance family, the Aqua will serve as a hardware caching layer, sitting between the managed storage and the compute node. This significantly enhances the read performance.
§ AQUA provides a new distributed and hardware accelerated caching layer.
§ AQUA nodes with custom AWS designed analytics processors to make operations (compression, encryption, filtering and aggregations) faster than traditional CPUs.
§ AQUA is included with RA3 instance type with no additional cost.

- Data Sharing:
§ With RA3 cluster types, the data sharing feature provides us read workload isolation. Ex: we can have one cluster producing the data, which can, then be shared with different departments without having the need to unload the data from our main cluster and ingest it into our end user or consuming clusters. So we do not require data movement and we can share this data in a secure fashion.
§ This feature allows instant, granular, high performance data access without data copies/movements.
§ This allows secure and governed collaboration within and across organization and with external parties.
§ Use cases: Cross group collaboration and sharing, workload isolation and chargeability , data as service.
§ We can also do bi directional sharing of the data between the clusters.

Amazon Redshift Node types:

- RA3 nodes allows for separation of compute and storage with Amazon Redshift managed storage.
- RA3 managed storage is made up of SSD disks and Amazon S3.
- These are 3 RA3 node types
§ RA3 4xlarge - CPU: 12 , Memory : 96Gb, Storage quota per node: 128Tb
§ RA3 16xlarge - CPU: 48 , Memory : 384Gb, Storage quota per node: 128Tb
§ RA3 XLPlus - CPU: 4, Memory : 32, Storage quota per node: 32Tb
- Dense Compute - DC2 - SSD backed
- Dense Storage - DS2 - Magnetic disks.

Amazon Redshift Advisor:
- Amazon Redshift Advisor is available in our Amazon Redshift Console.
- It runs daily scanning our operational metadata/statistics with the lens of best practices and provides us tailored, high impact recommendations to optimize our Redshift cluster for performance and cost savings.
- Types of Advisor recommendations include:
§ Data ingest: Speed up copy(compression), skip compression analysis during copy, split Amazon S3 objects loaded by copy.
§ Query Tuning : Reallocate WLM memory, Enable SQA
§ Table optimization: Alter distribution keys on tables, Alter sort keys on tables, column encoding.
§ Cost Savings: Delete unused clusters.

Automatic Table optimization:
- This feature continuously scans workload patterns.
- We can enable of disable this feature at the table level and when enabled it automatically applies the distribution key and the sort key over time to account for changes in the workload.
- We can also see what recommendations were made and get an audit trail of when those changes were applied using the system tables (Ex: svv_alter_table_recommendations logs recommended changes, svl_auto_worker_action logs audit trail of changes)

Ответить
SUMATHI M
SUMATHI M - 25.05.2021 14:22

Please help me with this query! What will happen if leader node fails

Ответить
Aditya Verma
Aditya Verma - 04.04.2021 23:53

How can I apply zone maps in my redshift table?

Ответить
Vaibhav Bhardwaj
Vaibhav Bhardwaj - 21.03.2021 21:26

What a nice way of summarizing all important concepts of Redshift.

Ответить