
In the present day, we’re happy to announce a brand new functionality for the AWS Glue Knowledge Catalog: producing column-level aggregation statistics for Apache Iceberg tables to speed up queries. These statistics are utilized by cost-based optimizer (CBO) in Amazon Redshift Spectrum, leading to improved question efficiency and potential price financial savings.
Apache Iceberg is an open desk format that gives the aptitude of ACID transactions in your information lakes. It’s designed to course of massive analytics datasets and is environment friendly for even small row-level operations. It additionally permits helpful options akin to time-travel, schema evolution, hidden partitioning, and extra.
AWS has invested in service integration with Iceberg to allow Iceberg workloads based mostly on buyer suggestions. One instance is the AWS Glue Knowledge Catalog. The Knowledge Catalog is a centralized repository that shops metadata about your group’s datasets, making the information seen, searchable, and queryable for customers. The Knowledge Catalog helps Iceberg tables and tracks the desk’s present metadata. It additionally permits automated compaction of particular person small information produced by every transactional write on tables into just a few massive information for quicker learn and scan operations.
In 2023, the Knowledge Catalog introduced assist for column-level statistics for non-Iceberg tables. That function collects desk statistics utilized by the question engine’s CBO. Now, the Knowledge Catalog expands this assist to Iceberg tables. The Iceberg desk’s column statistics that the Knowledge Catalog generates are based mostly on Puffin Spec and saved on Amazon Easy Storage Service (Amazon S3) with different desk information. This fashion, numerous engines supporting Iceberg can make the most of and replace them.
This submit demonstrates how column-level statistics for Iceberg tables work with Redshift Spectrum. Moreover, we showcase the efficiency good thing about the Iceberg column statistics with the TPC-DS dataset.
How Iceberg desk’s column statistics works
AWS Glue Knowledge Catalog generates desk column statistics utilizing the Theta Sketch algorithm on Apache DataSketches to estimate the variety of distinct values (NDV) and shops them in Puffin file.
For SQL planners, NDV is a crucial statistic to optimize question planning. There are just a few situations the place NDV statistics can doubtlessly optimize question efficiency. For instance, when becoming a member of two tables on a column, the optimizer can use the NDV to estimate the selectivity of the be part of. If one desk has a low NDV for the be part of column in comparison with the opposite desk, the optimizer could select to make use of a broadcast be part of as a substitute of a shuffle be part of, decreasing information motion and enhancing question efficiency. Furthermore, when there are greater than two tables to be joined, the optimizer can estimate the output dimension of every be part of and plan the environment friendly be part of order. Moreover, NDV can be utilized for numerous optimizations akin to group by, distinct, and depend question.
Nonetheless, calculating NDV repeatedly with 100% accuracy requires O(N) house complexity. As a substitute, Theta Sketch is an environment friendly algorithm that means that you can estimate the NDV in a dataset without having to retailer all of the distinct values on reminiscence and storage. The important thing thought behind Theta Sketch is to hash the information into a spread between 0–1, after which choose solely a small portion of the hashed values based mostly on a threshold (denoted as θ). By analyzing this small subset of knowledge, the Theta Sketch algorithm can present an correct estimate of the NDV within the unique dataset.
Iceberg’s Puffin file is designed to retailer data akin to indexes and statistics as a blob kind. One of many consultant blob sorts that may be saved is apache-datasketches-theta-v1
, which is serialized values for estimating the NDV utilizing the Theta Sketch algorithm. Puffin information are linked to a snapshot-id
on Iceberg’s metadata and are utilized by the question engine’s CBO to optimize question plans.
Leverage Iceberg column statistics via Amazon Redshift
To reveal the efficiency good thing about this functionality, we make use of the industry-standard TPC-DS 3 TB dataset. We evaluate the question efficiency with and with out Iceberg column statistics for the tables by working queries in Redshift Spectrum. We’ve got included the queries used on this submit, and we suggest attempting your individual queries by following the workflow.
The next is the general steps:
- Run AWS Glue Job that extracts TPS-DS dataset from Public Amazon S3 bucket and saves them as an Iceberg desk in your S3 bucket. AWS Glue Knowledge Catalog shops these tables’ metadata location. Question these tables utilizing Amazon Redshift Spectrum.
- Generate column statistics: Make use of the improved capabilities of AWS Glue Knowledge Catalog to generate column statistics for every tables. It generates puffin information storing Theta Sketch.
- Question with Amazon Redshift Spectrum: Consider the efficiency good thing about column statistics on question efficiency by using Amazon Redshift Spectrum to run queries on the dataset.
The next diagram illustrates the structure.
To do that new functionality, we full the next steps:
- Arrange sources with AWS CloudFormation.
- Run an AWS Glue job to create Iceberg tables for the 3TB TPC-DS dataset in your S3 bucket. The Knowledge Catalog shops these tables’ metadata location.
- Run queries on Redshift Spectrum and observe the question period.
- Generate Iceberg column statistics for Knowledge Catalog tables.
- Run queries on Redshift Spectrum and evaluate the question period with the earlier run.
- Optionally, schedule AWS Glue column statistics jobs utilizing AWS Lambda and an Amazon EventBridge
Arrange sources with AWS CloudFormation
This submit features a CloudFormation template for a fast setup. You’ll be able to overview and customise it to fit your wants. Notice that this CloudFormation template requires a area with a minimum of 3 Availability Zones. The template generates the next sources:
- A digital non-public cloud (VPC), public subnet, non-public subnets, and route tables
- An Amazon Redshift Serverless workgroup and namespace
- An S3 bucket to retailer the TPC-DS dataset, column statistics, job scripts, and so forth
- Knowledge Catalog databases
- An AWS Glue job to extract the TPS-DS dataset from the general public S3 bucket and save the information as an Iceberg desk in your S3 bucket
- AWS Identification and Entry Administration (AWS IAM) roles and insurance policies
- A Lambda operate and EventBridge schedule to run the AWS Glue column statistics on a schedule
To launch the CloudFormation stack, full the next steps:
- Sign up to the AWS CloudFormation console.
- Select Launch Stack.
- Select Subsequent.
- Depart the parameters as default or make applicable adjustments based mostly in your necessities, then select Subsequent.
- Overview the main points on the ultimate web page and choose I acknowledge that AWS CloudFormation would possibly create IAM sources.
- Select Create.
This stack can take round 10 minutes to finish, after which you’ll be able to view the deployed stack on the AWS CloudFormation console.
Run an AWS Glue job to create Iceberg tables for the 3TB TPC-DS dataset
When the CloudFormation stack creation is full, run the AWS Glue job to create Iceberg tables for the TPC-DS dataset. This AWS Glue job extracts the TPC-DS dataset from the general public S3 bucket and transforms the information into Iceberg tables. These tables are loaded into your S3 bucket and registered to the Knowledge Catalog.
To run the AWS Glue job, full the next steps:
- On the AWS Glue console, select ETL jobs within the navigation pane.
- Select
InitialDataLoadJob-<your-stack-name>
. - Select Run.
This AWS Glue job can take round half-hour to finish. The method is full when the job processing standing reveals as Succeeded.
The AWS Glue job creates tables storing the TPC-DS dataset in two equivalent databases: tpcdsdbnostats
and tpcdsdbwithstats
. The tables in tpcdsdbnostats
could have no generated statistics, and we use them as reference. We generate statistics on tables in tpcdsdbwithstats
. Verify the creation of these two databases and underlying tables on the AWS Glue console. Right now, these databases maintain the identical information and there are not any statistics generated on the tables.
Run queries on Redshift Spectrum with out statistics
Within the earlier steps, you arrange a Redshift Serverless workgroup with the given RPU (128 by default), ready the TPC-DS 3TB dataset in your S3 bucket, and created Iceberg tables (which at the moment don’t have statistics).
To run your question in Amazon Redshift, full the next steps:
- Obtain the Amazon Redshift queries.
- Within the Redshift question editor v2, run the queries listed within the Redshift Question for tables with out column statistics part within the downloaded file
redshift-tpcds-sample.sql
. - Notice the question runtime of every question.
Generate Iceberg column statistics
To generate statistics on the Knowledge Catalog tables, full the next steps:
- On the AWS Glue console, select Databases underneath Knowledge Catalog within the navigation pane.
- Select the
tpcdsdbwithstats
database to view all obtainable tables. - Choose any of those tables (for instance,
call_center
). - Go to Column statistics – new and select Generate statistics.
- Hold the default choices:
- For Select columns, choose Desk (All columns).
- For Row sampling choices, choose All rows.
- For IAM position, select
AWSGluestats-blog-<your-stack-name>
.
- Select Generate statistics.
You’ll be capable to see standing of the statistics technology run as proven within the following screenshot.
After you generate the Iceberg desk column statistics, it is best to be capable to see detailed column statistics for that desk.
Following the statistics technology, you will see that an <id>.stat
file within the AWS Glue desk’s underlying information location in Amazon S3. This file is a Puffin file that shops the Theta Sketch information construction. Question engines can use this Theta Sketch algorithm to effectively estimate the NDV when working on the desk, which helps optimize question efficiency.
Reiterate the earlier steps to generate statistics for all tables, akin to catalog_sales
, catalog_returns
, warehouse
, merchandise
, date_dim
, store_sales
, buyer
, customer_address
, web_sales
, time_dim
, ship_mode
, web_site
, and web_returns
. Alternatively, you may manually run the Lambda operate that instructs AWS Glue to generate column statistics for all tables. We focus on the main points of this operate later on this submit.
After you generate statistics for all tables, you may assess the question efficiency for every question.
Run queries on Redshift Spectrum with statistics
Within the earlier steps, you arrange a Redshift Serverless workgroup with the given RPU (128 by default), ready the TPC-DS 3TB dataset in your S3 bucket, and created Iceberg tables with column statistics.
To run the offered question utilizing Redshift Spectrum on the statistics tables, full the next steps:
- Within the Redshift question editor v2, run the queries listed in Redshift Question for tables with column statistics part within the downloaded file
redshift-tpcds-sample.sql
. - Notice the question runtime of every question.
With Redshift Serverless 128 RPU and the TPC-DS 3TB dataset, we carried out pattern runs for 10 chosen TPC-DS queries the place NDV data was anticipated to be useful. We ran every question 10 instances. The outcomes proven within the following desk are sorted by the share of the efficiency enchancment for the queries with column statistics.
TPC-DS 3T Queries | With out Column Statistics | With Column Statistics | Efficiency Enchancment (%) |
Question 16 | 305.0284 | 51.7807 | 489.1 |
Question 75 | 398.0643 | 110.8366 | 259.1 |
Question 78 | 169.8358 | 52.8951 | 221.1 |
Question 95 | 35.2996 | 11.1047 | 217.9 |
Question 94 | 160.52 | 57.0321 | 181.5 |
Question 68 | 14.6517 | 7.4745 | 96 |
Question 4 | 217.8954 | 121.996 | 78.6 |
Question 72 | 123.8698 | 76.215 | 62.5 |
Question 29 | 22.0769 | 14.8697 | 48.5 |
Question 25 | 43.2164 | 32.8602 | 31.5 |
The outcomes demonstrated clear efficiency advantages starting from 31.5–489.1%.
To dive deep, let’s discover question 16, which confirmed the best efficiency profit:
TPC-DS Question 16:
You’ll be able to evaluate the distinction between the question plans with and with out column statistics with the ANALYZE question.
The next screenshot reveals the outcomes with out column statistics.
The next screenshot reveals the outcomes with column statistics.
You’ll be able to observe some notable variations because of utilizing column statistics. At a excessive degree, the general estimated price of the question is considerably diminished, from 20633217995813352.00
to 331727324110.36
.
The 2 question plans selected completely different be part of methods.
The next is one line included within the question plan with out column statistics:
The next is the corresponding line within the question plan with column statistics:
The question plan for the desk with out column statistics used DS_DIST_BOTH
when becoming a member of massive tables, whereas the question plan for the desk with column statistics selected DS_BCAST_INNER
. The be part of order has additionally modified based mostly on the column statistics. These be part of technique and be part of order adjustments are primarily pushed by extra correct be part of cardinality estimations, that are potential with column statistics, and lead to a extra optimized question plan.
Schedule AWS Glue column statistics Runs
Sustaining up-to-date column statistics is essential for optimum question efficiency. This part guides you thru automating the method of producing Iceberg desk column statistics utilizing Lambda and EventBridge Scheduler. This automation retains your column statistics updated with out guide intervention.
The required Lambda operate and EventBridge schedule are already created via the CloudFormation template. The Lambda operate is used to invoke the AWS Glue column statistics run. First, full the next steps to discover how the Lambda operate is configured:
- On the Lambda console, select Capabilities within the navigation pane.
- Open the operate
GlueTableStatisticsFunctionv1
.
For a clearer understanding of the Lambda operate, we suggest reviewing the code within the Code part and inspecting the setting variables underneath Configuration.
As proven within the following code snippet, the Lambda operate invokes the start_column_statistics_task_run API via the AWS SDK for Python (Boto3) library.
Subsequent, full the next steps to discover how the EventBridge schedule is configured:
- On the EventBridge console, select Schedules underneath Scheduler within the navigation pane.
- Find the schedule created by the CloudFormation console.
This web page is the place you handle and configure the schedules in your occasions. As proven within the following screenshot, the schedule is configured to invoke the Lambda operate every day at a selected time—on this case, 08:27 PM UTC. This makes positive the AWS Glue column statistics runs on an everyday and predictable foundation.
Clear up
When you may have completed all of the above steps, bear in mind to wash up all of the AWS sources you created utilizing AWS CloudFormation:
- Delete the CloudFormation stack.
- Delete S3 bucket storing the Iceberg desk for the TPC-DS dataset and the AWS Glue job script.
Conclusion
This submit launched a brand new function within the Knowledge Catalog that allows you to create Iceberg desk column-level statistics. The Iceberg desk shops Theta Sketch, which can be utilized to estimate NDV effectively in a Puffin file. The Redshift Spectrum CBO can use that to optimize the question plan, leading to improved question efficiency and potential price financial savings.
Check out this new function within the Knowledge Catalog to generate column-level statistics and enhance question efficiency, and tell us your suggestions within the feedback part. Go to the AWS Glue Catalog documentation to be taught extra.
In regards to the Authors
Sotaro Hikita is a Options Architect. He helps prospects in a variety of industries, particularly the monetary {industry}, to construct higher options. He’s significantly keen about large information applied sciences and open supply software program.
Noritaka Sekiyama is a Principal Massive Knowledge Architect on the AWS Glue group. He’s accountable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking together with his new street bike.
Kyle Duong is a Senior Software program Growth Engineer on the AWS Glue and AWS Lake Formation group. He’s keen about constructing large information applied sciences and distributed methods.
Kalaiselvi Kamaraj is a Senior Software program Growth Engineer with Amazon. She has labored on a number of initiatives throughout the Amazon Redshift question processing group and at the moment specializing in performance-related initiatives for Redshift information lakes.
Sandeep Adwankar is a Senior Product Supervisor at AWS. Based mostly within the California Bay Space, he works with prospects across the globe to translate enterprise and technical necessities into merchandise that allow prospects to enhance how they handle, safe, and entry information.