Managing Tabular Data Made Easy with Amazon S3 Tables

Udaara Jayawardana
6 min readJan 11, 2025

--

Imagine you’ve enter a Amazon warehouse, a big, expansive, seemingly endless warehouse. It’s organised with labeled shelves, rows upon rows of boxes filled with everything you can think of. This warehouse is your data, sitting neatly in Amazon S3 buckets, waiting for you to use it. But, and here’s the catch, finding exactly what you need quickly can feel like trying to find a needle in a haystack.

Enter S3 Tables. It’s as if someone’s taken all those boxes of data, unpacked them, and laid their contents out on a beautifully designed table, organised in perfect rows and columns. Suddenly, querying your data is as simple as doing a Sunday crossword. You ask, and the answer is right there

S3 Tables combine the strength of S3’s storage with the elegance of structured table data. It’s fast, efficient, and automated. Maintenance? Sorted. Governance? Done. Performance? Unmatched.

But at this point, I’m sure you already have a well designed data warehouses at your disposal. So why should you care? Because, in the world of big data, every second counts, and S3 tables? They’re shaving those seconds down to milliseconds!

S3 Tables was introduced in AWS re:Invent 2024

Before we can go on to S3 Tables, let’s explore the three major cogs inside. Tabular Data, Columnar File Formats (like Parquet) & Apache Iceberg.

Tabular data is information structured into rows and columns, similar to a spreadsheet. Each row represents a single record or item (such as a customer or an order), and each column represents a specific piece of that record’s information (such as name, date, or price). It is structured, clean, and easy to read or analyze, making it suitable for tasks such as sorting, searching, and doing calculations.

While tabular data is excellent for managing smaller datasets in tools such as spreadsheets or databases, things become more complex as datasets grow larger. Enter columnar files such as Parquet files, a storage format built for large volumes of tabular data. Unlike typical CSVs and plain text formats, Parquet is columnar, meaning it stores data by columns rather than rows. This structure significantly improves performance for tasks such as querying specific columns while being storage efficient.
For example, consider examining millions of customer records to calculate total sales from specific regions. Instead of reading the full dataset row by row, Parquet allows you to read only the relevant columns, saving time, resources, and money.

But managing many Parquet files can be challenging. As the dataset grows, you might end up with thousands of files spread across your data lake. Querying them efficiently, handling schema changes, or even tracking which file contains the latest data becomes a logistical nightmare. Enter Apache Iceberg. Iceberg is an open table format that aims to bring order to the chaos of maintaining large sets of Parquet (and other columnar) files. Think of Iceberg as the template for organizing your Parquet files into a logical table.

  • Schema Evolution: Add or remove columns without rewriting all your data.
  • Time Travel Queries: Access past versions of your data without complex backup systems.
  • Partition Pruning: Automatically optimize queries by narrowing down to only relevant files.

Now, when you combine that with Amazon S3 scalability and automation, and you get Amazon S3 Tables! A seamless integration of Iceberg and S3, purpose-built for high-performance tabular data analytics.

Key Features of Amazon S3 Tables

1. Integrated with Apache Iceberg for Analytics Workload Optimization
With Apache Iceberg, Amazon S3 Tables is designed to handle large datasets efficiently. Without requiring complex configurations, Iceberg makes it easier to query, update, and analyze data by organizing it into structured tables. Large-scale analytics, where accuracy and speed are crucial, benefit greatly from it.

2. Performance Enhancements with Faster Queries and Higher Transaction Rates
Amazon S3 Tables are optimized for speed and efficiency. When compared to general-purpose S3 buckets, they offer up to 10x greater TPS (transactions per second) and 3x faster query speed. This makes them ideal for analytics workloads, where fast data processing and high transaction throughput are essential.

3. Integrated Compaction and Snapshot Management for Automated Maintenance
Large dataset management often requires ongoing maintenance, including merging files, cleaning up old data, and maintaining records. Amazon S3 Tables handles all of this automatically, ensuring your data stays organized and ready to use without requiring extra effort.

4. Table-Level Permissions
Governance is simplified, as permissions can be set at the table level, making it easier to control who has access to what data.

5. Integrate seamlessly with Analytics Services
Amazon S3 Tables seamlessly interacts with leading analytics solutions including Amazon EMR, Athena, QuickSight, Data Firehose, Redshift, and Apache Spark. This direct compatibility enables you to query, vizualise, and process data with your preferred tools without any additional configuration or complexity.

S3 Tables integration with Amazon Data Tools

Setting Up an Amazon S3 Table Bucket & Creating Tables

At the time of this post, S3 Tables were only available in North Virginia (us-east-1) Ohio (us-east-2) and Oregon (us-west-2). Keep an eye on this documentation page for more updates on available regions

Before creating an S3 Table, you must enable integration with AWS analytics services such as Amazon Athena, Amazon Redshift, and Amazon EMR.

Enable S3 Table Integration with Analytics Services

After this integration is enabled, all table buckets in this account and Region will automatically be available in AWS Glue Data Catalog under the catalog named s3tablescatalog.
This integration uses the AWS Glue and AWS Lake Formation services and might incur Glue request and storage costs.

There are 2 key differences with S3 Tables compared to regular S3 Buckets

  • S3 Table bucket names are not globally unique
    *** Refer these naming rules
  • S3 Table Bucket cannot be delete through the Web console. To delete, use the AWS CLI, AWS SDKs, or Amazon S3 REST API.
    *** The older AWS CLIs do not support S3 Tables, so you will need to upgrade first

Create Tables in S3 Tables Buckets with EMR

Creating an Amazon S3 Table is straightforward and well-documented in the AWS guides. Follow the official step-by-step instructions provided in the AWS S3 Tables Documentation.

S3 Table Bucket with EMR Created Tables

Visualize with Athena & QuickSight

I stored the S3 Tables data as Apache Parquet file. Unfortunately, AWS QuickSight does not natively support the direct import of Parquet files from S3. So, we’ll need Amazon Athena to act as an intermediary. Athena will query Parquet files stored in S3, and QuickSight will then connect to Athena to access and visualize the queried data.

Load Data to QuickSight via Athena

Then we can provide the custom SQL queries that we need for creating a QuickSight Dashboard.

Custom SQL Queries for the QuickSight Dashboard

To Sum It Up

Amazon S3 Tables are a huge step forward for organizing and analyzing data lakes. With real-time data ingestion through tools like Kinesis Firehose, S3 Tables simplify data management, enhance performance, and integrate effortlessly with AWS analytics services such as Athena and QuickSight. By merging the scalability of S3 with Apache Iceberg’s powerful features, S3 Tables offer a streamlined and efficient approach to data analytics.

Whether you’re working with large datasets in columnar files, S3 Tables provide a reliable yet efficient solution for modern analytics workloads. From automated maintenance to rapid queries and compatibility with a variety of analytics tools, it streamline the whole data lifecycle, allowing you to concentrate on insights rather than infrastructure.

As businesses become more and more reliant on data for decision-making, solutions like S3 Tables will be critical for streamlining complex data pipelines and providing teams with real-time, actionable insights. It’s more than simply storage; it represents the future of scalable, efficient, and intelligent data management.

--

--

Udaara Jayawardana
Udaara Jayawardana

Written by Udaara Jayawardana

A DevOps Engineer who specialises in the design and implementation of AWS and Containerized Infrastructure.

No responses yet