Minimal Rust Iceberg S3 Example

This repository demonstrates how to use Apache Iceberg with Rust and AWS S3 storage. It includes examples of writing data to and reading data from Iceberg tables using the REST catalog.

Prerequisites

Rust (latest stable version)
Docker
AWS Account with S3 access
Git

Setup

Clone the repository:

git clone https://github.com/definite-app/minimal-rust-ice-s3.git
cd minimal-rust-ice-s3

Set up your environment variables:

# Copy the example .env file
cp .env.example .env

# Edit .env with your AWS credentials and S3 configuration
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
AWS_REGION=your_aws_region
S3_BUCKET=your_bucket_name
S3_PATH=your_path

Start the REST catalog server:

docker compose up -d

Usage

Writing Data

The project includes an example that writes sample data to an Iceberg table:

cargo run --bin mjr

This will:

Create a namespace if it doesn't exist
Create a table with a simple schema (id: Int, name: String)
Write sample records to the table

Reading Data

To read the data back from the table:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_table

This will:

Connect to the same table
Execute a SELECT query
Display the results

Listing Tables

To list available tables in a namespace:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin list_tables

This will:

Connect to the catalog
List tables in the default namespace
Display sample data from the default table

Listing All Tables Across Namespaces

To list all tables across all namespaces using the REST catalog API:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin list_catalog_tables

This will:

Connect to the REST catalog
List all namespaces
For each namespace, list all tables
Display the results

Reading Custom Tables

To read data from a specific table with a limit:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_custom_table <namespace> <table_name> <limit>

For example:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_custom_table tpch customer 10

This will:

Connect to the specified table in the given namespace
Execute a SELECT query with the specified limit
Display the results

Running Predefined Queries

To run a predefined query that joins tables:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_query

This will:

Execute a default query that joins customer and nation tables
Display the results

Running Custom SQL Queries

To run any custom SQL query against your Iceberg tables:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "<SQL_QUERY>"

For example:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT l_shipmode, COUNT(*) as count FROM my_catalog.tpch.lineitem GROUP BY l_shipmode ORDER BY count DESC"

This will:

Execute your custom SQL query against the Iceberg tables
Display the results

Working with Partitioned Data

Writing Partitioned Data

To create a table with partitioned data:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin write_partitioned_data

This will:

Create a namespace called my_namespace
Create a table called partitioned_table with fields: id (Int), name (String), category (String)
Partition the data by the category field using Identity transform
Write sample data with categories A, B, and C
Store the data in S3 with appropriate partitioning

Reading Partitioned Data

To read data from the partitioned table:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_partitioned_data

This will:

Connect to the partitioned table
Read all data from the table
Read data from specific partitions (categories A, B, and C)
Perform aggregations (count by category)
Display the results

Writing Output Data to S3

To process Parquet files from the output directory and write them to S3 as Iceberg tables:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin write_output_data_to_s3

This will:

Create a namespace called output_data
Process lineitem files (months 1-5) and create a table partitioned by month
Process orders files (years 2020-2024) and create a table partitioned by year
Write the data to S3 with appropriate partitioning

Example Queries

Here are some example complex queries you can run:

Analyze shipping modes in lineitem table:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT l_shipmode, COUNT(*) as count, SUM(l_quantity) as total_quantity, AVG(l_extendedprice) as avg_price FROM my_catalog.tpch.lineitem GROUP BY l_shipmode ORDER BY count DESC"

Analyze order data by region:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT r.r_name as region, COUNT(DISTINCT c.c_custkey) as customer_count, COUNT(o.o_orderkey) as order_count, SUM(o.o_totalprice) as total_sales FROM my_catalog.tpch.orders o JOIN my_catalog.tpch.customer c ON o.o_custkey = c.c_custkey JOIN my_catalog.tpch.nation n ON c.c_nationkey = n.n_nationkey JOIN my_catalog.tpch.region r ON n.n_regionkey = r.r_regionkey GROUP BY r.r_name ORDER BY total_sales DESC"

Find top customers by total purchase amount:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT c.c_name, n.n_name as nation, r.r_name as region, COUNT(o.o_orderkey) as order_count, SUM(o.o_totalprice) as total_spent FROM my_catalog.tpch.customer c JOIN my_catalog.tpch.orders o ON c.c_custkey = o.o_custkey JOIN my_catalog.tpch.nation n ON c.c_nationkey = n.n_nationkey JOIN my_catalog.tpch.region r ON n.n_regionkey = r.r_regionkey GROUP BY c.c_name, n.n_name, r.r_name ORDER BY total_spent DESC LIMIT 10"

Query partitioned data:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT category, COUNT(*) FROM my_catalog.my_namespace.partitioned_table GROUP BY category"

Query lineitem data by month:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT month, COUNT(*) FROM my_catalog.output_data.lineitem_monthly GROUP BY month ORDER BY month"

Query orders data by year:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT year, COUNT(*) FROM my_catalog.output_data.orders_yearly GROUP BY year ORDER BY year"

Partitioned Tables

The project includes functionality to create partitioned versions of the TPC-H tables for improved query performance:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin create_partitioned_tpch

This will:

Create a new namespace called tpch_partitioned
Create a partitioned version of the lineitem table (partitioned by month of shipdate)
Create a partitioned version of the orders table (partitioned by year of orderdate)

Note: The tables are created without data. To populate them with data, you would need to implement a separate process to:

Read data from the original tables
Convert data types as needed (e.g., Int32 to Int64 for key fields)
Write the data to the partitioned tables with appropriate partition values

You can verify the table structures using:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "DESCRIBE my_catalog.tpch_partitioned.lineitem"

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "DESCRIBE my_catalog.tpch_partitioned.orders"

Once populated, partitioning would improve query performance when filtering on the partition columns:

# Query using partition pruning on lineitem (example for when data is populated)
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT COUNT(*) FROM my_catalog.tpch_partitioned.lineitem WHERE l_shipdate BETWEEN DATE '1992-01-01' AND DATE '1992-12-31'"

# Query using partition pruning on orders (example for when data is populated)
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT COUNT(*) FROM my_catalog.tpch_partitioned.orders WHERE o_orderdate >= DATE '1993-01-01'"

Partitioning Strategies

This project demonstrates several partitioning strategies:

Category-based partitioning (write_partitioned_data.rs):
- Partitions data by a categorical field (e.g., "category")
- Uses the Identity transform to create partitions based on the exact value
- Useful for data with a small number of distinct categories
Time-based partitioning (write_output_data_to_s3.rs):
- Partitions data by time periods (month, year)
- Creates separate partitions for each time period
- Enables efficient time-range queries
TPC-H partitioning (create_partitioned_tpch.rs):
- Partitions lineitem by month of shipdate
- Partitions orders by year of orderdate
- Demonstrates partitioning on date fields

S3 Storage Locations

The data is stored in S3 with the following structure:

Base path: s3://${S3_BUCKET}/${S3_PATH}
Default table: s3://${S3_BUCKET}/${S3_PATH}/default/my_table
Partitioned table: s3://${S3_BUCKET}/${S3_PATH}/my_namespace/partitioned_table
Output data:
- Lineitem monthly: s3://${S3_BUCKET}/${S3_PATH}/output_data/lineitem_monthly
- Orders yearly: s3://${S3_BUCKET}/${S3_PATH}/output_data/orders_yearly

Within each table directory, data files are organized according to the partition structure. For example:

s3://${S3_BUCKET}/${S3_PATH}/my_namespace/partitioned_table/category=A/data_A-xxx.parquet
s3://${S3_BUCKET}/${S3_PATH}/my_namespace/partitioned_table/category=B/data_B-xxx.parquet
s3://${S3_BUCKET}/${S3_PATH}/output_data/lineitem_monthly/month=1/lineitem_month_1-xxx.parquet
s3://${S3_BUCKET}/${S3_PATH}/output_data/orders_yearly/year=2020/orders_year_2020-xxx.parquet

Project Structure

src/main.rs - Main program for writing data
src/bin/read_table.rs - Example of reading data from the default table
src/bin/list_tables.rs - Example of listing available tables
src/bin/list_catalog_tables.rs - Example of listing all tables across namespaces
src/bin/read_custom_table.rs - Example of reading data from a specific table
src/bin/run_query.rs - Example of running a predefined query
src/bin/run_custom_query.rs - Example of running custom SQL queries
src/bin/create_partitioned_tpch.rs - Example of creating partitioned TPC-H tables
src/bin/write_partitioned_data.rs - Example of writing partitioned data
src/bin/read_partitioned_data.rs - Example of reading partitioned data
src/bin/write_output_data_to_s3.rs - Example of processing Parquet files and writing to S3
docker-compose.yml - REST catalog server configuration
.env - Environment variables (not tracked in git)

Configuration

The project uses the following configuration:

REST Catalog Server: http://localhost:8181
S3 Storage: Configured via environment variables
Table Location: s3://${S3_BUCKET}/${S3_PATH}
Default Table Schema:
- id (Int32)
- name (String)
Partitioned Table Schema:
- id (Int32)
- name (String)
- category (String)
TPC-H Tables: Available in the tpch namespace

Development

Make sure Docker is running
Set up your environment variables
Start the REST catalog server
Run the examples

Troubleshooting

If you see connection errors, ensure:
- Docker is running
- The REST catalog server is up (docker compose ps)
- Your AWS credentials are correct
For S3 access issues:
- Verify your AWS credentials
- Check S3 bucket permissions
- Ensure the bucket exists
For OpenSSL-related errors:
- Make sure to include OPENSSL_DIR=/opt/homebrew/opt/openssl@3 when running commands
- On non-macOS systems, you may need to adjust this path
For Rust compilation errors:
- Ensure you're using the correct type for FileIO (it's a struct, not a trait)
- When working with Arrow schemas, use arrow_schema.into() to convert to Arc<Schema>
- Check that all dependencies are correctly specified in Cargo.toml

Key Learnings

Iceberg Table Structure:
- Tables are organized with metadata and data files
- Metadata tracks schema, partitioning, and data file locations
- Data files (Parquet) contain the actual data
Partitioning Benefits:
- Improves query performance by pruning irrelevant data
- Enables efficient filtering on partition columns
- Organizes data logically in storage
REST Catalog:
- Provides a centralized metadata repository
- Manages namespaces, tables, and schemas
- Tracks data file locations and partitioning
Rust Implementation:
- Uses async/await for efficient I/O operations
- Leverages Arrow for in-memory data representation
- Uses Parquet for efficient data storage
- Implements transactions for atomic operations

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.env.example		.env.example
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
docker-compose.yml		docker-compose.yml
docker-compose.yml.example		docker-compose.yml.example

definite-app/rice-s3

Folders and files

Latest commit

History

Repository files navigation

Minimal Rust Iceberg S3 Example

Prerequisites

Setup

Usage

Writing Data

Reading Data

Listing Tables

Listing All Tables Across Namespaces

Reading Custom Tables

Running Predefined Queries

Running Custom SQL Queries

Working with Partitioned Data

Writing Partitioned Data

Reading Partitioned Data

Writing Output Data to S3

Example Queries

Analyze shipping modes in lineitem table:

Analyze order data by region:

Find top customers by total purchase amount:

Query partitioned data:

Query lineitem data by month:

Query orders data by year:

Partitioned Tables

Partitioning Strategies

S3 Storage Locations

Project Structure

Configuration

Development

Troubleshooting

Key Learnings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages