Skip to content

definite-app/rice-s3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Minimal Rust Iceberg S3 Example

This repository demonstrates how to use Apache Iceberg with Rust and AWS S3 storage. It includes examples of writing data to and reading data from Iceberg tables using the REST catalog.

Prerequisites

  • Rust (latest stable version)
  • Docker
  • AWS Account with S3 access
  • Git

Setup

  1. Clone the repository:
git clone https://github.com/definite-app/minimal-rust-ice-s3.git
cd minimal-rust-ice-s3
  1. Set up your environment variables:
# Copy the example .env file
cp .env.example .env

# Edit .env with your AWS credentials and S3 configuration
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
AWS_REGION=your_aws_region
S3_BUCKET=your_bucket_name
S3_PATH=your_path
  1. Start the REST catalog server:
docker compose up -d

Usage

Writing Data

The project includes an example that writes sample data to an Iceberg table:

cargo run --bin mjr

This will:

  • Create a namespace if it doesn't exist
  • Create a table with a simple schema (id: Int, name: String)
  • Write sample records to the table

Reading Data

To read the data back from the table:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_table

This will:

  • Connect to the same table
  • Execute a SELECT query
  • Display the results

Listing Tables

To list available tables in a namespace:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin list_tables

This will:

  • Connect to the catalog
  • List tables in the default namespace
  • Display sample data from the default table

Listing All Tables Across Namespaces

To list all tables across all namespaces using the REST catalog API:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin list_catalog_tables

This will:

  • Connect to the REST catalog
  • List all namespaces
  • For each namespace, list all tables
  • Display the results

Reading Custom Tables

To read data from a specific table with a limit:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_custom_table <namespace> <table_name> <limit>

For example:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_custom_table tpch customer 10

This will:

  • Connect to the specified table in the given namespace
  • Execute a SELECT query with the specified limit
  • Display the results

Running Predefined Queries

To run a predefined query that joins tables:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_query

This will:

  • Execute a default query that joins customer and nation tables
  • Display the results

Running Custom SQL Queries

To run any custom SQL query against your Iceberg tables:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "<SQL_QUERY>"

For example:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT l_shipmode, COUNT(*) as count FROM my_catalog.tpch.lineitem GROUP BY l_shipmode ORDER BY count DESC"

This will:

  • Execute your custom SQL query against the Iceberg tables
  • Display the results

Working with Partitioned Data

Writing Partitioned Data

To create a table with partitioned data:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin write_partitioned_data

This will:

  • Create a namespace called my_namespace
  • Create a table called partitioned_table with fields: id (Int), name (String), category (String)
  • Partition the data by the category field using Identity transform
  • Write sample data with categories A, B, and C
  • Store the data in S3 with appropriate partitioning

Reading Partitioned Data

To read data from the partitioned table:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_partitioned_data

This will:

  • Connect to the partitioned table
  • Read all data from the table
  • Read data from specific partitions (categories A, B, and C)
  • Perform aggregations (count by category)
  • Display the results

Writing Output Data to S3

To process Parquet files from the output directory and write them to S3 as Iceberg tables:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin write_output_data_to_s3

This will:

  • Create a namespace called output_data
  • Process lineitem files (months 1-5) and create a table partitioned by month
  • Process orders files (years 2020-2024) and create a table partitioned by year
  • Write the data to S3 with appropriate partitioning

Example Queries

Here are some example complex queries you can run:

Analyze shipping modes in lineitem table:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT l_shipmode, COUNT(*) as count, SUM(l_quantity) as total_quantity, AVG(l_extendedprice) as avg_price FROM my_catalog.tpch.lineitem GROUP BY l_shipmode ORDER BY count DESC"

Analyze order data by region:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT r.r_name as region, COUNT(DISTINCT c.c_custkey) as customer_count, COUNT(o.o_orderkey) as order_count, SUM(o.o_totalprice) as total_sales FROM my_catalog.tpch.orders o JOIN my_catalog.tpch.customer c ON o.o_custkey = c.c_custkey JOIN my_catalog.tpch.nation n ON c.c_nationkey = n.n_nationkey JOIN my_catalog.tpch.region r ON n.n_regionkey = r.r_regionkey GROUP BY r.r_name ORDER BY total_sales DESC"

Find top customers by total purchase amount:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT c.c_name, n.n_name as nation, r.r_name as region, COUNT(o.o_orderkey) as order_count, SUM(o.o_totalprice) as total_spent FROM my_catalog.tpch.customer c JOIN my_catalog.tpch.orders o ON c.c_custkey = o.o_custkey JOIN my_catalog.tpch.nation n ON c.c_nationkey = n.n_nationkey JOIN my_catalog.tpch.region r ON n.n_regionkey = r.r_regionkey GROUP BY c.c_name, n.n_name, r.r_name ORDER BY total_spent DESC LIMIT 10"

Query partitioned data:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT category, COUNT(*) FROM my_catalog.my_namespace.partitioned_table GROUP BY category"

Query lineitem data by month:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT month, COUNT(*) FROM my_catalog.output_data.lineitem_monthly GROUP BY month ORDER BY month"

Query orders data by year:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT year, COUNT(*) FROM my_catalog.output_data.orders_yearly GROUP BY year ORDER BY year"

Partitioned Tables

The project includes functionality to create partitioned versions of the TPC-H tables for improved query performance:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin create_partitioned_tpch

This will:

  • Create a new namespace called tpch_partitioned
  • Create a partitioned version of the lineitem table (partitioned by month of shipdate)
  • Create a partitioned version of the orders table (partitioned by year of orderdate)

Note: The tables are created without data. To populate them with data, you would need to implement a separate process to:

  1. Read data from the original tables
  2. Convert data types as needed (e.g., Int32 to Int64 for key fields)
  3. Write the data to the partitioned tables with appropriate partition values

You can verify the table structures using:

OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "DESCRIBE my_catalog.tpch_partitioned.lineitem"
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "DESCRIBE my_catalog.tpch_partitioned.orders"

Once populated, partitioning would improve query performance when filtering on the partition columns:

# Query using partition pruning on lineitem (example for when data is populated)
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT COUNT(*) FROM my_catalog.tpch_partitioned.lineitem WHERE l_shipdate BETWEEN DATE '1992-01-01' AND DATE '1992-12-31'"
# Query using partition pruning on orders (example for when data is populated)
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT COUNT(*) FROM my_catalog.tpch_partitioned.orders WHERE o_orderdate >= DATE '1993-01-01'"

Partitioning Strategies

This project demonstrates several partitioning strategies:

  1. Category-based partitioning (write_partitioned_data.rs):

    • Partitions data by a categorical field (e.g., "category")
    • Uses the Identity transform to create partitions based on the exact value
    • Useful for data with a small number of distinct categories
  2. Time-based partitioning (write_output_data_to_s3.rs):

    • Partitions data by time periods (month, year)
    • Creates separate partitions for each time period
    • Enables efficient time-range queries
  3. TPC-H partitioning (create_partitioned_tpch.rs):

    • Partitions lineitem by month of shipdate
    • Partitions orders by year of orderdate
    • Demonstrates partitioning on date fields

S3 Storage Locations

The data is stored in S3 with the following structure:

  • Base path: s3://${S3_BUCKET}/${S3_PATH}
  • Default table: s3://${S3_BUCKET}/${S3_PATH}/default/my_table
  • Partitioned table: s3://${S3_BUCKET}/${S3_PATH}/my_namespace/partitioned_table
  • Output data:
    • Lineitem monthly: s3://${S3_BUCKET}/${S3_PATH}/output_data/lineitem_monthly
    • Orders yearly: s3://${S3_BUCKET}/${S3_PATH}/output_data/orders_yearly

Within each table directory, data files are organized according to the partition structure. For example:

  • s3://${S3_BUCKET}/${S3_PATH}/my_namespace/partitioned_table/category=A/data_A-xxx.parquet
  • s3://${S3_BUCKET}/${S3_PATH}/my_namespace/partitioned_table/category=B/data_B-xxx.parquet
  • s3://${S3_BUCKET}/${S3_PATH}/output_data/lineitem_monthly/month=1/lineitem_month_1-xxx.parquet
  • s3://${S3_BUCKET}/${S3_PATH}/output_data/orders_yearly/year=2020/orders_year_2020-xxx.parquet

Project Structure

  • src/main.rs - Main program for writing data
  • src/bin/read_table.rs - Example of reading data from the default table
  • src/bin/list_tables.rs - Example of listing available tables
  • src/bin/list_catalog_tables.rs - Example of listing all tables across namespaces
  • src/bin/read_custom_table.rs - Example of reading data from a specific table
  • src/bin/run_query.rs - Example of running a predefined query
  • src/bin/run_custom_query.rs - Example of running custom SQL queries
  • src/bin/create_partitioned_tpch.rs - Example of creating partitioned TPC-H tables
  • src/bin/write_partitioned_data.rs - Example of writing partitioned data
  • src/bin/read_partitioned_data.rs - Example of reading partitioned data
  • src/bin/write_output_data_to_s3.rs - Example of processing Parquet files and writing to S3
  • docker-compose.yml - REST catalog server configuration
  • .env - Environment variables (not tracked in git)

Configuration

The project uses the following configuration:

  • REST Catalog Server: http://localhost:8181
  • S3 Storage: Configured via environment variables
  • Table Location: s3://${S3_BUCKET}/${S3_PATH}
  • Default Table Schema:
    • id (Int32)
    • name (String)
  • Partitioned Table Schema:
    • id (Int32)
    • name (String)
    • category (String)
  • TPC-H Tables: Available in the tpch namespace

Development

  1. Make sure Docker is running
  2. Set up your environment variables
  3. Start the REST catalog server
  4. Run the examples

Troubleshooting

  1. If you see connection errors, ensure:

    • Docker is running
    • The REST catalog server is up (docker compose ps)
    • Your AWS credentials are correct
  2. For S3 access issues:

    • Verify your AWS credentials
    • Check S3 bucket permissions
    • Ensure the bucket exists
  3. For OpenSSL-related errors:

    • Make sure to include OPENSSL_DIR=/opt/homebrew/opt/openssl@3 when running commands
    • On non-macOS systems, you may need to adjust this path
  4. For Rust compilation errors:

    • Ensure you're using the correct type for FileIO (it's a struct, not a trait)
    • When working with Arrow schemas, use arrow_schema.into() to convert to Arc<Schema>
    • Check that all dependencies are correctly specified in Cargo.toml

Key Learnings

  1. Iceberg Table Structure:

    • Tables are organized with metadata and data files
    • Metadata tracks schema, partitioning, and data file locations
    • Data files (Parquet) contain the actual data
  2. Partitioning Benefits:

    • Improves query performance by pruning irrelevant data
    • Enables efficient filtering on partition columns
    • Organizes data logically in storage
  3. REST Catalog:

    • Provides a centralized metadata repository
    • Manages namespaces, tables, and schemas
    • Tracks data file locations and partitioning
  4. Rust Implementation:

    • Uses async/await for efficient I/O operations
    • Leverages Arrow for in-memory data representation
    • Uses Parquet for efficient data storage
    • Implements transactions for atomic operations

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages