This repository demonstrates how to use Apache Iceberg with Rust and AWS S3 storage. It includes examples of writing data to and reading data from Iceberg tables using the REST catalog.
- Rust (latest stable version)
- Docker
- AWS Account with S3 access
- Git
- Clone the repository:
git clone https://github.com/definite-app/minimal-rust-ice-s3.git
cd minimal-rust-ice-s3
- Set up your environment variables:
# Copy the example .env file
cp .env.example .env
# Edit .env with your AWS credentials and S3 configuration
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
AWS_REGION=your_aws_region
S3_BUCKET=your_bucket_name
S3_PATH=your_path
- Start the REST catalog server:
docker compose up -d
The project includes an example that writes sample data to an Iceberg table:
cargo run --bin mjr
This will:
- Create a namespace if it doesn't exist
- Create a table with a simple schema (id: Int, name: String)
- Write sample records to the table
To read the data back from the table:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_table
This will:
- Connect to the same table
- Execute a SELECT query
- Display the results
To list available tables in a namespace:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin list_tables
This will:
- Connect to the catalog
- List tables in the default namespace
- Display sample data from the default table
To list all tables across all namespaces using the REST catalog API:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin list_catalog_tables
This will:
- Connect to the REST catalog
- List all namespaces
- For each namespace, list all tables
- Display the results
To read data from a specific table with a limit:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_custom_table <namespace> <table_name> <limit>
For example:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_custom_table tpch customer 10
This will:
- Connect to the specified table in the given namespace
- Execute a SELECT query with the specified limit
- Display the results
To run a predefined query that joins tables:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_query
This will:
- Execute a default query that joins customer and nation tables
- Display the results
To run any custom SQL query against your Iceberg tables:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "<SQL_QUERY>"
For example:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT l_shipmode, COUNT(*) as count FROM my_catalog.tpch.lineitem GROUP BY l_shipmode ORDER BY count DESC"
This will:
- Execute your custom SQL query against the Iceberg tables
- Display the results
To create a table with partitioned data:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin write_partitioned_data
This will:
- Create a namespace called
my_namespace
- Create a table called
partitioned_table
with fields: id (Int), name (String), category (String) - Partition the data by the
category
field using Identity transform - Write sample data with categories A, B, and C
- Store the data in S3 with appropriate partitioning
To read data from the partitioned table:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_partitioned_data
This will:
- Connect to the partitioned table
- Read all data from the table
- Read data from specific partitions (categories A, B, and C)
- Perform aggregations (count by category)
- Display the results
To process Parquet files from the output
directory and write them to S3 as Iceberg tables:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin write_output_data_to_s3
This will:
- Create a namespace called
output_data
- Process lineitem files (months 1-5) and create a table partitioned by month
- Process orders files (years 2020-2024) and create a table partitioned by year
- Write the data to S3 with appropriate partitioning
Here are some example complex queries you can run:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT l_shipmode, COUNT(*) as count, SUM(l_quantity) as total_quantity, AVG(l_extendedprice) as avg_price FROM my_catalog.tpch.lineitem GROUP BY l_shipmode ORDER BY count DESC"
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT r.r_name as region, COUNT(DISTINCT c.c_custkey) as customer_count, COUNT(o.o_orderkey) as order_count, SUM(o.o_totalprice) as total_sales FROM my_catalog.tpch.orders o JOIN my_catalog.tpch.customer c ON o.o_custkey = c.c_custkey JOIN my_catalog.tpch.nation n ON c.c_nationkey = n.n_nationkey JOIN my_catalog.tpch.region r ON n.n_regionkey = r.r_regionkey GROUP BY r.r_name ORDER BY total_sales DESC"
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT c.c_name, n.n_name as nation, r.r_name as region, COUNT(o.o_orderkey) as order_count, SUM(o.o_totalprice) as total_spent FROM my_catalog.tpch.customer c JOIN my_catalog.tpch.orders o ON c.c_custkey = o.o_custkey JOIN my_catalog.tpch.nation n ON c.c_nationkey = n.n_nationkey JOIN my_catalog.tpch.region r ON n.n_regionkey = r.r_regionkey GROUP BY c.c_name, n.n_name, r.r_name ORDER BY total_spent DESC LIMIT 10"
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT category, COUNT(*) FROM my_catalog.my_namespace.partitioned_table GROUP BY category"
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT month, COUNT(*) FROM my_catalog.output_data.lineitem_monthly GROUP BY month ORDER BY month"
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT year, COUNT(*) FROM my_catalog.output_data.orders_yearly GROUP BY year ORDER BY year"
The project includes functionality to create partitioned versions of the TPC-H tables for improved query performance:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin create_partitioned_tpch
This will:
- Create a new namespace called
tpch_partitioned
- Create a partitioned version of the
lineitem
table (partitioned by month of shipdate) - Create a partitioned version of the
orders
table (partitioned by year of orderdate)
Note: The tables are created without data. To populate them with data, you would need to implement a separate process to:
- Read data from the original tables
- Convert data types as needed (e.g., Int32 to Int64 for key fields)
- Write the data to the partitioned tables with appropriate partition values
You can verify the table structures using:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "DESCRIBE my_catalog.tpch_partitioned.lineitem"
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "DESCRIBE my_catalog.tpch_partitioned.orders"
Once populated, partitioning would improve query performance when filtering on the partition columns:
# Query using partition pruning on lineitem (example for when data is populated)
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT COUNT(*) FROM my_catalog.tpch_partitioned.lineitem WHERE l_shipdate BETWEEN DATE '1992-01-01' AND DATE '1992-12-31'"
# Query using partition pruning on orders (example for when data is populated)
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT COUNT(*) FROM my_catalog.tpch_partitioned.orders WHERE o_orderdate >= DATE '1993-01-01'"
This project demonstrates several partitioning strategies:
-
Category-based partitioning (
write_partitioned_data.rs
):- Partitions data by a categorical field (e.g., "category")
- Uses the Identity transform to create partitions based on the exact value
- Useful for data with a small number of distinct categories
-
Time-based partitioning (
write_output_data_to_s3.rs
):- Partitions data by time periods (month, year)
- Creates separate partitions for each time period
- Enables efficient time-range queries
-
TPC-H partitioning (
create_partitioned_tpch.rs
):- Partitions lineitem by month of shipdate
- Partitions orders by year of orderdate
- Demonstrates partitioning on date fields
The data is stored in S3 with the following structure:
- Base path:
s3://${S3_BUCKET}/${S3_PATH}
- Default table:
s3://${S3_BUCKET}/${S3_PATH}/default/my_table
- Partitioned table:
s3://${S3_BUCKET}/${S3_PATH}/my_namespace/partitioned_table
- Output data:
- Lineitem monthly:
s3://${S3_BUCKET}/${S3_PATH}/output_data/lineitem_monthly
- Orders yearly:
s3://${S3_BUCKET}/${S3_PATH}/output_data/orders_yearly
- Lineitem monthly:
Within each table directory, data files are organized according to the partition structure. For example:
s3://${S3_BUCKET}/${S3_PATH}/my_namespace/partitioned_table/category=A/data_A-xxx.parquet
s3://${S3_BUCKET}/${S3_PATH}/my_namespace/partitioned_table/category=B/data_B-xxx.parquet
s3://${S3_BUCKET}/${S3_PATH}/output_data/lineitem_monthly/month=1/lineitem_month_1-xxx.parquet
s3://${S3_BUCKET}/${S3_PATH}/output_data/orders_yearly/year=2020/orders_year_2020-xxx.parquet
src/main.rs
- Main program for writing datasrc/bin/read_table.rs
- Example of reading data from the default tablesrc/bin/list_tables.rs
- Example of listing available tablessrc/bin/list_catalog_tables.rs
- Example of listing all tables across namespacessrc/bin/read_custom_table.rs
- Example of reading data from a specific tablesrc/bin/run_query.rs
- Example of running a predefined querysrc/bin/run_custom_query.rs
- Example of running custom SQL queriessrc/bin/create_partitioned_tpch.rs
- Example of creating partitioned TPC-H tablessrc/bin/write_partitioned_data.rs
- Example of writing partitioned datasrc/bin/read_partitioned_data.rs
- Example of reading partitioned datasrc/bin/write_output_data_to_s3.rs
- Example of processing Parquet files and writing to S3docker-compose.yml
- REST catalog server configuration.env
- Environment variables (not tracked in git)
The project uses the following configuration:
- REST Catalog Server:
http://localhost:8181
- S3 Storage: Configured via environment variables
- Table Location:
s3://${S3_BUCKET}/${S3_PATH}
- Default Table Schema:
id
(Int32)name
(String)
- Partitioned Table Schema:
id
(Int32)name
(String)category
(String)
- TPC-H Tables: Available in the
tpch
namespace
- Make sure Docker is running
- Set up your environment variables
- Start the REST catalog server
- Run the examples
-
If you see connection errors, ensure:
- Docker is running
- The REST catalog server is up (
docker compose ps
) - Your AWS credentials are correct
-
For S3 access issues:
- Verify your AWS credentials
- Check S3 bucket permissions
- Ensure the bucket exists
-
For OpenSSL-related errors:
- Make sure to include
OPENSSL_DIR=/opt/homebrew/opt/openssl@3
when running commands - On non-macOS systems, you may need to adjust this path
- Make sure to include
-
For Rust compilation errors:
- Ensure you're using the correct type for
FileIO
(it's a struct, not a trait) - When working with Arrow schemas, use
arrow_schema.into()
to convert toArc<Schema>
- Check that all dependencies are correctly specified in
Cargo.toml
- Ensure you're using the correct type for
-
Iceberg Table Structure:
- Tables are organized with metadata and data files
- Metadata tracks schema, partitioning, and data file locations
- Data files (Parquet) contain the actual data
-
Partitioning Benefits:
- Improves query performance by pruning irrelevant data
- Enables efficient filtering on partition columns
- Organizes data logically in storage
-
REST Catalog:
- Provides a centralized metadata repository
- Manages namespaces, tables, and schemas
- Tracks data file locations and partitioning
-
Rust Implementation:
- Uses async/await for efficient I/O operations
- Leverages Arrow for in-memory data representation
- Uses Parquet for efficient data storage
- Implements transactions for atomic operations