This repository demonstrates how to use Apache Iceberg with Rust and AWS S3 storage. It includes examples of writing data to and reading data from Iceberg tables using the REST catalog.
- Rust (latest stable version)
- Docker
- AWS Account with S3 access
- Git
- Clone the repository:
git clone https://github.com/definite-app/minimal-rust-ice-s3.git
cd minimal-rust-ice-s3
- Set up your environment variables:
# Copy the example .env file
cp .env.example .env
# Edit .env with your AWS credentials and S3 configuration
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
AWS_REGION=your_aws_region
S3_BUCKET=your_bucket_name
S3_PATH=your_path
- Start the REST catalog server:
docker compose up -d
The project includes an example that writes sample data to an Iceberg table:
cargo run --bin mjr
This will:
- Create a namespace if it doesn't exist
- Create a table with a simple schema (id: Int, name: String)
- Write sample records to the table
To read the data back from the table:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_table
This will:
- Connect to the same table
- Execute a SELECT query
- Display the results
To list available tables in a namespace:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin list_tables
This will:
- Connect to the catalog
- List tables in the default namespace
- Display sample data from the default table
To read data from a specific table with a limit:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_custom_table <namespace> <table_name> <limit>
For example:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin read_custom_table tpch customer 10
This will:
- Connect to the specified table in the given namespace
- Execute a SELECT query with the specified limit
- Display the results
To run a predefined query that joins tables:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_query
This will:
- Execute a default query that joins customer and nation tables
- Display the results
To run any custom SQL query against your Iceberg tables:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "<SQL_QUERY>"
For example:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT l_shipmode, COUNT(*) as count FROM my_catalog.tpch.lineitem GROUP BY l_shipmode ORDER BY count DESC"
This will:
- Execute your custom SQL query against the Iceberg tables
- Display the results
Here are some example complex queries you can run:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT l_shipmode, COUNT(*) as count, SUM(l_quantity) as total_quantity, AVG(l_extendedprice) as avg_price FROM my_catalog.tpch.lineitem GROUP BY l_shipmode ORDER BY count DESC"
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT r.r_name as region, COUNT(DISTINCT c.c_custkey) as customer_count, COUNT(o.o_orderkey) as order_count, SUM(o.o_totalprice) as total_sales FROM my_catalog.tpch.orders o JOIN my_catalog.tpch.customer c ON o.o_custkey = c.c_custkey JOIN my_catalog.tpch.nation n ON c.c_nationkey = n.n_nationkey JOIN my_catalog.tpch.region r ON n.n_regionkey = r.r_regionkey GROUP BY r.r_name ORDER BY total_sales DESC"
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT c.c_name, n.n_name as nation, r.r_name as region, COUNT(o.o_orderkey) as order_count, SUM(o.o_totalprice) as total_spent FROM my_catalog.tpch.customer c JOIN my_catalog.tpch.orders o ON c.c_custkey = o.o_custkey JOIN my_catalog.tpch.nation n ON c.c_nationkey = n.n_nationkey JOIN my_catalog.tpch.region r ON n.n_regionkey = r.r_regionkey GROUP BY c.c_name, n.n_name, r.r_name ORDER BY total_spent DESC LIMIT 10"
The project includes functionality to create partitioned versions of the TPC-H tables for improved query performance:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin create_partitioned_tpch
This will:
- Create a new namespace called
tpch_partitioned
- Create a partitioned version of the
lineitem
table (partitioned by month of shipdate) - Create a partitioned version of the
orders
table (partitioned by year of orderdate)
Note: The tables are created without data. To populate them with data, you would need to implement a separate process to:
- Read data from the original tables
- Convert data types as needed (e.g., Int32 to Int64 for key fields)
- Write the data to the partitioned tables with appropriate partition values
You can verify the table structures using:
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "DESCRIBE my_catalog.tpch_partitioned.lineitem"
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "DESCRIBE my_catalog.tpch_partitioned.orders"
Once populated, partitioning would improve query performance when filtering on the partition columns:
# Query using partition pruning on lineitem (example for when data is populated)
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT COUNT(*) FROM my_catalog.tpch_partitioned.lineitem WHERE l_shipdate BETWEEN DATE '1992-01-01' AND DATE '1992-12-31'"
# Query using partition pruning on orders (example for when data is populated)
OPENSSL_DIR=/opt/homebrew/opt/openssl@3 cargo run --bin run_custom_query "SELECT COUNT(*) FROM my_catalog.tpch_partitioned.orders WHERE o_orderdate >= DATE '1993-01-01'"
src/main.rs
- Main program for writing datasrc/bin/read_table.rs
- Example of reading data from the default tablesrc/bin/list_tables.rs
- Example of listing available tablessrc/bin/read_custom_table.rs
- Example of reading data from a specific tablesrc/bin/run_query.rs
- Example of running a predefined querysrc/bin/run_custom_query.rs
- Example of running custom SQL queriessrc/bin/create_partitioned_tpch.rs
- Example of creating partitioned TPC-H tablesdocker-compose.yml
- REST catalog server configuration.env
- Environment variables (not tracked in git)
The project uses the following configuration:
- REST Catalog Server:
http://localhost:8181
- S3 Storage: Configured via environment variables
- Table Location:
s3://${S3_BUCKET}/${S3_PATH}
- Default Table Schema:
id
(Int32)name
(String)
- TPC-H Tables: Available in the
tpch
namespace
- Make sure Docker is running
- Set up your environment variables
- Start the REST catalog server
- Run the examples
-
If you see connection errors, ensure:
- Docker is running
- The REST catalog server is up (
docker compose ps
) - Your AWS credentials are correct
-
For S3 access issues:
- Verify your AWS credentials
- Check S3 bucket permissions
- Ensure the bucket exists
-
For OpenSSL-related errors:
- Make sure to include
OPENSSL_DIR=/opt/homebrew/opt/openssl@3
when running commands - On non-macOS systems, you may need to adjust this path
- Make sure to include