Skip to content

tttao/mini-data-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Required software (prerequisites)

  • Python
  • Docker

Deployment procedure

Set up connectivity from Trino to Google Workspace API

Mini-Data-Platform

This project showcases a low-cost data platform build with license-free software, that can be run on your personal infrastrcture. The objective is not to provide an enterprise grade solution, but to showcase, for educataional or learning purposes, the features of powerful open-source or license-free software, such as:

  • Trino
  • MinIO
  • Hive
  • Apache Superset
  • DBT

A detailed description of the architecture and of the components can be found on my blog posts: https://fabricemonnier.substack.com/p/self-hosted-data-analytics-delta?r=43vece

Deployment instructions

Set up connectivity from Trino to Google Workspace API

In order for Trino to interact with a Google Sheet, you need to set up credentials. Those should be created in your Google Workspace account, and should then be shared with Trino. Follow the steps described in https://trino.io/docs/current/connector/googlesheets.html#credentials, and copy the JSON file in ./trinodb/etc/gsheets-credentials.json. An example file is provided in that same directory.

Build the customised Trino and Superset Docker images

docker compose build --no-cache

Services Overview

  1. MinIO (minio)
  • S3-compatible object store.
  • Runs on ports 9000 (API) and 9001 (console).
  • Root user/pass set to admin / password.
  • Data persisted under ./minio-data.
  1. MinIO Client (mc)
  • Waits until MinIO is up.
  • Sets up an alias to MinIO.
  • On container creation: deletes the bucket warehouse if it exists, then recreates it.
  • Applies a public policy so everything inside is accessible.
  • Keeps the container alive with tail -f /dev/null.
  1. Postgres (postgresdb)
  • Backend DB for Hive Metastore.
  • Username/password: hive / hive, database metastore.
  • Data persisted under ./data/postgres.
  1. Hive Metastore (hive-metastore)
  • Uses starburstdata/hive image.
  • Exposes Thrift service on port 9083.
  • Connects to Postgres at metastore_db:5432.
  • Configured with MinIO as warehouse store (s3://warehouse/).
  • Uses S3_PATH_STYLE_ACCESS=true (needed for MinIO).
  • Healthcheck ensures port 9083 is open.
  1. Trino (trino)
  • Built from local ./trinodb/Dockerfile.
  • Exposed on 8082 (maps to container’s 8080).
  • 1 catalog is configured pointing to Google Sheets source data
  • 1 catalog is configured pointing to Hive Metastore for use by Superset (BI)
  1. Superset (superset)
  • Built from local ./superset/Dockerfile.

  • Web UI on 8088.

  • Admin credentials: admin / admin Overall flow

  • Data from Google Sheets ingested and stored using Delta Lake (parquet files) in MinIO (warehouse bucket).

  • Hive Metastore tracks metadata in Postgres.

  • Trino queries data via Hive Metastore.

  • Superset connects to Trino for BI dashboards

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •