All-in-One Docker BigDataOps

All-in-One Docker BigDataOps is a comprehensive Docker Compose environment that bundles Hadoop, Spark, Hive, Hue, and Airflow into a ready-to-run stack. This project simplifies Big Data operations, but it is intended to be an academic approach to Big Data, do not consider it as best practices for production environments

Overview

This repository provides an all-in-one solution for Big DataOps by integrating industry-leading tools into a single Docker Compose setup. With minimal configuration, you can deploy a powerful stack that covers everything from distributed processing with Hadoop to data orchestration with Airflow.

Key Features

Fully Integrated Tools
Enjoy a pre-configured environment featuring:
- Hadoop: Distributed storage and processing
- Spark: Big data analytics
- Hive: SQL-based querying
- Hue: User-friendly web interface
- Airflow: Robust data pipeline orchestration
Quick Setup
Start everything with a single command:
```
make start-all
```
Versatile Usage
Perfect for:
- Learning and experimentation
- Development and testing
- Small-scale production deployments
Example Workflows
Includes sample jobs and scripts to help you kickstart your big data projects.
Customizable
Easily extend or modify the stack to match your unique requirements.

Why Use This Repository?

Ease of Use: Say goodbye to tedious configurations. Focus on building and testing your data solutions.
All-in-One Solution: All necessary tools are bundled together for a seamless experience.
Community-Driven: Designed for accessibility and collaboration, making it a great resource for the Big DataOps community.

Getting Started

Prerequisites

If you are using Windows:

Cygwin64 Make sure you activate "Make" as aditional package

or

Make Make you sure you download grep and core utils too

Quick Start

Clone the repository and launch the full stack with one command:

git clone https://github.com/heirinsinho/all-in-one-docker-bigdataops.git
cd all-in-one-docker-bigdataops
make start-all

In case you prefer to run only a specific part, you can just:

make start-spark
make start-hadoop
make start-streaming
make start-airflow

Airflow Configuration

After starting the environment, configure Airflow with the following connections:

Spark Connection
- conn_id: spark_docker
- conn_type: spark
- conn_host: spark://spark-master
- conn_port: 7077
Hadoop SSH Connection
- conn_id: ssh_hadoop
- conn_type: ssh
- conn_host: namenode
- conn_port: 22
- conn_login: root
- conn_password: root123

Refer to the Makefile for additional commands and usage examples.

Examples

There is a complete example of big data streaming application within this repo:

MadFlow: A real time metric of the occupancy status in Madrid

You can find the scripts for the execution both as an Airflow DAG and in Jupyter notebooks Finally the api_madflow folder contains backend and frontend for the application to run locally If you encounter any issues or have suggestions, please add an issue or submit a pull request. Your feedback helps improve the project for everyone.

Contributing

Contributions are encouraged! If you'd like to enhance the project or fix an issue, please fork the repository and submit a pull request. See our Contribution Guidelines for more details.

License

This project is licensed under the MIT License. See the LICENSE file for further information.

Support

If you encounter any issues or have suggestions, please open an issue or submit a pull request. Your feedback helps improve the project for everyone.

Happy Big DataOps!

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
airflow		airflow
api_madflow		api_madflow
hadoop		hadoop
hive		hive
hue		hue
jupyter		jupyter
kafka		kafka
postgres/init-db		postgres/init-db
spark		spark
.env.madflow		.env.madflow
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
get_spark_version.ps1		get_spark_version.ps1
img.png		img.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

All-in-One Docker BigDataOps

Table of Contents

Overview

Key Features

Why Use This Repository?

Getting Started

Prerequisites

Quick Start

Airflow Configuration

Examples

Contributing

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

heirinsinho/all-in-one-docker-bigdataops

Folders and files

Latest commit

History

Repository files navigation

All-in-One Docker BigDataOps

Table of Contents

Overview

Key Features

Why Use This Repository?

Getting Started

Prerequisites

Quick Start

Airflow Configuration

Examples

Contributing

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages