All-in-One Docker BigDataOps is a comprehensive Docker Compose environment that bundles Hadoop, Spark, Hive, Hue, and Airflow into a ready-to-run stack. This project simplifies Big Data operations, but it is intended to be an academic approach to Big Data, do not consider it as best practices for production environments
- Overview
- Key Features
- Why Use This Repository?
- Getting Started
- Airflow Configuration
- Documentation & Examples
- Contributing
- License
- Support
This repository provides an all-in-one solution for Big DataOps by integrating industry-leading tools into a single Docker Compose setup. With minimal configuration, you can deploy a powerful stack that covers everything from distributed processing with Hadoop to data orchestration with Airflow.
-
Fully Integrated Tools
Enjoy a pre-configured environment featuring:- Hadoop: Distributed storage and processing
- Spark: Big data analytics
- Hive: SQL-based querying
- Hue: User-friendly web interface
- Airflow: Robust data pipeline orchestration
-
Quick Setup
Start everything with a single command:make start-all
-
Versatile Usage
Perfect for:- Learning and experimentation
- Development and testing
- Small-scale production deployments
-
Example Workflows
Includes sample jobs and scripts to help you kickstart your big data projects. -
Customizable
Easily extend or modify the stack to match your unique requirements.
- Ease of Use: Say goodbye to tedious configurations. Focus on building and testing your data solutions.
- All-in-One Solution: All necessary tools are bundled together for a seamless experience.
- Community-Driven: Designed for accessibility and collaboration, making it a great resource for the Big DataOps community.
If you are using Windows:
- Cygwin64 Make sure you activate "Make" as aditional package
or
- Make Make you sure you download grep and core utils too
Clone the repository and launch the full stack with one command:
git clone https://github.com/heirinsinho/all-in-one-docker-bigdataops.git
cd all-in-one-docker-bigdataops
make start-all
In case you prefer to run only a specific part, you can just:
make start-spark
make start-hadoop
make start-streaming
make start-airflow
After starting the environment, configure Airflow with the following connections:
-
Spark Connection
- conn_id:
spark_docker
- conn_type:
spark
- conn_host:
spark://spark-master
- conn_port:
7077
- conn_id:
-
Hadoop SSH Connection
- conn_id:
ssh_hadoop
- conn_type:
ssh
- conn_host:
namenode
- conn_port:
22
- conn_login:
root
- conn_password:
root123
- conn_id:
Refer to the Makefile for additional commands and usage examples.
There is a complete example of big data streaming application within this repo:
MadFlow: A real time metric of the occupancy status in Madrid
You can find the scripts for the execution both as an Airflow DAG and in Jupyter notebooks Finally the api_madflow folder contains backend and frontend for the application to run locally If you encounter any issues or have suggestions, please add an issue or submit a pull request. Your feedback helps improve the project for everyone.
Contributions are encouraged! If you'd like to enhance the project or fix an issue, please fork the repository and submit a pull request. See our Contribution Guidelines for more details.
This project is licensed under the MIT License. See the LICENSE file for further information.
If you encounter any issues or have suggestions, please open an issue or submit a pull request. Your feedback helps improve the project for everyone.
Happy Big DataOps!