Skip to content

Commit 1abcbef

Browse files
committed
Initial commit
0 parents  commit 1abcbef

19 files changed

+418
-0
lines changed

.dockerignore

Whitespace-only changes.

.gitignore

Whitespace-only changes.

README.md

Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
# Spark Cluster with Docker & docker-compose
2+
3+
# General
4+
5+
A simple spark standalone cluster for your testing environment purposses. A *docker-compose up* away from you solution for your spark development environment.
6+
7+
The Docker compose will create the following containers:
8+
9+
container|Ip address
10+
---|---
11+
spark-master|10.5.0.2
12+
spark-worker-1|10.5.0.3
13+
spark-worker-2|10.5.0.4
14+
spark-worker-3|10.5.0.5
15+
16+
# Installation
17+
18+
The following steps will make you run your spark cluster's containers.
19+
20+
## Pre requisites
21+
22+
* Docker installed
23+
24+
* Docker compose installed
25+
26+
* A spark Application Jar to play with(Optional)
27+
28+
## Build the images
29+
30+
The first step to deploy the cluster will be the build of the custom images, these builds can be performed with the *build-images.sh* script.
31+
32+
The executions is as simple as the following steps:
33+
34+
```sh
35+
chmod +x build-images.sh
36+
./build-images.sh
37+
```
38+
39+
This will create the following docker images:
40+
41+
* spark-base:2.3.1: A base image based on java:alpine-jdk-8 wich ships scala, python3 and spark 2.3.1
42+
43+
* spark-master:2.3.1: A image based on the previously created spark image, used to create a spark master containers.
44+
45+
* spark-worker:2.3.1: A image based on the previously created spark image, used to create spark worker containers.
46+
47+
* spark-submit:2.3.1: A image based on the previously created spark image, used to create spark submit containers(run, deliver driver and die gracefully).
48+
49+
## Run the docker-compose
50+
51+
The final step to create your test cluster will be to run the compose file:
52+
53+
```sh
54+
docker-compose up
55+
```
56+
57+
## Validate your cluster
58+
59+
Just validate your cluster accesing the spark UI on each worker & master URL.
60+
61+
### Spark Master
62+
63+
http://10.5.0.2:8080/
64+
65+
![alt text](docs/spark-master.png "Spark master UI")
66+
67+
### Spark Worker 1
68+
69+
http://10.5.0.3:8081/
70+
71+
![alt text](docs/spark-worker-1.png "Spark worker 1 UI")
72+
73+
### Spark Worker 2
74+
75+
http://10.5.0.4:8081/
76+
77+
![alt text](docs/spark-worker-2.png "Spark worker 2 UI")
78+
79+
### Spark Worker 3
80+
81+
http://10.5.0.5:8081/
82+
83+
![alt text](docs/spark-worker-3.png "Spark worker 3 UI")
84+
85+
# Resource Allocation
86+
87+
This cluster is shipped with three workers and one spark master, each of these has a particular set of resource allocation(basically RAM & cpu cores allocation).
88+
89+
* The default CPU cores allocation for each spark worker is 1 core.
90+
91+
* The default RAM for each spark-worker is 1024 MB.
92+
93+
* The default RAM allocation for spark executors is 256mb.
94+
95+
* The default RAM allocation for spark driver is 128mb
96+
97+
* If you wish to modify this allocations just edit the env/spark-worker.sh file.
98+
99+
# Binded Volumes
100+
101+
To make app running easier I've shipped two volume mounts described in the following chart:
102+
103+
Host Mount|Container Mount|Purposse
104+
---|---|---
105+
/mnt/spark-apps|/opt/spark-apps|Used to make available your app's jars on all workers & master
106+
/mnt/spark-data|/opt/spark-data| Used to make available your app's data on all workers & master
107+
108+
This is basically a dummy DFS created from docker Volumes...(maybe not...)
109+
110+
# Run a sample application
111+
112+
Now let`s make a **wild spark submit** to validate the distributed nature of our new toy following these steps:
113+
114+
## Create a Scala spark app
115+
116+
The first thing you need to do is to make a spark application. Our spark-submit image is designed to run scala code (soon will ship pyspark support guess I was just lazy to do so..).
117+
118+
In my case I am using an app called [crimes-app](https://). You can make or use your own scala app, I 've just used this one because I had it at hand.
119+
120+
121+
## Ship your jar & dependencies on the Workers and Master
122+
123+
A necesary step to make a **spark-submit** is to copy your application bundle into all workers, also any configuration file or input file you need.
124+
125+
Luckily for us we are using docker volumes so, you just have to copy your app and configs into /mnt/spark-apps, and your input files into /mnt/spark-files.
126+
127+
```bash
128+
#Copy spark application into all workers's app folder
129+
cp /home/workspace/crimes-app/build/libs/crimes-app.jar /mnt/spark-apps
130+
131+
#Copy spark application configs into all workers's app folder
132+
cp -r /home/workspace/crimes-app/config /mnt/spark-apps
133+
134+
# Copy the file to be processed to all workers's data folder
135+
cp /home/Crimes_-_2001_to_present.csv /mnt/spark-files
136+
```
137+
138+
## Check the successful copy of the data and app jar (Optional)
139+
140+
This is not a necessary step, just if you are curious you can check if your app code and files are in place before running the spark-submit.
141+
142+
```sh
143+
# Worker 1 Validations
144+
docker exec -ti spark-worker-1 ls -l /opt/spark-apps
145+
146+
docker exec -ti spark-worker-1 ls -l /opt/spark-data
147+
148+
# Worker 2 Validations
149+
docker exec -ti spark-worker-2 ls -l /opt/spark-apps
150+
151+
docker exec -ti spark-worker-2 ls -l /opt/spark-data
152+
153+
# Worker 3 Validations
154+
docker exec -ti spark-worker-3 ls -l /opt/spark-apps
155+
156+
docker exec -ti spark-worker-3 ls -l /opt/spark-data
157+
```
158+
After running one of this commands you have to see your app's jar and files.
159+
160+
161+
## Use docker spark-submit
162+
163+
```bash
164+
#Creating some variables to make the docker run command more readable
165+
#App jar environment used by the spark-submit image
166+
SPARK_APPLICATION_JAR_LOCATION="/opt/spark-apps/crimes-app.jar"
167+
#App main class environment used by the spark-submit image
168+
SPARK_APPLICATION_MAIN_CLASS="org.mvb.applications.CrimesApp"
169+
#Extra submit args used by the spark-submit image
170+
SPARK_SUBMIT_ARGS="--conf spark.executor.extraJavaOptions='-Dconfig-path=/opt/spark-apps/dev/config.conf'"
171+
172+
#We have to use the same network as the spark cluster(internally the image resolves spark master as spark://spark-master:7077)
173+
docker run --network docker-spark-cluster_spark-network -v /mnt/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS spark-submit:2.3.1
174+
175+
```
176+
177+
After running this you will see an output pretty much like this:
178+
179+
```bash
180+
Running Spark using the REST application submission protocol.
181+
2018-09-23 15:17:52 INFO RestSubmissionClient:54 - Submitting a request to launch an application in spark://spark-master:6066.
182+
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Submission successfully created as driver-20180923151753-0000. Polling submission state...
183+
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Submitting a request for the status of submission driver-20180923151753-0000 in spark://spark-master:6066.
184+
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - State of driver driver-20180923151753-0000 is now RUNNING.
185+
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Driver is running on worker worker-20180923151711-10.5.0.4-45381 at 10.5.0.4:45381.
186+
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:
187+
{
188+
"action" : "CreateSubmissionResponse",
189+
"message" : "Driver successfully submitted as driver-20180923151753-0000",
190+
"serverSparkVersion" : "2.3.1",
191+
"submissionId" : "driver-20180923151753-0000",
192+
"success" : true
193+
}
194+
```
195+
196+
# Summary (What have I done :O?)
197+
198+
* We compiled the necessary docker images to run spark master and worker containers.
199+
200+
* We created a spark standalone cluster using 3 worker nodes and 1 master node using docker && docker-compose.
201+
202+
* Copied the resources necessary to run a sample application.
203+
204+
* Submitted an application to the cluster using a **spark-submit** docker image.
205+
206+
* We ran a distributed application at home(just need enough cpu cores and RAM to do so).
207+
208+
# Why a standalone cluster?
209+
210+
* This is intended to be used for test purposses, basically a way of running distributed spark apps on your laptop or desktop.
211+
212+
* Right now I don't have enough resources to make a Yarn, Mesos or Kubernetes based cluster :(.
213+
214+
* This will be useful to use CI/CD pipelines for your spark apps(A really difficult and hot topic)

build-images.sh

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#!/bin/bash
2+
3+
set -e
4+
5+
docker build -t spark-base:2.3.1 ./docker/base
6+
docker build -t spark-master:2.3.1 ./docker/spark-master
7+
docker build -t spark-worker:2.3.1 ./docker/spark-worker
8+
docker build -t spark-submit:2.3.1 ./docker/spark-submit

docker-compose.yml

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
version: "3.7"
2+
services:
3+
spark-master:
4+
image: spark-master:2.3.1
5+
container_name: spark-master
6+
hostname: spark-master
7+
ports:
8+
- "8080:8080"
9+
- "7077:7077"
10+
networks:
11+
spark-network:
12+
ipv4_address: 10.5.0.2
13+
volumes:
14+
- /mnt/spark-apps:/opt/spark-apps
15+
- /mnt/spark-data:/opt/spark-data
16+
environment:
17+
- "SPARK_LOCAL_IP=spark-master"
18+
spark-worker-1:
19+
image: spark-worker:2.3.1
20+
container_name: spark-worker-1
21+
hostname: spark-worker-1
22+
depends_on:
23+
- spark-master
24+
ports:
25+
- "8081:8081"
26+
env_file: ./env/spark-worker.sh
27+
environment:
28+
- "SPARK_LOCAL_IP=spark-worker-1"
29+
networks:
30+
spark-network:
31+
ipv4_address: 10.5.0.3
32+
volumes:
33+
- /mnt/spark-apps:/opt/spark-apps
34+
- /mnt/spark-data:/opt/spark-data
35+
spark-worker-2:
36+
image: spark-worker:2.3.1
37+
container_name: spark-worker-2
38+
hostname: spark-worker-2
39+
depends_on:
40+
- spark-master
41+
ports:
42+
- "8082:8081"
43+
env_file: ./env/spark-worker.sh
44+
environment:
45+
- "SPARK_LOCAL_IP=spark-worker-2"
46+
networks:
47+
spark-network:
48+
ipv4_address: 10.5.0.4
49+
volumes:
50+
- /mnt/spark-apps:/opt/spark-apps
51+
- /mnt/spark-data:/opt/spark-data
52+
spark-worker-3:
53+
image: spark-worker:2.3.1
54+
container_name: spark-worker-3
55+
hostname: spark-worker-3
56+
depends_on:
57+
- spark-master
58+
ports:
59+
- "8083:8081"
60+
env_file: ./env/spark-worker.sh
61+
environment:
62+
- "SPARK_LOCAL_IP=spark-worker-3"
63+
networks:
64+
spark-network:
65+
ipv4_address: 10.5.0.5
66+
volumes:
67+
- /mnt/spark-apps:/opt/spark-apps
68+
- /mnt/spark-data:/opt/spark-data
69+
networks:
70+
spark-network:
71+
driver: bridge
72+
ipam:
73+
driver: default
74+
config:
75+
- subnet: 10.5.0.0/16

docker/base/Dockerfile

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
FROM java:8-jdk-alpine
2+
3+
ENV DAEMON_RUN=true
4+
ENV SPARK_VERSION=2.3.1
5+
ENV HADOOP_VERSION=2.7
6+
ENV SCALA_VERSION=2.12.4
7+
ENV SCALA_HOME=/usr/share/scala
8+
9+
RUN apk add --no-cache --virtual=.build-dependencies wget ca-certificates && \
10+
apk add --no-cache bash curl jq && \
11+
cd "/tmp" && \
12+
wget --no-verbose "https://downloads.typesafe.com/scala/${SCALA_VERSION}/scala-${SCALA_VERSION}.tgz" && \
13+
tar xzf "scala-${SCALA_VERSION}.tgz" && \
14+
mkdir "${SCALA_HOME}" && \
15+
rm "/tmp/scala-${SCALA_VERSION}/bin/"*.bat && \
16+
mv "/tmp/scala-${SCALA_VERSION}/bin" "/tmp/scala-${SCALA_VERSION}/lib" "${SCALA_HOME}" && \
17+
ln -s "${SCALA_HOME}/bin/"* "/usr/bin/" && \
18+
apk del .build-dependencies && \
19+
rm -rf "/tmp/"*
20+
21+
#Scala instalation
22+
RUN export PATH="/usr/local/sbt/bin:$PATH" && apk update && apk add ca-certificates wget tar && mkdir -p "/usr/local/sbt" && wget -qO - --no-check-certificate "https://cocl.us/sbt-0.13.16.tgz" | tar xz -C /usr/local/sbt --strip-components=1 && sbt sbtVersion
23+
24+
RUN apk add --no-cache python3
25+
26+
RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
27+
&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \
28+
&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz

docker/spark-master/Dockerfile

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
FROM spark-base:2.3.1
2+
3+
COPY start-master.sh /
4+
5+
ENV SPARK_MASTER_PORT 7077
6+
ENV SPARK_MASTER_WEBUI_PORT 8080
7+
ENV SPARK_MASTER_LOG /spark/logs
8+
9+
EXPOSE 8080 7077 6066
10+
11+
CMD ["/bin/bash", "/start-master.sh"]

docker/spark-master/start-master.sh

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/bin/bash
2+
3+
export SPARK_MASTER_HOST=`hostname`
4+
5+
. "/spark/sbin/spark-config.sh"
6+
7+
. "/spark/bin/load-spark-env.sh"
8+
9+
mkdir -p $SPARK_MASTER_LOG
10+
11+
export SPARK_HOME=/spark
12+
13+
ln -sf /dev/stdout $SPARK_MASTER_LOG/spark-master.out
14+
15+
cd /spark/bin && /spark/sbin/../bin/spark-class org.apache.spark.deploy.master.Master --ip $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT >> $SPARK_MASTER_LOG/spark-master.out

docker/spark-submit/Dockerfile

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
FROM spark-base:2.3.1
2+
3+
COPY spark-submit.sh /
4+
5+
ENV SPARK_MASTER_URL="spark://spark-master:7077"
6+
ENV SPARK_SUBMIT_ARGS=""
7+
ENV SPARK_APPLICATION_ARGS ""
8+
#ENV SPARK_APPLICATION_JAR_LOCATION /opt/spark-apps/myjar.jar
9+
#ENV SPARK_APPLICATION_MAIN_CLASS my.main.Application
10+
11+
12+
CMD ["/bin/bash", "/spark-submit.sh"]

docker/spark-submit/spark-submit.sh

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#!/bin/bash
2+
3+
/spark/bin/spark-submit \
4+
--class ${SPARK_APPLICATION_MAIN_CLASS} \
5+
--master ${SPARK_MASTER_URL} \
6+
--deploy-mode cluster \
7+
--total-executor-cores 1 \
8+
${SPARK_SUBMIT_ARGS} \
9+
${SPARK_APPLICATION_JAR_LOCATION} \
10+
${SPARK_APPLICATION_ARGS} \

docker/spark-worker/Dockerfile

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
FROM spark-base:2.3.1
2+
3+
COPY start-worker.sh /
4+
5+
ENV SPARK_WORKER_WEBUI_PORT 8081
6+
ENV SPARK_WORKER_LOG /spark/logs
7+
ENV SPARK_MASTER "spark://spark-master:7077"
8+
9+
EXPOSE 8081
10+
11+
CMD ["/bin/bash", "/start-worker.sh"]

0 commit comments

Comments
 (0)