|
| 1 | +# Spark Cluster with Docker & docker-compose |
| 2 | + |
| 3 | +# General |
| 4 | + |
| 5 | +A simple spark standalone cluster for your testing environment purposses. A *docker-compose up* away from you solution for your spark development environment. |
| 6 | + |
| 7 | +The Docker compose will create the following containers: |
| 8 | + |
| 9 | +container|Ip address |
| 10 | +---|--- |
| 11 | +spark-master|10.5.0.2 |
| 12 | +spark-worker-1|10.5.0.3 |
| 13 | +spark-worker-2|10.5.0.4 |
| 14 | +spark-worker-3|10.5.0.5 |
| 15 | + |
| 16 | +# Installation |
| 17 | + |
| 18 | +The following steps will make you run your spark cluster's containers. |
| 19 | + |
| 20 | +## Pre requisites |
| 21 | + |
| 22 | +* Docker installed |
| 23 | + |
| 24 | +* Docker compose installed |
| 25 | + |
| 26 | +* A spark Application Jar to play with(Optional) |
| 27 | + |
| 28 | +## Build the images |
| 29 | + |
| 30 | +The first step to deploy the cluster will be the build of the custom images, these builds can be performed with the *build-images.sh* script. |
| 31 | + |
| 32 | +The executions is as simple as the following steps: |
| 33 | + |
| 34 | +```sh |
| 35 | +chmod +x build-images.sh |
| 36 | +./build-images.sh |
| 37 | +``` |
| 38 | + |
| 39 | +This will create the following docker images: |
| 40 | + |
| 41 | +* spark-base:2.3.1: A base image based on java:alpine-jdk-8 wich ships scala, python3 and spark 2.3.1 |
| 42 | + |
| 43 | +* spark-master:2.3.1: A image based on the previously created spark image, used to create a spark master containers. |
| 44 | + |
| 45 | +* spark-worker:2.3.1: A image based on the previously created spark image, used to create spark worker containers. |
| 46 | + |
| 47 | +* spark-submit:2.3.1: A image based on the previously created spark image, used to create spark submit containers(run, deliver driver and die gracefully). |
| 48 | + |
| 49 | +## Run the docker-compose |
| 50 | + |
| 51 | +The final step to create your test cluster will be to run the compose file: |
| 52 | + |
| 53 | +```sh |
| 54 | +docker-compose up |
| 55 | +``` |
| 56 | + |
| 57 | +## Validate your cluster |
| 58 | + |
| 59 | +Just validate your cluster accesing the spark UI on each worker & master URL. |
| 60 | + |
| 61 | +### Spark Master |
| 62 | + |
| 63 | +http://10.5.0.2:8080/ |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | +### Spark Worker 1 |
| 68 | + |
| 69 | +http://10.5.0.3:8081/ |
| 70 | + |
| 71 | + |
| 72 | + |
| 73 | +### Spark Worker 2 |
| 74 | + |
| 75 | +http://10.5.0.4:8081/ |
| 76 | + |
| 77 | + |
| 78 | + |
| 79 | +### Spark Worker 3 |
| 80 | + |
| 81 | +http://10.5.0.5:8081/ |
| 82 | + |
| 83 | + |
| 84 | + |
| 85 | +# Resource Allocation |
| 86 | + |
| 87 | +This cluster is shipped with three workers and one spark master, each of these has a particular set of resource allocation(basically RAM & cpu cores allocation). |
| 88 | + |
| 89 | +* The default CPU cores allocation for each spark worker is 1 core. |
| 90 | + |
| 91 | +* The default RAM for each spark-worker is 1024 MB. |
| 92 | + |
| 93 | +* The default RAM allocation for spark executors is 256mb. |
| 94 | + |
| 95 | +* The default RAM allocation for spark driver is 128mb |
| 96 | + |
| 97 | +* If you wish to modify this allocations just edit the env/spark-worker.sh file. |
| 98 | + |
| 99 | +# Binded Volumes |
| 100 | + |
| 101 | +To make app running easier I've shipped two volume mounts described in the following chart: |
| 102 | + |
| 103 | +Host Mount|Container Mount|Purposse |
| 104 | +---|---|--- |
| 105 | +/mnt/spark-apps|/opt/spark-apps|Used to make available your app's jars on all workers & master |
| 106 | +/mnt/spark-data|/opt/spark-data| Used to make available your app's data on all workers & master |
| 107 | + |
| 108 | +This is basically a dummy DFS created from docker Volumes...(maybe not...) |
| 109 | + |
| 110 | +# Run a sample application |
| 111 | + |
| 112 | +Now let`s make a **wild spark submit** to validate the distributed nature of our new toy following these steps: |
| 113 | + |
| 114 | +## Create a Scala spark app |
| 115 | + |
| 116 | +The first thing you need to do is to make a spark application. Our spark-submit image is designed to run scala code (soon will ship pyspark support guess I was just lazy to do so..). |
| 117 | + |
| 118 | +In my case I am using an app called [crimes-app](https://). You can make or use your own scala app, I 've just used this one because I had it at hand. |
| 119 | + |
| 120 | + |
| 121 | +## Ship your jar & dependencies on the Workers and Master |
| 122 | + |
| 123 | +A necesary step to make a **spark-submit** is to copy your application bundle into all workers, also any configuration file or input file you need. |
| 124 | + |
| 125 | +Luckily for us we are using docker volumes so, you just have to copy your app and configs into /mnt/spark-apps, and your input files into /mnt/spark-files. |
| 126 | + |
| 127 | +```bash |
| 128 | +#Copy spark application into all workers's app folder |
| 129 | +cp /home/workspace/crimes-app/build/libs/crimes-app.jar /mnt/spark-apps |
| 130 | + |
| 131 | +#Copy spark application configs into all workers's app folder |
| 132 | +cp -r /home/workspace/crimes-app/config /mnt/spark-apps |
| 133 | + |
| 134 | +# Copy the file to be processed to all workers's data folder |
| 135 | +cp /home/Crimes_-_2001_to_present.csv /mnt/spark-files |
| 136 | +``` |
| 137 | + |
| 138 | +## Check the successful copy of the data and app jar (Optional) |
| 139 | + |
| 140 | +This is not a necessary step, just if you are curious you can check if your app code and files are in place before running the spark-submit. |
| 141 | + |
| 142 | +```sh |
| 143 | +# Worker 1 Validations |
| 144 | +docker exec -ti spark-worker-1 ls -l /opt/spark-apps |
| 145 | + |
| 146 | +docker exec -ti spark-worker-1 ls -l /opt/spark-data |
| 147 | + |
| 148 | +# Worker 2 Validations |
| 149 | +docker exec -ti spark-worker-2 ls -l /opt/spark-apps |
| 150 | + |
| 151 | +docker exec -ti spark-worker-2 ls -l /opt/spark-data |
| 152 | + |
| 153 | +# Worker 3 Validations |
| 154 | +docker exec -ti spark-worker-3 ls -l /opt/spark-apps |
| 155 | + |
| 156 | +docker exec -ti spark-worker-3 ls -l /opt/spark-data |
| 157 | +``` |
| 158 | +After running one of this commands you have to see your app's jar and files. |
| 159 | + |
| 160 | + |
| 161 | +## Use docker spark-submit |
| 162 | + |
| 163 | +```bash |
| 164 | +#Creating some variables to make the docker run command more readable |
| 165 | +#App jar environment used by the spark-submit image |
| 166 | +SPARK_APPLICATION_JAR_LOCATION="/opt/spark-apps/crimes-app.jar" |
| 167 | +#App main class environment used by the spark-submit image |
| 168 | +SPARK_APPLICATION_MAIN_CLASS="org.mvb.applications.CrimesApp" |
| 169 | +#Extra submit args used by the spark-submit image |
| 170 | +SPARK_SUBMIT_ARGS="--conf spark.executor.extraJavaOptions='-Dconfig-path=/opt/spark-apps/dev/config.conf'" |
| 171 | + |
| 172 | +#We have to use the same network as the spark cluster(internally the image resolves spark master as spark://spark-master:7077) |
| 173 | +docker run --network docker-spark-cluster_spark-network -v /mnt/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS spark-submit:2.3.1 |
| 174 | + |
| 175 | +``` |
| 176 | + |
| 177 | +After running this you will see an output pretty much like this: |
| 178 | + |
| 179 | +```bash |
| 180 | +Running Spark using the REST application submission protocol. |
| 181 | +2018-09-23 15:17:52 INFO RestSubmissionClient:54 - Submitting a request to launch an application in spark://spark-master:6066. |
| 182 | +2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Submission successfully created as driver-20180923151753-0000. Polling submission state... |
| 183 | +2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Submitting a request for the status of submission driver-20180923151753-0000 in spark://spark-master:6066. |
| 184 | +2018-09-23 15:17:53 INFO RestSubmissionClient:54 - State of driver driver-20180923151753-0000 is now RUNNING. |
| 185 | +2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Driver is running on worker worker-20180923151711-10.5.0.4-45381 at 10.5.0.4:45381. |
| 186 | +2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Server responded with CreateSubmissionResponse: |
| 187 | +{ |
| 188 | + "action" : "CreateSubmissionResponse", |
| 189 | + "message" : "Driver successfully submitted as driver-20180923151753-0000", |
| 190 | + "serverSparkVersion" : "2.3.1", |
| 191 | + "submissionId" : "driver-20180923151753-0000", |
| 192 | + "success" : true |
| 193 | +} |
| 194 | +``` |
| 195 | +
|
| 196 | +# Summary (What have I done :O?) |
| 197 | +
|
| 198 | +* We compiled the necessary docker images to run spark master and worker containers. |
| 199 | +
|
| 200 | +* We created a spark standalone cluster using 3 worker nodes and 1 master node using docker && docker-compose. |
| 201 | +
|
| 202 | +* Copied the resources necessary to run a sample application. |
| 203 | +
|
| 204 | +* Submitted an application to the cluster using a **spark-submit** docker image. |
| 205 | +
|
| 206 | +* We ran a distributed application at home(just need enough cpu cores and RAM to do so). |
| 207 | +
|
| 208 | +# Why a standalone cluster? |
| 209 | +
|
| 210 | +* This is intended to be used for test purposses, basically a way of running distributed spark apps on your laptop or desktop. |
| 211 | +
|
| 212 | +* Right now I don't have enough resources to make a Yarn, Mesos or Kubernetes based cluster :(. |
| 213 | +
|
| 214 | +* This will be useful to use CI/CD pipelines for your spark apps(A really difficult and hot topic) |
0 commit comments