DEV Community

Cover image for Spark On Kubernetes
Keyvan Soleimani
Keyvan Soleimani

Posted on

Spark On Kubernetes

Spark On Kubernetes via helm chart

This article is aimed at introducing the feature of Apache Spark on Kubernetes, which is officially announced as production-ready and Generally Available with the Apache Spark 3.1 release in March 2021.

To gain hands-on experience with the feature, we go through running Apache Spark on a local Kubernetes cluster with Workload Identity to explore the nuance of environment set up and arguments when submitting jobs.


The control-plane & worker nodes addresses are :

192.168.56.115
192.168.56.116
192.168.56.117
Enter fullscreen mode Exit fullscreen mode

Image description
Kubernetes cluster nodes :

Image description
you can install helm via the link helm


Installing Apache Spark

The Steps :

1) Install spark via helm chart (bitnami) :

$ helm repo add bitnami https://charts.bitnami.com/bitnami
$ helm search repo bitnami
$ helm install kayvan-release oci://registry-1.docker.io/bitnamicharts/spark
$ helm upgrade kayvan-release bitnami/spark - set worker.replicaCount=5
Enter fullscreen mode Exit fullscreen mode

the installed 6 pods :

Image description
and Services (headless for statefull) :

Image description
and the spark master ui is :

Image description
2) type the below commands on kubernetes kube-apiserver :

kubectl exec -it kayvan-release-spark-master-0 -- ./bin/spark-submit \
 --class org.apache.spark.examples.SparkPi \
 --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \
 ./examples/jars/spark-examples_2.12–3.4.1.jar 1000
Enter fullscreen mode Exit fullscreen mode

or

kubectl exec -it kayvan-release-spark-master-0 -- /bin/bash


./bin/spark-submit \
 --class org.apache.spark.examples.SparkPi \
 --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \
 ./examples/jars/spark-examples_2.12–3.4.1.jar 1000


./bin/spark-submit \
 --class org.apache.spark.examples.SparkPi \
 --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \
 ./examples/src/main/python/pi.py 1000


./bin/spark-submit \
 --class org.apache.spark.examples.SparkPi \
 --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \
 ./examples/src/main/python/wordcount.py //filepath
Enter fullscreen mode Exit fullscreen mode

Image description

Image description
the exact scala & python code of spark-examples_2.12–3.4.1.jar , pi.py & wordcount.py :

SparkPi.scala

pi.py

wordcount.py

3) The final result is 🍹 :

for scala :

Image description
for python :

Image description

Image description


The other python Programm :

1) Copy People.csv (large file) inside spark worker pods :

Image description

kubectl cp people.csv kayvan-release-spark-worker-{x}:/opt/bitnami/spark
Enter fullscreen mode Exit fullscreen mode

Notes:

  • you can download the file from link
  • you can also use a nfs share folder for read large csv file from it instead of copying it inside pods.

2) Write some python codes inside readcsv.py please :

from pyspark.sql import SparkSession
#from pyspark.sql.functions import sum
from pyspark.context import SparkContext

spark = SparkSession\
 .builder\
 .appName("Mahla")\
 .getOrCreate()


sc = spark.sparkContext

path = "people.csv"

df = spark.read.options(delimiter=",", header=True).csv(path)

df.show()

#df.groupBy("Job Title").sum().show()

df.createOrReplaceTempView("Peopletable")
df2 = spark.sql("select Sex, count(1) countsex, sum(Index) sex_sum " \
                     "from peopletable group by Sex")

df2.show()

#df.select(sum(df.Index)).show()
Enter fullscreen mode Exit fullscreen mode

Image description
3) copy readcsv.py file inside spark master pod :

kubectl cp readcsv.py kayvan-release-spark-master-0:/opt/bitnami/spark
Enter fullscreen mode Exit fullscreen mode

4) run the code :

kubectl exec -it kayvan-release-spark-master-0 -- ./bin/spark-submit 
   --class org.apache.spark.examples.SparkPi
   --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 
       readcsv.py
Enter fullscreen mode Exit fullscreen mode

5) showing some data :

Image description
6) the next result data:

Image description
7) the time consuming for processing :

Image description


The other python program on Docker Desktop :

docker-compose.yml :

version: '3.6'

services:

  spark:
    container_name: spark
    image: bitnami/spark:latest
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=root   
      - PYSPARK_PYTHON=/opt/bitnami/python/bin/python3
    ports:
      - 127.0.0.1:8081:8080

  spark-worker:
    image: bitnami/spark:latest
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=root
      - PYSPARK_PYTHON=/opt/bitnami/python/bin/python3
Enter fullscreen mode Exit fullscreen mode
docker-compose up --scale spark-worker=2
Enter fullscreen mode Exit fullscreen mode

Image description
copy required files to containers :

for e.g.

docker cp file.csv spark-worker-1:/opt/bitnami/spark
Enter fullscreen mode Exit fullscreen mode

python code on master :

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Writingjson").getOrCreate()

df = spark.read.option("header", True).csv("csv/file.csv").coalesce(2)

df.show()

df.write.partitionBy('name').mode('overwrite').format('json').save('file_name.json')
Enter fullscreen mode Exit fullscreen mode

run the code on spark master docker container :

./bin/spark-submit --master spark://4f28330ce077:7077 csv/ctp.py
Enter fullscreen mode Exit fullscreen mode

showing some data on master :

Image description
and the seperated json files based on name partitioning on worker 2 :

Image description
data for name=kayvan :

Image description
Congratulation 🍹

Top comments (0)