This is the material for the 2019 Silicon Valley Code Camp Session "Realish Time Predictive Analytics with Spark Structured Streaming"
Find me on Twitter: @newfront Find me on Medium @newfrontcreative About Twilio: Twilio
If you are planning to follow along during the presentation you will need to have the following installed on your local machine.
- Docker
- System Terminal (iTerm, Terminal, etc)
- Working Web Browser (Chrome or Firefox)
- Install Docker Desktop (https://www.docker.com/products/docker-desktop) Additional Docker Resources:
- 2 or more cpu cores.
- 8gb/ram or higher.
Zeppelin Project Info: "https://zeppelin.apache.org/docs/latest/interpreter/spark.html"
IMPORTANT: Notebooks use Spark 2.4.4
You must download spark 2.4.4 in order to run the examples
brew install wget && cd ~/Desktop && wget http://mirror.cc.columbia.edu/pub/software/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz && tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz && mv spark-2.4.4-bin-hadoop2.7 spark-2.4.4
curl -XGET http://mirror.cc.columbia.edu/pub/software/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz > ~/Desktop/spark-2.4.4.tgz && cd ~/Desktop && tar -xvzf spark-2.4.4.tgz && rm spark-2.4.4.tgz
- Export SPARK_HOME to your bash/terminal session
export SPARK_HOME=~/Desktop/spark-2.4.4/ - Starting the Zeppelin Environment:
cd /path/to/svcc-2019-realish-spark/ && ./install.sh && ./run.sh deployCustom && ./run.sh start - Stopping the environment:
cd /path/to/svcc-2019-realish-spark/ && ./run.sh stop - Zeppelin should be running on
http://localhost:8080- Click on the NotebooksSVCCand start with1-SparkIntro. Click the Play button on the top of the notebook (Play All) to load and run everything. Follow the links at the bottom of each notebook to go from1-5. Enjoy
- Zeppelin Configuration http://localhost:8080/#/configuration
- Zeppelin Interpreters http://localhost:8080/#/interpreter (configure spark here)
- If you want to use
spark-redisfrom Zeppelin (for Notebook #5). Update the Zeppelin Interpreter for Spark and underDependenciesaddcom.redislabs:spark-redis:2.4.0and restart the interpreter. If you have problems getting this to install. Try running `docker exec -it zeppelin - Suggested Spark Configurations (double the docker cores, use 1g less than total docker memory allocation)
spark.cores.max: 4
spark.executor.memory: 12gdocker exec -it zeppelin /spark/bin/spark-shell
ZEPPELIN_INTERPRETER_LOCALREPO
- Jump onto the docker process:
docker exec -it zeppelin bash - VIEW env variables:
cat /conf/zeppelin-env.sh.template - SEE
SPARK_HOME. You can now point this to locally installed SPARK
- Loading data (read / readStream)
- Schemas / Schema Inference
- StructTypes and structures
- Creating tempViews and SparkSQL
- Writing data (write / writeStream)
- Discuss the built in org.apache.spark.sql.functions._ (col.describe, stat methods) and how to explore data statistics
- Show how to transform a DataFrame (withColumn)
- Working with timestamps / windowing (sql functions, groupBy and groupBy(window))
- Show how to add a UDF (zScoreNormalize)
- Zeppelin Graphing to the rescue for help visualizing with EDA flow
- Joins and fixing data (field imputation strategy)
- Introduction to SparkML Pipelines
- Introduction to SparkML Transformers
- Introduction to basic SparkML Models
Spark Structured Streaming (loading model into a stream) (Start here and work backwards through the use case) (5 minutes)
- Read a Stream of data from Redis/Kafka
- Transform the data
- Apply new column to the DataFrame by transforming through a loaded SparkML model
- Write the prediction Stream to Redis/Kafka