svcc-2019-realish-spark

This is the material for the 2019 Silicon Valley Code Camp Session "Realish Time Predictive Analytics with Spark Structured Streaming"

About the Speaker

Find me on Twitter: @newfront Find me on Medium @newfrontcreative About Twilio: Twilio

Session Overview Link

Follow Along Requirements

If you are planning to follow along during the presentation you will need to have the following installed on your local machine.

Docker
System Terminal (iTerm, Terminal, etc)
Working Web Browser (Chrome or Firefox)

Docker

Install Docker Desktop (https://www.docker.com/products/docker-desktop) Additional Docker Resources:

Docker Runtime Recommendations

2 or more cpu cores.
8gb/ram or higher.

Zeppelin

Zeppelin Project Info: "https://zeppelin.apache.org/docs/latest/interpreter/spark.html"

Running the Follow Along Environment

IMPORTANT: Notebooks use Spark 2.4.4

You must download spark 2.4.4 in order to run the examples

brew install wget && cd ~/Desktop && wget http://mirror.cc.columbia.edu/pub/software/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz && tar -xvzf spark-2.4.4-bin-hadoop2.7.tgz && mv spark-2.4.4-bin-hadoop2.7 spark-2.4.4

curl -XGET http://mirror.cc.columbia.edu/pub/software/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz > ~/Desktop/spark-2.4.4.tgz && cd ~/Desktop && tar -xvzf spark-2.4.4.tgz && rm spark-2.4.4.tgz

Export SPARK_HOME to your bash/terminal session export SPARK_HOME=~/Desktop/spark-2.4.4/
Starting the Zeppelin Environment: cd /path/to/svcc-2019-realish-spark/ && ./install.sh && ./run.sh deployCustom && ./run.sh start
Stopping the environment: cd /path/to/svcc-2019-realish-spark/ && ./run.sh stop
Zeppelin should be running on http://localhost:8080 - Click on the Notebooks SVCC and start with 1-SparkIntro. Click the Play button on the top of the notebook (Play All) to load and run everything. Follow the links at the bottom of each notebook to go from 1-5. Enjoy

Updating Zeppelin Configurations

Zeppelin Configuration http://localhost:8080/#/configuration
Zeppelin Interpreters http://localhost:8080/#/interpreter (configure spark here)
If you want to use spark-redis from Zeppelin (for Notebook #5). Update the Zeppelin Interpreter for Spark and under Dependencies add com.redislabs:spark-redis:2.4.0 and restart the interpreter. If you have problems getting this to install. Try running `docker exec -it zeppelin
Suggested Spark Configurations (double the docker cores, use 1g less than total docker memory allocation)

spark.cores.max: 4
spark.executor.memory: 12g

Running Spark-Shell in the Docker Container

docker exec -it zeppelin /spark/bin/spark-shell

ZEPPELIN_INTERPRETER_LOCALREPO

Enhancing your Zeppelin Experience (Bring new Spark version)

Jump onto the docker process: docker exec -it zeppelin bash
VIEW env variables: cat /conf/zeppelin-env.sh.template
SEE SPARK_HOME. You can now point this to locally installed SPARK

Session Overview

A Gentle Introduction to Apache Spark DataFrames (10 minutes)

Loading data (read / readStream)
Schemas / Schema Inference
StructTypes and structures
Creating tempViews and SparkSQL
Writing data (write / writeStream)

Exploratory Data Analysis with Apache SparkSQL and Apache Zeppelin (25 - 30 minutes)

Discuss the built in org.apache.spark.sql.functions._ (col.describe, stat methods) and how to explore data statistics
Show how to transform a DataFrame (withColumn)
Working with timestamps / windowing (sql functions, groupBy and groupBy(window))
Show how to add a UDF (zScoreNormalize)
Zeppelin Graphing to the rescue for help visualizing with EDA flow
Joins and fixing data (field imputation strategy)

Quick and Dirty SparkML Predictive Models (by use case) (5 minutes)

Introduction to SparkML Pipelines
Introduction to SparkML Transformers
Introduction to basic SparkML Models

Spark Structured Streaming (loading model into a stream) (Start here and work backwards through the use case) (5 minutes)

Read a Stream of data from Redis/Kafka
Transform the data
Apply new column to the DataFrame by transforming through a loaded SparkML model
Write the prediction Stream to Redis/Kafka

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
logs		logs
notebook		notebook
redis		redis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
notes.md		notes.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

svcc-2019-realish-spark

About the Speaker

Follow Along Requirements

Docker

Docker Runtime Recommendations

Zeppelin

Running the Follow Along Environment

Updating Zeppelin Configurations

Running Spark-Shell in the Docker Container

Enhancing your Zeppelin Experience (Bring new Spark version)

Session Overview

A Gentle Introduction to Apache Spark DataFrames (10 minutes)

Exploratory Data Analysis with Apache SparkSQL and Apache Zeppelin (25 - 30 minutes)

Quick and Dirty SparkML Predictive Models (by use case) (5 minutes)

Spark Structured Streaming (loading model into a stream) (Start here and work backwards through the use case) (5 minutes)

Questions and Answers (5 minutes -> overtime)

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

newfront/svcc-2019-realish-spark

Folders and files

Latest commit

History

Repository files navigation

svcc-2019-realish-spark

About the Speaker

Follow Along Requirements

Docker

Docker Runtime Recommendations

Zeppelin

Running the Follow Along Environment

Updating Zeppelin Configurations

Running Spark-Shell in the Docker Container

Enhancing your Zeppelin Experience (Bring new Spark version)

Session Overview

A Gentle Introduction to Apache Spark DataFrames (10 minutes)

Exploratory Data Analysis with Apache SparkSQL and Apache Zeppelin (25 - 30 minutes)

Quick and Dirty SparkML Predictive Models (by use case) (5 minutes)

Spark Structured Streaming (loading model into a stream) (Start here and work backwards through the use case) (5 minutes)

Questions and Answers (5 minutes -> overtime)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages