Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
9c636e8
Introductory pages migrated
jaceklaskowski Jun 29, 2020
d1137af
Hive Data Source MOVED
jaceklaskowski Jun 29, 2020
e62263a
requirements
jaceklaskowski Jun 29, 2020
7295347
awesome plugin
jaceklaskowski Jun 29, 2020
6ddbc5a
Force mkdocs for readthedocs
jaceklaskowski Jun 29, 2020
ee6ce3d
Disable extension due to RTD error
jaceklaskowski Jun 29, 2020
3efaa6f
Notable Features MOVED
jaceklaskowski Jun 30, 2020
2f0db87
Developing Spark SQL Applications MOVED
jaceklaskowski Jun 30, 2020
b327db3
SparkSession Registries MIGRATED
jaceklaskowski Jul 1, 2020
8502be6
File-Based Data Sources MIGRATED
jaceklaskowski Jul 1, 2020
cdc3c48
Kafka Data Source MIGRATED
jaceklaskowski Jul 1, 2020
ea9b229
All Data Sources MIGRATED
jaceklaskowski Jul 1, 2020
f404e65
Data Source APIs MIGRATED
jaceklaskowski Jul 1, 2020
df545a5
Structured Query Execution MIGRATED
jaceklaskowski Jul 2, 2020
df24c30
Catalyst, Catalyst Expressions and Vectorized Parquet Decoding MIGRATED
jaceklaskowski Jul 2, 2020
013f33e
Menu reorg
jaceklaskowski Jul 2, 2020
6a2c51f
Base Logical Operators MIGRATED
jaceklaskowski Jul 2, 2020
72b2ed5
Concrete Logical Operators MIGRATED
jaceklaskowski Jul 2, 2020
7604124
SQL Support MIGRATED
jaceklaskowski Jul 2, 2020
6f2da2b
Tungsten Execution Backend MIGRATED
jaceklaskowski Jul 2, 2020
d51458b
Spark Thrift Server MIGRATED
jaceklaskowski Jul 2, 2020
1d44a54
Physical Operators MIGRATED
jaceklaskowski Jul 3, 2020
7a1d3ff
Concrete Physical Operators MIGRATED
jaceklaskowski Jul 3, 2020
cffe600
Logical Analysis Rules MIGRATED
jaceklaskowski Jul 3, 2020
7b49a27
Logical Optimizations MIGRATED
jaceklaskowski Jul 3, 2020
c813772
More sections MIGRATED
jaceklaskowski Jul 3, 2020
16eea5c
All sections MIGRATED
jaceklaskowski Jul 3, 2020
e48de88
Migration to mkdocs DONE!
jaceklaskowski Jul 3, 2020
1413165
README + gitignore
jaceklaskowski Jul 5, 2020
a163b8c
Page rename
jaceklaskowski Jul 5, 2020
ce674b6
MkDocs setup + Page renames
jaceklaskowski Jul 5, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Introductory pages migrated
  • Loading branch information
jaceklaskowski committed Jun 29, 2020
commit 9c636e8615921eff6364582c2f515044c5b76b81
67 changes: 67 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# many settings are copied verbatim from mkdocs-material
# https://github.com/squidfunk/mkdocs-material/blob/master/mkdocs.yml

site_name: The Internals of Spark SQL
site_url: https://books.japila.pl/spark-sql-internals
site_author: Jacek Laskowski

docs_dir: mkdocs

repo_name: mastering-spark-sql-book
repo_url: https://github.com/jaceklaskowski/mastering-spark-sql-book
edit_uri: ""

copyright: Copyright © 2020 Jacek Laskowski

theme:
name: 'material'
language: en
features:
# - tabs
- instant

markdown_extensions:
- admonition
- codehilite
- toc:
permalink: true
- pymdownx.arithmatex
- pymdownx.betterem:
smart_enable: all
- pymdownx.caret
- pymdownx.critic
- pymdownx.details
- pymdownx.emoji:
emoji_index: !!python/name:materialx.emoji.twemoji
emoji_generator: !!python/name:materialx.emoji.to_svg
- pymdownx.inlinehilite
- pymdownx.magiclink
- pymdownx.mark
- pymdownx.smartsymbols
- pymdownx.superfences
- pymdownx.tasklist:
custom_checkbox: true
- pymdownx.tabbed
- pymdownx.tilde

plugins:
- search
- minify:
minify_html: true
- git-revision-date-localized:
type: timeago

extra:
social:
- icon: fontawesome/brands/github-alt
link: https://github.com/jaceklaskowski
- icon: fontawesome/brands/twitter
link: https://twitter.com/jaceklaskowski
- icon: fontawesome/brands/linkedin
link: https://linkedin.com/in/jaceklaskowski

nav:
- Home: index.md
- spark-sql.md
- spark-sql-dataset-rdd.md
- spark-sql-dataset-vs-sql.md
17 changes: 17 additions & 0 deletions mkdocs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# The Internals of Spark SQL (Apache Spark 3.0.0)

Welcome to **The Internals of Spark SQL** online book!

I'm [Jacek Laskowski](https://pl.linkedin.com/in/jaceklaskowski), a freelance IT consultant specializing in [Apache Spark](https://spark.apache.org/), [Apache Kafka](https://kafka.apache.org/), [Delta Lake](https://delta.io/) and [Kafka Streams](https://kafka.apache.org/documentation/streams/).

I'm very excited to have you here and hope you will enjoy exploring the internals of Spark SQL as much as I have.

!!! quote "Flannery O'Connor"
I write to discover what I know.

??? note ""The Internals Of" series"
I'm also writing other online books in the "The Internals Of" series. Please visit ["The Internals Of" Online Books](https://books.japila.pl) home page.

Expect text and code snippets from a variety of public sources. Attribution follows.

Now, let me introduce you to [Spark SQL and Structured Queries](spark-sql.md).
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
== Datasets vs DataFrames vs RDDs
# Datasets, DataFrames and RDDs

Many may have been asking yourself why they should be using Datasets rather than the foundation of all Spark - RDDs using case classes.

This document collects advantages of `Dataset` vs `RDD[CaseClass]` to answer https://twitter.com/danosipov/status/704421546203308033[the question Dan has asked on twitter]:

> "In #Spark, what is the advantage of a DataSet over an RDD[CaseClass]?"

=== Saving to or Writing from Data Sources
## Saving to or Writing from Data Sources

With Dataset API, loading data from a data source or saving it to one is as simple as using <<spark-sql-SparkSession.adoc#read, SparkSession.read>> or <<spark-sql-dataset-operators.adoc#write, Dataset.write>> methods, appropriately.

=== Accessing Fields / Columns
## Accessing Fields / Columns

You `select` columns in a datasets without worrying about the positions of the columns.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
== Dataset API vs SQL
# Dataset API and SQL

Spark SQL supports two "modes" to write structured queries: xref:spark-sql-dataset-operators.adoc[Dataset API] and xref:spark-sql-SparkSession.adoc#sql[SQL].

Expand All @@ -15,10 +15,3 @@ This section describes the differences between Spark SQL features to develop Spa
. link:spark-sql-Expression-RuntimeReplaceable.adoc#implementations[RuntimeReplaceable Expressions] are only available using SQL mode by means of SQL functions like `nvl`, `nvl2`, `ifnull`, `nullif`, etc.

. <<spark-sql-column-operators.adoc#isin, Column.isin>> and link:spark-sql-AstBuilder.adoc#withPredicate[SQL IN predicate with a subquery] (and link:spark-sql-Expression-In.adoc[In Predicate Expression])

[[demo]]
.Demo: Structured Query in SQL Mode VS Dataset API
[source,scala]
----
// FIXME: Example of a structured query that is only possible in SQL mode
----
File renamed without changes.
91 changes: 31 additions & 60 deletions spark-sql.adoc → mkdocs/spark-sql.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,34 @@
== Spark SQL -- Structured Data Processing with Relational Queries on Massive Scale
title: Spark SQL

Like Apache Spark in general, *Spark SQL* in particular is all about distributed in-memory computations on massive scale.
# Spark SQL

Quoting the http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf[Spark SQL: Relational Data Processing in Spark] paper on Spark SQL:
## Structured Data Processing with Relational Queries on Massive Scale

> Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API.
Like Apache Spark in general, **Spark SQL** in particular is all about distributed in-memory computations on massive scale.

> Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarative
queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g., machine learning).
!!! quote "[Spark SQL: Relational Data Processing in Spark](http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf) paper on Spark SQL"
Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API.

The primary difference between the computation models of Spark SQL and Spark Core is the relational framework for ingesting, querying and persisting (semi)structured data using *relational queries* (aka *structured queries*) that can be expressed in _good ol'_ *SQL* (with many features of HiveQL) and the high-level SQL-like functional declarative link:spark-sql-Dataset.adoc[Dataset API] (aka *Structured Query DSL*).
Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarative
queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g., machine learning).

NOTE: Semi- and structured data are collections of records that can be described using link:spark-sql-schema.adoc[schema] with column names, their types and whether a column can be null or not (aka _nullability_).
The primary difference between the computation models of Spark SQL and Spark Core is the relational framework for ingesting, querying and persisting (semi)structured data using **structured queries** (aka **relational queries**) that can be expressed in _good ol'_ **SQL** (with many features of HiveQL) and the high-level SQL-like functional declarative [Dataset API](spark-sql-Dataset.md) (_Structured Query DSL_).

Whichever query interface you use to describe a structured query, i.e. SQL or Query DSL, the query becomes a link:spark-sql-Dataset.adoc[Dataset] (with a mandatory link:spark-sql-Encoder.adoc[Encoder]).
!!! note
Semi- and structured data are collections of records that can be described using [schema](spark-sql-schema.md) with column names, their types and whether a column can be null or not (_nullability_).

From https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html[Shark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark]:
Whichever query interface you use to describe a structured query, i.e. SQL or Query DSL, the query becomes a [Dataset](spark-sql-Dataset.md) (with a mandatory [Encoder](spark-sql-Encoder.md)).

> For *SQL users*, Spark SQL provides state-of-the-art SQL performance and maintains compatibility with Shark/Hive. In particular, like Shark, Spark SQL supports all existing Hive data formats, user-defined functions (UDF), and the Hive metastore.
!!! quote "[Shark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark](https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html)"
For **SQL users**, Spark SQL provides state-of-the-art SQL performance and maintains compatibility with Shark/Hive. In particular, like Shark, Spark SQL supports all existing Hive data formats, user-defined functions (UDF), and the Hive metastore.

> For *Spark users*, Spark SQL becomes the narrow-waist for manipulating (semi-) structured data as well as ingesting data from sources that provide schema, such as JSON, Parquet, Hive, or EDWs. It truly unifies SQL and sophisticated analysis, allowing users to mix and match SQL and more imperative programming APIs for advanced analytics.
For **Spark users**, Spark SQL becomes the narrow-waist for manipulating (semi-) structured data as well as ingesting data from sources that provide schema, such as JSON, Parquet, Hive, or EDWs. It truly unifies SQL and sophisticated analysis, allowing users to mix and match SQL and more imperative programming APIs for advanced analytics.

> For *open source hackers*, Spark SQL proposes a novel, elegant way of building query planners. It is incredibly easy to add new optimizations under this framework.
For **open source hackers**, Spark SQL proposes a novel, elegant way of building query planners. It is incredibly easy to add new optimizations under this framework.

A `Dataset` is a programming interface to the link:spark-sql-QueryExecution.adoc[structured query execution pipeline] with link:spark-sql-dataset-operators.adoc[transformations and actions] (as in the good old days of RDD API in Spark Core).
A `Dataset` is a programming interface to the [structured query execution pipeline](spark-sql-QueryExecution.md) with [transformations and actions](spark-sql-dataset-operators.md) (as in the good old days of RDD API in Spark Core).

Internally, a structured query is a link:spark-sql-catalyst.adoc[Catalyst tree] of (logical and physical) link:spark-sql-catalyst-QueryPlan.adoc[relational operators] and link:spark-sql-Expression.adoc[expressions].
Internally, a structured query is a [Catalyst tree](spark-sql-catalyst.md) of (logical and physical) [relational operators](spark-sql-catalyst-QueryPlan.md) and [expressions](spark-sql-Expression.md).

When an action is executed on a `Dataset` (directly, e.g. link:spark-sql-dataset-operators.adoc#show[show] or link:spark-sql-dataset-operators.adoc#count[count], or indirectly, e.g. link:spark-sql-DataFrameWriter.adoc#save[save] or link:spark-sql-DataFrameWriter.adoc#saveAsTable[saveAsTable]) the structured query (behind `Dataset`) goes through the link:spark-sql-QueryExecution.adoc#execution-pipeline[execution stages]:

Expand All @@ -48,8 +51,7 @@ Spark SQL supports structured queries in *batch* and *streaming* modes (with the

NOTE: You can find out more on Spark Structured Streaming in https://bit.ly/spark-structured-streaming[Spark Structured Streaming (Apache Spark 2.2+)] gitbook.

[source, scala]
----
```text
// Define the schema using a case class
case class Person(name: String, age: Int)

Expand Down Expand Up @@ -89,12 +91,11 @@ scala> teenagers.show
+-----+---+
|Jacek| 10|
+-----+---+
----
```

Spark SQL supports loading datasets from various data sources including tables in Apache Hive. With Hive support enabled, you can load datasets from existing Apache Hive deployments and save them back to Hive tables if needed.

[source, scala]
----
```text
sql("CREATE OR REPLACE TEMPORARY VIEW v1 (key INT, value STRING) USING csv OPTIONS ('path'='people.csv', 'header'='true')")

// Queries are expressed in HiveQL
Expand All @@ -108,7 +109,7 @@ scala> sql("desc EXTENDED v1").show(false)
|key |int |null |
|value |string |null |
+----------+---------+-------+
----
```

Like SQL and NoSQL databases, Spark SQL offers performance query optimizations using link:spark-sql-Optimizer.adoc[rule-based query optimizer] (aka *Catalyst Optimizer*), link:spark-sql-whole-stage-codegen.adoc[whole-stage Java code generation] (aka *Whole-Stage Codegen* that could often be better than your own custom hand-written code!) and link:spark-sql-tungsten.adoc[Tungsten execution engine] with its own link:spark-sql-InternalRow.adoc[internal binary row format].

Expand All @@ -125,8 +126,7 @@ Quoting https://drill.apache.org/[Apache Drill] which applies to Spark SQL perfe

The following snippet shows a *batch ETL pipeline* to process JSON files and saving their subset as CSVs.

[source, scala]
----
```scala
spark.read
.format("json")
.load("input-json")
Expand All @@ -135,37 +135,9 @@ spark.read
.write
.format("csv")
.save("output-csv")
----
```

With link:spark-structured-streaming.adoc[Structured Streaming] feature however, the above static batch query becomes dynamic and continuous paving the way for *continuous applications*.

[source, scala]
----
import org.apache.spark.sql.types._
val schema = StructType(
StructField("id", LongType, nullable = false) ::
StructField("name", StringType, nullable = false) ::
StructField("score", DoubleType, nullable = false) :: Nil)

spark.readStream
.format("json")
.schema(schema)
.load("input-json")
.select("name", "score")
.where('score > 15)
.writeStream
.format("console")
.start

// -------------------------------------------
// Batch: 1
// -------------------------------------------
// +-----+-----+
// | name|score|
// +-----+-----+
// |Jacek| 20.5|
// +-----+-----+
----
With Structured Streaming feature however, the above static batch query becomes dynamic and continuous paving the way for **continuous applications**.

As of Spark 2.0, the main data abstraction of Spark SQL is link:spark-sql-Dataset.adoc[Dataset]. It represents a *structured data* which are records with a known schema. This structured data representation `Dataset` enables link:spark-sql-tungsten.adoc[compact binary representation] using compressed columnar format that is stored in managed objects outside JVM's heap. It is supposed to speed computations up by reducing memory usage and GCs.

Expand Down Expand Up @@ -199,8 +171,7 @@ to find max/min or any other aggregates? SELECT MAX(column_name) FROM dftable_na

You can parse data from external data sources and let the _schema inferencer_ to deduct the schema.

[source, scala]
----
```
// Example 1
val df = Seq(1 -> 2).toDF("i", "j")
val query = df.groupBy('i)
Expand All @@ -219,10 +190,10 @@ scala> sql("SELECT IF(a > 0, a, 0) FROM (SELECT key a FROM src) temp").show
| 1|
| 0|
+-------------------+
----
```

=== [[i-want-more]] Further Reading and Watching
## Further Reading and Watching

. http://spark.apache.org/sql/[Spark SQL] home page
. (video) https://youtu.be/e-Ys-2uVxM0?t=6m44s[Spark's Role in the Big Data Ecosystem - Matei Zaharia]
. https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html[Introducing Apache Spark 2.0]
* [Spark SQL](http://spark.apache.org/sql/) home page
* (video) [Spark's Role in the Big Data Ecosystem - Matei Zaharia](https://youtu.be/e-Ys-2uVxM0?t=6m44s)
* [Introducing Apache Spark 2.0](https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html)