Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
9c636e8
Introductory pages migrated
jaceklaskowski Jun 29, 2020
d1137af
Hive Data Source MOVED
jaceklaskowski Jun 29, 2020
e62263a
requirements
jaceklaskowski Jun 29, 2020
7295347
awesome plugin
jaceklaskowski Jun 29, 2020
6ddbc5a
Force mkdocs for readthedocs
jaceklaskowski Jun 29, 2020
ee6ce3d
Disable extension due to RTD error
jaceklaskowski Jun 29, 2020
3efaa6f
Notable Features MOVED
jaceklaskowski Jun 30, 2020
2f0db87
Developing Spark SQL Applications MOVED
jaceklaskowski Jun 30, 2020
b327db3
SparkSession Registries MIGRATED
jaceklaskowski Jul 1, 2020
8502be6
File-Based Data Sources MIGRATED
jaceklaskowski Jul 1, 2020
cdc3c48
Kafka Data Source MIGRATED
jaceklaskowski Jul 1, 2020
ea9b229
All Data Sources MIGRATED
jaceklaskowski Jul 1, 2020
f404e65
Data Source APIs MIGRATED
jaceklaskowski Jul 1, 2020
df545a5
Structured Query Execution MIGRATED
jaceklaskowski Jul 2, 2020
df24c30
Catalyst, Catalyst Expressions and Vectorized Parquet Decoding MIGRATED
jaceklaskowski Jul 2, 2020
013f33e
Menu reorg
jaceklaskowski Jul 2, 2020
6a2c51f
Base Logical Operators MIGRATED
jaceklaskowski Jul 2, 2020
72b2ed5
Concrete Logical Operators MIGRATED
jaceklaskowski Jul 2, 2020
7604124
SQL Support MIGRATED
jaceklaskowski Jul 2, 2020
6f2da2b
Tungsten Execution Backend MIGRATED
jaceklaskowski Jul 2, 2020
d51458b
Spark Thrift Server MIGRATED
jaceklaskowski Jul 2, 2020
1d44a54
Physical Operators MIGRATED
jaceklaskowski Jul 3, 2020
7a1d3ff
Concrete Physical Operators MIGRATED
jaceklaskowski Jul 3, 2020
cffe600
Logical Analysis Rules MIGRATED
jaceklaskowski Jul 3, 2020
7b49a27
Logical Optimizations MIGRATED
jaceklaskowski Jul 3, 2020
c813772
More sections MIGRATED
jaceklaskowski Jul 3, 2020
16eea5c
All sections MIGRATED
jaceklaskowski Jul 3, 2020
e48de88
Migration to mkdocs DONE!
jaceklaskowski Jul 3, 2020
1413165
README + gitignore
jaceklaskowski Jul 5, 2020
a163b8c
Page rename
jaceklaskowski Jul 5, 2020
ce674b6
MkDocs setup + Page renames
jaceklaskowski Jul 5, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
MkDocs setup + Page renames
  • Loading branch information
jaceklaskowski committed Jul 5, 2020
commit ce674b6d603f6a55a79883c105ebb1bf4cccd0be
1,137 changes: 571 additions & 566 deletions mkdocs.yml

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: DataSourceV2Relation

`DataSourceV2Relation` is a <<spark-sql-LogicalPlan-LeafNode.adoc#, leaf logical operator>> that represents a data scan (_data reading_) or data writing in the <<spark-sql-data-source-api-v2.adoc#, Data Source API V2>>.

`DataSourceV2Relation` is <<creating-instance, created>> (indirectly via <<create, create>> helper method) exclusively when `DataFrameReader` is requested to ["load" data (as a DataFrame)](DataFrameReader.md#load) (from a data source with <<spark-sql-ReadSupport.adoc#, ReadSupport>>).
`DataSourceV2Relation` is <<creating-instance, created>> (indirectly via <<create, create>> helper method) exclusively when `DataFrameReader` is requested to ["load" data (as a DataFrame)](../DataFrameReader.md#load) (from a data source with <<spark-sql-ReadSupport.adoc#, ReadSupport>>).

[[creating-instance]]
`DataSourceV2Relation` takes the following to be created:
Expand Down Expand Up @@ -38,7 +38,7 @@ create(

In the end, `create` <<creating-instance, creates a DataSourceV2Relation>>.

NOTE: `create` is used exclusively when `DataFrameReader` is requested to ["load" data (as a DataFrame)](DataFrameReader.md#load) (from a data source with [ReadSupport](spark-sql-ReadSupport.md)).
NOTE: `create` is used exclusively when `DataFrameReader` is requested to ["load" data (as a DataFrame)](../DataFrameReader.md#load) (from a data source with [ReadSupport](../spark-sql-ReadSupport.md)).

=== [[computeStats]] Computing Statistics -- `computeStats` Method

Expand Down Expand Up @@ -139,7 +139,7 @@ Used when:

* `DataSourceV2Relation` logical operator is requested to <<newReader, create a DataSourceReader>>

* `DataSourceV2Relation` factory object is requested to <<create, create a DataSourceV2Relation>> (when `DataFrameReader` is requested to ["load" data (as a DataFrame)](DataFrameReader.md#load) from a data source with [ReadSupport](spark-sql-ReadSupport.md))
* `DataSourceV2Relation` factory object is requested to <<create, create a DataSourceV2Relation>> (when `DataFrameReader` is requested to ["load" data (as a DataFrame)](../DataFrameReader.md#load) from a data source with [ReadSupport](../spark-sql-ReadSupport.md))

| createWriter
a| [[createWriter]]
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ scala> println(q2.queryExecution.optimizedPlan.numberedTreeString)

`LogicalRelation` is <<creating-instance, created>> when:

* `DataFrameReader` [loads data from a data source that supports multiple paths](DataFrameReader.md#load) (through link:spark-sql-SparkSession.adoc#baseRelationToDataFrame[SparkSession.baseRelationToDataFrame])
* `DataFrameReader` is requested to load data from an external table using [JDBC](DataFrameReader.md#jdbc) (through link:spark-sql-SparkSession.adoc#baseRelationToDataFrame[SparkSession.baseRelationToDataFrame])
* `DataFrameReader` [loads data from a data source that supports multiple paths](../DataFrameReader.md#load) (through link:spark-sql-SparkSession.adoc#baseRelationToDataFrame[SparkSession.baseRelationToDataFrame])
* `DataFrameReader` is requested to load data from an external table using [JDBC](../DataFrameReader.md#jdbc) (through link:spark-sql-SparkSession.adoc#baseRelationToDataFrame[SparkSession.baseRelationToDataFrame])
* `TextInputCSVDataSource` and `TextInputJsonDataSource` are requested to infer schema
* `ResolveSQLOnFile` converts a logical plan
* `FindDataSourceTable` logical evaluation rule is link:spark-sql-Analyzer-FindDataSourceTable.adoc#apply[executed]
Expand Down
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

`PushDownPredicate` is simply a <<spark-sql-catalyst-Rule.adoc#, Catalyst rule>> for transforming <<spark-sql-LogicalPlan.adoc#, logical plans>>, i.e. `Rule[LogicalPlan]`.

When you execute link:spark-sql-Dataset.adoc#where[where] or link:spark-sql-Dataset.adoc#filter[filter] operators right after [loading a dataset](DataFrameReader.md#load), Spark SQL will try to push the where/filter predicate down to the data source using a corresponding SQL query with `WHERE` clause (or whatever the proper language for the data source is).
When you execute link:spark-sql-Dataset.adoc#where[where] or link:spark-sql-Dataset.adoc#filter[filter] operators right after [loading a dataset](../DataFrameReader.md#load), Spark SQL will try to push the where/filter predicate down to the data source using a corresponding SQL query with `WHERE` clause (or whatever the proper language for the data source is).

This optimization is called *filter pushdown* or *predicate pushdown* and aims at pushing down the filtering to the "bare metal", i.e. a data source engine. That is to increase the performance of queries since the filtering is performed at the very low level rather than dealing with the entire dataset after it has been loaded to Spark's memory and perhaps causing memory issues.

Expand Down
2 changes: 0 additions & 2 deletions mkdocs/spark-sql.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
title: Spark SQL

# Spark SQL

## Structured Data Processing with Relational Queries on Massive Scale
Expand Down