Skip to content

[SPARK-52598][DOCS] Reorganize Spark Connect programming guide #51305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/_data/menu-spark-connect.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
- text: Setting up Spark Connect
url: spark-connect-setup.html
- text: Extending Spark with Spark Server Libraries
url: spark-connect-server-libs.html
6 changes: 6 additions & 0 deletions docs/_includes/nav-left-wrapper-spark-connect.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<div class="left-menu-wrapper">
<div class="left-menu">
<h3><a href="spark-connect-overview.html">Spark Connect Guide</a></h3>
{% include nav-left.html nav=include.nav-spark-connect %}
</div>
</div>
9 changes: 5 additions & 4 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@
<a class="dropdown-item" href="{{ rel_path_to_root }}rdd-programming-guide.html">RDDs, Accumulators, Broadcasts Vars</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}sql-programming-guide.html">SQL, DataFrames, and Datasets</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}streaming/index.html">Structured Streaming</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}spark-connect-overview.html">Spark Connect</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}streaming-programming-guide.html">Spark Streaming (DStreams)</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}ml-guide.html">MLlib (Machine Learning)</a>
<a class="dropdown-item" href="{{ rel_path_to_root }}graphx-programming-guide.html">GraphX (Graph Processing)</a>
Expand Down Expand Up @@ -155,15 +156,17 @@ <h1 style="max-width: 680px;">Apache Spark - A Unified engine for large-scale da
{% endif %}

<div class="container">
{% if page.url contains "/ml" or page.url contains "/sql" or page.url contains "/streaming/" or page.url contains "migration-guide.html" %}
{% if page.url contains "/ml" or page.url contains "/sql" or page.url contains "/streaming/" or page.url contains "migration-guide.html" or page.url contains "/spark-connect" %}
{% if page.url contains "migration-guide.html" %}
{% include nav-left-wrapper-migration.html nav-migration=site.data.menu-migration %}
{% elsif page.url contains "/ml" %}
{% include nav-left-wrapper-ml.html nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %}
{% elsif page.url contains "/streaming/" %}
{% include nav-left-wrapper-streaming.html nav-streaming=site.data.menu-streaming %}
{% else %}
{% elsif page.url contains "/sql" %}
{% include nav-left-wrapper-sql.html nav-sql=site.data.menu-sql %}
{% elsif page.url contains "/spark-connect" %}
{% include nav-left-wrapper-spark-connect.html nav-spark-connect=site.data.menu-spark-connect %}
{% endif %}
<input id="nav-trigger" class="nav-trigger" checked type="checkbox">
<label for="nav-trigger"></label>
Expand All @@ -173,9 +176,7 @@ <h1 class="title">{{ page.displayTitle }}</h1>
{% else %}
<h1 class="title">{{ page.title }}</h1>
{% endif %}

{{ content }}

</div>
{% else %}
<div class="content mr-3" id="content">
Expand Down
284 changes: 6 additions & 278 deletions docs/spark-connect-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ license: |
See the License for the specific language governing permissions and
limitations under the License.
---
**Building client-side Spark applications**

In Apache Spark 3.4, Spark Connect introduced a decoupled client-server
architecture that allows remote connectivity to Spark clusters using the
Expand All @@ -28,6 +27,9 @@ in IDEs, Notebooks and programming languages.

To get started, see [Quickstart: Spark Connect](api/python/getting_started/quickstart_connect.html).

* This will become a table of contents (this text will be scraped).
{:toc}

<p style="text-align: center;">
<img src="img/spark-connect-api.png" title="Spark Connect API" alt="Spark Connect API Diagram" />
</p>
Expand Down Expand Up @@ -91,287 +93,13 @@ of applications, for example to benefit from performance improvements and securi
This means applications can be forward-compatible, as long as the server-side RPC
definitions are designed to be backwards compatible.

**Remote connectivity**: The decoupled architecture allows remote connectivity to Spark beyond SQL
and JDBC: any application can now interactively use Spark “as a service”.

**Debuggability and observability**: Spark Connect enables interactive debugging
during development directly from your favorite IDE. Similarly, applications can
be monitored using the application's framework native metrics and logging libraries.

# How to use Spark Connect

Spark Connect is available and supports PySpark and Scala
applications. We will walk through how to run an Apache Spark server with Spark
Connect and connect to it from a client application using the Spark Connect client
library.

## Download and start Spark server with Spark Connect

First, download Spark from the
[Download Apache Spark](https://spark.apache.org/downloads.html) page. Choose the
latest release in the release drop down at the top of the page. Then choose your package type, typically
“Pre-built for Apache Hadoop 3.3 and later”, and click the link to download.

Now extract the Spark package you just downloaded on your computer, for example:

{% highlight bash %}
tar -xvf spark-{{site.SPARK_VERSION_SHORT}}-bin-hadoop3.tgz
{% endhighlight %}

In a terminal window, go to the `spark` folder in the location where you extracted
Spark before and run the `start-connect-server.sh` script to start Spark server with
Spark Connect, like in this example:

{% highlight bash %}
./sbin/start-connect-server.sh
{% endhighlight %}

Make sure to use the same version of the package as the Spark version you
downloaded previously. In this example, Spark {{site.SPARK_VERSION_SHORT}} with Scala 2.13.

Now Spark server is running and ready to accept Spark Connect sessions from client
applications. In the next section we will walk through how to use Spark Connect
when writing client applications.

## Use Spark Connect for interactive analysis
<div class="codetabs">

<div data-lang="python" markdown="1">
When creating a Spark session, you can specify that you want to use Spark Connect
and there are a few ways to do that outlined as follows.

If you do not use one of the mechanisms outlined here, your Spark session will
work just like before, without leveraging Spark Connect.

### Set SPARK_REMOTE environment variable

If you set the `SPARK_REMOTE` environment variable on the client machine where your
Spark client application is running and create a new Spark Session as in the following
example, the session will be a Spark Connect session. With this approach, there is no
code change needed to start using Spark Connect.

In a terminal window, set the `SPARK_REMOTE` environment variable to point to the
local Spark server you started previously on your computer:

{% highlight bash %}
export SPARK_REMOTE="sc://localhost"
{% endhighlight %}

And start the Spark shell as usual:

{% highlight bash %}
./bin/pyspark
{% endhighlight %}

The PySpark shell is now connected to Spark using Spark Connect as indicated in the welcome message:

{% highlight python %}
Client connected to the Spark Connect server at localhost
{% endhighlight %}

### Specify Spark Connect when creating Spark session

You can also specify that you want to use Spark Connect explicitly when you
create a Spark session.

For example, you can launch the PySpark shell with Spark Connect as
illustrated here.

To launch the PySpark shell with Spark Connect, simply include the `remote`
parameter and specify the location of your Spark server. We are using `localhost`
in this example to connect to the local Spark server we started previously:

{% highlight bash %}
./bin/pyspark --remote "sc://localhost"
{% endhighlight %}

And you will notice that the PySpark shell welcome message tells you that
you have connected to Spark using Spark Connect:

{% highlight python %}
Client connected to the Spark Connect server at localhost
{% endhighlight %}

You can also check the Spark session type. If it includes `.connect.` you
are using Spark Connect as shown in this example:

{% highlight python %}
SparkSession available as 'spark'.
>>> type(spark)
<class 'pyspark.sql.connect.session.SparkSession'>
{% endhighlight %}

Now you can run PySpark code in the shell to see Spark Connect in action:

{% highlight python %}
>>> columns = ["id", "name"]
>>> data = [(1,"Sarah"), (2,"Maria")]
>>> df = spark.createDataFrame(data).toDF(*columns)
>>> df.show()
+---+-----+
| id| name|
+---+-----+
| 1|Sarah|
| 2|Maria|
+---+-----+
{% endhighlight %}

</div>

<div data-lang="scala" markdown="1">
For the Scala shell, we use an Ammonite-based REPL. Otherwise, very similar with PySpark shell.

{% highlight bash %}
./bin/spark-shell --remote "sc://localhost"
{% endhighlight %}

A greeting message will appear when the REPL successfully initializes:
{% highlight bash %}
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 4.1.0-SNAPSHOT
/_/

Type in expressions to have them evaluated.
Spark session available as 'spark'.
{% endhighlight %}

By default, the REPL will attempt to connect to a local Spark Server.
Run the following Scala code in the shell to see Spark Connect in action:

{% highlight scala %}
@ spark.range(10).count
res0: Long = 10L
{% endhighlight %}

### Configure client-server connection

By default, the REPL will attempt to connect to a local Spark Server on port 15002.
The connection, however, may be configured in several ways as described in this configuration
[reference](https://github.com/apache/spark/blob/master/sql/connect/docs/client-connection-string.md).

#### Set SPARK_REMOTE environment variable

The SPARK_REMOTE environment variable can be set on the client machine to customize the client-server
connection that is initialized at REPL startup.

{% highlight bash %}
export SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG"
./bin/spark-shell
{% endhighlight %}

or

{% highlight bash %}
SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG" spark-connect-repl
{% endhighlight %}

#### Configure programmatically with a connection string

The connection may also be programmatically created using _SparkSession#builder_ as in this example:

{% highlight scala %}
@ import org.apache.spark.sql.SparkSession
@ val spark = SparkSession.builder.remote("sc://localhost:443/;token=ABCDEFG").getOrCreate()
{% endhighlight %}

</div>
</div>

## Use Spark Connect in standalone applications

<div class="codetabs">


<div data-lang="python" markdown="1">

First, install PySpark with `pip install pyspark[connect]=={{site.SPARK_VERSION_SHORT}}` or if building a packaged PySpark application/library,
add it your setup.py file as:
{% highlight python %}
install_requires=[
'pyspark[connect]=={{site.SPARK_VERSION_SHORT}}'
]
{% endhighlight %}

When writing your own code, include the `remote` function with a reference to
your Spark server when you create a Spark session, as in this example:

{% highlight python %}
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
{% endhighlight %}


For illustration purposes, we’ll create a simple Spark Connect application, SimpleApp.py:
{% highlight python %}
"""SimpleApp.py"""
from pyspark.sql import SparkSession

logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system
spark = SparkSession.builder.remote("sc://localhost").appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

spark.stop()
{% endhighlight %}

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file.
Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed.

We can run this application with the regular Python interpreter as follows:
{% highlight python %}
# Use the Python interpreter to run your application
$ python SimpleApp.py
...
Lines with a: 72, lines with b: 39
{% endhighlight %}
</div>


<div data-lang="scala" markdown="1">
To use Spark Connect as part of a Scala application/project, we first need to include the right dependencies.
Using the `sbt` build system as an example, we add the following dependencies to the `build.sbt` file:
{% highlight sbt %}
libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % "{{site.SPARK_VERSION_SHORT}}"
{% endhighlight %}

When writing your own code, include the `remote` function with a reference to
your Spark server when you create a Spark session, as in this example:

{% highlight scala %}
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()
{% endhighlight %}


**Note**: Operations that reference User Defined Code such as UDFs, filter, map, etc require a
[ClassFinder](https://github.com/apache/spark/blob/master/sql/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ClassFinder.scala)
to be registered to pickup and upload any required classfiles. Also, any JAR dependencies must be uploaded to the server using `SparkSession#AddArtifact`.

Example:
{% highlight scala %}
import org.apache.spark.sql.connect.client.REPLClassDirMonitor
// Register a ClassFinder to monitor and upload the classfiles from the build output.
val classFinder = new REPLClassDirMonitor(<ABSOLUTE_PATH_TO_BUILD_OUTPUT_DIR>)
spark.registerClassFinder(classFinder)

// Upload JAR dependencies
spark.addArtifact(<ABSOLUTE_PATH_JAR_DEP>)
{% endhighlight %}
Here, `ABSOLUTE_PATH_TO_BUILD_OUTPUT_DIR` is the output directory where the build system writes classfiles into
and `ABSOLUTE_PATH_JAR_DEP` is the location of the JAR on the local file system.

The `REPLClassDirMonitor` is a provided implementation of `ClassFinder` that monitors a specific directory but
one may implement their own class extending `ClassFinder` for customized search and monitoring.

</div>
</div>

For more information on application development with Spark Connect as well as extending Spark Connect
with custom functionality, see [Application Development with Spark Connect](app-dev-spark-connect.html).
# Client application authentication

While Spark Connect does not have built-in authentication, it is designed to
Expand Down
Loading