apache · nchammas · Jun 27, 2025 · Jun 27, 2025
diff --git a/docs/_data/menu-spark-connect.yaml b/docs/_data/menu-spark-connect.yaml
@@ -0,0 +1,4 @@
+- text: Setting up Spark Connect
+  url: spark-connect-setup.html
+- text: Extending Spark with Spark Server Libraries
+  url: spark-connect-server-libs.html
diff --git a/docs/_includes/nav-left-wrapper-spark-connect.html b/docs/_includes/nav-left-wrapper-spark-connect.html
@@ -0,0 +1,6 @@
+<div class="left-menu-wrapper">
+    <div class="left-menu">
+        <h3><a href="spark-connect-overview.html">Spark Connect Guide</a></h3>
+        {% include nav-left.html nav=include.nav-spark-connect %}
+    </div>
+</div>
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
@@ -75,6 +75,7 @@
                             <a class="dropdown-item" href="{{ rel_path_to_root }}rdd-programming-guide.html">RDDs, Accumulators, Broadcasts Vars</a>
                             <a class="dropdown-item" href="{{ rel_path_to_root }}sql-programming-guide.html">SQL, DataFrames, and Datasets</a>
                             <a class="dropdown-item" href="{{ rel_path_to_root }}streaming/index.html">Structured Streaming</a>
+                            <a class="dropdown-item" href="{{ rel_path_to_root }}spark-connect-overview.html">Spark Connect</a>
                             <a class="dropdown-item" href="{{ rel_path_to_root }}streaming-programming-guide.html">Spark Streaming (DStreams)</a>
                             <a class="dropdown-item" href="{{ rel_path_to_root }}ml-guide.html">MLlib (Machine Learning)</a>
                             <a class="dropdown-item" href="{{ rel_path_to_root }}graphx-programming-guide.html">GraphX (Graph Processing)</a>
@@ -155,15 +156,17 @@ <h1 style="max-width: 680px;">Apache Spark - A Unified engine for large-scale da
         {% endif %}
 
         <div class="container">
-            {% if page.url contains "/ml" or page.url contains "/sql" or page.url contains "/streaming/" or page.url contains "migration-guide.html" %}
+            {% if page.url contains "/ml" or page.url contains "/sql" or page.url contains "/streaming/" or page.url contains "migration-guide.html" or page.url contains "/spark-connect" %}
                 {% if page.url contains "migration-guide.html" %}
                     {% include nav-left-wrapper-migration.html nav-migration=site.data.menu-migration %}
                 {% elsif page.url contains "/ml" %}
                     {% include nav-left-wrapper-ml.html nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %}
                 {% elsif page.url contains "/streaming/" %}
                     {% include nav-left-wrapper-streaming.html nav-streaming=site.data.menu-streaming %}
-                {% else %}
+                {% elsif page.url contains "/sql" %}
                     {% include nav-left-wrapper-sql.html nav-sql=site.data.menu-sql %}
+                {% elsif page.url contains "/spark-connect" %}
+                    {% include nav-left-wrapper-spark-connect.html nav-spark-connect=site.data.menu-spark-connect %}
                 {% endif %}
                 <input id="nav-trigger" class="nav-trigger" checked type="checkbox">
                 <label for="nav-trigger"></label>
@@ -173,9 +176,7 @@ <h1 class="title">{{ page.displayTitle }}</h1>
                     {% else %}
                         <h1 class="title">{{ page.title }}</h1>
                     {% endif %}
-
                     {{ content }}
-
                 </div>
             {% else %}
                 <div class="content mr-3" id="content">

diff --git a/docs/spark-connect-overview.md b/docs/spark-connect-overview.md
@@ -17,7 +17,6 @@ license: |
   See the License for the specific language governing permissions and
   limitations under the License.
 ---
-**Building client-side Spark applications**
 
 In Apache Spark 3.4, Spark Connect introduced a decoupled client-server
 architecture that allows remote connectivity to Spark clusters using the
@@ -28,6 +27,9 @@ in IDEs, Notebooks and programming languages.
 
 To get started, see [Quickstart: Spark Connect](api/python/getting_started/quickstart_connect.html).
 
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
 <p style="text-align: center;">
   <img src="img/spark-connect-api.png" title="Spark Connect API" alt="Spark Connect API Diagram" />
 </p>
@@ -91,287 +93,13 @@ of applications, for example to benefit from performance improvements and securi
 This means applications can be forward-compatible, as long as the server-side RPC
 definitions are designed to be backwards compatible.
 
+**Remote connectivity**: The decoupled architecture allows remote connectivity to Spark beyond SQL
+and JDBC: any application can now interactively use Spark “as a service”.
+
 **Debuggability and observability**: Spark Connect enables interactive debugging
 during development directly from your favorite IDE. Similarly, applications can
 be monitored using the application's framework native metrics and logging libraries.
 
-# How to use Spark Connect
-
-Spark Connect is available and supports PySpark and Scala
-applications. We will walk through how to run an Apache Spark server with Spark
-Connect and connect to it from a client application using the Spark Connect client
-library.
-
-## Download and start Spark server with Spark Connect
-
-First, download Spark from the
-[Download Apache Spark](https://spark.apache.org/downloads.html) page. Choose the
-latest release in  the release drop down at the top of the page. Then choose your package type, typically
-“Pre-built for Apache Hadoop 3.3 and later”, and click the link to download.
-
-Now extract the Spark package you just downloaded on your computer, for example:
-
-{% highlight bash %}
-tar -xvf spark-{{site.SPARK_VERSION_SHORT}}-bin-hadoop3.tgz
-{% endhighlight %}
-
-In a terminal window, go to the `spark` folder in the location where you extracted
-Spark before and run the `start-connect-server.sh` script to start Spark server with
-Spark Connect, like in this example:
-
-{% highlight bash %}
-./sbin/start-connect-server.sh
-{% endhighlight %}
-
-Make sure to use the same version  of the package as the Spark version you
-downloaded previously. In this example, Spark {{site.SPARK_VERSION_SHORT}} with Scala 2.13.
-
-Now Spark server is running and ready to accept Spark Connect sessions from client
-applications. In the next section we will walk through how to use Spark Connect
-when writing client applications.
-
-## Use Spark Connect for interactive analysis
-<div class="codetabs">
-
-<div data-lang="python" markdown="1">
-When creating a Spark session, you can specify that you want to use Spark Connect
-and there are a few ways to do that outlined as follows.
-
-If you do not use one of the mechanisms outlined here, your Spark session will
-work just like before, without leveraging Spark Connect.
-
-### Set SPARK_REMOTE environment variable
-
-If you set the `SPARK_REMOTE` environment variable on the client machine where your
-Spark client application is running and create a new Spark Session as in the following
-example, the session will be a Spark Connect session. With this approach, there is no
-code change needed to start using Spark Connect.
-
-In a terminal window, set the `SPARK_REMOTE` environment variable to point to the
-local Spark server you started previously on your computer:
-
-{% highlight bash %}
-export SPARK_REMOTE="sc://localhost"
-{% endhighlight %}
-
-And start the Spark shell as usual:
-
-{% highlight bash %}
-./bin/pyspark
-{% endhighlight %}
-
-The PySpark shell is now connected to Spark using Spark Connect as indicated in the welcome message:
-
-{% highlight python %}
-Client connected to the Spark Connect server at localhost
-{% endhighlight %}
-
-### Specify Spark Connect when creating Spark session
-
-You can also specify that you want to use Spark Connect explicitly when you
-create a Spark session.
-
-For example, you can launch the PySpark shell with Spark Connect as
-illustrated here.
-
-To launch the PySpark shell with Spark Connect, simply include the `remote`
-parameter and specify the location of your Spark server. We are using `localhost`
-in this example to connect to the local Spark server we started previously:
-
-{% highlight bash %}
-./bin/pyspark --remote "sc://localhost"
-{% endhighlight %}
-
-And you will notice that the PySpark shell welcome message tells you that
-you have connected to Spark using Spark Connect:
-
-{% highlight python %}
-Client connected to the Spark Connect server at localhost
-{% endhighlight %}
-
-You can also check the Spark session type. If it includes `.connect.` you
-are using Spark Connect as shown in this example:
-
-{% highlight python %}
-SparkSession available as 'spark'.
->>> type(spark)
-<class 'pyspark.sql.connect.session.SparkSession'>
-{% endhighlight %}
-
-Now you can run PySpark code in the shell to see Spark Connect in action:
-
-{% highlight python %}
->>> columns = ["id", "name"]
->>> data = [(1,"Sarah"), (2,"Maria")]
->>> df = spark.createDataFrame(data).toDF(*columns)
->>> df.show()
-+---+-----+
-| id| name|
-+---+-----+
-|  1|Sarah|
-|  2|Maria|
-+---+-----+
-{% endhighlight %}
-
-</div>
-
-<div data-lang="scala"  markdown="1">
-For the Scala shell, we use an Ammonite-based REPL. Otherwise, very similar with PySpark shell.
-
-{% highlight bash %}
-./bin/spark-shell --remote "sc://localhost"
-{% endhighlight %}
-
-A greeting message will appear when the REPL successfully initializes:
-{% highlight bash %}
-Welcome to
-      ____              __
-     / __/__  ___ _____/ /__
-    _\ \/ _ \/ _ `/ __/  '_/
-   /___/ .__/\_,_/_/ /_/\_\   version 4.1.0-SNAPSHOT
-      /_/
-
-Type in expressions to have them evaluated.
-Spark session available as 'spark'.
-{% endhighlight %}
-
-By default, the REPL will attempt to connect to a local Spark Server.
-Run the following Scala code in the shell to see Spark Connect in action:
-
-{% highlight scala %}
-@ spark.range(10).count
-res0: Long = 10L
-{% endhighlight %}
-
-### Configure client-server connection
-
-By default, the REPL will attempt to connect to a local Spark Server on port 15002.
-The connection, however, may be configured in several ways as described in this configuration
-[reference](https://github.com/apache/spark/blob/master/sql/connect/docs/client-connection-string.md).
-
-#### Set SPARK_REMOTE environment variable
-
-The SPARK_REMOTE environment variable can be set on the client machine to customize the client-server
-connection that is initialized at REPL startup.
-
-{% highlight bash %}
-export SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG"
-./bin/spark-shell
-{% endhighlight %}
-
-or
-
-{% highlight bash %}
-SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG" spark-connect-repl
-{% endhighlight %}
-
-#### Configure programmatically with a connection string
-
-The connection may also be programmatically created using _SparkSession#builder_ as in this example:
-
-{% highlight scala %}
-@ import org.apache.spark.sql.SparkSession
-@ val spark = SparkSession.builder.remote("sc://localhost:443/;token=ABCDEFG").getOrCreate()
-{% endhighlight %}
-
-</div>
-</div>
-
-## Use Spark Connect in standalone applications
-
-<div class="codetabs">
-
-
-<div data-lang="python"  markdown="1">
-
-First, install PySpark with `pip install pyspark[connect]=={{site.SPARK_VERSION_SHORT}}` or if building a packaged PySpark application/library,
-add it your setup.py file as:
-{% highlight python %}
-install_requires=[
-'pyspark[connect]=={{site.SPARK_VERSION_SHORT}}'
-]
-{% endhighlight %}
-
-When writing your own code, include the `remote` function with a reference to
-your Spark server when you create a Spark session, as in this example:
-
-{% highlight python %}
-from pyspark.sql import SparkSession
-spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
-{% endhighlight %}
-
-
-For illustration purposes, we’ll create a simple Spark Connect application, SimpleApp.py:
-{% highlight python %}
-"""SimpleApp.py"""
-from pyspark.sql import SparkSession
-
-logFile = "YOUR_SPARK_HOME/README.md"  # Should be some file on your system
-spark = SparkSession.builder.remote("sc://localhost").appName("SimpleApp").getOrCreate()
-logData = spark.read.text(logFile).cache()
-
-numAs = logData.filter(logData.value.contains('a')).count()
-numBs = logData.filter(logData.value.contains('b')).count()
-
-print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
-
-spark.stop()
-{% endhighlight %}
-
-This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file.
-Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed.
-
-We can run this application with the regular Python interpreter as follows:
-{% highlight python %}
-# Use the Python interpreter to run your application
-$ python SimpleApp.py
-...
-Lines with a: 72, lines with b: 39
-{% endhighlight %}
-</div>
-
-
-<div data-lang="scala"  markdown="1">
-To use Spark Connect as part of a Scala application/project, we first need to include the right dependencies.
-Using the `sbt` build system as an example, we add the following dependencies to the `build.sbt` file:
-{% highlight sbt %}
-libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % "{{site.SPARK_VERSION_SHORT}}"
-{% endhighlight %}
-
-When writing your own code, include the `remote` function with a reference to
-your Spark server when you create a Spark session, as in this example:
-
-{% highlight scala %}
-import org.apache.spark.sql.SparkSession
-val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()
-{% endhighlight %}
-
-
-**Note**: Operations that reference User Defined Code such as UDFs, filter, map, etc require a
-[ClassFinder](https://github.com/apache/spark/blob/master/sql/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ClassFinder.scala)
-to be registered to pickup and upload any required classfiles. Also, any JAR dependencies must be uploaded to the server using `SparkSession#AddArtifact`.
-
-Example:
-{% highlight scala %}
-import org.apache.spark.sql.connect.client.REPLClassDirMonitor
-// Register a ClassFinder to monitor and upload the classfiles from the build output.
-val classFinder = new REPLClassDirMonitor(<ABSOLUTE_PATH_TO_BUILD_OUTPUT_DIR>)
-spark.registerClassFinder(classFinder)
-
-// Upload JAR dependencies
-spark.addArtifact(<ABSOLUTE_PATH_JAR_DEP>)
-{% endhighlight %}
-Here, `ABSOLUTE_PATH_TO_BUILD_OUTPUT_DIR` is the output directory where the build system writes classfiles into
-and `ABSOLUTE_PATH_JAR_DEP` is the location of the JAR on the local file system.
-
-The `REPLClassDirMonitor` is a provided implementation of `ClassFinder` that monitors a specific directory but
-one may implement their own class extending `ClassFinder` for customized search and monitoring.
-
-</div>
-</div>
-
-For more information on application development with Spark Connect as well as extending Spark Connect
-with custom functionality, see [Application Development with Spark Connect](app-dev-spark-connect.html). 
 # Client application authentication
 
 While Spark Connect does not have built-in authentication, it is designed to