Skip to content

[Documentation] Deploy Worker and Udf Binaries #329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Nov 19, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions docs/deploy-worker-udf-binaries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Deploy Worker and UDF Binaries General Instruction

This how-to provides general instructions on how to deploy Worker and UDF (User-Defined Function) binaries,
including which Environment Variables to set up and some commonly used parameters
when launching applications with `spark-submit`.

## Configurations

### 1. Environment Variables
When deploying workers and writing UDFs, there are a few commonly used environment variables that you may need to set:

<table>
<tr>
<td width="25%"><b>Environment Variable</b></td>
<td width="75%"><b>Description</b></td>
</tr>
<tr>
<td><b>DOTNET_WORKER_DIR</b></td>
<td>Path where the <code>Microsoft.Spark.Worker</code> binary has been generated.</br>It's used by the Spark driver and will be passed to Spark executors. If this variable is not set up, the Spark executors will search the path specified in the <code>PATH</code> environment variable.</br><i>e.g. "C:\bin\Microsoft.Spark.Worker-0.6.0"</i></td>
</tr>
<tr>
<td><b>DOTNET_ASSEMBLY_SEARCH_PATHS</b></td>
<td>Comma-separated paths where <code>Microsoft.Spark.Worker</code> will load assemblies.</br>Note that if a path starts with ".", the working directory will be prepended. If in <b>yarn mode</b>, "." would represent the container's working directory.</br><i>e.g. "C:\Users\&lt;user name&gt;\&lt;mysparkapp&gt;\bin\Debug\&lt;dotnet version&gt;"</i></td>
</tr>
<tr>
<td><b>DOTNET_WORKER_DEBUG</b></td>
<td>If you want to <a href="https://github.com/dotnet/spark/blob/master/docs/developer-guide.md#debugging-user-defined-function-udf">debug a UDF</a>, then set this environment variable to <code>1</code> before running <code>spark-submit</code>.</td>
</tr>
</table>

### 2. Parameter Options
Once the Spark application is [bundled](https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies), you can launch it using `spark-submit`. The following table shows some of the commonly used options:

<table>
<tr>
<td width="25%"><b>Parameter Name</b></td>
<td width="75%"><b>Description</b></td>
</tr>
<tr>
<td><b>--class</b></td>
<td>The entry point for your application.</br><i>e.g. org.apache.spark.deploy.dotnet.DotnetRunner</i></td>
</tr>
<tr>
<td><b>--master</b></td>
<td>The <a href="https://spark.apache.org/docs/latest/submitting-applications.html#master-urls">master URL</a> for the cluster.</br><i>e.g. yarn</i></td>
</tr>
<tr>
<td><b>--deploy-mode</b></td>
<td>Whether to deploy your driver on the worker nodes (<code>cluster</code>) or locally as an external client (<code>client</code>).</br>Default: <code>client</code></td>
</tr>
<tr>
<td><b>--conf</b></td>
<td>Arbitrary Spark configuration property in <code>key=value</code> format.</br><i>e.g. spark.yarn.appMasterEnv.DOTNET_WORKER_DIR=.\worker\Microsoft.Spark.Worker-0.6.0</i></td>
</tr>
<tr>
<td><b>--files</b></td>
<td>Comma-separated list of files to be placed in the working directory of each executor.</br>
<ul>
<li>Please note that this option is only applicable for yarn mode.</li>
<li>It supports specifying file names with # similar to Hadoop.</br>
</ul>
<i>e.g. <code>myLocalSparkApp.dll#appSeen.dll</code>. Your application should use the name as <code>appSeen.dll</code> to reference <code>myLocalSparkApp.dll</code> when running on YARN.</i></li></td>
</tr>
<tr>
<td><b>--archives</b></td>
<td>Comma-separated list of archives to be extracted into the working directory of each executor.</br>
<ul>
<li>Please note that this option is only applicable for yarn mode.</li>
<li>It supports specifying file names with # similar to Hadoop.</br>
</ul>
<i>e.g. <code>hdfs://&lt;path to your worker file&gt;/Microsoft.Spark.Worker.net461.win-x64-0.6.0.zip#worker</code>. This will copy and extract the zip file to <code>worker</code> folder.</i></li></td>
</tr>
<tr>
<td><b>application-jar</b></td>
<td>Path to a bundled jar including your application and all dependencies.</br>
<i>e.g. hdfs://&lt;path to your jar&gt;/microsoft-spark-2.4.x-0.6.0.jar</i></td>
</tr>
<tr>
<td><b>application-arguments</b></td>
<td>Arguments passed to the main method of your main class, if any.</br><i>e.g. hdfs://&lt;path to your app&gt;/&lt;your app&gt;.zip &lt;your app name&gt; &lt;app args&gt;</i></td>
</tr>
</table>

> Note: Please specify all the `--options` before `application-jar` when launching applications with `spark-submit`, otherwise they will be ignored. Please see more `spark-submit` options [here](https://spark.apache.org/docs/latest/submitting-applications.html) and running spark on YARN details [here](https://spark.apache.org/docs/latest/running-on-yarn.html).

## FAQ
#### 1. Question: When I run a spark app with UDFs, I get the following error. What should I do?
> **Error:** [ ] [ ] [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.IO.FileNotFoundException: Assembly 'mySparkApp, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' file not found: 'mySparkApp.dll'

**Answer:** Please check if the `DOTNET_ASSEMBLY_SEARCH_PATHS` environment variable is set correctly. It should be the path that contains your `mySparkApp.dll`.

#### 2. Question: After I upgraded my Spark Dotnet version and reset the `DOTNET_WORKER_DIR` environment variable, why do I still get the following error?
> **Error:** Lost task 0.0 in stage 11.0 (TID 24, localhost, executor driver): java.io.IOException: Cannot run program "Microsoft.Spark.Worker.exe": CreateProcess error=2, The system cannot find the file specified.

**Answer:** Please try restarting your PowerShell window (or other command windows) first so that it can take the latest environment variable values. Then start your program.

#### 3. Question: After submitting my Spark application, I get the error `System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context'`.
> **Command:** %SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-2.4.x-0.6.0.jar mySparkApp.exe</br>
**Error:** [ ] [ ] [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=...'.

**Answer:** Please check the `Microsoft.Spark.Worker` version you are using. We currently provide two versions: **.NET Framework 4.6.1** and **.NET Core 2.1.x**. In this case, `Microsoft.Spark.Worker.net461.win-x64-0.6.0` (which you can download [here](https://github.com/dotnet/spark/releases)) should be used since `System.Runtime.Remoting.Contexts.Context` is only for .NET Framework.

#### 4. Question: How to run my spark application with UDFs on YARN? Which environment variables and parameters should I use?

**Answer:** To launch the spark application on YARN, the environment variables should be specified as `spark.yarn.appMasterEnv.[EnvironmentVariableName]`. Please see below as an example using `spark-submit`:
```shell
spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.DOTNET_WORKER_DIR=./worker/Microsoft.Spark.Worker-0.6.0 \
--conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./udfs \
--archives hdfs://<path to your files>/Microsoft.Spark.Worker.net461.win-x64-0.6.0.zip#worker,hdfs://<path to your files>/mySparkApp.zip#udfs \
hdfs://<path to jar file>/microsoft-spark-2.4.x-0.6.0.jar \
hdfs://<path to your files>/mySparkApp.zip mySparkApp
```