|
| 1 | +# Installation Guide |
| 2 | + |
| 3 | +## Requirements |
| 4 | + |
| 5 | +Flowman brings many dependencies with the installation archive, but everything related to Hadoop or Spark needs to |
| 6 | +be provided by your platform. This approach ensures that the existing Spark and Hadoop installation is used together |
| 7 | +with all patches and extensions available on your platform. Specifically this means that Flowman requires the following |
| 8 | +components present on your system: |
| 9 | + |
| 10 | +* Java 1.8 |
| 11 | +* Apache Spark with a matching minor version |
| 12 | +* Apache Hadoop with a matching minor version |
| 13 | + |
| 14 | +Note that Flowman can be built for different Hadoop and Spark versions, and the major and minor version of the build |
| 15 | +needs to match the ones of your platform |
| 16 | + |
| 17 | + |
| 18 | +## Downloading Flowman |
| 19 | + |
| 20 | +Currently since version 0.14.1, prebuilt releases are provided on [GitHub](https://github.com/dimajix/flowman/releases). |
| 21 | +This probably is the simplest way to grab a working Flowman package. Note that for each release, there are different |
| 22 | +packages being provided, for different Spark and Hadoop versions. The naming is very simple: |
| 23 | +``` |
| 24 | +flowman-dist-<version>-oss-spark<spark-version>-hadoop<hadoop-version>-bin.tar.gz |
| 25 | +``` |
| 26 | +You simply have to use the package which fits to the Spark and Hadoop versions of your environment. For example the |
| 27 | +package of Flowman 0.14.1 and for Spark 3.0 and Hadoop 3.2 would be |
| 28 | +``` |
| 29 | +flowman-dist-0.14.1-oss-spark30-hadoop32-bin.tar.gz |
| 30 | +``` |
| 31 | +and the full URL then would be |
| 32 | +``` |
| 33 | +https://github.com/dimajix/flowman/releases/download/0.14.1/flowman-dist-0.14.1-oss-spark3.0-hadoop3.2-bin.tar.gz |
| 34 | +``` |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | +## Building Flowman |
| 39 | + |
| 40 | +As an alternative to downloading a pre-built distribution of Flowman, you might also want to |
| 41 | +[build Flowman](building.md) yourself in order to match your environment. A task which is not difficult for someone who |
| 42 | + has basic experience with Maven. |
| 43 | + |
| 44 | + |
| 45 | +## Local Installation |
| 46 | + |
| 47 | +Flowman is distributed as a `tar.gz` file, which simply needs to be extracted at some location on your computer or |
| 48 | +server. This can be done via |
| 49 | +```shell script |
| 50 | +tar xvzf flowman-dist-X.Y.Z-bin.tar.gz |
| 51 | +``` |
| 52 | + |
| 53 | +### Directory Layout |
| 54 | + |
| 55 | +```bash |
| 56 | +├── bin |
| 57 | +├── conf |
| 58 | +├── examples |
| 59 | +│ ├── plugin-example |
| 60 | +│ │ └── job |
| 61 | +│ ├── sftp-upload |
| 62 | +│ │ ├── config |
| 63 | +│ │ ├── data |
| 64 | +│ │ └── job |
| 65 | +│ └── weather |
| 66 | +│ ├── config |
| 67 | +│ ├── job |
| 68 | +│ ├── mapping |
| 69 | +│ ├── model |
| 70 | +│ └── target |
| 71 | +├── lib |
| 72 | +├── libexec |
| 73 | +└── plugins |
| 74 | + ├── flowman-aws |
| 75 | + ├── flowman-azure |
| 76 | + ├── flowman-example |
| 77 | + ├── flowman-impala |
| 78 | + ├── flowman-kafka |
| 79 | + └── flowman-mariadb |
| 80 | +``` |
| 81 | + |
| 82 | +* The `bin` directory contains the Flowman executables |
| 83 | +* The `conf` directory contains global configuration files |
| 84 | +* The `lib` directory contains all Java jars |
| 85 | +* The `libexec` directory contains some internal helper scripts |
| 86 | +* The `plugins` directory contains more subdirectories, each containing a single plugin |
| 87 | +* The `examples` directory contains some example projects |
| 88 | + |
| 89 | + |
| 90 | +## Configuration |
| 91 | + |
| 92 | +You probably need to perform some basic global configuration of Flowman. The relevant files are stored in the `conf` |
| 93 | +directory. |
| 94 | + |
| 95 | +### `flowman-env.sh` |
| 96 | + |
| 97 | +The `flowman-env.sh` script sets up an execution environment on a system level. Here some very fundamental Spark |
| 98 | +and Hadoop properties can be configured, like for example |
| 99 | +* `SPARK_HOME`, `HADOOP_HOME` and related environment variables |
| 100 | +* `KRB_PRINCIPAL` and `KRB_KEYTAB` for using Kerberos |
| 101 | +* Generic Java options like http proxy and more |
| 102 | + |
| 103 | +#### Example |
| 104 | +```shell script |
| 105 | +#!/usr/bin/env bash |
| 106 | + |
| 107 | +# Specify Java home (just in case) |
| 108 | +# |
| 109 | +#export JAVA_HOME |
| 110 | + |
| 111 | +# Explicitly override Flowmans home. These settings are detected automatically, |
| 112 | +# but can be overridden |
| 113 | +# |
| 114 | +#export FLOWMAN_HOME |
| 115 | +#export FLOWMAN_CONF_DIR |
| 116 | + |
| 117 | +# Specify any environment settings and paths |
| 118 | +# |
| 119 | +#export SPARK_HOME |
| 120 | +#export HADOOP_HOME |
| 121 | +#export HADOOP_CONF_DIR=${HADOOP_CONF_DIR=$HADOOP_HOME/conf} |
| 122 | +#export YARN_HOME |
| 123 | +#export HDFS_HOME |
| 124 | +#export MAPRED_HOME |
| 125 | +#export HIVE_HOME |
| 126 | +#export HIVE_CONF_DIR=${HIVE_CONF_DIR=$HIVE_HOME/conf} |
| 127 | + |
| 128 | +# Set the Kerberos principal in YARN cluster |
| 129 | +# |
| 130 | +#KRB_PRINCIPAL= |
| 131 | +#KRB_KEYTAB= |
| 132 | + |
| 133 | +# Specify the YARN queue to use |
| 134 | +# |
| 135 | +#YARN_QUEUE= |
| 136 | + |
| 137 | +# Use a different spark-submit (for example spark2-submit in Cloudera) |
| 138 | +# |
| 139 | +#SPARK_SUBMIT= |
| 140 | + |
| 141 | + |
| 142 | +# Apply any proxy settings from the system environment |
| 143 | +# |
| 144 | +if [[ "$PROXY_HOST" != "" ]]; then |
| 145 | + SPARK_DRIVER_JAVA_OPTS=" |
| 146 | + -Dhttp.proxyHost=${PROXY_HOST} |
| 147 | + -Dhttp.proxyPort=${PROXY_PORT} |
| 148 | + -Dhttps.proxyHost=${PROXY_HOST} |
| 149 | + -Dhttps.proxyPort=${PROXY_PORT} |
| 150 | + $SPARK_DRIVER_JAVA_OPTS" |
| 151 | + |
| 152 | + SPARK_EXECUTOR_JAVA_OPTS=" |
| 153 | + -Dhttp.proxyHost=${PROXY_HOST} |
| 154 | + -Dhttp.proxyPort=${PROXY_PORT} |
| 155 | + -Dhttps.proxyHost=${PROXY_HOST} |
| 156 | + -Dhttps.proxyPort=${PROXY_PORT} |
| 157 | + $SPARK_EXECUTOR_JAVA_OPTS" |
| 158 | + |
| 159 | + SPARK_OPTS=" |
| 160 | + --conf spark.hadoop.fs.s3a.proxy.host=${PROXY_HOST} |
| 161 | + --conf spark.hadoop.fs.s3a.proxy.port=${PROXY_PORT} |
| 162 | + $SPARK_OPTS" |
| 163 | +fi |
| 164 | + |
| 165 | +# Set AWS credentials if required. You can also specify these in project config |
| 166 | +# |
| 167 | +if [[ "$AWS_ACCESS_KEY_ID" != "" ]]; then |
| 168 | + SPARK_OPTS=" |
| 169 | + --conf spark.hadoop.fs.s3a.access.key=${AWS_ACCESS_KEY_ID} |
| 170 | + --conf spark.hadoop.fs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY} |
| 171 | + $SPARK_OPTS" |
| 172 | +fi |
| 173 | +``` |
| 174 | + |
| 175 | +### `system.yml` |
| 176 | +After the execution environment has been setup, the `system.yml` is the first configuration file processed by the Java |
| 177 | +application. Its main purpose is to load some fundamental plugins, which are already required by the next level of |
| 178 | +configuration |
| 179 | + |
| 180 | +#### Example |
| 181 | +```yaml |
| 182 | +plugins: |
| 183 | + - flowman-impala |
| 184 | +``` |
| 185 | +
|
| 186 | +### `default-namespace.yml` |
| 187 | +On top of the very global settings, Flowman also supports so called *namespaces*. Each project is executed within the |
| 188 | +context of one namespace, if nothing else is specified the *defautlt namespace*. Each namespace contains some |
| 189 | +configuration, such that different namespaces might represent different tenants or different staging environments. |
| 190 | + |
| 191 | +#### Example |
| 192 | +```yaml |
| 193 | +name: "default" |
| 194 | +
|
| 195 | +history: |
| 196 | + kind: jdbc |
| 197 | + connection: flowman_state |
| 198 | + retries: 3 |
| 199 | + timeout: 1000 |
| 200 | +
|
| 201 | +hooks: |
| 202 | + - kind: web |
| 203 | + jobSuccess: http://some-host.in.your.net/success&job=$URL.encode($job)&force=$force |
| 204 | +
|
| 205 | +connections: |
| 206 | + flowman_state: |
| 207 | + driver: $System.getenv('FLOWMAN_HISTORY_DRIVER', 'org.apache.derby.jdbc.EmbeddedDriver') |
| 208 | + url: $System.getenv('FLOWMAN_HISTORY_URL', $String.concat('jdbc:derby:', $System.getenv('FLOWMAN_HOME'), '/logdb;create=true')) |
| 209 | + username: $System.getenv('FLOWMAN_HISTORY_USER', '') |
| 210 | + password: $System.getenv('FLOWMAN_HISTORY_PASSWORD', '') |
| 211 | +
|
| 212 | +config: |
| 213 | + - spark.sql.warehouse.dir=$System.getenv('FLOWMAN_HOME')/hive/warehouse |
| 214 | + - hive.metastore.uris= |
| 215 | + - javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=$System.getenv('FLOWMAN_HOME')/hive/db;create=true |
| 216 | + - datanucleus.rdbms.datastoreAdapterClassName=org.datanucleus.store.rdbms.adapter.DerbyAdapter |
| 217 | +
|
| 218 | +plugins: |
| 219 | + - flowman-example |
| 220 | + - flowman-hbase |
| 221 | + - flowman-aws |
| 222 | + - flowman-azure |
| 223 | + - flowman-kafka |
| 224 | + - flowman-mariadb |
| 225 | +
|
| 226 | +store: |
| 227 | + kind: file |
| 228 | + location: $System.getenv('FLOWMAN_HOME')/examples |
| 229 | +``` |
| 230 | + |
| 231 | + |
| 232 | +## Running in a Kerberized Environment |
| 233 | +Please have a look at [Kerberos](cookbook/kerberos.md) for detailed information. |
| 234 | + |
| 235 | +## Deploying with Docker |
| 236 | +It is also possible to run Flowman inside Docker. This simply requires a Docker image with a working Spark and |
| 237 | +Hadoop installation such that Flowman can be installed inside the image just as it is installed locally. |
0 commit comments