dimajix
diff --git a/‎.gitignore
Lines changed: 1 addition & 1 deletion b/‎.gitignore
Lines changed: 1 addition & 1 deletion
diff --git a/‎.travis.yml
Lines changed: 3 additions & 3 deletions b/‎.travis.yml
Lines changed: 3 additions & 3 deletions
diff --git a/‎BUILDING.md
Lines changed: 5 additions & 2 deletions b/‎BUILDING.md
Lines changed: 5 additions & 2 deletions
diff --git a/‎CHANGELOG.md
Lines changed: 8 additions & 0 deletions b/‎CHANGELOG.md
Lines changed: 8 additions & 0 deletions
diff --git a/‎INSTALLING.md
Lines changed: 237 additions & 0 deletions b/‎INSTALLING.md
Lines changed: 237 additions & 0 deletions
diff --git a/‎Jenkinsfile
Lines changed: 0 additions & 24 deletions b/‎Jenkinsfile
Lines changed: 0 additions & 24 deletions
diff --git a/‎README.md
Lines changed: 44 additions & 9 deletions b/‎README.md
Lines changed: 44 additions & 9 deletions
@@ -6,4 +6,4 @@ derby.log
 dependency-reduced-pom.xml
 metastore_db/
 /target/
-
+release/
@@ -53,15 +53,15 @@ jobs:
 
     - name: Hadoop 2.9 with Spark 3.0
       jdk: openjdk8
-      script: mvn clean install -Phadoop-2.9 -Pspark-3.0 -Ddockerfile.skip
+      script: mvn clean install -Phadoop-2.9 -Pspark-3.0
 
     - name: Hadoop 3.1 with Spark 3.0
       jdk: openjdk8
-      script: mvn clean install -Phadoop-3.1 -Pspark-3.0 -Ddockerfile.skip
+      script: mvn clean install -Phadoop-3.1 -Pspark-3.0
 
     - name: Hadoop 3.2 with Spark 3.0
       jdk: openjdk8
-      script: mvn clean install -Phadoop-3.2 -Pspark-3.0 -Ddockerfile.skip
+      script: mvn clean install -Phadoop-3.2 -Pspark-3.0
 
     - name: CDH 5.15
       jdk: openjdk8
 
@@ -32,15 +32,18 @@ value "core.autocrlf" to "input"
 
 You might also want to skip unittests (the HBase plugin is currently failing under windows)
 
-    mvn clean install -DskipTests    
+    mvn clean install -DskipTests
+    
+It may well be the case that some unittests fail on Windows - don't panic, we focus on Linux systems and ensure that
+the `master` branch really builds clean with all unittests passing on Linux.
 
 
 ## Build for Custom Spark / Hadoop Version
 
 Per default, Flowman will be built for fairly recent versions of Spark (2.4.5 as of this writing) and Hadoop (2.8.5). 
 But of course you can also build for a different version by either using a profile
 
-    mvn install -Pspark2.2 -Phadoop2.7 -DskipTests
+    mvn install -Pspark2.3 -Phadoop2.7 -DskipTests
 
 This will always select the latest bugfix version within the minor version. You can also specify versions explicitly 
 as follows:    
 
@@ -1,3 +1,11 @@
+# Version 0.14.2
+
+* Upgrade to Spark 2.4.7 and Spark 3.0.1
+* Clean up dependencies
+* Disable build of Docker image
+* Update examples
+
+
 # Version 0.14.1 - 2020-09-28
 
 * Fix dropping of partitions which could cause issues on CDH6 
 
@@ -0,0 +1,237 @@
+# Installation Guide
+
+## Requirements
+
+Flowman brings many dependencies with the installation archive, but everything related to Hadoop or Spark needs to 
+be provided by your platform. This approach ensures that the existing Spark and Hadoop installation is used together
+with all patches and extensions available on your platform. Specifically this means that Flowman requires the following
+components present on your system:
+
+* Java 1.8
+* Apache Spark with a matching minor version 
+* Apache Hadoop with a matching minor version
+
+Note that Flowman can be built for different Hadoop and Spark versions, and the major and minor version of the build
+needs to match the ones of your platform
+
+
+## Downloading Flowman
+
+Currently since version 0.14.1, prebuilt releases are provided on [GitHub](https://github.com/dimajix/flowman/releases).
+This probably is the simplest way to grab a working Flowman package. Note that for each release, there are different
+packages being provided, for different Spark and Hadoop versions. The naming is very simple:
+```
+flowman-dist-<version>-oss-spark<spark-version>-hadoop<hadoop-version>-bin.tar.gz
+```
+You simply have to use the package which fits to the Spark and Hadoop versions of your environment. For example the
+package of Flowman 0.14.1 and for Spark 3.0 and Hadoop 3.2 would be
+```
+flowman-dist-0.14.1-oss-spark30-hadoop32-bin.tar.gz
+```
+and the full URL then would be
+```
+https://github.com/dimajix/flowman/releases/download/0.14.1/flowman-dist-0.14.1-oss-spark3.0-hadoop3.2-bin.tar.gz
+```
+
+
+
+## Building Flowman
+
+As an alternative to downloading a pre-built distribution of Flowman, you might also want to 
+[build Flowman](building.md) yourself in order to match your environment. A task which is not difficult for someone who
+ has basic  experience with Maven.
+
+
+## Local Installation
+
+Flowman is distributed as a `tar.gz` file, which simply needs to be extracted at some location on your computer or 
+server. This can be done via
+```shell script
+tar xvzf flowman-dist-X.Y.Z-bin.tar.gz
+```
+
+### Directory Layout
+
+```bash
+├── bin
+├── conf
+├── examples
+│   ├── plugin-example
+│   │   └── job
+│   ├── sftp-upload
+│   │   ├── config
+│   │   ├── data
+│   │   └── job
+│   └── weather
+│       ├── config
+│       ├── job
+│       ├── mapping
+│       ├── model
+│       └── target
+├── lib
+├── libexec
+└── plugins
+    ├── flowman-aws
+    ├── flowman-azure
+    ├── flowman-example
+    ├── flowman-impala
+    ├── flowman-kafka
+    └── flowman-mariadb
+```
+
+* The `bin` directory contains the Flowman executables
+* The `conf` directory contains global configuration files
+* The `lib` directory contains all Java jars
+* The `libexec` directory contains some internal helper scripts
+* The `plugins` directory contains more subdirectories, each containing a single plugin
+* The `examples` directory contains some example projects 
+
+
+## Configuration
+
+You probably need to perform some basic global configuration of Flowman. The relevant files are stored in the `conf`
+directory.
+
+### `flowman-env.sh`
+
+The `flowman-env.sh` script sets up an execution environment on a system level. Here some very fundamental Spark
+and Hadoop properties can be configured, like for example
+* `SPARK_HOME`, `HADOOP_HOME` and related environment variables
+* `KRB_PRINCIPAL` and `KRB_KEYTAB` for using Kerberos
+* Generic Java options like http proxy and more
+
+#### Example
+```shell script
+#!/usr/bin/env bash
+
+# Specify Java home (just in case)
+#
+#export JAVA_HOME
+
+# Explicitly override Flowmans home. These settings are detected automatically,
+# but can be overridden
+#
+#export FLOWMAN_HOME
+#export FLOWMAN_CONF_DIR
+
+# Specify any environment settings and paths
+#
+#export SPARK_HOME
+#export HADOOP_HOME
+#export HADOOP_CONF_DIR=${HADOOP_CONF_DIR=$HADOOP_HOME/conf}
+#export YARN_HOME
+#export HDFS_HOME
+#export MAPRED_HOME
+#export HIVE_HOME
+#export HIVE_CONF_DIR=${HIVE_CONF_DIR=$HIVE_HOME/conf}
+
+# Set the Kerberos principal in YARN cluster
+#
+#KRB_PRINCIPAL=
+#KRB_KEYTAB=
+
+# Specify the YARN queue to use
+#
+#YARN_QUEUE=
+
+# Use a different spark-submit (for example spark2-submit in Cloudera)
+#
+#SPARK_SUBMIT=
+
+
+# Apply any proxy settings from the system environment
+#
+if [[ "$PROXY_HOST" != "" ]]; then
+    SPARK_DRIVER_JAVA_OPTS="
+        -Dhttp.proxyHost=${PROXY_HOST}
+        -Dhttp.proxyPort=${PROXY_PORT}
+        -Dhttps.proxyHost=${PROXY_HOST}
+        -Dhttps.proxyPort=${PROXY_PORT}
+        $SPARK_DRIVER_JAVA_OPTS"
+
+    SPARK_EXECUTOR_JAVA_OPTS="
+        -Dhttp.proxyHost=${PROXY_HOST}
+        -Dhttp.proxyPort=${PROXY_PORT}
+        -Dhttps.proxyHost=${PROXY_HOST}
+        -Dhttps.proxyPort=${PROXY_PORT}
+        $SPARK_EXECUTOR_JAVA_OPTS"
+
+    SPARK_OPTS="
+        --conf spark.hadoop.fs.s3a.proxy.host=${PROXY_HOST}
+        --conf spark.hadoop.fs.s3a.proxy.port=${PROXY_PORT}
+        $SPARK_OPTS"
+fi
+
+# Set AWS credentials if required. You can also specify these in project config
+#
+if [[ "$AWS_ACCESS_KEY_ID" != "" ]]; then
+    SPARK_OPTS="
+        --conf spark.hadoop.fs.s3a.access.key=${AWS_ACCESS_KEY_ID}
+        --conf spark.hadoop.fs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY}
+        $SPARK_OPTS"
+fi
+```
+
+### `system.yml`
+After the execution environment has been setup, the `system.yml` is the first configuration file processed by the Java
+application. Its main purpose is to load some fundamental plugins, which are already required by the next level of 
+configuration 
+
+#### Example
+```yaml
+plugins:
+  - flowman-impala
+```
+
+### `default-namespace.yml`
+On top of the very global settings, Flowman also supports so called *namespaces*. Each project is executed within the
+context of one namespace, if nothing else is specified the *defautlt namespace*. Each namespace contains some 
+configuration, such that different namespaces might represent different tenants or different staging environments.
+
+#### Example
+```yaml
+name: "default"
+
+history:
+  kind: jdbc
+  connection: flowman_state
+  retries: 3
+  timeout: 1000
+
+hooks:
+  - kind: web
+    jobSuccess: http://some-host.in.your.net/success&job=$URL.encode($job)&force=$force
+
+connections:
+  flowman_state:
+    driver: $System.getenv('FLOWMAN_HISTORY_DRIVER', 'org.apache.derby.jdbc.EmbeddedDriver')
+    url: $System.getenv('FLOWMAN_HISTORY_URL', $String.concat('jdbc:derby:', $System.getenv('FLOWMAN_HOME'), '/logdb;create=true'))
+    username: $System.getenv('FLOWMAN_HISTORY_USER', '')
+    password: $System.getenv('FLOWMAN_HISTORY_PASSWORD', '')
+
+config:
+  - spark.sql.warehouse.dir=$System.getenv('FLOWMAN_HOME')/hive/warehouse
+  - hive.metastore.uris=
+  - javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=$System.getenv('FLOWMAN_HOME')/hive/db;create=true
+  - datanucleus.rdbms.datastoreAdapterClassName=org.datanucleus.store.rdbms.adapter.DerbyAdapter
+
+plugins:
+  - flowman-example
+  - flowman-hbase
+  - flowman-aws
+  - flowman-azure
+  - flowman-kafka
+  - flowman-mariadb
+
+store:
+  kind: file
+  location: $System.getenv('FLOWMAN_HOME')/examples
+```
+
+
+## Running in a Kerberized Environment
+Please have a look at [Kerberos](cookbook/kerberos.md) for detailed information.
+
+## Deploying with Docker
+It is also possible to run Flowman inside Docker. This simply requires a Docker image with a working Spark and
+Hadoop installation such that Flowman can be installed inside the image just as it is installed locally.
@@ -1,12 +1,44 @@
 # Flowman
 
-Flowman is a Spark based ETL tool.
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+[![Build Status](https://travis-ci.org/dimajix/flowman.svg?branch=develop)](https://travis-ci.org/dimajix/flowman)
+[![Documentation](https://readthedocs.org/projects/flowman/badge/?version=latest)](https://flowman.readthedocs.io/en/latest/)
+
+Flowman is a Spark based ETL program that simplifies the act of writing data transformations.
+The main idea is that users write so called *specifications* in purely declarative YAML files
+instead of writing Spark jobs in Scala or Python. The main advantage of this approach is that
+many technical details of a correct and robust implementation are encapsulated and the user
+can concentrate on the data transformations themselves.
+
+In addition to writing and executing data transformations, Flowman can also be used for
+managing physical data models, i.e. Hive tables. Flowman can create such tables from a
+specification with the correct schema. This helps to keep all aspects (like transformations
+and schema information) in a single place managed by a single program.
+
+### Noteable Features
+
+* Declarative syntax in YAML files
+* Data model management (Create and Destroy Hive tables or file based storage)
+* Flexible expression language
+* Jobs for managing build targets (like copying files or uploading data via sftp)
+* Powerful yet simple command line tool
+* Extendable via Plugins
+
+
+## Documentation
+
+You can find comprehensive documentation at [Read the Docs](https://flowman.readthedocs.io/en/latest/). 
+
 
 # Installation
 
-The Maven build will create both a packed distribution file and a Docker image.
+You can either grab an appropriate pre-build package at https://github.com/dimajix/flowman/releases or you
+can build your own version via Maven with
+
+    mvn clean install
+    
+Please also read [BUILDING.md](BUILDING.md) for detailed instructions, specifically on build profiles.
 
-    mvn clean install -PCDH-5.15
 
 ## Installing the Packed Distribution 
 
@@ -15,7 +47,8 @@ location using
 
     tar xvzf flowman-{version}-bin.tar.gz
 
-# Command Line Util
+
+# Command Line Utils
 
 The primary tool provided by Flowman is called `flowexec` and is locaed in the `bin` folder of the 
 installation directory.
@@ -32,9 +65,11 @@ project with a file `project.yml` or you need to specify the path to a valid pro
 
     flowexec -f /path/to/project/folder <cmd>
 
-    
-# Debugging
+## Interactive Shell
+
+With version 0.14.0, Flowman also introduced a new interactive shell for executing data flows. The shell can be
+started via
 
-When you want to run the application inside your IDE (for example for debugging purpose), the best way to do that is
-to actually install (via `tar xvzf ...`) the application into some directory and then set the environment variable
-`FLOWMAN_HOME` accordingly. This will ensure that all plugins can be found and loaded.
+    flowshell -f <project>
+    
+Within the shell, you can interactively build targets and inspect intermediate mappings.