Skip to content

Commit f0dab8d

Browse files
committed
Merge branch 'develop'
2 parents 0ee3476 + a449490 commit f0dab8d

File tree

73 files changed

+1206
-238
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

73 files changed

+1206
-238
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@ derby.log
66
dependency-reduced-pom.xml
77
metastore_db/
88
/target/
9-
9+
release/

.travis.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -53,15 +53,15 @@ jobs:
5353

5454
- name: Hadoop 2.9 with Spark 3.0
5555
jdk: openjdk8
56-
script: mvn clean install -Phadoop-2.9 -Pspark-3.0 -Ddockerfile.skip
56+
script: mvn clean install -Phadoop-2.9 -Pspark-3.0
5757

5858
- name: Hadoop 3.1 with Spark 3.0
5959
jdk: openjdk8
60-
script: mvn clean install -Phadoop-3.1 -Pspark-3.0 -Ddockerfile.skip
60+
script: mvn clean install -Phadoop-3.1 -Pspark-3.0
6161

6262
- name: Hadoop 3.2 with Spark 3.0
6363
jdk: openjdk8
64-
script: mvn clean install -Phadoop-3.2 -Pspark-3.0 -Ddockerfile.skip
64+
script: mvn clean install -Phadoop-3.2 -Pspark-3.0
6565

6666
- name: CDH 5.15
6767
jdk: openjdk8

BUILDING.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,15 +32,18 @@ value "core.autocrlf" to "input"
3232

3333
You might also want to skip unittests (the HBase plugin is currently failing under windows)
3434

35-
mvn clean install -DskipTests
35+
mvn clean install -DskipTests
36+
37+
It may well be the case that some unittests fail on Windows - don't panic, we focus on Linux systems and ensure that
38+
the `master` branch really builds clean with all unittests passing on Linux.
3639

3740

3841
## Build for Custom Spark / Hadoop Version
3942

4043
Per default, Flowman will be built for fairly recent versions of Spark (2.4.5 as of this writing) and Hadoop (2.8.5).
4144
But of course you can also build for a different version by either using a profile
4245

43-
mvn install -Pspark2.2 -Phadoop2.7 -DskipTests
46+
mvn install -Pspark2.3 -Phadoop2.7 -DskipTests
4447

4548
This will always select the latest bugfix version within the minor version. You can also specify versions explicitly
4649
as follows:

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
# Version 0.14.2
2+
3+
* Upgrade to Spark 2.4.7 and Spark 3.0.1
4+
* Clean up dependencies
5+
* Disable build of Docker image
6+
* Update examples
7+
8+
19
# Version 0.14.1 - 2020-09-28
210

311
* Fix dropping of partitions which could cause issues on CDH6

INSTALLING.md

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
# Installation Guide
2+
3+
## Requirements
4+
5+
Flowman brings many dependencies with the installation archive, but everything related to Hadoop or Spark needs to
6+
be provided by your platform. This approach ensures that the existing Spark and Hadoop installation is used together
7+
with all patches and extensions available on your platform. Specifically this means that Flowman requires the following
8+
components present on your system:
9+
10+
* Java 1.8
11+
* Apache Spark with a matching minor version
12+
* Apache Hadoop with a matching minor version
13+
14+
Note that Flowman can be built for different Hadoop and Spark versions, and the major and minor version of the build
15+
needs to match the ones of your platform
16+
17+
18+
## Downloading Flowman
19+
20+
Currently since version 0.14.1, prebuilt releases are provided on [GitHub](https://github.com/dimajix/flowman/releases).
21+
This probably is the simplest way to grab a working Flowman package. Note that for each release, there are different
22+
packages being provided, for different Spark and Hadoop versions. The naming is very simple:
23+
```
24+
flowman-dist-<version>-oss-spark<spark-version>-hadoop<hadoop-version>-bin.tar.gz
25+
```
26+
You simply have to use the package which fits to the Spark and Hadoop versions of your environment. For example the
27+
package of Flowman 0.14.1 and for Spark 3.0 and Hadoop 3.2 would be
28+
```
29+
flowman-dist-0.14.1-oss-spark30-hadoop32-bin.tar.gz
30+
```
31+
and the full URL then would be
32+
```
33+
https://github.com/dimajix/flowman/releases/download/0.14.1/flowman-dist-0.14.1-oss-spark3.0-hadoop3.2-bin.tar.gz
34+
```
35+
36+
37+
38+
## Building Flowman
39+
40+
As an alternative to downloading a pre-built distribution of Flowman, you might also want to
41+
[build Flowman](building.md) yourself in order to match your environment. A task which is not difficult for someone who
42+
has basic experience with Maven.
43+
44+
45+
## Local Installation
46+
47+
Flowman is distributed as a `tar.gz` file, which simply needs to be extracted at some location on your computer or
48+
server. This can be done via
49+
```shell script
50+
tar xvzf flowman-dist-X.Y.Z-bin.tar.gz
51+
```
52+
53+
### Directory Layout
54+
55+
```bash
56+
├── bin
57+
├── conf
58+
├── examples
59+
│   ├── plugin-example
60+
│   │   └── job
61+
│   ├── sftp-upload
62+
│   │   ├── config
63+
│   │   ├── data
64+
│   │   └── job
65+
│   └── weather
66+
│   ├── config
67+
│   ├── job
68+
│   ├── mapping
69+
│   ├── model
70+
│   └── target
71+
├── lib
72+
├── libexec
73+
└── plugins
74+
├── flowman-aws
75+
├── flowman-azure
76+
├── flowman-example
77+
├── flowman-impala
78+
├── flowman-kafka
79+
└── flowman-mariadb
80+
```
81+
82+
* The `bin` directory contains the Flowman executables
83+
* The `conf` directory contains global configuration files
84+
* The `lib` directory contains all Java jars
85+
* The `libexec` directory contains some internal helper scripts
86+
* The `plugins` directory contains more subdirectories, each containing a single plugin
87+
* The `examples` directory contains some example projects
88+
89+
90+
## Configuration
91+
92+
You probably need to perform some basic global configuration of Flowman. The relevant files are stored in the `conf`
93+
directory.
94+
95+
### `flowman-env.sh`
96+
97+
The `flowman-env.sh` script sets up an execution environment on a system level. Here some very fundamental Spark
98+
and Hadoop properties can be configured, like for example
99+
* `SPARK_HOME`, `HADOOP_HOME` and related environment variables
100+
* `KRB_PRINCIPAL` and `KRB_KEYTAB` for using Kerberos
101+
* Generic Java options like http proxy and more
102+
103+
#### Example
104+
```shell script
105+
#!/usr/bin/env bash
106+
107+
# Specify Java home (just in case)
108+
#
109+
#export JAVA_HOME
110+
111+
# Explicitly override Flowmans home. These settings are detected automatically,
112+
# but can be overridden
113+
#
114+
#export FLOWMAN_HOME
115+
#export FLOWMAN_CONF_DIR
116+
117+
# Specify any environment settings and paths
118+
#
119+
#export SPARK_HOME
120+
#export HADOOP_HOME
121+
#export HADOOP_CONF_DIR=${HADOOP_CONF_DIR=$HADOOP_HOME/conf}
122+
#export YARN_HOME
123+
#export HDFS_HOME
124+
#export MAPRED_HOME
125+
#export HIVE_HOME
126+
#export HIVE_CONF_DIR=${HIVE_CONF_DIR=$HIVE_HOME/conf}
127+
128+
# Set the Kerberos principal in YARN cluster
129+
#
130+
#KRB_PRINCIPAL=
131+
#KRB_KEYTAB=
132+
133+
# Specify the YARN queue to use
134+
#
135+
#YARN_QUEUE=
136+
137+
# Use a different spark-submit (for example spark2-submit in Cloudera)
138+
#
139+
#SPARK_SUBMIT=
140+
141+
142+
# Apply any proxy settings from the system environment
143+
#
144+
if [[ "$PROXY_HOST" != "" ]]; then
145+
SPARK_DRIVER_JAVA_OPTS="
146+
-Dhttp.proxyHost=${PROXY_HOST}
147+
-Dhttp.proxyPort=${PROXY_PORT}
148+
-Dhttps.proxyHost=${PROXY_HOST}
149+
-Dhttps.proxyPort=${PROXY_PORT}
150+
$SPARK_DRIVER_JAVA_OPTS"
151+
152+
SPARK_EXECUTOR_JAVA_OPTS="
153+
-Dhttp.proxyHost=${PROXY_HOST}
154+
-Dhttp.proxyPort=${PROXY_PORT}
155+
-Dhttps.proxyHost=${PROXY_HOST}
156+
-Dhttps.proxyPort=${PROXY_PORT}
157+
$SPARK_EXECUTOR_JAVA_OPTS"
158+
159+
SPARK_OPTS="
160+
--conf spark.hadoop.fs.s3a.proxy.host=${PROXY_HOST}
161+
--conf spark.hadoop.fs.s3a.proxy.port=${PROXY_PORT}
162+
$SPARK_OPTS"
163+
fi
164+
165+
# Set AWS credentials if required. You can also specify these in project config
166+
#
167+
if [[ "$AWS_ACCESS_KEY_ID" != "" ]]; then
168+
SPARK_OPTS="
169+
--conf spark.hadoop.fs.s3a.access.key=${AWS_ACCESS_KEY_ID}
170+
--conf spark.hadoop.fs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY}
171+
$SPARK_OPTS"
172+
fi
173+
```
174+
175+
### `system.yml`
176+
After the execution environment has been setup, the `system.yml` is the first configuration file processed by the Java
177+
application. Its main purpose is to load some fundamental plugins, which are already required by the next level of
178+
configuration
179+
180+
#### Example
181+
```yaml
182+
plugins:
183+
- flowman-impala
184+
```
185+
186+
### `default-namespace.yml`
187+
On top of the very global settings, Flowman also supports so called *namespaces*. Each project is executed within the
188+
context of one namespace, if nothing else is specified the *defautlt namespace*. Each namespace contains some
189+
configuration, such that different namespaces might represent different tenants or different staging environments.
190+
191+
#### Example
192+
```yaml
193+
name: "default"
194+
195+
history:
196+
kind: jdbc
197+
connection: flowman_state
198+
retries: 3
199+
timeout: 1000
200+
201+
hooks:
202+
- kind: web
203+
jobSuccess: http://some-host.in.your.net/success&job=$URL.encode($job)&force=$force
204+
205+
connections:
206+
flowman_state:
207+
driver: $System.getenv('FLOWMAN_HISTORY_DRIVER', 'org.apache.derby.jdbc.EmbeddedDriver')
208+
url: $System.getenv('FLOWMAN_HISTORY_URL', $String.concat('jdbc:derby:', $System.getenv('FLOWMAN_HOME'), '/logdb;create=true'))
209+
username: $System.getenv('FLOWMAN_HISTORY_USER', '')
210+
password: $System.getenv('FLOWMAN_HISTORY_PASSWORD', '')
211+
212+
config:
213+
- spark.sql.warehouse.dir=$System.getenv('FLOWMAN_HOME')/hive/warehouse
214+
- hive.metastore.uris=
215+
- javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=$System.getenv('FLOWMAN_HOME')/hive/db;create=true
216+
- datanucleus.rdbms.datastoreAdapterClassName=org.datanucleus.store.rdbms.adapter.DerbyAdapter
217+
218+
plugins:
219+
- flowman-example
220+
- flowman-hbase
221+
- flowman-aws
222+
- flowman-azure
223+
- flowman-kafka
224+
- flowman-mariadb
225+
226+
store:
227+
kind: file
228+
location: $System.getenv('FLOWMAN_HOME')/examples
229+
```
230+
231+
232+
## Running in a Kerberized Environment
233+
Please have a look at [Kerberos](cookbook/kerberos.md) for detailed information.
234+
235+
## Deploying with Docker
236+
It is also possible to run Flowman inside Docker. This simply requires a Docker image with a working Spark and
237+
Hadoop installation such that Flowman can be installed inside the image just as it is installed locally.

Jenkinsfile

Lines changed: 0 additions & 24 deletions
This file was deleted.

README.md

Lines changed: 44 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,44 @@
11
# Flowman
22

3-
Flowman is a Spark based ETL tool.
3+
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
4+
[![Build Status](https://travis-ci.org/dimajix/flowman.svg?branch=develop)](https://travis-ci.org/dimajix/flowman)
5+
[![Documentation](https://readthedocs.org/projects/flowman/badge/?version=latest)](https://flowman.readthedocs.io/en/latest/)
6+
7+
Flowman is a Spark based ETL program that simplifies the act of writing data transformations.
8+
The main idea is that users write so called *specifications* in purely declarative YAML files
9+
instead of writing Spark jobs in Scala or Python. The main advantage of this approach is that
10+
many technical details of a correct and robust implementation are encapsulated and the user
11+
can concentrate on the data transformations themselves.
12+
13+
In addition to writing and executing data transformations, Flowman can also be used for
14+
managing physical data models, i.e. Hive tables. Flowman can create such tables from a
15+
specification with the correct schema. This helps to keep all aspects (like transformations
16+
and schema information) in a single place managed by a single program.
17+
18+
### Noteable Features
19+
20+
* Declarative syntax in YAML files
21+
* Data model management (Create and Destroy Hive tables or file based storage)
22+
* Flexible expression language
23+
* Jobs for managing build targets (like copying files or uploading data via sftp)
24+
* Powerful yet simple command line tool
25+
* Extendable via Plugins
26+
27+
28+
## Documentation
29+
30+
You can find comprehensive documentation at [Read the Docs](https://flowman.readthedocs.io/en/latest/).
31+
432

533
# Installation
634

7-
The Maven build will create both a packed distribution file and a Docker image.
35+
You can either grab an appropriate pre-build package at https://github.com/dimajix/flowman/releases or you
36+
can build your own version via Maven with
37+
38+
mvn clean install
39+
40+
Please also read [BUILDING.md](BUILDING.md) for detailed instructions, specifically on build profiles.
841

9-
mvn clean install -PCDH-5.15
1042

1143
## Installing the Packed Distribution
1244

@@ -15,7 +47,8 @@ location using
1547

1648
tar xvzf flowman-{version}-bin.tar.gz
1749

18-
# Command Line Util
50+
51+
# Command Line Utils
1952

2053
The primary tool provided by Flowman is called `flowexec` and is locaed in the `bin` folder of the
2154
installation directory.
@@ -32,9 +65,11 @@ project with a file `project.yml` or you need to specify the path to a valid pro
3265

3366
flowexec -f /path/to/project/folder <cmd>
3467

35-
36-
# Debugging
68+
## Interactive Shell
69+
70+
With version 0.14.0, Flowman also introduced a new interactive shell for executing data flows. The shell can be
71+
started via
3772

38-
When you want to run the application inside your IDE (for example for debugging purpose), the best way to do that is
39-
to actually install (via `tar xvzf ...`) the application into some directory and then set the environment variable
40-
`FLOWMAN_HOME` accordingly. This will ensure that all plugins can be found and loaded.
73+
flowshell -f <project>
74+
75+
Within the shell, you can interactively build targets and inspect intermediate mappings.

0 commit comments

Comments
 (0)