KBC Docker app for extracting data from Apache Hive database.
See Extractors for SQL Databases for more documentation.
The configuration config.json
contains following properties in parameters
key:
Note: query
or table
must be specified.
db
- object (required): Connection settingshost
- string (required): IP address or hostname of Apache Hive DB serverport
- integer (required): Server port (default port is10000
)database
- string (required): Database to connect to.authType
- enum (optional): Type of the authentication.password
(default) - user/password LDAP authentication.kerberos
- Kerberos authentication.
user
- string (required ifauthType = password
): User with correct access rights#password
- string (required ifauthType = password
): Password for givenuser
kerberos
- object (required ifauthType = kerberos
)kinitPrincipal
- string (required):- Name of the principal for the
kinit
, eg.init/[email protected]
. - It may be shortened if it is supported by your
krb5.conf
, eg.init/localhost
.
- Name of the principal for the
servicePrincipal
- string (required):- Name of the principal for the ODBC connection, eg.
hive/[email protected]
. - A fully qualified name must be used:
[service]/[host]@[realm]
- Name of the principal for the ODBC connection, eg.
config
- string (required): Content of thekrb5.conf
file.#keytab
- string (required):base64
encoded content of the*.keytab
file.
connectThrough
- bool (optional, defaultfalse
)- If enabled:
- Value from the
KBC_REALUSER
environment variable is used as theDelegationUID
in the connection string. - if
KBC_REALUSER
is not set, a UserException is thrown.
- Value from the
- To use this feature:
- The SAML login to the Keboola Connection must be used.
- SAML token must contain the required data and the stack must be set correctly.
- If enabled:
thriftTransport
- int (optional, default Binary (0
) if you are connecting to Hive Server 1. SASL (1
) if you are connecting to Hive Server 2.)- Binary (
0
) - SASL (
1
) - HTTP (
2
)
- Binary (
httpPath
- string (optional, default/hive2
if using Windows Azure HDInsight Service (6)./
if using non-Windows Azure HDInsight Service with Thrift Transport set to HTTP (2).)batchSize
- positive 32-bit int (optional, default10000
)- it sets
RowsFetchedPerBlock
parameter
- it sets
verboseLogging
- bool (optional, defaultfalse
)- when enabled it sets
LogLevel
to6
(logs all driver activity) artifacts
feature in KBC has to be enabled as the logs are uploaded there
- when enabled it sets
ssl
- object (optional):enabled
bool (optional): Defaultfalse
.ca
or#ca
string (optional):- Bundle of the trusted certificates in PEM/JKS format, see
caFileType
. - A special format
internal:my-certs.jks
may be used.- Then the certificate is read from the component, from the
/var/bundled-files/my-certs.jks
. - Currently, the
/var/bundled-files/
directory is empty. - The component can be forked and bundled with a larger client certificate.
- Then the certificate is read from the component, from the
- Bundle of the trusted certificates in PEM/JKS format, see
caFileType
enum (optional, defaultpem
):pem
-ca
value is in CRT/PEM format.jks
-ca
value is in base64 encoded JKS format.
verifyServerCert
bool (optional): Defaulttrue
.ignoreCertificateCn
bool (optional): Defaultfalse
.
ssh
- object (optional): Settings for SSH tunnelenabled
- bool (required): Enables SSH tunnelsshHost
- string (required): IP address or hostname of SSH serversshPort
- integer (optional): SSH server port (default port is22
)localPort
- integer (optional): SSH tunnel local port in Docker container (default33006
)user
- string (optional): SSH user (default same asdb.user
)compression
- bool (optional): Enables SSH tunnel compression (defaultfalse
)keys
- object (optional): SSH keyspublic
- string (optional): Public SSH key#private
- string (optional): Private SSH key
query
- string (optional): SQL query whose output will be extractedtable
- object (optional): Table whose will be extractedtableName
- string (required)schema
- string (required)
columns
- array (optional): List of columns to export (default all columns)outputTable
- string (required): Name of the output tableincremental
- bool (optional): Enables Incremental LoadingincrementalFetchingColumn
- string (optional): Name of column for Incremental FetchingincrementalFetchingLimit
- integer (optional): Max number of rows fetched per one runprimaryKey
- string (optional): Sets primary key to specified column in output tableretries
- integer (optional): Number of retries if an error occurred
Image or Stack Parameters can be used to set global configuration for the extractor. This can be used if e.g. all configurations on the stack use the same Kerberos authentication.
The global configuration is stored under key image_parameters.global_config
and has LOWER priority than the values in the parameters
.
Example of the configuration that the extractor gets:
{
"action": "testConnection",
"image_parameters": {
"global_config": {
"db": {
"authType": "IS OVERWRITTEN BY USER CONFIG",
"kerberos": "IS OVERWRITTEN BY USER CONFIG"
"ssl": {
"enabled": true,
"ca": "...",
"caFileType": "jks"
}
}
}
},
"parameters": {
"db": {
"host": "...",
"port": "...",
"database": "...",
"authType": "kerberos",
"kerberos": {
"principal": "...",
"config": "...",
"#keytab": "..."
},
}
}
}
Simple query:
{
"parameters": {
"db": {
"host": "hive-server",
"port": 10000,
"database": "default",
"user": "admin",
"#password": "******"
},
"query": "SELECT * FROM sales LIMIT 100",
"outputTable": "in.c-main.sales"
}
}
Export only specified columns:
{
"parameters": {
"db": { "host": "..." },
"table": {
"tableName": "internal",
"schema": "default"
},
"columns": ["price", "product_name"],
"outputTable": "in.c-main.internal"
}
}
Incremental fetching using timestamp column:
{
"parameters": {
"db": { "host": "..." },
"outputTable": "in.c-main.incremental",
"primaryKey": ["id"],
"incremental": true,
"incrementalFetchingColumn": "timestamp_col",
"incrementalFetchingLimit": 100
}
}
Clone this repository and init the workspace with following command.
AWS secrets are required to download the ODBC driver.
git clone https://github.com/keboola/db-extractor-hive
cd db-extractor-hive
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
docker compose build
docker compose run --rm wait
docker compose run --rm dev composer install --no-scripts
Run the test suite using this command:
docker compose run --rm dev composer tests
For information about deployment and integration with KBC, please refer to the deployment section of developers documentation
MIT licensed, see LICENSE file.