Apache Hive DB extractor

KBC Docker app for extracting data from Apache Hive database.

See Extractors for SQL Databases for more documentation.

Usage

Configuration

The configuration config.json contains following properties in parameters key:

Note: query or table must be specified.

db - object (required): Connection settings
- host - string (required): IP address or hostname of Apache Hive DB server
- port - integer (required): Server port (default port is 10000)
- database - string (required): Database to connect to.
- authType - enum (optional): Type of the authentication.
  - password (default) - user/password LDAP authentication.
  - kerberos - Kerberos authentication.
- user - string (required if authType = password): User with correct access rights
- #password - string (required if authType = password): Password for given user
- kerberos - object (required if authType = kerberos)
  - kinitPrincipal - string (required):
    - Name of the principal for the kinit, eg. init/[email protected].
    - It may be shortened if it is supported by your krb5.conf, eg. init/localhost.
  - servicePrincipal - string (required):
    - Name of the principal for the ODBC connection, eg. hive/[email protected].
    - A fully qualified name must be used: [service]/[host]@[realm]
  - config - string (required): Content of the krb5.conf file.
  - #keytab - string (required): base64 encoded content of the *.keytab file.
- connectThrough - bool (optional, default false)
  - If enabled:
    - Value from the KBC_REALUSER environment variable is used as the DelegationUID in the connection string.
    - if KBC_REALUSER is not set, a UserException is thrown.
  - To use this feature:
    - The SAML login to the Keboola Connection must be used.
    - SAML token must contain the required data and the stack must be set correctly.
- thriftTransport - int (optional, default Binary (0) if you are connecting to Hive Server 1. SASL (1) if you are connecting to Hive Server 2.)
  - Binary (0)
  - SASL (1)
  - HTTP (2)
- httpPath - string (optional, default /hive2 if using Windows Azure HDInsight Service (6). / if using non-Windows Azure HDInsight Service with Thrift Transport set to HTTP (2).)
- batchSize - positive 32-bit int (optional, default 10000)
  - it sets RowsFetchedPerBlock parameter
- verboseLogging - bool (optional, default false)
  - when enabled it sets LogLevel to 6 (logs all driver activity)
  - artifacts feature in KBC has to be enabled as the logs are uploaded there
- ssl - object (optional):
  - enabled bool (optional): Default false.
  - ca or #ca string (optional):
    - Bundle of the trusted certificates in PEM/JKS format, see caFileType.
    - A special format internal:my-certs.jks may be used.
      - Then the certificate is read from the component, from the /var/bundled-files/my-certs.jks.
      - Currently, the /var/bundled-files/ directory is empty.
      - The component can be forked and bundled with a larger client certificate.
  - caFileType enum (optional, default pem):
    - pem - ca value is in CRT/PEM format.
    - jks - ca value is in base64 encoded JKS format.
  - verifyServerCert bool (optional): Default true.
  - ignoreCertificateCn bool (optional): Default false.
- ssh - object (optional): Settings for SSH tunnel
  - enabled - bool (required): Enables SSH tunnel
  - sshHost - string (required): IP address or hostname of SSH server
  - sshPort - integer (optional): SSH server port (default port is 22)
  - localPort - integer (optional): SSH tunnel local port in Docker container (default 33006)
  - user - string (optional): SSH user (default same as db.user)
  - compression - bool (optional): Enables SSH tunnel compression (default false)
  - keys - object (optional): SSH keys
    - public - string (optional): Public SSH key
    - #private - string (optional): Private SSH key
query - string (optional): SQL query whose output will be extracted
table - object (optional): Table whose will be extracted
- tableName - string (required)
- schema - string (required)
columns - array (optional): List of columns to export (default all columns)
outputTable - string (required): Name of the output table
incremental - bool (optional): Enables Incremental Loading
incrementalFetchingColumn - string (optional): Name of column for Incremental Fetching
incrementalFetchingLimit - integer (optional): Max number of rows fetched per one run
primaryKey - string (optional): Sets primary key to specified column in output table
retries - integer (optional): Number of retries if an error occurred

Global config

Image or Stack Parameters can be used to set global configuration for the extractor. This can be used if e.g. all configurations on the stack use the same Kerberos authentication.

The global configuration is stored under key image_parameters.global_config and has LOWER priority than the values in the parameters.

Example of the configuration that the extractor gets:

{
  "action": "testConnection",
  "image_parameters": {
    "global_config": {
      "db": {
        "authType": "IS OVERWRITTEN BY USER CONFIG",
        "kerberos": "IS OVERWRITTEN BY USER CONFIG"
        "ssl": {
          "enabled": true,
          "ca": "...",
          "caFileType": "jks"
        }
      }
    }
  },
  "parameters": {
    "db": {
      "host": "...",
      "port":  "...",
      "database": "...",
      "authType": "kerberos",
      "kerberos": {
        "principal": "...",
        "config": "...",
        "#keytab": "..."
      },
    }
  }
}

Examples

Simple query:

{
  "parameters": {
    "db": {
      "host": "hive-server",
      "port": 10000,
      "database": "default",
      "user": "admin",
      "#password": "******"
    },
    "query": "SELECT * FROM sales LIMIT 100",
    "outputTable": "in.c-main.sales"
  }
}

Export only specified columns:

{
  "parameters": {
    "db": { "host": "..." },
    "table": {
      "tableName": "internal",
      "schema": "default"
    },
    "columns": ["price", "product_name"],
    "outputTable": "in.c-main.internal"
  }
}

Incremental fetching using timestamp column:

{
  "parameters": {
    "db": { "host": "..." },
    "outputTable": "in.c-main.incremental",
    "primaryKey": ["id"],
    "incremental": true,
    "incrementalFetchingColumn": "timestamp_col",
    "incrementalFetchingLimit": 100
  }
}

Development

Clone this repository and init the workspace with following command.

AWS secrets are required to download the ODBC driver.

git clone https://github.com/keboola/db-extractor-hive
cd db-extractor-hive
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
docker compose build
docker compose run --rm wait
docker compose run --rm dev composer install --no-scripts

Run the test suite using this command:

docker compose run --rm dev composer tests

Integration

For information about deployment and integration with KBC, please refer to the deployment section of developers documentation

License

MIT licensed, see LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
.github/workflows		.github/workflows
docker		docker
src		src
tests		tests
.dockerignore		.dockerignore
.env.dist		.env.dist
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
composer.lock		composer.lock
deploy.sh		deploy.sh
docker-compose.yml		docker-compose.yml
phpcs.xml		phpcs.xml
phpstan-baseline.neon		phpstan-baseline.neon
phpstan.neon		phpstan.neon
phpunit.xml.dist		phpunit.xml.dist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apache Hive DB extractor

Usage

Configuration

Global config

Examples

Development

Integration

License

About

Uh oh!

Releases 29

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

keboola/db-extractor-hive

Folders and files

Latest commit

History

Repository files navigation

Apache Hive DB extractor

Usage

Configuration

Global config

Examples

Development

Integration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 29

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages