This integration collects telemetry from Databricks (including Spark on Databricks) and/or Spark telemetry from any Spark deployment. See the Features section for supported telemetry types.
- All references within this document to Databricks documentation reference the Databricks on AWS documentation. Use the cloud switcher menu located in the upper right hand corner of the documentation to select corresponding documentation for a different cloud provider.
To get started with the Databricks Integration, deploy the integration using a supported deployment type, configure the integration using supported configuration mechanisms, and optionally import the sample dashboards included in the examples directory.
The Databricks Integration can be deployed inside a Databricks cluster on the cluster's driver node (recommended) or outside a Databricks cluster on a supported host platform.
NOTE:
- Apache Spark data for a Databricks cluster cannot be collected remotely. In order to collect Apache Spark data from a Databricks cluster, the Databricks Integration must be deployed inside the cluster.
- The Databricks Integration can be used separately from Databricks to solely collect Apache Spark data from any accessible Spark deployment.
The Databricks Integration can be deployed on the driver node of a Databricks all-purpose, job, or pipeline cluster using a cluster-scoped init script. The init script uses custom environment variables to specify configuration parameters necessary for the integration configuration.
To install the init script on an all-purpose cluster, perform the following steps.
- Login to your Databricks account and navigate to the desired workspace.
- Follow the recommendations for init scripts
to store the
cluster_init_integration.shscript within your workspace in the recommended manner. For example, if your workspace is enabled for Unity Catalog, you should store the init script in a Unity Catalog volume. - Navigate to the
Computetab and select the desired all-purpose compute to open the compute details UI. - Click the button labeled
Editto edit the compute's configuration. - Follow the steps to use the UI to configure a cluster-scoped init script and point to the location where you stored the init script in step 2 above.
- If your cluster is not running, click on the button labeled
Confirmto save your changes. Then, restart the cluster. If your cluster is already running, click on the button labeledConfirm and restartto save your changes and restart the cluster.
Additionally, follow the steps to set environment variables to add the following environment variables.
NEW_RELIC_API_KEY- Your New Relic User API KeyNEW_RELIC_LICENSE_KEY- Your New Relic License KeyNEW_RELIC_ACCOUNT_ID- Your New Relic Account IDNEW_RELIC_REGION- The region of your New Relic account; one ofUSorEUNEW_RELIC_DATABRICKS_WORKSPACE_HOST- The instance name of the target Databricks instanceNEW_RELIC_DATABRICKS_ACCESS_TOKEN- To authenticate with a personal access token, your personal access tokenNEW_RELIC_DATABRICKS_OAUTH_CLIENT_ID- To use a service principal to authenticate with Databricks (OAuth M2M), the OAuth client ID for the service principalNEW_RELIC_DATABRICKS_OAUTH_CLIENT_SECRET- To use a service principal to authenticate with Databricks (OAuth M2M), an OAuth client secret associated with the service principalNEW_RELIC_DATABRICKS_USAGE_ENABLED- Set totrueto enable collection of consumption and cost data from this cluster node orfalseto disable collection. Defaults totrue.NEW_RELIC_DATABRICKS_SQL_WAREHOUSE- The ID of a SQL warehouse where usage queries should be runNEW_RELIC_DATABRICKS_JOB_RUNS_ENABLED- Set totrueto enable collection of job run data from this cluster node orfalseto disable collection. Defaults totrue.NEW_RELIC_DATABRICKS_PIPELINE_EVENT_LOGS_ENABLED- Set totrueto enable collection of Databricks Delta Live Tables pipeline event logs from this cluster node orfalseto disable collection. Defaults totrue.NEW_RELIC_DATABRICKS_QUERY_METRICS_ENABLED- Set totrueto enable collection of Databricks query metrics for this workspace orfalseto disable collection. Defaults totrue.NEW_RELIC_INFRASTRUCTURE_ENABLED- Set totrueto install the New Relic Infrastructure agent on the driver and worker nodes of the cluster. Defaults tofalse.NEW_RELIC_INFRASTRUCTURE_LOGS_ENABLED- Set totrueto enable collection of the Spark driver and executor logs, the Spark driver event log, and the driver and worker init script logs. Logs will be forwarded to New Relic Logs via the New Relic Infrastructure agent and therefore, setting this environment variable totruerequires that theNEW_RELIC_INFRASTRUCTURE_ENABLEDenvironment variable also be set totrue. Defaults tofalse.
NOTE:
- The
NEW_RELIC_API_KEYandNEW_RELIC_ACCOUNT_IDare currently unused but are required by the new-relic-client-go module used by the integration. - Only the personal access token or OAuth credentials need to be specified but not both. If both are specified, the OAuth credentials take precedence.
- Sensitive data like credentials and API keys should never be specified directly in custom environment variables. Instead, it is recommended to create a secret using the Databricks CLI and reference the secret in the environment variable. See the appendix for an example of creating a secret and referencing it in a custom environment variable.
- To install the init script on a job cluster, see the section configure compute for jobs in the Databricks documentation.
- To install the init script on a pipeline cluster, see the section configure compute for a DLT pipeline in the Databricks documentation.
- Make sure to restart the cluster following the configuration of the environment variables.
The Databricks Integration provides binaries for the following host platforms.
- Linux amd64
- Windows amd64
To run the Databricks integration on a host outside a Databricks cluster, perform the following steps.
- Download the appropriate archive for your platform from the latest release.
- Extract the archive to a new or existing directory.
- Create a directory named
configsin the same directory. - Create a file named
config.ymlin theconfigsdirectory and copy the contents of the fileconfigs/config.template.ymlin this repository into it. - Edit the
config.ymlfile to configure the integration appropriately for your environment. - From the directory where the archive was extracted, execute the integration
binary using the command
./newrelic-databricks-integration(or.\newrelic-databricks-integration.exeon Windows) with the appropriate Command Line Options.
The Databricks Integration supports the following capabilities.
-
Collect Spark telemetry
When deployed inside a Databricks cluster, the Databricks Integration can collect telemetry from all Spark applications running in the cluster.
When deployed outside a Databricks cluster, the Databricks Integration can also collect Spark telemetry from all Spark applications on any accessible Spark deployment.
-
Collect Databricks consumption and cost data
The Databricks Integration can collect consumption and cost related data from the Databricks system tables. This data can be used to show Databricks DBU consumption metrics and estimated Databricks costs directly within New Relic.
-
Collect Databricks job run telemetry
The Databricks Integration can collect telemetry about Databricks Job runs, such as job run durations, task run durations, the current state of job and task runs, if a job or a task is a retry, and the number of times a task was retried.
-
Collect Databricks Delta Live Tables Pipeline event logs
The Databricks Integration can collect Databricks Delta Live Tables Pipeline event logs for all Databricks Delta Live Tables Pipelines defined in a workspace. Databricks Delta Live Tables Pipeline event log entries for every pipeline update are collected and sent to New Relic Logs.
| Option | Description | Default |
|---|---|---|
| --config_path | path to the (#configyml) to use | configs/config.yml |
| --dry_run | flag to enable "dry run" mode | false |
| --env_prefix | prefix to use for environment variable lookup | '' |
| --verbose | flag to enable "verbose" mode | false |
| --version | display version information only | N/a |
The Databricks integration is configured using the config.yml
and/or environment variables. For Databricks, authentication related configuration
parameters may also be set in a Databricks configuration profile.
In all cases, where applicable, environment variables always take precedence.
All configuration parameters for the Databricks integration can be set using a
YAML file named config.yml. The default location for this file
is configs/config.yml relative to the current working directory when the
integration binary is executed. The supported configuration parameters are
listed below. See config.template.yml
for a full configuration example.
The parameters in this section are configured at the top level of the
config.yml.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| New Relic license key | string | Y | N/a |
This parameter specifies the New Relic License Key (INGEST) that should be used to send generated metrics.
The license key can also be specified using the NEW_RELIC_LICENSE_KEY
environment variable.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| New Relic region identifier | US / EU |
N | US |
This parameter specifies which New Relic region that generated metrics should be sent to.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Polling interval (in seconds) | numeric | N | 60 |
This parameter specifies the interval (in seconds) at which the integration should poll for data.
This parameter is only used when runAsService is set to
true.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Flag to enable running the integration as a "service" | true / false |
N | false |
The integration can run either as a "service" or as a simple command line utility which runs once and exits when it is complete.
When set to true, the integration process will run continuously and poll the
for data at the recurring interval specified by the interval
parameter. The process will only exit if it is explicitly stopped or a fatal
error or panic occurs.
When set to false, the integration will run once and exit. This is intended for
use with an external scheduling mechanism like cron.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of pipeline configuration parameters | YAML Mapping | N | N/a |
The integration retrieves, processes, and exports data to New Relic using a data pipeline consisting of one or more receivers, a processing chain, and a New Relic exporter. Various aspects of the pipeline are configurable. This element groups together the configuration parameters related to pipeline configuration.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of log configuration parameters | YAML Mapping | N | N/a |
The integration uses the logrus package for application logging. This element groups together the configuration parameters related to log configuration.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The integration execution mode | databricks |
N | databricks |
The integration execution mode. Currently, the only supported execution mode is
databricks.
Deprecated: As of v2.3.0, this configuration parameter is no longer used.
The presence (or not) of the databricks top-level node will be
used to enable (or disable) the Databricks collector. Likewise, the presence
(or not) of the spark top-level node will be used to enable (or
disable) the Spark collector separate from Databricks.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of Databricks configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector. If this element is not specified, the Databricks collector will not be run.
Note that this node is not required. It can be used with or without the
spark top-level node.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of Spark configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Spark collector. If this element is not specified, the Spark collector will not be run.
Note that this node is not required. It can be used with or without the
databricks top-level node.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for a set of custom tags to add to all telemetry sent to New Relic | YAML Mapping | N | N/a |
This element specifies a group of custom tags that will be added to all telemetry sent to New Relic. The tags are specified as a set of key-value pairs.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Size of the buffer that holds items before processing | number | N | 500 |
This parameter specifies the size of the buffer that holds received items before being flushed through the processing chain and on to the exporters. When this size is reached, the items in the buffer will be flushed automatically.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Harvest interval (in seconds) | number | N | 60 |
This parameter specifies the interval (in seconds) at which the pipeline
should automatically flush received items through the processing chain and on
to the exporters. Each time this interval is reached, the pipeline will flush
items even if the item buffer has not reached the size specified by the
receiveBufferSize parameter.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Number of concurrent pipeline instances to run | number | N | 3 |
The integration retrieves, processes, and exports metrics to New Relic using
a data pipeline consisting of one or more receivers, a processing chain, and a
New Relic exporter. When runAsService is true, the
integration can launch one or more "instances" of this pipeline to receive,
process, and export data concurrently. Each "instance" will be configured with
the same processing chain and exporters and the receivers will be spread across
the available instances in a round-robin fashion.
This parameter specifies the number of pipeline instances to launch.
NOTE: When runAsService is false, only a single
pipeline instance is used.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Log level | panic / fatal / error / warn / info / debug / trace |
N | warn |
This parameter specifies the maximum severity of log messages to output with
trace being the least severe and panic being the most severe. For example,
at the default log level (warn), all log messages with severities warn,
error, fatal, and panic will be output but info, debug, and trace
will not.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Path to a file where log output will be written | string | N | stderr |
This parameter designates a file path where log output should be written. When
no path is specified, log output will be written to the standard error stream
(stderr).
The Databricks configuration parameters are used to configure the Databricks collector.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Databricks workspace instance name | string | conditional | N/a |
This parameter specifies the instance name
of the target Databricks instance for which data should be collected. This is
used by the integration when constructing the URLs for API calls. Note that the
value of this parameter must not include the https:// prefix, e.g.
https://my-databricks-instance-name.cloud.databricks.com.
The workspace host can also be specified using the DATABRICKS_HOST
environment variable.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Databricks personal access token | string | N | N/a |
When set, the integration will use Databricks personal access token authentication to authenticate Databricks API calls with the value of this parameter as the Databricks personal access token.
The personal access token can also be specified using the DATABRICKS_TOKEN
environment variable or any other SDK-supported mechanism (e.g. the token
field in a Databricks configuration profile).
See the authentication section for more details.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Databricks OAuth M2M client ID | string | N | N/a |
When set, the integration will use a service principal to authenticate with Databricks (OAuth M2M) when making Databricks API calls. The value of this parameter will be used as the OAuth client ID.
The OAuth client ID can also be specified using the DATABRICKS_CLIENT_ID
environment variable or any other SDK-supported mechanism (e.g. the client_id
field in a Databricks configuration profile).
See the authentication section for more details.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Databricks OAuth M2M client secret | string | N | N/a |
When the oauthClientId is set, this parameter can be set to
specify the OAuth secret
associated with the service principal.
The OAuth client secret can also be specified using the
DATABRICKS_CLIENT_SECRET environment variable or any other SDK-supported
mechanism (e.g. the client_secret field in a Databricks
configuration profile).
See the authentication section for more details.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Timeout (in seconds) to use when executing SQL statements on a SQL warehouse | number | N | 30 |
Certain telemetry and data collected by the Databricks collector requires the collector to run Databricks SQL statements on a SQL warehouse. This configuration parameter specifies the number of seconds to wait before timing out a pending or running SQL query.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of Databricks Usage configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector settings related to the collection of consumption and cost data.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of Databricks Job configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector settings related to the collection of job data.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of Databricks Pipeline configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector settings related to the collection of Databricks Delta Live Tables Pipelines telemetry.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of Databricks Query configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector settings related to the collection of Databricks query telemetry.
The Databricks usage configuration parameters are used to configure Databricks collector settings related to the collection of Databricks consumption and cost data.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Flag to enable automatic collection of consumption and cost data | true / false |
N | true |
By default, when the Databricks collector is enabled, it will automatically collect consumption and cost data.
This flag can be used to disable the collection of consumption and cost data by the Databricks collector. This may be useful when running multiple instances of the Databricks Integration. In this scenario, Databricks consumption and cost data collection should only be enabled on a single instance. Otherwise, this data will be recorded more than once in New Relic, affecting consumption and cost calculations.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| ID of a SQL warehouse on which to run usage-related SQL statements | string | Y | N/a |
The ID of a SQL warehouse on which to run the SQL statements used to collect Databricks consumption and cost data.
This parameter is required when the collection of Databricks consumption and cost data is enabled.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Flag to enable inclusion of identity related metadata in consumption and cost data | true / false |
N | false |
When the collection of Databricks consumption and cost data is enabled, the Databricks collector can include several pieces of identifying information along with the consumption and cost data.
By default, when the collection of Databricks consumption and cost data is enabled, the Databricks collector will not collect such data as it may be personally identifiable. This flag can be used to enable the inclusion of the identifying information.
When enabled, the following values are included.
- The identity of the user a serverless billing record is attributed to. This value is included in the identity metadata returned from usage records in the billable usage system table.
- The identity of the cluster creator for each usage record for billable usage attributed to all-purpose and job compute.
- The single user name for each usage record for billable usage attributed to all-purpose and job compute configured for single-user access mode.
- The identity of the warehouse creator for each usage record for billable usage attributed to SQL warehouse compute.
- The identity of the user or service principal used to run jobs for each query result collected by job cost data queries.
| Description | Valid Values | Required | Default |
|---|---|---|---|
Time of day (as HH:mm:ss) at which to run usage data collection |
string with format HH:mm:ss |
N | 02:00:00 |
This parameter specifies the time of day at which the collection of
consumption and cost data occur. The value must be of
the form HH:mm:ss where HH is the 0-padded 24-hour clock hour
(00 - 23), mm is the 0-padded minute (00 - 59) and ss is the
0-padded second (00 - 59). For example, 09:00:00 is the time 9:00 AM and
23:30:00 is the time 11:30 PM.
The time will always be interpreted according to the UTC time zone. The time
zone can not be configured. For example, to specify that the integration should
be run at 2:00 AM EST (-0500), the value 07:00:00 should be specified.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of flags used to selectively enable or disable optional usage queries | YAML Mapping | N | N/a |
When the collection of Databricks consumption and cost data
is enabled, the Databricks collector will always
collect billable usage data and list pricing data
on every run. In addition, by default, the Databricks collector will also run
all job cost queries on every run. However, the latter behavior
can be configured using a set of flags specified with this configuration
property to selectively enable or disable the job cost queries.
Each flag is specified using a property with the query ID as the name of the
property and true or false as the value of the property.The following flags
are supported.
jobs_cost_list_cost_per_job_runjobs_cost_list_cost_per_jobjobs_cost_frequent_failuresjobs_cost_most_retries
For example, to enable the list cost per job run query and the list cost per job query but disable the list cost of failed job runs for jobs with frequent failures query and the list cost of repaired job runs for jobs with frequent repairs query, the following configuration would be specified.
optionalQueries:
jobs_cost_list_cost_per_job_run: true
jobs_cost_list_cost_per_job: true
jobs_cost_frequent_failures: false
jobs_cost_most_retries: false| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of Databricks Job Run configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector settings related to the collection of job run data.
The Databricks job run configuration parameters are used to configure Databricks collector settings related to the collection of Databricks job run data.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Flag to enable automatic collection of job run data | true / false |
N | true |
By default, when the Databricks collector is enabled, it will automatically collect job run data.
This flag can be used to disable the collection of job run data by the Databricks collector. This may be useful when running multiple instances of the Databricks Integration against the same Databricks workspace. In this scenario, Databricks job run data collection should only be enabled on a single instance of the integration. Otherwise, this data will be recorded more than once in New Relic, affecting product features that use job run metrics (e.g. dashboards and alerts).
| Description | Valid Values | Required | Default |
|---|---|---|---|
| A prefix to prepend to Databricks job run metric names | string | N | N/a |
This parameter specifies a prefix that will be prepended to each Databricks job run metric name when the metric is exported to New Relic.
For example, if this parameter is set to databricks., then the full name of
the metric representing the duration of a job run (job.run.duration) will be
databricks.job.run.duration.
Note that it is not recommended to leave this value empty as the metric names without a prefix may be ambiguous.
| Description | Valid Values | Required | Default |
|---|---|---|---|
Flag to enable inclusion of the job run ID in the databricksJobRunId attribute on all job run metrics |
true / false |
N | false |
By default, the Databricks collector will not include job run IDs on any of the job run metrics in order to avoid possible violations of metric cardinality limits due to the fact that job run IDs have high cardinality because they are unique across all jobs and job runs.
This flag can be used to enable the inclusion of the job run ID in the
databricksJobRunId attribute on all job metrics.
When enabled, use the Limits UI and/or create a dashboard in order to monitor your limit status. Additionally, set alerts on resource metrics to provide updates on limits changes.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Offset (in seconds) from the current time to use for calculating the earliest job run start time to match when listing job runs | number | N | 86400 (1 day) |
This parameter specifies an offset, in seconds that can be used to tune the collector's performance by limiting the number of job runs to return by constraining the start time of job runs to match to be greather than a particular time in the past calculated as an offset from the current time.
See the section startOffset Configuration for
more details.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of Databricks Pipeline Update Metrics configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector settings related to the collection of Databricks Delta Live Tables pipeline update metrics.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of Databricks Pipeline Event Logs configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector settings related to the collection of Databricks Delta Live Tables pipeline event logs.
The Databricks pipeline event logs configuration parameters are used to configure Databricks collector settings related to the collection of Databricks Delta Live Tables pipeline event logs.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Flag to enable automatic collection of Databricks Delta Live Tables pipeline event logs | true / false |
N | true |
By default, when the Databricks collector is enabled, it will automatically collect Databricks Delta Live Tables pipeline event logs.
This flag can be used to disable the collection of Databricks Delta Live Tables pipeline event logs by the Databricks collector. This may be useful when running multiple instances of the Databricks Integration against the same Databricks workspace. In this scenario, the collection of Databricks Delta Live Tables pipeline event logs should only be enabled on a single instance of the integration. Otherwise, duplicate New Relic log entries will be created for each Databricks Delta Live Tables pipeline event log entry, making troubleshooting challenging and affecting product features that rely on signal integrity such as anomaly detection.
The Databricks pipeline update metrics configuration parameters are used to configure Databricks collector settings related to the collection of Databricks Delta Live Tables pipeline update metrics.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Flag to enable automatic collection of Databricks Delta Live Tables pipeline update metrics | true / false |
N | true |
By default, when the Databricks collector is enabled, it will automatically collect Databricks Delta Live Tables pipeline update metrics.
This flag can be used to disable the collection of Databricks Delta Live Tables pipeline update metrics by the Databricks collector. This may be useful when running multiple instances of the Databricks Integration against the same Databricks workspace. In this scenario, the collection of Databricks Delta Live Tables pipeline update metrics should only be enabled on a single instance of the integration. Otherwise, Databricks Delta Live Tables pipeline update metrics will be recorded more than once, making troubleshooting challenging and affecting product features that use these metrics (e.g. dashboards and alerts).
| Description | Valid Values | Required | Default |
|---|---|---|---|
| A prefix to prepend to Databricks pipeline update metric names | string | N | N/a |
This parameter specifies a prefix that will be prepended to each Databricks pipeline update metric name when the metric is exported to New Relic.
For example, if this parameter is set to databricks., then the full name of
the metric representing the duration of a pipeline update
(pipeline.update.duration) will be
databricks.pipeline.update.duration.
Note that it is not recommended to leave this value empty as the metric names without a prefix may be ambiguous.
| Description | Valid Values | Required | Default |
|---|---|---|---|
Flag to enable inclusion of the pipeline update ID in the databricksPipelineUpdateId attribute on all pipeline update metrics |
true / false |
N | false |
By default, the Databricks collector will not include pipeline update IDs on any of the pipeline update metrics in order to avoid possible violations of metric cardinality limits due to the fact that update IDs have high cardinality because they are unique across all pipeline updates.
This flag can be used to enable the inclusion of the pipeline update ID in the
databricksPipelineUpdateId attribute on all pipeline update metrics.
When enabled, use the Limits UI and/or create a dashboard in order to monitor your limit status. Additionally, set alerts on resource metrics to provide updates on limits changes.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Offset (in seconds) from the current time to use for calculating the earliest pipeline event timestamp to match when listing pipeline events | number | N | 86400 (1 day) |
This parameter specifies an offset, in seconds, that can be used to tune the collector's performance by limiting the number of pipeline events to return by constraining the timestamp of pipeline events to match to be greather than a particular time in the past calculated as an offset from the current time.
See the section startOffset Configuration for
more details.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Offset (in seconds) to use for delaying the collection of pipeline events to account for lag in the list pipeline events API | number | N | 5 |
This parameter specifies an offset, in seconds, that the collector will use to delay collection of pipeline events in order to account for potential lag in the list pipeline events endpoint of the Databricks ReST API.
See the section intervalOffset Configuration
for more details.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of Databricks Query Metrics configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector settings related to the collection of Databricks query metrics.
The Databricks query metrics configuration parameters are used to configure Databricks collector settings related to the collection of Databricks query metrics.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Flag to enable automatic collection of Databricks query metrics | true / false |
N | true |
By default, when the Databricks collector is enabled, it will automatically collect Databricks query metrics.
This flag can be used to disable the collection of query metrics by the Databricks collector. This may be useful when running multiple instances of the Databricks Integration against the same Databricks workspace. In this scenario, the collection of Databricks query metrics should only be enabled on a single instance of the integration. Otherwise, query metrics will be recorded more than once, making troubleshooting challenging and affecting product features that use these metrics (e.g. dashboards and alerts).
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Flag to enable inclusion of identity related metadata in Databricks query metric data | true / false |
N | false |
When the collection of Databricks query metrics is enabled, the Databricks collector can include several pieces of identifying information along with the query metric data.
By default, when the collection of Databricks query metrics is enabled, the Databricks collector will not collect such data as it may be personally identifiable. This flag can be used to enable the inclusion of the identifying information.
When enabled, the following values are included.
- The ID of the user that executed the query
- The email address or username of the user that executed the query
- The ID of the user whose credentials were used to execute the query
- The email address or username whose credentials were used to execute the query
- The identity of the warehouse creator of the SQL warehouse where the query was executed
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Offset (in seconds) to use for the start of the time range used on the list query history API call | number | N | 600 |
This parameter specifies an offset, in seconds, that the collector will use to calculate the start time of the time range used to call the list query history API.
See the section Query Metrics startOffset Configuration
for more details.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| Offset (in seconds) to use for the time window used to check for terminated queries to account for any lag setting the query end time in the list query history API | number | N | 5 |
This parameter specifies an offset, in seconds, that the collector will use to offset the current time window when checking for terminated queries, allowing the query end time additional time to be reflected in the list query history API.
See the section Query Metrics intervalOffset Configuration
for more details.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The maximum number of query results to return per page when calling the list query history API | number | N | 100 |
This parameter specifies the maximum number of query results to return per page when listing queries via the list query history API
NOTE:
For performance reasons, since the integration retrieves query metrics for each
query using the list query history API,
it is recommended to keep the maxResults parameter at its default value in
most cases.
The Spark configuration parameters are used to configure the Spark collector.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The Web UI URL of an application on the Spark deployment to monitor | string | N | N/a |
This parameter can be used to monitor a Spark deployment. It specifies the URL
of the Web UI of an
application running on the Spark deployment to monitor. The value should be of
the form http[s]://<hostname>:<port> where <hostname> is the hostname of the
Spark deployment to monitor and <port> is the port number of the Spark
application's Web UI (typically 4040 or 4041, 4042, etc if more than one
application is running on the same host).
Note that the value must not contain a path. The path of the Spark ReST API
endpoints (mounted at /api/v1) will automatically be appended.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| A prefix to prepend to Spark metric names | string | N | N/a |
This parameter specifies a prefix that will be prepended to each Spark metric name when the metric is exported to New Relic.
For example, if this parameter is set to spark., then the full name of the
metric representing the value of the memory used on application executors
(app.executor.memoryUsed) will be spark.app.executor.memoryUsed.
Note that it is not recommended to leave this value empty as the metric names without a prefix may be ambiguous.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The cluster manager on which the Spark engine is running. | string | N | standalone |
This parameter specifies an identifier that indicates the cluster manager on which the Spark engine is running.
The valid values for this parameter are standalone and databricks. When
this parameter is set to databricks, metrics are decorated with additional
attributes (such as the workspace ID, name, and URL). Additional parameters that
apply to Databricks can be specified using the Spark databricks
configuration. When this parameter is not set, or is set to standalone, no
additional telemetry or attributes are collected. The default value for this
parameter is standalone.
| Description | Valid Values | Required | Default |
|---|---|---|---|
| The root node for the set of Spark Databricks configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks specific settings when the Spark collector is collecting telemetry from Spark running on Databricks.
The Spark Databricks configuration parameters are used to configure the Databricks specific settings when the Spark collector is collecting telemetry from Spark running on Databricks.
| Description | Valid Values | Required | Default |
|---|---|---|---|
Flag to enable inclusion of the task run ID in the databricksJobRunTaskRunId attribute for Spark job, stage, and task metrics associated with Databricks job runs |
true / false |
N | false |
By default, the Spark collector will not include task run IDs on any of the Spark job, stage, and task metrics associated with Databricks job runs in order to avoid possible violations of metric cardinality limits due to the fact that task run IDs have high cardinality because they are unique across all job task runs.
This flag can be used to enable the inclusion of the task run ID in the
databricksJobRunTaskRunId attribute on all Spark job, stage, and task metrics
associated with Databricks job runs.
When enabled, use the Limits UI and/or create a dashboard in order to monitor your limit status. Additionally, set alerts on resource metrics to provide updates on limits changes.
See the section "Mapping Spark metrics to Databricks Jobs and Pipelines" for more details.
| Description | Valid Values | Required | Default |
|---|---|---|---|
Flag to enable inclusion of the pipeline update ID in the databricksPipelineUpdateId attribute for Spark job, stage, and task metrics associated with Databricks pipeline updates |
true / false |
N | false |
By default, the Spark collector will not include pipeline update IDs on any of the Spark job, stage, and task metrics associated with Databricks pipeline updates in order to avoid possible violations of metric cardinality limits due to the fact that pipeline update IDs have high cardinality because they are unique across all pipeline updates.
This flag can be used to enable the inclusion of the pipeline update ID in the
databricksPipelineUpdateId attribute on all Spark job, stage, and task metrics
associated with Databricks pipeline updates.
When enabled, use the Limits UI and/or create a dashboard in order to monitor your limit status. Additionally, set alerts on resource metrics to provide updates on limits changes.
See the section "Mapping Spark metrics to Databricks Jobs and Pipelines" for more details.
| Description | Valid Values | Required | Default |
|---|---|---|---|
Flag to enable inclusion of the pipeline flow ID in the databricksPipelineFlowId attribute for Spark job, stage, and task metrics associated with Databricks pipeline update flows |
true / false |
N | false |
By default, the Spark collector will not include pipeline update flow IDs on any of the Spark job, stage, and task metrics associated with Databricks pipeline update flows in order to avoid possible violations of metric cardinality limits due to the fact that pipeline update flow IDs have high cardinality because they unique across all pipeline updates.
This flag can be used to enable the inclusion of the pipeline update flow ID in
the databricksPipelineFlowId attribute on all Spark job, stage, and task
metrics associated with Databricks pipeline update flows.
When enabled, use the Limits UI and/or create a dashboard in order to monitor your limit status. Additionally, set alerts on resource metrics to provide updates on limits changes.
See the section "Mapping Spark metrics to Databricks Jobs and Pipelines" for more details.
The Databricks integration uses the Databricks SDK for Go to access the Databricks ReST API. The SDK performs authentication on behalf of the integration and provides many options for configuring the authentication type and credentials to be used. See the SDK documentation and the Databricks client unified authentication documentation for details.
For convenience purposes, the following parameters can be used in the Databricks configuration section of the `config.yml file.
accessToken- When set, the integration will instruct the SDK to explicitly use Databricks personal access token authentication. The SDK will not attempt to try other authentication mechanisms and instead will fail immediately if personal access token authentication fails.oauthClientId- When set, the integration will instruct the SDK to explicitly use a service principal to authenticate with Databricks (OAuth M2M). The SDK will not attempt to try other authentication mechanisms and instead will fail immediately if OAuth M2M authentication fails. The OAuth Client secret can be set using theoauthClientSecretconfiguration parameter or any of the other mechanisms supported by the SDK (e.g. theclient_secretfield in a Databricks configuration profile or theDATABRICKS_CLIENT_SECRETenvironment variable).oauthClientSecret- The OAuth client secret to use for OAuth M2M authentication. This value is only used whenoauthClientIdis set in theconfig.yml. The OAuth client secret can also be set using any of the other mechanisms supported by the SDK (e.g. theclient_secretfield in a Databricks configuration profile or theDATABRICKS_CLIENT_SECRETenvironment variable).
NOTE: Authentication applies only to calls made to collect Databricks telemetry via the Databricks SDK for Go. The Spark collector does not support authentication when accessing the Spark ReST API.
The Databricks Integration can collect Apache Spark
application
metrics for all running applications in a given Spark cluster.
Metrics are collected by accessing the monitoring ReST API
through the Web UI of a
given SparkContext.
This feature can be used for any Spark cluster, not just for Spark running on Databricks.
When the Spark collector is enabled (the top-level spark node is
specified), Spark application metrics are collected from the Web UI
URL specified in the webUiUrl or from the Web UI
URL https://localhost:4040 by default.
Spark application telemetry is sent to New Relic as dimensional metrics. The provided metrics and attributes (dimensions) are listed in the sections below.
NOTE: Many of the descriptions below are sourced from the Apache Spark monitoring ReST API documentation.
The Apache Spark monitoring ReST API returns two types of metrics: monotonically increasing counters (referred to below simply as "counters") and gauges, with the majority of metrics being counters.
While all gauge metrics returned from the ReST API
are created as gauge metrics within New Relic, note that all counter metrics
returned from the ReST API
are also created as gauge metrics. While the latest(), min(), max(),
sum(), count(), and average() NRQL aggregator functions
are all therefore available for use with these metrics, only latest()
will provide meaningful results (e.g. taking the average of a set of data points
that represent a monotonically increasing counter makes no sense).
Therefore, in general, only the latest()
aggregator function should be used when visualizing or alerting on metrics
listed in the sections below with the metric type counter.
Further to this, even when using the latest()
aggregator function, the metrics have no meaning without an identifying FACET.
As an example, to show the total number of completed tasks by executor ID, use
the query SELECT latest(app.executor.completedTasks) FROM Metric FACET sparkAppName, sparkAppExecutorId
and not SELECT count(app.executor.completedTasks) FROM Metric FACET sparkAppName, sparkAppExecutorId.
Using count() on a gauge metric accesses the count field of the metric which
in this case is always 1 even though the metric being represented is a
counter. The counter value is actually in the latest field of the metric.
Further, using the metric without faceting by the executor ID will only return
the latest app.executor.completedTasks metric in the selected time window and
ignore any other instances of that metric in the same time window.
The following attributes are included on all Spark application metrics.
| Attribute Name | Data Type | Description |
|---|---|---|
sparkAppId |
string | Spark application ID |
sparkAppName |
string | Spark application name |
databricksClusterId |
string | Databricks only Databricks cluster ID |
databricksClusterName |
string | Databricks only Databricks cluster name |
The following metrics are included for each Spark application.
| Attribute Name | Data Type | Description |
|---|---|---|
app.jobs |
string | Spark job counts by job status |
app.stages |
string | Spark stage counts by stage status |
The following metrics are included for each Spark application executor. Metrics
are scoped to a given executor using the sparkAppExecutorId attribute.
| Metric Name | Metric Type | Description |
|---|---|---|
app.executor.rddBlocks |
gauge | RDD blocks in the block manager of this executor |
app.executor.memoryUsed |
gauge | Storage memory used by this executor |
app.executor.diskUsed |
gauge | Disk space used for RDD storage by this executor |
app.executor.totalCores |
counter | Number of cores available in this executor |
app.executor.maxTasks |
counter | Maximum number of tasks that can run concurrently in this executor |
app.executor.activeTasks |
gauge | Number of tasks currently executing |
app.executor.failedTasks |
counter | Number of tasks that have failed in this executor |
app.executor.completedTasks |
counter | Number of tasks that have completed in this executor |
app.executor.totalTasks |
counter | Total number of tasks (running, failed and completed) in this executor |
app.executor.totalDuration |
counter | Elapsed time the JVM spent executing tasks in this executor. The value is expressed in milliseconds. |
app.executor.totalGCTime |
counter | Elapsed time the JVM spent in garbage collection summed in this executor. The value is expressed in milliseconds. |
app.executor.totalInputBytes |
counter | Total input bytes summed in this executor |
app.executor.totalShuffleRead |
counter | Total shuffle read bytes summed in this executor |
app.executor.totalShuffleWrite |
counter | Total shuffle write bytes summed in this executor |
app.executor.maxMemory |
gauge | Total amount of memory available for storage, in bytes |
app.executor.memory.usedOnHeapStorage |
gauge | Used on heap memory currently for storage, in bytes |
app.executor.memory.usedOffHeapStorage |
gauge | Used off heap memory currently for storage, in bytes |
app.executor.memory.totalOnHeapStorage |
gauge | Total available on heap memory for storage, in bytes. This amount can vary over time, on the MemoryManager implementation. |
app.executor.memory.totalOffHeapStorage |
gauge | Total available off heap memory for storage, in bytes. This amount can vary over time, depending on the MemoryManager implementation. |
app.executor.memory.peak.jvmHeap |
counter | Peak memory usage of the heap that is used for object allocation by the Java virtual machine |
app.executor.memory.peak.jvmOffHeap |
counter | Peak memory usage of non-heap memory that is used by the Java virtual machine |
app.executor.memory.peak.onHeapExecution |
counter | Peak on heap execution memory usage, in bytes |
app.executor.memory.peak.offHeapExecution |
counter | Peak off heap execution memory usage, in bytes |
app.executor.memory.peak.onHeapStorage |
counter | Peak on heap storage memory usage, in bytes |
app.executor.memory.peak.offHeapStorage |
counter | Peak off heap storage memory usage, in bytes |
app.executor.memory.peak.onHeapUnified |
counter | Peak on heap memory usage (execution and storage) |
app.executor.memory.peak.offHeapUnified |
counter | Peak off heap memory usage (execution and storage) |
app.executor.memory.peak.directPool |
counter | Peak JVM memory usage for direct buffer pool (java.lang.management.BufferPoolMXBean) |
app.executor.memory.peak.mappedPool |
counter | Peak JVM memory usage for mapped buffer pool (java.lang.management.BufferPoolMXBean) |
app.executor.memory.peak.nettyDirect |
counter | Databricks only |
app.executor.memory.peak.jvmDirect |
counter | Databricks only |
app.executor.memory.peak.sparkDirectMemoryOverLimit |
counter | Databricks only |
app.executor.memory.peak.totalOffHeap |
counter | Databricks only |
app.executor.memory.peak.processTreeJvmVirtual |
counter | Peak virtual memory size, in bytes |
app.executor.memory.peak.processTreeJvmRSS |
counter | Peak Resident Set Size (number of pages the process has in real memory) |
app.executor.memory.peak.processTreePythonVirtual |
counter | Peak virtual memory size for Python, in bytes |
app.executor.memory.peak.processTreePythonRSS |
counter | Peak resident Set Size for Python |
app.executor.memory.peak.processTreeOtherVirtual |
counter | Peak virtual memory size for other kinds of processes, in bytes |
app.executor.memory.peak.processTreeOtherRSS |
counter | Peak resident Set Size for other kinds of processes |
app.executor.memory.peak.minorGCCount |
counter | Total number of minor GCs that have occurred |
app.executor.memory.peak.minorGCTime |
counter | Total elapsed time spent doing minor GCs. The value is expressed in milliseconds. |
app.executor.memory.peak.majorGCCount |
counter | Total number of major GCs that have occurred |
app.executor.memory.peak.majorGCTime |
counter | Total elapsed time spent doing major GCs. The value is expressed in milliseconds. |
app.executor.memory.peak.totalGCTime |
counter | Total elapsed time spent doing GC (major + minor). The value is expressed in milliseconds. |
The following attributes are included on all Spark application executor metrics.
| Attribute Name | Data Type | Description |
|---|---|---|
sparkAppExecutorId |
string | Spark executor ID |
The following metrics are included for each Spark job in an application. Metrics
are scoped to a given job and status using the sparkAppJobId and
sparkAppJobStatus attributes.
| Metric Name | Metric Type | Description |
|---|---|---|
app.job.duration |
counter | Duration of the job. The value is expressed in milliseconds. |
app.job.stages |
gauge (for active and complete status) / counter (for skipped and failed status) |
Spark stage counts by stage status |
app.job.tasks |
gauge (for active status) / counter (for complete, skipped, failed, and killed status) |
Spark task counts by task status |
app.job.indices.completed |
counter | This metric is not documented in the Apache Spark monitoring ReST API documentation |
The following attributes are included on all Spark application job metrics.
| Attribute Name | Data Type | Description |
|---|---|---|
sparkAppJobId |
number | Spark job ID |
sparkAppJobStatus |
string | Spark job status |
sparkAppStageStatus |
string | Spark stage status. Only on app.job.stages metric. |
sparkAppTaskStatus |
string | Spark task status. Only on app.job.tasks metric. |
The following metrics are included for each Spark stage in an application.
Metrics are scoped to a given stage and status using the sparkAppStageId,
sparkAppStageAttemptId, sparkAppStageName, and sparkAppStageStatus
attributes.
NOTE: The metrics in this section are not documented in the Apache Spark monitoring ReST API documentation. The descriptions provided below were deduced via source code analysis and are not determinate.
| Metric Name | Metric Type | Description |
|---|---|---|
app.stage.tasks |
gauge (for active, pending, and complete status) / counter (for skipped and failed status) |
Spark task counts by task status |
app.stage.tasks.total |
counter | Total number of tasks for the named stage |
app.stage.duration |
counter | Duration of the stage. The value is expressed in milliseconds. |
app.stage.indices.completed |
counter | This metric is not documented in the Apache Spark monitoring ReST API documentation |
app.stage.peakNettyDirectMemory |
counter | Databricks only |
app.stage.peakJvmDirectMemory |
counter | Databricks only |
app.stage.peakSparkDirectMemoryOverLimit |
counter | Databricks only |
app.stage.peakTotalOffHeapMemory |
counter | Databricks only |
app.stage.executor.deserializeTime |
counter | Total elapsed time spent by executors deserializing tasks for the named stage. The value is expressed in milliseconds. |
app.stage.executor.deserializeCpuTime |
counter | Total CPU time spent by executors to deserialize tasks for the named stage. The value is expressed in nanoseconds. |
app.stage.executor.runTime |
counter | Total elapsed time spent running tasks on executors for the named stage. The value is expressed in milliseconds. |
app.stage.executor.cpuTime |
counter | Total CPU time spent running tasks on executors for the named stage. This includes time fetching shuffle data. The value is expressed in nanoseconds. |
app.stage.resultSize |
counter | The total number of bytes transmitted back to the driver by all tasks for the named stage |
app.stage.jvmGcTime |
counter | Total elapsed time the JVM spent in garbage collection while executing tasks for the named stage. The value is expressed in milliseconds. |
app.stage.resultSerializationTime |
gauge | Total elapsed time spent serializing task results for the named stage. The value is expressed in milliseconds. |
app.stage.memoryBytesSpilled |
counter | Sum of the in-memory bytes spilled by all tasks for the named stage |
app.stage.diskBytesSpilled |
counter | Sum of the number of on-disk bytes spilled by all tasks for the named stage |
app.stage.peakExecutionMemory |
counter | Sum of the peak memory used by internal data structures created during shuffles, aggregations and joins by all tasks for the named stage |
app.stage.inputBytes |
counter | Sum of the number of bytes read from org.apache.spark.rdd.HadoopRDD or from persisted data by all tasks for the named stage |
app.stage.inputRecords |
counter | Sum of the number of records read from org.apache.spark.rdd.HadoopRDD or from persisted data by all tasks for the named stage |
app.stage.outputBytes |
counter | Sum of the number of bytes written externally (e.g. to a distributed filesystem) by all tasks with output for the named stage |
app.stage.outputRecords |
counter | Sum of the number of records written externally (e.g. to a distributed filesystem) by all tasks with output for the named stage |
app.stage.shuffle.remoteBlocksFetched |
counter | Sum of the number of remote blocks fetched in shuffle operations by all tasks for the named stage |
app.stage.shuffle.localBlocksFetched |
counter | Sum of the number of local (as opposed to read from a remote executor) blocks fetched in shuffle operations by all tasks for the named stage |
app.stage.shuffle.fetchWaitTime |
counter | Total time tasks spent waiting for remote shuffle blocks for the named stage |
app.stage.shuffle.remoteBytesRead |
counter | Sum of the number of remote bytes read in shuffle operations by all task for the named stages |
app.stage.shuffle.remoteBytesReadToDisk |
counter | Sum of the number of remote bytes read to disk in shuffle operations by all tasks for the named stage |
app.stage.shuffle.localBytesRead |
counter | Sum of the number of bytes read in shuffle operations from local disk (as opposed to read from a remote executor) by all tasks for the named stage |
app.stage.shuffle.readBytes |
counter | This metric is not documented in the Apache Spark monitoring ReST API documentation |
app.stage.shuffle.readRecords |
counter | Sum of the number of records read in shuffle operations by all tasks for the named stage |
app.stage.shuffle.corruptMergedClockChunks |
counter | Sum of the number of corrupt merged shuffle block chunks encountered by all tasks (remote or local) for the named stage |
app.stage.shuffle.mergedFetchFallbackCount |
counter | Sum of the number of times tasks had to fallback to fetch original shuffle blocks for a merged shuffle block chunk (remote or local) for the named stage |
app.stage.shuffle.mergedRemoteBlocksFetched |
counter | Sum of the number of remote merged blocks fetched by all tasks for the named stage |
app.stage.shuffle.mergedLocalBlocksFetched |
counter | Sum of the number of local merged blocks fetched by all tasks for the named stage |
app.stage.shuffle.mergedRemoteChunksFetched |
counter | Sum of the number of remote merged chunks fetched by all tasks for the named stage |
app.stage.shuffle.mergedLocalChunksFetched |
counter | Sum of the number of local merged chunks fetched by all tasks for the named stage |
app.stage.shuffle.mergedRemoteBytesRead |
counter | Sum of the number of remote merged bytes read by all tasks for the named stage |
app.stage.shuffle.mergedLocalBytesRead |
counter | Sum of the number of local merged bytes read by all tasks for the named stage |
app.stage.shuffle.remoteReqsDuration |
counter | Total time tasks took executing remote requests for the named stage |
app.stage.shuffle.mergedRemoteReqsDuration |
counter | Total time tasks took executing remote merged requests for the named stage |
app.stage.shuffle.writeBytes |
counter | Sum of the number of bytes written in shuffle operations by all tasks for the named stage |
app.stage.shuffle.writeTime |
counter | Total time tasks spent blocking on writes to disk or buffer cache for the named stage. The value is expressed in nanoseconds. |
app.stage.shuffle.writeRecords |
counter | Sum of the number of records written in shuffle operations by all tasks for the named stage |
app.stage.executor.memory.peak.jvmHeap |
counter | Peak memory usage of the heap that is used for object allocation by the Java virtual machine while executing tasks for the named stage |
app.stage.executor.memory.peak.jvmOffHeap |
counter | Peak memory usage of non-heap memory by the Java virtual machine while executing tasks for the named stage |
app.stage.executor.memory.peak.onHeapExecution |
counter | Peak on heap execution memory usage while executing tasks for the named stage, in bytes |
app.stage.executor.memory.peak.offHeapExecution |
counter | Peak off heap execution memory usage while executing tasks for the named stage, in bytes |
app.stage.executor.memory.peak.onHeapStorage |
counter | Peak on heap storage memory usage while executing tasks for the named stage, in bytes |
app.stage.executor.memory.peak.offHeapStorage |
counter | Peak off heap storage memory usage while executing tasks for the named stage, in bytes |
app.stage.executor.memory.peak.onHeapUnified |
counter | Peak on heap memory usage (execution and storage) while executing tasks for the named stage, in bytes |
app.stage.executor.memory.peak.offHeapUnified |
counter | Peak off heap memory usage (execution and storage) while executing tasks for the named stage, in bytes |
app.stage.executor.memory.peak.directPool |
counter | Peak JVM memory usage for the direct buffer pool (java.lang.management.BufferPoolMXBean) while executing tasks for the named stage |
app.stage.executor.memory.peak.mappedPool |
counter | Peak JVM memory usage for the mapped buffer pool (java.lang.management.BufferPoolMXBean) while executing tasks for the named stage |
app.stage.executor.memory.peak.processTreeJvmVirtual |
counter | Peak virtual memory size while executing tasks for the named stage, in bytes |
app.stage.executor.memory.peak.processTreeJvmRSS |
counter | Peak Resident Set Size (number of pages the process has in real memory) while executing tasks for the named stage |
app.stage.executor.memory.peak.processTreePythonVirtual |
counter | Peak virtual memory size for Python while executing tasks for the named stage, in bytes |
app.stage.executor.memory.peak.processTreePythonRSS |
counter | Peak Resident Set Size for Python while executing tasks for the named stage |
app.stage.executor.memory.peak.processTreeOtherVirtual |
counter | Peak virtual memory size for other kinds of processes while executing tasks for the named stage, in bytes |
app.stage.executor.memory.peak.processTreeOtherRSS |
counter | Peak resident Set Size for other kinds of processes while executing tasks for the named stage |
app.stage.executor.memory.peak.minorGCCount |
counter | Total number of minor GCs that occurred while executing tasks for the named stage |
app.stage.executor.memory.peak.minorGCTime |
counter | Total elapsed time spent doing minor GCs while executing tasks for the named stage. The value is expressed in milliseconds. |
app.stage.executor.memory.peak.majorGCCount |
counter | Total number of major GCs that occurred while executing tasks for the named stage |
app.stage.executor.memory.peak.majorGCTime |
counter | Total elapsed time spent doing major GCs while executing tasks for the named stage. The value is expressed in milliseconds. |
app.stage.executor.memory.peak.totalGCTime |
counter | Total elapsed time spent doing GC while executing tasks for the named stage. The value is expressed in milliseconds. |
The following attributes are included on all Spark application stage metrics.
| Attribute Name | Data Type | Description |
|---|---|---|
sparkAppStageId |
string | Spark stage ID |
sparkAppStageAttemptId |
number | Spark stage attempt ID |
sparkAppStageName |
string | Spark stage name |
sparkAppStageStatus |
string | Spark stage status |
sparkAppTaskStatus |
string | Spark task status. Only on app.stage.tasks metric. |
The following metrics are included for each Spark task in an application.
Metrics are scoped to a given stage and status using the sparkAppStageId,
sparkAppStageAttemptId, sparkAppStageName, and sparkAppStageStatus
attributes. They are also scoped to a
given task and status using the sparkAppTaskId, sparkAppTaskAttempt, and
sparkAppTaskStatus attributes and
scoped to the executor of the task using the sparkAppTaskExecutorId attribute.
NOTE: Some of the shuffle read metric descriptions below are sourced from the file ShuffleReadMetrics.scala.
| Metric Name | Metric Type | Description |
|---|---|---|
app.stage.task.duration |
counter | Duration of the task. The value is expressed in milliseconds. |
app.stage.task.executorDeserializeTime |
counter | Elapsed time spent to deserialize this task. The value is expressed in milliseconds. |
app.stage.task.executorDeserializeCpuTime |
counter | CPU time taken on the executor to deserialize this task. The value is expressed in nanoseconds. |
app.stage.task.executorRunTime |
counter | Elapsed time the executor spent running this task. This includes time fetching shuffle data. The value is expressed in milliseconds. |
app.stage.task.executorCpuTime |
counter | CPU time the executor spent running this task. This includes time fetching shuffle data. The value is expressed in nanoseconds. |
app.stage.task.resultSize |
counter | The number of bytes this task transmitted back to the driver as the TaskResult |
app.stage.task.jvmGcTime |
counter | Elapsed time the JVM spent in garbage collection while executing this task. The value is expressed in milliseconds. |
app.stage.task.resultSerializationTime |
counter | Elapsed time spent serializing the task result. The value is expressed in milliseconds. |
app.stage.task.memoryBytesSpilled |
counter | The number of in-memory bytes spilled by this task |
app.stage.task.diskBytesSpilled |
counter | The number of on-disk bytes spilled by this task |
app.stage.task.peakExecutionMemory |
counter | Peak memory used by internal data structures created during shuffles, aggregations and joins. The value of this accumulator should be approximately the sum of the peak sizes across all such data structures created in this task. For SQL jobs, this only tracks all unsafe operators and ExternalSort. |
app.stage.task.input.bytesRead |
counter | Total number of bytes read from org.apache.spark.rdd.HadoopRDD or from persisted data |
app.stage.task.input.recordsRead |
counter | Total number of records read from org.apache.spark.rdd.HadoopRDD or from persisted data |
app.stage.task.output.bytesWritten |
counter | Total number of bytes written externally (e.g. to a distributed filesystem). Defined only in tasks with output. |
app.stage.task.output.recordsWritten |
counter | Total number of records written externally (e.g. to a distributed filesystem). Defined only in tasks with output. |
app.stage.task.shuffle.read.remoteBlocksFetched |
counter | Number of remote blocks fetched in shuffle operations |
app.stage.task.shuffle.read.localBlocksFetched |
counter | Number of local (as opposed to read from a remote executor) blocks fetched in shuffle operations |
app.stage.task.shuffle.read.totalBlocksFetched |
gauge | TODO: Not yet implemented |
app.stage.task.shuffle.read.fetchWaitTime |
counter | Time the task spent waiting for remote shuffle blocks. This only includes the time blocking on shuffle input data. For instance if block B is being fetched while the task is still not finished processing block A, it is not considered to be blocking on block B. The value is expressed in milliseconds. |
app.stage.task.shuffle.read.remoteBytesRead |
counter | Number of remote bytes read in shuffle operations |
app.stage.task.shuffle.read.remoteBytesReadToDisk |
counter | Number of remote bytes read to disk in shuffle operations. Large blocks are fetched to disk in shuffle read operations, as opposed to being read into memory, which is the default behavior |
app.stage.task.shuffle.read.localBytesRead |
counter | Number of bytes read in shuffle operations from local disk (as opposed to read from a remote executor) |
app.stage.task.shuffle.read.totalBytesRead |
gauge | TODO: Not yet implemented |
app.stage.task.shuffle.read.recordsRead |
counter | Number of records read in shuffle operations |
app.stage.task.shuffle.read.remoteReqsDuration |
counter | Total time taken for remote requests to complete by this task. This doesn't include duration of remote merged requests. |
app.stage.task.shuffle.read.push.corruptMergedBlockChunks |
counter | Number of corrupt merged shuffle block chunks encountered by this task (remote or local) |
app.stage.task.shuffle.read.push.mergedFetchFallbackCount |
counter | Number of times the task had to fallback to fetch original shuffle blocks for a merged shuffle block chunk (remote or local) |
app.stage.task.shuffle.read.push.remoteMergedBlocksFetched |
counter | Number of remote merged blocks fetched |
app.stage.task.shuffle.read.push.localMergedBlocksFetched |
counter | Number of local merged blocks fetched |
app.stage.task.shuffle.read.push.remoteMergedChunksFetched |
counter | Number of remote merged chunks fetched |
app.stage.task.shuffle.read.push.localMergedChunksFetched |
counter | Number of local merged chunks fetched |
app.stage.task.shuffle.read.push.remoteMergedBytesRead |
counter | Total number of remote merged bytes read |
app.stage.task.shuffle.read.push.localMergedBytesRead |
counter | Total number of local merged bytes read |
app.stage.task.shuffle.read.push.remoteMergedReqsDuration |
counter | Total time taken for remote merged requests |
app.stage.task.shuffle.write.bytesWritten |
counter | Number of bytes written in shuffle operations |
app.stage.task.shuffle.write.writeTime |
counter | Time spent blocking on writes to disk or buffer cache. The value is expressed in nanoseconds. |
app.stage.task.shuffle.write.recordsWritten |
counter | Number of records written in shuffle operations |
app.stage.task.photon.offHeapMinMemorySize |
Databricks only | |
app.stage.task.photon.offHeapMaxMemorySize |
Databricks only | |
app.stage.task.photon.photonBufferPoolMinMemorySize |
Databricks only | |
app.stage.task.photon.photonBufferPoolMaxMemorySize |
Databricks only | |
app.stage.task.photon.photonizedTaskTimeNs |
Databricks only |
The following attributes are included on all Spark application stage task metrics.
| Attribute Name | Data Type | Description |
|---|---|---|
sparkAppStageId |
string | Spark stage ID |
sparkAppStageAttemptId |
number | Spark stage attempt ID |
sparkAppStageName |
string | Spark stage name |
sparkAppStageStatus |
string | Spark stage status |
sparkAppTaskId |
string | Spark task ID |
sparkAppTaskAttempt |
number | Spark task attempt number |
sparkAppTaskStatus |
string | Spark task status |
sparkAppTaskLocality |
string | Locality of the data this task operates on |
sparkAppTaskSpeculative |
boolean | true if this is a speculative task execution, otherwise false |
sparkAppTaskExecutorId |
string | Spark executor ID |
The following metrics are included for each Spark RDD in an application. Metrics
are scoped to a given RDD using the sparkAppRDDId and sparkAppRDDName attributes.
NOTE: The metrics in this section are not documented in the Apache Spark monitoring ReST API documentation. The descriptions provided below were deduced via source code analysis and are not determinate.
| Metric Name | Metric Type | Description |
|---|---|---|
app.storage.rdd.partitions |
gauge | The total number of partitions for this RDD |
app.storage.rdd.cachedPartitions |
gauge | The total number of partitions that have been persisted (cached) in memory and/or on disk |
app.storage.rdd.memory.used |
gauge | The total amount of memory used by this RDD across all partitions |
app.storage.rdd.disk.used |
gauge | The total amount of disk space used by this RDD across all partitions |
app.storage.rdd.distribution.memory.used |
gauge | Unknown |
app.storage.rdd.distribution.memory.remaining |
gauge | Unknown |
app.storage.rdd.distribution.disk.used |
gauge | Unknown |
app.storage.rdd.distribution.memory.usedOnHeap |
gauge | Unknown |
app.storage.rdd.distribution.memory.usedOffHeap |
gauge | Unknown |
app.storage.rdd.distribution.memory.remainingOnHeap |
gauge | Unknown |
app.storage.rdd.distribution.memory.remainingOffHeap |
gauge | Unknown |
app.storage.rdd.partition.memory.used |
gauge | The total amount of memory used by this RDD partition |
app.storage.rdd.partition.disk.used |
gauge | The total amount of disk space used by this RDD partition |
The following attributes are included on all Spark application RDD metrics.
| Attribute Name | Data Type | Description |
|---|---|---|
sparkAppRDDId |
number | Spark RDD ID |
sparkAppRDDName |
string | Spark RDD name |
sparkAppRddDistributionIndex |
number | Numerical index of the RDD distribution in the list of distributions returned for this RDD by the Spark application RDD endpoint of the ReST API. Only on app.storage.rdd.distribution.* metrics. |
sparkAppRddPartitionBlockName |
string | Name of the block where the RDD partition is stored. Only on app.storage.rdd.partition.* metrics. |
Most of the Spark metrics recorded include at least one of the following attributes that indicate the status of the job, stage, or task for the metric.
-
sparkAppJobStatusSpark job status. One of the following.
running- job is executinglost- job status is unknownsucceeded- job completed successfullyfailed- job failed
-
sparkAppStageStatusSpark stage status. One of the following.
active- stage is executingcomplete- stage completed successfullyskipped- stage was skipped because it did not need to be recomputedfailed- stage failedpending- stage is waiting to be executed
-
sparkAppTaskStatusSpark task status. One of the following.
active- task is executingcomplete- task completed successfullyskipped- task was skipped because the stage was skippedfailed- task failedkilled- task was explicitly killed
On each run, for each Spark application, the integration records the following counter metrics.
-
app.jobsThe number of Spark jobs for an application by job status. This metric will always include the
sparkAppJobStatusattribute, which indicates which job status the metric applies to (for instance, if the metric value is 5 and thesparkAppJobStatusattribute isrunning, it means that there are 5 running jobs).Use the
sparkAppIdorsparkAppNameto target a specific application or to group the values by application. -
app.stagesThe number of Spark stages for an application by stage status. This metric will always include the
sparkAppStageStatusattribute, which indicates which stage status the metric applies to (for instance, if the metric value is 5 and thesparkAppStageStatusattribute iscomplete, it means that there are 5 completed stages).Use the
sparkAppIdorsparkAppNameattributes to target a specific application or to group the values by application. -
app.job.stagesThe number of Spark stages for a Spark job by stage status. This metric will always include the
sparkAppStageStatusandsparkAppJobStatusattributes, which indicate which stage status and job status the metric applies to, respectively (for instance, if the metric value is 2 and thesparkAppStageStatusattribute iscompleteand thesparkAppJobStatusisrunning, it means that the job is running and 2 stages have completed).Use the
sparkAppIdorsparkAppNameattributes to target a specific application or to group the values by application and thesparkAppJobIdattribute to target a specific job or to group the values by job. -
app.job.tasksThe number of Spark tasks for a Spark job by task status. This metric will always include the
sparkAppTaskStatusandsparkAppJobStatusattributes, which indicate which task status and job status the metric applies to, respectively (for instance, if the metric value is 4 and thesparkAppTaskStatusattribute iscompleteand thesparkAppJobStatusisrunning, it means that the job is running and 4 tasks have completed).Use the
sparkAppIdorsparkAppNameattributes to target a specific application or to group the values by application and thesparkAppJobIdattribute to target a specific job or to group the values by job. -
app.stage.tasksThe number of Spark tasks for a Spark stage by task status. This metric will always include the
sparkAppTaskStatusandsparkAppStageStatusattributes, which indicate which task status and stage status the metric applies to, respectively (for instance, if the metric value is 4 and thesparkAppTaskStatusattribute iscompleteand thesparkAppStageStatusisactive, it means that the stage is running and 4 tasks have completed).Use the
sparkAppIdorsparkAppNameattributes to target a specific application or to group the values by application and thesparkAppStageIdorsparkAppStageNameattribute to target a specific stage or to group the values by stage.
In general, only the latest()
aggregator function should be used when visualizing or alerting using these
counters.
When the Databricks Integration is deployed on the driver node of a Databricks cluster, all Spark metrics collected from Spark running on Databricks compute will have the following attributes.
| Name | Data Type | Data Description |
|---|---|---|
databricksWorspaceId |
string | The Databricks workspace ID |
databricksWorspaceName |
string | The Databricks workspace instance name |
databricksWorspaceUrl |
string | The Databricks workspace URL |
Additionally, the integration will attempt to map Spark job, stage, and task
metrics to either a Databricks Job task run or a Databricks Pipeline (and in
some cases a specific update flow). This is done by examining the jobGroup
field that is returned for Spark jobs on the Spark ReST API.
By default, the value of the jobGroup field is a string with a specific
format that includes the Databricks job ID and task run ID for Spark job, stage,
and task metrics associated with a Databricks job run, and includes the
Databricks pipeline ID (and in some cases the pipeline update ID and pipeline
update flow ID) for Spark job, stage, and task metrics associated with a
Databricks pipeline update.
When collecting Spark job, stage, and task metrics, the Databricks integration
will examine the value of the jobGroup field. If the value follows one of the
recognized formats, the integration will parse the IDs out of the field and
include the IDs as attributes on each Spark job, stage, and task metric as
follows.
-
For Spark job, stage, and task metrics associated with a Databricks job task run, the following attributes are added.
Name Data Type Description databricksJobIdnumber The job ID for the associated Databricks job run databricksJobRunTaskRunIdnumber The task run ID for the associated Databricks job run (only included if includeJobRunTaskRunIdis set totrue) -
For Spark job, stage, and task metrics associated with a Databricks pipeline update, the following attributes are added.
Name Data Type Description databricksPipelineIdstring The pipeline ID for the associated Databricks pipeline databricksPipelineUpdateIdstring The pipeline update ID for the associated Databricks pipeline update (only included if the pipeline update ID is present in the jobGroupfield andincludePipelineUpdateIdis set totrue)databricksPipelineFlowIdstring The pipeline update flow ID for the flow in the associated Databricks pipeline update (only included if the pipeline update flow ID is present in the jobGroupfield anddatabricksPipelindFlowIdis set totrue)
If the value of the jobGroup field does not follow one of the recognized
formats (for example, if the jobGroup has been customized using the
setJobGroup
function), the field value will not be parsed and none of the additional
attributes will be available on the associated Spark job, stage and task
metrics. Only the workspace related attributes will be available.
All examples below assume the spark metricPrefix
is spark..
Current number of jobs by application name and job status
FROM Metric
SELECT latest(spark.app.jobs)
FACET sparkAppName, sparkAppJobStatusCurrent number of stages by application name and stage status
FROM Metric
SELECT latest(spark.app.stages)
FACET sparkAppName, sparkAppStageStatusCurrent number of stages by application name, job ID, and stage status
FROM Metric
SELECT latest(spark.app.job.stages)
FACET sparkAppName, sparkAppJobId, sparkAppStageStatusCurrent number of tasks by application name, stage ID, and task status
FROM Metric
SELECT latest(spark.app.stage.tasks)
FACET sparkAppName, sparkAppStageId, sparkAppTaskStatus
LIMIT MAXCurrent number of running jobs by application name
FROM Metric
SELECT latest(spark.app.jobs)
WHERE sparkAppJobStatus = 'running'
FACET sparkAppNameNumber of completed tasks by application name and job ID
FROM Metric
SELECT latest(spark.app.job.tasks)
WHERE sparkAppTaskStatus = 'complete'
FACET sparkAppName, sparkAppJobId
LIMIT MAXNumber of failed tasks by application name and job ID
FROM Metric
SELECT latest(spark.app.job.tasks)
WHERE sparkAppTaskStatus = 'failed'
FACET sparkAppName, sparkAppJobId
LIMIT MAXJob duration by application name, job ID, and job status
FROM Metric
SELECT latest(spark.app.job.duration) / 1000 AS Duration
FACET sparkAppName, sparkAppJobId, sparkAppJobStatus
LIMIT MAXStage duration by application name, stage ID, and stage name
FROM Metric
SELECT latest(spark.app.stage.duration) / 1000 AS Duration
FACET sparkAppName, sparkAppStageId, sparkAppStageName
LIMIT MAXTask duration by application name, stage ID, stage name, and task ID
FROM Metric
SELECT latest(spark.app.stage.task.executorRunTime) / 1000 AS Duration
FACET sparkAppName, sparkAppStageId, sparkAppStageName, sparkAppTaskId
LIMIT MAXTotal elapsed stage executor run time (in seconds) by application name, stage ID, and stage name
FROM Metric
SELECT latest(spark.app.stage.executor.runTime) / 1000 AS Duration
FACET sparkAppName, sparkAppStageId, sparkAppStageName
LIMIT MAXTotal elapsed stage JVM GC time (in seconds) by application name, stage ID, and stage name
FROM Metric
SELECT latest(spark.app.stage.jvmGcTime) / 1000 AS Duration
FACET sparkAppName, sparkAppStageId, sparkAppStageName
LIMIT MAXAverage memory used by application name and executor ID over time
FROM Metric
SELECT average(spark.app.executor.memoryUsed)
WHERE spark.app.executor.memoryUsed IS NOT NULL
FACET sparkAppName, sparkAppExecutorId
TIMESERIESNumber of executors (active and dead) by application name
FROM Metric
SELECT uniqueCount(sparkAppExecutorId)
WHERE metricName = 'spark.app.executor.maxMemory'
FACET sparkAppNameTotal number of partitions by application name and RDD name
FROM Metric
SELECT latest(spark.app.storage.rdd.partitions)
FACET sparkAppName, sparkAppRDDNameAverage RDD memory used by application name and RDD name
FROM Metric
SELECT average(spark.app.storage.rdd.memory.used)
WHERE spark.app.storage.rdd.memory.used IS NOT NULL
FACET sparkAppName, sparkAppRDDIdAverage RDD partition memory used by application name, RDD ID, and RDD block name
FROM Metric
SELECT average(spark.app.storage.rdd.partition.memory.used)
WHERE spark.app.storage.rdd.partition.memory.used IS NOT NULL
FACET sparkAppName, sparkAppRDDId, sparkAppRddPartitionBlockNameA sample dashboard is included that shows examples of the types of Apache Spark information that can be displayed and the NRQL statements to use to visualize the data.
The Databricks Integration can collect Databricks consumption and cost related data from the Databricks system tables. This data can be used to show Databricks DBU consumption metrics and estimated Databricks costs directly within New Relic.
This feature is enabled by setting the Databricks usage enabled
flag to true in the integration configuration. When enabled,
the Databricks collector will collect consumption and cost related data once a
day at the time specified in the runTime configuration parameter.
The following information is collected on each run.
- Billable usage records from the
system.billing.usagetable - List pricing records from the
system.billing.list_pricestable - List cost per job run of jobs run on jobs compute and serverless compute
- List cost per job of jobs run on jobs compute and serverless compute
- List cost of failed job runs for jobs with frequent failures run on jobs compute and serverless compute
- List cost of repair job runs for jobs with frequent repairs run on jobs compute and serverless compute
NOTE: Job cost data from jobs run on workspaces outside the region of the
workspace containing the SQL Warehouse used to collect the consumption data
(specified in the workspaceHost configuration parameter)
will not be returned by the queries used by the Databricks integration to
collect Job cost data.
In order for the Databricks Integration to collect consumption and cost related data from Databricks, there are several requirements.
- The SQL warehouse ID of a SQL warehouse within the workspace associated with the configured workspace host must be specified. The Databricks SQL queries used to collect consumption and cost related data from the Databricks system tables. will be run against the specified SQL warehouse.
- As of v2.6.0, the Databricks integration leverages data that is stored in the
system.lakeflow.jobstable. As of 10/24/2024, this table is in public preview. To access this table, thelakeflowschema must be enabled in yoursystemcatalog. To enable thelakeflowschema, follow the instructions in the Databricks documentation to enable a system schema using the metastore ID of the Unity Catalog metastore attached to the workspace associated with the configured workspace host and the schema namelakeflow.
On each run, billable usage data is collected from the system.billing.usage table
for the previous day. For each billable usage record, a corresponding record
is created within New Relic as a New Relic event
with the event type DatabricksUsage and the following attributes.
NOTE: Not every attribute is included in every event. For example, the
cluster_* attributes are only included in events for usage records relevant to
all-purpose or job related compute. Similarly, the warehouse_* attributes are
only included in events for usage records relevant to SQL warehouse related
compute.
NOTE: Descriptions below are sourced from the billable usage system table reference.
| Name | Description |
|---|---|
account_id |
ID of the account this usage record was generated for |
workspace_id |
ID of the Workspace this usage record was associated with |
workspace_url |
URL of the Workspace this usage record was associated with |
workspace_instance_name |
Instance name of the Workspace this usage record was associated with |
record_id |
Unique ID for this usage record |
sku_name |
Name of the SKU associated with this usage record |
cloud |
Name of the Cloud this usage record is relevant for |
usage_start_time |
The start time relevant to this usage record |
usage_end_time |
The end time relevant to this usage record |
usage_date |
Date of this usage record |
usage_unit |
Unit this usage record is measured in |
usage_quantity |
Number of units consumed for this usage record |
record_type |
Whether the usage record is original, a retraction, or a restatement. See the section "Analyze Correction Records" in the Databricks documentation for more details. |
ingestion_date |
Date the usage record was ingested into the usage table |
billing_origin_product |
The product that originated this usage reocrd |
usage_type |
The type of usage attributed to the product or workload for billing purposes |
cluster_id |
ID of the cluster associated with this usage record |
cluster_creator |
Creator of the cluster associated with this usage record (only included if includeIdentityMetadata is true) |
cluster_single_user_name |
Single user name of the cluster associated with this usage record if the access mode of the cluster is single-user access mode (only included if includeIdentityMetadata is true) |
cluster_source |
Cluster source of the cluster associated with this usage record |
cluster_instance_pool_id |
Instance pool ID of the cluster associated with this usage record |
warehouse_id |
ID of the SQL warehouse associated with this usage record |
warehouse_name |
Name of the SQL warehouse associated with this usage record |
warehouse_creator |
Creator of the SQL warehouse associated with this usage record (only included if includeIdentityMetadata is true) |
instance_pool_id |
ID of the instance pool associated with this usage record |
node_type |
The instance type of the compute resource associated with this usage record |
job_id |
ID of the job associated with this usage record for serverless compute or jobs compute usage |
job_run_id |
ID of the job run associated with this usage record for serverless compute or jobs compute usage |
job_name |
Name of the job associated with this usage record for serverless compute or jobs compute usage. NOTE: This field will only contain a value for jobs run within a workspace in the same cloud region as the workspace containing the SQL warehouse used to collect the consumption and cost data (specified in workspaceHost). |
serverless_job_name |
User-given name of the job associated with this usage record for jobs run on serverless compute only |
notebook_id |
ID of the notebook associated with this usage record for serverless compute for notebook usage |
notebook_path |
Workspace storage path of the notebook associated with this usage for serverless compute for notebook usage |
dlt_pipeline_id |
ID of the Delta Live Tables pipeline associated with this usage record |
dlt_update_id |
ID of the Delta Live Tables pipeline update associated with this usage record |
dlt_maintenance_id |
ID of the Delta Live Tables pipeline maintenance tasks associated with this usage record |
run_name |
Unique user-facing identifier of the Mosaic AI Model Training fine-tuning run associated with this usage record |
endpoint_name |
Name of the model serving endpoint or vector search endpoint associated with this usage record |
endpoint_id |
ID of the model serving endpoint or vector search endpoint associated with this usage record |
central_clean_room_id |
ID of the central clean room associated with this usage record |
run_as |
See the section "Analyze Identity Metadata" in the Databricks documentation for more details (only included if includeIdentityMetadata is true) |
jobs_tier |
Jobs tier product features for this usage record: values include LIGHT, CLASSIC, or null |
sql_tier |
SQL tier product features for this usage record: values include CLASSIC, PRO, or null |
dlt_tier |
DLT tier product features for this usage record: values include CORE, PRO, ADVANCED, or null |
is_serverless |
Flag indicating if this usage record is associated with serverless usage: values include true or false, or null |
is_photon |
Flag indicating if this usage record is associated with Photon usage: values include true or false, or null |
serving_type |
Serving type associated with this usage record: values include MODEL, GPU_MODEL, FOUNDATION_MODEL, FEATURE, or null |
In addition, all compute resource tags, jobs tags, and budget policy tags
applied to this usage are added as event attributes. For jobs referenced in the
job_id attribute of JOBS usage records, custom job tags are also included if
the job was run within a workspace in the same cloud region as the workspace
containing the SQL warehouse used to collect the consumption data (specified in
the workspaceHost configuration parameter).
On each run, list pricing data is collected from the system.billing.list_prices table
and used to populate a New Relic lookup table
named DatabricksListPrices. The entire lookup table is updated on each run.
For each pricing record, this table will contain a corresponding row with the
following columns.
NOTE: Descriptions below are sourced from the pricing system table reference.
| Name | Description |
|---|---|
account_id |
ID of the account this pricing record was generated for |
price_start_time |
The time the price in this pricing record became effective in UTC |
price_end_time |
The time the price in this pricing record stopped being effective in UTC |
sku_name |
Name of the SKU associated with this pricing record |
cloud |
Name of the Cloud this pricing record is relevant for |
currency_code |
The currency the price in this pricing record is expressed in |
usage_unit |
The unit of measurement that is monetized in this pricing record |
list_price |
A single price that can be used for simple long-term estimates |
Note that the list_price field contains a single price suitable for simple
long-term estimates. It does not reflect any promotional pricing or custom price
plans.
On each run, the Databricks collector runs a set of queries that leverage data
in the system.billing.usage table,
the system.billing.list_prices table,
and the system.lakeflow.jobs table
to collect job cost related data for the previous day. The set of queries that
are run collect the following data.
- List cost per job run of jobs run on jobs compute and serverless compute
- List cost per job of jobs run on jobs compute and serverless compute
- List cost of failed job runs for jobs with frequent failures run on jobs compute and serverless compute
- List cost of repaired job runs for jobs with frequent repairs run on jobs compute and serverless compute
By default, the Databricks integration runs each query on every run. This
behavior can be configured using the optionalQueries
configuration parameter to selectively enable or disable queries by query ID.
For each query that is run, a New Relic event
with the event type DatabricksJobCost is created for each result row with one
attribute for each column of data. In addition, each event includes a query_id
attribute that contains the unique ID of the query that produced the data stored
in the event. This ID can be used in NRQL
queries to scope the query to the appropriate data set.
See the following sub-sections for more details on the data sets collected by each query.
NOTE: Please see the note at the end of the Consumption & Cost Data section regarding job cost data and the last requirement in the Consumption and Cost Requirements section for important considerations for collecting job cost data.
Job cost data on the list cost per job run is collected using the query with the
query id jobs_cost_list_cost_per_job_run. This query produces
DatabricksJobCost events with the following attributes.
| Name | Description |
|---|---|
workspace_id |
ID of the Workspace where the job run referenced in the run_id attribute occurred |
workspace_name |
Name of the Workspace where the job run referenced in the run_id attribute occurred |
job_id |
The unique job ID of the job associated with this job run |
job_name |
The user-given name of the job referenced in the job_id attribute |
run_id |
The unique run ID of this job run |
run_as |
The ID of the user or service principal used for the job run (only included if includeIdentityMetadata is true) |
list_cost |
The estimated list cost of this job run |
last_seen_date |
The last time a billable usage record was seen referencing the job ID referenced in the job_id attribute and the run ID referenced in the run_id attribute, in UTC |
Job cost data on the list cost per job is collected using the query with the
query id jobs_cost_list_cost_per_job. This query produces DatabricksJobCost
events with the following attributes.
| Name | Description |
|---|---|
workspace_id |
ID of the Workspace containing the job referenced in the job_id attribute |
workspace_name |
Name of the Workspace containing the job referenced in the job_id attribute |
job_id |
The unique job ID of the job |
job_name |
The user-given name of the job referenced in the job_id attribute |
runs |
The number of job runs seen for the day this result was collected for the job referenced in the job_id attribute |
run_as |
The ID of the user or service principal used to run the job (only included if includeIdentityMetadata is true) |
list_cost |
The estimated list cost of all runs for the job referenced in the job_id attribute for the day this result was collected |
last_seen_date |
The last time a billable usage record was seen referencing the job ID referenced in the job_id attribute, in UTC |
Job cost data on the list cost of failed job runs for jobs with frequent
failures is collected using the query with the query id
jobs_cost_frequent_failures. This query produces DatabricksJobCost events
with the following attributes.
| Name | Description |
|---|---|
workspace_id |
ID of the Workspace containing the job referenced in the job_id attribute |
workspace_name |
Name of the Workspace containing the job referenced in the job_id attribute |
job_id |
The unique job ID of the job |
job_name |
The user-given name of the job referenced in the job_id attribute |
runs |
The number of job runs seen for the day this result was collected for the job referenced in the job_id attribute |
run_as |
The ID of the user or service principal used to run the job (only included if includeIdentityMetadata is true) |
failures |
The number of job runs seen with the result state ERROR, FAILED, or TIMED_OUT for the day this result was collected for the job referenced in the job_id attribute |
list_cost |
The estimated list cost of all failed job runs seen for the day this result was collected for the job referenced in the job_id attribute |
last_seen_date |
The last time a billable usage record was seen referencing the job ID referenced in the job_id attribute, in UTC |
Job cost data on the list cost of repaired job runs for jobs with frequent
repairs is collected using the query with the query id jobs_cost_most_retries.
This query produces DatabricksJobCost events with the following attributes.
| Name | Description |
|---|---|
workspace_id |
ID of the Workspace where the job run referenced in the run_id attribute occurred |
workspace_name |
Name of the Workspace where the job run referenced in the run_id attribute occurred |
job_id |
The unique job ID of the job associated with this job run |
job_name |
The user-given name of the job referenced in the job_id attribute |
run_id |
The unique run ID of this job run |
run_as |
The ID of the user or service principal used to run the job (only included if includeIdentityMetadata is true) |
repairs |
The number of repair runs seen for the day this result was collected for the job run referenced in the run_id attribute where the result of the final repair run was SUCCEEDED |
list_cost |
The estimated list cost of the repair runs seen for the day this result was collected for the job run referenced in the run_id attribute |
repair_time_seconds |
The cumulative duration of the repair runs seen for the day this result was collected for the job run referenced in the run_id attribute |
The Databricks Integration can collect telemetry about Databricks Job
runs, such as job run durations, task run durations, the current state of job
and task runs, if a job or a task is a retry, and the number of times a task was
retried. This feature is enabled by default and can be enabled or disabled using
the Databricks jobs enabled flag in the
integration configuration.
NOTE: Some of the text below is sourced from the Databricks SDK Go module documentation.
On each run, the integration uses the Databricks ReST API
to retrieve job run data. By default, the Databricks ReST API
endpoint for listing job runs
returns a paginated list of all historical job runs sorted in descending order
by start time. On systems with many jobs, there may be a large number of job
runs to retrieve and process, impacting the performance of the collector. To
account for this, the integration provides the startOffset
configuration parameter. This parameter is used to tune the performance of the
integration by constraining the start time of job runs to match to be greater
than a particular time in the past calculated as an offset from the current time
and thus, limiting the number of job runs to return.
The effect of this behavior is that only job runs which have a start time at
or after the calculated start time will be returned on the API call. For
example, using the default startOffset (86400 seconds or 24
hours), only job runs which started within the last 24 hours will be returned.
This means that jobs that started more than 24 hours ago will not be returned
even if some of those jobs are not yet in a TERMINATED state.
Therefore, it is important to carefully select a value for the startOffset
parameter that will account for long-running job runs without degrading the
performance of the integration.
Job run data is sent to New Relic as dimensional metrics. The following metrics and attributes (dimensions) are provided.
| Metric Name | Metric Type | Description |
|---|---|---|
job.runs |
gauge | Job run counts per state |
job.tasks |
gauge | Task run counts per state |
job.run.duration |
gauge | Duration (in milliseconds) of the job run |
job.run.duration.queue |
gauge | Duration (in milliseconds) the job run spent in the queue |
job.run.duration.execution |
gauge | Duration (in milliseconds) the job run was actually executing commands (only available for single-task job runs) |
job.run.duration.setup |
gauge | Duration (in milliseconds) it took to setup the cluster (only available for single-task job runs) |
job.run.duration.cleanup |
gauge | Duration (in milliseconds) it took to terminate the cluster and cleanup associated artifacts (only available for single-task job runs) |
job.run.task.duration |
gauge | Duration (in milliseconds) of the task run |
job.run.task.duration.queue |
gauge | Duration (in milliseconds) the task run spent in the queue |
job.run.task.duration.execution |
gauge | Duration (in milliseconds) the task run was actually executing commands |
job.run.task.duration.setup |
gauge | Duration (in milliseconds) it took to setup the cluster |
job.run.task.duration.cleanup |
gauge | Duration (in milliseconds) it took to terminate the cluster and cleanup associated artifacts |
| Attribute Name | Data Type | Description |
|---|---|---|
databricksJobId |
number | The unique numeric ID of the job being run |
databricksJobRunId |
number | The unique numeric ID of the job run (only included if includeRunId is set to true) |
databricksJobRunName |
string | The optional name of the job run |
databricksJobRunAttemptNumber |
number | The sequence number of the run attempt for this job run (0 for the original attempt or if the job has no retry policy, greater than 0 for subsequent attempts for jobs with a retry policy) |
databricksJobRunState |
string | The state of the job run |
databricksJobRunIsRetry |
boolean | true if the job run is a retry of an earlier failed attempt, otherwise false |
databricksJobRunTerminationCode |
string | For terminated jobs, the job run termination code |
databricksJobRunTerminationType |
string | For terminated jobs, the job run termination type |
databricksJobRunTaskName |
string | The unique name of the task within it's parent job |
databricksJobRunTaskState |
string | The state of the task run |
databricksJobRunTaskAttemptNumber |
number | The sequence number of the run attempt for this task run (0 for the original attempt or if the task has no retry policy, greater than 0 for subsequent attempts for tasks with a retry policy) |
databricksJobRunTaskIsRetry |
boolean | true if the task run is a retry of an earlier failed attempt, otherwise false |
databricksJobRunTaskTerminationCode |
string | For terminated tasks, the task run termination code |
databricksJobRunTaskTerminationType |
string | For terminated tasks, the task run termination type |
Databricks job and task runs can be in one of 6 states.
BLOCKED- run is blocked on an upstream dependencyPENDING- run is waiting to be executed while the cluster and execution context are being preparedQUEUED- run is queued due to concurrency limitsRUNNING- run is executingTERMINATING- run has completed, and the cluster and execution context are being cleaned upTERMINATED- run has completed
The job and run task states are recorded in the databricksJobRunState and
databricksJobRunTaskState, respectively. The databricksJobRunState is on
every job run metric including the task related metrics. The
databricksJobRunTaskState is only on job run metrics
related to tasks, e.g. job.run.task.duration.
On each run, the integration records the number of job and task runs in each
state (with the exception of runs in the
TERMINATED state (see below)), in the metrics job.runs and job.tasks,
respectively, using the attributes databricksJobRunState and
databricksJobRunTaskState, to indicate the state.
In general, only the latest()
aggregator function should be used when visualizing or alerting on counts
including these states. For example, to display the
number of jobs by state, use the following NRQL statement.
FROM Metric
SELECT latest(databricks.job.runs)
FACET databricksJobRunStateThe count of job and task runs in the TERMINATED state are also recorded but
only include job and task runs that have terminated since the last run of the
integration. This is done to avoid counting terminated job and task runs more
than once, making it straightforward to use aggregator functions
to visualize or alert on values such as "number of job runs completed per time
period", "average duration of job runs", and "average duration job runs spent in
the queue". For example, to show the average time job runs spend in the queue,
grouped by job name, use the following NRQL statement.
FROM Metric
SELECT average(databricks.job.run.duration.queue / 1000)
WHERE databricksJobRunState = 'TERMINATED'
FACET databricksJobRunName
LIMIT MAXNOTE: Make sure to use the condition databricksJobRunState = 'TERMINATED'
or databricksJobRunTaskState = 'TERMINATED' when visualizing or alerting on
values using these aggregator functions.
The Databricks integration stores job and task run durations in the metrics
job.run.duration and job.run.task.duration, respectively. The values stored
in these metrics are calculated differently, depending on the state of the job
or task, as follows.
-
While a job or task run is running (not
BLOCKEDorTERMINATED), the run duration stored in the respective metrics is the "wall clock" time of the job or task run, calculated as the current time when the metric is collected by the integration minus thestart_timeof the job or task run as returned from the Databricks ReST API. This duration is inclusive of any time spent in the queue and any setup time, execution time, and cleanup time that has been spent up to the time the duration is calculated but does not include any time the job or task run was blocked.This duration is also cumulative, meaning that while the job run is running, the value calculated each time the integration runs will include the entire "wall clock" time since the reported
start_time. It is essentially acumulativeCountmetric but stored as agauge. Within NRQL statements, the latest() function can be used to get the most recent duration. For example, to show the most recent duration of job runs that are running grouped by job name, use the following NRQL (assuming the job run metricPrefix isdatabricks.).FROM Metric SELECT latest(databricks.job.run.duration) WHERE databricksJobRunState != 'TERMINATED' FACET databricksJobRunName
-
Once a job or task run has been terminated, the run duration stored in the respective metrics are not calculated but instead come directly from the values returned in the
listrunsAPI (also see the SDK documentation forBaseRunandRunTask). For both job and task runs, the run durations are determined as follows.- For job runs of multi-task jobs, the
job.run.durationfor a job run is set to therun_durationfield of the corresponding job run item in therunsfield returned from thelistrunsAPI. - For job runs of single-task jobs, the
job.run.durationfor a job run is set to the sum of thesetup_duration, theexecution_durationand thecleanup_durationfields of the corresponding job run item in therunsfield returned from thelistrunsAPI. - For task runs, the
job.run.task.durationfor a task run is set to the sum of thesetup_duration, theexecution_durationand thecleanup_durationfields of the corresponding task run item in thetasksfield of the corresponding job run item in therunsfield returned from thelistrunsAPI.
- For job runs of multi-task jobs, the
In addition to the job.run.duration and job.run.task.duration metrics, once
a job or task has terminated, the job.run.duration.queue and
job.run.task.duration.queue metrics are set to the queue_duration field of
the corresponding job run item in the runs field returned from the
listruns API for
job runs, and the queue_duration field of the corresponding task run item in
the tasks field of the corresponding job run item in the runs field from the
listruns API for
task runs.
Finally, for job runs and task runs of single-task jobs, the following metrics are set.
- The
job.run.duration.setupandjob.run.task.duration.setupmetrics are set to thesetup_durationfield of the corresponding job run item in therunsfield returned from thelistrunsAPI for job runs, and thesetup_durationfield of the corresponding task run item in thetasksfield of the corresponding job run item in therunsfield from thelistrunsAPI for task runs. - The
job.run.duration.executionandjob.run.task.duration.executionare set to theexecution_durationfield of the corresponding job run item in therunsfield returned from thelistrunsAPI for job runs, and theexecution_durationfield of the corresponding task run item in thetasksfield of the corresponding job run item in therunsfield from thelistrunsAPI for task runs. - The
job.run.duration.cleanupandjob.run.task.duration.cleanupare set to thecleanup_durationfield of the corresponding job run item in therunsfield returned from thelistrunsAPI for job runs, and thecleanup_durationfield of the corresponding task run item in thetasksfield of the corresponding job run item in therunsfield from thelistrunsAPI for task runs.
All examples below assume the job run metricPrefix
is databricks..
Current job run counts by state
FROM Metric
SELECT latest(databricks.job.runs)
FACET databricksJobRunStateCurrent task run counts by state
FROM Metric
SELECT latest(databricks.job.tasks)
FACET databricksJobRunTaskStateJob run durations by job name over time
FROM Metric
SELECT latest(databricks.job.run.duration)
FACET databricksJobRunName
TIMESERIESAverage job run duration by job name
FROM Metric
SELECT average(databricks.job.run.duration)
WHERE databricksJobRunState = 'TERMINATED'
FACET databricksJobRunNameNOTE: Make sure to use the condition databricksJobRunState = 'TERMINATED'
or databricksJobRunTaskState = 'TERMINATED' when visualizing or alerting on
values using aggregator functions
other than latest.
Average task run queued duration by task name over time
FROM Metric
SELECT average(databricks.job.run.task.duration.queue)
WHERE databricksJobRunTaskState = 'TERMINATED'
FACET databricksJobRunTaskName
TIMESERIESNOTE: Make sure to use the condition databricksJobRunState = 'TERMINATED'
or databricksJobRunTaskState = 'TERMINATED' when visualizing or alerting on
values using aggregator functions
other than latest.
A sample dashboard is included that shows examples of the types of job run information that can be displayed and the NRQL statements to use to visualize the data.
The Databricks Integration can collect telemetry about
Databricks Delta Live Tables Pipeline
updates, such
as flow durations,
flow data quality
metrics, and update
durations. This feature is enabled by default and can be enabled or disabled
using the Databricks pipeline update metrics enabled
flag in the integration configuration.
NOTE: Some of the text below is sourced from the Databricks Delta Live Tables pipeline event log schema documentation and the Databricks SDK Go module documentation.
On each run, the integration uses the Databricks ReST API
to retrieve pipeline events for all Delta Live Tables Pipelines.
By default, the Databricks ReST API
endpoint for listing pipeline events
returns a paginated list of all historical pipeline events sorted in
descending order by timestamp. On systems with many pipelines or complex
pipelines with many flows, there may be a large number of pipeline events to
retrieve and process, impacting the performance of the collector. To account for
this, the integration provides the startOffset
configuration parameter. This parameter is used to tune the performance of the
integration by constraining the timestamp of pipeline events to match to be
greater than a particular time in the past calculated as an offset from the
current time and thus, limiting the number of pipeline events to return.
The effect of this behavior is that only pipeline events which have a timestamp
at or after the calculated start time will be returned on the API call. For
example, using the default startOffset
(86400 seconds or 24 hours), only pipeline events which occurred within the last
24 hours will be returned. This means that pipeline events that occurred more
than 24 hours ago will not be returned even if the associated pipeline update
is not yet complete. This can lead to missing flow or update metrics. Therefore,
it is important to carefully select a value for the startOffset
parameter that will account for long-running pipeline updates without degrading
the performance of the integration.
The integration uses the list pipeline events endpoint
of the Databricks ReST API
to retrieve pipeline log events. There can be a slight lag time between when a
pipeline log event is emitted and when it is available in the API.
The intervalOffset
configuration parameter can be used to delay the processing of pipeline log
events in order to account for this lag. The default value for the
intervalOffset
configuration parameter is 5 seconds.
To help illustrate, here is the processing behavior for the default settings (60 second collection interval and 5 second interval offset):
- on the first minute, the integration will process all pipeline log events from 60 seconds ago until 55 seconds ago
- on the second interval, the integration will process all pipeline log events from 55 seconds ago until 1:55
- on the third interval, the integration will process all pipeline log events from 1:55 until 2:55
- and so on
Pipeline update metric data is sent to New Relic as dimensional metrics. The following metrics and attributes (dimensions) are provided.
| Metric Name | Metric Type | Description |
|---|---|---|
pipeline.pipelines |
gauge | Pipeline counts per state |
pipeline.updates |
gauge | Pipeline update counts per status |
pipeline.flows |
gauge | Pipeline flow counts per status |
pipeline.update.duration |
gauge | Total duration (in milliseconds) of the pipeline update identified by the id in the databricksPipelineUpdateId attribute |
pipeline.update.duration.wait |
gauge | Duration (in milliseconds) the pipeline update identified by the id in the databricksPipelineUpdateId attribute spent in the WAITING_FOR_RESOURCES state |
pipeline.update.duration.runTime |
gauge | Duration (in milliseconds) the pipeline update identified by the id in the databricksPipelineUpdateId attribute spent actually running |
pipeline.flow.duration |
gauge | Total duration (in milliseconds) of the pipeline flow named in the databricksPipelineFlowName attribute |
pipeline.flow.duration.queue |
gauge | Duration (in milliseconds) the pipeline flow named in the databricksPipelineFlowName attribute spent in the QUEUED state |
pipeline.flow.duration.plan |
gauge | Duration (in milliseconds) the pipeline flow named in the databricksPipelineFlowName attribute spent in the PLANNING state |
pipeline.flow.backlogBytes |
gauge | Number of bytes present in the backlog of the flow named in the databricksPipelineFlowName attribute |
pipeline.flow.backlogFiles |
gauge | Number of files that exist in the backlog of the flow named in the databricksPipelineFlowName attribute |
pipeline.flow.rowsWritten |
gauge | Number of rows output by the flow named in the databricksPipelineFlowName attribute |
pipeline.flow.recordsDropped |
gauge | Number of records dropped by the flow named in the databricksPipelineFlowName attribute |
pipeline.flow.expectation.recordsPassed |
gauge | Number of records that passed the expectation named in the databricksPipelineFlowExpectationName attribute for the dataset named in the databricksPipelineDatasetName attribute |
pipeline.flow.expectation.recordsFailed |
gauge | Number of records that failed the expectation named in the databricksPipelineFlowExpectationName attribute for the dataset named in the databricksPipelineDatasetName attribute |
| Attribute Name | Data Type | Description |
|---|---|---|
databricksPipelineId |
string | The unique UUID of the pipeline associated with the update. Included on all pipeline update metrics. |
databricksPipelineName |
string | The name of the pipeline associated with the update. Included on all pipeline update metrics. |
databricksPipelineState |
string | The state of the pipeline associated with the update. Only included on the pipeline.pipelines metric. |
databricksClusterId |
string | The ID of the cluster where the event associated with the metric originated. Included on all pipeline update metrics, when available. |
databricksPipelineUpdateId |
string | The unique UUID of the pipeline update. Included on all pipeline update metrics if includeUpdateId is set to true |
databricksPipelineUpdateStatus |
string | The status of the pipeline update. Included on the pipeline.updates and pipeline.update.duration metrics. |
databricksPipelineFlowId |
string | The unique UUID of the pipeline flow associated with the metric. Included on all pipeline flow metrics, when available. |
databricksPipelineFlowName |
string | The unique name of the pipeline flow associated with the metric. Included on all pipeline flow metrics. |
databricksPipelineFlowStatus |
string | The status of the pipeline flow. Included on all pipeline flow metrics. |
databricksPipelineFlowExpectationName |
string | The name of the data quality expectation associated with the metric. Included on all pipeline flow expectation metrics. |
databricksPipelineDatasetName |
string | The name of the dataset associated with the metric. Included on all pipeline flow expectation metrics. |
Databricks pipelines can be in one of nine states.
DELETEDDEPLOYINGFAILEDIDLERECOVERINGRESETTINGRUNNINGSTARTINGSTOPPING
The pipeline state is recorded in the databricksPipelineState attribute on the
pipeline.pipelines metric only.
Databricks pipeline updates can have one of eleven statuses.
CANCELEDCOMPLETEDCREATEDFAILEDINITIALIZINGQUEUEDRESETTINGRUNNINGSETTING_UP_TABLESSTOPPINGWAITING_FOR_RESOURCES
The pipeline update status is recorded in the databricksPipelineUpdateStatus
attribute on the pipeline.updates metric and the pipeline.update.* metrics.
The statuses CANCELED, COMPLETED, and FAILED are considered update
termination statuses.
Databricks pipeline flows can have one of ten statuses.
COMPLETEDEXCLUDEDFAILEDIDLEPLANNINGQUEUEDRUNNINGSKIPPEDSTARTINGSTOPPED
The pipeline flow status is recorded in the databricksPipelineFlowStatus
attribute on the pipeline.flows metric and the pipeline.flow.*
metrics.
The statuses COMPLETED, EXCLUDED, FAILED, SKIPPED, and STOPPED are
considered flow termination statuses.
On each run, the integration records the following counts.
-
The count of all pipelines by pipeline state is recorded in the metric
pipeline.pipelinesusing the attributedatabricksPipelineStateto indicate the pipeline state. -
The count of all pipeline updates by update status is recorded in the metric
pipeline.updatesusing the attributedatabricksPipelineUpdateStatusto indicate the update status.Pipeline updates with a termination update status (
CANCELED,COMPLETED,FAILED) are only counted if the corresponding event occurred since the last run of the integration. Otherwise, these updates are ignored. This is done to avoid counting terminated updates more than once.Pipeline updates with a non-termination update status are always counted.
-
The count of all pipeline flows by flow status is recorded in the metric
pipeline.flowsusing the attributedatabricksPipelineFlowStatusto indicate the flow status.Pipeline flows with a termination flow status (
COMPLETED,EXCLUDED,FAILED,SKIPPED,STOPPED) are only counted if the corresponding event occurred since the last run of the integration. Otherwise, these flows are ignored. This is done to avoid counting terminated flows more than once.Pipeline flows with a non-termination flow status are always counted.
The Databricks integration records several update and flow durations. The different durations and the ways in which they are calculated are as follows.
-
pipeline.update.durationThe
pipeline.update.durationmetric represents the total duration of the update from the time it is created until the time it terminates. After the update is created but before it terminates, the duration recorded in this metric is the "wall clock" time, calculated as the current time when the metric is recorded by the integration minus the timestamp of the pipeline log event where the update was created. Once the update terminates, the duration recorded in this metric is calculated by subtracting the timestamp of the pipeline log event where the update was created from the timestamp of the pipeline log event where the update reached the termination status. -
pipeline.update.duration.waitThe
pipeline.update.duration.waitmetric represents the amount of time the update spent waiting for resources (that is, while it has theWAITING_FOR_RESOURCESupdate status). While the update is waiting for resources, the duration recorded in this metric is the "wall clock" time, calculated as the current time when the metric is recorded by the integration minus the timestamp of the pipeline log event where the update reached theWAITING_FOR_RESOURCESstatus. Once the update reaches theINITIALIZINGstatus or terminates, the duration recorded in this metric is calculated by subtracting the timestamp of the pipeline log event where the update reached theWAITING_FOR_RESOURCESstatus from the timestamp of the pipeline log event where the update reached theINITIALIZINGstatus or terminated. -
pipeline.update.duration.runTimeThe
pipeline.update.duration.runTimemetric represents the amount of time the update spent actually running, defined as the time the update reached theINITIALIZINGupdate status until the time the update terminates. After the update has reached theINITIALIZINGstatus but before it terminates, the duration recorded in this metric is the "wall clock" time, calculated as the current time when the metric is recorded by the integration minus the timestamp of the pipeline log event where the update reached theINITIALIZINGstatus. Once the update terminates, the duration recorded in this metric is calculated by subtracting the timestamp of the pipeline log event where the update reached theINITIALIZINGstatus from the timestamp of the pipeline log event where the update terminated. -
pipeline.flow.durationThe
pipeline.flow.durationmetric represents the total duration of the flow from the time it starts until the time it terminates. After the flow starts but before it terminates, the duration recorded in this metric is the "wall clock" time, calculated as the current time when the metric is recorded by the integration minus the timestamp of the pipeline log event where the flow terminated. Once the flow terminates, the duration recorded in this metric is calculated by subtracting the timestamp of the pipeline log event where the flow started from the timestamp of the pipeline log event where the flow terminated. -
pipeline.flow.duration.queueThe
pipeline.flow.duration.queuemetric represents the time the flow spent in the queue (that is, while it has theQUEUEDflow status). While the update in the queue and before it terminates or reaches thePLANNINGorSTARTINGstatus, the duration recorded in this metric is the "wall clock" time, calculated as the current time when the metric is recorded by the integration minus the timestamp of the pipeline log event where the flow reached theQUEUEDstatus. Once the flow terminates or reaches either thePLANNINGstatus or theSTARTINGstatus, the duration recorded in this metric is calculated by subtracting the timestamp of the pipeline log event where the flow reached theQUEUEDstatus from the timestamp of the log event where the flow terminated or reached either thePLANNINGstatus or theSTARTINGstatus. -
pipeline.flow.duration.planThe
pipeline.flow.duration.planmetric represents the time spent to plan the flow (that is, while it has thePLANNINGflow status). While the update has thePLANNINGstatus and before it terminates or reaches theSTARTINGstatus, the duration recorded in this metric is the "wall clock" time, calculated as the current time when the metric is recorded by the integration minus the timestamp of the pipeline log event where the flow terminated or reached thePLANNINGstatus. Once the terminates or reaches theSTARTINGstatus, the duration recorded in this metric is calculated by subtracting the timestamp of the pipeline log event where the flow reached thePLANNINGstatus from the timestamp of the log event where the flow terminated or reached theSTARTINGstatus.
NOTE: Wall clock durations are cumulative, meaning that until the update
or flow reaches the status at which the true duration is calculated, the value
of each successive metric increases monotonically. It is essentially a
cumulativeCount metric
but stored as a gauge. Within NRQL
statements, the latest()
function can be used to get the most recent duration. For example, to show the
most recent duration of flows that are running grouped by flow name, use the
following NRQL (assuming the pipeline update metricPrefix
is databricks.).
FROM Metric
SELECT latest(databricks.pipeline.flow.duration)
WHERE databricksPipelineFlowStatus = `RUNNING`
FACET databricksPipelineFlowNameAll examples below assume the pipeline update metricPrefix
is databricks..
Current update counts by state
FROM Metric
SELECT latest(databricks.pipeline.pipelines) AS 'Pipelines'
FACET databricksPipelineStateCurrent update counts by status
FROM Metric
SELECT latest(databricks.pipeline.updates) AS 'Updates'
FACET databricksPipelineUpdateStatusUpdate durations by pipeline name and update ID over time
FROM Metric
SELECT latest(databricks.pipeline.update.duration)
FACET databricksPipelineName, substring(databricksPipelineUpdateId, 0, 6)
TIMESERIESRecent updates
FROM Metric
SELECT databricksPipelineName AS Pipeline, substring(databricksPipelineUpdateId, 0, 6) AS Update, databricksPipelineUpdateStatus as Status, getField(databricks.pipeline.update.duration, latest) / 1000 AS Duration
WHERE databricks.pipeline.update.duration IS NOT NULL
AND databricksPipelineUpdateStatus IN ('COMPLETED', 'CANCELED', 'FAILED')
LIMIT MAXAverage update duration by pipeline name and update status
FROM Metric
SELECT average(databricks.pipeline.update.duration) / 1000 AS 'seconds'
WHERE databricksPipelineUpdateStatus IN ('COMPLETED', 'CANCELED', 'FAILED')
FACET databricksPipelineName, databricksPipelineUpdateStatus
TIMESERIESNOTE: Make sure to use the condition
databricksPipelineUpdateStatus IN ('COMPLETED', 'CANCELED', 'FAILED') when
visualizing or alerting on update duration using aggregator functions
other than latest.
Recent flows
FROM Metric
SELECT databricksPipelineName AS Pipeline, substring(databricksPipelineUpdateId, 0, 6) AS Update, databricksPipelineFlowName AS Flow, databricksPipelineFlowStatus AS Status, getField(databricks.pipeline.flow.duration, latest) / 1000 AS Duration
WHERE databricks.pipeline.flow.duration IS NOT NULL
AND databricksPipelineFlowStatus IN ('COMPLETED', 'STOPPED', 'SKIPPED', 'FAILED', 'EXCLUDED')
LIMIT MAXAverage flow backlog bytes
FROM Metric
SELECT average(databricks.pipeline.flow.backlogBytes)
FACET databricksPipelineName, databricksPipelineFlowNameFlow expectation records passed
FROM Metric
SELECT databricksPipelineName AS Pipeline, substring(databricksPipelineUpdateId, 0, 6) AS Update, databricksPipelineFlowName AS Flow, databricksPipelineFlowExpectationName AS Expectation, getField(databricks.pipeline.flow.expectation.recordsPassed, latest) AS Records
WHERE databricks.pipeline.flow.expectation.recordsPassed IS NOT NULL
LIMIT MAXA sample dashboard is included that shows examples of the types of pipeline update information that can be displayed and the NRQL statements to use to visualize the data.
The Databricks Integration can collect Databricks Delta Live Tables Pipeline event logs for all Databricks Delta Live Tables Pipelines defined in a workspace. Databricks Delta Live Tables Pipeline event log entries for every pipeline update are collected and sent to New Relic Logs.
NOTE: Some of the text below is sourced from the Databricks Delta Live Tables pipeline event log schema documentation and the Databricks SDK Go module documentation.
Databricks Delta Live Tables Pipeline event logs are sent to New Relic as New Relic Log data. For each Databricks Delta Live Tables Pipeline event log entry, the fields from the event log entry are mapped to attributes on the corresponding New Relic log entry as follows.
NOTE: Due to the way in which the Databricks ReST API
returns Databricks Delta Live Tables Pipeline event log
entries (in descending order by timestamp), later event log entries may
sometimes be visible in the New Relic Logs UI
before older event log entries are visible. This does not affect the temporal
ordering of the log entries. Once older log entries become visible (usually
30-60s longer than the value of the interval configuration parameter),
they are properly ordered by timestamp in the New Relic Logs UI.
| Pipeline Event Log Field Name | New Relic Log Entry Attribute Name | Data Type in New Relic Log Entry | Description |
|---|---|---|---|
message |
message |
string | A human-readable message describing the event |
timestamp |
timestamp |
integer | The time (in milliseconds since the epoch) the event was recorded |
id |
databricksPipelineEventId |
string | A unique identifier for the event log record |
event_type |
databricksPipelineEventType |
string | The event type |
level |
level, databricksPipelineEventLevel |
string | The severity level of the event, for example, INFO, WARN, ERROR, or METRICS |
maturity_level |
databricksPipelineEventMaturityLevel |
string | The stability of the event schema, one of STABLE, NULL, EVOLVING, or DEPRECATED |
| n/a | databricksPipelineEventError |
boolean | true if an error was captured by the event (see below for additional details), otherwise false |
origin.batch_id |
databricksPipelineFlowBatchId |
integer | The id of a batch. Unique within a flow. |
origin.cloud |
databricksCloud |
string | The cloud provider, e.g., AWS or Azure |
origin.cluster_id |
databricksClusterId |
string | The id of the cluster where an execution happens. Unique within a region. |
origin.dataset_name |
databricksPipelineDatasetName |
string | The name of a dataset. Unique within a pipeline. |
origin.flow_id |
databricksPipelineFlowId |
string | The id of the flow. Globally unique. |
origin.flow_name |
databricksPipelineFlowName |
string | The name of the flow. Not unique. |
origin.host |
databricksPipelineEventHost |
string | The optional host name where the event was triggered |
origin.maintenance_id |
databricksPipelineMaintenanceId |
string | The id of a maintenance run. Globally unique. |
origin.materialization_name |
databricksPipelineMaterializationName |
string | Materialization name |
origin.org_id |
databricksOrgId |
integer | The org id of the user. Unique within a cloud. |
origin.pipeline_id |
databricksPipelineId |
string | The id of the pipeline. Globally unique. |
origin.pipeline_name |
databricksPipelineName |
string | The name of the pipeline. Not unique. |
origin.region |
databricksCloudRegion |
string | The cloud region |
origin.request_id |
databricksPipelineRequestId |
string | The id of the request that caused an update |
origin.table_id |
databricksDeltaTableId |
string | The id of a (delta) table. Globally unique. |
origin.uc_resource_id |
databricksUcResourceId |
string | The Unity Catalog id of the MV or ST being updated |
origin.update_id |
databricksPipelineUpdateId |
string | The id of an execution. Globally unique. |
Additionally, if the error field is set on the pipeline event log entry,
indicating an error was captured by the event, the following attributes are
added to the New Relic log entry.
| Pipeline Event Log Field Name | New Relic Log Entry Attribute Name | Data Type in New Relic Log Entry | Description |
|---|---|---|---|
error.fatal |
databricksPipelineEventErrorFatal |
boolean | Whether the error is considered fatal, that is, unrecoverable |
error.exceptions[N].class_name |
databricksPipelineEventErrorExceptionNClassName |
string | Runtime class of the Nth exception |
error.exceptions[N].message |
databricksPipelineEventErrorExceptionNMessage |
string | Exception message of the Nth exception |
The New Relic Infrastructure agent can be used to collect infrastructure data like CPU and memory usage from the nodes in a Databricks cluster as well as the Spark driver and executor logs, the Spark driver event log, and the logs from all init scripts on the driver and worker nodes.
The New Relic Infrastructure agent
will be automatically installed on the driver and worker nodes of the cluster
as part of the process of deploying the integration on the driver node of a cluster
if the value of the NEW_RELIC_INFRASTRUCTURE_ENABLED environment variable is
set to true. In addition, if the value of the environment variable
NEW_RELIC_INFRASTRUCTURE_LOGS_ENABLED is set to true, the Spark driver and
executor logs, the Spark driver event log, and the logs from all init scripts
on the driver and worker nodes will be collected and forwarded
to New Relic Logs.
When installed, the New Relic infrastructure agent collects all default infrastructure monitoring data for the driver and worker nodes. Additionally, a host entity will show up for each node in the cluster.
When log forwarding is configured, the following logs are collected.
- Spark driver standard out and standard error logs
- Spark driver log4j log
- Spark event log
- Standard out and standard error logs for all init scripts on the driver node
- Spark application executor standard out and standard error logs
- Standard out and standard error logs for all init scripts on the worker nodes
The following attributes are included on all infrastructure data,
all Log events collected by the infrastructure agent,
and on the host entities for all nodes in the cluster.
| Attribute Name | Data Type | Description |
|---|---|---|
databricksWorkspaceHost |
string | The instance name of the target Databricks instance |
databricksClusterId |
string | The ID of the Databricks cluster where the metric, event, or Log originated from. On a host entity, the ID of the cluster the host is a part of. |
databricksClusterName |
string | The name of the Databricks cluster where the metric, event, or Log originated from. On a host entity, the name of the cluster the host is a part of. |
databricksIsDriverNode |
string | true if the metric, event, or Log originated from the driver node. false if the metric, event, or Log originated from a worker node. On a host entity, true if the host is the driver node or false if it is a worker node. |
databricksIsJobCluster |
string | true if the Databricks cluster where the metric, event or Log originated from is a job cluster, otherwise false. On a host entity, true if the cluster the host is a part of is a job cluster, otherwise false. |
databricksLogType |
string | The type of log that generated the Log event. Only included on cluster health Log events. |
The databricksLogType attribute can have one of the following values. These
values can be used to determine the log which generated the Log event.
-
driver-stdoutIndicates the
Logevent originated from the driver standard out log. -
driver-stderrIndicates the
Logevent originated from the driver standard error log. -
driver-log4jIndicates the
Logevent originated from the driver log4j log. -
spark-eventlogIndicates the
Logevent originated from the Spark event log. -
driver-init-script-stdoutIndicates the
Logevent originated from the standard out log of an init script on the driver node. The specific init script can be determined by examining thefilePathattribute. -
driver-init-script-stderrIndicates the
Logevent originated from the standard error log of an init script on the driver node. The specific init script can be determined by examining thefilePathattribute. -
executor-stdoutIndicates the
Logevent originated from the standard out log of a Spark application executor. The specific Spark application can be determined by examining thefilePathattribute. -
executor-stderrIndicates the
Logevent originated from the standard error log of a Spark application executor. The specific Spark application can be determined by examining thefilePathattribute. -
worker-init-script-stdoutIndicates the
Logevent originated from the standard out log of an init script on a worker node. The specific init script can be determined by examining thefilePathattribute. The specific worker node can be determined by examining thehostnameorfullHostnameattributes. -
worker-init-script-stderrIndicates the
Logevent originated from the standard error log of an init script on a worker node. The specific init script can be determined by examining thefilePathattribute. The specific worker node can be determined by examining thehostnameorfullHostnameattributes.
Driver node average CPU by cluster and entity name
FROM SystemSample
SELECT average(cpuPercent)
WHERE databricksIsDriverNode = 'true'
FACET databricksClusterName, entityName
TIMESERIESWorker node average CPU by cluster and entity name
FROM SystemSample
SELECT average(cpuPercent)
WHERE databricksIsDriverNode = 'false'
FACET databricksClusterName, entityName
TIMESERIESSpark driver log4j logs
WITH position(databricksWorkspaceHost, '.') AS firstDot
FROM Log
SELECT substring(databricksWorkspaceHost, 0, firstDot) AS 'Workspace', databricksClusterName AS 'Cluster', hostname AS Hostname, level AS Level, message AS Message
WHERE databricksClusterId IS NOT NULL
AND databricksLogType = 'driver-log4j'
LIMIT 1000Spark event logs
WITH position(databricksWorkspaceHost, '.') AS firstDot,
capture(filePath, r'/databricks/driver/eventlogs/(?P<sparkContextId>[^/]+)/.*') AS sparkContextId
FROM Log
SELECT substring(databricksWorkspaceHost, 0, firstDot) AS 'Workspace', databricksClusterName AS 'Cluster', sparkContextId AS 'Spark Context ID', Event, `App Name`, `Job ID`, `Stage Info.Stage ID` AS 'Stage ID', `Task Info.Task ID` AS 'Task ID'
WHERE databricksClusterId IS NOT NULL
AND databricksLogType = 'spark-eventlog'
LIMIT 1000Spark application executor standard out logs
WITH position(databricksWorkspaceHost, '.') AS firstDot,
capture(filePath, r'/databricks/spark/work/(?P<appName>[^/]+)/.*') AS appName
FROM Log
SELECT substring(databricksWorkspaceHost, 0, firstDot) AS 'Workspace', databricksClusterName AS 'Cluster', hostname AS Hostname, appName AS Application, message AS Message
WHERE databricksClusterId IS NOT NULL
AND databricksLogType = 'executor-stdout'
LIMIT 1000A sample dashboard is included that shows examples of visualizing cluster health metrics and logs and the NRQL statements to use to visualize the data.
The Databricks Integration can collect telemetry about queries executed in
Databricks SQL warehouses and serverless compute, including execution times,
query statuses, and query I/O metrics such as the number of bytes and rows read
and the number of bytes spilled to disk and sent over the network. This feature
is enabled by default and can be enabled or disabled using the Databricks query metrics enabled
flag in the integration configuration.
NOTE:
- Unlike other metrics collected by the integration, query metrics are only recorded once a query has completed execution.
- When serverless compute is enabled, metrics for all SQL and Python queries executed on serverless compute for notebooks and jobs are returned by the list query history API used to collect query metrics.
- Some of the text below is sourced from the documentation on Databricks queries and the Databricks SDK Go module documentation.
Query metrics are only collected once when the query terminates, that is, when it completes successfully, when it fails, or when it is canceled. Furthermore, query metrics are only collected for queries that terminated since the last execution of the integration. This way, query metrics are only collected once.
Query metrics are collected using the query history endpoint. of the Databricks ReST API By default, this endpoint returns queries sorted in descending order by start time. This means that unless the time range used on the list query history API call includes the time when a query started, it will not be returned. Were the integration to only request the query history since the last execution of the integration, telemetry for queries that started prior to the last execution of the integration but completed since the last execution of the integration would not be recorded because they would not be returned.
To account for this, the integration provides the startOffset
configuration parameter. This parameter is used to offset the start time of the
time range used on the list query history API
call. The effect of this behavior is that metrics for queries that completed
since the last execution of the integration but started prior to the last
execution of the integration (up to startOffset
seconds ago) will be recorded.
For example, consider a query that started 3 minutes prior to the current
execution of the integration but completed 10 seconds prior to the current
execution of the integration. Using the default value of the interval
configuration parameter, without a mechanism to effect the start time used to
query the query history, the time range used on list query history API
would be 60 seconds ago until 0 seconds ago. Because the list query history API
returns queries according to the start time of the query, the example query
would not be returned even though it completed since the last execution of the
integration. Since the query is not returned, the query end time can not be
checked against the time window and the query metrics will never be recorded.
In contrast, with the default startOffset
(600 seconds or 10 minutes), the example query would be returned even though it
started before the last execution of the integration. Since the query is
returned, the integration will see that the query completed since the last
execution of the integration, and will record the query metrics for the query.
NOTE:
It is important to carefully select a value for the startOffset
configuration parameter. Collecting query metrics with the list query history API
is an expensive operation. A value should be chosen that accounts for as many
long running queries as possible without negatively affecting the performance
of the integration or the cost of executing the list query history API
call.
Since query metrics are only collected if the query terminated since the last execution of the integration, if there is any lag setting the query end time, it is possible for some queries to be missed. For example, consider a query that ends one second prior to the current run time of the integration. If the end time of the query takes one second or more to be reflected in the API, the query end time will not be set during the current run of the integration and can not be checked. On the subsequent run of the integration, the query will be returned and it's end time will be set but the end time will fall just outside the last run time of the integration. As such, the metrics for the query will never be recorded.
To account for this, the integration provides the intervalOffset
configuration parameter. This parameter is used to slide the time window that
the query end time is checked against by some number of seconds, allowing the
query end time additional time to be reflected in the API. The default value for
the intervalOffset configuration
parameter is 5 seconds.
Query metric data is sent to New Relic as events. The following event attributes are provided.
NOTE:
Much of the text below is sourced from the file model.go.
| Attribute Name | Data Type | Description |
|---|---|---|
id |
string | The unique ID of the query |
type |
string | The query statement type (e.g., SELECT, INSERT) |
status |
string | The status of the query (e.g., FINISHED, FAILED, CANCELED) |
warehouseId |
string | The ID of the SQL warehouse where the query was executed |
warehouseName |
string | The name of the SQL warehouse where the query was executed |
warehouseCreator |
string | The creator of the SQL warehouse where the query was executed (only included if includeIdentityMetadata is true) |
workspaceId |
int | The ID of the workspace containing the SQL warehouse where the query was executed |
workspaceName |
string | The name of the workspace containing the SQL warehouse where the query was executed |
workspaceUrl |
string | The URL of the workspace containing the SQL warehouse where the query was executed |
duration |
int | Total execution time of the statement in milliseconds (excluding result fetch time) |
query |
string | The query text, truncated to 4096 characters |
plansState |
string | Whether plans exist for the execution, or the reason why they are missing |
error |
bool | Indicates whether the query failed |
errorMessage |
string | The error message if the query failed |
userId |
int | The ID of the user who executed the query (only included if includeIdentityMetadata is true) |
userName |
string | The email address or username of the user who executed the query (only included if includeIdentityMetadata is true) |
runAsUserId |
int | The ID of the user whose credentials were used to run the query (only included if includeIdentityMetadata is true) |
runAsUserName |
string | The email address or username of the user whose credentials were used to run the query (only included if includeIdentityMetadata is true) |
executionEndTime |
int | The timestamp when the execution of the query ended, in milliseconds |
startTime |
int | The timestamp when the query started, in milliseconds |
endTime |
int | The timestamp when the query ended, in milliseconds |
final |
bool | Indicates whether more updates for the query are expected |
compilationTime |
int | The time spent loading metadata and optimizing the query, in milliseconds |
executionTime |
int | The time spent executing the query, in milliseconds |
fetchTime |
int | The time spent fetching the query results after the execution finished, in milliseconds |
totalTime |
int | The total execution time of the query from the client’s point of view, in milliseconds |
totalTaskTime |
int | The sum of the execution time for all of the query’s tasks, in milliseconds |
totalPhotonTime |
int | The total execution time for all individual Photon query engine tasks in the query, in milliseconds |
overloadingQueueStartTime |
int | The timestamp when the query was enqueued waiting while the warehouse was at max load, in milliseconds. This field will be 0 if the query skipped the overloading queue. |
provisioningQueueStartTime |
int | The timestamp when the query was enqueued waiting for a cluster to be provisioned for the warehouse, in milliseconds. This field will be 0 if the query skipped the provisioning queue. |
compilationStartTime |
int | The timestamp when the underlying compute started compilation of the query, in milliseconds |
fromCache |
bool | Indicates whether the query result was fetched from the cache |
bytesRead |
int | The total size of data read by the query, in bytes |
cacheBytesRead |
int | The total size of persistent data read from the cache, in bytes |
filesRead |
int | The total number of files read after pruning |
partitionsRead |
int | The total number of partitions read after pruning |
remoteBytesRead |
int | The size of persistent data read from cloud object storage on your cloud tenant, in bytes |
rowsRead |
int | The total number of rows read by the query |
rowsReturned |
int | The total number of rows returned by the query |
bytesPruned |
int | The total number of bytes in all tables not read due to pruning |
filesPruned |
int | The total number of files from all tables not read due to pruning |
remoteBytesWritten |
int | The size of persistent data written to cloud object storage in your cloud tenant, in bytes |
diskBytesSpilled |
int | The size of data temporarily written to disk while executing the query, in bytes |
networkBytesSent |
int | The total amount of data sent over the network between executor nodes during shuffle, in bytes |
Total completed queries
FROM DatabricksQuery
SELECT count(*) AS Queries
WHERE status = 'FINISHED'Total failed queries
FROM DatabricksQuery
SELECT count(*) AS Queries
WHERE status = 'FAILED'Rate of successful queries per second
FROM DatabricksQuery
SELECT rate(count(*), 1 second) AS Queries
WHERE status = 'FINISHED'
TIMESERIESNumber of failed queries by query text
FROM DatabricksQuery
SELECT count(*)
WHERE status = 'FAILED'
FACET queryAverage bytes spilled to disk by query text
FROM DatabricksQuery
SELECT average(diskBytesSpilled)
WHERE diskBytesSpilled > 0
ORDER BY diskBytesSpilled DESC
FACET queryTotal duration of all queries run by warehouse
FROM DatabricksQuery
SELECT sum(duration) / 1000 AS Duration
FACET warehouseName
TIMESERIESQuery history
WITH
concat(workspaceUrl, '/sql/warehouses/', warehouseId, '/monitoring?queryId=', id) AS queryLink
FROM DatabricksQuery
SELECT
status as Status,
substring(query, 0, 100) as Query,
startTime as Started,
duration as Duration,
userName as User,
queryLinkQueries with data spilling
WITH
concat(workspaceUrl, '/sql/warehouses/', warehouseId, '/monitoring?queryId=', id) AS queryLink
FROM DatabricksQuery
SELECT
diskBytesSpilled,
query,
queryLink
WHERE diskBytesSpilled > 0
ORDER BY diskBytesSpilled DESC
LIMIT 100A sample dashboard is included that shows examples of the types of query metric information that can be displayed and the NRQL statements to use to visualize the data.
While not strictly enforced, the basic preferred editor settings are set in the .editorconfig. Other than this, no style guidelines are currently imposed.
This project uses both go vet and
staticcheck to perform static code analysis. These
checks are run via precommit on all commits. Though
this can be bypassed on local commit, both tasks are also run during
the validate workflow and must have no
errors in order to be merged.
Commit messages must follow the conventional commit format.
Again, while this can be bypassed on local commit, it is strictly enforced in
the validate workflow.
The basic commit message structure is as follows.
<type>[optional scope][!]: <description>
[optional body]
[optional footer(s)]
In addition to providing consistency, the commit message is used by
svu during
the release workflow. The presence and values
of certain elements within the commit message affect auto-versioning. For
example, the feat type will bump the minor version. Therefore, it is important
to use the guidelines below and carefully consider the content of the commit
message.
Please use one of the types below.
feat(bumps minor version)fix(bumps patch version)chorebuilddocstest
Any type can be followed by the ! character to indicate a breaking change.
Additionally, any commit that has the text BREAKING CHANGE: in the footer will
indicate a breaking change.
For local development, simply use go build and go run. For example,
go build cmd/databricks/databricks.goOr
go run cmd/databricks/databricks.goIf you prefer, you can also use goreleaser with
the --single-target option to build the binary for the local GOOS and
GOARCH only.
goreleaser build --single-targetReleases are built and packaged using goreleaser.
By default, a new release will be built automatically on any push to the main
branch. For more details, review the .goreleaser.yaml
and the goreleaser documentation.
The svu utility is used to generate the next tag value based on commit messages.
This project utilizes GitHub workflows to perform actions in response to certain GitHub events.
| Workflow | Events | Description |
|---|---|---|
| validate | push, pull_request to main branch |
Runs precommit to perform static analysis and runs commitlint to validate the last commit message |
| build | push, pull_request |
Builds and tests code |
| release | push to main branch |
Generates a new tag using svu and runs goreleaser |
| repolinter | pull_request |
Enforces repository content guidelines |
Sensitive data like credentials and API keys should never be specified directly in custom environment variables. Instead, it is recommended to create a secret using the Databricks CLI and reference the secret in the environment variable.
For example, the init script used to
install the Databricks Integration when deploying the integration on the driver node of a cluster
requires the New Relic license key to be specified in the environment variable
named NEW_RELIC_LICENSE_KEY. The steps below show an example of how to create
a secret for the New Relic license key and reference it in a custom environment variable.
-
Follow the steps to install or update the Databricks CLI.
-
Use the Databricks CLI to create a Databricks-backed secret scope with the name
newrelic. For example,databricks secrets create-scope newrelic
NOTE: Be sure to take note of the information in the referenced URL about the
MANAGEscope permission and use the correct version of the command. -
Use the Databricks CLI to create a secret for the license key in the new scope with the name
licenseKey. For example,databricks secrets put-secret --json '{ "scope": "newrelic", "key": "licenseKey", "string_value": "[YOUR_LICENSE_KEY]" }'
To set the custom environment variable named NEW_RELIC_LICENSE_KEY and
reference the value from the secret, follow the steps to
configure custom environment variables
and add the following line after the last entry in the Environment variables
field.
NEW_RELIC_LICENSE_KEY={{secrets/newrelic/licenseKey}}
New Relic has open-sourced this project. This project is provided AS-IS WITHOUT WARRANTY OR DEDICATED SUPPORT. Issues and contributions should be reported to the project here on GitHub.
We encourage you to bring your experiences and questions to the Explorers Hub where our community members collaborate on solutions and new ideas.
At New Relic we take your privacy and the security of your information seriously, and are committed to protecting your information. We must emphasize the importance of not sharing personal data in public forums, and ask all users to scrub logs and diagnostic information for sensitive information, whether personal, proprietary, or otherwise.
We define “Personal Data” as any information relating to an identified or identifiable individual, including, for example, your name, phone number, post code or zip code, Device ID, IP address, and email address.
For more information, review New Relic’s General Data Privacy Notice.
We encourage your contributions to improve this project! Keep in mind that when you submit your pull request, you'll need to sign the CLA via the click-through using CLA-Assistant. You only have to sign the CLA one time per project.
If you have any questions, or to execute our corporate CLA (which is required if your contribution is on behalf of a company), drop us an email at [email protected].
A note about vulnerabilities
As noted in our security policy, New Relic is committed to the privacy and security of our customers and their data. We believe that providing coordinated disclosure by security researchers and engaging with the security community are important means to achieve our security goals.
If you believe you have found a security vulnerability in this project or any of New Relic's products or websites, we welcome and greatly appreciate you reporting it to New Relic through HackerOne.
If you would like to contribute to this project, review these guidelines.
To all contributors, we thank you! Without your contribution, this project would not be what it is today.
The Databricks Integration project is licensed under the Apache 2.0 License.







