Databricks Question 1668314325
Databricks Question 1668314325
Answers
Databricks Data Science and Engineering workspace, also known as the Workspace.
Databricks SQL provides a simple experience for users who want to run quick, ad hoc queries on the
data lake, visualise query results, and create and share dashboards.
Databricks Machine Learning is an integrated end to end machine learning environment useful for
tracking experiments, training models, managing feature development, and serving features and
models
Databricks runs one executor per worker node. Therefore the terms executor and worker are used
interchangeably in the context of the Databricks architecture. People often think of cluster size in terms of
the number of workers, but there are other important factors to consider:
Total executor cores (compute): The total number of cores across all executors. This determines
the maximum parallelism of a cluster.
Total executor memory: The total amount of RAM across all executors. This determines how much
data can be stored in memory before spilling it to disk.
Executor local storage: The type and amount of local disk storage. Local disk is primarily used in
the case of spills during shuffles and caching.
There’s a balancing act between the number of workers and the size of worker instance types. A cluster
with two workers, each with 40 cores and 100 GB of RAM, has the same compute and memory as an
eight worker cluster with 10 cores and 25 GB of RAM.
If you expect many re-reads of the same data, then your workloads may benefit from caching. Consider a
storage optimized configuration with Delta Cache.
Jobs clusters run automated jobs in an expeditious and robust way. The Databricks Job scheduler
creates job clusters when you run Jobs and terminates them when the associated Job is complete. You
cannot restart a job cluster. These properties ensure an isolated execution environment for each and
every Job.
Standard clusters are ideal for processing large amounts of data with Apache Spark.
Single Node clusters are intended for jobs that use small amounts of data or non-distributed
workloads such as single-node machine learning libraries.
High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc
jobs. Administrators usually create High Concurrency clusters. Databricks recommends enabling
autoscaling for High Concurrency clusters.
None: No isolation. Does not enforce workspace-local table access control or credential passthrough.
Cannot access Unity Catalog data.
Single User: Can be used only by a single user (by default, the user who created the cluster). Other
users cannot attach to the cluster. When accessing a view from a cluster with Single User security
mode, the view is executed with the user’s permissions. Single-user clusters support workloads using
Python, Scala, and R. Init scripts, library installation, and DBFS FUSE mounts are supported on single-
user clusters. Automated jobs should use single-user clusters.
User Isolation: Can be shared by multiple users. Only SQL workloads are supported. Library
installation, init scripts, and DBFS FUSE mounts are disabled to enforce strict isolation among the
cluster users.
Table ACL only (Legacy): Enforces workspace-local table access control, but cannot access Unity Catalog
data.
Passthrough only (Legacy): Enforces workspace-local credential passthrough, but cannot access Unity
Catalog data.
The only security modes supported for Unity Catalog workloads are Single User and User Isolation. For
more information, see Cluster security mode.
all cloud resources currently in use are deleted. This means that associated VMs and operational
memory will be purged, attached volume storage will be deleted, network connections between
nodes will be removed. In short, all resources previously associated with the compute environment
will be completely removed.
Any results that need to be persisted should be saved to a permanent location. You wont lose
your code or data files that you saved appropriately.
Clusters will also terminate automatically due to inactivity assuming this setting is used.
Cluster configuration settings are maintained, and you can then use the restart button to deploy a
new set of cloud resources using the same configuration.
The Restart button allows us to manually restart our cluster. This can be useful if we need to
completely clear out the cache on the cluster or wish to completely reset our compute environment.
The Delete button will stop our cluster and remove the cluster configuration.
Changing most settings by clicking on the Edit button will require running clusters to be restarted.
display(dbutils.fs.ls("/databricks-datasets"))
2.3. What function should you use when you have tabular data
returned by a Python cell?
display()
Databricks Repos have user-level folders and non-user top level folders. User-level folders are automatically
created when users first clone a remote repository. You can think of Databricks Repos in user folders as “local
checkouts” that are individual for each user and where users make changes to their code.
Admins can create non-user top level folders. The most common use case for these top level folders is to
create Dev, Staging, and Production folders that contain Databricks Repos for the appropriate versions or
branches for development, staging, and production. For example, if your company uses the Main branch for
production, the Production folder would contain Repos configured to be at the Main branch.
Typically, permissions on these top-level folders are read-only for all non-admin users within the workspace.
To ensure that Databricks Repos are always at the latest version, you can set up Git automation to call the
Repos API 2.0. In your Git provider, set up automation that, after every successful merge of a PR into the
main branch, calls the Repos API endpoint on the appropriate repo in the Production folder to bring that repo
to the latest version. For example, on GitHub this can be achieved with GitHub Actions.
Developer workflow
In your user folder in Databricks Repos, clone your remote repository. A best practice is to create a new
feature branch or select a previously created branch for your work, instead of directly committing and pushing
changes to the main branch. You can make changes, commit, and push changes in that branch. When you
are ready to merge your code, create a pull request and follow the review and merge processes in Git.
Here is an example workflow. Note that this workflow requires that you have already set up your Git
integration.
2. Use the Repos UI to create a feature branch from the main branch. This example uses a single feature
branch feature-b for simplicity. You can create and use multiple feature branches to do your work.
3. Make your modifications to Databricks notebooks and files in the Repo.
1. Working on a new branch, a coworker makes changes to the notebooks and files in the Repo.
2. The coworker commits and pushes their changes to the Git provider.
6. To merge changes from other branches or rebase the feature branch, you must use the Git command
line or an IDE on your local system. Then, in the Repos UI, use the Git dialog to pull changes into the
feature-b branch in the Databricks Repo.
7. When you are ready to merge your work to the main branch, use your Git provider to create a PR to
merge the changes from feature-b.
8. In the Repos UI, pull changes to the main branch.
You can point a job directly to a notebook in a Databricks Repo. When a job kicks off a run, it uses the
current version of the code in the repo.
If the automation is setup as described in Admin workflow, every successful merge calls the Repos API to
update the repo. As a result, jobs that are configured to run code from a repo always use the latest version
available when the job run was created.
4.2. How does Delta Lake address the data lake pain points to
ensure reliable, ready-to-go data?
ACID Transactions – Delta Lake adds ACID transactions to data lakes. ACID stands for atomicity,
consistency, isolation, and durability, which are a standard set of guarantees most databases are
designed around. Since most data lakes have multiple data pipelines that read and write data at the
same time, data engineers often spend a significant amount of time to make sure that data remains
reliable during these transactions. With ACID transactions, each transaction is handled as having a
distinct beginning and end. This means that data in a table is not updated until a transaction
successfully completes, and each transaction will either succeed or fail fully.
These transactional guarantees eliminate many of the motivations for having both a data lake and a data
warehouse in an architecture. Appending data is easy, as each new write will create a new version of a
data table, and new data won’t be read until the transaction completes. This means that data jobs that fail
midway can be disregarded entirely. It also simplifies the process of deleting and updating records - many
changes can be applied to the data in a single transaction, eliminating the possibility of incomplete deletes
or updates.
Schema Management – Delta Lake gives you the ability to specify and enforce your data schema. It
automatically validates that the schema of the data being written is compatible with the schema of the
table it is being written into. Columns that are present in the table but not in the data are set to null. If
there are extra columns in the data that are not present in the table, this operation throws an exception.
This ensures that bad data that could corrupt your system is not written into it. Delta Lake also enables
you to make changes to a table’s schema that can be applied automatically.
Scalable Metadata Handling – Big data is very large in size, and its metadata (the information about
the files containing the data and the nature of the data) can be very large as well. With Delta Lake,
metadata is processed just like regular data - with distributed processing.
Unified Batch and Streaming Data – Delta Lake is designed from the ground up to allow a single system
to support both batch and stream processing of data. The transactional guarantees of Delta Lake mean
that each micro-batch transaction creates a new version of a table that is instantly available for insights.
Many Databricks users use Delta Lake to transform the update frequency of their dashboards and reports
from days to minutes while eliminating the need for multiple systems.
Data Versioning and Time Travel – With Delta Lake, the transaction logs used to ensure ACID
compliance create an auditable history of every version of the table, indicating which files have changed
between versions. This log makes it easy to retain historical versions of the data to fulfill compliance
requirements in various industries such as GDPR and CCPA. The transaction logs also include metadata
like extended statistics about the files in the table and the data in the files. Databricks uses Spark to scan
the transaction logs, applying the same efficient processing to the large amount of metadata associated
with millions of files in a data lake.
4.4. Is Delta Lake the default for all tables created in Databricks?
Yes, Delta Lakes is the default for all tables created in Databricks.
catalog_name.database_name.table_name
This syntax is also valid but not good practice as each statement is processed as a separate transaction with
its own ACID guarantees:
UPDATE students
SET value = value + 1
WHERE name LIKE "T%"
5.6. What is the syntax for merge and what are the benefits of
using merge?
Databricks uses the MERGE keyword to perform upserts, which allows updates, inserts, and other data
manipulations to be run as a single command.
If you write 3 statements, one each to insert, update, and delete records, this would result in 3 separate
transactions; if any of these transactions were to fail, it might leave our data in an invalid state. Instead, we
combine these actions into a single atomic transaction, applying all 3 types of changes together.
MERGE statements must have at least one field to match on, and each WHEN MATCHED or WHEN NOT
MATCHED optional clause can have any number of additional conditional statements.
MERGE INTO table_a a
USING table_b b
ON a.col_name=b.col_name
WHEN MATCHED AND b.col = X
THEN UPDATE SET *
WHEN MATCHED AND a.col = Y
THEN DELETE
WHEN NOT MATCHED AND b.col = Z
THEN INSERT *
update* and insert* are used to update/insert all the columns in the target table with matching columns from the
source data set. The equivalent Delta Lake APIs are updateAll() and insertAll() .
Let's say you want to backfill a loan_delta table with historical data on past loans. But some of the historical
data may already have been inserted in the table, and you don't want to update those records because they
may contain more up-to-date information. You can deduplicate by the loan_id while inserting by running the
following merge operation with only the insert action (since the update action is optional):
# in python
(deltaTable
.alias("t")
.merge(historicalUpdates.alias("s"), "t.loan_id = s.loan_id")
.whenNotMatchedInsertAll()
.execute())
Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and
process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to work quickly
on petabytes of data. What makes Hive unique is the ability to query large datasets, leveraging Apache Tez
or MapReduce, with a SQL-like interface.
6.2. What are the two commands to see metadata about a table?
Using DESCRIBE EXTENDED allows us to see important metadata about our table.
%python
display(dbutils.fs.ls(f"{DA.paths.user_db}/students"))
DESCRIBE DETAIL allows us to see some other details about our Delta table, including the number of files.
6.4. Describe the Delta Lake files, their format and directory
structure
Our directory contains a number of Parquet data files and a directory named _delta_log .
We can peek inside the _delta_log to see more to see the transactions.
%python
display(dbutils.fs.ls(f"{DA.paths.user_db}/students/_delta_log"))
6.5. What does the query engine do using the transaction logs
when we query a Delta Lake table?
Rather than overwriting or immediately deleting files containing changed data, Delta Lake uses the
transaction log to indicate whether or not files are valid in a current version of the table. When we query a
Delta Lake table, the query engine uses the transaction logs to resolve all the files that are valid in the current
version, and ignores all other data files.
You can look at a particular transaction log and see if records were inserted / updated / deleted.
%python
display(spark.sql(f"SELECT * FROM
json.`{DA.paths.user_db}/students/_delta_log/00000000000000000007.json`"))
In the output, the add column contains a list of all the new files written to our table; the remove column
indicates those files that no longer should be included in our table.
6.6. What commands do you use to compact small files and index
tables?
Having a lot of small files is not very efficient, as you need to open them before reading them. Small files can
occur for a variety of reasons; e.g. performing a number of operations where only one or several records are
inserted.
Using the OPTIMIZE command allows you to combine files toward an optimal size (scaled based on the size of
the table). It will replace existing data files by combining records and rewriting the results.
When executing OPTIMIZE , users can optionally specify one or several fields for ZORDER indexing. It speeds
up data retrieval when filtering on provided fields by colocating data with similar values within data files. Co-
locality is used by Delta Lake data-skipping algorithms to dramatically reduce the amount of data that needs to
be read.
For example, if we know that the data analysts in our team query a lot of files by id , we can make the
process more efficient by looking only at the ids they're interested in. It indexes on id and clusters the ids
into separate files, so that we dont have to read every file when querying the data.
OPTIMIZE events
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
For more information about the OPTIMIZE command, see Optimize performance with file management.
The operationsParameters column will let you review predicates used for updates, deletes, and merges.
The operationMetrics column indicates how many rows and files are added in each operation.
The version column designates the state of a table once a given transaction completes.
The readVersion column indicates the version of the table an operation executed against.
6.8. How do you query and roll back to previous table version?
Query:
The transaction log provides us with the ability to query previous versions of our table. These time travel
queries can be performed by specifying either the integer version or a timestamp.
SELECT *
FROM students VERSION AS OF 3
Note that we're not recreating a previous state of the table by undoing transactions against our current
version; rather, we're just querying all those data files that were indicated as valid as of the specified
version.
Rollback:
Suppose that you accidentally delete all of your data (eg by typing DELETE FROM students , where we delete
all the records in our table). Luckily, we can simply rollback this commit. Note that a RESTORE command is
recorded as a transaction; you won't be able to completely hide the fact that you accidentally deleted all the
records in the table, but you will be able to undo the operation and bring your table back to a desired state.
RESTORE TABLE students TO VERSION AS OF 8
6.9. What command do you use to clean up stale data files and
what are the consequences of using this command?
Imagine that you optimize your data. You know that while all your data has been compacted, the data files
from previous versions of your table are still being stored. You can remove these files and remove access to
previous versions of the table by running VACUUM on the table.
While Delta Lake versioning and time travel are great for querying recent versions and rolling back queries,
keeping the data files for all versions of large production tables around indefinitely is very expensive (and can
lead to compliance issues if PII is present).
Databricks will automatically clean up stale files in Delta Lake tables. If you wish to manually purge old data
files, this can be performed with the VACUUM operation.
By default, VACUUM will prevent you from deleting files less than 7 days old. This is because if you run
VACUUM on a Delta table, you lose the ability to time travel back to a version older than the specified data
retention period.
So to use this command you need to turn off the check to prevent premature deletion of data files, and
make sure that logging of VACUUM commands is enabled. Finally, you can use the DRY RUN option of
vacuum to print out all records to be deleted (useful to review files manually before deleting them)
The cell above modifies some Spark configurations. The first command overrides the retention threshold
check to allow us to demonstrate permanent removal of data. The second command sets
spark.databricks.delta.vacuum.logging.enabled to true to ensure that the VACUUM operation is
recorded in the transaction log.
Note that vacuuming a production table with a short retention can lead to data corruption and/or failure of long-
running queries. This is for demonstration purposes only and extreme caution should be used when disabling
this setting.
When running VACUUM and deleting files, we permanently remove access to versions of the table that
require these files to materialize.
Because VACUUM can be such a destructive act for important datasets, it's always a good idea to turn the
retention duration check back on.
Note that because Delta Cache stores copies of files queried in the current session on storage volumes
deployed to your currently active cluster, you may still be able to temporarily access previous table versions
(though systems should not be designed to expect this behavior). Restarting the cluster will ensure that these
cached data files are permanently purged.
Default location is under dbfs:/user/hive/warehouse/ and the database directory is the name of the
database with the .db extension, so so it is dbfs:/user/hive/warehouse/db_name.db . This is a directory
that our database is tied to.
Whereas the location of the database with custom location is in the directory specified after the LOCATION
keyword.
Note that in Databricks, the terms “schema” and “database” are used interchangeably (whereas in many
relational systems, a database is a collection of schemas).
USE db_name_default_location;
Same syntax when creating a table in the database with custom location and inserting data. The
schema must be provided because there is no data from which to infer the schema.
USE db_name_custom_location;
Python syntax:
df.write.saveAsTable("table_name")
You can use this command to find a table location within the database, both for default and custom
location.
Default location:
By default, managed tables in a database without the location specified will be created in the
dbfs:/user/hive/warehouse/<database_name>.db/ directory.
%python
hive_root = f"dbfs:/user/hive/warehouse"
db_name = f"db_name_default_location.db"
table_name = f"managed_table_in_db_with_default_location"
tbl_location = f"{hive_root}/{db_name}/{table_name}"
print(tbl_location)
files = dbutils.fs.ls(tbl_location)
display(files)
Custom location:
The managed table is created in the path specified with the LOCATION keyword during database creation.
As such, the data and metadata for the table are persisted in a directory there.
%python
table_name = f"managed_table_in_db_with_custom_location"
tbl_location = f"{DA.paths.working_dir}/_custom_location.db/{table_name}"
print(tbl_location)
files = dbutils.fs.ls(tbl_location)
display(files)
Because data and metadata are managed independently, you can rename a table or register it to a new
database without needing to move any data. Data engineers often prefer unmanaged tables and the
flexibility they provide for production data.
USE db_name_default_location;
Python syntax:
df.write.option("path", "/path/to/empty/directory").saveAsTable("table_name")
7.8. What happens when you drop tables (difference between a
managed and an unmanaged table)?
Managed tables:
Databricks manages both the metadata and the data for a managed table; when you drop a table, you also
delete the underlying data. Data analysts and other users that mostly work in SQL may prefer this behavior.
Managed tables are the default when creating a table.
For managed tables, when dropping the table, the table's directory and its log and data files will be deleted,
only the database directory remains.
Databricks only manages the metadata for unmanaged (external) tables; when you drop a table, you do not
affect the underlying data. Because data and metadata are managed independently, you can rename a table or
register it to a new database without needing to move any data.
For production workloads, we will often want to define those tables as external. This will avoid the potential
issue of dropping a production table, and avoid having to do an internal migration if we need to associate these
data files with a different database or change a table name at a later point as we're working with this particular
table. Data engineers often prefer unmanaged tables and the flexibility they provide for production data.
After executing the above, the table definition no longer exists in the metastore, but the underlying data
remain intact.
Even though this table no longer exists in our database, we still have access to the underlying data files,
meaning that we can still interact with these files directly, or we can define a new table using these files.
%python
tbl_path = f"{DA.paths.working_dir}/external_table"
files = dbutils.fs.ls(tbl_path)
display(files)
7.9. What is the command to drop the database and its
underlying tables and views?
Using cascade , we will delete all the tables and views associated with a database.
SHOW TABLES;
8.2. What is the difference between Views, Temp Views & Global
Temp Views?
Persisted as an object in a database. Persisted across multiple sessions, just like a table.
You can query views from any part of the Databricks product (permissions allowing).
View
Creating a view does not process or write any data; only the query text (i.e. the logic) is
registered to the metastore in the associated database against the source.
Global temporary views are scoped to the cluster level and can be shared between
notebooks or jobs that share computing resources. They are registered to a separate
database, the global temp database (rather than our declared database), so it won't
Global show up in our list of tables associated with our declared database. This database lives
Temporary as part of the cluster, and as long as the cluster is on, then that global temp view will be
View available from any Spark session that connects to that cluster, or notebook attached to
that cluster.
Global temp views are lost when the cluster is restarted. Databricks recommends using
views with appropriate table ACLs instead of global temporary views.
Temp views
SHOW TABLES;
WITH cte_table AS (
SELECT
col1,
col2,
col3
FROM
external_table
WHERE
col1 = X
GROUP BY col2
)
SELECT
*
FROM
cte_table
WHERE
col1 > X
AND col2 = Y;
WITH lax_bos AS (
WITH origin_destination (origin_airport, destination_airport) AS (
SELECT
origin,
destination
FROM
external_table
)
SELECT
*
FROM
origin_destination
WHERE
origin_airport = 'LAX'
AND destination_airport = 'BOS'
)
SELECT
count(origin_airport) AS `Total Flights from LAX to BOS`
FROM
lax_bos;
SELECT
max(total_delay) AS `Longest Delay (in minutes)`
FROM
(
WITH delayed_flights(total_delay) AS (
SELECT
delay
FROM
external_table
)
SELECT
*
FROM
delayed_flights
);
9.5. How do you extract the raw bytes and metadata of a file?
What is a typical use case for this?
Some workflows may require working with entire files, such as when dealing with images or unstructured data.
Using binaryFile to query a directory will provide: file metadata alongside the binary representation of the file
contents. Specifically, the fields created will indicate the path , modificationTime , length , and
content .
The cell below demonstrates using Spark SQL DDL to create a table against an external CSV source.
All the metadata and options passed during table declaration will be persisted to the metastore, ensuring that
data in the location will always be read with these options.
10.4. Does the column order matter if additional csv data files are
added to the source directory at a later stage?
When working with CSV s as a data source, it's important to ensure that column order does not change if
additional data files will be added to the source directory. Because the data format does not have strong
schema enforcement, Spark will load columns and apply column names and data types in the order
specified during table declaration.
Options passed during table declaration are included as Storage Properties , (e.g. specifying the pipe
delimiter and presence of a header).
10.6. What are the limits of tables with external data sources?
Whenever we're defining tables or queries against external data sources, we cannot expect the performance
guarantees associated with Delta Lake and Lakehouse. For example, while Delta Lake tables will guarantee
that you always query the most recent version of your source data, tables registered against other data
sources may represent older cached versions.
The cell below executes some logic that we can think of as just representing an external system directly
updating the files underlying our table.
%python
(spark.table("sales_csv")
.write.mode("append")
.format("csv")
.save(f"{DA.paths.working_dir}/sales-csv"))
10.7. How can you manually refresh the cache of your data?
At the time you query the data source, Spark automatically caches the underlying data in local storage. This
ensures that on subsequent queries, Spark will provide the optimal performance by just querying this local
cache.
Our external data source is not configured to tell Spark that it should refresh this data. We can manually
refresh the cache of our data by running the REFRESH TABLE command. Note that refreshing our table will
invalidate our cache, meaning that we'll need to rescan our original data source and pull all data back into
memory. For very large datasets, this may take a significant amount of time.
CREATE TABLE
USING JDBC
OPTIONS (
url = "jdbc:{databaseServerType}://{jdbcHostname}:{jdbcPort}",
dbtable = "{jdbcDatabase}.table", user = "{jdbcUsername}",
password = "{jdbcPassword}"
)
In the code sample below, we'll connect with SQLite. Note that the backend-configuration of the JDBC server
assume you are running this notebook on a single-node cluster. If you are running on a cluster with multiple
workers, the client running in the executors will not be able to connect to the driver.
While the table is listed as MANAGED , listing the contents of the specified location confirms that no data is
being persisted locally. Note that some SQL systems such as data warehouses will have custom drivers.
You can move the entire source table(s) to Databricks and then executing logic on the currently active
cluster. However, this can incur significant overhead because of network transfer latency associated with
moving all data over the public internet
You can push down the query to the external SQL database and only transfer the results back to Databricks.
However, this can incur significant overhead because the execution of query logic in source systems not
optimized for big data queries.
11.4. How do you filter and rename columns from existing tables
during table creation?
Simple transformations like changing column names or omitting columns from target tables can be easily
accomplished during table creation.
Note that we could have accomplished the same goal with a view, as shown below.
CREATE OR REPLACE VIEW purchases_vw AS
SELECT order_id AS id, transaction_timestamp, purchase_revenue_in_usd AS price
FROM sales;
As noted previously, CTAS statements do not support schema declaration. For example, a timestamp column
can be some variant of a Unix timestamp, which may not be the most useful for analysts to derive insights. This
is a situation where generated columns would be beneficial. You can also provide a descriptive column
comment for the generated column.
In the example below, the date column is generated by converting the existing transaction_timestamp
column to a timestamp, and then a date.
Because date is a generated column, if we write to purchase_dates without providing values for the
date column, Delta Lake automatically computes them.
The cell below configures a setting to allow for generating columns when using a Delta Lake MERGE
statement.
SET spark.databricks.delta.schema.autoMerge.enabled=true;
All dates will be computed correctly as data is inserted, although neither our source data or insert query specify
the values in this field. As with any Delta Lake source, the query automatically reads the most recent snapshot
of the table for any query; you never need to run REFRESH TABLE .
It's important to note that if a field that would otherwise be generated is included in an insert to a table, this
insert will fail if the value provided does not exactly match the value that would be derived by the logic used to
define the generated column.
11.6. What are the two types of table constraints and how do you
display them?
Because Delta Lake enforces schema on write, Databricks can support standard SQL constraint
management clauses to ensure the quality and integrity of data added to a table.
Databricks currently support two types of constraints: NOT NULL constraints, and CHECK constraints
(Generated columns are a special implementation of check constraints).
In both cases, you must ensure that no data violating the constraint is already in the table prior to defining the
constraint. Once a constraint has been added to a table, data violating the constraint will result in write failure.
Below, we'll add a CHECK constraint to the date column of our table. Note that CHECK constraints look like
standard WHERE clauses you might use to filter a dataset. We can alter our purchase_dates table and add the
constraint valid_date that checks whether date is greater than the string '2020-01-01' .
ALTER TABLE purchase_dates ADD CONSTRAINT valid_date CHECK (date > '2020-01-01');
Table constraints are shown in the Table Properties field (you'll need to scroll down to see it).
11.7. Which built-in Spark SQL commands are useful for file
ingestion (for the select clause)?
Our SELECT clause leverages two built-in Spark SQL commands useful for file ingestion:
The metadata fields added to the table provide useful information to understand when records were inserted
and from where. This can be especially helpful if troubleshooting problems in the source data becomes
necessary. All of the comments and properties for a given table can be reviewed using DESCRIBE
TABLE EXTENDED .
The benefits observed in Hive or HDFS do not translate to Delta Lake, and you should consult with an
experienced Delta Lake architect before partitioning tables. As a best practice, you should default to non-
partitioned tables for most use cases when working with Delta Lake.
11.10. What are the two options to copy Delta Lake tables and
what are the use cases?
DEEP CLONE fully copies data and metadata from a source table to a target. This copy occurs incrementally,
so executing this command again can sync changes from the source to the target location. Because all the
data files must be copied over, this can take quite a while for large datasets.
If you wish to create a copy of a table quickly to test out applying changes without the risk of modifying the
current table, SHALLOW CLONE can be a good option. SHALLOW CLONE just copIES the Delta
transaction logs, meaning that the data doesn't move.
In either case, data modifications applied to the cloned version of the table will be tracked and stored
separately from the source. Cloning is a great way to set up tables for testing SQL code while still in
development.
Overwriting a table is much faster because it doesn’t need to list the directory recursively or delete any
files;
The old version of the table still exists and can be easily retrieved using Time Travel;
It’s an atomic operation. Concurrent queries can still read the table while you are deleting the table; Due
to ACID transaction guarantees, if overwriting the table fails, the table will be in its previous state.
2. INSERT OVERWRITE provides a nearly identical outcome as above. This cell overwrites data in the sales table
using the results of an input query executed directly on parquet files in the sales-historical dataset (data
in the target table will be replaced by data from the query).
A primary difference has to do with how Delta Lake enforces schema on write. Whereas a CRAS statement
will allow us to completely redefine the contents of our target table, INSERT OVERWRITE will fail if we try to
change our schema (unless we provide optional settings).
The table history also records these two operations ( CRAS statements and INSERT OVERWRITE ) differently.
Append new sale records to the sales table using INSERT INTO :
INSERT INTO does not have any built-in guarantees to prevent inserting the same records multiple times.
Re-executing the above cell would write the same records to the target table, resulting in duplicate records.
12.5. What is the syntax for the the MERGE SQL operation and the
benefits of using merge?
You can upsert data from a source table, view, or DataFrame into a target Delta table using the MERGE SQL
operation. Delta Lake supports inserts, updates and deletes in MERGE , and supports extended syntax beyond
the SQL standards to facilitate advanced use cases.
The main benefits of MERGE are: 1. updates, inserts, and deletes are completed as a single transaction; 2.
multiple conditions can be added in addition to matching fields; 3. it provides extensive options for
implementing custom logic
Below, we'll only update records if the current row has a NULL email and the new row does not. All
unmatched records from the new batch will be inserted:
We're merging records from the users update view into the users table, matching records by user id. For each
new record in the users update view, we check for rows in the users table with the same user id. If there's a
match, and the email field in the users update view is not null, then we'll update the row in the user's dataset
using the row in users update. If a new record in users update does not have the same user id as any of the
existing records in the users table, this record will be inserted as a new row in the user's table.
This optimized command uses the same MERGE syntax but only provided a WHEN NOT MATCHED clause. Below,
we use this to confirm that records with the same user_id and event_timestamp aren't already in the
events table. This prevents adding records that already exist in the events table.
Note that this operation does have some expectations: data schema should be consistent and duplicate
records should try to be excluded or handled downstream.
While we're showing simple execution on a static directory below, the real value is in multiple executions
over time picking up new files in the source automatically.
This incrementally loads from the directory of the sales-30-m dataset into the sales table specifying
parquet as the file format. This can be part of our ingestion capabilities, or to get your Delta Tables out in
another format for someone else.
Recent feature: a validate keyword now allows you to check that the data format of your source data is
still in line with the data in your target table before you incrementally load files.
NULL is the absence of value, or the lack of value, therefore it is not something we can count.
13.2. What is the syntax to count null values?
OR
Example:
SELECT
count(user_id) AS total_ids,
count(DISTINCT user_id) AS unique_ids,
count(email) AS total_emails,
count(DISTINCT email) AS unique_emails,
count(updated) AS total_updates,
count(DISTINCT(updated)) AS unique_updates,
count(*) AS total_rows,
count(DISTINCT(*)) AS unique_non_null_rows
FROM users_dirty
13.3. What is the syntax to count for distinct values in a table for a
specific column?
SELECT COUNT(DISTINCT(col_1, col_2)) FROM table_name WHERE col_1 IS NOT NULL
Example:
SELECT *,
date_format(first_touch, "MMM d, yyyy") AS first_touch_date,
date_format(first_touch, "HH:mm:ss") AS first_touch_time,
regexp_extract(email, "(?<=@).+", 0) AS email_domain
FROM (
SELECT *,
CAST(user_first_touch_timestamp / 1e6 AS timestamp) AS first_touch
FROM deduped_users
)
14.3. What are struct types? What is the syntax to parse JSON
objects into struct types with Spark SQL?
Spark SQL also has the ability to parse JSON objects into struct types (a native Spark type with nested
attributes) by using a from_json function. However, this from_json function requires a schema. To
derive the schema of our the data, you can take a row example with no null fields, and use Spark SQL's
schema_of_json function.
In the example below, we copy and paste an example JSON row to the function and chain it into the
from_json function to cast our value field to a struct type.
Syntax:
This is now a struct field. We have a temporary view parsed_events with a column we named json . The
values in each record were correctly parsed and stored in a struct with the nested values.
14.4. Once a JSON string is unpacked to a struct type, what is the
syntax to flatten the fields into columns? What is the syntax to
interact with the subfields in a struct type?
Once a JSON string is unpacked to a struct type, Spark supports * (star) unpacking to flatten fields into
columns.
DESCRIBE events
We can interact with the subfields in this field using standard . syntax similar to how we might traverse
nested data in JSON.
Let's select a subfield purchase_revenue_in_usd of the ecommerce column. This returns a new column
with the values for the subfield extracted from the ecommerce column.
SELECT ecommerce.purchase_revenue_in_usd
FROM events
WHERE ecommerce.purchase_revenue_in_usd IS NOT NULL
We combine these queries to create a simple table that shows the unique collection of actions and the items in
a user's cart.
SELECT user_id,
collect_set(event_name) AS event_history,
array_distinct(flatten(collect_set(items.item_id))) AS cart_history
FROM events
GROUP BY user_id
14.8. What is the syntax for an INNER JOIN ?
By default, the join type is INNER . That means the results will contain the intersection of the two sets, and
any rows that are not in both sets will not appear.
The SQL JOIN clause is used to combine records from two or more tables in a database. A JOIN is a means
for combining fields from two tables by using values common to each.
Here we chain a join with a lookup table to an explode operation to grab the standard printed item name.
Conceptually, a full outer join combines the effect of applying both left and right outer joins. Where rows in the
FULL OUTER JOINed tables do not match, the result set will have NULL values for every column of the table
that lacks a matching row. For those rows that do match, a single row will be produced in the result set
(containing columns populated from both tables).
For example, this allows us to see each employee who is in a department and each department that has an
employee, but also see each employee who is not part of a department and each department which doesn't
have an employee.
Example of a full outer join (the OUTER keyword is optional):
SELECT *
FROM employee FULL OUTER JOIN department
ON employee.DepartmentID = department.DepartmentID;
SELECT *
FROM employee
LEFT OUTER JOIN department ON employee.DepartmentID = department.DepartmentID;
A right outer join (or right join) closely resembles a left outer join, except with the treatment of the tables
reversed. Every row from the "right" table (B) will appear in the joined table at least once. If no matching row
from the "left" table (A) exists, NULL will appear in columns from A for those rows that have no match in B.
A right outer join returns all the values from the right table and matched values from the left table (NULL in the
case of no matching join predicate). For example, this allows us to find each employee and his or her
department, but still show departments that have no employees.
SELECT *
FROM employee RIGHT OUTER JOIN department
ON employee.DepartmentID = department.DepartmentID;
Below is an example of a left anti-join. It is the exact same as a left join except for the WHERE clause. This is
what differentiates it from a typical left join.
The query below is finding all customers that did not have a matching cse_id in the
customer_success_engineer table. By setting the cse_id column in the example above to null, it is finding all
rows in the left table that did not have a matching record (a null value) in the table on the right.
SELECT
*
FROM customers a
LEFT JOIN customer_success_engineer b
ON a.assigned_cse_id = b.cse_id
WHERE TRUE
AND b.cse_id IS NULL
Suppose we are having tea and we want to have a list of all combinations of available tea and cake.
tea
Green tea
Peppermint tea
English Breakfast
cake
Carrot cake
Brownie
Tarte tatin
A CROSS JOIN will create all paired combinations of the rows of the tables that will be joined.
cake tea
SELECT ColumnName_1,
ColumnName_2,
ColumnName_N
FROM [Table_1]
CROSS JOIN [Table_2]
Below is an alternative syntax for cross-join that does not include the CROSS JOIN keyword; we will place
the tables that will be joined after the FROM clause and separated with a comma.
SELECT ColumnName_1,
ColumnName_2,
ColumnName_N
FROM [Table_1],[Table_2]
A semi join returns rows that match an EXISTS subquery without duplicating rows from the left side of the
predicate when multiple rows on the right side satisfy the criteria of the subquery.
SELECT columns
FROM table_1
WHERE EXISTS (
SELECT values
FROM table_2
WHERE table_2.column = table_1.column);
Example:
employee table
1 Inès 28
2 Ghassan 26
3 Camille 25
4 Cécile 35
client table
10 Leïla 20
11 Claire 28
12 Thomas 25
13 Rébecca 30
Output
1 Inès 28
After the joining, the selected fields of the rows of the employee table satisfying the equality condition will be
displayed as a result, but this equality condition is valid only for those rows in the client table that does
have the value of client_age as 28.
14.14. What is the syntax for the Spark SQL UNION , MINUS ,
and INTERSECT set operators?
UNION returns the collection of two queries. The query below returns the same results as if we inserted our
new_events_final into the events table.
The SQL UNION operator is different from join as it combines the result of two or more SELECT statements.
Each SELECT statement within the UNION must have the same number of columns. The columns must also
have similar data types. Also, the columns in each SELECT statement must be in the same order.
MINUS returns all the rows found in one dataset but not the other;
SELECT * FROM () : The SELECT statement inside the parentheses is the input for this table.
PIVOT : The first argument in the clause is an aggregate function and the column to be aggregated. Then, we
specify the pivot column in the FOR subclause. Finally, the IN operator contains the pivot column values.
Here we use PIVOT to create a new transactions table that flattens out the information contained in the
sales table. This flattened data format can be useful for dashboarding, but also useful for applying machine
learning algorithms for inference or prediction.
single value by merging the elements into a buffer, and then apply a finishing function on the final buffer.
You may write a filter that produces a lot of empty arrays in the created column. When that happens, it can be
useful to use a WHERE clause to show only non-empty array values in the returned column.
In this example, we accomplish that by using a subquery. They are useful for performing an operation in
multiple steps. In this case, we're using it to create the named column that we will use with a WHERE clause.
CREATE OR REPLACE TEMP VIEW king_size_sales AS
In the statement above, for each value in the input array, we extract the item's revenue value, multiply it by
100, and cast the result to integer. Note that we're using the same kind as references as in the previous
command, but we name the iterator with a new variable, k .
-- get total revenue from king items per order
CREATE OR REPLACE TEMP VIEW king_item_revenues AS
SELECT
order_id,
king_items,
TRANSFORM (
king_items,
k -> CAST(k.item_revenue_in_usd * 100 AS INT)
) AS item_revenues
FROM king_size_sales;
Another example:
We will use the reduce function to find an average value, by day, for CO2 readings. Take a closer look at the
individual pieces of the REDUCE function by reviewing the list below.
REDUCE(co2_level, 0, (c, acc) -> c + acc, acc ->(acc div size(co2_level)))
value is added to from the array; we start at zero in this case to get an accurate sum of the values in the list.
(c, acc) is the list of arguments we'll use for this function. It may be helpful to think of acc as the buffer
value and c as the value that gets added to the buffer.
c + acc is the buffer function. As the function iterates over the list, it holds the total ( acc ) and adds the
next value in the list ( c ).
acc div size(co2_level) is the finishing function. Once we have the sum of all numbers in the array,
we divide by the number of elements to find the average.
15.1. What is the syntax to define and register SQL UDFs? How do
you then apply that function to the data?
A SQL UDF applies a recipe to a particular text and returns a result.
Let's apply a function to a temp view called foods that has a column called food and values corresponding to
various types of food:
15.2. How can you see where the function was registered and
basic information about expected inputs and what is returned?
DESCRIBE FUNCTION EXTENDED yelling
SQL UDFs will persist between execution environments (which can include notebooks, DBSQL queries, and
jobs).
16.1. What is the syntax to turn SQL queries into Python strings?
print("""
SELECT *
FROM table_name
""")
16.2. What is the syntax to execute SQL from a Python cell?
def return_new_string(string_arg):
return "The string passed to this function was " + string_arg
table_name = "users"
filter_clause = "WHERE state = 'CA'"
query = f"""
SELECT *
FROM {table_name}
{filter_clause}
"""
print(query)
16.7. What is the syntax for if / else clauses wrapped in a
function?
def foods_i_like(food):
if food == "beans":
print(f"I love {food}")
elif food == "potatoes":
print(f"My favorite vegetable is {food}")
elif food != "beef":
print(f"Do you have any good recipes for {food}?")
else:
print(f"I don't eat {food}")
16.8. What are the two methods for casting values to numeric
types (int and float)?
The two methods to cast values to numeric types are int() and float() , e.g. int("2") .
print(result)
As implemented, this logic would only be useful for interactive execution of this logic. The message isn't
currently being logged anywhere, and the code will not return the data in the desired format; human
intervention would be required to act upon the printed message.
Below, we execute a different query and set preview to False , as the purpose of the query is to create a
temp view rather than return a preview of data.
new_query = "CREATE OR REPLACE TEMP VIEW id_name_tmp_vw AS SELECT id, name FROM
demo_tmp_vw"
simple_query_function(new_query, preview=False)
Suppose we want to protect our company from malicious SQL, like the query below.
We can use the find() method to test for multiple SQL statements by looking for a semicolon. If it's not
found, it will return -1 . With that knowledge, we can define a simple search for a semicolon in the query
string and raise a custom error message if it was found (not -1 ).
def injection_check(query):
semicolon_index = query.find(";")
if semicolon_index >= 0:
raise ValueError(f"Query contains semi-colon at index
{semicolon_index}\nBlocking execution to avoid SQL injection attack")
Always be wary of allowing untrusted users to pass text that will be passed to SQL queries. Note that only
one query can be executed using spark.sql() , so text with a semi-colon will always throw an error. If we
add this method to our earlier query function, we now have a more robust function that will assess each
query for potential threats before execution. We will see normal performance with a safe query, and prevent
execution when when bad logic is run.
The directory
Auto Loader will detect new files as they arrive in this location and queue
data_source of the source
them for ingestion; passed to the .load() method
data
The format of While the format for all Auto Loader queries will be cloudFiles , the
source_format the source format of the source data should always be specified for the
data cloudFiles.format option
The name of Spark Structured Streaming supports writing directly to Delta Lake tables
table_name the target by passing a table name as a string to the .table() method. Note that
table you can either append to an existing table or create a new table
The location
This argument is passed to the checkpointLocation and
for storing
cloudFiles.schemaLocation options. Checkpoints keep track of
checkpoint_directory metadata
streaming progress, while the schema location tracks updates to the fields
about the
in the source dataset
stream
Because Auto Loader uses Spark Structured Streaming to load data incrementally, the code above
doesn't appear to finish executing. We can think of this as a continuously active query. This means that as
soon as new data arrives in our data source, it will be processed through our logic and loaded into
our target table. The great thing about Auto Loader is that when new data comes in, it will pick it up
and process it automatically. You dont need to have a tool like Airflow checking in every hour or so.
17.7. What can you do once data has been ingested to Delta Lake
with Auto Loader?
Once data has been ingested to Delta Lake with Auto Loader, users can interact with it the same way
they would any table.
%sql
SELECT * FROM target_table
Rather than failing the job, or dropping records, you will automatically quarantine that data in a separate
column which will allow you to do programmatic or manual review of that data and see if you can fix those
records and insert those back into your base dataset.
17.9. What is the data type encoded by Auto Loader for fields in a
text-based file format?
Because JSON is a text-based format, Auto Loader will encode all fields as STRING type. This is the safest
and most permissive type, ensuring that the least amount of data is dropped or ignored at ingestion due to
type mismatch.
An Auto Loader query automatically detects and processes records from the source directory into the target
table.
%sql
DESCRIBE HISTORY target_table
Each streaming update corresponds to a new batch of files being added to that source directory and
ingested. We can see the number of rows being ingested with each batch.
From a lakehouse perspective, it makes data ingestion very easy. We don't have to use Airflow to
orchestrate or use any additional code to process what has or has not already been processed. It's all
handled automatically by Auto Loader.
18. Reasoning about Incremental Data with
Spark Structured Streaming
The magic behind Spark Structured Streaming is that it allows users to interact with ever-growing data
sources as if they were just a static table of records, by treating infinite data as a table. New data in the
data stream translates into new rows appended to an unbounded table. Structured Streaming lets us
define a query against the data source and automatically detect new records and propagate them through
previously defined logic. Spark Structured Streaming is optimised on Databricks to integrate closely with
Delta Lake and Auto Loader.
Example situations where you have an ever growing dataset:
New rows are appended to the input table for each trigger interval (i.e., how frequently you're looking for
new data, data up to trigger 1 (t=1), t=2 or t=3). These new rows are essentially analogous to micro-batch
transactions and will be automatically propagated through the results table to the sink.
18.4. Explain how Structured Streaming ensures end-to-end
exactly-once fault-tolerance.
Structured Streaming ensures end-to-end (from source, execution engine, sink), exactly-once (every
record will appear just once, there won't be duplicates and it will always arrive to your sink) semantics
under any failure condition (fault tolerance).
Structured Streaming sources, sinks, and the underlying execution engine work together to track the
progress of stream processing. If a failure occurs, the streaming engine attempts to restart and/or
reprocess the data.
The two conditions for the underlying streaming mechanism to work are:
Replayable approach: Structured Streaming uses checkpointing and write ahead logs to record
the offset range of data being processed during each trigger interval (a unique ID allows you to pick
up where you left off). This means that in order for this to work, we need to define a checkpoint .
checkpoint is providing Spark Structured Streaming with a location to store the progress of previous
runs of your stream. This approach only works if the streaming source is replayable; replayable sources
include cloud-based object storage and pub/sub messaging services.
Idempotent sinks: The streaming sinks are designed to be idempotent - that is, multiple writes of the
same data (as identified by the offset) do not result in duplicates being written to the sink.
(spark.readStream
.table("bronze")
.createOrReplaceTempView("streaming_tmp_vw"))
When we execute a query on a streaming temporary view, the results of the query will continuously be
updated as new data arrives in the source. Think of a query executed against a streaming temp view as
an always-on incremental query. It's important to shut down those streams before moving on. A
continuously running stream will keep an interactive cluster alive.
%sql
SELECT * FROM streaming_tmp_vw
18.6. How can you transform streaming data?
We can execute most transformation against streaming temp views the same way we would with static
data. Because we are querying a streaming temp view, this becomes a streaming query that executes
indefinitely, rather than completing after retrieving a single set of results.
For streaming queries like this, Databricks Notebooks include interactive dashboards that allow users to
monitor streaming performance. Note that none of these records are being persisted anywhere at this
point. This is just in memory.
%sql
SELECT device_id, count(device_id) AS total_recordings
FROM streaming_tmp_vw
GROUP BY device_id
Defining a temp view from a streaming read, and then defining another temp view against that
streaming temp view to apply your logic is a pattern that you can use to leverage SQL in order to do
incremental data processing with Databricks.
You'll need to use the PySpark Structured Streaming APIs for the data stream writer in the next step, but
the logic in between can be completed with SQL, meaning that it's easy to take logic that has been written
by SQL-only analysts or engineers and inject that into your streaming or incremental workloads without
needing to do a full refactor of that code base.
%sql
CREATE OR REPLACE TEMP VIEW device_counts_tmp_vw AS (
SELECT device_id, COUNT(device_id) AS total_recordings
FROM streaming_tmp_vw
GROUP BY device_id
)
Databricks creates checkpoints by storing the current state of your streaming job to cloud storage (so
a checkpoint location is a place in cloud object storage where we can store that stream progress in).
Checkpointing combines with write ahead logs to allow a terminated stream to be restarted and
continue from where it left off. Spark takes care of the bookkeeping for us (which files are new, what
has changed since the last time we ran our job, etc.)
Checkpoints cannot be shared between separate streams. A checkpoint is required for every
streaming write to ensure. Each stream that we write will need to have its own unique checkpoint
that is tied to that stream.
Output Modes, similar to static/batch workloads.
.outputMode("append") : This is the default. Only newly appended rows are incrementally
appended to the target table with each batch
.outputMode("complete") : The Results Table is recalculated each time a write is triggered; the
target table is overwritten with each batch
Trigger Intervals, specifying when the system should process the next set of data.
(spark.table("device_counts_tmp_vw")
.writeStream
.option("checkpointLocation", f"{DA.paths.checkpoints}/silver")
.outputMode("complete")
.trigger(availableNow=True)
.table("device_counts")
.awaitTermination() # This optional method blocks execution of the next cell until
the incremental batch write has succeeded
)
When we execute this, we dont have it continuously running. It executes as if it was a batch operation.
We can change our trigger method to change this query from a triggered incremental batch to an always-
on query triggered every 4 seconds. We can use the same checkpoint to make it an always on query.
This logic will start from the point where the previous query left off.
query = (spark.table("device_counts_tmp_vw")
.writeStream
.option("checkpointLocation", f"{DA.paths.checkpoints}/silver")
.outputMode("complete")
.trigger(processingTime='4 seconds')
.table("device_counts"))
When we query the device_counts table that we wrote out to, we treat that as a static table. It is
being updated by an incremental or streaming query but the table itself will give us static results.
Because we are now querying a table (not a streaming DataFrame), the following will not be a
streaming query.
%sql
SELECT *
FROM device_counts
Gold: highly refined and aggregated data. Data thas has been transformed to knowledge. Updates to
these tables will be completed as part of regularly scheduled production workloads, which helps control
costs and allows SLAs for data freshness to be established. Gold tables provide business level
aggregates often used for reporting and dashboarding. This would include aggregations such as daily
active website users, weekly sales per store, or gross revenue per quarter by department. The end
outputs are actionable insights, dashboards and reports of business metrics. Gold tables will often be
stored in a separate storage container to help avoid cloud limits on data requests. In general, because
aggregations, joins and filtering are being handled before data is written to the golden layer, query
performance on data in the gold tables should be exceptional.
Because all data and metadata lives in object storage in the cloud, multiple users and applications can
access data in near-real time, allowing analysts to access the freshest data as it's being processed.
Each stage can be configured as a batch or streaming job, and ACID transactions ensure that we succeed or
fail completely.
19.4. Describe how you can configure a read on a raw JSON source
using Auto Loader with schema inference. What is the
cloudFiles.schemaHints option?
We configure a read on a raw JSON source using Auto Loader with schema inference. For a JSON data
source, Auto Loader will default to inferring each column as a string. You can specify the data type for a
column using the cloudFiles.schemaHints option. Specifying improper types for a field will result in null
values.
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaHints", "time DOUBLE")
.option("cloudFiles.schemaLocation", f"{DA.paths.checkpoints}/bronze")
.load(DA.paths.data_landing_location)
.createOrReplaceTempView("recordings_raw_temp"))
We can enrich our raw data with additional metadata describing the source file and the time it was
ingested. This additional metadata can be ignored during downstream processing while providing
useful information for troubleshooting errors if corrupt data is encountered.
%sql
CREATE OR REPLACE TEMPORARY VIEW recordings_bronze_temp AS (
SELECT *, current_timestamp() receipt_time, input_file_name() source_file
FROM recordings_raw_temp
)
The code below passes our enriched raw data back to PySpark API to process an incremental write to a
Delta Lake table. When new data arrives, the changes are immediately detected by this streaming query.
(spark.table("recordings_bronze_temp")
.writeStream
.format("delta")
.option("checkpointLocation", f"{DA.paths.checkpoints}/bronze")
.outputMode("append")
.table("bronze"))
We are then loading a static CSV file to add patient data to our recordings. In production, we could use
Databricks' Auto Loader feature to keep an up-to-date view of this data in our Delta Lake.
(spark.read
.format("csv")
.schema("mrn STRING, name STRING")
.option("header", True)
.load(f"{DA.paths.data_source}/patient/patient_info.csv")
.createOrReplaceTempView("pii"))
%sql
SELECT * FROM pii
19.5. What happens with the ACID guarantees that Delta Lake
brings to your data when you choose to merge this data with
other data sources?
The ACID guarantees that Delta Lake brings to your data are managed at the table level, ensuring that only
fully successfully commits are reflected in your tables. If you choose to merge these data with other data
sources, be aware of how those sources version data and what sort of consistency guarantees they have.
(spark.readStream
.table("bronze")
.createOrReplaceTempView("bronze_tmp"))
%sql
CREATE OR REPLACE TEMPORARY VIEW recordings_w_pii AS (
SELECT device_id, a.mrn, b.name, cast(from_unixtime(time, 'yyyy-MM-dd HH:mm:ss') AS
timestamp) time, heartrate
FROM bronze_tmp a
INNER JOIN pii b
ON a.mrn = b.mrn
WHERE heartrate > 0)
(spark.table("recordings_w_pii")
.writeStream
.format("delta")
.option("checkpointLocation", f"{DA.paths.checkpoints}/recordings_enriched")
.outputMode("append")
.table("recordings_enriched"))
%sql
SELECT COUNT(*) FROM recordings_enriched
(spark.readStream
.table("recordings_enriched")
.createOrReplaceTempView("recordings_enriched_temp"))
%sql
CREATE OR REPLACE TEMP VIEW patient_avg AS (
SELECT mrn, name, mean(heartrate) avg_heartrate, date_trunc("DD", time) date
FROM recordings_enriched_temp
GROUP BY mrn, name, date_trunc("DD", time))
Use cases for complete mode: when you're writing from your silver to your gold. You want to aggregate
over all the data that's available. Or when building a dashboard, we're interested in those aggregations
over a period of time. It's like a point in time snapshot. (It won't let you do aggregations in append,
because of this concept of infinite data, e.g. what's the average of infinity?)
Then you also have update. Update is similar to merge in Delta Lake, but not quite as powerful. If you
combine Delta Lake and Structured Streaming, you have to set structured streaming mode to update,
and do your merge. That's how it knows not to get rid of the whole dataset, but rather update specific
records.
The gold Delta table we have just registered will perform a static read of the current state of the data
each time we run the following query.
%sql
SELECT * FROM daily_patient_avg
The above table includes all days for all users. If the predicates for our ad hoc queries match the data
encoded here, we can push down our predicates to files at the source and very quickly generate more
limited aggregate views.
%sql
SELECT *
FROM daily_patient_avg
WHERE date BETWEEN "2020-01-17" AND "2020-01-31"
19.10. Describe the two options to incrementally process data,
either with a triggered option or a continuous option.
When landing additional files in our source directory, we'll be able to see these process through the first 3
tables in our Delta Lake, but we will need to re-run our final query to update our daily_patient_avg table,
since this query uses the trigger available now syntax. The trigger once logic defined against the silver
table is only going to be executed as a batch when we choose to execute it, meaning that it needs to be
manually triggered.
We have the ability to incrementally process data either with a triggered option where we're doing a batch
incremental operation, or a continuous option where we have an always on incremental stream.
20.1. Describe how Delta Live Tables makes the ETL lifecycle
easier.
Delta Live Tables (DLT) makes it easy to build and manage reliable data pipelines that deliver high-
quality data on Delta Lake. DLT helps data engineering teams simplify ETL development and
management with declarative pipeline development, automatic data testing, and deep visibility for
monitoring and recovery.
By just adding LIVE to your SQL queries, DLT will begin to automatically take care of all of your
operational, governance and quality challenges. With the ability to mix Python with SQL, users get
powerful extensions to SQL to implement advanced transformations and embed AI models as part of the
pipelines.
DLT provides deep visibility into pipeline operations with detailed logging and tools to visually track
operational stats and quality metrics. With this capability, data teams can understand the performance
and status of each table in the pipeline. Data engineers can see which pipelines have run successfully or
failed, and can reduce downtime with automatic error handling and easy refresh.
DLT takes the queries that you write to transform your data and instead of just executing them against a
database, DLT deeply understands those queries and analyzes them to understand the data flow between
them. Once DLT understands the data flow, lineage information is captured and can be used to keep data
fresh and pipelines operating smoothly.
Because DLT understands the data flow and lineage, and because this lineage is expressed in an
environment-independent way, different copies of data (i.e. development, production, staging) are
isolated and can be updated using a single code base. The same set of query definitions can be run on
any of those datasets.
The ability to track data lineage is hugely beneficial for improving change management and reducing
development errors, but most importantly, it provides users the visibility into the sources used for
analytics – increasing trust and confidence in the insights derived from the data.
20.2. Beyond transformations, how can you define your data in
your code?
Your data should be a single source of truth for what is going on inside your business. Beyond just the
transformations, there are 3 things that should be included in the code that defines your data:
Quality Expectations: With declarative quality expectations, DLT allows users to specify what makes
bad data bad and how bad data should be addressed with tunable severity.
Documentation with Transformation: DLT enables users to document where the data comes from,
what it’s used for and how it was transformed. This documentation is stored along with the
transformations, guaranteeing that this information is always fresh and up to date.
Table Attributes: Attributes of a table (e.g. "contains PII") along with quality and operational information
about table execution is automatically captured in the Event Log. This information can be used to
understand how data flows through an organization and meet regulatory requirements.
20.3. Describe why large scale ETL is complex when not using DLT.
With declarative pipeline development, improved data reliability and cloud-scale production operations,
DLT makes the ETL lifecycle easier and enables data teams to build and leverage their own data
pipelines to get to insights faster, ultimately reducing the load on data engineers.
Large scale ETL is complex when not using DLT:
Complex pipeline development: hard to build and maintain table dependencies; difficult to
switch between batch and stream processing
Data quality and governance: difficult to monitor and enforce data quality; impossible to trace
data lineage
Difficult pipeline operations: poor observability at granular, data level; error handling and
recovery is laborious
20.4. How do you create and run a DLT pipeline in the DLT UI?
To create and configure a pipeline, click the Jobs button on the sidebar and select the Delta Live
Tables tab.
Triggered pipelines run once and then shut down until the next manual or scheduled update. (this
corresponds to this triggered=once ). Continuous pipelines run continuously, ingesting new data as it
arrives. Choose the mode based on latency and cost requirements.
If you specify a value for Target , tables are published to the specified database. Without a Target
specification, we would need to query the table based on its underlying location in DBFS (relative to the
Storage Location).
Enable autoscaling , Min Workers and Max Workers control the worker configuration for the
underlying cluster processing the pipeline. Notice the DBU estimate provided, similar to that provided when
configuring interactive clusters.
With a pipeline created, you will now run the pipeline.
You can run the pipeline in development mode. Development mode accelerates the development
lifecycle by reusing the cluster (as opposed to creating a new cluster for each run) and disabling retries so
that you can readily identify and fix errors. Refer to the documentation for more information on this feature.
21.1. What is the syntax to do streaming with SQL for Delta Live
tables? What's the keyword that shows you're using Delta Live
Tables?
You can use SQL to declare Delta Live Tables implementing a simple multi-hop architecture. At its
simplest, you can think of DLT SQL as a slight modification to traditional CTAS statements. DLT tables
and views will always be preceded by the LIVE keyword.
For each query, the live keyword automatically captures the dependencies between datasets defined in
the pipeline and uses this information to determine the execution order. A pipeline is a graph that links
together the datasets that have been defined by SQL or Python.
21.2. What is the syntax for declaring a bronze layer table using
Auto Loader and DLT?
sales_orders_raw ingests JSON data incrementally from the example dataset found in /databricks-
datasets/retail-org/sales_orders/ .
Incremental processing via Auto Loader (which uses the same processing model as Structured
Streaming), requires the addition of the STREAMING keyword in the declaration. The cloud_files()
method enables Auto Loader to be used natively with SQL. This method takes the following positional
parameters: the source location, the source data format, and an arbitrarily sized array of optional reader
options. In this case, we set cloudFiles.inferColumnTypes to true . The comment provides
additional metadata that would be visible to anyone exploring the data catalog.
21.3. What keyword can you use for quality control? How do you
reference DLT Tables/Views and streaming tables?
Now we declare tables implementing the silver layer. At this level we apply operations like data cleansing
and enrichment. Our first silver table enriches the sales transaction data with customer information in
addition to implementing quality control by rejecting records with a null order number. The CONSTRAINT
keyword introduces quality control. Similar in function to a traditional WHERE clause, CONSTRAINT
integrates with DLT, enabling it to collect metrics on constraint violations. Constraints provide an optional ON
VIOLATION clause, specifying an action to take on records that violate the constraint. The three modes
currently supported by DLT include: FAIL UPDATE (pipeline failure when constraint is violated), DROP ROW
(discard records that violate constraints), or OMITTED (records violating constraints will be included, but
violations will be reported in metrics). The DLT UI, will show you a pie chart of how much is on violation, and
how much isn't if you have a live table with a constraint.
References to other DLT tables and views will always include the LIVE. prefix. A target database name
will automatically be substituted at runtime, allowing for easily migration of pipelines between
DEV/QA/PROD environments.
References to streaming DLT tables use the STREAM() , supplying the table name as an argument.
To define a schedule for the job, you can set the Schedule Type to Scheduled , specifying the period,
starting time, and time zone. You can optionally select the Show Cron Syntax checkbox.
Note that Databricks enforces a minimum interval of 10 seconds between subsequent runs triggered by the
schedule of a job regardless of the seconds configuration in the cron expression. You can choose a time zone
that observes daylight saving time or UTC.
The job scheduler is not intended for low latency jobs. Due to network or cloud issues, job runs may
occasionally be delayed up to several minutes. In these situations, scheduled jobs will run immediately upon
service availability.
You can change job or task settings before repairing the job run. Unsuccessful tasks are re-run with the current
job and task settings. For example, if you change the path to a notebook or a cluster setting, the task is re-run
with the updated notebook or cluster settings. You can view the history of all task runs on the Task run details
page.
22.5. How can you view Jobs?
You can filter jobs in the Jobs list by using keywords, selecting only the jobs you own, selecting all jobs you
have permissions to access (access to this filter requires that Jobs access control is enabled), or by using
tags. To search for a tag created with only a key, type the key into the search box. To search for a tag created
with a key and value, you can search by the key, the value, or both the key and value. For example, for a tag
with the key department and the value finance , you can search for department or finance to find
matching jobs. To search by both the key and value, enter the key and value separated by a colon; for
example, department:finance .
22.6. How can you view runs for a Job and the details of the runs?
When clicking a job name, the Runs tab appears with a table of active runs and completed runs. To switch to a
matrix view, click Matrix. The matrix view shows a history of runs for the job, including each job task.
The Job Runs row of the matrix displays the total duration of the run and the state of the run. To view
details of the run, including the start time, duration, and status, hover over the bar in the Job Runs row.
Each cell in the Tasks row represents a task and the corresponding status of the task. To view details of each
task, including the start time, duration, cluster, and status, hover over the cell for that task.
The job run and task run bars are color-coded to indicate the status of the run. Successful runs are green,
unsuccessful runs are red, and skipped runs are pink. The height of the individual job run and task run bars
provides a visual indication of the run duration.
Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs,
Databricks recommends that you export results before they expire.
The job run details page contains job output and links to logs, including information about the success or
failure of each task in the job run.
You can also export the logs for your job run. You can set up your job to automatically deliver logs to DBFS or
S3 through the Job API.
Configuring task dependencies creates a Directed Acyclic Graph (DAG) of task execution, a common way of
representing execution order in job schedulers. Databricks runs upstream tasks before running downstream
tasks, running as many of them in parallel as possible.
To configure the cluster where a task runs, click the Cluster drop-down. You can edit a shared job cluster,
but you cannot delete a shared cluster if it is still used by other tasks.
Dependent libraries will be installed on the cluster before the task runs. You must set all task dependencies to
ensure they are installed before the run starts. Follow the recommendations in Library dependencies for
specifying dependencies.
You can pass templated variables into a job task as part of the task’s parameters. These variables are replaced
with the appropriate values when the job task runs. You can use task parameter values to pass the context
about a job run, such as the run ID or the job’s start time.
When a job runs, the task parameter variable surrounded by double curly braces is replaced and appended to
an optional string value included as part of the value. For example, to pass a parameter named MyJobId with
a value of my-job-6 for any run of job ID 6, add the following task parameter:
{
"MyJobID": "my-job-{{job_id}}"
}
Timeout corresponds to the maximum completion time for a job. If the job does not complete in this time,
Databricks sets its status to “Timed Out”.
Retries is a policy that determines when and how many times failed runs are retried. To set the retries for the
task, click Advanced options and select Edit Retry Policy.
New Job Clusters are dedicated clusters for a job or task run. A shared job cluster is created and
started when the first task using the cluster starts, and terminates after the last task using the cluster
completes.
The cluster is not terminated when idle but terminates only after all tasks using it have completed.
If a shared job cluster fails or is terminated before all tasks have finished, a new cluster is created.
A cluster scoped to a single task is created and started when the task starts, and terminates when
the task completes. In production, Databricks recommends using new shared or task scoped
clusters so that each job or task runs in a fully isolated environment.
When you run a task on a new cluster, the task is treated as a data engineering (task) workload,
subject to the task workload pricing. When you run a task on an existing all-purpose cluster, the task
is treated as a data analytics (all-purpose) workload, subject to all-purpose workload pricing. When
selecting your all-purpose cluster, you will get a warning about how this will be billed as all-purpose
compute. Production jobs should always be scheduled against new job clusters appropriately sized
for the workload, as this is billed at a much lower rate.
If you select a terminated existing cluster and the job owner has Can Restart permission,
Databricks starts the cluster when the job is scheduled to run.
Existing all-purpose clusters work best for tasks such as updating dashboards at regular intervals.
When defining a task, customers will have the option to either configure a new cluster or choose an
existing one. With cluster reuse, your list of existing clusters will now contain clusters defined in other
tasks in the job.
When multiple tasks share a job cluster, the cluster will be initialized when the first relevant task is
starting. This cluster will stay on until the last task using this cluster is finished. This way there is no
additional startup time after the cluster initialization, leading to a time/cost reduction while using the job
clusters which are still isolated from other workloads.
The cluster is not terminated when idle. It terminates only after all tasks using it have completed.
To decrease new job cluster start time, create a pool and configure the job’s cluster to use the pool.
Note: Databricks have released a naming change for Databricks SQL that replaces the term "endpoint" with
"warehouse". No functional change is intended; this is just a naming change.
You can discover insights from your query results with a wide variety of rich visualizations.
Databricks SQL allows you to organize visualizations into dashboards with an intuitive drag-and-drop
interface.
You can then share your dashboards with others, both within and outside your organization, without the
need to grant viewers direct access to the underlying data.
You can configure dashboards to automatically refresh, as well as to alert viewers to meaningful changes in
the data.
You can then review the SQL code used to populate this plot. Note that 3 tier namespacing is used to
identify the source table; this is a preview of new functionality to be supported by Unity Catalog. You can
click Run in the top right of the screen to preview the results of the query.
You can review and edit the visualisation. You can also click on the Add Visualization button to the right of
the visualization name, and configure the visualisation as you see fit. You can then select Add to Dashboard
from the menu.
You can always change the organization of visualizations, e.g. by dragging and resizing visualizations.
Note: At present, in this demo no other users should have any permissions to run your dashboard, as they
have not been granted permissions to the underlying databases and tables using Table ACLs. If you wish other
users to be able to trigger updates to your dashboard, you will either need to grant them permissions to Run
as owner or add permissions for the tables referenced in your queries.
24.1. List the four key functional areas for data governance.
Data Access Control: Who has access to what?
Data Access Audit: Understand who accessed what and when? What did they do? Compliance aspect
Data Lineage: Which data objects feed downstream data objects - if you make a change to an upstream
table, how does that affect downstream and vice versa
Data Discovery: Important to find your data and see what actually exists.
24.2. Explain how Unity Catalog simplifies this with one tool to
cover all of these areas.
With Unity Catalog, we're seeking to cover all of these areas with one tool.
Traditionally, governance has been a challenge on data lakes. This is primarily due to the file formats
that exist on objects stores. Governance is traditionally tied to the specific cloud providers data
governance, and tied to files. This introduces a lot of complexity. For example, you can lock down
access at the file level, but it doesnt allow you to do anything more granular, e.g. row based access
controls, or columns.
Also, if you need to update your file structures, e.g. for performance reasons, you'll need to update your
data governance model as well. Conversely, if you update your data governance model, you'll need to
update your file format as well, which can involve rewriting files, or changing the structure of the
underlying files.
If you have a multi cloud infrastructure, you're going to need to set permissions on various data
sources for each of those clouds.
We seek to simplify this with Unity Catalog, by giving secure access to our various personas in a simple
manner.
With UC, we now have this additional catalog qualifier (also note that schema and database are
synonymous terms in Databricks). Catalog is a collection of databases.
24.3. Walk through a traditional query lifecycle, and how it
changes when using Unity Catalog. Highlight the differences and
why this makes a query lifecycle much simpler for data
consumers.
Traditionally, a query would be submitted to a cluster / SQL Warehouse. The cluster/SQL Warehouse
would go and check the table ACL from the hive metastore to ensure that the query has proper access. If it
did, it would query the hive metastore again to find the location of the files that are being queried. Those
locations paths would then be returned to the cluster/SQL Warehouse. Then the cluster, using pre-existing
IAM role, or cloud specific alternative would then go out to query the cloud storage directly and return the
data. And ultimately, return the query results back to the user. Notice what's taking place within this one
single, specific workspace (grey area). The problem is that replicating this control model across different
workspaces is simply not possible.
With Unity Catalog, the query goes to the cluster/SQL warehouse. The cluster will go out and check the
Unity Catalog namespace. UC will write to the log that this query was submitted. This helps with audibility
and compliance (trace for every query that is submitted). UC will then check the grants to make sure that
the security is valid. At that point, the UC (with its own pre-configured IAM role) will go out to the cloud
storage directly and return pre-configured short-lived URLs and associated tokens, which it will then return
to the cluster or SQL warehouse. The cluster will then use those URLs to go out to the cloud storage with
the tokens, to get the proper access, return the data, and then return the full query results back to the
user. Notice now what's happening in the grey area representing the workspace, by contrast to the old
way of doing this.
Unity Catalog introduces a change. The amount of workspace specific infrastructure is reduced. Unity
Catalog exists outside of the workspace at the account level. This means that multiple, different
workspaces can rely on that same Unity Catalog. All of the grants are going to be managed in Unity
Catalog. Our clusters or warehouses will be able to leverage Unity Catalog regardless of the workspace
that they're in, confirm the credentials for the individual user that is submitting a query, and then grant
those permissions to cloud object storage to return the data so that the query can be materialized.
25.1. What is the data explorer, how do you access it and what
does it allow you to do?
The data explorer allows users and admins to navigate databases, tables, and views; explore data schema,
metadata, and history; set and modify permissions of relational entities.
25.2. What are the default permissions for users and admins in
DBSQL?
By default, admins will have the ability to view all objects registered to the metastore and will be able to
control permissions for other users in the workspace.
Users will default to having no permissions on anything registered to the metastore, other than objects that
they create in DBSQL; note that before users can create any databases, tables, or views, they must have
create and usage privileges specifically granted to them.
Generally, permissions will be set using Groups that have been configured by an administrator, often by
importing organizational structures from SCIM integration with a different identity provider.
Object Scope
controls access to the underlying filesystem. Users granted access to ANY FILE can
ANY FILE bypass the restrictions put on the catalog, databases, tables, and views by reading from
the file system directly.
25.4. For each object owner, describe what they can grant
privileges for.
Databricks admins and object owners can grant privileges according to the following rules:
Databricks administrator All objects in the catalog and the underlying filesystem.
Table owner Only the table (similar options for views and functions).
25.5. Describe all the privileges that can be configured in Data
Explorer.
The following privileges can be configured in Data Explorer:
Privilege Ability
ALL PRIVILEGES gives all privileges (is translated into all the below privileges).
MODIFY gives ability to add, delete, and modify data to or from an object.
does not give any abilities, but is an additional requirement to perform any action
USAGE
on a database object.
To confirm this has run successfully, they can execute the following query: