Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
By Will Girten
()
Related to Building Modern Data Applications Using Databricks Lakehouse
Related ebooks
Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake Rating: 0 out of 5 stars0 ratingsNoSQL Essentials: Navigating the World of Non-Relational Databases Rating: 0 out of 5 stars0 ratingsData Engineering Best Practices: Architect robust and cost-effective data solutions in the cloud era Rating: 0 out of 5 stars0 ratingsDatabricks Essentials: A Guide to Unified Data Analytics Rating: 0 out of 5 stars0 ratingsBig Data and Analytics: The key concepts and practical applications of big data analytics (English Edition) Rating: 0 out of 5 stars0 ratingsFull Stack Web Development with Fastify: Building High-Performance Modern Applications from Frontend to Backend Rating: 0 out of 5 stars0 ratingsNode.js Cookbook: Practical recipes for building server-side web applications with Node.js 22 Rating: 0 out of 5 stars0 ratingsMySQL 8 Cookbook Rating: 0 out of 5 stars0 ratingsFundamentals of Analytics Engineering: An introduction to building end-to-end analytics solutions Rating: 0 out of 5 stars0 ratingsMastering Database Design Rating: 0 out of 5 stars0 ratingsLPI Security Essentials Study Guide: Exam 020-100 Rating: 0 out of 5 stars0 ratingsInstant MapReduce Patterns – Hadoop Essentials How-to Rating: 0 out of 5 stars0 ratingsSpring Boot 3.0 Crash Course Rating: 0 out of 5 stars0 ratingsJava EE 7 First Look Rating: 0 out of 5 stars0 ratingsSpring MVC Blueprints Rating: 0 out of 5 stars0 ratingsLearning Apache Cassandra - Second Edition Rating: 0 out of 5 stars0 ratingsMastering Scala Machine Learning Rating: 0 out of 5 stars0 ratingsMastering Java Concurrency: From Basics to Expert Proficiency Rating: 0 out of 5 stars0 ratingsGetting Started with Greenplum for Big Data Analytics Rating: 0 out of 5 stars0 ratingsUltimate Azure Data Scientist Associate (DP-100) Certification Guide Rating: 0 out of 5 stars0 ratingsJava EE 7 Development with WildFly Rating: 0 out of 5 stars0 ratingsThe Art of SQL: Crafting Robust Database Solutions Rating: 0 out of 5 stars0 ratingsPostgreSQL Development Essentials Rating: 5 out of 5 stars5/5Ultimate SQL Server and Azure SQL for Data Management and Modernization Rating: 0 out of 5 stars0 ratings
Computers For You
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5Elon Musk Rating: 4 out of 5 stars4/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsAlan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5Quantum Computing For Dummies Rating: 3 out of 5 stars3/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms Rating: 0 out of 5 stars0 ratingsA Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5The Technological Republic: Hard Power, Soft Belief, and the Future of the West Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsMastering QuickBooks 2020: The ultimate guide to bookkeeping and QuickBooks Online Rating: 0 out of 5 stars0 ratings
Reviews for Building Modern Data Applications Using Databricks Lakehouse
0 ratings0 reviews
Book preview
Building Modern Data Applications Using Databricks Lakehouse - Will Girten
Building Modern Data Applications Using Databricks Lakehouse
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Apeksha Shetty
Publishing Product Managers: Arindam Majumder and Nilesh Kowadkar
Book Project Manager: Shambhavi Mishra
Senior Content Development Editor: Shreya Moharir
Technical Editor: Seemanjay Ameriya
Copy Editor: Safis Editing
Proofreader: Shreya Moharir
Indexer: Manju Arasan
Production Designer: Prashant Ghare
Senior DevRel Marketing Coordinator: Nivedita Singh
First published: October 2024
Production reference: 1181024
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK
ISBN 978-1-80107-323-3
www.packtpub.com
To my beautiful and caring wife, Ashley, and our smiley son, Silvio Apollo, thank you for your unwavering support and encouragement.
– Will Girten
Contributors
About the author
Will Girten is a lead specialist solutions architect who joined Databricks in early 2019. With over a decade of experience in data and AI, Will has worked in various business verticals, from healthcare to government and financial services. Will’s primary focus has been helping enterprises implement data warehousing strategies for the lakehouse and performance-tuning BI dashboards, reports, and queries. Will is a certified Databricks Data Engineering Professional and Databricks Machine Learning Professional. He holds a Bachelor of Science in computer engineering from the University of Delaware.
I want to give a special thank you to one of my greatest supporters, mentors, and friends, YongSheng Huang.
About the reviewer
Oleksandra Bovkun is a senior solutions architect at Databricks. She helps customers adopt the Databricks Platform to implement a variety of use cases, follow best practices for implementing data products, and extract the maximum value from their data. During her career, she has participated in multiple data engineering and MLOps projects, including data platform setups, large-scale performance optimizations, and on-prem to cloud migrations. She previously had data engineering and software development roles for consultancies and product companies. She aims to support companies in their data and AI journey to maximize business value using data and AI solutions. She is a regular presenter at conferences, meetups, and user groups in Benelux and Europe.
Table of Contents
Preface
Part 1: Near-Real-Time Data Pipelines for the Lakehouse
1
An Introduction to Delta Live Tables
Technical requirements
The emergence of the lakehouse
The Lambda architectural pattern
Introducing the medallion architecture
The Databricks lakehouse
The maintenance predicament of a streaming application
What is the DLT framework?
How is DLT related to Delta Lake?
Introducing DLT concepts
Streaming tables
Materialized views
Views
Pipeline
Pipeline triggers
Workflow
Types of Databricks compute
Databricks Runtime
Unity Catalog
A quick Delta Lake primer
The architecture of a Delta table
The contents of a transaction commit
Supporting concurrent table reads and writes
Tombstoned data files
Calculating Delta table state
Time travel
Tracking table changes using change data feed
A hands-on example – creating your first Delta Live Tables pipeline
Summary
2
Applying Data Transformations Using Delta Live Tables
Technical requirements
Ingesting data from input sources
Ingesting data using Databricks Auto Loader
Scalability challenge in structured streaming
Using Auto Loader with DLT
Applying changes to downstream tables
APPLY CHANGES command
The DLT reconciliation process
Publishing datasets to Unity Catalog
Why store datasets in Unity Catalog?
Creating a new catalog
Assigning catalog permissions
Data pipeline settings
The DLT product edition
Pipeline execution mode
Databricks runtime
Pipeline cluster types
A serverless compute versus a traditional compute
Loading external dependencies
Data pipeline processing modes
Hands-on exercise – applying SCD Type 2 changes
Summary
3
Managing Data Quality Using Delta Live Tables
Technical requirements
Defining data constraints in Delta Lake
Using temporary datasets to validate data processing
An introduction to expectations
Expectation composition
Hands-on exercise – writing your first data quality expectation
Acting on failed expectations
Hands-on example – failing a pipeline run due to poor data quality
Applying multiple data quality expectations
Decoupling expectations from a DLT pipeline
Hands-on exercise – quarantining bad data for correction
Summary
4
Scaling DLT Pipelines
Technical requirements
Scaling compute to handle demand
Hands-on example – setting autoscaling properties using the Databricks REST API
Automated table maintenance tasks
Why auto compaction is important
Vacuuming obsolete table files
Moving compute closer to the data
Optimizing table layouts for faster table updates
Rewriting table files during updates
Data skipping using table partitioning
Delta Lake Z-ordering on MERGE columns
Improving write performance using deletion vectors
Serverless DLT pipelines
Introducing Enzyme, a performance optimization layer
Summary
Part 2: Securing the Lakehouse Using the Unity Catalog
5
Mastering Data Governance in the Lakehouse with Unity Catalog
Technical requirements
Understanding data governance in a lakehouse
Introducing the Databricks Unity Catalog
A problem worth solving
An overview of the Unity Catalog architecture
Unity Catalog-enabled cluster types
Unity Catalog object model
Enabling Unity Catalog on an existing Databricks workspace
Identity federation in Unity Catalog
Data discovery and cataloging
Tracking dataset relationships using lineage
Observability with system tables
Tracing the lineage of other assets
Fine-grained data access
Hands-on example – data masking healthcare datasets
Summary
6
Managing Data Locations in Unity Catalog
Technical requirements
Creating and managing data catalogs in Unity Catalog
Managed data versus external data
Saving data to storage volumes in Unity Catalog
Setting default locations for data within Unity Catalog
Isolating catalogs to specific workspaces
Creating and managing external storage locations in Unity Catalog
Storing cloud service authentication using storage credentials
Querying external systems using Lakehouse Federation
Hands-on lab – extracting document text for a generative AI pipeline
Generating mock documents
Defining helper functions
Choosing a file format randomly
Creating/assembling the DLT pipeline
Summary
7
Viewing Data Lineage Using Unity Catalog
Technical requirements
Introducing data lineage in Unity Catalog
Tracing data origins using the Data Lineage REST API
Visualizing upstream and downstream transformations
Identifying dependencies and impacts
Hands-on lab – documenting data lineage across an organization
Summary
Part 3: Continuous Integration, Continuous Deployment, and Continuous Monitoring
8
Deploying, Maintaining, and Administrating DLT Pipelines Using Terraform
Technical requirements
Introducing the Databricks provider for Terraform
Setting up a local Terraform environment
Importing the Databricks Terraform provider
Configuring workspace authentication
Defining a DLT pipeline source notebook
Applying workspace changes
Configuring DLT pipelines using Terraform
name
notification
channel
development
continuous
edition
photon
configuration
library
cluster
catalog
target
storage
Automating DLT pipeline deployment
Hands-on exercise – deploying a DLT pipeline using VS Code
Setting up VS Code
Creating a new Terraform project
Defining the Terraform resources
Deploying the Terraform project
Summary
9
Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment
Technical requirements
Introduction to Databricks Asset Bundles
Elements of a DAB configuration file
Specifying a deployment mode
Databricks Asset Bundles in action
User-to-machine authentication
Machine-to-machine authentication
Initializing an asset bundle using templates
Hands-on exercise – deploying your first DAB
Hands-on exercise – simplifying cross-team collaboration with GitHub Actions
Setting up the environment
Configuring the GitHub Action
Testing the workflow
Versioning and maintenance
Summary
10
Monitoring Data Pipelines in Production
Technical requirements
Introduction to data pipeline monitoring
Exploring ways to monitor data pipelines
Using DBSQL Alerts to notify data validity
Pipeline health and performance monitoring
Hands-on exercise – querying data quality events for a dataset
Data quality monitoring
Introducing Lakehouse Monitoring
Hands-on exercise – creating a lakehouse monitor
Best practices for production failure resolution
Handling pipeline update failures
Recovering from table transaction failure
Hands-on exercise – setting up a webhook alert when a job runs longer than expected
Summary
Index
Other Books You May Enjoy
Preface
As datasets have exploded in size with the introduction of cheap cloud storage and processing data in near real time has become an industry standard, many organizations have turned to the lakehouse architecture, which combines the fast BI speeds of a traditional data warehouse with the scalable ETL processing of big data in the cloud. The Databricks Data Intelligence Platform – built upon several open source technologies, including Apache Spark, Delta Lake, MLflow, and Unity Catalog – eliminates friction points and accelerates the design and deployment of modern data applications built for the lakehouse.
In this book, you’ll start with an overview of the Delta Lake format, cover core concepts of the Databricks Data Intelligence Platform, and master building data pipelines using the Delta Live Tables framework. We’ll dive into applying data transformations, how to implement the Databricks medallion architecture, and how to continuously monitor the quality of data landing in your lakehouse. You’ll learn how to react to incoming data using the Databricks Auto Loader feature and automate real-time data processing using Databricks workflows. You’ll learn how to use CI/CD tools such as Terraform and Databricks Asset Bundles (DABs) to deploy data pipeline changes automatically across deployment environments, as well as monitor, control, and optimize cloud costs along the way. By the end of this book, you will have mastered building a production-ready, modern data application using the Databricks Data Intelligence Platform.
With Databricks recently named a Leader in the 2024 Gartner Magic Quadrant for Data Science and Machine Learning Platforms, the demand for mastering a skillset in the Databricks Data Intelligence Platform is only expected to grow in the coming years.
Who this book is for
This book is for data engineers, data scientists, and data stewards tasked with enterprise data processing for their organizations. This book will simplify learning advanced data engineering techniques on Databricks, making implementing a cutting-edge lakehouse accessible to individuals with varying technical expertise. However, beginner-level knowledge of Apache Spark and Python is needed to make the most out of the code examples in this book.
What this book covers
Chapter 1
, An Introduction to Delta Live Tables, discusses building near-real-time data pipelines using the Delta Live Tables framework. It covers the fundamentals of pipeline design as well as the core concepts of the Delta Lake format. The chapter concludes with a simple example of building a Delta Live Table pipeline from start to finish.
Chapter 2
, Applying Data Transformations Using Delta Live Tables, explores data transformations using Delta Live Tables, guiding you through the process of cleaning, refining, and enriching data to meet specific business requirements. You will learn how to use Delta Live Tables to ingest data from a variety of input sources, register datasets in Unity Catalog, and effectively apply changes to downstream tables.
Chapter 3
, Managing Data Quality Using Delta Live Tables, introduces several techniques for enforcing data quality requirements on newly arriving data. You will learn how to define data quality constraints using Expectations in the Delta Live Tables framework, as well as monitor the data quality of a pipeline in near real time.
Chapter 4
, Scaling DLT Pipelines, explains how to scale a Delta Live Tables (DLT) pipeline to handle the unpredictable demands of a typical production environment. You will take a deep dive into configuring pipeline settings using the DLT UI and Databricks Pipeline REST API. You will also gain a better understanding of the daily DLT maintenance tasks that are run in the background and how to optimize table layouts to improve performance.
Chapter 5
, Mastering Data Governance in the Lakehouse with Unity Catalog, provides a comprehensive guide to enhancing data governance and compliance of your lakehouse using Unity Catalog. You will learn how to enable Unity Catalog on a Databricks workspace, enable data discovery using metadata tags, and implement fine-grained row and column-level access control of datasets.
Chapter 6
, Managing Data Locations in Unity Catalog, explores how to effectively manage storage locations using Unity Catalog. You will learn how to govern data access across various roles and departments within an organization while ensuring security and auditability with the Databricks Data Intelligence Platform.
Chapter 7
, Viewing Data Lineage using Unity Catalog, discusses tracing data origins, visualizing data transformations, and identifying upstream and downstream dependencies by tracing data lineage in Unity Catalog. By the end of the chapter, You will be equipped with the skills needed to validate that data is coming from trusted sources.
Chapter 8
, Deploying, Maintaining, and Administrating DLT Pipelines Using Terraform, covers deploying DLT pipelines using the Databricks Terraform provider. You will learn how to set up a local development environment and automate a continuous build and deployment pipeline, along with best practices and future considerations.
Chapter 9
, Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment, explores how DABs can be used to streamline the deployment of data analytics projects and improve cross-team collaboration. You will gain an understanding of the practical use of DABs through several hands-on examples.
Chapter 10
, Monitoring Data Pipelines in Production, delves into the crucial task of monitoring data pipelines in Databricks. You will learn various mechanisms for tracking pipeline health, performance, and data quality within the Databricks Data Intelligence Platform.
To get the most out of this book
While not a mandatory requirement, to get the most out of this book, it’s recommended that you have beginner-level knowledge of Python and Apache Spark, and at least some knowledge of navigating around the Databricks Data Intelligence Platform. It’s also recommended to have the following dependencies installed locally in order to follow along with the hands-on exercises and code examples throughout the book:
Furthermore, it’s recommended that you have a Databricks account and workspace to log in, import notebooks, create clusters, and create new data pipelines. If you do not have a Databricks account, you can sign up for a free trial on the Databricks website https://www.databricks.com/try-databricks
.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
Download the example code files
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Building-Modern-Data-Applications-Using-Databricks-Lakehouse
. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/
. Check them out!
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: The result of the data generator notebook should be three tables in total: youtube_channels, youtube_channel_artists, and combined_table.
A block of code is set as follows:
@dlt.table(
name=random_trip_data_raw
,
comment=The raw taxi trip data ingested from a landing zone.
,
table_properties={
quality
: bronze
}
)
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
@dlt.table(
name=random_trip_data_raw
,
comment=The raw taxi trip data ingested from a landing zone.
,
table_properties={
quality
: bronze
,
pipelines.autoOptimize.managed
: false
}
)
Any command-line input or output is written as follows:
$ databricks bundle validate
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: Click the Run all button at the top right of the Databricks workspace to execute all the notebook cells, verifying that all cells execute successfully.
Tips or important notes
Appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected]
and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata
and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]
with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com
.
Share Your Thoughts
Once you’ve read Building Modern Data Applications Using Databricks Lakehouse, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page
for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Download a free PDF copy of this book
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below
https://packt.link/free-ebook/978-1-80107-323-3
Submit your proof of purchase
That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1:Near-Real-Time Data Pipelines for the Lakehouse
In this first part of the book, we’ll introduce the core concepts of the Delta Live Tables (DLT) framework. We’ll cover how to ingest data from a variety of input sources and apply the latest changes to downstream tables. We’ll also explore how to enforce requirements on incoming data so that your data teams can be alerted of potential data quality issues that might contaminate your lakehouse.
This part contains the following chapters:
Chapter 1
, An Introduction to Delta Live Tables
Chapter