Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks

Ebook515 pages4 hours

Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks

Name: Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Author: Will Girten
ISBN: 9781804612873

By Will Girten

Rating: 0 out of 5 stars

()

Read preview

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateOct 31, 2024

ISBN9781804612873

Author

Will Girten

Related authors

Skip carousel

Related to Building Modern Data Applications Using Databricks Lakehouse

Related ebooks

Skip carousel

Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake
Ebook
Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake
byPulkit Chadha
Rating: 0 out of 5 stars
0 ratings
NoSQL Essentials: Navigating the World of Non-Relational Databases
Ebook
NoSQL Essentials: Navigating the World of Non-Relational Databases
byKameron Hussain
Rating: 0 out of 5 stars
0 ratings
Data Engineering Best Practices: Architect robust and cost-effective data solutions in the cloud era
Ebook
Data Engineering Best Practices: Architect robust and cost-effective data solutions in the cloud era
byRichard J. Schiller
Rating: 0 out of 5 stars
0 ratings
Mastering Time Series Analysis and Forecasting with Python: Bridging Theory and Practice Through Insights, Techniques, and Tools for Effective Time Series Analysis in Python
Ebook
Mastering Time Series Analysis and Forecasting with Python: Bridging Theory and Practice Through Insights, Techniques, and Tools for Effective Time Series Analysis in Python
bySulekha Aloorravi
Rating: 0 out of 5 stars
0 ratings
Databricks Essentials: A Guide to Unified Data Analytics
Ebook
Databricks Essentials: A Guide to Unified Data Analytics
byRobert Johnson
Rating: 0 out of 5 stars
0 ratings
Big Data and Analytics: The key concepts and practical applications of big data analytics (English Edition)
Ebook
Big Data and Analytics: The key concepts and practical applications of big data analytics (English Edition)
byDr. Jugnesh Kumar
Rating: 0 out of 5 stars
0 ratings
Full Stack Web Development with Fastify: Building High-Performance Modern Applications from Frontend to Backend
Ebook
Full Stack Web Development with Fastify: Building High-Performance Modern Applications from Frontend to Backend
byAarav Joshi
Rating: 0 out of 5 stars
0 ratings
Node.js Cookbook: Practical recipes for building server-side web applications with Node.js 22
Ebook
Node.js Cookbook: Practical recipes for building server-side web applications with Node.js 22
byBethany Griggs
Rating: 0 out of 5 stars
0 ratings
MySQL 8 Cookbook
Ebook
MySQL 8 Cookbook
byKyran Velos
Rating: 0 out of 5 stars
0 ratings
Fundamentals of Analytics Engineering: An introduction to building end-to-end analytics solutions
Ebook
Fundamentals of Analytics Engineering: An introduction to building end-to-end analytics solutions
byDumky De Wilde
Rating: 0 out of 5 stars
0 ratings
Ultimate Azure Data Engineering: Build Robust Data Engineering Systems on Azure with SQL, ETL, Data Modeling, and Power BI for Business Insights and Crack Azure Certifications
Ebook
Ultimate Azure Data Engineering: Build Robust Data Engineering Systems on Azure with SQL, ETL, Data Modeling, and Power BI for Business Insights and Crack Azure Certifications
byAshish Agarwal
Rating: 0 out of 5 stars
0 ratings
Mastering Database Design
Ebook
Mastering Database Design
byTed Noreux
Rating: 0 out of 5 stars
0 ratings
LPI Security Essentials Study Guide: Exam 020-100
Ebook
LPI Security Essentials Study Guide: Exam 020-100
byDavid Clinton
Rating: 0 out of 5 stars
0 ratings
Instant MapReduce Patterns – Hadoop Essentials How-to
Ebook
Instant MapReduce Patterns – Hadoop Essentials How-to
bySrinath Perera
Rating: 0 out of 5 stars
0 ratings
Spring Boot 3.0 Crash Course
Ebook
Spring Boot 3.0 Crash Course
byKit Harrington
Rating: 0 out of 5 stars
0 ratings
Ultimate AWS Data Engineering: Design, Implement and Optimize Scalable Data Solutions on AWS with Practical Workflows and Visual Aids for Unmatched Impact (English Edition)
Ebook
Ultimate AWS Data Engineering: Design, Implement and Optimize Scalable Data Solutions on AWS with Practical Workflows and Visual Aids for Unmatched Impact (English Edition)
byRathish Mohan
Rating: 0 out of 5 stars
0 ratings
Java EE 7 First Look
Ebook
Java EE 7 First Look
byNDJOBO Armel Fabrice
Rating: 0 out of 5 stars
0 ratings
Spring MVC Blueprints
Ebook
Spring MVC Blueprints
bySherwin John Calleja Tragura
Rating: 0 out of 5 stars
0 ratings
Learning Apache Cassandra - Second Edition
Ebook
Learning Apache Cassandra - Second Edition
bySandeep Yarabarla
Rating: 0 out of 5 stars
0 ratings
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Ebook
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
byArun Manivannan
Rating: 0 out of 5 stars
0 ratings
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Ebook
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
bySaba Shah
Rating: 0 out of 5 stars
0 ratings
Mastering Scala Machine Learning
Ebook
Mastering Scala Machine Learning
byAlex Kozlov
Rating: 0 out of 5 stars
0 ratings
CSS For Beginners: The Complete Step-By-Step Guide To Learning Web Development For Building Responsive Websites, Mastering Web Design, And Becoming A Coding Expert
Ebook
CSS For Beginners: The Complete Step-By-Step Guide To Learning Web Development For Building Responsive Websites, Mastering Web Design, And Becoming A Coding Expert
byVoltaire Lumiere
Rating: 0 out of 5 stars
0 ratings
Mastering Java Concurrency: From Basics to Expert Proficiency
Ebook
Mastering Java Concurrency: From Basics to Expert Proficiency
byWilliam Smith
Rating: 0 out of 5 stars
0 ratings
Getting Started with Greenplum for Big Data Analytics
Ebook
Getting Started with Greenplum for Big Data Analytics
bySunila Gollapudi
Rating: 0 out of 5 stars
0 ratings
Ultimate Azure Data Scientist Associate (DP-100) Certification Guide
Ebook
Ultimate Azure Data Scientist Associate (DP-100) Certification Guide
byRajib Kumar De
Rating: 0 out of 5 stars
0 ratings
Java EE 7 Development with WildFly
Ebook
Java EE 7 Development with WildFly
byFrancesco Marchioni
Rating: 0 out of 5 stars
0 ratings
The Art of SQL: Crafting Robust Database Solutions
Ebook
The Art of SQL: Crafting Robust Database Solutions
byRichard Paul Evans
Rating: 0 out of 5 stars
0 ratings
PostgreSQL Development Essentials
Ebook
PostgreSQL Development Essentials
byManpreet Kaur
Rating: 5 out of 5 stars
5/5
Ultimate SQL Server and Azure SQL for Data Management and Modernization
Ebook
Ultimate SQL Server and Azure SQL for Data Management and Modernization
byAmit Khandelwal
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
Ebook
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
byJoe Shelley
Rating: 5 out of 5 stars
5/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 4 out of 5 stars
4/5
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Ebook
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
byMargot Lee Shetterly
Rating: 4 out of 5 stars
4/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 5 out of 5 stars
5/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
Ebook
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
byJohannes Wild
Rating: 0 out of 5 stars
0 ratings
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
Ebook
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
Ebook
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
Quantum Computing For Dummies
Ebook
Quantum Computing For Dummies
bywhurley
Rating: 3 out of 5 stars
3/5
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms
Ebook
The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms
byCory Althoff
Rating: 0 out of 5 stars
0 ratings
A Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going
Ebook
A Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going
byMichael Wooldridge
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
The Technological Republic: Hard Power, Soft Belief, and the Future of the West
Ebook
The Technological Republic: Hard Power, Soft Belief, and the Future of the West
byAlexander C. Karp
Rating: 4 out of 5 stars
4/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 5 out of 5 stars
5/5
Boundaries in an Overconnected World: Setting Limits to Preserve Your Focus, Privacy, Relationships, and Sanity
Ebook
Boundaries in an Overconnected World: Setting Limits to Preserve Your Focus, Privacy, Relationships, and Sanity
byAnne Katherine
Rating: 4 out of 5 stars
4/5
How to Raise a Healthy Gamer: End Power Struggles, Break Bad Screen Habits, and Transform Your Relationship with Your Kids
Ebook
How to Raise a Healthy Gamer: End Power Struggles, Break Bad Screen Habits, and Transform Your Relationship with Your Kids
byAlok Kanojia, MD, MPH
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Mastering QuickBooks 2020: The ultimate guide to bookkeeping and QuickBooks Online
Ebook
Mastering QuickBooks 2020: The ultimate guide to bookkeeping and QuickBooks Online
byCrystalynn Shelton
Rating: 0 out of 5 stars
0 ratings

Related categories

Skip carousel

Reviews for Building Modern Data Applications Using Databricks Lakehouse

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Building Modern Data Applications Using Databricks Lakehouse - Will Girten

Cover.png

Building Modern Data Applications Using Databricks Lakehouse

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Apeksha Shetty

Publishing Product Managers: Arindam Majumder and Nilesh Kowadkar

Book Project Manager: Shambhavi Mishra

Senior Content Development Editor: Shreya Moharir

Technical Editor: Seemanjay Ameriya

Copy Editor: Safis Editing

Proofreader: Shreya Moharir

Indexer: Manju Arasan

Production Designer: Prashant Ghare

Senior DevRel Marketing Coordinator: Nivedita Singh

First published: October 2024

Production reference: 1181024

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-80107-323-3

www.packtpub.com

To my beautiful and caring wife, Ashley, and our smiley son, Silvio Apollo, thank you for your unwavering support and encouragement.

– Will Girten

Contributors

About the author

Will Girten is a lead specialist solutions architect who joined Databricks in early 2019. With over a decade of experience in data and AI, Will has worked in various business verticals, from healthcare to government and financial services. Will’s primary focus has been helping enterprises implement data warehousing strategies for the lakehouse and performance-tuning BI dashboards, reports, and queries. Will is a certified Databricks Data Engineering Professional and Databricks Machine Learning Professional. He holds a Bachelor of Science in computer engineering from the University of Delaware.

I want to give a special thank you to one of my greatest supporters, mentors, and friends, YongSheng Huang.

About the reviewer

Oleksandra Bovkun is a senior solutions architect at Databricks. She helps customers adopt the Databricks Platform to implement a variety of use cases, follow best practices for implementing data products, and extract the maximum value from their data. During her career, she has participated in multiple data engineering and MLOps projects, including data platform setups, large-scale performance optimizations, and on-prem to cloud migrations. She previously had data engineering and software development roles for consultancies and product companies. She aims to support companies in their data and AI journey to maximize business value using data and AI solutions. She is a regular presenter at conferences, meetups, and user groups in Benelux and Europe.

Table of Contents

Preface

Part 1: Near-Real-Time Data Pipelines for the Lakehouse

An Introduction to Delta Live Tables

Technical requirements

The emergence of the lakehouse

The Lambda architectural pattern

Introducing the medallion architecture

The Databricks lakehouse

The maintenance predicament of a streaming application

What is the DLT framework?

How is DLT related to Delta Lake?

Introducing DLT concepts

Streaming tables

Materialized views

Views

Pipeline

Pipeline triggers

Workflow

Types of Databricks compute

Databricks Runtime

Unity Catalog

A quick Delta Lake primer

The architecture of a Delta table

The contents of a transaction commit

Supporting concurrent table reads and writes

Tombstoned data files

Calculating Delta table state

Time travel

Tracking table changes using change data feed

A hands-on example – creating your first Delta Live Tables pipeline

Summary

Applying Data Transformations Using Delta Live Tables

Technical requirements

Ingesting data from input sources

Ingesting data using Databricks Auto Loader

Scalability challenge in structured streaming

Using Auto Loader with DLT

Applying changes to downstream tables

APPLY CHANGES command

The DLT reconciliation process

Publishing datasets to Unity Catalog

Why store datasets in Unity Catalog?

Creating a new catalog

Assigning catalog permissions

Data pipeline settings

The DLT product edition

Pipeline execution mode

Databricks runtime

Pipeline cluster types

A serverless compute versus a traditional compute

Loading external dependencies

Data pipeline processing modes

Hands-on exercise – applying SCD Type 2 changes

Summary

Managing Data Quality Using Delta Live Tables

Technical requirements

Defining data constraints in Delta Lake

Using temporary datasets to validate data processing

An introduction to expectations

Expectation composition

Hands-on exercise – writing your first data quality expectation

Acting on failed expectations

Hands-on example – failing a pipeline run due to poor data quality

Applying multiple data quality expectations

Decoupling expectations from a DLT pipeline

Hands-on exercise – quarantining bad data for correction

Summary

Scaling DLT Pipelines

Technical requirements

Scaling compute to handle demand

Hands-on example – setting autoscaling properties using the Databricks REST API

Automated table maintenance tasks

Why auto compaction is important

Vacuuming obsolete table files

Moving compute closer to the data

Optimizing table layouts for faster table updates

Rewriting table files during updates

Data skipping using table partitioning

Delta Lake Z-ordering on MERGE columns

Improving write performance using deletion vectors

Serverless DLT pipelines

Introducing Enzyme, a performance optimization layer

Summary

Part 2: Securing the Lakehouse Using the Unity Catalog

Mastering Data Governance in the Lakehouse with Unity Catalog

Technical requirements

Understanding data governance in a lakehouse

Introducing the Databricks Unity Catalog

A problem worth solving

An overview of the Unity Catalog architecture

Unity Catalog-enabled cluster types

Unity Catalog object model

Enabling Unity Catalog on an existing Databricks workspace

Identity federation in Unity Catalog

Data discovery and cataloging

Tracking dataset relationships using lineage

Observability with system tables

Tracing the lineage of other assets

Fine-grained data access

Hands-on example – data masking healthcare datasets

Summary

Managing Data Locations in Unity Catalog

Technical requirements

Creating and managing data catalogs in Unity Catalog

Managed data versus external data

Saving data to storage volumes in Unity Catalog

Setting default locations for data within Unity Catalog

Isolating catalogs to specific workspaces

Creating and managing external storage locations in Unity Catalog

Storing cloud service authentication using storage credentials

Querying external systems using Lakehouse Federation

Hands-on lab – extracting document text for a generative AI pipeline

Generating mock documents

Defining helper functions

Choosing a file format randomly

Creating/assembling the DLT pipeline

Summary

Viewing Data Lineage Using Unity Catalog

Technical requirements

Introducing data lineage in Unity Catalog

Tracing data origins using the Data Lineage REST API

Visualizing upstream and downstream transformations

Identifying dependencies and impacts

Hands-on lab – documenting data lineage across an organization

Summary

Part 3: Continuous Integration, Continuous Deployment, and Continuous Monitoring

Deploying, Maintaining, and Administrating DLT Pipelines Using Terraform

Technical requirements

Introducing the Databricks provider for Terraform

Setting up a local Terraform environment

Importing the Databricks Terraform provider

Configuring workspace authentication

Defining a DLT pipeline source notebook

Applying workspace changes

Configuring DLT pipelines using Terraform

name

notification

channel

development

continuous

edition

photon

configuration

library

cluster

catalog

target

storage

Automating DLT pipeline deployment

Hands-on exercise – deploying a DLT pipeline using VS Code

Setting up VS Code

Creating a new Terraform project

Defining the Terraform resources

Deploying the Terraform project

Summary

Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment

Technical requirements

Introduction to Databricks Asset Bundles

Elements of a DAB configuration file

Specifying a deployment mode

Databricks Asset Bundles in action

User-to-machine authentication

Machine-to-machine authentication

Initializing an asset bundle using templates

Hands-on exercise – deploying your first DAB

Hands-on exercise – simplifying cross-team collaboration with GitHub Actions

Setting up the environment

Configuring the GitHub Action

Testing the workflow

Versioning and maintenance

Summary

Monitoring Data Pipelines in Production

Technical requirements

Introduction to data pipeline monitoring

Exploring ways to monitor data pipelines

Using DBSQL Alerts to notify data validity

Pipeline health and performance monitoring

Hands-on exercise – querying data quality events for a dataset

Data quality monitoring

Introducing Lakehouse Monitoring

Hands-on exercise – creating a lakehouse monitor

Best practices for production failure resolution

Handling pipeline update failures

Recovering from table transaction failure

Hands-on exercise – setting up a webhook alert when a job runs longer than expected

Summary

Index

Other Books You May Enjoy

Preface

As datasets have exploded in size with the introduction of cheap cloud storage and processing data in near real time has become an industry standard, many organizations have turned to the lakehouse architecture, which combines the fast BI speeds of a traditional data warehouse with the scalable ETL processing of big data in the cloud. The Databricks Data Intelligence Platform – built upon several open source technologies, including Apache Spark, Delta Lake, MLflow, and Unity Catalog – eliminates friction points and accelerates the design and deployment of modern data applications built for the lakehouse.

In this book, you’ll start with an overview of the Delta Lake format, cover core concepts of the Databricks Data Intelligence Platform, and master building data pipelines using the Delta Live Tables framework. We’ll dive into applying data transformations, how to implement the Databricks medallion architecture, and how to continuously monitor the quality of data landing in your lakehouse. You’ll learn how to react to incoming data using the Databricks Auto Loader feature and automate real-time data processing using Databricks workflows. You’ll learn how to use CI/CD tools such as Terraform and Databricks Asset Bundles (DABs) to deploy data pipeline changes automatically across deployment environments, as well as monitor, control, and optimize cloud costs along the way. By the end of this book, you will have mastered building a production-ready, modern data application using the Databricks Data Intelligence Platform.

With Databricks recently named a Leader in the 2024 Gartner Magic Quadrant for Data Science and Machine Learning Platforms, the demand for mastering a skillset in the Databricks Data Intelligence Platform is only expected to grow in the coming years.

Who this book is for

This book is for data engineers, data scientists, and data stewards tasked with enterprise data processing for their organizations. This book will simplify learning advanced data engineering techniques on Databricks, making implementing a cutting-edge lakehouse accessible to individuals with varying technical expertise. However, beginner-level knowledge of Apache Spark and Python is needed to make the most out of the code examples in this book.

What this book covers

Chapter 1

, An Introduction to Delta Live Tables, discusses building near-real-time data pipelines using the Delta Live Tables framework. It covers the fundamentals of pipeline design as well as the core concepts of the Delta Lake format. The chapter concludes with a simple example of building a Delta Live Table pipeline from start to finish.

Chapter 2

, Applying Data Transformations Using Delta Live Tables, explores data transformations using Delta Live Tables, guiding you through the process of cleaning, refining, and enriching data to meet specific business requirements. You will learn how to use Delta Live Tables to ingest data from a variety of input sources, register datasets in Unity Catalog, and effectively apply changes to downstream tables.

Chapter 3

, Managing Data Quality Using Delta Live Tables, introduces several techniques for enforcing data quality requirements on newly arriving data. You will learn how to define data quality constraints using Expectations in the Delta Live Tables framework, as well as monitor the data quality of a pipeline in near real time.

Chapter 4

, Scaling DLT Pipelines, explains how to scale a Delta Live Tables (DLT) pipeline to handle the unpredictable demands of a typical production environment. You will take a deep dive into configuring pipeline settings using the DLT UI and Databricks Pipeline REST API. You will also gain a better understanding of the daily DLT maintenance tasks that are run in the background and how to optimize table layouts to improve performance.

Chapter 5

, Mastering Data Governance in the Lakehouse with Unity Catalog, provides a comprehensive guide to enhancing data governance and compliance of your lakehouse using Unity Catalog. You will learn how to enable Unity Catalog on a Databricks workspace, enable data discovery using metadata tags, and implement fine-grained row and column-level access control of datasets.

Chapter 6

, Managing Data Locations in Unity Catalog, explores how to effectively manage storage locations using Unity Catalog. You will learn how to govern data access across various roles and departments within an organization while ensuring security and auditability with the Databricks Data Intelligence Platform.

Chapter 7

, Viewing Data Lineage using Unity Catalog, discusses tracing data origins, visualizing data transformations, and identifying upstream and downstream dependencies by tracing data lineage in Unity Catalog. By the end of the chapter, You will be equipped with the skills needed to validate that data is coming from trusted sources.

Chapter 8

, Deploying, Maintaining, and Administrating DLT Pipelines Using Terraform, covers deploying DLT pipelines using the Databricks Terraform provider. You will learn how to set up a local development environment and automate a continuous build and deployment pipeline, along with best practices and future considerations.

Chapter 9

, Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment, explores how DABs can be used to streamline the deployment of data analytics projects and improve cross-team collaboration. You will gain an understanding of the practical use of DABs through several hands-on examples.

Chapter 10

, Monitoring Data Pipelines in Production, delves into the crucial task of monitoring data pipelines in Databricks. You will learn various mechanisms for tracking pipeline health, performance, and data quality within the Databricks Data Intelligence Platform.

To get the most out of this book

While not a mandatory requirement, to get the most out of this book, it’s recommended that you have beginner-level knowledge of Python and Apache Spark, and at least some knowledge of navigating around the Databricks Data Intelligence Platform. It’s also recommended to have the following dependencies installed locally in order to follow along with the hands-on exercises and code examples throughout the book:

Furthermore, it’s recommended that you have a Databricks account and workspace to log in, import notebooks, create clusters, and create new data pipelines. If you do not have a Databricks account, you can sign up for a free trial on the Databricks website https://www.databricks.com/try-databricks

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Building-Modern-Data-Applications-Using-Databricks-Lakehouse

. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/

. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: The result of the data generator notebook should be three tables in total: youtube_channels, youtube_channel_artists, and combined_table.

A block of code is set as follows:

@dlt.table(

name=random_trip_data_raw,

comment=The raw taxi trip data ingested from a landing zone.,

table_properties={

quality: bronze

}

)

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

@dlt.table(

name=random_trip_data_raw,

comment=The raw taxi trip data ingested from a landing zone.,

table_properties={

quality: bronze,

pipelines.autoOptimize.managed: false

}

)

Any command-line input or output is written as follows:

$ databricks bundle validate

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: Click the Run all button at the top right of the Databricks workspace to execute all the notebook cells, verifying that all cells execute successfully.

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected]

and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata

and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]

with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com

Share Your Thoughts

Once you’ve read Building Modern Data Applications Using Databricks Lakehouse, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page

for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/978-1-80107-323-3

Submit your proof of purchase

That’s it! We’ll send your free PDF and other benefits to your email directly

Part 1:Near-Real-Time Data Pipelines for the Lakehouse

In this first part of the book, we’ll introduce the core concepts of the Delta Live Tables (DLT) framework. We’ll cover how to ingest data from a variety of input sources and apply the latest changes to downstream tables. We’ll also explore how to enforce requirements on incoming data so that your data teams can be alerted of potential data quality issues that might contaminate your lakehouse.

This part contains the following chapters:

Chapter 1

, An Introduction to Delta Live Tables

Chapter

Enjoying the preview?

Page 1 of 1

Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks

Will Girten

Related authors

Related to Building Modern Data Applications Using Databricks Lakehouse

Related ebooks

Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake

NoSQL Essentials: Navigating the World of Non-Relational Databases

Data Engineering Best Practices: Architect robust and cost-effective data solutions in the cloud era

Mastering Time Series Analysis and Forecasting with Python: Bridging Theory and Practice Through Insights, Techniques, and Tools for Effective Time Series Analysis in Python

Databricks Essentials: A Guide to Unified Data Analytics

Big Data and Analytics: The key concepts and practical applications of big data analytics (English Edition)

Full Stack Web Development with Fastify: Building High-Performance Modern Applications from Frontend to Backend

Node.js Cookbook: Practical recipes for building server-side web applications with Node.js 22

MySQL 8 Cookbook

Fundamentals of Analytics Engineering: An introduction to building end-to-end analytics solutions

Ultimate Azure Data Engineering: Build Robust Data Engineering Systems on Azure with SQL, ETL, Data Modeling, and Power BI for Business Insights and Crack Azure Certifications

Mastering Database Design

LPI Security Essentials Study Guide: Exam 020-100

Instant MapReduce Patterns – Hadoop Essentials How-to

Spring Boot 3.0 Crash Course

Ultimate AWS Data Engineering: Design, Implement and Optimize Scalable Data Solutions on AWS with Practical Workflows and Visual Aids for Unmatched Impact (English Edition)

Java EE 7 First Look

Spring MVC Blueprints

Learning Apache Cassandra - Second Edition

Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes

Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python

Mastering Scala Machine Learning

CSS For Beginners: The Complete Step-By-Step Guide To Learning Web Development For Building Responsive Websites, Mastering Web Design, And Becoming A Coding Expert

Mastering Java Concurrency: From Basics to Expert Proficiency

Getting Started with Greenplum for Big Data Analytics

Ultimate Azure Data Scientist Associate (DP-100) Certification Guide

Java EE 7 Development with WildFly

The Art of SQL: Crafting Robust Database Solutions

PostgreSQL Development Essentials

Ultimate SQL Server and Azure SQL for Data Management and Modernization

Computers For You

CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide

The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology

Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race

Deep Search: How to Explore the Internet More Effectively

Mastering ChatGPT: 21 Prompts Templates for Effortless Writing

Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad

Elon Musk

Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!

How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)

Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition

The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution

How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally

Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics

Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees

Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL

ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind

The Professional Voiceover Handbook: Voiceover training, #1

Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls

Quantum Computing For Dummies

The Hacker Crackdown: Law and Disorder on the Electronic Frontier

The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms

A Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going

Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are

The Technological Republic: Hard Power, Soft Belief, and the Future of the West

Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence

Boundaries in an Overconnected World: Setting Limits to Preserve Your Focus, Privacy, Relationships, and Sanity

How to Raise a Healthy Gamer: End Power Struggles, Break Bad Screen Habits, and Transform Your Relationship with Your Kids

CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61

Mastering QuickBooks 2020: The ultimate guide to bookkeeping and QuickBooks Online

Related categories

Reviews for Building Modern Data Applications Using Databricks Lakehouse

What did you think?

Book preview

Building Modern Data Applications Using Databricks Lakehouse - Will Girten