Explore 1.5M+ audiobooks & ebooks free for days

From $11.99/month after trial. Cancel anytime.

Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Ebook515 pages4 hours

Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks

Rating: 0 out of 5 stars

()

Read preview
LanguageEnglish
PublisherPackt Publishing
Release dateOct 31, 2024
ISBN9781804612873
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks

Related to Building Modern Data Applications Using Databricks Lakehouse

Related ebooks

Computers For You

View More

Reviews for Building Modern Data Applications Using Databricks Lakehouse

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Building Modern Data Applications Using Databricks Lakehouse - Will Girten

    Cover.png

    Building Modern Data Applications Using Databricks Lakehouse

    Copyright © 2024 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Apeksha Shetty

    Publishing Product Managers: Arindam Majumder and Nilesh Kowadkar

    Book Project Manager: Shambhavi Mishra

    Senior Content Development Editor: Shreya Moharir

    Technical Editor: Seemanjay Ameriya

    Copy Editor: Safis Editing

    Proofreader: Shreya Moharir

    Indexer: Manju Arasan

    Production Designer: Prashant Ghare

    Senior DevRel Marketing Coordinator: Nivedita Singh

    First published: October 2024

    Production reference: 1181024

    Published by Packt Publishing Ltd.

    Grosvenor House

    11 St Paul’s Square

    Birmingham

    B3 1RB, UK

    ISBN 978-1-80107-323-3

    www.packtpub.com

    To my beautiful and caring wife, Ashley, and our smiley son, Silvio Apollo, thank you for your unwavering support and encouragement.

    – Will Girten

    Contributors

    About the author

    Will Girten is a lead specialist solutions architect who joined Databricks in early 2019. With over a decade of experience in data and AI, Will has worked in various business verticals, from healthcare to government and financial services. Will’s primary focus has been helping enterprises implement data warehousing strategies for the lakehouse and performance-tuning BI dashboards, reports, and queries. Will is a certified Databricks Data Engineering Professional and Databricks Machine Learning Professional. He holds a Bachelor of Science in computer engineering from the University of Delaware.

    I want to give a special thank you to one of my greatest supporters, mentors, and friends, YongSheng Huang.

    About the reviewer

    Oleksandra Bovkun is a senior solutions architect at Databricks. She helps customers adopt the Databricks Platform to implement a variety of use cases, follow best practices for implementing data products, and extract the maximum value from their data. During her career, she has participated in multiple data engineering and MLOps projects, including data platform setups, large-scale performance optimizations, and on-prem to cloud migrations. She previously had data engineering and software development roles for consultancies and product companies. She aims to support companies in their data and AI journey to maximize business value using data and AI solutions. She is a regular presenter at conferences, meetups, and user groups in Benelux and Europe.

    Table of Contents

    Preface

    Part 1: Near-Real-Time Data Pipelines for the Lakehouse

    1

    An Introduction to Delta Live Tables

    Technical requirements

    The emergence of the lakehouse

    The Lambda architectural pattern

    Introducing the medallion architecture

    The Databricks lakehouse

    The maintenance predicament of a streaming application

    What is the DLT framework?

    How is DLT related to Delta Lake?

    Introducing DLT concepts

    Streaming tables

    Materialized views

    Views

    Pipeline

    Pipeline triggers

    Workflow

    Types of Databricks compute

    Databricks Runtime

    Unity Catalog

    A quick Delta Lake primer

    The architecture of a Delta table

    The contents of a transaction commit

    Supporting concurrent table reads and writes

    Tombstoned data files

    Calculating Delta table state

    Time travel

    Tracking table changes using change data feed

    A hands-on example – creating your first Delta Live Tables pipeline

    Summary

    2

    Applying Data Transformations Using Delta Live Tables

    Technical requirements

    Ingesting data from input sources

    Ingesting data using Databricks Auto Loader

    Scalability challenge in structured streaming

    Using Auto Loader with DLT

    Applying changes to downstream tables

    APPLY CHANGES command

    The DLT reconciliation process

    Publishing datasets to Unity Catalog

    Why store datasets in Unity Catalog?

    Creating a new catalog

    Assigning catalog permissions

    Data pipeline settings

    The DLT product edition

    Pipeline execution mode

    Databricks runtime

    Pipeline cluster types

    A serverless compute versus a traditional compute

    Loading external dependencies

    Data pipeline processing modes

    Hands-on exercise – applying SCD Type 2 changes

    Summary

    3

    Managing Data Quality Using Delta Live Tables

    Technical requirements

    Defining data constraints in Delta Lake

    Using temporary datasets to validate data processing

    An introduction to expectations

    Expectation composition

    Hands-on exercise – writing your first data quality expectation

    Acting on failed expectations

    Hands-on example – failing a pipeline run due to poor data quality

    Applying multiple data quality expectations

    Decoupling expectations from a DLT pipeline

    Hands-on exercise – quarantining bad data for correction

    Summary

    4

    Scaling DLT Pipelines

    Technical requirements

    Scaling compute to handle demand

    Hands-on example – setting autoscaling properties using the Databricks REST API

    Automated table maintenance tasks

    Why auto compaction is important

    Vacuuming obsolete table files

    Moving compute closer to the data

    Optimizing table layouts for faster table updates

    Rewriting table files during updates

    Data skipping using table partitioning

    Delta Lake Z-ordering on MERGE columns

    Improving write performance using deletion vectors

    Serverless DLT pipelines

    Introducing Enzyme, a performance optimization layer

    Summary

    Part 2: Securing the Lakehouse Using the Unity Catalog

    5

    Mastering Data Governance in the Lakehouse with Unity Catalog

    Technical requirements

    Understanding data governance in a lakehouse

    Introducing the Databricks Unity Catalog

    A problem worth solving

    An overview of the Unity Catalog architecture

    Unity Catalog-enabled cluster types

    Unity Catalog object model

    Enabling Unity Catalog on an existing Databricks workspace

    Identity federation in Unity Catalog

    Data discovery and cataloging

    Tracking dataset relationships using lineage

    Observability with system tables

    Tracing the lineage of other assets

    Fine-grained data access

    Hands-on example – data masking healthcare datasets

    Summary

    6

    Managing Data Locations in Unity Catalog

    Technical requirements

    Creating and managing data catalogs in Unity Catalog

    Managed data versus external data

    Saving data to storage volumes in Unity Catalog

    Setting default locations for data within Unity Catalog

    Isolating catalogs to specific workspaces

    Creating and managing external storage locations in Unity Catalog

    Storing cloud service authentication using storage credentials

    Querying external systems using Lakehouse Federation

    Hands-on lab – extracting document text for a generative AI pipeline

    Generating mock documents

    Defining helper functions

    Choosing a file format randomly

    Creating/assembling the DLT pipeline

    Summary

    7

    Viewing Data Lineage Using Unity Catalog

    Technical requirements

    Introducing data lineage in Unity Catalog

    Tracing data origins using the Data Lineage REST API

    Visualizing upstream and downstream transformations

    Identifying dependencies and impacts

    Hands-on lab – documenting data lineage across an organization

    Summary

    Part 3: Continuous Integration, Continuous Deployment, and Continuous Monitoring

    8

    Deploying, Maintaining, and Administrating DLT Pipelines Using Terraform

    Technical requirements

    Introducing the Databricks provider for Terraform

    Setting up a local Terraform environment

    Importing the Databricks Terraform provider

    Configuring workspace authentication

    Defining a DLT pipeline source notebook

    Applying workspace changes

    Configuring DLT pipelines using Terraform

    name

    notification

    channel

    development

    continuous

    edition

    photon

    configuration

    library

    cluster

    catalog

    target

    storage

    Automating DLT pipeline deployment

    Hands-on exercise – deploying a DLT pipeline using VS Code

    Setting up VS Code

    Creating a new Terraform project

    Defining the Terraform resources

    Deploying the Terraform project

    Summary

    9

    Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment

    Technical requirements

    Introduction to Databricks Asset Bundles

    Elements of a DAB configuration file

    Specifying a deployment mode

    Databricks Asset Bundles in action

    User-to-machine authentication

    Machine-to-machine authentication

    Initializing an asset bundle using templates

    Hands-on exercise – deploying your first DAB

    Hands-on exercise – simplifying cross-team collaboration with GitHub Actions

    Setting up the environment

    Configuring the GitHub Action

    Testing the workflow

    Versioning and maintenance

    Summary

    10

    Monitoring Data Pipelines in Production

    Technical requirements

    Introduction to data pipeline monitoring

    Exploring ways to monitor data pipelines

    Using DBSQL Alerts to notify data validity

    Pipeline health and performance monitoring

    Hands-on exercise – querying data quality events for a dataset

    Data quality monitoring

    Introducing Lakehouse Monitoring

    Hands-on exercise – creating a lakehouse monitor

    Best practices for production failure resolution

    Handling pipeline update failures

    Recovering from table transaction failure

    Hands-on exercise – setting up a webhook alert when a job runs longer than expected

    Summary

    Index

    Other Books You May Enjoy

    Preface

    As datasets have exploded in size with the introduction of cheap cloud storage and processing data in near real time has become an industry standard, many organizations have turned to the lakehouse architecture, which combines the fast BI speeds of a traditional data warehouse with the scalable ETL processing of big data in the cloud. The Databricks Data Intelligence Platform – built upon several open source technologies, including Apache Spark, Delta Lake, MLflow, and Unity Catalog – eliminates friction points and accelerates the design and deployment of modern data applications built for the lakehouse.

    In this book, you’ll start with an overview of the Delta Lake format, cover core concepts of the Databricks Data Intelligence Platform, and master building data pipelines using the Delta Live Tables framework. We’ll dive into applying data transformations, how to implement the Databricks medallion architecture, and how to continuously monitor the quality of data landing in your lakehouse. You’ll learn how to react to incoming data using the Databricks Auto Loader feature and automate real-time data processing using Databricks workflows. You’ll learn how to use CI/CD tools such as Terraform and Databricks Asset Bundles (DABs) to deploy data pipeline changes automatically across deployment environments, as well as monitor, control, and optimize cloud costs along the way. By the end of this book, you will have mastered building a production-ready, modern data application using the Databricks Data Intelligence Platform.

    With Databricks recently named a Leader in the 2024 Gartner Magic Quadrant for Data Science and Machine Learning Platforms, the demand for mastering a skillset in the Databricks Data Intelligence Platform is only expected to grow in the coming years.

    Who this book is for

    This book is for data engineers, data scientists, and data stewards tasked with enterprise data processing for their organizations. This book will simplify learning advanced data engineering techniques on Databricks, making implementing a cutting-edge lakehouse accessible to individuals with varying technical expertise. However, beginner-level knowledge of Apache Spark and Python is needed to make the most out of the code examples in this book.

    What this book covers

    Chapter 1

    , An Introduction to Delta Live Tables, discusses building near-real-time data pipelines using the Delta Live Tables framework. It covers the fundamentals of pipeline design as well as the core concepts of the Delta Lake format. The chapter concludes with a simple example of building a Delta Live Table pipeline from start to finish.

    Chapter 2

    , Applying Data Transformations Using Delta Live Tables, explores data transformations using Delta Live Tables, guiding you through the process of cleaning, refining, and enriching data to meet specific business requirements. You will learn how to use Delta Live Tables to ingest data from a variety of input sources, register datasets in Unity Catalog, and effectively apply changes to downstream tables.

    Chapter 3

    , Managing Data Quality Using Delta Live Tables, introduces several techniques for enforcing data quality requirements on newly arriving data. You will learn how to define data quality constraints using Expectations in the Delta Live Tables framework, as well as monitor the data quality of a pipeline in near real time.

    Chapter 4

    , Scaling DLT Pipelines, explains how to scale a Delta Live Tables (DLT) pipeline to handle the unpredictable demands of a typical production environment. You will take a deep dive into configuring pipeline settings using the DLT UI and Databricks Pipeline REST API. You will also gain a better understanding of the daily DLT maintenance tasks that are run in the background and how to optimize table layouts to improve performance.

    Chapter 5

    , Mastering Data Governance in the Lakehouse with Unity Catalog, provides a comprehensive guide to enhancing data governance and compliance of your lakehouse using Unity Catalog. You will learn how to enable Unity Catalog on a Databricks workspace, enable data discovery using metadata tags, and implement fine-grained row and column-level access control of datasets.

    Chapter 6

    , Managing Data Locations in Unity Catalog, explores how to effectively manage storage locations using Unity Catalog. You will learn how to govern data access across various roles and departments within an organization while ensuring security and auditability with the Databricks Data Intelligence Platform.

    Chapter 7

    , Viewing Data Lineage using Unity Catalog, discusses tracing data origins, visualizing data transformations, and identifying upstream and downstream dependencies by tracing data lineage in Unity Catalog. By the end of the chapter, You will be equipped with the skills needed to validate that data is coming from trusted sources.

    Chapter 8

    , Deploying, Maintaining, and Administrating DLT Pipelines Using Terraform, covers deploying DLT pipelines using the Databricks Terraform provider. You will learn how to set up a local development environment and automate a continuous build and deployment pipeline, along with best practices and future considerations.

    Chapter 9

    , Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment, explores how DABs can be used to streamline the deployment of data analytics projects and improve cross-team collaboration. You will gain an understanding of the practical use of DABs through several hands-on examples.

    Chapter 10

    , Monitoring Data Pipelines in Production, delves into the crucial task of monitoring data pipelines in Databricks. You will learn various mechanisms for tracking pipeline health, performance, and data quality within the Databricks Data Intelligence Platform.

    To get the most out of this book

    While not a mandatory requirement, to get the most out of this book, it’s recommended that you have beginner-level knowledge of Python and Apache Spark, and at least some knowledge of navigating around the Databricks Data Intelligence Platform. It’s also recommended to have the following dependencies installed locally in order to follow along with the hands-on exercises and code examples throughout the book:

    Furthermore, it’s recommended that you have a Databricks account and workspace to log in, import notebooks, create clusters, and create new data pipelines. If you do not have a Databricks account, you can sign up for a free trial on the Databricks website https://www.databricks.com/try-databricks

    .

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Building-Modern-Data-Applications-Using-Databricks-Lakehouse

    . If there’s an update to the code, it will be updated in the GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/

    . Check them out!

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: The result of the data generator notebook should be three tables in total: youtube_channels, youtube_channel_artists, and combined_table.

    A block of code is set as follows:

    @dlt.table(

        name=random_trip_data_raw,

        comment=The raw taxi trip data ingested from a landing zone.,

        table_properties={

            quality: bronze

        }

    )

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    @dlt.table(

        name=random_trip_data_raw,

        comment=The raw taxi trip data ingested from a landing zone.,

        table_properties={

            quality: bronze,

            

    pipelines.autoOptimize.managed: false

     

        }

    )

    Any command-line input or output is written as follows:

    $ databricks bundle validate

    Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: Click the Run all button at the top right of the Databricks workspace to execute all the notebook cells, verifying that all cells execute successfully.

    Tips or important notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at [email protected]

    and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata

    and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]

    with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com

    .

    Share Your Thoughts

    Once you’ve read Building Modern Data Applications Using Databricks Lakehouse, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page

    for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?

    Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below

    https://packt.link/free-ebook/978-1-80107-323-3

    Submit your proof of purchase

    That’s it! We’ll send your free PDF and other benefits to your email directly

    Part 1:Near-Real-Time Data Pipelines for the Lakehouse

    In this first part of the book, we’ll introduce the core concepts of the Delta Live Tables (DLT) framework. We’ll cover how to ingest data from a variety of input sources and apply the latest changes to downstream tables. We’ll also explore how to enforce requirements on incoming data so that your data teams can be alerted of potential data quality issues that might contaminate your lakehouse.

    This part contains the following chapters:

    Chapter 1

    , An Introduction to Delta Live Tables

    Chapter

    Enjoying the preview?
    Page 1 of 1