You're reading from Modern Data Architectures with Python A practical guide to building and deploying data pipelines, data warehouses, and data lakes with Python

Product type Paperback

Published in Sep 2023

Publisher Packt

ISBN-13 9781801070492

Length 318 pages

Edition 1st Edition

Languages

Python

Tools

MLflow

Concepts

Data Science

Author (1):

Brian Lipp

View More author details

Table of Contents (19) Chapters

Preface

1. Part 1:Fundamental Data Knowledge

2. Chapter 1: Modern Data Processing Architecture FREE CHAPTER

3. Chapter 2: Understanding Data Analytics

4. Part 2: Data Engineering Toolset

5. Chapter 3: Apache Spark Deep Dive

6. Chapter 4: Batch and Stream Data Processing Using PySpark

7. Chapter 5: Streaming Data with Kafka

8. Part 3:Modernizing the Data Platform

9. Chapter 6: MLOps

10. Chapter 7: Data and Information Visualization

11. Chapter 8: Integrating Continous Integration into Your Workflow

12. Chapter 9: Orchestrating Your Data Workflows

13. Part 4:Hands-on Project

14. Chapter 10: Data Governance

15. Chapter 11: Building out the Groundwork

16. Chapter 12: Completing Our Project

17. Index

Why subscribe?

18. Other Books You May Enjoy

Practical lab

We have a bronze table being loaded into our data lake using a third-party tool. There has been a request to clean up the data and resolve known issues. Your task is to write the needed Python code that will address each of the following issues.

The following are the issues present:

Wrong column name: The date column is spelled wrong
Nulls not correctly identified: The sales_id column has null values as NA strings
Data with missing values is unwanted: Any data with a null in sales_id should be dropped
Duplicate sales_id: Take the first value of any duplicate rows
Date column not DateType: The date column is not a DateType

Loading the problem data

The following code will create our bronze table:

bronze_sales = spark.createDataFrame(data = [
    ("1", "LA", "2000-01-01",5, 1400),
    ("2", "LA", "1998-2-01",4, 1500),
   ...

The rest of the chapter is locked

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

You're reading from Modern Data Architectures with Python A practical guide to building and deploying data pipelines, data warehouses, and data lakes with Python

Table of Contents (19) Chapters

Practical lab

Loading the problem data

Authors (1)

Personalised recommendations for you

You're reading from Modern Data Architectures with Python A practical guide to building and deploying data pipelines, data warehouses, and data lakes with Python

Table of Contents (19) Chapters

Practical lab

Loading the problem data

Authors (1)

Personalised recommendations for you

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access