Practical lab
We have a bronze table being loaded into our data lake using a third-party tool. There has been a request to clean up the data and resolve known issues. Your task is to write the needed Python code that will address each of the following issues.
The following are the issues present:
- Wrong column name: The
datecolumn is spelled wrong - Nulls not correctly identified: The
sales_idcolumn has null values asNAstrings - Data with missing values is unwanted: Any data with a null in
sales_idshould be dropped - Duplicate sales_id: Take the first value of any duplicate rows
- Date column not DateType: The
datecolumn is not a DateType
Loading the problem data
The following code will create our bronze table:
bronze_sales = spark.createDataFrame(data = [
("1", "LA", "2000-01-01",5, 1400),
("2", "LA", "1998-2-01",4, 1500),
...