Data Cleaning in SQL
Data Cleaning in SQL
This process ensures that the data is clean, consistent, and ready for analysis.
01. Handle Missing Values:
The SQL functions
• COALESCE()
• IFNULL()
• ISNULL()
are used to handle missing or NULL values in databases. Let's explore how these functions work with a
common example.
1.1 - COALESCE():
Returns the first non-NULL value from the list of arguments.
Example:
Suppose you have a table called employees with columns for base_salary, bonus, and
total_compensation, where some values in bonus and total_compensation might be NULL.
Case 01:
SELECT name,
base_salary,
COALESCE(bonus, 0) AS bonus,
COALESCE(total_compensation, base_salary + COALESCE(bonus, 0)) AS
total_compensation
FROM employees;
Explanation:
• If bonus is NULL, it is replaced with 0.
• If total_compensation is NULL, it is replaced with base_salary + bonus.
Case 02:
SELECT name,
base_salary,
COALESCE(bonus, base_salary, 0) AS bonus,
COALESCE(total_compensation, base_salary + COALESCE(bonus, 0)) AS
total_compensation
FROM employees;
Explanation:
Here, COALESCE() will give you the value from bonus if it's available, or it will take base_salary if
bonus is missing. If both are missing, it will return 0.
1.2 - IFNULL():
IFNULL() is a function in SQL that checks if a value is NULL (empty) and, if it is, replaces it with a
value you choose. If the value isn't NULL, it returns the original value.
Example:
Suppose you have a table called employees with columns for base_salary, bonus, and
total_compensation, where some values in bonus and total_compensation might be NULL.
SELECT name,
base_salary,
IFFNULL(bonus, 0) AS bonus,
IFFNULL(total_compensation, base_salary + IFFNULL(bonus, 0)) AS
total_compensation
FROM employees;
Explanation:
• COALESCE(): Can take multiple arguments and returns the first non-NULL value.
• IFNULL(): Only takes two arguments — if the first one is NULL, it returns the second.
1.3 - ISNULL():
The ISNULL() function in SQL is used to check if a value is NULL. It returns a specified value if the
original value is NULL. It's similar to IFNULL() but is more common in SQL Server.
IFNULL() and ISNULL() are simpler and more focused, but support only two arguments.
Example:
Suppose you have a table called employees with columns for base_salary, bonus, and
total_compensation, where some values in bonus and total_compensation might be NULL.
SELECT name,
base_salary,
ISNULL(bonus, 0) AS bonus,
ISNULL(total_compensation, base_salary + ISNULL(bonus, 0)) AS
total_compensation
FROM employees;
Here, ISNULL() checks if the bonus is NULL. If it is, it returns 0, otherwise it returns the original
bonus.
02. Remove Duplicates:
In SQL, both DISTINCT and ROW_NUMBER() can be used to remove duplicates, but they work in
different ways and serve different purposes. Let's go through each method with an example and
explain the difference between them.
2.1 - DISTINCT():
The DISTINCT keyword removes duplicate rows from the result set. It checks all the columns you
specify and returns only unique rows.
Example:
Suppose you have the following table of employee records with some duplicate entries:
You want to remove the duplicate rows (same name, department, and salary).
Query:
2.2 - ROW_NUMBER():
The ROW_NUMBER() function assigns a unique number to each row within a partition of data,
based on an ORDER BY clause. You can use this to identify and remove duplicates by keeping only
the first occurrence of each set of duplicates.
Example:
If you want to keep only one row for each employee based on name and department but remove
duplicates based on the combination of those columns, you can use ROW_NUMBER().
WITH RankedEmployees AS (
SELECT employee_id, name, department, salary,
ROW_NUMBER() OVER (PARTITION BY name, department ORDER BY
employee_id) AS row_num
FROM employees
)
SELECT employee_id, name, department, salary
FROM RankedEmployees
WHERE row_num = 1;
Here, ROW_NUMBER() assigns a unique number to each row partitioned by name and
department. We only keep the rows where row_num = 1, effectively removing duplicates.
• DISTINCT:
• Removes duplicates by considering all specified columns.
• It does not allow you to control which duplicate to keep.
• Simpler to use if you just want to remove duplicates based on exact matches.
• ROW_NUMBER():
• Provides more flexibility by assigning a unique number to each row.
• Allows you to remove duplicates based on more complex logic, such as deciding
which duplicate to keep (based on ordering).
• Useful when you need to retain more control over how duplicates are handled (e.g.,
keeping the latest or the earliest record based on another column).
3.1 - LOWER():
Converts all characters in a string to lowercase.
Input Table:
name
John Doe
SARAH SMITH
MiKe JOHN
standardized_name
john doe
sarah smith
mike john
3.2 - UPPER():
Converts all characters in a string to uppercase.
Input Table:
name
John Doe
SARAH SMITH
MiKe JOHN
standardized_name
JOHN DOE
SARAH SMITH
MIKE JOHN
3.3 - TRIM():
Removes leading and trailing spaces from a string.
Example:
Suppose you have a table named customer_feedback, which contains customer reviews. Some of
these reviews have leading and trailing spaces, which can affect data analysis and reporting.
feedback_id review
1 Great product!
2 Excellent service!
3 Average quality.
4 Not satisfied with the service.
5 Would buy again!
SELECT
feedback_id,
TRIM(review) AS cleaned_review
FROM
customer_feedback;
feedback_id review
1 Great product!
2 Excellent service!
3 Average quality.
4 Not satisfied with the service.
5 Would buy again!
Explanation
• Feedback IDs: Remain unchanged.
• Cleaned Reviews: The TRIM() function removes any leading and trailing spaces from each
review. This ensures that the feedback is clean and ready for further analysis, such as
sentiment analysis or reporting.
04. CORRECT INCONSISTENT DATA:
Correcting inconsistent data in SQL can often involve string manipulation functions such as
SUBSTR() and CONCAT(). Let’s create a scenario where we specifically need to use SUBSTR() and
CONCAT() together to correct inconsistent data.
Example:
Imagine you have a table named products, which stores product codes in inconsistent formats.
Some product codes may have leading or trailing spaces, and some may have additional
characters that need to be standardized.
product_id product_code
1 ABC-123
2 def456
3 ghi-789
4 JKL-0-001
5 MNO_234
Objective
1. Remove any leading or trailing spaces from the product codes.
2. Ensure all product codes follow a standard format: the code should start with "PROD-",
followed by a numeric part extracted from the existing product code.
UPDATE products
SET product_code = CONCAT('PROD-',
SUBSTR(TRIM(product_code),
INSTR(TRIM(product_code), '-') + 1));
product_id product_code
1 PROD-123
2 PROD-456
3 PROD-789
4 PROD-001
5 PROD-234
Explanation of the Query
1. TRIM(product_code): This removes leading and trailing spaces from each product code.
2. INSTR(TRIM(product_code), '-') + 1: This finds the position of the first - in the trimmed
product code and adds 1 to get the starting position of the numeric part.
3. SUBSTR(..., INSTR(...) + 1): This extracts the substring starting from the character
immediately after the -, which will give us the numeric part of the product code.
4. CONCAT('PROD-', ...): This concatenates "PROD-" with the extracted numeric part to
create the standardized product code.
05. CHANGE DATA TYPES:
You can use CAST() and CONVERT() in SQL to change data types of columns or values, and they
are often used for converting between string, numeric, and date formats. Below is an example
that demonstrates both CAST() and CONVERT() functions.
Example Scenario
We have a table sales with columns for sale_id, sale_amount, and sale_date. You want to:
1. Convert the sale_id (which is an integer) into a string for some report.
2. Convert sale_date (which is a DATETIME) into a VARCHAR, but in a specific format:
dd/mm/yyyy.
SELECT
CAST(sale_id AS VARCHAR(10)) AS sale_id_string,
CAST(sale_amount AS DECIMAL(10, 2)) AS sale_amount_decimal,
CONVERT(VARCHAR(10), sale_date, 103) AS sale_date_formatted
FROM sales;
2. CONVERT():
• Versatile: Allows additional formatting, particularly with DATETIME types.
• Specific to SQL Server: Offers flexibility for converting and formatting dates,
numbers, etc.
Example: Converting DATETIME to string with a specific format:
SELECT CONVERT(VARCHAR(10), sale_date, 103) FROM sales;
Summary:
• Use CAST() when you need simple, straightforward data type conversion that is portable
across different database systems.
• Use CONVERT() in SQL Server when you need to apply specific formatting, especially for
DATETIME values or when you need more control over the output format.
06. HANDLE DATE FORMAT ISSUES:
When handling date format issues in SQL, particularly in MySQL, we use functions like
STR_TO_DATE(), EXTRACT(), NOW(), and DATE_FORMAT() to manipulate and extract dates from
various formats.
Let’s explore these functions with a practical example using a table named orders.
Example Scenario:
You have an orders table where:
• order_id stores the order identification numbers.
• order_date stores the date as a string in inconsistent formats (e.g., DD/MM/YYYY, MM-DD-
YYYY).
• You need to:
1. Convert these string-formatted dates into actual DATE types.
2. Extract specific parts of the date (like year or month).
3. Format the date into a more user-friendly format for reporting purposes.
4. Get the current date for comparison purposes.
SELECT
order_id,
EXTRACT(YEAR FROM STR_TO_DATE(order_date, '%d/%m/%Y')) AS order_year,
EXTRACT(MONTH FROM STR_TO_DATE(order_date, '%d/%m/%Y')) AS order_month
FROM orders
WHERE order_id = 1;
order_id = 1: The year is 2024 and the month is 9 extracted from the date '26/09/2024'.
SELECT
NOW() AS current_datetime
FROM orders
LIMIT 1;
current_datetime
2024-10-01 12:30:00
This would return the current system date and time at the time of query execution. In this case, it
is assumed to be 2024-10-01 12:30:00.
SELECT
order_id,
DATE_FORMAT(STR_TO_DATE(order_date, '%d/%m/%Y'), '%M %d, %Y') AS
formatted_order_date
FROM orders
WHERE order_id = 1;
order_id formatted_order_date
1 September 26, 2024
SELECT
order_id,
STR_TO_DATE(order_date, '%d/%m/%Y') AS formatted_date1,
STR_TO_DATE(order_date, '%m-%d-%Y') AS formatted_date2,
EXTRACT(YEAR FROM STR_TO_DATE(order_date, '%d/%m/%Y')) AS order_year,
EXTRACT(MONTH FROM STR_TO_DATE(order_date, '%d/%m/%Y')) AS order_month, -
DATE_FORMAT(STR_TO_DATE(order_date, '%d/%m/%Y'), '%M %d, %Y') AS formatted_date3,
NOW() AS current_datetime
FROM orders;
04. Attempt to Insert Invalid Data (Will Fail Due to CHECK Constraint)
Error Message:
06. Attempt to Insert Invalid Order (Will Fail Due to FOREIGN KEY Constraint)
Error Message:
Summary
• The CHECK constraint on the Customers table ensures that only customers with valid
ages (18-100) are inserted.
• The FOREIGN KEY constraint on the Orders table ensures that orders can only reference
valid customers from the Customers table.
By enforcing these constraints, the database maintains integrity and prevents invalid or
inconsistent data from being entered.
08. HANDLE NUMERIC VALUES:
You can handle numeric values using the functions ROUND(), CEIL(), FLOOR(), and ABS() in SQL.
Here is a single dataset with examples of how each function works.
Consider the following example with table: sales_data,
sale_id sale_amount
1 234.567
2 -78.423
3 456.789
4 123.001
5 -65.999
I. ROUND() Function
The ROUND() function is used to round a number to a specified number of decimal places.
SELECT
sale_id,
sale_amount,
ROUND(sale_amount, 2) AS rounded_amount_2_decimals,
ROUND(sale_amount, 0) AS rounded_to_nearest_integer
FROM sales_data;