top of page
Screenshot 2568-08-16 at 9.58.31 PM.png

[Ep 4] PHASE 1: Cleaning Data in SQL

  • Writer: Puii Duangtip
    Puii Duangtip
  • Aug 18
  • 8 min read

Updated: Sep 1

Where insight begins with mess.


The Cleaning Principles

Before any dashboard or DAX, it started here: the untangling.

Before diving into queries, I followed a few core principles:

  • Understand the "why" behind the mess: (In this case, the chaos was intentional — built to reflect real-world data challenges.)

  • Clean with purpose: Look for missing values, text inconsistencies (Mon vs. Monday), logical outliers (like age 150 or loyalty points = 99999), and data type issues.

  • Document decisions: Every assumption. Every correction. Keep a clear audit trail.

  • Preserve the raw: Never overwrite the original — clean in layers or views.

  • Use the right tools: SQL for structure. Power BI for story.



Focus — Why I’m Doing This

Not just cleaning for cleaning’s sake.

My goal: simulate a real marketing data analyst workflow, with all the mess and ambiguity that comes with it — and strengthen two core areas:

Technical Skills
Analytical Thinking

Level up my practical toolkit.


In SQL:

  • Write clean, scalable queries that can stand up to larger datasets.



In Power BI:

  • Learn to calculate advanced metrics by using precise DAX logic — RFM analysis, retention rate, LTV, promo ROI.

Sharpen how I interpret and question data.


In SQL:

  • Ask deeper questions.

  • Why is this field null? Should it be zero?

  • What logic might explain this outlier?


In Power BI:

  • Build clarity into every chart.

  • What metrics matter to stakeholders?

  • How can I visualize complexity without clutter?



Cleaning SQL — Dataset by Dataset 🗃️

Filename: .csv — and what I did with each.

Each file came with its own mess — here’s a breakdown of how I handled them.


1. customer_profiles.csv

SQL Cleaning Query with Explanations

Step 1: Check for duplicate customer_id

SELECT 
	customer_id,
	count(*) 
FROM customer_profiles
GROUP BY customer_id
HAVING count(*) > 1;

This identifies any customer_id that appears more than once — which should never happen.

Step 2: Clean email, phone_number, and create contact_completion column

ree
ree
-- email_cleaned
	CASE 
		WHEN TRIM(email) = '' THEN NULL
      	ELSE email
	END AS email_cl,
    
-- phone_number_cleaned
	CASE 
		WHEN LENGTH(REPLACE(REPLACE(REPLACE(CAST(phone_number AS CHAR(50)),'a',''),'b',''),'c','')) < 10 THEN NULL
		WHEN phone_number IS NOT NULL THEN REPLACE(REPLACE(REPLACE(CAST(phone_number AS CHAR(50)),'a',''),'b',''),'c','')		
		ELSE phone_number
	END AS phone_number_cl,
    
-- contact_completion 
	CASE
		WHEN email IS NULL OR TRIM(email) = ''OR phone_number IS NULL OR LENGTH(REPLACE(REPLACE(REPLACE(CAST(phone_number AS CHAR(50)),'a',''),'b',''),'c','')) < 10 THEN 0
		ELSE 1
	END AS contact_completed

Step 3: Calculate age and create age_group for segmentation

ree

-- age (corrected to handle invalid birth_year)
    CASE
        WHEN birth_year IS NULL OR birth_year <= 1900 OR birth_year > YEAR(CURDATE()) THEN NULL -- Handle invalid or future birth years
        ELSE YEAR(CURDATE()) - birth_year
    END AS age,
    
    
-- age_group (corrected to handle invalid birth_year)
    CASE
        WHEN birth_year IS NULL OR birth_year <= 1900 OR birth_year > YEAR(CURDATE()) THEN NULL -- Handle invalid or future birth years
        WHEN YEAR(CURDATE()) - birth_year < 18 THEN 'Under 18'
        WHEN YEAR(CURDATE()) - birth_year < 25 THEN '18-24'
        WHEN YEAR(CURDATE()) - birth_year < 35 THEN '25-34'
        WHEN YEAR(CURDATE()) - birth_year < 45 THEN '35-44'
        WHEN YEAR(CURDATE()) - birth_year < 55 THEN '45-54'
        WHEN YEAR(CURDATE()) - birth_year < 65 THEN '55-64'
        WHEN YEAR(CURDATE()) - birth_year >= 65 THEN '65+' -- Changed from > 64 to >= 65 for clarity
        ELSE NULL
    END AS age_group,

Step 4: Reclassify loyalty_points and create loyalty_tier

Code snippet showing SQL CASE statements for cleaning loyalty points and assigning tiers based on points: Bronze, Silver, Gold, and Platinum.
-- loyalty_points_cleaned
	CASE 
		WHEN loyalty_points = 99999 THEN 999
		ELSE loyalty_points
	END AS loyalty_points,

-- create loyalty_tier
	CASE
		WHEN loyalty_points = 0   THEN 'Bronze'
        	WHEN loyalty_points < 300 THEN 'Silver'
        	WHEN loyalty_points < 750 THEN 'Gold'
        	WHEN loyalty_points >= 750 THEN 'Platinum'
        	ELSE NULL
	END AS loyalty_tier

Step 5: Create the cleaned table

SQL code snippet creates a customer profiles table. Highlights email, phone, and birth year cleaning. Dark background with colored text.
SQL code snippet handling age and age group categorization with error checks for invalid birth years, shown on a dark background.
SQL code snippet for customer profiles classifies loyalty tiers and contact completion. Uses "CASE" statements; keywords in yellow.
CREATE TABLE customer_profiles_cl AS
SELECT
	DISTINCT customer_id, -- ## Select unique customer_id
	first_name,
	last_name,
    
-- email_cleaned
	CASE 
		WHEN TRIM(email) = '' THEN NULL
		ELSE email
	END AS email_cl,
    
-- phone_number_cleaned
	CASE 
		WHEN LENGTH(REPLACE(REPLACE(REPLACE(CAST(phone_number AS CHAR(50)),'a',''),'b',''),'c','')) < 10 THEN NULL
		WHEN phone_number IS NOT NULL THEN REPLACE(REPLACE(REPLACE(CAST(phone_number AS CHAR(50)),'a',''),'b',''),'c','')		
		ELSE phone_number
	END AS phone_number_cl,
    
	gender,
    
-- birth_year_cleaned
	CASE 
		WHEN birth_year > 2025 THEN NULL
		ELSE birth_year
	END AS birth_year_cl,
     
	birth_month,
       
-- age (corrected to handle invalid birth_year)
	CASE
		WHEN birth_year IS NULL OR birth_year <= 1900 OR birth_year > YEAR(CURDATE()) THEN NULL -- Handle invalid or future birth years
		ELSE YEAR(CURDATE()) - birth_year
	END AS age,
     
-- age_group (corrected to handle invalid birth_year)
	CASE
        	WHEN birth_year IS NULL OR birth_year <= 1900 OR birth_year > YEAR(CURDATE()) THEN NULL -- Handle invalid or future birth years
        	WHEN YEAR(CURDATE()) - birth_year < 18 THEN 'Under 18'
        	WHEN YEAR(CURDATE()) - birth_year < 25 THEN '18-24'
        	WHEN YEAR(CURDATE()) - birth_year < 35 THEN '25-34'
        	WHEN YEAR(CURDATE()) - birth_year < 45 THEN '35-44'
       	WHEN YEAR(CURDATE()) - birth_year < 55 THEN '45-54'
        	WHEN YEAR(CURDATE()) - birth_year < 65 THEN '55-64'
        	WHEN YEAR(CURDATE()) - birth_year >= 65 THEN '65+' -- Changed from > 64 to >= 65 for clarity
        	ELSE NULL
    	END AS age_group,
    
	date(member_since) AS member_since, -- ## Change to date format 
	date(last_visit_date) AS last_visit_date,
    
-- loyalty_points_cleaned
	CASE 
		WHEN loyalty_points = 99999 THEN 999
		ELSE loyalty_points
	END AS loyalty_points_cl,
	
-- loyalty_tier
	CASE
		WHEN loyalty_points = 0   THEN 'Bronze'
		WHEN loyalty_points < 300 THEN 'Silver'
        	WHEN loyalty_points < 750 THEN 'Gold'
        	WHEN loyalty_points >= 751 THEN 'Platinum'
        	ELSE NULL
END AS loyalty_tier,
		
	spending_propensity,
    
-- contact_completion 
	CASE
		WHEN email IS NULL OR TRIM(email) = ''
		OR phone_number IS NULL OR LENGTH(REPLACE(REPLACE(REPLACE(CAST(phone_number AS CHAR(50)),'a',''),'b',''),'c','')) < 10
		THEN 0
        	ELSE 1
	END AS contact_completed
    
FROM dim_customer_profiles 
WHERE customer_id IS NOT NULL AND TRIM(customer_id) <> ''; -- ## Remove when customer_id is blank

2. fiscal_calendar.csv

SQL Cleaning Query + Create the cleaned table

CREATE TABLE fiscal_calendar_cl AS
SELECT
	date(date) AS date,
    
-- day_of_week_cleaned
	CASE
		WHEN LOWER(TRIM(day_of_week)) IN ('monday', 'mon')    THEN 'Mon'
        	WHEN LOWER(TRIM(day_of_week)) IN ('tuesday', 'tue')   THEN 'Tue'
        	WHEN LOWER(TRIM(day_of_week)) IN ('wednesday', 'wed') THEN 'Wed'
        	WHEN LOWER(TRIM(day_of_week)) IN ('thursday', 'thu')  THEN 'Thu'
        	WHEN LOWER(TRIM(day_of_week)) IN ('friday', 'fri')    THEN 'Fri'
        	WHEN LOWER(TRIM(day_of_week)) IN ('saturday', 'sat')  THEN 'Sat'
        	WHEN LOWER(TRIM(day_of_week)) IN ('sunday', 'sun')    THEN 'Sun'
        	ELSE 'Unknown' -- Handle any unexpected values
    	END AS day_of_week_cl,
    
    	day_of_month,
    	week_of_year,
    
-- month_name_cleaned
    	CASE
		WHEN LOWER(TRIM(month_name)) = 'jan' THEN 'January'
		WHEN LOWER(TRIM(month_name)) = 'jul' THEN 'July'
		ELSE month_name
	END AS month_name_cl,
    
    	quarter_name,
    	calendar_year,
    	fiscal_week,
    	fiscal_month_num,
    	fiscal_month_name,
    	fiscal_quarter,
    	fiscal_year,
    	is_holiday,
    	holiday_name,
    	is_easter_turkey_ad_period,
    	is_easter_turkey_sales_period,
    
    	CONCAT(SUBSTRING(month_name,1,3)," '",SUBSTRING(fiscal_year,3,2)) AS fiscal_month_year
        
FROM dim_fiscal_calendar
WHERE date IS NOT NULL and trim(date) <> '';

3. product_catalog.csv

SQL Cleaning Query with Explanations

Step 1: Check for duplicate product_id

ree
SELECT
	DISTINCT -- ## select unique product_id
	CASE
		WHEN LENGTH(product_id) < 5 THEN CONCAT(SUBSTRING(product_id,1,1),0,SUBSTRING(product_id,-3,3))
		ELSE product_id
	END AS product_id_cl

Step 2: Review product_name for typos or missing values

ree
CASE
	WHEN LOWER(TRIM(product_name)) = "Nat\'S Hearty Breakfast" THEN "Nat's Hearty Breakfast"
	WHEN LOWER(TRIM(product_name)) = "Bc Chicken Burger" THEN "BC Chicken Burger"
	ELSE TRIM(product_name)
END AS product_name_cl

Step 3: Validate category alignment

Ensure product_name matches the assigned category.

For example:

  • "Turkey Dinner" → Seasonal

  • "Classic Burger" → Burgers

ree
CASE
	WHEN lower(product_name) LIKE "%cheesecake%" THEN "Dessert"
	WHEN lower(product_name) LIKE "%milkshake%" THEN "Drink"
	WHEN lower(product_name) LIKE "%zoo sticks%" THEN "Appetizer"
	WHEN lower(product_name) LIKE "%pirate%" THEN "Kids Meal"
	WHEN lower(product_name) LIKE "%salad%" THEN "Salad"
	WHEN lower(product_name) LIKE "%breakfast%" THEN "Breakfast"
	WHEN lower(product_name) LIKE "%lobster%" THEN "Seafood"
	WHEN lower(product_name) LIKE "%fish%" THEN "Seafood"
	WHEN lower(product_name) LIKE "%cod%" THEN "Seafood"
	WHEN lower(product_name) LIKE "%spaghetti%" THEN "Pasta"
	WHEN lower(product_name) LIKE "%burger%" THEN "Burger"
	WHEN lower(trim(product_name)) = "unknown product" THEN category
	ELSE NULL
END AS category_cl

Step 4: Create the cleaned table

CREATE TABLE product_catalog_cl AS
SELECT
	DISTINCT -- ## select unique product_id
	CASE
		WHEN LENGTH(product_id) < 5 THEN CONCAT(
							SUBSTRING(product_id,1,1),
	            				0,
	            				SUBSTRING(product_id,-3,3))
		ELSE product_id
	END AS product_id_cl,
    
-- product_name_cleaned
	CASE
		WHEN LOWER(TRIM(product_name)) = "Nat\'S Hearty Breakfast" 
			THEN "Nat's Hearty Breakfast"
		WHEN LOWER(TRIM(product_name)) = "Bc Chicken Burger" 
			THEN "BC Chicken Burger"
        	WHEN LOWER(TRIM(product_name)) = "Kids\' Mac & Cheese Kids" 
			THEN "Kid's Mac & Cheese"
        	WHEN LOWER(TRIM(product_name)) = "Pirate Pak - Grilled 
			Cheese Kids" THEN "Kid's Pirate Pak - Grilled Cheese"
        	WHEN LOWER(TRIM(product_name)) = "Hamburger Kids" 
			THEN "Kid's Hamburger"
		WHEN LOWER(TRIM(product_name)) = "Kids' Chicken Strips 
			Kids" THEN "Kid's Chicken Strips"
        	WHEN LOWER(TRIM(product_name)) = "Chicken Strips Kids" THEN 
			"Kid's Chicken Strips"
        	WHEN LOWER(TRIM(product_name)) = "Kids' Grilled Cheese 
			Kids" THEN "Kid's Grilled Cheese"
		WHEN LOWER(TRIM(product_name)) = "Kids' Mini Pizza Kids" 
			THEN "Kid's Mini Pizza"
        	WHEN LOWER(TRIM(product_name)) = "Kids\' Combo Meal Kids" 
			THEN "Kid's Combo Meal"
        	WHEN LOWER(TRIM(product_name)) = "Kids' Cheeseburger Kids" 
			THEN "Kid's Cheese Burger"
		ELSE TRIM(product_name)
	END AS product_name_cl,
    
-- category_cleaned
    	CASE
		WHEN LOWER(product_name) LIKE "%kid%" THEN "Kid's Meal"
        	WHEN LOWER(product_name) LIKE "%vegan%" THEN "Vegan"
		WHEN LOWER(product_name) LIKE "plant-based%" THEN "Vegan"
        	WHEN LOWER(product_name) LIKE "%turkey%" THEN "Promotion"
        	WHEN LOWER(product_name) LIKE "%sauce%" THEN "Side"
        	WHEN LOWER(product_name) LIKE "%brownie%" THEN "Dessert"
        	WHEN LOWER(product_name) LIKE "%apple pie%" THEN "Dessert"
		WHEN LOWER(product_name) LIKE "%cheesecake%" THEN "Dessert"
        	WHEN LOWER(product_name) LIKE "%cookie%" THEN "Dessert"
        	WHEN LOWER(product_name) LIKE "%milkshake%" THEN "Drink"
        	WHEN LOWER(product_name) LIKE "%smoothie%" THEN "Drink"
        	WHEN LOWER(product_name) LIKE "%fruit juice%" THEN "Drink"
        	WHEN LOWER(product_name) LIKE "%zoo sticks%" THEN "Appetizer"
        	WHEN LOWER(product_name) LIKE "%pirate%" THEN "Kids Meal"
        	WHEN LOWER(product_name) LIKE "%sandwich%" THEN "Sandwich"
		WHEN LOWER(product_name) LIKE "%salad%" THEN "Salad"
        	WHEN LOWER(product_name) LIKE "%breakfast%" THEN "Breakfast"
        	WHEN LOWER(product_name) LIKE "%prawn%" THEN "Seafood"
        	WHEN LOWER(product_name) LIKE "%lobster%" THEN "Seafood"
        	WHEN LOWER(product_name) LIKE "%fish%" THEN "Seafood"
        	WHEN LOWER(product_name) LIKE "%cod%" THEN "Seafood"
        	WHEN LOWER(product_name) LIKE "%spaghetti%" THEN "Pasta"
        	WHEN LOWER(product_name) LIKE "%burger%" THEN "Burger"
		WHEN LOWER(product_name) LIKE "%wrap%" THEN "Wrap"
        	WHEN LOWER(TRIM(product_name)) = "unknown product" THEN category
        	ELSE TRIM(category)
	END AS category_cl,
    
    	base_price,
    	is_active,
    	campaign_affinity
    
FROM dim_product_catalog
WHERE product_id IS NOT NULL AND trim(product_id) <> '';

4. sales_detail.csv

SQL Cleaning Query with Explanations

Step 1: Check for duplicate transaction_id

-- check duplicate rows 
SELECT * FROM sales_detail 
ORDER BY transaction_id 

-- select distinct transaction_id 
SELECT DISTINCT transaction_id FROM sales_detail

Step 2: Validate positive values

Code snippet showing SQL statements with ABS function on quantity, unit_price, and total_amount. Background is dark, text is light.
ABS(quantity) AS quantity,
ABS(unit_price) AS unit_price,        
ABS(total_amount) AS total_amount_cl,

-- ABS = absolute number

Step 3: Check for product_id

SQL code snippet showing a CASE statement for cleaning product IDs. Highlights include "TRIM," "LENGTH," and "CONCAT" functions.
CASE 
	WHEN LENGTH(TRIM(product_id)) < 5 THEN
	CONCAT('P','0', SUBSTRING(product_id,-3,3))
	ELSE product_id
END AS product_id_cl,

Step 4: Create the cleaned table

Code snippet showing SQL query for creating `sales_detail_cl` table. Includes columns for transaction data, product ID formatting, and conditions.
SQL script in dark theme with syntax highlighting. Conversion cases for days and months. Notable colors include orange and blue.
CREATE TABLE sales_detail_cl AS

SELECT
	DISTINCT  transaction_id,
	customer_id,
	store_id,

-- product_id_cleaned
	CASE 
		WHEN LENGTH(TRIM(product_id)) < 5 THEN
		CONCAT('P','0', SUBSTRING(product_id,-3,3))
		ELSE product_id
	END AS product_id_cl,

DATE(sales_date) AS sales_date, -- ##format sale_date

ABS(quantity) AS quantity,
ABS(unit_price) AS unit_price,  
ABS(total_amount) AS total_amount_cl,

payment_method,
transaction_channel,  
is_returned,
returned_date,
promo_type_applied,
temperature_celsius,
weather_condition, 

CASE
	WHEN LOWER(TRIM(day_of_week)) IN ('monday', 'mon')    THEN 'Mon'
	WHEN LOWER(TRIM(day_of_week)) IN ('tuesday', 'tue')   THEN 'Tue'
	WHEN LOWER(TRIM(day_of_week)) IN ('wednesday', 'wed') THEN 'Wed'
	WHEN LOWER(TRIM(day_of_week)) IN ('thursday', 'thu')  THEN 'Thu'
	WHEN LOWER(TRIM(day_of_week)) IN ('friday', 'fri')    THEN 'Fri'
	WHEN LOWER(TRIM(day_of_week)) IN ('saturday', 'sat')  THEN 'Sat'
	WHEN LOWER(TRIM(day_of_week)) IN ('sunday', 'sun')    THEN 'Sun'
	ELSE 'Unknown' -- Handle any unexpected values
	END AS day_of_week_cl,  

CASE
	WHEN LOWER(TRIM(month_name)) = 'jan' THEN 'January'
	WHEN LOWER(TRIM(month_name)) = 'jul' THEN 'July'
	ELSE month_name
	END AS month_name_cl, 

fiscal_week,
fiscal_month_num,
fiscal_month_name,
fiscal_quarter,
fiscal_year

FROM fact_sales_detail

WHERE transaction_id IS NOT NULL AND trim(transaction_id) <> '';

5. store_locations.csv

SQL Cleaning Query + Create the cleaned table

CREATE TABLE store_locations_cl AS
SELECT
	DISTINCT store_id,
	city_province,
	CASE 
		WHEN trim(city_province) = 'Kelowna' THEN 'Vancouver Island'
        	WHEN trim(city_province) = 'Victoria' THEN 'Vancouver Island'
        	ELSE region
	END AS region_cl,
    
    	guest_capacity,
    	DATE(opening_date) AS opening_date,
    	store_price_multiplier
			    
FROM dim_store_locations
WHERE store_id IS NOT NULL AND TRIM(store_id) <> ''; 

From above syntax, Kelowna’s a city — my bad. You can fix it or just leave it if you want to compare the numbers in Power BI. Either way, now you see how clean data shapes the dashboards.

6. marketing_campaigns.csv

SQL Cleaning Query + Create the cleaned table

CREATE TABLE marketing_campaigns_cl AS
SELECT 
	campaign_name,
	promo_type_used,
	fiscal_year,
	ABS(estimated_ad_spend) AS estimated_ad_spend_cl,
	ABS(estimated_operational_cost) AS estimated_operational_cost_cl,
	marketing_channel_cl
FROM fact_marketing_campaigns

The data was scrubbed clean. The noise was gone.


Next: [Ep.5] PHASE 2 — Modeling in Power BI: I move the rows into their new home — the place where numbers tell a story, and patterns quietly emerge.



💡 Part of my Data × Design Series — follow the journey from raw interview challenge to polished dashboards.


Table of Contents

Comments


bottom of page