[Ep 4] PHASE 1: Cleaning Data in SQL
- Puii Duangtip
- Aug 18
- 8 min read
Updated: Sep 1
Where insight begins with mess.
The Cleaning Principles
Before any dashboard or DAX, it started here: the untangling.
Before diving into queries, I followed a few core principles:
Understand the "why" behind the mess: (In this case, the chaos was intentional — built to reflect real-world data challenges.)
Clean with purpose: Look for missing values, text inconsistencies (Mon vs. Monday), logical outliers (like age 150 or loyalty points = 99999), and data type issues.
Document decisions: Every assumption. Every correction. Keep a clear audit trail.
Preserve the raw: Never overwrite the original — clean in layers or views.
Use the right tools: SQL for structure. Power BI for story.
Focus — Why I’m Doing This
Not just cleaning for cleaning’s sake.
My goal: simulate a real marketing data analyst workflow, with all the mess and ambiguity that comes with it — and strengthen two core areas:
Technical Skills | Analytical Thinking |
Level up my practical toolkit. In SQL:
In Power BI:
| Sharpen how I interpret and question data. In SQL:
In Power BI:
|
Cleaning SQL — Dataset by Dataset 🗃️
Filename: .csv — and what I did with each.
Each file came with its own mess — here’s a breakdown of how I handled them.
1. customer_profiles.csv
SQL Cleaning Query with Explanations
Step 1: Check for duplicate customer_id
SELECT
customer_id,
count(*)
FROM customer_profiles
GROUP BY customer_id
HAVING count(*) > 1;This identifies any customer_id that appears more than once — which should never happen.
Step 2: Clean email, phone_number, and create contact_completion column


-- email_cleaned
CASE
WHEN TRIM(email) = '' THEN NULL
ELSE email
END AS email_cl,
-- phone_number_cleaned
CASE
WHEN LENGTH(REPLACE(REPLACE(REPLACE(CAST(phone_number AS CHAR(50)),'a',''),'b',''),'c','')) < 10 THEN NULL
WHEN phone_number IS NOT NULL THEN REPLACE(REPLACE(REPLACE(CAST(phone_number AS CHAR(50)),'a',''),'b',''),'c','')
ELSE phone_number
END AS phone_number_cl,
-- contact_completion
CASE
WHEN email IS NULL OR TRIM(email) = ''OR phone_number IS NULL OR LENGTH(REPLACE(REPLACE(REPLACE(CAST(phone_number AS CHAR(50)),'a',''),'b',''),'c','')) < 10 THEN 0
ELSE 1
END AS contact_completedStep 3: Calculate age and create age_group for segmentation

-- age (corrected to handle invalid birth_year)
CASE
WHEN birth_year IS NULL OR birth_year <= 1900 OR birth_year > YEAR(CURDATE()) THEN NULL -- Handle invalid or future birth years
ELSE YEAR(CURDATE()) - birth_year
END AS age,
-- age_group (corrected to handle invalid birth_year)
CASE
WHEN birth_year IS NULL OR birth_year <= 1900 OR birth_year > YEAR(CURDATE()) THEN NULL -- Handle invalid or future birth years
WHEN YEAR(CURDATE()) - birth_year < 18 THEN 'Under 18'
WHEN YEAR(CURDATE()) - birth_year < 25 THEN '18-24'
WHEN YEAR(CURDATE()) - birth_year < 35 THEN '25-34'
WHEN YEAR(CURDATE()) - birth_year < 45 THEN '35-44'
WHEN YEAR(CURDATE()) - birth_year < 55 THEN '45-54'
WHEN YEAR(CURDATE()) - birth_year < 65 THEN '55-64'
WHEN YEAR(CURDATE()) - birth_year >= 65 THEN '65+' -- Changed from > 64 to >= 65 for clarity
ELSE NULL
END AS age_group,Step 4: Reclassify loyalty_points and create loyalty_tier

-- loyalty_points_cleaned
CASE
WHEN loyalty_points = 99999 THEN 999
ELSE loyalty_points
END AS loyalty_points,
-- create loyalty_tier
CASE
WHEN loyalty_points = 0 THEN 'Bronze'
WHEN loyalty_points < 300 THEN 'Silver'
WHEN loyalty_points < 750 THEN 'Gold'
WHEN loyalty_points >= 750 THEN 'Platinum'
ELSE NULL
END AS loyalty_tierStep 5: Create the cleaned table



CREATE TABLE customer_profiles_cl AS
SELECT
DISTINCT customer_id, -- ## Select unique customer_id
first_name,
last_name,
-- email_cleaned
CASE
WHEN TRIM(email) = '' THEN NULL
ELSE email
END AS email_cl,
-- phone_number_cleaned
CASE
WHEN LENGTH(REPLACE(REPLACE(REPLACE(CAST(phone_number AS CHAR(50)),'a',''),'b',''),'c','')) < 10 THEN NULL
WHEN phone_number IS NOT NULL THEN REPLACE(REPLACE(REPLACE(CAST(phone_number AS CHAR(50)),'a',''),'b',''),'c','')
ELSE phone_number
END AS phone_number_cl,
gender,
-- birth_year_cleaned
CASE
WHEN birth_year > 2025 THEN NULL
ELSE birth_year
END AS birth_year_cl,
birth_month,
-- age (corrected to handle invalid birth_year)
CASE
WHEN birth_year IS NULL OR birth_year <= 1900 OR birth_year > YEAR(CURDATE()) THEN NULL -- Handle invalid or future birth years
ELSE YEAR(CURDATE()) - birth_year
END AS age,
-- age_group (corrected to handle invalid birth_year)
CASE
WHEN birth_year IS NULL OR birth_year <= 1900 OR birth_year > YEAR(CURDATE()) THEN NULL -- Handle invalid or future birth years
WHEN YEAR(CURDATE()) - birth_year < 18 THEN 'Under 18'
WHEN YEAR(CURDATE()) - birth_year < 25 THEN '18-24'
WHEN YEAR(CURDATE()) - birth_year < 35 THEN '25-34'
WHEN YEAR(CURDATE()) - birth_year < 45 THEN '35-44'
WHEN YEAR(CURDATE()) - birth_year < 55 THEN '45-54'
WHEN YEAR(CURDATE()) - birth_year < 65 THEN '55-64'
WHEN YEAR(CURDATE()) - birth_year >= 65 THEN '65+' -- Changed from > 64 to >= 65 for clarity
ELSE NULL
END AS age_group,
date(member_since) AS member_since, -- ## Change to date format
date(last_visit_date) AS last_visit_date,
-- loyalty_points_cleaned
CASE
WHEN loyalty_points = 99999 THEN 999
ELSE loyalty_points
END AS loyalty_points_cl,
-- loyalty_tier
CASE
WHEN loyalty_points = 0 THEN 'Bronze'
WHEN loyalty_points < 300 THEN 'Silver'
WHEN loyalty_points < 750 THEN 'Gold'
WHEN loyalty_points >= 751 THEN 'Platinum'
ELSE NULL
END AS loyalty_tier,
spending_propensity,
-- contact_completion
CASE
WHEN email IS NULL OR TRIM(email) = ''
OR phone_number IS NULL OR LENGTH(REPLACE(REPLACE(REPLACE(CAST(phone_number AS CHAR(50)),'a',''),'b',''),'c','')) < 10
THEN 0
ELSE 1
END AS contact_completed
FROM dim_customer_profiles
WHERE customer_id IS NOT NULL AND TRIM(customer_id) <> ''; -- ## Remove when customer_id is blank2. fiscal_calendar.csv
SQL Cleaning Query + Create the cleaned table
CREATE TABLE fiscal_calendar_cl AS
SELECT
date(date) AS date,
-- day_of_week_cleaned
CASE
WHEN LOWER(TRIM(day_of_week)) IN ('monday', 'mon') THEN 'Mon'
WHEN LOWER(TRIM(day_of_week)) IN ('tuesday', 'tue') THEN 'Tue'
WHEN LOWER(TRIM(day_of_week)) IN ('wednesday', 'wed') THEN 'Wed'
WHEN LOWER(TRIM(day_of_week)) IN ('thursday', 'thu') THEN 'Thu'
WHEN LOWER(TRIM(day_of_week)) IN ('friday', 'fri') THEN 'Fri'
WHEN LOWER(TRIM(day_of_week)) IN ('saturday', 'sat') THEN 'Sat'
WHEN LOWER(TRIM(day_of_week)) IN ('sunday', 'sun') THEN 'Sun'
ELSE 'Unknown' -- Handle any unexpected values
END AS day_of_week_cl,
day_of_month,
week_of_year,
-- month_name_cleaned
CASE
WHEN LOWER(TRIM(month_name)) = 'jan' THEN 'January'
WHEN LOWER(TRIM(month_name)) = 'jul' THEN 'July'
ELSE month_name
END AS month_name_cl,
quarter_name,
calendar_year,
fiscal_week,
fiscal_month_num,
fiscal_month_name,
fiscal_quarter,
fiscal_year,
is_holiday,
holiday_name,
is_easter_turkey_ad_period,
is_easter_turkey_sales_period,
CONCAT(SUBSTRING(month_name,1,3)," '",SUBSTRING(fiscal_year,3,2)) AS fiscal_month_year
FROM dim_fiscal_calendar
WHERE date IS NOT NULL and trim(date) <> '';3. product_catalog.csv
SQL Cleaning Query with Explanations
Step 1: Check for duplicate product_id

SELECT
DISTINCT -- ## select unique product_id
CASE
WHEN LENGTH(product_id) < 5 THEN CONCAT(SUBSTRING(product_id,1,1),0,SUBSTRING(product_id,-3,3))
ELSE product_id
END AS product_id_clStep 2: Review product_name for typos or missing values

CASE
WHEN LOWER(TRIM(product_name)) = "Nat\'S Hearty Breakfast" THEN "Nat's Hearty Breakfast"
WHEN LOWER(TRIM(product_name)) = "Bc Chicken Burger" THEN "BC Chicken Burger"
ELSE TRIM(product_name)
END AS product_name_clStep 3: Validate category alignment
Ensure product_name matches the assigned category.
For example:
"Turkey Dinner" → Seasonal
"Classic Burger" → Burgers

CASE
WHEN lower(product_name) LIKE "%cheesecake%" THEN "Dessert"
WHEN lower(product_name) LIKE "%milkshake%" THEN "Drink"
WHEN lower(product_name) LIKE "%zoo sticks%" THEN "Appetizer"
WHEN lower(product_name) LIKE "%pirate%" THEN "Kids Meal"
WHEN lower(product_name) LIKE "%salad%" THEN "Salad"
WHEN lower(product_name) LIKE "%breakfast%" THEN "Breakfast"
WHEN lower(product_name) LIKE "%lobster%" THEN "Seafood"
WHEN lower(product_name) LIKE "%fish%" THEN "Seafood"
WHEN lower(product_name) LIKE "%cod%" THEN "Seafood"
WHEN lower(product_name) LIKE "%spaghetti%" THEN "Pasta"
WHEN lower(product_name) LIKE "%burger%" THEN "Burger"
WHEN lower(trim(product_name)) = "unknown product" THEN category
ELSE NULL
END AS category_clStep 4: Create the cleaned table
CREATE TABLE product_catalog_cl AS
SELECT
DISTINCT -- ## select unique product_id
CASE
WHEN LENGTH(product_id) < 5 THEN CONCAT(
SUBSTRING(product_id,1,1),
0,
SUBSTRING(product_id,-3,3))
ELSE product_id
END AS product_id_cl,
-- product_name_cleaned
CASE
WHEN LOWER(TRIM(product_name)) = "Nat\'S Hearty Breakfast"
THEN "Nat's Hearty Breakfast"
WHEN LOWER(TRIM(product_name)) = "Bc Chicken Burger"
THEN "BC Chicken Burger"
WHEN LOWER(TRIM(product_name)) = "Kids\' Mac & Cheese Kids"
THEN "Kid's Mac & Cheese"
WHEN LOWER(TRIM(product_name)) = "Pirate Pak - Grilled
Cheese Kids" THEN "Kid's Pirate Pak - Grilled Cheese"
WHEN LOWER(TRIM(product_name)) = "Hamburger Kids"
THEN "Kid's Hamburger"
WHEN LOWER(TRIM(product_name)) = "Kids' Chicken Strips
Kids" THEN "Kid's Chicken Strips"
WHEN LOWER(TRIM(product_name)) = "Chicken Strips Kids" THEN
"Kid's Chicken Strips"
WHEN LOWER(TRIM(product_name)) = "Kids' Grilled Cheese
Kids" THEN "Kid's Grilled Cheese"
WHEN LOWER(TRIM(product_name)) = "Kids' Mini Pizza Kids"
THEN "Kid's Mini Pizza"
WHEN LOWER(TRIM(product_name)) = "Kids\' Combo Meal Kids"
THEN "Kid's Combo Meal"
WHEN LOWER(TRIM(product_name)) = "Kids' Cheeseburger Kids"
THEN "Kid's Cheese Burger"
ELSE TRIM(product_name)
END AS product_name_cl,
-- category_cleaned
CASE
WHEN LOWER(product_name) LIKE "%kid%" THEN "Kid's Meal"
WHEN LOWER(product_name) LIKE "%vegan%" THEN "Vegan"
WHEN LOWER(product_name) LIKE "plant-based%" THEN "Vegan"
WHEN LOWER(product_name) LIKE "%turkey%" THEN "Promotion"
WHEN LOWER(product_name) LIKE "%sauce%" THEN "Side"
WHEN LOWER(product_name) LIKE "%brownie%" THEN "Dessert"
WHEN LOWER(product_name) LIKE "%apple pie%" THEN "Dessert"
WHEN LOWER(product_name) LIKE "%cheesecake%" THEN "Dessert"
WHEN LOWER(product_name) LIKE "%cookie%" THEN "Dessert"
WHEN LOWER(product_name) LIKE "%milkshake%" THEN "Drink"
WHEN LOWER(product_name) LIKE "%smoothie%" THEN "Drink"
WHEN LOWER(product_name) LIKE "%fruit juice%" THEN "Drink"
WHEN LOWER(product_name) LIKE "%zoo sticks%" THEN "Appetizer"
WHEN LOWER(product_name) LIKE "%pirate%" THEN "Kids Meal"
WHEN LOWER(product_name) LIKE "%sandwich%" THEN "Sandwich"
WHEN LOWER(product_name) LIKE "%salad%" THEN "Salad"
WHEN LOWER(product_name) LIKE "%breakfast%" THEN "Breakfast"
WHEN LOWER(product_name) LIKE "%prawn%" THEN "Seafood"
WHEN LOWER(product_name) LIKE "%lobster%" THEN "Seafood"
WHEN LOWER(product_name) LIKE "%fish%" THEN "Seafood"
WHEN LOWER(product_name) LIKE "%cod%" THEN "Seafood"
WHEN LOWER(product_name) LIKE "%spaghetti%" THEN "Pasta"
WHEN LOWER(product_name) LIKE "%burger%" THEN "Burger"
WHEN LOWER(product_name) LIKE "%wrap%" THEN "Wrap"
WHEN LOWER(TRIM(product_name)) = "unknown product" THEN category
ELSE TRIM(category)
END AS category_cl,
base_price,
is_active,
campaign_affinity
FROM dim_product_catalog
WHERE product_id IS NOT NULL AND trim(product_id) <> '';4. sales_detail.csv
SQL Cleaning Query with Explanations
Step 1: Check for duplicate transaction_id
-- check duplicate rows
SELECT * FROM sales_detail
ORDER BY transaction_id
-- select distinct transaction_id
SELECT DISTINCT transaction_id FROM sales_detailStep 2: Validate positive values

ABS(quantity) AS quantity,
ABS(unit_price) AS unit_price,
ABS(total_amount) AS total_amount_cl,
-- ABS = absolute numberStep 3: Check for product_id

CASE
WHEN LENGTH(TRIM(product_id)) < 5 THEN
CONCAT('P','0', SUBSTRING(product_id,-3,3))
ELSE product_id
END AS product_id_cl,Step 4: Create the cleaned table


CREATE TABLE sales_detail_cl AS
SELECT
DISTINCT transaction_id,
customer_id,
store_id,
-- product_id_cleaned
CASE
WHEN LENGTH(TRIM(product_id)) < 5 THEN
CONCAT('P','0', SUBSTRING(product_id,-3,3))
ELSE product_id
END AS product_id_cl,
DATE(sales_date) AS sales_date, -- ##format sale_date
ABS(quantity) AS quantity,
ABS(unit_price) AS unit_price,
ABS(total_amount) AS total_amount_cl,
payment_method,
transaction_channel,
is_returned,
returned_date,
promo_type_applied,
temperature_celsius,
weather_condition,
CASE
WHEN LOWER(TRIM(day_of_week)) IN ('monday', 'mon') THEN 'Mon'
WHEN LOWER(TRIM(day_of_week)) IN ('tuesday', 'tue') THEN 'Tue'
WHEN LOWER(TRIM(day_of_week)) IN ('wednesday', 'wed') THEN 'Wed'
WHEN LOWER(TRIM(day_of_week)) IN ('thursday', 'thu') THEN 'Thu'
WHEN LOWER(TRIM(day_of_week)) IN ('friday', 'fri') THEN 'Fri'
WHEN LOWER(TRIM(day_of_week)) IN ('saturday', 'sat') THEN 'Sat'
WHEN LOWER(TRIM(day_of_week)) IN ('sunday', 'sun') THEN 'Sun'
ELSE 'Unknown' -- Handle any unexpected values
END AS day_of_week_cl,
CASE
WHEN LOWER(TRIM(month_name)) = 'jan' THEN 'January'
WHEN LOWER(TRIM(month_name)) = 'jul' THEN 'July'
ELSE month_name
END AS month_name_cl,
fiscal_week,
fiscal_month_num,
fiscal_month_name,
fiscal_quarter,
fiscal_year
FROM fact_sales_detail
WHERE transaction_id IS NOT NULL AND trim(transaction_id) <> '';5. store_locations.csv
SQL Cleaning Query + Create the cleaned table
CREATE TABLE store_locations_cl AS
SELECT
DISTINCT store_id,
city_province,
CASE
WHEN trim(city_province) = 'Kelowna' THEN 'Vancouver Island'
WHEN trim(city_province) = 'Victoria' THEN 'Vancouver Island'
ELSE region
END AS region_cl,
guest_capacity,
DATE(opening_date) AS opening_date,
store_price_multiplier
FROM dim_store_locations
WHERE store_id IS NOT NULL AND TRIM(store_id) <> ''; From above syntax, Kelowna’s a city — my bad. You can fix it or just leave it if you want to compare the numbers in Power BI. Either way, now you see how clean data shapes the dashboards.
6. marketing_campaigns.csv
SQL Cleaning Query + Create the cleaned table
CREATE TABLE marketing_campaigns_cl AS
SELECT
campaign_name,
promo_type_used,
fiscal_year,
ABS(estimated_ad_spend) AS estimated_ad_spend_cl,
ABS(estimated_operational_cost) AS estimated_operational_cost_cl,
marketing_channel_cl
FROM fact_marketing_campaignsThe data was scrubbed clean. The noise was gone.
Next: [Ep.5] PHASE 2 — Modeling in Power BI: I move the rows into their new home — the place where numbers tell a story, and patterns quietly emerge.
💡 Part of my Data × Design Series — follow the journey from raw interview challenge to polished dashboards.

![[Ep.5] PHASE 2: Modeling in Power BI](https://static.wixstatic.com/media/11293b29c7ba41eabc197a8afc7e6851.jpg/v1/fill/w_980,h_980,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/11293b29c7ba41eabc197a8afc7e6851.jpg)
![[Ep 3] Designing My Own Dataset (Gemini + Python)](https://static.wixstatic.com/media/11062b_a8cb8f8acd304aaf8c7c6e281c16e533~mv2.jpg/v1/fill/w_980,h_842,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/11062b_a8cb8f8acd304aaf8c7c6e281c16e533~mv2.jpg)
![[Ep.1-2] The Technical Interview That Turned Into a Data Analytics Masterclass](https://static.wixstatic.com/media/ead171_dd3b799463944bc49518a4345d770e4b~mv2.jpg/v1/fill/w_980,h_668,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/ead171_dd3b799463944bc49518a4345d770e4b~mv2.jpg)
Comments