top of page
Screenshot 2568-08-16 at 9.58.31 PM.png

[Ep 3] Designing My Own Dataset (Gemini + Python)

  • Writer: Puii Duangtip
    Puii Duangtip
  • Aug 18, 2025
  • 19 min read

Updated: Sep 1, 2025

Not your average spreadsheet. Prompt. Generate. Repeat.


Good data doesn’t just arrive clean.

You have to build it — or better, design it to ask better questions.


I wanted something that mimicked the unpredictability marketers face every day.

A thinking environment that made me ask:

“What does this unlock?”“What new behavior might this explain?”

So I opened Gemini. Then ChatGPT.


“Give me a loyalty dataset with incomplete contact info.” “Now generate transactions across dine-in, delivery, and takeout.” “Add campaign overlap, bad weather, and a ghost customer or two.”


Every prompt made the data weirder. And more useful.


I built for MESS. 

  • NULLs. Duplicates. Misaligned dates.

  • Conflicting category labels

  • Multi-location, multi-channel transactions

  • Four overlapping campaigns, none of them fully aligned.


The kind of dataset where six teams touch the same system, everyone follows their own logic — and you’re the one asked to “make sense of it.”



🧾 The Blueprint

  • Customers — some loyal, some half-filled, some clearly faking their birthday

  • Transactions — 10,000+ rows across locations, seasons, and weather

  • Products — filtered by menu type, delivery flags, and category confusion

  • Campaigns — overlapping dates, mixed goals, and promo confusion

  • Stores — with mislabeled regions and inconsistent activity spikes


The data moved. Sometimes predictably. Often irrationally.

It asked questions back.


Promo days didn’t always boost sales.

Certain stores underperformed.

Some customers ghosted.

Others showed up every Monday like clockwork.


That was the point.

This wasn’t just dataset generation — it was scenario design.


🧰 Workflow & Tools That Made It Work

This was a personal project — with no affiliation to any real brand. Just the ambition to simulate real-life complexity.

It was just me; trying to simulate and  build a system I could learn from.


Step 1: Generate the Code (Gemini Prompt)

Here’s the full Gemini prompt I used — and one you can tweak for any brand you’re eyeing. It returned a full Python script.

But fair warning: this isn’t clean data. That would’ve been too easy.


Prompt for the Marketing Dataset Generator (Gemini)
1. Persona & Goal
You are an expert data architect and simulator. Your primary goal is to generate a comprehensive and realistic set of messy datasets for a fictional Canadian restaurant chain. I will be acting as a Marketing Analyst, and I need this data to practice a full data analysis workflow, from data cleaning and transformation in SQL to visualization and insight generation in Power BI. The datasets should be interconnected and contain logical patterns, but also include common real-world data issues like null values, inconsistencies, and outliers to make the exercise challenging.

2. Fictional Company Profile
- Name: Maple Leaf Grill & Lounge
- Description: A popular family-friendly casual dining chain with locations across Canada (primarily in British Columbia and Alberta). It's known for its signature burgers, classic comfort food, and seasonal promotions.
- Operations: The chain operates both dine-in restaurants and offers takeout and delivery services. They also have a customer loyalty program.

3. Business Scenarios & Objectives
The marketing department needs to analyze the performance of several key campaigns to inform future strategy and budget allocation. The primary analyses required are:

- Easter Turkey Promotion Analysis: Evaluate the Return on Investment (ROI) of the annual Easter campaign (FY2020-FY2024) by store, region, and channel (Dine-in, Takeout, Delivery). Compare year-over-year performance to guide planning for the FY2025 campaign.

- "Double Points Tuesdays" Loyalty Analysis: Measure the success of a 4-week loyalty campaign aimed at increasing weekday traffic. Key metrics are lift in Tuesday sales, loyalty member engagement, and average check size.

- "Back to School Burger Blitz" Analysis: Assess the impact of a September discount campaign on burger sales, overall store traffic, and profit margins.

- "Meatless Mondays" Behavioral Analysis: Understand the adoption of a plant-based menu promotion, identifying key customer segments and location-based trends.

4. Dataset Generation Requirements
Please generate the following tables in CSV format. Ensure they are logically linked through primary and foreign keys.

Table 1: Stores
- Description: A list of all restaurant locations.
- Columns: StoreID (Primary Key), StoreName (e.g., "MLG Metrotown", "MLG Kelowna"), Region (e.g., "BC", "AB"), Area (e.g., "Metro Vancouver", "Calgary Area", "Vancouver Island"), StoreType (e.g., "Urban", "Suburban", "Mall").
- Realism/Messiness:
     Include 40-50 unique stores.
     Leave the Area field NULL for 1-2 stores.
     Ensure regions are primarily 'BC' and 'AB', with one or two outliers in 'SK' or 'ON'.

Table 2: Fiscal_Calendar
- Description: A standard fiscal calendar mapping dates.
- Columns: Date, FiscalYear (e.g., "F24"), FiscalQuarter (e.g., "Q1"), FiscalMonth (e.g., "April"), DayOfWeek (e.g., "Monday").
- Realism/Messiness:
     Cover dates from January 1, 2020, to December 31, 2024.
     This table can be clean.

Table 3: Customers
- Description: Customer profile information, primarily for loyalty members.
- Columns: CustomerID (Primary Key), LoyaltyID (can be NULL), JoinDate, Email, Phone, City, BirthYear.
- Realism/Messiness:
     Generate around 15,000 customers.
     About 60% of customers should have a LoyaltyID. The rest are NULL.
     BirthYear should have about 10% NULL values.
     JoinDate should be spread realistically over the last 5 years.

Table 4: Products
- Description: A list of all menu items.
- Columns: ProductID (Primary Key), ProductName, Category (e.g., "Burgers", "Entrees", "Appetizers", "Beverages", "Plant-Based", "Seasonal"), BasePrice.
- Realism/Messiness:
     Include a "Turkey Dinner" and "Heat & Serve Turkey Dinner" in the "Seasonal" category.
     Include several burger names for the "Burger Blitz" campaign.
     Include at least 5-6 items in the "Plant-Based" category.
     Vary prices realistically ($5 for drinks, $15-25 for entrees, etc.).

Table 5: Marketing_Campaigns
- Description: Details for each major marketing campaign.
- Columns: CampaignID (Primary Key), CampaignName (e.g., "Easter Turkey Promo F24", "Double Points Tuesdays F24"), PromoStartDate, PromoEndDate, FiscalYear.
- Realism/Messiness:
     Create records for all four campaigns mentioned in the scenarios, including separate records for each year of the Easter promotion (F20-F24).

Table 6: Marketing_Spend
- Description: Marketing spend broken down by campaign and channel.
- Columns: SpendID (Primary Key), CampaignID (Foreign Key to Marketing_Campaigns), Spend_Date, Channel (e.g., "Social Media", "Digital Display", "Radio", "OOH"), Amount.

Table 7: Sales_Transactions (Fact Table)
- Description: The core sales data, linking all other tables together. This should be the largest table.
- Columns: TransactionID (Primary Key), TransactionDetailID (to link to items), StoreID (Foreign Key to Stores), CustomerID (Foreign Key to Customers, can be NULL), ProductID (Foreign Key to Products), CampaignID (Foreign Key to Marketing_Campaigns, can be NULL), TransactionTimestamp, Quantity, UnitPrice, TotalAmount (Quantity * UnitPrice), DiscountAmount.
- Realism/Messiness:
     Generate a large volume of transactions (e.g., 500,000+ rows) covering the 2020-2024 period.
     CustomerID should be NULL for ~40% of transactions (non-loyalty/anonymous guests).
     CampaignID should only be populated if the transaction occurred during a promotional period and involved a promotional item.
     Sales for promotional items (like "Turkey Dinner") should spike significantly during their respective campaign periods.
     Simulate different sales patterns for Dine-in vs. Takeout vs. Delivery. Delivery sales should have grown steadily from 2020.

5. Final Instruction
Please begin by generating the data for the Stores table. After you provide it, I will confirm, and we can proceed table by table. Feel free to ask clarifying questions about data distribution or specific anomalies to make the scenarios more robust.

Bonus: Every time you run the prompt, it shifts — new quirks, new blind spots, new stories.

Step 2: Run it in Colab

I ran the code in  Google Colab — but any Python IDE will do.

No setup, just hit play and download the tables..


If Gemini isn’t your thing, just copy this Python code and run it.

Python Code for Dataset Generation
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random
import uuid
import os

# --- Configuration Parameters ---
OUTPUT_DIR = "white_spot_datasets"
START_DATE_SALES = datetime(2023, 4, 1)
END_DATE_SALES = datetime(2024, 7, 31)
SEED = 42
NUM_STORES = 10
STORE_REGIONS = {
    "Lower Mainland": 0.6,
    "Vancouver Island": 0.2,
    "Okanagan": 0.1,
    "Interior": 0.1
}
NUM_CUSTOMERS = 5000
NUM_PRODUCT_CATEGORIES = 10 # This parameter is no longer directly used due to hardcoded product_data_raw
PRODUCTS_PER_CATEGORY = 5 # This parameter is no longer directly used due to hardcoded product_data_raw
BASE_PRODUCT_PRICE = 5.0
PRICE_VARIABILITY = 3.0
AVG_TRANSACTIONS_PER_STORE_PER_DAY = 30
TRANSACTION_STD_DEV_RATIO = 0.3
MIN_TRANSACTIONS_PER_DAY = 5
AVG_ITEMS_PER_TRANSACTION = 2.5
ITEM_STD_DEV = 1.0
PRICE_DETERIORATION_RATE = 0.0001
RETURN_RATE = 0.005

# --- New Global Variables ---

MARKETING_CHANNELS = [
    # Digital Marketing
    'Social Media - Facebook/Instagram Ads',
    'Social Media - TikTok Organic',
    'Email - Promotional Offer',
    'Paid Search - Google Ads',
    'Display Ads - Programmatic',
    'Influencer Marketing - Local Bloggers',

    # Traditional Marketing
    'Local Radio Spot',
    'Newspaper Print Ad',
    'Flyer Distribution',

    # Out-of-Home (OOH)
    'Billboard - Main Roads',

    # In-Store / Experiential
    'In-Store POS Display',
    'Community Event Sponsorship',
    'Loyalty Program Enrollment Drive',
]

SALES_CHANNELS = ['Dine-in', 'Takeout', 'Delivery App', 'Drive-Thru'] # For transaction channel
TRANSACTION_SALES_BOOST = 1.8 # e.g., 80% increase in transactions during sales
TRANSACTION_AD_BOOST = 1.2 # e.g., 20% increase in transactions during ad period
# --- End New Global Variables ---

# --- Campaign Definitions ---
EASTER_TURKEY_AD_DAYS = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
EASTER_TURKEY_SALES_DAYS = [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
EASTER_TURKEY_PROMO_YEARS = [2023, 2024, 2025] # Specific April years for Easter promos, including 2025

DOUBLE_POINTS_TUES_DATES_CALENDAR = [
    datetime(2023, 5, 2), datetime(2023, 5, 9), datetime(2023, 5, 16), datetime(2023, 5, 23), datetime(2023, 5, 30),
    datetime(2023, 8, 1), datetime(2023, 8, 8), datetime(2023, 8, 15), datetime(2023, 8, 22), datetime(2023, 8, 29),
    datetime(2023, 11, 7), datetime(2023, 11, 14), datetime(2023, 11, 21), datetime(2023, 11, 28),
    datetime(2024, 2, 6), datetime(2024, 2, 13), datetime(2024, 2, 20), datetime(2024, 2, 27),
    datetime(2024, 5, 7), datetime(2024, 5, 14), datetime(2024, 5, 21), datetime(2024, 5, 28)
]

BACK_TO_SCHOOL_DATES_CALENDAR = [
    datetime(2023, 8, 20), datetime(2023, 8, 21), datetime(2023, 8, 22), datetime(2023, 8, 23), datetime(2023, 8, 24), datetime(2023, 8, 25), datetime(2023, 8, 26), datetime(2023, 8, 27), datetime(2023, 8, 28), datetime(2023, 8, 29), datetime(2023, 8, 30), datetime(2023, 8, 31),
    datetime(2023, 9, 1), datetime(2023, 9, 2), datetime(2023, 9, 3), datetime(2023, 9, 4), datetime(2023, 9, 5), datetime(2023, 9, 6), datetime(2023, 9, 7), datetime(2023, 9, 8), datetime(2023, 9, 9), datetime(2023, 9, 10)
]

PLANT_BASED_MONDAYS_START = datetime(2024, 1, 1)
PLANT_BASED_MONDAYS_END = datetime(2024, 3, 31)

WEATHER_CITIES = {
    "Vancouver": {"lat": 49.2827, "lon": -123.1207, "temp_avg": 10, "temp_std": 5, "precip_prob": 0.5},
    "Victoria": {"lat": 48.4284, "lon": -123.3656, "temp_avg": 11, "temp_std": 4, "precip_prob": 0.4},
    "Kelowna": {"lat": 49.8878, "lon": -119.4960, "temp_avg": 9, "temp_std": 8, "precip_prob": 0.3},
    "Whistler": {"lat": 50.1163, "lon": -122.9574, "temp_avg": 5, "temp_std": 10, "precip_prob": 0.6},
    "Prince George": {"lat": 53.9170, "lon": -122.7497, "temp_avg": 2, "temp_std": 12, "precip_prob": 0.4}
}
PRECIP_TYPES = ['Rain', 'Snow', 'Cloudy', 'Sunny']
SEASONAL_TEMP_ADJUST = {
    1: -5, 2: -3, 3: 0, 4: 3, 5: 7, 6: 10, 7: 12, 8: 11, 9: 7, 10: 3, 11: -1, 12: -4
}
SEASONAL_PRECIP_ADJUST = {
    1: 0.2, 2: 0.1, 3: 0, 4: -0.1, 5: -0.2, 6: -0.2, 7: -0.3, 8: -0.2, 9: 0, 10: 0.1, 11: 0.2, 12: 0.1
}

# Set random seed for reproducibility
np.random.seed(SEED)
random.seed(SEED)

# --- Helper Functions ---
def get_fiscal_year(date):
    """ Determines the fiscal year for a given date. Assuming fiscal year starts April 1st. """
    if date.month >= 4:
        return date.year + 1
    else:
        return date.year

def get_fiscal_week_month_quarter(date):
    """ Determines fiscal week, month, and quarter based on a Q1 starting April 1. """
    fiscal_month_num = (date.month - 4) % 12 + 1
    fiscal_quarter = (fiscal_month_num - 1) // 3 + 1
    
    # Calculate fiscal year start for correct day_of_fiscal_year calculation
    fiscal_year_start = datetime(date.year if date.month >= 4 else date.year - 1, 4, 1)
    day_of_fiscal_year = (date - fiscal_year_start).days + 1
    fiscal_week = (day_of_fiscal_year - 1) // 7 + 1
    
    # Cap fiscal week at 52 or 53 - more robust calculation
    # A fiscal year can have 52 or 53 weeks. For simplicity, we'll cap at 53.
    # In a real scenario, this would follow specific rules (e.g., 53rd week if fiscal year has 365+ days).
    # For synthetic data, a simple cap is usually sufficient to prevent overruns.
    if fiscal_week > 53: 
        fiscal_week = 53
    
    return fiscal_week, fiscal_month_num, fiscal_quarter

def get_fiscal_month_name(fiscal_month_num):
    month_names = {
        1: 'April', 2: 'May', 3: 'June', 4: 'July', 5: 'August', 6: 'September',
        7: 'October', 8: 'November', 9: 'December', 10: 'January', 11: 'February', 12: 'March'
    }
    return month_names.get(fiscal_month_num, 'Invalid Fiscal Month')

# --- Data Generation Functions ---
def generate_store_locations():
    print("Generating store_locations...")
    stores = []
    store_ids_pool = [f"S{i+1:04d}" for i in range(NUM_STORES * 2)]
    selected_store_ids = random.sample(store_ids_pool, NUM_STORES)
    bc_cities = ["Vancouver", "Victoria", "Burnaby", "Richmond", "Surrey", "Kelowna", "Abbotsford", "Nanaimo", "Kamloops", "Coquitlam"]
    
    for i, store_id in enumerate(selected_store_ids):
        region = random.choices(
            list(STORE_REGIONS.keys()), weights=list(STORE_REGIONS.values()), k=1
        )[0]
        city = random.choice(bc_cities)
        opening_date = (datetime.now() - timedelta(days=random.randint(365, 365*10))).strftime('%Y-%m-%d')
        store_price_multiplier = round(random.uniform(0.95, 1.05), 2)
        stores.append({
            'store_id': store_id,
            'city_province': city,
            'region': region,
            'guest_capacity': np.random.randint(50, 200),
            'opening_date': opening_date,
            'store_price_multiplier': store_price_multiplier
        })
    df = pd.DataFrame(stores)
    df.loc[df.sample(frac=0.03).index, 'city_province'] = df['city_province'].apply(lambda x: x.lower() if isinstance(x, str) else x)
    df.loc[df.sample(frac=0.001).index, 'guest_capacity'] = -10
    print(f"Generated {len(df)} store locations.")
    return df

def generate_fiscal_calendar(start_date, end_date):
    print("Generating fiscal_calendar...")
    dates = []
    current_date = start_date
    while current_date <= end_date:
        fiscal_year = get_fiscal_year(current_date)
        fiscal_week, fiscal_month_num, fiscal_quarter = get_fiscal_week_month_quarter(current_date)
        fiscal_month_name = get_fiscal_month_name(fiscal_month_num)
        
        is_holiday = False
        holiday_name = None
        # Example specific holidays - note 2025 Easter is included for future data
        if current_date.month == 4 and current_date.day == 18 and current_date.year == 2025:
            is_holiday = True
            holiday_name = 'Easter Monday'
        elif current_date.month == 7 and current_date.day == 1:
            is_holiday = True
            holiday_name = 'Canada Day'

        is_easter_turkey_ad_period = (
            current_date.month == 4 and current_date.day in EASTER_TURKEY_AD_DAYS and current_date.year in EASTER_TURKEY_PROMO_YEARS
        )
        is_easter_turkey_sales_period = (
            current_date.month == 4 and current_date.day in EASTER_TURKEY_SALES_DAYS and current_date.year in EASTER_TURKEY_PROMO_YEARS
        )

        dates.append({
            'date': current_date,
            'day_of_week': current_date.strftime('%A'),
            'day_of_month': current_date.day,
            'week_of_year': current_date.isocalendar()[1],
            'month_name': current_date.strftime('%B'),
            'quarter_name': f"Q{(current_date.month - 1) // 3 + 1}",
            'calendar_year': current_date.year,
            'fiscal_week': fiscal_week,
            'fiscal_month_num': fiscal_month_num,
            'fiscal_month_name': fiscal_month_name,
            'fiscal_quarter': fiscal_quarter,
            'fiscal_year': fiscal_year,
            'is_holiday': is_holiday,
            'holiday_name': holiday_name,
            'is_easter_turkey_ad_period': is_easter_turkey_ad_period,
            'is_easter_turkey_sales_period': is_easter_turkey_sales_period
        })
        current_date += timedelta(days=1)
    
    df = pd.DataFrame(dates)
    # Introduce some dirtiness
    df.loc[df.sample(frac=0.005).index, 'day_of_week'] = df['day_of_week'].apply(lambda x: x[:3]) # Shorten day names
    df.loc[df.sample(frac=0.005).index, 'month_name'] = df['month_name'].apply(lambda x: x[:3].upper()) # Abbreviate month names
    idx_fiscal_week_error = df.sample(frac=0.001).index
    df.loc[idx_fiscal_week_error, 'fiscal_week'] = np.random.randint(1, 53, size=len(idx_fiscal_week_error)) # Invalid fiscal week
    idx_fiscal_month_error = df.sample(frac=0.001).index
    df.loc[idx_fiscal_month_error, 'fiscal_month_num'] = np.random.randint(1, 13, size=len(idx_fiscal_month_error)) # Invalid fiscal month num
    df.loc[idx_fiscal_month_error, 'fiscal_month_name'] = df.loc[idx_fiscal_month_error, 'fiscal_month_num'].apply(get_fiscal_month_name) # Sync name with new num
    print(f"Generated {len(df)} fiscal calendar entries.")
    return df

def generate_product_catalog():
    print("Generating product_catalog...")
    product_data_raw = """
P0001 Nashville Chicken Burger Burger 8.23 5.81 29.4 General
P0002 Nat's Hearty Breakfast Breakfast 9.89 7.68 22.35 General
P0003 Chocolate Cheesecake Dessert 20.32 11.14 45.18 General
P0004 Chicken Burger Burger 13.26 6.89 48.04 General
P0005 Milkshake Drink 12.84 8.73 32.01 General
P0006 Zoo Sticks Appetizer 10.88 5.72 47.43 General
P0008 Spaghetti Bolognese Pasta 9.7 5.18 46.6 General
P0009 Kids' Mac & Cheese Kids Meal 12.27 7.24 40.99 BackToSchool
P0010 Strawberry Milkshake Drink 17.18 13.36 22.24 General
P0011 Chicken Caesar Salad Salad 21.48 16.86 21.51 General
P0012 CHICKEN CAESAR WRAP Wrap 8.58 6.44 24.94 General
P0014 Turkey Dinner Entree 11.06 6.1 44.85 Easter
P0015 Cranberry Sauce Side 8.91 6.31 29.18 Easter
P016 NASHVILLE CHICKEN BURGER Burger 9.86 6.79 31.14 General
P017 Grilled Chicken Sandwich Sandwich 17.07 11.47 32.81 General
P018 Pirate Pak - Grilled Cheese Kids Meal 23.43 16.09 31.33 BackToSchool
P020 Spaghetti Bolognese Pasta 15.9 8.57 46.1 General
P021 Monty Mushroom Burger Burger 7.51 6 20.11 General
P023 Vegan Power Bowl Salad 7.33 4.34 40.79 PlantBased
P024 Chocolate Cheesecake Dessert 15.82 12.54 20.73 General
P026 Classic Milkshake Drink 24.29 17.08 29.68 General
P027 Monty Mushroom Burger Burger 16.43 10.67 35.06 General
P028 Vegan Pasta Prima Pasta 8.21 5.83 28.99 PlantBased
P029 NAT'S HEARTY BREAKFAST Breakfast 11.2 6.53 41.7 General
P030 Lobster Mac & Cheese Entree 19.7 15.02 23.76 General
P032 PACIFIC COD TACOS Seafood 18.18 13.81 24.04 General
P033 Hamburger Kids Meal 7.62 5.75 24.54 BackToSchool
P034 Chicken Caesar Salad Salad 17.98 13.93 22.53 General
P035 PIRATE PAK - GRILLED CHEESE Kids Meal 18.26 9.29 49.12 BackToSchool
P036 Chocolate Brownie Dessert 23.93 17.32 27.62 General
P037 Crispy Chicken Burger Burger 7.44 5.89 20.83 General
P038 Fish & Chips Seafood 23.18 17.8 23.21 General
P039 Kids' Chicken Strips Kids Meal 21.94 16.21 26.12 BackToSchool
P040 Apple Pie Dessert 21.31 13.77 35.38 General
P041 Zoo Sticks Appetizer 20.08 12.05 39.99 General
P042 LOBSTER MAC & CHEESE Seafood 8.6 6.63 22.91 General
P044 Spicy Chicken Burger Burger 11.86 8.5 28.33 General
P045 Pacific Cod Tacos Seafood 14.95 9.19 38.53 General
P046 Prawn Skewers Seafood 9.8 6.49 33.78 General
P047 SPAGHETTI BOLOGNESE Pasta 15.43 10.82 29.88 General
P048 Legendary Burger Burger 9.96 7.81 21.59 General
P049 Fish & Chips Seafood 8.53 6.27 26.49 General
P050 Spaghetti Bolognese Pasta 8.95 4.66 47.93 General
P051 Zoo Sticks Appetizer 9.4 5.23 44.36 General
P052 pirate pak - grilled cheese Kids Meal 16.11 11.85 26.44 BackToSchool
P053 pacific cod tacos Seafood 19.9 14.29 28.19 General
P054 Nat's Hearty Breakfast Breakfast 16.16 9.12 43.56 General
P055 Mixed Berry Smoothie Drink 12.37 9.17 25.87 General
P056 Kids' Grilled Cheese Kids Meal 17.52 9.88 43.61 BackToSchool
P057 Lobster Mac & Cheese Entree 22.57 17.33 23.22 General
P058 Nat's Hearty Breakfast Breakfast 22.63 12.95 42.78 General
P059 Chicken Caesar Salad Salad 11.28 6.24 44.68 General
P060 Kids' Cheeseburger Kids Meal 13.72 8.94 34.84 BackToSchool
P061 Fruit Juice Drink 22.37 14.41 35.58 General
P062 Pacific Cod Tacos Seafood 24.98 15.51 37.91 General
P063 Lobster Mac & Cheese Entree 11.86 9.47 20.15 General
P064 Fish & Chips Seafood 14.21 7.36 48.21 General
P065 Chicken Caesar Salad Salad 20.19 11.4 43.54 General
P066 Classic Burger Burger 12.84 9.84 23.36 General
P067 Monty Mushroom Burger Burger 20.7 14.5 29.95 General
P068 Vegan Deluxe Wrap Wrap 10.61 6.57 38.08 PlantBased
P069 Nashville Chicken Burger Burger 15.47 11.44 26.05 General
P070 Pirate Pak - Grilled Cheese Kids Meal 15.17 11.27 25.71 BackToSchool
P071 Nat's Hearty Breakfast Breakfast 15.77 10.55 33.1 General
P072 Crispy Chicken Burger Burger 14.06 8.75 37.77 General
P073 LEGENDARY BURGER Burger 8.29 4.89 41.01 General
P074 Garlic Bread Appetizer 10.91 6.62 39.32 General
P075 Chocolate Milkshake Drink 24.4 17.03 30.2 General
P077 Kids' Mini Pizza Kids Meal 18.23 12.33 32.36 BackToSchool
P079 Pirate Pak - Grilled Cheese Kids Meal 11.98 6.58 45.08 BackToSchool
P080 Chicken Strips Kids Meal 13.94 9.07 34.94 BackToSchool
P081 Chocolate Cheesecake Dessert 10.5 6.44 38.67 General
P082 BC Chicken Burger Burger 11.66 8.88 23.84 General
P083 Spaghetti Bolognese Pasta 22.35 16.48 26.26 General
P084 Pirate Pak - Grilled Cheese Kids Meal 20.45 12.25 40.1 BackToSchool
P085 Nashville Chicken Burger Burger 9.63 7.45 22.64 General
P086 Monty Mushroom Burger Burger 7.02 5.33 24.07 General
P087 Zoo Sticks Appetizer 9.14 5.08 44.42 General
P088 Monty Mushroom Burger Burger 8.88 6.16 30.63 General
P089 Lobster Mac & Cheese Entree 24 16.36 31.83 General
P090 Vanilla Milkshake Drink 8.06 5.88 27.05 General
P091 Nashville Chicken Burger Burger 7.08 4.71 33.47 General
P092 Monty Mushroom Burger Burger 21.13 14.14 33.08 General
P093 nat's hearty breakfast Breakfast 7.59 5.3 30.17 General
P094 legendary burger Burger 13.24 9.74 26.44 General
P095 Chocolate Chip Cookie Dessert 10.75 7.28 32.28 General
P096 Legendary Burger Burger 24.32 16.6 31.74 General
P097 Milkshake Drink 9.69 5.34 44.89 General
P098 FISH & CHIPS Seafood 7.73 4.14 46.44 General
P101 Fish & Chips Seafood 11.3 8.25 26.99 General
P103 Spaghetti Bolognese Pasta 24.15 16.53 31.55 General
P105 Blueberry Smoothie Drink 15.38 11.91 22.56 General
P106 Lemonade Drink 14.86 10.73 27.79 General
P107 Chicken Caesar Salad Salad 19.8 15.03 24.09 General
P108 Zoo Sticks Appetizer 17.8 11.78 33.82 General
P109 Fruit Cup Dessert 18.94 12.23 35.43 General
P110 Spaghetti Bolognese Pasta 16.9 12.43 26.45 General
P111 Pacific Cod Tacos Seafood 16.67 8.42 49.49 General
P112 Chicken Caesar Salad Salad 9.85 5.98 39.29 General
P113 Monty Mushroom Burger Burger 19.07 10.27 46.15 General
P114 LOBSTER MAC & CHEESE Entree 21.23 15.68 26.14 General
P115 Monty Mushroom Burger Burger 8.19 5.94 27.47 General
P116 monty mushroom burger Burger 9.14 5.7 37.64 General
P117 CHOCOLATE CHEESECAKE Dessert 22.13 16.65 24.76 General
P118 Nat's Hearty Breakfast Breakfast 17.17 11.26 34.42 General
P119 Pancakes Breakfast 24.52 18.74 23.57 General
P120 BC Chicken Burger Burger 23 12.75 44.57 General
P121 Nat's Hearty Breakfast Breakfast 7.62 4.11 46.06 General
P122 Plant-Based Burger Burger 8.25 5.36 35.03 PlantBased
P123 Pirate Pak - Grilled Cheese Kids Meal 10.09 7.36 27.06 BackToSchool
P124 Nat's Hearty Breakfast Breakfast 22.15 11.32 48.89 General
P125 Monty Mushroom Burger Burger 8.3 4.69 43.49 General
P126 NAT'S HEARTY BREAKFAST Breakfast 12.86 8.64 32.81 General
P128 CHOCOLATE CHEESECAKE Dessert 13.18 8.65 34.37 General
P129 Monty Mushroom Burger Breakfast 20.54 13.5 34.27 General
P131 Chocolate Cheesecake Dessert 23.12 15.11 34.65 General
P132 Ice Cream Sundae Dessert 13.46 8.52 36.7 General
P133 Roasted Turkey Sandwich Sandwich 9.09 5.37 40.92 Easter
P134 Pacific Cod Tacos Seafood 8.65 6.49 24.97 General
P135 Zoo Sticks Appetizer 8.29 5.83 29.67 General
P136 Pirate Pak - Grilled Cheese Kids Meal 11.94 8.76 26.63 BackToSchool
P137 CHOCOLATE CHEESECAKE Dessert 19.71 14.32 27.35 General
P138 Monty Mushroom Burger Burger 24.41 17.78 27.16 General
P139 Fish & Chips Seafood 7.64 5.6 26.7 General
P140 Chocolate Cheesecake Dessert 23.71 13.67 42.35 General
P141 Chicken Caesar Salad Salad 21.79 16.57 23.96 General
P142 legendary burger Burger 22.54 13.8 38.78 General
P143 BC Chicken Burger Burger 18.57 14.71 20.79 General
P144 Lobster Mac & Cheese Entree 16.88 11.04 34.6 General
P145 Nashville Chicken Burger Burger 9.06 5.93 34.55 General
P146 Legendary Burger Burger 12.72 8.22 35.38 General
P147 Chicken Caesar Salad Salad 13.82 8.31 39.87 General
P148 Nashville Chicken Burger Burger 20.16 15.82 21.53 General
P149 Fish & Chips Seafood 20.68 15.77 23.74 General
P150 French Fries Side 7.86 4.84 38.42 General
P151 Plant-Based Sausage & Eggs Breakfast 15.50 11.20 25.00 PlantBased
P152 Kids' Combo Meal Kids Meal 14.99 10.50 30.00 BackToSchool
P153 Green Bean Casserole Side 6.75 4.50 33.00 Easter
    """
    products = []
    seen_product_ids = set()
    for line in product_data_raw.strip().split('\n'):
        all_parts = line.split()
        
        if len(all_parts) < 6:
            print(f"Skipping malformed product line: {line}")
            continue

        product_id = all_parts[0]
        if product_id in seen_product_ids:
            continue
        seen_product_ids.add(product_id)
        
        campaign_affinity = all_parts[-1]
        base_price = float(all_parts[-4])
        category = all_parts[-5]
        product_name_parts = all_parts[1:-5]
        product_name = " ".join(product_name_parts)

        if not product_name: product_name = "Unknown Product"
        if not category: category = "Unknown Category"
        
        product_name = product_name.title()
        category = category.title()

        try:
            if base_price < 0: base_price = abs(base_price)
        except ValueError:
            base_price = BASE_PRODUCT_PRICE

        products.append({
            'product_id': product_id,
            'product_name': product_name,
            'category': category,
            'base_price': round(base_price, 2),
            'is_active': random.random() > 0.05,
            'campaign_affinity': campaign_affinity
        })
    
    df = pd.DataFrame(products)
    print(f"Generated {len(df)} products from provided data.")
    return df

def generate_customer_profiles():
    print("Generating customer_profiles...")
    customers = []
    first_names = ["John", "Jane", "Mike", "Sarah", "Chris", "Emily", "David", "Anna", "Peter", "Linda"]
    last_names = ["Doe", "Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia", "Miller", "Davis", "Rodriguez"]
    email_domains = ["example.com", "mail.com", "test.org", "domain.net", "service.co"]
    phone_prefixes = ["555-123", "555-456", "555-789", "555-010"]
    
    for i in range(NUM_CUSTOMERS):
        customer_id = f"C{i+1:04d}"
        gender = random.choice(['Male', 'Female', 'Other'])
        current_year = datetime.now().year
        birth_year = random.randint(current_year - 70, current_year - 18)
        birth_month = random.randint(1, 12)
        first = random.choice(first_names)
        last = random.choice(last_names)
        email = f"{first.lower()}.{last.lower()}@{random.choice(email_domains)}"
        phone = f"({random.choice(['236', '604', '778', '672'])}) {random.choice(phone_prefixes)}-{random.randint(0, 9999):04d}"
        member_since = (datetime.now() - timedelta(days=random.randint(0, 365*5))).strftime('%Y-%m-%d')
        last_visit_date = (datetime.now() - timedelta(days=random.randint(0, 365))).strftime('%Y-%m-%d')
        spending_propensity = round(random.uniform(0.7, 1.3), 2)
        
        customers.append({
            'customer_id': customer_id,
            'first_name': first,
            'last_name': last,
            'email': email,
            'phone_number': phone,
            'gender': gender,
            'birth_year': birth_year,
            'birth_month': birth_month,
            'member_since': member_since,
            'last_visit_date': last_visit_date,
            'loyalty_points': np.random.randint(0, 1000) if random.random() < 0.7 else 0,
            'spending_propensity': spending_propensity
        })
    
    df = pd.DataFrame(customers)
    # Introduce some dirtiness
    df.loc[df.sample(frac=0.03).index, 'email'] = np.nan
    df.loc[df.sample(frac=0.02).index, 'phone_number'] = df['phone_number'].apply(lambda x: x[:5] + 'abc' if isinstance(x, str) else x)
    df.loc[df.sample(frac=0.001).index, 'birth_year'] = np.random.choice([-1900, datetime.now().year + 5])
    df.loc[df.sample(frac=0.001).index, 'birth_month'] = np.random.choice([0, 13])
    df.loc[df.sample(frac=0.002).index, 'loyalty_points'] = 99999
    print(f"Generated {len(df)} customer profiles.")
    return df

def generate_marketing_campaigns(fiscal_calendar_df):
    print("Generating marketing_campaigns...")
    campaigns = []

    all_fiscal_years = fiscal_calendar_df['fiscal_year'].unique()

    promo_spend_ranges = {
        'Easter Turkey Sales': {'ad_min': 15000, 'ad_max': 50000, 'op_min': 1500, 'op_max': 5000},
        'Double Points Tuesday': {'ad_min': 5000, 'ad_max': 15000, 'op_min': 500, 'op_max': 1500},
        'Back to School': {'ad_min': 10000, 'ad_max': 30000, 'op_min': 1000, 'op_max': 3000},
        'Plant-Based Monday': {'ad_min': 3000, 'ad_max': 10000, 'op_min': 300, 'op_max': 1000},
        'Easter Turkey Ad': {'ad_min': 5000, 'ad_max': 15000, 'op_min': 200, 'op_max': 800}
    }

    for promo_type, ranges in promo_spend_ranges.items():
        relevant_years = []
        if 'Easter' in promo_type:
            relevant_years = [y for y in all_fiscal_years if y in EASTER_TURKEY_PROMO_YEARS or y == (datetime.now().year + 1 if datetime.now().month >= 4 else datetime.now().year) and y <= get_fiscal_year(END_DATE_SALES)]
        elif promo_type == 'Double Points Tuesday' or promo_type == 'Plant-Based Monday':
            relevant_years = all_fiscal_years
        elif promo_type == 'Back to School':
            # Ensure back to school promo years are within or near the sales data range
            relevant_years = [y for y in all_fiscal_years if datetime(y-1, 8, 1) <= END_DATE_SALES + timedelta(days=30)]
            
        for f_year in sorted(list(set(relevant_years))):
            ad_spend = round(random.uniform(ranges['ad_min'], ranges['ad_max']), 2)
            operational_cost = round(random.uniform(ranges['op_min'], ranges['op_max']), 2)

            campaign_name = f"{promo_type} {f_year}"

            marketing_channel = random.choice(MARKETING_CHANNELS)

            campaigns.append({
                'campaign_name': campaign_name,
                'promo_type_used': promo_type,
                'fiscal_year': f_year,
                'estimated_ad_spend': ad_spend,
                'estimated_operational_cost': operational_cost,
                'marketing_channel_cl': marketing_channel
            })

    df = pd.DataFrame(campaigns)
    # Introduce some dirtiness
    df.loc[df.sample(frac=0.05).index, 'estimated_ad_spend'] = -random.random() * 1000
    df.loc[df.sample(frac=0.03).index, 'estimated_operational_cost'] = np.nan
    print(f"Generated {len(df)} marketing campaign entries.")
    return df

def generate_sales_detail(store_locations_df, product_catalog_df, customer_profiles_df, fiscal_calendar_df):
    print("Generating sales_detail...")
    sales_data = []

    all_dates = fiscal_calendar_df['date'].unique()
    all_store_ids = store_locations_df['store_id'].unique()
    all_product_ids = product_catalog_df['product_id'].unique()
    all_customer_ids = customer_profiles_df['customer_id'].unique()

    product_details = product_catalog_df.set_index('product_id')[['base_price', 'campaign_affinity']].to_dict('index')
    customer_propensities = customer_profiles_df.set_index('customer_id')['spending_propensity'].to_dict()
    store_price_multipliers = store_locations_df.set_index('store_id')['store_price_multiplier'].to_dict()

    CAMPAIGN_PRODUCT_AFFINITIES = {
        'Easter Turkey Sales': ['Easter'],
        'Plant-Based Monday': ['Plantbased'], # Ensure case matches product_catalog
        'Back to School': ['Backtoschool'], # Ensure case matches product_catalog
        'Double Points Tuesday': ['General', 'Easter', 'Plantbased', 'Backtoschool'],
        'Easter Turkey Ad': ['Easter']
    }

    PAYMENT_METHOD_MULTIPLIERS = {
        'Credit Card': 1.05,
        'Debit Card': 1.0,
        'Mobile Pay': 1.02,
        'Cash': 0.95
    }

    transaction_id_counter = 0

    for current_date_np in all_dates:
        current_date = pd.to_datetime(current_date_np)
        calendar_row = fiscal_calendar_df[fiscal_calendar_df['date'] == current_date].iloc[0]

        is_holiday = calendar_row['is_holiday']
        is_easter_turkey_ad_period = calendar_row['is_easter_turkey_ad_period']
        is_easter_turkey_sales_period = calendar_row['is_easter_turkey_sales_period']
        day_of_week = calendar_row['day_of_week']
        month_name = calendar_row['month_name']
        fiscal_week = calendar_row['fiscal_week']
        fiscal_month_num = calendar_row['fiscal_month_num']
        fiscal_month_name = calendar_row['fiscal_month_name']
        fiscal_quarter = calendar_row['fiscal_quarter']
        fiscal_year = calendar_row['fiscal_year']

        base_daily_transactions = AVG_TRANSACTIONS_PER_STORE_PER_DAY

        if current_date.weekday() in [5, 6]:
            base_daily_transactions *= 1.5
        elif current_date.weekday() in [0, 1, 2, 3, 4]:
            base_daily_transactions *= 0.9

        promo_type = None

        if is_easter_turkey_sales_period:
            base_daily_transactions *= TRANSACTION_SALES_BOOST
            promo_type = 'Easter Turkey Sales'
        elif is_easter_turkey_ad_period:
            base_daily_transactions *= TRANSACTION_AD_BOOST
            promo_type = 'Easter Turkey Ad'

        if current_date in DOUBLE_POINTS_TUES_DATES_CALENDAR and current_date.weekday() == 1:
            base_daily_transactions *= 1.5
            promo_type = 'Double Points Tuesday'

        if current_date in BACK_TO_SCHOOL_DATES_CALENDAR:
            base_daily_transactions *= 1.4
            promo_type = 'Back to School'

        if current_date.weekday() == 0 and PLANT_BASED_MONDAYS_START <= current_date <= PLANT_BASED_MONDAYS_END:
            base_daily_transactions *= 1.3
            promo_type = 'Plant-Based Monday'

        if is_holiday:
            if calendar_row['holiday_name'] == 'Canada Day':
                base_daily_transactions *= 1.7
            elif calendar_row['holiday_name'] == 'Easter Monday':
                base_daily_transactions *= 1.3

        for store_id in all_store_ids:
            num_transactions_today = int(np.random.normal(base_daily_transactions, base_daily_transactions * TRANSACTION_STD_DEV_RATIO))
            num_transactions_today = max(MIN_TRANSACTIONS_PER_DAY, num_transactions_today)

            current_store_price_multiplier = store_price_multipliers.get(store_id, 1.0)
            
            store_city = store_locations_df[store_locations_df['store_id'] == store_id]['city_province'].iloc[0]
            weather_info = WEATHER_CITIES.get(store_city, {"temp_avg": 10, "temp_std": 5, "precip_prob": 0.5})

            temp_today = np.random.normal(weather_info['temp_avg'] + SEASONAL_TEMP_ADJUST.get(current_date.month, 0), weather_info['temp_std'])
            precip_prob_today = weather_info['precip_prob'] + SEASONAL_PRECIP_ADJUST.get(current_date.month, 0)
            precip_prob_today = max(0, min(1, precip_prob_today))
            
            # Adjust weights for weather conditions to favor Sunny/Cloudy when precip_prob_today is low
            # And Rain/Snow when precip_prob_today is high
            weights_for_precip_types = [
                precip_prob_today * 0.5,  # Rain
                precip_prob_today * 0.5,  # Snow
                (1 - precip_prob_today) * 0.5, # Cloudy
                (1 - precip_prob_today) * 0.5  # Sunny
            ]
            # Ensure weights sum to 1 to avoid issues with random.choices
            total_weight = sum(weights_for_precip_types)
            if total_weight > 0:
                weights_for_precip_types = [w / total_weight for w in weights_for_precip_types]
            else: # Fallback for edge cases, e.g., if precip_prob_today causes all weights to be 0
                weights_for_precip_types = [0.25, 0.25, 0.25, 0.25] # Equal weights

            weather_condition = random.choices(PRECIP_TYPES, weights=weights_for_precip_types, k=1)[0]


            for _ in range(num_transactions_today):
                transaction_id_counter += 1
                transaction_id = f"T{transaction_id_counter:08d}"

                customer_id = random.choice(all_customer_ids)
                current_customer_spending_propensity = customer_propensities.get(customer_id, 1.0)

                payment_method = random.choice(['Credit Card', 'Debit Card', 'Cash', 'Mobile Pay'])
                current_payment_method_multiplier = PAYMENT_METHOD_MULTIPLIERS.get(payment_method, 1.0)

                transaction_channel = random.choice(SALES_CHANNELS)

                num_items_in_transaction = max(1, int(np.random.normal(AVG_ITEMS_PER_TRANSACTION, ITEM_STD_DEV)))
                
                available_products_for_selection = list(all_product_ids)
                product_weights = [1] * len(available_products_for_selection)

                if promo_type and promo_type in CAMPAIGN_PRODUCT_AFFINITIES:
                    promo_affinities = CAMPAIGN_PRODUCT_AFFINITIES[promo_type]
                    for i, prod_id in enumerate(available_products_for_selection):
                        # Ensure 'campaign_affinity' key exists before accessing
                        prod_affinity = product_details.get(prod_id, {}).get('campaign_affinity')
                        if prod_affinity in promo_affinities:
                            product_weights[i] *= 5
                        elif prod_affinity == 'General':
                            product_weights[i] *= 1
                        else:
                            product_weights[i] *= 0.5
                
                total_weight_products = sum(product_weights)
                product_probabilities = [w / total_weight_products for w in product_weights] if total_weight_products > 0 else [1/len(product_weights)]*len(product_weights)


                selected_product_ids = random.choices(available_products_for_selection, weights=product_probabilities, k=num_items_in_transaction)

                for product_id in selected_product_ids:
                    base_price = product_details.get(product_id, {}).get('base_price', BASE_PRODUCT_PRICE)
                    
                    final_price = base_price * current_store_price_multiplier * current_customer_spending_propensity * current_payment_method_multiplier
                    final_price = final_price * (1 + np.random.normal(0, PRICE_VARIABILITY / BASE_PRODUCT_PRICE / 5))
                    final_price = max(1.0, round(final_price, 2)) # Ensure price is not negative and round

                    quantity = random.randint(1, 3)
                    line_item_amount = round(final_price * quantity, 2)
                    
                    is_returned = random.random() < RETURN_RATE
                    returned_date = current_date + timedelta(days=random.randint(1, 30)) if is_returned else None

                    sales_data.append({
                        'transaction_id': transaction_id,
                        'customer_id': customer_id,
                        'store_id': store_id,
                        'product_id': product_id,
                        'sales_date': current_date.strftime('%Y-%m-%d'),
                        'quantity': quantity,
                        'unit_price': final_price,
                        'total_amount': line_item_amount,
                        'payment_method': payment_method,
                        'transaction_channel': transaction_channel,
                        'is_returned': is_returned,
                        'returned_date': returned_date.strftime('%Y-%m-%d') if returned_date else None,
                        'promo_type_applied': promo_type,
                        'temperature_celsius': round(temp_today, 1),
                        'weather_condition': weather_condition,
                        'day_of_week': day_of_week,
                        'month_name': month_name,
                        'fiscal_week': fiscal_week,
                        'fiscal_month_num': fiscal_month_num,
                        'fiscal_month_name': fiscal_month_name,
                        'fiscal_quarter': fiscal_quarter,
                        'fiscal_year': fiscal_year
                    })
    
    df = pd.DataFrame(sales_data)
    # Introduce some dirtiness
    df.loc[df.sample(frac=0.001).index, 'quantity'] = -1
    df.loc[df.sample(frac=0.001).index, 'total_amount'] = df['total_amount'] * 100
    df.loc[df.sample(frac=0.002).index, 'unit_price'] = np.nan
    df.loc[df.sample(frac=0.003).index, 'payment_method'] = 'UNKNOWN'
    
    df.loc[df['is_returned'] == False, 'returned_date'] = None

    print(f"Generated {len(df)} sales detail entries.")
    return df

# --- Main Generation Logic ---
def generate_all_datasets():
    os.makedirs(OUTPUT_DIR, exist_ok=True)

    print("Starting dataset generation...")

    store_locations_df = generate_store_locations()
    store_locations_df.to_csv(os.path.join(OUTPUT_DIR, 'dim_store_locations.csv'), index=False)
    print(f"Saved dim_store_locations.csv with {len(store_locations_df)} records.")

    product_catalog_df = generate_product_catalog()
    product_catalog_df.to_csv(os.path.join(OUTPUT_DIR, 'dim_product_catalog.csv'), index=False)
    print(f"Saved dim_product_catalog.csv with {len(product_catalog_df)} records.")

    customer_profiles_df = generate_customer_profiles()
    customer_profiles_df.to_csv(os.path.join(OUTPUT_DIR, 'dim_customer_profiles.csv'), index=False)
    print(f"Saved dim_customer_profiles.csv with {len(customer_profiles_df)} records.")

    fiscal_calendar_df = generate_fiscal_calendar(START_DATE_SALES, END_DATE_SALES)
    fiscal_calendar_df.to_csv(os.path.join(OUTPUT_DIR, 'dim_fiscal_calendar.csv'), index=False)
    print(f"Saved dim_fiscal_calendar.csv with {len(fiscal_calendar_df)} records.")

    marketing_campaigns_df = generate_marketing_campaigns(fiscal_calendar_df)
    marketing_campaigns_df.to_csv(os.path.join(OUTPUT_DIR, 'fact_marketing_campaigns.csv'), index=False)
    print(f"Saved fact_marketing_campaigns.csv with {len(marketing_campaigns_df)} records.")

    sales_detail_df = generate_sales_detail(
        store_locations_df, product_catalog_df, customer_profiles_df, fiscal_calendar_df
    )
    sales_detail_df.to_csv(os.path.join(OUTPUT_DIR, 'fact_sales_detail.csv'), index=False)
    print(f"Saved fact_sales_detail.csv with {len(sales_detail_df)} records.")

    print(f"\nAll datasets generated and saved to '{OUTPUT_DIR}' directory.")
    print("Dataset Generation Complete!")

if __name__ == "__main__":
    generate_all_datasets()

Run Python Code on Google Colab
Run Python Code on Google Colab

Step 3: Export and explore

Downloaded the CSVs. Cleaned them in SQL. Built dashboards in Power BI.

(We'll walk through the cleaning process in the next section.)




☕ Matcha Break: Two Weeks, Ten Dashboards ☕

At some point, I lost track of how many dashboards I’d built.

I’d sit down to “fix one KPI card,” and suddenly it was 2 a.m. and I was formatting DAX like it owed me money.


The matcha flowed. The coffee followed.

Eventually, I started talking to my Power BI model like it was a coworker who wasn’t pulling their weight: “No, ‘contact_completed’ is a binary field. You know this. Pull it together.”


There were moments I wondered if I’d gone too far.

Then I’d fix a broken table relationship, hit refresh… and there it was:


A cleaner story.

A clearer signal.

The quiet thrill of watching chaos resolve into insight.


This project wasn’t just work.

It was joy; the weird, slightly over-caffeinated kind.

The kind where you suffer just enough to be hooked.


Like building a wooden castle that collapses every time, until you figure out how to rebuild it with a stronger foundation.



Next: [Ep 4] PHASE 1: Cleaning Data in SQL . The part where the joy meets stabbing pain — cleaning messy SQL tables without losing your mind.



💡 Part of my Data × Design Series — follow the journey from raw interview challenge to polished dashboards.


Table of Contents

Comments


bottom of page