E-Commerce Data Analysis & Customer Segmentation Project
Notebook Link
📖 Project Overview
This project conducts a deep-dive analysis of a transnational e-commerce dataset to uncover actionable business insights. The goal was to move beyond simple reporting and build a data-driven framework for understanding customer behavior, optimizing marketing strategies, and driving revenue growth. The final deliverable is an interactive Power BI dashboard and a strategic executive summary.
Key Questions Answered:
- Who are our most valuable customers and how do we retain them?
- What are the sales trends and how can we optimize operations?
- Which products and regions present the biggest opportunities?
- Why do customers churn and how can we win them back?
📊 Dashboard Preview

🧰 Tools & Technologies
- Programming Language: Python (Pandas, NumPy, Matplotlib, Seaborn)
- Data Visualization: Power BI, Tableau
- Version Control: Git, GitHub
- Methodologies: RFM Analysis, Cohort Analysis, Time Series Analysis, Data Cleaning & Wrangling
📈 Dataset
- Source: [Kaggle - E-Commerce Dataset]
- Description: A transnational data set which contains all the transactions occurring between [Date Range, e.g., 01/12/2010 and 09/12/2011].
- Size: 541910 of transactions, 4260 Unique customer, 3600 Unique products.
- Original Columns: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country.
🔧 Data Cleaning & Preprocessing
The raw data required extensive cleaning to ensure analytical integrity. A nuanced, two-tiered approach was used to handle different business contexts.
1. Data Type Conversion
- Converted
InvoiceDateto a datetime object for time-series analysis. - Converted
CustomerIDto a string (object) type to represent it as a unique identifier.
2. Handling Duplicate Records
- Issue: 5,268 records were exact duplicates across all columns.
- Root Cause: Determined to be system errors in data logging.
- Action: Removed all duplicate rows, keeping only the first occurrence.
- Impact: Prevented artificial inflation of sales metrics.
3. Handling Missing Values
- Issue: 25% of
CustomerIDvalues were missing. - Investigation: These records were primarily for operational items (e.g., ‘DOTCOM POSTAGE’, ‘JUMBO BAG’) and system adjustments, not representative of customer purchases.
- Action: Implemented a segmented approach:
business_df: Kept all records for accurate financial reporting.analysis_df: Removed records with nullCustomerIDfor clean customer behavior analysis.
4. Handling Invalid and Extreme Values
- Investigation: Negative values in
QuantityandUnitPricerepresented valid business actions like returns and accounting adjustments. Extremely high quantities indicated wholesale orders, and high prices were operational fees. - Action: Created a refined
analysis_dffor customer behavior by applying filters:Quantity > 0andUnitPrice > 0- Capped quantities and prices at reasonable thresholds for retail.
- Filtered out transactions with non-product descriptions (e.g., ‘POSTAGE’, ‘FEE’).
Final Datasets:
business_df: The complete dataset for accurate financial calculations.customer_df: The cleaned subset for analyzing customer and product behavior.
🧮 Analysis & Methodology
1. Exploratory Data Analysis (EDA)
Conducted a thorough univariate and multivariate analysis to understand distributions, correlations, and patterns.
- Temporal Analysis: Sales trends over time, daily/weekly/hourly patterns.
- Geographic Analysis: Revenue and order distribution by country.
- Product Analysis: Identification of best/worst-selling products and categories.
2. RFM Analysis (Recency, Frequency, Monetary)
Segmented the customer base into distinct groups for targeted marketing.
- Recency (R): Days since last purchase.
- Frequency (F): Number of transactions.
- Monetary (M): Total revenue from the customer.
- Segments Identified:
- Champions: Best customers, drive the majority of revenue.
- Loyal Customers: Reliable, consistent purchasers.
- Potential Loyalists: Recent customers with high potential.
- At-Risk Customers: High-value customers who are churning.
- Needs Attention: Dormant, low-engagement customers.
3. Cohort Analysis
Analyzed customer retention rates over time.
- Method: Grouped customers by their acquisition month and tracked the percentage who made purchases in subsequent months.
- Key Insight: Identified a concerning trend of declining retention rates throughout 2011, highlighting a need for improved customer onboarding and engagement.
📊 Key Insights & Findings
Customer Behavior
- The 80/20 Rule is Evident: The top segment (“Champions”), comprising only 643 customers, generates a massively disproportionate amount of revenue.
- Retention is a Key Opportunity: Cohort analysis reveals a significant drop-off in customers after 3 months.
- Clear Shopping Patterns: Sales peak on Thursdays and during mid-day hours (10 AM - 3 PM), which generates nearly 80% of daily revenue.
Product & Sales Analysis
- Top Product: “REGENCY CAKESTAND 3 TIER” is the top revenue-generating product ($116).
- Market Concentration: The UK drives over 80% of all revenue. The Netherlands shows a higher value per order than Germany, suggesting an untapped opportunity.
(For a full list of insights, refer to the Executive Summary)
🚀 Executive Summary & Business Recommendations
The analysis was synthesized into a strategic document for decision-makers. Key recommendations include:
- Launch a VIP Program: Protect and reward “Champion” customers to secure the company’s largest revenue stream.
- Execute Targeted Win-Back Campaigns: Re-engage high-value “At-Risk” customers with urgent, personalized offers.
- Optimize Marketing Spend: Reallocate ad budget to target the Netherlands based on its higher value per order and focus campaigns on high-conversion days (Thu) and times (10 AM - 3 PM).
- Improve New Customer Onboarding: Address the declining retention trend with a structured nurture program for new signups.
🔗 Download the Full Executive Summary PDF
👨💻 Author
Ahmed Elatwy
- LinkedIn: linkedin.com/in/ahmed-elatwy
- GitHub: github.com/AhmedElatwy
- Portfolio: ahmedelatwy.github.io