EDA in Public (Part 2): Product Deep Dive & Time-Series Analysis in Pandas

! Welcome back to the “EDA in Public” series! This is Part 2 of the series; if you haven’t seen Part 1 yet, read it here. Here’s a recap of what we conquered.

In Part 1, we took a messy, half-million-row sales file and got it into shape. Specifically, we:

Sampled the data for faster processing. Fixed missing values and standardised text entries (like ‘EIRE’ to ‘Ireland’). Filtered out all the noise — returns, cancellations, and transactions with zero price. Engineered our most critical feature: the Revenue column. Achieved our first business insight: the Top 10 Revenue-Generating Countries.

We now have a clean, revenue-ready Pandas DataFrame ready to be exploited for insights! By the end of this series, I want to have mastered exploratory data analysis using Pandas. Feel free to follow along if you’re a data enthusiast.

Now we shift our focus to delivering truly impactful business intelligence for NovaShop. Our overall goal for Part 2 is to answer the fundamental questions: What products are the best performers, and when is the best time to sell them?

This part is all about analysis using powerful Pandas aggregation techniques (.groupby()) and feature engineering using the datetime column (.dt accessor). We’ll break down our analysis into two key areas:

Product Deep Dive: Identifying the products that move the most units versus those that bring in the most cash. Sales Timing (Time-Series): Uncovering the seasonality of sales (monthly peaks) and the operational trends (busiest hour of the day).

Without further ado, let’s get to it.

Loading the clean data

To improve performance, let’s export our cleaned dataset and import it again for analysis

df.to_csv('online_retail_clean.csv', index=False)

Now, I can create a new project and start afresh.

When loading CSVs that contain date columns, I try to use the parse_dates argument in pd.read_csv(). This saves me the step of manually converting the column type later, ensuring the date features are correct from the start.

import pandas as pd df = pd.read_csv('online_retail_clean.csv', parse_dates=['InvoiceDate']) print(f"Successfully loaded clean data. Total rows: ")

Output:

Successfully loaded clean data. Total rows: 52933 Product Deep Dive: Who Are the True Money Makers?

With the clean data loaded, let’s jump straight into product performance. Products can be considered “best-sellers” in two different ways: by volume (units sold) or by value (revenue generated).

Insight 2: Top 10 Bestsellers (by Quantity)

The quantity list tells us which products are popular and in high demand, even if their price is low. This is important for warehouse stocking and inventory management.

We’ll group by the product Description, sum the Quantity, and sort it descendingly.

# Top 10 Bestsellers (by Quantity) df.groupby(‘Description’)[‘Quantity’].sum().sort_values(ascending=False).head(10)

Output:

Description MEDIUM CERAMIC TOP STORAGE JAR 74698 WORLD WAR 2 GLIDERS ASSTD DESIGNS 9419 GIN + TONIC DIET METAL SIGN 5016 ASSORTED COLOURS SILK FAN 4684 JUMBO BAG RED RETROSPOT 4671 PACK OF 72 RETROSPOT CAKE CASES 3561 MINI PAINT SET VINTAGE 3285 ASSORTED COLOUR BIRD ORNAMENT 2783 BROCADE RING PURSE 2525 POPCORN HOLDER 2508 Name: Quantity, dtype: int64 Insight 3: Top 10 Revenue Drivers (The True Money-Makers)

This is what NovaShop really wants to know. These products drive profitability. They might not sell as many units as the items in the previous list, but their higher price or larger bulk orders make them far more valuable.

We follow the same process, but this time we aggregate on the Revenue column we engineered in Part 1.

# Top 10 Revenue Drivers (The True Money-Makers) df.groupby(‘Description’)[‘Revenue’].sum().sort_values(ascending=False).head(10)

Output:

Description MEDIUM CERAMIC TOP STORAGE JAR 77785.74 DOTCOM POSTAGE 18456.13 REGENCY CAKESTAND 3 TIER 15093.38 AMAZON FEE 13541.33 PARTY BUNTING 9491.60 GIN + TONIC DIET METAL SIGN 9480.51 JUMBO BAG RED RETROSPOT 8996.65 Manual 8966.96 CHILLI LIGHTS 8395.20 POSTAGE 7194.51 Name: Revenue, dtype: float64

Discussion: What These Lists Tell NovaShop

This is where the power of EDA truly shines, because these two lists tell a very different story:

The All-Star Product: The “MEDIUM CERAMIC TOP STORAGE JAR” is clearly the winner, dominating both lists by a massive margin. It’s high volume and high value. This item should be NovaShop’s primary focus for inventory, marketing, and expansion. A surprising revenue source: While exploring the data, I noticed items like “DOTCOM POSTAGE,” “AMAZON FEE,” “Manual,” and “POSTAGE.” These aren’t actual products — they’re service fees and manual adjustments. Yet together, they bring in almost £48,000 in revenue. NovaShop should double-check whether these fees are being counted in their gross profit. It’s an easy thing to miss, but it could be a quiet and meaningful source of income hiding in plain sight. The High-Value Physical Goods: Items like the “REGENCY CAKESTAND 3 TIER” and “CHILLI LIGHTS” make the revenue list but not the quantity list. This tells us they are high-price items that contribute significantly to the total cash flow, and they are worth targeted marketing efforts.

We now have a complete picture of what NovaShop sells best. Time to figure out when!

Preparing for Time-Series Analysis: Feature Engineering

We know what sold well. Now, let’s figure out when sales peak. Our goal is to break down the InvoiceDate column into components that allow us to group and analyze sales by year, month, day, and even hour.

This is a classic feature engineering move. Since we already made sure InvoiceDate is a proper Pandas datetime object (thanks to parse_dates in our loading step!), we can use the extremely useful .dt accessor.

Extracting Time Components

To understand seasonality (monthly) and operational efficiency (hourly), we’ll create four new columns.

Year — For long-term comparisons Month — For identifying seasonal peaks (e.g., Q4 rush). DayName — To find the busiest day of the week. Hour — To determine peak operational hours.

Let’s generate these columns:

print(“\n — — Time Series Feature Extraction — -”) # Extract Year, Month, DayName, and Hour from the InvoiceDate column df[‘Year’] = df[‘InvoiceDate’].dt.year df[‘Month’] = df[‘InvoiceDate’].dt.month df[‘DayName’] = df[‘InvoiceDate’].dt.day_name() df[‘Hour’] = df[‘InvoiceDate’].dt.hour print(“New time features added: Year, Month, DayName, Hour.”) print(df[[‘InvoiceDate’, ‘Year’, ‘Month’, ‘DayName’, ‘Hour’]].head())

Output:

--- Time Series Feature Extraction --- New time features added: Year, Month, DayName, Hour. InvoiceDate Year Month DayName Hour 0 2011-06-01 12:05:00 2011 6 Wednesday 12 1 2011-05-27 17:14:00 2011 5 Friday 17 2 2011-04-21 17:05:00 2011 4 Thursday 17 3 2011-11-16 10:39:00 2011 11 Wednesday 10 4 2011-03-10 08:40:00 2011 3 Thursday 8

Perfect! We can move right into mapping the sales patterns in the next section.

Mapping Sales Patterns (Time-Series Insights)

With our time features ready, we can now ask and answer our questions about when sales occur. We’ll start big (monthly trends) and drill down small (hourly trends).

Insight 4: Sales Trends by Month (Seasonality)

Monthly aggregation is crucial for spotting seasonality — the predictable pattern of sales repeating every year. This data is used for financial forecasting and planning inventory levels for the peak season.

We’ll group by the Month column and calculate the total revenue for each.

print(“\n — — Insight 4: Sales Trends by Month (Seasonality) — -”) # Group by month and sum the revenue monthly_revenue = df.groupby(‘Month’)[‘Revenue’].sum().sort_values(ascending=False) # Print the results, which should show the seasonal peaks print(monthly_revenue)

Output:

--- Insight 4: Sales Trends by Month (Seasonality) --- Month 11 143576.360 1 142013.420 12 133178.980 10 119533.540 9 103929.420 3 72968.270 8 71302.890 5 70424.510 6 68722.520 7 68713.831 4 51882.010 2 50178.730 Name: Revenue, dtype: float64

Interpretation: The Massive Q4 and January Surprise

The data clearly shows that NovaShop’s sales are heavily skewed toward the end of the year, confirming the typical retail Q4 (October, November, December) rush.

Peak Season: November is the peak month by a slight margin, followed closely by December. This is the time to stock up on the Medium Ceramic Top Storage Jars and run targeted ads. The January Surge: The second-highest month is January! This is an excellent insight for NovaShop. It suggests customers may be using Christmas gift money or taking advantage of post-holiday sales. This period should not be treated as a slump but as a secondary high-volume sales opportunity. Low Season: February and April are the lowest performers, which helps NovaShop plan for lower staffing and marketing budgets during those periods. Insight 5: Sales Trends by Day of Week and Hour (Operational Planning)

While seasonality is about finance and inventory, daily and hourly trends are about operations. NovaShop can use this to schedule warehouse staff, optimize ad spend, and time email campaigns.
We’ll run two separate aggregations here: one for the day of the week and one for the hour of the day.

Revenue by Day of Week

# 1. Sales by Day of Week daily_revenue = df.groupby(‘DayName’)[‘Revenue’].sum() print(“\n — — Revenue by Day of Week — -”) print(daily_revenue)

Output:

--- Revenue by Day of Week --- DayName Friday 163861.320 Monday 171026.230 Sunday 83125.890 Thursday 219342.980 Tuesday 282796.741 Wednesday 176271.320 Name: Revenue, dtype: float64

Interpretation: Tuesday is the Power Day

The data reveals a clear and actionable pattern for the work week:

Peak Sales: Tuesday absolutely dominates the week, generating almost 30% more revenue than the next busiest day, Thursday. This is a prime day for launching new products or running high-impact flash sales. Slump Days: Unsurprisingly, sales drop dramatically on Sunday. NovaShop might consider this a low-priority day for email marketing or customer service staffing. Missing Day: The dataset contains no transactions for Saturday in our sample (or perhaps the full set, which is common in B2B datasets).

Revenue by Hour of Day

# 2. Sales by Hour of Day hourly_revenue = df.groupby(‘Hour’)[‘Revenue’].sum().sort_index() print(“\n — — Revenue by Hour of Day — -”) print(hourly_revenue)

Output:

--- Revenue by Hour of Day --- Hour 7 2830.910 8 26597.800 9 76768.750 10 209809.971 11 115233.600 12 142474.460 13 130348.290 14 119241.520 15 145178.980 16 70620.550 17 36148.030 18 14022.790 19 4397.130 20 2751.700

Interpretation: The Late Morning Boom

The hourly data points directly to operational needs:

Massive Peak: Sales explode at 10 AM (Hour 10), generating almost double the revenue of the next few hours. This is the optimal time for NovaShop to launch email marketing and social media campaigns. Sustained Activity: Sales remain high from 11 AM through 3 PM, suggesting a strong midday and post-lunch shopping window. End of Day: Activity falls off sharply after 4 PM, with very little revenue generated in the evening (7 PM onward). This indicates NovaShop’s customer base is primarily desk-based, shopping during traditional working hours. Conclusion: Setting the Stage for Customer Segmentation

We have officially finished the heavy-duty EDA! We know what sells best (the Ceramic Jar and fees), where sales are strongest (the UK), and when the sales peaks occur (Q4, Tuesdays, and 10 AM).

However, we are still missing the most valuable piece of the puzzle: The Who. The best companies don’t just know what sells; they know who their most valuable customers are and how to keep them.

 In Part 3, we will dive into the most advanced analysis technique in this series: RFM (Recency, Frequency, Monetary) Customer Segmentation. This will allow us to classify customers into tiers like “Champions,” “Loyal Customers,” and “At-Risk,” giving NovaShop a true roadmap for personalised marketing.

I hope you found this article helpful. Feel free to say hi to me on any of these platforms. I truly appreciate your feedback.

Medium

LinkedIn

Twitter

YouTube

Comments (0)

AI Article