EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas

! Welcome to the beginning of a serious knowledge journey that I’m calling “EDA in Public.” For many who know me, I consider the easiest way to study something is to deal with a real-world drawback and share all the messy course of — together with errors, victories, and every part in between. In case you’ve been seeking to stage up your Pandas and knowledge evaluation abilities, that is the sequence for you.

We’re going to behave as knowledge analysts for a fictional, mid-sized e-commerce firm I’ll name NovaShop. They handed us a uncooked, messy gross sales CSV and requested a easy query: “How are we doing?”

The aim of Half 1 is foundational: we’ll clear this difficult e-commerce dataset, discover its primary construction, and grasp the core EDA abilities in Pandas that each knowledge scientist makes use of each day. This sequence is structured to take you from a newbie (Half 1) to an superior knowledge analyst (Half 3), so be at liberty to leap in wherever you’re at.

Earlier than we soar into the code, let’s nail down our core motivation. For NovaShop, we have to reply some easy, but highly effective questions: Which merchandise drive probably the most income? Which nations generate probably the most gross sales? Let’s discover out.

Dataset Overview: Unpacking the Gross sales Knowledge

To start out our evaluation for NovaShop, we’ll be utilizing the UCI Online Retail Dataset. It is a nice, extremely sensible, non-streamlined dataset that captures all transactions for a UK-based non-store on-line retail firm between late 2010 and late 2011.

This dataset is licensed beneath a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This permits for the sharing and adaptation of the datasets for any function, offered that the suitable credit score is given.

The dataset accommodates over half one million rows, and it’s filled with the sorts of anomalies you encounter in the actual world — lacking values, damaging numbers, and inconsistent textual content formatting. That is precisely what we wish!

Listed here are the eight key columns we’ll be working with, and what they inform us from a enterprise perspective:

InvoiceNo: That is the Bill quantity. A singular 6-digit quantity assigned to every transaction. If the code begins with ‘C’, it signifies a cancellation (a return).
StockCode/Product code: A singular 5-digit code assigned to every distinct product.
Description: The title of the merchandise. This typically wants cleansing (further areas, inconsistent instances).
Amount: Variety of gadgets bought. What number of models had been concerned per transaction? Could be damaging for returns.
InvoiceDate: The date and time of the transaction. That is important for time-series evaluation in a while.
UnitPrice: Product value per unit in Sterling (GBP). The worth of 1 merchandise. It could possibly typically be 0 or damaging attributable to errors/free gadgets.
CustomerID: A singular 5-digit quantity assigned to every registered buyer. Crucially, that is typically lacking, which means we now have many transactions from company.
Nation: Identify of the nation the place the shopper resides. This could be nice for segmenting worldwide gross sales.

Let’s take a fast peek on the first few rows to see what we’re coping with. That is the output of df.head():

Let’s import the dataset into Pandas and see what number of rows we’re coping with.

import pandas as pd
import numpy as np
df = pd.read_csv(‘On-line Retail.csv’)
df.form

Output:

(541909, 8)

That’s numerous rows. I may need to slice the rows a bit.

Knowledge Loading and Slicing: Coping with Quantity

For Half 1, we’re going to take a 10% random pattern of the total dataset. This provides us a way more manageable dimension — round 54,000 rows — whereas sustaining the traits of the unique knowledge. It is a widespread and sensible method when coping with Large Knowledge environments or when specializing in prototyping.

# Knowledge Loading and Slicing
FILE_PATH = ‘On-line Retail.csv’
SAMPLE_FRACTION = 0.1 # We are going to pattern 10% of the info
full_df = pd.read_csv(FILE_PATH, encoding=’unicode_escape’)

# Take a random 10% pattern for quicker processing in Half 1
df = full_df.pattern(frac=SAMPLE_FRACTION, random_state=42).reset_index(drop=True)

Now, let’s try the form once more utilizing the df.form

Output:

(54191, 8)

Excellent! Now we are able to start.

Preliminary Knowledge Inspection: Getting Our Fingers Soiled

My first step will probably be to find what we’re cleansing. I must reply these questions

What does the dataset include?
What are the info varieties and non-null counts?
What do the numbers appear to be?

Visible Verify: df.head() and df.tail()

Wanting on the first and previous few rows, I may affirm just a few issues:

Cancellations & Returns: I occurred to note some InvoiceNo values beginning with ‘C’, and the corresponding Amount is damaging. This confirms returns are included, and for income evaluation, I’ll must separate or exclude them.
Missingness: I may visually see NaN values within the CustomerID column, confirming the mass of visitor transactions.
Inconsistent Textual content: I checked the Description column for whitespaces. Primarily based on the small pattern I received, I couldn’t actually inform. But it surely doesn’t damage to handle them once I go into knowledge cleansing. I may also preserve the capitalisation constant. It’s at all times greatest observe to strip all main/trailing whitespace from all string columns to forestall refined errors when grouping.
I additionally seen some capitalisation inconsistencies within the Nation column. I noticed a rustic named EIRE. It most likely means Eire, would possibly want to vary that.

What Are the Knowledge Varieties and Non-Null Counts? (`.data()`)

The subsequent important step is to test the construction utilizing the .data() methodology. This tells me the info kind of every column and, most significantly, what number of non-null (not lacking) values we now have.

Key Findings

Lacking Values: After taking our 10% pattern, I may see a large hole in CustomerID. Roughly 25% of the shopper IDs are lacking. I additionally seen just a few hundred lacking values in Description that I’ll want to handle.
Knowledge Varieties: The InvoiceDate column remains to be listed as an object (string) kind. Gotta convert this to a correct Pandas datetime object.CustomerID can also be at the moment a float, doubtless as a result of it accommodates these NaN values! It is a little element I’ll want to recollect if I ever wish to use it as a real integer ID.

What Do the Numbers Look Like? (`.describe()`)

Subsequent, I used .describe() to get a fast statistical abstract of all of the numerical columns (Amount and UnitPrice).

Amount (Amount): The minimal amount is -2472. This confirms that returns exist, however the scale of these minimal factors suggests an excessive outlier transaction. For primary gross sales evaluation, I would must filter out these damaging numbers and doubtlessly the intense constructive and damaging outliers.
Unit Value (UnitPrice): The minimal value is 0. This implies some merchandise got away totally free or are placeholder entries. Since a product ought to have a constructive value in a traditional sale, filtering out any rows the place UnitPrice is zero or much less is at all times greatest observe to calculate income precisely.

Primarily based on these fast inspections carried out. This knowledge is way from clear. We’ve large missingness, incorrect knowledge varieties, extremely problematic damaging/zero values in our core numerical columns, and textual content inconsistencies to deal with.

Dealing with Lacking Values

Now that we all know the place our knowledge is lacking, we are able to start discussing methods for dealing with null values. I’m principally involved with Description and CustomerID.

Lacking Descriptions

The Description tells us what product was offered, so dropping it renders the transaction meaningless. In our pattern, lower than 1% of the rows have descriptions lacking. Dropping these rows fully is sensible, as they’re ineffective for evaluation on the product stage.

# Drop rows the place the Description is lacking
df.dropna(subset=['Description'], inplace=True)
# checking for null counts
df['Description'].isnull().sum()

Output:

np.int64(0)

Alright, excellent! All null values are gone.

Lacking Buyer IDs

The lacking CustomerIDs (round 25% of our pattern) are a a lot greater deal. If I drop all of them, I’ll lose nearly 1 / 4 of our gross sales knowledge, which can give NovaShop a dangerously skewed view of its whole income.

For easy income and product evaluation (the aim of Half 1), I don’t actually need the CustomerID. We will proceed with the evaluation on all rows even with out an ID. We solely want the ID if we plan on doing buyer segmentation (like RFM evaluation), which I’ll save for Half 2!

Knowledge Kind Conversion and Duplicate Removing

Now that we’ve dealt with the lacking descriptions, the following two instant points are coping with absolutely duplicated rows and fixing our important InvoiceDate column.

Fixing the Bill Date Kind

Keep in mind how .data() confirmed InvoiceDate as a string (object)? We have to repair this instantly so Pandas is aware of tips on how to type and combination our knowledge chronologically.

# Convert InvoiceDate from object (string) to datetime
df[‘InvoiceDate’] = pd.to_datetime(df[‘InvoiceDate’])

Eradicating Duplicates

If two rows are equivalent throughout all columns, they’re duplicates and can artificially inflate our gross sales numbers. Let’s test for them and take away them.

# Take away Duplicates
num_duplicates = df.duplicated().sum()
print(f”Discovered {num_duplicates} absolutely duplicated rows.”)

Output:

Discovered 62 absolutely duplicated rows

Let’s drop them

# Take away duplicates, maintaining the primary occasion
df.drop_duplicates(inplace=True)
print(f”Dataframe dimension after eradicating duplicates: {len(df)} rows.”)

Output:

Dataframe dimension after eradicating duplicates: 53980 rows.

Filtering Out Returns and Errors

Our .describe() methodology confirmed us some critical crimson flags: damaging portions (returns) and nil/damaging unit costs (errors/free gadgets). For Half 1, we wish to calculate web income from gross sales, so we should filter this noise out.

Dealing with Detrimental Portions and Costs

We are going to preserve solely the transactions the place:

Amount is strictly better than zero (filtering out all returns and cancellations).
UnitPrice is strictly better than zero (filtering out errors and free gadgets).

# Filter: Maintain solely transactions the place Amount is constructive (i.e., gross sales, not returns)
df = df[df[‘Quantity’] > 0]

# Filter: Maintain solely transactions the place UnitPrice is constructive (i.e., not free or an error)
df = df[df[‘UnitPrice’] > 0]
print(f”Dataframe dimension after filtering: {len(df)} rows.”)

Output

Dataframe dimension after filtering: 52933 rows.

Cleansing Textual content Columns

Lastly, let’s deal with these textual content standardization points we noticed visually, just like the EIRE nation code and any potential hidden whitespace in Description.

Strip Whitespace: We use .str.strip() to take away main/trailing areas from each Description and Nation.
Standardize Nation: We manually map inconsistencies like ‘EIRE’ to ‘Eire’.

# Cleansing Textual content Columns
# Clear Description and Nation columns
df[‘Description’] = df[‘Description’].str.strip()
df[‘Country’] = df[‘Country’].str.strip()

# Deal with particular nation title inconsistencies
# EIRE is a standard inconsistency on this dataset for Eire
df[‘Country’].exchange(‘EIRE’, ‘Eire’, inplace=True)
print(“Textual content columns cleaned and standardized.”)

Function Engineering & First Perception

The information is now clear. Now we are able to lastly begin asking the enjoyable questions. The one most essential metric for any gross sales dataset is income. Since our unique knowledge solely has Amount and UnitPrice, we have to engineer the Income column ourselves.

Function Engineering: Creating the `Income` Column

Income for a transaction is solely the variety of gadgets offered multiplied by the worth of every merchandise.

df[‘Revenue’] = df[‘Quantity’] * df[‘UnitPrice’]

First Perception: Which Nations Drive Income?

Let’s use our clear knowledge to reply NovaShop’s query: “Which nations are driving our gross sales?”
We’ll use a robust three-step mixture:

Group by the Nation column.
Mixture (sum) the Income inside every group.
Kind the outcomes from highest income to lowest.

# Group by Nation, sum the Income, and kind for the highest 10
top_countries = df.groupby(‘Nation’)[‘Revenue’].sum().sort_values(ascending=False).head(10)
print(“n — — High 10 Nations by Income (GBP) — -”)
print(top_countries)

Output:

--- High 10 Nations by Income (GBP) ---
Nation
United Kingdom 941268.661
Netherlands 27435.830
EIRE 26066.000
France 23645.330
Germany 22389.510
Australia 12429.990
Spain 5600.900
Switzerland 5483.890
Hong Kong 3597.850
Belgium 3593.510
Identify: Income, dtype: float64

The output for this usually exhibits a large dominance by the United Kingdom. That is anticipated for the reason that firm is UK-based. The subsequent a number of nations give us a fast roadmap of the place NovaShop’s worldwide focus ought to be.

Conclusion

We did it. We took a uncooked, half-a-million-row dataset, sliced it for manageability, battled lacking values, mounted knowledge varieties, filtered out errors, and calculated our first key enterprise metric. The arduous, foundational work is completed.

Right here’s a fast recap of what we conquered in Half 1:

Knowledge Vetting: We used .head(), .data(), and .describe() to determine essential points just like the damaging costs/portions, lacking Buyer IDs, and the wrong datetime format.
Knowledge Cleansing: We systematically eliminated nulls and duplicates, transformed InvoiceDate to a correct datetime object, and filtered out non-sale transactions (returns and free gadgets).
Function Engineering: We created the vital Income column.
First Perception: We generated the High 10 Nations by Income for NovaShop, giving them their first actionable knowledge level from the clear set.

The clear dataset is now primed for extra subtle evaluation. In Half 2, we’re going to dive deep into Product Evaluation and Time-Sequence Decomposition. We’ll work out which gadgets are the true money-makers and analyse how gross sales quantity adjustments hour-by-hour and month-by-month.

I’m excited to maintain going! In case you loved this text. Be happy to let me know on any of those channels. Your suggestions will imply loads to me.

EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas

I Replaced GPT-4 with a Local SLM and My CI/CD Pipeline Stopped Failing

How to Call Rust from Python

Inside the AI Power Move That Could Redefine Finance

Git UNDO : How to Rewrite Git History with Confidence

DIY AI & ML: Solving The Multi-Armed Bandit Problem with Thompson Sampling

Your RAG Gets Confidently Wrong as Memory Grows – I Built the Memory Layer That Stops It

I Replaced GPT-4 with a Local SLM and My CI/CD Pipeline Stopped Failing

Humanoid data: 10 Things That Matter in AI Right Now

175 Park Avenue skyscraper in New York will rank among the tallest in the US

The conversation that could change a founder’s life

Featured Picks

Criminals offer reporter money to hack BBC

Austrian Supreme Court rules that loot boxes do not constitute gambling

Australia’s gender pay gap is reducing – unless you’re in tech, where it’s getting worse

EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas

Dataset Overview: Unpacking the Gross sales Knowledge

Knowledge Loading and Slicing: Coping with Quantity

Preliminary Knowledge Inspection: Getting Our Fingers Soiled

Visible Verify: df.head() and df.tail()

What Are the Knowledge Varieties and Non-Null Counts? (.data())

What Do the Numbers Look Like? (.describe())

Dealing with Lacking Values

Lacking Descriptions

Lacking Buyer IDs

Knowledge Kind Conversion and Duplicate Removing

Fixing the Bill Date Kind

Eradicating Duplicates

Filtering Out Returns and Errors

Dealing with Detrimental Portions and Costs

Cleansing Textual content Columns

Function Engineering & First Perception

Function Engineering: Creating the Income Column

First Perception: Which Nations Drive Income?

Conclusion

Related Posts

What Are the Knowledge Varieties and Non-Null Counts? (`.data()`)

What Do the Numbers Look Like? (`.describe()`)

Function Engineering: Creating the `Income` Column