Unlocking Data Insights: A Step-by-Step Guide to Real-World Data Cleaning

Pooja Pendharkar
6 min readOct 22, 2023

In today’s data-driven world, accurate and meaningful insights are crucial for making informed decisions. However, raw data is often messy, with errors and inconsistencies that can mislead analysis. In this blog post, we’ll dive into the nitty-gritty of data cleaning using a real-world case study: the Quantium dataset on Forage. We’ll explore the step-by-step process of transforming raw data into a clean, reliable dataset ready for analysis.

Why clean data?

Our objective in data cleaning is threefold:

  1. Addressing Incomplete or Missing Data: We focus on filling in the gaps, ensuring that no crucial information is left out due to missing values.
  2. Handling Invalid or Erroneous Data: We identify and rectify data entries that do not align with the expected formats or standards. This involves correcting errors and ensuring consistency.
  3. Eliminating Inaccurate Data: We pinpoint and remove inaccurate data points, ensuring the dataset is reliable and precise for analysis. This step is crucial for generating trustworthy insights.

Setting the Stage

Before delving into data cleaning, it’s crucial to define clear project goals. Understand the purpose of the wrangled data and what will it be used for.

For our example case, the goal is to analyze data to comprehend current purchasing trends and customer behaviors, focusing on chip purchases. Understand customer segments and chip-buying patterns and identify key metrics to define purchasing behavior effectively.

With this goal in mind, we begin the data-cleaning process by looking at two datasets i.e. customer and transaction data.

Step 1: Data Profiling — Understanding the Dataset

The first step in any data-cleaning process is to assess data quality. Our goal here is to truly understand our dataset, not just skim the surface. We want to grasp the nuances, get to know the different attributes, examine the data, and ultimately chart our course for the next steps.

Diving directly into large CSV or Excel files, especially for large datasets can be overwhelming. That’s where Python code comes to the rescue, making our lives much simpler.

To perform data profiling you can either use a dedicated profiling library, OR you can write custom Python code to gain insights into the dataset.

Let’s start with the custom code first.

#Get an overview of the dataset
df.info()

#Get the summary statistics for numerical columns, remeber the categorical values are not shown here
df.describe()

#Find the missing or null values in each column
df.isnull()

#Data types of columns
df.dtypes

#Correlation matrix to find close correlation between columns
df.corr()

While the custom code is a great way to start profiling, personally, I like using the existing Pandas Profiling library to make my life a tad bit easier.

import pandas as pd
import pandas_profiling as pp

df = pd.read_excel("QVI_transaction_data.xlsx")
profile = df.profile_report(title='Data Preprocessing Report')
profile.to_file("TransactionEADReport.html")

This code simplifies the process significantly. The resulting HTML file contains all the crucial details you require. All it takes is a thorough read-through to comprehend the next steps. Pay attention to missing, inaccurate, or invalid data.

In the Quantium case, several key discoveries were made based on the profiling report:

  • No missing or Null values were found
  • Incorrect data formats were identified
  • The product name column didn’t seem correct as per the directions mentioned in the case
  • Outliers were detected in the Total Sales column

Based on these insights we outline the next steps to be completed for data cleaning.

Step 2: Standardizing Date Formats

One common issue is inconsistent date formats. We convert the ‘DATE’ column to a standard format, ensuring uniformity in the dataset.

df['DATE'] = pd.to_datetime(df['DATE'], origin='1899-12-30', unit='D').dt.date

Step 3: Extracting Product Information

Cleaning product names is essential for effective analysis. We extract product size and brand information from the ‘PROD_NAME’ column, allowing us to categorize products accurately.

df['PACK_SIZE'] = df['PROD_NAME'].str.extract(r'(\d+)([gG])')[0].astype(int)
df['PROD_BRAND'] = df['PROD_NAME'].apply(lambda x: ' '.join(x.split()[:-1]))

To ensure accurate analysis, we focus on specific product categories by filtering out non-chip products. To do this we will check the listings that don’t contain words like chip, crisp, etc. As a chip category, we are also considering balls, rings etc.

keywords = ['chip', 'pringle', 'crisp', 'kettle', 'dorito', 'crinkle', 'cracker', 'ring'] 

contains_chip = df['PROD_NAME'].apply(lambda x: any(keyword in x.lower() for keyword in keywords))

# Get count with 'chip' mention and without 'chip' mention
count_with_chip = contains_chip.sum()
count_without_chip = len(contains_chip) - count_with_chip

# Print the counts
print(f'Count with "chip" mention: {count_with_chip}')
print(f'Count without "chip" mention: {count_without_chip}')

# Create a new dataset where 'chip' mentions are not found
no_chip_df = df[~contains_chip]

repeat_counts = no_chip_df['PROD_BRAND'].value_counts()

For the listings that didn’t initially seem to belong to the chip category, such as Twisties, Tostitos, and Infuzions, a bit of research confirmed that they belonged to the chip category. However, it wasn’t always clear-cut. Listings like Old El Paso Salsa Dip and Woolworths Mild Salsa, which are clearly salsa products, were removed. However, we couldn’t simply exclude “salsa” as a keyword because some chips have salsa flavors, like Red Rock Deli SR Salsa


# List of keywords to exclude
keywords_to_exclude = ['old el paso salsa', 'woolworths salsa']

# Filter out items containing excluded keywords and variations
filtered_df = df[~df['PROD_NAME'].str.lower().str.contains('|'.join(keywords_to_exclude), case=False, na=False)]

Verify and repeat.

Initially, the dataset contained 264,839 rows. After removal, there were 255,512 rows left, indicating the elimination of 9,324 rows. To cross-verify, we examined the repeat count, which ideally should have been 12,245. However, an inconsistency arose: entries like Woolworths Mild Salsa were not excluded based on the keyword match. I tweaked the code to identify the pattern and refine dataset again.

# List of keywords to exclude
keywords_to_exclude = ['old el paso salsa', 'woolworths salsa']

# Regular expression pattern to match 'Woolworths' followed by any text and then 'Salsa'
exclude_pattern = r'Woolworths.*Salsa|' + '|'.join(keywords_to_exclude)

# Filter out items containing excluded keywords and variations
filtered_df = df[~df['PROD_NAME'].str.contains(exclude_pattern, case=False, na=False, regex=True)]

Now, after dropping 12,245 values, we have thoroughly cleaned and verified the data in the Product Name column.

Step 5: Handling Outliers

Outliers can skew analysis results. From the profiling report, we noticed that the Total sales column has extreme value.

Source: Extracted from profiling report

Alternatively, you can also use the Interquartile Range (IQR) method to identify and remove outliers from specific column.

def find_outliers_IQR(df):
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
IQR = q3 - q1
outliers = df[((df < q1 - 1.5 * IQR) | (df > q3 + 1.5 * IQR))]
return outliers

df_outliers = find_outliers_IQR(filtered_df["TOT_SALES"])
print("number of outliers: " + str(len(df_outliers)))
print("Max outlier value: " + str(df_outliers.max()))
print("Min outlier value: " + str(df_outliers.min()))

Outlier relevancy

It’s essential to recognize that extreme values aren’t always outliers; they can provide valuable insights. However, in this instance, the sales value of 650, coupled with a quantity order of 200, seemed unusual. It appeared to be a rare case of mass order amid regular transactions, hence we remove this specific outlier.

df_outliers = find_outliers_IQR(filtered_df["TOT_SALES"])
filtered_df = filtered_df.drop(filtered_df[filtered_df['TOT_SALES'] == 650].index)

Step 6: Repeat steps for another dataset and merge

Repeat the data cleaning steps for the customer purchase behavior dataset. Since no inconsistencies were found we will move ahead and merge the dataset.

#Purchase behavior dataset

df1=pd.read_csv("QVI_purchase_behaviour.csv")
df1.head()

#Step 2
#profile the data

profile = df1.profile_report(title='Data Preprocessing Report')
profile.to_file("BehaviourEADReport.html")

#Step 4 missing and null data values
df1.info()
#we don't have null or missing values in the dataset

df1.describe()

#we now merge the datasets
merged_df = pd.merge(filtered_df, df1, on='LYLTY_CARD_NBR', how='inner')

#convert to csv
merged_df.to_csv('Merged_Quantium.csv', index=False)

That’s it our data is cleaned and now ready for analysis.

Conclusion

Data cleaning is a critical process in any analysis, ensuring that the insights derived are accurate and reliable. By following these steps you can transform messy, raw data into a valuable asset for decision-making. Remember, understanding your data, handling inconsistencies, and being vigilant for outliers are crucial to unlocking meaningful insights.

Quantium Case: https://www.theforage.com/virtual-internships/prototype/NkaC7knWtjSbi6aYv/Data-Analytics

Data Cleaning

--

--