Instacart

During one of my courses at CareerFoundry, I had a project that introduced me to the world of Python. This project was key because it allowed me to apply my analytical skills for performing exploratory analysis in a very big dataset.

Objective

Analyzing sales patterns, performing exploratory analysis, and deriving insights for a better segmentation for an online grocery store, Instacart. The company is interested in understanding the purchase behavior from their customers, so they could deliver better targeted marketing campaigns. My role as Data Analyst is to give a recommendation.

Goal

Deliver a final report including tables and visualizations that profile Instacart customers based on their purchase behaviors.

Tools, Skills, Methodologies

First step- Cleaning Data

As a first step I needed to clean the datasets. As with any project, avoiding this step might lead into weird or misleading results in your analysis. Python is surprisingly good at doing this job and is incredibly fast. This step consisted in cleaning 3 data sets (orders, customers and products) was straightforward and it followed the classic methodology of ensuring data is in a consistent format, there are no unexplained missing values, and there are no duplicated records.

Second step- Merging Data

This is where things start turning interesting. Once I merged the datasets, I was left with a unified version of almost 30M rows and 30 columns! Quite a lot when it is your first time trying to handle this massive amount of data. To give you a brief idea, this dataset contained information in the orders each user had done, the time and date of purchase, the product’s name, the department the product belong to, and demographic information from customers (age, income, state, family status, number of dependants).

Third Step- Segmenting Customers

From this step on is where I understood why pandas (the library of Python I used for this project) is a better option when dealing with big datasets. Normally, in programing languages when trying to categorize something, you need to evaluate logical conditions and iterate these conditions to every row in the dataset. Doing this is the same as trying to apply a formula to numerous cells in Excel and your computer screen freezes. Thanks to the magical loc function I was able to segment Instacart customers into 15 different profiles based on family status, age, income, and number of dependents.

Final Step - Analyzing and Visualizing

This final stage of the process is where I extracted most of my future learnings and practices from Python. Analyzing the data is all about grouping variables and applying a statistical criterion. For instance, finding out the average spending or the count of orders made by each of the 15 profiles I had suggested in the step before.

When are orders placed?

The analysis shows evidence that weekend orders are the busiest, while it is slowest on Tuesdays and Wednesdays.
Most orders are placed during work hours between 10 am to 4 pm. After 4pm, the order volume decreases.

Which are top-selling categories?

The five most popular departments are produce, dairy/eggs, snacks, beverages, and frozen items.

Produce departments have twice as many orders as dairy/eggs.

Preference types of products reflects distribution of customer profile.

What is ordering habits of different customer profiles?

Middle income married older parent profile far exceeds other income profile groups

Upper income married older parent group is 2nd largest order frequency

Customer income profile with least orders are upper income single parents

Across all customer profiles, majority of customers are frequent customers

Small number of customers are labeled as non-frequent customer group

Instacart Grocery Online Store

Purpose and Context