14  Grouping and Pivoting

Let’s study how to create pivot table -style reports.

In this sales dataset, you will see we have a row per date per product sold on that date.

from pandas import read_csv, to_datetime

sales_df = read_csv(f"https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/monthly-sales.csv")
sales_df["date"] = to_datetime(sales_df["date"])
sales_df.head()
date product unit price units sold sales price
0 2024-03-01 Button-Down Shirt 65.05 2 130.10
1 2024-03-01 Vintage Logo Tee 15.95 1 15.95
2 2024-03-01 Sticker Pack 4.50 1 4.50
3 2024-03-02 Super Soft Hoodie 75.00 2 150.00
4 2024-03-02 Button-Down Shirt 65.05 7 455.35
len(sales_df)
117
products = sales_df["product"].unique()
print(products)
['Button-Down Shirt' 'Vintage Logo Tee' 'Sticker Pack' 'Super Soft Hoodie'
 'Baseball Cap' 'Khaki Pants' 'Brown Boots']
days = sales_df["date"].unique()
print(len(days))
print(days.min().date(), "...", days.max().date())
31
2024-03-01 ... 2024-03-31

We know we can calculate total sales using a Series aggregation:

sales_df["sales price"].sum().round(2)
np.float64(12000.71)

But how can we calculate the total sales for each product?

14.1 Grouping

Enter the groupby method for group-based aggregations.

Calculating total sales per product, to identify the top selling products:

sales_by_product = sales_df.groupby("product")["sales price"].sum()
sales_by_product.sort_values(ascending=False)
product
Button-Down Shirt    6960.35
Super Soft Hoodie    1875.00
Khaki Pants          1602.00
Vintage Logo Tee      941.05
Brown Boots           250.00
Sticker Pack          216.00
Baseball Cap          156.31
Name: sales price, dtype: float64

Calculating total sales per day:

sales_by_date = sales_df.groupby("date")["sales price"].sum()
sales_by_date.sort_values(ascending=False).head()
date
2024-03-11    1067.48
2024-03-03     902.63
2024-03-10     765.20
2024-03-18     739.25
2024-03-02     675.53
Name: sales price, dtype: float64

Calculating total sales per day of week:

sales_df["weekday"] = sales_df["date"].dt.strftime("%A")

sales_by_weekday = sales_df.groupby("weekday")["sales price"].sum()
sales_by_weekday.sort_values(ascending=False)
weekday
Sunday       3428.91
Saturday     2312.84
Monday       2275.41
Thursday     1244.70
Wednesday     991.50
Tuesday       988.10
Friday        759.25
Name: sales price, dtype: float64

14.2 Pivot Tables

We can alternatively use the pivot_table function to perform more fine-grained grouping and aggregation.

Parameters of the pivot_table function:

  • The index parameter specifies the rows we want to wind up with (i.e. “row per what?”).
  • The values parameter specifies what columns we would like to aggregate.
  • The aggfunc parameter specifies how to aggregate those columns. We can pass our own aggregation function(s) or get them from the numpy package. We can aggregate different columns differently.

Pivoting by date:

from pandas import pivot_table
#import numpy as np

dates_pivot = pivot_table(sales_df,
    index=["date"],
    values=["sales price", "units sold"],
    aggfunc={
        "sales price": "sum", #np.sum,
        "units sold": "sum", #np.sum,
    } # designate the agg function to be used for each original column. can use our own custom functions here as well
)

dates_pivot.rename(columns={"sales price": "sales_total", "units sold": "units_sold"}, inplace=True)
dates_pivot.sort_values(by=["sales_total"], ascending=False, inplace=True)
dates_pivot.head()
sales_total units_sold
date
2024-03-11 1067.48 20
2024-03-03 902.63 15
2024-03-10 765.20 15
2024-03-18 739.25 18
2024-03-02 675.53 13

Pivoting by product:

products_pivot = pivot_table(sales_df,
    index=["product"],
    values=["sales price", "units sold"],
    aggfunc={
        "sales price": "sum", #np.sum,
        "units sold": "sum", #np.sum,
    }
)

products_pivot.rename(columns={"sales price": "sales_total", "units sold": "units_sold"}, inplace=True)
products_pivot.sort_values(by=["sales_total"], ascending=False, inplace=True)
products_pivot.head()
sales_total units_sold
product
Button-Down Shirt 6960.35 107
Super Soft Hoodie 1875.00 25
Khaki Pants 1602.00 18
Vintage Logo Tee 941.05 59
Brown Boots 250.00 2

Pivot by weekday:

weekdays_pivot = pivot_table(sales_df,
    index=["weekday"],
    values=["sales price", "units sold"],
    aggfunc={
        "sales price": ["sum", "mean"], #np.mean,
        #"units sold": ["sum", "mean"] #np.mean
    }
)

weekdays_pivot.columns = ["sales_avg", "sales_total"]
weekdays_pivot.sort_values(by=["sales_avg"], ascending=False, inplace=True)
weekdays_pivot
sales_avg sales_total
weekday
Sunday 155.859545 3428.91
Monday 126.411667 2275.41
Saturday 115.642000 2312.84
Thursday 77.793750 1244.70
Tuesday 76.007692 988.10
Wednesday 70.821429 991.50
Friday 54.232143 759.25

These pivot tables are now suitable for charting as well:

import plotly.express as px

chart_df = dates_pivot.copy()
#chart_df["date"] = chart_df.index
chart_df.sort_values(by=["sales_total"], ascending=True, inplace=True)

px.bar(chart_df,  #x="date",
       y="sales_total",
    title="Sales by Day (March 2024)", height=350
)
chart_df = products_pivot.copy()
chart_df["product"] = chart_df.index
chart_df.sort_values(by=["sales_total"], ascending=True, inplace=True)

px.bar(chart_df,  y="product", x="sales_total", orientation="h",
    title="Top Selling Products (March 2024)", height=350
)
chart_df = weekdays_pivot.copy()
chart_df["weekday"] = chart_df.index
chart_df.sort_values(by=["sales_avg"], ascending=True, inplace=True)

px.bar(chart_df,  y="weekday", x="sales_avg", orientation="h",
    title="Average Sales per Weekday (March 2024)", height=350
)