Grouping and Pivoting
Let’s study how to create pivot table -style reports.
In this sales dataset, you will see we have a row per date per product sold on that date.
from pandas import read_csv, to_datetime
sales_df = read_csv(f"https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/monthly-sales.csv" )
sales_df["date" ] = to_datetime(sales_df["date" ])
sales_df.head()
0
2024-03-01
Button-Down Shirt
65.05
2
130.10
1
2024-03-01
Vintage Logo Tee
15.95
1
15.95
2
2024-03-01
Sticker Pack
4.50
1
4.50
3
2024-03-02
Super Soft Hoodie
75.00
2
150.00
4
2024-03-02
Button-Down Shirt
65.05
7
455.35
products = sales_df["product" ].unique()
print (products)
['Button-Down Shirt' 'Vintage Logo Tee' 'Sticker Pack' 'Super Soft Hoodie'
'Baseball Cap' 'Khaki Pants' 'Brown Boots']
days = sales_df["date" ].unique()
print (len (days))
print (days.min ().date(), "..." , days.max ().date())
31
2024-03-01 ... 2024-03-31
We know we can calculate total sales using a Series
aggregation:
sales_df["sales price" ].sum ().round (2 )
But how can we calculate the total sales for each product?
Grouping
Enter the groupby
method for group-based aggregations.
Calculating total sales per product, to identify the top selling products:
sales_by_product = sales_df.groupby("product" )["sales price" ].sum ()
sales_by_product.sort_values(ascending= False )
product
Button-Down Shirt 6960.35
Super Soft Hoodie 1875.00
Khaki Pants 1602.00
Vintage Logo Tee 941.05
Brown Boots 250.00
Sticker Pack 216.00
Baseball Cap 156.31
Name: sales price, dtype: float64
Calculating total sales per day:
sales_by_date = sales_df.groupby("date" )["sales price" ].sum ()
sales_by_date.sort_values(ascending= False ).head()
date
2024-03-11 1067.48
2024-03-03 902.63
2024-03-10 765.20
2024-03-18 739.25
2024-03-02 675.53
Name: sales price, dtype: float64
Calculating total sales per day of week:
sales_df["weekday" ] = sales_df["date" ].dt.strftime("%A" )
sales_by_weekday = sales_df.groupby("weekday" )["sales price" ].sum ()
sales_by_weekday.sort_values(ascending= False )
weekday
Sunday 3428.91
Saturday 2312.84
Monday 2275.41
Thursday 1244.70
Wednesday 991.50
Tuesday 988.10
Friday 759.25
Name: sales price, dtype: float64
Pivot Tables
We can alternatively use the pivot_table
function to perform more fine-grained grouping and aggregation.
Parameters of the pivot_table
function:
The index
parameter specifies the rows we want to wind up with (i.e. “row per what?”).
The values
parameter specifies what columns we would like to aggregate.
The aggfunc
parameter specifies how to aggregate those columns. We can pass our own aggregation function(s) or get them from the numpy
package. We can aggregate different columns differently.
Pivoting by date:
from pandas import pivot_table
#import numpy as np
dates_pivot = pivot_table(sales_df,
index= ["date" ],
values= ["sales price" , "units sold" ],
aggfunc= {
"sales price" : "sum" , #np.sum,
"units sold" : "sum" , #np.sum,
} # designate the agg function to be used for each original column. can use our own custom functions here as well
)
dates_pivot.rename(columns= {"sales price" : "sales_total" , "units sold" : "units_sold" }, inplace= True )
dates_pivot.sort_values(by= ["sales_total" ], ascending= False , inplace= True )
dates_pivot.head()
date
2024-03-11
1067.48
20
2024-03-03
902.63
15
2024-03-10
765.20
15
2024-03-18
739.25
18
2024-03-02
675.53
13
Pivoting by product:
products_pivot = pivot_table(sales_df,
index= ["product" ],
values= ["sales price" , "units sold" ],
aggfunc= {
"sales price" : "sum" , #np.sum,
"units sold" : "sum" , #np.sum,
}
)
products_pivot.rename(columns= {"sales price" : "sales_total" , "units sold" : "units_sold" }, inplace= True )
products_pivot.sort_values(by= ["sales_total" ], ascending= False , inplace= True )
products_pivot.head()
product
Button-Down Shirt
6960.35
107
Super Soft Hoodie
1875.00
25
Khaki Pants
1602.00
18
Vintage Logo Tee
941.05
59
Brown Boots
250.00
2
Pivot by weekday:
weekdays_pivot = pivot_table(sales_df,
index= ["weekday" ],
values= ["sales price" , "units sold" ],
aggfunc= {
"sales price" : ["sum" , "mean" ], #np.mean,
#"units sold": ["sum", "mean"] #np.mean
}
)
weekdays_pivot.columns = ["sales_avg" , "sales_total" ]
weekdays_pivot.sort_values(by= ["sales_avg" ], ascending= False , inplace= True )
weekdays_pivot
weekday
Sunday
155.859545
3428.91
Monday
126.411667
2275.41
Saturday
115.642000
2312.84
Thursday
77.793750
1244.70
Tuesday
76.007692
988.10
Wednesday
70.821429
991.50
Friday
54.232143
759.25
These pivot tables are now suitable for charting as well:
import plotly.express as px
chart_df = dates_pivot.copy()
#chart_df["date"] = chart_df.index
chart_df.sort_values(by= ["sales_total" ], ascending= True , inplace= True )
px.bar(chart_df, #x="date",
y= "sales_total" ,
title= "Sales by Day (March 2024)" , height= 350
)
chart_df = products_pivot.copy()
chart_df["product" ] = chart_df.index
chart_df.sort_values(by= ["sales_total" ], ascending= True , inplace= True )
px.bar(chart_df, y= "product" , x= "sales_total" , orientation= "h" ,
title= "Top Selling Products (March 2024)" , height= 350
)
chart_df = weekdays_pivot.copy()
chart_df["weekday" ] = chart_df.index
chart_df.sort_values(by= ["sales_avg" ], ascending= True , inplace= True )
px.bar(chart_df, y= "weekday" , x= "sales_avg" , orientation= "h" ,
title= "Average Sales per Weekday (March 2024)" , height= 350
)