Summary Statistics

The pandas package makes it easy to calculate basic summary statistics.

Let’s consider this example dataset of monthly financial and economic indicators:

from pandas import read_csv

repo_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance"
request_url = f"{repo_url}/main/docs/data/monthly-indicators.csv"

df = read_csv(request_url)
df.head()

	timestamp	cpi	fed	spy	gld
0	2024-05-01	314.069	5.33	525.6718	215.30
1	2024-04-01	313.548	5.33	500.3636	211.87
2	2024-03-01	312.332	5.33	521.3857	205.72
3	2024-02-01	310.326	5.33	504.8645	189.31
4	2024-01-01	308.417	5.33	479.8240	188.45

It contains over 200 months, spanning a time period from 2004 to 2024:

print(len(df))
print(df["timestamp"].min(), "...", df["timestamp"].max())

234
2004-12-01 ... 2024-05-01

We can use the dataframe’s describe method to quickly see the basic summary statistics for each numeric column in the dataset:

df.describe()

	cpi	fed	spy	gld
count	234.000000	234.000000	234.000000	234.000000
mean	239.904709	1.588761	199.248881	124.362344
std	30.299220	1.879657	122.947398	40.170234
min	190.300000	0.050000	55.148800	41.650000
25%	216.963500	0.120000	98.790475	101.575000
50%	236.409000	0.390000	164.878300	123.360000
75%	256.327500	2.480000	270.226075	159.787500
max	314.069000	5.330000	525.671800	215.300000

This will show us the number of rows, mean and standard deviation, min and max, and quantiles for each column.

As you may be aware, we can alternatively calculate these metrics ourselves, using Series aggregations:

# https://pandas.pydata.org/docs/reference/api/pandas.Series.html
# https://pandas.pydata.org/docs/reference/api/pandas.Series.quantile.html

series = df["fed"]

print("COUNT:", len(series))
print("MEAN:", series.mean().round(6))
print("STD:", series.std().round(6))
print("-------------")
print("MIN:", series.min())
print("25TH:", series.quantile(.25))
print("MED:", series.median())
print("75TH:", series.quantile(.75))
print("MAX:", series.max())

COUNT: 234
MEAN: 1.588761
STD: 1.879657
-------------
MIN: 0.05
25TH: 0.12
MED: 0.39
75TH: 2.48
MAX: 5.33

series.describe() # for comparison

count    234.000000
mean       1.588761
std        1.879657
min        0.050000
25%        0.120000
50%        0.390000
75%        2.480000
max        5.330000
Name: fed, dtype: float64

Distribution Plots

In order to learn more about the distribution of this data, we can create distribution plots, to tell a story about the summary statistics.

A box plot:

import plotly.express as px

# https://plotly.com/python-api-reference/generated/plotly.express.box.html
px.box(df, x="fed", orientation="h", points="all",
        title="Distribution of Federal Funds Rate (Monthly)",
        hover_data=["timestamp"], height=400
)

A violin plot:

# https://plotly.com/python-api-reference/generated/plotly.express.violin.html
px.violin(df, x="fed", orientation="h", points="all", box=True,
        title="Distribution of Federal Funds Rate (Monthly)",
        hover_data=["timestamp"], height=400
)

A histogram:

# https://plotly.com/python-api-reference/generated/plotly.express.histogram.html
px.histogram(df, x="fed", #nbins=12,
        title="Distribution of Federal Funds Rate (Monthly)", height=400)

When we make a histogram, we can specify the number of bins, using the nbins parameter.

These charts help us visually identify distributions in the data.