3 Column Operations

In this chapter we will focus on column operations, including dropping, renaming, and mapping columns.

For these examples, we will use a familiar dataset of products:

Code

from pandas import read_csv

request_url = "https://raw.githubusercontent.com/prof-rossetti/intro-to-python/main/data/products.csv"
df = read_csv(request_url)
df.head()

	id	name	aisle	department	price
0	1	Chocolate Sandwich Cookies	cookies cakes	snacks	3.50
1	2	All-Seasons Salt	spices seasonings	pantry	4.99
2	3	Robust Golden Unsweetened Oolong Tea	tea	beverages	2.49
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	frozen meals	frozen	6.99
4	5	Green Chile Anytime Sauce	marinades meat preparation	pantry	7.99

3.1 `Series` as a Column

Remember, when we access a given column of values, we are working with a Series object:

prices = df["price"]
print(type(prices))
prices

<class 'pandas.core.series.Series'>

0      3.50
1      4.99
2      2.49
3      6.99
4      7.99
5     21.99
6      3.50
7      4.25
8      6.50
9      2.99
10     1.99
11    18.50
12    16.00
13     4.99
14    25.50
15     4.50
16     9.99
17    12.50
18     3.99
19     4.25
Name: price, dtype: float64

With a Series object as a column, we can use list-like accessors to access one of the values based on its row index value:

prices[0]

np.float64(3.5)

And we can convert the Series object to a normal list as desired:

print(prices.tolist())

[3.5, 4.99, 2.49, 6.99, 7.99, 21.99, 3.5, 4.25, 6.5, 2.99, 1.99, 18.5, 16.0, 4.99, 25.5, 4.5, 9.99, 12.5, 3.99, 4.25]

3.1.1 Series Aggregations

One of the cool things about Series objects is they know how to calculate their own aggregations.

For example, minimum and maximum values:

print("MIN PRICE:", df["price"].min())
print("MAX PRICE:", df["price"].max())

MIN PRICE: 1.99
MAX PRICE: 25.5

Mean and median values:

print("MEAN PRICE:", df["price"].mean().round(2))
print("MEDIAN PRICE:", df["price"].median().round(2))

MEAN PRICE: 8.37
MEDIAN PRICE: 4.99

Percentiles, etc:

print("25TH PCT PRICE:", df["price"].quantile(.25).round(2))
print("75TH PCT PRICE:", df["price"].quantile(.75).round(2))

25TH PCT PRICE: 3.87
75TH PCT PRICE: 10.62

3.1.2 Value Counts

The value_counts method shows us a count of the number of rows for each value in a given column:

df["department"].value_counts()

department
beverages          5
frozen             4
pantry             2
snacks             2
personal care      2
dairy eggs         1
household          1
babies             1
meat seafood       1
dry goods pasta    1
Name: count, dtype: int64

We can use the normalize parameter to display our counts as percentage of the total number of rows:

df["department"].value_counts(normalize=True)

department
beverages          0.25
frozen             0.20
pantry             0.10
snacks             0.10
personal care      0.10
dairy eggs         0.05
household          0.05
babies             0.05
meat seafood       0.05
dry goods pasta    0.05
Name: proportion, dtype: float64

3.2 Renaming Columns

We can use the rename method to rename columns. This is not mutating unless we use the inplace parameter.

df.rename(columns={"department": "dept"}, inplace=True)
df.head()

	id	name	aisle	dept	price
0	1	Chocolate Sandwich Cookies	cookies cakes	snacks	3.50
1	2	All-Seasons Salt	spices seasonings	pantry	4.99
2	3	Robust Golden Unsweetened Oolong Tea	tea	beverages	2.49
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	frozen meals	frozen	6.99
4	5	Green Chile Anytime Sauce	marinades meat preparation	pantry	7.99

3.3 Dropping Columns

We can use the drop method to remove columns from the dataset. This is not mutating unless we use the inplace parameter.

df.drop(columns=["aisle"], inplace=True)
df.head()

	id	name	dept	price
0	1	Chocolate Sandwich Cookies	snacks	3.50
1	2	All-Seasons Salt	pantry	4.99
2	3	Robust Golden Unsweetened Oolong Tea	beverages	2.49
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	frozen	6.99
4	5	Green Chile Anytime Sauce	pantry	7.99

3.4 Creating New Columns

We can create a new column by specifying the name of that column and assigning a new value. For example, a column of constants:

df["inflation_factor"] = 1.5

df.head()

	id	name	dept	price	inflation_factor
0	1	Chocolate Sandwich Cookies	snacks	3.50	1.5
1	2	All-Seasons Salt	pantry	4.99	1.5
2	3	Robust Golden Unsweetened Oolong Tea	beverages	2.49	1.5
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	frozen	6.99	1.5
4	5	Green Chile Anytime Sauce	pantry	7.99	1.5

In practice, instead of creating a new column of constants or placeholder values, we will generally create a new column of transformed values (see “Mapping Columns” section below).

3.5 Mapping Columns

We can transform the values in one given column, and store the results in another column

When mapping using a single scalar value, for example multiplying all values in a column by 1.5:

# PRICE COL * 1.5
df["inflated_price"] = (df["price"] * 1.5).round(2)

df[["id", "name", "price", "inflated_price"]].head()

	id	name	price	inflated_price
0	1	Chocolate Sandwich Cookies	3.50	5.25
1	2	All-Seasons Salt	4.99	7.48
2	3	Robust Golden Unsweetened Oolong Tea	2.49	3.74
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	6.99	10.48
4	5	Green Chile Anytime Sauce	7.99	11.98

When mapping using two columns, this performs an element-wise operation where the first values in each Series are compared, then the second values, etc.

# PRICE COL * INFLATION FACTOR COL
df["inflated_price"] = (df["price"] * df["inflation_factor"]).round(2)

df[["id", "name", "price", "inflation_factor", "inflated_price"]].head()

	id	name	price	inflation_factor	inflated_price
0	1	Chocolate Sandwich Cookies	3.50	1.5	5.25
1	2	All-Seasons Salt	4.99	1.5	7.48
2	3	Robust Golden Unsweetened Oolong Tea	2.49	1.5	3.74
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	6.99	1.5	10.48
4	5	Green Chile Anytime Sauce	7.99	1.5	11.98

This essentially allows us to compare multiple values from the same row.

3.5.1 Applying a Mapping Dictionary

If we want to change certain values, we can use the map method in conjunction with a dictionary to provide the mapping of old values to new values:

department_abbrevs = {
  "beverages": "BV",
  "frozen": "FR",
  "pantry": "PN",
  "snacks": "SN",
  "personal care": "PC",
  "dairy eggs": "DE",
  "household": "HH",
  "babies": "BB",
  "meat seafood": "MS",
  "dry goods pasta": "DG",
} # DEFINE VALUE MAPPING

df["dept_abbrev"] = df["dept"].map(department_abbrevs) # APPLY VALUE MAPPING

df[["id", "name", "dept", "dept_abbrev"]].head()

	id	name	dept	dept_abbrev
0	1	Chocolate Sandwich Cookies	snacks	SN
1	2	All-Seasons Salt	pantry	PN
2	3	Robust Golden Unsweetened Oolong Tea	beverages	BV
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	frozen	FR
4	5	Green Chile Anytime Sauce	pantry	PN

In this example, we are using the department_abbrevs dictionary to look up the corresponding abbreviation for each department name in the “dept” column. As a result we obtain a new column of transformed values (the abbreviations), which we are storing back in the original DataFrame in a new column called “dept_abbrev”.

3.5.2 Applying a Transformation Function

For more complex mapping logic, we can use the apply method to apply a transformation function.

When defining our own transformation function, the function should accept a parameter representing one of the values in the column over which we are applying the function. It should operate on that input value and return the transformed value.

# TRANSFORMATION FUNCTION:

def inflate_price(original_price):
    # only inflate if the price is greater than $5
    if original_price < 5.00:
        return original_price
    else:
        return original_price * 1.5

assert inflate_price(3.00) == 3.00
assert inflate_price(10.00) == 15.00

After defining the transformation function, we pass that function as a parameter to the apply method:

# APPLY TRANSFORMATION FUNCTION TO PRICE COL:
df["inflated_price"] = df["price"].apply(inflate_price).round(2)

df[["id", "name", "price", "inflated_price"]].head()

	id	name	price	inflated_price
0	1	Chocolate Sandwich Cookies	3.50	3.50
1	2	All-Seasons Salt	4.99	4.99
2	3	Robust Golden Unsweetened Oolong Tea	2.49	2.49
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	6.99	10.48
4	5	Green Chile Anytime Sauce	7.99	11.98

In this example, we are applying the transformation function inflate_price to each value in the “price” column. As a result we obtain a new column of transformed values, which we are storing back in the original DataFrame in a new column called “inflated_price”.

3.1 Series as a Column