2 Working with DataFrames

Now that we know how to obtain a DataFrame object, let’s start working with one.

For example, this dataset of grocery products, which is hosted online:

Code

from pandas import read_csv

request_url = "https://raw.githubusercontent.com/prof-rossetti/intro-to-python/main/data/products.csv"
df = read_csv(request_url)
print(type(df))
df.head()

<class 'pandas.core.frame.DataFrame'>

	id	name	aisle	department	price
0	1	Chocolate Sandwich Cookies	cookies cakes	snacks	3.50
1	2	All-Seasons Salt	spices seasonings	pantry	4.99
2	3	Robust Golden Unsweetened Oolong Tea	tea	beverages	2.49
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	frozen meals	frozen	6.99
4	5	Green Chile Anytime Sauce	marinades meat preparation	pantry	7.99

Data Source

This dataset was adapted from open source data released by Instacart in 2017 (which seems to have since been deleted).

“The Instacart Online Grocery Shopping Dataset 2017”, accessed from https://www.instacart.com/datasets/grocery-shopping-2017 in 2017.

2.1 Previewing the Data

We can preview the first few rows or last few rows using the head or tail methods, respectively.

Previewing the first few rows:

df.head()

	id	name	aisle	department	price
0	1	Chocolate Sandwich Cookies	cookies cakes	snacks	3.50
1	2	All-Seasons Salt	spices seasonings	pantry	4.99
2	3	Robust Golden Unsweetened Oolong Tea	tea	beverages	2.49
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	frozen meals	frozen	6.99
4	5	Green Chile Anytime Sauce	marinades meat preparation	pantry	7.99

Previewing the last few rows:

df.tail()

	id	name	aisle	department	price
15	16	Mint Chocolate Flavored Syrup	ice cream toppings	snacks	4.50
16	17	Rendered Duck Fat	poultry counter	meat seafood	9.99
17	18	Pizza for One Suprema Frozen Pizza	frozen pizza	frozen	12.50
18	19	Gluten Free Quinoa Three Cheese & Mushroom Blend	grains rice dried goods	dry goods pasta	3.99
19	20	Pomegranate Cranberry & Aloe Vera Enrich Drink	juice nectars	beverages	4.25

By default, we see five rows, but we can customize the number of rows by passing an integer parameter to these methods, like head(3) or tail(3).

2.2 Dataset Properties

2.2.1 Size and Shape

It’s easy to count the number of rows, using the familiar len function:

len(df)

Alternatively, we can access the shape property, which tells us the dataset size in terms of number of rows and columns:

df.shape

(20, 5)

Note

The shape is a tuple formatted as (n_rows, n_cols), where the first value represents the number of rows, and the second represents the number of columns.

2.2.2 Column Names

Every DataFrame object has a set of column names, which uniquely identify the columns in the dataset.

Accessing the column names, using the columns property:

df.columns

Index(['id', 'name', 'aisle', 'department', 'price'], dtype='object')

This produces a list-like Index object, which we can convert to a list using the tolist method, as desired:

df.columns.tolist()

['id', 'name', 'aisle', 'department', 'price']

FYI

The tolist method is used across the numpy and pandas packages to convert specialized data structures to simple Python lists.

2.2.3 Row Index Values

Every DataFrame object has a set of row index values, which uniquely identify the rows in the dataset.

Accessing the row index values, using the index property:

df.index

RangeIndex(start=0, stop=20, step=1)

Just like the column index, the row index can be converted to a list:

df.index.tolist()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

As we see, the default row index is a set of auto-incrementing numbers starting at 0 (similar to the index values of a simple list).

However, it is possible to update the index values. For more information about working with the index, see Index Operations.

2.3 Accessing Data

2.3.1 Accessing Columns

We can access one or more columns worth of values, using a dictionary-like accessor.

To access a single column, we pass the string column name, and we get a pandas Series object back:

names = df["name"] # SINGLE COLUMN NAME
print(type(names))
names.head()

<class 'pandas.core.series.Series'>

0                           Chocolate Sandwich Cookies
1                                     All-Seasons Salt
2                 Robust Golden Unsweetened Oolong Tea
3    Smart Ones Classic Favorites Mini Rigatoni Wit...
4                            Green Chile Anytime Sauce
Name: name, dtype: object

To access multiple columns, we pass a list of string column names, and we get a DataFrame object back:

names_and_prices = df[["name", "price"]] # LIST OF COLUMN NAMES
print(type(names_and_prices))
names_and_prices.head()

<class 'pandas.core.frame.DataFrame'>

	name	price
0	Chocolate Sandwich Cookies	3.50
1	All-Seasons Salt	4.99
2	Robust Golden Unsweetened Oolong Tea	2.49
3	Smart Ones Classic Favorites Mini Rigatoni Wit...	6.99
4	Green Chile Anytime Sauce	7.99

For more information about working with columns, see Column Operations.

2.3.2 Accessing Rows

To access a given row, we can use the iloc method in conjunction with a list-like accessor, referencing the index value of that row:

first_row = df.iloc[0] # ACCESSING A ROW BY ITS INDEX VALUE
print(type(first_row))
first_row

<class 'pandas.core.series.Series'>

id                                     1
name          Chocolate Sandwich Cookies
aisle                      cookies cakes
department                        snacks
price                                3.5
Name: 0, dtype: object

When we access a single row, we get a Series object back.

Note

When we use index references like this with the iloc method, we are actually referencing the index value itself, not the position of the row in the dataset. In the event the index values change, you would need to use the new values instead of the default integer values.

We can access multiple rows in succession (for example the first three rows), using a list-slicing approach:

first_rows = df.iloc[0:3] # MULTIPLE ROWS, LIST-SLICING APPROACH
print(type(first_rows))
first_rows

<class 'pandas.core.frame.DataFrame'>

	id	name	aisle	department	price
0	1	Chocolate Sandwich Cookies	cookies cakes	snacks	3.50
1	2	All-Seasons Salt	spices seasonings	pantry	4.99
2	3	Robust Golden Unsweetened Oolong Tea	tea	beverages	2.49

When we access multiple rows, we get a DataFrame object back.

For more information about working with rows, see Row Operations.

2.3.3 Accessing Cell Values

It is possible to access individual values using the loc method, or a row-based approach, or a column-based approach.

The loc method, specifying the given row index value, and the column name:

df.loc[0, "name"] # CELL-ORIENTED APPROACH [ROW INDEX, COL NAME]

'Chocolate Sandwich Cookies'

Row-based approach, using the aforementioned iloc method gets us the first row, then we access one of its values using a dictionary-like accessor:

df.iloc[0]["name"] # ROW-ORIENTED APPROACH

'Chocolate Sandwich Cookies'

Column-based approach, referencing the column name to get us the column, then we access one of its values using a list-like accessor:

df["name"][0] # COLUMN-ORIENTED APPROACH

'Chocolate Sandwich Cookies'