Accessing the Filesystem in Colab

Let’s take a few moments to explore the "Files" menu in the Google Colab left sidebar.

We see there are some example files in the "sample_data" directory.

Example files in the Colab filesystem

Observe, it is possible to download files like these from the Colab filesystem to your local machine, and upload files from your local machine to the Colab filesystem as well.

Once we have the files in the Colab filesystem, we can write Python code to access and manipulate them. One way of interacting with the filesystem in Python is by using the capabilities of the os module. We can read and write text (".txt") files using the open function, but we rarely need to do so. More commonly, we will be reading and writing ".csv" files in Python using the pandas package.

The examples below are meant to demonstrate capabilities related to the Colab filesystem. For now, the main take-away is understanding there are ways for us to write Python code to interact with the surrounding environment, specifically accessing and manipulating the filesystem. The Python syntax may be unfamiliar at this time. Feel free to return to these examples later, once you have practical need to read and write files.

Accessing the Filesystem

Using the os module to access the filesystem:

import os

os.listdir("sample_data")
['anscombe.json',
 'README.md',
 'california_housing_train.csv',
 'mnist_test.csv',
 'california_housing_test.csv',
 'mnist_train_small.csv']
csv_filepath = "sample_data/california_housing_test.csv"
print(csv_filepath)

os.path.isfile(csv_filepath)
sample_data/california_housing_test.csv
True

Reading and Writing Text Files

Using the open function in “r” (reader) mode to read text files:

md_filepath = "sample_data/README.md"
print(md_filepath)

with open(md_filepath, mode="r") as f:
    print(f.read())
sample_data/README.md
This directory includes a few sample datasets to get you started.

*   `california_housing_data*.csv` is California housing data from the 1990 US
    Census; more information is available at:
    https://developers.google.com/machine-learning/crash-course/california-housing-data-description

*   `mnist_*.csv` is a small sample of the
    [MNIST database](https://en.wikipedia.org/wiki/MNIST_database), which is
    described at: http://yann.lecun.com/exdb/mnist/

*   `anscombe.json` contains a copy of
    [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet); it
    was originally described in

    Anscombe, F. J. (1973). 'Graphs in Statistical Analysis'. American
    Statistician. 27 (1): 17-21. JSTOR 2682899.

    and our copy was prepared by the
    [vega_datasets library](https://github.com/altair-viz/vega_datasets/blob/4f67bdaad10f45e3549984e17e1b3088c731503d/vega_datasets/_data/anscombe.json).

Writing to text file, using the open function in “w” (writer) mode:

new_md_filepath = "my_message.txt"
print("WRITING TO:", new_md_filepath)

with open(new_md_filepath, "w") as f:
    f.write("Hello World!")

print("DONE!")
WRITING TO: my_message.txt
DONE!
# verifying:
print("READING FROM:", new_md_filepath)

with open(new_md_filepath, "r") as f:
    contents = f.read()
    print(contents)

print("DONE!")
READING FROM: my_message.txt
Hello World!
DONE!

Reading and Writing CSV Files

Using the read_csv function from the pandas package to read “.csv” file contents:

from pandas import read_csv

print("READING FROM:", csv_filepath)

df = read_csv(csv_filepath)
df.head()
READING FROM: sample_data/california_housing_test.csv
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0
1 -118.30 34.26 43.0 1510.0 310.0 809.0 277.0 3.5990 176500.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0
3 -118.36 33.82 28.0 67.0 15.0 49.0 11.0 6.1359 330000.0
4 -119.67 36.33 19.0 1241.0 244.0 850.0 237.0 2.9375 81700.0

Using the to_csv method of the resulting pandas.DataFrame object, to write data back to a CSV file:

new_csv_filepath = "my_copy.csv"
print("WRITING TO:", new_csv_filepath)

df.to_csv(new_csv_filepath, index=False)
WRITING TO: my_copy.csv
# verifying:
print("READING BACK FROM:", new_csv_filepath)
new_df = read_csv(new_csv_filepath)
new_df.head()
READING BACK FROM: my_copy.csv
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -122.05 37.37 27.0 3885.0 661.0 1537.0 606.0 6.6085 344700.0
1 -118.30 34.26 43.0 1510.0 310.0 809.0 277.0 3.5990 176500.0
2 -117.81 33.78 27.0 3589.0 507.0 1484.0 495.0 5.7934 270500.0
3 -118.36 33.82 28.0 67.0 15.0 49.0 11.0 6.1359 330000.0
4 -119.67 36.33 19.0 1241.0 244.0 850.0 237.0 2.9375 81700.0

For now, the main take-away is understanding there are ways for us to write Python code to interact with the surrounding environment, specifically accessing and manipulating the filesystem. We will return to cover data processing with pandas in much more detail in a later unit.