import os
"sample_data") os.listdir(
['anscombe.json',
'README.md',
'california_housing_train.csv',
'mnist_test.csv',
'california_housing_test.csv',
'mnist_train_small.csv']
Let’s take a few moments to explore the "Files" menu in the Google Colab left sidebar.
We see there are some example files in the "sample_data" directory.
Observe, it is possible to download files like these from the Colab filesystem to your local machine, and upload files from your local machine to the Colab filesystem as well.
Once we have the files in the Colab filesystem, we can write Python code to access and manipulate them. One way of interacting with the filesystem in Python is by using the capabilities of the os
module. We can read and write text (".txt") files using the open
function, but we rarely need to do so. More commonly, we will be reading and writing ".csv" files in Python using the pandas
package.
The examples below are meant to demonstrate capabilities related to the Colab filesystem. For now, the main take-away is understanding there are ways for us to write Python code to interact with the surrounding environment, specifically accessing and manipulating the filesystem. The Python syntax may be unfamiliar at this time. Feel free to return to these examples later, once you have practical need to read and write files.
Using the os
module to access the filesystem:
import os
"sample_data") os.listdir(
['anscombe.json',
'README.md',
'california_housing_train.csv',
'mnist_test.csv',
'california_housing_test.csv',
'mnist_train_small.csv']
= "sample_data/california_housing_test.csv"
csv_filepath print(csv_filepath)
os.path.isfile(csv_filepath)
sample_data/california_housing_test.csv
True
Using the open
function in “r” (reader) mode to read text files:
= "sample_data/README.md"
md_filepath print(md_filepath)
with open(md_filepath, mode="r") as f:
print(f.read())
sample_data/README.md
This directory includes a few sample datasets to get you started.
* `california_housing_data*.csv` is California housing data from the 1990 US
Census; more information is available at:
https://developers.google.com/machine-learning/crash-course/california-housing-data-description
* `mnist_*.csv` is a small sample of the
[MNIST database](https://en.wikipedia.org/wiki/MNIST_database), which is
described at: http://yann.lecun.com/exdb/mnist/
* `anscombe.json` contains a copy of
[Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet); it
was originally described in
Anscombe, F. J. (1973). 'Graphs in Statistical Analysis'. American
Statistician. 27 (1): 17-21. JSTOR 2682899.
and our copy was prepared by the
[vega_datasets library](https://github.com/altair-viz/vega_datasets/blob/4f67bdaad10f45e3549984e17e1b3088c731503d/vega_datasets/_data/anscombe.json).
Writing to text file, using the open
function in “w” (writer) mode:
= "my_message.txt"
new_md_filepath print("WRITING TO:", new_md_filepath)
with open(new_md_filepath, "w") as f:
"Hello World!")
f.write(
print("DONE!")
WRITING TO: my_message.txt
DONE!
# verifying:
print("READING FROM:", new_md_filepath)
with open(new_md_filepath, "r") as f:
= f.read()
contents print(contents)
print("DONE!")
READING FROM: my_message.txt
Hello World!
DONE!
Using the read_csv
function from the pandas
package to read “.csv” file contents:
from pandas import read_csv
print("READING FROM:", csv_filepath)
= read_csv(csv_filepath)
df df.head()
READING FROM: sample_data/california_housing_test.csv
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
0 | -122.05 | 37.37 | 27.0 | 3885.0 | 661.0 | 1537.0 | 606.0 | 6.6085 | 344700.0 |
1 | -118.30 | 34.26 | 43.0 | 1510.0 | 310.0 | 809.0 | 277.0 | 3.5990 | 176500.0 |
2 | -117.81 | 33.78 | 27.0 | 3589.0 | 507.0 | 1484.0 | 495.0 | 5.7934 | 270500.0 |
3 | -118.36 | 33.82 | 28.0 | 67.0 | 15.0 | 49.0 | 11.0 | 6.1359 | 330000.0 |
4 | -119.67 | 36.33 | 19.0 | 1241.0 | 244.0 | 850.0 | 237.0 | 2.9375 | 81700.0 |
Using the to_csv
method of the resulting pandas.DataFrame
object, to write data back to a CSV file:
= "my_copy.csv"
new_csv_filepath print("WRITING TO:", new_csv_filepath)
=False) df.to_csv(new_csv_filepath, index
WRITING TO: my_copy.csv
# verifying:
print("READING BACK FROM:", new_csv_filepath)
= read_csv(new_csv_filepath)
new_df new_df.head()
READING BACK FROM: my_copy.csv
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
0 | -122.05 | 37.37 | 27.0 | 3885.0 | 661.0 | 1537.0 | 606.0 | 6.6085 | 344700.0 |
1 | -118.30 | 34.26 | 43.0 | 1510.0 | 310.0 | 809.0 | 277.0 | 3.5990 | 176500.0 |
2 | -117.81 | 33.78 | 27.0 | 3589.0 | 507.0 | 1484.0 | 495.0 | 5.7934 | 270500.0 |
3 | -118.36 | 33.82 | 28.0 | 67.0 | 15.0 | 49.0 | 11.0 | 6.1359 | 330000.0 |
4 | -119.67 | 36.33 | 19.0 | 1241.0 | 244.0 | 850.0 | 237.0 | 2.9375 | 81700.0 |
For now, the main take-away is understanding there are ways for us to write Python code to interact with the surrounding environment, specifically accessing and manipulating the filesystem. We will return to cover data processing with pandas
in much more detail in a later unit.