In this notebook, we build on the foundation provided by previous notebooks to introduce how to read and write data from and to a file. This is an important skill since we often want to share our results or will need to rerun an analysis, both of which are made considerably easier when the data can be reused.
First, you will learn how to open a file and write data into the file. Next, you will learn how to read and write data into delimiter-separated files (e.g., comma separated value, or CSV, files). Finally, you will learn more about working with Python modules (or packages), which can be used to add new functionality into your Python program. As a specific example, you will learn to use the csv
module to read and write CSV data into Python programs.
First, we run the following Bash shell script to create a new directory for our data files, assuming it already hasn't already been created. Don't worry about how this works, although if you are curious you can learn more from the Linux Command website.
%%bash
# An Absolute File path for Bash script
# DIR=/home/data_scientist/data
# Relative file path for Bash script
DIR=data
if [ ! -d "$DIR" ] ; then
mkdir "$DIR"
fi
When working with files, or any other system object, we must be careful about properly managing the underlying resource. In this particular case, that means a file and the associated file descriptor that the host operating system uses to reference the actual file. While modern operating systems can typically manage a very large number of file descriptors, when we use virtualization, like with our course Jupyter server, we want to minimize our server footprint. Thus, we need to carefully husband resources like file descriptors to avoid exhausting our server resources.
But a more important aspect is that whenever we open a file, we want to be sure that the file is properly closed and that any data that a program wrote to the file has been written to permanent storage. Thus, we need to ensure that every file that was opened has been properly closed. To open a file, Python has an open
method that opens the named file and returns a file object that you either read from or write to depending on the mode used to open the file. Conversely, Python also has a close
method that closes the file object.
To explicitly state why a file is being opened, the open
method accepts a mode argument, whose default values is rt
or open for reading text data. The allowed modes are detailed in the following table.
Mode | Description |
---|---|
'r' | reading (default) |
'w' | writing, truncate file first |
'x' | create and open file for writing |
'a' | writing, append to file if exists |
'b' | binary mode |
't' | text mode (default) |
'+' | open for reading and writing |
Historically, you would only read from a text file or write to a text file by using traditional Python file input and output (with the advent of powerful data science modules in Python; however, we now often read and write data directly from advanced data structures like a Pandas DataFrame
). Thus, to open a text file named test.txt
for writing without truncating the existing file contents (i.e., append), you would use f = open('test.txt', 'a')
, and after all operations on the file are complete, you would use f.close()
to close the file and release all associated resources. One last item, when opening a file for reading and writing, the +
mode follows either a w
to open the file but truncate the file contents, or an r
to open the file without truncation.
In Python, file input/output employs a runtime context, which is a way to enforce what should happen when a code block is entered and exited. The context is created by using the with
command in Python, where the rest of the line following the with
command creates the actual context, which manages the entry into and exit from the enclosed code block. For our purposes, the standard application for a Python context is opening and closing files. As demonstrated in the following code block, we can now open a file, perform operations on the file, and no longer worry about closing the file, which is now taken care of automatically by the context.
with open('temp.txt', 'a') as fout:
fout.write(data)
As previously described, to write text (or data that can easily be converted to text) data to a file, we need to open the file by using a context (which is created via the with
statement in Python). We need to define a variable to refer to the newly opened file so that we can write to the correct file. We write text to the file by using the write
method on the file that we just opened.
The following code snippet demonstrates this technique. First, we define a variable that holds the name of the file that will hold our text data. Next, we open the file as fout
, after which we write two lines before exiting the context, which closes the file and ensures the text data are correctly written to the file. The second code block uses the Unix cat
command to display the contents of this new file. We employ one trick in this code block by reusing the new Python variable out_file
, which we can do by prefixing the variable name by a dollar sign $
. This shorthand expands the variable to the full file name when the cat
command is executed, and is a good practice since it can minimize typos and syntax errors (since we only must change the file name once to ensure any changes will be correctly propagated through the rest of our analytics notebook).
# File writing demonstration
# An example of an absolute file path
#out_file = '/home/data_scientist/data/temp.txt'
# A relative file path
out_file = 'data/temp.txt'
# We explicitly place a newline at the end of each string
with open(out_file, 'w') as fout:
fout.write("Hello World!\n")
fout.write("Goodbye World!\n")
# Note, we can access Python variable names in our
# Unix script by prefixing them with a dollar sign ($)
!cat $out_file
# alternatively, you could provide the full file name
# to the Unix command, like
# cat /home/data_scientist/data/temp.txt
To read data with Python, we simply open the file (in a context). By default, for a text file, we iterate though the file object, which returns each line of the text file as a Python string.
with open(out_file, 'r') as fin:
for line in fin:
print(line)
The open
method also takes an encoding
attribute that can be used to specify the character encoding used in the file. Originally, the only character encoding used by computers was the ASCII encoding, which only required 8-bits to represent each character. This encoding only represented the standard American typewriter characters, and thus failed to work for non-English languages or words. To support character encodings for any language, the Unicode Consortium was formed and standardized character encoding were subsequently developed. One of the most popular current character encodings is utf-8
, which is a Unicode standard.
Student Exercise
In the empty Code cell below, write a simple Python script to emulate the Unix cat
command. Explicitly, your script should open the file 'temp.txt'
that was created in an earlier notebook cell, and read and display each line from the file.
In general, you will want to work with data files that have been created by others, perhaps exclusively or in conjunction with data files you may have created. To demonstrate working with externally created data files, we will obtain a list of airports from a remote website. Even if you skip mastering this section of the notebook, be sure to execute the cells so that the data is correctly acquired for the rest of this notebook.
The first code cell provides the name of the file where we will store the data locally. The second code cell is a special Unix script (technically a BASH shell script) that first tests if the file exists locally on your Jupyter server, and if not, uses the wget
command to pull the file off a remote webserver to your server. Finally, we use the Unix head
command to display the first five lines of the file to verify the file has been retrieved successfully and to see the format of each row. In this case, the file employs commas to separate values from each other within a single row. This format is known as comma separated value, or CSV, and is a popular text format.
# We first name the file that contains our data of interest
data_file='data/airports.csv'
%%bash -s "$data_file"
# Note, we passed in a Python variable above to the Bash script
# which is then accessed via positional parameter, or $1 in this case.
# First test if file of interest does not exist
if [ ! -f "$1" ] ; then
# If it does not exist, we grab the file from the Internet and
# store it locally in the data directory
wget -O "$1" http://stat-computing.org/dataexpo/2009/airports.csv
else
echo "File already exists locally."
fi
# Now display the first five lines
!head -5 $data_file
We can interpret the airport data acquired previously by using a spreadsheet as a mental model. This file has one airport location written in each row of the file, and the columns or fields in each row are separated by commas. This file format is known as a comma separated value or CSV file, and many spreadsheets will export data to this format. In this case, a comma is used as a delimiter for the different fields or columns in the file, but other delimiters can also be used.
Now that we have the airport CSV file, we can read the data and process it accordingly. In the following Code cell, the file is opened, and a list of airports in the state of Delaware (abbreviation DE) is display in a special format. To accomplish this task, we open the file using a context and assign the file to the variable fin
, which is short for file input. Python treats this file input as an iterator, allowing us to access one line of text data at a time (where the end of a line is traditionally marked by the newline character, \n
). This line of text is returned as a Python string, which is held in the line
variable. We split the line
string on our delimiter, which is a comma, into columns that are held in the cols
list.
In this particular example, we check if the string abbreviation for the state of Delaware, which is DE
, is in the fourth column (i.e., cols[3]
), and if so we print a nicely formatted string naming each airport, its city location, and its International Air Transport Association, or IATA, code. The format string is specified at the top of the code cell and enables variable substitution by using the curly braces to indicate where the new text should be inserted for each row (i.e., {0} indicates where the first variable should be inserted into the string, etc.). The output shown after the code cell displays the data generated by running this script.
# Now we can use Python to read in the file
# Here is our formatted print string
fString = "An airport named {0} is located in {1}, {2} and has IATA CODE = {3}"
print("Displaying Delaware State Airports")
print(80*'-')
# Now loop through the file, and display any airport in the state of Delaware (DE).
# Notice how each line is read in from the file as a Python string, which we tokenize
# (or split) on commas into a list of columns. We can extract individual columns to
# get the data of interest.
with open(data_file, 'r') as fin:
for line in fin:
cols = line.split(',')
if 'DE' in cols[3]:
print(fString.format(cols[1], cols[2], cols[3], cols[0]))
Student Exercise
In the empty Code cell below, write a Python script (by copying and modifying the previous script) to display only the first five airports in a different state such as California (abbreviation CA).
As the Python language has become more popular, individuals and organizations have invested considerable time, energy, and effort in developing Python applications. Fortunately, the Python language supports encapsulating code into modules, which are essentially files containing Python definitions, for example, functions, classes, or variables. A module can be imported into another Python file, allowing the definitions to be reused.
When one or modules are more widely used, they can be bundled together into a Python package, which can provide enhanced functionality. To import a package (or module) into another Python program, you use the import
statement, which has the following forms:
import numpy
import numpy as np
from numpy import arange
from numpy import *
The first form brings the entire contents of the numpy package into the current program, but leaves all items in the numpy namespace. Thus, to refer to a particular definition, like arange
one must use the numpy
prefix, as in numpy.arange()
. The second form is similar to the first, but the prefix has been shortened to np
. The third form only imports the single, listed definition, which is also brought into the current namespace and thus does not require any prefix. The last form brings the entire contents of the numpy package into the current file and namespace. As a result, the chances for name collisions increase, and thus the last form is strongly discouraged. Note in this course we generally use only the second form, and the appropriate import
format will be demonstrated in future lessons for each module.
Many popular packages have been included with the standard Python distributions and are known collectively as the Standard Library. Other packages are available from third parties, yet can be very useful in specific circumstances. The following table lists some of the more popular Python packages that are relevant for this course:
name | Description |
---|---|
numpy | Fast numerical arrays and matrices |
scipy | Comprehensive set of scientific and engineering functions |
matplotlib | Comprehensive plotting library |
seaborn | Better data plotting |
pandas | Data structures and simplifies data analysis tasks |
csv | Easily read and write CSV files |
scikit_learn | Provides Machine Learning tools |
In addition to these listed packages, many other packages exist. The official repository for public Python packages is PyPI, the Python Package Index. These libraries can generally be installed with pip, the Python package management tool; however, the details of doing this are beyond the scope of this course.
A caveat, however, to blindly using libraries from PYPI or any other distribution mechanism is that while a particular library may simplify the development of a Python program, this same library may conversely complicate the distribution and maintenance of a Python program by introducing extra dependencies that are possibly out of the control of the developer. Thus, a judicious evaluation of the benefits and risks of using any Python package should be considered before their adoption. The Python packages listed previously, as well as other community-standard python packages, are generally safe to adopt as they are well supported and widely available.
The maintenance problem is usually not the result of the Python package itself, but with its dependencies. As an example, the popular SciPy package requires external C and Fortran libraries that provide the actual implementation of basic linear algebra and special mathematical functions. To acquire these libraries for any given operating system and hardware platform can be difficult and might require compiling the original sources, further increasing any dependency issues that are not handled by pip
.
While ongoing efforts exist in the community to provide a solution to these dependency issues, the current recommended approach is to use the Anaconda Python distribution from Continuum Analytics. Anaconda is freely available, and provides a complete Python installation along with a number of the more popular Python packages, available for most operating systems.
We discussed working with text data earlier in this notebook, where you learned how to write and read data from files that used a comma as a delimiter. One of the beautiful aspects of Python is the rich ecosystem of modules that have been developed to simplify mundane tasks. To demonstrate this, we will now read the same data by using the csv
module.
The next code block creates a list of lists, called airports
, where the inner list contains separate strings for each column in the row (i.e., each airport). This demonstrates how reading and parsing a CSV file can be simplified by using the csv
module. The second code block processes the airports
list to once again extract and display the airports in the state of Delaware.
# Now read in the entire data file
import csv
# We store the rows in a list
airports = []
# Open the file for reading, and extract the rows
with open(data_file, 'r') as csvfile:
for row in csv.reader(csvfile, delimiter=','):
airports.append(row)
# Display first five rows (remember this is a list of lists)
print(airports[0:5])
# Here is our formatted print string
fString = "An airport named {0} is located in {1}, {2} and has IATA CODE = {3}"
print("Displaying Delaware State Airports")
print(80*'-')
# Now iterate through the list of airports, and extract the ones of interest.
for row in airports:
if 'DE' in row[3]:
print(fString.format(row[1], row[2], row[3], row[0]))
Traditionally, the delimiter most frequently used is the comma, leading to the comma-separated value (CSV) format described earlier. However, other delimiters can also be used, including whitespace characters like the space or tab characters, or specific, infrequently used characters like the vertical bar |
.
We can easily read and write delimiter separated value formats by using the read and write methods in the csv
module. These methods include an optional delimiter
parameter that can be used to specify the actual value to use to distinguish between consecutive values in row. Other parameters can also be used to control how to escape the delimiter character and how to indicate the end of a line.
To demonstrate this concept, the following code cell writes the airport data to a new file, but this time by using the vertical bar as the delimiter. Notice the code looks very familiar; we simply change the output filename and specify the |
character as our delimiter.
# We will write a CSV file using the | character as a delimiter
import csv
# Output Filename, delimiter separated format file
ds_file = 'data/vbar.txt'
with open(ds_file, 'w') as csvfile:
# We need out csv writer stream
fout = csv.writer(csvfile, delimiter='|')
# Now write each airport out using the delimiter
for airport in airports:
fout.writerow(airport)
This simple code block demonstrated how to write out a vertical-bar separated value file. We can either view the file contents by using Unix command line tools, as demonstrated in the next cell, or by using the Jupyter Notebook, which is demonstrated later in this notebook.
!head -5 $ds_file
Reading the data into a Python program is straightforward; simply use the csv.reader
method to iterate through the rows in the file. We demonstrate this in the following code cell, where we convert the data to fixed-width format to improve the readability of the resulting output. To do this, we first need to construct appropriate string formatting codes.
In the following code cell, we construct two format code strings: the first one is for the header row that contains the column labels, while the second one is for the data rows. These format codes are fairly easy to understand if you take them one step at a time. We first enclose each string substitution in curly braces { }
, and use numbers to indicate the order of substitution; that is, a 0
indicates the first variable, a 1
indicates the second variable, and so on. Next, we provide a colon :
character to indicate the presence of a formatting code, which consists of numbers and a letter code. The numbers following the colon indicate the field width (in characters) that the column will span and for floating-point data; the numbers after the period specify the precision (or numbers after the decimal point) of the value. The character code indicates the type of data to encode: 's' for string, and 'f' for floating-point. Thus, for example, {1:29s}
means first format code substitution, with a representation that is 29 characters wide that will accept a string.
# We can read the data and display by usiung our previous string format codes.
hfmt = "{0:5s}{1:29s}{2:27s}{3:6s}{4:10s}{5:12s}{6:10s}"
fmt = "{0:5s}{1:29s}{2:30s}{3:3s}{4:4s}{5:14.8f}{6:14.8f}"
# First line is header row
rCount = 0
# Now Read in file data.
with open(ds_file, 'r') as csvfile:
for row in csv.reader(csvfile, delimiter='|'):
# We output first line special since it is a header row.
if rCount == 0:
print(hfmt.format(row[0], row[1], row[2], row[3], row[4], \
row[5], row[6]))
# Else we simply print the row
else:
print(fmt.format(row[0], row[1], row[2], row[3], row[4], \
float(row[5]), float(row[6])))
# We only want to print out first five rows.
rCount += 1
if rCount > 5:
break
We have already discussed the simplest persistence technique, basic file input/output, in this notebook. By using the Python programming language, you can open a file for reading and writing and even use binary mode to save storage space (or even directly use a compression technique by using the appropriate Python library like bzip2).
While this works, it is not optimal for several reasons:
All data is written and read as Python strings. Complex arrangements of heterogenous data thus require potentially complex (and costly in execution time) transformations.
All concurrency is provided by the file system; thus, we are not guaranteed consistent results if multiple writers work at the same time.
Without extra effort, for example, to write to a binary file or to employ compression, this approach is costly in terms of storage space.
We rely completely on the underlying file system for consistency and durability. Thus, persisted application state may have unintentional dependencies on the underlying file system.
Fortunately, Python provides a simple technique, called pickling, that we can use to easily save data to a file and to later reconstitute the data into a Python program. Pickling writes the class information for any data being written to the file along with the data. When you unpickle data, this class information is used to properly reconstitute the data in the pickled file. Pickling is easy to use and can often suffice for simple data persistence tasks. To pickle data to a file, you must import the pickle module and open a file in binary writing mode. After this, simply call the pickle.dump()
method with the data to write and the file stream.
import pickle
p_file = 'data/test.p'
with open(p_file, 'wb') as fout:
pickle.dump(airports, fout)
Unpickling data is also easy; simply open the appropriate file in binary read mode and call the pickle.load()
method to retrieve the data from the file and assign to a variable.
with open(p_file, 'rb') as fin:
new_airports = pickle.load(fin)
print(new_airports[0:5])
While easier than custom read/write routines, pickling still requires the file system to provide support for concurrency, consistency, and durability. To go any further with data persistence, we will need to work with database systems, which is the topic of a future lesson.
The last code cell in this notebook removes our temporary files.
%%bash
# We now clean up the temporary files created earlier in this notebook
rm -rf $out_file
rm -rf $ff_file
rm -rf $p_file
The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.
© 2017: Robert J. Brunner at the University of Illinois.
This notebook is released under the Creative Commons license CC BY-NC-SA 4.0. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.