One of the early criticisms of many in the data science arena of the Python language was the lack of useful data structures for performing data analysis tasks. This criticism stemmed in part from comparisons between the R language and Python since R has a built-in DataFrame object that greatly simplified many data analysis tasks. This deficiency was addressed in 2008 by Wes McKinney with the creation of Pandas (the name was originally an abbreviation of Panel data), and this module continues to be improved. To quote the Pandas documentation:
Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.
The Pandas module introduces several new data structures like the Series
, DataFrame
, and Panel
that build on top of existing tools like NumPy
to speed up data analysis tasks. Note that the NumPy
module provides support for numerical operations, including the generation of random data, which we will use in this notebook. You will learn more about NumPy
in a future lesson. The Pandas module also provides efficient mechanisms for moving data between in-memory representations and different data formats, including CSV and text files, JSON files, SQL databases, HDF5 format files, and even Excel spreadsheets. Finally, the Pandas module also provides support for dealing with missing or incomplete data and aggregating or grouping data.
Before using Pandas, we must first import the Pandas module (note that this is the standard import form used for Pandas):
import pandas as pd
Second, we simply create and use the appropriate Pandas data structure. The two most important data structures for typical data science tasks are the Series
and the DataFrame
:
Series
: a one-dimensional labeled array that can hold any data type such as integers, floating-point numbers, strings, or Python objects. A Series
has both a data column and a label column called the index.
DataFrame
: a two-dimensional labeled data structure with columns that can hold different data types, similar to a spreadsheet or relational database table.
The Pandas module also provides a date/time data structure, sometimes referred to as a TimeSeries
, and a three-dimensional data structure known as a Panel
. But of all of these, the DataFrame
is the data structure we will use the most.
Series
¶A Series
is useful to hold data that should be accessible by using a specific label. For example, you might use time or dates to reference a specific financial transaction. You create a Series
by passing in an appropriate data set along with an optional index:
values = pd.Series(data, index=idx)
The following Code cells demonstrate different options for creating a Series
. First, we create a series from a list of numbers with an explicit index assignment.
# First we import the module we will use in this notebook.
# Note the standard import form for the Pandas module
import pandas as pd
# We label the values
s1 = pd.Series(range(6), index=['q', 'w', 'e', 'r', 't', 'y'])
s1
Second, we create a Series
from a dictionary, but in this case, we include an index that suppresses one dictionary entry (the value corresponding to the e
key) and includes one unidentified entry (corresponding to the y
key). The resulting series will, therefore, have four entries, one of which does not contain a valid entry, which is represented by a NaN
value.
d = {'q': 11, 'w': 21, 'e': 31, 'r': 41}
# We pick out the q, w, and r keys, but have an undefined y key.
s2 = pd.Series(d, index = ['q', 'w', 'r', 'y'])
s2
The final technique creates a Series
that contains the same value throughout, the number 42
. The number of rows in the series is determined in this case by the number of entries in the index (6).
# We create a Series from an integer constant with explicit labels
s3 = pd.Series(42, index = ['q', 'w', 'e', 'r', 't', 'y'])
s3
DataFrame
¶The second major data structure that Pandas provides is the DataFrame
, which is a two-dimensional array, where each column is effectively a Series
with a shared index. A DataFrame is a very powerful data structure and provides a nice mapping for a number of different data formats (and storage mechanisms). A DataFrame
can be created in many ways; for a full listing see the official documentation.
Once you have been exposed to more advanced concepts, like the NumPy
module, you will learn other techniques to create and manipulate a DataFrame
. In a subsequent lesson, you will learn to create a DataFrame
by reading data from a file or a website. For this notebook, we will create a DataFrame
by using a dictionary to map column labels to the values in the column, which we store in a list.
The following Code cell creates a DataFrame
by using a dictionary with five entries, where each entry contains a five element list. This creates a DataFrame
that has 25 entries, some of which are missing values (we encode these using the Python None
keyword, and Pandas translates this into NaN
, which is shorthand for Not a Number). The last step is to display the resulting DataFrame
, which has increasing integers as row labels (that were automatically assigned since we did not supply them) and the dictionary key values as column labels.
We can display the first few or last rows (the exact number can be specified via an attribute of the respective functions) of a DataFrame
by using the head
or tail
function.
# Define dictionary used to create our DataFrames
dt = {'A' : [0, 1, 2, 3, None],
'B' : [5, 6, None, 8, 9],
'C' : [10, None, 12, 13, 14],
'D' : [15, 16, 17, 18, 19],
'E' : [20, 21, None, None, 24]}
# Create default DataFrame
df = pd.DataFrame(dt)
# Display the last two lines of the DataFrame
df.tail(2)
# Create DataFrame with named columns
idf = pd.DataFrame(dt, index=list('qwert'))
idf.tail(2)
Programmatically, a Series
(and a DataFrame
) provides powerful access methods, some similar to a Python list, others very different, since a Series
is a more powerful data structure. In the next few Code cells, different techniques for interacting with a Pandas Series
are demonstrated. First, element access is used, including the traditional list slicing, as well as the more powerful index-based slicing.
# We can slice like a list
print('s1[-2] = {}'.format(s1[-2]))
print('s2["q"] = {}'.format(s2['q']))
print('s3["q":"e"] =')
print(s3["q":"e"])
Pandas also provides a number of functions that can be applied directly on a Series
, including basic statistical measures, such as the average, minimum, maximum, and finding the rows with the largest or smallest entries. In addition, the describe
function can be used to display a statistical description of a Series
.
# Compute functions on a Series
print(40*'-')
print('Sum of s1:\n')
print(s1.sum())
print()
print(40*'-')
print('Largest and smallest entries in s2:\n')
print(s2.nlargest(2))
print()
print(s2.nsmallest(2))
print()
print(40*'-')
print('Cumulative Sum of s3:\n')
print(s3.cumsum())
# Display description of s1
s1.describe()
The Pandas module enables Series
objects to be combined mathematically in an intuitive manner. For example, basic math operations, such as addition, scalar multiplication, and subtraction, are performed on an element-by-element manner. This approach will make sense to anyone familiar with the mathematical concept of vectors and can greatly simplify implementing algorithms. In the following Code cell, several of these functions are demonstrated, first by adding two Series
together before combining scalar multiplication with Series
subtraction.
# We can also perform vectorized operations
print('\nSum Series:')
print(s1 + s3)
print(s1 + s2)
print('\nSeries operations:')
st = s2 * 5 - 1.2
print(st)
Student Exercise
In the empty Code cell below, first create a Pandas Series
with 10 elements ranging from 11
to 20
with an index that consists of the first 10 characters in the English alphabet (i.e., a
, b
, c
, d
, e
, f
, g
, h
, i
, and j
). Next, compute and display the cumulative product (cumprod
) and the summary statistical description (describe
).
Since a DataFrame
is essentially a two-dimensional analog of the one-dimensional Series
, many of the same features extend from a Series
to a DataFrame
, including vectorized functions and mathematical operations (some of which are demonstrated later in this notebook). However, the differences between a Series
and a DataFrame
introduce several new issues due to the presence of multiple columns. Specifically, with a DataFrame
one needs to be aware that each column might store different types of data; and, in addition, one now must be able to index to a particular row, a particular column, and to a particular field.
The following Code cells demonstrate the differences between a Series
and a DataFrame
, particularly when storing different types of data, and in directly accessing specific rows, columns, or subsections of a DataFrame
. First, we access the dtypes
attribute of our DataFrame
to display the data types for all the columns. Next, we use the describe
function to display summary statistics of the columns. Note that the describe
function processes numerical columns differently than non-numerical columns. Finally, the columns
attribute is used to display the columns names for the DataFrame
, which can later be used to access a specific column or columns.
# Display data type of a Pandas DataFrame
df.dtypes
df.describe()
df.columns
Given a DataFrame, the selection of a subset of the DataFrame
can be accomplished in a number of different ways, which can lead to some confusion. You can select one or more rows, one or more columns, or a combination of a set of rows and a set of columns. Formally, when selecting rows, we are slicing, and when selecting columns, we are indexing. Thus, with one exception, a new DataFrame
can be selected from an existing DataFrame
by the techniques listed in the following table.
Selection | Interpretation |
---|---|
df[1:3] |
Select the second and third rows from df |
idf[0:2] |
Select first two rows from idf |
idf['q':'w'] |
Select first two rows from idf |
df['A'] |
Select first column from df , return a Series , known as dictionary-style index |
df[['A']] |
Select first column from df , return a DataFrame , known as explicit list-style index |
df[['B','D']] |
Select second and fourth column from df |
df[0:2][['A','C']] |
Select first and third column only from the first two rows from df |
As shown in the table, elements, one or more rows, or one or more columns can be accessed by using either the implicit label of a row and/or column, or by the explicit Python numerical index. Thus, a single column can be directly accessed (as a Series
) by using a dictionary style access with the column label or (as a DataFrame
) by using a list of column labels. If the list includes only one column label, a single column DataFrame
is returned; otherwise a multiple column DataFrame
is returned. Likewise, one or more rows can be accessed by using the row indices in a traditional slice notation or by using the explicit row labels as a list of strings (when they exist).
These techniques are demonstrated in the following Code cells; change them and see what results.
# Selection of column 'A' by using a dictionary style access via column label
# Returns a Series
df['A']
type(df['A'])
# Selection of column 'A' by using a list style access via explicit column label
# Returns a single column DataFrame
df[['A']]
type(df[['A']])
# Selection of first two rows by using implicit row index
# in this example, we do not have explicit row labels
df[0:2]
# Select first two rows using explicit slice
idf['q':'w']
# Select first two rows using implicit slice
idf[0:2]
# Selection of 'B and 'D' columns by using explicit column index
df[['B','D']]
# Select first and third columns only from the first two rows
df[0:2][['A','C']]
loc
and iloc
¶The different selection techniques presented before might seem confusing, especially when used in a complex line of code. To simplify this, Pandas provides several utility functions that clearly indicate the type of selection being performed via the loc
and iloc
.
loc
: use explicit labels to select a subset of the entire DataFrame
.iloc
: use implicit labels (i.e., the positional index) to select a subset of the entire DataFrame
.These functions are demonstrated in the following code blocks, where subsets of the original DataFrame
are selected in different manners.
# Use label based selection
df.loc[[0, 3], ['B', 'D']]
# Use positional based selection
df.iloc[2:5, 1:4]
In some cases, a data set being analyzed might have missing data (reasons for missing data will be explored in a subsequent lesson). Pandas can handle these cases in an elegant manner by assigning the NaN
value to these cells. Furthermore, functions operating on a DataFrame
(or Series
) can handle these missing data in different manners, such as dropping any row or column with missing data, or, alternatively, imputing (or replacing) missing values with a substitute value. The substitute value can either be a constant value, such as zero, or a value computed from the DataFrame
itself, such as the mean value of the column containing the missing entry. These techniques are demonstrated below by using the dropna
and fillna
functions, which drop rows containing a missing value or replace missing values, respectively.
By default, rows will be dropped and computed values (like an average) will be computed down a column. However, these defaults can be changed by explicitly specifying the axis
parameter for these two functions. In this case, axis=0
is used to specify rows, while axis=1
is used to specify columns.
# Drop all rows with missing values
df.dropna()
# Drop all columns (axis = 1) with missing values
df.dropna(axis=1)
# Fill missing values with a constant value of zero
df.fillna(0)
# Fill missing values with the mean value for the respective column
df.fillna(df.mean())
By default, many operations, such as fillna
on a DataFrame
, produce a new copy of the DataFrame. To operate in place on a DataFrame
, you must set the inplace
attribute to True
. The following code block first makes a copy of our demonstration DataFrame
before imputing missing values in place.
#Make a copy of our DataFrame
df2 = df.copy(deep=True)
# Now impute in place
df2.fillna(df2.mean(), inplace=True)
DataFrame
¶As with a Pandas Series
, many mathematical operations on a Pandas DataFrame
are vectorized. Thus, scalars can be combined with a DataFrame
on an element-by-element basis, and multiple DataFrame
objets can be combined (as long as they have the same number of rows and columns) in an intuitive manner. In addition, NaN
values are carried through appropriately, as demonstrated below. Finally, we can flip the rows and columns of a DataFrame by taking the transpose.
# We can operate with basic scalar values
df + 2.5
# And perform more complex scalar operations
-1.0 * df2 + 3.5
# DataFrames can be combined
df - df2
# We can transpose the dataFrame
df.T
Student Exercise
In the empty Code cell below, first create a Pandas DataFrame
with 25 elements ranging from 1
to 25
, with row indices assigned automatically and column labels equal to the characters: b
, d
, f
, g
, and j
. Next, select the subset of this DataFrame
corresponding to the second-fourth rows and columns (so nine total entries). Multiply this new DataFrame
by 10
, and subtract 5
. Finally, apply the describe
function to the resultant DataFrame
.
The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.
© 2017: Robert J. Brunner at the University of Illinois.
This notebook is released under the Creative Commons license CC BY-NC-SA 4.0. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.