Introduction to Pandas


One of the early criticisms of many in the data science arena of the Python language was the lack of useful data structures for performing data analysis tasks. This criticism stemmed in part from comparisons between the R language and Python since R has a built-in DataFrame object that greatly simplified many data analysis tasks. This deficiency was addressed in 2008 by Wes McKinney with the creation of Pandas (the name was originally an abbreviation of Panel data), and this module continues to be improved. To quote the Pandas documentation:

Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

The Pandas module introduces several new data structures like the Series, DataFrame, and Panel that build on top of existing tools like NumPy to speed up data analysis tasks. Note that the NumPy module provides support for numerical operations, including the generation of random data, which we will use in this notebook. You will learn more about NumPy in a future lesson. The Pandas module also provides efficient mechanisms for moving data between in-memory representations and different data formats, including CSV and text files, JSON files, SQL databases, HDF5 format files, and even Excel spreadsheets. Finally, the Pandas module also provides support for dealing with missing or incomplete data and aggregating or grouping data.


Brief Introduction to Pandas

Before using Pandas, we must first import the Pandas module (note that this is the standard import form used for Pandas):

import pandas as pd

Second, we simply create and use the appropriate Pandas data structure. The two most important data structures for typical data science tasks are the Series and the DataFrame:

  1. Series: a one-dimensional labeled array that can hold any data type such as integers, floating-point numbers, strings, or Python objects. A Series has both a data column and a label column called the index.

  2. DataFrame: a two-dimensional labeled data structure with columns that can hold different data types, similar to a spreadsheet or relational database table.

The Pandas module also provides a date/time data structure, sometimes referred to as a TimeSeries, and a three-dimensional data structure known as a Panel. But of all of these, the DataFrame is the data structure we will use the most.

Series

A Series is useful to hold data that should be accessible by using a specific label. For example, you might use time or dates to reference a specific financial transaction. You create a Series by passing in an appropriate data set along with an optional index:

values = pd.Series(data, index=idx)

The following Code cells demonstrate different options for creating a Series. First, we create a series from a list of numbers with an explicit index assignment.


In [1]:
# First we import the module we will use in this notebook. 
# Note the standard import form for the Pandas module

import pandas as pd

# We label the values
s1 = pd.Series(range(6), index=['q', 'w', 'e', 'r', 't', 'y'])

s1
Out[1]:
q    0
w    1
e    2
r    3
t    4
y    5
dtype: int64

Second, we create a Series from a dictionary, but in this case, we include an index that suppresses one dictionary entry (the value corresponding to the e key) and includes one unidentified entry (corresponding to the y key). The resulting series will, therefore, have four entries, one of which does not contain a valid entry, which is represented by a NaN value.


In [2]:
d = {'q': 11, 'w': 21, 'e': 31, 'r': 41}

# We pick out the q, w, and r keys, but have an undefined y key.
s2 = pd.Series(d, index = ['q', 'w', 'r', 'y'])

s2
Out[2]:
q    11.0
w    21.0
r    41.0
y     NaN
dtype: float64

The final technique creates a Series that contains the same value throughout, the number 42. The number of rows in the series is determined in this case by the number of entries in the index (6).


In [3]:
# We create a Series from an integer constant with explicit labels
s3 = pd.Series(42, index = ['q', 'w', 'e', 'r', 't', 'y'])

s3
Out[3]:
q    42
w    42
e    42
r    42
t    42
y    42
dtype: int64

DataFrame

The second major data structure that Pandas provides is the DataFrame, which is a two-dimensional array, where each column is effectively a Series with a shared index. A DataFrame is a very powerful data structure and provides a nice mapping for a number of different data formats (and storage mechanisms). A DataFrame can be created in many ways; for a full listing see the official documentation.

Once you have been exposed to more advanced concepts, like the NumPy module, you will learn other techniques to create and manipulate a DataFrame. In a subsequent lesson, you will learn to create a DataFrame by reading data from a file or a website. For this notebook, we will create a DataFrame by using a dictionary to map column labels to the values in the column, which we store in a list.

The following Code cell creates a DataFrame by using a dictionary with five entries, where each entry contains a five element list. This creates a DataFrame that has 25 entries, some of which are missing values (we encode these using the Python None keyword, and Pandas translates this into NaN, which is shorthand for Not a Number). The last step is to display the resulting DataFrame, which has increasing integers as row labels (that were automatically assigned since we did not supply them) and the dictionary key values as column labels.

We can display the first few or last rows (the exact number can be specified via an attribute of the respective functions) of a DataFrame by using the head or tail function.


In [4]:
# Define dictionary used to create our DataFrames

dt = {'A' : [0, 1, 2, 3, None], 
      'B' : [5, 6, None, 8, 9], 
      'C' : [10, None, 12, 13, 14], 
      'D' : [15, 16, 17, 18, 19],
      'E' : [20, 21, None, None, 24]}

# Create default DataFrame
df = pd.DataFrame(dt)

# Display the last two lines of the DataFrame
df.tail(2)
Out[4]:
A B C D E
3 3.0 8.0 13.0 18 NaN
4 NaN 9.0 14.0 19 24.0
In [5]:
# Create DataFrame with named columns

idf = pd.DataFrame(dt, index=list('qwert'))

idf.tail(2)
Out[5]:
A B C D E
r 3.0 8.0 13.0 18 NaN
t NaN 9.0 14.0 19 24.0

Series

Programmatically, a Series (and a DataFrame) provides powerful access methods, some similar to a Python list, others very different, since a Series is a more powerful data structure. In the next few Code cells, different techniques for interacting with a Pandas Series are demonstrated. First, element access is used, including the traditional list slicing, as well as the more powerful index-based slicing.


In [6]:
# We can slice like a list

print('s1[-2] = {}'.format(s1[-2]))

print('s2["q"] = {}'.format(s2['q']))

print('s3["q":"e"] =')
print(s3["q":"e"])
s1[-2] = 4
s2["q"] = 11.0
s3["q":"e"] =
q    42
w    42
e    42
dtype: int64

Pandas also provides a number of functions that can be applied directly on a Series, including basic statistical measures, such as the average, minimum, maximum, and finding the rows with the largest or smallest entries. In addition, the describe function can be used to display a statistical description of a Series.


In [7]:
# Compute functions on a Series

print(40*'-')
print('Sum of s1:\n')
print(s1.sum())
print()

print(40*'-')
print('Largest and smallest entries in s2:\n')
print(s2.nlargest(2))
print()
print(s2.nsmallest(2))
print()

print(40*'-')
print('Cumulative Sum of s3:\n')
print(s3.cumsum())
----------------------------------------
Sum of s1:

15

----------------------------------------
Largest and smallest entries in s2:

r    41.0
w    21.0
dtype: float64

q    11.0
w    21.0
dtype: float64

----------------------------------------
Cumulative Sum of s3:

q     42
w     84
e    126
r    168
t    210
y    252
dtype: int64
In [8]:
# Display description of s1

s1.describe()
Out[8]:
count    6.000000
mean     2.500000
std      1.870829
min      0.000000
25%      1.250000
50%      2.500000
75%      3.750000
max      5.000000
dtype: float64

Vectorized Operations

The Pandas module enables Series objects to be combined mathematically in an intuitive manner. For example, basic math operations, such as addition, scalar multiplication, and subtraction, are performed on an element-by-element manner. This approach will make sense to anyone familiar with the mathematical concept of vectors and can greatly simplify implementing algorithms. In the following Code cell, several of these functions are demonstrated, first by adding two Series together before combining scalar multiplication with Series subtraction.


In [9]:
# We can also perform vectorized operations
print('\nSum Series:')
print(s1 + s3)

print(s1 + s2)

print('\nSeries operations:')

st = s2 * 5 - 1.2

print(st)
Sum Series:
q    42
w    43
e    44
r    45
t    46
y    47
dtype: int64
e     NaN
q    11.0
r    44.0
t     NaN
w    22.0
y     NaN
dtype: float64

Series operations:
q     53.8
w    103.8
r    203.8
y      NaN
dtype: float64

Student Exercise

In the empty Code cell below, first create a Pandas Series with 10 elements ranging from 11 to 20 with an index that consists of the first 10 characters in the English alphabet (i.e., a, b, c, d, e, f, g, h, i, and j). Next, compute and display the cumulative product (cumprod) and the summary statistical description (describe).


In [ ]:
 

DataFrame

Since a DataFrame is essentially a two-dimensional analog of the one-dimensional Series, many of the same features extend from a Series to a DataFrame, including vectorized functions and mathematical operations (some of which are demonstrated later in this notebook). However, the differences between a Series and a DataFrame introduce several new issues due to the presence of multiple columns. Specifically, with a DataFrame one needs to be aware that each column might store different types of data; and, in addition, one now must be able to index to a particular row, a particular column, and to a particular field.

The following Code cells demonstrate the differences between a Series and a DataFrame, particularly when storing different types of data, and in directly accessing specific rows, columns, or subsections of a DataFrame. First, we access the dtypes attribute of our DataFrame to display the data types for all the columns. Next, we use the describe function to display summary statistics of the columns. Note that the describe function processes numerical columns differently than non-numerical columns. Finally, the columns attribute is used to display the columns names for the DataFrame, which can later be used to access a specific column or columns.


In [10]:
# Display data type of a Pandas DataFrame

df.dtypes
Out[10]:
A    float64
B    float64
C    float64
D      int64
E    float64
dtype: object
In [11]:
df.describe()
Out[11]:
A B C D E
count 4.000000 4.000000 4.000000 5.000000 3.000000
mean 1.500000 7.000000 12.250000 17.000000 21.666667
std 1.290994 1.825742 1.707825 1.581139 2.081666
min 0.000000 5.000000 10.000000 15.000000 20.000000
25% 0.750000 5.750000 11.500000 16.000000 20.500000
50% 1.500000 7.000000 12.500000 17.000000 21.000000
75% 2.250000 8.250000 13.250000 18.000000 22.500000
max 3.000000 9.000000 14.000000 19.000000 24.000000
In [12]:
df.columns
Out[12]:
Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

Data Selection

Given a DataFrame, the selection of a subset of the DataFrame can be accomplished in a number of different ways, which can lead to some confusion. You can select one or more rows, one or more columns, or a combination of a set of rows and a set of columns. Formally, when selecting rows, we are slicing, and when selecting columns, we are indexing. Thus, with one exception, a new DataFrame can be selected from an existing DataFrame by the techniques listed in the following table.

Selection Interpretation
df[1:3] Select the second and third rows from df
idf[0:2] Select first two rows from idf
idf['q':'w'] Select first two rows from idf
df['A'] Select first column from df, return a Series, known as dictionary-style index
df[['A']] Select first column from df, return a DataFrame, known as explicit list-style index
df[['B','D']] Select second and fourth column from df
df[0:2][['A','C']] Select first and third column only from the first two rows from df

As shown in the table, elements, one or more rows, or one or more columns can be accessed by using either the implicit label of a row and/or column, or by the explicit Python numerical index. Thus, a single column can be directly accessed (as a Series) by using a dictionary style access with the column label or (as a DataFrame) by using a list of column labels. If the list includes only one column label, a single column DataFrame is returned; otherwise a multiple column DataFrame is returned. Likewise, one or more rows can be accessed by using the row indices in a traditional slice notation or by using the explicit row labels as a list of strings (when they exist).

These techniques are demonstrated in the following Code cells; change them and see what results.


In [13]:
# Selection of column 'A' by using a dictionary style access via column label
# Returns a Series

df['A']
Out[13]:
0    0.0
1    1.0
2    2.0
3    3.0
4    NaN
Name: A, dtype: float64
In [14]:
type(df['A'])
Out[14]:
pandas.core.series.Series
In [15]:
# Selection of column 'A' by using a list style access via explicit column label
# Returns a single column DataFrame

df[['A']]
Out[15]:
A
0 0.0
1 1.0
2 2.0
3 3.0
4 NaN
In [16]:
type(df[['A']])
Out[16]:
pandas.core.frame.DataFrame
In [17]:
# Selection of first two rows by using implicit row index
# in this example, we do not have explicit row labels
df[0:2]
Out[17]:
A B C D E
0 0.0 5.0 10.0 15 20.0
1 1.0 6.0 NaN 16 21.0
In [18]:
# Select first two rows using explicit slice

idf['q':'w']
Out[18]:
A B C D E
q 0.0 5.0 10.0 15 20.0
w 1.0 6.0 NaN 16 21.0
In [19]:
# Select first two rows using implicit slice

idf[0:2]
Out[19]:
A B C D E
q 0.0 5.0 10.0 15 20.0
w 1.0 6.0 NaN 16 21.0
In [20]:
# Selection of 'B and 'D' columns by using explicit column index
df[['B','D']]
Out[20]:
B D
0 5.0 15
1 6.0 16
2 NaN 17
3 8.0 18
4 9.0 19
In [21]:
# Select first and third columns only from the first two rows

df[0:2][['A','C']]
Out[21]:
A C
0 0.0 10.0
1 1.0 NaN

loc and iloc

The different selection techniques presented before might seem confusing, especially when used in a complex line of code. To simplify this, Pandas provides several utility functions that clearly indicate the type of selection being performed via the loc and iloc.

  • loc: use explicit labels to select a subset of the entire DataFrame.
  • iloc: use implicit labels (i.e., the positional index) to select a subset of the entire DataFrame.

These functions are demonstrated in the following code blocks, where subsets of the original DataFrame are selected in different manners.


In [22]:
# Use label based selection
df.loc[[0, 3], ['B', 'D']]
Out[22]:
B D
0 5.0 15
3 8.0 18
In [23]:
# Use positional based selection
df.iloc[2:5, 1:4]
Out[23]:
B C D
2 NaN 12.0 17
3 8.0 13.0 18
4 9.0 14.0 19

NaN

In some cases, a data set being analyzed might have missing data (reasons for missing data will be explored in a subsequent lesson). Pandas can handle these cases in an elegant manner by assigning the NaN value to these cells. Furthermore, functions operating on a DataFrame (or Series) can handle these missing data in different manners, such as dropping any row or column with missing data, or, alternatively, imputing (or replacing) missing values with a substitute value. The substitute value can either be a constant value, such as zero, or a value computed from the DataFrame itself, such as the mean value of the column containing the missing entry. These techniques are demonstrated below by using the dropna and fillna functions, which drop rows containing a missing value or replace missing values, respectively.

By default, rows will be dropped and computed values (like an average) will be computed down a column. However, these defaults can be changed by explicitly specifying the axis parameter for these two functions. In this case, axis=0 is used to specify rows, while axis=1 is used to specify columns.


In [24]:
# Drop all rows with missing values
df.dropna()
Out[24]:
A B C D E
0 0.0 5.0 10.0 15 20.0
In [25]:
# Drop all columns (axis = 1) with missing values
df.dropna(axis=1)
Out[25]:
D
0 15
1 16
2 17
3 18
4 19
In [26]:
# Fill missing values with a constant value of zero
df.fillna(0)
Out[26]:
A B C D E
0 0.0 5.0 10.0 15 20.0
1 1.0 6.0 0.0 16 21.0
2 2.0 0.0 12.0 17 0.0
3 3.0 8.0 13.0 18 0.0
4 0.0 9.0 14.0 19 24.0
In [27]:
# Fill missing values with the mean value for the respective column
df.fillna(df.mean())
Out[27]:
A B C D E
0 0.0 5.0 10.00 15 20.000000
1 1.0 6.0 12.25 16 21.000000
2 2.0 7.0 12.00 17 21.666667
3 3.0 8.0 13.00 18 21.666667
4 1.5 9.0 14.00 19 24.000000

By default, many operations, such as fillna on a DataFrame, produce a new copy of the DataFrame. To operate in place on a DataFrame, you must set the inplace attribute to True. The following code block first makes a copy of our demonstration DataFrame before imputing missing values in place.


In [28]:
#Make a copy of our DataFrame
df2 = df.copy(deep=True)

# Now impute in place
df2.fillna(df2.mean(), inplace=True)

Vectorized Operations on a DataFrame

As with a Pandas Series, many mathematical operations on a Pandas DataFrame are vectorized. Thus, scalars can be combined with a DataFrame on an element-by-element basis, and multiple DataFrame objets can be combined (as long as they have the same number of rows and columns) in an intuitive manner. In addition, NaN values are carried through appropriately, as demonstrated below. Finally, we can flip the rows and columns of a DataFrame by taking the transpose.


In [29]:
# We can operate with basic scalar values

df + 2.5
Out[29]:
A B C D E
0 2.5 7.5 12.5 17.5 22.5
1 3.5 8.5 NaN 18.5 23.5
2 4.5 NaN 14.5 19.5 NaN
3 5.5 10.5 15.5 20.5 NaN
4 NaN 11.5 16.5 21.5 26.5
In [30]:
# And perform more complex scalar operations

-1.0 * df2 + 3.5
Out[30]:
A B C D E
0 3.5 -1.5 -6.50 -11.5 -16.500000
1 2.5 -2.5 -8.75 -12.5 -17.500000
2 1.5 -3.5 -8.50 -13.5 -18.166667
3 0.5 -4.5 -9.50 -14.5 -18.166667
4 2.0 -5.5 -10.50 -15.5 -20.500000
In [31]:
# DataFrames can be combined
df - df2
Out[31]:
A B C D E
0 0.0 0.0 0.0 0 0.0
1 0.0 0.0 NaN 0 0.0
2 0.0 NaN 0.0 0 NaN
3 0.0 0.0 0.0 0 NaN
4 NaN 0.0 0.0 0 0.0
In [32]:
# We can transpose the dataFrame

df.T
Out[32]:
0 1 2 3 4
A 0.0 1.0 2.0 3.0 NaN
B 5.0 6.0 NaN 8.0 9.0
C 10.0 NaN 12.0 13.0 14.0
D 15.0 16.0 17.0 18.0 19.0
E 20.0 21.0 NaN NaN 24.0

Student Exercise

In the empty Code cell below, first create a Pandas DataFrame with 25 elements ranging from 1 to 25, with row indices assigned automatically and column labels equal to the characters: b, d, f, g, and j. Next, select the subset of this DataFrame corresponding to the second-fourth rows and columns (so nine total entries). Multiply this new DataFrame by 10, and subtract 5. Finally, apply the describe function to the resultant DataFrame.


In [ ]:
 

Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

  1. Pandas documentation
  2. A complete Pandas tutorial
  3. The Pandas chapter from the book Python Data Science Handbook by Jake VanderPlas

© 2017: Robert J. Brunner at the University of Illinois.

This notebook is released under the Creative Commons license CC BY-NC-SA 4.0. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.