Advanced Pandas


A previous lesson introduced the Pandas DataFrame, which is a powerful data structure that mimics a traditional spreadsheet. Data can easily be read from a CSV file, a fixed width format text file, a JSON file, an HTML file (i.e., a webpage), and a relational database into a Pandas DataFrame. In this notebook, this capability is demonstrated by first reading a file from an Internet website. This new DataFrame is subsequently used to demonstrate basic functionality, including descriptive analysis, basic indexing, and fancy indexing.

When reading the airport data from this website, we specify the comma as the delimiter and explicitly indicate that the index column is 'iata', which is the airport code. Using a data-specific column as a row-index can often simplify data processing. The following two code cells display the first and last few rows, respectively, by selecting several columns by using column labels.


In [1]:
import pandas as pd

# Read data from CSV file, and display subset

dfa = pd.read_csv('http://stat-computing.org/dataexpo/2009/airports.csv', delimiter=',', index_col='iata')
In [2]:
# We can grab the first five rows, and only extract three columns
dfa[['airport', 'city', 'state']].head(5)
Out[2]:
airport city state
iata
00M Thigpen Bay Springs MS
00R Livingston Municipal Livingston TX
00V Meadow Lake Colorado Springs CO
01G Perry-Warsaw Perry NY
01J Hilliard Airpark Hilliard FL
In [3]:
# Extract the last four rows for three different columns
dfa[['airport', 'lat', 'long']].tail(4)
Out[3]:
airport lat long
iata
ZER Schuylkill Cty/Joe Zerbey 40.706449 -76.373147
ZPH Zephyrhills Municipal 28.228065 -82.155916
ZUN Black Rock 35.083227 -108.791777
ZZV Zanesville Municipal 39.944458 -81.892105

This data set contains mixed data, both numeric and text, and thus requires different column storage. The first code cell below indicates the type of storage for all six columns in this DataFrame, which shows the numeric columns lat and long, while the other four columns are text data, represented by data type object. This also impacts the default behavior of specific Pandas functions, such as the describe function, since summary statistics can only be calculated for numeric data.


In [4]:
# Display the data types of each column
dfa.dtypes
Out[4]:
airport     object
city        object
state       object
country     object
lat        float64
long       float64
dtype: object
In [5]:
# Display a summary of the numerical information in the DataFrame

dfa.describe()
Out[5]:
lat long
count 3376.000000 3376.000000
mean 40.036524 -98.621205
std 8.329559 22.869458
min 7.367222 -176.646031
25% 34.688427 -108.761121
50% 39.434449 -93.599425
75% 43.372612 -84.137519
max 71.285448 145.621384

Since this new DataFrame was created with a labelled row index, we can use row labels to slice rows from the DataFrame. The following code cells demonstrate basic slicing and indexing of this DataFrame by using both explicit indices (row and column labels) and implicit indices (row and column index values).


In [6]:
# Slice rows by using the indicated label from the index column

dfa.loc[['11J', '11R', '12C']]
Out[6]:
airport city state country lat long
iata
11J Early County Blakely GA USA 31.396986 -84.895257
11R Brenham Municipal Brenham TX USA 30.219000 -96.374278
12C Rochelle Municipal Rochelle IL USA 41.893001 -89.078290
In [7]:
# Slice rows by using the row implicit index

dfa[99:103]
Out[7]:
airport city state country lat long
iata
11J Early County Blakely GA USA 31.396986 -84.895257
11R Brenham Municipal Brenham TX USA 30.219000 -96.374278
12C Rochelle Municipal Rochelle IL USA 41.893001 -89.078290
12D Tower Municipal Tower MN USA 47.818333 -92.291667
In [8]:
# Slice rows and columns by using explicit row and column labels

dfa.loc[['11R', '12C', '12D'], ['airport', 'city', 'state', 'country']]
Out[8]:
airport city state country
iata
11R Brenham Municipal Brenham TX USA
12C Rochelle Municipal Rochelle IL USA
12D Tower Municipal Tower MN USA
In [9]:
# Slice rows and columns by using implicit row and column indices

dfa.iloc[100: 103, 0:4]
Out[9]:
airport city state country
iata
11R Brenham Municipal Brenham TX USA
12C Rochelle Municipal Rochelle IL USA
12D Tower Municipal Tower MN USA

Masking

Pandas also support selecting rows based on column values, which is known as masking. This is performed by specifying tests on columns that result in True or False, and only the True results are returned. Thus, a row mask is formed, and masked rows are hidden and unmasked rows are selected. These tests must follow the rules of Boolean logic, but can involve multiple column comparisons that are combined into one final result.

For example, the first code cell below selects all airports in the state of Delaware by specifying the test dfa.state == 'DE'. This test effectively selects those rows that have DE in the state column of the dfa DataFrame. The second code cell involves a more complicated expression that selects those airports whose latitude is greater than 48 and longitude is less than -170. In this case, the two expressions are enclosed in parenthesis and combined with a Boolean and to generate the final test result.


In [10]:
dfa[dfa.state =='DE']
Out[10]:
airport city state country lat long
iata
33N Delaware Airpark Dover DE USA 39.218376 -75.596427
DOV Dover Air Force Base Dover DE USA 39.130113 -75.466310
EVY Summit Airpark Middletown DE USA 39.520389 -75.720444
GED Sussex Cty Arpt Georgetown DE USA 38.689194 -75.358889
ILG New Castle County Wilmington DE USA 39.678722 -75.606528
In [11]:
# We can also select rows based on boolean tests on columns
dfa[(dfa.lat > 48) & (dfa.long < -170)]
Out[11]:
airport city state country lat long
iata
ADK Adak Adak AK USA 51.877964 -176.646031
AKA Atka Atka AK USA 52.220348 -174.206350
GAM Gambell Gambell AK USA 63.766766 -171.732824
SNP St. Paul St. Paul AK USA 57.167333 -170.220444
SVA Savoonga Savoonga AK USA 63.686394 -170.492636

Other Functionality

Pandas provides additional functions that can simplify data processing tasks, which often are used in spreadsheets. Of these other functions, the following code cells demonstrate three specific functions:

  1. sample: randomly selects n rows, where n is specified as an argument to the sample function
  2. sort_index: sorts the DataFrame based on the values in the index
  3. sort_values: sorts the DataFrame by the column specified in the by attribute

Note that the sort functions return a new DataFrame; to sort a DataFrame in place you must set the inplace attribute to True. In addition, the sort functions take an ascending parameter that specifies if the sort should be in ascending or descending order.


In [12]:
dfa.sample(5)
Out[12]:
airport city state country lat long
iata
LGC LaGrange-Callaway Lagrange GA USA 33.008847 -85.072606
BWG Bowling Green-Warren County Bowling Green KY USA 36.964517 -86.419679
3U7 Benchmark Benchmark MT USA 47.481332 -112.869768
PCZ Waupaca Municipal Waupaca WI USA 44.333688 -89.015499
54J Defuniak Springs Defuniak Springs FL USA 30.731300 -86.151608
In [13]:
dfa.sort_index().head()
Out[13]:
airport city state country lat long
iata
00M Thigpen Bay Springs MS USA 31.953765 -89.234505
00R Livingston Municipal Livingston TX USA 30.685861 -95.017928
00V Meadow Lake Colorado Springs CO USA 38.945749 -104.569893
01G Perry-Warsaw Perry NY USA 42.741347 -78.052081
01J Hilliard Airpark Hilliard FL USA 30.688012 -81.905944
In [14]:
dfa.sort_values(by='lat', ascending=False).head()
Out[14]:
airport city state country lat long
iata
BRW Wiley Post Will Rogers Memorial Barrow AK USA 71.285448 -156.766002
AWI Wainwright Wainwright AK USA 70.638000 -159.994750
ATK Atqasuk Atqasuk AK USA 70.467276 -157.435736
AQT Nuiqsut Nuiqsut AK USA 70.209953 -151.005561
SCC Deadhorse Deadhorse AK USA 70.194756 -148.465161

Student Exercise

In the empty Code cell below, first extract all airports in the state of California. Second, apply a mask to select only those rows with a latitude between 38 and 40. Finally, compute and display the average and standard deviation of the longitude for these masked rows.


In [ ]:
 

DataFrame Computations

Some of the most important uses of a Pandas DataFrame involve grouping related data together and operating on the grouped subsets independently. For example, data may be grouped by a categorical variable, such as state or county, and sales totals accumulated by the grouped region. To demonstrate this functionality, we turn to a second data set on restaurant data that is provided along with the seaborn Python module. The seaborn is presented in a separate lesson and provides support for advanced visualizations. Right now, however, we simply want to easily process this data, so we load the data into the dft DataFrame and display several randomly selected rows.


In [15]:
# Load the 'tips' data set into a DataFrame

import seaborn as sns

dft = sns.load_dataset('tips')
dft.sample(5)
Out[15]:
total_bill tip sex smoker day time size
12 15.42 1.57 Male No Sun Dinner 2
169 10.63 2.00 Female Yes Sat Dinner 2
145 8.35 1.50 Female No Thur Lunch 2
140 17.47 3.50 Female No Thur Lunch 2
138 16.00 2.00 Male Yes Thur Lunch 2

To aggregate rows together, we employ the groupby method to create groups of rows that should be aggregated into a subset. The column (or columns) used to separate the rows are specified as a parameter to the groupby functions, as shown in the following code cell where the tips data set is grouped on the time column into a new DataFrameGroupBy object called dg. This new dg object can be operated on as a normal DataFrame with the exception that it contains subsets that are treated independently. This is shown in the next two code blocks where the head and tail functions are used to show the first and last few rows of the group data set. Notice how the same number of rows are shown for each grouped data set.


In [16]:
# Group the DataFrame by the time column
dg = dft.groupby('time')
type(dg)
Out[16]:
pandas.core.groupby.groupby.DataFrameGroupBy
In [17]:
# Display first two rows from each group
dg.head(2)
Out[17]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
77 27.20 4.00 Male No Thur Lunch 4
78 22.76 3.00 Male No Thur Lunch 2
In [18]:
# Display last three rows from each group
dg.tail(3)
Out[18]:
total_bill tip sex smoker day time size
224 13.42 1.58 Male Yes Fri Lunch 2
225 16.27 2.50 Female Yes Fri Lunch 2
226 10.09 2.00 Female Yes Fri Lunch 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

Operating on Groups

The DataFrame groups can be operated by using similar techniques to the normal DataFrame. For example, statistical quantities such as the median or standard deviation can be computed for each group, as shown in the next few code cells. Multiple functions can be computed at once by using the aggregate method, which takes a list of the statistical functions to apply to each group. Finally, the describe function can be applied to the groups, which generates a statistical summary for each group of data.


In [19]:
# Compute median for each column in each group
dg.median()
Out[19]:
total_bill tip size
time
Lunch 15.965 2.25 2
Dinner 18.390 3.00 2
In [20]:
# Compute the standard deviation for each column in each group
dg.std()
Out[20]:
total_bill tip size
time
Lunch 7.713882 1.205345 1.040024
Dinner 9.142029 1.436243 0.910241
In [21]:
# Compute the mean and the standard deviation for each column in each group
dg.aggregate(['mean', 'std'])
Out[21]:
total_bill tip size
mean std mean std mean std
time
Lunch 17.168676 7.713882 2.728088 1.205345 2.411765 1.040024
Dinner 20.797159 9.142029 3.102670 1.436243 2.630682 0.910241
In [22]:
# Compute and display all summary statistics for all groups
dg.describe()
Out[22]:
size tip total_bill
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
time
Lunch 68.0 2.411765 1.040024 1.0 2.0 2.0 2.0 6.0 68.0 2.728088 ... 3.2875 6.7 68.0 17.168676 7.713882 7.51 12.2350 15.965 19.5325 43.11
Dinner 176.0 2.630682 0.910241 1.0 2.0 2.0 3.0 6.0 176.0 3.102670 ... 3.6875 10.0 176.0 20.797159 9.142029 3.07 14.4375 18.390 25.2825 50.81

2 rows × 24 columns

In [23]:
dg['size'].count()[0]
Out[23]:
68

Alternatively, one can determine the index of the rows that contain the minimum or maximum value on a group basis. For example, the two code cells below display the row index for the maximum (via the idxmax function) and the minimum (via the idxmin function) for each group. These functions return a DataFrame, thus the results can be extracted for an individual column in a group by using the standard DataFrame indexing techniques (as shown in the third code cell).


In [24]:
# Compute row indices for column maximum values
dg.idxmax()
Out[24]:
total_bill tip size
time
Lunch 197.0 141.0 125
Dinner 170.0 170.0 156
In [25]:
# Compute row indices for column minimum values
dg.idxmin()
Out[25]:
total_bill tip size
time
Lunch 149.0 135.0 82
Dinner 67.0 67.0 67
In [26]:
# Select and display the row index for maximum total bill at lunch

print('Row index for minimum total bill = {0}'.format(dg.idxmin().iloc[0,0]))
Row index for minimum total bill = 149.0

Student Exercise

In the empty Code cell below, first group the dft DataFrame by the sex column. Next, compute and display the minimum, maximum, and median values for the new grouped DataFrame by using the aggregate function correctly. Finally, compute the row index for the maximum total_bill for the Female group.


In [ ]:
 

Stacking

Given two or more DataFrame objects, a common task is joining them together. When working with a relational database, this is formally a join operation, and Pandas supports joins across two DataFrame objects. But for two DataFrame objects that have the same structure, the process can be simplified by employing either horizontal stacking (where columns are combined) or vertical stacking (where rows are combined). These operations both use the Pandas concat function, which by default assumes axis=0, which implies vertical stacking. Specifying axis=1 implies horizontal stacking, where columns from each subsequent DataFrame are added to the previous columns. Note that this operation generates a new DataFrame with the concatenated data.

Both of these operations are demonstrated in the following code cells. First, we split the original tips data into two DataFrame objects: t1 and t2, based on the implicit row index (notice how the end of the first new DataFrame aligns with the start of the second new DataFrame via the index). Next, the concat method is called to vertically stack (or combine) these two objects into a new DataFrame. Afterwards, the first few rows are displayed with the head method, and the complete summary statistics are shown for the new DataFrame to facilitate comparison with the original data (and the result of the horizontal stacking example).


In [27]:
# Chop the 'tips' DataFrame into two sets based on rows
tr1 = dft.iloc[:200]
tr2 = dft.iloc[200:]
In [28]:
# End of first new DataFrame
tr1.tail(2)
Out[28]:
total_bill tip sex smoker day time size
198 13.00 2.0 Female Yes Thur Lunch 2
199 13.51 2.0 Male Yes Thur Lunch 2
In [29]:
# Start of second new DataFrame 
tr2.head(2)
Out[29]:
total_bill tip sex smoker day time size
200 18.71 4.00 Male Yes Thur Lunch 3
201 12.74 2.01 Female Yes Thur Lunch 2
In [30]:
# Vertical stacking
tc = pd.concat([tr1, tr2])
In [31]:
# Display the first few rows of the stacked data
tc.head(4)
Out[31]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
In [32]:
# Compute and display the summary statistics of the stacked data
tc.describe()
Out[32]:
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000

The second example splits the tips data into two DataFrame objects based on the implicit column index. Next the concat method is called to horizontally stack (or combine) these two objects into a new DataFrame (notice how the columns in the two new DataFrame objects align for the same row index). Afterwards the same two functions are called to show the new DataFrame is the same as the original (and vertically stacked) DataFrame.


In [33]:
# Chop the 'tips' DataFrame into two sets based on columns

tc1 = dft.iloc[:,:2]
tc2 = dft.iloc[:,2:]
In [34]:
# Display columns in first new DataFrame
tc1.head(4)
Out[34]:
total_bill tip
0 16.99 1.01
1 10.34 1.66
2 21.01 3.50
3 23.68 3.31
In [35]:
# Display columns in first new DataFrame
tc2.head(4)
Out[35]:
sex smoker day time size
0 Female No Sun Dinner 2
1 Male No Sun Dinner 3
2 Male No Sun Dinner 3
3 Male No Sun Dinner 2
In [36]:
# Horizontal stacking
tm = pd.concat([tc1, tc2], axis=1)
In [37]:
# Display the first few rows of the stacked data
tm.head(4)
Out[37]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
In [38]:
# Compute and display the summary statistics of the stacked data
tm.describe()
Out[38]:
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000

Pivot Tables

A pivot table is a powerful technique for summarizing a large data set by grouping data together over multiple columns. A pivot table can also display summary information for the new representation such as average or standard deviation. Pivot tables are popular techniques when using spreadsheets, as they provide a quick and updatable summary of a potentially very large table of data.

In Pandas, a pivot table can quickly be created by using the pivot_table function. This function takes one argument that specifies the column that should be aggregated. The index argument is used to specify the row indices, which specifies how the DataFrame rows are grouped. The columns argument specifies the columns that are displayed in the pivot table, and the aggfunc argument specifies which functions should be computed and displayed for the group data; default value is simply the average.

The rest of this notebook provides several demonstrations of the Pandas pivot table functionality. First, a pivot table is created on the tips data set by first grouping the data by sex, and computing the mean total_bill for each day in the DataFrame. The next pivot table flips this by first grouping on day and computing the mean total_bill for each sex in the DataFrame. Next, the same pivot table is created but is now displayed with the aggregate function computed for all data in the appropriate row or column (via the margins argument). The next two pivot tables create hierarchical horizontal (or vertical) indices by specifying a list for the index (or columns) attribute, respectively. Finally, a pivot table with multiple summary statistics is displayed by specifying the min, max, and median functions via the aggfunc attribute during the creation of the pivot table.


In [39]:
# Create a Pivot Table using the sex column as the vertical index 
# and the day column as the horizontal index

dft.pivot_table('total_bill', index='sex', columns='day')
Out[39]:
day Thur Fri Sat Sun
sex
Male 18.714667 19.857000 20.802542 21.887241
Female 16.715312 14.145556 19.680357 19.872222
In [40]:
# Create a Pivot Table using the day column as the vertical index 
# and the sex column as the horizontal index

dft.pivot_table('total_bill', index='day', columns='sex')
Out[40]:
sex Male Female
day
Thur 18.714667 16.715312
Fri 19.857000 14.145556
Sat 20.802542 19.680357
Sun 21.887241 19.872222
In [41]:
# Create a Pivot Table using the sex column as the vertical index 
# and the day column as the horizontal index. This time include margin
# data for each row and column.

dft.pivot_table('total_bill', index='day', columns='sex', margins=all)
Out[41]:
sex Male Female All
day
Thur 18.714667 16.715312 17.682742
Fri 19.857000 14.145556 17.151579
Sat 20.802542 19.680357 20.441379
Sun 21.887241 19.872222 21.410000
All 20.744076 18.056897 19.785943
In [42]:
# Create a Pivot Table using the sex column as the vertical index 
# and the day and smoker columns as the hierarchical horizontal index

dft.pivot_table('total_bill', index=['day', 'smoker'], columns='sex')
Out[42]:
sex Male Female
day smoker
Thur Yes 19.171000 19.218571
No 18.486500 16.014400
Fri Yes 20.452500 12.654286
No 17.475000 19.365000
Sat Yes 21.837778 20.266667
No 19.929063 19.003846
Sun Yes 26.141333 16.540000
No 20.403256 20.824286
In [43]:
# Create a Pivot Table using the day column as the horizontal index 
# and the sex and smoker columns as the hierarchical vertical index

dft.pivot_table('total_bill', index='day', columns=['sex', 'smoker'])
Out[43]:
sex Male Female
smoker Yes No Yes No
day
Thur 19.171000 18.486500 19.218571 16.014400
Fri 20.452500 17.475000 12.654286 19.365000
Sat 21.837778 19.929063 20.266667 19.003846
Sun 26.141333 20.403256 16.540000 20.824286
In [44]:
# Create a Pivot Table using the sex column as the horizontal index 
# and the day column as the horizontal index. Display aggregate max, 
# median, and min for the total_bill column.

dft.pivot_table(index='day', columns='sex', aggfunc={'total_bill': ['max', 'median', 'min']})
Out[44]:
total_bill
max median min
sex Male Female Male Female Male Female
day
Thur 41.19 43.11 16.975 13.785 7.51 8.35
Fri 40.17 22.75 17.215 15.380 8.58 5.75
Sat 50.81 44.30 18.240 18.360 7.74 3.07
Sun 48.17 35.26 20.725 17.410 7.25 9.60

Student Exercise

In the empty Code cell below, first split the tips DataFrame into three new DataFrame objects with roughly equal numbers of rows. Second, vertically stack only two of these three DataFrame objects into a new DataFrame. Third, create a pivot table on this new DataFrame using the time column as the horizontal index and the sex column as the vertical index. Compute and display the mean and standard deviation for the tips column in this pivot table.


In [ ]:
 

Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

  1. Pandas documentation
  2. A complete Pandas tutorial
  3. The Pandas chapter from the book Python Data Science Handbook by Jake VanderPlas

© 2017: Robert J. Brunner at the University of Illinois.

This notebook is released under the Creative Commons license CC BY-NC-SA 4.0. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.