A previous lesson introduced the Pandas DataFrame
, which is a powerful data structure that mimics a traditional spreadsheet. Data can easily be read from a CSV file, a fixed width format text file, a JSON file, an HTML file (i.e., a webpage), and a relational database into a Pandas DataFrame
. In this notebook, this capability is demonstrated by first reading a file from an Internet website. This new DataFrame
is subsequently used to demonstrate basic functionality, including descriptive analysis, basic indexing, and fancy indexing.
When reading the airport data from this website, we specify the comma as the delimiter and explicitly indicate that the index column is 'iata', which is the airport code. Using a data-specific column as a row-index can often simplify data processing. The following two code cells display the first and last few rows, respectively, by selecting several columns by using column labels.
import pandas as pd
# Read data from CSV file, and display subset
dfa = pd.read_csv('http://stat-computing.org/dataexpo/2009/airports.csv', delimiter=',', index_col='iata')
# We can grab the first five rows, and only extract three columns
dfa[['airport', 'city', 'state']].head(5)
# Extract the last four rows for three different columns
dfa[['airport', 'lat', 'long']].tail(4)
This data set contains mixed data, both numeric and text, and thus requires different column storage. The first code cell below indicates the type of storage for all six columns in this DataFrame
, which shows the numeric columns lat
and long
, while the other four columns are text data, represented by data type object
. This also impacts the default behavior of specific Pandas functions, such as the describe
function, since summary statistics can only be calculated for numeric data.
# Display the data types of each column
dfa.dtypes
# Display a summary of the numerical information in the DataFrame
dfa.describe()
Since this new DataFrame
was created with a labelled row index, we can use row labels to slice rows from the DataFrame
. The following code cells demonstrate basic slicing and indexing of this DataFrame
by using both explicit indices (row and column labels) and implicit indices (row and column index values).
# Slice rows by using the indicated label from the index column
dfa.loc[['11J', '11R', '12C']]
# Slice rows by using the row implicit index
dfa[99:103]
# Slice rows and columns by using explicit row and column labels
dfa.loc[['11R', '12C', '12D'], ['airport', 'city', 'state', 'country']]
# Slice rows and columns by using implicit row and column indices
dfa.iloc[100: 103, 0:4]
Pandas also support selecting rows based on column values, which is known as masking. This is performed by specifying tests on columns that result in True
or False
, and only the True
results are returned. Thus, a row mask is formed, and masked rows are hidden and unmasked rows are selected. These tests must follow the rules of Boolean logic, but can involve multiple column comparisons that are combined into one final result.
For example, the first code cell below selects all airports in the state of Delaware by specifying the test dfa.state == 'DE'
. This test effectively selects those rows that have DE
in the state
column of the dfa
DataFrame
. The second code cell involves a more complicated expression that selects those airports whose latitude is greater than 48
and longitude is less than -170
. In this case, the two expressions are enclosed in parenthesis and combined with a Boolean and to generate the final test result.
dfa[dfa.state =='DE']
# We can also select rows based on boolean tests on columns
dfa[(dfa.lat > 48) & (dfa.long < -170)]
Pandas provides additional functions that can simplify data processing tasks, which often are used in spreadsheets. Of these other functions, the following code cells demonstrate three specific functions:
sample
: randomly selects n
rows, where n
is specified as an argument to the sample
functionsort_index
: sorts the DataFrame
based on the values in the indexsort_values
: sorts the DataFrame
by the column specified in the by
attributeNote that the sort functions return a new DataFrame
; to sort a DataFrame
in place you must set the inplace
attribute to True
. In addition, the sort functions take an ascending
parameter that specifies if the sort should be in ascending or descending order.
dfa.sample(5)
dfa.sort_index().head()
dfa.sort_values(by='lat', ascending=False).head()
Student Exercise
In the empty Code cell below, first extract all airports in the state of California. Second, apply a mask to select only those rows with a latitude between 38
and 40
. Finally, compute and display the average and standard deviation of the longitude for these masked rows.
DataFrame
Computations¶Some of the most important uses of a Pandas DataFrame
involve grouping related data together and operating on the grouped subsets independently. For example, data may be grouped by a categorical variable, such as state or county, and sales totals accumulated by the grouped region. To demonstrate this functionality, we turn to a second data set on restaurant data that is provided along with the seaborn Python module. The seaborn is presented in a separate lesson and provides support for advanced visualizations. Right now, however, we simply want to easily process this data, so we load the data into the dft
DataFrame
and display several randomly selected rows.
# Load the 'tips' data set into a DataFrame
import seaborn as sns
dft = sns.load_dataset('tips')
dft.sample(5)
To aggregate rows together, we employ the groupby
method to create groups of rows that should be aggregated into a subset. The column (or columns) used to separate the rows are specified as a parameter to the groupby
functions, as shown in the following code cell where the tips data set is grouped on the time
column into a new DataFrameGroupBy
object called dg
. This new dg
object can be operated on as a normal DataFrame
with the exception that it contains subsets that are treated independently. This is shown in the next two code blocks where the head
and tail
functions are used to show the first and last few rows of the group data set. Notice how the same number of rows are shown for each grouped data set.
# Group the DataFrame by the time column
dg = dft.groupby('time')
type(dg)
# Display first two rows from each group
dg.head(2)
# Display last three rows from each group
dg.tail(3)
The DataFrame
groups can be operated by using similar techniques to the normal DataFrame
. For example, statistical quantities such as the median or standard deviation can be computed for each group, as shown in the next few code cells. Multiple functions can be computed at once by using the aggregate
method, which takes a list of the statistical functions to apply to each group. Finally, the describe
function can be applied to the groups, which generates a statistical summary for each group of data.
# Compute median for each column in each group
dg.median()
# Compute the standard deviation for each column in each group
dg.std()
# Compute the mean and the standard deviation for each column in each group
dg.aggregate(['mean', 'std'])
# Compute and display all summary statistics for all groups
dg.describe()
dg['size'].count()[0]
Alternatively, one can determine the index of the rows that contain the minimum or maximum value on a group basis. For example, the two code cells below display the row index for the maximum (via the idxmax
function) and the minimum (via the idxmin
function) for each group. These functions return a DataFrame
, thus the results can be extracted for an individual column in a group by using the standard DataFrame
indexing techniques (as shown in the third code cell).
# Compute row indices for column maximum values
dg.idxmax()
# Compute row indices for column minimum values
dg.idxmin()
# Select and display the row index for maximum total bill at lunch
print('Row index for minimum total bill = {0}'.format(dg.idxmin().iloc[0,0]))
Student Exercise
In the empty Code cell below, first group the dft
DataFrame
by the sex
column. Next, compute and display the minimum, maximum, and median values for the new grouped DataFrame
by using the aggregate
function correctly. Finally, compute the row index for the maximum total_bill
for the Female
group.
Given two or more DataFrame
objects, a common task is joining them together. When working with a relational database, this is formally a join operation, and Pandas supports joins across two DataFrame
objects. But for two DataFrame
objects that have the same structure, the process can be simplified by employing either horizontal stacking (where columns are combined) or vertical stacking (where rows are combined). These operations both use the Pandas concat
function, which by default assumes axis=0
, which implies vertical stacking. Specifying axis=1
implies horizontal stacking, where columns from each subsequent DataFrame
are added to the previous columns. Note that this operation generates a new DataFrame
with the concatenated data.
Both of these operations are demonstrated in the following code cells. First, we split the original tips data into two DataFrame
objects: t1
and t2
, based on the implicit row index (notice how the end of the first new DataFrame
aligns with the start of the second new DataFrame
via the index). Next, the concat
method is called to vertically stack (or combine) these two objects into a new DataFrame
. Afterwards, the first few rows are displayed with the head
method, and the complete summary statistics are shown for the new DataFrame
to facilitate comparison with the original data (and the result of the horizontal stacking example).
# Chop the 'tips' DataFrame into two sets based on rows
tr1 = dft.iloc[:200]
tr2 = dft.iloc[200:]
# End of first new DataFrame
tr1.tail(2)
# Start of second new DataFrame
tr2.head(2)
# Vertical stacking
tc = pd.concat([tr1, tr2])
# Display the first few rows of the stacked data
tc.head(4)
# Compute and display the summary statistics of the stacked data
tc.describe()
The second example splits the tips data into two DataFrame
objects based on the implicit column index. Next the concat
method is called to horizontally stack (or combine) these two objects into a new DataFrame
(notice how the columns in the two new DataFrame
objects align for the same row index). Afterwards the same two functions are called to show the new DataFrame
is the same as the original (and vertically stacked) DataFrame
.
# Chop the 'tips' DataFrame into two sets based on columns
tc1 = dft.iloc[:,:2]
tc2 = dft.iloc[:,2:]
# Display columns in first new DataFrame
tc1.head(4)
# Display columns in first new DataFrame
tc2.head(4)
# Horizontal stacking
tm = pd.concat([tc1, tc2], axis=1)
# Display the first few rows of the stacked data
tm.head(4)
# Compute and display the summary statistics of the stacked data
tm.describe()
A pivot table is a powerful technique for summarizing a large data set by grouping data together over multiple columns. A pivot table can also display summary information for the new representation such as average or standard deviation. Pivot tables are popular techniques when using spreadsheets, as they provide a quick and updatable summary of a potentially very large table of data.
In Pandas, a pivot table can quickly be created by using the pivot_table
function. This function takes one argument that specifies the column that should be aggregated. The index
argument is used to specify the row indices, which specifies how the DataFrame
rows are grouped. The columns
argument specifies the columns that are displayed in the pivot table, and the aggfunc
argument specifies which functions should be computed and displayed for the group data; default value is simply the average.
The rest of this notebook provides several demonstrations of the Pandas pivot table functionality. First, a pivot table is created on the tips data set by first grouping the data by sex
, and computing the mean total_bill
for each day
in the DataFrame
. The next pivot table flips this by first grouping on day
and computing the mean total_bill
for each sex
in the DataFrame
. Next, the same pivot table is created but is now displayed with the aggregate function computed for all data in the appropriate row or column (via the margins
argument). The next two pivot tables create hierarchical horizontal (or vertical) indices by specifying a list for the index
(or columns
) attribute, respectively. Finally, a pivot table with multiple summary statistics is displayed by specifying the min
, max
, and median
functions via the aggfunc
attribute during the creation of the pivot table.
# Create a Pivot Table using the sex column as the vertical index
# and the day column as the horizontal index
dft.pivot_table('total_bill', index='sex', columns='day')
# Create a Pivot Table using the day column as the vertical index
# and the sex column as the horizontal index
dft.pivot_table('total_bill', index='day', columns='sex')
# Create a Pivot Table using the sex column as the vertical index
# and the day column as the horizontal index. This time include margin
# data for each row and column.
dft.pivot_table('total_bill', index='day', columns='sex', margins=all)
# Create a Pivot Table using the sex column as the vertical index
# and the day and smoker columns as the hierarchical horizontal index
dft.pivot_table('total_bill', index=['day', 'smoker'], columns='sex')
# Create a Pivot Table using the day column as the horizontal index
# and the sex and smoker columns as the hierarchical vertical index
dft.pivot_table('total_bill', index='day', columns=['sex', 'smoker'])
# Create a Pivot Table using the sex column as the horizontal index
# and the day column as the horizontal index. Display aggregate max,
# median, and min for the total_bill column.
dft.pivot_table(index='day', columns='sex', aggfunc={'total_bill': ['max', 'median', 'min']})
Student Exercise
In the empty Code cell below, first split the tips DataFrame
into three new DataFrame
objects with roughly equal numbers of rows. Second, vertically stack only two of these three DataFrame
objects into a new DataFrame
. Third, create a pivot table on this new DataFrame
using the time
column as the horizontal index and the sex
column as the vertical index. Compute and display the mean and standard deviation for the tips
column in this pivot table.
The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.
© 2017: Robert J. Brunner at the University of Illinois.
This notebook is released under the Creative Commons license CC BY-NC-SA 4.0. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.