Advanced NumPy


As described in detail in a previous lesson, NumPy is built around a new, n-dimensional array (ndarray) data structure that provides fast support for numerical computations. In this notebook, we introduce multi-dimensional NumPy arrays, and focus on extending concepts introduced to deal with one-dimensional arrays to higher dimensional arrays. Note that for simplicity we will rarely discuss anything other than two- or three-dimensional arrays.

Creating Multi-Dimensional Arrays

Higher dimensional arrays can be created in same way that a one-dimensional array is created (including all of the previously listed convenience functions, where a one-dimensional array with the correct number of elements is created, and subsequently reshaped into the correct dimensionality. For example, you can create a NumPy array with one hundred elements and reshape this new array into a ten by ten matrix.


In [1]:
import numpy as np

# Make a one hundred element one-dimensional array
data = np.arange(100)
print(data)
print()

# Reshape to a 10 x 10 array
mat = data.reshape(10, 10)
print(mat)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]

[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 22 23 24 25 26 27 28 29]
 [30 31 32 33 34 35 36 37 38 39]
 [40 41 42 43 44 45 46 47 48 49]
 [50 51 52 53 54 55 56 57 58 59]
 [60 61 62 63 64 65 66 67 68 69]
 [70 71 72 73 74 75 76 77 78 79]
 [80 81 82 83 84 85 86 87 88 89]
 [90 91 92 93 94 95 96 97 98 99]]

Special convenience functions also exist to create special multidimensional arrays. For example, you can create an identity matrix, where the diagonal elements are all one and all other elements are zero by using the eye method (since the Identity matrix is often denoted by the capital letter 'I'). You can also specify the diagonal (or certain off-diagonal) elements by using the diag function, where the input array is assigned to a set of diagonal elements in the new array. If the k parameter to the np.diag method is zero, the diagonal elements will be initialized. If the k parameter is a positive (negative) integer, the diagonal corresponding to the integer value of k is initialized. The size of the resulting array will be the minimum possible size to allow the input array to be properly initialized.


In [2]:
# Create special two-dimensional arrays

print("Matrix will be 4 x 4.\n", np.eye(4))
print("\nMatrix will be 4 x 4.\n", np.diag(np.arange(4), 0))
print("\nMatrix will be 5 x 5.\n", np.diag(np.arange(4), 1))
print("\nMatrix will be 5 x 5.\n", np.diag(np.arange(4), -1))
Matrix will be 4 x 4.
 [[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

Matrix will be 4 x 4.
 [[0 0 0 0]
 [0 1 0 0]
 [0 0 2 0]
 [0 0 0 3]]

Matrix will be 5 x 5.
 [[0 0 0 0 0]
 [0 0 1 0 0]
 [0 0 0 2 0]
 [0 0 0 0 3]
 [0 0 0 0 0]]

Matrix will be 5 x 5.
 [[0 0 0 0 0]
 [0 0 0 0 0]
 [0 1 0 0 0]
 [0 0 2 0 0]
 [0 0 0 3 0]]

Slicing Multidimensional Arrays

Multi-dimensional arrays can be sliced (or indexed); the only trick is to remember the proper ordering for the elements. Each dimension is differentiated by a comma in the slicing operation, so a two-dimensional array is sliced with [start1:end1, start2:end2], while a three-dimensional array is sliced with [start1:end1, start2:end2. start3:end3], continuing on with higher dimensions. If only one dimension is specified, it will default to the first dimension. These concepts are demonstrated in the following two code cells, first for a two-dimensional array, followed by a three-dimensional array.

Note that each of these slicing operations (i.e., start:end) can also include an optional stride value as well.


In [3]:
# Two-dimensional slicing
b = np.arange(9).reshape((3,3))

print("3 x 3 array = \n", b)

print("\nSlice in first dimension (row 1): ", b[0])
print("\nSlice in first dimension (row 3): ", b[2])

print("\nSlice in second dimension (col 1): ", b[:,0])
print("\nSlice in second dimension (col 3): ", b[:,2])

print("\nSlice in first and second dimension: ", b[0:1, 1:2])


print("\nDirect Element access: ", b[0,1])
3 x 3 array = 
 [[0 1 2]
 [3 4 5]
 [6 7 8]]

Slice in first dimension (row 1):  [0 1 2]

Slice in first dimension (row 3):  [6 7 8]

Slice in second dimension (col 1):  [0 3 6]

Slice in second dimension (col 3):  [2 5 8]

Slice in first and second dimension:  [[1]]

Direct Element access:  1
In [4]:
# Three-dimensional slicing
c = np.arange(27).reshape((3,3, 3))

print("3 x 3 x 3 array = \n", c)
print("\nSlice in first dimension (first x axis slice):\n", c[0])

print("\nSlice in first and second dimension: ", c[0, 1])

print("\nSlice in first dimension (third x axis slice):\n", c[2])

print("\nSlice in second dimension (first y axis slice):\n", c[:,0])
print("\nSlice in second dimension (third y axis slice):\n", c[:,2])

print("\nSlice in first and second dimension: ", c[0:1, 1:2])

print("\nSlice in first and second dimension:\n", c[0,1])
print("\nSlice in first and third dimension: ", c[0,:,1])
print("\nSlice in first, second, and third dimension: ", c[0:1,1:2,2:])

print("\nDirect element access: ", c[0,1, 2])
3 x 3 x 3 array = 
 [[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]

Slice in first dimension (first x axis slice):
 [[0 1 2]
 [3 4 5]
 [6 7 8]]

Slice in first and second dimension:  [3 4 5]

Slice in first dimension (third x axis slice):
 [[18 19 20]
 [21 22 23]
 [24 25 26]]

Slice in second dimension (first y axis slice):
 [[ 0  1  2]
 [ 9 10 11]
 [18 19 20]]

Slice in second dimension (third y axis slice):
 [[ 6  7  8]
 [15 16 17]
 [24 25 26]]

Slice in first and second dimension:  [[[3 4 5]]]

Slice in first and second dimension:
 [3 4 5]

Slice in first and third dimension:  [1 4 7]

Slice in first, second, and third dimension:  [[[5]]]

Direct element access:  5

Student Exercise

In the empty Code cell below, write and execute code to make a four by four matrix that contains the integers 0 through 15. Slice and display the third row, the third column, and the element at the position (4, 4).


In [ ]:
 

Special Indexing

NumPy also provides several other special indexing techniques. The first such technique is the use of an index array, where you use an array to specify the elements to be selected. The second technique is a Boolean mask array. In this case, the Boolean array is the same size as the primary NumPy array, and if the element in the mask array is True, the corresponding element in the primary array is selected, and vice-versa for a False mask array element. These two special indexing techniques are demonstrated in the following two code cells.


In [5]:
# Demonstration of a multi-dimensional index array

# Two-dimensional array
c = np.arange(10).reshape((2, 5))

print("\nStarting array:\n", c)
print("\nIndex Array access: \n", c[np.array([0, 1]) , np.array([3, 4])])
Starting array:
 [[0 1 2 3 4]
 [5 6 7 8 9]]

Index Array access: 
 [3 9]
In [6]:
# Demonstrate Boolean mask access

# Two-dimensional example.

print("\n--------------------")
c = np.arange(25).reshape((5, 5))
print("\n Starting Array: \n", c)

# Build a mask that is True for all even elements with value greater than four
mask1 = (c > 4)
mask2 = (c % 2 == 0)

print("\nMask 1:\n", mask1)
print("\nMask 2:\n", mask2)

# We use the logical_and ufunc here, but it is described later
mask = np.logical_and(mask1, mask2)

print("\nMask :\n", mask)

print("\nMasked Array :\n", c[mask])

c[mask] *= -2

print("\nNew Array :\n", c)
--------------------

 Starting Array: 
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]

Mask 1:
 [[False False False False False]
 [ True  True  True  True  True]
 [ True  True  True  True  True]
 [ True  True  True  True  True]
 [ True  True  True  True  True]]

Mask 2:
 [[ True False  True False  True]
 [False  True False  True False]
 [ True False  True False  True]
 [False  True False  True False]
 [ True False  True False  True]]

Mask :
 [[False False False False False]
 [False  True False  True False]
 [ True False  True False  True]
 [False  True False  True False]
 [ True False  True False  True]]

Masked Array :
 [ 6  8 10 12 14 16 18 20 22 24]

New Array :
 [[  0   1   2   3   4]
 [  5 -12   7 -16   9]
 [-20  11 -24  13 -28]
 [ 15 -32  17 -36  19]
 [-40  21 -44  23 -48]]

Basic Operations

NumPy arrays naturally support basic mathematical operations, including addition, subtraction, multiplication, division, integer division, and remainder, allowing you to easily combine a scalar (a single number) with a vector (a one-dimensional array), or a matrix (a multi-dimensional array). In the next Code cell, we create a two-dimensional array and use that array to demonstrate how to operate on a scalar and a matrix.


In [7]:
# Create a two-dimensional array

b = np.arange(9).reshape((3,3))

print("Matrix = \n", b)

print("\nMatrix + 10.5 =\n", (b + 10.5))

print("\nMatrix * 0.25 =\n", (b * 0.25))

print("\nMatrix % 2 =\n", (b % 2))

print("\nMatrix / 3.0 =\n", ((b - 4.0) / 3.))
Matrix = 
 [[0 1 2]
 [3 4 5]
 [6 7 8]]

Matrix + 10.5 =
 [[10.5 11.5 12.5]
 [13.5 14.5 15.5]
 [16.5 17.5 18.5]]

Matrix * 0.25 =
 [[0.   0.25 0.5 ]
 [0.75 1.   1.25]
 [1.5  1.75 2.  ]]

Matrix % 2 =
 [[0 1 0]
 [1 0 1]
 [0 1 0]]

Matrix / 3.0 =
 [[-1.33333333 -1.         -0.66666667]
 [-0.33333333  0.          0.33333333]
 [ 0.66666667  1.          1.33333333]]

We also can combine arrays as long as they have the same dimensionality. In the next Code cell, we create a one-dimensional and a two-dimensional array and demonstrate how these two arrays can be combined.


In [8]:
# Create two arrays
a = np.arange(1, 10)
b = (10. - a).reshape((3, 3))

print(f"Array = \n {a}")
print(f"\nMatrix = \n {b}")

print("\nArray[0:3] + Matrix Row 1 = ",a[:3] + b[0,:,])

print("\nArray[0:3] + Matrix[:0] = ", a[:3] + b[:,0])

print("\nArray[3:6] + Matrix[0:] = ", a[3:6] + b[0, :])

# Now combine scalar operations

print("\n3.0 * Array[3:6] + (10.5 + Matrix[0:]) = ", 3.0 * a[3:6] + (10.5 + b[0, :]))
Array = 
 [1 2 3 4 5 6 7 8 9]

Matrix = 
 [[9. 8. 7.]
 [6. 5. 4.]
 [3. 2. 1.]]

Array[0:3] + Matrix Row 1 =  [10. 10. 10.]

Array[0:3] + Matrix[:0] =  [10.  8.  6.]

Array[3:6] + Matrix[0:] =  [13. 13. 13.]

3.0 * Array[3:6] + (10.5 + Matrix[0:]) =  [31.5 33.5 35.5]

Summary Functions

NumPy provides convenience functions that can quickly summarize the values of an array, or the values along a specific axis of a multi-dimensional array. These functions include basic statistical measures (mean, median, var, std, min, and max), the total sum or product of all elements in the array (sum, prod), as well as running sums or products for all elements in the array (cumsum, cumprod). The last two functions actually produce arrays that are of the same size as the input array, where each element is replaced by the respective running sum/product up to and including the current element. Another function, trace, calculates the trace of an array, which simply sums up the diagonal elements in the multi-dimensional array.


In [9]:
# Demonstrate data processing convenience functions

# Make an array = [1, 2, 3, 4, 5]
a = np.arange(1, 10).reshape(3,3)
print(f'Array = \n{a}\n')

print("Mean value (all) = {}".format(np.mean(a)))
print("Mean value (columns) = {}".format(np.mean(a, axis = 0)))
print("Mean value (rows) = {}\n".format(np.mean(a, axis = 1)))

print("Std. Deviation (columns) = {}".format(np.std(a, axis = 0)))
print("Std. Deviation (rows) = {}\n".format(np.std(a, axis = 1)))


print("Minimum value (columns) = {}".format(np.min(a, axis = 0)))
print("Maximum value (columns) = {}\n".format(np.max(a, axis = 0)))

# Now compute trace of 5 x 5 diagonal matrix (= 5)
print('Trace of 5 x 5 identity matrix = ', np.trace(np.eye(5)))
Array = 
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Mean value (all) = 5.0
Mean value (columns) = [4. 5. 6.]
Mean value (rows) = [2. 5. 8.]

Std. Deviation (columns) = [2.44948974 2.44948974 2.44948974]
Std. Deviation (rows) = [0.81649658 0.81649658 0.81649658]

Minimum value (columns) = [1 2 3]
Maximum value (columns) = [7 8 9]

Trace of 5 x 5 identity matrix =  5.0

Universal Functions

NumPy also includes methods that are universal functions, or ufuncs that are vectorized and thus operate on each element in the array, without the need for a loop. This topic was discussed previously in considerable detail in the Introduction to NumPy lesson; here we simply demonstrate the basic ideas with multi-dimensional arrays.

You have already seen examples of some of these functions at the start of this IPython Notebook when we compared the speed and simplicity of NumPy versus normal Python for numerical operations. These functions almost all include an optional out parameter that allows a pre-defined NumPy array to be used to hold the results of the calculation, which can often speed up the processing by eliminating the need for the creation and destruction of temporary arrays. These functions will all still return the final array, even if the out parameter is used.

For complete details, look at the official NumPy ufunc reference guide for more information on any of these functions, for example, the isnan function, since the user guide has a full breakdown of each function and sample code demonstrating how to use the function.

In the following Code cells, we demonstrate several of these ufuncs on multi-dimensional arrays.


In [10]:
# Make a 3 x 3 two-dimensional array
b = np.arange(1, 10).reshape(3, 3)

print('original array:\n', b)

# Peeform ufunc math

c = np.sin(b)

print('\nnp.sin : \n', c)
print('\nnp.log and np.abs : \n', np.log10(np.abs(c)))
print('\nnp.mod : \n', np.mod(b, 2))
print('\nnp.logical_and : \n', np.logical_and(np.mod(b, 2), True))
original array:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

np.sin : 
 [[ 0.84147098  0.90929743  0.14112001]
 [-0.7568025  -0.95892427 -0.2794155 ]
 [ 0.6569866   0.98935825  0.41211849]]

np.log and np.abs : 
 [[-0.07496085 -0.04129404 -0.85041141]
 [-0.12101744 -0.01821569 -0.55374951]
 [-0.18244349 -0.00464642 -0.38497791]]

np.mod : 
 [[1 0 1]
 [0 1 0]
 [1 0 1]]

np.logical_and : 
 [[ True False  True]
 [False  True False]
 [ True False  True]]
In [11]:
# Demonstrate Boolean tests with operators

d = np.arange(9).reshape(3, 3)

print("Greater Than or Equal Test: \n", d >= 5)

# Now combine to form Boolean Matrix

np.logical_and(d > 3, d % 2)
Greater Than or Equal Test: 
 [[False False False]
 [False False  True]
 [ True  True  True]]
Out[11]:
array([[False, False, False],
       [False, False,  True],
       [False,  True, False]])

Student Exercise

In the empty Code cell below, write and execute code to make a four by four matrix that contains the integers 0 through 15. Add five to the first row, ten to the second row, fifteen to the third row, twenty to the fourth row, and display the resulting matrix. Next apply the cos function to the odd rows, and the sin function to the even rows (recall that Python indexing starts at zero), and display the resulting matrix. Finally, compute and display the cumulative sum for each row, and the cumulative product for each column.


In [ ]:
 

Masked Arrays

As discussed in a previous lesson, NumPy provides support for masked arrays, where certain elements can be masked based on some criterion and ignored during subsequent calculations (i.e., these elements are masked). Masked arrays are in the numpy.ma package, and simply require a masked array to be created from a given array and a condition that indicates which elements should be masked. This new masked array can be used as a normal NumPy (multi-dimensional) array, except the masked elements are ignored. NumPy provides operations for masked arrays, allowing them to be used in a similar manner as normal NumPy arrays.


Correlation Measurements

In the Scatter Plots notebook, you learned to visually interpret relationships between data. Given two NumPy arrays, we can calculate analytic measures of the correlation between the two arrays to quantify any relationship. The two main techniques for computing a correlation are the Pearson and the Spearman correlation coefficients.

Pearson Correlation Coefficient

The Pearson correlation coefficient measures the linear relationship between two arrays. The computed value ranges between -1 and +1, with the former implying an exact negative linear relationship, and the latter an exact positive linear relationship. A value of zero indicates no correlation. The following figure, from Wikipedia, provides a visual illustration of two dimensional distributions of pints for various values of $r$.

Wikipedia PCC Figure

In the spirit of this figure, there are several online games where you can view a two-dimensional distribution of points and guess the correlation. The first is an old-style arcade game, Guess the Correlation, while the second is a teaching game called Correlation Guessing Game. These games provide an excellent chance to improve your understanding of the relationship between an analytic correlation measure (introduced herein) and the visual appearance of a distribution of data (as covered in the Python Two-Dimensional Plotting notebook). This skill will be valuable as you learn to perform linear regression and more generally when working with multi-dimensional data in the future.

We will use the pearsonr function provided by the SciPy module, which computes and returns the $r$ value we are after as well as a two-tailed p-value, which we can safely ignore for now (this will be discussed in a subsequent course).

Spearman Correlation Coefficient

The Spearman correlation coefficient measures whether or not two arrays have a monotonic relationship, which basically means that as $x$ increases, $y$ continues to move in the same manner without changing direction (thus if $y$ is increasing, it never starts decreasing, and vice-versa). The computed value ranges between -1 and +1, with the former implying an exact negative monotonic relationship, and the latter an exact positive monotonic relationship. A value of zero indicates no correlation. We will use the spearmanr function provided by the SciPy module, which computes and returns the $rho$ value we are after as well as a two-tailed p-value for a hypothesis test, which we can safely ignore for now (this will be discussed in a subsequent course).

Covariance

In some cases, we do not want a single value indicating the relationship between two sets of data, and instead we want to know how each variable in an array (or matrix) varies in relationship to the other variables. This computation is known as the covariance, which produces a matrix whose diagonal elements represent the variance between the same variables and the off-diagonal elements are the covariances for a given pair of different variables. This is encoded in the following formula:

\begin{equation} \textrm{Cov(}x_i\textrm{, }y_i\textrm{)} = (x_i - \mu_x) (y_j - \mu_y) \end{equation}

where $\mu_x$ and $\mu_y$ are the mean values for the $x$-array and $y$-array, respectively.

Given two equal-sized arrays, we can compute the covariance between them by using the NumPy cov function. To aid in the interpretation of the covariance, this matrix can be normalized by the variances to create a new matrix called the correlation matrix. The entries in this new matrix are Pearson product-moment correlation coefficients. We can calculate this correlation matrix, whose values are constrained to lie between -1 (perfect negative correlation) and +1 (perfect positive correlation) by using the NumPy corrcoef function.

The following code cell demonstrates these functions on correlated and random data, respectively.


In [12]:
import scipy.stats as st

# Create two correlated arrays
x = np.arange(10)
y = 9 - x

# Compute auto-correlations
pr1 = st.pearsonr(x, x)
sr1 = st.spearmanr(x, x)

# Compute cross-correlations
pr2 = st.pearsonr(x, y)
sr2 = st.spearmanr(x, y)

print('Pearsonr correlation for x,x = {0:6.4f}'.format(pr1[0]))
print('Spearmanr correlation for x,x = {0:6.4f}'.format(sr1[0]))
print('Pearsonr correlation for x,y = {0:6.4f}'.format(pr2[0]))
print('Spearmanr correlation for x,y = {0:6.4f}\n'.format(sr2[0]))

print('Covariance Matrix = \n', np.cov(x, y))
print()

print('Correlation Matrix = \n', np.corrcoef(x, y))
print()

# Generate random arrays
x = np.random.uniform(0, 1, 10)
y = np.random.uniform(0, 1, 10)

# Compute cross-correlations, should be close to null
prr = st.pearsonr(x, y)
srr = st.spearmanr(x, y)

print('Pearsonr correlation for random x,y = {0:6.4f}'.format(prr[0]))
print('Spearmanr correlation for random x,y = {0:6.4f}\n'.format(srr[0]))
Pearsonr correlation for x,x = 1.0000
Spearmanr correlation for x,x = 1.0000
Pearsonr correlation for x,y = -1.0000
Spearmanr correlation for x,y = -1.0000

Covariance Matrix = 
 [[ 9.16666667 -9.16666667]
 [-9.16666667  9.16666667]]

Correlation Matrix = 
 [[ 1. -1.]
 [-1.  1.]]

Pearsonr correlation for random x,y = 0.0368
Spearmanr correlation for random x,y = 0.1152


Student Exercise

In the empty Code cell below, write and execute code to compute the Pearson and Spearman correlations for the columns in the Iris data set. You can load this data from the Seaborn module, and extract the columns as NumPy matrices to simplify the calculation. Do these values make sense in light of the scatter plots you generated previously?


In [ ]:
 

Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

  1. NumPy tutorial
  2. NumPy cheatsheet
  3. NumPy lecture notes
  4. NumPy notebook demo
  5. The NumPy chapter from the book Python Data Science Handbook by Jake VanderPlas

© 2017: Robert J. Brunner at the University of Illinois.

This notebook is released under the Creative Commons license CC BY-NC-SA 4.0. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.