As we discussed previously, the Python programming language provides a rich set of data structures such as the list, tuple, dictionary, and string, which can greatly simplify common programming tasks. Of these structures, all but the string are heterogeneous, which means they can hold data of different types. This flexibility comes at a cost, however, as it is more expensive in both computational time and storage to maintain an arbitrary collection of data than to hold a pre-defined set of data.
(Advanced: Optional) For example, the Python list is implemented (in the Cython implementation) as a variable length array that contains pointers to the objects held in the array. While flexible, it takes time to create, resize, and iterate over, even if the data contained in the list is homogenous. Also, even though you can create multiple-dimensional lists, creating and working with them is neither simple or intuitive. Yet, many applications require multi-dimensional representation; for example, location on the surface of the Earth or pixel properties in an image.
Thus, these data structures clearly are not designed or optimized for data intensive computing. Scientific and engineering computing applications have a long history of using optimized data structures and libraries, including codes written in C, Fortran, or MatLab. These applications expect to have vector and matrix data structures and optimized algorithms that can leverage these structures seamlessly. Fortunately, since many data science applications, including statistical and machine learning, are built on this academic legacy, a community of open source developers have built the Numerical Python (NumPy) library, a fundamental numerical package to facilitate scientific computing in Python.
NumPy is built around a new, n-dimensional array (ndarray
) data structure that provides fast support for numerical computations. This data type for objects stored in the array can be specified at creation time, but the array is homogenous. This array can be used to represent a vector (one-dimensional set of numerical values) or matrix (multiple-dimensional set of vectors). Furthermore, NumPy provides additional benefits built on top of the array
object, including masked arrays, universal functions, sampling from random distributions, and support for user-defined, arbitrary data-types that allow the array
to become an efficient, multi-dimensional generic data container.
Despite the discussion in the previous section, you might be curious if the benefits of learning NumPy are worth the effort of learning to effectively use a new Python data structure, especially one that is not part of the standard Python distribution. In the end, you will have to make this decision, but there are several definite benefits to using NumPy:
list
.list
.NumPy is a very powerful library, with a rich and detailed user guide. Every time I look through the documentation, I learn new tricks. Furthermore, NumPy is the foundation upon which other popular Python modules, such as Pandas and Seaborn, are based. The time you spend learning to use this library properly will be amply rewarded. In the rest of this IPython notebook, we will introduce many of the basic NumPy features, but to fully master this library you will need to spend time reading the documentation and trying out the full capabilities of the library.
To demonstrate the first two points, consider the programming task of computing basic mathematical functions on a large number of data points. In the first code block, we first import both the math
library and the numpy
library. Second, we define two constants: size, which is the number of data points to process, and delta, which is a floating point offset we add to the array elements. You can change these two parameters to see how the performance of the different approaches varies. Finally, we create the list
and the NumPy array
that we will use in the next few codes blocks:
import math
import numpy as np
size = 100000
delta = 1.0E-2
aList = [(x + delta) for x in range(size)]
anArray = np.arange(size) + delta
print(aList[2:6])
print(anArray[2:6])
At this point, we have created and populated both data structures, and you have seen that they are both indexed in the same manner, meaning it is probably easier to learn and use NumPy arrays than you might have thought. Next, we can apply several standard mathematical functions to our list
, creating new list
s in the process. To determine how long these operations take, we use the IPython %timeit
magic function, which will, by default, run the code contained on the rest of the line multiple times and report the average best time.
%timeit [math.sin(x) for x in aList]
%timeit [math.cos(x) for x in aList]
%timeit [math.log(x) for x in aList]
As you can see, to create these new list
objects, we apply the mathematical function to every angle in the original list
. These operations are fairly fast and all roughly constant in time, demonstrating the overall speed of list comprehensions in Python. Now we will try doing the same thing by using the NumPy library.
%timeit np.sin(anArray)
%timeit np.cos(anArray)
%timeit np.log10(anArray)
First, the creation of each of these new arrays was much faster, nearly a factor of ten in each case! (Actual results will depend on the host computer and Python version.) Second, the operations themselves were arguably simpler both to write and to read, as the function is applied to the data structure itself and not each individual element. But perhaps we should compare the results to ensure they are the same.
l = [math.sin(x) for x in aList]
a = np.sin(anArray)
print("Python List: ", l[2:10:3])
print("NumPY array:", a[2:10:3])
print("Difference = ", a[5:7] - np.array(l[5:7]))
# Now create a NumPy array from a Python list
%timeit np.sin(aList)
As the previous code block demonstrates, the NumPy results agree with the standard Python results, although the NumPy results are more conveniently displayed. As a last test, we create a new NumPy array
from the original Python list
by applying the np.sin
function, which, while not as fast as the pure NumPy version, is faster than the Python version and easier to read.
Now we will change gears and introduce the NumPy library.
NumPy arrays, which are instances of the ndarray
class, are statically-typed, homogenous data structures that can be created in a number of different ways. You can create an array from an existing Python list
or tuple
, or use one of the many built-in NumPy convenience methods:
empty
: Creates a new array whose elements are uninitialized.zeros
: Creates a new array whose elements are initialized to zero.ones
: Creates a new array whose elements are initialized to one.empty_like
: Creates a new array whose size matches the input array and whose values are uninitialized.zero_like
: Creates a new array whose size matches the input array and whose values are initialized to zero.ones_like
: Creates a new array whose size matches the input array and whose values are initialized to unity.# Make and print out simple NumPy arrays
print(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))
print("\n", np.empty(10))
print("\n", np.zeros(10))
print("\n", np.ones(10))
print("\n", np.ones_like(np.arange(10)))
We can also create NumPy arrays that have specific initialization patterns. For example, the arange(start, end, stride)
method works like the normal Python range
method, creating an array whose elements begin with the start
parameter. Subsequent elements are incremented successively by the stride
parameter, stopping when the end
parameter would either be reached or surpassed. As was the case with the range
method, the start
and stride
parameters are optional, defaulting to zero and one, respectively, and the end
value is not included in the array.
# Demonstrate the np.arange method
print(np.arange(10))
print(np.arange(3, 10, 2))
NumPy also provides two convenience methods that take a similar form but assign values to the elements that are evenly spaced. The first method is the linspace(start, end, num)
method, which creates num
elements and assigns values that are linearly spaced between start
and end
.
The second method, logspace(start, end, num)
, creates num
elements and assigns values that are logarithmically spaced between start
and end
. The num
parameter is optional and defaults to fifty. Unlike the arange
method, these two methods are inclusive, which means both the start
and end
parameters are included as elements in the new array. There is an optional parameter called base
that you can use to specify the base of the logarithm used to create the intervals. By default, this value is 10, making the intervals log10
spaced.
# Demonstrate linear and logarthmic array creation.
print("Linear space bins [0, 10] = {}\n".format(np.linspace(0, 10, 4)))
print("Default linspace bins = {}\n".format(len(np.linspace(0,10))))
print("Log space bins [0, 1] = {}\n".format(np.logspace(0, 1, 4)))
print("Default linspace bins = {}\n".format(len(np.logspace(0,10))))
Each NumPy array has several attributes that describe the general features of the array. These attributes include the following:
ndim
: Number of dimensions of the array (previous examples were all unity)shape
: The dimensions of the array, so a matrix with n
rows and m
columns has shape
equal to (n, m)
size
: The total number of elements in the array. For a matrix with n rows and m columns, the size
is equal to the product of $n \times m$dtype
: The data type of each element in the arrayitemsize
: The size in bytes of each element in the arraydata
: The buffer that holds the array dataNumPy arrays are statically-typed, thus their data type is specified when they are created. The default data type is float
, but the type can be specified in several ways. First, if you use an existing list
(as we did in the previous code block) or array
to initialize the new array
, the data type of the previous data structure will be used. If a heterogeneous Python list
is used, the greatest data type will be used in order, guaranteeing that all values will be safely contained in the new array
. If using a NumPy function to create the new array
, the data type can be specified explicitly by using the dtype
argument, which can either be one of the predefined built-in data types or a user defined custom data type.
The full list of built-in data types can be obtained from the np.sctypeDict.keys()
method; but for brevity, we list some of the more commonly used built-in data types below, along with their maximum size in bits, which constrains the maximum allowed value that may be stored in the new array
:
int8
, int16
, int32
, and int64
uint8
, uint16
, uint32
, and uint64
float16
, float32
, float64
, and float128
Other data types include complex numbers, byte arrays, character arrays, and date/time arrays.
To check the type of an array, you can simply access the array's dtype
attribute. A ValueError
exception will be thrown if you try to assign an incompatible value to an element in an array
.
# Access our previously created array's data type
a.dtype
As previously declared, trying to assign the wrong data type to a NumPy array will throw a ValueException
, as shown in the following figure (note: you can repeat this on your own in a new code cell).
NumPy supports multidimensional arrays, although for simplicity we will rarely discuss anything other than two- or three-dimensional arrays. We will defer a discussion of working with multi-dimensional arrays by using NumPy to a subsequent lesson in this course.
NumPy supports many different ways to access elements in an array. Elements can be indexed or sliced in the same way a Python list or tuple can be indexed or sliced, as demonstrated in the following code cell.
a = np.arange(9)
print("Original Array = ", a)
a[1] = 3
a[3:5] = 4
a[0:6:2] *= -1
print("\nNew Array = ", a)
NumPy also provides several other special indexing techniques. The first such technique is the use of an index array, where you use an array to specify the elements to be selected. The second technique is a Boolean mask array. In this case, the Boolean array is the same size as the primary NumPy array, and if the element in the mask array is True
, the corresponding element in the primary array is selected, and vice-versa for a False
mask array element. These two special indexing techniques are demonstrated in the following two code cells.
# Demonstration of an index array
a = np.arange(10)
print("\nStarting array:\n", a)
print("\nIndex Access: ", a[np.array([1, 3, 5, 7])])
# Demonstrate Boolean mask access
# Simple case
a = np.arange(10)
print("Original Array:", a)
print("\nMask Array: ", a > 4)
# Now change the values by using the mask
a[a > 4] = -1.0
print("\nNew Array: ", a)
NumPy has a rich support for random number generation, which can be used to create and populate an array of a given shape and size. NumPy provides support for sampling random values from over 30 different distributions including the binomial
, normal
, and poisson
(we will discuss distribution functions in subsequent lessons). There are also special convenience functions to simplify the sampling of random data over the uniform or normal distributions. These techniques are demonstrated in the following code cell.
# Create arrays of random data from different distributions
print("Uniform sampling, integers [0, 1): {}".format(np.random.randint(0, 10, 5)))
print("Uniform sampling [0, 1): {}".format(np.random.rand(5)))
print("Normal sampling (0, 1) : {0}".format(np.random.randn(5)))
NumPy arrays naturally support basic mathematical operations, including addition, subtraction, multiplication, division, integer division, and remainder, allowing you to easily combine a scalar (a single number) with a vector (a one-dimensional array), or a matrix (a multi-dimensional array). In the next code block, we first create a one-dimensional array, and subsequently operate on this array to demonstrate how to combine a scalar with a vector.
# Create and use a vector
a = np.arange(10)
print(a)
print("\n", (2.0 * a + 1)/3)
print("\n", a%2)
print("\n", a//2)
Student Exercise
In the empty Code cell below, first create a one-dimensional NumPy array by using the arange
function of the integers zero to ten, inclusive. Next, perform the following basic math operations on this new array and display only the final result. First, add two to every element. Second, multiply the result by five. Finally, subtract seven. Next, create a second array of eleven, randomly selected integers between zero and ten, and multiply this new array by the original array, displaying the result.
NumPy provides convenience functions that can quickly summarize the values of an array, which can be very useful for specific data processing tasks (note that we will cover descriptive statistics in a subsequent lesson). These functions include basic statistical measures (mean
, median
, var
, std
, min
, and max
), the total sum or product of all elements in the array (sum
, prod
), as well as running sums or products for all elements in the array (cumsum
, cumprod
). The last two functions actually produce arrays that are of the same size as the input array, where each element is replaced by the respective running sum/product up to and including the current element. Another function, trace
, calculates the trace of an array, which simply sums up the diagonal elements in the multi-dimensional array.
# Demonstrate data processing convenience functions
# Make an array = [1, 2, 3, 4, 5]
a = np.arange(1, 6)
print("Mean value = {}".format(np.mean(a)))
print("Median value = {}".format(np.median(a)))
print("Variance = {}".format(np.var(a)))
print("Std. Deviation = {}\n".format(np.std(a)))
print("Minimum value = {}".format(np.min(a)))
print("Maximum value = {}\n".format(np.max(a)))
print("Sum of all values = {}".format(np.sum(a)))
print("Running cumulative sum of all values = {}\n".format(np.cumsum(a)))
print("Product of all values = {}".format(np.prod(a)))
print("Running product of all values = {}\n".format(np.cumprod(a)))
NumPy also includes methods that are universal functions or ufuncs that are vectorized and thus operate on each element in the array, without the need for a loop. You have already seen examples of some of these functions at the start of this IPython Notebook when we compared the speed and simplicity of NumPy versus normal Python for numerical operations. These functions almost all include an optional out
parameter that allows a pre-defined NumPy array to be used to hold the results of the calculation, which can often speed-up the processing by eliminating the need for the creation and destruction of temporary arrays. These functions will all still return the final array, even if the out
parameter is used.
NumPy includes over sixty ufuncs that come in several different categories:
add
, divide
, power
, sqrt
, log
, and exp
.sin
, cos
, arctan
, sinh
, and deg2rad
functions.bitwise_and
, bitwise_or
, invert
, and right_shift
.greater
, equal
, logical_and
, and maximum
.isreal
, isnan
, signbit
, and fmod
.Look at the official NumPy ufunc reference guide for more information on any of these functions, for example, the isnan function, since the user guide has a full breakdown of each function and sample code demonstrating how to use the function.
In the following code blocks, we demonstrate several of these ufuncs.
b = np.arange(1, 10)
print('original array:\n', b)
c = np.sin(b)
print('\nnp.sin : \n', c)
print('\nnp.log and np.abs : \n', np.log10(np.abs(c)))
print('\nnp.mod : \n', np.mod(b, 2))
print('\nnp.logical_and : \n', np.logical_and(np.mod(b, 2), True))
# Demonstrate Boolean tests with operators
d = np.arange(9)
print("Greater Than or Equal Test: \n", d >= 5)
# Now combine to form Boolean Matrix
np.logical_and(d > 3, d % 2)
NumPy provides support for masked arrays, where certain elements can be masked based on some criterion and ignored during subsequent calculations (i.e., these elements are masked). Masked arrays are in the numpy.ma
package, and simply require a masked array to be created from a given array and a condition that indicates which elements should be masked. This new masked array can be used as a normal NumPy array, except the masked elements are ignored. NumPy provides operations for masked arrays, allowing them to be used in a similar manner as normal NumPy arrays.
You can also impute missing (or bad) values by using a masked array, and replacing masked elements with a different value, such as the mean value. Masked arrays can also be used to mask out bad values in a calculation such as divide-by-zero or logarithm of zero, and a masked array will ignore error conditions during standard operations, making them very useful since they operate in a graceful manner.
# Create and demonstrate a masked array
import numpy.ma as ma
x = [0.,1.,-9999.,3.,4.]
print("Original array = :", x)
mx = ma.masked_values (x, -9999.)
print("\nMasked array = :", mx)
print("\nMean value of masked elements:", mx.mean())
print("\nOperate on unmasked elements: ", mx - mx.mean())
print("\n Impute missing values (using mean): ", mx.filled(mx.mean())) # Imputation
# Create two arrays with masks
x = ma.array([1., -1., 3., 4., 5., 6.], mask=[0,0,0,0,1,0])
y = ma.array([1., 2., 0., 4., 5., 6.], mask=[0,0,0,0,0,1])
# Now take square root, ignores div by zero and masked elements.
print(np.sqrt(x/y))
# Now try some random data
d = np.random.rand(1000)
# Now mask for values within some specified range (0.1 to 0.9)
print("Masked array mean value: ", ma.masked_outside(d, 0.1, 0.9).mean())
NumPy has support for reading or writing data to files. Of these methods, two of the most useful are the loadtxt
method and the genfromext
method, each of which allow you to easily read data from a text file into a NumPy array. The primary difference is that the genfromtxt
method can handle missing data, while the loadtxt
cannot. Both methods allow you to specify the column delimiter, easily skip header or footer rows, specify which columns should be extracted for each row, and allow you to unpack the row so that each column goes into a separate array.
For example, the following code snippet demonstrates how to use the loadtxt
method to pull out the second and fourth columns from the fin
file handle, where the file is assumed to be in CSV format. The data is persisted into the a
and b
NumPy arrays.
a, b = np.loadtxt(fin, delimeter = ',', usecols=(1, 3), unpack=True)
We demonstrate the genfromtxt
method in the following code block, where we first create the test data file before reading that data back into a NumPy array.
# First write data to a file using Unix commands.
info = "1, 2, 3, 4, 5 \n 6, 7, 8, 9, 10"
with open("test.csv", 'w') as fout:
print(info, file=fout)
# Now we can read it back into a NumPy array.
d = np.genfromtxt("test.csv", delimiter=",")
print("New Array = \n", d)
In addition to the normal Python help
function, NumPy provides a special lookup
function that will search the NumPy library for classes, types, or methods that match the search string passed to the function. This can be useful for finding specific information given a general concept, or to learn more about related topics by performing a search. For example, the following function call would display all relevant NumPy information for 'masked array'.
np.lookfor('masked array')
Student Exercise
In the empty Code cell below, first create a one-dimensional NumPy array of the even integers from 20 to 42, inclusive. Next, apply a ufunc to compute the fourth power of each element in this array. Divide each element in this array by 11, and print out the resulting elements, one element per line.
The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.
© 2017: Robert J. Brunner at the University of Illinois.
This notebook is released under the Creative Commons license CC BY-NC-SA 4.0. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.