import numpy as np
import csv
import math
import requests
Import data used in the notebook. Data will save to the notebook's directory
def download_data(git_loc,dest_name):
'''
Downloads data from Github and saves it to the notebook's working directory.
'''
req = requests.get(git_loc)
with open(dest_name,"w") as file:
for line in req.text:
file.writelines(line)
download_data('https://raw.githubusercontent.com/edunford/ppol564/master/lectures/lecture_07/gapminder.csv',
"gapminder.csv")
As we've seen, we can represent data in Python as nested lists. For example, consider the again the Gapminder data.
gap_data = []
with open("gapminder.csv",mode='rt') as file:
data = csv.reader(file)
for row in data:
gap_data.append(row)
# Look at the "head" of the data (first five entries)
gap_data[:5]
One challenge of working with data in this format is how it's indexed.
We can easily index each row entry accessing the rows position.
gap_data[3]
We can then access a specific ("column") value, but referencing the index position of the second list.
gap_data[3][1]
To do this systematically, say by referencing a single variable in the data, we need to use information about both these indices and loop through the values.
column_names = gap_data[0] # Get the names of the columns
column_names
# Draw out the index for the column
var_ind = column_names.index("gdpPercap")
var_ind
for row in gap_data:
print(row[var_ind])
Constantly indexing and looping like this can be inefficient.
numpy arrays are optimized to handle these types of data matrix computations, by
# Convert our gap_data to a numpy array
gap_data_np = np.array(gap_data)
gap_data_np
gap_data_np[:,2]
Let's time those two operations: as we can see, the boost in performance (and the conceptual ease of execution) is huge.
%%timeit
for row in gap_data:
row[var_ind]
%%timeit
gap_data_np[:,2]
# vector as a list
v = np.array([1,2,3,4])
v
# A matrix is a nested list
M = np.array([[1,2,3,4],
[2,3,4,1],
[-1,1,2,1]])
print(M)
M.shape
# An ndimensional array is a nested list
A = np.array([
[
[1,2,3,4],
[2,3,4,1],
[-1,1,2,1]
],
[
[1,2,3,4],
[2,3,4,1],
[-1,1,2,1]]
])
print(A)
A.shape
np.arange(1, 10, .5 )
np.arange(0,1+.01,.01)
np.linspace
¶Generate a a vector of length N where each entry is evenly spaced between the interval for the number requested.
np.linspace(start,end,length)
np.linspace(1,5,10)
np.linspace(0,1,3)
np.random.
¶We'll use the .random.
sub-library in numpy to generate numerical numpy arrays from known random distributions.
Whenever we randomly generate numbers, we normally want to replicate our results. To do so, we need to set a seed
that ensures we'll generate the same random numbers again.
np.random.seed(123)
Generate random numbers from a standard normal distribution.
np.random.randn(10)
Generate an array of random integers within a range: np.random.randint(start,end,n)
np.random.randint(1,10,10)
Also, we can generate random values from known distributions, e.g.
np.random.binomial(1,.5,10) # coin flip distribution
np.random.normal(5,1,10) # normal (continuous) distribution
np.random.poisson(1,10) # count distribution
np.random.uniform(1,5,10) # uniform distribution
We'll delve more into random number generation later in the semester.
# Matrix full of zeros
np.zeros((3,4))
# Matrix full of ones
np.ones((3,4))
# Identity Matrix
np.eye(4)
# empty container
empty_array = np.empty((2,3))
empty_array
Generating a matrix similar to the one you already have
X = np.zeros((4,4))
np.ones_like(X)
np.zeros_like(P)
# Call the shape of an array
v = np.random.randint(1,100,30)
v
v.reshape(5,6)
v.reshape(10,3)
Can only reshape given the appropriate dimensions
v.reshape(10,2)
# use negative 1 to guess the dimension
P = np.random.randn(20).reshape(10,-1)
P.shape
P
# Alternative way to change the shape
P.shape = 2,10
P.shape
P
M[row,column]
X = np.linspace(1,25,25).reshape(5,5)
X
X[0] # index first row
X[:,0] # index first column
X[0,0] # index a specific cell
# Can use : or ... for a complete subsection
print(X[:,1])
print(X[...,1])
# slice rows and columns
X[0:3,0:3]
X[-1,:] # last row
X[:,-1] # last column
Demand specific indices in requested order.
X[[3,0,2],:]
X[:,[3,0,2]]
We can use vectorization (see below) to great effect with boolean (logical) evaluations. This offers a way to quickly and easily subset data by particular conditions
D = np.random.randint(1,100,50).reshape(10,5)
D
D >= 50
D[D >= 50]
D[D >= 50] = -999
D
X
X[:3,:3] = 0
X
X[1,2] = -999
X[4,4] = -999
X
D = np.random.randint(1,100,50).reshape(10,5)
D[D <= 50] = 1
D[D > 50] = 0
D
np.where()
¶Similar to R
's ifelse()
D = np.random.randint(1,100,50).reshape(10,5)
D
np.where(D>50,0,1)
b = np.random.randint(1,10,100)
b
b[(b < 8) & (b > 4)]
b[(b < 8) & (b > 4)] = -999
b
We can easily stack and grow numpy arrays
m1 = np.random.randn(10).reshape(5,-1).round(1)
m2 = np.random.poisson(1,10).reshape(5,-1)
m1
m2
rbind
: binding row¶# stack the two columns using concatenate
np.concatenate([m1,m2],axis=0)
# or use verticle stack
np.vstack([m1,m2])
cbind
: binding columns¶np.concatenate([m1,m2],axis=1)
np.hstack([m1,m2])
Note that when we slice an array we do not copy the array, rather we get a "view" of the array.
# Recall the behavior of double assignment with lists
x = [1,2,3]
y = x
y[2] = 100
x
# We can get around this behavior by making copies.
# One way to make a copy is to slice
y = x[:]
y[2] = -999
x
When we slice an array, we get a sub-"view" of the data that still effects the original data object.
P = np.ones((5,5))
P
g = P[:2,:2]
g
g += 100
g
P
"This default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer."
To get around this behavior, we again just need to make a .copy()
.
g2 = P[:2,:2].copy()
g2 -= 1000
g2
P
Broadcasting makes it possible for operations to be performed on arrays of mismatched shapes.
Broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is "broadcast" across the larger array so that they have compatible shapes.
For example, say we have a numpy array of dimensions (5,1)
$$ \begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} $$
Now say we wanted to add the values in this array by 5
$$ \begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} + 5 $$
Broadcasting "pads" the array of 5 (which is shape = 1,1), and extends it so that it has similar dimension to the larger array in which the computation is being performed.
$$ \begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} + \begin{bmatrix} 5\\\color{lightgrey}{5}\\\color{lightgrey}{5}\\\color{lightgrey}{5}\\\color{lightgrey}{5}\end{bmatrix} $$
$$ \begin{bmatrix} 1 + 5\\2 + 5\\3 + 5\\4 + 5\\5 + 5\end{bmatrix} $$
$$ \begin{bmatrix} 6\\7\\8\\9\\10\end{bmatrix} $$
A = np.array([1,2,3,4,5])
A + 5
By 'broadcast', we mean that the smaller array is made to match the size of the larger array in order to allow for element-wise manipulations.
A general Rule of thumb: All corresponding dimension of the arrays must be compatible or one of the two dimensions is 1.
Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays (from reading):
If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
np.arange(3) + 5
$$ \texttt{np.arange(3)} = \begin{bmatrix} 0&1&2\end{bmatrix} $$
$$ \texttt{5} = \begin{bmatrix} 5 \end{bmatrix} $$
$$ \begin{bmatrix} 0&1&2\end{bmatrix} + \begin{bmatrix} 5 & \color{lightgrey}{5} & \color{lightgrey}{5}\end{bmatrix} = \begin{bmatrix} 5 & 6 & 7\end{bmatrix} $$
np.ones((3,3)) + np.arange(3)
$$ \texttt{np.ones((3,3)) = }\begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} $$
$$ \texttt{np.arange(3)} = \begin{bmatrix} 0 & 1 & 2\end{bmatrix} $$
$$ \begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} + \begin{bmatrix} 0 & 1 & 2\\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2} \\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2}\end{bmatrix} = \begin{bmatrix} 1 & 2 & 3\\ 1 & 2 & 3 \\ 1 & 2 & 3 \end{bmatrix} $$
np.arange(3).reshape(3,1) + np.arange(3)
$$ \texttt{np.arange(3).reshape(3,1)} = \begin{bmatrix} 0 \\ 1 \\ 2\end{bmatrix} $$
$$ \texttt{np.arange(3)} = \begin{bmatrix} 0 & 1 & 2\end{bmatrix} $$
$$ \begin{bmatrix} 0 & \color{lightgrey}{0} & \color{lightgrey}{0} \\ 1 & \color{lightgrey}{1} & \color{lightgrey}{1} \\ 2 & \color{lightgrey}{2} & \color{lightgrey}{2}\end{bmatrix} + \begin{bmatrix} 0 & 1 & 2\\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2} \\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2}\end{bmatrix} = \begin{bmatrix} 0 & 1 & 2\\ 1 &2&3 \\ 2& 3 & 4\end{bmatrix} $$
Example of dimensional disagreement.
np.ones((4,7))
np.ones((4,7)) + np.zeros( (5,9) )
np.ones((4,7)) + np.zeros( (1,7) )
Similar to broadcasting, vectorization allows for simultaneous computation along all values in the array.
X = np.random.randint(1,10,50).reshape(10,5)
X
np.log(X)
The computations are performed on each element in the array simultaneously.
Again, let's consider what this same operation would need to look like if we were dealing with a nested list. We'd need to perform each operation element-by-element in the nested list structure.
X2 = X.tolist()
n_rows = len(X2)
n_cols = len(X2[0])
for i in range(n_rows):
for j in range(n_cols):
X2[i][j] = math.log(X2[i][j])
X2
Vectorization frees us from this tedium. Moreover, it's extremely efficient so we can perform computations quickly.
For example:
# Locate the absolute value for an array
np.abs([1,2,-6,7,8])
# Round Values to the k-th decimal point
np.round(np.log(X),1)
# Count the number of non zeros
np.count_nonzero(np.array([1,0,8,0,1]))
Numpy comes baked in with a large number of ufuncs
(or "universal functions") that are all vectorized. See here for a detailed list of these operations.
The universal functions constructed in Python come with an axis
argument that outlines how the function should be applied
A = np.random.randint(1,10,100).reshape(20,5)
A
Consider calculating the average across some data set. By default, the ufunc .mean()
will calculate the average for the entire data matrix.
A.mean()
If we wanted to calculate the mean for each observation (row) or variable (column), we'll need to use the axis
argument to specify which.
axis = 0
== move across the columnsaxis = 1
== move across the rowsA.mean(axis=0)
A.mean(axis=1)
Consider the following function that yields a different string when input a
is larger/smaller than input b
.
def bigsmall(a,b):
if a > b:
return "A is larger"
else:
return "B is larger"
bigsmall(5,6)
bigsmall(6,5)
We can implement this function in a vectorized fashion using the np.vectorize()
method.
# Create a vectorized version of the function
vec_bigsmall = np.vectorize(bigsmall)
vec_bigsmall
# And now implement on arrays of numbers!
vec_bigsmall([0,2,5,7,0],[4,3,10,2,6])
Out of the box, numpy arrays can only handle one data class at a time...
x = np.array([1,2,3,4])
x.dtype # examine the data type contained within
And we can't necessarily change the data type on the fly by ducktyping (i.e. overwriting the data object with different types of values).
x[1] = .04
x
x[1] = "this"
x
To do this, we need to alter the data type of the data contained within the array with .astype()
x.astype('f')
x.astype('U')
List of all data types and their conversions (table drawn from reading)
Character | Description | Example |
---|---|---|
b |
Byte | np.dtype('b') |
i |
Signed integer | np.dtype('i4') == np.int32 |
u |
Unsigned integer | np.dtype('u1') == np.uint8 |
f |
Floating point | np.dtype('f8') == np.int64 |
c |
Complex floating point | np.dtype('c16') == np.complex128 |
S , a |
String | np.dtype('S5') |
U |
Unicode string | np.dtype('U') == np.str_ |
V |
Raw data (void) | np.dtype('V') == np.void |
This limitation extends itself to heterogeneous data types
nested_list = [['a','b','c'],[1,2,3],[.3,.55,1.2]]
nested_list
data = np.array(nested_list).T
data
All the data in the matrix is treated as a string!
To get around this, we need to again be explicit about the data type of each column. Here we pre-specify a data table and it's inputs.
data = np.zeros((3), dtype={'names':('v1', 'v2', 'v3'),
'formats':('U5', 'i', 'f')})
data
We then load the data to the specified columns.
data['v1'] = ['a','b','c']
data['v2'] = [1,2,3]
data['v3'] = [.3,.55,1.2]
data
We can then index, but will do so differently than we observed above.
data['v1']
data[1][['v1','v2']]
Though possible to deal with heterogeneous data frames using numpy, there is a lot of overhead to constructing a data object. As such, we'll use Pandas series and DataFrames to deal with heterogeneous data.
np automatically truncates the data when printing. Handy when you have alot of data
print(np.arange(10000).reshape(100,100))
# We can adjust these settings
np.set_printoptions(threshold=None)
print(np.arange(100).reshape(10,10))
Numpy provides a data class for missing values (i.e. nan
== "Not a Number", see here)
Y = np.random.randint(1,10,25).reshape(5,5) + .0
Y
Y[Y > 5] = np.nan
Y
type(np.nan)
# scan for missing values
np.isnan(Y)
~np.isnan(Y) # are not NAs
When we have missing values, we'll run into issues when computing across the data matrix.
np.mean(Y)
To get around this, we need to use special version of the methods that compensate for the existence of nan
.
np.nanmean(Y)
np.nanmean(Y,axis=0)
# Mean impute the missing values
Y[np.where(np.isnan(Y))] = np.nanmean(Y)
Y