PPOL564 - Data Science I: Foundations

Lecture 7

Using Numpy

Plan for Today

  • Beyond nested lists
  • Basics of using Numpy
  • Broadcasting and Vectorization
  • From Matrix to Data Frame: Dealing with multiple data types
In [2]:
import numpy as np
import csv
import math
import requests 

Import data used in the notebook. Data will save to the notebook's directory

In [4]:
def download_data(git_loc,dest_name):
    '''
    Downloads data from Github and saves it to the notebook's working directory.
    '''
    req = requests.get(git_loc)
    with open(dest_name,"w") as file:
        for line in req.text:
            file.writelines(line)
download_data('https://raw.githubusercontent.com/edunford/ppol564/master/lectures/lecture_07/gapminder.csv',
         "gapminder.csv")

Limitations of Nested Lists in DS

As we've seen, we can represent data in Python as nested lists. For example, consider the again the Gapminder data.

In [2]:
gap_data = []
with open("gapminder.csv",mode='rt') as file:
    data = csv.reader(file)
    for row in data:
        gap_data.append(row)

# Look at the "head" of the data (first five entries)        
gap_data[:5]
Out[2]:
[['country', 'lifeExp', 'gdpPercap'],
 ['Guinea_Bissau', '39.21', '652.157'],
 ['Bolivia', '52.505', '2961.229'],
 ['Austria', '73.103', '20411.916'],
 ['Malawi', '43.352', '575.447']]

One challenge of working with data in this format is how it's indexed.

We can easily index each row entry accessing the rows position.

In [3]:
gap_data[3]
Out[3]:
['Austria', '73.103', '20411.916']

We can then access a specific ("column") value, but referencing the index position of the second list.

In [4]:
gap_data[3][1] 
Out[4]:
'73.103'

To do this systematically, say by referencing a single variable in the data, we need to use information about both these indices and loop through the values.

In [5]:
column_names = gap_data[0] # Get the names of the columns
column_names
Out[5]:
['country', 'lifeExp', 'gdpPercap']
In [6]:
# Draw out the index for the column 
var_ind = column_names.index("gdpPercap") 
var_ind
Out[6]:
2
In [7]:
for row in gap_data:
    print(row[var_ind])
gdpPercap
652.157
2961.229
20411.916
575.447
17473.723
2591.853
5406.038
10888.176
3312.788
2447.909
20556.684
5733.625
65332.91
17262.623
1356.671
810.384
2469.167
9331.712
1741.365
22410.746
1314.38
7208.065
14074.582
7866.872
8416.554
780.553
16245.209
3477.21
1200.416
680.133
3484.779
12013.579
13969.037
1044.582
5613.844
4469.453
4898.398
1854.731
675.368
6384.055
7269.216
1153.82
1569.275
6197.645
3163.352
6703.289
14160.936
4426.026
13920.011
2697.833
17425.382
1488.309
817.559
648.343
6283.259
3675.582
1835.01
3009.288
675.669
10863.164
3255.367
1017.713
542.278
673.093
20261.744
604.814
1335.595
1165.454
11529.865
4768.942
1358.199
7300.17
2844.856
3074.031
1533.122
12138.562
635.858
5031.504
1912.825
802.675
7724.113
1382.782
439.333
27074.334
19380.473
17750.87
4431.847
1057.296
3045.966
18077.664
19980.596
1692.805
782.729
7376.583
2834.413
776.067
10088.516
20531.422
1140.793
471.663
5754.827
5448.611
2174.771
21671.825
1155.395
541.003
19900.758
3759.997
8217.318
509.115
4015.403
4195.343
1774.634
26261.151
1439.271
1488.308
1072.819
10415.531
849.281
3239.607
8955.554
14029.826
21748.852
18833.57
781.077
958.785
9305.049
7811.809
7100.133
3607.101
19943.126
3424.656
7247.431
843.991
1620.739
26747.307
10224.807
11354.092
3128.121
15758.606
5829.317

Constantly indexing and looping like this can be inefficient.

numpy arrays are optimized to handle these types of data matrix computations, by

  1. offering a way to efficiently index and slice data
  2. vectorizing operations
  3. efficiently storing data in memory
In [8]:
# Convert our gap_data to a numpy array
gap_data_np = np.array(gap_data)
gap_data_np
Out[8]:
array([['country', 'lifeExp', 'gdpPercap'],
       ['Guinea_Bissau', '39.21', '652.157'],
       ['Bolivia', '52.505', '2961.229'],
       ['Austria', '73.103', '20411.916'],
       ['Malawi', '43.352', '575.447'],
       ['Finland', '72.992', '17473.723'],
       ['North_Korea', '63.607', '2591.853'],
       ['Malaysia', '64.28', '5406.038'],
       ['Hungary', '69.393', '10888.176'],
       ['Congo', '52.502', '3312.788'],
       ['Morocco', '57.609', '2447.909'],
       ['Germany', '73.444', '20556.684'],
       ['Ecuador', '62.817', '5733.625'],
       ['Kuwait', '68.922', '65332.91'],
       ['New_Zealand', '73.989', '17262.623'],
       ['Mauritania', '52.302', '1356.671'],
       ['Uganda', '47.619', '810.384'],
       ['Equatorial Guinea', '42.96', '2469.167'],
       ['Croatia', '70.056', '9331.712'],
       ['Indonesia', '54.336', '1741.365'],
       ['Canada', '74.903', '22410.746'],
       ['Comoros', '52.382', '1314.38'],
       ['Montenegro', '70.299', '7208.065'],
       ['Slovenia', '71.601', '14074.582'],
       ['Trinidad and Tobago', '66.828', '7866.872'],
       ['Poland', '70.177', '8416.554'],
       ['Lesotho', '50.007', '780.553'],
       ['Italy', '74.014', '16245.209'],
       ['Tunisia', '60.721', '3477.21'],
       ['Kenya', '52.681', '1200.416'],
       ['Gambia', '44.401', '680.133'],
       ['Bosnia and Herzegovina', '67.708', '3484.779'],
       ['Libya', '59.304', '12013.579'],
       ['Greece', '73.733', '13969.037'],
       ['Ghana', '52.341', '1044.582'],
       ['Peru', '58.859', '5613.844'],
       ['Turkey', '59.696', '4469.453'],
       ['Reunion', '66.644', '4898.398'],
       ['Sri_Lanka', '66.526', '1854.731'],
       ['Cambodia', '47.903', '675.368'],
       ['Bulgaria', '69.744', '6384.055'],
       ['Lebanon', '65.866', '7269.216'],
       ['Togo', '51.499', '1153.82'],
       ['Yemen', '46.78', '1569.275'],
       ['Jamaica', '68.749', '6197.645'],
       ['Swaziland', '49.002', '3163.352'],
       ['Chile', '67.431', '6703.289'],
       ['Israel', '73.646', '14160.936'],
       ['Algeria', '59.03', '4426.026'],
       ['Czech_Republic', '71.511', '13920.011'],
       ['Djibouti', '46.381', '2697.833'],
       ['Singapore', '71.22', '17425.382'],
       ['Nigeria', '43.581', '1488.309'],
       ['Bangladesh', '49.834', '817.559'],
       ['DRC', '44.544', '648.343'],
       ['Cuba', '71.045', '6283.259'],
       ['Namibia', '53.491', '3675.582'],
       ['Sudan', '48.401', '1835.01'],
       ['Syria', '61.346', '3009.288'],
       ['Rwanda', '41.482', '675.669'],
       ['Puerto Rico', '72.739', '10863.164'],
       ['Albania', '68.433', '3255.367'],
       ['Vietnam', '57.48', '1017.713'],
       ['Mozambique', '40.38', '542.278'],
       ['Mali', '43.413', '673.093'],
       ['Saudi Arabia', '58.679', '20261.744'],
       ['Liberia', '42.476', '604.814'],
       ['Madagascar', '47.771', '1335.595'],
       ['Chad', '46.774', '1165.454'],
       ['Gabon', '51.221', '11529.865'],
       ['Mauritius', '64.953', '4768.942'],
       ['Zambia', '45.996', '1358.199'],
       ['Romania', '68.291', '7300.17'],
       ['Dominican Republic', '61.554', '2844.856'],
       ['Egypt', '56.243', '3074.031'],
       ['Senegal', '50.626', '1533.122'],
       ['Oman', '58.443', '12138.562'],
       ['Zimbabwe', '52.663', '635.858'],
       ['Botswana', '54.598', '5031.504'],
       ["Cote d'Ivoire", '48.436', '1912.825'],
       ['Afghanistan', '37.479', '802.675'],
       ['Mexico', '65.409', '7724.113'],
       ['Sao Tome and Principe', '57.896', '1382.782'],
       ['Myanmar', '53.322', '439.333'],
       ['Switzerland', '75.565', '27074.334'],
       ['United Kingdom', '73.923', '19380.473'],
       ['Japan', '74.827', '17750.87'],
       ['El Salvador', '59.633', '4431.847'],
       ['India', '53.166', '1057.296'],
       ['Thailand', '62.2', '3045.966'],
       ['Bahrain', '65.606', '18077.664'],
       ['Australia', '74.663', '19980.596'],
       ['Mongolia', '55.89', '1692.805'],
       ['Nepal', '48.986', '782.729'],
       ['Iran', '58.637', '7376.583'],
       ['Honduras', '57.921', '2834.413'],
       ['Guinea', '43.24', '776.067'],
       ['Venezuela', '66.581', '10088.516'],
       ['Iceland', '76.511', '20531.422'],
       ['Somalia', '40.989', '1140.793'],
       ['Burundi', '44.817', '471.663'],
       ['Panama', '67.802', '5754.827'],
       ['Costa Rica', '70.181', '5448.611'],
       ['Philippines', '60.967', '2174.771'],
       ['Denmark', '74.37', '21671.825'],
       ['Benin', '48.78', '1155.395'],
       ['Eritrea', '45.999', '541.003'],
       ['Belgium', '73.642', '19900.758'],
       ['West Bank and Gaza', '60.329', '3759.997'],
       ['South_Korea', '65.001', '8217.318'],
       ['Ethiopia', '44.476', '509.115'],
       ['Guatemala', '56.729', '4015.403'],
       ['Colombia', '63.898', '4195.343'],
       ['Cameroon', '48.129', '1774.634'],
       ['United States', '73.478', '26261.151'],
       ['Pakistan', '54.882', '1439.271'],
       ['China', '61.785', '1488.308'],
       ['Sierra Leone', '36.769', '1072.819'],
       ['Slovak Republic', '70.696', '10415.531'],
       ['Tanzania', '47.912', '849.281'],
       ['Paraguay', '66.809', '3239.607'],
       ['Argentina', '69.06', '8955.554'],
       ['Spain', '74.203', '14029.826'],
       ['Netherlands', '75.648', '21748.852'],
       ['France', '74.349', '18833.57'],
       ['Niger', '44.559', '781.077'],
       ['Central African Republic', '43.867', '958.785'],
       ['Serbia', '68.551', '9305.049'],
       ['Iraq', '56.582', '7811.809'],
       ['Uruguay', '70.782', '7100.133'],
       ['Angola', '37.883', '3607.101'],
       ['Sweden', '76.177', '19943.126'],
       ['Nicaragua', '58.349', '3424.656'],
       ['South Africa', '53.993', '7247.431'],
       ['Burkina Faso', '44.694', '843.991'],
       ['Haiti', '50.165', '1620.739'],
       ['Norway', '75.843', '26747.307'],
       ['Taiwan', '70.337', '10224.807'],
       ['Portugal', '70.42', '11354.092'],
       ['Jordan', '59.786', '3128.121'],
       ['Ireland', '73.017', '15758.606'],
       ['Brazil', '62.239', '5829.317']], dtype='<U24')
In [9]:
gap_data_np[:,2]
Out[9]:
array(['gdpPercap', '652.157', '2961.229', '20411.916', '575.447',
       '17473.723', '2591.853', '5406.038', '10888.176', '3312.788',
       '2447.909', '20556.684', '5733.625', '65332.91', '17262.623',
       '1356.671', '810.384', '2469.167', '9331.712', '1741.365',
       '22410.746', '1314.38', '7208.065', '14074.582', '7866.872',
       '8416.554', '780.553', '16245.209', '3477.21', '1200.416',
       '680.133', '3484.779', '12013.579', '13969.037', '1044.582',
       '5613.844', '4469.453', '4898.398', '1854.731', '675.368',
       '6384.055', '7269.216', '1153.82', '1569.275', '6197.645',
       '3163.352', '6703.289', '14160.936', '4426.026', '13920.011',
       '2697.833', '17425.382', '1488.309', '817.559', '648.343',
       '6283.259', '3675.582', '1835.01', '3009.288', '675.669',
       '10863.164', '3255.367', '1017.713', '542.278', '673.093',
       '20261.744', '604.814', '1335.595', '1165.454', '11529.865',
       '4768.942', '1358.199', '7300.17', '2844.856', '3074.031',
       '1533.122', '12138.562', '635.858', '5031.504', '1912.825',
       '802.675', '7724.113', '1382.782', '439.333', '27074.334',
       '19380.473', '17750.87', '4431.847', '1057.296', '3045.966',
       '18077.664', '19980.596', '1692.805', '782.729', '7376.583',
       '2834.413', '776.067', '10088.516', '20531.422', '1140.793',
       '471.663', '5754.827', '5448.611', '2174.771', '21671.825',
       '1155.395', '541.003', '19900.758', '3759.997', '8217.318',
       '509.115', '4015.403', '4195.343', '1774.634', '26261.151',
       '1439.271', '1488.308', '1072.819', '10415.531', '849.281',
       '3239.607', '8955.554', '14029.826', '21748.852', '18833.57',
       '781.077', '958.785', '9305.049', '7811.809', '7100.133',
       '3607.101', '19943.126', '3424.656', '7247.431', '843.991',
       '1620.739', '26747.307', '10224.807', '11354.092', '3128.121',
       '15758.606', '5829.317'], dtype='<U24')

Let's time those two operations: as we can see, the boost in performance (and the conceptual ease of execution) is huge.

In [10]:
%%timeit
for row in gap_data:
    row[var_ind]
6.94 µs ± 47.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [11]:
%%timeit
gap_data_np[:,2]
309 ns ± 2.92 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Basics of Using Numpy

Vectors, Matrices, and N-Dimensional Arrays

In [12]:
# vector as a list
v = np.array([1,2,3,4])
v
Out[12]:
array([1, 2, 3, 4])
In [13]:
# A matrix is a nested list
M = np.array([[1,2,3,4],
             [2,3,4,1],
             [-1,1,2,1]])
print(M)
M.shape
[[ 1  2  3  4]
 [ 2  3  4  1]
 [-1  1  2  1]]
Out[13]:
(3, 4)
In [14]:
#  An ndimensional array is a nested list
A = np.array([
              [
                [1,2,3,4],
                [2,3,4,1],
                [-1,1,2,1]
              ],
             [
                 [1,2,3,4],
                 [2,3,4,1],
                 [-1,1,2,1]]
              ])
print(A)
A.shape
[[[ 1  2  3  4]
  [ 2  3  4  1]
  [-1  1  2  1]]

 [[ 1  2  3  4]
  [ 2  3  4  1]
  [-1  1  2  1]]]
Out[14]:
(2, 3, 4)

Generating Arrays

np.arange

Generate a range of numbers by some interval

np.arange(start,stop,by)
In [15]:
np.arange(1, 10, .5 )
Out[15]:
array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5, 7. ,
       7.5, 8. , 8.5, 9. , 9.5])
In [3]:
np.arange(0,1+.01,.01)
Out[3]:
array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
       0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21,
       0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32,
       0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43,
       0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54,
       0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65,
       0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76,
       0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87,
       0.88, 0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98,
       0.99, 1.  ])

np.linspace

Generate a a vector of length N where each entry is evenly spaced between the interval for the number requested.

np.linspace(start,end,length)
In [17]:
np.linspace(1,5,10) 
Out[17]:
array([1.        , 1.44444444, 1.88888889, 2.33333333, 2.77777778,
       3.22222222, 3.66666667, 4.11111111, 4.55555556, 5.        ])
In [18]:
np.linspace(0,1,3) 
Out[18]:
array([0. , 0.5, 1. ])

Random Number Generation np.random.

We'll use the .random. sub-library in numpy to generate numerical numpy arrays from known random distributions.

Whenever we randomly generate numbers, we normally want to replicate our results. To do so, we need to set a seed that ensures we'll generate the same random numbers again.

In [19]:
np.random.seed(123)

Generate random numbers from a standard normal distribution.

In [20]:
np.random.randn(10) 
Out[20]:
array([-1.0856306 ,  0.99734545,  0.2829785 , -1.50629471, -0.57860025,
        1.65143654, -2.42667924, -0.42891263,  1.26593626, -0.8667404 ])

Generate an array of random integers within a range: np.random.randint(start,end,n)

In [21]:
np.random.randint(1,10,10)
Out[21]:
array([4, 5, 1, 1, 5, 2, 8, 4, 3, 5])

Also, we can generate random values from known distributions, e.g.

  • Normal (Gaussian)
  • Binomial (or Bernoulli "a.k.a. the coin flip distribution"when there are only two outcomes)
  • Poisson
In [22]:
np.random.binomial(1,.5,10) # coin flip distribution
Out[22]:
array([1, 1, 1, 1, 1, 1, 1, 0, 0, 0])
In [23]:
np.random.normal(5,1,10) # normal (continuous) distribution
Out[23]:
array([5.9071052 , 3.5713193 , 4.85993128, 4.1382451 , 4.74438063,
       2.20141089, 3.2284669 , 4.30012277, 5.92746243, 4.82636432])
In [24]:
np.random.poisson(1,10) # count distribution
Out[24]:
array([2, 0, 0, 1, 0, 2, 1, 2, 0, 0])
In [25]:
np.random.uniform(1,5,10) # uniform distribution
Out[25]:
array([2.66808884, 3.72520306, 4.50182737, 3.04168935, 3.67725513,
       3.34374621, 3.49961401, 3.6987562 , 4.36936975, 1.33277995])

We'll delve more into random number generation later in the semester.

Generating Matrices (2-dimensional Arrays)

In [26]:
# Matrix full of zeros
np.zeros((3,4))
Out[26]:
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])
In [27]:
# Matrix full of ones
np.ones((3,4))
Out[27]:
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])
In [28]:
# Identity Matrix
np.eye(4)
Out[28]:
array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])
In [29]:
# empty container
empty_array = np.empty((2,3))
empty_array
Out[29]:
array([[0., 0., 0.],
       [0., 0., 0.]])

Generating a matrix similar to the one you already have

In [5]:
X = np.zeros((4,4))
np.ones_like(X)
Out[5]:
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])
In [ ]:
np.zeros_like(P)

Reshaping Arrays

In [6]:
# Call the shape of an array
v = np.random.randint(1,100,30)
v
Out[6]:
array([41, 28,  3, 55, 35, 43,  1, 77,  5, 95, 12, 69, 60,  1, 94, 76, 58,
       70, 57, 73, 82,  6, 27, 48, 66, 73, 32, 17, 72, 23])
In [7]:
v.reshape(5,6)
Out[7]:
array([[41, 28,  3, 55, 35, 43],
       [ 1, 77,  5, 95, 12, 69],
       [60,  1, 94, 76, 58, 70],
       [57, 73, 82,  6, 27, 48],
       [66, 73, 32, 17, 72, 23]])
In [8]:
v.reshape(10,3)
Out[8]:
array([[41, 28,  3],
       [55, 35, 43],
       [ 1, 77,  5],
       [95, 12, 69],
       [60,  1, 94],
       [76, 58, 70],
       [57, 73, 82],
       [ 6, 27, 48],
       [66, 73, 32],
       [17, 72, 23]])

Can only reshape given the appropriate dimensions

In [9]:
v.reshape(10,2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-44ecc721f6a9> in <module>()
----> 1 v.reshape(10,2)

ValueError: cannot reshape array of size 30 into shape (10,2)
In [10]:
# use negative 1 to guess the dimension
P = np.random.randn(20).reshape(10,-1)
P.shape
Out[10]:
(10, 2)
In [11]:
P
Out[11]:
array([[-1.48635895e+00, -9.58685682e-01],
       [-1.24432742e+00,  3.14553808e-01],
       [-4.43642384e-01, -4.53963597e-01],
       [ 1.43936170e-01,  6.36996117e-01],
       [-1.47723150e+00,  4.55764472e-01],
       [-4.59038465e-01, -1.99879197e-01],
       [ 4.29367339e-01,  2.05851218e+00],
       [-2.13301257e-01,  9.14560780e-04],
       [ 6.10234542e-01, -9.69664937e-02],
       [-1.23060054e+00, -3.15366792e-01]])
In [12]:
# Alternative way to change the shape
P.shape = 2,10
P.shape
Out[12]:
(2, 10)
In [13]:
P
Out[13]:
array([[-1.48635895e+00, -9.58685682e-01, -1.24432742e+00,
         3.14553808e-01, -4.43642384e-01, -4.53963597e-01,
         1.43936170e-01,  6.36996117e-01, -1.47723150e+00,
         4.55764472e-01],
       [-4.59038465e-01, -1.99879197e-01,  4.29367339e-01,
         2.05851218e+00, -2.13301257e-01,  9.14560780e-04,
         6.10234542e-01, -9.69664937e-02, -1.23060054e+00,
        -3.15366792e-01]])

Indexing and Slicing

M[row,column]
In [14]:
X = np.linspace(1,25,25).reshape(5,5)
X
Out[14]:
array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.],
       [11., 12., 13., 14., 15.],
       [16., 17., 18., 19., 20.],
       [21., 22., 23., 24., 25.]])
In [15]:
X[0] # index first row 
Out[15]:
array([1., 2., 3., 4., 5.])
In [16]:
X[:,0] # index first column
Out[16]:
array([ 1.,  6., 11., 16., 21.])
In [17]:
X[0,0] # index a specific cell 
Out[17]:
1.0
In [ ]:
# Can use : or ... for a complete subsection
print(X[:,1])
print(X[...,1])
In [18]:
# slice rows and columns
X[0:3,0:3] 
Out[18]:
array([[ 1.,  2.,  3.],
       [ 6.,  7.,  8.],
       [11., 12., 13.]])
In [19]:
X[-1,:] # last row
Out[19]:
array([21., 22., 23., 24., 25.])
In [20]:
X[:,-1] # last column
Out[20]:
array([ 5., 10., 15., 20., 25.])

Demand specific indices in requested order.

In [21]:
X[[3,0,2],:]
Out[21]:
array([[16., 17., 18., 19., 20.],
       [ 1.,  2.,  3.,  4.,  5.],
       [11., 12., 13., 14., 15.]])
In [22]:
X[:,[3,0,2]]
Out[22]:
array([[ 4.,  1.,  3.],
       [ 9.,  6.,  8.],
       [14., 11., 13.],
       [19., 16., 18.],
       [24., 21., 23.]])

Boolean Arrays

We can use vectorization (see below) to great effect with boolean (logical) evaluations. This offers a way to quickly and easily subset data by particular conditions

In [23]:
D = np.random.randint(1,100,50).reshape(10,5)
D
Out[23]:
array([[19, 88, 70,  1, 55],
       [24, 54, 92, 52, 56],
       [97, 35,  5, 70, 40],
       [66,  5, 46, 81, 79],
       [35, 13, 12, 37, 42],
       [18, 85, 59, 51, 25],
       [35, 75, 99, 51, 32],
       [14, 39, 78, 19, 43],
       [88, 17, 99, 47, 77],
       [59,  3, 57, 83, 45]])
In [24]:
D >= 50
Out[24]:
array([[False,  True,  True, False,  True],
       [False,  True,  True,  True,  True],
       [ True, False, False,  True, False],
       [ True, False, False,  True,  True],
       [False, False, False, False, False],
       [False,  True,  True,  True, False],
       [False,  True,  True,  True, False],
       [False, False,  True, False, False],
       [ True, False,  True, False,  True],
       [ True, False,  True,  True, False]])
In [25]:
D[D >= 50]
Out[25]:
array([88, 70, 55, 54, 92, 52, 56, 97, 70, 66, 81, 79, 85, 59, 51, 75, 99,
       51, 78, 88, 99, 77, 59, 57, 83])
In [26]:
D[D >= 50] = -999
D
Out[26]:
array([[  19, -999, -999,    1, -999],
       [  24, -999, -999, -999, -999],
       [-999,   35,    5, -999,   40],
       [-999,    5,   46, -999, -999],
       [  35,   13,   12,   37,   42],
       [  18, -999, -999, -999,   25],
       [  35, -999, -999, -999,   32],
       [  14,   39, -999,   19,   43],
       [-999,   17, -999,   47, -999],
       [-999,    3, -999, -999,   45]])

Reassignment

In [27]:
X
Out[27]:
array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.],
       [11., 12., 13., 14., 15.],
       [16., 17., 18., 19., 20.],
       [21., 22., 23., 24., 25.]])
In [28]:
X[:3,:3] = 0
X
Out[28]:
array([[ 0.,  0.,  0.,  4.,  5.],
       [ 0.,  0.,  0.,  9., 10.],
       [ 0.,  0.,  0., 14., 15.],
       [16., 17., 18., 19., 20.],
       [21., 22., 23., 24., 25.]])
In [29]:
X[1,2] = -999
X[4,4] = -999
In [30]:
X
Out[30]:
array([[   0.,    0.,    0.,    4.,    5.],
       [   0.,    0., -999.,    9.,   10.],
       [   0.,    0.,    0.,   14.,   15.],
       [  16.,   17.,   18.,   19.,   20.],
       [  21.,   22.,   23.,   24., -999.]])
In [31]:
D = np.random.randint(1,100,50).reshape(10,5)
D[D <= 50] = 1
D[D > 50] = 0
D
Out[31]:
array([[1, 1, 0, 0, 1],
       [0, 1, 1, 0, 1],
       [0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1],
       [1, 1, 0, 1, 1],
       [0, 0, 1, 1, 1],
       [0, 0, 0, 1, 0],
       [1, 0, 0, 1, 1],
       [0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1]])

If-else with np.where()

Similar to R's ifelse()

In [32]:
D = np.random.randint(1,100,50).reshape(10,5)
D
Out[32]:
array([[49,  7, 46, 35, 85],
       [ 4, 30, 16, 69, 77],
       [19, 60, 52,  2, 74],
       [61, 95, 25, 28,  2],
       [69, 26,  5, 63, 94],
       [15, 29, 78, 31, 76],
       [ 5, 74, 88, 27, 46],
       [ 6, 19, 26, 87, 44],
       [79, 65, 78, 24, 48],
       [83, 86, 40, 22, 36]])
In [33]:
np.where(D>50,0,1)
Out[33]:
array([[1, 1, 1, 1, 0],
       [1, 1, 1, 0, 0],
       [1, 0, 0, 1, 0],
       [0, 0, 1, 1, 1],
       [0, 1, 1, 0, 0],
       [1, 1, 0, 1, 0],
       [1, 0, 0, 1, 1],
       [1, 1, 1, 0, 1],
       [0, 0, 0, 1, 1],
       [0, 0, 1, 1, 1]])

Reassignment with Compound Boolean Statements

In [34]:
b = np.random.randint(1,10,100)
b
Out[34]:
array([3, 8, 2, 3, 1, 7, 8, 8, 3, 4, 8, 3, 2, 7, 2, 2, 1, 5, 2, 2, 3, 3,
       5, 7, 3, 4, 9, 4, 6, 6, 2, 6, 1, 7, 2, 6, 3, 8, 4, 2, 8, 7, 3, 3,
       6, 2, 7, 5, 7, 3, 8, 2, 2, 7, 3, 4, 9, 7, 8, 2, 6, 3, 6, 5, 3, 8,
       7, 8, 3, 7, 1, 7, 9, 6, 3, 8, 2, 4, 4, 6, 9, 6, 2, 7, 1, 1, 6, 2,
       5, 9, 9, 6, 7, 9, 6, 7, 3, 1, 9, 7])
In [35]:
b[(b < 8) & (b > 4)]
Out[35]:
array([7, 7, 5, 5, 7, 6, 6, 6, 7, 6, 7, 6, 7, 5, 7, 7, 7, 6, 6, 5, 7, 7,
       7, 6, 6, 6, 7, 6, 5, 6, 7, 6, 7, 7])
In [ ]:
b[(b < 8) & (b > 4)] = -999
b

Stacking

We can easily stack and grow numpy arrays

In [36]:
m1 = np.random.randn(10).reshape(5,-1).round(1)
m2 = np.random.poisson(1,10).reshape(5,-1)
In [37]:
m1
Out[37]:
array([[-0.8,  1.5],
       [-0.5,  0.7],
       [ 1.7, -0.5],
       [ 1.7, -2.2],
       [ 0.5, -0.8]])
In [38]:
m2
Out[38]:
array([[1, 0],
       [1, 1],
       [0, 0],
       [1, 1],
       [2, 2]])

rbind: binding row

In [39]:
# stack the two columns using concatenate
np.concatenate([m1,m2],axis=0)
Out[39]:
array([[-0.8,  1.5],
       [-0.5,  0.7],
       [ 1.7, -0.5],
       [ 1.7, -2.2],
       [ 0.5, -0.8],
       [ 1. ,  0. ],
       [ 1. ,  1. ],
       [ 0. ,  0. ],
       [ 1. ,  1. ],
       [ 2. ,  2. ]])
In [40]:
# or use verticle stack
np.vstack([m1,m2])
Out[40]:
array([[-0.8,  1.5],
       [-0.5,  0.7],
       [ 1.7, -0.5],
       [ 1.7, -2.2],
       [ 0.5, -0.8],
       [ 1. ,  0. ],
       [ 1. ,  1. ],
       [ 0. ,  0. ],
       [ 1. ,  1. ],
       [ 2. ,  2. ]])

cbind: binding columns

In [41]:
np.concatenate([m1,m2],axis=1)
Out[41]:
array([[-0.8,  1.5,  1. ,  0. ],
       [-0.5,  0.7,  1. ,  1. ],
       [ 1.7, -0.5,  0. ,  0. ],
       [ 1.7, -2.2,  1. ,  1. ],
       [ 0.5, -0.8,  2. ,  2. ]])
In [42]:
np.hstack([m1,m2])
Out[42]:
array([[-0.8,  1.5,  1. ,  0. ],
       [-0.5,  0.7,  1. ,  1. ],
       [ 1.7, -0.5,  0. ,  0. ],
       [ 1.7, -2.2,  1. ,  1. ],
       [ 0.5, -0.8,  2. ,  2. ]])

Views vs. Copies

Note that when we slice an array we do not copy the array, rather we get a "view" of the array.

In [33]:
# Recall the behavior of double assignment with lists
x = [1,2,3]
y = x
y[2] = 100
x
Out[33]:
[1, 2, 100]
In [34]:
# We can get around this behavior by making copies. 
# One way to make a copy is to slice
y = x[:]
y[2] = -999
x
Out[34]:
[1, 2, 100]

When we slice an array, we get a sub-"view" of the data that still effects the original data object.

In [35]:
P = np.ones((5,5))
P
Out[35]:
array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])
In [36]:
g = P[:2,:2] 
g
Out[36]:
array([[1., 1.],
       [1., 1.]])
In [37]:
g += 100
g
Out[37]:
array([[101., 101.],
       [101., 101.]])
In [38]:
P
Out[38]:
array([[101., 101.,   1.,   1.,   1.],
       [101., 101.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   1.]])

As noted in the reading:

"This default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer."

To get around this behavior, we again just need to make a .copy().

In [39]:
g2 = P[:2,:2].copy()
g2 -= 1000
g2
Out[39]:
array([[-899., -899.],
       [-899., -899.]])
In [40]:
P
Out[40]:
array([[101., 101.,   1.,   1.,   1.],
       [101., 101.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   1.]])

Broadcasting

Broadcasting makes it possible for operations to be performed on arrays of mismatched shapes.

Broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is "broadcast" across the larger array so that they have compatible shapes.

For example, say we have a numpy array of dimensions (5,1)

$$ \begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} $$

Now say we wanted to add the values in this array by 5

$$ \begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} + 5 $$

Broadcasting "pads" the array of 5 (which is shape = 1,1), and extends it so that it has similar dimension to the larger array in which the computation is being performed.

$$ \begin{bmatrix} 1\\2\\3\\4\\5\end{bmatrix} + \begin{bmatrix} 5\\\color{lightgrey}{5}\\\color{lightgrey}{5}\\\color{lightgrey}{5}\\\color{lightgrey}{5}\end{bmatrix} $$

$$ \begin{bmatrix} 1 + 5\\2 + 5\\3 + 5\\4 + 5\\5 + 5\end{bmatrix} $$

$$ \begin{bmatrix} 6\\7\\8\\9\\10\end{bmatrix} $$

In [41]:
A = np.array([1,2,3,4,5])
A + 5
Out[41]:
array([ 6,  7,  8,  9, 10])

By 'broadcast', we mean that the smaller array is made to match the size of the larger array in order to allow for element-wise manipulations.

How it works:

  • Shapes of the two arrays are compared element-wise.
  • Dimensions are considered in reverse order, starting with the trailing dimensions, and working forward
  • We are stretching the smaller array by making copies of its elements. However, and this is key, no actual copies are made, making the method computationally and memory efficient.

A general Rule of thumb: All corresponding dimension of the arrays must be compatible or one of the two dimensions is 1.

Rules of Broadcasting

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays (from reading):

Rule 1

If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.

Rule 2

If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.

Rule 3

If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

Example 1

In [42]:
np.arange(3) + 5
Out[42]:
array([5, 6, 7])

$$ \texttt{np.arange(3)} = \begin{bmatrix} 0&1&2\end{bmatrix} $$


$$ \texttt{5} = \begin{bmatrix} 5 \end{bmatrix} $$


$$ \begin{bmatrix} 0&1&2\end{bmatrix} + \begin{bmatrix} 5 & \color{lightgrey}{5} & \color{lightgrey}{5}\end{bmatrix} = \begin{bmatrix} 5 & 6 & 7\end{bmatrix} $$

Example 2

In [43]:
np.ones((3,3)) + np.arange(3)
Out[43]:
array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

$$ \texttt{np.ones((3,3)) = }\begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} $$


$$ \texttt{np.arange(3)} = \begin{bmatrix} 0 & 1 & 2\end{bmatrix} $$


$$ \begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix} + \begin{bmatrix} 0 & 1 & 2\\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2} \\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2}\end{bmatrix} = \begin{bmatrix} 1 & 2 & 3\\ 1 & 2 & 3 \\ 1 & 2 & 3 \end{bmatrix} $$

Example 3

In [44]:
np.arange(3).reshape(3,1) + np.arange(3)
Out[44]:
array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

$$ \texttt{np.arange(3).reshape(3,1)} = \begin{bmatrix} 0 \\ 1 \\ 2\end{bmatrix} $$


$$ \texttt{np.arange(3)} = \begin{bmatrix} 0 & 1 & 2\end{bmatrix} $$


$$ \begin{bmatrix} 0 & \color{lightgrey}{0} & \color{lightgrey}{0} \\ 1 & \color{lightgrey}{1} & \color{lightgrey}{1} \\ 2 & \color{lightgrey}{2} & \color{lightgrey}{2}\end{bmatrix} + \begin{bmatrix} 0 & 1 & 2\\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2} \\ \color{lightgrey}{0} & \color{lightgrey}{1} & \color{lightgrey}{2}\end{bmatrix} = \begin{bmatrix} 0 & 1 & 2\\ 1 &2&3 \\ 2& 3 & 4\end{bmatrix} $$

Example 4

Example of dimensional disagreement.

In [45]:
np.ones((4,7)) 
Out[45]:
array([[1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1.]])
In [46]:
np.ones((4,7))  + np.zeros( (5,9) )
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-46dcb0444846> in <module>()
----> 1 np.ones((4,7))  + np.zeros( (5,9) )

ValueError: operands could not be broadcast together with shapes (4,7) (5,9) 
In [ ]:
np.ones((4,7))  + np.zeros( (1,7) )

Vectorization

Similar to broadcasting, vectorization allows for simultaneous computation along all values in the array.

In [47]:
X = np.random.randint(1,10,50).reshape(10,5)
X
Out[47]:
array([[7, 4, 7, 7, 7],
       [2, 4, 5, 4, 2],
       [1, 6, 9, 7, 9],
       [2, 1, 4, 2, 4],
       [5, 8, 7, 2, 5],
       [4, 4, 8, 7, 9],
       [7, 5, 5, 8, 1],
       [1, 9, 9, 5, 9],
       [7, 2, 7, 9, 8],
       [2, 8, 2, 8, 9]])
In [48]:
np.log(X)
Out[48]:
array([[1.94591015, 1.38629436, 1.94591015, 1.94591015, 1.94591015],
       [0.69314718, 1.38629436, 1.60943791, 1.38629436, 0.69314718],
       [0.        , 1.79175947, 2.19722458, 1.94591015, 2.19722458],
       [0.69314718, 0.        , 1.38629436, 0.69314718, 1.38629436],
       [1.60943791, 2.07944154, 1.94591015, 0.69314718, 1.60943791],
       [1.38629436, 1.38629436, 2.07944154, 1.94591015, 2.19722458],
       [1.94591015, 1.60943791, 1.60943791, 2.07944154, 0.        ],
       [0.        , 2.19722458, 2.19722458, 1.60943791, 2.19722458],
       [1.94591015, 0.69314718, 1.94591015, 2.19722458, 2.07944154],
       [0.69314718, 2.07944154, 0.69314718, 2.07944154, 2.19722458]])

The computations are performed on each element in the array simultaneously.

Again, let's consider what this same operation would need to look like if we were dealing with a nested list. We'd need to perform each operation element-by-element in the nested list structure.

In [49]:
X2 = X.tolist()
n_rows = len(X2)
n_cols = len(X2[0])
for i in range(n_rows):
    for j in range(n_cols):
        X2[i][j] = math.log(X2[i][j])
X2
Out[49]:
[[1.9459101490553132,
  1.3862943611198906,
  1.9459101490553132,
  1.9459101490553132,
  1.9459101490553132],
 [0.6931471805599453,
  1.3862943611198906,
  1.6094379124341003,
  1.3862943611198906,
  0.6931471805599453],
 [0.0,
  1.791759469228055,
  2.1972245773362196,
  1.9459101490553132,
  2.1972245773362196],
 [0.6931471805599453,
  0.0,
  1.3862943611198906,
  0.6931471805599453,
  1.3862943611198906],
 [1.6094379124341003,
  2.0794415416798357,
  1.9459101490553132,
  0.6931471805599453,
  1.6094379124341003],
 [1.3862943611198906,
  1.3862943611198906,
  2.0794415416798357,
  1.9459101490553132,
  2.1972245773362196],
 [1.9459101490553132,
  1.6094379124341003,
  1.6094379124341003,
  2.0794415416798357,
  0.0],
 [0.0,
  2.1972245773362196,
  2.1972245773362196,
  1.6094379124341003,
  2.1972245773362196],
 [1.9459101490553132,
  0.6931471805599453,
  1.9459101490553132,
  2.1972245773362196,
  2.0794415416798357],
 [0.6931471805599453,
  2.0794415416798357,
  0.6931471805599453,
  2.0794415416798357,
  2.1972245773362196]]

Vectorization frees us from this tedium. Moreover, it's extremely efficient so we can perform computations quickly.

For example:

In [50]:
# Locate the absolute value for an array
np.abs([1,2,-6,7,8])
Out[50]:
array([1, 2, 6, 7, 8])
In [51]:
# Round Values to the k-th decimal point
np.round(np.log(X),1)
Out[51]:
array([[1.9, 1.4, 1.9, 1.9, 1.9],
       [0.7, 1.4, 1.6, 1.4, 0.7],
       [0. , 1.8, 2.2, 1.9, 2.2],
       [0.7, 0. , 1.4, 0.7, 1.4],
       [1.6, 2.1, 1.9, 0.7, 1.6],
       [1.4, 1.4, 2.1, 1.9, 2.2],
       [1.9, 1.6, 1.6, 2.1, 0. ],
       [0. , 2.2, 2.2, 1.6, 2.2],
       [1.9, 0.7, 1.9, 2.2, 2.1],
       [0.7, 2.1, 0.7, 2.1, 2.2]])
In [52]:
# Count the number of non zeros
np.count_nonzero(np.array([1,0,8,0,1]))
Out[52]:
3

Numpy comes baked in with a large number of ufuncs (or "universal functions") that are all vectorized. See here for a detailed list of these operations.

Vectorization across array dimensions

The universal functions constructed in Python come with an axis argument that outlines how the function should be applied

In [53]:
A = np.random.randint(1,10,100).reshape(20,5)
A
Out[53]:
array([[8, 2, 4, 2, 9],
       [8, 6, 2, 3, 6],
       [3, 3, 4, 3, 7],
       [8, 2, 4, 9, 4],
       [8, 4, 4, 6, 7],
       [1, 9, 8, 8, 5],
       [5, 6, 1, 9, 3],
       [6, 2, 6, 3, 5],
       [4, 1, 4, 8, 8],
       [3, 6, 2, 8, 6],
       [2, 3, 9, 6, 1],
       [4, 4, 4, 2, 8],
       [7, 4, 2, 8, 3],
       [7, 6, 9, 5, 8],
       [9, 6, 3, 5, 4],
       [4, 2, 8, 9, 6],
       [8, 8, 8, 6, 1],
       [1, 9, 2, 1, 5],
       [6, 5, 6, 1, 8],
       [2, 3, 3, 8, 8]])

Consider calculating the average across some data set. By default, the ufunc .mean() will calculate the average for the entire data matrix.

In [54]:
A.mean()
Out[54]:
5.1

If we wanted to calculate the mean for each observation (row) or variable (column), we'll need to use the axis argument to specify which.

  • axis = 0 == move across the columns
  • axis = 1 == move across the rows
In [55]:
A.mean(axis=0)
Out[55]:
array([5.2 , 4.55, 4.65, 5.5 , 5.6 ])
In [56]:
A.mean(axis=1)
Out[56]:
array([5. , 5. , 4. , 5.4, 5.8, 6.2, 4.8, 4.4, 5. , 5. , 4.2, 4.4, 4.8,
       7. , 5.4, 5.8, 6.2, 3.6, 5.2, 4.8])

Building vectorized functions

Consider the following function that yields a different string when input a is larger/smaller than input b.

In [57]:
def bigsmall(a,b):
    if a > b:
        return "A is larger"
    else:
        return "B is larger"
In [58]:
bigsmall(5,6)
Out[58]:
'B is larger'
In [59]:
bigsmall(6,5)
Out[59]:
'A is larger'

We can implement this function in a vectorized fashion using the np.vectorize() method.

In [60]:
# Create a vectorized version of the function
vec_bigsmall = np.vectorize(bigsmall)
vec_bigsmall 
Out[60]:
<numpy.lib.function_base.vectorize at 0x7faf1810c9b0>
In [61]:
# And now implement on arrays of numbers!
vec_bigsmall([0,2,5,7,0],[4,3,10,2,6])
Out[61]:
array(['B is larger', 'B is larger', 'B is larger', 'A is larger',
       'B is larger'], dtype='<U11')

Road to the DataFrame → Handling Multiple Data Types

Out of the box, numpy arrays can only handle one data class at a time...

In [62]:
x = np.array([1,2,3,4])
x.dtype                # examine the data type contained within 
Out[62]:
dtype('int64')

And we can't necessarily change the data type on the fly by ducktyping (i.e. overwriting the data object with different types of values).

In [63]:
x[1] = .04
x   
Out[63]:
array([1, 0, 3, 4])
In [64]:
x[1] = "this"
x
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-64-b16414e738a9> in <module>()
----> 1 x[1] = "this"
      2 x

ValueError: invalid literal for int() with base 10: 'this'

To do this, we need to alter the data type of the data contained within the array with .astype()

In [65]:
x.astype('f')
Out[65]:
array([1., 0., 3., 4.], dtype=float32)
In [66]:
x.astype('U')
Out[66]:
array(['1', '0', '3', '4'], dtype='<U21')

List of all data types and their conversions (table drawn from reading)

Character Description Example
b Byte np.dtype('b')
i Signed integer np.dtype('i4') == np.int32
u Unsigned integer np.dtype('u1') == np.uint8
f Floating point np.dtype('f8') == np.int64
c Complex floating point np.dtype('c16') == np.complex128
S, a String np.dtype('S5')
U Unicode string np.dtype('U') == np.str_
V Raw data (void) np.dtype('V') == np.void

This limitation extends itself to heterogeneous data types

In [67]:
nested_list = [['a','b','c'],[1,2,3],[.3,.55,1.2]]
nested_list
Out[67]:
[['a', 'b', 'c'], [1, 2, 3], [0.3, 0.55, 1.2]]
In [68]:
data = np.array(nested_list).T
data
Out[68]:
array([['a', '1', '0.3'],
       ['b', '2', '0.55'],
       ['c', '3', '1.2']], dtype='<U4')

All the data in the matrix is treated as a string!

Structured Arrays

To get around this, we need to again be explicit about the data type of each column. Here we pre-specify a data table and it's inputs.

In [69]:
data = np.zeros((3), dtype={'names':('v1', 'v2', 'v3'),
                            'formats':('U5', 'i', 'f')})
data
Out[69]:
array([('', 0, 0.), ('', 0, 0.), ('', 0, 0.)],
      dtype=[('v1', '<U5'), ('v2', '<i4'), ('v3', '<f4')])

We then load the data to the specified columns.

In [70]:
data['v1'] = ['a','b','c']
data['v2'] = [1,2,3]
data['v3'] = [.3,.55,1.2]
data
Out[70]:
array([('a', 1, 0.3 ), ('b', 2, 0.55), ('c', 3, 1.2 )],
      dtype=[('v1', '<U5'), ('v2', '<i4'), ('v3', '<f4')])

We can then index, but will do so differently than we observed above.

In [71]:
data['v1']
Out[71]:
array(['a', 'b', 'c'], dtype='<U5')
In [72]:
data[1][['v1','v2']]
Out[72]:
('b', 2)

Though possible to deal with heterogeneous data frames using numpy, there is a lot of overhead to constructing a data object. As such, we'll use Pandas series and DataFrames to deal with heterogeneous data.

Miscellaneous

Printing Numpy Arrays

np automatically truncates the data when printing. Handy when you have alot of data

In [73]:
print(np.arange(10000).reshape(100,100))
[[   0    1    2 ...   97   98   99]
 [ 100  101  102 ...  197  198  199]
 [ 200  201  202 ...  297  298  299]
 ...
 [9700 9701 9702 ... 9797 9798 9799]
 [9800 9801 9802 ... 9897 9898 9899]
 [9900 9901 9902 ... 9997 9998 9999]]
In [74]:
# We can adjust these settings
np.set_printoptions(threshold=None)
print(np.arange(100).reshape(10,10))
[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 22 23 24 25 26 27 28 29]
 [30 31 32 33 34 35 36 37 38 39]
 [40 41 42 43 44 45 46 47 48 49]
 [50 51 52 53 54 55 56 57 58 59]
 [60 61 62 63 64 65 66 67 68 69]
 [70 71 72 73 74 75 76 77 78 79]
 [80 81 82 83 84 85 86 87 88 89]
 [90 91 92 93 94 95 96 97 98 99]]

Missing Values

Numpy provides a data class for missing values (i.e. nan == "Not a Number", see here)

In [75]:
Y = np.random.randint(1,10,25).reshape(5,5) + .0
Y
Out[75]:
array([[4., 7., 2., 9., 5.],
       [2., 7., 6., 1., 4.],
       [2., 4., 8., 8., 8.],
       [6., 4., 7., 4., 6.],
       [2., 6., 2., 4., 2.]])
In [76]:
Y[Y > 5] = np.nan
Y
Out[76]:
array([[ 4., nan,  2., nan,  5.],
       [ 2., nan, nan,  1.,  4.],
       [ 2.,  4., nan, nan, nan],
       [nan,  4., nan,  4., nan],
       [ 2., nan,  2.,  4.,  2.]])
In [77]:
type(np.nan)
Out[77]:
float
In [78]:
# scan for missing values
np.isnan(Y)
Out[78]:
array([[False,  True, False,  True, False],
       [False,  True,  True, False, False],
       [False, False,  True,  True,  True],
       [ True, False,  True, False,  True],
       [False,  True, False, False, False]])
In [79]:
~np.isnan(Y) # are not NAs
Out[79]:
array([[ True, False,  True, False,  True],
       [ True, False, False,  True,  True],
       [ True,  True, False, False, False],
       [False,  True, False,  True, False],
       [ True, False,  True,  True,  True]])

When we have missing values, we'll run into issues when computing across the data matrix.

In [80]:
np.mean(Y)
Out[80]:
nan

To get around this, we need to use special version of the methods that compensate for the existence of nan.

In [81]:
np.nanmean(Y)
Out[81]:
3.0
In [82]:
np.nanmean(Y,axis=0)
Out[82]:
array([2.5       , 4.        , 2.        , 3.        , 3.66666667])
In [83]:
# Mean impute the missing values
Y[np.where(np.isnan(Y))] = np.nanmean(Y)
Y
Out[83]:
array([[4., 3., 2., 3., 5.],
       [2., 3., 3., 1., 4.],
       [2., 4., 3., 3., 3.],
       [3., 4., 3., 4., 3.],
       [2., 3., 2., 4., 2.]])