Learning Objectives

In the Asynchronous Lecture

Understand comprehensions and generators in Python.
Think through File management and reading data into Python.
Discuss navigating data stored as a nested list
Introduction to the numpy data structure.

In the Synchronous Lecture

Delve further into numpy arrays.
Thinking about managing multiple data types in a relational array with pandas
Exploring indices in a pandas series and data frame object.

If you have any questions while watching the pre-recorded material, be sure to write them down and to bring them up during the synchronous portion of the lecture.

Synchronous Materials

Lecture Notebooks
- More on numpy
- Introduction to Data in pandas

Asynchronous Materials

The following tabs contain pre-recorded lecture materials for class this week. Please review these materials prior to the synchronous lecture.

Total time: Approx. 1 hour and 14 minutes

_

Comprehensions

Code from the video

Download Jupyter Notebook used in the video.

Generators

Code from the video

Download Jupyter Notebook used in the video. (NOTE: This is the same notebook as the one downloaded in the “Comprehensions” tab.)

File Management

Code from the video

Download Jupyter Notebook used in the video.
Download the data used in the Notebook:
- news-story.txt
- student_data.csv

Data as Nested Lists

Code from the video

Download the aggregated version of the gapminder.csv data used in the video.

# Batteries included Functions
import csv # convert a .csv to a nested list
import os  # library for managing our operating system. 


# Where am I on my computer?
os.getcwd()


# Say I needed to change my working directory
# os.chdir("file/path/here")

# THIS IS WHERE MY DATA IS LOCATED ON MY MACHINE; THIS WILL NOT RUN ON YOUR
# COMPUTER. Change this path to point to where the gapminder data is. Above this
# code chunk you'll see a link to download the data.
loc_file = "lectures/week_05/supplementary-materials/gapminder.csv"

# Read in the gapminder data 
with open(loc_file,mode="rt") as file:
    data = [row for row in csv.reader(file)]


# %% -----------------------------------------
# Indexing Rows

# For any row > 0, row == 0 is the column names. 
data[100]


# %% -----------------------------------------
# Indexing Columns

# Referencing a column data value
d = data[100] # First select the row
d[1] # Then reference the column

# doing the above all in one step
data[100][1]

# The key is to keep in mind the column names
cnames = data.pop(0)

# We can now reference this column name list to pull out the columns we're interested in.
ind = cnames.index("lifeExp") # Index allows us to "look up" the location of a data value. 
data[99][ind]


# %% -----------------------------------------
# Drawing out specific COLUMN of data

# identify the position
ind = cnames.index("lifeExp")

# Looping through each row pulling out the relevant data value
life_exp = []
for row in data:
    life_exp.append(float(row[ind]))

# Same idea, but as a list comprehension 
life_exp = [float(row[ind]) for row in data]


# Make this code more flexible 
var_name = "gdpPercap"
out = [row[cnames.index(var_name)] for row in data]
out


# %% -----------------------------------------
# Numpy offers an efficiency boost, especially when indexing
import numpy as np


# Convert to a numpy array
data_np = np.array(data)


# Column Variable we wish to access is easy using slicing. 
data_np[:,2]


# Let's compare runtimes!

# %% -----------------------------------------
%%timeit
out1 = []
for row in data:
    out1.append(row[var_ind])


# %% -----------------------------------------
%%timeit
out2 = [row[var_ind] for row in data]

    
# %% -----------------------------------------
%%timeit
out3 = data_np[:,var_ind]

`numpy`

Code from the video

import numpy as np

#### Vectors, Matrices, and N-Dimensional Arrays ####

# %% vectors (1 Dimension) ----------------------------------------- 
v = np.array([1,2,3,4])
v


# %% Matrix (2 Dimensions) -----------------------------------------
NL = [[1,2,3,4],[2,3,4,1],[-1,1,2,1]]
M = np.array(NL)
M

M.shape


# %% N-dimensional Array -----------------------------------------

#  An ndimensional array is a nested list
A = np.array([
              [
                [1,2,3,4],
                [2,3,4,1],
                [-1,1,2,1]
              ],
             [
                 [1,2,3,4],
                 [2,3,4,1],
                 [-1,1,2,1]]
              ])
A
A.shape

# %%  -----------------------------------------

###### Generating Arrays #####



# .arange
np.arange(1, 10, 1 )


# .linspace
np.linspace(1,5,10) 


# Zeros
np.zeros(10)

# Ones
np.ones(10)

# Random number generation 
np.random.randn(10) # Random Number 
np.random.randint(1,10,10) # Random Interger 
np.random.uniform(1,5,10) # Uniform distribution
np.random.binomial(1,.5,10) # Binomial (Trials)
np.random.normal(5,1,10) # Normal
np.random.poisson(5,5) # Normal

# %% Indexing -----------------------------------------

M 

M.shape

# [ROW, COLUMN]
# ":" == "all back"

# A cell
M[0,0] 

# A row
M[1,:]


# A column
M[:,1]


# Slicing the data structure works as it did with other python data types

M[0:2,0:2]

# Last Column
M[:,-1] 


# Last Row
M[-1, : ] 


# Change the order by the requested postions
M
M[[2,0,1],:]


# %% Indexing Using Conditionals -----------------------------------------

X = np.random.randn(10)

X

# Vector of Boolean values
X > 0

# Can index using this vector 
X
X[X>0]


# Logic extends to any N-dimensional Array 
X = np.random.randn(50).reshape(10,5)
X
X[X > 0]


# %% Reshaping -----------------------------------------

# Call the shape of an array
v = np.random.randint(1,100,30)
v 
v.shape

# Reshape
v.reshape(10,3)


# Reshape has to be plausible
v.reshape(10,2)


# %% Reassignment -----------------------------------------

X = np.zeros(50).reshape(10,5)
X



# Reassign data values by referencing positions
X[0,0] = 999
X



# Reassign whole ranges of values
X[0,:] = 999
X

X[:,0] = 999
X

# Reassignment using boolean values. 
D = np.random.randn(50).reshape(10,5).round(1)
D

D > 0

D[D > 0] = 1
D[D <= 0] = 0
D


# Using where "ifelse()-like" method
D = np.random.randn(50).reshape(10,5).round(1) # Generate some random numbers again
D # Before 
np.where(D>0,1,0) # After

Practice

These exercises are designed to help you reinforce your grasp of the concepts covered in the asynchronous lecture material.

_

Question 1

Convert the following loop into a list comprehension.

bind = []
for i in range(10):
  for j in "georgetown":
      if j != "g":
        bind.append((i,j))
print(bind)

_

Answer

bind = [(i,j) for i in range(10) for j in "georgetown" if j != "g"]
print(bind)

Question 2

Save the following lines of text to your Desktop as a .txt file named zen_of_python.txt.

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.

_

Answer

# Store the text as a string
txt = """
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
"""

# Define the relevant file path to your Desktop
file_path = ""

# Open connection, write lines, then close. 
with open(file_path + "zen_of_python.txt",mode="wt",encoding="utf-8") as file:
  file.writelines(txt)

Question 3

Using the following data, write a function called select() that takes the nested list data and a variable name as input and returns the requested variable as a single list. Make sure the function can deal with cases when a variable that is not in the data is requested (e.g. the variable name is misspelled). Make sure you include a docstring with your function.

data = [
  ["Var1","Var2","Var3"],
  [1,"Apples",True],
  [4,"Horses",None],
  [-1,"Small Birds",False],
]

_

Answer

def select(data,variable):
    """Function selects a column variable using a specified 
    variable name from data organized as a nested list. 

    Args:
        data (list): data structure organized as a nested list.
        variable (str): Name of the variable being selected.

    Returns:
        list: list of containing the requested data column. 
    """
    cnames = data.pop(0)
    if variable in cnames:
        out = [row[cnames.index(variable)] for row in data]
        return out

# Test
print(select(data,"Var2"))

## ['Apples', 'Horses', 'Small Birds']

The following materials were generated for students enrolled in PPOL564. Please do not distribute without permission.

ed769@georgetown.edu | www.ericdunford.com

Back to Course Website

Long Live the Data Frame
From Nested Lists to Data Frames

PPOL 564 | Data Science I | Foundations

Lecture Materials for Week 5

Professor Eric Dunford (ed769@georgetown.edu)
McCourt School of Public Policy, Georgetown University

Learning Objectives

Synchronous Materials

Asynchronous Materials

_

Comprehensions

Code from the video

Generators

Code from the video

File Management

Code from the video

Data as Nested Lists

Code from the video

`numpy`

Code from the video

Practice

_

Question 1

_

Answer

Question 2

_

Answer

Question 3

_

Answer

Back to Course Website Long Live the Data Frame From Nested Lists to Data Frames

PPOL 564 | Data Science I | Foundations Lecture Materials for Week 5

Professor Eric Dunford (ed769@georgetown.edu) McCourt School of Public Policy, Georgetown University

Learning Objectives

Synchronous Materials

Asynchronous Materials

_

Comprehensions

Code from the video

Generators

Code from the video

File Management

Code from the video

Data as Nested Lists

Code from the video

numpy

Code from the video

Practice

_

Question 1

_

Answer

Question 2

_

Answer

Question 3

_

Answer

Back to Course Website

Long Live the Data Frame
From Nested Lists to Data Frames

PPOL 564 | Data Science I | Foundations

Lecture Materials for Week 5

Professor Eric Dunford (ed769@georgetown.edu)
McCourt School of Public Policy, Georgetown University

`numpy`