PPOL564 - Data Science I: Foundations

Lecture 4

Data Types in Python

Plan for Today

  • Objects
  • Cover the standard data types in python
    • scalar and collection types
    • methods
  • Mutable vs. Immutable data types
  • Copies
  • Modules
  • Manipulating Mutable Data Structures (see other notebook)
  • See supplement notebook for a more detailed look at the functionality of strings and dates.

Being Pythonic

Whitespace is significant
Everything is an object
Aim for readablility

Follow PEPs

  • Updates to python are recorded in Python Enhancement Proposals (or PEPs)
    • When there is a change to python, it is recorded here
    • also the python "philosophy" lives here in its suggestions (e.g. PEP8 re: spacing)
In [1]:
# Python has a sort of philosophy to it. 
import this 
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Objects

  • = assignment operator. The assignment operator only binds to names, it never copies an object by value (more on this below).
    • NOTE: variables aren't values that we are "placing into a box" called x. Rather, variables are just labels we are putting on values.
  • An objects "type" is defined at runtime (this differs from other languages where type must be made explicit); type cannot be changed once established (but the state can change if the object is mutable. More on this below)
  • reference is assigned to an object (e.g. below, 'x' references the object '4')
  • there can be multiple references to the same object (more on this below)
  • objects are assigned a unique object id when initiated.
In [2]:
x = 4
x
Out[2]:
4
In [3]:
type(x)
Out[3]:
int
In [4]:
id(x) # Identity of the object
Out[4]:
4419394656

Scalar Data Types

type Description Example
int integer types 4
float 64-bit floating point numbers 4.567
bool boonlean logical values True
None null object (serves as a valuable place holder) None

All scalar data types are immutable.

int

In [5]:
x = 1
type(x)
Out[5]:
int
In [6]:
int(3.4) # Constructor
Out[6]:
3

float

In [7]:
x = 1.3
type(x)
Out[7]:
float
In [8]:
float(1) # Constructor
Out[8]:
1.0

float + int = float

In [9]:
3.0 + 3 
Out[9]:
6.0

Calculations with integers and floats

In [10]:
# addition
4 + 4
Out[10]:
8
In [11]:
# Subtraction 
50 - 25
Out[11]:
25
In [12]:
# Multiplication
5 * 5
Out[12]:
25

Float Division Operator (/) vs. Integer Division Operator (//)

In [13]:
1000/50 # float division
Out[13]:
20.0
In [14]:
1000//50 # integer division
Out[14]:
20
In [15]:
# Exponentiation
5**4
Out[15]:
625
In [16]:
# Remainders ('modulo')
10%6
Out[16]:
4

NoneType

In [17]:
y = None
type(y)
Out[17]:
NoneType
In [18]:
None == 6
Out[18]:
False
In [19]:
6 is None
Out[19]:
False

bool

In [20]:
b = True
type(b)
Out[20]:
bool
In [21]:
b == False
Out[21]:
False
In [22]:
# Any non-zero value is truthy
print(bool(1))
print(bool(55))
print(bool(-5500))
True
True
True
In [23]:
# Zero is false
bool(0)
Out[23]:
False
In [24]:
# NoneTypes are false
bool(None)
Out[24]:
False
In [25]:
# Empty containers are false
print(bool([]))
print(bool({}))
print(bool(""))
False
False
False

Object types determine behavior

Python knows how to behave given the methods assigned to the object when we create an instance. The methods dictate how different data types deal with similar operations (such as addition, multiplication, comparative evaluations, ect.)

Note that special attributes in python are delimited by double underscores (or "dunder")

In [26]:
x = 4
int(4)
Out[26]:
4
In [27]:
dir(x) 
Out[27]:
['__abs__',
 '__add__',
 '__and__',
 '__bool__',
 '__ceil__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floor__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__le__',
 '__lshift__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rrshift__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 '__xor__',
 'bit_length',
 'conjugate',
 'denominator',
 'from_bytes',
 'imag',
 'numerator',
 'real',
 'to_bytes']
In [28]:
x.__add__(4) # addition
Out[28]:
8
In [29]:
x.__mod__(3) # modulo
Out[29]:
1
In [30]:
x.__mul__(6) # multiplication, etc. 
Out[30]:
24
In [31]:
x.__eq__(5) # 4 == 5
Out[31]:
False
In [32]:
x.__float__()
Out[32]:
4.0

Collection/Container Data Types

Type Description Example Mutable
list heterogeneous sequences of objects [1,2,3]
str sequences of characters "A word"
dicts associative array of key/value mappings {"a": 1} keys ✘ values ✓
sets unordered collection of distinct objects {1,2,3}
tuples heterogeneous sequence (1,2)

We can access the information contained within python collection types using a 0-based index.

list

  • Allow for heterogeneous membership in the various object types
  • Mutable (can change items contained within the object after creating the instance)

Construction:

  • literal: []
  • constructor: list()
In [33]:
x = [1, 2.2, "str", True, None] 
x
Out[33]:
[1, 2.2, 'str', True, None]

list constructor (needs another iterable object for this to work... more on this later)

In [34]:
list("This") 
Out[34]:
['T', 'h', 'i', 's']

Can grab and replace elements using a 0-based index

In [35]:
x[0]
Out[35]:
1
In [36]:
x[4] = "a"
x
Out[36]:
[1, 2.2, 'str', True, 'a']

We can reverse the index using negative values

 0,  1 ,   2  ,   3 ,  4
[1, 2.2, "str", True, None]
 0, -4 ,  -3  ,  -2 ,  -1
In [37]:
x[-1]
Out[37]:
'a'

Adding to lists.

In [38]:
x + 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-eaf7b6991020> in <module>()
----> 1 x + 1

TypeError: can only concatenate list (not "int") to list
In [2]:
x = [4,3,5,6]
x + [1]
print(x)
x = x + [1]
[4, 3, 5, 6]
In [ ]:
x.append("r")
x

Multiplying lists repeats them.

In [39]:
[1,2,3] * 6
Out[39]:
[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]

str

Construction:

  • literal: "" or ''
  • constructor: str()
In [40]:
# Generate a string using quotes
s = "this is a string"
print(s)
print(type(s))
this is a string
<class 'str'>
In [41]:
# or using the string constructor
ss = str(3456)
ss
Out[41]:
'3456'
In [42]:
print(type(ss))
<class 'str'>
In [43]:
# layer quotation (when need be)
s = "this is a 'string'" # good
print(s)

s = 'this is a "string"' # good
print(s)
this is a 'string'
this is a "string"

Strings are containers too

String elements can be accessed using a 0-based index, much like objects in a list

In [44]:
my_str = "Georgetown"
print(my_str[0])
print(my_str[5])
G
e

Appending ('adding') two strings pastes their values together

In [45]:
"Cat" + "Cat" + "Dog"
Out[45]:
'CatCatDog'

Multiplying strings repeats the value.

In [46]:
"class"*5
Out[46]:
'classclassclassclassclass'

tuple

  • Heterogeneous: can take any type of object
  • immutable: once created, elements cannot be added or removed.
  • 0-based index to access values (though the values cannot be changed).
  • Tuple unpacking

Construction:

  • literal: ()
    • though parenthesis are not required when one has a comma separated progression
  • constructor: tuple() (where the input is an iterable object, like a list)
In [47]:
tt = ("apple",2.4,6)
print(type(tt))
print(tt)
<class 'tuple'>
('apple', 2.4, 6)
In [48]:
# constructor 
tuple([4,5,6])
Out[48]:
(4, 5, 6)
In [49]:
tt2 = 1,2,3,4,5,6
print(tt2)
(1, 2, 3, 4, 5, 6)

Indexing a tuple (0-based index)

In [50]:
tt[0]
Out[50]:
'apple'

adding tuples combines/appends them

In [51]:
(1,2,3) + (99,-100)
Out[51]:
(1, 2, 3, 99, -100)

multiplying tuples repeats them

In [52]:
(1,2,3) * 3
Out[52]:
(1, 2, 3, 1, 2, 3, 1, 2, 3)

Tuple Unpacking: allows one to deconstruct the tuple object into named references. This allows for flexibility regarding which objects we want when performing sequential operations, like iterating.

Note that when holding nested collections, copies are shallow (see below)

In [53]:
print(tt)

a, b, c = tt

print(a)
print(b)
print(c)
('apple', 2.4, 6)
apple
2.4
6
In [54]:
tt3 = ((1,2,3),(4,5),(1,3,4))
print(tt3)

a2, b2, c2 = tt3
print(a2)
print(b2)
print(c2)
((1, 2, 3), (4, 5), (1, 3, 4))
(1, 2, 3)
(4, 5)
(1, 3, 4)

use _ for placeholder assignments

In [55]:
a,_,c,_,d = (1,2,3,4,5)
print(a)
print(c)
print(d)
1
3
5

set

  • unordered collection of unique elements (i.e. duplicates are removed)
  • properties of "set algebra"
  • mutable in that elements can be added and removed

Construction:

  • {} brackets (but with no key value pairs -- see below)
  • set() constructor
In [56]:
my_set = {1,2,3,3,3,4,4,4,5,1}
print(type(my_set))
my_set
<class 'set'>
Out[56]:
{1, 2, 3, 4, 5}
In [57]:
set("caaaaaaaaaaaaaaaaaat")
Out[57]:
{'a', 'c', 't'}

Add elements to a set using the .add() method or .update().

In [58]:
my_set.add(6)
my_set
Out[58]:
{1, 2, 3, 4, 5, 6}
In [59]:
my_set.update({8})
my_set
Out[59]:
{1, 2, 3, 4, 5, 6, 8}

Adding values that are already members doesn't change anything. So sets are efficient ways to keep track of unique values in a series.

In [60]:
my_set.add(1)
my_set
Out[60]:
{1, 2, 3, 4, 5, 6, 8}

set operations:

  • test for membership
  • union
  • intersection
  • difference
  • subsets/supersets
In [61]:
s1 = {"a","b","c"}
s2 = {"z","b","c","g","x"}
s3 = {"z","g","x"}
In [62]:
# Join set 1 and set 2
print(s1.union(s2))
print(s1.union(s2) == s2.union(s1)) # commutative
{'b', 'x', 'g', 'a', 'z', 'c'}
True
In [63]:
# in set 1 AND set 2
s1.intersection(s2)
Out[63]:
{'b', 'c'}
In [64]:
# Set 1 not in set 2
s1.difference(s2)
Out[64]:
{'a'}
In [65]:
# in the set 1 and set 2 but not both
s1.symmetric_difference(s2)
Out[65]:
{'a', 'g', 'x', 'z'}
In [66]:
# check if one set is a subset of another sets
s3.issubset(s2)
Out[66]:
True
In [67]:
# check if one set is a super set of another set
s2.issuperset(s3)
Out[67]:
True
In [68]:
# test if two sets have no members in common
s3.isdisjoint(s1)
Out[68]:
True

dict

  • Associative array of key-value pairs
  • indexed by the keys
  • no maintained ordering of the keys
  • keys can't be changed once created (immutable), but the values can be changed
  • keys cannot be duplicated (but values can be).

Construction:

  • literal: {<key>:<value>}
  • constructor: dict()
In [69]:
my_dict = {'a': 4, 'b': 7, 'c': 9.2}
print(type(my_dict))
print(my_dict)
<class 'dict'>
{'a': 4, 'b': 7, 'c': 9.2}

Dictionary constructor

In [70]:
my_dict = dict(a = 4.23, b = 10, c = 6.6)
my_dict
Out[70]:
{'a': 4.23, 'b': 10, 'c': 6.6}

Accessing the dictionary's 'keys'

In [71]:
my_dict.keys()
Out[71]:
dict_keys(['a', 'b', 'c'])

Accessing the dictionary's 'values'

In [72]:
my_dict.values()
Out[72]:
dict_values([4.23, 10, 6.6])

Access key value pairs as tuples (useful for iteration)

In [73]:
my_dict.items()
Out[73]:
dict_items([('a', 4.23), ('b', 10), ('c', 6.6)])

We can construct dictionaries from scratch by stringing together tuple pairs and converting to a dictionary type using the dict() constructor.

In [74]:
# recall that a list can hold any type of object, including tuples
xx = [("a",4),("b",10),("c",8)]
print(xx)
dict_xx = dict(xx)
print(dict_xx)
[('a', 4), ('b', 10), ('c', 8)]
{'a': 4, 'b': 10, 'c': 8}

We can index a dictionary using the key.

In [75]:
print(dict_xx['a'])
print(my_dict['c'])
4
6.6

We get a key error when an index doesn't exist

In [76]:
dict_xx['d']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-76-b1a256eb3555> in <module>()
----> 1 dict_xx['d']

KeyError: 'd'

Or we can use the .get() method to the same end -- but without the error if the key does not exist.

In [77]:
print(dict_xx.get('a'))
print(dict_xx.get('d'))
4
None

Adding additional values to a dictionary.

In [78]:
dict_xx + {'d':4}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-78-44104faeca4b> in <module>()
----> 1 dict_xx + {'d':4}

TypeError: unsupported operand type(s) for +: 'dict' and 'dict'

The addition method doesn't work (there is none, e.g. dict.__add__()). Rather we need to .update() the dictionary.

In [79]:
dict_xx.update({'d':4})
dict_xx
Out[79]:
{'a': 4, 'b': 10, 'c': 8, 'd': 4}

Note that the keys must be immutable (e.g. str, int, float, bool, tuple) but the values can be mutable (e.g. lists, dicts, sets + all immutable types). (we'll go into the specifics of this below)

In [80]:
print({'apple': "a"})
print({2: "a"})
print({2.5: "a"})
print({True: "a"})
print({None: "a"})
print({(4,5): "a"})
{'apple': 'a'}
{2: 'a'}
{2.5: 'a'}
{True: 'a'}
{None: 'a'}
{(4, 5): 'a'}
In [81]:
print({[1,2,3]: "a"})
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-81-69687cd4447d> in <module>()
----> 1 print({[1,2,3]: "a"})

TypeError: unhashable type: 'list'
In [ ]:
print({{1,2,3}: "a"})

We'll delve more into manipulating container objects like strings in the next lecture.

Mutable vs. Immutable Objects

  • mutable → "object can be changes after it is created"
    • lists
    • dict
    • set
  • immutable → "object cannot be changes after it is created"

    • int
    • float
    • bool
    • str
    • tuple
  • mutable objects are useful when you need add or edit values. Immutable objects are useful when you need values to remain consistent.

For further reading, see the following Medium post

In [82]:
# Mutability with lists
gg = [1,2,3,4]
gg[1] = 9
gg
Out[82]:
[1, 9, 3, 4]
In [83]:
# Immutability with tuples
tt = (1,2,3,4)
tt[1] = 9
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-83-b27359dfb0d8> in <module>()
      1 # Immutability with tuples
      2 tt = (1,2,3,4)
----> 3 tt[1] = 9

TypeError: 'tuple' object does not support item assignment
In [ ]:
# A mixture of worlds with dictionaries
dd =  {"a":[1,2,3,4],"b":[33,44]}
print(dd.keys()) # can't change the keys (immutable)

# But we can add to the dict
dd.update({'c':9})
print(dd)

# and change values
dd['a'][2]= "SSSS"
print(dd)
In [ ]:
my_set = {3,4,5,5,5,6}
print(my_set)`
my_set.pop() # pop out the first value
print(my_set)
In [ ]:
my_str = "This is a string"
my_str[0] = "X"

Object references and copies

As noted, each object obtains a unique object id when instantiated. When we alter an immutable object, the object gets a new id in memory. However, this is not always the case when dealing with mutable objects. We can actually assign multiple references to the same objects. This can results in desirable behavior. To get around this, we can make copies of an object.

In [84]:
# Instantiate an object with the integer value 4.
x = 4
id(x)
Out[84]:
4419394656
In [85]:
# Add 1 integer to the value increasing it by one
x += 1
id(x) # A new object id is assigned to memory!
Out[85]:
4419394688
In [86]:
# These assignments correspond with the values not the object.
print(id(4))
print(id(5))
4419394656
4419394688

Note that when dealing with mutable objects, a very different story emerges.

In [87]:
# Create a list object called my_list
my_list = ["a", 2, 3.3]

# Make another object that is a copy of my_list
your_list = my_list

Note that both the contents (the values contained within) are the same

In [88]:
print(my_list)
print(your_list)
['a', 2, 3.3]
['a', 2, 3.3]

And the object ids are the same

In [89]:
print(id(my_list))
print(id(your_list))
140596830001288
140596830001288

Note we can test for two forms of equivalence when comparing objects.

  • identity equivalence: are these two object ids the same?
  • value equivalence: are the values contained within the object the same?
In [90]:
print(my_list is your_list) # `is` tests for equality of identity
True
In [91]:
print(my_list == your_list) # `==` tests for equality of value.
True

Now let's update one of the list objects.

In [92]:
your_list.append(10)
your_list
Out[92]:
['a', 2, 3.3, 10]

When we look at the other list object, however, we note something surprising: it changed as well!

In [93]:
my_list # ????
Out[93]:
['a', 2, 3.3, 10]

To understand what's going on, we need to think closely about what's coming on underneath the hood when we assign mutable objects to new references. See the following interactive code diagram.

Copying mutable objects

To get around the following issue, we need to make a copy of a mutable data object.

There are three ways we can make a copy.

  • use constructor, e.g. list()
  • use the copy method, e.g. .copy()
  • slice the entire series, e.g. my_list[:]
In [94]:
a = my_list
a is my_list
Out[94]:
True
In [95]:
a = list(my_list)
b = my_list.copy()
c = my_list[:]
In [96]:
# Values are equivalent
print(a == my_list)
print(b == my_list)
print(c == my_list)
True
True
True
In [97]:
# But their identities are not
print(a is my_list)
print(b is my_list)
print(c is my_list)
False
False
False

Shallow copies

Recall that a list can hold a heterogeneous types of objects, including other lists. We call this a "nested list" (or a nested data structure).

Nested data containing other mutable data types can generate similar types of problems.

In [98]:
nested_list = [[1,2,3],[4,7,88],[69,21,9.1]]
nested_list
Out[98]:
[[1, 2, 3], [4, 7, 88], [69, 21, 9.1]]
In [99]:
print(nested_list[1])
[4, 7, 88]
In [100]:
print(nested_list[1][1])
7
In [101]:
# Copy the list 
new_nested_list = list(nested_list) 
In [102]:
# Equivalent values
print(new_nested_list == nested_list) 
True
In [103]:
# Not identified as the same object... so the copying did work!
print(new_nested_list is nested_list) 
False

Let's now edit the nested list by appending new values. When we do this, we see the values for the one list changed, but not the other. That's what we want.

In [104]:
# Let's augment this list...
nested_list.append([1,2,3,4,5])

print(nested_list)
print(new_nested_list)
[[1, 2, 3], [4, 7, 88], [69, 21, 9.1], [1, 2, 3, 4, 5]]
[[1, 2, 3], [4, 7, 88], [69, 21, 9.1]]

Let's now edit a specific value within one of the nested lists.

In [105]:
# Let's augment one of the lists within the lists...
print(nested_list[1])
nested_list[1][1] = "AAA"
print(nested_list)
[4, 7, 88]
[[1, 2, 3], [4, 'AAA', 88], [69, 21, 9.1], [1, 2, 3, 4, 5]]

Oh no! The value was also altered in the other list as well!

In [106]:
print(new_nested_list) 
[[1, 2, 3], [4, 'AAA', 88], [69, 21, 9.1]]

To understand what is going on, let's again refer to the interactive diagram and try to reproduce the circumstances.

The take-away: Copies are shallow → this means that copying a list will still maintain the references to the nested lists.

Deep Copies

Deep copies allow use to ensure that a copy is made recursively for all mutable data types in the nested data structure.

In [107]:
import copy # from the standard library
my_list = [[1,2],[3,4]]
my_list2 = copy.deepcopy(my_list)
In [108]:
my_list2[1][1]=55
In [109]:
print(my_list)
print(my_list2)
[[1, 2], [3, 4]]
[[1, 2], [3, 55]]

Modules: Importing Functionality

Standard Library

Python comes with an extensive standard library and built-in functions.

Some examples of these modules (to name a few...)

math for mathematical computations

In [110]:
import math
math.log(100)
Out[110]:
4.605170185988092

re for string computations

In [111]:
import re
my_string = "this is a dog"
re.sub("this","That",my_string)
Out[111]:
'That is a dog'

random for random number generation

In [112]:
import random
random.randint(1, 10)
Out[112]:
5

datetime for dealing with dates

In [113]:
import datetime
date1 = datetime.date(year=2009,month=1,day=13)
date2 = datetime.date(year=2010,month=1,day=13)
date2 - date1
Out[113]:
datetime.timedelta(days=365)

Importing Modules

Excerpt from Real Python post

Modular programming refers to the process of breaking a large, unwieldy programming task into separate, smaller, more manageable subtasks or modules. Individual modules can then be cobbled together like building blocks to create a larger application.

There are several advantages to modularizing code in a large application:

  • Simplicity: Rather than focusing on the entire problem at hand, a module typically focuses on one relatively small portion of the problem. If you’re working on a single module, you’ll have a smaller problem domain to wrap your head around. This makes development easier and less error-prone.

  • Maintainability: Modules are typically designed so that they enforce logical boundaries between different problem domains. If modules are written in a way that minimizes interdependency, there is decreased likelihood that modifications to a single module will have an impact on other parts of the program. (You may even be able to make changes to a module without having any knowledge of the application outside that module.) This makes it more viable for a team of many programmers to work collaboratively on a large application.

  • Reusability: Functionality defined in a single module can be easily reused (through an appropriately defined interface) by other parts of the application. This eliminates the need to recreate duplicate code.

  • Scoping: Modules typically define a separate namespace, which helps avoid collisions between identifiers in different areas of a program. (One of the tenets in the Zen of Python is Namespaces are one honking great idea—let’s do more of those!)

Functions, modules and packages are all constructs in Python that promote code modularization.

In [114]:
import sys
In [115]:
import numpy as np
In [116]:
from sklearn import metrics

Installing Modules

Using PiPy

In [117]:
!pip install numpy
Requirement already satisfied: numpy in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (1.15.1)
You are using pip version 18.0, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Using Anaconda

In [118]:
!conda install numpy
/bin/sh: conda: command not found