PPOL564 - Data Science I: Foundations

Lecture 4

Data Types in Python

Plan for Today¶

Objects
Cover the standard data types in python
- scalar and collection types
- methods
Mutable vs. Immutable data types
Copies
Modules
Manipulating Mutable Data Structures (see other notebook)
See supplement notebook for a more detailed look at the functionality of strings and dates.

Being Pythonic¶

Whitespace is significant Everything is an object Aim for readablility

Follow PEPs¶

Updates to python are recorded in Python Enhancement Proposals (or PEPs)
- When there is a change to python, it is recorded here
- also the python "philosophy" lives here in its suggestions (e.g. PEP8 re: spacing)

# Python has a sort of philosophy to it. 
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Objects¶

= assignment operator. The assignment operator only binds to names, it never copies an object by value (more on this below).
- NOTE: variables aren't values that we are "placing into a box" called x. Rather, variables are just labels we are putting on values.
An objects "type" is defined at runtime (this differs from other languages where type must be made explicit); type cannot be changed once established (but the state can change if the object is mutable. More on this below)
reference is assigned to an object (e.g. below, 'x' references the object '4')
there can be multiple references to the same object (more on this below)
objects are assigned a unique object id when initiated.

x = 4
x

4

type(x)

int

id(x) # Identity of the object

4419394656

Scalar Data Types¶

type	Description	Example
`int`	integer types	`4`
`float`	64-bit floating point numbers	`4.567`
`bool`	boonlean logical values	`True`
`None`	null object (serves as a valuable place holder)	`None`

All scalar data types are immutable.

`int`¶

x = 1
type(x)

int

int(3.4) # Constructor

3

`float`¶

x = 1.3
type(x)

float

float(1) # Constructor

1.0

float + int = float

3.0 + 3

6.0

Calculations with integers and floats¶

# addition
4 + 4

8

# Subtraction 
50 - 25

25

# Multiplication
5 * 5

25

Float Division Operator (/) vs. Integer Division Operator (//)

1000/50 # float division

20.0

1000//50 # integer division

20

# Exponentiation
5**4

625

# Remainders ('modulo')
10%6

4

`NoneType`¶

y = None
type(y)

NoneType

None == 6

False

6 is None

False

`bool`¶

b = True
type(b)

bool

b == False

False

# Any non-zero value is truthy
print(bool(1))
print(bool(55))
print(bool(-5500))

True
True
True

# Zero is false
bool(0)

False

# NoneTypes are false
bool(None)

False

# Empty containers are false
print(bool([]))
print(bool({}))
print(bool(""))

False
False
False

Object types determine behavior¶

Python knows how to behave given the methods assigned to the object when we create an instance. The methods dictate how different data types deal with similar operations (such as addition, multiplication, comparative evaluations, ect.)

Note that special attributes in python are delimited by double underscores (or "dunder")

x = 4
int(4)

4

dir(x)

['__abs__',
 '__add__',
 '__and__',
 '__bool__',
 '__ceil__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floor__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__le__',
 '__lshift__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rrshift__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 '__xor__',
 'bit_length',
 'conjugate',
 'denominator',
 'from_bytes',
 'imag',
 'numerator',
 'real',
 'to_bytes']

x.__add__(4) # addition

8

x.__mod__(3) # modulo

1

x.__mul__(6) # multiplication, etc.

24

x.__eq__(5) # 4 == 5

False

x.__float__()

4.0

Collection/Container Data Types¶

Type	Description	Example	Mutable
`list`	heterogeneous sequences of objects	`[1,2,3]`	✓
`str`	sequences of characters	`"A word"`	✘
`dicts`	associative array of key/value mappings	`{"a": 1}`	keys ✘ values ✓
`sets`	unordered collection of distinct objects	`{1,2,3}`	✓
`tuples`	heterogeneous sequence	`(1,2)`	✘

We can access the information contained within python collection types using a 0-based index.

`list`¶

Allow for heterogeneous membership in the various object types
Mutable (can change items contained within the object after creating the instance)

Construction:

literal: []
constructor: list()

x = [1, 2.2, "str", True, None] 
x

[1, 2.2, 'str', True, None]

list constructor (needs another iterable object for this to work... more on this later)

list("This")

['T', 'h', 'i', 's']

Can grab and replace elements using a 0-based index

x[0]

1

x[4] = "a"
x

[1, 2.2, 'str', True, 'a']

We can reverse the index using negative values

 0,  1 ,   2  ,   3 ,  4
[1, 2.2, "str", True, None]
 0, -4 ,  -3  ,  -2 ,  -1

x[-1]

'a'

Adding to lists.

x + 1

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-eaf7b6991020> in <module>()
----> 1 x + 1

TypeError: can only concatenate list (not "int") to list

x = [4,3,5,6]
x + [1]
print(x)
x = x + [1]

[4, 3, 5, 6]

x.append("r")
x

Multiplying lists repeats them.

[1,2,3] * 6

[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]

`str`¶

Construction:

literal: "" or ''
constructor: str()

# Generate a string using quotes
s = "this is a string"
print(s)
print(type(s))

this is a string
<class 'str'>

# or using the string constructor
ss = str(3456)
ss

'3456'

print(type(ss))

<class 'str'>

# layer quotation (when need be)
s = "this is a 'string'" # good
print(s)

s = 'this is a "string"' # good
print(s)

this is a 'string'
this is a "string"

Strings are containers too¶

String elements can be accessed using a 0-based index, much like objects in a list

my_str = "Georgetown"
print(my_str[0])
print(my_str[5])

G
e

Appending ('adding') two strings pastes their values together

"Cat" + "Cat" + "Dog"

'CatCatDog'

Multiplying strings repeats the value.

"class"*5

'classclassclassclassclass'

`tuple`¶

Heterogeneous: can take any type of object
immutable: once created, elements cannot be added or removed.
0-based index to access values (though the values cannot be changed).
Tuple unpacking

Construction:

literal: ()
- though parenthesis are not required when one has a comma separated progression
constructor: tuple() (where the input is an iterable object, like a list)

tt = ("apple",2.4,6)
print(type(tt))
print(tt)

<class 'tuple'>
('apple', 2.4, 6)

# constructor 
tuple([4,5,6])

(4, 5, 6)

tt2 = 1,2,3,4,5,6
print(tt2)

(1, 2, 3, 4, 5, 6)

Indexing a tuple (0-based index)

tt[0]

'apple'

adding tuples combines/appends them

(1,2,3) + (99,-100)

(1, 2, 3, 99, -100)

multiplying tuples repeats them

(1,2,3) * 3

(1, 2, 3, 1, 2, 3, 1, 2, 3)

Tuple Unpacking: allows one to deconstruct the tuple object into named references. This allows for flexibility regarding which objects we want when performing sequential operations, like iterating.

Note that when holding nested collections, copies are shallow (see below)

print(tt)

a, b, c = tt

print(a)
print(b)
print(c)

('apple', 2.4, 6)
apple
2.4
6

tt3 = ((1,2,3),(4,5),(1,3,4))
print(tt3)

a2, b2, c2 = tt3
print(a2)
print(b2)
print(c2)

((1, 2, 3), (4, 5), (1, 3, 4))
(1, 2, 3)
(4, 5)
(1, 3, 4)

use _ for placeholder assignments

a,_,c,_,d = (1,2,3,4,5)
print(a)
print(c)
print(d)

1
3
5

`set`¶

unordered collection of unique elements (i.e. duplicates are removed)
properties of "set algebra"
mutable in that elements can be added and removed

Construction:

{} brackets (but with no key value pairs -- see below)
set() constructor

my_set = {1,2,3,3,3,4,4,4,5,1}
print(type(my_set))
my_set

<class 'set'>

{1, 2, 3, 4, 5}

set("caaaaaaaaaaaaaaaaaat")

{'a', 'c', 't'}

Add elements to a set using the .add() method or .update().

my_set.add(6)
my_set

{1, 2, 3, 4, 5, 6}

my_set.update({8})
my_set

{1, 2, 3, 4, 5, 6, 8}

Adding values that are already members doesn't change anything. So sets are efficient ways to keep track of unique values in a series.

my_set.add(1)
my_set

{1, 2, 3, 4, 5, 6, 8}

set operations:

test for membership
union
intersection
difference
subsets/supersets

s1 = {"a","b","c"}
s2 = {"z","b","c","g","x"}
s3 = {"z","g","x"}

# Join set 1 and set 2
print(s1.union(s2))
print(s1.union(s2) == s2.union(s1)) # commutative

{'b', 'x', 'g', 'a', 'z', 'c'}
True

# in set 1 AND set 2
s1.intersection(s2)

{'b', 'c'}

# Set 1 not in set 2
s1.difference(s2)

{'a'}

# in the set 1 and set 2 but not both
s1.symmetric_difference(s2)

{'a', 'g', 'x', 'z'}

# check if one set is a subset of another sets
s3.issubset(s2)

True

# check if one set is a super set of another set
s2.issuperset(s3)

True

# test if two sets have no members in common
s3.isdisjoint(s1)

True

`dict`¶

Associative array of key-value pairs
indexed by the keys
no maintained ordering of the keys
keys can't be changed once created (immutable), but the values can be changed
keys cannot be duplicated (but values can be).

Construction:

literal: {<key>:<value>}
constructor: dict()

my_dict = {'a': 4, 'b': 7, 'c': 9.2}
print(type(my_dict))
print(my_dict)

<class 'dict'>
{'a': 4, 'b': 7, 'c': 9.2}

Dictionary constructor

my_dict = dict(a = 4.23, b = 10, c = 6.6)
my_dict

{'a': 4.23, 'b': 10, 'c': 6.6}

Accessing the dictionary's 'keys'

my_dict.keys()

dict_keys(['a', 'b', 'c'])

Accessing the dictionary's 'values'

my_dict.values()

dict_values([4.23, 10, 6.6])

Access key value pairs as tuples (useful for iteration)

my_dict.items()

dict_items([('a', 4.23), ('b', 10), ('c', 6.6)])

We can construct dictionaries from scratch by stringing together tuple pairs and converting to a dictionary type using the dict() constructor.

# recall that a list can hold any type of object, including tuples
xx = [("a",4),("b",10),("c",8)]
print(xx)
dict_xx = dict(xx)
print(dict_xx)

[('a', 4), ('b', 10), ('c', 8)]
{'a': 4, 'b': 10, 'c': 8}

We can index a dictionary using the key.

print(dict_xx['a'])
print(my_dict['c'])

4
6.6

We get a key error when an index doesn't exist

dict_xx['d']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-76-b1a256eb3555> in <module>()
----> 1 dict_xx['d']

KeyError: 'd'

Or we can use the .get() method to the same end -- but without the error if the key does not exist.

print(dict_xx.get('a'))
print(dict_xx.get('d'))

4
None

Adding additional values to a dictionary.

dict_xx + {'d':4}

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-78-44104faeca4b> in <module>()
----> 1 dict_xx + {'d':4}

TypeError: unsupported operand type(s) for +: 'dict' and 'dict'

The addition method doesn't work (there is none, e.g. dict.__add__()). Rather we need to .update() the dictionary.

dict_xx.update({'d':4})
dict_xx

{'a': 4, 'b': 10, 'c': 8, 'd': 4}

Note that the keys must be immutable (e.g. str, int, float, bool, tuple) but the values can be mutable (e.g. lists, dicts, sets + all immutable types). (we'll go into the specifics of this below)

print({'apple': "a"})
print({2: "a"})
print({2.5: "a"})
print({True: "a"})
print({None: "a"})
print({(4,5): "a"})

{'apple': 'a'}
{2: 'a'}
{2.5: 'a'}
{True: 'a'}
{None: 'a'}
{(4, 5): 'a'}

print({[1,2,3]: "a"})

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-81-69687cd4447d> in <module>()
----> 1 print({[1,2,3]: "a"})

TypeError: unhashable type: 'list'

print({{1,2,3}: "a"})

We'll delve more into manipulating container objects like strings in the next lecture.

Mutable vs. Immutable Objects¶

mutable → "object can be changes after it is created"
- lists
- dict
- set

immutable → "object cannot be changes after it is created"
- int
- float
- bool
- str
- tuple
mutable objects are useful when you need add or edit values. Immutable objects are useful when you need values to remain consistent.

For further reading, see the following Medium post

# Mutability with lists
gg = [1,2,3,4]
gg[1] = 9
gg

[1, 9, 3, 4]

# Immutability with tuples
tt = (1,2,3,4)
tt[1] = 9

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-83-b27359dfb0d8> in <module>()
      1 # Immutability with tuples
      2 tt = (1,2,3,4)
----> 3 tt[1] = 9

TypeError: 'tuple' object does not support item assignment

# A mixture of worlds with dictionaries
dd =  {"a":[1,2,3,4],"b":[33,44]}
print(dd.keys()) # can't change the keys (immutable)

# But we can add to the dict
dd.update({'c':9})
print(dd)

# and change values
dd['a'][2]= "SSSS"
print(dd)

my_set = {3,4,5,5,5,6}
print(my_set)`
my_set.pop() # pop out the first value
print(my_set)

my_str = "This is a string"
my_str[0] = "X"

Object references and copies¶

As noted, each object obtains a unique object id when instantiated. When we alter an immutable object, the object gets a new id in memory. However, this is not always the case when dealing with mutable objects. We can actually assign multiple references to the same objects. This can results in desirable behavior. To get around this, we can make copies of an object.

# Instantiate an object with the integer value 4.
x = 4
id(x)

4419394656

# Add 1 integer to the value increasing it by one
x += 1
id(x) # A new object id is assigned to memory!

4419394688

# These assignments correspond with the values not the object.
print(id(4))
print(id(5))

4419394656
4419394688

Note that when dealing with mutable objects, a very different story emerges.

# Create a list object called my_list
my_list = ["a", 2, 3.3]

# Make another object that is a copy of my_list
your_list = my_list

Note that both the contents (the values contained within) are the same

print(my_list)
print(your_list)

['a', 2, 3.3]
['a', 2, 3.3]

And the object ids are the same

print(id(my_list))
print(id(your_list))

140596830001288
140596830001288

Note we can test for two forms of equivalence when comparing objects.

identity equivalence: are these two object ids the same?
value equivalence: are the values contained within the object the same?

print(my_list is your_list) # `is` tests for equality of identity

True

print(my_list == your_list) # `==` tests for equality of value.

True

Now let's update one of the list objects.

your_list.append(10)
your_list

['a', 2, 3.3, 10]

When we look at the other list object, however, we note something surprising: it changed as well!

my_list # ????

['a', 2, 3.3, 10]

To understand what's going on, we need to think closely about what's coming on underneath the hood when we assign mutable objects to new references. See the following interactive code diagram.

Copying mutable objects¶

To get around the following issue, we need to make a copy of a mutable data object.

There are three ways we can make a copy.

use constructor, e.g. list()
use the copy method, e.g. .copy()
slice the entire series, e.g. my_list[:]

a = my_list
a is my_list

True

a = list(my_list)
b = my_list.copy()
c = my_list[:]

# Values are equivalent
print(a == my_list)
print(b == my_list)
print(c == my_list)

True
True
True

# But their identities are not
print(a is my_list)
print(b is my_list)
print(c is my_list)

False
False
False

Shallow copies¶

Recall that a list can hold a heterogeneous types of objects, including other lists. We call this a "nested list" (or a nested data structure).

Nested data containing other mutable data types can generate similar types of problems.

nested_list = [[1,2,3],[4,7,88],[69,21,9.1]]
nested_list

[[1, 2, 3], [4, 7, 88], [69, 21, 9.1]]

print(nested_list[1])

[4, 7, 88]

print(nested_list[1][1])

7

# Copy the list 
new_nested_list = list(nested_list)

# Equivalent values
print(new_nested_list == nested_list)

True

# Not identified as the same object... so the copying did work!
print(new_nested_list is nested_list)

False

Let's now edit the nested list by appending new values. When we do this, we see the values for the one list changed, but not the other. That's what we want.

# Let's augment this list...
nested_list.append([1,2,3,4,5])

print(nested_list)
print(new_nested_list)

[[1, 2, 3], [4, 7, 88], [69, 21, 9.1], [1, 2, 3, 4, 5]]
[[1, 2, 3], [4, 7, 88], [69, 21, 9.1]]

Let's now edit a specific value within one of the nested lists.

# Let's augment one of the lists within the lists...
print(nested_list[1])
nested_list[1][1] = "AAA"
print(nested_list)

[4, 7, 88]
[[1, 2, 3], [4, 'AAA', 88], [69, 21, 9.1], [1, 2, 3, 4, 5]]

Oh no! The value was also altered in the other list as well!

print(new_nested_list)

[[1, 2, 3], [4, 'AAA', 88], [69, 21, 9.1]]

To understand what is going on, let's again refer to the interactive diagram and try to reproduce the circumstances.

The take-away: Copies are shallow → this means that copying a list will still maintain the references to the nested lists.

Deep Copies¶

Deep copies allow use to ensure that a copy is made recursively for all mutable data types in the nested data structure.

import copy # from the standard library
my_list = [[1,2],[3,4]]
my_list2 = copy.deepcopy(my_list)

my_list2[1][1]=55

print(my_list)
print(my_list2)

[[1, 2], [3, 4]]
[[1, 2], [3, 55]]

Modules: Importing Functionality¶

Standard Library¶

Python comes with an extensive standard library and built-in functions.

Some examples of these modules (to name a few...)

`math` for mathematical computations¶

import math
math.log(100)

4.605170185988092

`re` for string computations¶

import re
my_string = "this is a dog"
re.sub("this","That",my_string)

'That is a dog'

`random` for random number generation¶

import random
random.randint(1, 10)

5

`datetime` for dealing with dates¶

import datetime
date1 = datetime.date(year=2009,month=1,day=13)
date2 = datetime.date(year=2010,month=1,day=13)
date2 - date1

datetime.timedelta(days=365)

Importing Modules¶

Excerpt from Real Python post

Modular programming refers to the process of breaking a large, unwieldy programming task into separate, smaller, more manageable subtasks or modules. Individual modules can then be cobbled together like building blocks to create a larger application.

There are several advantages to modularizing code in a large application:

Simplicity: Rather than focusing on the entire problem at hand, a module typically focuses on one relatively small portion of the problem. If you’re working on a single module, you’ll have a smaller problem domain to wrap your head around. This makes development easier and less error-prone.
Maintainability: Modules are typically designed so that they enforce logical boundaries between different problem domains. If modules are written in a way that minimizes interdependency, there is decreased likelihood that modifications to a single module will have an impact on other parts of the program. (You may even be able to make changes to a module without having any knowledge of the application outside that module.) This makes it more viable for a team of many programmers to work collaboratively on a large application.
Reusability: Functionality defined in a single module can be easily reused (through an appropriately defined interface) by other parts of the application. This eliminates the need to recreate duplicate code.
Scoping: Modules typically define a separate namespace, which helps avoid collisions between identifiers in different areas of a program. (One of the tenets in the Zen of Python is Namespaces are one honking great idea—let’s do more of those!)

Functions, modules and packages are all constructs in Python that promote code modularization.

import sys

import numpy as np

from sklearn import metrics

Installing Modules¶

Using `PiPy`¶

!pip install numpy

Requirement already satisfied: numpy in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (1.15.1)
You are using pip version 18.0, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Using `Anaconda`¶

!conda install numpy

/bin/sh: conda: command not found

PPOL564 - Data Science I: Foundations

Lecture 4 Data Types in Python

Plan for Today¶

Being Pythonic¶

Follow PEPs¶

Objects¶

Scalar Data Types¶

int¶

float¶

Calculations with integers and floats¶

NoneType¶

bool¶

Object types determine behavior¶

Collection/Container Data Types¶

list¶

str¶

Strings are containers too¶

tuple¶

set¶

dict¶

Mutable vs. Immutable Objects¶

Object references and copies¶

Copying mutable objects¶

Shallow copies¶

Deep Copies¶

Modules: Importing Functionality¶

Standard Library¶

math for mathematical computations¶

re for string computations¶

random for random number generation¶

datetime for dealing with dates¶

Importing Modules¶

Installing Modules¶

Using PiPy¶

Using Anaconda¶

Lecture 4

Data Types in Python

`int`¶

`float`¶

`NoneType`¶

`bool`¶

`list`¶

`str`¶

`tuple`¶

`set`¶

`dict`¶

`math` for mathematical computations¶

`re` for string computations¶

`random` for random number generation¶

`datetime` for dealing with dates¶

Using `PiPy`¶

Using `Anaconda`¶