PPOL564 - Data Science I: Foundations

Lecture 4

Manipulating Data Structures

Plan for Today

  • Manipulating Mutable Data Structures.
  • See other notebook for discussion on data types.
  • See supplement notebook for a more detailed look at the functionality of strings and dates.

Manipulating Mutable Objects

Lists

In [1]:
country_list = ["Russia","Latvia","United States","Nigeria","Mexico","India","Costa Rica"]
country_list
Out[1]:
['Russia',
 'Latvia',
 'United States',
 'Nigeria',
 'Mexico',
 'India',
 'Costa Rica']

len()

len() provides use with the length of the list..

In [2]:
print(len(country_list))
print(len(country_list[1]))
7
6

.index()

Isolating the index location of a specific value.

In [3]:
country_list.index('Nigeria')
Out[3]:
3
In [4]:
country_list[country_list.index('Nigeria')]
Out[4]:
'Nigeria'

Membership in a list using the in operator.

In [5]:
'Russia' in country_list
Out[5]:
True

Appending and altering values

Adding values to a collection, we have seen methods such as __add__, .append(), .extend(), and .update() given the collection type.

Recall that not all methods actually update the object.

In [6]:
print(id(country_list)) # print object id
print(country_list + ['Canada']) # add canada to the list
140195653389960
['Russia', 'Latvia', 'United States', 'Nigeria', 'Mexico', 'India', 'Costa Rica', 'Canada']
In [7]:
print(id(country_list)) # object id remains consistent
print(country_list) # list wasn't updated
140195653389960
['Russia', 'Latvia', 'United States', 'Nigeria', 'Mexico', 'India', 'Costa Rica']

We need an in-place addition offered by the __iadd__ method with the literal +=

In [54]:
country_list += ['Canada']
country_list
Out[54]:
['Russia',
 'Latvia',
 'United States',
 'Nigeria',
 'Mexico',
 'India',
 'Costa Rica',
 'Canada']

There is also an in-place repetition operation (__imul__)

In [55]:
country_list *= 3
country_list
Out[55]:
['Russia',
 'Latvia',
 'United States',
 'Nigeria',
 'Mexico',
 'India',
 'Costa Rica',
 'Canada',
 'Russia',
 'Latvia',
 'United States',
 'Nigeria',
 'Mexico',
 'India',
 'Costa Rica',
 'Canada',
 'Russia',
 'Latvia',
 'United States',
 'Nigeria',
 'Mexico',
 'India',
 'Costa Rica',
 'Canada']

The point is that it makes for more efficient code. Also, when we append we are making a new object reference; An in-place extension retains the original object id.

In [9]:
x = [1,2,3]
print(id(x))
140193773473096
In [10]:
x1 = x + [4]
print(id(x1))
140193773472904
In [11]:
x += [4]
print(id(x))
140193773473096
In [12]:
print(x1)
print(x)
[1, 2, 3, 4]
[1, 2, 3, 4]

Slicing

Often we want values ranges of values in a container. We can accomplish this by slicing.

Rule of thumb:

  • :
  • <start here>:<to the value before here>
x = [1, 2, 3, 4, 5, 6]
x[1:4]

is

0  1  2  3  4  5
[1, 2, 3, 4, 5, 6]
    ^  ^  ^
In [57]:
country_list = ["Russia","Latvia","United States","Nigeria","Mexico","India","Costa Rica"]
country_list[1:5]
Out[57]:
['Latvia', 'United States', 'Nigeria', 'Mexico']

When we leave a value open, we are saying take me all the way to the end or the beginning,

In [58]:
country_list[:4]
Out[58]:
['Russia', 'Latvia', 'United States', 'Nigeria']
In [59]:
country_list[5:]
Out[59]:
['India', 'Costa Rica']

The slicing operator by itself copies the object

In [60]:
cc = country_list[:]
cc is country_list
Out[60]:
False

And every slice creates a new object id

In [61]:
print(id(country_list))
print(id(country_list[:3]))
print(id(country_list[3:]))
4336964808
4336503368
4337294856

Deleting Values

  • del keyword
  • .remove() method
In [62]:
del country_list[1]
country_list
Out[62]:
['Russia', 'United States', 'Nigeria', 'Mexico', 'India', 'Costa Rica']
In [63]:
country_list.remove("Nigeria")
country_list
Out[63]:
['Russia', 'United States', 'Mexico', 'India', 'Costa Rica']

Popping elements out of a container

Elements can be used and removed simultaneously from a collection with .pop(). Useful when you have a set list that you want to perform similar features on.

In [64]:
country_list.pop()
Out[64]:
'Costa Rica'
In [65]:
country_list
Out[65]:
['Russia', 'United States', 'Mexico', 'India']

We can pop items out given index location

In [66]:
country_list.pop(2)
Out[66]:
'Mexico'
In [67]:
country_list
Out[67]:
['Russia', 'United States', 'India']

Counting Values

In [68]:
country_list = ["Russia","Latvia","United States","Russia","Mexico",
                "India","Papua New Guinea","Latvia","Russia"]
print(country_list.count("Russia"))
print(country_list.count("Latvia"))
3
2

Sorting Values

In [69]:
country_list.sort()
country_list
Out[69]:
['India',
 'Latvia',
 'Latvia',
 'Mexico',
 'Papua New Guinea',
 'Russia',
 'Russia',
 'Russia',
 'United States']
In [70]:
country_list.reverse()
country_list
Out[70]:
['United States',
 'Russia',
 'Russia',
 'Russia',
 'Papua New Guinea',
 'Mexico',
 'Latvia',
 'Latvia',
 'India']

There are some built-in sorting methods also.

In [71]:
sorted(country_list)
Out[71]:
['India',
 'Latvia',
 'Latvia',
 'Mexico',
 'Papua New Guinea',
 'Russia',
 'Russia',
 'Russia',
 'United States']
In [72]:
# Can sort by some defined function
sorted(country_list,key=len,reverse=True)
Out[72]:
['Papua New Guinea',
 'United States',
 'Russia',
 'Russia',
 'Russia',
 'Mexico',
 'Latvia',
 'Latvia',
 'India']
In [73]:
# Can sort by a function that we define (more on lambda functions next time)
sorted(country_list,key=lambda x: x[0] == "R" or x[0] == "L",reverse=True)
Out[73]:
['Russia',
 'Russia',
 'Russia',
 'Latvia',
 'Latvia',
 'United States',
 'Papua New Guinea',
 'Mexico',
 'India']

Accessing a method's documentation with help()

In [74]:
help([].sort)
Help on built-in function sort:

sort(*, key=None, reverse=False) method of builtins.list instance
    Stable sort *IN PLACE*.

Recall, also, that there is Jupyter notebook magic for requesting a function/methods documentation.

In [75]:
?list()

list Methods to Keep in Mind

Methods in object type `list`

Method Description
.append() L.append(object) -> None -- append object to end
.clear() L.clear() -> None -- remove all items from L
.copy() L.copy() -> list -- a shallow copy of L
.count() L.count(value) -> integer -- return number of occurrences of value
.extend() L.extend(iterable) -> None -- extend list by appending elements from the iterable
.index() L.index(value, [start, [stop]]) -> integer -- return first index of value. Raises ValueError if the value is not present.
.insert() L.insert(index, object) -- insert object before index
.pop() L.pop([index]) -> item -- remove and return item at index (default last). Raises IndexError if list is empty or index is out of range.
.remove() L.remove(value) -> None -- remove first occurrence of value. Raises ValueError if the value is not present.
.reverse() L.reverse() -- reverse IN PLACE
.sort() L.sort(key=None, reverse=False) -> None -- stable sort IN PLACE

Dictionaries

Recall that dictionary are associative array of key-value pairs, indexed by the keys. Dictionary maintain inherent ordering of the keys and the keys can't change once created but the values stored within the keys can change. Dictionary keys provide an efficient way to lookup information contained within the data structure.

We can combine dictionary with other data types (such as a list) to make an efficient and effective data structure.

In [76]:
grades = {"John": [90,88,95,86],"Susan":[87,91,92,89],"Chad":[56,None,72,77]}

We can use the keys for efficient look up.

In [77]:
grades["John"]
Out[77]:
[90, 88, 95, 86]

We can also use the .get() method to get the values that correspond to a specific key.

In [78]:
grades.get("Susan")
Out[78]:
[87, 91, 92, 89]

Accessing key-value pairs

To print a listing of all available keys, use the .keys() method

In [79]:
grades.keys()
Out[79]:
dict_keys(['John', 'Susan', 'Chad'])

Likewise, we can print all values using the .values() method.

In [80]:
grades.values()
Out[80]:
dict_values([[90, 88, 95, 86], [87, 91, 92, 89], [56, None, 72, 77]])

Finally, we can collect all key value pairs (as a tuple) using the .items() method.

In [81]:
grades.items()
Out[81]:
dict_items([('John', [90, 88, 95, 86]), ('Susan', [87, 91, 92, 89]), ('Chad', [56, None, 72, 77])])

Updating dictionaries

We can add new dictionary data entries using the .update() method.

In [82]:
new_entry = {"Wendy":[99,98,97,94]} # Another student dictionary entry with grades
grades.update(nbew_entry) # Update the current dictionary 
grades
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-82-13108c066e21> in <module>()
      1 new_entry = {"Wendy":[99,98,97,94]} # Another student dictionary entry with grades
----> 2 grades.update(nbew_entry) # Update the current dictionary
      3 grades

NameError: name 'nbew_entry' is not defined

In a similar fashion, we can update the dictionary directly by providing a new key entry and storing the data.

In [83]:
grades["Seth"] = [66,72,79,81]
grades
Out[83]:
{'John': [90, 88, 95, 86],
 'Susan': [87, 91, 92, 89],
 'Chad': [56, None, 72, 77],
 'Seth': [66, 72, 79, 81]}

Remember: values are mutable, keys are not

Dropping Keys

(1) You can .pop() a dictionary value out.

In [84]:
grades.pop("Seth")
Out[84]:
[66, 72, 79, 81]
In [85]:
grades
Out[85]:
{'John': [90, 88, 95, 86],
 'Susan': [87, 91, 92, 89],
 'Chad': [56, None, 72, 77]}

(2) You can delete the key.

In [86]:
del grades['Wendy']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-86-29e948693ad6> in <module>()
----> 1 del grades['Wendy']

KeyError: 'Wendy'
In [87]:
grades
Out[87]:
{'John': [90, 88, 95, 86],
 'Susan': [87, 91, 92, 89],
 'Chad': [56, None, 72, 77]}

Dropping Values

To drop values, either

  1. overwrite the original data
  2. drop the key
  3. clear the dictionary
In [88]:
grades['John'] = 7
In [89]:
grades
Out[89]:
{'John': 7, 'Susan': [87, 91, 92, 89], 'Chad': [56, None, 72, 77]}

Clear the contents of the dictionary.

In [90]:
grades.clear()
grades
Out[90]:
{}

Values don't have to be relational

Note the below:

  • for key "a", we stored an integer.
  • for key "b", we stored another dictionary that has two keys "i" and "ii" that stored a string and a float, respectively.
  • for key "c", we stored a tuple.
In [91]:
new_dict = {"a":6,"b":{"i":"hello","ii":2.3},"c":(4,5,6,7)}
new_dict
Out[91]:
{'a': 6, 'b': {'i': 'hello', 'ii': 2.3}, 'c': (4, 5, 6, 7)}

dict methods to keep in mind

Methods in object type `dict`

Method Description
.clear() D.clear() -> None. Remove all items from D.
.copy() D.copy() -> a shallow copy of D
.fromkeys() Returns a new dict with keys from iterable and values equal to value.
.get() D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None.
.items() D.items() -> a set-like object providing a view on D's items
.keys() D.keys() -> a set-like object providing a view on D's keys
.pop() D.pop(k[,d]) -> v, remove specified key and return the corresponding value. If key is not found, d is returned if given, otherwise KeyError is raised
.popitem() D.popitem() -> (k, v), remove and return some (key, value) pair as a 2-tuple; but raise KeyError if D is empty.
.setdefault() D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D
.update() D.update([E, ]**F) -> None. Update D from dict/iterable E and F. If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
.values() D.values() -> an object providing a view on D's values

Sets

Sets differ from lists and dictionaries in that we can perform set operations. In addition, no duplicate values are retained in the set, so it provides an efficient way to isolate unique values in a list of inputs.

In [92]:
my_set = {1,2,3,8,4,4,6}
my_set
Out[92]:
{1, 2, 3, 4, 6, 8}

Note that values in a set cannot be accessed using an index.

In [93]:
my_set[0]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-93-158c424478a1> in <module>()
----> 1 my_set[0]

TypeError: 'set' object does not support indexing

Rather we either .pop() values out of the set (but we cannot provide an index location).

In [94]:
my_set.pop()
Out[94]:
1
In [95]:
my_set
Out[95]:
{2, 3, 4, 6, 8}

Or we can .remove() specific values from the set.

In [96]:
my_set.remove(3)
my_set
Out[96]:
{2, 4, 6, 8}

Finally, note that sets can contain heterogeneous scalar types, but they cannot contain other mutable container data types.

In [97]:
set_a = {.5,6,"a",None}
set_a
Out[97]:
{0.5, 6, None, 'a'}

Can't hold a mutable list.

In [98]:
set_b = {.5,6,"a",None,[8,5,6]}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-98-7feb046d8695> in <module>()
----> 1 set_b = {.5,6,"a",None,[8,5,6]}

TypeError: unhashable type: 'list'

Can hold an immutable tuple.

In [99]:
set_c = {.5,6,"a",None,(1,2,3)}
set_c
Out[99]:
{(1, 2, 3), 0.5, 6, None, 'a'}

Finally, note that the order changed. Like dictionary keys, sets do not retain any intrinsic ordering.

set methods to keep in mind

Methods in object type `set`

Method Description
.add() Add an element to a set.
.clear() Remove all elements from this set.
.copy() Return a shallow copy of a set.
.difference() Return the difference of two or more sets as a new set.
.difference_update() Remove all elements of another set from this set.
.discard() Remove an element from a set if it is a member.
.intersection() Return the intersection of two sets as a new set.
.intersection_update() Update a set with the intersection of itself and another.
.isdisjoint() Return True if two sets have a null intersection.
.issubset() Report whether another set contains this set.
.issuperset() Report whether this set contains another set.
.pop() Remove and return an arbitrary set element. Raises KeyError if the set is empty.
.remove() Remove an element from a set; it must be a member.
.symmetric_difference() Return the symmetric difference of two sets as a new set.
.symmetric_difference_update() Update a set with the symmetric difference of itself and another.
.union() Return the union of sets as a new set.
.update() Update a set with the union of itself and others.