PPOL564 - Data Science I: Foundations

Lecture 6

File Management

Contents

  • connection management: open(), close()
  • Reading/writing files
  • using with to manage connections.
  • Reading .csv

Note that two files are read in on this notebook. These files need to be in the same working directory as your notebook, else Python will not know where the files are on your computer. Please download the accompanying files (news-story.txt and student_data.csv) from the class website.

Reading Files

open()

Now, let's open this file in Python.

The built-in open() function opens files on our system. The function takes the following arguments:

  • a file path
  • a mode describing how to treat the file (e.g. read the file, write to the file, append to the file, etc.). Default is read mode ("r").
  • an encoding. Default is "UTF-8" for most systems.
In [1]:
file = open("news-story.txt",mode='rt',encoding='UTF-8')

open() returns a special item type _io.TextIOWrapper. Note that a file-like-object is loosely defined in Python. Again, we see duck-typing in action: if it looks like a file and behave like a file then, heck, it's probably a file.

In [2]:
type(file)
Out[2]:
_io.TextIOWrapper
In [3]:
print(file.read())
Russian planes have reportedly bombed rebel-held targets in the Syrian province of Idlib, as government troops mass before an expected offensive.

If confirmed, they would be the first such air strikes there in three weeks.

Earlier, US President Donald Trump warned Syria's Bashar al-Assad against launching a "reckless attack" on Idlib.

But Kremlin spokesman Dmitry Peskov rejected the warning and said the Syrian army was "getting ready" to clear a "cradle of terrorism" there.

Five reasons why the battle for Idlib matters
Why is there a war in Syria?
Mr Peskov said the al Qaeda-linked jihadists dominating in the north-western province of Idlib were threatening Russian military bases in Syria and blocking a political solution to the civil war.

The UN has warned of a humanitarian catastrophe if an all-out assault takes place.

The UN envoy to Syria, Staffan de Mistura, called on Russia and Turkey to act urgently to avert "a bloodbath" in Idlib.

He said telephone talks between Russian President Vladimir Putin and his Turkish counterpart Regep Tayyip Erdogan "would make a big difference".

Mr de Mistura also welcomed Mr Trump's comments on the issue, saying it was sending "the right message".



In [4]:
print(file.read()) # Once we've read through the items, the file object is empty


close()

Once we are done with a file, we need to close it.

In [5]:
file.close()

Opening and forgetting to close files can lead to a bunch of issues --- mainly the mismanagement of computational resources on your machine.

Moreover, close() is necessary for actually writing files to our computer


Methods available when reading in files

Methods in object type `TextIOWrapper`

Method Description
._CHUNK_SIZE() int([x]) -> integer int(x, base=10) -> integer
._finalizing() bool(x) -> bool
.buffer() Create a new buffered reader using the given readable raw IO object.
.closed() bool(x) -> bool
.encoding() str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
.errors() str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
.line_buffering() bool(x) -> bool
.mode() str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
.name() str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
.readlines() Return a list of lines from the stream.
.reconfigure() Reconfigure the text stream with new parameters.
.write_through() bool(x) -> bool
In [6]:
file = open("news-story.txt",mode='rt',encoding='UTF-8')
file.readlines() # convert all items to a list
Out[6]:
['Russian planes have reportedly bombed rebel-held targets in the Syrian province of Idlib, as government troops mass before an expected offensive.\n',
 '\n',
 'If confirmed, they would be the first such air strikes there in three weeks.\n',
 '\n',
 'Earlier, US President Donald Trump warned Syria\'s Bashar al-Assad against launching a "reckless attack" on Idlib.\n',
 '\n',
 'But Kremlin spokesman Dmitry Peskov rejected the warning and said the Syrian army was "getting ready" to clear a "cradle of terrorism" there.\n',
 '\n',
 'Five reasons why the battle for Idlib matters\n',
 'Why is there a war in Syria?\n',
 'Mr Peskov said the al Qaeda-linked jihadists dominating in the north-western province of Idlib were threatening Russian military bases in Syria and blocking a political solution to the civil war.\n',
 '\n',
 'The UN has warned of a humanitarian catastrophe if an all-out assault takes place.\n',
 '\n',
 'The UN envoy to Syria, Staffan de Mistura, called on Russia and Turkey to act urgently to avert "a bloodbath" in Idlib.\n',
 '\n',
 'He said telephone talks between Russian President Vladimir Putin and his Turkish counterpart Regep Tayyip Erdogan "would make a big difference".\n',
 '\n',
 'Mr de Mistura also welcomed Mr Trump\'s comments on the issue, saying it was sending "the right message".\n',
 '\n',
 '\n']
In [7]:
# Is the file closed?
file.closed
Out[7]:
False

File modes

Mode Description
r "open for reading" default
w open for writing
x open for exclusive creation, failing if the file already exists
a open for writing, appending to the end of the file if it exists
b binary mode
t text mode (default)

Examples,

  • mode = 'rb' → "read binary"
  • mode = 'wt' → "write text"
In [8]:
f = open('news-story.txt',mode="rt",encoding='utf-8')

# Print the mode
print(f.mode)

f.close()
rt

Writing files

In [9]:
f = open('text_file.txt',mode="wt",encoding='utf-8')
f.write('This is an example\n') 
f.write('Of writing a file...\n')
f.write('Neat!\n')
f.close()

NOTE that you must close() for your lines to be written to the file

Now, read the file back in in "read mode"

In [10]:
f = open('text_file.txt',mode="rt",encoding='utf-8')
print(f.read())
This is an example
Of writing a file...
Neat!

We can even batch write using a container.

In [11]:
sent = "This is a sentence.".split()
print(sent)
['This', 'is', 'a', 'sentence.']
In [12]:
# Note here I'm opening the file in "append mode"
f = open('text_file.txt',mode="at",encoding='utf-8')
f.writelines(sent)
f.close()
In [13]:
f = open('text_file.txt',mode="rt",encoding='utf-8')
print(f.read())
This is an example
Of writing a file...
Neat!
Thisisasentence.

Note that \n is the delimiter for line breaks.

In [14]:
sent2 = []
for word in sent:
    new_word = word + "\n"
    sent2.append(new_word)
print(sent2)
['This\n', 'is\n', 'a\n', 'sentence.\n']
In [15]:
# Open the file, and write our new sentence list object
f = open('text_file.txt',mode="at",encoding='utf-8')
f.writelines(sent2)
f.close()

f = open('text_file.txt',mode="rt",encoding='utf-8')
print(f.read())
This is an example
Of writing a file...
Neat!
Thisisasentence.This
is
a
sentence.


Iterating over files

We'll note when looking at the object's attributes that there is an __iter__() and __next__() method, meaning we can iterate over the open file object.

In [16]:
file = open("news-story.txt",mode='rt',encoding='UTF-8')
for line in file:
    print(line)
file.close()
Russian planes have reportedly bombed rebel-held targets in the Syrian province of Idlib, as government troops mass before an expected offensive.



If confirmed, they would be the first such air strikes there in three weeks.



Earlier, US President Donald Trump warned Syria's Bashar al-Assad against launching a "reckless attack" on Idlib.



But Kremlin spokesman Dmitry Peskov rejected the warning and said the Syrian army was "getting ready" to clear a "cradle of terrorism" there.



Five reasons why the battle for Idlib matters

Why is there a war in Syria?

Mr Peskov said the al Qaeda-linked jihadists dominating in the north-western province of Idlib were threatening Russian military bases in Syria and blocking a political solution to the civil war.



The UN has warned of a humanitarian catastrophe if an all-out assault takes place.



The UN envoy to Syria, Staffan de Mistura, called on Russia and Turkey to act urgently to avert "a bloodbath" in Idlib.



He said telephone talks between Russian President Vladimir Putin and his Turkish counterpart Regep Tayyip Erdogan "would make a big difference".



Mr de Mistura also welcomed Mr Trump's comments on the issue, saying it was sending "the right message".





In [17]:
file = open("news-story.txt",mode='rt',encoding='UTF-8')
for line in file:
    if line == '\n':
        continue
    print(line)        
file.close()
Russian planes have reportedly bombed rebel-held targets in the Syrian province of Idlib, as government troops mass before an expected offensive.

If confirmed, they would be the first such air strikes there in three weeks.

Earlier, US President Donald Trump warned Syria's Bashar al-Assad against launching a "reckless attack" on Idlib.

But Kremlin spokesman Dmitry Peskov rejected the warning and said the Syrian army was "getting ready" to clear a "cradle of terrorism" there.

Five reasons why the battle for Idlib matters

Why is there a war in Syria?

Mr Peskov said the al Qaeda-linked jihadists dominating in the north-western province of Idlib were threatening Russian military bases in Syria and blocking a political solution to the civil war.

The UN has warned of a humanitarian catastrophe if an all-out assault takes place.

The UN envoy to Syria, Staffan de Mistura, called on Russia and Turkey to act urgently to avert "a bloodbath" in Idlib.

He said telephone talks between Russian President Vladimir Putin and his Turkish counterpart Regep Tayyip Erdogan "would make a big difference".

Mr de Mistura also welcomed Mr Trump's comments on the issue, saying it was sending "the right message".

In [18]:
# Example: How many words are in each line?

file = open("news-story.txt",mode='rt',encoding='UTF-8')

for line in file:
    if line == '\n':
        continue
    n_words_per_line = len(line.split())
    print(n_words_per_line)
    
file.close()
21
14
16
23
8
7
30
14
22
21
18

with: beyond opening and closing with context managers

As you'll note, the need to open() and close() files can get a bit redundant after awhile. This issue of closing after opening to deal with resource cleanup is common enough that python has a special protocol for it: the with code block.

In [19]:
with open("news-story.txt",mode='rt',encoding='UTF-8') as file:
    for line in file:
        if line == '\n':
            continue
        n_words_per_line = len(line.split())
        print(n_words_per_line)
21
14
16
23
8
7
30
14
22
21
18
In [20]:
file.closed
Out[20]:
True

Reading Comma Separated Values (CSV)

See the python documentation for more on the csv module located in the standard library.

In [21]:
import csv

Reading in .csv data

In [23]:
with open("student_data.csv",mode='rt') as file:
    data = csv.reader(file)
    for row in data:
        print(row)
['Student', 'Grade']
['Susan', 'A']
['Sean', 'B-']
['Cody', 'A-']
['Karen', 'B+']

Writing csv data

In [24]:
# Student data as a nested list.
student_data = [["Student","Grade"],
                ["Susan","A"],
                ["Sean","B-"],
                ["Cody","A-"],
                ["Karen",'B+']]

# Write the rows with the .writerows() method
with open("student_data.csv",mode='w') as file:
    csv_file = csv.writer(file)
    csv_file.writerows(student_data)

Reading csv files as dictionaries

Assigning value to variables by using DictReader()/DictWriter() method. Here our variable names operate as keys that we can easily reference.

In [25]:
with open("student_data.csv", 'r') as file:
    csv_file = csv.DictReader(file)
    for row in csv_file:
        print(row)
OrderedDict([('Student', 'Susan'), ('Grade', 'A')])
OrderedDict([('Student', 'Sean'), ('Grade', 'B-')])
OrderedDict([('Student', 'Cody'), ('Grade', 'A-')])
OrderedDict([('Student', 'Karen'), ('Grade', 'B+')])
In [26]:
with open("student_data.csv", 'r') as file:
    csv_file = csv.DictReader(file)
    for row in csv_file:
        print(f"{row['Student']} received a {row['Grade']} in the course")
Susan received a A in the course
Sean received a B- in the course
Cody received a A- in the course
Karen received a B+ in the course

Writing csv file types as dictionaries

In [27]:
with open("student_data.csv", 'w') as file:
    variable_names = ["Student","Grade"]
    csv_file = csv.DictWriter(file, fieldnames=variable_names)

    csv_file.writeheader()
    for student in student_data[1:]:
        csv_file.writerow({'Student':student[0],'Grade':student[1]})

Dealing with different delimiters

In a csv, commas are used to separate values, but we could just as easily use something else to separate values.

In [28]:
with open("student_data.csv", 'r') as file:
    
    csv_file = csv.reader(file, delimiter = ",") # comma separated values  
    
    with open("only_student_data.csv", 'w') as new_file:
        
            new_csv_file = csv.writer(new_file, delimiter = "\t") # tab separated values
            
            for row in csv_file:
                
                new_csv_file.writerow(row) # only write the student's name