Linguist 278: Programming for Linguists (Stanford Linguistics, Fall 2021)

Classes 6: Basic file and CSV reading and writing; os.path basics

Basic file reading and writing

  1. To open a text file with name filename and read its contents into a string called contents:

    with open(filename) as f:
       contents = f.read()
    
  2. to write a string s to a file filename:

    with open(filename, 'wt') as f:
       f.write(s)
    
  3. To efficiently open a (potentially very large) file and read it line-by-line without first reading all of it into memory:

    with open(filename) as f:
       for line in f:
           line ## This is a string, including the final new line character.
    

Common csv library operations

Yield lists

with open(filename) as f:
    for row in csv.reader(f):
        # do something with `row`, a list of str

Yield dicts

with open(filename) as f:
    for d in csv.DictReader(f):
        # do something with `d`, a dict mapping
        # values from line 0 to values for the
        # current row.

Abstracting out the delimiter

with open(filename) as f:
    for d in csv.DictReader(f, delimiter=delimiter):            

Writing a CSV file

with open(filename, 'wt') as f:
    writer = csv.writer(f)
    writer.writerows(rows)
    # Alternatively, write the rows one at a time:
    # for row in rows:
    #    writer.writerow(rows)

Why use csv at all?

It seems at first like csv is unnecessary overhead. After all you can create a CSV row with ','.join(row) and parse with line.strip().split(','). Consider these rows:

rows = [
    ["Earth, Wind, and Fire", 1969],
    ["Peter, Paul and Mary", 1961],
    ["Guns N' Roses", 1985],
    ['"Weird Al" Yankovic', 1976]]

If you write these with your the simple method (after first converting the years to str!), you get a file with contents

Earth, Wind, and Fire,1969
Peter, Paul and Mary,1961
Guns N' Roses,1985
"Weird Al" Yankovic,1976

which is now ambiguous about how it should be parsed. If you instead use csv.writer, you get

"Earth, Wind, and Fire",1969
"Peter, Paul and Mary",1961
Guns N' Roses,1985
"""Weird Al"" Yankovic",1976

which cannot be read with the simple method line.strip().split(','), but with csv.reader will process accurately.

The os library

The Python os library has lots of functionality for dealing with the operating system. It's sublibrary os.path contains methods for working with filenames.

I suggest getting in the habit of using os.path.join to specify filenames. It creates a string from the string arguments you give it, but the delimiter between them is the appropriate one for the current operating system.

# Goes up one level with ".." and then gives the base filename:
os.path.join("..", "foo.txt")

# Goes up one level and then down into the directory "bar", and then gives the base filename:
os.path.join("..", "bar", "foo.txt")

# Goes down two levels and then gives the base filename:
os.path.join("bar1", "bar2" "foo.txt")

The fact that the separation character differs across operating system motivates using os.path methods for any operation on filenames that depends on this character. Common cases: os.path.basename, os.path.split, os.path.splitext.