Linguist 278: Programming for Linguists (Stanford Linguistics, Fall 2021)

Classes 8: Iterators and generators

Iterators

File objects as iterators

Custom generators

We often want to read in files line-by-line and do something with those lines. For example, with a CSV, you might want to process the column values in complex ways. Custom generators provide a scalable, intuitive way to do this. Here's a simple example:

def uppercase_reader(filename):
    with open(filename) as f:
        for line in f:
            line = line.upper()  ## Presumably you would do something more interesting!
            yield line

The crucial piece is using yield rather than return. With the above, calling next will iteratively yield uppercased lines, and using a for-loop will move loop over uppercased lines:

for line in uppercase_reader(filename):
    # Do something with `line`

Here's a more realistic illustration — a Google Books filereader:

import gzip

def googlebooks_reader(filename, gz=True):
    if gz:
        f = gzip.open(filename, mode='rt', encoding='utf8')
    else:
        f = open(filename)
    with f:
        for line in f:
            w, yr, mc, vc = line.split("\t")
            yr = int(yr)
            mc = int(mc)
            vc = int(vc)
            yield {'word': w, 'year': yr, 'match_count': mc, 'volume_count': vc}

This version turns each line into a dict with intuitive key names and the correct types for the values. This seems like a good generic format for doing analytic work with the file.