Linguist 278: Programming for Linguists (Stanford Linguistics, Fall 2021)

Classes 8: Iterators and generators

Iterators

You can turn str, list, tuple, dict, and other iterables into iterators with the iter built-in.

The next built-in will return the next item from the iterator:

vals = [1,2,3]

i = iter(vals)

next(i)
1

next(i)
2

next(i)
3

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)

As the StopIteration error shows, the iterator is "used up" when we've moved through all its members.

File objects as iterators

You can call next on file objects:

f = open("alice.txt")

next(f)
'\ufeffThe Project Gutenberg EBook of Alice in Wonderland, by Lewis Carroll\n'

next(f)
'\n'

When you use our standard idiom for opening a file and moving through its lines, you are relying on the fact that file objects are iterators:
```
with open(filename) as f:
  for line in f:
      # Do something with `line`
```
This is a memory-efficient way to read through the file, because it never reads the whole file into memory. This is therefore preferred in contexts where line-by-line processing meets your needs.

Custom generators

We often want to read in files line-by-line and do something with those lines. For example, with a CSV, you might want to process the column values in complex ways. Custom generators provide a scalable, intuitive way to do this. Here's a simple example:

def uppercase_reader(filename):
    with open(filename) as f:
        for line in f:
            line = line.upper()  ## Presumably you would do something more interesting!
            yield line

The crucial piece is using yield rather than return. With the above, calling next will iteratively yield uppercased lines, and using a for-loop will move loop over uppercased lines:

for line in uppercase_reader(filename):
    # Do something with `line`

Here's a more realistic illustration — a Google Books filereader:

import gzip

def googlebooks_reader(filename, gz=True):
    if gz:
        f = gzip.open(filename, mode='rt', encoding='utf8')
    else:
        f = open(filename)
    with f:
        for line in f:
            w, yr, mc, vc = line.split("\t")
            yr = int(yr)
            mc = int(mc)
            vc = int(vc)
            yield {'word': w, 'year': yr, 'match_count': mc, 'volume_count': vc}

This version turns each line into a dict with intuitive key names and the correct types for the values. This seems like a good generic format for doing analytic work with the file.