Linguist 278: Programming for Linguists
Stanford Linguistics, Fall 2021
Christopher Potts

Document formats

Set-up

In [1]:
import glob
import os
import pandas as pd

Download this package of example files, unzip it, and place the resulting folder in this directory:

https://web.stanford.edu/class/linguist278/data/sampledocs.zip

In [2]:
sampledoc_dirname = "sampledocs"

Relying on Pandas

In [3]:
csv_example = os.path.join(sampledoc_dirname, "movie-data.csv")

Reading CSVs with Pandas

For most CSV work, you can use pd.read_csv and not worry about the details:

In [4]:
movie_df = pd.read_csv(csv_example)
In [5]:
movie_df
Out[5]:
Movie Release Date Worldwide Gross in Dollars
0 There Will Be Blood 2008-01-25 00:00:00 77208711
1 Lost in Translation 2003-10-03 00:00:00 119723856
2 The Trip June 10 2011 2030962
3 The Royal Tenenbaums January 04 2002 19077240
4 Arrival 2016-11-11 00:00:00 100546139
5 Office Space 1999-05-09 00:00:00 12800000

It also works with compressed files:

In [6]:
csv_gz_example = os.path.join(sampledoc_dirname, "movie-data.csv.gz")
In [7]:
movie_df = pd.read_csv(csv_gz_example)

And the delimiter argument will let you handle TSV and other formats, as with the built-in csv library.

Writing CSVs with Pandas

To write a pd.DataFrame df to CSV with pandas, use df.to_csv:

In [8]:
movie_df.to_csv(csv_example, index=None)
In [9]:
movie_df.to_csv(csv_gz_example, compression="gzip", index=None)

Excel files

For Excel files, use pd.read_excel for reading:

In [10]:
xlsx_example = os.path.join(sampledoc_dirname, "movie-data.xlsx")
In [11]:
movie_df = pd.read_excel(xlsx_example)

And df.to_excel for writing:

In [12]:
movie_df.to_excel(xlsx_example, index=None)

Using the CSV library

In [13]:
def csv_reader(src_filename, delimiter=","):
    with open(src_filename) as f:
        return csv.reader(f, delimiter=delimiter)
In [14]:
def csv_reader_dicts(src_filename, delimiter=","):
    with open(src_filename) as f:
        return csv.DictReader(f, delimiter=delimiter)
In [15]:
def csv_writer(rows, output_filename, header=None):
    with open(output_filename, 'wt') as f:
        writer = csv.writer(f)
        if header is not None:
            writer.writerow(header)
        writer.writerows(rows)

DOCX

In [16]:
import textract
In [17]:
docx_example = os.path.join(sampledoc_dirname, "ling278_stanfordtools.docx")
In [18]:
def docx_reader(src_filename):
    return textract.process(docx_example).decode()
In [19]:
docx_text = docx_reader(docx_example)
In [20]:
print(docx_text[: 200])
Notes on using Stanford’s computing resources
Chris Potts, Ling 278: Programming for Linguists, Fall 2020
Sep 14

Computer clusters

Stanford provides all of us with access to a cluster of Linux machi

textract will also do its best with a wide variety of other file formats!

GZIP

GZIP is a common compression format. Python's gzip library is part of the core language:

In [21]:
import gzip
In [22]:
def gzip_reader(src_filename):
    with gzip.open(src_filename, mode='rt', encoding='utf8') as f:
        for line in f:
            yield line
In [23]:
def gzip_writer(s, output_filename):
    with gzip.open(src_filename, mode='wb') as f:
        f.write(s.encode(encoding="utf8"))
In [24]:
list(gzip_reader(csv_gz_example))
Out[24]:
['Movie,Release Date,Worldwide Gross in Dollars\n',
 'There Will Be Blood,2008-01-25 00:00:00,77208711\n',
 'Lost in Translation,2003-10-03 00:00:00,119723856\n',
 'The Trip,June 10 2011,2030962\n',
 'The Royal Tenenbaums,January 04 2002,19077240\n',
 'Arrival,2016-11-11 00:00:00,100546139\n',
 'Office Space,1999-05-09 00:00:00,12800000\n']

HTML and XML

JSON

JSON is a common data format these days. It is more flexible than CSV because it allows nesting of objects and built-in typing of objects. It can handle str, int, float, list, and dict.

The limitations of JSON are what make it very portable: it stores things in plain-text and uses objects that all modern programming languages have.

For other Python objects, you need to resort to pickle, which is not portable outside of Python.

In [25]:
import json
In [26]:
json_example = os.path.join(sampledoc_dirname, "toy-json.json")
In [27]:
j = [
    {"a": 1, "b": 2.45, "c": [1,2,3], "d": {"dd": True}},
    {"f": 7},
    True
]

Write JSON

In [28]:
def json_writer(d, output_filename):
    with open(output_filename, "wt") as f:
        json.dump(d, f, indent=4, sort_keys=True)
In [29]:
json_writer(j, json_example)

Read JSON

In [30]:
def json_reader(src_filename):
    with open(src_filename, "rt") as f:
        return json.load(f)
In [31]:
j = json_reader(json_example)

JSON Lines (JSONL) format

In [32]:
def read_jsonl(src_filename):
    data = []
    with open(src_filename) as f:
        for line in f:
            d = json.loads(line)
            data.append(d)
    return data
In [33]:
def write_jsonl(data, output_filename):
    lines = ""
    for d in data:
        s = json.dumps(d)
        lines += s + "\n"
    with open(output_filename, "wt") as f:
        f.write(lines)
In [34]:
jsonl_example = os.path.join(sampledoc_dirname, "movie-data.jsonl")
In [35]:
movies = read_jsonl(jsonl_example)
In [36]:
movies[0]
Out[36]:
{'Movie': 'There Will Be Blood',
 'Release Date': '25 January 2008',
 'Worldwide Gross in Dollars': '77,208,711'}
In [37]:
write_jsonl(movies, jsonl_example)

To write JSONL from a pd.DataFrame: `df.to_json(output_filename, orient="records", lines=True)

PDF

In [38]:
!pip install pymupdf
Requirement already satisfied: pymupdf in /Applications/anaconda3/lib/python3.7/site-packages (1.18.3)
In [39]:
import fitz
from fitz.utils import getColor

pdf2text

In [40]:
def pdf2text(src_filename):
    """Open a PDF file and extract its page contents, returning a list of str."""
    data = []
    doc = fitz.open(src_filename)
    for page in doc:
        contents = page.getText('text')
        data.append(contents)
    return data
In [41]:
pdf_example = os.path.join(sampledoc_dirname, "ling278_stanfordtools.pdf")
In [42]:
pdf_text = pdf2text(pdf_example)
In [43]:
print(pdf_text[0][: 200])
Notes on using Stanford’s computing resources
Chris Potts, Ling 278: Programming for Linguists, Fall 2020
Sep 14
1
Computer clusters
Stanford provides all of us with access to a cluster of Linux machi

Highlighting words

In [44]:
def pdf_highlighter(span, src_filename, output_filename, color="tomato"):
    doc = fitz.open(src_filename)
    rgb = fitz.utils.getColor(color)
    for page in doc:
        for inst in page.searchFor(span):
            ann = page.addHighlightAnnot(inst)
            ann.setColors({"stroke": rgb})
            info = ann.info
            info["title"] = "Interesting! Tell me more!"
            ann.setInfo(info)
            ann.update()
    doc.save(output_filename, garbage=4, deflate=True, clean=True, expand=0)
In [45]:
pdf_output_filename = os.path.join(sampledoc_dirname, "ling278_stanfordtools-highlighting.pdf")
In [46]:
pdf_highlighter("Stanford", pdf_example, pdf_output_filename)

Optical Character Recognition (OCR)

For PDF and other image files that don't have embedded text, you have to do more advanced pre-processing to identify text. I recommend the open-source library tesseract for this.

pickle

The pickle library has an interface that is very similar to JSON. However, whereas JSON is a plain-text format that is highly limited in what it can store, you can pickle just about any Python data structure. This makes it really useful for quickly storing large data structures, and you can store them with associated code. The only downside is that pickle is not a portable format. It can be read only by Python, and there can even be problems reading and writing pickle files across Python versions. So you might think of pickle as your own private, temporary storage format.

In [47]:
import pickle
In [48]:
pickle_example = os.path.join(sampledoc_dirname, "toy-pickle.pickle")

Write pickle

In [49]:
# Imagine this is a huge dictionary that took your computer
# all night to build, and you just want to stash it so that you
# don't have to keep rebuilding it.

d = {"a": 1, "b": 2.45, "c": [1, 2, 3], "d": {"dd": True}}
In [50]:
def write_pickle(pyobj, output_filename):
    with open(output_filename, "wb") as f:
        pickle.dump(pyobj, f)
In [51]:
write_pickle(d, pickle_example)

Read pickle

In [52]:
def read_pickle(src_filename):
    with open(src_filename, "rb") as f:
        return pickle.load(f)
In [53]:
d = read_pickle(pickle_example)

Pickle a function

In [54]:
def exponent(x, pow=2):
    return x**pow
In [55]:
pickled_exponent = os.path.join(sampledoc_dirname, "exponent.pickle")
In [56]:
write_pickle(exponent, pickled_exponent)
In [57]:
exponent2 = read_pickle(pickled_exponent)
In [58]:
exponent2(4)   # Returns 16.
Out[58]:
16

ZIP

ZIP is a common compression format, used especially for packaging up multiple files into a single file. Python's zipfile library provides tools for working with these files.

In [59]:
import zipfile

Unpacking ZIP files

In [60]:
zip_example = os.path.join(sampledoc_dirname, "gutenberg.zip")
In [61]:
def open_zipfile(src_filename, output_dirname, file_to_open=None):
    with zipfile.ZipFile(src_filename) as f:
        if file_to_open is None:
            f.extractall(path=output_dirname)
        else:
            f.extract(file_to_open, path=output_dirname)
In [62]:
open_zipfile(
    zip_example,
    sampledoc_dirname,
    file_to_open=os.path.join("gutenberg", "austen-emma.txt"))
In [63]:
open_zipfile(
    zip_example,
    sampledoc_dirname)

Archiving files

In [64]:
def write_zipfile(src_filenames, output_filename):
     with zipfile.ZipFile(output_filename, "w") as f:
            for filename in src_filenames:
                f.write(filename, os.path.basename(filename))
In [65]:
movie_filenames = glob.glob(os.path.join(sampledoc_dirname, "movie-data.*"))
In [66]:
write_zipfile(movie_filenames, "movie-data.zip")