Linguist 278: Programming for Linguists
Stanford Linguistics, Fall 2021
Christopher Potts
Distributed 2020-11-09
Due 2020-11-16
Our in-class hackathon on November 9 and 11 is Assignment 7. My hope is that you can complete most of the hackathon/homework by the end of class on November 11.
You are encouraged to work in groups (max size: 3 people).
The requirements are much more open-ended than for other assignments. Part of the task is to think up an original question using the materials in this notebook. I've given some general ideas below to get you started.
You should think about scoping the problem so that you can complete it in the time available.
Designate one person to submit the notebook and make absolutely sure all group members names are given at the top of the noteboook.
Submit a modified version of this notebook, with your new code included, extraneous code removed, and prose added so that I can follow along. That is, try to make this notebook look like a polished piece of literate programming.
It is okay to do the work collaboratively using Google Colab. In that case, please submit the address of the notebook on Canvas, and make sure I can access the file.
I am guessing most notebooks will have about the scope of three regular two-pointer assignment questions, with some prose explaining what they do and why. However, there is no strict requirement on code quantity or anything.
You needn't confine yourself to the data and other resources in this notebook. (There is a lot to work with here, though.)
Examples of things you might do (not meant to be restrictive!):
Write code that identifies interesting relationships in the concreteness, sentiment, and age of aquisition datasets. You can use Pandas to merge them on their Index
values and then reason across them!
Write code that identifies differences between Project Gutenberg authors as revealed by the concreteness, sentiment, age of acquisition, and/or beautiful words datasets.
Write code to find the most something – the most sentiment-laden sentence or passage, the most challenging passage, the most abstract passage, etc. I've included code below for heuristic paragraph and sentence parsing.
import glob
import os
import pandas as pd
import re
import string
Download the hackathon data distribution:
http://web.stanford.edu/class/linguist278/data/hackathon.zip
and unzip it in the same directory as this notebook. (If you want to put it somewhere else, just be sure to change data_home
in the next cell.)
data_home = "hackathon"
From Age-of-acquisition ratings for 30 thousand English words (Victor Kuperman, Hans Stadthagen-Gonzalez, and Marc Brysbaert, Behavior Research Methods, 2014):
Word
: The word (str)OccurTotal
: token count in their dataOccurNum
: Participants who gave an age-of-acquisition, rather than saying "Unknown"Rating.Mean
: mean age of aquisition in years of ageRating.SD
: standard deviation of the distribution of ages of acquisitionFrequency
: token count of Word
in the SUBTLEX-USage_df = pd.read_csv(
os.path.join(data_home, "Kuperman-BRM-data-2012.csv"),
index_col='Word')
age_df.shape
(30121, 5)
age_df.head(2)
OccurTotal | OccurNum | Rating.Mean | Rating.SD | Frequency | |
---|---|---|---|---|---|
Word | |||||
have | 18 | 18 | 3.72 | 1.96 | 314232.0 |
do | 20 | 20 | 3.60 | 1.60 | 312915.0 |
We've worked with this dataset before. It's presented in Concreteness ratings for 40 thousand generally known English word lemmas (Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman, Behavior Research Methods, 2014). Overview:
Word
: The word (str)Bigram
: Whether it is a single word or a two-word expressionConc.M
: The mean concreteness ratingConc.SD
: The standard deviation of the concreteness ratings (float)Unknown
: The number of persons indicating they did not know the wordTotal
: The total number of persons who rated the wordPercent_known
: Percentage of participants who knew the wordSUBTLEX
: The SUBTLEX-US frequency countDom_Pos
: The part-of-speech where knownconcreteness_df = pd.read_csv(
os.path.join(data_home, "Concreteness_ratings_Brysbaert_et_al_BRM.csv"),
index_col='Word')
concreteness_df.shape
(39954, 8)
concreteness_df.head()
Bigram | Conc.M | Conc.SD | Unknown | Total | Percent_known | SUBTLEX | Dom_Pos | |
---|---|---|---|---|---|---|---|---|
Word | ||||||||
roadsweeper | 0 | 4.85 | 0.37 | 1 | 27 | 0.96 | 0 | 0 |
traindriver | 0 | 4.54 | 0.71 | 3 | 29 | 0.90 | 0 | 0 |
tush | 0 | 4.45 | 1.01 | 3 | 25 | 0.88 | 66 | 0 |
hairdress | 0 | 3.93 | 1.28 | 0 | 29 | 1.00 | 1 | 0 |
pharmaceutics | 0 | 3.77 | 1.41 | 4 | 26 | 0.85 | 0 | 0 |
The dataset Norms of valence, arousal, and dominance for 13,915 English lemmas (Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert, Behavior Research Methods, 2013) contains a lot of sentiment information about +13K words. The following code reads in the full dataset and then restricts to just the mean ratings for the three core semantic dimensions:
Word
: The word (str)Valence
(positive/negative)Arousal
(intensity)Dominance
sentiment_df = pd.read_csv(
os.path.join(data_home, "Warriner_et_al emot ratings.csv"),
index_col='Word')
sentiment_df.shape
(13915, 64)
sentiment_df = sentiment_df[['V.Mean.Sum', 'A.Mean.Sum', 'D.Mean.Sum']]
sentiment_df = sentiment_df.rename(
columns={'V.Mean.Sum': 'Valence',
'A.Mean.Sum': 'Arousal',
'D.Mean.Sum': 'Dominance'})
I took the 100 Most Beautiful Words (of which there are 107) and enriched them:
Word
: The word (str).Pronunciation
: CMU Pronouncing Dictionary representation.Morphology
: Celex morphological representations.Frequency
: frequency according to the Google N-gram Corpus. Category
: 'most-beautiful' or 'regular'The 'regular' examples are 107 randomly selected non-proper-names.
Maybe there's something interesting here?
beauty_df = pd.read_csv(
os.path.join(data_home, "wordbeauty.csv"),
index_col="Word")
beauty_df.shape
(214, 4)
beauty_df.head(2)
Pronunciation | Morphology | Frequency | Category | |
---|---|---|---|---|
Word | ||||
lithe | L AY1 DH | (lithe)[A] | 136457 | most-beautiful |
vestige | V EH1 S T IH0 JH | (vestige)[N] | 135247 | most-beautiful |
beauty_df['Category'].value_counts()
regular 107 most-beautiful 107 Name: Category, dtype: int64
beauty_df.head()
Pronunciation | Morphology | Frequency | Category | |
---|---|---|---|---|
Word | ||||
lithe | L AY1 DH | (lithe)[A] | 136457 | most-beautiful |
vestige | V EH1 S T IH0 JH | (vestige)[N] | 135247 | most-beautiful |
nemesis | N EH1 M AH0 S IH0 S | (nemesis)[N] | 1338430 | most-beautiful |
inure | IH0 N Y UH1 R | (inure)[V] | 123230 | most-beautiful |
imbue | IH0 M B Y UW1 | (imbue)[V] | 105790 | most-beautiful |
The Gutenberg metadata has been removed from these files, and the first line gives the title, author, and publication year in a systematic pattern.
gutenberg_home = os.path.join(data_home, "gutenberg")
gutenberg_filenames = glob.glob(os.path.join(gutenberg_home, "*.txt"))
len(gutenberg_filenames)
26
gutenberg_filenames[: 5]
['hackathon/gutenberg/blake-poems.txt', 'hackathon/gutenberg/carroll-alice.txt', 'hackathon/gutenberg/shakespeare-caesar.txt', 'hackathon/gutenberg/christie-secad10.txt', 'hackathon/gutenberg/dickens-ncklb10.txt']
You might want to modify this, depending on how you want to process these texts (by word? sentence? chapter?).
def gutenberg_iterator(filename):
"""Yields paragraphs (as defined simply by multiple
newlines in a row).
Parameters
----------
filename : str
Full path to the file.
Yields
------
multiline str
"""
with open(filename) as f:
contents = f.read()
for para in re.split(r"[\n\s*]{2,}", contents):
yield para
emma_iterator = gutenberg_iterator(gutenberg_filenames[0])
for _ in range(5):
print("="*50)
print(next(emma_iterator))
================================================== [Poems by William Blake 1789] ================================================== SONGS OF INNOCENCE AND OF EXPERIENCE and THE BOOK of THEL ================================================== SONGS OF INNOCENCE ================================================== INTRODUCTION ================================================== Piping down the valleys wild,
from nltk.tokenize import sent_tokenize
sent_tokenize("Hello? This is Dr. Potts! How are you?")
['Hello?', 'This is Dr. Potts!', 'How are you?']
From assignment 2.
def simple_tokenize(s):
"""Break str `s` into a list of str.
1. `s` has all of its peripheral whitespace removed.
2. `s` is downcased with `lower`.
3. `s` is split on whitespace.
4. For each token, any peripheral punctuation on it is stripped
off. Punctuation is here defined by `string.punctuation`.
Parameters
----------
s : str
The string to tokenize.
Returns
-------
list of str
"""
punct = string.punctuation
final_toks = []
toks = s.lower().strip().split()
for w in toks:
final_toks.append(w.strip(punct))
return final_toks
def word_counts(s, tokenizing_func=simple_tokenize):
"""Count distribution for the words in `s` according to `tokenizer`.
Parameters
----------
s : str
String to tokenize and get word counts for.
tokenizing_func : function
Any function that can be called as `tokenizing_func(s)` where
`s` is a string. The default is `simple_tokenize`.
Returns
-------
dict mapping str to int
"""
wc = {}
toks = tokenizing_func(s)
for w in toks:
wc[w] = wc.get(w, 0) + 1
return wc
From assignment 5.
def egrep(regex, filename):
"""Python version of egrep. The function iterates through the
user's file `filename`, line-by-line, stripping off the final
newline character, and yielding only the lines that match the
user's regular expression `regex`.
Note: like basic egrep, a line that contains multiple matches for
`regex` is yielded only once.
Parameters
----------
regex : Compiled regular expression
The pattern to use for matching
filename : str
Full path to the file to open and iterate through
Yields
------
str
Lines from the file, with newline characters removed
"""
for line in open(filename):
line = line.strip()
if regex.search(line):
yield line