Google Ngrams: From Relative Frequencies to Absolute Counts

Google's Ngram Viewer is a powerful tool to visualize the history of a term or an expression over time.

Whether or not you should trust the results, however, is another question. There are many problems with Google's Ngram Viewer: Scientific literature is overrepresented, important and unimportant publications have the same weight, and there are OCR errors. For a quick read on some of these problems, have a look at this Wired article.

Relative frequencies present another problem. Does a frequency of 0.000002% in 1840 indicate 1, 10, or 100 occurrences of a term? And how many times should a term appear in order for a pattern to be robust?

This tutorial describes how to calculate the absolute number of appearances of a term using Google's source data. You can download this ipython notebook here.

How Addictive Was Smoking in the 19th century? Or: Problems With Relative Frequencies

As part of my research I argue that smoking only became widely seen as an addiction in the late 1980s when addiction was inscribed into the brain. Tobacco industry contest this. They argue that smokers were well aware that smoking is addictive all the way back to the 19th century. Their argument goes like this: (I'm not joking--this ngram has appeared in court.)

In [1]:
from IPython.display import IFrame
width=900, height=400 )

Out[1]:

Smothering Differences Through Smoothing

Do they have a case? Clearly "addicted to smoking" appeared far more frequently in the 19th than the 20th century. But then again, 0.0000020% does not seem like all that much.

In cases where you're uncertain how reliable findings are, you can try to reduce the smoothing parameter as a first step. The standard setting for smoothing is 3 which means that the value for a given year is the average of that year itself as well as the 3 preceeding and following years. Hence, the relative frequency for the year 1900 is the average of 1897, 1898, 1899, 1900, 1901, 1902, and 1903. The problem with this setting is that it makes rare terms, which may appear 100 times in 1 year and only 5 times in the following to be stable.

The easiest way to check if a trend is sustained is to change smoothing to 0:

In [2]:
IFrame("https://books.google.com/ngrams/interactive_chart?content=addicted+to+smoking&year_start=1800&" \
width=900, height=400 )

Out[2]:

Absolute Counts, Manually

Suddenly, the sustained pattern seems a lot less stable--the peaks are higher but the troughs are lower.
But what if we wanted to know the absolute number that "addicted to smoking?"

Google makes the raw data for all ngrams corpora available here. In particular, we're interested in the total_counts for the English corpus. This file contains the total number of tokens per year.

For example, it contains the following line:
1815,156318674,940919,1281
This means that for 1815, there are 156.318.674 total words on 940.919 pages in 1281 books.

With this, we can calculate how often "addicted to smoking" appeared when it spiked to the highest relative frequeny in 1815:
$$0.0000038 \times 0.01 \times 156318674 = 5.94 \sim 6$$
In other words, the big 1815 spike is caused by 6 appearances of "addicted to smoking."

There's Gotta Be a Better Way! Absolute Counts With Python

It would be possible to calculate absolute counts for every year by hand. But who has the time? Below, you can find a python script that automates the process and plots out the result.
There are 2 main functions: plot_absolute_counts and print_absolute_counts. The plotting function uses matplotlib to plot the absolute count over time. The printing function prints the (rounded) absolute counts for every year.

Both functions take the following parameters:

token : The term/expression/ngram. Needs to be entered as a string, i.e. with quotation marks, e.g. 'addicted'
corpus : Which corpus to use. Valid corpora names are: 'english', 'american english', 'british english', 'english fiction', 'chinese', 'french', 'german', 'hebrew', 'italian', 'russian', 'spanish'. Default: 'english'.
smoothing : Smoothing parameter to use. Default: 0.
start_year : Default: 1800
end_year : Default: 2000
log_scale : (plotting only) Whether or not to use a log scale for plotting. Default: True.

In [3]:
import urllib2
import csv
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
def plot_absolute_counts(token, corpus='english', smoothing=0, start_year=1800, end_year=2000, log_scale=True):
'''
Valid corpora names are:
'english', 'american english', 'british english', 'english fiction'
'chinese', 'french', 'german', 'hebrew', 'italian', 'russian', 'spanish'

'''
# Load absolute counts of the totken
absolute_counts = retrieve_absolute_counts(token, corpus, smoothing, start_year, end_year)

years = range(start_year, start_year + len(absolute_counts))

plt.rcParams['figure.figsize'] = (15,8)
plt.rcParams['font.size'] = 10
ax= plt.axes()
if log_scale:
ax.set_yscale('log')
plt.plot(years, absolute_counts, label = '{}'.format(token))
title = 'Absolute Counts of "{}" in the "{}" corpus with smoothing={}.'.format(token, corpus,smoothing)
if log_scale:
title += ' Log Scale.'
plt.title(title)

handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels)

legend_title = ax.get_legend().get_title()
legend_title.set_fontsize(15)

plt.show()

def print_absolute_counts(token, corpus='english', smoothing=0, start_year=1800, end_year=2000):
'''
Prints out the absolute counts (instead of plotting them)
Useful to get the exact
'''

absolute_counts = retrieve_absolute_counts(token, corpus, smoothing, start_year, end_year)
print 'Absolute Counts for: {}'.format(token)
for i in range(len(absolute_counts)):
print '{}: {}'.format(start_year + i, int(absolute_counts[i]))

def retrieve_absolute_counts(token, corpus, smoothing, start_year, end_year):
'''
This function retrieves the absolute counts for a given token.
It first loads the relative frequencies from the ngram viewer and the absolute counts
for the corpus from Google's source data.
Then, it multiplies the absolute number of terms in the corpus for any given year with the
relative frequency of the search token.
'''

# dictionary maps from corpus name to corpus id
corpora = {
'english' : 15,
'american english': 17,
'british english': 18,
'english fiction': 16,

'chinese': 23,
'french': 19,
'german': 20,
'hebrew': 24,
'italian': 22,
'russian': 25,
'spanish': 21,
}

corpus_id = corpora[corpus]

# Step 1: Load the frequency data from the ngram view

token = token.replace(' ', '+')
# construct the url, i.e. place the token and other parameters where they belong
'&corpus={}&smoothing={}'.format(token, start_year, end_year, corpus_id, smoothing)

# Load the data from the page.

# Find the places in the html where the data starts and ends
start = page.find('var data = ')
end = page.find('];\n', start)

# Extract the data dictionary
data = eval(page[start+12:end])
frequencies = data['timeseries']

# Step 2: load the total number of tokens per year from Google's source data

# Step 3: calculate the absolute number of appearances by multiplying the frequencies with the total
#         number of tokens
absolute_counts = [round(frequencies[i] * total_counts[i]) for i in range(len(frequencies))]

return absolute_counts

'''
This function loads the total counts for a given corpus from Google's source data.
'''

# map from id to url
id_to_url= {
}

response = urllib2.urlopen(urllib2.Request(id_to_url[corpus_id]))

total_counts = []
for row in data:
# first and last rows are empty, so a try...except is needed
try:
year, word_count, _, _ = row.split(',')
if int(year) >= start_year and int(year) <= end_year:
total_counts.append(int(word_count))

except ValueError:
pass


In [5]:
'''
A note on python syntax

The following three commands achieve exactly the same result:

plot_absolute_counts('addicted to smoking', 'english', 0, 1800, 2000, True)
log_scale=True)

The first one only passes the one necessary parameter (token) and uses defaults for everything else.
The second one passes all parameters explicitly (token, corpus, smoothing, start/end year, log_scale).
For this to work, they need to be passed in the same order as they are declared in plot_absolute_counts
The third command explicitly instantiates each parameter. It is the least ambiguous but also the most verbose version.

'''
plot_absolute_counts('addicted to smoking', 'english', smoothing=0, start_year=1800, end_year=1900)

In [6]:
print_absolute_counts('addicted to smoking', 'english', smoothing=0, start_year=1800, end_year=1900)

Absolute Counts for: addicted to smoking
1800: 2
1801: 1
1802: 1
1803: 0
1804: 0
1805: 0
1806: 4
1807: 0
1808: 0
1809: 1
1810: 1
1811: 2
1812: 0
1813: 1
1814: 1
1815: 6
1816: 1
1817: 1
1818: 1
1819: 1
1820: 0
1821: 0
1822: 1
1823: 3
1824: 3
1825: 8
1826: 4
1827: 0
1828: 1
1829: 2
1830: 3
1831: 3
1832: 6
1833: 0
1834: 5
1835: 13
1836: 4
1837: 10
1838: 1
1839: 15
1840: 4
1841: 5
1842: 5
1843: 1
1844: 7
1845: 6
1846: 11
1847: 9
1848: 17
1849: 1
1850: 7
1851: 4
1852: 4
1853: 4
1854: 10
1855: 16
1856: 14
1857: 13
1858: 8
1859: 12
1860: 2
1861: 4
1862: 1
1863: 4
1864: 7
1865: 14
1866: 2
1867: 3
1868: 5
1869: 5
1870: 2
1871: 1
1872: 9
1873: 12
1874: 3
1875: 6
1876: 9
1877: 6
1878: 11
1879: 8
1880: 17
1881: 5
1882: 15
1883: 13
1884: 16
1885: 14
1886: 2
1887: 19
1888: 7
1889: 24
1890: 0
1891: 6
1892: 3
1893: 4
1894: 1
1895: 6
1896: 8
1897: 1
1898: 22
1899: 17
1900: 12


Looking the absolute counts of 'addicted to smoking,' it becomes clear that this token rarely appeared more than 10 times per year--much ado about nothing.
In general, I'd be hesitant to put too much stock into ngram results before the 1950s because the sample size tends to be 1-2 orders of magnitude smaller. Have a look at the following 2 plots, one on a normal and one on a logarithmic scale.

In [7]:
plot_absolute_counts('and', log_scale=False)
plot_absolute_counts('and')


Have fun running your own experiments!