Linguist 278: Programming for Linguists
Stanford Linguistics, Fall 2021
Christopher Potts
Pandas is an amazing, powerful, flexible library for doing reproducible research efficiently and effectively.
We will be able to cover only a tiny percentage of the things it can do.
Thus, to create this notebook, I reviewed the work in Pandas that I've done over the years, identified the most common operations, and assembled them here as a bunch of code snippets with notes interspersed.
My hope is that this gives you enough of a feel for Pandas that you'll be able to explore and make use of its other functionality by reviewing its documentation and Googling for Stack Exchange discussions.
%matplotlib inline
import os
import pandas as pd
import random
The pd.read_csv function is an incredibly powerful and flexible tool for reading in CSV files. It can handle multiple formats for its input, it will infer types for columns on import, and many other things.
toy_df = pd.read_csv(
"http://web.stanford.edu/class/linguist278/data/toy-csv.csv")
Here, toy_df
is a pd.DataFrame
. These are like spreadsheets, and like data.frame
objects in the programming language R. They display nicely in notebooks:
toy_df.head()
We can specify that we want the Subject column to be the pd.Index
for this DataFrame:
toy_df = pd.read_csv(
"http://web.stanford.edu/class/linguist278/data/toy-csv.csv",
index_col=0)
toy_df.shape
pd.DataFrame(
{
"lower": ["a", "b", "c"],
"upper": ["A", "B", "C"]
})
pd.DataFrame(
[
["a", "A"],
["b", "B"],
["c", "C"]
],
columns=["lower", "upper"])
The columns in DataFrames are pd.Series
objects. You can pull them out by indexing directly into the pd.DataFrame
with the column name (as a str
):
toy_df['Height']
pd.Series(["a", "b", "c"], name="lower")
toy_df['Height'].mean()
toy_df['Height'].max()
toy_df['Height'].min()
toy_df['Height'].median()
toy_df['Height'] - toy_df['Height'].min()
toy_df['Height'] * 2.54
They are pretty much like dictionaries where the keys are the pd.Index
values:
height = toy_df['Height']
height[1]
If the Pandas object proves stubborn, the values
attribute will return a NumPy array, which will behave very much like a list. (If even the array is vexing, call list
on it.)
toy_df.values
toy_df['Height'].values
# Some random ages for a dummy column:
ages = []
for i in range(toy_df.shape[0]):
ages.append(random.randint(18, 100))
toy_df['Age'] = ages
toy_df.head()
If you index into a DataFrame with a list of column names, you get the subframe containing those columns, in the order you specified:
toy_df[['Occupation']].head()
You can use this to change the order of the columns:
toy_df[['Occupation', 'Height']].head()
toy_df.head(2)
toy_df.tail(2)
To get specific rows based on the name of the value in the Index:
toy_df.loc[1]
And based on the index:
toy_df.iloc[0]
List indexing to get multiple rows:
toy_df.loc[[1,5]]
Index version:
toy_df.iloc[[1,5]]
toy_df['Occupation'] == 'Linguist'
toy_df[ toy_df['Occupation'] == 'Linguist' ]
target_professions = {'Linguist'}
toy_df[ toy_df['Occupation'].isin(target_professions) ]
toy_df.sort_index()
toy_df.sort_values('Occupation')
toy_df.sort_values('Occupation', ascending=False)
toy_df.sort_values(['Occupation', 'Height'])
toy_df.sort_values(['Height', 'Occupation'])
toy_df['Occupation'].value_counts()
This is a method on Series. It takes as its argument a single function, which is applied to every element in the Series.
def convert_inches_to_cm(x):
return x * 2.54
toy_df['Height'].apply(convert_inches_to_cm)
When you have the entire DataFrame at your disposal, you have to decide whether you want to call your apply
function on the rows (axis=1
) or on the columns (axis=0
, the default).
def create_summary(row):
return "Subject {} is a {} with height {} inches".format(
row.name, row['Occupation'], row['Height'])
toy_df.apply(create_summary, axis=1)
toy_df.groupby('Occupation')
name, group = next(iter(toy_df.groupby('Occupation')))
name
group
def series_mean(s):
return s.mean()
toy_df.groupby('Occupation').apply(series_mean)
Remember to put
%matplotlib inline
above all your import statements at the top of the notebook.
For visualization, Pandas is largely a wrapper around Matplotlib an incredibly powerful and complicated Python library for creating plots and other visualizations. The good news is that, if you want to do it, Matplotlib can do it. The bad news is that this power makes Matplotlib very complicated.
toy_df.plot()
toy_df['Occupation'].value_counts().plot.barh()
toy_df.boxplot(column="Height", by="Occupation")