Linguist 278: Programming for Linguists
Stanford Linguistics, Fall 2021
Christopher Potts

Class 12: Introduction to Pandas¶

Pandas is an amazing, powerful, flexible library for doing reproducible research efficiently and effectively.
We will be able to cover only a tiny percentage of the things it can do.
Thus, to create this notebook, I reviewed the work in Pandas that I've done over the years, identified the most common operations, and assembled them here as a bunch of code snippets with notes interspersed.
My hope is that this gives you enough of a feel for Pandas that you'll be able to explore and make use of its other functionality by reviewing its documentation and Googling for Stack Exchange discussions.

%matplotlib inline
import os
import pandas as pd
import random

Contents¶

Reading CSVs into DataFrame objects
DataFrame shapes
Creating DataFrames in other ways
Columns as Series
Creating a Series directly
Common series methods
Indexing into Series
Getting the values from a DataFrame or Series
Adding columns to a DataFrame
Subframes based on columns
DataFrames heads and tails
Getting specific DataFrame rows
Subframes based on specifc row values
Sorting
value_counts
apply on Series
apply on DataFrame
groupby
Basic plotting

Reading CSVs into DataFrame objects¶

The pd.read_csv function is an incredibly powerful and flexible tool for reading in CSV files. It can handle multiple formats for its input, it will infer types for columns on import, and many other things.

toy_df = pd.read_csv(
    "http://web.stanford.edu/class/linguist278/data/toy-csv.csv")

Here, toy_df is a pd.DataFrame. These are like spreadsheets, and like data.frame objects in the programming language R. They display nicely in notebooks:

toy_df.head()

We can specify that we want the Subject column to be the pd.Index for this DataFrame:

toy_df = pd.read_csv(
    "http://web.stanford.edu/class/linguist278/data/toy-csv.csv",
    index_col=0)

DataFrame shapes¶

toy_df.shape

(10, 2)

Creating DataFrames in other ways¶

pd.DataFrame(
    {
        "lower": ["a", "b", "c"], 
        "upper": ["A", "B", "C"]
    })

pd.DataFrame(
    [
        ["a", "A"], 
        ["b", "B"], 
        ["c", "C"]
    ], 
    columns=["lower", "upper"])

Columns as Series¶

The columns in DataFrames are pd.Series objects. You can pull them out by indexing directly into the pd.DataFrame with the column name (as a str):

toy_df['Height']

Subject
1     74.370003
2     67.496862
3     74.923564
4     64.623722
5     67.767879
6     61.503977
7     62.736810
8     68.608040
9     70.160905
10    76.811444
Name: Height, dtype: float64

Creating a Series directly¶

pd.Series(["a", "b", "c"], name="lower")

0    a
1    b
2    c
Name: lower, dtype: object

Common series methods¶

toy_df['Height'].mean()

68.9003206602722

toy_df['Height'].max()

76.81144438287173

toy_df['Height'].min()

61.50397707923559

toy_df['Height'].median()

68.18795942394993

toy_df['Height'] - toy_df['Height'].min()

Subject
1     12.866026
2      5.992885
3     13.419587
4      3.119745
5      6.263902
6      0.000000
7      1.232833
8      7.104063
9      8.656928
10    15.307467
Name: Height, dtype: float64

toy_df['Height'] * 2.54

Subject
1     188.899808
2     171.442030
3     190.305853
4     164.144254
5     172.130413
6     156.220102
7     159.351496
8     174.264421
9     178.208699
10    195.101069
Name: Height, dtype: float64

Indexing into Series¶

They are pretty much like dictionaries where the keys are the pd.Index values:

height = toy_df['Height']

height[1]

74.37000326528937

Getting the values from a DataFrame or Series¶

If the Pandas object proves stubborn, the values attribute will return a NumPy array, which will behave very much like a list. (If even the array is vexing, call list on it.)

toy_df.values

array([[74.37000326528937, 'Psychologist'],
       [67.4968620693749, 'Psychologist'],
       [74.92356434760966, 'Psychologist'],
       [64.62372198999978, 'Psychologist'],
       [67.76787900026083, 'Linguist'],
       [61.50397707923559, 'Psychologist'],
       [62.736809619085655, 'Psychologist'],
       [68.60803984763902, 'Linguist'],
       [70.16090500135535, 'Psychologist'],
       [76.81144438287173, 'Linguist']], dtype=object)

toy_df['Height'].values

array([74.37000327, 67.49686207, 74.92356435, 64.62372199, 67.767879  ,
       61.50397708, 62.73680962, 68.60803985, 70.160905  , 76.81144438])

Adding columns to a DataFrame¶

# Some random ages for a dummy column:

ages = []
for i in range(toy_df.shape[0]):
    ages.append(random.randint(18, 100))

toy_df['Age'] = ages

toy_df.head()

Subframes based on columns¶

If you index into a DataFrame with a list of column names, you get the subframe containing those columns, in the order you specified:

toy_df[['Occupation']].head()

You can use this to change the order of the columns:

toy_df[['Occupation', 'Height']].head()

DataFrames heads and tails¶

toy_df.head(2)

toy_df.tail(2)

Getting specific DataFrame rows¶

To get specific rows based on the name of the value in the Index:

toy_df.loc[1]

Height               74.37
Occupation    Psychologist
Age                     80
Name: 1, dtype: object

And based on the index:

toy_df.iloc[0]

Height               74.37
Occupation    Psychologist
Age                     80
Name: 1, dtype: object

List indexing to get multiple rows:

toy_df.loc[[1,5]]

Index version:

toy_df.iloc[[1,5]]

Subframes based on specifc row values¶

Creating Boolean Series¶

toy_df['Occupation'] == 'Linguist'

Subject
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8      True
9     False
10     True
Name: Occupation, dtype: bool

Boolean series as filter¶

toy_df[ toy_df['Occupation'] == 'Linguist' ]

Set-based example¶

target_professions = {'Linguist'}

toy_df[ toy_df['Occupation'].isin(target_professions) ]

Sorting¶

Based on the index¶

toy_df.sort_index()

Based on a single column¶

toy_df.sort_values('Occupation')

toy_df.sort_values('Occupation', ascending=False)

Multiple columns at once¶

toy_df.sort_values(['Occupation', 'Height'])

toy_df.sort_values(['Height', 'Occupation'])

value_counts¶

toy_df['Occupation'].value_counts()

Psychologist    7
Linguist        3
Name: Occupation, dtype: int64

apply on Series¶

This is a method on Series. It takes as its argument a single function, which is applied to every element in the Series.

def convert_inches_to_cm(x):
    return x * 2.54

toy_df['Height'].apply(convert_inches_to_cm)

Subject
1     188.899808
2     171.442030
3     190.305853
4     164.144254
5     172.130413
6     156.220102
7     159.351496
8     174.264421
9     178.208699
10    195.101069
Name: Height, dtype: float64

apply on DataFrame¶

When you have the entire DataFrame at your disposal, you have to decide whether you want to call your apply function on the rows (axis=1) or on the columns (axis=0, the default).

def create_summary(row):
    return "Subject {} is a {} with height {} inches".format(
        row.name, row['Occupation'], row['Height'])

toy_df.apply(create_summary, axis=1)

Subject
1     Subject 1 is a Psychologist with height 74.370...
2     Subject 2 is a Psychologist with height 67.496...
3     Subject 3 is a Psychologist with height 74.923...
4     Subject 4 is a Psychologist with height 64.623...
5     Subject 5 is a Linguist with height 67.7678790...
6     Subject 6 is a Psychologist with height 61.503...
7     Subject 7 is a Psychologist with height 62.736...
8     Subject 8 is a Linguist with height 68.6080398...
9     Subject 9 is a Psychologist with height 70.160...
10    Subject 10 is a Linguist with height 76.811444...
dtype: object

groupby¶

toy_df.groupby('Occupation')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x115fbe5f8>

name, group = next(iter(toy_df.groupby('Occupation')))

name

'Linguist'

group

def series_mean(s):
    return s.mean()

toy_df.groupby('Occupation').apply(series_mean)

Basic plotting¶

Remember to put

%matplotlib inline

above all your import statements at the top of the notebook.

For visualization, Pandas is largely a wrapper around Matplotlib an incredibly powerful and complicated Python library for creating plots and other visualizations. The good news is that, if you want to do it, Matplotlib can do it. The bad news is that this power makes Matplotlib very complicated.

Basic plot – just try it!¶

toy_df.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x115fbe860>

Barplot¶

toy_df['Occupation'].value_counts().plot.barh()

<matplotlib.axes._subplots.AxesSubplot at 0x11731f518>

Boxplot¶

toy_df.boxplot(column="Height", by="Occupation")

<matplotlib.axes._subplots.AxesSubplot at 0x117410898>

	Subject	Height	Occupation
0	1	74.370003	Psychologist
1	2	67.496862	Psychologist
2	3	74.923564	Psychologist
3	4	64.623722	Psychologist
4	5	67.767879	Linguist

	Height	Occupation	Age
Subject
1	74.370003	Psychologist	80
2	67.496862	Psychologist	20
3	74.923564	Psychologist	81
4	64.623722	Psychologist	29
5	67.767879	Linguist	47

	Height	Occupation	Age
Subject
6	61.503977	Psychologist	26
7	62.736810	Psychologist	100
4	64.623722	Psychologist	29
2	67.496862	Psychologist	20
5	67.767879	Linguist	47
8	68.608040	Linguist	100
9	70.160905	Psychologist	58
1	74.370003	Psychologist	80
3	74.923564	Psychologist	81
10	76.811444	Linguist	65

	Height	Age
Occupation
Linguist	71.062454	70.666667
Psychologist	67.973692	56.285714

	lower	upper
0	a	A
1	b	B
2	c	C

	lower	upper
0	a	A
1	b	B
2	c	C