Based on problems by Colin Kincaid, Monica Anuforo, Jennie Yang, Nick Bowman, Juliette Woodrow, Chris Piech, Mehran Sahami, and Kathleen Creel.
To get started, download the zip file for HW8 here.
In this assignment, we have done all of the file reading and data processing for you. We have stored the data in a nested dictionary, the structure of which is described later in the handout. In this assignment, you will write code to plot the data in an interesting way. Plotting and visualizing across professor gender and review quality reveals interesting trends about human language usage. We hope that you will be able to use this exercise in data visualization to also think critically about the underlying biases that exist in online datasets! The end product of this assignment is a complete application that will help you dig deep into our provided dataset while answering important social and ethical questions along the way.
In this assignment, you will analyze a historical dataset consisting of nearly 20 years of reviews of college and university professors posted on RateMyProfessors.com, a popular review aggregation website. Teacher ratings are a common and impactful facet of life in university - here at Stanford, we fill out course reviews at the end of every quarter. However, teaching evaluations are not an objective source of truth about the quality of a professor's teaching. Recent research has shown that teaching evaluations often demonstrate harmful biases, including gender bias. The bias in teaching evaluations is a problem because the scores are often used in decisions about who to hire, fire, tenure, and promote. Your goal is to build a piece of software that helps you investigate how humans use language in gendered (and potentially biased) ways.
Here are two sample outputs of the BiasBars program, the first showing how frequently the word "class" appears in professor reviews, and the second showing how frequently the word "smart" appears in professor reviews. Notice how the values are about the same for a more descriptive term ("class"), while the frequencies are skewed by gender for a more normative term ("smart").
Before we get started coding, we first want to provide you with some background about why being able to investigate and identify biases in datasets is such an important problem to solve. Much of today’s work in artificial intelligence involves natural language processing, a field which studies the way language is used today and has been used in the past. The datasets we use to train artificially intelligent systems are usually collections of text that humans have written at some point in the past. If there are imbalances in how different groups of people tend to be described or represented in these datasets, then our machines will pick up on and potentially amplify those imbalances. Extreme manifestations of these biases like Tay, Microsoft’s 2016 chatbot infamous for tweeting racist and anti-Semitic statements after just a day of learning from anonymous posts on the Internet, magnify the importance of understanding the ways we use language. More recent examples include Amazon's AI tool for expediting hiring and recruiting, which was shut down after demonstrating extreme negative bias towards hiring candidates based on their gender.
Even when people do not mean to be malicious, their language can still exhibit biases that influence how our machines learn. For example, when history of science professor Londa Schiebinger attempted to Google Translate a Spanish article written about her, all of the pronouns became “he” and “him” rather than “she” and “her” simply because masculine pronouns were more common than feminine pronouns in the available data. In a later study, Schiebinger found more insidious translation errors that assumed genders for people of certain professions based on the frequency of word usage in gendered languages such as German. The software engineers who made Google Translate probably did not mean for this to occur; they probably did not even account for that possibility as they were designing their translation algorithm. The moral of the story? To prevent these kinds of slip-ups, computer scientists need to consider the social impacts of their work at the beginning of their design process.
Identifying issues of bias and representation in datasets is a natural extension of many of the interesting ethical topics that we have talked about in CS106A so far this quarter. As we've mentioned before, our hope is that by introducing these sorts of topics early in computer science education, we can help the next generation of software developers and computer science researchers—which could include you!—be more mindful of the potential social implications of their work.
In this milestone, you are going to understand the structure of the dictionary created by the provided file reading code.
This assignment uses real world data from RateMyProfessors.com, an online platform that enables students to leave anonymous, public reviews about their college/university professors. A typical review on RateMyProfessors.com consists of an overall numerical rating of quality (from 1-5), a number of qualitative tags (like "amazing lectures" or "difficult exams"), and a free-response comment section where students can write a short paragraph describing their experience with the professor. An example review for our very own Nick Parlante is shown below:
The Internet makes this platform for reviews accessible to the global community of students, empowering students to make decisions about classes they might want to take or universities they might want to attend based on the quality of instruction. The indirectness and anonymity of being behind a computer or phone screen also gives people a sense of security to say whatever they want, which can range from the supportive or constructive to the downright offensive or harmful. In analyzing this dataset you will be working to answer the following question: does a professor's gender influence the language people use to describe them?
To examine this question, we have collected and compiled a dataset of 20,000 reviews from RateMyProfessors.com posted over a 17-year span from 2001 to 2018. We have cleaned and organized the data into one large text file that will act as the source of information for the program you will write. There are 3 important components of every review that we have used to build the dataset: the gender of the professor being reviewed, text content of the free-response comment, and overall quality (a numerical score from 1-5).
A note on gender vs sex: In this dataset, gender is the only piece of information we have about these people’s social identities; the dataset does not include other salient identities such as race and ability. Furthermore, gender is only classified into the categories of woman and man, which means non-binary people are unfortunately not represented. We choose to describe the two genders included in this dataset as “woman” and “man” rather than “female” and “male,” as the former terms refer to gender and social role whereas the latter typically refer to sex assigned at birth. Professors do not have the opportunity to describe their own gender identity; this data represents the guesses of students. We will reflect further on this point in the ethics questions at the end of the assignment.
We have written a few functions, namely convert_rating_to_index(rating)
,
add_data_for_word(word_data, word, gender, rating)
, and read_file(filename)
in
biasbarsgraph.py. These functions are complete and you do not need to add code here. Feel free to peruse them.
To begin with, we need to consider the issue of being able to organize the data by the numerical rating associated with the review, since we want to be able to identify trends in how a given word is used in positive reviews vs. negative reviews. Since numerical rating is a float (real value) that can take on many different values between 1.0 and 5.0, we are going to make our data processing task simpler by representing review quality using only three "buckets":
The data structure for this program (which we will refer to as
word_data
) is a dictionary that has a key for every word that we come across in the
dataset. The value associated with each word is a nested dictionary, which maps gender to a list of
counts, broken down by rating bucket (the ordering is counts for low reviews, then medium reviews, then high
reviews). A mini example of the word_data
data structure would look something like this:
{
'great': {
'W': [30, 100, 800],
'M': [100, 200, 1500]
},
'teacher': {
'W': [330, 170, 852],
'M': [402, 250, 1194]
}
}
Let's break down this data structure with an example. Let's say we wanted to
access the number of times the word "great" appears in reviews for women (which we can see is
the list [30, 100, 800]
in the above example). How can we access this list of counts?
word_data
dictionary are words, so to get the data associated with the word
"great," we can write word_data['great']
. Then, we'd get the following dictionary:
{
'W': [30, 100, 800],
'M': [100, 200, 1500]
}
word_data['great']['W']
. This expression gives us the list:
[30, 100, 800]
Later in this assignment, you'll need to access the counts lists for men and women for a given word, so keep this in mind! Now, we can move on to the coding that you will have to do for the assignment: building a cool visualization for this data! We have already completed all of the file reading for you in the starter project.
In this milestone you are going to make sure you have everything installed properly in order to complete the assignment.
Unzip and open the assignment folder in PyCharm as usual. Make sure you have thematplotlib
package
installed,
which allows you to quickly and easily draw graphs. If you haven't done this already, open the terminal in PyCharm
and run the following command:
python3 -m pip install matplotlib
If you have a Windows computer, substitute py for python3.
In this section, you will write code in the function
plot_word(word_data, word, total_words_women, total_words_men)
using your new matplotlib skills to create a bar chart for a specific word. Our first parameter
word_data
is the nested structure we explored above, and word
is the specific word we
want to plot. The parameters total_words_men
and total_words_women
are the
total number of words we've seen in reviews for men and women respectively, which will be handy later on.
Recall from Lecture 26 that in order to make a plot, you need a list of x-values and a list of y-values. We handle the code for the x-values. You'll just need to worry about creating the lists of y-values for women and men for a given word. Below is example code to create a bar chart:
x_vals = # some list of x-values, we've provided
y_vals = # some list of y-values, YOU DO THIS
plt.bar(x_vals, y_vals, color="orange") # call the plt.bar function to make the graph, we've provided
In this assignment, the x-values for both the women and men bars
have been provided for you, these are the variables x_vals_women
and x_vals_men
.
We're asking you to create two y-values lists: one for women and one for men. To create your
y_vals_women
list, use word_data
to access the list of counts for this word for
women. This is like the example we walked through in Milestone 1, in which we accessed the counts list for
"great" in reviews of female professors. Then, write a list comprehension to divide each of the
values in that list by total_words_women
. This allows us to normalize our data. Overall, our
dataset has more words in reviews of men than reviews of women, so without accounting for this, it might appear
that a word appears less frequently for women, when really, there were just fewer reviews of women in general.
We'll do the same to produce the y_vals_men
list: divide each of the raw counts in our list of
counts for this word for men by total_words_men
to produce a normalized y-values list for men.
As a small example, if the counts list for our given word
in reviews of men was
[100, 200, 1500]
and
total_words_men
was 10,000, we'd want to use a list comprehension to produce
y_vals_men = [0.01, 0.02, 0.15]
.
To test your implementation, you can run the following in the terminal:
python3 biasbarsgraph.py smart
If you have a Windows computer, substitute py for python3. You can replace "smart" with any word that you want
to see plotted.
Below is the expected output for the word "smart." Once you've implemented plot_word
, you
should see this pop up if you run the command from above.
If you're wondering why our y-values aren't tiny fractions, since we've divided by the total number of words,
this is because in another step of the program, we've multiplied all of the counts by 1,000,000 (see
scale_up_counts(word_data)
if you're curious).
In this milestone, you are going to think critically about the RateMyProfessors.com dataset and reflect on the
relevant ethical issues that computer scientists should consider when working on data science problems. You
will be answering the following questions in ethics.txt
.
Veronica Rivera gave an awesome lecture on ethics in the data. Many critical issues can arise when working with datasets that have real data describing real people. In the last part of the assignment, we are going to ask you a few questions. We hope that these questions will provide a foundation for you to think critically about the social impacts of your work should you continue on into the wonderful field of computer science! Please take the time to seriously think about each of the questions presented below and answer each in at least 2-3 well thought-out sentences.
Once you've finished thinking through and answering all these questions, make sure to stop and really admire the magnitude of what you've accomplished! You've wrangled a complex real-world dataset and used it to flex your ethical and critical thinking skills. Congrats on making it to the end!
Once you've gotten all the parts of this assignment working, you're ready to submit!
Make sure to submit only the python files you modified for this assignment on Paperless. You should make sure to submit the files:
- biasbarsgraph.py
- ethics.txt