This week's section will give you practice with the dictionary data structure. In particular, you'll explore the nested dictionary structure and how it lends itself to representing complex information. You'll also explore how to use decomposition-by-variable to more easily and readably update a nested dictionary structure.
Counting by Consonants
Implement the following function:
def count_by_consonants(filename):
"""
Reads in the file whose name is filename and returns a dictionary
that contains counts of words that share the same consonants in order.
"""
For example, if the file consonants.txt
contains this text:
great
grate
teeny
greet
tiny
bump
calling count_by_consonants('consonants.txt')
will return the
dictionary {'grt': 3, 'tny': 2, 'bmp': 1}
.
Wannabe Astronauts
The inspiration for this problem came from this video by Veritasium on YouTube
How lucky do you have to be to be an astronaut?
In 2020 more than 12,000 people applied to join NASA's newest class of astronauts. In a few years, 10-12 of those people will be selected to join the Artemis program and become astronauts. Will those 10-12 truly be the best of the bunch, or could there be some luck involved? Source
Programming a Simulation
Let's explore this question by making a simulation using code. Complete
function make_applicants(num_applicants)
that creates num_applicants
applicants who want to be astronauts, where each applicant is created
with a random float of skill and luck. Before writing this function,
take a minute to consider what data structures would be helpful here.
What information needs to be stored for each applicant? How can we store
the information on all of the applicants? This function should return
the data structure containing information on all of the applicants.
Then run these applicants through a skill test: each applicant must pass
a minimum skill score (like 0.90) to make the first cut. To do this,
write a function make_first_cut_applicants(applicants)
which takes as
parameter a data structure holding all applicants and their scores and
returns a list of all of the skilled astronauts.
After this first cut, all remaining applicants are highly skillful, but
it's time to select the best of the best. Find the best applicants by
combining their luck and skill scores (made from some proportion such as
95% skill and 5% luck) and make the final list of astronauts from the
applicants with the ten highest combined scores. To do so, first write a
function calc_combined_score(applicant)
which takes as parameter an
applicant and returns their combined luck and skill score. Then use this
function to create a list of the top ten applicants.
When you find the top ten astronauts, compare them to the group of highly skilled applicants that made the first cut. Find the average skill and luck scores of each group,then use the percent difference function to see how important each score was in the final decision process.
So is Success All Hard Work? - Discussion
If you were to ask one of the selected astronauts why they were chosen, what might they say?
Stanford accepted 2,349 out of 45,227 applicants for the class of 2024. What happens when you use those numbers instead? Where do you think this difference comes from? Check out the numbers for other classes at the bottom of this page.
What do the results of this simulation say about the possible effects of luck and skill in highly selective programs? In what ways can this simulation be misleading?
If you like, try altering some of the constants in your code, such as increasing the percentage of luck in the combined score, lowering the skill score needed to pass the first cut, or selecting a greater proportion of applicants.
Drawing Friend Graphs
Python is commonly used to help analyze data sets by creating visualizations of said data. One common example of data visualization is to look at relationships between users social networks. Programmers can use python to help draw the network. The visual can be used to measure statistics of the network or to find clusters, which are groups of individuals who have many of the same friends. There are many reasons to search for close-knit groups in social networks - for instance, to create targeted marketing schemes or to suggest groups they'd like to be a part of. Cluster analysis has been also used to identify terrorist cells.
In this problem, you are going to use your python programming skills to create a visual representation of a social network where users can follow each other. More specifically, you will use information from two text files and draw lines representing the relationships in the network. If a user follows another user, your output should have a line connecting the two people.
You will be given two files through command-line arguments. The first
file is a list of each person in the network, each on a separate line,
where their name is followed by a colon and then a list of all the
people that they follow. The second file is a list of coordinates. Each
line has the name of a person in the network, followed by a colon and
then a comma separated list of two integers representing the x and y
coordinates you will use to place the node representing that person on
the canvas. For example, two matching files, users.txt
and
coords.txt
, might look like this (note that the users needn't
necessarily be in the same order in both files):
users.txt
Juliette: Brahm, Nick, Julie, Cynthia
Brahm: Juliette, Nick, Cynthia, Mehran, Chris, Cynthia
Mehran: Oliver, Chris
Chris: Mehran, Oliver
Nick: Juliette, Julie, Keith
Julie: Juliette, Nick, Cynthia
Oliver: Mehran, Chris
Cynthia: Juliette, Julie, Keith
Keith: Nick, Cynthia
coords.txt
Brahm: 141, 343
Chris: 390, 65
Cynthia: 100, 250
Julie: 185, 670
Juliette: 238, 409
Keith: 30, 400
Mehran: 550, 145
Nick: 14, 585
Oliver: 699, 18
Your job is to implement the following function:
def draw_friend_graph(canvas, friends_file, coordinates_file):
"""
Draws a graph representing the friend network. For each user,
draw a circle at their respective coordinates and a label with their name.
Next, draw lines connecting that circle to the circle of each person they
follow.
"""
As you consider how best to approach the problem and store your data, keep in mind that relationships aren't necessarily symmetric. For example, note that Brahm follows several users in the example above, but despite his best efforts to use in-vogue hashtags, has but a single humble follower.
You may assume that the two files provided are valid and represent the same users, but an interesting extension to this problem might be to verify that.
First Letter Index
Implement the following function:
def first_list(strs):
"""
Given a list of strings, create and return a dictionary whose
keys are the unique first characters of the strings and whose
values are lists of words beginning with those characters, in
the same order that they appear in strs.
>>> first_list(['banter', 'brahm', 'aardvark', 'python', 'antiquated'])
{'b': ['banter', 'brahm'], 'a': ['aardvark', 'antiquated'], 'p': ['python']}
"""
Cryptography
Cryptography is the study of techniques for communicating messages secretly. Imagine that Alice and Bob want to send messages to each other, but Eve can snoop on the messages they're sending and read them. Alice wants to figure out way to "encrypt" her messages so that if Eve reads the message, she won't be able to understand it, but Bob will be able to "decrypt" the message. We're going to write a program to help Alice and Bob do this.
Alice decides that she'll replace every letter in her original message
with a different letter. She's defined a variable in the program called
ENCRYPTION_DICT
that keeps track of these associations:
ENCRYPTION_DICT = {
'A': 'T',
'B': 'H',
'C': 'E',
'D': 'Q',
'E': 'U',
'F': 'I',
'G': 'C',
'H': 'K',
...
}
Alice and Bob exchanged this dictionary before we went into quarantine, so they both know that this is the strategy, but Eve doesn't know that! Note that in order to avoid ambiguity, the values in this dictionary are unique (that is, each letter is a value in the dictionary exactly once).
Encryption
To start us off, implement the following function:
def encrypt(plaintext):
"""
Takes in plaintext as an input and returns 'ciphertext': the result
of substituting each letter in the plaintext by its corresponding
encrypted character in ENCRYPTION_DICT.
The plaintext comprises entirely of uppercase letters and non-alphabetic
characters like punctuation. Non-alphabetic characters needn't be encrypted,
but rather should appear in the plaintext in their original form.
>>> encrypt("HEY, HOW'S IT GOING?")
"KUD, KXZ'S BV CXBFC?"
>>> encrypt("I LOVE CS 106A!")
'B WXLU ES 106T!'
>>> encrypt("UNICORNS ARE THE MOST BEAUTIFUL ANIMALS IN EXISTENCE")
'AFBEXPFS TPU VKU NXSV HUTAVBIAW TFBNTWS BF UYBSVUFEU'
"""
Decryption
Now that Bob has received Alice's encrypted message, he needs to "decrypt" it, or convert it back to the original message. To help him do so, implement the following function:
def decrypt(ciphertext):
"""
Uses ENCRYPTION_DICT to decrypt each of the alphabetic characters of
ciphertext.
>>> decrypt("KUD, KXZ'S BV CXBFC?")
"HEY, HOW'S IT GOING?"
>>> decrypt('B WXLU ES 106T!')
'I LOVE CS 106A!'
>>> decrypt('AFBEXPFS TPU VKU NXSV HUTAVBIAW TFBNTWS BF UYBSVUFEU')
'UNICORNS ARE THE MOST BEAUTIFUL ANIMALS IN EXISTENCE'
"""
Hint
Note that in order to successfully decrypt a message, you need the
'reverse' of ENCRYPTION_DICT
: rather than associating plaintext
characters with their encrypted counterparts, we need to go the other
way.
We suggest writing a function reverse_encryption_dict
to decompose out
this problem. In class, we saw an example of this wherein a reversed
dictionary associated keys with lists of values, because multiple keys
can share the same value. Note that this doesn't apply to the case of
ENCRYPTION_DICT
, because each character is guaranteed to have a unique
character. How does this affect how you implement
reverse_encryption_dict
?
This is an interesting encryption scheme (known formally as a substitution cipher), but unfortunately isn't very secure. What issues do you see with it?
Recipes
Inspired by an unhealthy amount of Netflix and quarantine cooking, Parth and Peter are opening up a bakery. To try and keep costs low, they want to automate as much of their inventory tracking as possible.
Thanks to 106A, they think they can use dictionaries to help.
All of the ingredients they have available will be stored in a pantry
dictionary, whose keys are ingredients and values are weights (Of course
they're using the metric system), like so:
pantry = {
'flour': 400,
'sugar': 300,
'salt': 10,
'chocolate': 150
}
Each recipe
is also stored in a dictionary, like the following
uninspiring concoction:
recipe = {
'flour': 200,
'salt': 2.5
}
(As you can tell from the recipe above, it doesn't look like their bakery is going to do very well, but that's not important right now.)
Recipes and the pantry are first stored as files, where each key value
pair is on a different line, separated by ':: '
. The pantry list and
recipe above would look like:
pantry.txt
flour:: 400
sugar:: 300
salt:: 10
chocolate:: 150
recipe.txt
flour:: 200
salt:: 2.5
Our goal is to write a few functions to help Parth and Peter run their bakery. Begin by implementing the following function:
def read_dict_from_file(filename):
"""
Takes in the name of a file containing a recipe or
pantry list and reads it into a dictionary.
An example doctest using the file above:
>>> read_dict_from_file('recipe.txt')
{'flour': 200, 'salt': 2.5}
"""
Once you've developed the infrastructure to construct recipe and pantry list dictionaries, implement the following functions:
def can_make(recipe, pantry):
"""
Given the contents of the pantry, returns a boolean indicating
whether or not it is possible to follow the recipe. Note that
the parameters to this function are dictionaries, and not
filenames. The pantry should not be modified in this function
"""
pass
def make_recipe(recipe, pantry):
"""
Given a recipe and a pantry with enough ingredients to make the recipe,
modify the contents of the pantry to remove as many quantities as the
recipe requires. You may modify the pantry in place, but return the modified
pantry in order to test the output using doctests.
# using the recipe and pantry defined above
>>> make_recipe(recipe, pantry)
{'flour': 200, 'sugar': 300, 'salt': 7.5, 'chocolate': 150}
"""
pass
Note that make_recipe
assumes the pantry is sufficient for the recipe
but this is not necessarily always the case; thus, every call to
make_recipe
will need to be guarded by a call to can_make_recipe
.
Whilst implementing can_make
function, you might find the
dictionary's .get
function helpful, which accepts a parameter
suggesting what to return if the value is not in the dictionary. For
example, calling recipe.get('yeast', 42)
will return 42
if 'yeast'
is not a key in the recipe
dictionary.
Finally, implement a main
function, which first reads in a pantry file
from a user and then continuously asks for recipe filenames from the
user and either prints an error message if the pantry does not have
sufficient ingredients, or removes the ingredients from the pantry if it
does exist, printing the pantry afterwards.
A sample run of the program is below, which repeatedly tries to make the recipe above in the vain hope that someone will actually want to eat it. Note that your program should be able to accept any valid recipe filename, though. User input is bolded and italicized.
$ python3 pantry_manager.py
Enter pantry filename: pantry.txt
What recipe should we bake next (Press enter to quit.)? recipe.txt
You can make that recipe! Your pantry now looks like this:
{'flour': 200, 'sugar': 300, 'salt': 7.5, 'chocolate': 150}
What recipe should we bake next (Press enter to quit.)? recipe.txt
You can make that recipe! Your pantry now looks like this:
{'flour': 0, 'sugar': 300, 'salt': 5, 'chocolate': 150}
What recipe should we bake next (Press enter to quit.)? recipe.txt
You can't make that recipe.
What recipe should we bake next (Press enter to quit.)? User presses enter immediately
Anagrams
Two words are anagrams if they consist of the same letters in a different order. Your job in this problem is to write a program that allows users to see what words are anagrams of words that they type. Here's some sample output (user input is bolded and italicized):
$ python3 anagrams.py
Word: listen
['enlist', 'inlets', 'listen', 'silent', 'slinte', 'tinsel']
Word: race
['acer', 'acre', 'care', 'cera', 'crea', 'race']
Word: python
['phyton', 'python', 'typhon']
Word: programming
['programming']
Word: piech
piech is not in the dictionary
Word: microphone
['microphone', 'neomorphic']
Word: User presses enter immediately
The design of this program is largely up to you, although we have a few suggestions for you:
-
We've provided a constant
LEXICON
whose value is the name of a file containing all of the words in the English language. It may be helpful to find some way of associating words with their anagrams. -
A useful observation is that two anagrams are identical when their characters are sorted. For example,
'LISTEN'
and'SILENT'
, when sorted are both'eilnst'
. Python has an inbuiltsorted
function, which returns a list of the characters in a string in alphabetical order. This list can then be converted back to a string like so:
>>> string = 'banter'
>>> sorted_characters = sorted('banter')
>>> sorted_characters
['a', 'b', 'e', 'n', 'r', 't']
>>> sorted_string = ''.join(sorted_characters)
'abenrt'
Big Tweet Data
In this program, you'll write a program that reads through a large collection of tweets and store the data to keep track of how hashtags occur in tweets. This is a great example of how Python can be used in data analysis tasks.
Our Dataset
For the purposes of this problem, each tweet is represented as a single
line of text in a file. Each line consists of the poster's username
(prefixed by a '@' symbol), followed by a colon and then the text of
the tweet. Each character in this file can be a character from any
language, or an emoji, although you don't need to do anything special
to deal with these characters. One such file in the PyCharm project we
provide is small-tweets.txt
, which is reproduced here:
@BarackObama: Missed President Obama's final #SOTU last night? Check out his full remarks. https://t.co/7KHp3EHK8D
@BarackObama: Fired up from the #SOTU? RSVP to hear @VP talk about the work ahead with @OFA supporters:
https://t.co/EIe2g6hT0I https://t.co/jIGBqLTDHB
@BarackObama: RT @WhiteHouse: The 3rd annual #BigBlockOfCheeseDay is today! Here's how you can participate:
https://t.co/DXxU8c7zOe https://t.co/diT4MJWQβ¦
@BarackObama: Fired up and ready to go? Join the movement: https://t.co/stTSEUMkxN #SOTU
@kanyewest: Childish Gambino - This is America https://t.co/sknjKSgj8c
@kanyewest: ππππ₯π₯π₯ https://t.co/KmvxIwKkU6
@dog_rates: This is Kylie. She loves naps and is approximately two bananas long. 13/10 would snug softly
https://t.co/WX9ad5efbN
@GonzalezSarahA: RT @JacobSmithVT: Just spent ten minutes clicking around this cool map #education #vt #realestate
https://t.co/iqxXtruqrt
We provide 3 such files for you in the PyCharm Project:
small-tweets.txt
, big-tweets.txt
and huge-tweets.txt
.
Building a user_tags
Dictionary
Central to this program is a user_tags
dictionary, in which each key
is a Twitter user's name like '@BarackObama'
. The value for each key
in this dictionary is a second, nested dictionary which counts how
frequently that particular user has used particular hashtags. For
example, a very simple user_tags
dictionary might be:
{'@BarackObama': {'#SCOTUS': 4, '#Obamacare': 3}}
We'll explore this dictionary in some more detail as we go through this problem, but as a matter of nomenclature, we'll call the inner dictionary the 'counts' dictionary. Our high-level strategy is to change the above dict for each tweet we read, so it accumulates all the counts as we go through the tweets.
1. Warmup questions
Given the dictionary above, what updates we would make to it in each of the following cases?
- We encounter a new tweet that reads
'@BarackObama: #Obamacare signups now!'
. - We encounter a new tweet that reads
'@kanyewest: ππππ₯π₯π₯ https://t.co/KmvxIwKkU6'
.
2. Implementing add_tweet
The add_tweet
function is the core of this whole program, and is
responsible for performing the update to a user_tags
dictionary
described above. The tests shown below represent a sequence of tweets,
expressed as a series of Doctests. For each call, you can see the
dictionary that is passed in, and the dictionary that is returned on the
next line. The first test passes in the empty dictionary ({}
) and gets
back a dictionary with 1 user and 2 tags. The 2nd test then takes that
returned dictionary as its input, and so on. Each call adds more data to
the user_tags
dictionary.
We've provided you with two functions entitled parse_tags
and
parse_user
, both of which take as a parameter the tweet in question
and return a list of tags in the tweet and the username that posted the
tweet, respectively.
def add_tweet(user_tags, tweet):
"""
Given a user_tags dict and a tweet, parse out the user and tags,
and add those counts to the user_tags dict which is returned.
If no user exists in the tweet, return the user_tags dict unchanged.
Note: call the parse_tags(tweet) and parse_user(tweet) functions to pull
the parts out of the tweet.
>>> add_tweet({}, '@alice: #apple #banana')
{'@alice': {'#apple': 1, '#banana': 1}}
>>> add_tweet({'@alice': {'#apple': 1, '#banana': 1}}, '@alice: #banana')
{'@alice': {'#apple': 1, '#banana': 2}}
>>> add_tweet({'@alice': {'#apple': 1, '#banana': 2}}, '@bob: #apple')
{'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}}
"""
3. Implementing parse_tweets
Use add_tweet
in a loop to build up and return a user_tags
dict.
This should look mostly like other file-reading functions you've
written, and your job is to make sure you understand how to follow the
pattern of creating and updating a dictionary suggested by the
add_tweet
function. Restated, the responsibility of add_tweet
is to
update a dictionary, and parse_tweets
must create and maintain that
dictionary as it is updated.
Running your program
We provide a main
function that calls the parse_tweets
function you
implemented in a variety of ways. To use it, run the program from the
terminal. Run with just 1 argument (a data filename), it reads in all
the data from that file and prints out a summary of each user and all
their tweets and counts:
$ python3 tweets.py small-tweets.txt
@BarackObama
#BigBlockOfCheeseDay -> 1
#SOTU -> 3
@GonzalezSarahA
#education -> 1
#vt -> 1
#realestate -> 1
When run with the '-users'
argument, main
prints out all the
usernames:
$ python3 tweets.py -users small-tweets.txt
users
@BarackObama
@kanyewest
@dog_rates
@GonzalezSarahA
When run with the '-user'
argument followed by a username, the program
prints out the data for just that user.
$ python3 tweets.py -user @BarackObama small-tweets.txt
user: @BarackObama
#BigBlockOfCheeseDay -> 1
#SOTU -> 3
Extension
You probably won't get to this extension in section, but if you have time, implement this additional function which you can then leverage to answer some interesting questions about Hashtag use.
Implementing flat_counts
It's natural to be curious about how often tags are used across users.
This function takes in a user_tags
dictionary and computes a new
"flat" count dictionary:
def flat_counts(user_tags):
"""
Given a user_tags dicts, sum up the tag counts across all users,
return a "flat" counts dict with a key for each tag,
and its value is the sum of that tag's count across users.
>>> flat_counts({'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}})
{'#apple': 2, '#banana': 2}
"""
main
will call that function with the -flat
argument, like so:
$ python3 tweets.py -flat small-tweets.txt
flat
#BigBlockOfCheeseDay -> 1
#MAGA -> 2
#SOTU -> 3
.
.
.