Slide 1
Today: code for dict-count algorithm, when does python make a copy? More sophisticated nested-dict example, other ways to read a file
Slide 2
Dict Count Code Examples
> dict2 Demos and Exercises
Slide 3
str-count2() - Unified/Invariant Version, no else
- For each item
- 1. If item not seen before - set to 0
if s not in counts: counts[s] = 0
- 2. Then the line below works for 100% of cases
counts[s] += 1
- This is unified/invariant version
We'll tend to do it this way
+= line works for all cases
Alternately can use "else"
Slide 4
Standard Dict-Count Code - Unified Version
def str_count2(strs):
counts = {}
for s in strs:
if s not in counts: # fix counts/s if not seen before
counts[s] = 0
# Unified: now s is in counts one way or
# another, so can do next step unconditionally
counts[s] += 1
return counts
Slide 5
Char Count - Exercise
Apply the dict-count algorithm to chars in a string. Build a counts dict
of how many times each char appears in a string so 'Coffee'
returns
{'c': 1, 'o': 1, 'f': 2, 'e': 2}
.
Recall loop form: for ch in s:
Slide 6
Char Count - Solution
def char_count(s):
counts = {}
for ch in s:
# Do all computation in lowercase
low = ch.lower()
if low not in counts:
counts[low] = 0
counts[low] += 1
return counts
Slide 7
Python and Copying
For more detail see guide: Python Not Copying
When Python uses an assignment = with a data structure like a list or a dict, Python does not make a copy of the structure. Instead, there is just the one list or dict, and multiple references pointing to it.
1. One List and One Dict
Here is code that creates one list and one dict, each with a variable pointing to it.
>>> lst = [1, 2, 3]
>>> d = {}
>>> d['a'] = 1
Memory looks like:
2. Store Reference To List in Dict
>>> d['b'] = lst
What does this do? Key: the =
does not make a copy of the list.
Instead, it stores an additional reference to the one list inside the
dict.
Memory looks like:
3. lst.append() - What Happens?
There is just one list, and there are two references to it. This is fine. What does the following code do?
>>> lst.append(4)
What does memory look like now? First, what does the list look like? Who is pointing to it?
Memory looks like:
What do these lines of code print now?
>>> lst
???
>>> d['b']
???
Answer
Both lst
and d['b']
refer to the same list, which is now
[1, 2, 3, 4]
Slide 8
Summary - References Proliferate
Python does not copy a list or dict when used with, say, =. Instead, Python just spreads around more references to the one list. This is a normal way for Python programs to work - a few important lists or dicts, and references to those structures spread around in the code. This does not require any action on your part, just realize that that there are no copies.
Slide 9
Dict-Count Chapter 1 Summary - Init/Incr
- Our first use of dict was counting
- Super important dict code pattern
- Stereotypical not-in/in logic per data item
- Unified strategy:
- 1. "not in" - fix dict so key is present
Call this "init", e.g. value = 0 - 2. Then can assume key is in there, update value
Call this "increment", e.g. value += 1
Suppose "x" holds the key we're counting...
if x not in counts: # Fix so x is in there
counts[x] = 0 # -Init
counts[x] += 1 # -Increment
Slide 10
Dict - Chapter 2 - Nested
- More sophisticated dict algorithms
- Old: value is int counter 0, 1, 2, 3
- Now: value is "nested" structure
a list, dict stored as a value inside a dict (like d/lst above) - email_hosts example below - nice, realistic example of this
Slide 11
Email Hosts Data
- Have email strings
'abby@foo.com'
'bob@bar.com'
has one @
'user'
is left of@
->'abby'
'host'
is right of@
->'foo.com'
\
Slide 12
Email Hosts Challenge
High level: we have a big list of email addresses. We want to organize the data by host. For each host, build up a list of all the users for that host.
Given a list of email address strings. Each email address has one '@' in it, e.g. 'abby\@foo.com', where 'abby' is the user, and 'foo.com' is the host.
Create a nested dict with a key for each host, and the value for that key is a list of all the users for that host, in the order they appear in the original list (repetition allowed).
emails:
['abby@foo.com', 'bob@bar.com', 'abe@foo.com']
returns hosts dict:
{
'foo.com': ['abby', 'abe'],
'bar.com': ['bob']
}
Slide 13
Nested - Key/Value Types
When working a nested dict problem, it's good to keep in mind the type
of each key and its value. This info guides code that reads or writes in
the dict - when do you do +=
and when do you do .append()
. What we
have for this problem - will refer to this when writing a key line of
code:
1. Each key is a host string, e.g. 'foo.com'
2. The value for each key is a list of users for that host, e.g.
['abby', 'abe']
Slide 14
Email Hosts Example
- nested dict problem
Here is the code to start with. The "not in" structure still applies.
def email_hosts(emails):
hosts = {}
for email in emails:
at = email.find('@')
user = email[:at]
host = email[at + 1:]
# your code here
pass
return hosts
1. Think about the "increment" line first. What is the append line? Look above at the definition for each key and value.
2. Need to init for the not-in case. For counting the init was: 0. Now the init is: [].
Slide 15
Issue: hosts[host]
This line is very hard to read, like what on earth is it?
Recall that the hosts dict looks like:
{
'foo.com': ['abby', 'abe'],
'bar.com': ['bob']
}
In hosts dict, each key is a host, and each value is a list of user names.
Slide 16
Style Technique: Decomp By Var
Instead of using hosts[host]
as is, put its value into a well named
var, spelling out what sort of data it holds. This is a big help and is
how our solution is written. Note how the names in this line of code
confirm that the logic is correct: users.append(user)
This depends on
the "shallow" feature of Python data (above), e.g. hosts[host]
returns a reference to the embedded list to us.
No:
hosts[host].append(user)
Yes:
users = hosts[host]
users.append(user)
Slide 17
Email Hosts Solution Code
def email_hosts(emails):
hosts = {}
for email in emails:
at = email.find('@')
user = email[:at]
host = email[at + 1:]
# key algorithm: init/increment
if host not in hosts:
hosts[host] = []
users = hosts[host] # decomp by var
users.append(user)
return hosts
Slide 18
Drawing of the Email Hosts Sequence
Slide 19
Exercise: food_ratings() (optional)
- nested dict problem
food_ratings(ratings): Given a list of food survey rating strings like this 'donut:10'. Create and return a "foods" dict that has a key for each food, and its value is a list of all its rating numbers in int form. Use split() to divide each food rating into its parts. There's a lot of Python packed into this question!
Build dict with structure:
Key = one food string
Value = list of rating ints
Nice Parsing Technique split(':')
Nice technique: rating = 'donut:10'
Use split(): rating.split(':') -> ['donut', '10']
Slide 20
File Reading 2.0
See guide for more details: File Reading and Writing
- We've done this way:
for line in f:
Advantage: low memory use even with huge file - Look at a few different reading forms
- "with" form - handles closing automatically
- 'r' for reading (the default)
- 'w' for file-writing
Standard "with" to open a text file for reading:
with open(filename) as f:
# use f in here
The form below is equivalent to above since 'r' is the default, meaning read the file. 'w' means write the file from RAM to the file system. See the guide above for sample writing code.
with open(filename, 'r') as f:
# use f in here
Can specify encoding (default depends on your machine / locale). Encoding 'utf-8' is what many files use. Try this if you get a UnciodeDecodeError
with open(filename, encoding='utf-8') as f:
# use f
Older way to open() a file (use in interpreter)
f = open(filename)
# use f
# f.close() when done
# "with" does the .close() automatically
Slide 21
File Loop
Most common, process 1 line at a time. Uses the least memory.
for line in f:
# process each line
Slide 22
Alternative: r.readlines()
f.readlines() - return list of line strings, can do slices etc. to access lines in a custom order. Each line has the '\n' at its end. Use str.strip() to strip off whitespace from the ends of a line.
>>> f = open('poem.txt') # alternative to with, use in interpreter
>>> lines = f.readlines()
>>> lines
['Roses are red\n', 'Violets are blue\n', '"RED" BLUE.\n']
>>>
>>> line = lines[0] # first line
>>> line
'Roses are red\n'
>>> line.strip() # strip() - remove whitespace from ends
'Roses are red'
>>>
>>> lines[1:] # slice to grab subset of lines
['Violets are blue\n', '"RED" BLUE.\n']
Slide 23
Alternative: f.read()
read() - whole file into one string. Handy if you can process the whole thing at once, not needing to go line by line. Reading from a file "consumes" the data. Doing a second read returns the empty string.
>>> f = open('poem.txt')
>>> s = f.read() # whole file in string
>>> s
'Roses are red\nViolets are blue\n"RED" BLUE.\n'
>>>
>>> >>> f.read() # reading again gets nothing
''
Slide 24
Data Sets
-
If you go to https://www.kaggle.com, you can find lots and lots of data sets. Your assignment for next week is all about data, so I thought we would look at some of the data sets available. Part of the "parsing" aspect of this lecture is all about extracting data from data sets – it isn't always trivial because data can be "dirty" (messy), or can be in a form that just isn't easy to extract.
-
There is a python module,
csv
(import csv
) that allows you to more easily parse "comma separated data" (indeed, many different types of data), and you should check it out if you want to look at public data sets.