Slide 1

Today: code for dict-count algorithm, when does python make a copy? More sophisticated nested-dict example, other ways to read a file

Slide 2

Dict Count Code Examples

> dict2 Demos and Exercises

Slide 3

str-count2() - Unified/Invariant Version, no else

> str_count2()

For each item
1. If item not seen before - set to 0
if s not in counts: counts[s] = 0
2. Then the line below works for 100% of cases
counts[s] += 1
This is unified/invariant version
We'll tend to do it this way
+= line works for all cases
Alternately can use "else"

Slide 4

Standard Dict-Count Code - Unified Version

def str_count2(strs):
    counts = {}
    for s in strs:
        if s not in counts:  # fix counts/s if not seen before
            counts[s] = 0
        # Unified: now s is in counts one way or
        # another, so can do next step unconditionally
        counts[s] += 1
    return counts

Slide 5

Char Count - Exercise

> char_count()

Apply the dict-count algorithm to chars in a string. Build a counts dict of how many times each char appears in a string so 'Coffee' returns {'c': 1, 'o': 1, 'f': 2, 'e': 2}.

Recall loop form: for ch in s:

Slide 6

Char Count - Solution

def char_count(s):
    counts = {}
    for ch in s:
        # Do all computation in lowercase
        low = ch.lower()
        if low not in counts:
            counts[low] = 0
        counts[low] += 1
    return counts

Slide 7

Python and Copying

For more detail see guide: Python Not Copying

When Python uses an assignment = with a data structure like a list or a dict, Python does not make a copy of the structure. Instead, there is just the one list or dict, and multiple references pointing to it.

1. One List and One Dict

Here is code that creates one list and one dict, each with a variable pointing to it.

>>> lst = [1, 2, 3]
>>> d = {}
>>> d['a'] = 1

Memory looks like:
one lst points to list, d points to dict

2. Store Reference To List in Dict

>>> d['b'] = lst

What does this do? Key: the = does not make a copy of the list. Instead, it stores an additional reference to the one list inside the dict.

Memory looks like:
reference to list store inside of dict

3. lst.append() - What Happens?

There is just one list, and there are two references to it. This is fine. What does the following code do?

>>> lst.append(4)

What does memory look like now? First, what does the list look like? Who is pointing to it?

Memory looks like:
list is modified

What do these lines of code print now?

>>> lst
???
>>> d['b']
???

Answer

Both lst and d['b'] refer to the same list, which is now [1, 2, 3, 4]

Slide 8

Summary - References Proliferate

Python does not copy a list or dict when used with, say, =. Instead, Python just spreads around more references to the one list. This is a normal way for Python programs to work - a few important lists or dicts, and references to those structures spread around in the code. This does not require any action on your part, just realize that that there are no copies.

Slide 9

Dict-Count Chapter 1 Summary - Init/Incr

Our first use of dict was counting
Super important dict code pattern
Stereotypical not-in/in logic per data item
Unified strategy:
1. "not in" - fix dict so key is present
Call this "init", e.g. value = 0
2. Then can assume key is in there, update value
Call this "increment", e.g. value += 1

Suppose "x" holds the key we're counting...

    if x not in counts:   # Fix so x is in there
       counts[x] = 0      # -Init
    counts[x] += 1        # -Increment

Slide 10

Dict - Chapter 2 - Nested

More sophisticated dict algorithms
Old: value is int counter 0, 1, 2, 3
Now: value is "nested" structure
a list, dict stored as a value inside a dict (like d/lst above)
email_hosts example below - nice, realistic example of this

Slide 11

Email Hosts Data

Have email strings
'abby@foo.com'
'bob@bar.com'
has one @
'user' is left of @ -> 'abby'
'host' is right of @ -> 'foo.com'\

Slide 12

Email Hosts Challenge

High level: we have a big list of email addresses. We want to organize the data by host. For each host, build up a list of all the users for that host.

Given a list of email address strings. Each email address has one '@' in it, e.g. 'abby\@foo.com', where 'abby' is the user, and 'foo.com' is the host.

Create a nested dict with a key for each host, and the value for that key is a list of all the users for that host, in the order they appear in the original list (repetition allowed).

emails:
  ['abby@foo.com', 'bob@bar.com', 'abe@foo.com']

returns hosts dict:
  {
   'foo.com': ['abby', 'abe'],
   'bar.com': ['bob']
  }

Slide 13

Nested - Key/Value Types

When working a nested dict problem, it's good to keep in mind the type of each key and its value. This info guides code that reads or writes in the dict - when do you do += and when do you do .append(). What we have for this problem - will refer to this when writing a key line of code:

1. Each key is a host string, e.g. 'foo.com'

2. The value for each key is a list of users for that host, e.g. ['abby', 'abe']

Slide 14

Email Hosts Example

> email_hosts()

nested dict problem

Here is the code to start with. The "not in" structure still applies.

def email_hosts(emails):
    hosts = {}
    for email in emails:
        at = email.find('@')
        user = email[:at]
        host = email[at + 1:]
        # your code here
        pass
    return hosts

1. Think about the "increment" line first. What is the append line? Look above at the definition for each key and value.

2. Need to init for the not-in case. For counting the init was: 0. Now the init is: [].

Slide 15

Issue: `hosts[host]`

This line is very hard to read, like what on earth is it?

Recall that the hosts dict looks like:

  {
   'foo.com': ['abby', 'abe'],
   'bar.com': ['bob']
  }

In hosts dict, each key is a host, and each value is a list of user names.

Slide 16

Style Technique: Decomp By Var

Instead of using hosts[host] as is, put its value into a well named var, spelling out what sort of data it holds. This is a big help and is how our solution is written. Note how the names in this line of code confirm that the logic is correct: users.append(user) This depends on the "shallow" feature of Python data (above), e.g. hosts[host] returns a reference to the embedded list to us.

No:

    hosts[host].append(user)

Yes:

    users = hosts[host]
    users.append(user)

Slide 17

Email Hosts Solution Code

def email_hosts(emails):
    hosts = {}
    for email in emails:
        at = email.find('@')
        user = email[:at]
        host = email[at + 1:]

        # key algorithm: init/increment
        if host not in hosts:
            hosts[host] = []
        users = hosts[host]  # decomp by var
        users.append(user)
    return hosts

Slide 18

Drawing of the Email Hosts Sequence

alt:what hosts memory looks like, adding one
user

Slide 19

Exercise: food_ratings() (optional)

> food_ratings()

nested dict problem

food_ratings(ratings): Given a list of food survey rating strings like this 'donut:10'. Create and return a "foods" dict that has a key for each food, and its value is a list of all its rating numbers in int form. Use split() to divide each food rating into its parts. There's a lot of Python packed into this question!

Build dict with structure:

Key = one food string

Value = list of rating ints

Nice Parsing Technique `split(':')`

Nice technique: rating = 'donut:10'

Use split(): rating.split(':') -> ['donut', '10']

Slide 20

File Reading 2.0

See guide for more details: File Reading and Writing

We've done this way: for line in f:
Advantage: low memory use even with huge file
Look at a few different reading forms
"with" form - handles closing automatically
'r' for reading (the default)
'w' for file-writing

Standard "with" to open a text file for reading:

with open(filename) as f:
    # use f in here

The form below is equivalent to above since 'r' is the default, meaning read the file. 'w' means write the file from RAM to the file system. See the guide above for sample writing code.

with open(filename, 'r') as f:
    # use f in here

Can specify encoding (default depends on your machine / locale). Encoding 'utf-8' is what many files use. Try this if you get a UnciodeDecodeError

with open(filename, encoding='utf-8') as f:
    # use f

Older way to open() a file (use in interpreter)

f = open(filename)
# use f
# f.close() when done
# "with" does the .close() automatically

Slide 21

File Loop

Most common, process 1 line at a time. Uses the least memory.

for line in f:
    # process each line

Slide 22

Alternative: r.readlines()

f.readlines() - return list of line strings, can do slices etc. to access lines in a custom order. Each line has the '\n' at its end. Use str.strip() to strip off whitespace from the ends of a line.

>>> f = open('poem.txt')  # alternative to with, use in interpreter
>>> lines = f.readlines()
>>> lines
['Roses are red\n', 'Violets are blue\n', '"RED" BLUE.\n']
>>> 
>>> line = lines[0]    # first line
>>> line
'Roses are red\n'
>>> line.strip()       # strip() - remove whitespace from ends
'Roses are red'
>>>
>>> lines[1:]   # slice to grab subset of lines
['Violets are blue\n', '"RED" BLUE.\n']

Slide 23

Alternative: f.read()

read() - whole file into one string. Handy if you can process the whole thing at once, not needing to go line by line. Reading from a file "consumes" the data. Doing a second read returns the empty string.

>>> f = open('poem.txt')
>>> s = f.read()          # whole file in string
>>> s
'Roses are red\nViolets are blue\n"RED" BLUE.\n'
>>> 
>>> >>> f.read()          # reading again gets nothing
''

Slide 24

Data Sets

If you go to https://www.kaggle.com, you can find lots and lots of data sets. Your assignment for next week is all about data, so I thought we would look at some of the data sets available. Part of the "parsing" aspect of this lecture is all about extracting data from data sets – it isn't always trivial because data can be "dirty" (messy), or can be in a form that just isn't easy to extract.
There is a python module, csv (import csv) that allows you to more easily parse "comma separated data" (indeed, many different types of data), and you should check it out if you want to look at public data sets.