L15

Today: parsing, complex while loops, parse words out of string patterns

Preface - The CS106A Story Arc

Have some problem IRL
Think about what we want - a drawing or scheme about a data structure
Think about steps on the drawing
Translate into code
Debug the code,
Look at the error message
Look at the drawing
Add fixes to the code until it works
Profit!
Todays examples show this whole story arc

Data and Parsing

Here's some fun looking data...

$GPGGA,005328.000,3726.1389,N,12210.2515,W,2,07,1.3,22.5,M,-25.7,M,2.0,0000*70
$GPGSA,M,3,09,23,07,16,30,03,27,,,,,,2.3,1.3,1.9*38
$GPRMC,005328.000,A,3726.1389,N,12210.2515,W,0.00,256.18,221217,,,D*78
$GPGGA,005329.000,3726.1389,N,12210.2515,W,2,07,1.3,22.5,M,-25.7,M,2.0,0000*71
$GPGSA,M,3,09,23,07,16,30,03,27,,,,,,2.3,1.3,1.9*38
$GPRMC,005329.000,A,3726.1389,N,12210.2515,W,0.00,256.18,221217,,,D*79
$GPGGA,005330.000,3726.1389,N,12210.2515,W,2,07,1.3,22.5,M,-25.7,M,3.0,0000*78
$GPGSA,M,3,09,23,07,16,30,03,27,,,,,,2.3,1.3,1.9*38
...

Or a less dry example, try to extract the hashtag from the text.

So I'm like, no way! #yolo, and they're like yazzo!

The first example is what a GPS chip outputs
buried deep in your phone, this is going on
It's a standard: NMEA_018
Notice: it's text - lines and characters
"Parsing"
Have raw text like this example
Find and pull out the data you want
On the surface, we are doing parsing examples today
Really you want experience with loops and complex logic
Parsing is just a problem spaces for us to practice loops and logic

1. `for/i/range`

The for/i/range form is great for going through numbers which you know ahead of time - a common pattern in real programs. If you need to go through 0..n-1 - use for/i/range, that's exactly what it's for.

for i in range(n):
    # i is 0, 1, 2, .. n-1

2. `while` - More Flexible

But we also have the while loop. The "for" is suited for the case where you know the numbers ahead of time. The while is more flexible. The while tests on each iteration, stoping at the right spot. Ultimately you need both forms, but here we will switch to using while.

`while` Equivalent of `for/i/range`

It's possible to write the equivalent of for/i/range as a while loop instead. This is not the easieset way to go through 0..n-1, but it shows a useful while structure.

Aside: down in the CPU chip, there is a facility that resembles the while. Through layers of abstraction, your for/i/range in Python is constructed using the while down in the chip. You don't need to worry about this as a Python programmer. The for/i/range loop is an abstraction that Python reliably constructs for you, using what the chip has.

for i in range(n) - go-to solution for that sequence
Can write this as a while .. do steps manually
Three parts: init, test, increment
Pattern init/test/increment useful in other loops
Use range() for common 0..n-1 case
Use while where need fine control of i (examples to follow)
Beware: easy to forget update step, result is infinite loop
for/range is so common .. we don't have muscle-memory for the update line

Here is the while-equivalent to for i in range(n)

i = 0         # 1. init
while i < n:  # 2. test
    # ...
    # use i in loop
    # ...
    i += 1    # 3. increment loop-bottom
              # (easy to forget this line)

(optional) Example while_double()

> while_double()

double_char() written as a while. The for-loop is the correct approach here, so we are just showing how a "for" can written with "while".

1. Init i = 0

2. Test i < n

3. Increment i += 1 (loop bottom)

while_double() Solution

def while_double(s):
    result = ''
    i = 0
    while i < len(s):
        result += s[i] + s[i]
        i += 1
    return result

Foreshadow: Advance With `var += 1`

Framing for today's examples
Imagine a text string drawn on paper
A finger is pointing at 'a' to start
Move the finger to the right, until pointing at a space char
Python equivalent:
Start with end = 4
e.g. end is 4, pointing at the 'a'
end += 1 .. like moving one to the right
Continue end += 1 until get to a space

Start with end = 4. Advance to space char with end += 1 in loop alt: advance end to space char

Useful Question: When is `i` In Bounds?

Suppose the int i is indexing into a string, and I am changing it with i += 1 or i -= 1. What are the bounds for i remaining a valid index into the string?

In-bounds for increasing i:

i < length

In-bounds for decreasing i:

i >= 0

Python detail: surprisingly s[-1] does not give an error in Python, it accesses the last char, although this may not be what you want. For our algorithms, we treat i >= 0 as the boundary.

Example: at_word()

> at_word() (in parse1 section)

'xx @abcd xyz' -> 'abcd'
'x@ab^xyz' -> 'ab'

at_word(s): We'll say an at-word is an '@' followed by zero or more alphabetic chars. Find and return the alphabetic part of the first at-word in s, or the empty string if there is none. So 'xx @abc xyz' returns 'abc'.

Realistic parsing problem, extracting the wanted part of a string
Demonstrate several patterns on this one
We'll re-use these patterns
We'll work through this one carefully
Points about this code:
Use str.find() to locate each @
Use while to skip over alpha chars to find end
Use var < len(s) to protect use of s[var]
Var names: search, at, end - try to keep things straight

at_word() Strategy 1

First use s.find() to locate the '@'. Then start end pointing to the right of the '@'.

at_word() Start Picture

Code to set this up:

    at = s.find('@')
    if at == -1:
        return ''
    
    end = at + 1

at_word() Goal Picture

at_word() While Test

Use a while loop to advance end over the alphabetic chars. What is the test for this loop? Work it out on the drawing.

    while ???? 
        end += 1

Think about what while test is True while end is pointing at an alphabetic char. Draw T/F under each char for the test we want.

alt: T/F drawn under chars for while test

AKA skip over the alpha chars
Loop test: this is true = advance end by one
Test: s[end].isalpha()
Reminisce - Bit "true = go" pattern for moving forward
This code will 90% work, with one case to fix later
Loop leaves end pointing to the first non-alpha char

This loop is 90% correct to advance end:

    # Advance end over alpha chars
    while s[end].isalpha():
        end += 1

at_word() Slice with end

Once we have at/end computed, pulling out the result word is just a slice.

    word = s[at + 1:end]
    return word

at_word() V1

> at_word()

Put those phrases together and it's an excellent first try, and it 90% works. Run it.

def at_word(s):
    at = s.find('@')
    if at == -1:
        return ''
    
    end = at + 1
    # Advance end over alpha chars
    while s[end].isalpha():
        end += 1

    word = s[at + 1:end]
    return word

at_word: 'woot' Bug

That code is pretty good, but there is actually a bug in the while-loop. It has to do with particular form of input case below, where the alphabetic chars go right up to the end of the string. Think about how the loop works when advancing "end" for the case below.

    at = s.find('@')
    end = at + 1
    while s[end].isalpha():
        end += 1



'xx@woot'
 01234567

Problem: keep advancing "end" .. past the end of the string, eventually end is 7. Then the while-test s[end].isalpha() throws an error since index 7 is past the end of the string.

The loop above translates to: "advance end so long as s[end] is alphabetic"

To fix the bug, we modify the test to: "advance end so long as end is valid and s[end] alphabetic".

In other words, stop advancing if end reaches the end of the string.

Loop end bug: alt: bug - end goes off the end of the string

Solution: `end < len(s)` Guard Test

This "guard" pattern will be a standard part of looping over something. We cannot access s[end] when end is too big. Add a "guard" test end < len(s) before the s[end]. This stops the loop when end gets to 7. The slice then works as before. This code is correct.

def at_word(s):
    at = s.find('@')
    if at == -1:
        return ''

    # Advance end over alpha chars
    end = at + 1
    while end < len(s) and s[end].isalpha():
        end += 1
    
    word = s[at + 1:end]
    return word

Guard / Short Circuit Pattern

The "and" evaluates left to right. As soon as it sees a False it stops, known as "short circuiting". In this way the < len(s) guard checks that end is a valid number, before s[end] tries to use it. This a standard pattern: the index-is-valid guard is first, then "and", then s[end] that uses the index. The guard stops the loop from running off the end of the string. We'll see more examples of this guard pattern.

while i < len(s) and .... s[i] ...:
    i += 1

End Bug Summary

Bug: run end off the end of s, testing non-existent s[end]
e.g. this happens if input is s = 'xx @woot'
Think through how the loop works for that case
Solution:
Add guard <
This is the fixed loop:
while end < len(s) and s[end].isalpha():
Q: How to test if index i is valid in s?
A: i < len(s)
Only look at s[end] char after checking that end is valid
Boolean Short Circuit
Python evaluates expression left-right
As soon as boolean value determined, stops trying
A False in the midst of an and stops
So the < guards the s[end].isalpha()

`'woot'` Bug - Very Specific Input

We say that having a few reasonable test cases will find the great majority of bugs, and this is true. This is how modern software is built and tested. However, we see a slightly unsettling pattern with the 'woot' bug — the bug is triggered only by a very specific pattern in the input data. After the '@', there must be a run of alphabetic chars going right up to the end of the string. If we did not have an test case with that pattern, the tests would pass code, even if the code has this bug in it.

Software is put out for use in the world, and then, out of millions of users, a user happens to make an input case which exposes a bug in the software. The user may file a bug report, or a crash-reporter system in the software may prompt the user to please submit it. The bug-report makes its way back to the engineers, and they may well react like "oh wow, look at this, we never thought of this case." There's so many users out there doing so many things, they will come up with cases you never thought of. With the user-submitted bug in hand, a new test case can be added, a so called "regression" test, that tests against this bug the software had one time, and should avoid having in the future.

Note This Works ok: `s[at + 1:end]`

The slice s[at + 1:end] works fine, even though end is not a valid index, going 1 past the last char. How does this work?

Reason 1 - UBNI

Why does s[at + 1:end] work fine?
Up To But Not Including - UBNI
The char at the second slice index is not included in the slice
So the fact that it's one past the end is fine - it is not included
This is the best reason, so focus on this one

Reason 2 - Slice Tolerates Garbage

The other reason it works is a little sketchy
It turns out, slices never raise an error about bad out of bounds index numbers
They will work with any old garbage numbers
If a number is too big, it is interpreted as "the end of the string"
This does not mean you can stop caring about index numbers
It just means the slice is not checking for you
Our above solution is fine - the end index is managed accurately
Going exactly one past the chars we want in all cases

>>> s = 'Python'
>>> len(s)
6
>>> s[2:5]
'tho'
>>> s[2:6]
'thon'
>>> s[2:46789]
'thon'

(optional) at_words() - Zero Char Case - Works?

What about
     'xx @ xx'
at = 3 --^
end = 4 --^
s[at + 1:end] -> s[4:4]

Consider slice
s[at + 1:end]
Slice is s[4:4]
Which is the empty string '', which is correct
There are zero alphabetic chars after the @
So the code works perfectly for this edge case

Exercise: exclamation()

> exclamation()

exclamation(s): We'll say an exclamation is zero or more alphabetic chars ending with a '!'. Find and return the first exclamation in s, or the empty string if there is none. So 'xx hi! xx' returns 'hi!'. (Like at_word, but right-to-left).

Suggestions:

    'xx hi! xx' -> 'hi!'
exclaim---^
start ---^
start --^
start -^ (loop end)

1. Set variable exclaim to point to the exclamation mark. (in starter code)

2. Set a variable start to the left of the exclamation mark. Write a loop to move start towards the start of the string, over the alphabetic chars. Slice out the answer - off-by-one details to think about here. Run this version, it 90% works.

3. Then add a guard to prevent start from running past the beginning of the string. As the loop goes right-to-left. The leftmost valid index is 0, so that will figure in the guard test.

Starter code

def exclamation(s):
    exclaim = s.find('!')
    if exclaim == -1:
        return ''

    # Your code here
    start = ???

exclamation() Solution

def exclamation(s):
    exclaim = s.find('!')
    if exclaim == -1:
        return ''
        
    # Your code here
    # Move start left over alpha chars
    # guard: start >= 0
    start = exclaim - 1
    while start >= 0 and s[start].isalpha():
        start -= 1
    
    # start is on the first non alpha
    word = s[start + 1:exclaim + 1]
    return word

Boolean Expressions

See the guide for details Boolean Expression

Boolean operators: and or not
Mixture of these, can add parenthesis to force order of operation
"precedence" in CS parlance
Say have three boolean variables, each True/False
age - say age is good if less than 30
is_raining - True if raining
is_weekend- True if it's the weekend
Define: to be a good day, need two things:
1. it must not be raining
2. then either age is under 30 or it's the weekend

The code below looks reasonable, but doesn't quite work right

def good_day(age, is_weekend, is_raining):
    if not is_raining and age < 30 or is_weekend:
        print('good day')

Boolean Precedence:

"Precedence" is the order of operations
not = highest, (like - in -7)
and = next highest (like *)
or = lowest (like +)

What The Above Does

Because and is higher precedence than or as written above, the code above acts like the following (and evaluates before or):

   if (not is_raining and age < 30) or is_weekend:

You can tell the above does not work right, because any time is_weekend is True, the whole thing is True, regardless of age or rain. This does not match the good-day definition above, which requires that it not be raining.

Boolean Precedence Solution

The solution we will spell out is not difficult.

Many programmers do not have boolean precedence memorized .. fine
Do remember that "not" is the highest precedence
Solution: note when you have a mixture of and + or
When there is a mixture, the precedence will matter
put in parenthesis to set the order you want
We will never complain about extra parenthesis, so add them to spell out the order you want
In this case, put parens to group the or part, separating from not-raining
BTW similar logic applies to math - if there's a mixture of * and +, add parenthesis

Solution

def good_day(age, is_weekend, is_raining):
    if not is_raining and (age < 30 or is_weekend):
        print('good day')

(optional) Exercise Boolean oh_no()

> oh_no()

Parse "or" Example - at_word99()

> at_word99()

'xx @ab12 xyz' -> 'ab12'

at_word99(): Like at-word, but with digits added. We'll say an at-word is an '@' followed by zero or more alphabetic or digit chars. Find and return the alpha-digit part of the first at-word in s, or the empty string if there is none. So 'xx @ab12 xyz' returns 'ab12'.

We've reached a very realistic level of complexity for solving real problems.

"end" Loop For at_words99()

Like before, but now a word is made of alpha or digit - many real problems will need this sort of code. This may be our most complicated line of code thus far in the quarter! Fortunately, it's a re-usable pattern for any of these "find end of xxx chars" problems.

The most difficult part is the "end" loop to locate where the word ends. What is the while test here? (Bring up at_word99() in other window to work it out). We want to use "or" to allow alpha or digit.

at = s.find('@')
end = at + 1
while ??????????:
    end += 1

at_word99() While Test

> at_word99()

 # 1. Still have the < guard
 # 2. Use "or" to allow isalpha() or isdigit()
 # 3. Need to add parens, since this has and+or
 #    combination
 while end < len(s) and (s[end].isalpha() or s[end].isdigit()):
     end += 1

at_word99() Solution

def at_word99(s):
    at = s.find('@')
    if at == -1:
        return ''

    # Advance end over alpha or digit chars
    # use "or" + parens
    end = at + 1
    while end < len(s) and (s[end].isalpha() or s[end].isdigit()):
        end += 1
    
    word = s[at + 1:end]
    return word

Style: Long Lines

Normally each Python line of code is un-broken. BUT if you add parenthesis, Python allows the code to span multiple lines until the closing parenthesis. Indent the later lines an extra 4 spaces - in this way, they have a different indentation than the body of the while. There's also a preference to end each line with an operator like or .. to suggest that there's more on the later lines.

    while (end < len(s) and 
            (s[end].isalpha() or
            s[end].isdigit())):
        end += 1

More practice.

(optional): dotcom2()

> dotcom2()

 'xx www.foo.com xx' -> 'www.foo.com'

dotcomt2(s): We are looking for the name of an internet host within a string. Find the '.com' in s. Find the series of alphabetic chars or periods before the '.com' with a while loop and return the whole hostname, so 'xx www.foo.com xx' returns 'www.foo.com'. Return the empty string if there is no '.com'. This version has the added complexity of the periods.

Ideas: find the '.com', loop left-right to find the chars before it. Loop over both alphabetic and '.'

dotcom2() Solution

def dotcom2(s):
    com = s.find('.com')
    if com == -1:
        return ''
    
    # "or" logic - move leftwards over
    # alphabetic or '.'
    start = com - 1
    while start >= 0 and (s[start].isalpha() or s[start] == '.'):
        start -= 1
    
    return s[start + 1:com + 4]

Preface - The CS106A Story Arc

Data and Parsing

1. for/i/range

2. while - More Flexible

while Equivalent of for/i/range

(optional) Example while_double()

while_double() Solution

Foreshadow: Advance With var += 1

Useful Question: When is i In Bounds?

Example: at_word()

at_word() Strategy 1

at_word() Start Picture

at_word() Goal Picture

at_word() While Test

at_word() Slice with end

at_word() V1

at_word: 'woot' Bug

Solution: end < len(s) Guard Test

Guard / Short Circuit Pattern

End Bug Summary

'woot' Bug - Very Specific Input

Note This Works ok: s[at + 1:end]

Reason 1 - UBNI

Reason 2 - Slice Tolerates Garbage

(optional) at_words() - Zero Char Case - Works?

Exercise: exclamation()

exclamation() Solution

Boolean Expressions

Boolean Precedence:

What The Above Does

Boolean Precedence Solution

(optional) Exercise Boolean oh_no()

Parse "or" Example - at_word99()

"end" Loop For at_words99()

at_word99() While Test

at_word99() Solution

Style: Long Lines

(optional): dotcom2()

dotcom2() Solution

1. `for/i/range`

2. `while` - More Flexible

`while` Equivalent of `for/i/range`

Foreshadow: Advance With `var += 1`

Useful Question: When is `i` In Bounds?

Solution: `end < len(s)` Guard Test

`'woot'` Bug - Very Specific Input

Note This Works ok: `s[at + 1:end]`