Linguist 278: Programming for Linguists (Stanford Linguistics, Fall 2021)

Classes 9: Introduction to regular expressions

import re

Examples

See the cheat sheet for even more!

Quantifiers

# Match a
re.search(r"a", "a")  ## Matches a

re.search(r"a", "b")  ## No Match


# Match 0 or 1 a:
re.search(r"a?", "aa")  ## Matches a

re.search(r"a?", "")   ## Matches ''


# Match 0 or more a
re.search(r"a*", "aaa")  ## Matches aaa

re.search(r"a*", "")   ## Matches ''


# Match 1 or more a
re.search(r"a+", "a")  ## Matches a

re.search(r"a+", "")   ## No match


# Match 2 to 5 a in a row; first and/or second bounds can be left off
re.search(r"a{2,5}", "aaa")  ## Matches aaa

re.search(r"a{2,5}", "a")  ## No match

re.search(r"a{2,5}", "aaaaaaa")  ## Matches the first 5 a

Case-insensitive search

re.search(r"a+", "A", re.I)  ## Matches A


re.search(r"a+", "A")  ## No match

Sets of characters

# Match any of the characters inside []:
re.search(r"[abc]", "b")  ## Matches b


# Match 1 or more characters inside [] in a row, any order:
re.search(r"[abc]+", "bac")  ## Matches bac


# Compare with exact string match:
re.search(r"abc", "bac")  ## No match


# Match any character that *isn't* in the set consisting of a, b, c:
re.search(r"[^abc]+", "bac")

Grouping with parentheses

# Matches ab and then 1 or more c:
re.search(r"abc+", "abcccccc")  ## Matches abcccccc


# Matches 1 or more of the exact string abc:
re.search(r"(abc)+", "abcabcc")  ## Matches abcabc


# Matches abc or xyz
re.search(r"(abc|xyz)", "xyzabc")  ## Matches xyz

Character groups

# Period matches any character
re.search(r".", "abc")  ## Matches abc.

# Matches 1 or more alphanumeric characters in a row:
re.search(r"\w+", "abc")  ## Matches abc.

# This is roughly equivalent to the above, but \w is more permissive
# with Unicode:
re.search(r"[a-zA-Z0-9_]+", "abc")  ## Matches abc


# Matches 1 or more digits:
re.search(r"\d+", "a000b")  ## Matches 000


# Matches any kind of whitepace

re.search(r"\s", "a b")

Some special characters

^ matches the start of the string. (Inside [], it is negation!)
$ matches the end of the string.
To match any special character, including quantifiers, braces, etc., add a preceding backslash.

Matcher objects are True in conditionals

Conditionals are very common environments for regexs, since one is very often testing strings based on the regex pattern:

if re.search(r"L", "aLa"):
    print("Match!")
else:
    print("No match!")

Capturing groups and match objects

m = re.search(r"(R)", texts[0], re.I)

m.start() ## First index of the match

m.end()  ## Final index of the match

m.groups() ## Returns ('R',)

Compiling regular expressions

regex = re.compile(r"a+", re.I)

regex.search("bAab")

Regexercises

All of these should be run on 'unix-words-en.txt' using egrep:

unix_filename = 'unix-words-en.txt'

Words that begin with a capital letter.

r"^[A-Z]"
Words that contain a medial capital letter.

.[A-Z]. or .+[A-Z].+ if you need to match the entire string.
Words that contain a hyphen.

- or .*-.* if you need to match the entire string.
Words that begin with exactly two consonants, end with exactly two consonants, and have exactly three vowels ('a','e','i','o','u') in between. Case-insensitive.
Words that end either in 'ous' or in 'ary'. Case-insensitive. (How many of these are adjectives derived from nouns?)
Words that contain the vowels 'a', 'e', 'i', 'o', and 'u', in that order, perhaps with other characters in between. Case-insensitive.
Same as above, but make sure the other characters aren't vowels.
Words with an even number of a's in them (0, 2, 4, ...). Case-insensitive.
Words that begin and end with the same letter. (For this, you need to use capturing groups: re.compile(r"(.)x\1") is a regex for any string containing a character, 'x', and then the first character again.)
Words with the longest sequence of the same letter in a row in the data. Case-insensitive.
Words that have two or more dotted letters in a row. Case-insensitive.
Same as the above, but allow hyphens between the dotted letters, and don't count hyphens towards the length.