Linguist 278: Programming for Linguists (Stanford Linguistics, Fall 2021)

Classes 9: Introduction to regular expressions

import re

Examples

See the cheat sheet for even more!

Quantifiers

# Match a
re.search(r"a", "a")  ## Matches a

re.search(r"a", "b")  ## No Match


# Match 0 or 1 a:
re.search(r"a?", "aa")  ## Matches a

re.search(r"a?", "")   ## Matches ''


# Match 0 or more a
re.search(r"a*", "aaa")  ## Matches aaa

re.search(r"a*", "")   ## Matches ''


# Match 1 or more a
re.search(r"a+", "a")  ## Matches a

re.search(r"a+", "")   ## No match


# Match 2 to 5 a in a row; first and/or second bounds can be left off
re.search(r"a{2,5}", "aaa")  ## Matches aaa

re.search(r"a{2,5}", "a")  ## No match

re.search(r"a{2,5}", "aaaaaaa")  ## Matches the first 5 a
re.search(r"a+", "A", re.I)  ## Matches A


re.search(r"a+", "A")  ## No match

Sets of characters

# Match any of the characters inside []:
re.search(r"[abc]", "b")  ## Matches b


# Match 1 or more characters inside [] in a row, any order:
re.search(r"[abc]+", "bac")  ## Matches bac


# Compare with exact string match:
re.search(r"abc", "bac")  ## No match


# Match any character that *isn't* in the set consisting of a, b, c:
re.search(r"[^abc]+", "bac")

Grouping with parentheses

# Matches ab and then 1 or more c:
re.search(r"abc+", "abcccccc")  ## Matches abcccccc


# Matches 1 or more of the exact string abc:
re.search(r"(abc)+", "abcabcc")  ## Matches abcabc


# Matches abc or xyz
re.search(r"(abc|xyz)", "xyzabc")  ## Matches xyz

Character groups

# Period matches any character
re.search(r".", "abc")  ## Matches abc.

# Matches 1 or more alphanumeric characters in a row:
re.search(r"\w+", "abc")  ## Matches abc.

# This is roughly equivalent to the above, but \w is more permissive
# with Unicode:
re.search(r"[a-zA-Z0-9_]+", "abc")  ## Matches abc


# Matches 1 or more digits:
re.search(r"\d+", "a000b")  ## Matches 000


# Matches any kind of whitepace

re.search(r"\s", "a b") 

Some special characters

Matcher objects are True in conditionals

Conditionals are very common environments for regexs, since one is very often testing strings based on the regex pattern:

if re.search(r"L", "aLa"):
    print("Match!")
else:
    print("No match!")

Capturing groups and match objects

m = re.search(r"(R)", texts[0], re.I)

m.start() ## First index of the match

m.end()  ## Final index of the match

m.groups() ## Returns ('R',)

Compiling regular expressions

regex = re.compile(r"a+", re.I)

regex.search("bAab")

Regexercises

All of these should be run on 'unix-words-en.txt' using egrep:

unix_filename = 'unix-words-en.txt'
  1. Words that begin with a capital letter.

    r"^[A-Z]"

  2. Words that contain a medial capital letter.

    .[A-Z]. or .+[A-Z].+ if you need to match the entire string.

  3. Words that contain a hyphen.

    - or .*-.* if you need to match the entire string.

  4. Words that begin with exactly two consonants, end with exactly two consonants, and have exactly three vowels ('a','e','i','o','u') in between. Case-insensitive.

  5. Words that end either in 'ous' or in 'ary'. Case-insensitive. (How many of these are adjectives derived from nouns?)

  6. Words that contain the vowels 'a', 'e', 'i', 'o', and 'u', in that order, perhaps with other characters in between. Case-insensitive.

  7. Same as above, but make sure the other characters aren't vowels.

  8. Words with an even number of a's in them (0, 2, 4, ...). Case-insensitive.

  9. Words that begin and end with the same letter. (For this, you need to use capturing groups: re.compile(r"(.)x\1") is a regex for any string containing a character, 'x', and then the first character again.)

  10. Words with the longest sequence of the same letter in a row in the data. Case-insensitive.

  11. Words that have two or more dotted letters in a row. Case-insensitive.

  12. Same as the above, but allow hyphens between the dotted letters, and don't count hyphens towards the length.