Linguist 278: Programming for Linguists (Stanford Linguistics, Fall 2021)

Classes 10: More on regular expressions

import re

Negated character groups

For each of the character group abbreviations, capitalizing it will created its negated version:

Greedy and non-greedy matching

By default, the quantifiers will match greedily. To turn this off, add a question mark:

re.search(r".+,", "aa,bb,")  ## Matches 'aa,bb,'

re.search(r".+?,", "aa,bb,")  ## Matches 'aa,'

Turning off capturing groups

Parentheses are used both for grouping and for capturing. If you want to use them just for grouping – no capturing – then start the group with ?::

re.search(r"(a)b", "ab").group(1)  ## Returns 'a'

re.search(r"(?:a)b", "ab").group(1)  ## Error, because there isn't a group 1.

Using findall

The method findall is a simple variant of search in which all the matches are returned as a list:

re.findall(r"abc", "abcabc")  ## ['abc', 'abc']

A common gotcha is that, if there are capturing groups, then findall will include only them:

re.findall(r"(ab|AB)c", "abcABc") ## ['ab', 'AB']

This is a situation in which we often want to turn off the grouping:

re.findall(r"(?:ab|AB)c", "abcABc")  ## ['abc', 'ABc']

String substitutions

re.sub will allow you to replace matches with something else:

re.sub(r"abc", r"XYZ", "abc")

You can refer to ca groups using \1, \2, etc.:

re.sub(r"(a)(b)", r"\2\1", "abab")

Notice the use of raw strings for the replacement. This is primarily due to the backslash. I try always to use a raw string for this argument.

For much more

We've only scratched the surface of what can be done with regular expressions in Python. For more: https://docs.python.org/3/library/re.html