import re
See the cheat sheet for even more!
# Match a
re.search(r"a", "a") ## Matches a
re.search(r"a", "b") ## No Match
# Match 0 or 1 a:
re.search(r"a?", "aa") ## Matches a
re.search(r"a?", "") ## Matches ''
# Match 0 or more a
re.search(r"a*", "aaa") ## Matches aaa
re.search(r"a*", "") ## Matches ''
# Match 1 or more a
re.search(r"a+", "a") ## Matches a
re.search(r"a+", "") ## No match
# Match 2 to 5 a in a row; first and/or second bounds can be left off
re.search(r"a{2,5}", "aaa") ## Matches aaa
re.search(r"a{2,5}", "a") ## No match
re.search(r"a{2,5}", "aaaaaaa") ## Matches the first 5 a
re.search(r"a+", "A", re.I) ## Matches A
re.search(r"a+", "A") ## No match
# Match any of the characters inside []:
re.search(r"[abc]", "b") ## Matches b
# Match 1 or more characters inside [] in a row, any order:
re.search(r"[abc]+", "bac") ## Matches bac
# Compare with exact string match:
re.search(r"abc", "bac") ## No match
# Match any character that *isn't* in the set consisting of a, b, c:
re.search(r"[^abc]+", "bac")
# Matches ab and then 1 or more c:
re.search(r"abc+", "abcccccc") ## Matches abcccccc
# Matches 1 or more of the exact string abc:
re.search(r"(abc)+", "abcabcc") ## Matches abcabc
# Matches abc or xyz
re.search(r"(abc|xyz)", "xyzabc") ## Matches xyz
# Period matches any character
re.search(r".", "abc") ## Matches abc.
# Matches 1 or more alphanumeric characters in a row:
re.search(r"\w+", "abc") ## Matches abc.
# This is roughly equivalent to the above, but \w is more permissive
# with Unicode:
re.search(r"[a-zA-Z0-9_]+", "abc") ## Matches abc
# Matches 1 or more digits:
re.search(r"\d+", "a000b") ## Matches 000
# Matches any kind of whitepace
re.search(r"\s", "a b")
^
matches the start of the string. (Inside []
, it is negation!)
$
matches the end of the string.
To match any special character, including quantifiers, braces, etc., add a preceding backslash.
Conditionals are very common environments for regexs, since one is very often testing strings based on the regex pattern:
if re.search(r"L", "aLa"):
print("Match!")
else:
print("No match!")
m = re.search(r"(R)", texts[0], re.I)
m.start() ## First index of the match
m.end() ## Final index of the match
m.groups() ## Returns ('R',)
regex = re.compile(r"a+", re.I)
regex.search("bAab")
All of these should be run on 'unix-words-en.txt' using egrep
:
unix_filename = 'unix-words-en.txt'
Words that begin with a capital letter.
r"^[A-Z]"
Words that contain a medial capital letter.
.[A-Z].
or .+[A-Z].+
if you need to match the entire string.
Words that contain a hyphen.
-
or .*-.*
if you need to match the entire string.
Words that begin with exactly two consonants, end with exactly two consonants, and have exactly three vowels ('a','e','i','o','u') in between. Case-insensitive.
Words that end either in 'ous' or in 'ary'. Case-insensitive. (How many of these are adjectives derived from nouns?)
Words that contain the vowels 'a', 'e', 'i', 'o', and 'u', in that order, perhaps with other characters in between. Case-insensitive.
Same as above, but make sure the other characters aren't vowels.
Words with an even number of a's in them (0, 2, 4, ...). Case-insensitive.
Words that begin and end with the same letter. (For this, you need to use capturing groups: re.compile(r"(.)x\1")
is a regex for any string containing a character, 'x', and then the first character again.)
Words with the longest sequence of the same letter in a row in the data. Case-insensitive.
Words that have two or more dotted letters in a row. Case-insensitive.
Same as the above, but allow hyphens between the dotted letters, and don't count hyphens towards the length.