Problem Set 7

Due Friday, November 15 at 1:00 pm Pacific

What can you do with regular expressions? What are the limits of regular languages? And what applications do these concepts have outside of the regular languages? In this seventh problem set, you'll explore the answers to these questions along with their practical consequences.

As always, please feel free to drop by office hours, ask on EdStem, or send us emails if you have any questions. We'd be happy to help out.

Good luck, and have fun!

Starter Files

Here are the starter files for the problems you'll submit electronically:

📦 PS7 Starter Files

Unpack the files somewhere convenient and open up the bundled project.

If you would like to type your solutions in $\LaTeX$, you may want to download this template file where you can place your answers:

🖋 PS7 $\LaTeX$ Template

Problem One: Designing Regular Expressions

Below are a list of alphabets and languages over those alphabets. For each language, write a regular expression for that language. Provide your answers by downloading the starter files for Problem Set Seven from Canvas and editing the file res/RegularExpressions.regexes. (Unlike the DFA/NFA editor from last time, you’ll need to manually edit these files by opening them in Qt Creator.) Feel free to test your answers locally, and submit your work on GradeScope.

Before starting this problem, read our handy Guide to Regular Expressions for some useful insights about regex design.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}, \mathtt{c}, \mathtt{d}, \mathtt{e}}$. Write a regular expression for the language $L$ = { $w \in \Sigma^\star \suchthat$ the letters in $w$ are sorted alphabetically }. For example, $\mathtt{abcde} \in L\text,$ $\mathtt{bee} \in L\text,$ $\mathtt{a} \in L\text,$ and $\varepsilon \in L\text,$ but $\mathtt{decade} \notin L\text.$

Write a regular expression for the complement of the language from part (i) of this problem.

Like NFAs and unlike DFAs, there isn't a simple construction that starts with a regex for $L$ and turns it into a regex for $\overline{L}$.

On Unix-style operating systems like macOS or Linux, files are organized into directories. You can reference a file by giving a path to the file, a series of directory names separated by slashes. For example, the path /home/username/ might represent a user’s home directory, and a path like /home/username/Documents/PS7.tex might represent that person’s solution to this problem set. Paths that start with a slash character are called absolute paths and say exactly where the file is on disk. Paths that don’t start with a slash are called relative paths and say where, relative to the current folder, a file can be found. For example, if I’m logged into my computer and am in my home folder, I could look up the file Documents/PS7.tex to find my solution to this problem set.

The general pattern here is that a file path consists of a series of directory or file names separated by slashes. That path might optionally start with a slash, but isn’t required to, and it might optionally end with a slash, but isn’t required to. However, you can’t have two consecutive slashes.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{/}}$. Write a regular expression for $L$ = { $w \in \Sigma^\star \suchthat w$ represents the name of a file path on a Unix-style system }. For example, $\mathtt{/aaa/a/aa} \in L$, $\mathtt{/} \in L$, $\mathtt{a} \in L$, $\mathtt{/a/a/a/} \in L$, and $\mathtt{aaa/} \in L$, but $\mathtt{//a//} \notin L$, $\mathtt{a//a} \notin L$, and $\varepsilon \notin L$.

Fun fact: this problem comes from former CS103 instructor Amy Liu, who fixed a bug in industrial code that arose when someone wrote the wrong regex for this language. Oops.

Suppose you are taking a walk with your dog on a leash of length two. Let $\Sigma = \Set{\mathtt{y}, \mathtt{d}}$ and let $L$ = { $w \in \Sigma^\star \suchthat w$ represents a walk with your dog on a leash where you and your dog both end up at the same location }. For example, we have $\mathtt{yyddddyy} \in L$ because you and your dog are never more than two steps apart and both of you end up four steps ahead of where you started; similarly, $\mathtt{ddydyy} \in L$. However, $\mathtt{yyyydddd} \notin L$, since halfway through your walk you’re three steps ahead of your dog; $\mathtt{ddyd} \notin L$, because your dog ends up two steps ahead of you; and $\mathtt{ddyddyyy} \notin L$, because at one point your dog is three steps ahead of you.

Write a regular expression for $L$.

Note that, unlike Problem Set Six, you and your dog must end at the same position.

Let $\Sigma = \Set{\mathtt{M}, \mathtt{D}, \mathtt{C}, \mathtt{L}, \mathtt{X}, \mathtt{V}, \mathtt{I}}$ and let $L$ = { $w \in \Sigma^\star \suchthat w$ is number less than $2,000$ represented in Roman numerals }. For example, $\mathtt{CMXCIX} \in L$, since it represents the number $999$, as are the strings L (50), VIII (8), DCLXVI (666), CXXXVII (137), CDXII (412), and MDCXVIII (1,618). However, we have that $\mathtt{VIIII} \notin L$ (you'll never have four I's in a row; use IX or IV instead), that $\mathtt{MM} \notin L$ (it's a Roman numeral, but it's for $2,000$, which is too large), that $\mathtt{VX} \notin L$ (this isn't a valid Roman numeral), and that $\mathtt{IM} \notin L$ (the notation of using a smaller digit to subtract from a larger one only lets you use I to prefix V and X, or X to prefix L and C, or C to prefix D and M). The Romans didn't have a way of expressing $0$, so to make your life easier we'll say that $\varepsilon \in L$ and that the empty string represents $0$. (Oh, those silly Romans.)

Write a regular expression for $L$.

As a note, we’re using the “standard form” of Roman numerals. You can see a sample of numbers written out this way via this link.)

Problem Two: Finite Languages

A language $L$ is called finite if $L$ contains finitely many strings (that is, $\abs{L}$ is a natural number). Given a finite language $L$, explain how to write a regular expression for $L$. Briefly justify your answer; no formal proof is necessary. This shows that all finite languages are regular.

Watch for edge cases!

Problem Three: Distinguishability

The concepts of distinguishability and distinguishing sets play a key role in the Myhill-Nerode theorem. This problem explores them in more depth. Before starting this problem, read our Guide to the Myhill-Nerode Theorem, which reviews the major concepts and provides some good intuitions.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L = \Set{ w \in \Sigma^\star \suchthat \text{the number of }\mathtt{a}\text{'s in }w\text{ is a multiple of three }}\text.$ Which of the following pairs of strings are distinguishable relative to $L$? For each pair that is distinguishable relative to $L$, give us the shortest string $w \in \Sigma^\star$ such that exactly one of $xw$ and $yw$ is in $L$. (If multiple strings are tied for having the shortest length, you can pick any one of them.) No justification is necessary.
1. $x = \mathtt{a}, y = \mathtt{aa}\text.$
2. $x = \mathtt{a}, y = \mathtt{aaa}\text.$
3. $x = \mathtt{aa}, y = \mathtt{aaa}\text.$
4. $x = \mathtt{a}, y = \mathtt{aaaa}\text.$

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L = \Set{ w \in \Sigma^\star \suchthat w \text{ contains } \mathtt{aa} \text{ as a substring }}\text.$ Which of the following are distinguishing sets for $L$? For each set that isn't a distinguishing set, give us a pair of distinct strings from the set that aren't distinguishable relative to $L$. No justification is necessary.
1. $\Set{ \mathtt{a}^n \suchthat n \in \naturals }\text.$
2. $\emptyset\text.$
3. $\Sigma^\star\text.$
4. $\Set{\mathtt{abba}}\text.$
5. $\Set{\mathtt{abba}, \mathtt{abbababba}}\text.$
6. $\Set{\varepsilon, \mathtt{a}, \mathtt{aa}}\text.$
7. $\Set{\varepsilon, \mathtt{a}, \mathtt{b}}\text.$

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and consider this language $L\text:$
\[L = \Set{ w \in \Sigma^\star \suchthat w \text{ has even length and its first half does not equal its second half } }\text.\]
Give a distinguishing set $S$ for $L$ containing four strings. Then, for each pair of distinct strings $x, y \in S$, show that they're distinguishable by giving a string $w$ where exactly one of $xw$ and $yw$ belongs to $L$.

As above, let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$ and let $L$ be this language:
\[L = \Set{ w \in \Sigma^\star \suchthat w \text{ has even length and its first half does not equal its second half} }\text.\]
Prove that $L$ is not regular.

See if you can find a way of generalizing the distinguishing set you found in part (iii) to have four strings, then five, and eventually infinitely many.

Problem Four: That's Odd…

Let $\Sigma = \set{\mathtt{a}, \mathtt{b}}$ and let $L = \set{ w \in \Sigma^\star \suchthat \abs{w} \text{ is odd } }$ be a language over $L\text.$ This language is regular; one way to see this is that it's given by the regex $\Sigma\left(\Sigma\Sigma\right)^\star\text.$

However, below is an (incorrect!) proof that $L$ is not regular.

(Incorrect!) Theorem: $L$ is not regular.

(Incorrect!) Proof: Let $S = \set{ \mathtt{a}^n \in \Sigma^\star \suchthat n \in \naturals }\text.$ We claim that $S$ is an infinite distinguishing set for $L\text.$

To see that $S$ is infinite, note that it contains one string for each natural number.

To see that $S$ is a distinguishing set for $L\text,$ consider any two strings $\mathtt{a}^n, \mathtt{a}^{n+1} \in S\text.$ We will prove that $\mathtt{a}^n \not\equiv_L \mathtt{a}^{n+1}\text.$ To do so, we will show there exists a string $w \in \Sigma^\star$ such that exactly one of $\mathtt{a}^n w$ and $\mathtt{a}^{n+1} w$ is in $L\text.$

Specifically, pick $w = \mathtt{a}^n\text.$ First, notice that $\mathtt{a}^n w = \mathtt{a}^n \mathtt{a}^n = \mathtt{a}^{2n}\text.$ This means that $\abs{\mathtt{a}^{2n}} = 2n\text,$ which is even, and so $\mathtt{a}^n w = \mathtt{a}^{2n} \notin L\text.$ Second, observe that $\mathtt{a}^{n+1}w = \mathtt{a}^{n+1}\mathtt{a}^n = \mathtt{a}^{2n+1}\text.$ Because $\abs{\mathtt{a}^{2n+1}} = 2n+1\text,$ which is odd, we see that $\mathtt{a}^{n+1}w = \mathtt{a}^{2n+1} \in L\text.$ Thus exactly one of $\mathtt{a}^n w$ and $\mathtt{a}^{n+1}w$ is in $L\text,$ as needed.

Because $S$ is an infinite distinguishing set for $L\text,$ by the Myhill-Nerode theorem, we see that $L$ is not regular. $\qed$

This proof has to be wrong, because $L$ is indeed regular.

What's wrong with this proof? Be as specific as possible. For full credit, you should identify the first sentence in the proof that contains a logical error and explain what that error is. (We say "first sentence" because once the proof makes a logic error, everything after that point will be incorrect.)

Problem Five: SHA-256

For the purposes of this problem, let $\Sigma = \set{\mathtt{0}, \mathtt{1}}\text.$

In computer security, there's a widely-used class of functions called cryptographic hash functions. Generally speaking, these functions take in arbitrary sequences of bits, then output a fixed-length sequence of bits that serves as a "fingerprint" for the original bitstring.

One of the most-commonly-used cryptographic hash functions is called SHA-256. We can imagine SHA-256 as a function $\texttt{SHA-256} : \Sigma^\star \to \Sigma^{256}$ that maps arbitrary strings of bits to sequences of 256 bits. Because all data on a computer can be represented as a string of bits, you can think of SHA-256 as a function that takes as input an arbitrary piece of data, then outputs a sequence of 256 bits derived from that data. Because it's unwieldy to write out strings of 256 bits, the output of SHA-256 is typically represented in hexadecimal notation as a series of 64 characters drawn from $\set{\mathtt{0}, \mathtt{1}, \dots\, \mathtt{9}, \mathtt{a}, \mathtt{b}, \dots, \mathtt{f}}\text.$

For example, applying SHA-256 to the string avocet produces the output

6fb09913d815a12f8fe18fe6da5c9579a9f750ef9617d4fc2d78f698db4c1057

and applying SHA-256 to the string curlew produces this output:

0a28e796fca2c33a0107989411f2de207e53aced562ea7e728be15ce6590f9ec

SHA-256 has several important properties that give it a wide range of applications, and in this problem we'll explore how it works and why it's so useful.

What is the result of applying SHA-256 to the strings shoveler, Shoveler, and SHOVELER?

You can find a wide variety of websites that will compute SHA-256 hashes for you; feel free to use any of them. Just confirm that they're working correctly by seeing whether their hashes for avocet and curlew match the results above.

As you saw in part (i), tiny changes to the input to SHA-256 can produce wildly different outputs, and those outputs at first glance seem totally random. More generally, it's believed that it is hard (for some definition of "hard") to find an input to SHA-256 that produces a given output or to distinguish the output of SHA-256 from a truly-random string. As a result, in practice, many programmers treat SHA-256 hash of a file as a "fingerprint" that can be used to identify that file.

Here's one simple application. When a computer downloads a software update, it needs to make sure that the software it downloaded hasn't been tampered with in transit. Otherwise, the computer might replace perfectly good software with malicious software crafted by an attacker. To address this, the developers typically publish the cryptographic hash of the new software update in a public and trustworthy place. Then, when a computer downloads the update, it can compute the hash of the file it downloaded and compare it against the posted SHA-256 hash. If the computed hash matches the expected hash, then the file was almost certainly not tampered with. If the hashes do not agree, it indicates that the file that was downloaded isn't the same as the one that the developers initially sent out.

In a previous offering of CS103, we computed the SHA-256 hash of the .zip file containing the programming starter files for this assignment. Those starter files yielded the following hash:
```
f8c8fb1a8db708ce2b9a072233e6c9d3a067633f1727bd3fa6dd581ba254265c
```
Has that .zip file been updated since we computed this hash? (A simple yes/no answer suffices here.)

Here's another application. When you set a password on a website, that website needs to store the password for later so that it can confirm you are who you claim to be. However, if the website were to simply store your password in a file somewhere on disk and hackers were to obtain that file, they'd immediately have access to your password and could log in as you without anyone knowing. To combat this, websites typically do not store raw passwords on disk. Instead, they hash the password and store the password hash on disk. Then, when you log in to the website and send your password over, the web server can hash your password and compare it against the stored hash. If they match, then you have the right password and you're let in. If not, your password is wrong and you won't be granted acess. (Warning! This approach by itself is vulnerable to more sophisticated attacks. Do not implement your own password management system using this approach. Take CS255 to learn about how to do this properly.)

Suppose the password file on disk looks like this:
```
htiek   77f65685a223013b6ffb2a166f6749246232401947289a0b0e625a4a9008fd4b
szum    1e30adb19df164cea9015f9d4819cb67e1e87ea88ec63b26c7dfba48f30bb434
cbl     1e915b5523dcc54c7b48f65521f42537e0bbe8e67899c6ea8a78167c8fefaaa8
zhengl  c88cf9468f1701ff059abe2c172a0da5edd6af83ae80ca1a7579e291c38df6d9
liuamyj 45bd7212b0ff0015da9759ed3b699510145b5a6d285cb037e0130d6350f31a50
```
Here, each line consists of a username and a SHA-256 hash of the password.

Three users try to log in and send over these passwords:
- htiek tries logging in using the password kudu.
- szum tries logging in using the password quokka.
- cbl tries logging in using the password gerenuk.
Which users should be let in?

We will give you a shout-out in class if you can figure out the password for zhengl. We will send a glowing letter of introduction to the cryptography professors in the CS department if you can figure out the password for liuamyj.

The remainder of this problem discusses the internal workings of SHA-256. While we won't go into the full details, we'll introduce enough of the key pieces that you will have a good feeling for why SHA-256 works well as a cryptographic hash function.

Let's introduce some new concepts. A compression function is a function $f : \Sigma^m \to \Sigma^n$ where $m \gt n\text.$ In other words, compression function is simply a function that takes as input a bitstring of some fixed length $m$ and outputs a shorter bitstring of length $n\text.$

In one sentence, explain why no compression function is injective.

Because compression functions are not injective, for any compression function $f : \Sigma^m \to\Sigma^n\text,$ there must be two distinct strings $w, x \in \Sigma^m$ such that $f(w) = f(x)\text.$ Such a pair of elements is called a collision. However, while such strings are guaranteed to exist, they may not be easy to find. For example, a compression function from $\Sigma^{512} \to \Sigma^{511}$ must have a collision, but since there are $2^{512}$ strings in $\Sigma^{512}\text,$ it's computationally infeasible to try compressing every string and waiting until we coincidentally discover a collision.

That gives rise to a new definition: a compression function $f : \Sigma^m \to \Sigma^n$ is called collision-resistant if there are no known strings $w, x \in \Sigma^m$ where $w \ne x$ and $f(w) = f(x)\text.$ Notice that this definition hinges on the word "known," which is a bit hand-wavy from a mathematical perspective. It turns out that it is possible to formalize the definition of "known" using some techniques more advanced than what we cover in CS103 (take CS255, intro to cryptography, for details). But for the purposes of this problem, we'll interpret "known" to mean "some person has in their possession two concrete strings $w$ and $x$ with the necessary properties."

Consider the compression function $f : \Sigma^5 \to \Sigma^3$ defined as follows:
\[\forall a, b, c, d, e \in \Sigma. f(abcde) = bed\text.\]
Tell us what $f(\mathtt{01010})$ is, then, in one sentence, demonstrate that $f$ is not collision-resistant.

Compression functions work great if you want to compress a string of some fixed, known length. However, hash functions like SHA-256 need to work for strings of arbitrary length. To accomplish this, we can "stretch" a collision-resistant compression function so that it works on larger and larger inputs using a strategy called the Merkle-Damgård construction.

Here's how it works. Suppose we have a collision-resistant compression function $f : \Sigma^{l + b} \to \Sigma^l$ for some nonzero $l, b \in \naturals\text.$ (In the previous section we talked about compression functions from $\Sigma^m$ to $\Sigma^n$ with $m \gt n\text,$ and here we're introducing $l$ and $b$ to quantify how much bigger the exponent is on the domain than on the codomain.) We can use $f$ to define a collection of collision-resistant compression functions $g_n : \Sigma^{bn} \to \Sigma^l$ for all positive natural numbers $n\text.$ Specifically, we begin by selecting some string $v \in \Sigma^l$ called the initialization vector that is assumed to be known to everyone. We then define the functions $g_n$ inductively as follows:

$g_1 : \Sigma^b \to \Sigma^l$ is defined as $g_1(w) = f(vw)\text.$
$g_{n+1} : \Sigma^{b(n+1)} \to \Sigma^l$ is defined as $g_{n+1}(w) = f(g_n(x)y)\text,$ where $w = xy$ for $x \in \Sigma^{bn}$ and $y \in \Sigma^b\text.$

Before diving into the properties of the Merkle-Damgård construction, let's take a minute to work through an example.

Let $l = 3$ and $b = 2$ and let $f : \Sigma^{l + b} \to \Sigma^{l}$ be the compression function from part (v) of this problem. Compute $g_4(\mathtt{11011100})$ using $v = \mathtt{000}$ as the initialization vector. No justification is necessary.

The reason the Merkle-Damgård construction is useful is that if $f$ is collision-resistant, then all the functions it produces are also collision-resistant. Or, equivalently, if any of the functions it produces is not collision-resistant, then neither is $f\text.$

Let $f : \Sigma^{l + b} \to \Sigma^l$ be a compression function (not necessarily the one from part (vi) of this problem) and $v \in \Sigma^l$ be an initialization vector. Prove for all natural numbers $n \ge 1$ that if $g_n$ is not collision-resistant, then neither is $f\text.$

Feel free to use the following fact: if $x_1, x_2 \in \Sigma^r$ and $y_1, y_2 \in \Sigma^s\text,$ then $x_1 y_1 = x_2 y_2$ if and only if $x_1 = x_2$ and $y_1 = y_2\text.$

To bring it all together, here's how SHA-256 works. SHA-256 is based on a specific compression function $f : \Sigma^{256 + 512} \to \Sigma^{256}$ (with $l = 256$ and $b = 512\text)$ that is widely believed to be collision-resistant. (That is, no one has come forward with a collision in this compression function, though in principle a nation-state might have found one already but kept it secret.) SHA-256 then uses the Merkle-Damgård construction to stretch $f$ into a family of collision-resistant compression functions. When computing the SHA-256 hash of a string $w\text,$ a new string $x$ is chosen so that $\abs{wx} = 512n$ for some natural number $n\text.$ Then, SHA-256 returns $g_n(wx)\text.$ The details of how $x$ is chosen are essential for the security of the construction.

To learn more about how SHA-256 is constructed, as well as a bunch of other applications for cryptographic hash functions, take CS255, our amazing intro to cryptography class.

Optional Fun Problem: Languages with Backspaces

Imagine that you’re typing a string of a’s and b’s into your computer. You might hit the backspace key to delete the most recent character that you entered. Let’s represent pressing the backspace key with the X character. So, if you typed in abXa, you’d end up with the string aa. If you typed in abbaXXXXaab, you’d end up with the string aab. If you typed in aXXXXb, you’d end up with the string b (pressing backspace when there’s no text entered has no effect). And if you typed in XXaabbbXX, you’d end up with the string aab.

Let's formalize this idea in the context of formal language theory. Let $\Sigma_\mathtt{X} = \Set{\mathtt{a}, \mathtt{b}, \mathtt{X}}$ and let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}\text.$ We define a function $\nabla : \Sigma_\mathtt{X} \to \Sigma$ such that $\nabla(w)$ is the string that results from applying all the backspace presses in $w\text.$ For example:

$\nabla(\varepsilon) = \varepsilon\text.$
$\nabla(\mathtt{XXX}) = \varepsilon\text.$
$\nabla(\mathtt{aXbXXbXa}) = \mathtt{a}\text.$
$\nabla(\mathtt{aaaXXXbbbXX}) = \mathtt{b}\text.$

Finally, let $\nabla^{-1}[L] = \set{ w \suchthat w \in \Sigma_\mathtt{X}^\star \land \nabla(w) \in L}\text.$ There are exactly two languages $L$ over $\Sigma$ for which $\nabla^{-1}[L]$ is regular. Tell us what those languages are, then prove that for every other language $L$ over $\Sigma\text,$ the language $\nabla^{-1}[L]$ is nonregular.