Problem Set 6

Due Friday, February 21 at 1:00 PM Pacific

This sixth problem set explores formal language theory, finite automata, the regular languages, and their properties. This will be your first foray into computability theory, and I hope you find it fun and exciting!

As always, please feel free to drop by office hours, ask on Ed, or email the staff list if you have any questions. We'd be happy to help out.

Good luck, and have fun!

Starter Files

Here are the starter files for the problems you'll submit electronically:

📦 PS6 Starter Files

Unpack the files somewhere convenient and open up the bundled project.

If you would like to type your solutions in $\LaTeX$, you may want to download this template file where you can place your answers:

🖋 PS6 $\LaTeX$ Template

General Advice

First, read through the index of problems below to get an idea of time management for this assignment.
Next, glance through all the questions before you start working. This not only ensures you won't be caught off guard as you reach later problems, but also gives your brain a chance to start working on some of these as a background process while you do other things.
If you get stuck, consider moving on to another problem for a while! Changing things up is a great way to get out of a rut, make new connections between concepts, and spark creative insights!
If you realize you don't have the requisite knowledge to complete a problem, take a step back from it and review the relevant lecture materials. Skipping lecture and trying to extract the relevant tidbits for each problem will lead to a more frustrating experience than engaging fully with a lecture and approaching the problem sets afterward.

Timeline

This pset has six required problems (two of which are autograded, the rest of which are written / manually graded). Because all the questions are of similar flavors (working with languages and automata!), we are listing them all on a single page.

We recommend the following timeline for completing questions:

Q1 and Q2 by Sunday evening
Q3 and Q4 by Tuesday evening
Q5 by Wednesday evening
Q6 by Thursday evening

Of course, we recommend reading through the entire problem set ASAP so you have a strong idea of how much time the problems will take and so your brain can start working on them as a background process. :)

Problem One: Constructing DFAs

Download the starter files for Problem Set Six and extract them somewhere convenient on your system. Run the program to access the automaton editor, which you’ll use throughout the problem set. The provided program offers this functionality:

Automaton Editor: Use this tool to design automata. Double-click to add a state. Add transitions by hovering over the source state, clicking the blue region in the border, and dragging to the destination state.
Automaton Tester: A tool you can use to test your automata on various input strings. Enter strings in the text box on the left, and the tool will show you which of those strings your automata accepts and which of those strings it rejects. You can follow each string with what you expect the automaton to do, and the tool will tell you whenever the expected output doesn’t match the actual output. For example, you might start with these test strings for the automaton in part (i):
- BBIIB yes
- BBB no
Automaton Debugger: A tool for single-stepping through the operation of an automaton on a string. You can use this to see how your automata process individual strings.

There’s also an automated test suite you can use to check your work.

For each of the following languages over the indicated alphabets, construct a DFA that accepts precisely the strings that are in the indicated language. Your DFA does not have to have the fewest number of states possible, though for your own edification we’d recommend trying to construct the smallest DFAs possible. Use the Automaton Editor to compose your answers, and submit your finished automata to Gradescope.

There are many ways to tile a $2 \times 8$ checkerboard with dominoes, two of which are shown here:

Notice that the horizontal dominoes must appear as stacked pairs (do you see why?) We can read each tiling from left to right as a string made from the characters I and B, where I denotes “a vertical domino” and B denotes “two horizontal dominoes.” The top tiling here would be represented as BIBIII and the bottom tiling as IBBBI.

Let $\Sigma$ be the alphabet $\Set{\mathtt{B}, \mathtt{I}}$. Construct a DFA for the language { $w \in \Sigma^* \suchthat w$ represents a domino tiling of a $2 \times 8$ checkerboard }.

You’re taking a walk with your dog along a straight-line path. Your dog is on a leash of length two, so the distance between you and your dog can be at most two units. You and your dog start at the same position. Consider the alphabet $\Sigma = \Set{\mathtt{y}, \mathtt{d}}$. A string in $\Sigma^*$ can be thought of as a series of events in which either you or your dog moves forward one unit. For example, the string yydd means you take two steps forward, then your dog takes two steps forward.

Let $L$ be the language { $w \in \Sigma^* \suchthat w$ describes a series of steps where you and your dog are never more than two units apart }. Construct a DFA for $L$.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$. Construct a DFA for the language $L$ = { $w \in \Sigma^* \suchthat w$ contains the same number of instances of the substring ab and the substring ba }. Note that substrings are allowed to overlap, so we have $\mathtt{aba} \in L$ (one copy of each substring) and $\mathtt{babab} \in L$ (two copies of each substring).

Let $\Sigma = \Set{\mathtt{a}, \mathtt{c}, \mathtt{m}, \mathtt{o}}$. Construct a DFA for the language $L$ = { $w \in \Sigma^* \suchthat w$ contains the word cocoa as a substring }. For example, $\mathtt{mm\underline{cocoa}mm} \in L$ and $\mathtt{\underline{cocoa}} \in L$, but $\mathtt{c} \notin L$, $\mathtt{cmomcmoma} \notin L$ (though cocoa is a subsequence of cmomcmoma, it’s not a substring), and $\varepsilon \notin L$.

Some trickier cases to watch for: $\mathtt{ccc\underline{cocoa}} \in L$, and $\mathtt{co\underline{cocoa}} \in L$.

Problem Two: Constructing NFAs

For each of the following languages over the indicated alphabets, use the Automaton Editor to design an NFA that accepts precisely the strings that are in the indicated language. As before, while you don’t have to design the smallest NFAs possible, we recommend that you try to keep your NFAs small both to make testing easier and for your own edification.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}, \mathtt{c}}$. Construct an NFA for { $w \in \Sigma^* \suchthat w$ ends in a, bb, or ccc }.

While it’s possible to do this completely deterministically, it’s a bit easier if you use the “guess-and-check” framework we talked about in class.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}, \mathtt{c}, \mathtt{d}, \mathtt{e}}$. Construct an NFA for the language $L$ = { $w \in \Sigma^* \suchthat $ the letters in $w$ are sorted alphabetically }. For example, $\mathtt{abcde} \in L$, $\mathtt{bee} \in L$, $\mathtt{a} \in L$ and $\varepsilon \in L$, but $\mathtt{decade} \notin L$.

You could do this deterministically, but that will require a lot of transitions. You can dramatically reduce that number by using ε-transitions strategically.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}, \mathtt{c}, \mathtt{d}, \mathtt{e}}$. Construct an NFA for the language { $w \in \Sigma^* \suchthat$ the last character of $w$ appears nowhere else in $w$, and $\abs{w} \ge 1$ }.

This problem is all about embracing nondeterminism. Use the “guess-and-check” framework. What information would you want to guess? How would you check it? Please don’t try to solve this problem by building a DFA for this language; you’ll need at least 50 states if you tried to approach things this way.

Stuck? Try reducing the alphabet to two or three letters and see if you can solve that version.

Let $\Sigma = \Set{\mathtt{a}, \mathtt{b}}$. Construct an NFA for the language $L$ = { $w \in \Sigma^* \suchthat w$ contains at least two b's with exactly five characters between them }. For example, $\mathtt{\underline{b}aaaaa\underline{b}} \in L$, $\mathtt{aabaa\underline{b}aaabb\underline{b}} \in L$, and $\mathtt{a\underline{b}bbbba\underline{b}aaaaaaab} \in L$, but $\mathtt{bbbbb} \notin L$, $\mathtt{bbbab} \notin L$, and $\mathtt{aaabab} \notin L$.

The smallest DFA for this language is huge, so please don’t solve this one deterministically. Embrace the nondeterminism! What would you like to guess? How would you check that your guess is right?

Problem Three: $\powerset{\Sigma^*}$

Let $\Sigma$ be an alphabet. Give a short English description of the set $\powerset{\Sigma^*}$.

We think that there is a single “best answer” that makes appropriate use of the terminology from our lectures on formal languages. You should be able to describe the set in at most ten words.

Problem Four: Much Ado About Nothing, Part II

We’ve covered a bunch of terminology in the past week. This problem is designed to help you make sense of what all these terms mean.

Are there any languages $L$ where $\varepsilon \in L$? If so, give an example of one. If not, explain why not.

Are there any languages $L$ where $\varepsilon \notin L$? If so, give an example of one. If not, explain why not.

Are there any languages $L$ where $\varepsilon \subseteq L$? If so, give an example of one. If not, explain why not.

Are there any languages $L$ where $\varepsilon \not\subseteq L$? If so, give an example of one. If not, explain why not.

Does $\emptyset = \varepsilon$? Briefly explain your answer.

Does $\emptyset = \Set{\varepsilon}$? Briefly explain your answer.

Problem Five: Error-Correcting Codes

You're probably familiar with the idea that, inside the computer, all data are stored as strings of zeroes and ones. Data sent from computer to computer over a network similarly encodes everything with 0s and 1s.

Suppose you want to transmit text from your computer to another. Your computer needs to convert the text into a series of 0s and 1s, send those 0s and 1s across a network, and have the receiving computer convert the 0s and 1s back to text. The most common way to do this is the following. First, each computer agrees, in advance, on a fixed "coding table" that associates each letter with a codeword made of 0s and 1s. Your computer then converts your text string into 0s and 1s by replacing each letter with its codeword, then sends the resulting binary string across the network. The receiving computer then converts the codewords back to the letters they correspond to.

Let's see an example. To keep things simple, suppose all messages you send are made of just the first 15 letters of the alphabet, plus a space character. Your computer and the receiving computer might agree on the following coding table in advance:

\[\begin{array}{c|c} \text{Letter} & \text{Binary Code} \\ \hline \mathtt{A} & \mathtt{0000} \\ \mathtt{B} & \mathtt{0001} \\ \mathtt{C} & \mathtt{0010} \\ \mathtt{D} & \mathtt{0011} \\ \mathtt{E} & \mathtt{0100} \\ \mathtt{F} & \mathtt{0101} \\ \mathtt{G} & \mathtt{0110} \\ \mathtt{H} & \mathtt{0111} \\ \mathtt{I} & \mathtt{1000} \\ \mathtt{J} & \mathtt{1001} \\ \mathtt{K} & \mathtt{1010} \\ \mathtt{L} & \mathtt{1011} \\ \mathtt{M} & \mathtt{1100} \\ \mathtt{N} & \mathtt{1101} \\ \mathtt{O} & \mathtt{1110} \\ \mathtt{\_} & \mathtt{1111} \\ \end{array}\]

Suppose you want to send the message HELLO across the network. Your computer would then send this series of 0s and 1s:

\[\underbrace{\mathtt{0111}}_{\mathtt{H}} \underbrace{\mathtt{0100}}_{\mathtt{E}} \underbrace{\mathtt{1011}}_{\mathtt{L}} \underbrace{\mathtt{1011}}_{\mathtt{L}} \underbrace{\mathtt{1110}}_{\mathtt{O}}\]

The receiving computer could then decode the message by breaking the string apart into blocks of four bits each (each codeword is four bits long), decoding each according to the table.

You're the receiving computer and get this message:

0010111011010101100000110100

What text message were you sent? No justification is necessary.

In practice, however, things get a bit more complicated. Specifically, when messages are sent from one computer to another, sometimes individual bits will flip, with 0s getting turned into 1s and vice-versa. This can change the content of a message. For example, suppose you want to send the message HELLO. As above, this looks like this:

However, during transmission, the tenth bit gets flipped. That causes the message to arrive like this:

\[\underbrace{\mathtt{0111}}_{\mathtt{H}} \underbrace{\mathtt{0100}}_{\mathtt{E}} \underbrace{\mathtt{1{\color{red}1}11}}_{\color{red}\mathtt{\_}} \underbrace{\mathtt{1011}}_{\mathtt{L}} \underbrace{\mathtt{1110}}_{\mathtt{O}}\]

As a result, the wrong message (HE LO) arrives. From context we can figure out that something is probably wrong here, but in general the computer won't have a way of knowing this. How can we address this?

One option is to use an error-correcting code. An error-correcting code is a collection of codewords such that if any single bit is flipped within a codeword, the receiving computer can

determine that a bit has been flipped in the codeword, and then
determine what the original codeword was.

Here's an example. Suppose we use these new codewords for our letters:

\[\begin{array}{c|c} \text{Letter} & \text{Binary Code} \\ \hline \mathtt{A} & \mathtt{0000000} \\ \mathtt{B} & \mathtt{1110000} \\ \mathtt{C} & \mathtt{1001100} \\ \mathtt{D} & \mathtt{0111100} \\ \mathtt{E} & \mathtt{0101010} \\ \mathtt{F} & \mathtt{1011010} \\ \mathtt{G} & \mathtt{1100110} \\ \mathtt{H} & \mathtt{0010110} \\ \mathtt{I} & \mathtt{1101001} \\ \mathtt{J} & \mathtt{0011001} \\ \mathtt{K} & \mathtt{0100101} \\ \mathtt{L} & \mathtt{1010101} \\ \mathtt{M} & \mathtt{1000011} \\ \mathtt{N} & \mathtt{0110011} \\ \mathtt{O} & \mathtt{0001111} \\ \mathtt{\_} & \mathtt{1111111} \\ \end{array}\]

Now, our message HELLO looks like this:

\[\underbrace{\mathtt{0010110}}_{\mathtt{H}} \underbrace{\mathtt{0101010}}_{\mathtt{E}} \underbrace{\mathtt{1010101}}_{\mathtt{L}} \underbrace{\mathtt{1010101}}_{\mathtt{L}} \underbrace{\mathtt{0001111}}_{\mathtt{O}}\]

Suppose that, in transit, one bit gets flipped in the first codeword, as shown here:

\[\underbrace{\mathtt{0010{\color{red}0}10}}_{\mathtt{?}} \underbrace{\mathtt{0101010}}_{\mathtt{E}} \underbrace{\mathtt{1010101}}_{\mathtt{L}} \underbrace{\mathtt{1010101}}_{\mathtt{L}} \underbrace{\mathtt{0001111}}_{\mathtt{O}}\]

When the receiving computer gets this message, it can tell that the first seven bits, 0010010, aren't a legal codeword. (You can see this because it's not anywhere in the table.) It can then ask - since a single bit got flipped, it must be the case that these seven bits are one bit different from something in the table given above. After some exploration, the computer can find that it's 0010110, which is the code for H. Moreover, no other codeword in the table differs from 0010010 by one bit. (Check this - isn't that neat?) The receiving computer can then pretend that it received the codeword for H instead of the garbled version it received, and successfully decode the message.

You are the receiving computer and get this message:

0001111011011100011000100010

What text message were you sent? No justification is necessary. Note that some of the bits may have flipped in transit, though we promise you that at most one bit has flipped per codeword received.

How did we choose the codewords in the second table so that any single bit flip can be detected and corrected? If you try coming up with something like this on your own, you'll figure out pretty quickly that it's a lot harder than it looks! Amazingly, it turns out that the concepts we've explored thus far this quarter are all we need to approach the task of finding good codewords.

The first major insight we need is that we can frame this as a graph problem. Let $\Sigma = \Set{\mathtt{0}, \mathtt{1}}\text.$ For any natural number $n \ge 0\text,$ we can define a graph $Q_n$ as follows:

\[Q_n = (\Sigma^n, \Set{ \Set{w, x} \suchthat w, x \in \Sigma^n \text{ and } w \text{ and } x \text{ differ in exactly one position }})\]

Draw the graphs $Q_0\text,$ $Q_1\text,$ $Q_2\text,$ and $Q_3\text.$

If you're very astute, you might recognize the shapes of these graphs from somewhere.

Here's why these graphs are useful. If we want to use codewords that are $n$ bits each, we can look at the graph $Q_n\text.$ If you start with any $n\text{-bit}$ string and flip any single bit, it corresponds from moving from one node in the graph to one of its neighbors. Our goal is to choose a collection of codewords (nodes) so that if you flip any bit in the codeword (move from a node to its neighbor), we can always uniquely identify which codeword we started from (which neighbor to move back to).

Here's how we can formalize this in the language of graph theory. Let $G = (V, E)$ be a connected graph and let $u, v \in V$ be nodes in $G\text.$ The distance between $u$ and $v\text,$ denoted $d_G(u, v)\text,$ is defined to be the length of the shortest walk from $u$ to $v$ in graph $G\text.$ (There can be many equally-short walks between $u$ and $v\text,$ in which case we take the length of any of them.)

What is $d_{Q_5}(\mathtt{11011}, \mathtt{01010})\text?$ No justification is necessary.

Don't draw the full graph $Q_5$ here; it has 32 nodes and 80 edges and most of it isn't useful to you. You may want to instead look at $Q_3$ and think about what paths look like in that graph. Once you've built up an intuition, see if you can figure out the length of the shortest walk between the two nodes here without drawing the graph.

We can then define a $k\text{-independent set}$ to be a set $I \subseteq V$ with the following property:

\[\forall u \in I. \forall v \in I. (u \ne v \to d_G(u, v) \ge k)\]

This definition generalizes the definition of an independent set that we covered in lecture.

Is $\Set{\mathtt{000000}, \mathtt{010101}, \mathtt{101010}}$ a 3-independent set of $Q_6\text?$ No justification is necessary.

Let $G = (V, E)$ be a connected graph and $I$ a 3-independent set of $G\text.$ Prove that for all $u \in V\text,$ there is at most one $v \in I$ where $\Set{u, v} \in E\text.$

Your result from part (vi) explains one way to come up with error-correcting codes. Start with the graph $Q_n$ for your favorite choice of $n\text.$ Then, somehow, select a large 3-independent set $I$ of $Q_n\text.$ Now, if you use the elements of $I$ as your codewords and any one bit in them gets flipped, the receiving computer can always determine what the original codeword was - after all, there's at most one choice of what it could be!

What is the largest 3-independent set of $Q_3\text?$ No justification is needed. There are several choices you could make here that are all equally large; we just need one of them.

Your result from part (vii) gives the simplest possible error-correcting code. Amazing!

This is just the tip of the iceberg when it comes to error-correcting codes. It's possible to build codes that are resilient even if two or more bits get flipped in a single codeword, for example. Beyond this, error-correcting codes have applications in networking, data center design, and more. To learn more, take CS250!

Problem Six: Monoids and Kleene Stars

There’s a special type of language called a monoid that pops up when exploring formal language theory. Specifically, we define monoids as follows: if $M$ is a language over $\Sigma$, then

\[M \text{ is a monoid} \qquad \text{if} \qquad \varepsilon \in M \text{ and } MM \subseteq M\text.\]

Let’s take a minute to get a handle on what monoids look like.

Give us two examples of monoids, one of which contains finitely many strings and one of which contains infinitely many strings. No justification is required.

Your next task is to prove an important result about powers of languages.

Let $L$ be a language over some alphabet $\Sigma$. Prove, by induction, that $L^nL^m = L^{m+n}$ for all natural numbers $m$ and $n\text.$ To do so, prove that the following predicate $P(n)$ is true for all natural numbers $n$:

$P(n)$ is the statement "for all $m \in \naturals$, we have $L^nL^m = L^{m+n}$."

Your proof should leverage the formal definition of language powers, which is reprinted below:
\[L^0 = \Set{\varepsilon} \qquad L^{n+1} = LL^n\]
Feel free to use the following facts, which you can state without proof:
- Language concatenation is associative: $A(BC) = (AB)C\text.$
- The language $\Set{\varepsilon}$ is the identity language for concatenation: $\Set{\varepsilon}L = L\Set{\varepsilon} = L\text.$
While intuitively we think of $L^k$ as “all strings made of $k$ strings from $L$,” the formal definition of $L^k$ is the inductive one given here. Use that formal definition rather than the higher-level intuition.

There is a deep, fundamental connection between Kleene stars and monoids.

Let $L$ be an arbitrary language over $\Sigma$. Prove that $L^\star$ is a monoid over $\Sigma$. In the course of writing this proof, please call back to the formal definition of language concatenation and the Kleene star:
\[L^\star = \Set{w \suchthat \exists n \in \naturals. w \in L^n }\] \[L_1L_2 = \Set{ x \suchthat \exists w_1 \in L_1. \exists w_2 \in L_2. x = w_1w_2 }\]

This is a great spot to write out two columns, one for “what I’m assuming” and one for “what I need to show.” Again, don’t assume anything about language powers or concatenation without first proving it.

Remember that to prove $S \subseteq T$, you should pick an arbitrary $x \in S$ and then prove $x \in T$.

Let $L$ be a language over $\Sigma$ and let $M$ be a monoid over $\Sigma$. Prove that if $L \subseteq M$, then $L^\star \subseteq M$.

This problem is all about finding the right way to formalize things. Break the task down into smaller pieces and make sure you organize everything in a way that makes the logical flow easy to read and rigorously covers all cases. Once you have the setup put together, dive in and fill out each section.

Parts (iii) and (iv) of this problem collectively say “$L^\star$ is the language formed by adding the fewest strings to $L$ to make it a monoid.” To see why, note that $L^\star$ is a monoid, and we can’t take any strings out of it and be left with a monoid, since $L^\star$ is a subset of all monoids containing $L$. That’s a very different perspective on the Kleene star than what we started with!

Monoids, in a slightly more general sense than what you've seen here, have applications in parallel computing, functional programming, and theoretical CS. Take CS149 and CS242 for details!

Optional Fun Problem: Dual-Use Automata

Let $L_1$ and $L_2$ be regular languages. Prove there is a DFA $D$ with the following property: $D$ recognizes $L_1\text,$ but by changing which states of $D$ are accepting, $D$ can be made to recognize $L_2\text.$ (You cannot change which state of $D$ is the start state, or where the transitions are, or add/remove states. All you can do is change which states are accepting.)