Assignment 2. Fun with Collections

Due Friday, January 24 at 1:00 pm

Each student begins with four late days that may be used throughout the quarter. You may submit this assignment 24 hours late by using one late day or 48 hours late using two late days. No submissions will be accepted more than 48 hours after the due date without prior approval by the head TA. See the syllabus for more information about our late policies.

All due dates and submission times are expressed in Pacific time.

This assignment must be completed individually. Working in groups is not permitted.

This assignment is all about the amazing things you can do with collections. It’s a two-parter. The first is a program that can guess the language a piece of text is written in. The second models rising sea levels on a sampler of realistic terrains. By the time you've completed this assignment, you'll have a much better handle on the container classes and how to use different types like queues, maps, vectors, and sets to model and solve problems. Plus, you'll have some things we think you'd love to share with your friends and family.

This assignment has two parts. It will be quite a lot to do if you start this assignment the night before it’s due, but if you make slow and steady progress on this assignment each day you should be in great shape. Here’s our recommended timetable:

Aim to complete Rosetta Stone within four days of the assignment going out.
Aim to complete Rising Tides within seven days of this assignment going out.

As always, feel free to reach out to us if you have questions. Feel free to contact us on EdStem, to email your section leader, or to stop by the LaIR.

Assignment Logistics

Starter Files

We provide a ZIP of the starter project. Download the zip, extract the files, and double-click the .pro file to open the project in Qt Creator.

📦 Starter code

Resources

Feel free to check out our Python-to-C++ guide if you're moving from Python to C++. Also, check out our style guide, guide to testing, and debugging guide.

Getting Help

Keep an eye on the Ed forum for an announcement of the Assignment 2 YEAH (YEAH = Your Early Assignment Help) group session where our veteran section leaders will answer your questions and share pro tips. We know it can be daunting to sit down and break the barrier of starting on a substantial programming assignment – come to YEAH for advice and confidence to get you on your way!

We also here to help if you get run into issues along the way! The Ed forum is open 24/7 for general discussion about the assignment, lecture topics, the C++ language, using Qt, and more. Always start by searching first to see if your question has already been asked and answered before making a new post.

To troubleshoot a problem with your specific code, your best bet is to bring it to the LaIR helper hours or office hours.

Part One: Rosetta Stone

_{Rosetta Stone is a joint project with Katie Creel, Diana Navas, and Wanheng Hu, Embedded Ethics Extraordinaires. Thanks to Richard Lin, Patricia Wei, Jin-Hee Lee, Neha Chetry, Nuhu Osman Attah, and Rose Novick for providing sample texts.}

Have you ever navigated your web browser to a site not written in your native language? If so, you might have seen something like this in the corner:

The browser has automatically detected that the website I visited is written in Haitian Creole, while the browser itself is set to English.

How does the browser know what language a webpage is written in? This is the language identification problem, which has been studied extensively for at least the last thirty years. In this part of the assignment, you’ll implement a simple but effective language identification algorithm, gaining experience working with the Map container type in the process.

Background: Trigrams

The approach we’ll be using for language identification is based on trigram profiles. To understand what a trigram profile is, let’s do an example. Suppose we have this (admittedly silly) piece of text:

A BANANA BANDANA

The trigram profile for this string is a collection of all its length-three subtrings, along with the number of times that length-three substring appears. Specifically, it’s this group:

"ANA": 3
" BA": 2
"A B": 2
"BAN": 2
"AND": 1
"DAN": 1
"NA ": 1
"NAN": 1
"NDA": 1

That is, the pattern ANA appears three times, the pattern BAN appears twice, and the pattern NDA appears exactly once. These substrings are called trigrams. (More generally, the k-grams of a text are all of the substrings of that text that have length exactly k.)

Trigram profiles are useful for language identification because the patterns of trigrams that appear in a long piece of text often heavily reflects the language that text was written in. For example, here’s the top trigrams from a profile of an English text (specifically, the Wikipedia article “Human”):

" th": 667
"the": 616
"he ": 533
" an": 497
"nd ": 492
"and": 470
"ion": 423
" of": 376
" in": 375
"of ": 363
"tio": 333
"ed ": 320
"ing": 318
"man": 289
"ng ": 288
…

That top entry, " th", indicates that the text has a lot of words that start with th. You can see that the word "the" is extremely common, as are the suffixes "ing" and "ion".

Contrast this with a set of trigrams taken from a piece written in Malay:

"an ": 184
"ang": 82
" da": 73
" me": 71
"ng ": 71
"dan": 52
"kan": 52
"sia": 52
"ia ": 47
"men": 42
" be": 40
" pe": 39
" ya": 39
"yan": 39
" ma": 38
…

Or this set of trigrams from a piece written in Spanish:

" de": 531
"os ": 396
"de ": 374
"ent": 298
" la": 293
"es ": 277
"la ": 239
"el ": 232
" co": 217
" es": 208
"en ": 198
"ien": 198
"nte": 196
"as ": 193
" en": 185
…

Or this piece written in Slovene:

"je ": 82
" po": 65
" pr": 45
"ih ": 45
"anj": 43
"nos": 43
" na": 39
"ost": 39
"ove": 37
"lov": 36
" je": 34
" ra": 33
"raz": 33
"in ": 32
"sti": 32
…

Or this one in Hausa:

"ar ": 93
"an ": 85
" da": 72
"in ": 69
"da ": 58
"iya": 42
" a ": 39
" ka": 38
" ta": 38
"ara": 38
" sh": 35
" wa": 35
"a s": 35
"ana": 33
"na ": 33
…

As you can see, the relative frequencies of the trigrams in the text are markedly different from one another. This makes it possible to guess a text’s language by computing its trigram profile and then finding which language that profile most resembles.

Background: UTF-8

Before we dive into the actual coding part, we have to address an issue you’ll encounter almost immediately when working on this assignment.

Here’s what happens if we gram trigrams from a piece of text written in Vietnamese:

"ng ": 591
"h\341\273": 442
" \304\221": 371
"\260\341\273": 311
"\306\260\341": 311
" th": 307
" nh": 276
" ng": 251
"\303\240 ": 231
" tr": 229
"\341\273\235": 219
"\304\221\341": 211
"nh ": 205
"i\341\273": 200
…

And here’s some trigrams from Bulgarian text:

"\320\275\320": 835
"\321\202\320": 701
"\320\276\320": 607
"\320\260 ": 551
"\321\200\320": 517
"\320\265\320": 491
"\320\260\320": 450
"\320\270\320": 438
"\320\262\320": 433
"\320\270\321": 395
"\260 \320": 356
"\320\270 ": 349
"\320\276\321": 348
"\320\260\321": 321
…

And here's some from Farsi:

"\330\247\331": 562
"\330\247\330": 529
"\247\331\206": 391
"\331\206\330": 361
"\330\263\330": 320
" \330\247": 306
"\333\214 ": 287
"\331\210\330": 281
"\333\214\330": 256
" \330\250": 241
"\331\207 ": 240
"\331\206 ": 235
"\330\257\330": 231
"\330\261\330": 230
…

What’s going on here? Why are we seeing lots of numbers and slashes rather than actual characters?

The reason has to do with two historical artifacts of C++ colliding with contemporary realities. C++ is an older language (it dates back to the 1980s), at a time where computer networks were small and memory was scarce. To save space, most C++ implementations only allowed for 256 different values for the char type. And that was fine, since computers were often used in contexts where the English alphabet (plus a few added characters for other languages) was all that was needed. You could easily get away with having only 256 possible different characters, and the space savings were key at a time where even a 1GB hard drive would be considered a luxury.

But now that memory is much cheaper and people all over the world are connecting and sharing text in their native languages, this restriction on the char type is more of a liability than an asset. Restricting the char type to only have 256 possible values means that there are more Chinese characters than there are possible char values – not to mention more emojis than chars.

To address this, a compromise was reached – while (for backwards compatibility) English letters would take up just one char value, glyphs from other languages would be encoded as a sequence of multiple consecutive chars representing a single glyph. And, as a strange side-effect, the individual chars making up those glyphs can’t easily be displayed in isolation. They are often displayed as \NNN. For example, the sequence \331\206 corresponds to the Arabic character ن, while \320\270 corresponds to the Cyrillic letter и. This system is called UTF-8.

In the context of this assignment, this means that a “trigram” formed from a piece of text might actually consist of a fractional character. For example, the ancient Phoenician character 𐤄 is encoded as a sequence of four char values, so trigrams from Phoenician text would often contain quarters or halves of individual Phoenician letters. While in general it is not okay to subdivide characters this way (many systems will display error characters if you try printing out a partial glyph), for the purposes of this assignment you should always form trigrams from a group of three char values, regardless of what fraction of a glyph those three chars represent.

To summarize:

Don’t be surprised if you see sequences like \302 or \225 appearing in your trigrams. That’s a consequence of using characters that 1980s Americans wouldn’t have expected to see on their computers. It’s a complete historical artifact, but one we have to live with today.
For the purposes of this assignment, it is perfectly okay – and in fact, expected – that when forming trigrams, you should always form groups of three char values, rather than trying to segment text at the actual boundaries between glyphs. Surprisingly, this still works well in practice.
Outside of the context of CS106B, be aware that char does not mean “a character in any language.” Some glyphs require multiple characters to encode. The “correct” way to handle these sorts of strings is to use a library that properly breaks text apart into its individual units.

Milestone One: Form k-Grams

Your first task in this problem is to implement a function

Map<string, double> kGramsIn(const string& text,
                             int kGramLength);

that takes as input a string, then returns a Map<string, double> containing the frequencies of all the k-grams of length kGramLength. Each key should be a k-gram (a substring of length kGramLength), and each value should be the number of times that this k-gram appears in the string. For convenience, any k-gram that doesn’t appear in the text should not be present in the map, not even with value zero.

You can assume that the kGramLength parameter is positive (for example, it shouldn’t be 0 or -137). If the kGramLength parameter is not positive (that is, it's zero or negative), you should call the special function error, defined in "error.h", to report an error. You can call this function as follows:

  error("Some error message");

The error function acts like an emergency abort and will immediately exit the function with an error.

The input string might be shorter than kGramLength. If that happens, there are no k-grams, and you should return an empty Map.

Milestone One Requirements

Implement the kGramsIn function in RosettaStone.cpp.
Use the provided tests to confirm that your code works correctly.
Optionally, but not required: use STUDENT_TEST to add a test case or two of your own!

You might be wondering why we’re having you return a Map<string, double> here rather than a Map<string, int>. After all, the frequencies here are all integers, and int seems like a more appropriate type. There are two reasons why we’re having you do this:

In a later step, you will be dividing all the entries of this map to make them between 0 and 1, inclusive. Working with a Map<string, double> here makes that step a bit easier.
Later on, you’ll be doing a calculation involving multiplying each of the terms in the map by themselves. The int type is (usually) limited to a range from roughly negative two billion to roughly positive two billion, and some of the entries in the map, when squared, exceed that range. On the other hand, doubles can comfortably handle values outside this range.

Some notes on this problem:

There's an Endearing C++ Quirk you may encounter in this part of the assignment. The .length() and .size() functions on a string return what's called an unsigned integer. These are integer values that are not permitted to be negative, which means that they can have unusual interactions with subtraction. For example, if you have a string of length 2 and compute str.length() - 3, instead of getting -1, you'll get a gigantic positive number (this is called integer overflow - take CS107 for details!). As a result, avoid subtraction when working with .length() and .size(). If you're tempted to subtract something from the length of a string, instead see if you can add something to another quantity.
Just to make sure the notion of a k-gram is clear, the 3-grams of the string "ABCDEF" are "ABC", "BCD", "CDE", and "DEF"; the 4-grams of the string "ABCDEF" are "ABCD", "BCDE", and "CDEF"; the only 6-gram of "ABCDEF" is "ABCDEF"; the six 1-grams of "ABCDEF" are "A", "B", "C", "D", "E", and "F"; and there are no 137-grams of the string "ABCDEF".
To make sense of the syntax here, the name of the function you're writing is kGramsIn, and the function return type is Map<string, double>. This does not mean that there is an actual Map<string, double> named kGramsIn. If you want to create a Map<string, double> as a local variable, feel free to do so, but pick a more descriptive name for it.
You should call the error function if the k-gram length is zero or negative, because there's no meaningful way to define a 0-gram or a -137-gram. However, if the k-gram length is greater than the length of the input string, you should return an empty Map, since in general large k-grams are a meaningful thing to talk about and there just don't happen to be any for a shorter string.
If you're looking for information about the syntax of different operations on our container types, check out the Stanford C++ Library Documentation page.

Milestone Two: Normalize Frequencies

We now have a Map<string, double> representing the trigrams in the original text. Once we have these frequencies, we can guess the text’s language by seeing which language’s trigram distribution it’s most similar to. There are many ways to define what “most similar to” means, but one of the most common ways to do this uses something called the cosine similarity, which is described below.

First, an observation. Here are the trigram distributions for three pieces of text, each of which are written in English:

Distribution 1

" th": 219
"the": 171
"he ": 113
"ing": 79
" of": 76
"ng ": 76
"of ": 69
" to": 62
"at ": 60
"nd ": 60
"acc": 58
"cin": 58
"cci": 57
"to ": 57
"and": 54
"re ": 54
" be": 53
" in": 53
"e t": 53
…

Distribution 2

" th": 495
"the": 388
"he ": 351
"ing": 231
"nd ": 194
"ng ": 193
" a ": 159
"ed ": 158
" to": 157
" I ": 154
" an": 153
"and": 152
"to ": 146
"her": 134
"er ": 128
"e t": 126
"re ": 124
"is ": 121
" of": 117
…

Distribution 3

" th": 510
"he ": 371
"the": 339
"ed ": 269
" to": 257
" an": 222
"to ": 217
"ing": 208
"nd ": 205
"ng ": 196
"and": 195
" in": 187
"at ": 175
"hat": 174
" a ": 171
" of": 159
"is ": 158
"as ": 154
"tha": 154
…

In each case, we can see that the very most common trigrams are pretty much the same across the three texts (though the ordering is a bit different). However, notice that the numbers are generally increasing as we move from the first document to the third. The reason for that is simple – these were computed from progressively longer texts.

To make the trigram frequencies more consistent across texts of different lengths, you will need to normalize the scores. Borrowing a technique from linear algebra, we’ll normalize scores as follows:

Add up the squares of all the frequencies in the map.
Divide each frequency by the square root of this number.

As a very simplified example, suppose we have these trigrams:

"aaa": 3
"baa": 1
"aab": 1

We’d compute the number 3² + 1² + 1² = 11, then divide each number by the square root of eleven to get these new frequencies:

"aaa": 0.904534
"baa": 0.301511
"aab": 0.301511

Applying this same approach to the three distributions from above (including the frequencies of the less frequent k-grams that we didn’t show) gives the following:

Distribution 1

" th": 0.40398
"the": 0.31503
"he ": 0.20755
"ing": 0.14639
" of": 0.14083
"ng ": 0.14083
"of ": 0.12786
" to": 0.11489
"at ": 0.11118
"nd ": 0.11118
"acc": 0.10562
"cin": 0.10562
"to ": 0.10562
"cci": 0.10377
"and": 0.10007
"re ": 0.10007
" be": 0.09821
"e t": 0.09821
"ed ": 0.09821
…

Distribution 2

" th": 0.37585
"the": 0.29460
"he ": 0.26651
"ing": 0.17539
"nd ": 0.14730
"ng ": 0.14654
" a ": 0.12072
"ed ": 0.11997
" to": 0.11921
" I ": 0.11693
" an": 0.11617
"and": 0.11541
"to ": 0.11085
"her": 0.10174
"er ": 0.09719
"e t": 0.09567
"re ": 0.09415
"is ": 0.09187
" of": 0.08883
…

Distribution 3

" th": 0.33214
"he ": 0.24161
"the": 0.22077
"ed ": 0.17518
" to": 0.16737
" an": 0.14458
"to ": 0.14132
"ing": 0.13546
"nd ": 0.13350
"ng ": 0.12764
"and": 0.12699
" in": 0.12178
"at ": 0.11397
"hat": 0.11331
" a ": 0.11136
" of": 0.10355
"is ": 0.10289
"as ": 0.10029
"tha": 0.10029
…

With the scores normalized, these numbers are much closer to one another, indicating that they’re using fairly similar distributions of trigrams.

Your next task is to implement a function

Map<string, double> normalize(const Map<string, double>& profile);

that takes as input a trigram profile, then returns an normalized version of the profile.

There is an important edge case you need to handle. If the input Map is empty or does not contain any nonzero values, then the calculation we’ve instructed you to do above would divide by zero. (Do you see why?) To address this, if the map doesn’t contain at least one nonzero value, you should use the error() function to report an error.

Milestone Two Requirеments

Implement the normalize function in RosettaStone.cpp.
Use the provided tests to confirm that your code works correctly.
Optionally, but not required: use STUDENT_TEST to add a test case or two of your own!

Some notes on this problem:

There are no restrictions on what the keys in the map can be. Although you’ll ultimately be using this function in the context of trigrams, you should not assume that the keys will necessarily be strings of length three, nor that the keys could be a set of trigrams for a piece of text. Similarly, while it’s never possible to have a negative number of copies of a given trigram, your code should be able to handle negative values. In each case, sum up the squares of all the values, then divide each of the values by the square root of that sum.
The keys in the resulting map should be the same as the keys in the input map. In other words, you should not add or remove keys.
For those of you coming from Python: C++ doesn’t have a built-in exponentiation operator like **. Instead, to compute powers or square roots, include the header <cmath> and use the pow and sqrt functions.
If you're looking for information about the syntax of different operations on our container types, check out the Stanford C++ Library Documentation page.

Milestone Three: Filter Out Uncommon Trigrams

While all texts written in a particular language will show a strong degree of similarity between their most frequent trigrams, the infrequent trigrams in two pieces of text written in the same language often differ greatly. For example, these infrequent trigrams often reflect uncommon words that appear a few times in a text.

To address this, we’re going to ask you to write a function

Map<string, double> topKGramsIn(const Map<string, double>& profile,
                                int numToKeep);

that takes as input a k-gram profile and a maximum number of k-grams to keep, then returns a Map containing only the numToKeep most frequent k-grams in the profile. If numToKeep is bigger than the number of k-grams in the profile, you should return all the original k-grams. And if numToKeep is negative, you should report an error(), since that’s not feasible. (If numToKeep is zero, you should just return an empty map.)

To do this, you’ll need some way to work with k-grams sorted by frequency. For that, we’ll introduce a new, useful container type that you’ll see later on in the quarter: a priority queue. If you include the header file "priorityqueue.h", you’ll get access to the PriorityQueue container type.

Like a regular queue, the PriorityQueue supports the enqueue, dequeue, peek, size, and isEmpty operations. The difference between a priority queue and a regular queue is the order in which the elements are dequeued. In a regular Queue, elements are lined up in sequential order, and calling dequeue() or peek() gives back the element added the longest time ago. In the PriortyQueue, each time you enqueue an element, you associate it with a number called its weight. The element that’s returned by dequeue() or peek() is the element that has the lowest weight. For example, let’s imagine we set up a PriorityQueue like this, holding a collection of past CS TAs:

PriorityQueue<string> pq; 
pq.enqueue("Amy",    103);
pq.enqueue("Ashley", 101);
pq.enqueue("Anna",   110);
pq.enqueue("Luna",   161);

If we write

string elem = pq.dequeue();
cout << elem << endl;      

then the string printed out will be "Ashley", since of the four elements enqueued, the weight associated with Ashley (101) was the lowest. Calling pq.dequeue() again will return "Amy", since her associated weight (103) was the lowest of what remains. Calling pq.dequeue() a third time would return "Anna".

Let’s insert more values. If we call pq.enqueue("Chioma", 103) and then pq.dequeue(), the return value would be the newly-added "Chioma", because her associated weight is lower than all others. And note that Chioma was the most-recently-added TA here; just goes to show you that this is quite different from a regular Queue!

We’re going to leave it to you to determine how to use the PriorityQueue type to find the most frequent k‑grams. There are many strategies you can use to do this. Something to think about: PriorityQueues are great at removing items with low weight, and you want to find items of high weight.

Milestone Three Requirements

Implement the topKGramsIn function in RosettaStone.cpp.
Use the provided tests to confirm that your code works correctly.
Optionally, but not required: use STUDENT_TEST to add a test case or two of your own!

Some notes on this problem:

Your solution needs to make use of the PriorityQueue type. While there are other ways to sort things in C++, for the purposes of this problem we'd prefer if you avoid them.
If multiple k-grams are tied for having the same weight, you can break ties however you’d like.
The input frequencies may or may not be normalized.
Although this wouldn’t be possible in a k-grams setting, this function should work just fine even if some of the values in the map are zero or negative.

Milestone Four: Implement Cosine Similarity

At the end of the previous milestone, we saw this trio of trigram profiles for English texts:

Distribution 1

" th": 0.40398
"the": 0.31503
"he ": 0.20755
"ing": 0.14639
" of": 0.14083
"ng ": 0.14083
"of ": 0.12786
" to": 0.11489
"at ": 0.11118
"nd ": 0.11118
"acc": 0.10562
"cin": 0.10562
"to ": 0.10562
"cci": 0.10377
"and": 0.10007
"re ": 0.10007
" be": 0.09821
"e t": 0.09821
"ed ": 0.09821
…

Distribution 2

" th": 0.37585
"the": 0.29460
"he ": 0.26651
"ing": 0.17539
"nd ": 0.14730
"ng ": 0.14654
" a ": 0.12072
"ed ": 0.11997
" to": 0.11921
" I ": 0.11693
" an": 0.11617
"and": 0.11541
"to ": 0.11085
"her": 0.10174
"er ": 0.09719
"e t": 0.09567
"re ": 0.09415
"is ": 0.09187
" of": 0.08883
…

Distribution 3

" th": 0.33214
"he ": 0.24161
"the": 0.22077
"ed ": 0.17518
" to": 0.16737
" an": 0.14458
"to ": 0.14132
"ing": 0.13546
"nd ": 0.13350
"ng ": 0.12764
"and": 0.12699
" in": 0.12178
"at ": 0.11397
"hat": 0.11331
" a ": 0.11136
" of": 0.10355
"is ": 0.10289
"as ": 0.10029
"tha": 0.10029
…

We as humans can look at these numbers and say “yep, they’re pretty close,” but how could a computer do this? Suppose we have two trigram profiles P₁ and P₂ that have already been normalized. One measure we can use to determine how similar they are is to compute their cosine similarity, which can be computed as follows: for each trigram that appears in both $P_1$ and $P_2\text,$ multiply the frequency of that trigram in $P_1$ and $P_2\text,$ then add up all the numbers you found this way.

As an example, consider these two (highly contrived, not actually real) trigram profiles:

Profile 1

"aaa": 0.333
"bbb": 0.667
"ccc": 0.667

Profile 2

"bbb": 0.333
"ccc": 0.667
"ddd": 0.667

These two trigram sets have the trigrams "bbb" and "ccc" in common. We’d therefore find their cosine similarity by computing

\[\begin{aligned} &\phantom{= \text{ }} \text{(product of }\texttt{"bbb"}\text{ frequencies)} + \text{(product of }\texttt{"ccc"}\text{ frequencies)} \\ &= (0.667 \times 0.333) + (0.667 \times 0.667) \\ &= 0.667 \end{aligned}\]

The frequencies of "aaa" and "ddd" aren’t relevant here, since those trigrams appear in only one of the two trigram sets.

In the context of trigrams, cosine similarities range from 0 (the two documents have no trigrams in common) to 1 (the two documents have identical relative trigram frequencies).

Your next task is to write a function

double cosineSimilarityOf(const Map<string, double>& lhs, 
                          const Map<string, double>& rhs);

that takes as input two sets of normalized scores, then returns their cosine similarity using the formula given above.

Milestone Four Requirements

Implement the cosineSimilarityOf function in RosettaStone.cpp.
Use the provided tests to confirm that your code works correctly.
Optionally, but not required: use STUDENT_TEST to add a test case or two of your own!

Some notes:

Although you’ll ultimately be applying this function to trigram profiles, your function should work in the case where the keys in the two maps are arbitrary strings. In other words, you shouldn’t assume that you’re just working with maps whose keys are all length-three strings.
You can assume that the two sets of input scores are normalized and don’t need to handle the case where this isn’t true.
If you're looking for information about the syntax of different operations on our container types, check out the Stanford C++ Library Documentation page.

Milestone Five: Guess a Text’s Language

At this point, you have the ability to take a text, extract its trigrams, and normalize its profile. You have the ability to take two normalized profiles and compare their scores against one another. All that’s left to do now is put everything together to guess the language of a text.

Here’s how. Before releasing this assignment to you, we went online and got lots and lots of texts written in different languages. Each of the texts associated with a given language is called a corpus (plural corpora) and is designed to be large and diverse enough to give a representative sample of what that language looks like in practice. We then computed trigram profiles for each of those languages, giving a trigram profile for the language as a whole.

To make an educated guess about what language a piece of text is written in, you can simply compute the similarity of that text’s profile against the profiles of all the different languages. Whichever language has the highest similarity to your text then gives the best guess of what language the text was written in.

Your final task is to write a function

string guessLanguageOf(const Map<string, double>& textProfile,
                       const Set<Corpus>& corpora);

that takes as input a text’s k-gram profile (which has been normalized and then trimmed to store only the most frequent k-grams) and a set of corpora for different languages. This function then returns the name of the language that has the highest cosine similarity to the text. As an edge case, if the set of corpora is empty, you should call error() to report an error, since in that case there aren’t any languages to pick from.

You may have noticed that the last parameter to this function is a Set<Corpus> representing the text corpora. What exactly is a Corpus? In the RosettaStone.h header file (look at Headers/src/RosettaStone.h), you'll see that it's defined like this:

struct Corpus {                                           
    string name;                 // Name of the language  
    Map<string, double> profile; // Normalized and trimmed
};

This is a structure, a type representing a bunch of different objects all packaged together as one. This structure type groups a string named name (the name of the language) and a Map<string, double> named profile (its normalized trigram profile). The name Corpus refers to a type, just like int or string. You can create variables of type Corpus just as you can variables of any other type. For example, you could declare a variable of type Corpus named english like this:

Corpus english;

One you have a variable of type Corpus, you can access the constituent elements of the struct by using the dot operator. For example, you can write code like this:

Corpus english;
english.name = "American English";
english.profile["the"] = 0.13742;

Milestone Five Requirements

Implement the guessLanguageOf function in RosettaStone.cpp.
Use the provided tests to confirm that your code works correctly.
Optionally, but not required: use STUDENT_TEST to add a test case or two of your own!

Some notes on this part of the problem:

You can assume that whoever is calling this function will already have normalized the frequencies of all the k-grams in the profile and each corpus and then filtered them down to just the most common k-grams. You do not need to - and should not - repeat these steps, since doing so will slow down your function.
If there is a tie between two or more languages being the most similar, you can break the tie in whatever way you’d like.
If you're looking for information about the syntax of different operations on our container types, check out the Stanford C++ Library Documentation page.

Milestone Six: Explore and Evaluate

You’ve just implemented a system to guess the languages of various pieces of texts. It’s now time to ask: how well does it work? When does it give good guesses? When does it do poorly? Questions like these are important, especially given the cultural and political significance of languages.

Let’s begin with some cases where your program does a great job identifying languages. The following texts will all have their languages correctly identified.

Run the “Rosetta Stone” demo from the top-level menu and use it to answer each of the following questions. Write your answers in the file LanguageIdentification.txt. As a note, assuming you've coded everything up correctly, the answers your program produces should be correct.