\documentclass[12pt,letterpaper]{article}
\usepackage{graphicx,textcomp}
\usepackage{natbib}
\usepackage{setspace}
\usepackage{fullpage}
\usepackage{color}
\usepackage[reqno]{amsmath}
\usepackage{amsthm}
\usepackage{amssymb,enumerate}
\usepackage[all]{xy}
\usepackage{endnotes}
\usepackage{lscape}
\newtheorem{com}{Comment}
\newtheorem{lem} {Lemma}
\newtheorem{prop}{Proposition}
\newtheorem{thm}{Theorem}
\newtheorem{defn}{Definition}
\newtheorem{cor}{Corollary}
\newtheorem{obs}{Observation}
\usepackage[compact]{titlesec}
\usepackage{dcolumn}
\usepackage{tikz}
\usetikzlibrary{arrows}
\usepackage{multirow}
\usepackage{xcolor}
\newcolumntype{.}{D{.}{.}{-1}}
\newcolumntype{d}[1]{D{.}{.}{#1}}
\definecolor{light-gray}{gray}{0.65}
\usepackage{url}
\newcommand{\Sref}[1]{Section~\ref{#1}}
\newtheorem{hyp}{Hypothesis}
\title{Text as Data: Homework 4}
\date{Due: 11/7 at 5pm}
\begin{document}
\maketitle
In this homework assignment we will analyze Machiavelli's \emph{The Prince}. Download {\tt Mach.tar} from the course website and expand the compressed folder. (This is relevant {\tt http://xkcd.com/1168/}). \\
Each file represents a subset of the manuscript. We will analyze its contents using principal components, multidimensional scaling, and clustering methods.
\subsection*{Create a Document-Term Matrix}
Using the sections from the Machiavelli text, create a document term matrix.
\begin{itemize}
\item[-] Discard punctuation, capitalization
\item[-] Apply the porter stemmer to the documents
\item[-] Remove stop words from the following list (remembering to stem the stop words):
{\tt `http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop'}
\item[-] Identify the 500 most common unigrams
\item[-] Create a $N \times 500$ document term matrix $\boldsymbol{X}$, where the columns count the unigrams and the rows are the documents
\end{itemize}
We will work with a normalized version of the term document matrix. That is we will divide each row by the total number of words in the top 500 unigrams used:
\begin{eqnarray}
\boldsymbol{x}_{i}^{*} & = & \frac{\boldsymbol{x}_{i}}{\sum_{j=1}^{500} x_{ij}} \nonumber\\
\boldsymbol{X}^{*} & = & \begin{pmatrix} \boldsymbol{x}_{1}^{*} \\
\boldsymbol{x}_{2}^{*} \\
\vdots \\
\boldsymbol{x}_{N}^{*} \\
\end{pmatrix} \nonumber
\end{eqnarray}
\subsection*{Low Dimensional Embeddings with Principal Components}
\begin{itemize}
\item[1)] Wise Will (WW), your friend with a weird name, notices you looking at the slides about principal component analysis (PCA). WW casually remarks that the variance of the eigenvalues of the variance-covariance matrix is a useful heuristic for knowing if PCA can be fruitfully applied to some document-term matrix. WW, completely unsolicited, explains that as the variance of the eigenvalues goes up, the more useful PCA will be. He then laughs and leaves your office. WW is kind of a jerk. \\
Let's formalize WW's suggestion. Suppose document-term matrix $\boldsymbol{X}$ has variance-covariance matrix $\boldsymbol{\Sigma} = \frac{\boldsymbol{X}^{'}\boldsymbol{X}}{N}$. And suppose that $\boldsymbol{\Sigma}$ has eigenvalues $\lambda_{1}>\lambda_{2}>\hdots > \lambda_{d}>0$. Then we calculate the variance of the eigenvalues as
\begin{eqnarray}
\sigma^{2} & = & \frac{1}{d} \sum_{j=1}^{d}(\lambda_{j} - \bar{\lambda})^{2} \nonumber
\end{eqnarray}
where $\bar{\lambda}$ is $\frac{1}{d} \sum_{i=1}^{d} \lambda_{i}$. WW is saying that as $\sigma^{2}$ gets bigger, a low-dimensional embedding via PCA will provide a better summary of our data. \\
Does WW have a good point? Why or why not? (Hint: what do the eigenvalues represent?)
\item[2)] Apply the function {\tt prcomp} to $\boldsymbol{X}^{*}$. Be sure to set use a scaled version of the data, by setting {\tt scale = T}, which will ensure that each column has unit variance.
\begin{itemize}
\item[a)] Create a plot of variance explained by each additional principal component. What does this plot suggest about the number of components to include?
\item[b)] Plot the two-dimensional embedding of the text documents. Label the texts with their number. (Each file is {\tt Mach\_XX.txt}, where {\tt XX} is the chunk number)
\item[c)] Label the two largest principal components. What does this embedding suggest about the primary variation this representation of the Prince? (Hint: if your {\tt embed} is your object with principal components, examine {\tt embed\$rotation})
\end{itemize}
\item[3)]An alternative method---discussed at the end of the seventh lecture---is multidimensional scaling (MDS). Classic MDS attempts to preserve distances between objects in a low dimensional scaling.
\begin{itemize}
\item[a)] Calculate the Euclidean distance between each document using $\boldsymbol{X}^{*}$. Call this matrix $\boldsymbol{D}(\boldsymbol{X}^{*})$ (Hint: use {\tt R}'s built in function {\tt dist})
\item[b)] Apply the classic MDS to $\boldsymbol{D}(\boldsymbol{X}^{*})$ using the {\tt R} function {\tt cmdscale}. That is, execute the code\\
{\tt mds\_scale<- cmdscale(DISTANCE\_MATRIX, k = 2)}
\item[c)] Apply PCA to $\boldsymbol{X}^{*}$, but this time do not use {\tt prcomp}'s scaling option. That is, use {\tt prcomp} with {\tt scale = F}.
\item[d)] Compare the first dimension of the output from classic MDS to the first dimension of the embedding from principal components. What is the correlation between the embeddings?
\item[d)] Now use {\tt dist} to create a distance matrix using the {\tt manhattan} metric, apply Classic multidimensional scaling to the distance matrix based on manhattan distance, and compare the first dimension of this embedding to the first dimension from PCA. What is the correlation?
\item[e)] What do you conclude about the relationship between PCA and MDS?
\end{itemize}
\end{itemize}
\subsection*{Clustering Methods}
\begin{itemize}
\item[1)] Using the {\tt kmeans} function, create a plot of the {\tt kmeans} objective function as the number of clusters varies from 2 to $N - 1$.
\item[2)] Apply K-Means with 6 clusters, being sure to use {\tt set.seed} to ensure you can replicate your analysis
\item[3)] Label each cluster using computer and hand methods:
\begin{itemize}
\item[i)] Suppose $\boldsymbol{\theta}_{k}$ is the cluster center for cluster $k$ and define $\bar{\boldsymbol{\theta}}_{-k} = \frac{\sum_{j \neq k} \boldsymbol{\theta_{j}} }{K-1 }$ or the average of the centers not $k$. Define
\begin{eqnarray}
\text{Diff}_{k} & = & \boldsymbol{\theta}_{k} - \bar{\boldsymbol{\theta}}_{-k}\nonumber
\end{eqnarray}
Use the top ten words from $\text{Diff}_{k}$ to label the clusters
\item[ii)] Sample and read texts assigned to each cluster and produce a hand label
\end{itemize}
\item[4)] Measure the cohesiveness and exclusivity of each cluster. What is the most exclusive and most cohesive topics? What are the least? Does this align with your reading of the texts?
\end{itemize}
\end{document}