\documentclass[12pt]{article}
\usepackage{graphicx}
\usepackage{amsmath,amssymb,url}
\usepackage{bbm}
\newfont{\bssdoz}{cmssbx10 scaled 1200}
\newfont{\bssten}{cmssbx10}
\oddsidemargin 0in \evensidemargin 0in \topmargin -0.5in
\headheight 0.25in \headsep 0.25in
\textwidth 6.5in \textheight 9in %\marginparsep 0pt \marginparwidth 0pt
\parskip 1.5ex \parindent 0ex \footskip 20pt
\begin{document}
\parindent 0ex
\thispagestyle{empty} \vspace*{-0.75in}
{\bssdoz CME 323: Distributed Algorithms and Optimization}
{\\ \bssten Instructor: Reza Zadeh (rezab@stanford.edu) }
{\\ \bssten HW\#1}
\vspace{5pt}
\begin{enumerate}
\item Download the following materials:
\begin{itemize}
\item Slides: \url{http://stanford.edu/~rezab/dao/slides/itas_workshop.pdf}
\item Spark and Data: \url{http://training.databricks.com/workshop/usb.zip}
\end{itemize}
Now, answer the following questions:
\begin{enumerate}
\item Checkpoint on slide 11
\item Checkpoint on slide 55
\item Checkpoint on slide 60.
\textbf{Note:} Slide 59 references the file \verb|CONTRIBUTING.md|
which is not in the provided zip file. Instead, use the file \verb|website/getting-started.md|.
\end{enumerate}
Submit your code and answers.
\item Warmup question. Assume you are given a typical MapReduce implementation where you only have to write the Map and Reduce functions. The Map function you will write takes as input a (key, value) record and returns either a (key, value) record or nothing. The Reduce function you will write takes as input (key, list of all values for that key) and returns either a record or nothing. The framework already takes care of iterating the Map function over all the records in the input file, key-based intermediate data transfer between Map and Reduce, and storing the returned value of Reduce Ð you do not have to worry about these.
You are now given an input file which contains comprehensive information about a social network that has asymmetrical (directed) links, i.e., a network where users ÔfollowÕ other users but not necessarily vice-versa (e.g., Twitter). Each record in this input file is (userid-a, userid-b), where userid-a ÔfollowsÕ userid-b (i.e., points to it). Note that this record tells you nothing about whether or not userid-b follows userid-a.
Write a MapReduce program (i.e., Map function and Reduce function) that outputs all pairs of userids who follow each other. Pseudocode is OK.
%solution in png file
\item Warmup question. Consider counting the number of occurrences of words in a collection of documents,
where there are only $k$ possible words. Write a MapReduce
to achieve this, and analyze the shuffle size with and without combiners being used (assuming $B$ mappers are used).
\item The \textit{prefix-sum} operator takes an array $a_1, \ldots, a_n$ and returns an array $s_1, \ldots, s_n$, where
$s_i = \sum_{j \leq i} a_j$. For example starting with an array 17 0 5 32 it returns 17 17 22 54.
Describe how to implement \textit{prefix-sum} in MapReduce, where the input is stored as $\langle i, a_i \rangle$.
That is, the key is the position in the array, and the value is the value at that position. Analyze the shuffle size, and
the reduce-key space and time complexity.
\item For a given undirected graph $G = (V, E)$ with $n$ vertices and $m$ edges ($m \geq n$), we say that $G$ is shallow if for every pair of vertices $u,v \in V$, there is a path from $u$ to $v$ of length at most 2 (i.e., using at most two edges).
\begin{enumerate}
\item Give an algorithm that can decide whether G is shallow in $O(n^{2.376})$ time.
\item Given an $n\times r$ matrix $A$ an $r \times n$ matrix $B$ where $r\leq n$, show that we can multiply $A$ and $B$ in
$O((n/r)^2 á r^{2.376})$ time. (Use the fact that we can multiply two $r \times r$ matrices in $O(r^{2.376})$ time.)
\item Give an algorithm that can decide whether $G$ is shallow in $O(m^{0.55}n^{1.45})$ time. [Hint: consider length-2 paths that go through low-degree vertices and length-2 paths that go through high-degree vertices separately. Use (b)]
\end{enumerate}
\item In a group of 2 or 3 students, decide upon a project and send a maximum 300-word proposal.
The project must have to do with distributed computing and be theoretical or practical. Examples include
implementing distributed versions of algorithms from your research and/or analyzing their communication costs.
Talk to Reza if you need inspiration or ideas.
\end{enumerate}
\end{document}