MS&E 317: Algorithms for Modern Data Models

Spring 2014, Stanford University
Mon, Wed 2:15 PM - 3:30 PM at Meyer 143

Instructors:
Ashish Goel | Reza Zadeh

We traditionally think of algorithms as running on data available in a single location, typically main memory. In many modern applications including web analytics, search and data mining, computational biology, finance, and scientific computing, the data is often too large to reside in a single location, is arriving incrementally over time, is noisy/uncertain, or all of the above.

Paradigms such as map-reduce, streaming, sketching, Distributed Hash Tables, Resilient Distributed Datasets, and random walks have proved useful for these applications. This course will provide an introduction to the design and analysis of algorithms for these modern data models. The class will be largely theoretical, but also involve at least two hands-on programming exercises.

Pre-requisites: Targeting Doctorate students having taken Algorithms at the level of CS 261. Being able to competently program in any main-stream high level language (C, C++, Ruby, Java, Scala, Python, Perl).

There will be 3 homeworks, one scribed lecture, and a take-home final exam. Students can substitute the take-home final with a project. Students taking the class for credit/no credit instead of letter grade can skip the final.

Students are strongly encouraged to attend the Spark Workshop.

Optional textbooks:
Algorithm Design by Kleinberg and Tardos [KT]
Randomized Algorithms by Rajeev Motwani and Prabakhar Raghavan [MR]

Homework

Homework 1 [pdf]
Homework 2 [pdf]
Homework 3 [pdf]

Scribe template

Final Exam Due June 9th by midnight

Lectures and References

Lecture 1: Introduction to Modern Data Models [Scribe Notes], MapReduce Paper
Lecture 2: Triangle Counting: [Scribe Notes], Curse of the Last Reducer, Counting Triangles with Spark
Lecture 3: Squaring Matrices on MapReduce [Scribe Notes], DISCO paper, DIMSUM paper, Tall and Skinny SVD, Combiners Description
Lecture 4: Power iterations and PageRank [Scribe Notes] [slides], PageRank implementation
Lecture 5: Sketches [Scribe Notes]
Lecture 6: Applications Sketches, MinHash [Scribe Notes]
Lecture 7: P-Stability [Scribe Notes]
Lecture 8: The Secretary Problem [Scribe Notes], Optimizing expected value for the Secretary problem
Lecture 9: The Perceptron, Intro to Stochastic Gradient Descent [Scribe Notes], Parallel Stochastic Gradient Descent, SGD updates for many ML losses, Sorting/Shuffling on MapReduce
Lecture 10: Sampling, Locality Sensitive Hashing [Scribe Notes]
Lecture 11: Locality Sensitive Hashing [Scribe Notes]
Lecture 12: Parallel SGD, Alternating Direction Method of Multipliers [Scribe Notes], Many Resouces from Stephen Boyd, ADMM on Spark
Lecture 13: Overview of Large scale optimization. PLANET: Distributed Decision Tree Learning [Scribe Notes] PLANET paper
Lecture 14: Approximating Cuts. Clustering. [Scribe Notes]
Lecture 15: Incremental Facility Location. Implementation of LSH. [Scribe Notes]
Lecture 16: Social Search [Scribe Notes] Sketch Paper

Contact

Ashish: ashishg at stanford.edu

Reza: rezab at stanford.edu