CME 323: Distributed Algorithms and Optimization

Spring 2015, Stanford University
Mon, Wed 12:35 PM - 1:50 PM at 530-127

Instructor: Reza Zadeh

The emergence of large distributed clusters of commodity machines has brought with it a slew of new algorithms and tools. Many fields such as Machine Learning and Optimization have adapted their algorithms to handle such clusters. The class will cover widely used distributed algorithms in academia and industry.

We will cover distributed algorithms for:

Convex Optimization
Matrix Factorization
Machine Learning
Neural Networks
The Bootstrap
Numerical Linear Algebra
Large Graph analysis
Streaming and online algorithms

A shorter version of this class was given at Spark Summit 2015: [video] [slides]

Class Format

Throughout the class, topics will be illustrated with hands-on exercises using the high-speed cluster programming framework, Spark, with computing resources provided by the instructor. The design of distributed algorithms primarily differs from traditional algorithms in the requirement to consider communication cost, so there will be analysis of communication cost.

Pre-requisites: Targeting graduate students having taken Algorithms at the level of CME 305 or CS 261. Being able to competently program in any main-stream high level language.

There will be 3 homeworks, one scribed lecture, and a project. Students taking the class for credit/no credit instead of letter grade can skip the project.

Optional textbooks:
Convex Optimization by Boyd and Vandenberghe [BV]
Randomized Algorithms by Rajeev Motwani and Prabakhar Raghavan [MR]
Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman [HTF]

Homework

Homework 1 [pdf] [tex] [solutions], Collected Monday April 20th in class
Homework 2 [pdf] [tex] [solutions], Collected Monday May 4th in class
Homework 3 [pdf] [tex] [solutions], Collected Monday May 18th in class

Scribe template

Lectures and References

Lecture 1: Distributed Computing with Spark
[slides] [typical cluster]
Lecture 2: Distributed Optimization Overview
[slides] [SQL guide] [RDD guide] [PairRDD API]
Lecture 3: Complexity measures for clusters
[notes] [last reducer]
Lecture 4: Shuffling data to do Join, Groupby, and other all-to-all communication patterns
[slides] [notes] [timsort]
Lecture 5: Hands-on a real cluster in class, part 1
[slides] [streaming demo]
Lecture 6: Hands-on a real cluster in class, part 2
[slides] [parquet] [guide]
Lecture 7: Partitioning for repeated joins and Pagerank
[slides] [notes]
Lecture 8: The Pregel data flow paradigm
[slides] [notes] [pregel] [graphx]
Lecture 9: Matrix Computations: Multiplication, Singular Value Decomposition (Tall and Skinny, Square), PCA
[slides] [notes]
Lecture 10: Covariance Matrix and All-pairs Similarity
[slides] [notes]
Lecture 11: Streaming Stochastic Gradient Descent for Generalized Linear Models, Streaming K-Means
[notes] [hogwild] [parallel SGD] [SGD convergence]
Lecture 12: Streaming items through a cluster with Spark Streaming, Perceptron Introduction
[slides] [notes] [streaming examples]
Lecture 13: Streaming Proof, Alternating Direction Method of Multipliers (ADMM), Theory/Practice interface, AllReduce
[notes] [ADMM resources] [ADMM on Spark] [Slide 37]
Lecture 14: Matrix Completion, Alternating Least Squares, Generalized Low Rank Models
[notes] [slides] [paper summary] [GLRM] [FastALS]
Lecture 15: Neural Networks
[notes] [DistBelief]
Lecture 16: Distributed Decision Trees, Bag of little Bootstraps
[notes] [PLANET] [BLB]

Projects

Swaroop Indra Ramaswamy and Rohit Patki: Distributed minimum spanning trees. [slides] [report]

Carlos Riquelme, Lan Nguyen and Sven Schmit: Cascading vector machines. [slides] [report] [Github]

Benoit Dancoisne, Emilien Dupont and William Zhang: Distributed Max-Flow in Spark. [slides] [report] [Github]

Kevin Chavez, Hao Yi Ong and Augustus Hong: Distributed Deep Q-Learning. [slides] [report] [Github]

Zi Yin and Zhiang Hu (Harvy): Parallelized Union Find Set, with an Application in Finding Connected Components in a Graph. [slides] [report]

Charles Y. Zheng, Jingshu Wang and Arzav Jain: All-Pairs Shortest Paths in Spark. [slides] [report] [Github]

Haoming Li and Bangzheng He: A Distributed Solver for Kernalized SVM. [slides] [report]

Yilong Geng and Mingyu Gao: Distributed Stable Marriage with Incomplete List and Ties using Spark. [slides] [report] [Github]

David Daniels, Eric Liu and Charles Zhang: Distributed Structural Estimation of Graph Edge-Type Weights from Noisy PageRank Orders. [slides] [report] [Github]

Yifan Jin and Shaun Benjamin: Monte Carlos Tree Search. [slides] [report] [Code]

Orren Karniol-Tambour: Data Parallel EM for estimating the Genome Relative Abundance (GRA) in Metagenomic Samples. [slides] [report] [Github]

Supplementary Materials

Advanced Data Science on Spark: [slides]

Spark Intro Tutorial: [slides] [code and data - 1 GB]

Spark Devops Slides: Spark Summit slides

Tutorial: Stanford Spark Workshop Exercises

Tutorial: Movie Recommendation with MLlib

Tutorial: Graph Analytics with GraphX

Contact

Reza: rezab at stanford.edu
Office hours: by appointment

TA

Dieterich Lawson: jdlawson at stanford.edu
Office hours: Tuesdays 4-6pm

Simon Anastasiadis: simonsa at stanford.edu
Office hours: Wednesdays 2:15-4:15pm

TA office hours will be held in the Huang Engineering Center basement (in front of the ICME office)