Talks Abstracts. Workshop on Algorithms for Modern Massive Data Sets

Wednesday, June 25, 2008. Theme: Data analysis and data applications


9:45 - 10:00 Opening: Organizers

10:00 - 11:00 Tutorial:
Christos Faloutsos
Carnegie Mellon University


How do graphs look like? How do they evolve over time? How can we generate
realistic-looking graphs? We review some static and temporal 'laws', and
we describe the ``Kronecker'' graph generator, which naturally matches all
of the known properties of real graphs. Moreover, we present tools for
discovering anomalies and patterns in two types of graphs, static and
time-evolving. For the former, we present the 'CenterPiece' subgraphs
(CePS), which expects $q$ query nodes (eg., suspicious people) and finds
the node that is best connected to all $q$ of them (eg., the master mind
of a criminal group). We also show how to compute CenterPiece subgraphs
efficiently. For the time evolving graphs, we present tensor-based
methods, and apply them on real data, like the DBLP author-paper dataset,
where they are able to find natural research communities, and track their

Finally, we also briefly mention some results on influence and virus
propagation on real graphs, as well as on the emerging map/reduce approach
and its impact on large graph mining.

11:00 - 11:30
Deepak Agarwal
Yahoo! Research, Silicon Valley


We propose a novel statistical model to predict large scale dyadic
response variable in the presence of covariate information. Our approach
simultaneously incorporates the effect of covariates and estimates local
structure that is induced by interactions among the dyads through a
discrete latent factor model.  The discovered latent factors provide a
predictive model that is both accurate and interpretable. To further
combat sparsity that is often present in large scale dyadic data, we
enhance the discrete latent factors by coupling it with a factorization
model on the cluster effects. In fact, by regularizing the factors through
$L_{2}$ norm constraints, we are able to allow larger number of factors
that capture dyadic interactions with greater precision compared to a pure
factorization model, especially for sparse data.  The methods are
illustrated in a generalized linear model framework and includes linear
regression, logistic regression and Poisson regression as special cases.
We also provide scalable generalized EM-based algorithms for model fitting
using both "hard" and "soft" cluster assignments. We demonstrate the
generality and efficacy of our approach through large scale simulation
studies and analysis of datasets obtained from certain real-world movie
recommendation and internet advertising applications.

11:30 - 12:00
Chandrika Kamath
Lawrence Livermore National Laboratory


Scientific data sets, whether from computer simulations, observations, or
experiments, are not only massive, but also very complex. This makes it
challenging to find useful information in the data, a fact scientists find
disconcerting as they wonder about the science still undiscovered in the
data. In this talk, I will describe my experiences in analyzing data in a
variety of scientific domains.  Using example problems from astronomy,
fluid mixing, remote sensing, and experimental physics, I will describe
our solution approaches and discuss some of the challenges we have
encountered in mining these datasets.

12:00 - 2:00 LUNCH (on your own)

2:00 - 3:00 Tutorial
Edward Chang
Google Research, Mountain View


Social networking sites such as Orkut, MySpace, Hi5, and Facebook attract
billions of visits a day, surpassing the page views of Web Search.  These
social networking sites provide applications for individuals to establish
communities, to upload and share documents/photos/videos, and to interact
with other users.  Take Orkut as an example. Orkut hosts millions of
communities, with hundreds of communities created and tens of thousands of
blogs/photos uploaded each hour.  To assist users to find relevant
information, it is essential to provide effective collaborative filtering
tools to perform recommendations such as friend, community, and ads

In this talk, I will first describe both computational and storage
challenges to traditional collaborative filtering algorithms brought by
aforementioned information explosion.  To deal with huge social graphs
that expand continuously, an effective algorithm should be designed to 1)
run on thousands of parallel machines for sharing storage and speeding up
computation, 2) perform incremental retraining and updates for attaining
online performance, and 3) fuse information of multiple sources for
alleviating information sparseness.  In the second part of the talk, I
will present algorithms we recently developed including parallel PF-Growth
[1], parallel combinational collaborative filtering [2], parallel LDA,
parallel spectral clustering, and parallel Support Vector Machines [3].

[1,2,3] Papers and some codes are available at

3:00 - 3:30
Sharad Goel
Yahoo! Research, New York


We tackle the computational problem of query-conditioned search. Given a
machine-learned scoring rule and a query distribution, we build a
predictive index by pre-computing lists of potential results sorted based
on an expected score of the result over future queries. The predictive
index datastructure supports an anytime algorithm for approximate
retrieval of the top elements. The general approach is applicable to
webpage ranking, internet advertisement, and approximate nearest neighbor
search. It is particularly effective in settings where standard techniques
(e.g., inverted indices) are intractable. We experimentally find
substantial improvement over existing methods for internet advertisement
and approximate nearest neighbors. This work is joint with John Langford
and Alex Strehl.

3:30 - 4:00
James Demmel
University of California, Berkeley


We survey recent results on designing numerical algorithms to minimize the
largest cost component: communication.  This could be bandwidth and
latency costs between processors over a network, or between levels of a
memory hierarchy; both costs are increasing exponentially compared to
floating point. We describe novel algorithms in sparse and dense linear
algebra, for both direct methods (like QR and LU) and iterative methods
that can minimize communication.

4:00 - 4:30 COFFEE BREAK

4:30 - 5:00
Jun Liu
Harvard University


In a regression or classification problem, one often has many potential
predictors (independent variables), and these predictors may interact with
each other to exert non-additive effects. I will present a Bayesian
approach to search for these interactions. We were motivated by the
epistasis detection problem in population-based genetic association
studies, i.e., to detect interactions among multiple genetic defects
(mutations) that may be causal to a specific complex disease. Aided with
MCMC sampling techniques, our Bayesian method can efficiently detect
interactions among many thousands of genetic markers.

A related problem is to find patterns or associations in text documents,
such as the "market basket" problem, in which one attempts to mine
association rules among the items in a supermarket based on customers'
transaction records.  For this problem, we formulate our goal as to find
subsets of words that tend to co-occur in a sentence.  We call the set of
co-occurring words (not necessarily orderly) a "theme" or a "module". I
will illustrate a simple generative model and the EM algorithm for
inferring the themes. I will demonstrate its applications in a few
examples including an analysis of chinese medicine prescriptions and an
analysis of a chinese novel.

5:00 - 5:30
Fan Chung
University of California, San Diego


We will discuss four partitioning algorithms using eigenvectors, random
walks, PageRank and their variations. In particular, we will examine local
partitioning algorithms, which find a cut near a specified starting
vertex, with a running time that depends on the size of the small side of
the cut, rather than on the size of the input graph (which can be
prohibitively large). Three of the four partitioning algorithms are local
algorithms and are particularly appropriate for applications for modern
massive data sets.

5:30 - 6:00
Ronald Coifman
Yale University


We discuss the emergence of self organization of data, either through
eigenvectors of affinity operators, or equivalently through a multiscale
ontology. We illustrate these ideas on images and audio files as well as
molecular dynamics.


Thursday, June 26, 2006. Theme: Networked data and algorithmic tools

9:00 - 10:00 Tutorial:
Milena Mihail
Georgia Institute of Technology


We will discuss new trends in the study of models and algorithms for
complex networks. The common theme is to enhance network elements, say
nodes, with a small list of attributes describing the profile of the node
relative to other nodes of the network. Indeed, this is common practice in
real complex networks.

Standard models for complex networks have captured sparse graphs with
heavy tailed statistics and/or the small world phenomenon. We will discuss
random dot product graphs, which is a model generating networks with a
much wider range of properties. These properties include varying degree
distributions, varying graph densities, and varying spectral
decompositions. The flexibility of this model follows by representing
nodes as d-dim vectors (one dimension for each relevant attribute) sampled
from natural general distributions, and links added with probabilities
proportional to a similarity function over vectors in high dimensions.

Concerning algorithms for complex networks, we argue that endowing local
network elements with (a computationally reasonable amount of) global
information can facilitate fundamental algorithmic primitives. For
example, we will discuss how spectral representations (in particular, a
network node being aware of its value according to a principal
eigenvector) can facilitate searching, information propagation and
information retrieval. However, computing such eiganevector components
locally in a distributed, asynchronous network is a challenging open

10:00 - 10:30
Reid Andersen
Microsoft Research, Redmond


The minimum quotient cut problem is a fundamental and well-studied problem
in graph partitioning.  We present an algorithm called Improve that
improves a proposed partition of a graph, taking as input a subset of
vertices and returning a new subset of vertices with a smaller quotient
cut score.  The most powerful previously known method for improving
quotient cuts, which is based on parametric flow, returns a partition
whose quotient cut score is at least as small as any set contained within
the proposed set.  For our algorithm, we can prove a stronger guarantee:
the quotient score of the set returned is nearly as small as any set in
the graph with which the proposed set has a larger-than-expected
intersection.  The algorithm finds such a set by solving a sequence of
polynomially many S-T minimum cut problems, a sequence that cannot be cast
as a single parametric flow problem.  We demonstrate empirically that
applying Improve to the output of various graph partitioning algorithms
greatly improves the quality of cuts produced without significantly
impacting the running time.

Joint work with Kevin Lang.

10:30 - 11:00 COFFEE BREAK

11:00 - 11:30
Michael W. Mahoney
Yahoo! Research, Silicon Valley


The concept of a community is central to social network analysis, and thus
a large body of work has been devoted to identifying community structure.
For example, a community may be thought of as a set of web pages on
related topics, a set of people who share common interests, or more
generally as a set of nodes in a network more similar amongst themselves
than with the remainder of the network.  Motivated by difficulties we
experienced at actually finding meaningful communities in large real-world
networks, we have performed a large scale analysis of a wide range of
social and information networks.  Our main methodology uses local spectral
methods and involves computing isoperimetric properties of the networks at
various size scales -- a novel application of ideas from scientific
computation to internet data analysis. Our empirical results suggest a
significantly more refined picture of community structure than has been
appreciated previously. Our most striking finding is that in nearly every
network dataset we examined, we observe tight but almost trivial
communities at very small size scales, and at larger size scales, the best
possible communities gradually ``blend in'' with the rest of the network
and thus become less ``community-like.'' This behavior is not explained,
even at a qualitative level, by any of the commonly-used network
generation models. Moreover, this behavior is exactly the opposite of what
one would expect based on experience with and intuition from expander
graphs, from graphs that are well-embeddable in a low-dimensional
structure, and from small social networks that have served as testbeds of
community detection algorithms. Possible mechanisms for reproducing our
empirical observations will be discussed, as will implications of these
findings for clustering, classification, and more general data analysis in
modern large social and information networks.

11:30 - 12:00
Nikhil Srivastava
Yale University


A sparsifier of a graph G is a sparse subgraph H that approximates it in
some natural or useful way. Benczur and Karger gave the first sparsifiers
in 1996, in which the weight of every cut in H was within a factor of
(1+/-\epsilon) of its weight in G. In this work, we consider a stronger
notion of approximation (introduced by Spielman and Teng in 2004) that
requires the Laplacian quadratic forms of H and G to be close --
specifically, that x^TL'x = (1+/-\epsilon) x^TLx for all vectors x in R^n,
where L and L' are the Laplacians of G and H respectively. This subsumes
the Benczur-Karger notion, which corresponds to the special case of x in
{0,1}^n. It also implies that G and H are similar spectrally, and that L'
is a good preconditioner for L.

We show that every graph contains a sparsifier with O(n log n) edges, and
that one can be found in nearly-linear time by randomly sampling each edge
of the graph with probability proportional to its effective resistance. A
key ingredient in our algorithm is a subroutine of independent interest: a
nearly-linear time algorithm that builds a data structure from which we
can query the approximate effective resistance between any two vertices in
a graph in O(log n) time.

We conjecture the existence of sparsifiers with O(n) edges, noting that
these would generalize the notion of expander graphs, which are
constant-degree sparsifiers for the complete graph.

Joint work with Dan Spielman.

12:00 - 12:30
Amin Saberi
Stanford University


Random graph generation has been studied extensively as an interesting
theoretical problem. It has also become an important tool in a variety of
real world applications including detecting motifs in biological networks
and simulating networking protocols on the Internet topology. One of the
central problems in this context is to generate a uniform sample from the
set of simple graphs with a given degree sequence.

The classic method for approaching this problem is the Markov chain Monte
Carlo (MCMC) method. However, current MCMC-based algorithms have large
running times, which make them unusable for real-world networks that have
tens of thousands of nodes. This has constrained practitioners to use
simple heuristics that are non-rigorous and have often led to wrong

I will present a technique that leads to almost linear time algorithms for
generating random graphs for a range of degree sequences. I will also show
how this method can be extended for generating random graphs that exclude
certain subgraphs e.g. short cycles.

Our approach is based on sequential importance sampling (SIS) technique
that has been recently successful for counting and sampling graphs in

12:30 - 2:30 LUNCH (on your own)

2:30 - 3:00
Pankaj K. Agarwal
Duke University


With recent advances in terrain-mapping technologies such as Laser
altimetry (LIDAR) and ground based laser scanning, millions of
georeferenced points can be acquired within short periods of time.
However, while acquiring and georeferencing the data has become extremely
efficient, transforming the resulting massive amounts of heterogeneous
data to useful information for different types of users and applications
is lagging behind, in large part because of the scarcity of robust,
efficient algorithms for terrain modeling and analysis that can handle
massive data sets acquired by different technologies and that can rapidly
detect and predict changes in the model as the new data is acquired.

This talk will review our on-going work on developing efficient algorithms
for terrain modeling and analysis that work with massive data sets. It
will focus on an I/O-efficient algorithm for computing contour maps of a
terrain. A few open questions will also be discussed.

3:00 - 3:30
Leonidas Guibas
Stanford University


Digital models of physical shapes are becoming ubiquitous in our economy
and life. Such models are sometimes designed ab initio using CAD tools,
but more and more often they are based on existing real objects whose
shape is acquired using various 3D scanning technologies. In most
instances, the original scanner data is just a set, but a very large set,
of points sampled from the surface of the object. We are interested in
tools for understanding the local and global structure of such large-scale
scanned geometry for a variety of tasks, including model completion,
reverse engineering, shape comparison and retrieval, shape editing,
inclusion in virtual worlds and simulations, etc. This talk will present a
number of point-based techniques for discovering global structure in 3D
data sets, including partial and approximate symmetries, shared parts,
repeated patterns, etc. It is also of interest to perform such structure
discovery across multiple data sets distributed in a network, without
actually ever bring them all to the same host.

3:30 - 4:00
Yuan Yao
Stanford University


We develop a computational approach to explore the relatively low
populated transition or intermediate states in biomolecular folding
pathways, based on a topological data analysis tool, Mapper, with
simulation data from large-scale distributed computing. Characterization
of these transient intermediate states is crucial for the description of
biomolecular folding pathways, which is however difficult in both
experiments and computer simulations. We are motivated by recent debates
over the existence of multiple intermediates even in simple systems like
RNA hairpins. With a proper design of conditional density filter functions
and clustering on their level sets, we are able to provide structural
evidence on the multiple intermediate states.  The method is effective in
dealing with high degree of heterogeneity in distribution, being less
sensitive to the distance metric than geometric embedding methods, and
capturing structural features in multiple pathways. It is a reminiscence
of the classical Morse theory from mathematics, efficiently capturing
topological features of high dimensional data sets.

4:00 - 4:30 COFFEE BREAK

4:30 - 5:00
Piotr Indyk
Massachusetts Institute of Technology


Over the recent years, a new *linear* method for compressing
high-dimensional data (e.g., images) has been discovered. For any
high-dimensional vector x, its *sketch* is equal to Ax, where A is an m x
n matrix (possibly chosen at random). Although typically the sketch length
m is much smaller than the number of dimensions n, the sketch contains
enough information to recover an *approximation* to x. At the same time,
the linearity of the sketching method is very convenient for many
applications, such as data stream computing and compressed sensing.

The major sketching approaches can be classified as either combinatorial
(using sparse sketching matrices) or geometric (using dense sketching
matrices). They achieve different trade-offs, notably between the
compression rate and the running time. Thus, it is desirable to understand
the connections between them, with the ultimate goal of obtaining the
"best of both worlds" solution.

In this talk we show that, in a sense, the combinatorial and geometric
approaches are based on different manifestations of the same phenomenon.
This connection will enable us to obtain several novel algorithms and
constructions, which inherit advantages of sparse matrices, such as lower
sketching and recovery times.

Joint work with: Radu Berinde, Anna Gilbert, Howard Karloff, Milan Ruzic
and Martin Strauss.

5:00 - 5:30
Ping Li
Cornell University


The method of stable random projections has become a popular tool for
dimension reduction, in particular, for efficiently computing pairwise
distances in massive high-dimensional data (including dynamic streaming
data) matrices, with many applications in data mining and machine learning
such as clustering, nearest neighbors, kernel methods etc.. Closely
related to stable random projections, Compressed Counting (CC) is recently
developed to efficiently compute Lp frequency moments of a single dynamic
data stream. CC exhibits a dramatic improvement over stable random
projections when p is about 1. Applications of CC include estimating
entropy moments of data streams and statistical parameter estimations in
dynamic data using low memory.

5:30 - 6:00
Joel Tropp
California Institute of Technology


A deep result of Bourgain and Tzafriri states that every matrix with
unit-norm columns contains a large collection of columns that is
exceptionally well conditioned. This talk describes a randomized,
polynomial-time algorithm for producing the desired submatrix.

Friday, June 27, 2008. Theme: Statistical, geometric, and topological methods

9:00 - 10:00 Tutorial
Jerome H. Friedman
Stanford University


Regularized regression and classification methods fit a linear model to
data, based on some loss criterion, subject to a constraint on the
coefficient values. As special cases, ridge-regression, the lasso, and
subset selection all use squared-error loss with different particular
constraint choices. For large problems the general choice of
loss/constraint combinations is usually limited by the computation
required to obtain the corresponding solution estimates, especially when
non convex constraints are used to induce very sparse solutions. A fast
algorithm is presented that produces solutions that closely approximate
those for any convex loss and a wide variety of convex and non convex
constraints, permitting application to very large problems. The benefits
of this generality are illustrated by examples.

10:00 - 10:30
Tong Zhang
Rutgers University


Consider linear least squares regression where the target function is a
sparse linear combination of a set of basis functions. We are interested
in the problem of identifying those basis functions with non-zero
coefficients and reconstructing the target function from noisy
observations. This problem is NP-hard. Two heuristics that are widely used
in practice are forward and backward greedy algorithms. First, we show
that neither idea is adequate. Second, we propose a novel combination that
is based on the forward greedy algorithm but takes backward steps
adaptively whenever necessary. We prove strong theoretical results showing
that this procedure is effective in learning sparse representations.
Experimental results support our theory.

10:30 - 11:00 COFFEE BREAK

11:00 - 11:30
Jitendra Malik
University of California, Berkeley


Straightforward classification using (non-linear) kernelized SVMs requires
evaluating the kernel for a test vector and each of the support vectors.
This can be prohibitively expensive, particularly in image recognition
applications, where one might be trying to search for an object at any of
a number of possible scales and locations. We show that for a large class
of kernels, namely those based on histogram intersection and its variants,
one can build classifiers with run time complexity logarithmic in the
number of support vectors (as opposed to linear in the standard case). By
precomputing auxiliary tables, we can construct approximations with
constant runtime and space requirements , independent of the number of
SV's.  These improvements lead to up to 3 orders of magnitude speedup in
classification time and up to 2 orders of magnitude memory savings on
various classification tasks. We also obtain the state of the art results
on pedestrian classification datasets.

This is joint work with Subhransu Maji and Alexander Berg. A paper as well
as source code is available at

11:30 - 12:00
Elad Hazan
IBM Almaden Research Center


Topological methods have in recent years shown themselves to be capable of
identifying a number of qualitative geometric properties of high
dimensional data sets.  In this talk, we will describe philosophical
reasons why topological methods should play a role in the study of data,
as well as give several examples of how the ideas play out in practice.

12:00 - 12:30
T.S. Jayram
IBM Almaden Research Center


Let A be a matrix whose rows are given by A_1, A_2, ..., A_m. For two
aggregate operators P and Q, the cascaded aggregate (P \circ Q)(A) is
defined to be P(Q(A_1), Q(A_2), ..., Q(A_m)). The problem of computing
such aggregates over data streams has received considerable interest
recently. In this talk, I will present some recent results on computing
cascaded frequency moments and norms over data streams.

Joint work with David Woodruff (IBM Almaden).

12:30 - 2:30 LUNCH (on your own)

2:30 - 3:30 Tutorial
Gunnar Carlsson
Stanford University


Topological methods have in recent years shown themselves to be capable of
identifying a number of qualitative geometric properties of high
dimensional data sets.  In this talk, we will describe philosophical
reasons why topological methods should play a role in the study of data,
as well as give several examples of how the ideas play out in practice.

3:30 - 4:00
Partha Niyogi
University of Chicago


Increasingly, we face machine learning problems in very high dimensional
spaces. We proceed with the intuition that although natural data lives in
very high dimensions, they have relatively few degrees of freedom. One way
to formalize this intuition is to model the data as lying on or near a low
dimensional manifold embedded in the high dimensional space. This point of
view leads to a new class of algorithms that are "manifold motivated" and
a new set of theoretical questions that surround their analysis. A central
construction in these algorithms is a graph or simplicial complex that is
data-derived and we will relate the geometry of these to the geometry of
the underlying manifold. We will develop the idea of manifold
regularization, its applications to semi-supervised learning, and the
theoretical guarantees that may be provided in some settings.

4:00 - 4:30 COFFEE BREAK

4:30 - 5:00
Sanjoy Dasgupta
University of California, San Diego


The curse of dimensionality has traditionally been the bane of
nonparametric statistics (histograms, kernel density estimation, nearest
neighbor search, and so on), as reflected in running times and convergence
rates that are exponentially bad in the dimension. This problem is all the
more pressing as data sets get increasingly high dimensional. Recently the
field has been rejuvenated substantially, in part by the realization that
a lot of real-world data which appears high-dimensional in fact has low
"intrinsic" dimension in the sense of lying close to a low-dimensional
manifold. In the past few years, there has been a huge interest in
learning such manifolds from data, and then using the learned structure to
transform the data into a lower-dimensional space where standard
statistical methods generically work better.

I'll exhibit a way to benefit from intrinsic low dimensionality without
having to go to the trouble of explicitly learning its fine structure.
Specifically, I'll present a simple variant of the ubiquitous k-d tree (a
spatial data structure widely used in machine learning and statistics)
that is provably adaptive to low dimensional structure. We call this a
"random projection tree" (RP tree).

Along the way, I'll discuss different notions of intrinsic dimension --
motivated by manifolds, by local statistics, and by analysis on metric
spaces -- and relate them. I'll then prove that RP trees require resources
that depend only on these dimensions rather than the dimension of the
space in which the data happens to be situated.

This is work with Yoav Freund (UC San Diego).

5:00 - 5:30
Kenneth Clarkson
IBM Almaden Research Center


The Johnson-Lindenstrauss random projection lemma gives a simple way to
reduce the dimensionality of a set of points while approximately
preserving their pairwise Euclidean distances.  The most direct
application of the lemma applies to a finite set of points, but recent
work has extended the technique to affine subspaces, smooth manifolds, and
sets of bounded doubling dimension; in each case, a projection to a
sufficiently large dimension k implies that all pairwise distances are
approximately preserved with high probability. Here the case of random
projection of a smooth manifold (submanifold of R^m) is considered, and a
previous analysis is sharpened, giving an upper bound for k that depends
on the surface area, total absolute curvature, and a few other quantities
associated with the manifold, and in particular does not depend on the
ambient dimension m or the reach of the manifold.

5:30 - 6:00
Yoram Singer
Google Research, Mountain View


Many machine learning tasks can be cast as constrained optimization
problems. The talk focuses on efficient algorithms for learning tasks
which are cast as optimization problems subject to L1 and hyper-box
constraints. The end result are typically sparse and accurate models. We
start with an overview of existing projection algorithms onto the simplex.
We then describe a linear time projection for dense input spaces. Last, we
describe a new efficient projection algorithm for very high dimensional
spaces. We demonstrate the merits of the algorithm in experiments with
large scale image and text classification.

6:00 - 6:30
Arindam Banerjee
University of Minnesota, Twin Cities


In recent years, co-clustering has emerged as a powerful data mining tool
that can analyze dyadic data connecting two entities. However, almost all
existing co-clustering techniques are partitional, and allow individual
rows and columns of a data matrix to belong to only one cluster. Several
current applications, such as recommendation systems, market basket
analysis, and gene expression analysis, can substantially benefit from a
mixed cluster assignment of rows and columns. In this talk, we present
Bayesian co-clustering (BCC) models, that allow mixed probabilistic
assignments of rows and columns to all clusters. BCC maintains separate
Dirichlet priors over the mixed assignments and assumes each observation
to be generated by an exponential family distribution corresponding to its
row and column clusters. The model is designed to naturally handle sparse
matrices as only the non-zero/non-missing entries are assumed to be
generated by the model. We propose a fast variational algorithm for
inference and parameter estimation in the model. In addition to finding
co-cluster structure in observations, the model outputs a low dimensional
co-embedding of rows and columns, and predicts missing values in the
original matrix. We demonstrate the efficacy of the model through
experiments on both simulated and real data.


Saturday, June 28, 2008. Theme: Machine learning and dimensionality reduction

9:00 - 10:00 Tutorial
Michael I. Jordan
University of California, Berkeley


Consider a regression or classification problem in which the data consist
of pairs (X,Y), where X is a high-dimensional vector. If we wish to find a
low-dimensional subspace to represent X, one option is to ignore Y and
avail ourselves of unsupervised methods such PCA and factor analysis.  In
this talk, I will discuss a supervised alternative which aims to take Y
into account in finding a low-dimensional representation for X, while
avoiding making strong assumptions about the relationship of X to Y.
Specifically, the problem of "sufficient dimension reduction" (SDR) is
that of finding a subspace S such that the projection of the covariate
vector X onto S captures the statistical dependency of the response Y on X
(Li, 1991; Cook, 1998).  I will present a general overview of the SDR
problem, focusing on the formulation of SDR in terms of conditional
independence.  I will also discuss some of the popular algorithmic
approaches to SDR, particularly those based on inverse regression.
Finally, I will describe a new methodology for SDR which is based on the
characterization of conditional independence in terms of conditional
covariance operators on reproducing kernel Hilbert spaces (a
characterization of conditional independence that is of independent
interest). I show how this characterization leads to an M-estimator for S
and I show that the estimator is consistent under weak conditions; in
particular, we do not have to impose linearity or ellipticity conditions
of the kinds that are generally invoked for SDR methods based on inverse

Joint work with Kenji Fukumizu and Francis Bach.

10:00 - 10:30
Nathan Srebro
University of Chicago


Traditional runtime analysis of training Support Vector Machines, and
indeed most learning methods, shows how the training runtime increases as
more training examples are available.  Considering the true objective of
training, which is to obtain a good predictor, I will argue that training
time should be studied as a decreasing, or at least non-increasing,
function of training set size.  I will then present both theoretical and
empirical results demonstrating how a simple stochastic subgradient
descent approach for training SVMs indeed displays such monotonic
decreasing behavior.

10:30 - 11:00 COFFEE BREAK

11:00 - 11:30
Inderjit S. Dhillon
University of Texas, Austin


Minimum rank problems arise frequently in machine learning applications
and are notoriously difficult to solve due to the non-convex nature of the
rank objective. In this talk, we will present a novel online learning
approach for the problem of rank minimization of matrices over polyhedral
sets. In particular, we present two online learning algorithms for rank
minimization - our first algorithm is a multiplicative update method based
on a generalized experts framework, while our second algorithm is a novel
application of the online convex programming framework (Zinkevich, 2003).
A salient feature of our online learning approach is that it allows us to
give provable approximation guarantees for the rank minimization problem
over polyhedral sets. We demonstrate that our methods are effective in
practice through experiments on synthetic examples, and on the real-life
application of low-rank kernel learning.

This is joint work with Constantine Caramanis, Prateek Jain and Raghu

11:30 - 12:00
Nir Ailon
Google Research, New York


The Johnson-Lindenstrauss dimension reduction idea using random
projections was discovered in the early 80's.  Since then many "computer
science friendly" versions were published, offering constant factor but no
big-Oh improvements in the runtime.  Two years ago Ailon and Chazelle
showed a nontrivial algorithm with the first asymptotic improvement, and
suggested the question:  What is the exact complexity of J-L computation
from d dimensions to k dimensions?  An O(d log d) upper bound is implied
by A-C for k up to d^{1/3} (in the L2->L2) case.  In this talk I will show
how to achieve this bound for k up to d^{1/2} combining techniques from
the theories of error correcting codes and probability in Banach spaces.
This is based on joint work with Edo Liberty.

12:00 - 12:30
Satyen Kale
Microsoft Research, Redmond


Algorithms based on convex optimization, especially linear and
semidefinite programming, are ubiquitous in Computer Science. While there
are polynomial time algorithms known to solve such problems, quite often
the running time of these algorithms is very high. Designing simpler and
more efficient algorithms is important for practical impact.

In my talk, I will describe applications of a Lagrangian relaxation
technique, the Multiplicative Weights Update method in the design of
efficient algorithms for various optimization problems. We generalize the
method to the setting of symmetric matrices rather than real numbers. The
new algorithm yields the first truly general, combinatorial, primal-dual
method for designing efficient algorithms for semidefinite programming.  
Using these techniques, we obtain significantly faster algorithms for
approximating the Sparsest Cut and Balanced Separator in both directed and
undirected weighted graphs to factors of O(log n) and O(sqrt{log n}), and
also for the Min UnCut and Min 2CNF Deletion problems. The algorithm also
has applications in quantum computing and derandomization.

This is joint work with Sanjeev Arora.


2:30 - 3:00
Ravi Kannan
Microsoft Research, India


While spectral algorithms enjoy quite some success in practice, there are
not many provable results about them. In general, results are only known
for the ``average case'' assuming some probabilitic model of the data. We
discuss some models and results. Two sets of models have been analyzed.
Gaussian mixture models yield provable results on spectral algorithms. The
geometry of Gaussians and ``isopermetry'' play a crucial role here. A
second set of models inspired by Random Graphs and Random Matrix theory
assume total mutual independence of all entries of the input matrix. While
this allows the use of the classical bounds on eigenvalues of random
matrices, it is too restrictive. A Partial Independence model where the
rows are assumed to be independent vector-valued random variables is more
realistic and recently results akin to totally independent matrices have
been proved for these.

3:00 - 3:30
Chris Wiggins
Columbia University


Connections among disparate approaches to graph partitioning may be made
by reinterpreting the problem as a special case of one of either of two
more general and well-studied problems in machine learning: inferring
latent variables in a generative model or estimating an
(information-theoretic) optimal encoding in rate distortion theory. In
either approach, setting in a more general context shows how to unite and
generalize a number of approaches. As an encoding problem, we see how
heuristic measures such as the normalized cut may be derived from
information theoretic quantities. As an inference problem, we see how
variational Bayesian methods lead naturally to an efficient and
interpretable approach to identifying ``communities" in graphs as well as
revealing the natural scale (or number of communities) via Bayesian
approaches to complexity control.

3:30 - 4:00
Anna Gilbert
University of Michigan, Ann Arbor


Traditionally, group testing is a design problem. The goal is to construct
an optimally efficient set of tests of items such that the test results
contains enough information to determine a small subset of items of
interest. It has its roots in the statistics community and was originally
designed for the Selective Service to find and to remove men with syphilis
from the draft. It appears in many forms, including coin-weighing
problems, experimental designs, and public health. We are interested in
both the design of tests and the design of an efficient algorithm that
works with the tests to determine the group of interest. Many of the same
techniques that are useful for designing tests are also used to solve
algorithmic problems in analyzing and in recovering statistical quantities
from streaming data. I will discuss some of these methods and briefly
discuss several recent applications in high throughput drug screening.

4:00 - 4:30 COFFEE BREAK

4:30 - 5:00
Lars Kai Hansen
Technical University of Denmark


While the generalization performance of high-dimensional principal
component analysis is quite well understood, matrix factorizations like
independent component analysis, non-negative matrix factorization, and
clustering are less investigated for generalizability. I will review
theoretical results for PCA and heuristics used to improve PCA test
performance, and discuss extensions to high-dimensional ICA, NMF, and

5:00 - 5:30
Elizabeth Purdom
University of California, Berkeley


Graphs or networks are common ways of depicting information. In biology,
for example, many different biological processes are represented by
graphs, such as signalling networks or metabolic pathways. This kind of a
priori information is a useful supplement to the standard numerical data
coming from an experiment. Incorporating the information from these graphs
into an analysis of the numerical data is a non-trivial task that is
generating increasing interest.

There are many results from graph theory regarding the properties of an
adjacency matrix and other closely related matrices. These results suggest
jointly analyzing numerical data and a graph by using the spectral
decomposition of the adjacency matrix (or certain transformations of it)
to find interesting features of the numerical data that also reflect the
structure of the graph. The adjacency matrix representation of graphs also
underlies similar kernel methods that have been used to jointly analyze
graphs and data.

We will briefly discuss these problems and how these ideas can be used in
for prediction of novel graph structure as well as graph-based

5:30 - 6:00
Holly Jin


We explore the use of basis pursuit denoising (BPDN) for sparse
nonnegative matrix factorization (sparse NMF).  A matrix A is approximated
by low-rank factors UDV', where U and V are sparse with unit-norm columns,
and D is diagonal.  We use an active-set BPDN solver with warm starts to
compute the rows of U and V in turn. (Friedlander and Hatz have developed
a similar formulation for both matrices and tensors.)  We present
computational results and discuss the benefits of sparse NMF for some real
matrix applications.

This is joint work with Michael Saunders.

6:00 - 6:30
Lek-Heng Lim
University of California, Berkeley


Modern ranking data is often incomplete, unbalanced, and arises from a
complex network. We will propose a method to analyze such data using
combinatorial Hodge theory, which may be regarded either as an additive
decomposition of a skew-symmetric matrix into three matrices with special
properties or a decomposition of a weighted digraph into three orthogonal
components. In this framework, ranking data is represented as a
skew-symmetric matrix and Hodge-decomposed into three mutually orthogonal
components corresponding to globally consistent, locally consistent, and
locally inconsistent parts of the ranking data. Rank aggregation then
naturally emerges as projections onto a suitable subspace and an
inconsistency measure of the ranking data arises as the triangular trace

This is joint work with Yuan Yao of Stanford University.