Talks Abstracts. Workshop on Algorithms for Modern Massive Datasets

Wednesday, June 21, 2006. Theme: Linear Algebraic Basics


10:00 - 11:00 Tutorial:
Ravi Kannan
Department of Computer Science
Yale University


In many application areas, the input data matrix is too large to be
handled by traditional Linear Algebra algorithms.  A number of recent
results address this.  One proves that a small random sample of columns
and a small random sample of rows are sufficient to approximate any matrix
provided the sampling probabilities are judiciously chosen. Also, from a
good low-rank approximation (LRA) to the sampled sub-matrices, one can
derive a good LRA to the whole matrix. These approximations are suitable
for a host of numerical applications which go under the name of Principal
Component Analysis.

We also discuss applications of these methods to a broad class of
combinatorial problems of which a typical one is to maximize a low-degree
n-variable polynomial over the corners of the unit n-cube. We describe an
efficient algorithm for finding a rough analog of LRA to a tensor which
then helps us estimate the maximum.

11:00 - 11:30
Santosh Vempala
Department of Mathematics
Massachusetts Institute of Technology


It is well-known that the sum of the first k terms of the Singular Value
Decomposition of a matrix gives its optimal rank-k approximation (under
the 2-norm or the Frobenius norm). Is computing the SVD essential for
low-rank approximation?

It was shown in 1998 (FKV) that a small sample of rows chosen according to
their squared lengths can be used to compute a low-rank approximation
whose error is at most the best possible plus a term that depends on the
sample size and the norm of the original matrix. This leads to a
randomized algorithm to compute such an approximation in time linear in
the number of non-zero entries of the matrix.

In this talk, we discuss this approach and present two ways of
generalizing it, called adaptive sampling and volume sampling. Together,
they show that a sample of O(k/c + klogk) rows of any matrix can generate
a rank-k matrix whose error is at most (1+c) times that of the optimum
rank-k matirx. The algorithm based on this computes such a multiplicative
low-rank approximation, also in linear time.

11:30 - 12:00
Petros Drineas
Department of Computer Science
Rensselaer Polytechnic Institute


In this talk we will focus on low-rank matrix decompositions that are
explicitly expressed in terms of a small number of actual columns and/or
rows of a matrix, as opposed to, e.g., linear combinations of up to all
the columns and/or rows of the matrix, such as provided by truncating the
singular value decomposition (SVD). Motivations for studying such matrix
decompositions include (i) the efficient decomposition of large low-rank
matrices that posses additional structure such as sparsity or
non-negativity, (ii) expressing Gram matrices in statistical learning
theory in terms of a small number of actual data points, (iii) the
improved data interpretability that such decompositions provide for
datasets in the Internet domain, the social sciences, biology, chemistry,
medicine, etc., and (iv) the efficient computation of low-rank matrix
approximations in space-constrained settings.

We shall discuss two such decompositions. Given an m-by-n matrix A, the
first one approximates A by the product CX, where C consists of a few
columns of A, and X is a coefficient matrix. The second one is of the form
CUR, where C consists of a few columns of A, R consists of a few rows of
A, and U is a carefully constructed, constant-sized matrix. Previous such
matrix decompositions only guaranteed additive error approximations. Our
algorithms for constructing C, X, U, and R take low polynomial time and
return approximations that are almost the best possible in a relative
error sense. They employ the subspace sampling technique; in particular,
rows and columns of A are sampled with probabilities that depend on the
lengths of the rows of the top few singular vectors of A.

12:00 - 1:30 LUNCH (on your own)

1:30 - 2:30 Tutorial
Dianne O'Leary
Department of Computer Science
  and Institute for Advanced Computer Studies
University of Maryland, College Park


This tutorial will survey a wide variety of matrix decompositions that
have been proposed for use in information retrieval. Comparisons among the
methods and their applicability to massive data sets will be emphasized.

2:30 - 3:00
G.W. Stewart
Department of Computer Science
University of Maryland, College Park


The purpose of this talk is to describe an algorithm for producing a
reduced rank approximation to a sparse mxn matrix A in which the factors
are either small or sparse.  When the sparseness of the factorization is
not an issue, there are two decompositions that give reduced-rank
decompositions: the singular value decomposition and the pivoted QR
decomposition.  This talk focuses on the latter.

We will show how a variant of the pivoted Gram--Schmidt algorithm can
produce approximations to A of the form XS, where where X is an mxp matrix
consisting of columns of X and S is a pxn dense matrix. Thus if p is
sufficiently small, this factorization is suitable for the case where m is
much larger than n.  The same algorithm applied to the transpose of A an
approximation of the form TY, where Y is an pxn matrix matrix consisting
of rows of A and T is an dense mxp matrix. If both approximations have
been computed, they may be combined into a approximation of the form XUY,
where U is a pxp matrix--an approximation suitable for the case where both
m and n are large.

The approximations can be computed a column (or row) at a time.  At each
stage the error in the approximation can be computed essentially for free,
thus giving a rigorous criterion for stopping the process.

3:00 - 3:30
Haesun Park
College of Computing
Georgia Institute of Technology


Fisher's discriminant analysis for binary class dimension reduction and
the binary classifier design based on the minimum squared error
formulation (MSE) have been widely utilized for handling high-dimensional
clustered data sets. As the data sets get modified from incorporating new
data points and deleting obsolete data points, there is a need to develop
efficient updating and downdating algorithms for these methods to avoid
expensive recomputation of the solution from scratch.

In this talk, an efficient algorithm for adaptive linear and nonlinear
kernel discriminant analysis based on regularized MSE, called KDA/RMSE, is
introduced. In adaptive KDA/RMSE, updating and downdating of the
computationally expensive eigenvalue decomposition (EVD) or singular value
decomposition (SVD) is approximated by updating and downdating of the QR
decomposition achieving an order of magnitude speed up. This fast
algorithm for adaptive kernelized discriminant analysis is designed by
utilizing regularization and some relationship between linear and
nonlinear discriminant analysis for dimension reduction and the MSE for
classifier design. An efficient algorithm for computing leave-one-out
cross validation is also introduced by utilizing downdating of KDA/RMSE.

Based on a joint work with Hyunsoo Kim and Barry Drake.

3:30 - 4:00 COFFEE BREAK

4:00 - 4:30
Michael Mahoney
Yahoo! Research


Much recent work in theoretical computer science, numerical linear
algebra, machine learning, and data analysis has considered low-rank
matrix decompositions of the following form: given an m-by-n matrix A,
decompose it as a product of three matrices, C, U, and R, where C consists
of a few columns of A, R consists of a few rows of A, and U is a small
carefully constructed matrix that guarantees that the product CUR is
"close," either provably or empirically, to A. Applications of such
decompositions include the computation of compressed "sketches" for large
matrices in a pass-efficient manner, matrix reconstruction, speeding up
kernel-based statistical learning computations, sparsity-preservation in
low-rank approximations, and improved interpretability of data analysis
methods. We shall review the motivation for these decompositions, discuss
various choices for the matrices C, U, and R that are appropriate in
different application domains, and then discuss the application of CUR
decompositions to three diverse application domains: human genetics,
medical imaging, and internet recommendation systems. In each of these
applications, the columns and rows have a natural interpretation in terms
of the processes generating the data, and CUR decompositions can be used
to solve reconstruction, clustering, classification, or prediction
problems in each of these areas.

4:30 - 5:00
Daniel Spielman
Department of Computer Science
Yale University


Motivated by the problem of solving systems of linear equations, we
develop nearly-linear time algorithms for partitioning graphs and for
building strong sparsifiers. The graph partitioning algorithm is the first
provably fast graph partitioning algorithm that finds sparse cuts of
nearly optimal balance. It is based on an algorithm for finding small
clusters in massive graphs that runs in time proportional to the size of
the cluster it outputs. Using the graph partitioning algorithm and random
sampling, we show how to spectrally spproximate any graph by a sparse

We then apply these algorithms, along with constructions of low-strech
spanning trees, to optimize a preconditioner contruction introduced by
Vaiday. We thereby obtain a nearly-linear time algorithm for approximately
solving systems of linear equations of the form Ax=b, where A is symmetric
and diagonally dominant.

Joint work with Shang-Hua Teng.

5:00 - 5:30
Anna Gilbert/Martin Strauss
Department of Mathematics
University of Michigan, Ann Arbor


Coding theory has played a central role in the development of computer
science.  One critical point of interaction is decoding error-correcting
codes.  First- and second-order Reed-Muller (RM(1) and RM(2),
respectively) codes are two fundamental error-correcting codes which arise
in communication as well as in probabilistically checkable proofs and
learning.  (The theory of first-order Reed-Muller codes is also known as
Fourier analysis on the Boolean cube.)

In this paper, we take the first steps toward extending the decoding tools
of RM(1) into the realm of quadratic binary codes.  We show how to recover
a substantial subcode of RM(2) called a Kerdock code, in the presence of
significant noise.  The Kerdock codes are a well-studied family of codes
for coding theory, radar signaling, and spread spectrum communication.  
Our result is a list-decoding result for Kerdock codes which is roughly
analogous to that of RM(1).  In addition, we present a new algorithmic
characterization of Kerdock codes that we hope will be more useful to the
algorithmic (and coding theory) community than the classic descriptions.

Joint work with Robert Calderbank.

5:30 - 6:00
Bob Plemmons
Department of Mathematics and Computer Science
Wake Forest University


Nonnegative matrix factorization (NMF) is a technique with numerous
applications that has been used extensively in recent years for
approximating high dimensional data where the data are composed of
nonnegative entries. The process involves low-rank nonnegative approximate
matrix factorizations to allow the user to work with reduced dimensional
models and to facilitate more efficient statistical classification of
data. In spectral imaging, we will describe our use of NMF to obtain
spectral signatures for space object identification using a point source,
imaged at different spectral frequencies.  In other applications, we are
also concerned with multispectral data to be processed for extended object
identification and analysis, and for processing stacks of correlated 2D
images. We wish to find a nonnegative factorization of a multiway array,
i.e., a tensor. The non-negative tensor factorization (NTF) problem has a
unique set of difficulties for our applications that we will discuss.

Joint work with Christos Boutsidis, Misha Kilmer, and Paul Pauca.

6:00 - 6:30
Art Owen
Department of Statistics
Stanford University
A hybrid of multivariate regression and factor analysis

Multivariate regression is a useful tool for identifying age related genes
from microarray data. Introducing latent variables into the regression
provides a diagnostic tool for spotting observations that are outliers and
for identifying missing variables. An example of the former is a patient
whose kidney appears, from gene expression, to be that of a much younger
patient. The latent variables alone yield a factor analysis model while
the measured variables alone are a multivariate regression. The criterion
for the combined model is a Frobenious norm on the residual. The set of
minimizers can be characterized in terms of a singular value
decomposition. A power iteration appears to be faster the SVD in some real

Based on a collaboration with Stuart Kim and Jacob Zahn.


Thursday, June 22, 2006. Theme: Industrial Applications and Sampling Methods

9:00 - 10:00 Tutorial:
Prabhakar Raghavan
Yahoo! Research


Web search has come to dominate our consciousness as a convenience we take
for granted, as a medium for connecting advertisers and buyers, and as a
fast-growing revenue source for the companies that provide this service.
Following a brief overview of the state of the art and how we got there,
this talk covers a spectrum of technical challenges arising in web search
- ranging from social search to auction design and incentive mechanisms.

10:00 - 10:30
Tong Zhang
Yahoo! Research


Ranking has important applications in web-search, advertising, user
recommender systems, etc, and is essential for internet companies such as

In this talk, I will first go over some formulations and methods of
ranking in the statistical literature. Then I will focus on a formulation
suitable for web search, and talk about training relevance models based on
DCG (discounted cumulated gain) optimization. Under this metric, the
system output quality is naturally determined by the performance near the
top of its rank-list. I will mainly discuss various theoretical issues in
this learning problem. They reflect real issues encountered at Yahoo when
building a machine learning based web-search ranking system.

Joint work with David Cossock at Yahoo.

10:30 - 11:00 COFFEE BREAK

11:00 - 11:30
Michael Berry
Department of Computer Science
University of Tennessee, Knoxville


Automated approaches for the identification and clustering of semantic
features or topics are highly desired for text mining applications.  
Using a low rank non-negative matrix factorization algorithm to retain
natural data non-negativity, we eliminate the need to use subtractive
basis vector and encoding calculations present in techniques such as
principal component analysis for semantic feature abstraction.  Existing
techniques for non-negative matrix factorization are briefly reviewed and
a new approach for non-negative matrix factorization is presented.  A
demonstration of the use of this approach for topic (or discussion)
detection and tracking is presented using the Enron Email Collection.

Joint work with Murray Browne.

11:30 - 12:00
Hongyuan Zha
Department of Computer Science and Engineering
Pennsylvania State University


In this talk we discuss information retrieval methods that aim at serving
a diverse stream of user queries. We propose methods that emphasize the
importance of taking into consideration of query difference in learning
effective retrieval functions. We formulate the problem as a multi-task
learning problem using a risk minimization framework. In particular, we
show how to calibrate the empirical risk to incorporate query difference
in terms of introducing nuisance parameters in the statistical models, and
we also propose an alternating optimization method to simultaneously learn
the retrieval function and the nuisance parameters. We work out the
details for both L1 and L2 regularization cases. We illustrate the
effectiveness of the proposed methods using modeling data extracted from a
commercial search engine.

Joint work with Zhaohui Zheng, Haoying Fu and Gordon Sun.

12:00 - 12:30
Trevor Hastie/Ping Li
Department of Statistics
Stanford University


Sampling and sketching (including random projections) are two basic
strategies for dimension reduction. For dimension reduction in L2 using
random projections, we show that the accuracy can be improved by taking
advantage of the marginal information; and the efficiency can be improved
dramatically using a very sparse random projection matrix. For dimension
reduction in L1, we prove an analog of the JL lemma using nonlinear
estimators. Previous known results have proved that no analog of the JL
lemma exists for L1 if restricted to linear estimators. As a competitor to
random projections, a sketch-based sampling algorithm is very efficient
and highly flexible in sparse data. This sampling algorithm generates
random samples online from sketches. These various methods are useful for
mining huge databases (e.g., association rules), distance-based
clustering, nearest neighbor searching, as well as efficiently computing
the SVM kernels.

Joint work with Kenneth Church.

12:30 - 2:00 LUNCH (on your own)

2:00 - 3:00 Tutorial
Muthu Muthukrishnan
Google Inc.


Recently Donoho introduced the problem of Compressed Sensing and proved an
interesting result that there exists a set of roughly klog n vectors of
dimension n each such that any n dimensional vector can be recovered to
nearly best accuracy possible using k terms, from only the inner products
of the vector with the klog n vectors.  This result has partial precursors
in algorithms research and many postcursors. I will provide an overview of
Compressed Sensing, its pre- and postcursors, from an algorithmicist's
point of view.

3:00 - 3:30
Inderjit Dhillon
Department of Computer Sciences
University of Texas, Austin


Bregman divergences offer a rich variety of distortion measures that are
applicable to varied applications in data mining and machine learning.
Popular Bregman vector divergences are the squared Euclidean distance,
relative entropy and the Itakura-Saito distance. These divergence measures
can be extended to Hermitian matrices, yielding as special cases, the
squared Frobenius distance, von Neumann divergence and the Burg
divergence. In this talk, I will talk about the kernel learning problem
that uses these Bregman matrix divergences in a natural way. A key
advantage is in obtaining efficient update algorithms for learning
low-rank kernel matrices.

Joint work with Brian Kulis, Matyas Sustik and Joel Tropp.

3:30 - 4:00
Bruce Hendrickson
Sandia National Laboratories


Latent semantic analysis (LSA) is a method for information retrieval and
processing which is based upon the singular value decomposition. It has a
geometric interpretation in which objects (e.g. documents and keywords)
are placed in a low-dimensional geometric space.  In this talk, we derive
a new algebraic/geometric method for placing objects in space to
facilitate information analysis.  Our approach uses an alternative
motivation, and reduces to the computation of eigenvectors of Laplacian
matrices.  We show that our method, which we call Fiedler retrieval, is
closely related to LSA, and essentially equivalent for particular choices
of scaling parameters. We then show that Fiedler retrieval supports a
number of generalizations and extensions that existing LSA approaches
cannot handle, including unified text and link analysis.

4:00 - 4:30 COFFEE BREAK

4:30 - 5:00
Piotr Indyk
Computer Science and Artificial Intelligence Lab
Massachusetts Institute of Technology


The nearest neighbor problem is defined as follows: given a set of n data
points in a d-dimensional space, construct a data structure which, given a
query point q, returns the data point closest to the query.

The problem is of importance in multiple areas of computer science.
Nearest neighbor algorithms can be used for: classification (in machine
learning), similarity search (in biological, text or visual databases),
data quantization (in signal processing and compression), etc. Most of
these applications involve high-dimensional data. Unfortunately, classic
algorithms for geometric search problems, such as kd-trees, do not scale
well as the dimension increases.

In recent years, there has been extensive research done on *approximate*
variants of the nearest neighbor problem. One of the algorithms, based on
the idea of Locality-Sensitive Hashing (LSH), has been successfully used
in several applied scenarios.

In this talk I will review that work, as well as describe recent
developments in this area. In particular, I will show a new LSH-based
algorithm, whose running time significantly improves over the earlier
bounds. The running time of the new algorithm is provably close to the
best possible in the class of hashing-based algorithms.

5:00 - 5:30
Moses Charikar
Department of Computer Science
Princeton University


Several algorithmic techniques have been devised recently to deal with
large volumes of data. At the heart of many of these techniques are
ingenious schemes to represent data compactly. This talk will present
several examples of such compact representation schemes. Some of these are
inspired by techniques devised in the field of approximation algorithms
for "rounding" solutions of LP and SDP relaxations. We will also see how
such compact representation schemes lead to efficient, one-pass algorithms
for processing large volumes of data (streaming algorithms).

5:30 - 6:00
Sudipto Guha
Department of Computer and Information Science
University of Pennsylvania


In this talk we will explore several new directions in data streams,
particularly at the intersection of streams, information and signal
processing. One of the most important feature of a data stream is the
meaning or the semantics of the elements streaming by. They typically
either define a distribution over a domain or represent a signal. We will
investigate a few problems corresponding to these two scenarios. The first
set of problems will consider modeling a distribution as a stream of
updates, and investigate space efficient algorithms for computing various
information theoretic properties, divergences etc. The second set of
problems would consider wavelet processing on data streams.

6:00 - 6:30
Frank McSherry
Microsoft Research


We discuss privacy issues associated with the analysis of sensitive data
sets, motivated by the general inaccessibility of sensitive data due to
privacy concerns. We present a definition of "differential privacy", in
which each user is assured that no conclusion reached by the data analyst
is much more likely than if than user had not participated in the study.
The unequivocal nature of the statement, quantified over all possible
conclusions and all prior knowledge of the analyst, gives very strong
privacy guarantees. Additionally, we show detail how many common analyses
can be faithfully implemented in a framework that yields differential
privacy, and discuss the opportunities for incorporating new analyses.

Joint work with Cynthia Dwork.

Friday, June 23, 2006. Theme: Kernel and Learning Applications

9:00 - 10:00 Tutorial
Dimitris Achlioptas
Department of Computer Science
University of California, Santa Cruz


We will see how carefully crafted random matrices can be used to enhance a
wide variety of computational tasks, including:  dimensionality reduction,
spectral computations, and kernel methods in machine learning. Several
examples will be considered, including the following two.

Imagine that we want to compute the top few eigenvectors of a matrix A,
hoping to extract the "important" features in A. We will prove that either
this is not worth doing, or that we can begin by randomly throwing away a
large fraction of A's entries.

A famous result of Johnson and Lindenstrauss asserts that any set of n
points in d-dimensional Euclidean space can be embedded in k- dimensional
Euclidean space, where k=O(log n), such that all pairwise distances are
preserved with arbitrarily good accuracy. We prove that to construct such
an embedding it suffices to multiply the n x d matrix of points with a
random d x k matrix, whose entries are set to 1 independently, with equal

The emphasis of the talk will be on Euclidean intuition. In particular, no
prior knowledge on random matrices and/or linear algebra will be assumed.

10:00 - 10:30
Tomaso Poggio
Center for Biological and Computational Learning,
  Computer Science and Artificial Intelligence Laboratory,
  and McGovern Institute for Brain Research
Massachusetts Institute of Technology


The problem of learning is one of the main gateways to making intelligent
machines and to understanding how the brain works. In this talk I will
give a brief overview of recent work on learning theory and on general
conditions for predictivity of an algorithm and ultimately for science
itself. I will then show a few examples of our efforts in developing
machines that learn for applications such as visual recognition and
graphics and speech synthesis. Finally, I will sketch a new theory of the
ventral stream of the visual cortex, describing how the brain may learn to
recognize objects, and show that the resulting model is capable of
performing recognition on datasets of complex images at the level of human
performance in rapid categorization tasks. The model performs as well or
better than most state-of-art computer vision systems in recognition of
complex images.

Relevant papers can be downloaded from

10:30 - 11:00 COFFEE BREAK

11:00 - 11:30
Stephen Smale
Department of Mathematics
University of California, Berkeley
Toyota Technological Institute
University of Chicago


The following questions are discussed. Suppose points are drawn at random
from a submanifold M of Euclidean space. How may one reconstruct the
topology and the geometry of M? and with what confidence?

11:30 - 12:00
Gunnar Carlsson
Department of Mathematics
  and Institute for Computational and Mathematical Engineering
Stanford University


In recent years techniques have been developed for evaluating Betti
numbers of spaces given only a finite but large set of points (a "point
cloud") sampled from the space.  Such techniques can be applied in
statistical settings where the data being considered is high dimensional
(and hence cannot be visualized) and non-linear in nature.  This talk will
describe the techniques required to obtain the Betti numbers, as well as
discuss some actual examples where they have been applied.

12:00 - 12:30
Vin de Silva
Department of Mathematics and Computer Science
Pomona College


Point-cloud data sets are discrete objects which vary continuously,
whereas topological spaces are continuous objects which vary discretely.
It follows that the discrete invariants of classical algebraic topology
(such as homology) are not immediately useful for point-cloud topology. It
is always necessary to make those invariants continuous in some way. One
very successful paradigm for this is the theory of persistent homology. I
will discuss an alternative paradigm which uses harmonic forms defined by
discrete Laplacian operators.

12:30 - 2:00 LUNCH (on your own)

2:00 - 2:30
Daniel Boley
Department of Computer Science
University of Minnesota, Twin Cities


We discuss how a fast deterministic clustering algorithm enables a variety
of applications, including the training of support vector machines and the
computation of a low memory factored representation which admits the
application of many standard algorithms on massive datasets. Some
theoretical and experimental properties of the clustering methods and of
the resulting support vector machines will be provided. We will show that
the clustering method compares favorably with k-means and that the
resulting SVM compares favorably with that computed in the usual way
directly from the data.

2:30 - 3:00
Chris Ding
Lawrence Berkeley National Laboratory


We show that the objective of NMF X=FG' is equivalent to K-means
clustering: G is cluster indicator and F contains cluster centroids. This
can be generalized to semi-NMF where X, F contain mixed-sign data, while G
being nonnegative. We further propose convex-NMF by restricting F be
convex combinations of data points, ensuring F to be meaningful cluster
centroids. We also show that the symmetric NMF W=HH' is equivalent to
Kernel \Km clustering and the Laplacian-based spectral clustering. All
these follow by rewriting K-means objective as a trace of quadratic
function whose continuous relaxation solution are given by PCA components.
We emphasize orthogonality and nonnegativity in matrix based clustering.
We derive the updating algorithms for semi-NMF, convex-NMF, symmetric-NMF
and prove their convergence. We present experiments on face images,
newgroups, web log and text data to show the effectiveness of these NMF
based clustering.

Based on joint work with Xiaofeng He, Horst Simon, Tao Li and Michael

3:00 - 3:30
Al Inselberg
School of Mathematical Sciences
Tel Aviv University


A dataset with M items has 2M subsets anyone of which may be the one we
really want. With a good data display our fantastic pattern-recognition
ability can not only cut great swaths searching through this combinatorial
explosion but also extract insights from the visual patterns. These are
the core reasons for data visualization. With Parallel Coordinates (abbr.
kcoords) the search for multivariate relations in high dimensional
datasets is transformed into a 2-D pattern recognition problem. After a
short overview of k-coords, guidelines and strategies for knowledge
discovery are illustrated on different real datasets, one with 400
variables from a manufacturing process. A geometric classification
algorithm based on k-coords is presented and applied to complex datasets.
It has low computational complexity providing the classification rule
explicitly and visually. The minimal set of variables required to state
the rule is found and ordered by their predictive value. A visual economic
model of a real country is constructed and analyzed to illustrate how
multivariate relations can be modeled by means of hypersurfaces,
understanding trade-offs, sensitivities and interelations. Parallel
coordinates is also used in collision avoidance algorithms for air traffic

P.S. Do not be intimidated by this formalistic description. The speaker is
also well known for his numerological anecdotes and palindromic

3:30 - 4:00
Joel Tropp
Department of Mathematics
University of Michigan


The heavy hitters problem elicits a list of the m largest-magnitude
components in a signal of length d. Although this problem is easy when the
signal is presented explicitly, it becomes much more challenging in the
setting of streaming data, where the signal is presented implicitly as a
sequence of additive updates. One approach maintains a small sketch of the
data that can be used to approximate the heavy hitters quickly. In
previous work, this sketch is essentially a random linear projection of
the data that fails with small probability for each signal. It is often
desirable that the sketch succeed simultaneously for ALL signals from a
given class, a requirement that may be called uniform heavy hitters. It
arises, for example, when the signal is queried a large number of times or
when the signal updates are stochastically dependent.

This talk describes a random linear sketch for uniform heavy hitters that
succeeds with high probability. The recovery algorithm produces a list of
heavy hitters that approximates the input signal with an ell_2 error that
is optimal, except for an additive term that depends on the optimal ell_1
error and a controllable parameter eps. The recovery algorithm requires
space m*poly(log(d)/eps) and time m^2*poly(log(d)/eps) to produce the list
of heavy hitters. In the turnstile model, the time per update is
m*poly(log(d)/eps). Up to logarithmic factors, the performance of this
algorithm is the best possible with respect to several resources.

Joint work with Anna Gilbert, Martin Strauss, and Roman Vershynin.

4:00 - 4:30 COFFEE BREAK

4:30 - 5:00
David Donoho
Department of Statistics
  and Institute for Computational and Mathematical Engineering
Stanford University


Many modern datasets have more variables than observations. Among those
variables it is thought that only a small number are really useful for
linear prediction, but we don't know which.  We could proceed by
combinatorial search to try all possible subsets and build prediction
models with those, but that's computationally very heavy.

In my talk I will describe some fun results showing that often one can get
just as good results without all-subsets search. For example, often one
can penalize models by the l1-norm on the coefficients and find the same
model that combinatorial optimization would have found.  There are maybe
100 papers related to this over the last 18 months, and I'll try to give a
flavor of what's being done and why.

5:00 - 5:30
Robert Tibshirani
Department of Statistics
Stanford University


In regression problems where the number of predictors greatly exceeds the
number of observations, conventional regression techniques may produce
unsatisfactory results.  We describe a technique called supervised
principal components that can be applied to this type of problem.
Supervised principal components is similar to conventional principal
components analysis except that it uses a subset of the predictors that
are selected based on their association with the outcome. Supervised
principal components can be applied to regression and generalized
regression problems such as survival analysis. It compares favorably to
other techniques for this type of problem, and can also account for the
effects of other covariates and help identify which predictor variables
are most important. We also provide asymptotic consistency results to help
support our empirical findings. These methods could become important tools
for DNA microarray data, where they may be used to more accurately
diagnose and treat cancer.

Joint work with Eric Bair, Trevor Hastie, and Debashis Paul.

5:30 - 6:00
Tao Yang and
Department of Computer Science
University of California, Santa Barbara

PAGE RANKING FOR LARGE-SCALE INTERNET SEARCH: ASK.COM'S EXPERIENCES has been developing a comprehensive suite of search and
question-answering technology and differentiated products to help users to
find what they are looking for. This talk gives an overview of's
search engine and describes many of challenges faced in seeking relevant
answers from billions of documents for tens of millions of users everyday.
Particularly, this talk will discuss our ExpertRank algorithm which
provides relevant search results by identifying topics and authoritative
sites on the Web through query-specific clustering and expert analysis.

Joint work with Apostolos Gerasoulis, Wei Wang, and other


Saturday, June 24, 2006. Theme: Tensor-Based Data Applications

9:00 - 10:00 BREAKFAST

10:00 - 11:00 Tutorial
Lek-Heng Lim
Institute for Computational and Mathematical Engineering
Stanford University


A simple recipe for creating multilinear models is to take a bilinear
model, say, a term-by-document matrix, and include one or more additional
"modes" to get, say, a term-by-document-by-URL tensor (which may come
from, for example, a collection of term-by-document matrices across
different URLs). Now, in analogy with matrix methods, one finds a low-rank
approximation to the tensor in question and hope that it reveals
interesting features about the data. Here, rank may mean the outer-product
rank or the multilinear rank (the respective models are sometimes known as
CANDECOMP/PARAFAC and the Tucker model) or indeed any other notions of
tensor rank.

An immediate question is: why should this work? Many problems arise once
we go from order-2 (matrices) to higher orders (tensors). Not least among
which is the fact that the problem of finding a best rank-r approximation
for tensors often do not even have a solution. Furthermore, there seems to
be no proper statistical foundation for such models --- what is one really
doing when decomposing a tensor of data values into a sum of outer product
of vectors? why should these vectors tell us anything about the data?

In this talk, we will discuss these and other related questions. We will
also address similar issues for two classes of tensors that are important
in data-analytic applications: symmetric tensors (Independent Component
Analysis) and nonnegative tensors (Nonnegative Tensor Factorization). We
will show how these topics have recently been enriched by ideas from the
ancient (eg. Cayley's formula for the hyperdeterminant of a 2x2x2 tensor)
to the present (eg. Donoho's work in \ell^0-approximations). Furthermore,
we shall see how some existing computational, mathematical, and
statistical tools could be employed to study multilinear models. Examples
include: using Matroid Theory to prove uniqueness of tensor
decompositions; using properties of higher secant varities of
Segre/Veronese variety to shed light on the aforementioned non-existence
of best rank-r approximations; using SDP to solve relaxed convex versions
of the best rank-r tensor approximation problem; using the Log Linear
Model, or more generally, Graphical Models and Bayesian Networks to
provide a statistical foundation.

Time permitting, we will also discuss Multilinear Spectral Theory and show
how the eigenvalues of symmetric tensors may be used to obtain basic
results in Spectral Hypergraph Theory and how one could prove a
Perron-Frobenius Theorem for irreducible non-negative tensors.

11:00 - 11:30
Eugene Tyrtyshnikov
Institute of Numerical Mathematics
Russian Academy of Sciences


In some applications we are interested to work with huge amounts of data
for which standard methods of numerical analysis cannot be applied. For
instance, consider a 3-dimensional array of size of several thousand in
every dimension. Is it possible to work with this size array any
efficiently? In the talk, we are going to show that the answer is "yes" -
however, under certain assumptions on the data.

The main assumption is that the data admits a sufficiently accurate
low-tensor-rank approximation. Then, we present a special
cross-approximation technique using a relatively small amount of the
entries of the given array and providing the so-called Tucker
decomposition with the complexity linear in one-dimensional size of the
array. The technique generalizes the low-rank approximation technique for
matrices to 3-dimensional arrays. The Tucker core can be further
approximated by a trilinear (canonical) decomposition. Development of good
algorithms solving the latter problem is still topical. To this end, we
present a new algorithm that seems to be competitive with the best known

Joint work with I. Oseledets and D. Savostianov

11:30 - 12:00
Lieven De Lathauwer
Ecole Nationale Superieure d'Electronique et de ses Applications,
and University of Cergy-Pontoise


The Parallel Factor decomposition (PARAFAC) in multilinear algebra
decomposes a higher-order tensor in a sum of rank-1 terms. The
decomposition can be essentially unique without imposing orthogonality
constraints on the rank-1 terms. Tucker's decomposition is a second way to
generalize the Singular Value Decomposition (SVD) of matrices. Here, the
transformation matrices are orthogonal. Their columns correspond to the
directions of extremal oriented energy in column space, row space, etc.
However, the decomposition does in general not reduce the tensor to
diagonal form. In this talk we introduce a new tensor decomposition, of
which Tucker's decomposition and PARAFAC are special cases. The
decomposition thus generalizes the SVD of matrices in two ways. The result
sheds new light on very fundamental aspects of tensor algebra, including
tensor rank.

12:00 - 1:30 LUNCH

1:30 - 2:00
Orly Alter
Department of Biomedical Engineering and
Institute for Cellular and Molecular Biology
University of Texas, Austin


DNA microarrays make it possible to record, for the first time, the
complete genomic signals, such as mRNA expression and DNA-bound proteins'
occupancy levels, that are generated and sensed by cellular systems. The
underlying genome-scale networks of relations among all genes of the
cellular systems can be computed from these signals. These relations among
the activities of genes, not only the activities of the genes alone, are
known to be pathway-dependent, i.e., conditioned by the biological and
experimental settings in which they are observed.

I will describe the use of the matrix eigenvalue decomposition (EVD) and
pseudoinverse projection and a tensor higher-order EVD (HOEVD) in
reconstructing the pathways, or genome-scale pathway-dependent relations
among the genes of a cellular system, from nondirectional networks of
correlations, which are computed from measured genomic signals and
tabulated in symmetric matrices [Alter & Golub, PNAS 2005].

EVD formulates a genes $\times$ genes network, which is computed from a
``data'' signal, as a linear superposition of genes $\times$ genes
decorrelated and decoupled rank-1 subnetworks. Significant EVD subnetworks
might represent functionally independent pathways that are manifest in the
data signal.

The integrative pseudoinverse projection of a network, computed from a
data signal, onto a designated ``basis'' signal approximates the network
as a linear superposition of only the subnetworks that are common to both
signals, i.e., pseudoinverse projection filters off the network the
subnetworks that are exclusive to the data signal. The
pseudoinverse-projected network simulates observation of only the pathways
that are manifest under both sets of the biological and experimental
conditions where the data and basis signals are measured.

I will define a comparative HOEVD, that formulates a series of networks
computed from a series of signals as linear superpositions of decorrelated
rank-1 subnetworks and the rank-2 couplings among these subnetworks.
Significant HOEVD subnetworks and couplings might represent independent
pathways or transitions among them common to all or exclusive to a subset
of the signals.

I will illustrate the EVD, pseudoinverse projection and HOEVD of genome-
scale networks with analyses of mRNA expression data from the yeast {\it
Saccharomyces cerevisiae} during its cell cycle and DNA-binding data of
yeast transcription factors that are involved in cell cycle, development
and biosynthesis programs. Boolean functions of the discretized
subnetworks and couplings highlight known and novel differential, i.e.,
pathway- dependent relations between genes.

2:00 - 2:30
Shmuel Friedland
Department of Mathematics, Statistics and Computer Science
University of Illinois, Chicago


The theory of $k(\ge 3)$ tensors became of great interest in recent
applications. In particular, the rank of a tensor and an approximation of
a given tensor by low rank tensors, emerged as the most outstanding
problems in data processing with many parameters.

In this talk I will point out some directions of study of these problems
using analytical methods: as algebraic geometry combined with matrix
theory, and numerical methods: as SVD for matrices combined with Newton

Some very preliminary results can be found in Chapter 5 of my notes:

2:30 - 3:00
Tammy Kolda
Sandia National Laboratories


Link analysis typically focuses on a single type of connection, e.g., two
journal papers are linked because they are written by the same author.
However, often we want to analyze data that has multiple linkages between
objects, e.g., two papers may have the same keywords and one may cite the
other. The goal of this paper is to show that multilinear algebra provides
a tool for multi-link analysis. We analyze five years of publication data
from journals published by the Society for Industrial and Applied
Mathematics. We explore how papers can be grouped in the context of
multiple link types using a tensor to represent all the links between
them. A PARAFAC decomposition on the resulting tensor yields information
similar to the SVD decomposition of a standard adjacency matrix. We show
how the PARAFAC decomposition can be used to understand the structure of
the document space and define paper-paper similarities based on multiple
linkages. Examples are presented where the decomposed tensor data is used
to find papers similar to a body of work (e.g., related by topic or
similar to a particular author^Rs papers), find related authors using
linkages other than explicit co-authorship or citations, distinguish
between papers written by different authors with the same name, and
predict the journal in which a paper was published.

Joint work with Daniel M. Dunlavy and W. Philip Kegelmeyer.

3:00 - 3:30
Lars Eld\'en
Department of Mathematics
Link\"oping University


We investigate various properties of the best rank-($R_1,R_2,R_3$)
approximation of a tensor, and their implications in the development of
algorithms for computing the approximation. In particular we discuss the
Grassmann-Newton method for solving the problem, and its relation to the
Grassmann-Rayleigh quotient iteration of de Lathauwer et al.

Joint work with Berkant Savas.

3:30 - 4:00 COFFEE BREAK

4:00 - 4:30
Liqun Qi
Department of Mathematics
City University of Hong Kong


Motivated by the positive definiteness issue in automatic control, I
defined eigenvalues, E-eigenvalues, H-eigenvalues and Z-eigenvalues for a
real completely symmetric tensor. An mth order n-dimensional real
completely symmetric tensor has n(m 1) n 1 eigenvalues and at most P n 1
k=0(m 1) k E-eigenvalues. They are roots of two one-dimensional
polynomials, which are called the characteristic polynomial and the
E-characteristic polynomial respectively. The sum of the eigenvalues is
equal to the trace of the tensor, multi- plied with (m 1) n 1 . The
eigenvalues obey some Gerschgorin-type theorems, while the E-eigenvalues
are invariant under orthogonal transformation. The latter property
indicates that E-eigenvalues are invariants of practical tensors in
physics and mechanics. Real eigenvalues (E-eigenvalues) with real
eigenvectors (E-eigenvectors) are called H- eigenvalues (Z-eigenvalues).
Z-eigenvalues always exist. When m is even, H-eigenvalues always exist. A
real completely symmetric tensor is positive definite if and only if all
of its H-eigenvalues (Z-eigenvalues) are positive. Based on the resultant
theory, H-eigenvalue and Z-eigenvalue methods are proposed for judging
positive definiteness of an even degree multivariate form when m and n are
small. Independently, Lek-Heng Lim has also explored the definitions and
applications of H- eigenvalues and Z-eigenvalues. In particular, he
proposed a multilinear generalization of the Perron-Frobenius theorem
based upon H-eigenvalues.

4:30 - 5:00
Brett Bader
Sandia National Laboratories


This presentation introduces the DEDICOM (DEcomposition into DIrectional
COMponents) family of models from the psychometrics literature for
analyzing asymmetric relationships in data and applies it to new
applications in data mining, in particular social network analysis.  In
this work we present an improved algorithm for computing the 3-way DEDICOM
model with modifications that make it possible to handle large, sparse
data sets.  We demonstrate the capabilities of DEDICOM as a new tool for
reporting latent relationships in large semantic graphs, which we
represent by an adjacency tensor.  For an application we consider the
email communications of former Enron employees that were made public, and
posted to the web, by the U.S. Federal Energy Regulatory Commission during
its investigation of Enron.  We represent the Enron email network as a
directed graph with edges labeled by time.  Using the three-way DEDICOM
model on this data, we show unique latent relationships that exist between
types of employees and study their communication patterns over time.

Joint work with Richard Harshman and Tamara Kolda.

5:00 - 5:30
Alex Vasilescu
Media Laboratory
Massachusetts Institute of Technology


Principal components analysis (PCA) is one of the most valuable results
from applied linear algebra. It is used ubiquitously in all forms of data
analysis -- in data mining, biometrics, psychometrics, chemometrics,
bioinformatics, computer vision, computer graphics, etc. This is because
it is a simple, non-parametric method for extracting relevant information
through the demensionality reduction of high-dimensional datasets in order
to reveal hidden underlying variables. PCA is a linear method, however,
and as such it has severe limitations when applied to real world data. We
are addressing this shortcoming via multilinear algebra, the algebra of
higher order tensors.

In the context of computer vision and graphics, we deal with natural
images which are the consequence of multiple factors related to scene
structure, illumination, and imaging. Multilinear algebra offers a potent
mathematical framework for explicitly dealing with multifactor image
datasets. I will present two multilinear models that learn (nonlinear)
manifold representations of image ensembles in which the multiple
constituent factors (or modes) are disentangled and analyzed explicitly.
Our nonlinear models are computed via a tensor decomposition, known as the
M-mode SVD, which is an extension to tensors of the conventional matrix
singular value decomposition (SVD), or through a generalization of
conventional (linear) ICA called Multilinear Independent Components
Analysis (MICA).

I will demonstrate the potency of our novel statistical learning approach
in the context of facial image biometrics, where the relevant factors
include different facial geometries, expressions, lighting conditions, and
viewpoints. When applied to the difficult problem of automated face
recognition, our multilinear representations, called TensorFaces (M-mode
PCA) and Independent TensorFaces (MICA), yields significantly improved
recognition rates relative to the standard PCA and ICA approaches.
Recognition is achieved with a novel Multilinear Projection Operator.

In computer graphics, our image-based rendering technique, called
TensorTextures, is a multilinear generative model that, from a sparse set
of example images of a surface, learns the interaction between viewpoint,
illumination and geometry, which determines surface appearance, including
complex details such as self-occlusion and self shadowing. Our tensor
algebraic framework is also applicable to human motion data in order to
extract "human motion signatures" that are useful in graphical animation
synthesis and motion recognition.

5:30 - 6:00
Rasmus Bro
Chemometrics Group, Department of Dairy and Food Science
The Royal Veterinary and Agricultural University


In many metabonomic (bioinformatic) investigations, the data suffer from
specific problems such as baseline issues, shifts in time etc.
Additionally, such data are often characterized by being overwhelmingly
information rich which is a problem in mathematical and statistical terms
but even more so from a scientific point of view because the desire is
often to obtain valid causal explanations for complex biological problems.
We will show how tailored multi-way analysis tools can help in analysis of
such complex data and show several examples of different origin.

6:00 - 6:30 (cancelled with apologies)
Pierre Comon
Laboratoire I3S
CNRS and University of Nice, Sophia-Antipolis


The problem of identifying linear mixtures of independent random variables
only from outputs can be traced back to 1953 with the works of Darmois or
Skitovich. They pointed out that when data are non Gaussian, a lot more
can be said about the mixture. In practice, Blind Identification of linear
mixtures is useful especially in Factor Analysis, in addition to many
other application areas (including Signal & Image Processing, Digital
Communications, Biomedical, or Complexity Theory).

Harshman and Carroll provided independently in the seventies numerical
algorithms to decompose a data record stored in a 3-way array into
elementary arrays, each representing the contribution of a single
underlying factor. The main difference with the well known Principal
Component Analysis is that the mixture is not imposed to be a unitary
matrix. This is very relevant because the actual mixture often has no
reason to have orthogonal columns. The Parafac ALS algorithm, widely used
since that time, theoretically does not converge for topological reasons,
but yields very usable results after a finite number of iterations, under
rank conditions however.

Independently, the problem of Blind Source Separation (BSS) arose around
1985 and was solved -explicitly or implicitly- with the help of High-Order
Statistics (HOS), which are actually tensors. It gave rapidly birth to the
more general problem of Independent Component Anlalysis (ICA) in 1991. ICA
is a tool that can be used to extract Factors when the physical diversity
does not allow to store directly and efficiently the data in tensor
format, in other words when the data cannot be uniquely decomposed
directly. The problem then consists of decomposing a data cumulant tensor,
of arbitrarily chosen order, into a linear combination of rank-one
symmetric tensors. For this purpose, several algorithms can be thought of.