Posters Abstracts. Workshop on Algorithms for Modern Massive Data Sets

If you find the TeX codes below hard to read, the typeset version may be found in pp. 16–19 of this PDF file.

Christos Boutsidis
Rensselaer Polytechnic Institute

Given an $m \times n$ matrix $A$ and an integer $k$, with $k \ll n$, the
Column Subset Selection Problem (CSSP) is to select $k$ columns of $A$ to
form an $m \times k$ matrix $C$ that minimizes the residual $\lVert A -
CC^{+} A \rVert_{\xi}$, for $\xi=2$ or $F$. Here $C^{+}$ denotes the
pseudoinverse of the matrix $C$. This combinatorial optimization problem
has been exhaustively studied in the numerical linear algebra and the
theoretical computer science communities. Current state-of-the-art
approximation algorithms guarantee

\lVert A-CC^{+} A \rVert_2 &\le O(k^{1/2}(n-k)^{1/2}) \lVert A-A_k \rVert_2,\\
\lVert A-CC^{+} A \rVert_F &\le \sqrt{(k+1)!} \lVert A-A_k \rVert_F.

Here, we present a new approximation algorithm which guarantees that with
high probability

\lVert A-CC^{+} A \rVert_2 &\le O(k^{3/4}\log^{1/2}(k)(n-k)^{1/4}) \lVert A-A_k \rVert_2,\\
\lVert A-CC^{+} A \rVert_F &\le O(k \sqrt{\log k}) \lVert A-A_k \rVert_F,

thus \emph{asymptotically} improves upon the previous results. $A_k$ is
the best-rank-k approximation to $A$, computed with the SVD. Also note
that if $A=U \Sigma V^T$ is the SVD of a matrix $A$, then $A^{+}=V
\Sigma^{-1}U^T$. We also present an example on how this algorithm can be
utilized for a low rank approximation of a data matrix.

This is joint work with Michael Mahoney and Petros Drineas

Pavel Dmitriev
Yahoo! Inc.

There has been a lot of recent interest in mining patterns from graphs.
Often, the exact structure of the patterns of interest is not known. This
happens, for example, when molecular structures are mined to discover
fragments useful as features in chemical compound classification task, or
when web sites are mined to discover sets of web pages representing
logical documents. Such patterns are often generated from a few small
subgraphs (cores), according to certain generalization rules (GRs). We
call such patterns ``generalized patterns''(GPs). While being structurally
slightly different, GPs often perform the same function in the network.

Previously proposed approaches to mining GPs either assumed that the cores
and the GRs are given, or that all interesting GPs are frequent. These are
strong assumptions, which often do not hold in practical applications. In
this paper, we propose an approach to mining GPs that is free from the
above assumptions. Given a small number of GPs selected by the user, our
algorithm discovers all GPs similar to the user examples. First, a machine
learning-style approach is used to find the cores. Second, generalizations
of the cores in the graph are computed to identify GPs. Evaluation on
synthetic and real-world datasets demonstrates the promise as well as
highlights some problems with our approach.

Charles Elkan
University of California, San Diego

Many data mining and knowledge discovery applications that use matrices
focus on low-rank approximations, including for example latent semantic
indexing (LSI).  The singular value decomposition (SVD) is one of the most
popular low-rank matrix approximation methods.  However, it is often
considered intractable for applications involving massive data.  Recent
research has addressed this problem, with several randomized methods
proposed to compute the SVD.  We review these techniques and present the
first empirical study of some of them.  The projection-based approach of
Sarlos appears best in terms of accuracy and runtime, although a sampling
approach of Drineas, Kannan, and Mahoney also performs favorably.  No
method offers major savings when the input matrix is sparse.  Since most
massive matrices used in knowledge discovery, for example word-document
matrices, are sparse, this finding reveals an important direction for
future research.

This is joint work with Aditya Krishna Menon, University of California,
San Diego.

Sreenivas Gollapudi
Microsoft Research, Silicon Valley

This study focuses on computations on large graphs (e.g., the web-graph)
where the edges of the graph are presented as a stream. The objective in
the streaming model is to use small amount of memory (preferably
sub-linear in the number of nodes $n$) and a few passes.

In the streaming model, we show how to perform several graph computations
including estimating the probability distribution after a random walk of
length $l$, mixing time, and the conductance. We estimate the mixing time
$M$ of a random walk in


space and $\widetilde{O}(\sqrt{M/\alpha})$ passes. Furthermore, the
relation between mixing time and conductance gives us an estimate for the
conductance of the graph. By applying our algorithm for computing
probability distribution on the web-graph, we can estimate the {\em
PageRank} $p$ of any node up to an additive error of $\sqrt{\epsilon p}$
in $\widetilde{O}(\sqrt{M/\alpha})$ passes and$ \widetilde{O}\Bigl(
M\alpha, \alpha n\sqrt{M\alpha} + \frac{1}{\epsilon}\sqrt{M/\alpha}
\Bigr\}\Bigr)$ space, for any $\alpha \in (0,1]$. In particular, for
$\epsilon = M/n$, by setting $\alpha = M^{-1/2}$, we can compute the
approximate PageRank values in $\tilde{O}(nM^{-1/4})$ space and
$\widetilde{O}(M^{3/4})$ passes.  In comparison, a standard implementation
of the PageRank algorithm will take $O(n)$ space and $O(M)$ passes.

Mohammad Al Hasan
Rensselaer Polytechnic Institute

Frequent Pattern Mining (FPM) is a mature topic in data mining with many
efficient algorithms to mine patterns with variable complexities, such as
set, sequence, tree, graph, and etc. But, the usage of FPM in real-life
knowledge discovery systems is considerably low in comparison to their
potentital. The prime reason is the lack of interpretability caused from
the enormity of output-set size. A modest size graph database with
thousands graphs can produce millions of frequent graph patterns with a
reasonable support value. So, the recent research direction in FPM has
shifted from obtaining the total set to generate a representative
(summary) set. Most of the summmarization approaches that is proposed
recently are post-processor; they adopt some form of clustering algorithm
to generate a smaller representative pattern-set from the collection of
total set of frequent patterns (FP). But, post-processing requires to
obtain all the members of FP; which, unfortunately, is not practically
feasible for many real-world data set.

In this research, we consider an alternative viewpoint of
representativeness. It is based on uniform and identical sampling, the
most fundamental theory in data analysis and statistics.  Our
representative patterns are obtained with uniform probability from the
pool of all frequent maximal patterns. Along this idea, we first discuss a
hypothetic algorithm for uniform generation that uses a \#P-oracle.
However, such an oracle is difficult to obtain as it is related to the
counting of maximal frequent patterns, which was shown to be \#P-complete.
Hence, we propose a random-walk based approach that traverse the FP
partial order randomly with prescribed transition matrix. On convergence,
the random walk visits each maximal pattern with a uniform probability.
Since, the number of maximal patterns, generally, are very low compared to
the search space, the idea of importance sampling is used to guide the
random walk to visit the maximal patterns more frequently.

Ling Huang
Intel Research

The computational and/or communication constraints associated with
processing large-scale data sets using support vector machines (SVM) in
contexts such as distributed networking systems are often prohibitively
high, resulting in practitioners of SVM learning algorithms having to
apply the algorithm on approximate versions of the kernel matrix induced
by a certain degree of data reduction. In this research project, we study
the tradeoffs between data reduction and the loss in an algorithm's
classification performance. We introduce and analyze a consistent
estimator of the SVM's achieved classification error, and then derive
approximate upper bounds on the perturbation on our estimator. The bound
is shown to be empirically tight in a wide range of domains, making it
practical for the practitioner to determine the amount of data reduction
given a permissible loss in the classification performance.

Xuhui Huang
Stanford University

The Replica Exchange Method (REM) and Simulated Tempering (ST) enhanced
sampling algorithms have become widely used in sampling conformation space
of bimolecular systems. We have implemented a serial version of REM (SREM)
and ST in the heterogeneous Folding@home distributed computing environment
in order to calculate folding free energy landscapes. We have applied both
algorithms to the 21 residue Fs peptide, and SREM to a 23 residue BBA5
protein. For each system, we find converged folding free energy landscapes
for simulations started from near.native and fully unfolded states. We
give a detailed comparison of SREM and ST and find that they give
equivalent results in reasonable agreement with experimental data. Such
accuracy is made possible by the massive parallelism provided by
Folding@home, which allowed us to run approximately five thousand 100ns
simulations for each system with each algorithm, and generates more than a
million configurations. Our extensive sampling shows that the AMBER03
force field gives better agreement with experimental results than previous
versions of the force field.

Prateek Jain
University of Texas, Austin

Distance and kernel learning are increasingly important for several
machine learning applications.  However, most existing distance metric
algorithms are limited to learning metrics over low-dimensional data,
while existing kernel learning algorithms are limited to the transductive
setting and thus do not generalize to new data points. In this work, we
focus on the Log Determinant loss due to its computational and modelling
properties, and describe an algorithm for distance and kernel learning
that is simple to implement, scales to very large data sets, and
outperforms existing techniques.  Unlike most previous methods, our
techniques can generalize to new points, and unlike most previous metric
learning methods, they can work with high-dimensional data. We demonstrate
our learning approach applied to large-scale problems in computer vision,
specifically image search.

Pentti Kanerva
Stanford University

We have applied a simple form of random projections, called Random
Indexing, to English text, to encode the contexts of words into
high-dimensional vectors.  The context vectors for words can be used in
natural-language research and text search, for example.  Random Indexing
is particularly suited for massive data that keeps on accumulating---it is
incremental.  It compresses a large sparse $M \times N$ matrix, with $M$
and $N$ growing into the billions, into a much smaller matrix of fixed
size with little information loss.  Possible applications include network
connectivity analysis: who is talking to whom?

Simon Lacoste-Julien
University of California, Berkeley

Probabilistic topic models (and their extensions) have become popular as
models of latent structures in collections of text documents or images.
These models are usually treated as generative models and trained using
maximum likelihood estimation, an approach which may be suboptimal in the
context of an overall classification problem. In this work, we describe
DiscLDA, a discriminative learning framework for such models as Latent
Dirichlet Allocation (LDA) in the setting of dimensionality reduction with
supervised side information. In DiscLDA, a class-dependent linear
transformation is introduced on the topic mixture proportions. This
parameter is estimated by maximizing the conditional likelihood using
Monte Carlo EM. By using the transformed topic mixture proportions as a
new representation of documents, we obtain a supervised dimensionality
reduction algorithm that uncovers the latent structure in a document
collection while preserving predictive power for the task of
classification. We compare the predictive power of the latent structure of
DiscLDA with unsupervised LDA on the 20 Newsgroup document classification

This is joint work with Fei Sha and Michael Jordan.

Taesup Moon
Stanford University

We introduce S-DUDE, a new algorithm for denoising DMC-corruputed data.
The algorithm extends the recently introduced DUDE (Discrete Universal
DEnoiser) of Weissman et. al., and aims to compete with a genie that has
access to both noisy and clean data, and can choose to switch, up to $m$
times, between sliding window denoisers in a way that minimizes the
overall loss. The key issue is to learn the best segmentation of data and
the associated denoisers of the genie, only based on the noisy data.
Unlike the DUDE, we treat cases of one- and two-dimensional data
separately due to the fact that the segmentation of two-dimensional data
is significantly more challenging than that of one-dimensional data. We
use dynamic programming and quadtree decomposition in devising our scheme
for one- and two-dimensional data, respectively, and show that, regardless
of the underlying clean data, our scheme achieves the performance of the
genie provided that $m$ is sub-linear in the sequence length $n$ for
one-dimensional data and $m\ln m$ is sub-linear in $n$ for two-dimensional
data. We also prove the universal optimality of our scheme for the case
where the underlying data are (spatially) stationary. The resulting
complexity of our scheme is linear in both $n$ and $m$ for one-dimensional
data, and linear in $n$ and exponential in $m$ for two-dimensional data.
We also provide a sub-optimal scheme for two-dimensional data that has
linear complexity in both $n$ and $m$, and present some promising
experimental results involving this scheme.

Joint work with Professor Tsachy Weissman (Stanford).

Ramesh Nallapati
Stanford University

In this work, we address the problem of joint modeling of text and
citations in the topic modeling framework. We present two different models
called the Pairwise-Link-LDA and the Link-PLSA-LDA models.

The Pairwise-Link-LDA model combines the ideas of LDA and Mixed Membership
Block Stochastic Models and allows modeling arbitrary link structure.
However, the model is computationally expensive, since it involves
modeling the presence or absence of a citation (link) between every pair
of documents. The second model solves this problem by assuming that the
link structure is a bipartite graph. As the name indicates, Link-PLSA-LDA
model combines the LDA and PLSA models into a single graphical model.

Our experiments on a subset of Citeseer data show that both these models
are able to predict unseen data better than the base-line model of
Erosheva and Lafferty, by capturing the notion of topical similarity
between the contents of the cited and citing documents.

Our experiments on two different data sets on the link prediction task
show that the Link-PLSA-LDA model performs the best on the citation
prediction task, while also remaining highly scalable. In addition, we
also present some interesting visualizations generated by each of the

Hariharan Narayanan
University of Chicago

In many applications, the data that requires classification is very high
dimensional. In the absence of additional structure, this is a hard
problem. We show that the task of classification can tractable in two
special cases of interest, namely \textbf{1.}~when data lies on a low
dimensional submanifold on which it has a bounded density, and
\textbf{2.}~when the data has a certain heavy tailed character.

More precisely, we consider the following questions: \textbf{1.}~What is
the sample complexity of classifying data on a manifold when the class
boundary is smooth? \textbf{2.}~What is the sample complexity of
classifying data with Heavy-Tailed characteristics in a Hilbert Space $H$
using a linear separator?

In both cases, the VC dimension of the target function class is infinite,
and so distribution free learning is impossible. In the first case, we
provide sample complexity bounds that depend on the maximum density,
curvature parameters and intrinsic dimension but are independent of the
ambient dimension. In the second case, we obtain dimension independent
bounds if there exists a basis ${e_1, e_2, \ldots}$ of $H$ such that the
probability that a random data point does not lie in the span of ${e_1,
\dots, e_k}$ is bounded above by $C k^{-a}$ for some positive constants
$C$ and $a$.

Cases \textbf{1} and \textbf{2} are based on joint work with Partha Niyogi
and Michael Mahoney, respectively.

Stefan Pickl
Naval Postgraduate School, Monterey
   and Bundeswehr University, Munich

A research area of central importance in computational biology,
biotechnology --- and life sciences at all --- is devoted to modeling,
prediction and dynamics of complex expression patterns. However, as
clearly understood in these days, this enterprise cannot be investigated
in a satisfying way without taking into account the role of the
environment in its widest sense: Complex Data Analysis, Statistical
Behaviour and Algorithmic Approaches are in the Main Center of Interest:
To a representation of past, present and most likely future states, this
contribution also acknowledge the presence of measurement errors and
further uncertainties.

This poster contribution surveys and improves recent advances in
understanding the mathematical foundations and interdisciplinary
implications of the newly introduced gene-environment networks. Moreover,
it integrates the important theme carbon dioxide emission reduction into
the context of our networks and their dynamics. Given the data from DNA
microarray experiments and environmental records, we extract nonlinear
ordinary differential equations which contain parameters that have to be
determined. This is done by some modern kinds of (so-called generalized
Chebychev) approximation and (so-called generalized semi-infinite)
optimization.  After this is provided, time-discretized dynamical systems
are studied. Here, a combinatorial algorithm constructing and following
polyhedra sequences, allows to detect the region of parametric
(in)stability. Finally, we analyze the topological landscape of
gene-environment networks in its structural stability. With the example of
$CO_2$-emission control and some further statistical perspectives we

Jon Shlens
Miller Institute, University of California, Berkeley
   and Salk Institute for Biological Studies

All visual signals from the eye to the brain originate in the electrical
activity of a population of cells in the retina, termed retinal ganglion
cells (RGCs). Standard models implicitly assume that RGCs signal visual
information independently of one another. However, several studies have
demonstrated that significant correlated activity amongst RGCs may
fundamentally alter visual signals. We recorded the electrical activity of
several hundred RGCs in retina to explore how network interactions effect
the encoding of visual information.

To test whether concerted firing can be explained by pairwise
interactions, we used a maximum entropy approach borrowed from statistical
mechanics to predict concerted activity. The nearest neighbor Ising model
accurately reproduced the data, revealing that higher-order correlations
have minimal impact on network activity within a neural population.

The nearest-neighbor structure was explored further by employing a
generalized linear model to additionally capture the dynamics and stimulus
dependencies of the population of neurons. We employed MCMC techniques to
construct a Bayesian estimate of the visual stimulus. We found that
network activity contributed a valuable visual signal: ignoring network
activity sacrifices roughly 20\% of the visual signal.

Future investigations will examine how these results scale to a complete
recording of several hundred neurons in the presence to stochastic stimuli
matched to the statistics of the natural environment.

Joint work with Jonathan Pillow, Liam Paninski, Greg Field, Jeff Gauthier,
Martin Greschner, Alexander Sher, Alan Litke, Eero Simoncelli and

Jian Sun
Stanford University

Within the general framework of analyzing the properties of shapes which
are independent of the shape~Rs pose, or embedding, we have developed a
novel method for efficiently computing global symmetries of a shape which
are invariant up to isometry preserving transformations. Our approach is
based on the observation that the intrinsic symmetries of a shape are
transformed into the Euclidean symmetries in the signature space defined
by the eigenfunctions of the Laplace-Beltrami operator. We devise an
algorithm which detects and computes the isometric mappings from the shape
onto itself.

The aforementioned method works very well when the whole object is
intrinsically symmetric. However, in practice objects are often partially
symmetric. To tackle this problem, we have developed an algorithm that
detects partial intrinsic self-similarities of shapes. Our method models
the heat flow on the object and uses the heat distribution to characterize
points and regions on the shape. Using this method we are able to
mathematically quantify the similarity between points at certain scale,
and determine, in particular, at which scale the point or region becomes
unique. This method can be used in data analysis and processing,
visualization and compression.

John Wu
Lawrence Berkeley National Lab

FastBit is a software tool for searching large read-only datasets.  It
organizes user data in a column-oriented structure which is efficient for
on-line analytical processing (OLAP), and utilizes compressed bitmap
indexes to further speed up query processing.  We have proven the
compressed bitmap index used in FastBit to be theoretically optimal for
one-dimensional queries.  Compared with other optimal indexing methods,
bitmap indexes are superior because they can be efficiently combined to
answer multi-dimensional queries whereas other optimal methods can not. In
this poster, we show that FastBit answers queries 10 times faster than a
commercial database system, and highlight an application where FastBit
sped up the molecular docking software hundreds of times.

Zuobing Xu
University of California, Santa Cruz

We describe new models, including those based on the Dirichlet Compound
Multinomial (DCM) distribution and for information retrieval, relevance
feedback, and active learning. We first indicate how desirable properties
such as search engine ranking being concave in word repetition can be
obtained through the use of the DCM model.  We also describe several
effective estimation methods. Reduction of the state space is achieved
through the use of relevance and non-relevance sets. Effective capture of
negative feedback and modeling of overlap terms between negative and
positive feedback (relevant) documents is thus enabled.  We conclude with
a discussion of the computation efficiency and demonstration of the very
good performance of these new methods on TREC data sets. This work also
enables the efficient use of the original probabilistic ranking approach,
while incorporating estimation methods developed in the language model

This is joint work with Ramakrishna Akella, University of California,
Santa Cruz.