
Data Programming: Creating Large Training Sets, Quickly
Alex Ratner, Chris De Sa, Sen Wu, Daniel Selsam, and Christopher Ré
In NIPS: Proceedings of the 29th Neural Information Processing Systems Conference, December 2016.
Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most timeconsuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users provide a set of labeling functions, which are programs that heuristically label large subsets of data points, albeit noisily. By viewing these labeling functions as implicitly describing a generative model for this noise, we show that we can recover the parameters of this model to "denoise" the training set. Then, we show how to modify a discriminative loss function to make it noiseaware. We demonstrate our method over a range of discriminative models including logistic regression and LSTMs. We establish theoretically that we can recover the parameters of these generative models in a handful of settings. Experimentally, on the 2014 TACKBP relation extraction challenge, we show that data programming would have obtained a winning score, and also show that applying data programming to an LSTM model leads to a TACKBP score almost 6 F1 points over a supervised LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way to create machine learning models for nonexperts.

Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much
Bryan He, Christopher De Sa, Ioannis Mitliagkas, and Christopher Ré
In NIPS: Proceedings of the 29th Neural Information Processing Systems Conference, December 2016.
Gibbs sampling is a Markov Chain Monte Carlo sampling technique that iteratively samples variables from their conditional distributions. There are two common scan orders for the variables: random scan and systematic scan. Due to the benefits of locality in hardware, systematic scan is commonly used, even though most statistical guarantees are only for random scan. While it has been conjectured that the mixing times of random scan and systematic scan do not differ by more than a logarithmic factor, we show by counterexample that this is not the case, and we prove that that the mixing times do not differ by more than a polynomial factor under mild conditions. To prove these relative bounds, we introduce a method of augmenting the state space to study systematic scan using conductance.

Socratic Learning: Empowering the Generative Model
Paroma Varma, Rose Yu, Dan Iter, Christopher De Sa, Christopher Ré
In FiLMNIPS: Future of Interactive Learning Machines at NIPS, December 2016.
Modern machine learning techniques, such as deep learning, often use discriminative models that require large amounts of labeled data. An alternative approach is to use a generative model, which leverages heuristics from domain experts to train on unlabeled data. Domain experts often prefer to use generative models because they "tell a story" about their data. Unfortunately, generative models are typically less accurate than discriminative models. Several recent approaches combine both types of model to exploit their strengths. In this setting, a misspecified generative model can hurt the performance of subsequent discriminative training. To address this issue, we propose a framework called Socratic learning that automatically uses information from the discriminative model to correct generative model misspecification. Furthermore, this process provides users with interpretable feedback about how to improve their generative model. We evaluate Socratic learning on realworld relation extraction tasks and observe an immediate improvement in classification accuracy that could otherwise require several weeks of effort by domain experts.

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling
Best Paper Award
Christopher De Sa, Kunle Olukotun, and Christopher Ré
In ICML: Proceedings of the 33rd International Conference on
Machine Learning, June 2016.
Gibbs sampling is a Markov chain Monte Carlo technique commonly used
for estimating marginal distributions.
To speed up Gibbs sampling, there has recently been interest in parallelizing
it by executing asynchronously. While empirical results suggest
that many models can be efficiently sampled asynchronously, traditional
Markov chain analysis does not apply to the asynchronous case, and thus
asynchronous Gibbs sampling is poorly understood.
In this paper, we derive a better understanding of the two main challenges
of asynchronous Gibbs: bias and mixing time.
We show experimentally that our theoretical results match practical outcomes.

Parallel SGD: When does Averaging Help?
Jian Zhang, Christopher De Sa, Ioannis Mitiliagkas, and Christopher Ré
In OptML: Optimization Methods for the Next Generation of Machine Learning
, workshop at ICML, June 2016.
Consider a number of workers running SGD independently on the
same pool of data and averaging the models every once in a while â€” a
common but not well understood practice. We study model averaging
as a variancereducing mechanism and describe two ways in which the
frequency of averaging affects convergence. For convex objectives, we show
the benefit of frequent averaging depends on the gradient variance envelope.
For nonconvex objectives, we illustrate that this benefit depends on the
presence of multiple globally optimal points. We complement our findings
with multicore experiments on both synthetic and real data.

DeepDive: Declarative Knowledge Base Construction
Christopher De Sa, Alex Ratner, Christopher Ré, Jaeho Shin,
Feiran Wang, Sen Wu, and Ce Zhang
Research highlight in SIGMOD Record, April 2016.
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a longstanding problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.

Generating Configurable Hardware from Parallel Patterns
Raghu Prabhakar, David Koeplinger, Kevin J. Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun
In ASPLOS: 21st Int'l Conference on Architectural Support
for Programming Languages and Operating Systems, April 2016.
In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising as they offer significant performance and energy improvements compared to CPUs for a wide class of applications and are far more flexible than fixedfunction ASICs. However, FPGAs are difficult to program. Traditional programming models for reconfigurable logic use lowlevel hardware description languages like Verilog and VHDL, which have none of the productivity features of modern software development languages but produce very efficient designs, and lowlevel software languages like C and OpenCL coupled with highlevel synthesis (HLS) tools that typically produce designs that are far less efficient. Functional languages with parallel patterns are a better fit for hardware generation because they both provide highlevel abstractions to programmers with little experience in hardware design and avoid many of the problems faced when generating hardware from imperative languages. In this paper, we identify two optimizations that are important when using parallel patterns to generate hardware: tiling and metapipelining. We present a general representation of tiled parallel patterns, and provide rules for automatically tiling patterns and generating metapipelines. We demonstrate experimentally that these optimizations result in speedups up to 40x on a set of benchmarks from the data analytics domain.

Have Abstraction and Eat Performance, Too: Optimized Heterogeneous Computing with Parallel Patterns
Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth,
Christopher De Sa, Christopher Aberger, and Kunle Olukotun
In CGO: International Symposium on Code Generation and Optimization, March 2016.
High performance in modern computing platforms requires
programs to be parallel, distributed, and run on heterogeneous
hardware. However programming such architectures
is extremely difficult due to the need to implement the application
using multiple programming models and combine
them together in adhoc ways. To optimize distributed applications
both for modern hardware and for modern programmers
we need a programming model that is sufficiently
expressive to support a variety of parallel applications, sufficiently
performant to surpass handoptimized sequential implementations,
and sufficiently portable to support a variety
of heterogeneous hardware. Unfortunately existing systems
tend to fall short of these requirements.
In this paper we introduce the Distributed Multiloop Language
(DMLL), a new intermediate language based on common
parallel patterns that captures the necessary semantic
knowledge to efficiently target distributed heterogeneous architectures.
We show straightforward analyses that determine
what data to distribute based on its usage as well as
powerful transformations of nested patterns that restructure
computation to enable distribution and optimize for heterogeneous
devices. We present experimental results for a range
of applications spanning multiple domains and demonstrate
highly efficient execution compared to manuallyoptimized
counterparts in multiple distributed programming models.

Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width
Spotlight
Christopher De Sa, Ce Zhang, Kunle Olukotun, Christopher Ré
In NIPS: Proceedings of the 28th Neural Information Processing Systems Conference, December 2015.
Gibbs sampling on factor graphs is a widely used inference technique, which often produces good empirical results. Theoretical guarantees for its performance are weak: even for tree structured graphs, the mixing time of Gibbs may be exponential in the number of variables. To help understand the behavior of Gibbs sampling, we introduce a new (hyper)graph property, called hierarchy width. We show that under suitable conditions on the weights, bounded hierarchy width ensures polynomial mixing time. Our study of hierarchy width is in part motivated by a class of factor graph templates, hierarchical templates, which have bounded hierarchy width—regardless of the data used to instantiate them. We demonstrate a rich application from natural language processing in which Gibbs sampling provably mixes rapidly and achieves accuracy that exceeds human volunteers.

Taming the Wild: A Unified Analysis of Hogwild!Style Algorithms
Christopher De Sa, Ce Zhang, Kunle Olukotun, Christopher Ré
In NIPS: Proceedings of the 28th Neural Information Processing Systems Conference, December 2015.
Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingalebased analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we use our new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for nonconvex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild!, that uses lowerprecision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.

Incremental Knowledge Base Construction Using DeepDive
Best of Issue
Jaeho Shin, Sen Wu, Feiran Wang, Ce Zhang, Christopher De Sa, Christopher Ré
In VLDB: Proceedings of the 41st International Conference on Very Large Data Bases, September 2015.
Populating a database with unstructured information is a
longstanding problem in industry and research that encompasses
problems of extraction, cleaning, and integration. Recent
names used for this problem include dealing with dark
data and knowledge base construction (KBC). In this work,
we describe DeepDive, a system that combines database and
machine learning ideas to help develop KBC systems, and we
present techniques to make the KBC process more efficient.
We observe that the KBC process is iterative, and we develop
techniques to incrementally produce inference results
for KBC systems. We propose two methods for incremental
inference, based respectively on sampling and variational
techniques. We also study the tradeoff space of these methods
and develop a simple rulebased optimizer. DeepDive
includes all of these contributions, and we evaluate DeepDive
on five KBC systems, showing that it can speed up
KBC inference tasks by up to two orders of magnitude with
negligible impact on quality.

Global Convergence of Stochastic Gradient Descent for Some Nonconvex Matrix Problems
Christopher De Sa, Kunle Olukotun, and Christopher Ré
In ICML: Proceedings of the 32nd International Conference on
Machine Learning, July 2015.
Stochastic gradient descent (SGD) on a lowrank factorization is commonly employed to speed up matrix problems including matrix completion, subspace tracking, and SDP relaxation. In this paper, we exhibit a step size scheme for SGD on a lowrank leastsquares problem, and we prove that, under broad sampling conditions, our method converges globally from a random starting point within \(O(\epsilon^{1} n \log n)\) steps with constant probability for constantrank problems. Our modification of SGD relates it to stochastic power iteration. We also show experiments to illustrate the runtime and convergence of the algorithm.