Benjamin Van Roy: Research

Reinforcement learning

My research interests center on the design and analysis of reinforcement learning algorithms. This requires advances in methods for knowledge representation, exploration, adaptation, generalization, and decision. I work with collaborators on theory, algorithms, and software that drive these advances. To understand fundamental limits of data efficiency and what it takes to achieve them, we have developed analytic approaches that build on information theory. Data-efficient reinforcement learning requires an agent to represent what it knows about its environment, as well as what it does not know. To serve this need, we have developed epistemic neural networks. Implementation invariably brings computational considerations to the fore. A substantial portion of my work concerns scalable incremental algorithms – such as temporal-difference learning – for adapting an agent's epistemic state to new information.

I have also worked on operations research, finance, economics, recommender systems, and location technology. These subjects continue to interest me as domains for application of reinforcement learning.

Since 2012, my students have organized the Stanford Reinforcement Learning Forum, which hosts excellent talks on the subject.

Recommended reading

Papers listed below introduce facets of my research on reinforcement learning.

Information-theoretic foundations

Information theory offers elegant tools for analysis of machine learning. These tools accomodate great generality, handling models that are parametric or nonparametric and that involve discrete or continuous variables that are noisy or noiseless, as well as problems of sequential decision and learning. Foundations for analysis of error and sample complexity can be found in this paper, along with a result that illustrates how these tools can generate novel insights:

More generally, information theory can inform the design and analysis of data-efficient reinforcement learning agents:

Epistemic neural networks

A conventional neural network produces an output given an input and parameters (weights and biases). This interface does not allow an agent to distinguish epistemic from aleatoric uncertainty. The output of an epistemic neural network depends additionally on an epistemic index, allowing an agent to express predictions that distinguish these kinds of uncertainty. This interface and the epinet – an epistemic neural network that operates with modest computation beyond what is required by a conventional neural network – are discussed in this paper:

Aside from designing general reinforcement learning agents, the epinet is broadly useful in applications for which it is valuable to distinguishing epistemic from aleatoric uncertainty. An example of this arises with fine-tuning of large language models, where estimates of epistemic uncertainty can guide gathering of informative data.

Exploration

Data-efficiency requires, among other things, that an agent intelligently explore the environment to uncover useful data. The following paper explains the role of exploration, then develops and analyzes an approach based on randomized value functions:

This randomized approach is motivated by Thompson sampling, a popular exploration heuristic that we have studied in depth:

Information-directed sampling grew out of this thread of research and offers an alternative approach that in the face of complex information structures can greatly outperform Thompson sampling:

Adaptation

An effective reinforcement learning agent must efficiently adapt to information garnered from new observations. This need is commonly served by algorithms that incrementally update an agent's epistemic state, which is often structured as a function that predicts future return and, possibly, other cumulants. Update algorithms aim to maintain knowledge that guides effective behavior while economizing on computation. Temporal-difference methods – and the particular case of Q-learning – are probably the best-known among these. The following paper offers an analysis of temporal-difference methods:

When temporal-difference learning converges, it seems to arrive at value functions that result in effective decisions. Some understanding is developed here:

We have also developed and analyzed a version of Q-learning that benefits from the aforementioned asymptotic property and learns in a provably efficient sense:

That being said, the version of Q-learning presented in this paper is not a practial algorithm. Rather, it serves as a constructive proof of attainable performance as a function of several salient statistics of the environment.