In addition to the annotated bibliography which I generated for the 2010 course, I’ve created a supplement that includes my notes and references for the “controversial hypotheses” paper. In the supplement, you’ll find a lot of relevant papers with many of them directly linked to PDF versions so you can quickly check out the paper and decide whether it is relevant to your interests. I compile this document using a Python script and will update it during the quarter. It is offered in the spirit of one of the central tenets of this class, namely, that building tools and in particular information mining tools are critical ingredients in conducting science today.
There is still a good deal of mystery shrouding exactly how shape is represented in the latter stages of the ventral pathway and, in particular, debate about the role of the inferotemporal (IT) area of the ventral pathway. Shimon Ullman has interesting computational theories concerning both the image-coding function of neurons in IT and the role of overlap in constructing compositional features from simpler ones. Stu Geman’s paper [8] addresses a fundamental tradeoff between invariance and selectivity is which hinted at the Ullman and Soloviev paper [51]. For interesting insight into the Gestalt Psychology / Psychophysics perspective on how primates perceive spatial and temporal structure check out the papers by Gepshtein and Kubovy [9, 18] or the work by Jitendra Malik on segmentation which draws inspiration from the Gestalt spatial and temporal grouping principles. Papers coming out of Manabu Tanifuji’s lab investigate candidate optimal stimuli for IT neurons as well as hypotheses and evidence for how the associated features might be combined to explain responses to more complex stimuli:
A neural code for three-dimensional object shape in macaque inferotemporal cortex [56] (PDF)
Invariance and Selectivity in the Ventral Visual Pathway [8] (PDF)
Representation of the spatial relationship among object parts by neurons in macaque inferotemporal cortex [55] (PDF)
Visual Features of Intermediate Complexity and their use in Classification [52] (PDF)
Complex objects are represented in macaque inferotemporal cortex by the combination of feature columns [50] (PDF)
Perceptual grouping in space and in space-time: An exercise in phenomenological psychophysics [18] (PDF)
Computation of pattern invariance in brain-like structures [51] (PDF)
We still do not have an adequate explanation for the extensive feedback connections that originate in extrastriate regions and terminate in V1 nor do we understand completely the role of lateral connections within V1. The following papers explore some of the related issues and posit functional roles for less-well-studied striate circuits:
Surround Suppression of V1 Neurons Mediates Orientation-Based Representation of High-Order Visual Features [48]
Contour and boundary detection improved by surround suppression [11]
The contribution of feedforward, lateral and feedback connections to the classical receptive field center and extra-classical receptive field surround of primate V1 neurons [26]
It has become common in the practice of machine learning involving visual data to either whiten the training data or perform an operation called local contrast normalization. There is some evidence that the latter operation is carried out by a nonlinear transformation called divisive normalization which is also implicated in surround suppression. Here are two classic papers investigating divisive normalization:
Natural image statistics and divisive normalization: Modeling nonlinearity and adaptation in cortical neurons [53]
Local contrast in natural images: normalisation and coding efficiency [3]
Here are a few influential papers that don’t fit easily into a single category, but I separate them out for their impact on the field. The paper by Peter Földiák influenced Wiskott and Sejnowski’s slow feature analysis [54] and played a key role in a number of practical applications including the high-throughput method of Pinto et al [39] which we note elsewhere in this bibliography. The work by Riesenhuber and Poggio gave rise to what is called — at least by the authors — the standard model. A variant of this model was implemented by Thomas Serre and compared with state-of-the-art computer vision algorithms:
Hierarchical Bayesian Inference in the Visual Cortex [24]
Hierarchical models of object recognition in cortex [42] (PDF)
In the second lecture, I mentioned a number of projects and startups interested in building large-scale models of the cortex that operate at the individual cell level. The BlueBrain project at EPFL is directed by Henry Markram and is aimed at modeling all of the cells in single cortical column. The Eugene Izhikevich and Gerald Edelman at the Neuroscience Institute have developed several models that use the leaky-integrate-and-fire model of neural activation and the spike-timing-dependent-plasticity model of learning. Eugene’s NSI web page has a bunch of interesting papers which include the details of his large-scale simulations. A presentation that covers both the leaky-integrate-and-fire and the spike-timing-dependent-plasticity models and summarizes one or two of Eugene Izhikevich’s papers would be interesting and encouraged. There is less recently-published work from Paul Rhodes since he has started Evolved Machines, but you can find some of his early work on thalamo-cortical relays — David Mumford’s papers on the interaction between the cortex and thalamus might provides additional insight into this interesting topic:
Large-scale model of mammalian thalamo-cortical systems [13]
The blue brain project [27]
A model of thalamocortical relay cells [41]
On the computational architecture of the neocortex II: The role of cortico-cortical loops [30]
On the computational architecture of the neocortex I: The role of the thalamo-cortical loop [29]
Mammals and primates in particular are the product of millions of years of natural selection in which functions of the peripheral nervous system were incorporated into the cortex, a process called encephalization. There is still a great deal of processing that happens in the eye and ear and the following paper is a classic on the processing that is carried on in the frog’s eye and a lesson to us all in terms of the degree to which visual representations are dictated by the demands of the organism and not by any abstract principle that can be applied across sensory modalities. Of course, there’s always the possibility that the brain only hit upon the right abstract principle relatively late in the vast time scale of natural selection:
What the Frog’s Eye Tells the Frog’s Brain [25]
Here is a small sample of work on analyzing fMRI data as a means of inferring what an experimental subject is looking at or thinking about in the case of a word that evokes a visual memory. Some of this work has been exaggerated in the popular press — newspaper articles use phases like “mind reading” to dramatize the experimental studies — but is very interesting for what it says about patterns of neural activity that are common across different subjects and the same subjects at different times:
Predicting Human Brain Activity Associated with the Meanings of Nouns [28]
Identifying natural images from human brain activity [16]
Pawan Sinha and his colleagues wrote this first article to inspire computer vision researchers interested face recognition to look at what is known in cognitive neuroscience about human face recognition. The next two articles by Michael Tarr and Heinrich Bülthoff are similar in their intent, and the final article by Marty Banks and P. Bennett is a classic that attempts to explain the developmental mechanism that limits visual acuity in infants. The reason I include this last paper is to pique your curiosity about why neonates might have such a deficit and whether it might confer some advantage, say, in learning a primitive shape vocabulary:
Face Recognition by Humans; Nineteen Results All Computer Vision Researchers Should Know About [45] (PDF)
Visual Object Recognition: Can a Single Mechanism Suffice? [cite{Tarr03}] (PDF)
Image-based object recognition in man, monkey and machine [49]
Optical and photoreceptor immaturities limit the spatial and chromatic vision of human neonates [1]
Here is a sample of papers by Bruno Oshausen, his colleagues — of special note are David Field and David Van Essen, and his students. In addition to his seminal work with Field on sparse coding to model of simple cells in V1 [33], he has also applied sparse coding to representing video [36, 4], and, building on earlier work of Geoff Hinton, he has looked at the possibility of neural circuits that transform retinal patches into a standard scale and pose [32]. If you become interested in sparse coding, you might seriously consider looking into the work of Horace Barlow [2, 2] which inspired the efficient coding hypothesis:
Learning Transformational Invariants from Time-Varying Natural Images [4]
Learning Horizontal Connections in a Sparse Coding Model of Natural Images [7]
Learning invariant and variant components of time-varying natural images [36]
The recognition of partially visible natural objects in the presence and absence of their occluders [15]
Learning Sparse, Overcomplete Representations of Time-varying Natural Images [35]
Processing Shape, Motion and Three-dimensional Shape-from-motion in the Human Cortex [31] (HTML)
Sparse coding with an overcomplete basis set: A strategy employed by V1? [33]
A neurobiological model of visual attention and pattern recognition based on dynamic routing of information [32]
Geoff Hinton is generally credited with the development of so-called deep belief networks and his most recent work on restricted Boltzmann machines is cited when deep networks are invoked. However, there have been many hierarchical models developed over the years and they employ diverse learning and inference algorithms. One recurring theme is the idea of using a great deal of unlabeled data to perform unsupervised learning of the layers, one layer at a time. Yann LeCun’s work on convolutional networks has been particularly influential for its parsimonious use of parameters. I’ve also include work that was described by Jim DiCarlo in his presentation at the Clark Center back in March. Note that the paper by Jarrett et al [14] looks more deeply at the architectural components in shallow (one or two layer) convolutional networks with some interesting conclusions in light of the Pinto et al [39] Here is sampling of papers on deep networks including Geoff’s recent Science article and an early paper by LeCun on convolutional networks:
A High-Throughput Screening Approach to Discovering Good Forms of Biologically Inspired Visual Representation [39] (HTML)
What is the Best Multi-Stage Architecture for Object Recognition? [14] (PDF)
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations [23] (PDF)
Why is Real-World Visual Object Recognition Hard? [38]
Learning a non-linear embedding by preserving class neighbourhood structure [43]
Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition [40]
Learning Methods for Generic Object Recognition with Invariance to Pose and Lighting [21]
Handwritten Zip Code Recognition with Multilayer Networks [20]
I mentioned in the introductory lecture, that Eugene Charniak wrote a — now classic but still frequently cited and read — introduction to Bayesian Networks which constitute an important class of probabilistic graphical models. I’ve included both the reference and a link to the document on Kevin Murphy’s web site at the University of British Columbia; Kevin has written a somewhat longer and more comprehensive survey paper that also covers what are now called dynamic Bayesian networks. The original work by Judea Pearl that really introduced the basic idea of Bayesian networks as well as their mathematical and algorithmic foundations is still and excellent introduction to the subject. By far, the most comprehensive, mathematically detailed and up-to-date treatment of the field is the book by Koller and Friedman. I have also included two papers by Erik Sudderth which I believe represent some of the most interesting graphical models of visual representation and that would be interesting to explore in terms of their utility for modeling primate vision. The models by Sudderth are particularly interested for their use of stochastic processes which are capable of adapting the complexity of the graphical models to accommodate the data. If you want to get a deeper understanding of energy-based graphical models which are discussed in the work of Lee et al [22, 23], take a look at the LeCun et al tutorial [19]. You’ll also find a couple of recent papers on inferring scene layout which address some basic forms of visual inference that are critical in recognizing objects and establishing a context from which to draw further inference:
Probabilistic Graphical Models: Principles and Techniques [17]
Decomposing a Scene into Geometric and Semantically Consistent Regions [10]
Describing Visual Scenes Using Transformed Objects and Parts [47] (PDF)
Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes [46] (PDF)
A Brief Introduction to Graphical Models and Bayesian Networks (PDF)
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference [37]