Research Discussions

The following log contains entries starting several months prior to the first day of class, involving colleagues at Brown, Google and Stanford, invited speakers, collaborators, and technical consultants. Each entry contains a mix of technical notes, references and short tutorials on background topics that students may find useful during the course. Entries after the start of class include notes on class discussions, technical supplements and additional references. The entries are listed in reverse chronological order with a bibliography and footnotes at the end. This listing is now closed, but if you would like to follow the research threads spawned by the 2020 class discussions, you are welcome to read about our ongoing work here.

Class Discussions

Welcome to the 2020 class discussion list. Preparatory notes posted prior to the first day of classes are available here. Introductory lecture material for the first day of classes is available here, a sample of final project suggestions here and last year's calendar of invited talks here. Since the class content for this year builds on that of last year, you may find it useful to search the material from the 2019 class discussions available here. Several of the invited talks from 2019 are revisited this year and, in some cases, are supplemented with new reference material provided by the list moderator.

July 4, 2020

%%% Sat Jul  4 03:29:17 PDT 2020

This entry, the last for the 2020 class, focuses on developing working models of various components of the cognitive architectures that we discussed in class this year, demonstrating how these components can be applied to solving interesting problems. In particular we focus on problem of exploiting working memory in the frontal cortex to solve a special case of the variable binding problem in service to learning simple rules encoded in sparse distributed representations in a differentiable connectionist. The discussion here is just a start. In coming months, we will dig deeper into this problem as well as explore other related problems in preparation for class next year.

Neural Coding Microarchitectures

%%% Fri Jun 19 05:40:13 PDT 2020

One of the core cognitive capabilities required in solving problems involves keeping track of task-relevant information including what tasks you are working on along with their inputs, intermediate products, current status, and dependencies between tasks, including their order of invocation, all of which are handled by the procedure call stack in a conventional von Neumann architecture.

In the models we consider here the attentional system is responsible for identifying relevant stimuli both internal and external, and the executive control system in prefrontal cortex is responsible for orchestrating the cognitive tasks required to deal with the challenges and opportunities that follow from observations and predictions prompted by the attentional system.

Attention only makes sense if there is more than one thing to attend to and those things are obvious enough to distinguish from the background and interesting enough to warrant your attention. If you are performing a task then presumably by attending to the task you will naturally expand the number of items that you have to keep track of and maintain access to.

Eventually there may be so many items that you run out of short-term memory in which to store those items. Having attended to some item in the process of working on one task and then not being able to recover that item after a span of working on other tasks, you might search for the item in recent episodic memory, but that assumes you know what you are looking for.

Attention, memory and the capacity to perform computations on information stored in memory are all limited resources that depend on one another. In order to perform computations it is necessary to identify and access information in the form of representations that manifest as patterns of activity among ensembles of neurons.

Whether these patterns arise from perception, prediction or the restoration of episodic memories stored long-term memory, they have to be captured in a stable format and transferred to locations where they can be operated upon by circuits that have been trained to perform particular information processing tasks.

You can think of the basal ganglia and prefrontal cortex as staging cognitive tasks by selecting information relevant to performing a given task and moving that information into specific registers in working memory where various operations can be carried out and the results stored in other registers in working memory.

Summarizing current estimates of the size of working memory in the frontal cortex, there are on the order of 20,000 stripes each one corresponding to approximately 100 interconnected rate-coded neurons, stripes are organized in stripe clusters consisting of approximately 10 stripes, and each stripe can be independently updated and is associated with a corresponding stripe in the striatum.

This sounds straightforward until you realize that all of this staging and moving information about has to be learned and the same substrate used for one task may need to be interrupted and used for another task before returning to completing the first task, and that you never perform exactly the same task twice.

Consider the task of making a cake with a recipe that you've used many times in the past. You might start by setting out all the bowls and other cooking paraphernalia you'll need and then measuring out the ingredients so you can mix the cake batter and start the oven heating to the right temperature for baking.

For every subtask such as measuring out flour, there will be dozens if not thousands of computations required, and for each one of these you will have to load registers, initiate computations and transfer results to other registers, all the while dealing with interruptions and novel problems like substituting unbleached for pastry flour.

Of course you will also likely have thousands of little subroutines many of that you've practiced many times, but perhaps not in exactly the order that you will need to perform them in making the cake and so they will have to be adapted and coordinated and their runtime requirements in terms of allocating registers and utilizing specific circuits satisfied.

Basic register allocation and circuit utilization seem low level to require that an agent learn to directly control them. It may be that there is a base level of control hardwired in the brain that works in the same way that the microarchitecture of a modern computer abstracts from its underlying processor-specific instruction set architecture¹. Even if biological brains lack such a microarchitecture, engineers may find it advantageous to provide this level of abstraction to expedite learning.

The focus in this year's class was on exploring architectures motivated by our understanding of brain function. The architectures we developed attempted to integrate many of the cognitive functions generally associated with human cognition. The emphasis was on learning what the brain was doing and not necessarily how it was doing it.

Next year, students will be challenged to develop working models of various components of these architectures demonstrating how these components can be applied to solving interesting problems. Since developing a complete architecture to solve any nontrivial problem is well beyond the scope of the current state-of-the-art, the trick will be simplify the rest of the architecture just enough to solve the target application. The next subsection outlines such a project and offers it as a challenge to students working over the summer break. The goal for this summer is to develop a number of similar challenges in preparation for next year's class.

Integrating Basal Ganglia Functionality

%%% Tue Jun 23 15:14:36 PDT 2020

In class this year we spent several lectures discussing the basal ganglia as a reinforcement learning machine and the PBWM (prefrontal, basal ganglia and working memory) model described in O'Reilly et al [386] as the basis for a simple form of variable binding that enables rule-based inference.

In particular, we investigated how the components that comprise the PBWM might emulate a simple Turing complete automaton called a register machine. In the following, we sketch a somewhat more detailed version of the register machine that takes its inspiration from PBWM along with a simplified cognitive architecture and simple task to demonstrate its capabilities.

One important feature of the prefrontal cortex is that it supports hierarchical representations for planning and decision making. We have discussed the relevant biology and related technologies at some length in these notes – there are over 160 mentions of "hierarchy" in this year's notes alone, and here we argue that, in the case of the architectures we are contemplating, hierarchical structure will naturally arise as a consequence of learning and development.

For example, in conventional programming there exists a natural hierarchy induced by the procedure call stack and the function type signature by way of the corresponding hierarchy of types. Combinatorial complexity and compositionality induce natural procedural hierarchies, e.g., writing a program involves writing loops and conditionals, which, in turn, involves writing conditionals that involve writing Boolean expressions, etc.

Contrary to the standard top-down engineering model for software development, learning how to write programs generally starts with simple procedures that operate on simple types, e.g., you start off implementing list, string and sequence functions, before realizing that strings and sequences are subtypes of vectors and lists are subtypes of sequences, etc. In order to induce such natural hierarchical structure, we may have to recapitulate some features of human cognitive development.

Specifically, it may be possible to bootstrap writing and executing cognitive programs by using a developmental curriculum that serves to learn and then build on a low-level microarchitecture of the sort alluded earlier. In a strategy similar to that described by Josh Merel in his discussion in class², the process of bootstrapping might start by learning to write simple subroutines that are constructed from even simpler components including basic arithmetic, relational and logical operators – see Figure 1.

Figure 1: The above graphic demonstrates some of the differences between motor control tasks and the architecture used to learn and deploy them as described in Merel et al [352] and the sort of cognitive tasks and corresponding architectures discussed here. The two panels A and B represent traces illustrating the execution of a task by a fully-trained instance of each architecture. Panel A represents a motor program as a policy that is deployed by stringing together motor primitives in the process of executing a primitive, sensing the environment, estimating the relevant state variables, and then iterating. This execution cycle is akin to traditional control-theory models that consist of modules that observe – perform state estimation, plan – learn a policy, and execute – carry out primitive actions.

Panel B represents a cognitive program as a collection of subroutines that can call one another by passing information from the caller to the called subroutine, and decide what subroutine to call next by using their own internal logic contingent on local state information passed by a caller or global state originating from sensory data, other subroutines by way of shared working memory or directly from the prefrontal cortex in its role an operating system of sorts. B may seem more complicated than A given the relative autonomy of subroutines and the multiple ways in which subroutines can influence and be influenced by one another, however, the steps of selecting one motor primitive and carrying out the next are intricately coupled with one another through the environment serving as a high-bandwidth two-way communication channel between the two.

Building a system that can write simple subroutines is challenging enough, but the hardest part may be in learning how to coordinate the running of multiple simple subroutines to produce complex behavior. Note that in the case of the motor primitives described in Merel et al [353], separate subroutines do not have to pass information back and forth and so you simply string subroutines together and execute them sequentially with interleaved real-time conditional branching to adapt to unanticipated contingencies.

The problem arises when subroutines have inputs and outputs that require unpredictable / on-the-fly routing of information to support integral branching, recurrence and interrupt handling — the sort of problem that the microarchitecture in combination with the compiler, assembler and operating system are designed to facilitate. Conventional computing hardware supports virtually unlimited, easily repurposed, restructured and conveniently accessed working memory, making it relatively simple for subroutines to share information. In particular, programs and data can move about independently to accommodate varying demands.

In biological systems, information and computation are collocated, neural algorithms are said to be executed in-place requiring no auxiliary storage and relying on fixed pathways – the data changes but its sources, sinks and format are fixed. In order to accommodate these limitations, neural circuits in frontal cortex responsible for generating and coordinating behavior are directly connected to the sources of sensorimotor information that they require to perform the necessary computations³

The thalamus is divided into approximately 60 nuclei each of which provides a unique pathway for relaying specific sensorimotor information to cortical and subcortical locations, primarily in the cerebral cortex. Motor pathways, limbic pathways, and sensory pathways besides olfaction all pass through this central structure. The modular organization of the thalamus and its alignment with structures in the striatum and cerebral cortex explain the pairing of information and processing in frontal cortex, as well as the consequence that some general-purpose data processing functions will necessarily require duplication.

Implementing and Testing the Model

%%% Mon Jun 29 15:12:53 PDT 2020

The following assumes familiarity with PBWM. If you're not familiar with this model, check out the resources available here. If you are somewhat familiar, but want to learn more, you might want to check out the transcript of the discussion we had with Randy back in January during which we talked about PBWM and our interest in adapting the model for the programmer's apprentice application, and the followup (exchange) with Randy and Michael Frank in June in which, among other topics, we discussed more recent developments concerning the sharing of responsibility for action selection and executive control between basal ganglia and prefrontal cortex⁴

The SIR (Store, Ignore, Recall) task described in O'Reilly et al [219] provides a simple test of competence in selectively attending to and maintaining items in working memory while performing a task that requires finding patterns in sequences of alphanumeric characters, e.g., identify all instances of the three-character target strings 1AX and 2BY in the following test string "195H31AQ2BY472U22B1AXJR97"⁵.

As an exercise to familiarize yourself with the SIR task, compare and contrast the following problems assuming that you are interacting with a small child with a short attention span — a three-year-old will have trouble remembering the two target strings and have to be reminded whereas the five-year-old may still find the task challenging but derive benefit from assigning each of the target strings a memorable retrieval cue of their choice, e.g., "one axe" and "two buzz"⁶:

describe the SIR task to a three-year-old child in enough detail that they can easily perform the task;
explain to a five-year-old how a program that solves the SIR task works so they can perform the task;

The goal is to perform an efficient search and so concatenating the results from two searches, one for 1AX and another for 2BY, is not acceptable. To be explicit, you might describe the problem as follows: traverse the sequence one character at a time in the order given and output 1 if the current character is the last character in an occurrence of either 1AX or 2BY, and 0 otherwise.

Figure 2: The proposed architecture employs a biologically less plausible variant of PBWM and a simple model of frontal cortex function separating executive control logic and procedural abstraction to solve the SIR (store, ignore, recall) (SIR) problem as described in O'Reilly et al [219]. This version of the SIR problem assumes that the specific challenge corresponding to the two patterns can change at any point in time. Each successful (true positive) identification of either pattern yields a reward of 1, each false positive yields -1 and any other outcome 0. The objective is to maximize expected reward. The network mapping the contents of the basal (ganglia) source registers (BSR) to the frontal (cortex) target registers (FTR) is a special case of the model shown here. Each of the networks in the subroutine library is a simple gated recurrent unit [32] – three are shown here, but only two are required to solve the SIR problem. The executive control network is a multicell LSTM, and working memory corresponds to the gated memory registers in the BSR and the executive control and subroutine library. The complete architecture is rendered here to highlight the functional characteristics of the (correctly) learned model and can be implemented as a single multicell LSTM.

Figure 2 depicts the basic model and the caption describes a simple basic version of the architecture that should suffice for solving SIR problems. The following pseudocode describes one possible solution that might learned. In the listing, the token PFC corresponds to the executive controller, SR1, SR2 and SR3 correspond to the three subroutines:

# SR1 determines whether to match the first, second or neither pattern:
if SR1 is running and its input matches the first character in the ith pattern then output 1 or 2 otherwise 0;
# SR2 and SR3 determine whether the input matches nth character in the ith pattern:
if SR2 is running and its input matches the second character in the ith pattern then output 1 otherwise 0;
if SR3 is running and its input matches the third character in the ith pattern then output 1 otherwise 0;

# The PFC unit controls the run status (0,1) of subroutines:
if SR3 is running and its output is not 0 then output i and set the run status of SR3 to 0;
if SR2 is running and its output is not 0 then set the run status of SR3 to 1 and the run status of SR2 to 0;
if SR1 is running and its output is not 0 then set the run status of SR2 to 1 and the run status of SR1 to 0;
# if the target patterns or input sequence are updated then reset:
reinitialize the run status of SR1, SR2 and SR3 to 0;

In the PBWM model, the basal ganglia is responsible for identifying contexts represented as patterns of activity in posterior cortex that are associated with rewards. In the simple model we consider here, the basal ganglia functions are subsumed in the prefrontal cortex, and so the PFC ends up being responsible for both action selection and executive control. Having recognized a potentially rewarding context, the PFC has to select a suitable plan and set it in motion.

During training the networks, the weights for both the PFC and subroutines responsible for carrying out tasks are tuned by gradient descent or an alternative learning rule. By ways of providing a microarchitecture to facilitate compilation, it seems reasonable to hardwire the basic machinery necessary to handle those aspects of PFC oversight common to all PFC operations and train only those weights responsible for novel uses of that machinery in the same way DNC controllers manage reads and writes⁷.

Assuming such a plan exists, the PFC signals a subroutine, SR1, by setting a register in PFC working memory that SR1 will notice and then activate itself. SR1 is responsible for checking whether the register containing the current character in the input sequence contains a 1, 2 or neither, and on the basis of the result set a flag visible to the PFC indicating failure to identify the beginning of rewarding pattern (0), the initial character of a 1AX (1) or 2BY sequence (2). This flag will serve to activate one of SR2 or SR3 if appropriate.

There are a lot of alternative implementations one might come up with, the challenge is to design a neural network that is easily trained to carry out the necessary steps. The above descriptions rely on logical and relational operators, conditionals and other operations familiar to programmers but challenging to learn from examples. It might help to speed learning by using configurable hybrid neural networks with both fixed and programmable (trained) weights analogous to field-programmable gate array (FPGA) devices⁸.

While the intent here is not focus on possible solutions so much as to motivate the problem and highlight some of the challenges, here are some suggestions you might find useful: O'Reilly and Frank [384] is directly relevant and worth reading carefully. In particular, the appendix provides detail about the experiments conducted and the models tested including an LSTM model. The Ott and Nieder [391] review article is an excellent complement to the PBWM materials suggested at beginning of this subsection. In previous classes, we investigated other approaches to modeling PFC function and subroutine libraries using variations on differentiable neural computer key-value memory [1], [2], differentiable structured programs [122] [3], neural program emulation [4] and neural coding using a variant of imagination based planning [213, 395, 524] [5].

Missing from the picture are the sort of learning one might observe in an infant that do lend themselves to neural network solutions, such as the way that a newborn learns to recognize its mother within hours of birth, differentiate between its parents and siblings when its vision develops further and soon recognize and track the family dog as it runs around the house. One could characterize such activities in terms of simple programs, but the point is that some problems are best solved with logic and unambiguous rules whereas others require the flexibility of smoothly varying differentiable functions⁹. Along similar lines, the sort of context-based template matching that manifests in language production and the effective use of analogy in transferring knowledge from one domain to another is another area where humans excel and machines lag behind¹⁰.

Learning rules of the sort exemplified in solving problems like the SIR task are likewise well suited to learning to write simple computer programs, whereas riding a bicycle or learning to walk upright yield to relatively simple reinforcement learning solutions. Interestingly, learning language may turn out to be more like learning to walk despite the fact that, as in the case of writing programs, language learning involves manipulating discrete symbols. Why is it that learning to write relatively simple programs is for most people notoriously difficult, whereas the planning and logic involved in preparing dinner seems so simple that a child can learn, and seems on the face of it every bit as complicated as performing multi-digit subtraction or computing the nth number in the Fibonacci series?

And yet writing a short computer program is truly difficult for most humans, requiring almost superhuman concentration and cognitive control. Perhaps we should build intelligent machines that have an innate faculty for such single-minded concentration that they can turn on and off at will and that enables them to perform feats that are virtually impossible for someone with the sort of otherwise quite useful cognitive flexibility and fluid intelligence that most humans possess and take for granted in their diverse activities and social interactions – toggle back and forth between "Mister Spock" and "Captain Kirk" mode. I'm being facetious, but it's a question worth spending a few minutes thinking about.

Miscellaneous Loose Ends: In responding to Gene's inquiry about books on partially observable Markov decision processes, I asked around and a friend of mine at Google recommended the course on POMDPs that Pieter Abbeel and John Schulman teach at Berkeley. I checked out the syllabus and slides and this looks like a well-designed course complete with a useful collection of resources.

Here are the references for a few of the papers that I consulted in writing this entry. The BibTeX records including abstracts are in the footnote at end of this sentence:¹¹

Anthony Strock, Xavier Hinaut, Nicolas Rougier. A Robust Model of Gated Working Memory. Neural Computation, Massachusetts Institute of Technology Press (MIT Press), 2020, pp.151-181. [472]
Hayworth, Kenneth and Marblestone, Adam. How thalamic relays might orchestrate supervised deep training and symbolic computation in the brain. bioRxiv. 2018. [218]

Meropi Topalidou, Daisuke Kase, Thomas Boraud, Nicolas Rougier. A Computational Model of Dual Competition between the Basal Ganglia and the Cortex. eNeuro, Society for Neuroscience, 2018. [488]

Nicolas P. Rougier, David C. Noelle, Todd S. Braver, John D. Cohen, Randall C. O'Reilly. Prefrontal Cortex and Flexible Cognitive Control: Rules Without Symbols. Proceedings of the National Academy of Sciences, National Academy of Sciences, 2005, 102 (20), pp.7338-7343. [429]

Nicolas P. Rougier, Randall C. O'Reilly. A Gated Prefrontal Cortex Model of Dynamic Task Switching. Cognitive Science, Wiley, 2002. [430]

O'Reilly, Randall and Frank, Michael. Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia. Neural Computation 2006 pp.283-328. [384]

Touretzky, David and Hinton, Geoffrey. Symbols among the neurons: details of a connectionist inference architecture, Proceedings of the 9th International Joint Conference on Artificial Intelligence, 1985, pp. 238-248. [489]

May 29, 2020

%%% Sun May 29 06:34:04 PDT 2020

I've recommended to several of you that you consult Randy O'Reilly's Computational Cognitive Neuroscience to learn more about a number of topics, in particular, those having to do with the interplay between the basal ganglia, prefrontal cortex, motor cortex and areas of the posterior cortex that are important for executive control. In previous discussions I've suggested a few keywords you might use to search in the book PDF for relevant information, but I realize that while it is a relatively short textbook, it is also exceedingly dense with information on many different levels of abstraction and diverse mechanisms and functions.

In case you are still looking for insight concerning executive control and how it relates to various sensory, motor and higher-level cognitive function, I marked up a copy of the book PDF by highlighting excerpts relevant to executive control using a green highlight to indicate important concepts, yellow to indicate important technical takeaways and blue to indicate figures and other resources that are worth a careful look. I've attached the marked up copy, and I've included below a scenario that illustrates some of the key use cases exercising aspects of human executive control.

If you read much of the cognitive science literature relating to executive control, you are apt to find papers on topics such as task switching, attention sets, cognitive shifting, consciously or unconsciously redirecting one's attention from one fixation to another, conflict monitoring, active maintenance of patterns of activity and short-term memory, selective attention versus flexible attention, and cognitive versus motor control, just to mention a few of the more frequent mentions. If you are interest is in developing human-inspired cognitive architectures, then I suggest you start with the attached, marked up copy of Randy's text.

The scenario below is "aspirational" in the sense that it illustrates a relatively simple example of everyday life in which cognitive challenges are hidden in the midst of what, for most of us, are decidedly not challenging since it seems to us that we cope with them effortlessly on a daily basis. Cognitive scientists have developed experimental protocols that employ exaggerated variants of such commonplace challenges in order to determine the limitations of our cognitive apparatus.

These protocols include the Stroop task developed by John Ridley Stroop, the A-not-B task developed by Jean Piaget, the n-back task developed by Wayne Kirchner, and the SIR (Store, Ignore, Recall) task described in O'Reilly et al [219] designed to test the ability to rapidly update and robustly maintain information in working memory using the dynamic gating of PFC representations by the basal ganglia in the PBWM model — where numbers in parentheses correspond to page numbers in [386]. The tasks mentioned above are often used to test implementations of cognitive theories as in the case of PBWM.

In Hazy et al [219] (HTML), the authors describe how their implementation of PBWM written in Leabra handles both the Stroop and n-back tasks. As part of your final project, you might think about how the model or architecture you are developing would handle one of these tasks. Alternatively, you might note that your model falls short on a particular task and discuss how it might be modified or extended to handle the task. Since in most cases you'll be describing and not implementing a model, you could perform a gedanken experiment in which you explain how your model might handle some of challenges described in the aspirational scenario below.

P.S. Here is the promised (aspirational) scenario illustrating different aspects of human executive control:

It's Saturday morning around 8 AM; you are finishing up the dishes after breakfast and thinking about your day ahead. You've committed to making the main dish for dinner tonight, and have decided that you will make up a batch of your favorite veggie burgers. As you clean the counter tops and put the last of the teacups in the dish strainer, you take inventory of the ingredients you'll need for the burgers to determine whether or not you have all the ingredients on hand or will have to make a trip to the grocery store – it appears that have everything you need.
The recipe calls for brown rice and you decide that you will prepare the rice in advance, and leave rest of the preparations for later since you have all the ingredients on hand and making the burgers will require little effort aside from dicing some vegetables and using the blender to combine the beans, garlic and lemon juice. You place a saucepan on the stove top, add 2 cups of water to 1 cup of brown rice and turn the burner on high while you measure out the other ingredients. As soon as the water starts to boil you turn down the heat to low, cover the saucepan and set a timer to ring in 20 minutes.

Earlier in the week you promised yourself to spend time this morning practicing some of the jazz songs you've been working on, but on a whim you decide to try something you saw on a YouTube video last night. The video featured a classical pianist who started out playing one of Bach's Brandenburg Concertos and gradually mixed in jazz improvisations that matched the cadence and temperament of the Bach piece¹².

After checking for new messages on your phone, you go into the living room, sit down at the piano and begin playing Bach's No. 5 major concerto in D minor without any sheet music since you know the piece so well you could play it in your sleep. After playing a couple of measures experimenting with tempo, you settle on a leisurely pace and experiment by introducing a few improvised notes here and there. This doesn't work for you and so you decide to use the musical score for the concerto to better anticipate and visualize promising opportunities for introducing departures from the notes on the score.

You are have just begun looking for the sheet music, when the timer in the kitchen starts to ring. You return to the kitchen, turn off the heat and remove the lid of the saucepan to check the rice. You notice that there is some rice stuck to the bottom of the pan but there's no sign of burning and so you transfer the rice to a bowl and set it aside to cool down before putting it in the refrigerator. You put the saucepan in the sink, fill it with hot water so it will be easier to clean later on and return to the living room.

You've forgotten exactly what you were doing before the timer interrupted your playing, but there is sheet music spread out all over the piano bench and you remember your plan to use the score of the concerto to help in identifying passages to improvise. You arrange the sheet music on the piano music stand and scan the first couple of pages looking for opportunities before starting to play the piano again. You are adept at sight reading classical music as you play the piano and you can do so for a familiar piece without exerting any (cognitive) effort, but you are disappointed with your first few attempts to improvise.

You pencil in a few notes on the score to suggest possibilities for short departures from original score and try again. This time you manage something more interesting by using the notes you added to launch your improvisation and your knowledge of the Bach piece to make an interesting return to the original score. However, despite your facility with playing the piece you selected for your attempt at improvisation, the result sounds clumsy and contrived, but you persist for another hour before becoming so dissatisfied with your performance that you decide to call it a day.

As you return the sheet music to the storage space under the piano bench, you suddenly remember how much you enjoyed adding slices of avocado to the veggie burgers the last time you made them, and so you decide to run out to the local vegetable stand to see if you can find a ripe avocado and perhaps some heirloom tomatoes to complement the burgers for dinner this evening. You grab your wallet and keys from the table in the hallway where you usually leave them and start for the garage, but then pause for a moment standing in the middle of the kitchen to take stock and make sure you aren't forgetting anything else. You think it prudent to take along your cell phone, but it's not on the table where you usually leave it and so you have to think for a moment to remember when and where you last used it.

May 27, 2020

%%% Thu May 27 04:33:32 PDT 2020

In the following, we attempt to answer two fundamental questions concerning early development: First, what are the prerequisites for grounding the physical, social and emotional intuitions necessary to ensure that an AI system can get the most out of its exposure to the spoken and written word. Second, what does early physiological development tell us about cognitive development and what might steps might we take in order to leverage the former to facilitate the latter in bootstrapping AI systems.

Grounding Machine Intelligence in Human Experience

Grounding human language in human experience may be critical if our goal is to leverage human knowledge in the form of books scientific articles or other documents. This doesn't imply that we need to slavishly recapitulate the experience of a human infant but rather that we somehow create a seed grounding using a simulated environment that captures the dynamics of having a human-like effectively-articulated and sensorily-equipped simulated body subject to (approximately) the same laws of physics that govern our bodies and provide stimulation to human brings our senses.

The challenge for the designer is to provide enough of a foundation that many of the key social and cultural experiences that inform our understanding of human habits and institutions can be acquired through reading—or watching videos on YouTube. Clearly an AI system will have to acquire much of its emotional and social experience indirectly through its exposure to human beings, until such time as we have the need to build AI systems with physical sensations and emotional feelings.

Note that the analogies that permeate the human use of language are not always drawn from the same domain of discourse. Humans routinely use mechanical analogies to explain social behaviors and emotional experiences, as well as to leverage stories and fairy tales about human behavior to explain basic physical concepts. We easily and without realizing mix metaphors and combine analogies to tell stories that explain everything from cooking omelets to explaining quantum mechanics.

The above argument would seem to require that AI systems will need to engage with humans and feel human-like emotions to fully realize the advantages of language. It may be, however, that it is possible bootstrap such a facility by building on top of a foundation based on understanding the physical world of human beings gleaned from an agents experience in a realistic simulator. The result may be more akin to the emotional understanding of someone with severe Asperger syndrome, lacking some aspects of emotional intelligence but able to compensate using its mechanical intuitions to predict human behavior.

Leveraging Biological Development Machine Cognition

After listening to the Spelke OCW lectures, I searched for papers that include the phrases: "core knowledge", "naive physics", "physical intuition" and "mental simulation" in different combinations and found dozens of relevant documents, but settled on just a few with a bias for work by Jason Fischer, whose summary of relevant research and choice of coauthors appealed to me. What follows is a summary of the key ideas that I gleaned from my reading, liberally punctuated with verbatim quotes from the papers that I found to be most informative:

Fischer [165] in reviewing the experiments of newborns in their first three or four months performing relatively sophisticated tasks that involve "early-emerging physical intuitions", notes that Liz Spelke and her colleagues argue from these studies that "children are born with an innate knowledge of basic principles governing object motion and that this knowledge provides the mental scaffolding for learning more sophisticated physical concepts over the course of development." Fischer goes on to ask "Just how accurate are our physical intuitions? Do we carry out mental simulations of physical dynamics, or do we rely upon heuristics that are effective in many scenarios but could break down in others? What brain machinery supports naíve physics?"

Fischer introduces additional detail regarding the experiments with newborns in the process of critiquing several theories about how infants might go about performing these tasks, including the possibility that infants learn rules by way of explanation having observed contradictory outcomes that can't be predicted by the infant's current knowledge. Fisher spends some time describing a competing proposal that involves the existence of a mental model that makes it possible to simulate perceived physical process using a biological version of a video-game physics engine—albeit one with substantially less detail—that can be run forward and backward in order to validate or undermine the infant's physical intuitions.

Mitko and Fischer [358] examine the degree to which intuitive physics can be accounted for by our spatial cognition and conclude that their findings point toward the two faculties being separable. Binz and Endres [58] compare synthetic models of perception and find agreement between the performance of a family of optimized models and the pattern of development observed in infants and children on three classic cognitive tasks [44], indicating common principles for the acquisition of knowledge—admittedly at a considerable difference in the amount of training data required. Fischer et al [166] examine the functional neuroanatomy of physical inference and observe a close relationship between the cognitive and neural mechanisms involved in parsing the physical content of a scene and preparing an appropriate action.

Pursuant to investigating the proposal that humans rely upon a mental simulation engine akin to the "physics engines" used in many video games, Fischer et al [166] identify a common brain network and frontal and parietal cortex that appears to be active "in a variety of physical inference tasks as well as simply viewing physically rich scenes". The investigators used (40) Amazon Mechanical Turk workers to classify short videos as to their relevance to physical inference and then collected fMRI data on twelve subjects (ages 18–26) participating in the fMRI component [167].

Without going into any great detail, the conducted experiments revealed a "systematic pattern of activation across all three tasks including bilateral frontal regions (dorsal premotor cortex / supplementary motor area), bilateral parietal regions (somatosensory association cortex / superior parietal lobule), and the left supramarginal gyrus," Interestingly, the authors also note that "subcortical structures and the cerebellum were also included in the parcel generation process, but no consistent group activity appeared in those areas."

The subjects in this case were all young adults and so it is not clear whether the same circuits play a role in the first few months postnally, nor whether these circuits are even fully mature in this early window. We have already seen that infant vision is limited at birth in terms of acuity (20/400), color discrimination (shades of gray) and depth disparity (monocular). In addition there are rather dramatic changes in the organization of primary motor cortex during the first three months.

Chakrabarty and Martin [87] studied the postnatal development of the motor representation in primary motor cortex in an effort to understand when the muscles and joints of the forelimb become represented in primary motor cortex (M1) during postnatal life and how local representation patterns change¹³. The authors "show that the M1 motor representation is absent at day 45 and, during the subsequent month, the motor map is constructed by progressively representing more distal forelimb joints.

Tau and Peterson [480] note that the "the brain grows to about 70% of its adult size by 1 year of age [...] [t]his increase in brain volume during the first year of life is greatest in the cerebellum, followed by subcortical areas and then the cerebral cortex, which increases in volume by an impressive 88% in the first year. Jernigan et al [249] provide additional insights regarding the achievement of cognitive milestones based on the development of myelinated fiber tracts.

It seems likely that the circuits in frontal and parietal cortex that Fischer et al identify as the basis for our intuitive physical reasoning either do not yet exist or are not mature enough to support the sort of learning we see in infants during the first few months postnatally. They may, however, physically develop in parallel with the earliest experiences of the infant leading to the acquisition of fundamental physical principles. Specifically, the incremental maturation of those circuits might provide the curricular scaffolding necessary to facilitate such learning by limiting sensory and motor abilities to focus the infant's attention and thereby simplify the learning process.

How could you arrange circumstances so that most of the time learning is one or zero shot and even if you get it wrong it is (relatively) easy to reverse or unlearn without collateral damage — my euphemism for catastrophic forgetting? Of course, if you linger too long with a misconception about some fundamental physical or psychological principle or assume that it must be true because no one has ever told you / demonstrated to you otherwise, then you may be setting yourself up for considerable grief downstream — all the more important getting it right the first time.

Babbling and Stumbling our Way to Physical Intuition

Imagine the staged maturation of the circuits that enable the infant's growing awareness of the position and movement of its body — its proprioceptive sense — in tandem with the gradual roll out of its ability to control its movement starting from the trunk and gradually extending to the extremities [87]. At each point in time, there will be an opportunity to experiment with a new motor capability, e.g., the (constrained) degrees of freedom allowed by the bones, muscles and tendons that comprise your elbow, conditioned on the controlled articulation of previously-learned perception-action pairings, e.g., your shoulder (scapula) and its associated tendons, plus the upper arm (brachium) consisting of your humerus, biceps and tendons that connect these to radius bone at the elbow.

Imagine further that as each new sensori-motor system comes on line, the newborn executes the motor equivalent of babbling [326] to encode a collection of motor primitives in primary motor cortex. This process could be extended to multiple joint-muscle pairs enabling the newborn to explore the recursive composition of simpler commands. As the infant's visual acuity, depth perception and color vision improve, the opportunity for more complicated goal-driven babbling activities creating a natural hierarchy of features in keeping with Joaqúin Fuster's hierarchy [184].

I am certainly no expert having read or skimmed a dozen or so papers on the early development of the primate central nervous system. Nor do I pretend to understand all of what I read about the behavioral and cognitive milestones of human infants. It seems to me however that these milestones could be aligned with neurophysiological changes especially within the first year, during which the human brain is undergoing growth and maturation in some of the key areas of the brain responsible for supporting the sort of behavioral trajectory that we witness in the growing toddler. Whether true or not, the very idea provides an interesting suggestion about how one might mimic the early coordinated stages of aligned cognitive and neurophysiological development.

The proposal that I have in mind would be to exploit the limitations of the evolving motor and sensory cortex to engage with the infants restricted ambient environment to quickly learn fundamental properties of the underlying dynamical system by relying solely on its limited repertoire of sensory-motor actions. So for example in the first few days the newborn can only make out blurry gray shapes that are in the foreground and is unable to move its body to any appreciable degree. At this stage, the infant might simply learn to attend to the blurry shape of its mother and perhaps register her movements without actually moving any body parts other than its eyes rotating in their sockets.

The scenario of a looming blobby object would likely repeat frequently in the first couple of days giving the infant ample time to learn how to track nearby object by gaze alone. We would expect that the reciprocal connections linking perception to action would serve to learn how to perform such tracking automatically. Subsequent developmental stages would introduce multiple blobs perhaps with some additional acuity so that the blobs could be distinguished and tracked independently if indeed they moved so. Two blobs in the same visual plane might knock into one another or seemingly one pass through the other since without depth it might be difficult to reliably infer occlusion.

As an exercise, it might be interesting to imagine the infant's sensorimotor trajectory as a timeline and see if one can identify opportunities for incorporating into the infant's evolving physics model the Gestalt principles of figure-ground, similarity, proximity, common region, continuity, closure, focal point and common fate. Some of these it might be possible to learn in any order, whereas others would build on one another thereby requiring some sort of curricular intervention to make sure such dependencies are observed. Alternatively the principles might simply be instilled into the infant's behavioral repertoire, their implicit observance a prerequisite for their success.

At some point the infant's sensorimotor apparatus will be sophisticated enough that the baby will have greater autonomy and the option of ignoring a parent trying to teach yet another lesson. To make further progress, the growing child will have to be convinced that it is in her interest to attend to and perhaps follow the instructions of a parent or appropriate surrogate such as a grandparent kindergarten instructor. I currently have no idea of how a child might graduate from being depended upon parents or other adults to achieving a greater degree of autonomy along with its attendant ambiguity and the opportunity to strike out on its own. It is however an interesting challenge to contemplate.

Miscellaneous Loose Ends: Here are some exchanges with a neuroscientist who took a look at an early version of the above:

TLD: Given the early development arc described by Spelke and others, it would seem that many or most of the areas you identify in the PNAS paper are either not fully articulated or simply absent in the infant brain and yet the authors maintain that there must exist some neural circuitry at work that explains the their performance on simple tasks that imply some basic physical intuition. My students and I are curious if there is any work identifying the neural substrate for that early intuitive capability in an infant during its first five months postnatally that might serve as an inductive bias for much of what follows during later development. Thanks in advance for any insights you might have or papers you can point us to.
JIF: That's a great question regarding the neural basis of those early emerging physical intuitions. Since we know that non-human animals possess many of those same intuitions beginning early in life, I can imagine the kinds of experiments that could answer your question (e.g., neural recordings done during expectation violation paradigms similar to the ones used with human infants). To my knowledge, that kind of work has not been done yet. But I can say that on the behavioral side, those expectations about the mechanical properties of objects are present very early, about as close to a demonstration of innateness as you can get. For example, these findings from baby chicks show that this kind of knowledge is present just after hatching and with next to no additional experience [497, 421].

My thinking goes that core physical knowledge sits in the earliest seeds of the regions that will develop into those that we identified in the PNAS paper as being important for intuitive physics in adults. It kickstarts the process by which those regions develop the array of properties that we see in adulthood. If we were to look for the neural basis of that early-emerging knowledge in young infants, we may find it right in the places we'd expect based on the brain imaging in adults, even before the other functional properties of those brain regions develop. That does more or less seem to be the case for other core visual domains that have been studied with brain imaging in infants [126]. I know this doesn't provide much in the way of neural substrates, especially at the level of circuitry (it's more like a wishlist for neuroscience work that I'd like to see get done!) but hopefully some food for though anyway.

TLD: It's interesting how many recent papers are trying to make sense of early development starting at around four or five months. I now understand that five months old is a milestone neurophysiologically speaking — the entire infant brain is still undergoing substantial changes, but several key circuits have been established. I've been particularly interested in how the process enfolds during those first four weeks when the relevant substrate is very much in the process of assembly. When I wrote you earlier, I was entertaining the idea that there must be some subcortical circuitry at play to bridge the gap between the peripheral nervous system and the the developing cortex. After some thought this seems obviously the case, but largely beside the point given I now think of the rapidly maturing motor and somatosensory cortex as providing "curricular affordances" to guide exploration and focus learning.

May 23, 2020

%%% Sat May 23 04:23:10 PDT 2020

Earlier this month I suggested that some of you might be interested in videos on the MIT open courseware website featuring Liz Spelke lecturing on child development. Those lectures emphasized experiments that were conducted by Spelke, her colleagues at Harvard as well as other cognitive neuroscientists with an emphasis on the protocols used to elicit behavior for the purpose of verifying hypotheses concerning specific cognitive capacities in humans and other model organisms. Those experiments were considerably constrained by the limited repertoire of behavioral responses one has to work with in studying infants and young children.

This morning we watched an invited lecture at the Allen Institute for AI in which Spelke directly explained different types of cognitive systems that she refers to as core knowledge and does so in a manner designed to address the concerns of researchers in AI and machine learning who are interested in commonsense reasoning in machines and hoping that our current understanding of child development might provide insight into the kinds of concepts that AI systems could possibly learn and the various trajectories describing what children learn, in what order and at what age — with the hope of recapitulating such behavior in machine models of cognition.

Here is the abstract–with my italics added to emphasize topics that some of you have mentioned in our final project discussions–and a link to the video that was uploaded to YouTube on December 4, 2019:

Title: From Core Concepts to New Systems of Knowledge VIDEO
Abstract: Young children rapidly gain a basic, commonsense understanding of how the world works. Research on infants suggests that this understanding rests on a set of early emerging, domain-specific cognitive systems. Six systems of core knowledge serve to (a) represent objects and their motions, (b) animate beings and their actions, (c) social beings and their engagements, (d) places and their relations of distance and direction, (e) forms and their scale-invariant geometry, and (f) number. These systems are innate, abstract, strikingly limited, and yet present and functional throughout human life. Infants' knowledge then grows both through gradual learning processes that people share with other animals, and through a fast and flexible learning process that appears to be unique to our species and emerges with the onset of language. The latter process composes new systems of concepts productively by combining concepts from core knowledge. The compositional process is poorly understood but amenable to study through coordinated research in psychology, neuroscience and artificial intelligence. To illustrate, this talk will focus on core knowledge of places, objects, and people, and on one new system of concepts that emerges early in human development: the artifact concepts underlying our prolific tool use.

Nikhil Bhattasali, who took the course last year, recently asked me about pointers to interesting research on (a) self-organizing learning mechanisms for visual and tactile senses and (b) mechanisms for formulating goals (e.g. a baby wanting to pick up a toy) and recognizing when those goals are satisfied. Here is my answer which includes references to work by Guido Schillaci and his colleagues and a link to an open-access special issue of Frontiers in Robotics and AI which he co-edited with Verena Hafner and Bruno Lara entitled "Re-Enacting Sensorimotor Experience for Cognition" — as a further enticement the first technical entry is "Self-organized internal models architecture for coding sensory–motor schemes".

Regarding (a) the first idea that comes to my mind is hierarchies of proprioceptive and multi-modal maps of the sort that Joaquín Fuster talks about, and then using some form of predictive coding with both forward and inverse maps constituting the corresponding reciprocal connections and the top-down corrections from predictive coding employed to implement some form of efference copy / corollary discharge. The hierarchy could be built layer by layer starting with, for example, the retinotopic maps in primary visual cortex – according to Brian Wandel, there are upwards of a dozen such topographically organized and spatially-aligned representations in the primary and secondary visual cortex. It seems pretty clear that the architecture is in place in the developing primate brain to support such a highly-preserved, stereotyped collection of representations.
The higher-level uni-modal secondary and multimodal association areas may not be uniformly organized and interconnected but it would seem necessary that there be in place some system for partitioning the collection of all of these features so as to ensure that the thalamocortical_radiations and cortico-basal-ganglia-thalamo-cortical_loop provide a highly-stereotyped, highly-structured interface to working memory "registers" in the motor and prefrontal cortex. Within each of these areas, both unimodal and multimodal, the corresponding topographic mapping would probably have some leeway to accommodate changes due to age, disease and structural damage, as in the case of somatosensory representations in the rat barrel cortex. There appear to be quite a few – perhaps thousands – of papers suggesting the use of hierarchies of feature-aligned, self-organizing maps is a representation.

As for more concrete proposals implemented in working robots or simulated robot environments, Guido Schillaci, Verena Hafner and Bruno Lara co-edited an interesting collection of papers on the related topic of "Re-Enacting Sensorimotor Experience for Cognition" from which I've only sampled a few papers, but the collection is open-access and you can download the PDF if you are curious. In addition, I suggest that you take a look at Guido's papers many of which are available on his website here.

Miscellaneous Loose Ends: If you viewed Liz Spelke's OCW lecture, you might be curious about what exactly newborn sees during the first few months; here are three tools designed for new parents to be aware of and notice possible problems with their new baby's vision: WebMD newborn-vision resources, How Babies See the World and The Baby Sight Tool.

May 19, 2020

%%% Tues May 19 04:17:37 PDT 2020

Here is some of the background material I've been gathering as it seemed potentially relevant for a couple of final projects. The first three sections are relatively straightforward and could turn out to be useful for several projects. The last section is more theoretical and a bit dense. You may find it interesting if the title "On the Emergence of Sensory Motor Hierarchies" intrigues you, but it is primarily relevant to a better understanding the perception-action cycle and developing an information theoretic account of how sensory motor hierarchies might naturally arise out of the interaction between the environment and an agent's acting on its own behalf and (intrinsically) recognizing the value of information.

Behavioral Biases: Environmental Triggers

Formally, if you want to induce a behavior–as opposed to hard wiring the behavior and the rules for deploying it into agent programming–you need to introduce an instinct or bias in the agent cognitive architecture to facilitate the desired behavior such that once initiated the induced behavior will emerge naturally from interaction with the environment. Alternatively, you can choose an environment with features–generally referred to as affordances–such that when the agent encounters them in the wild these features automatically initiate appropriate behavior conditioned on the observable characteristics of the specific affordance [194].

Sometimes it make sense to do both, as generally happens in natural environments. For example, many species of birds have distinctive superficial features that, when seen by a hatchling / baby chick, will induce "imprinting" so that thereafter the chick will behave toward, e.g., stay near and follow around, the first adult bird it encounters as its mother. As another example, an abhorrence of and immediate retreat from snake- or spider-evocative shapes is built into most mammals.

Sensory Hierarchy: Exploratory Behaviors

There are other approaches that induce exploratory behavior as a response to a drive to (efficiently) collect information that serves–is relevant to–action learning and selection and is often initiated by random movement, sometimes called servo-babbling in robotics [530], e.g., see the attached paper¹⁴. In an hostile environment, this proclivity has to balanced by a survival instinct that recognizes and avoids dangerous situations.

In social animals and artificial agents expected to perform collaboratively, generally one instills an instinct to interact with, e.g., seek out and playfully engage with siblings, and learn how to "read" and respond appropriately to the intentions of other agents in their species-specific social milieu. You might want to think carefully about how to implement such instincts and affordances in your experimental environment and its programmable agents.

Efference Copies and Corollary Discharge

Gordon et al [202] in a review of discussions that took place in a workshop at the Santa Fe Institute entitled "Perception & Action – An Interdisciplinary Approach to Cognitive Systems Theory" discuss the biological evidence for integration of perception and action in the brain and the role of efference copies in particular. The two referenced papers in the following quote are worth a look if you are interested in this area:

Foundational anatomical and physiological studies have provided substantial evidence for the integration of motor and sensory functions in the brain (for a recent review, see Guillery and Sherman [204]). Regarding the neuroanatomy of the thalamus and cortex, Guillery and Sherman have noted that most, if not all, ascending axons reaching the thalamus for relay to the cortex have collateral branches that innervate the spinal cord and motor nuclei of the brainstem (see Guillery [204]). Similarly, those cortico-cortical connections that are relayed via higher-order thalamic structures, such as the pulvinar nucleus, also branch to innervate brainstem motor nuclei. Guillery and Sherman thus hypothesize that a significant portion of the driving inputs to thalamicrelay nuclei are "efference copies" of motor instructions sent to subcortical motor centers, suggesting a more pervasive ambiguity between sensory and motor signals than has previously been acknowledged. — Excerpt from Gordon et al [202]

Emergence of Sensory Motor Hierarchies

In doing background research helping students interested in projects relating to natural language processing, the Wei et al [512] paper that Rishabh Singh suggested as an interesting example of exploiting the relationship between natural and artificial (programming) languages got me thinking about using language as the basis for a bottleneck learning strategy [483]. In particular, I was intrigued with their use of the mutual information¹⁵ of two language models as the basis for – or at least one component of – a regularization scheme. Another idea that kept coming up in my discussions with students had to do with the nature of hierarchy, and in particular, hierarchy in the context of Fuster's perception-action cycle¹⁶.

With these ideas in my mind, I started reviewing related work by Naftali Tishby and Yoram Singer: Tishby for his joint effort with Fernando Pereira and Bill Bialek [483] on the information bottleneck method, and Yoram for his work on adaptive mixtures of probabilistic automata and the hierarchical Markov hidden model that he developed with Tishby and Shai Fine [162]. The relevance of the bottleneck method is obvious; the hierarchical model is the simplest partially observable stochastic process one might use as a theoretical basis for studying hierarchical reinforcement learning.

In his 2013 Technion lecture on the emergence of hierarchies in sensing-acting systems, Tishby uses information theory to model the performance of an agent constrained to interleave information gathering and acting in its environment. He and his coauthor Daniel Polani [484] are able to prove that a hierarchy of features will emerge naturally and converge on an asymptotically optimal solution as a consequence of gathering information useful for acting in its environment. Their general formulation shares some characteristics with Karl Friston's free energy principle for biological systems [261]; however, I found Tishby and Polani's analysis clearer and more rigorous¹⁷.

Simplifying somewhat, Tishby and Polani assume the agent will start by learning features that apply only to its immediate spatial and temporal extent and will continue doing so until the return on investment in effort yields diminishing returns relative to future rewards [530]. In their model, they appeal to free energy and the value of information, establishing a direct-analogy with compression-distortion trade-offs in source coding. Having exhausted its prospects for extracting value from this limited spatio-temporal extent, the agent expands its spatial and temporal extent and begins learning features whose respective receptive fields are proportionally expanded. The same cycle repeats until the features in the next level of the hierarchy extend beyond the scope of the agent's ability to extract additional value. By recursively expanding the agent's spatio-temporal extent at an exponential rate, e.g., doubling its size at each level, they are able to prove convergence and establish an optimal lower bound¹⁸.

Tishby’s lecture at the Institute for Advanced Studies on February 24, 2020, covers some of the same ground, but this time he provides a computational model using recurrent deep neural networks that starts with a single hidden layer and progressively adds new layers that behave in accord with his theoretical analyses [485]. I'm reading papers and watching lectures by Liz Spelke on child development, and it occurs to me that neonates and toddlers in their first 18 months appear to exhibit a similar trajectory of expanding focus, by progressively enlarging their sphere of influence as their sensory apparatus and means of interacting with their environment develop. While neurogenesis is nearly complete by birth, much of the infant brain–and early visual cortex in particular–undergoes extensive change over the first 18 months¹⁹.

Tishby's talk also prompted me to review the work of Etienne Koechlin and his colleagues Thomas Summerfield and Christopher Jubault as it relates to the role of time and temporal hierarchies [282, 284]. Joaquín Fuster [183] summarizes their contribution writing, "In their methodologically impeccable study, Koechlin et al [285] reveal the neural dynamics of the frontal hierarchy in behavioral action. Progressively higher areas control the performance of actions requiring the integration of progressively more complex and temporally dispersed information.

Their study substantiates the crucial role of the prefrontal cortex in the temporal organization of cognitive control." In [282], Koechlin and Summerfield contrast the roles of sensorimotor and cognitive control dividing the latter into three components corresponding to branching, episodic and contextual control. These three components are identified with adjacent areas in the prefrontal cortex and their time course and dependence on one another is explained in terms of their related total, conditional and mutual information. If you are reading about this general area of study for the first time, you might want to read David Badre's 2008 review summarizing the evidence for a rostro-caudal gradient of function in prefrontal cortex [41].

Miscellaneous Loose Ends: A number of you have asked for my recommendations regarding introductory material on various topics in systems and cognitive neuroscience. Some of you may know about the MIT Open Courseware (OCW) Foundation. I've taken several courses over the years and have been impressed with my small sample. About ten years ago I took Fundamentals of Biology with Eric_Lander. It refreshed (and corrected) what I had previously learned and enabled me to update what knowledge I already had by filling in gaps (gaping holes really) — as a plus, Lander is an affable and well-informed instructor full of interesting anecdotes about the human genome project and knowledgeable about a wide range of other topics²⁰.

If you are looking for introductory lectures, go to OCW website and use the search box to find out what MIT has to offer. Relevant to CS379C, the Brains, Minds and Machines Summer Course run by Tommi Poggio includes several lectures directly relevant to our discussions concerning human development. In particular, Liz Spelke's lecture on Cognition in Infancy provides an excellent introduction. The lecture is in two parts with the second part dealing with how children learn various concepts of the sort relevant to common sense reasoning.

May 17, 2020

%%% Sun May 17 04:55:45 PDT 2020

I've been (somewhat) obsessed with Fuster's perception-action cycle and related aligned-and-reciprocally-connected sensory / motor hierarchy since I first encountered his research on the prefrontal cortex [185]. Since then I've generated nearly a dozen models only a few of which ever made it into these research notes. Most recently I got excited thinking about perisylvian language networks, and, in particular the arcuate fasciculus and the inverse pathway from motor (Broca's complex) to sensory (including Wernicke's area) cortex and its possible role in supporting inner speech. The model that I proposed:

is simple, but I got stuck thinking about the forward pathway. Then Rishabh Singh sent me several recent neural programming / code synthesis papers to help out a couple of students pursuing a related project, and one of the papers described a network architecture that borrowed an idea from NLP called dual task learning [512]–where "dual" is used in the sense it is applied in mathematical optimization theory–that treats code generation as the dual of code documentation. In a similar sense, the reciprocal sensory ← motor connections in Fuster's hierarchy consist of a forward (primal) model and its corresponding (dual) inverse. Here is the Wei et al [512] architecture:

What if we were to implement a variant of their code-generation encoder-decoder unit as a forward model complementing my inverse–inner speech–model and then couple the couple the two with a regularizing loss function as in Wei et al? As a simple thought experiment, assume that both the forward and inverse models involve a latent-variable (informational) bottleneck layer [485] as shown in the following graphic. What might serve as the analog of their dual-constraints module integrating the language model for code and the language model for comments and what exactly would be their role in loss function?

Note that the dual-constraints component of the Wei et al paper is part of the objective function intended to bias the use of the language employed in each of the primal and dual models. In the case of the reciprocal connections in the above speech network, both the primal and dual pathways use the same vocabulary / lexicon for describing their respective source thought vectors, but the intended (contextually-based) meaning of a word or n-gram might be different depending on the target audience and ambient interlocutory environment, e.g., a technical discussion between two software engineers in the midst pair programming versus a conversation with a partner in an obviously social mixed-company context.

As illustrated in the graphic, the language-based forward model is depicted as a sequence of words just as was the inverse model. Instead, suppose that we replace each word sequence by a representation of the respective distribution with which the words and phrases in the language model were generated. In the case of the programmer's apprentice, we want to allow for multiple distributions associated with different agents that the apprentice interacts with, including the programmer and fictional agents that capture the different points of view or context-appropriate distribution, e.g., the difference between a social setting and the back and forth one might expect between the programmer and the apprentice in the midst of pair programming.

In the machine theory of mind (ToM) model proposed by Rabinowitz et al [412], the authors distinguish between "a general theory of mind–the learned weights of the network, which encapsulate predictions about the common behavior of all agents in the training set–and an agent-specific theory of mind–the 'agent embedding' formed from observations about a single agent at test time, which encapsulates what makes this agent’s character and mental state distinct from others’. These correspond to a prior and posterior over agent behavior." One use of such a ToM would allow, for example, the apprentice to channel the programmer when absent by approximating how the programmer might describe a code fragment depending on how the apprentice describes it.

Miscellaneous Loose Ends: Related to the action-perception cycle, I came across some work from Naftali Tishby on this topic. If you don't know Tali Tishby, his theoretical work with Fernando Pereira and Bill Bialek on the information bottleneck method [485, 483] has been highly influential in machine learning. A multi-class course taught by Tishby at The Edmond and Lily Safra Center for Brain Sciences (ELSC) at the Hebrew University of Jerusalem is available as a collection of individual videos entitled "Perception Action Cycle Week m" where m can run from 1 to posssibly larger than 12 — I haven't looked any further.

He has a related chapter in Perception-Action Cycle: Models, Architectures, and Hardware [109] which is available for download on the Springer website PDF. Also of possible interest are his Technion lectures "On the Emergence of Hierarchies in Sensing-Acting Systems" VIDE0 and the more recent lecture on February 24, 2020 at the Institute for Advanced Studies "Direct and Dual Information Bottleneck Frameworks for Deep Learning" VIDE0.

May 15, 2020

%%% Fri May 15  4:10:41 PDT 2020

I found Leslie Kaelbling's discussion in class on Thursday particularly thought provoking. I liked the way she set up the problem. Despite using some of the same familiar words it was clear she was departing from the usual story and she took her time bringing the audience along. That was important otherwise when she reached the point where she could reveal her alternative way of looking at the problem, the students might not have recognized it for being novel or undersood its implications.

It makes perfect sense that we as engineers can bootstrap a system by programming in inductive biases (instinct), general organizing strategies (cognitive architecture) and mechanical design (embodied affordances), in order to accelerate subsequent learning. As Leslie pointed out, the trick is to find a trade-off between up-front engineering effort that can benefit a wide range of applications and on-the-job training necessary to accommodate the details of the target working environment.

Apart from the fact that I really want to understand the computational architecture of the human brain for lots of personal and scientific reasons, I see my strategy for proceeding pretty much along the same lines as Leslie does, albeit with a very different way of describing it. Despite being determined by natural selection, the human brain is remarkably well engineered for its target environment(s). The trade-off between "nature and nurture" mirrors Leslie's engineering trade-off and exhibits some extraordinarily clever compromises.

In particular, the clever integration of connectionist and symbolic computing capabilities–which I can rhapsodize over at length–constitutes an engineering accomplishment of the first rank. From my perspective, natural systems benefit from three distinct engineering practices: (a) natural selection accounts for the basic physical substrate, (b) our staged development spread out over two decades but particular intense in the first 5-6 years corresponds to the work Leslie and her students do, and (c) the rest (overlapping with development) plays out over a lifetime of on-the-job training and includes constantly adapting to our changing environment and compensating for our aging physical infrastructure.

It's worth pointing out that in humans (b) and (c) include the integration and refinement of language—undoubtedly the most powerful technology ever developed by human collective intelligence, and missing from Leslie's account, perhaps only excluded for lack of time. I don't see language as special except in its more general characterization as a collection cognitive genes refined over millennia by cultural selection that can be passed on, literally by word of mouth, from one generation to the next, but rather as a tool for multi-agent coordination.

So instead of (a) Intel, Hitachi, Siemens, etc providing the basic hardware, (b) MIT and Stanford engineers writing code to support a general declarative and procedural substrate, and (c) on-the-job machine learning, we employ (a') our software instantiated biologically inspired cognitive architecture, (b') recapitulate human development using curriculum learning systems to reduce sample complexity, and (c') use complementary learning systems to enable life-long-learning that leverages human language to expedite on-the-job training.

Miscellaneous Loose Ends: You might take a look at the first few technical slides–#5 to #10– in my introductory lecture to get some idea just how far we are willing to abstract away from the messy biological details that most neurobiologists consider the most interesting aspects of human cognition.

I use the concept of a high-dimensional configuration space and a corresponding lower-dimensional manifold²¹ of reachable points to explain how the basal ganglia provides sequences of state-action-reward triples to the cerebellum which it then uses to program the motor cortex – an idea developed independently by both David Marr [341, 340] and James Albus [7, 8] building on ideas from Claude Shannon and Norbert Wiener.

More generally, I don't think it a coincidence that, from a rough-and-ready engineering perspective, all of the following technologies were deeply influenced by biological neural networks: convolutional compositional architectures and differentiable kernel machines, e.g., CONVNETs, VAEs and GANs — primary visual and auditory cortex, gated recurrent neural networks, e.g., LSTMs and GRUs — striatum, differentiable neural memory networks, e.g., NTMs and DNCs — hippocampus, meta-reinforcement learning systems, e.g., Meta-Q-Learning — basal ganglia, attentional feedback architectures, e.g., ABCNNs and BERT — ubiquitous throughout cortex, meaning-as-context embedding models, e.g., encoder-decoder and sequence-to-vector machines — arcuate fasiculous and model-based feedback control systems, e.g., MERLIN — cerebellum.

It's true that I want to build systems patterned after the human being but I have no intention of slavishly adhering to the solutions stumbled upon by natural selection. Indeed, my idea of an embodied system for the programmer's apprentice application corresponds to a software engineer's integrated development environment, fully instrumented to ingest programs as abstract syntax trees and manipulate them using a tool chain including debuggers, the ability generate full traces from running code, single-step through buggy programs and make changes to alter their behavior. I maintain that the IDE serves as the AI system's sensors and servomotors and the computer + OS that the IDE is running on serves as the richly-featured dynamical system / environment in which the system is embedded in–a powerful programmable universal Turing machine. Despite the–seeming–unnaturalness of such an arrangement, think of the IDE simply as a prosthetic device that allows the AI system to interface with the complex world of computable functions.

May 14, 2020

%%% Thu May  4 08:59:28 PDT 2020

The scientific literature on the evolutionary origins of human language is huge, spanning multiple disciplines and a great deal of controversy. In order to save you time and not a little frustration, I've provided some excerpts from an article in Discover by Bridget Alex [12] HTML summarizing key points from a review by evolutionary biologist W. Tecumseh Fitch [169] PDF attempting to answer the question: How Did Human Language Evolve? The following text interleaves excerpts from the Discover article with my commentary.

Most scientists think this happened in stages, as our ancestors evolved the adaptations needed for language. In earlier stages, human ancestors would have used a kind of protolanguage – more complex than ape communication, but lacking elements of modern language. There are at least three elements of language only present in hominins:

First is a fine-control over our vocal tracts. Other apes are likely born with a more limited repertoire of vocalizations. The difference comes down to how our brains are wired: Humans have direct connections between the neurons controlling our voice box and the motor cortex, the region of our brain responsible for voluntary movements. Brain scans show these connections are lacking in other primates.

Next is our tendency to communicate for the sake of communicating. To encapsulate this, biologist Fitch used the German word Mitteilungsbeduürfnis, "the drive to share thoughts." Whereas chimps use a finite set of calls and gestures to convey the essentials – food, sex and danger – humans talk to bond and exchange ideas, and strive to ensure we're understood. Most researchers attribute this difference to an idea called "theory of mind," the understanding that others have thoughts. Chimps demonstrate more limited theory of mind, whereas humans know that other humans think things – and we're constantly using language to uncover and influence those thoughts.

The last difference is hierarchical syntax. Phrases and sentences have nested structure and these provide meaning beyond the simple sequence of words. For instance, take the sentence: "Chad, who was out to lunch with Tony, was late to the meeting." Hierarchical syntax processing allows us to correctly interpret that Chad was late to the meeting, even though "Tony" is closer to the verb "was late." I think many scientists who are not well educated in computer science either don't understand how important this is or they appreciate it for the wrong reasons.

Some believe precise vocal control and learning was the first language trait to emerge in hominins – and not for speaking, but for singing. This idea of musical protolanguage comes from Darwin himself and has been modified over the years by different researchers. During this hypothetical singing stage of human evolution, our ancestors' survival and/or reproductive success would have depended on serenading, in the context of maintaining social bonds, attracting mates or soothing infants.

An alternate view envisions protolanguage characterized by gesture and pantomime. In this case, syntax and social communication would have preceded vocal prowess. The strength of the gestural hypothesis is that our closest relatives, chimpanzees, exhibit more controlled and variable gestures (over 70 and counting) than calls (4 types and more hard-to-distinguish subtypes). The weakness of this view is, it's unclear why or how language became so speech-dominate.

Discussions concerning the origins of language and theory-of-mind thinking eventually get around to the issue of inner speech, and from there inevitably slide into consciousness. The Neurobiology of Language [225] was published in 2015 and is pretty comprehensive and up to date. There are dozens of references to "inner speech" to be found in the many chapters. I also think highly of Terrence Deacon's The Symbolic Species: The Co-evolution of Language and the Brain [118] (PDF) ... search for the word "protolanguage" as a starter²². Charles Fernyhough [161, 11, 10] is another reliable source.

Inner speech, Fernyhough writes in The Voices Within: The History and Science of How We Talk to Ourselves [160], isn't bound by many of the conventions of verbal speech. For one, we can produce it much faster when we don't have to go at the pace required to use tongues and lips and voice boxes. One researcher who the book cites, clocks inner speech at an average pace of 4,000 words per minute–10 times faster than verbal speech. And it's often more condensed–we don't have to use full sentences to talk to ourselves, because we know what we mean.

I have read, thought and written a lot about consciousness. If you want to know what I think, you can read some of my correspondence with Daniel Dennett, David Chalmers, Adam Marblestone and Greg Wayne in the 2020 class discussion listing. The best account of my recent thinking can be had by reading the entries for January 13, January 19, January 22 and January 25 in that order. I'd be happy to discuss, but only if you have carefully read these entries first; the issues are complicated and my account seems to elicit a good deal of personal angst that can be summarized as "I'm not that simple!", which generally stems from one or another of the many versions of dualism (SOURCE).

P.S. If you are interested in digging deeper, I can provide access to the encyclopedic Neurobiology of Language.

May 11, 2020

%%% Mon May 11 14:55:13 PDT 2020

There is a famous paper [309] written by four neuroscientists and published in 1959 that is, to this day, considered essential reading in many introductory neuroscience classes. The last two authors, Warren McCulloch and Walter Pitts, you may know for their simple threshold model of a neuron which formed the basis for the first artificial neural network architectures [349] and was the inspiration for John von Neumann's Probabilistic logics and the synthesis of reliable organisms from unreliable components [506].

The first two authors, Jerome Lettvin and Humberto Maturana, while they called themselves cognitive scientists – to contrast their approach to studying the mind and its processes with B.F. Skinner's behaviorism – today they would be characterized as neurophysiologists, and are famous in their own right for having carried out some of the first experiments involving neural recordings of awake, behaving animals.

The paper is as much famous for the way in which the authors approached the problem as it is for the recording technology and experimental protocols that they invented. The authors were very careful to set aside their preconceptions of how vision operated within the broader taxonomic family of vertebrates and take an ethological stance, working backward from observed behavior as a generator of hypotheses and letting the data speak for itself.

The paper PDF is a model of conciseness and clarity, and well worth the half-hour or so it takes to read.

@article{Lettvinetal59,
author = {Lettvin, J. Y. and Maturana, H. R. and McCulloch, W. S. and Pitts, W. H.},
title = {What the Frog's Eye Tells the Frog's Brain},
journal = {Proceedings of the Institute for Radio Engineers},
volume = 47,
year = 1959,
pages = {1940-1951},
abstract = {A frog hunts on land by vision. He escapes enemies mainly by seeing them. His eyes do not move, as do ours, to follow prey, attend suspicious events, or search for things of interest. If his body changes its position with respect to gravity or the whole visual world is rotated about him, then he shows compensatory eye movements. These movements enter his hunting and evading habits only, e.g., as he sits on a rocking lily pad. Thus his eyes are actively stabilized. He has no fovea, or region of greatest acuity in vision, upon which he must center a part of the image. He has only a single visual system, retina to colliculus, not a double one such as ours where the retina sends fibers not only to colliculus but to the lateral geniculate body which relays to cerebral cortex. Thus, we chose to work on the frog because of the uniformity of his retina, the normal lack of eye and head movements except for those which stabilize the retinal image, and the relative simplicity of the connection of his eye to his brain.
The frog does not seem to see or, at any rate, is not concerned with the detail of stationary parts of the world around him. He will starve to death surrounded by food if it is not moving. His choice of food is determined only by size and movement. He will leap to capture any object the size of an insect or worm, providing it moves like one. He can be fooled easily not only by a bit of dangled meat but by any moving small object. His sex life is conducted by sound and touch. His choice of paths in escaping enemies does not seem to be governed by anything more devious than leaping to where it is darker. Since he is equally at home in water and on land, why should it matter where he lights after jumping or what particular direction he takes? He does remember a moving thing providing it stays within his field of vision and he is not distracted.}
}

May 9, 2020

%%% Sat May 9 03:35:53 PDT 2020

In the early 1980's, a research scientist at Xerox PARC by the name of Kurt VanLehn set out to build a mathematics tutoring program [501]. He began by studying how children learn relatively simple procedures for performing mathematical operations such as how to add and subtract multi-digit numbers.

His first efforts were dedicated to identifying the kinds of errors children made in learning how to perform such basic calculations and developing strategies for helping them to get over the misconceptions that were causing them to make such errors.

Kurt worked on these problems for many years and never did get a reasonably robust AI system for tutoring children in math. Of course, this was the 1980s and the technology du jour was dominated by rule-based – so called "expert" – systems and no one seriously thought about developing systems that learn from data.

The point is that it is really hard to figure out what's going on in children's heads when they learn math and that our adult intuitions often lead us astray – not the least, by giving us a false impression that the problem we are trying to solve is simple enough that it will lead to quick solutions²⁴.

In his 2020 paper [24] submitted to the annual Cognitive Science Society meeting, Jay McClelland and his colleagues examine how developmentally-staged curriculum-learning protocols, language skills and culturally induced context impact how quickly children learn various skills such as mathematics and computer programming.

The take-home lesson here is that if we are to build systems that learn such skills in such a way that they are integrated with a wide range of related behaviors, then we would be well served to recapitulate early human development and train our systems to acquire language grounded in the conceptual, environmental and cultural idioms that most children have acquired by the age of five.

Rather than base our design of systems that learn complex skills like computer programming on our intuitions about how humans behave or even activity traces of humans learning and exercising such skills, I suggested that instead we attempt to identify the machinery that enables us to quickly acquire the sort of representations that even our long-extinct hominid forbearers would have no trouble learning.

These representations would include relational models, simple graphs in the form of entity nodes and relational edges, and the dynamics of moving objects governed by environmentally relevant forces, such as gravity, friction and pressure gradients. To be most useful, these representations should be grounded by direct experience of these relations, networks and dynamics.

Note that "grounding" need only depend on a suitably rich model of dynamics to provide the basis for quickly learning a wide range of other dynamical systems. In the programmer's apprentice application, the apprentice's embodiment as an integrated development environment (IDE) providing the interface necessary to engage with the world of program specifications, source code and running processes, supplies just such a rich dynamical system.

Furthermore, in the case of uniquely-human skills like computer programming, the representations should be integrally linked to human language, allowing the richly expressive network of connections provided by even a juvenile's mastery of language to link the rigid, spare grammar of most modern programming languages to the web of meaning and metaphor afforded by our everyday lexicon and relatively loose adherence to any rigidly enforced syntactic constraints²⁵.

I believe that all of these representations and the means to acquire them quickly by way of direct experience can be learned in a largely unsupervised manner using a small collection of architectural motifs and basic principles for error-guided learning, and in the following I include a short list of papers that present a first cut at providing the necessary circuit motifs and learning principles.

What if you treated programs as a particularly well-structured dialect of human language and exploited this for automated code synthesis by leveraging existing tools for natural language processing. The following two papers provide examples of this approach – these examples are a bit dated and so you might want to run a search to find more recent examples.

Drissi et al [1] provide an interesting demonstration of how to translate between two different programming languages using a simple trick to bias the decoder output of an encoder-decoder pair to only make assignments that obey the rules of grammar of the target language.

Desai et al [135] demonstrate the ability to map natural language specifications to short programs or scripts in a domain specific language (DSL) such as is used to write macros to automate routine text-manipulation operations in a text editor.

The previous two examples primarily rely upon syntactic features of the input and output representations. In contrast, Wang et al [520] use LSTMs to generate both syntactic and semantic features of programs, where the former is accomplished by analyzing programs as abstract syntax trees and latter by analyzing program traces of running programs.

All three of the previous examples make use of recurrent neural networks, e.g., LSTMs and GRUs, as the building blocks for more complicated circuits, in order to reason about the syntax and semantics of programs, and they convert programs to their equivalent abstract syntax trees (ASTs) to obtain a canonical format that serves to finesse parsing and demonstrate generality.

ASTs are directed acyclic graphs (DAGs), whereas ASTs augmented with back edges to represent loops and recurrent procedure calls introduce cycles and enable a more expressive class of models. In the last few years, there has been more attention paid to representing and reasoning about arbitrary graphs as a general facility applicable to a wide range of concrete and abstract tasks [50, 436, 214].

For example, Hu et al [234] demonstrate language-conditioned graph networks for relational reasoning, Sanchez et al [435] introduce graph networks as learnable physics engines for inference and control, and Johnson [252] demonstrates how to learn graphical state transitions from NLP descriptions²⁶.

If you want to take on an ambitious project that will stretch your creative capacity and introduce you to thinking about complete end-to-end systems that capture important aspects of human cognition while at the same time not having to embrace the full complexity of human experience, problem domains like the programmer's apprentice are worth seriously considering.

May 7, 2020

%%% Thu May  7 04:13:55 PDT 2020

In [519], Rishabh Singh uses LSTMs to generate syntactic and semantic features of programs, e.g., the former by analyzing programs as abstract syntax trees and latter by analyzing program traces of running programs. In class I mentioned the usefulness of graph nets and relational models that are also typically implemented using banks of LSTMs.

In earlier lectures, we read about stripe-shaped structures in the basal ganglia and frontal cortex working memory, and Randy O'Reilly made the case that these structures, anatomically organized as a collection of densely interconnected cortical columns, provide functionality well characterized by multi-cell LSTMs.

The cortex is often characterized as an homogeneous sheet of tissue tiled by cortical columns of the sort championed by Vernon Mountcastle and his colleagues [232, 365, 364]. The debate still rages about whether these columns implement the same function, with the word "same" having as many different interpretations as there are neuroscientists writing related papers [123].

Setting aside the biological controversy, in building an architecture patterned after / inspired by the human brain, starting with LSTMs as a basic building block for implementing diverse cortical functions makes a lot of sense. In preparing Gary Marcus for his talk today, he and I talked about the rationale for such a move.

Gary, Adam Marblestone and I advocated a similar design strategy in our 2014 paper appearing in Science [337]. In our conversation yesterday, we talked about the role that homeobox genes play in guiding multipotent cells to differentiate taking into account their local environment in the developing organism.

This mechanism provides an explanation for how an initially homogeneous tiling of the cortex can self-organize into a collection of bespoke specialized circuits that have a superficially similar structure but implement different functions depending on their location and neighboring circuits within the growing cellular matrix.

This is what I was alluding to on Tuesday about I talked about circuits in cortex that combine columns having similar functional characteristics and local micro architectures to build functionally diverse architectures that exploit local structure determined by large scale properties including the source and topography of incoming and outgoing connections and in particular white matter tracts connecting distant cortical and subcortical regions.

You might want to ask Gary about how he envisions our using such insights in designing neural architectures, and don't forget to read Gary's arXiv [338] paper in preparing for his discussion with us this afternoon. You might also be interested in his 2004 book [336] entitled The Birth of the Mind: How a Tiny Number of Genes Creates The Complexities of Human Thought. Chapter 6, "Wiring the Brain", does a good job of explaining the underlying biology.

Miscellaneous Loose Ends: Here is a cleaned up version of the "take-home message" that I suggested at the very end of our discussion with Gary:

There are multiple types of intelligence, but there are some characteristics that are highly desirable in an agent that would have any chance of dealing with the sort of problems that are most pressing for humans today. Here are the three characteristics that I would look for in such a system:

Adaptable: able to recognize, anticipate and prioritize problems and devise flexible strategies to solve them;

Extensible: capable of creative extrapolation by generalizing prior knowledge and applying causal inference;

Efficient: discarding implausible options while efficiently identifying and exploring promising alternatives;

There exist AI systems that exhibit one or two of these characteristics to some degree – some using symbolic methods and others relying on distributed representations of one sort or another, but none that convincingly aspire to all three and successfully leverage both symbolic and connectionist technologies.

Moreover, there are systems that profess to have solved or to be poised to solve problems of pressing importance to the future of humanity, but that are inherently ill-equipped to do so, despite the exaggerated claims of those with a financial or intellectual stake in their widespread adoption.

May 5, 2020

%%% Mon May  5  3:12:15 PDT 2020

These are the rough working notes I compiled in preparing for my lecture this afternoon. Most of the figures are from Poeppel et al [402]. The one noteworthy exception is Figure 6 describing a hybrid neural-and-symbolic model of language processing that addresses the relatively simple needs of the programmer's apprentice application while at the same time finessing many of complex timing and vocal articulation problems relevant to spoken language understanding and production.

Sounds signify events that we associate with entities, e.g., animals, storms, and processes, e.g., birds calling, rain falling. Sounds vary over time, involve multiple sources that may or may not be related and provide information critical to our survival. Humans do not rely on sound as much as some other animals, but it serves us well if our vision is occluded or lacking adequate illumination.

Sound also provides a convenient medium for exchanging information in the form of spoken language. It is not the only medium for language, but it may provide some insight into how language is processed in the brain. While spoken language is a late arrival evolutionarily speaking, the brain structures that make it possible were very likely in place and serving other purposes long before humans started using anything that we would recognize as language. Our understanding of human language has considerably accelerated in the last two decades. In particular, the very idea of language being localized in Broca's and Wernicke's areas now seems naïve – see Figures 3 and 4.

Figure 3: Regarding advances in methods, the remarkable improvements in recording techniques and data analyses have yielded new maps of the functional anatomy of language. Speaking spatially, local regions, processing streams, the two hemispheres, and distributed global networks are now implicated in language function in unanticipated ways. For example, the canonical language region, Broca’s area, is now known, based on innovative cytoarchitectural and immuno-cytochemical data, to be composed of a number of subregions plausibly implicating a much greater number of different functions than previously assumed, supporting both language and non-language processing as illustrated in the inset graphic on the right. Note that the above graphics and the excerpt used as caption are from Poeppel et al [402].

Most mammals are able to parse the complex structure of their aural experience, assigning meaning to different aspects of their experience including the identity and behavior of the sources. It seems plausible that the ability to parse the ambient stream of sounds enables hearing-equipped animals to create and remember wordless narratives that summarize particularly relevant aural experience.

We needn't have restricted ourselves to sound, together our senses allow us to construct rich representations of the world that we use to guide immediate action and subsequently recall in order to deal with novel situations. As a means of communication, pointing and gesticulating serves as an effective medium for coordinating activity as in the case of hunting or foraging, and the effectiveness of sign language for the hearing impaired suggests the medium is not a limiting factor.

What does seem critical however is the ability to efficiently, reliably and unambiguously convey information from one person to another using a suitably expressive language. Expressivity in formal language theory is a syntactic property of languages characterized by Chomsky's hierarchy with the class of context-free grammars equivalent to universal Turing machines capable of simulating any computable function. Most human languages are context free.

Strings in a language can be constructed from an arbitrary alphabet, where a {0,1} alphabet is technically sufficient but not necessarily efficient in light of the devices – eyes, ears, vocal tract, larynx – used to generate strings and interpret their meaning. Assuming we could measure the requisite characteristics of the communication channel, the Shannon-Hartley noisy-channel coding theorem bounds the rate of error-free transmission, but that's not particularly relevant here.

Figure 4: Moreover, there is emerging consensus that regions are organized into at least two (and likely more) processing streams, dorsal and ventral streams, that underpin different processing subroutines, for example, mediating aspects of lexical recognition and lexical combination versus aspects of sensorimotor transformations for production, respectively. Note that the above graphics and the excerpt used as caption are from Poeppel et al [402].

Reliable, unambiguous information transfer relies on being able to identify word boundaries and segment audio into discrete tokens as much as it relies on the speaker producing meaningful utterances. Most utterances are technically redundant but practically spare to compensate for varying channel characteristics, including the attention and education of the listener. A vocabulary of 20K words and the recursive, combinatorial characteristics of human language make for a powerful tool.

Given our remarkable facility for using this tool, how might natural selection have set the stage for our seemingly rapid adaptation and adoption of language? Or perhaps the question should be, given systems already in place for communication prior to the appearance of modern language, how might language have evolved under selection pressure from climate change and the advances of other hominids vying for the same sources of food and shelter / increasingly crowded ecological niche.

Dogs have brains that are much like ours. They can construct rich representations of the world around them and encode their experience in memory so as to take advantage of what they learned in the past and apply it to their present circumstances. They can communicate with one another and with humans in order to convey that they are hungry or would like to play or go outdoors. They can accurately interpret the signs produced by other animals and humans and exploit what they learn from the signs to interact and collaborate.

Dogs can't however learn to communicate using any language that we would consider rich enough to conceive of and express the thoughts of a two-year-old human child. There are three key functional characteristics of young children's brains that explain their advanced cognitive and communicative abilities. The three characteristics involve short-term memory capacity²⁷ neural signal transmission and auditory production anatomy²⁸. In the terminology of computer architecture, we have a larger cache, a faster processor bus and better audio output hardware, where the first two translate into a deeper stack for nested subroutine calls and support for recursion and the third enables higher channel capacity.

So why do dogs bark and humans speak? In short, our incrementally better computational capability and the rapidly increasing complexity of our social behavior enabled us to develop a communication faculty capable of representing and sharing our experience, coordinating our collective behavior and analyzing the results of our observations to pursue scientific inquiry, create art and produce technology.

One of the goals of comparative, developmental and evolutionary neuroscience is to understand how humans acquire language and why other animals with similar brains fall short or fail altogether in acquiring a language of comparable complexity and utility [119, 118, 318, 316]. In the following paragraphs, we attempt to explain this conundrum by providing a computational model of how language is acquired and appealing to the differences between the brains of humans and other mammals listed above.

There are a number of theories that attempt to explain human language in terms of anatomy and physiology at the cellular and systems level – see Figure 5 for insight into the neural architecture of language processing in human brain. In the following, I reference and draw upon several papers by Friedemann Pulvermüller and his colleagues. I am not in agreement with all aspects of his theory, nor am I particularly well qualified to evaluate the integrity of all the experimental evidence he provides in support of his theory. That said, I find Pulvermüller's comprehensive view intriguing and many of the technical details useful as inspiration for computational models.

If you are interested in better understanding Pulvermüller's theory, I recommend that you begin by listening to his overview for a general audience in this interview²⁹ . His most recent paper³⁰ provides the most comprehensive and up-to-date account of his theory, an earlier paper focuses on what he believes are the fundamental mechanisms that give rise to language³¹ and this current opinion piece³² provides a more succinct overview aimed at a general technical audience.

Figure 5: The model differs both from standard state feedback control and psycholinguistic models in two main respects. First, phonological processing is distributed over two hierarchically organized levels, implicating a higher-level cortical auditory–motor circuit and a lower-level somatosensory–motor circuit, which roughly map onto syllabic and phonemic levels of analysis, respectively. Second, a true efferent copy³³ signal is not a component of the model. Instead, the function served by an efferent copy is integrated into the motor planning process – contrast with Fuster’s sensory-motor hierarchy. Note that the above graphics and the excerpt used as caption are from Poeppel et al [402].

In describing his theory, Pulvermüller makes use of the concept of cell assemblies corresponding to a distributed neural representation of behaviorally relevant objects, concepts or constellations of such. Originally defined by Donald Hebb [220] and subsequently refined by Valentino Braitenberg [70], cell assemblies are the answer to Braitenberg's quandary, "[s]ituations, things, partners or opponents are, in a way, the terms of behavior. It is legitimate, therefore, to ask what phenomena correspond to them in the internal activity of the brain, or, in other words: how are the meaningful chunks of experience 'represented' in the brain?" Palm et al [393] review seven observations that Braitenberg made his 1978 paper, in light of subsequent developments and functional consequences:

Most of the input to cortical neurons comes from other cortical neurons, i.e. the cortex works mostly on its own output.
There is an informational imbalance between excitation and inhibition since excitatory connections greatly outnumber inhibitory ones.
The global cortico-cortical feedback is positive, since cortico-cortical fibers are excitatory.
There is a great dispersion of information in the cortical connectivity system, i.e., cortical representations are widely distributed.
Cortical connectivity can be described as the cooperation of an A- and a B-system, where the A-system distributes information globally by cortico-cortical fibers and the B-system distributes information locally by intracortical axonal arborizations.
There is a predominance of inborn specificity in the wiring at the macro-level (e.g. in the connection of areas) and of randomness (or individuality) at the micro-level.
Synaptic plasticity in the cortex is predominantly of the Hebbian type, i.e. excitatory synapses are strengthened by coincident firing of the pre- and postsynaptic neurons.

These observations led Braitenberg to propose his concept of cell assemblies as a means for creating, selecting and maintaining concepts that aggregate diverse sensory and motor representations. In [409], Pulvermüller proposes four mechanisms involving how meaning is represented in the neocortex at the network level. In the following, we invoke related concepts cast in terms of artificial neural networks in an attempt to provide model for how meaning might be inferred and made use of in computer model of language. Pulvermüller's four mechanism's are as follows:

referential semantics, which establishes links between symbols and the objects and actions they are used to speak about;
combinatorial semantics, which enables the learning of symbolic meaning from context;
emotional-affective semantics, which establishes links between signs and internal states of the body;
abstraction mechanisms for generalizing over a range of instances of semantic meaning.

Just as babies begin to induce the retinal mapping of primary visual cortex very early and possibly even while still in the womb, so too they learn the prosody, phonology and, in particular, the ability to recognize word boundaries in their native tongue within the first few weeks of hearing their mother's voice. Perhaps not so surprising, the same convolutional architectures that work for segmenting images also work for parsing natural language [462, 463].

It is worth keeping in mind that motion and perception – and in particular the comprehension and production of language – are all continuous, learning is distributed and likewise continuous and hierarchy is critical both for representing complex phenomena and bringing a larger temporal context to bear in establishing meaning.

Figure 6: In the panel directly above, the thought vector (signified as a cloud) on the left represents the embedding of a speech act generated in frontal cortex, whereas the one on the right is the reconstructed thought vector of a speech act generated by either the system (the apprentice in the case of the programmer's apprentice) shown as a dashed red line (inner speech) or by an external source shown as a dashed green line. The external source could be the system speaking out loud or the voice of the programmer. The RNN on the left is a decoder and the one on the right is a encoder.

The thick orange and purple arrows correspond to the reciprocal connections between anterior motor cortex and posterior sensory cortex; here we imagine the former as a variant of predictive coding and the latter serving as a conduit for an efferent copy signal distinguishing between inner speech and external speech, either self generated (spoken out loud) or originating from another party.

Word learning is finessed by using a lexicon implemented as a key-value DNC with values corresponding to one-hot vectors, plus a fixed (10-20 thousand word) vocabulary and a pre-trained word-/-n-gram-to-vector language model. The assistant in the programmer's assistant application could be trained using examples generated by a curriculum-learning protocol modeled along the lines of the early developmental stages involved in human language learning. See the section on Bootstrapped Language in the introductory course notes.

In NLP systems implemented as deep neural networks, words are represented as one-hots that essentially map word symbols to word vectors thereby providing a fixed, unambiguous, immutable signature for each word. The size of the vocabulary for native speakers averages in the range from 10K to 20K words, and could easily add a similar number of common n-grams and still not tax the system. See Figure 6 for a hybrid neural-and-symbolic model of language processing that addresses the needs of the programmer's apprentice application while at the same time finessing many of complex timing and articulation problems relevant to human language facility.

In Pulvermüller a word vector corresponds to a pattern of activity in the primary cortex that arises when we speak the word, and since this pattern varies we rely on a mapping that is trained to produce the same activity in an area – in the general vicinity of Broca's area – representing the word lexicon so that every time we speak a word the same pattern is activated. In a computer model we can easily ensure this requirement by using some variant of NTM or DNC.

Generally speaking the meaning of a word is defined by the company it keeps, that is to say the contexts in which it participates. In computer language models this corresponds to a word-vector embedding space organized so that words that have the same meaning are nearby one another according to some distance metric defined on the underlying vector space.

When you speak or hear a word spoken, activity patterns arise in sensory cortex starting in the primary auditory cortex – including Wernicke's area – in the temporal lobe, and propagating to secondary and multimodal association areas that are differentially propagated to working memory in the frontal cortex by way of the thalamus and striatum under the control of the basal ganglia and circuits in prefrontal cortex.

The arcuate fasciculus is a rostro-caudal bundle of axons that connects Broca's area and Wernicke's area in the brain. It is an white matter (myelinated) fiber tract connecting (caudal) posterior temporal cortex and (rostral) inferior frontal cortex – collectively referred to as the perisylvian circuit [532], providing reciprocal connections between language related cortical areas and that facilitates grounding through the perception-action cycle.

Words are signs grounded in our experience of the world³⁵. Perception provides the basis for that grounding and the arcuate fasciculus estalishes the mapping from signs to their corresponding referents. Grammar and syntax are encoded in the motor cortex just as is any other complex coordination of muscles and body parts and learned using the same combination of cortical and subcortical circuits.

As mentioned above, Pulvermüller, in channeling the ideas of Braitenberg and Hebb, makes a case for cell assemblies providing the structure for how we create representations using language to connect disparate patterns of activity from across the sensory cortex and linked by the arcuate fasciculus to corresponding representation in motor and frontal cortex. The hippocampal complex is also implicated in building complex representations of episodic memories derived from diverse patterns of activity throughout the cortex and subcortical regions relating to affect [76, 380, 411].

Understanding how these two seemingly different mechanisms work together to support the complementary learning model that we've discussed in class and elsewhere in these notes remains a puzzle [294, 383, 343, 347]. Clearly the hippocampal system in mammals doesn't require language, but it does provide a means for consolidating the many threads of our perceived experience and, in humans, that experience often includes what was said, read or thought – the last by way of our ongoing inner dialogue which also has its evolutionary roots in the motor system.

April 19, 2020

%%% Sun Apr 19  3:36:42 PDT 2020

In their paper in Nature Communications [352] the authors mention their companion paper appearing in ICLR [353] where they introduce the neural probabilistic motor primitives (NPMP) architecture, and demonstrate that it can used to rapidly acquire motor skills by first learning a component network that encodes a set of general-purpose motor primitives and then using those primitives as basis for quickly learning new skills:

We focus on the problem of learning a single motor module that can flexibly express a range of behaviors for the control of high-dimensional physically simulated humanoids. To do this, we propose a motor architecture that has the general structure of an inverse model with a latent-variable bottleneck. We show that it is possible to train this model entirely offline to compress thousands of expert policies and learn a motor primitive embedding space. The trained neural probabilistic motor primitive system can perform one-shot imitation of whole-body humanoid behaviors, robustly mimicking unseen trajectories. Additionally, we demonstrate that it is also straightforward to train controllers to reuse the learned motor primitive space to solve tasks, and the resulting movements are relatively naturalistic [my emphasis]. [Page 1 [353]]

The italicized phrase in the above excerpt might give one pause or at least a concern about the generality of the method, but I suggest that you ask Josh to clarify when he visits on Thursday. It might be interesting as an exercise to consider what an analogous set of primitives for more abstract inference pertaining to, say, mathematics or programming might look like. Alternatively, imagine how one might employ a similar strategy to build on the motor primitives to create layers of abstraction that couple new control strategies to new perceptual strategies.

That the focus here on the motor system rather than the cognitive system is purposeful and stems in part from the evolutionary perspective that despite the conceit that we are first and foremost cognitive animals – where the distinction is that cognition abstracts from a primarily physiological view to emphasize the psychological characteristics of the primate brain — not enough time has elapsed since language and complex social behavior emerged in our evolutionary history for natural selection to have sigificantly altered our brains to support such high-level cognitive function[471, 34, 453, 102].

The basic architecture is reproduced here in Figure 7 and is worth comparing with Box 1 in Merel et al [352] reproduced earlier in Figure 12. The authors provide a succinct description of the architecture as having "the general structure of an inverse model, which produces actions based on the current state and a target. The architecture and training scheme are designed for the embedding space to reflect short-term motor behavior. As we demonstrate below, this allows for the selective execution of particular behavioral modes and also admits one-shot imitation via the trajectory encoder." [Page 5 [353]]

Figure 7: From Figure 2 in Merel et al [353] Neural probabilistic motor primitive architecture for one-shot skill deployment. The yellow-highlighted components are available for offline, supervised training. Once the full model has been learned, the decoder can be reused as a policy in other settings.

The architecture shown in Figure 12 reminds me of a somewhat simpler version of MERLIN [523]. One difference, is that, as far as I can tell, the skills to be learned, while they do require some history of previous actions to produce a smooth trajectory through the manifold of states³⁶ visited by the experts providing training data, the dynamics of the restricted environment in which these actions take place is well approximated as a being fully observable Markov process of order K for some finite K > 1 — recall that MERLIN was particularly noteworthy for its use of a variational autoencoder as a memory-based predictor compressing a high-dimensional perceptual space into a low-dimensional state representation in order to deal with its partially observable environment.

The authors make it clear that they use feedback, for example they write that "[in] our setting, motor primitives refer to the reusable embedding space learned from many related behaviors and the associated context-modulable policy capable of generating sensory-feedback-stabilized motor behavior when executed in an environment", but it would appear that feedback is in the form of extrinsic rewards employed in the context of learning "feedback policies where the feedback terms are computed from the value function."

In terms of training, they describe two approaches, the first is referred to as behavioral cloning from noisy rollouts and the second as linear-feedback policy cloning (LFPC). The first involves "gathering a number of noisy trajectories from the expert (either under a stochastic policy or with noise injection) while logging the optimal/mean action of the expert instead of the noisy action actually executed — a version of [which] is equivalent to the DART algorithm of Laskey et al [301]. [They] then perform behavioral cloning³⁷ from that data." The second (LFPC) is somewhat more complicated mathematically and doesn't improve performance, rather it achieves performance comparable with behavioral cloning but with fewer rollouts.

Here's an analogy that may or may not be helpful or entirely accurate, but that I found useful to contemplate: The trick is to perform an analog of shotgun gene sequencing in which the agent ingests thousands of trajectories from experts and then combines relevant pieces to generate sequences that solve novel sensory-motor tasks. Assuming that the trajectories span the requirements of a new task, they can serve as a basis in the form of a set of combinatorial primitives, and thereby facilitate transfer learning. As long as older trajectories are preserved in the process of accommodating a new task that isn't fully covered, it would appear that such a strategy would avoid the problem of catastrophic forgetting / interference. This algorithmic perspective is similar to the model of the cerebellum developed by Albus and Marr.

Let's recap some of the principles that we've learned so far that seem to be essential properties of successful, well adapted biological organisms and the environments they inhabit:

An array of sensors and effectors that allow the agent to interact with its environment.
An environment rich enough to provide a wide range of easily and safely sampled dynamics.
A method of learning about its environment primarily relying upon unsupervised learning.
A method of exploring that enables quickly generalizing and adapting to new environments.

What are the properties of the network architectures and the representations they employ that enable efficient learning and inference across multiple timescales appropriate to the agent’s environment:

Ground inference with respect to agent’s environment to support planning and acting³⁸.
Acquire the information necessary to perform in a partially observable environment³⁹.
Recognize when errors occur, identify their cause, correct the problem and verify.
Reduce sample complexity by designing networks to serve needs of downstream networks⁴⁰.

Figure 8: CAPTION

Figure 9: CAPTION VIDEO

Embodied cognition would seem to require maintaining a dynamic representation of the body that serves both planning and acting. Our brains and bodies have been shaped by natural selection to be exquisitely aligned with the physical and social world we inhabit. Some claim that without such a grounding our thoughts would be literally meaningless. I expect that is not true, but I agree that without exposure to physical and social systems of some complexity we would have little basis for science, culture, commerce or art as we undertand those terms and that given our strong anthropomorphic bias it is difficult for us to believe otherwise.

The rat barrel cortex is organized as a topographical map of facial whiskers that preserves locality and facilitates inference that relies on their coordinated function just as primary visual cortex preserves retinotopy. The topography changes quickly to compensate for damage to the whiskers in the same way that somatosensory cortex undergoes changes to reallocate available areal coverage of neurons thereby optimizing for increased importance of the hands and fingers for concert pianist.

In a related structural-functional accommodation, the inputs to functional areas in the frontal cortex are prearranged to align with areas in the parietal cortex that encode relevant representations and are often filtered through a latent embedding-space bottleneck layer that matches degree of specificity (abstraction) to that of the relevant functional requirements.

In thinking about how we build representations that account relevant information, the inductive bias that drives inference throughout the sensory cortex needs to account for (infer) relationships between entities at all levels of sensory abstraction from primary unimodal to associative multimodal layers and made available in a standardized format to down stream components.

April 18, 2020

%%% Sat Apr 18  3:46:58 PDT 2020

In our meeting with Randy O'Reilly in January to discuss his PBWM (Prefrontal cortex Basal ganglia and Working Memory) model of learning in the prefrontal cortex and basal ganglia, Randy mentioned a recent paper co-authored with Seth Herd and several other colleagues [224]. He did so in responding to a question posed by Nikhil Bhattasali relating to sample complexity and reinforcement learning in the basal ganglia — in the following excerpt from the resulting transcript, text enclosed in [brackets] are my paraphrase / rational reconstruction of Randy's words:

If you have in your cortex really nice differential representations between things that are good – like positive reinforcing food, and you have other really distinct representations in your cortex for things that are bad that you want to avoid, then the gating learning in the striatum becomes really simple. And so, I think that is clearly what has happened in people is that we have this huge cortex that is developing all the right kinds of abstractions and encodings of the world in order to make the basal ganglia's job dead simple.
[We start with the sort of ecologically and evolutionarily tuned reinforcement signal humans are programmed to respond to, and then try to move as much of the job off into the cortex as we can and still use that core basal ganglia competence and its resulting Go / NoGo gating signal to drive action selection]. It does give you a [different perspective / emphasis] on what decisions you're making and, in some sense, I think having it really tied to an [ecologically driven dopamine signal] is probably a good idea – you could think all kinds of interesting abstract thoughts, but at the end of the day if it doesn't pay the bills, so to speak, or it's going to get you into trouble, then you want the basal ganglia to step in and say, "Well, maybe that wasn't such a good idea".

And so one of our most recent papers that we just put on arXiv and that Seth Herd is the lead author on, takes this idea and shows that you can really put a lot more of the work into the cortex. We call it "the proposer" in this model, so we say that the cortex is developing these proposals for action and that's really where all the hard work is happening – in the cerebral cleverness – and then the basal ganglia just provides this final "yeah, that's okay / that's good or not". So it makes the learning job for the basal ganglia much easier. You essentially have it reduce the dimensionality of what the basal ganglia needs to learn, and therefore the sample efficiency becomes less of a problem.

Here is an excerpt from Herd et al [224] in which the authors summarize the consensus role of the basal ganglia and then motivate an alternative view in which the basal ganglia and cortex work together:

There is now a broad consensus about the critical role of the basal ganglia in helping to select actions. By learning over time from dopamine neuromodulation, it selects those actions which maximize reward and minimize negative outcomes. Computationally, this system is well-described by the Actor-Critic framework of Sutton and Barto [49] (1981). In this framework, the basal ganglia action-selection system is the "actor", and the set of brain areas that produce phasic changes in dopamine are termed the "critic". This system evaluates selected actions and improves choices over time, while also improving the critic itself.
More recently, there has been considerable interest in a higher-level, model-based form of action selection, thought to depend on prefrontal cortical areas. This has been contrasted to the model-free nature of the learned associations in the basal ganglia system. Many people tend to think of this distinction in terms of separate, and perhaps competing, systems that enact goal-directed versus habitual behavior, where the basal ganglia is the habit system, and the prefrontal cortex is goal-directed.

However, it is increasingly clear that the basal ganglia plays a critical role in higher-level cognitive function, and in goal-directed behavior. We present an alternative model in which the basal ganglia and cortex are not separate, and do not compete, but interact to produce a spectrum of computations. These range between fully model-free (or prediction-free), to fully model-based (or prediction-based). We use the terms prediction-based and prediction-free, to avoid a variety of accumulated terminological baggage.

Figure 10 borrowed from Herd et al[224] describes the proposer-predictor-actor-critic model. In addition to a simulation model coded in Leabra, the authors present evidence supporting their model and examine in some detail the central role played by the anterior cingulate cortex⁴¹ (ACC) and orbitofrontal cortex⁴². Those of you interested in a final project relating to the neural basis of complex decision making in primates and humans in particular might consider comparing the Herd et al model with that of Merel et al which we will look at next – compare Figure 10 with Figure 11.

Figure 10: From Figure 1 in Herd et al [224] Structure of Proposer-Predictor-Actor-Critic architecture across frontal cortex and subcortical areas. We depict two parallel circuits with a hierarchical relationship. The top is a broad functional diagram, emphasizing the serially iterative and hierarchical nature of our proposed decision-making process. The bottom expands those functions, and identifies the brain areas that perform each function.

The results of their simulations are worth reading as they not only provide insights into the proposed model, but also connect to the literature providing additional targetted reading if you want to explore the ideas in the paper more deeply. As an example, here is a description of how the model exhibits shifts in its reliance on different action selection strategies:

Our model shows some transition from slower, more prediction-based to faster, more prediction-free computations. This transition has long been a topic of interest in psychology under the terms controlled versus automatic behavior [McClelland et al [348], Schneider and Shiffrin [443]]. It has also been addressed in terms of a shift between goal-directed to habitual behavior [Tolman [487]] however, that distinction does not map directly to prediction-based vs. prediction-free computations, as illustrated by the fact that our prediction-free Proposer component takes the current Goal into account and learns to produce candidate Plans that accomplish that goal. Because it selects a candidate Plan based on its weights, learned from the relationship between Situation, Goal, and previously performed Plans, the Proposer model component is performing a prediction-free computation.

Next we turn to the paper by Merel, Botvinick and Wayne that Josh Merel will be talking about on Thursday. According to Merel et al, the three principles of the optimal feedback control (OFC) framework – not to be confused with the orbitofrontal cortex mentioned above – as illustrated in Figure 11 are as follows: (1) Motor control is generated to optimize an objective function. (2) Deviations from an intended trajectory that arise should be corrected by leveraging sensory feedback in a task-optimal fashion. Together, these first two principles imply that online correction of movements should prioritize task-relevant dimensions (a "minimum intervention principle"). (3) Internal models help compensate for sensory delays and assist with state estimation.

Table 1 [352] provides a summary of the key principles of hierarchical control. Worth emphasizing is the point that factoring perceptual information into component categories according the informational requirements of downstream planning and control subsystems, and then routing suitably compressed representations of the relevant components directly to each subsystems should require less training data and reduce sample complexity overall. From our understanding of the relevant neurobiology, the striatum / thalamus in the basal ganglia (BG) serves to partition information into distinct categories as illustrated in Figure 10.2 in [386]⁴³.

Figure 11: From Figure 1 in Merel et al [352] (a) Interaction cycle between an embodied control system and a physical environment to generate behavior. (b) A flat controller with no architectural segregation of different inputs. (c) A basic, brain-inspired two-stage hierarchy: a lower-level motor controller directly generates motor commands to the effectors based on input from proprioceptive sensors and modulatory input from a higher-level controller, which is responsive to additional signals, including vision and task context signals.

By appropriately factoring sensory information at different levels of procedural abstraction, behavior becomes more invariant with respect to, e.g., movement, abstract goals, intentions. The authors also point out that by introducing (kinematic) constraints baked into the hardware or autonomy programmed into lower-level controllers, subsequent layers can essentially provide modulatory control thereby simplifying and making more robust higher-level control, e.g., see Rod Brooks subsumption architecture [74, 75, 125, 455]. The authors mention that "This partial autonomy is related to information factorization insofar as a lower-level system should have adequate information to be partially autonomous."

Figure 12: From Box 1 in Merel et al [352] To promote a diversity of behavior as well as the exploration and discovery of new ones, the neural probabilistic motor primitives (NPMP) architecture has been introduced [353], which expresses a set of robust, human-like motor behaviors as a basis for further task learning.

Having summarized some of most important features that support and benefit hierarchical organization, the authors consider the nervous system of higher vertebrates and how they control movement through a distributed set of anatomically and functionally hierarchical structures — see Figure 13 for a graphical overview. The study of spinal cord injuries in human patients and decorticate and decerebrate cats provides evidence for autonomy and hierarchy, e.g., "The spine, even in spinalized preparations, is responsive to somatic sensory feedback and can act semi-autonomously from the brain to coordinate multiple joints over time.".

In case of decorticate cats the cortex is not required for a significant amount of the behavior generated by the cat. "This reflects partial autonomy as well as amortized control, insofar, as stereotyped movements are 'habitual'. In particular, we also know that decorticate animals with intact basal ganglia can initiate goal-directed locomotor behavior" leading to the observation that "The basal ganglia then appropriately modulates the brainstem locomotor nuclei, which in turn modulate spinal central pattern generators." and relate this to the notion of options in hierarchical reinforcement learning [475, 397, 390].

Figure 13: From Box 2 in Merel et al [352] The diagram depicts an abstraction of the hierarchical anatomy of the mammalian nervous system. The scheme is, insofar as possible, a consensus view of previous hierarchical interpretations with the intent of serving as an uncontroversial foundation.

Remarkably and at the same time profoundly disturbing for those of us who love cats and think them intelligent in ways that are seldom acknowledged in literature, the authors relate that "Where decerebration removes the entire cerebrum, decortication refers to the removal of cortex without damage to thalamus or basal ganglia, so essentially all subcortical structures are intact, modulo atrophy owing to removal of significant sources of inputs. Cats and dogs with their entire cortex removed often generate superficially normal behavior after a recovery period." and then proceed to describe the pathetic behavior of the feline subjects post operation.

Here I directly quote – from Page 7 in [352] – a list of observations pertaining to the behavior of decorticate animals that employ the same terminology, e.g., contexts, mediation, modulation and motor programs, and mention of the role of inputs from cortex and thalamus in our discussion of PBWM (Prefrontal Cortex Basal Ganglia Working Memory) with Randy O'Reilly in January:

The cortex is not required for a significant amount of the behavior generated by the cat. This reflects partial autonomy as well as amortized control, insofar, as stereotyped movements are "habitual". In particular, we also know that decorticate animals with intact basal ganglia can initiate goal-directed locomotor behavior. The basal ganglia then appropriately modulates the brainstem locomotor nuclei, which in turn modulate spinal CPGs.
Subcortical structures can select among different modes of coordinated behavior, possibly reflecting short-term temporal abstraction and multi-joint coordination. Specifically, it has been proposed that motor program selection is performed by the basal ganglia, normally informed by inputs from cortex and thalamus. This is also consistent with recent work correlating neural activity in striatum with moment-to-moment sequencing of movement "syllables".
While sensory-guided insight is impaired upon removal of cortex, residual sensory information that has been processed through non-cortical pathways remains available, reflecting appropriate information factorization.
Certain forms of learning still occur, obviously mediated via non-cortical circuitry. It is believed that learning of motor coordination is mediated by cerebellum and learning related to action selection is mediated by basal ganglia. This is consistent with the broader literature on the basal ganglia being involved in the learning and deployment of context-triggered habitual actions, with this circuitry thought to implement something like reinforcement learning.

Miscellaneous Loose Ends: When I asked Josh Merel for a sample of related work that students might read in advance of his lecture, he recommended a paper [102] by Paul Cisek entitled "Resynthesizing behavior through phylogenetic refinement" (PDF). I had previously added a note in the class discussion list mentioning a recent interview with Cisek if you are interested.

April 17, 2020

%%% Fri Apr 17 03:50:11 PDT 2020

Rafael is in charge of the presentation today. We will be going through Greg Wayne's 2018 class lecture in which he describes the MERLIN architecture that addresses several of the problems we discussed on Tuesday and will be important for the Thursday lecture next week, April 23, during which Josh Merel from Deepmind will be joining us to talk about his recent paper in Nature Communications with Greg and Matt Botvinick.

Today's talk will also set the stage for the following week, April 30, when we'll look more carefully into predictive coding and its applications in biologically inspired architectures patterned after the human brain. In particular, we will review David Cox's work on an implementation of predictive coding called PredNet. Until recently, David was a professor at Harvard with joint appointments in both computer neuroscience and neuroscience. He worked closely with Jeff Lichtman at Harvard on an ambitious project funded by iARPA that he described in his 2017 class lecture. Currently he is the Director of the MIT-IBM Watson AI Lab in Cambridge Massachusetts.

I told David that Rafael and I will solicit questions from you regarding his work on the visual system in rats for the iARPA MICrONS (Machine Intelligence from Cortical Networks) project as well as the predictive coding paper and his earlier research when he was working with Jim DiCarlo's lab at at MIT. David also worked with and co-authored papers with Dan Yamins on optimizing artificial neural network architectures for computer vision. Rafael's lecture will be available here later this morning along with links to David's 2017 talk slides and video.

April 15, 2020

%%% Wed Apr 15 10:42:30 PDT 2020

In the first lecture we introduced the idea of a cognitive architecture and mentioned as an example the work of Josh Merel, Matt Botvinick and Greg Wayne [352]. The architecture described in their paper was a consensus architecture meaning that the various pieces and the way in which they interacted with one another reflect our current knowledge as inferred from recent literature.

I should say that some scientists think it is premature to posit such all-encompassing abstract models of human cognition. Historically cognitive neuroscientists are not so shy and recently even card-carrying systems neuroscientists have participated in this collective endeavor motivated by the fact that our knowledge of many of the pieces has matured and that understanding them in more detail will necessarily have to rely partially on how they interact with one another in order to manifest complex behavior.

When you see diagrams such as the one shown in the Merel et al paper, the major circuits are generally represented as boxes that presumably encapsulate particular functional components and the connections between those components are shown as lines drawn with arrows indicating the direction in which information is flowing, which in some cases, as in the diagram shown here, as lines with arrows at both ends, and the connections are said to be reciprocal.

Scientists who work on, for example, the hippocampus or basal ganglia are generally not comfortable with either the boxes that circumscribe these regions of the brain or the implication that they are connected to one another via such simple conduits. While the most researchers who hypothesize cognitive architectures would for the most part agree, they would also maintain that such simplifications serve both to simplify the architectures and provide a basis for discussion and debate.

The claim that functions are localized in the brain has been contentious throughout the history of neuroscience. It was brought to prominence in the early 1800s through the work of the neuroanatomist Franz Joseph Gall and popularized by the pseudo science of phrenology. The more extreme versions of the theory have largely been been rejected by most neuroscientists as we have acquired the tools necessary to put such hypotheses to the test. If you're interested in the history neuroscience, the work by Karl Lashley on cerebral localization figures prominently in the debate.

The researchers involved in the PDP group made it clear that the connectionist models they were developing were distributed and operated in parallel and largely asynchronously. It is not contentious however to suggest that certain computations are carried out by circuits whose constituent cells are wholly confined to specific regions of the brain, but we are not talking about there being a VISION or EMOTION module in the brain.

So when you see a box diagram like the ones shown here in a paper written by systems and cognitive neuroscientists like Matt Botvinick and Greg Wayne you can be pretty sure they would tell you that the box boundaries are not intended to indicate biological localization and that the functions they describe really only makes sense in the larger context of the other components with which they share information and processing.

As engineers building artificial neural networks we have an advantage over natural selection that we can re-imagine – or as we like to say in software engineering – re-factor neural functionality to facilitate both understanding and reusability. As an example, we will quickly review the design of one well-known artificial neural-network component that figures in the remainder of this lecture.

The long short-term memory model of Hochreiter and Schmidhuber is a recurrent neural network that is composed of multiple cells where each cell takes input from and produces output for other cells as well as the input and output for the LSTM network itself. We take it for granted that all inputs and outputs correspond to network layers that model vector-space representations and that are connected to other layers by the composition of a weight matrix and activation function.

Each cell also receives and produces signals for other cells that enable them to store local state information, modify that state, retain that state within the cell indefinitely, and control when they want to update that state using information from other cells. Individual cells can also control when they want to zero out their stored state, or propagate state information to neighboring cells.

A given LSTM unit might have any number of cells that work together to produce complex representations that they construct iteratively using their recurrent connections and that are trained using backpropagation through time. As we will see, this capability provides an approximation of how such computations are performed in the basal ganglia and how the recurrent computations can repeatedly sample from the sensory cortex to iteratively construct a context for performing actions that optimize expected future rewards.

The brain performs these computations using a number of specialized neurons that produce different patterns of spiking activity – referred to as bursting – and that employ diverse methods of signaling one another – as in the case of the cells that comprise the dopaminergic pathways leading from the substantia nigra⁴⁴.

That these biological computations can be modeled by artificial neural networks that perform rather simple algebraic computations is not all that surprising given that even two-layer perceptrons are universal function approximators. Are we missing something fundamental by approximating such cells in this fashion? Do we need to accurately reproduce the dynamics and spiking behavior of individual neurons?

In the previous two lectures we argued that if our objective is build mesosale models that bridge the conceptual gap between the biological basis for neural computation and the emergent properties of human behavior, then we have little choice other than to establish an analog of the digital abstraction for explaining how electronic devices consisting of analog components can give rise to reliable digital computations of astonishing complexity.

The architecture shown in Box 2 of the Merel et al Nature Communications paper [352] identifies the putative functional components associated with specific neural circuitry where the location of the corresponding boxes against the stylized depiction of the human brain provides an approximate characterization of where the circuits are located in the brain and how they are connected with one another.

In addition to my qualifications regarding the meaning and extent of those connections, it should also be pointed out that that the lines between boxes do not even account for all the major white matter tracts that we discussed in the second lecture. Having attempted to draw such architectural models of the human brain, I can forgive the authors for their sparse characterization of the relevant neural connections since doing them justice would result in a confusing dense tangle of overlapping lines.

Randy O'Reilly in his 2018 talk in class explains the role of the basal ganglia and its connection to the LSTM model of Hochcreiter and Schmidhuber [227]. As background, the following Khan-Academy-style abbreviated explanation of basal ganglia function focusses on the role of the direct and indirect pathways. This short piece efficiently summarizes the basic neurobiology included in most introductory textbooks including O'Reilly et al [386]:

[ Excerpt from Brain Explained video running from 00:01:00 to 00:05:00 ]

The following excerpt features expanding on what's known and what's not known, while summarizing his contributions [292, 423, 86] to our current understanding of the role of the prefrontal cortex in influencing activity in the basal ganglia:

[ Excerpt from Randy's 2018 video running from 00:34:30 to 00:55:30 ]

Now we consider an expanded view that accounts for input from the prefrontal cortex often referred to as the meta-direct pathway that illustrate both the progression of our scientific understanding and a clear example of the difficulty of localizing function in most any model of biological computation and the human brain in particular.

Below we have included three excerpts from Matt Botvinick's 2018 talk in which he describes the results in the Nature Neuroscience article entitled "Prefrontal Cortex as a Meta-reinforcement Learning System" that he co-authored with Demis Hassabis and a team of other scientists and engineers at Deepmind [517]. The full video is over an hour and a half long. If you don't have time to view the whole video, these three excerpts provide a reasonable synopsis and together take only about 24 minutes to watch.

After explaining Brenden Lake's critique of artificial reinforcement learning [299], in Excerpt #1 Matt describes Harlow's 1949 paper describing a set of psychology experiments relating to the "Learning to Learn" problem also referred to in the literature as the "Meta Learning" problem. Then, in responding to Brenden's challenge, he follows up in Excerpt #2 with a roughly parallel description of a set of bandit-problem experiments that demonstrate artificial reinforcement learning systems can also learn to learn. Finally, he makes the connection in Excerpt #3 between the way the artificial system learns and the role of the prefrontal cortex and meta-direct pathway in the primate brain:

[ Excerpt #1 from Matt's 2018 video – Learning to learn monkey experiments 00:24:30 to 00:28:40 ]

[ Excerpt #2 from Matt's 2018 video – LSTM Learner solves bandit problems 00:33:30 to 00:45:30 ]

[ Excerpt #3 from Matt's 2018 video – Making connections to primate brains 00:51:30 to 01:11:00 ]

Summarizing what we learned from Randy and Matt, the basal ganglia with executive input from the prefrontal cortex providing general direction and motivation, iteratively shapes what is often referred to as a context for acting that constitutes a summary of the current state assembled from activity originating throughout the cortex as well as subcortical areas involved in lower-order emotional processing of input from sensory systems⁴⁵.

These patterns of activity throughout the brain are registered in specific locations called stripes in the striatum and frontal cortex. Shaping corresponds to iteratively adjusting the weights of the information stored in these designated locations, emphasizing some by exciting their associated neurons and suppressing others by inhibiting their respective neurons.

The basis for making these adjustments comes from many sources but ultimately is determined by dopaminergic neurons in the substantia nigra. Dopamine can be used to either suppress or elevate activity in the striatum and the Go signal that Randy describes in his talk is used to indicate that the current context represented in the striatum should be forwarded to locations in the motor cortex – you can read more about the role of stripes in action selection in the transcript of an interesting discussion we had with Randy back in January.

This process of formulating and propagating suggestions for activity guided by a measure of expected future reward happens continually and can also be used to switch between tasks or utilize different circuits in the frontal cortex. As we'll see later, the basal ganglia in conjunction with the cerebellum also serves to coordinate and control the timing of not only motor activities but also computations relating to abstract reasoning.

Reiterating our earlier statement about the distributed and asynchronous computations that involve systems distributed over multiple cortical and sub cortical areas, the timing, pacing and coordination of both physical and abstract activity occurs in multiple regions with different mechanisms that have to self-monitor and coordinate their own behavior. For example, when you are executing a plan, the prefrontal cortex has to formulate the plan perhaps performing simulations to ensure that it will work in the current context, while at the same time overseeing the activity of ongoing subroutines and managing its own time.

To see how the activity in the basal ganglia correspond to the computations performed by deep networks for reinforcement learning, you can think of the iterative shaping of a context for activity in terms of evaluating different options by searching for state action pairs that have high value generally referred to as value iteration. In actor critic-systems, a network is trained to maximize expected future rewards by using a deep network to perform the search, essentially performing a table lookup.

Instead of shaping a context as prologue to selecting an action, most implementations of actor-critic networks simply pick the next action to perform. However, in hierarchical reinforcement learning, the state space for the underlying Markov decision process is partitioned into blocks where each block corresponds to a subspace of the global MDP state space.

In this way, the underlying MDP is represented as a collection of independent sub MDPs – one for each subspace. The subspaces can be thought of as specialized contexts for acting, and, for each of these specialized contexts, the agent learns a separate sub policy applicable only in the corresponding sub MDP's restricted subspace.

While mathematically elegant and theoretically efficient, for this to work the agent has to learn a partition of the global state space – that is to say a collection of nonoverlapping subspaces that span the state space of the original MDP, and then learn how to keep track of the subspace that it currently occupies — a task made all the more difficult by the fact that in most cases the relevant Markov process is partially observable [523].

Moreover, the whole reason for doing this is to expedite learning, and, if the subspaces are not selected carefully, much of the effort required in learning one sub policy will be repeated in learning other sub policies. This is the problem of transfer learning in which the object is to learn a collection of reusable sub policies corresponding to subroutines that can be shared across subspaces and composed to solve any of the sub MDPs. If you are interesting there are several related postings in the class discussion notes, for instance see the October 16th entry.

More complex approaches that exploit hierarchy of some sort essentially perform a variation on the meta-controller described in Matt's lecture and can be construed as Matt does as learning to learn. If time permits, later in class we will take a look at some of the many variants of this idea that address the problem of interference between different tasks and the consequences of what is called catastrophic forgetting as well as the issues concerning transfer learning and abstraction in the form of subroutines or hierarchies of plans with reusable parts and what sorts of biological and artificial networks might enable this sort of compositional learning.

To close this discussion, I'll give you a preview of Thursday's lecture in which we continue to trace the path or rather multiple paths that enable action selection. Once the basal ganglia converges on the Go signal, it forwards the context information in the striatum to the frontal cortex using the gating mechanism that Randy described and the reciprocal connections between the stripes located in the striatum and stripe-like features in the frontal cortex that constitute working memory for circuits in the frontal cortex.

Considering just motor actions, the information in the striatum is sent to locations in working memory specifically allocated to provide input to the circuits that comprise the motor cortex. These regions include the primary motor cortex which is directly adjacent to the somatosensory cortex, the two brain regions being anatomically separated from one another by the central sulcus, which is the dividing line between the posterior and the frontal cortex.

The somatosensory cortex is organized according to the topographical structure of the body, mapping body parts to peripheral neurons that register touch and pain, and is aligned with a similar structure organizing the layout of the motor cortex mapping body parts to neurons that enervate the muscles and receive input from low-level circuits in the brain stem and spinal cord implementing the low-level subroutines that provide local enervation and coordination of muscle activity.

Located anterior to primary motor cortex there is a region called premotor cortex that prepares muscles for performing exact movements and located next to the premotor area and further from the central sulcus is the supplementary motor cortex. Together these three areas orchestrate the selection, coordination and timing of motor programs stored in their associated neural circuits.

Executing such motor programs involves sending commands to locations in the brainstem and spinal cord that initiate low-level circuits corresponding to sensorimotor loops that manage muscle activity. The commands issued by the primary motor cortex are sent along what is called the descending pathways on their way to the brain stem and from there to the spinal cord.

Command signals traversing the descending pathway may be intercepted by the cerebellum that interacts with the motor cortex to adjust the signals for the purpose of improving coordination and timing. The cerebellum working with the basal ganglia can also suggest changes in the motor circuits and by so doing program and debug the circuits responsible for generating motor commands.

Here is the list of discussion topics and related work that Rafael prepared for the second half of Tuesday's class meeting:

Slow versus fast reinforcement learning – i.e. the induced RL algorithm versus the general RL algorithm.
The model presented does not do real reinforcement learning, but rather task inference – this is suggested by other papers (including from DeepMind). It also fits with the context-generation interpretation.
Performance on in-distribution versus out-of-distribution tasks:
1. https://arxiv.org/abs/1910.10897
Model-based versus model-free learning – the point is made at the end of the lecture video that the proposed model follows a model-based learning strategy, however, I think that it is more like a filter, rather than a model:
1. https://arxiv.org/abs/1910.08348
Discuss more the evolutionary scale of meta-learning, i.e. how evolution can speed up slow RL to do fast-fast RL. Matt talks about that in his video, but rather briefly.

Miscellaneous Loose Ends: In attempting to understand how machines might learn to act and solve problems in class on Tuesday, we ran up against the perennial problem of how we represent procedural knowledge about our options for acting, our motivation for acting, and our knowledge of the world we live in that provides the basis for learning and planning in novel situations. Pursuant to tackling these problems of representation we are forced to consider the mechanisms whereby we acquire such knowledge.

To begin with, as Rafael pointed out yesterday, we don't have knowledge so much as we have beliefs, which is an important distinction because believing something — or acting as if something might be true — could be just the thing for learning something even if we are deluded in our belief. Similarly, we don't perceive the world so much as we have sensations from which we attempt to infer properties of the world. It's also important to keep in mind that information has value and gathering it takes work.

The term predictive coding has a long history and many different interpretations. The theory of predictive coding follows from the observation that in order to act in an uncertain and unforgiving environment, complex organisms naturally evolve the means to predict the future and that in order to do so one needs to construct models of the world that enable an agent to reason about the consequences of its actions and those of other agents.

In order to construct such models we need to explore the world around us in a manner that is both safe and efficient. In predictive coding we take advantage of our errors in making predictions in order to refine our predictive models and adjust our expectations accordingly. However, a problem arises when you consider that the amount of data necessary to train such a model would be prohibitive. Fortunately, predictive coding provides an elegant solution to this problem.

In the formulation due to Rao and Ballard [417, 416], an agent learns to predict what it senses whether the resulting sensations originate from visual, vestibular or any other sensory modality. This has the advantage that we have the incontrovertible ground truth of our own senses and so can compare what we sense with what we predict. Learning a predictive model is unsupervised unless you have a mind to anthropomorphize the natural natural world, e.g., "mother nature".

A related model due to Horace Barlow [48, 47] neatly combines the basic idea of predictive coding with the notion of efficient coding. It being more a principle or hypothesis, there are many instantiations of the basic idea. My interpretation/application of the idea relates to the perception-action cycle developed by Joaquín Fuster as a model of how the sensory abstractions that we learn are optimized to serve the actions that we need to perform in order survive [185, 183].

The overall process of acquiring these models is the topic of many disciplines including cybernetics, control theory and various branches of statistics and applied math. The use of deep networks has considerably expanded our options for implementing these models, and, in particular, the development and extension of variational Bayesian methods has provided powerful tools for building efficient instantiations of predictive coding [523, 91, 143].

I came across another discussion concerning different approaches to neuroscience. I'm not particularly interested in such debates except for helping me to avoid a particular sort of hidden bias that permeates most entrenched established disciplines. The discussion is on the Brain Inspired podcast Episode #66 entitled "Paul Cisek: Forward Through Evolution" and here are some representative papers from Cisek that Brain Inspired host, Paul Middlefield included with the interview:

April 8, 2020

%%% Wed Apr  8 03:24:13 PDT 2020

The recording of yesterday's introductory lecture along with the slides are on the class calendar page here and the lectures for tommorow are available here. Thanks to everyone who asked questions. Tomorrow I'll try to keep my answers short so we can at least touch on the most of the concepts covered in the talk, and follow up by email afterwards.

Special thanks to Ethan Richman for his question in class yesterday. I thought about it last night and this morning wrote the following message. Well, I didn't actually "write" it, I dictated it which is probably why it is as long as it is, but Ethan's question was spot on and I wanted to address the issues it raised — at least for me — as they relate to the course. Here is my response to Ethan:

Biological brains and modern computers are continuous-state dynamical systems — arguably very different dynamics, but with some crucial similarities. They both fall prey to the limitation that their underlying components are subject to internal and external processes that cause the components that comprise their computational substrates to change over time and thereby influence the overall system's behavior in terms of being able to consistently perform accurate computations in service to their primary objectives independent of such fluctuations. In both cases they rely upon specialized regulatory systems that continuously compensate for these spurious influences.

Biological systems rely on a form of homeostatic regulation to locally adjust the parameters of their constituent computing components so that these components can be depended upon to maintain their operating characteristics so as to ameliorate the consequences of local changes in temperature, blood oxygen levels, etc. Synthetic devices rely primarily on components that are engineered to have almost identical operating characteristics and that remain stable indefinitely under reasonable operating conditions, and on specialized hardware, e.g., surge protectors and heatsinks, that are designed to maintain operating parameters within narrow bounds that, for example, allow circuit designers to assume fixed thresholds that can be used to define discrete states interpreted as binary vectors.

The high degree of stability engineered into synthetic computing devices is what enables the digital abstraction, and it was John von Neumann's theory of how unreliable components can be used to build reliable systems that provided the theoretical foundations for probabilistically quantifying the degree to which such reliable machines could be engineered, in much the same way that Claude Shannon provided the theoretical foundations for quantifying the degree to which information can be transmitted over a noisy channel.

Both biological and synthetic computing devices inevitably have to cash out the continuous states of their analog computing substrates in order to perform actions with consequences in their external environments. The digital abstraction makes it relatively easy to build systems that consistently and accurately transform input representations in binary form into output representations in the same format. This is considered a critical feature in an environment in which such precision is demanded and all transactions can be represented in digital formats.

For the most part, the world in which we live does not present itself in such a discrete fashion, nor does it necessarily demand precision of the sort required by commerce, banking, engineering, etc. For all intents and purposes, the external world in which we live is continuous and even if it were not — say, we were to admit to its quantum character, we could quite reasonably treat these small variations as acceptable round off error.

One can imagine a system that would periodically sample the state of the world and maintain a representation of the dynamics of its environment in the form of a giant table mapping initial-state-action pairs to subsequent-state-value pairs that result from performing said action in the specified initial state. Such a system could, in principle, behave by simply choosing to act by selecting the action that maximizes expected value. Ignoring the fact that biological systems would have trouble precisely encoding such states, the main problem with this approach is that it doesn't allow for generalization; moreover, since the same exact state is highly unlikely to reoccur, perhaps the best the system could do is simply choose the nearest state using some suitable metric or interpolate both states and actions in choosing how to respond to any given environmental conditions.

Such interpolation schemes don't tend to work well for most applications — the state and action spaces are too large and the manifold corresponding to the reachable configuration space determined by the underlying physics is irregular and so the interpolation of two observed points often don't lie on said manifold. The details aren't crucial; what is important is that you can, with some care, separate the (sub) systems devoted to regulating and maintaining environmental factors that support reliable computing, from those systems that actually perform the computations whether they be discrete (digital) or continuous (analog).

I apologize for the length of this introduction, but I am trying to be as precise as possible — a luxury that I don't have when lecturing to a class of finite duration, consisting of students with finite attention. The point of this introduction is to frame the decision I've made in pursuing a deeper understanding of biological computation, and, in particular, human cognition. While understanding how biological systems homeostatically regulate themselves is absolutely critical if we are to understand the mechanisms of disease and aging, it is not necessary if our goal is to build synthetic agents that are capable of or surpass human-level cognitive ability but without the added constraints imposed by their biological origins.

And that is, in fact, my goal. Or at least the goal that I happen to be pursuing at this moment and in this class. With this sort of engineering focus in mind, the conceit here is that if we were to make precise the notion of a digital abstraction in human beings, then the way in which we learn from biology and engineer intelligent systems would be quite different from the way in which most neuroscientists study the brain. Our working hypothesis is that the fundamental unit of computation is a weight matrix / Hilbert transform coupled with a nonlinear activation function, i.e., a layer in a multi-layer network.

Each component constitutes a distributed representation (embedding space) of an abstract concept (in the most general sense) that encodes a superposition of instances of that concept — it's a metaphor, no spooky invocation of quantum theory is required here. Using the same metaphor, collapsing the wave function amounts to selecting one such instance as is required in speaking a word or performing an action. In response to the question, "you mean to say that the human brain is just one big very deep recurrent neural network?", I'll paraphrase Scott McNealy, "It's the network (architecture) dummy!", meaning that it is the large-scale architecture of the human brain that makes the difference.

This is not to say that the architecture of the basal ganglia surrogate circuit isn't important. The blind monks attempting to describe the elephant are doing a pretty good job. The hippocampus will require several components, each of them with a different type of weight matrix and nonlinear activation function. Their recurrent loops and reciprocal connections to other cortical and subcortical circuits are an essential part of the overall architecture. I'll grant you the challenges in training the resulting network of networks are daunting, but here again we can reach out to neuroscientists who study the lengthy and complicated period of development in humans that progresses in critical stages and persists well into early adulthood. The focus of this class is to step back and see what lessons we can learn by attempting to put the pieces together.

April 3, 2020

%%% Fri Apr  3 14:06:37 PDT 2020

I'll be separately exchanging email with the students who have volunteered for the first five discussion topics since much of what I have to communicate involves papers, concepts, the larger picture of the general approach we are taking and the narrower emphasis of this particular year's class. This isn't intended to exclude anyone, simply to avoid deluging you with email, much of which you might not find directly relevant. If you would like to be included, send me email and I'll add you to the email group.

Note that anything of general interest will be posted in this document which includes both the notes from the past six months of preparation for this year's class working with three student from last year and one from the year before, and, looking forward, all the new posts relating to this year's class discussions, this entry being the second such post.

I'll also encourage the students working on first five discussion topics to create a Slack workspace with individual projects for each topic, something others of you might want to consider if you're contemplating a team for one of the remaining topics or thinking about final projects, which we will be talking about in next Tuesday's class. last year students asked if I might create an index for the online class discussion lists. I'm busy now preparing lectures, however, as an exercise, I've started compiling a glossary of concepts that, for the time being at least, link to footnotes in past classes available in the course archive:

reticular theory and the neuron doctrine
Bruno Olshausen on neural nonlinearity
defining fundamental units of computation
Carver Mead on homeostatic regulation
mesoscale modeling as a complexity bridge
neural network motifs as circuit components
functional versus structural connectomics
algebraic invariants as topological features
modeling nonlinear dynamical systems
entropy, disorder and information theory

P.S. I don't know how many of you watched the "debate" between Yoshua Bengio and Gary Marcus at the University of Montreal last December. The debate, such as it was, cast Yoshua as an advocate for connectionist models and deep networks and Gary as an advocate for combinatorial models and symbolic representations. Neither deserves their narrow advocacy as characterized, and, in the debate, they come across as remarkably in agreement albeit with some differences about how their respective disciplines might contribute to the development of AI systems that draw inspiration from human cognition.

Speaker: Yoshua Bengio and Gary Marcus
Affiliation: University of Montreal and New York University / RobustAI
Date: December 26, 2019
Venue: Quebec AI Institute, University of Montreal
Title: "DebateAI DEBATE : Yoshua Bengio | Gary Marcus" (VIDEO) (TRANSCRIPT)

Those of you interested in structural connectomics — which we will touch upon in Tuesday's lecture — might be interested in the public debate staged at Columbia University almost exactly eight years ago in which two accomplished neuroscientists, Sebastian Seung and Tony Movshon, debated the importance of connectomics as a driver for research in the neurosciences and whether the proposed investments in brain mapping advocated by the European Human Brain Project [157] and the US BRAIN Initiative (Brain Research through Advancing Innovative Neurotechnologies) were a good use of public funding for the sciences [446, 16].

Speaker: Tony Movshon and Sebastian Seung
Affiliation: Columbia University and Princeton University
Date: April 4, 2012
Venue: Columbia University
Title: "The Great Brain Mapping Debate : Tony Movshon | Sebastian Seung" (VIDEO)

April 1, 2020

%%% Wed Apr  1 09:33:07 PDT 2020

In the slides for my lecture next Tuesday, I've rehashed my earlier argument concerning why it makes little sense for software engineers designing systems that aspire to human-level competence such as the programmer's apprentice to slavishly adhere to the biology. Whether the same can be said for computational neuroscientists depends a lot on what sort of phenomena you are trying model.

Conventional computer hardware is, like the brain, analog and has been developed to reliably compute Boolean functions by legions of engineers working over decades as transistor density and clock rates have dramatically increased. John von Neumann understood this and made clear the path to a solution in a series of five lectures given at the California Institute of Technology in 1952 entitled "The Synthesis of Reliable Organisms from Unreliable Components", which were subsequently published in a collection edited by Claude Shannon and John McCarthy [506]. Von Neumann employed Shannon's formal definition of information to model and quantify computational error in much the same way that Shannon modeled and quantified communication error.

Nicholas Pippenger notes that "The model used by von Neumann is roughly speaking the "neural network" of McCulloch and Pitts [349], with the adjunction of a finite probability space to model the failures of components [401]. Pippenger goes on to describe the importance of von Neumann's theory. Key to my mention in the slides is that von Neumann's theory provides the foundation for what is called the Digital Abstraction which made possible the division of labor between the physicists and engineers designing the (analog) electronics and the computer scientists designing the (digital) architectures [223].

John von Neumann did more than anyone else, with the possible exception of Alan Turing, to create the modern computer. His handwritten notes transcribed by Herman Goldstine in 1945 and circulated as the "First Draft of a Report on the EDVAC" define the architectural foundation for all but a few digital computers designed since [375]. Along with his "Theory of Self-Reproducing Automata" [508] and transcribed Silliman Memorial Lectures at Yale University on "The Computer and the Brain" [507], von Neumann and Turing were prescient in articulating the relationship between biological life and artificial systems that exhibit similar autonomy and intelligence. Here's what I wrote in 2016 and referred to in my lecture slides — the footnotes contain most of the technical points:

February 17, 2016

David Duvenaud introduced me to Guillaume Hennequin, Máté Lengyel and Daniel Wolpert in the Computational and Biological Learning Lab at Cambridge, and they invited me to give a talk and spend the afternoon on my stopover in London returning from Zurich. Guillaume, who is my host, asked for a description of my talk and I supplied the title and abstract below sans footnotes and citations.

When I sent it, I pointed out that I had attempted a little neuroscience humor in the title and a little hyperbole in the abstract, and that if staid and understated was the norm for talks at CBL, to please advise me and I would produce something more appropriate. Guillaume replied that it "looks great and will appeal to many here at CBL", so I guess I'm stuck with it. I kind of like it, but now I have to deliver on its promises or the jokes will fall flat and the hyperbole will be just that.

It's the Network Dummy: Exhuming the reticular theory⁴⁶ while shoveling a little dirt on the neuron doctrine⁴⁷

Scott McNealy, former CEO of Sun Microsystems, is rumored to have quipped⁴⁸, "It's the network dummy", when a reporter asked where the computer was upon seeing a room full of thin-client, diskless workstations⁴⁹. McNealy's point was that the power of computer networks isn't the (linear) sum of individual computers; the power is in the (nonlinear) manner in which they work together⁵⁰. When a computational neuroscientist examines an EM image of neural tissue, does she see a network or a bunch of neurons? The answer will depend on what she understands to be the fundamental unit of neural computation?

We assume the fundamental unit of neural computation is not the individual neuron⁵¹, compartment, synapse or even circuit in the traditional sense in which electrical engineers generally think of circuits, but rather dynamic ensembles of hundreds or thousands of neurons that organize themselves depending on the task, participate in multiple tasks, switch between tasks depending on context and are easily reprogrammed to perform new tasks. Consequently, the total number of computational units is far fewer than the number of neurons.

We also assume that much of what goes on in individual neurons and their pairwise interactions is in service to maintaining an equilibrium state conducive to performing their primary role in maintaining the body and controlling behavior⁵². This implies that the contribution of small neural circuits to computations facilitating meso- or macro-scale⁵³ behavior is considerably less than one might expect given the considerable complexity of the individual components. Since much of the complexity will manifest itself in the topology of the network, we need some means of computing topological invariants at multiple scales in order to tease out the computational roles of the multitude of circuit motifs⁵⁴ that are likely present, even in those parts of the brain assumed to be structurally and functionally homogeneous.
In the talk, we describe the convergence of several key technologies that will facilitate our understanding of neural circuits satisfying these assumptions. These technologies include (i) high-throughput electron microscopy and circuit reconstruction for structural connectomics, (ii) dense two-photon-excitation fluorescent voltage and calcium probes for functional connectomics⁵⁵, and (iii) analytical methods from algebraic topology⁵⁶, nonlinear dynamical systems⁵⁷ and deep recurrent neural networks for inferring function from structure and activity recordings.

Class Preparation

Note that these preparatory notes describe work in progress. They constitute a research log for recording ideas, preliminary sketches, sources, etc. The entries are sorted in reverse chronological order — from the latest to the earliest entries — and may provide pointers to later entries that supersede, contradict or build upon ideas discussed in earlier ones, but don't count on it and be sure to search forward in time to see if there are more recent relevant entries.

March 31, 2020

%%% Tue Mar 31 03:17:19 PDT 2020

Signing up for class discussions ... thanks to all of you who have filled in your entries in the signup sheet or sent me team proposals. We have enough bids to handle most of the topics, but could use some help with the first — actor-critic learning focusing on the neural correlates of reinforcement learning involving the basal ganglia and locations in the frontal cortex that implement a form of meta-learning.

This topic also represents one of the most important technical contributions to successful applications including AlphaGo and its successors such as AlphaStar which we will look at in subsequent class discussions. The talks and papers by Randy O'Reilly and Matt Botvinick are insightful and thought provoking, and there is plenty of room for including one or more of the applications, e.g., the talk by Oriol Vinyals from the 2018 class.

This topic is also fascinating for the fact that there is still much we don't know about the basal ganglia which is arguably one of the most important and interesting circuits in the brains of mammals and in particular its complicated role in human cognition. It's a great topic for a team effort and I'd love to see a proposal.

I'd also like see team proposals for the last two topics in the list of five. The first of the two focuses on the role of hierarchy which is evident throughout the brain and, in particular, its application to planning and complex cognitive activity. Research on transfer learning, abstraction and compositional representation, has highlighted the importance of these characteristics and the topic directly relates to one of the most important challenges facing artificial intelligence in the coming decade and one that cognitive and systems neuroscience has much to offer in terms of insights from biology that can be turned into technological innovations.

The second of the two has seen increased interest in last couple of years. In one sense, predictive coding has been one of the success stories of biologically inspired AI. It also might be the answer to some of the problems raised in the previous paragraph. It is both a technology and an organizing principle that provides deep insight into biological systems. This combination of a technology for building intelligent systems and organizational framework for understanding them makes for a rich multi-disciplinary literature that engages both AI researchers and neuroscientists.

March 29, 2020

%%% Sun Mar 29 05:14:41 PDT 2020

Executive Summary

This is a followup to the message I sent to students currently enrolled in CS379C on Friday, March 13. This post includes new students who have enrolled since then and covers some administrative issues relating to class participation that I would like to take care of prior to the first class on Tuesday, April 7. Specifically, students are asked to provide their preferences for discussion topics they are interested in contributing to.

Welcome to CS379C. If you are receiving this message, the registrar believes that you are enrolled in the class. If you didn't receive my earlier welcome message, my name is Tom Dean and I'll be your instructor / professor for the class. Rafael Rafailov, who took the course last year, will be my teaching assistant.

There are now nearly 40 students enrolled in the class. I expect there will be some attrition and possibly additional enrollments as students shop other classes during the first week of classes, but for the time being I will plan on somewhere in the neighborhood of 40 students taking the class for credit.

This is not a traditional seminar or lecture course. The COVID-19 pandemic will present challenges that none of us can anticipate, but I would like to make this course as interactive as possible if for no other reason to provide some relief from the isolation that many of you must be feeling.

To that end, apart from two introductory lectures in the first week of class, my plan is to enlist you to self-organize in small teams responsible for guiding discussions in the remaining class sessions. We experimented with this format last year, and, from the feedback I received from students, it was a success both socially and intellectually.

Here's how it works: I have compiled the materials for five class sessions each dealing with an important architectural component of the human brain. Research on each of these components has either already had an impact on developing artificial systems patterned after the human brain, or I hope to convince you that it will have a significant impact in the near future.

In addition to these five initial class sessions, I have identified eight additional topics that address key technology issues that I and others believe are holding back progress in artificial intelligence and machine learning. We believe these issues can be addressed by leveraging recent discoveries in cognitive and systems neuroscience.

Here's the plan: Each of you will contribute as part of a team in preparing for and running the class discussion for one topic. Each team will consist of 3 to 4 students. I will supply a description and a list of materials for each topic, ask you to indicate your preferences, and form teams by a yet-to-be-finalized decision procedure that you will help us formulate.

One possibility is for each student to independently pick three topics ranking them 1, 2 and 3 indicating their first, second and third choice. Then Rafael and I will assemble five teams — one for each topic — doing our best to satisfy your preferences. Alternatively, you could self-organize to form your own teams and submit bids to work on specific topics. We could also try a hybrid of these two.

I've shared a signup spreadsheet with you. You can access the spreadsheet here. The headings of the topic columns link to descriptions of the initial five topics or you can simply access them here. Everyone please enter your preferences as soon as possible. If you would rather wait for one of the other eight topics later in the quarter, fill in a 0 for each of the three topics you would have ranked highest; this will give me some indication of your interests.

In addition, for those of you who would like to form a team to work on one of the five topics, send me the topic and the names of the proposed team members — no fewer than three and no more than four. For everyone, if you have any suggestions for how best to combine individual preferences and team bids, send them to me.

Why should you volunteer to work on one of the first five topics? For one thing, I've already done a lot of the work assembling the necessary materials, and, if you start soon, we will have plenty of time to help you organize your presentations. For another reason, the five topics correspond to architecturally-coherent functional units that have served as the inspiration for many advances in machine learning and artificial intelligence and are well worth knowing about. Finally, I think you will benefit from the exercise, and, if you like, I will introduce you to some of the key scientists and engineers working in your selected topic area.

That's all for now. Please add your rankings to the spreadsheet as soon as possible, and send me proposals for teams and your suggestions for combining rankings and team proposals. On Wednesday, April 1, — no fooling — I will look at the spreadsheet, make whatever assignments are possible at that time, and, if you are interested, schedule a Zoom meeting to discuss your topic and strategy for organizing your discussion session and accompanying interleaved concept presentations.

March 19, 2020

%%% Thu Mar 19 03:13:56 PDT 2020

The following is a summary of my understanding of Karl Friston's work related to the free-energy principle. I undertook this study because Friston's papers frequently appear in the results returned by queries involving many of the terms we use in this course, including "predictive coding", "variational inference" and "minimizing free energy". Having done the necessary background reading and produced the summary, I don't believe this research is particularly relevant to the focus of the class and suggest that you don't spend inordinate amounts of time delving into the related literature. That said, in your copious free time you might find it interesting to sample some of the papers featured below.

Preliminary Attributions

Friston's free energy principle (FEP) is closely related to predictive coding, variational inference and related concepts in statistics and information theory. Predictive coding as articulated by Rao and Ballard [417] is related to the concept of unconscious inference developed by Hermann von Helmholtz.

Following in Helmholtz footsteps, Peter Dayan, Geoffrey Hinton, Radford Neal and Richard Zemel [113] developed a new family of recurrent neural networks they named Helmholtz machines that implement a class of undirected probabilistic graphical models Markov random fields borrowing ideas from statistical physics. They characterize their method as a form of hierarchical self-supervised learning related to the function of bottom-up and top-down cortical processing pathways.

FEP is also related to the efficient coding hypothesis championed by Horace Barlow [47] and the ideas of James J. Gibson on ecological psychology, specifically Gibson's notion of direct perception concerning how the environment of an organism affords various actions to the organism.

I seem to recall Friston mentioning affordances in one of his talks, though I don't recall him attributing the concept to Gibson. In his interview with Sean Carroll, Friston, in accord with Gibson's perspective, mentioned affordances as part of the extended body described by Andy Clark and David Chalmers in their approach to extended cognition. See also Chalk et al [88] — Toward a unified theory of efficient, predictive, and sparse coding.

The Free Energy Principle

We are interested in FEP in the spirit it was intended, as a principle to guide understanding and provide insight into the mechanisms that characterize biological systems. We take it for granted that biological systems are also computational systems, the instantiations of algorithms that govern behavior, and, in particular, enable organisms to survive and reproduce.

For the purposes of this discussion, we assume that all biological systems can be usefully characterized as minimizing free energy and that, given the limitations of the computational substrate and the difficulty of the implied optimization problem, we assume that biological systems employ variational approximation methods to perform the necessary calculations.

Biological systems are obviously not static computational systems, they grow, make decisions, acquire skills, and adapt to their environment. In the same way that the time and space required to run a conventional computer program are not known in advance in solving novel problems and have to be allocated on the fly, so too computations performed by a biological system in minimizing its free energy can acquire resources by altering its biological substrate or exploiting the affordances of its environment.

Conceptually we represent the environment as a probabilistic graphical model and the biological system as a subgraph enclosed by a semipermeable membrane corresponding to its Markov blanket⁵⁸ that serves as the interface and protective barrier between the system and its external environment. From this perspective, the target system and its environment are jointly modeled by a Bayesian network representing all of the relevant factors and their associated joint and conditional probabilities.

The model is intended to account for all the external factors that impinge on the target system's ability to interact with, alter or be altered by its environment, and all the internal factors that govern its ability to represent and respond to external factors including the ability adapt its computational and physical characteristics in response to environmental factors.

In particular, there are costs associated with acquiring information, performing computations, storing memories, learning new representations and so on. Internal factors include the cost of maintaining representations of sensory and motor systems, as well as abstractions such as one might find in the association areas of the primate cortex. There are opportunity costs for not performing actions that will have to factored into objective function of the system probabilistic model,

All of these changes in state are taken into account in minimizing the free energy of the system. Moreover, the system is not passive. Its algorithmic activities have consequences in the external world when variables inside the Markov blanket are altered and propagated to variables outside the blanket by way of variables that comprise the blanket.

The idea that by thinking we alter not only our internal state but also cause changes in the environment and alter the free energy of the system — the model and the system are essentially equivalent in this formulation. If the model creates new representations to account for new features of the environment or abstractions that serve to guide behavior, these changes are reflected in the free energy.

There is no identifiable planner, central executive or homunculus hiding in a Cartesian Theater. There is simply a computational process driven by an objective function minimizing a measure of predictive accuracy, so that, by propagating the gradient of this function to adjust the model parameters, alters its representation of the internal and external state thereby altering the manifest internal and external state in the process.

Active Inference Property

Friston et al [178] refer to this property as active inference and relate it to the concept of active perception and reinforcement learning. They address the problem of conceptual aliasing identified by Whitehead and Ballard [527], namely that one unavoidable consequence of active perception is that "the agent's internal representation often confounds external world states".

Whitehead and Ballard provide a solution the to problem of perceptual aliasing within the context of reinforcement learning [527], whereas Friston et al [178] appeal to a free-energy formulation of perception in which agents adjust their internal states and sampling of the environment in order to minimize their free-energy, thereby dispensing with reinforcement learning including its reliance on dopaminergic neurotransmission.

According to the free-energy principle, much of what we observe in the behavior of biological systems can be explained in terms of Bayesian inference and self-evidencing⁵⁹. The following extended quotation from Kirchoff et al [276] summarizes the relationship between active inference, self-evidencing and statistical inference far better — and more succinctly — than I could:

Active inference is a cornerstone of the free energy principle. This principle states that for organisms to maintain their integrity they must minimize variational free energy. Variational free energy bounds surprise⁶⁰ because the former can be shown to be either greater than or equal to the latter. It follows that any organism that minimizes free energy thereby reduces surprise—which is the same as saying that such an organism maximizes evidence for its own model, i.e. its own existence. In other words, self-evidencing behavior is equivalent to statistical inference. To see this, consider, first,
F(s, a, r) = −ln p(s, a | m) + D_KL[q(ϕ | r) || p(ϕ | s, a)],
where s refers to sensory states, a to active states and r to internal states. The notation F(s, a, r) denotes the variational free energy of internal states and their Markov blanket, ln p(s, a | m) refers to the negative log probability or surprise conditioned on a generative model and D_KL[q(ϕ | r) ∥ p(ϕ | s, a)] is the Kullback–Leibler (KL) divergence between two probability densities: the variational density, q(ϕ | r) and the posterior density, p(ϕ | s, a).
Crucially, this equality gives a Bayesian interpretation of variational free energy. The negative log likelihood or probability is the same as surprise, while the KL divergence measures the discrepancy between the variational density and the true posterior. Minimizing free energy by changing internal states can only reduce the divergence between beliefs about external states (the variational density) and the true posterior density given the states of the Markov blanket. We can think of this as a form of perception. Minimizing free energy by changing the active states can only change the surprise or model evidence. This constitutes a form of action that underwrites self-evidencing. — Excerpt from Kirchoff et al [276]

Hierarchical Organization

As mentioned earlier, we are primarily interested in using the free energy principle to provide insight into the mechanisms that characterize biological systems. One example in which FEP supplies useful insights relates to our understanding of the large-scale (hierarchical) organization of the human brain emerging as a consequence of our interactions with the environment. We have touched on this question frequently in these discussion notes, e.g., in the work of Koechlin and Jubault [284], Badre [41], Dayan et al [113] and Fuster [185], among others.

Apart from resolving the evolutionary processes involved, we are interested in the practical problem of learning to organize knowledge to facilitate its compact storage and efficient transfer across applications, e.g., in the work of Bakker and Schmidhuber [45], Kulkarni et al [293], Sahni et al [433] and Tamar et al[479], among others.

With respect to the specific contribution of FEP to our understanding of the hierarchical organization of the human brain, here is a excerpt from Kiebel et al [269] on the emergence of circuits in the prefrontal cortex that represent cognitive skills in a hierarchy of time scales:

Many aspects of brain function can be understood in terms of a hierarchy of temporal scales at which representations of the environment evolve. The lowest level of this hierarchy corresponds to fast fluctuations associated with sensory processing, whereas the highest levels encode slow contextual changes in the environment, under which faster representations unfold. First, we describe a mathematical model that exploits the temporal structure of fast sensory input to track the slower trajectories of their underlying causes. This model of sensory encoding or perceptual inference establishes a proof of concept that slowly changing neuronal states can encode the paths or trajectories of faster sensory states. We then review empirical evidence that suggests that a temporal hierarchy is recapitulated in the macroscopic organization of the cortex. This anatomic-temporal hierarchy provides a comprehensive framework for understanding cortical function: the specific time-scale that engages a cortical area can be inferred by its location along a rostro-caudal gradient, which reflects the anatomical distance from primary sensory areas. This is most evident in the prefrontal cortex, where complex functions can be explained as operations on representations of the environment that change slowly.

Relevant Tutorials

In addition to the bibliographical references in this document, the following tutorials should be useful if you're contemplating a final project that involves predictive coding, methods for minimizing free energy or implementing some version of Fuster's perception-action cycle for training sensory-motor feature hierarchies. I think it's instructive to go through the math in Kirby's tutorial, but I don't recommend reimplementing the original Helmholtz machine models. Modern variational autoencoders offer a safer alternative given the attention paid them in the last few year and the availability of several high-quality open-source implementations. David Blei's review of variational inference is probably more than you want or need, but it is a good reference if you want to understand the mathematics more deeply:

A Tutorial on Helmholtz Machines by Kevin Kirby [275] (PDF)
Tutorial on Variational Autoencoders by Carl Doersch [143] (PDF
Variational Inference: A Review for Statisticians by David Blei et al [59] (PDF

Related Publications

The following section includes BibTeX database records, including abstracts, for a few of the papers I thought worth special mention. Curating such lists often saves me time later, e.g., when I'm writing a paper and need to check my recollection. I have library privileges ar both Brown and Stanford libraries and find that their search tools often complement web search engines. Now that you can live chat with the librarians their suggestions can often direct you to digital content that you're not aware of existing.

@article{FristonENTROPY-12,
       author = {Karl, Friston},
        title = {A Free Energy Principle for Biological Systems},
      journal = {Entropy (Basel)},
         year = {2012},
       volume = {14},
       number = {11},
        pages = {2100-2121},
     abstract = {This paper describes a free energy principle that tries to explain the ability of biological systems to resist a natural tendency to disorder. It appeals to circular causality of the sort found in synergetic formulations of self-organization (e.g., the slaving principle) and models of coupled dynamical systems, using nonlinear Fokker Planck equations. Here, circular causality is induced by separating the states of a random dynamical system into external and internal states, where external states are subject to random fluctuations and internal states are not. This reduces the problem to finding some (deterministic) dynamics of the internal states that ensure the system visits a limited number of external states; in other words, the measure of its (random) attracting set, or the Shannon entropy of the external states is small. We motivate a solution using a principle of least action based on variational free energy (from statistical physics) and establish the conditions under which it is formally equivalent to the information bottleneck method. This approach has proved useful in understanding the functional architecture of the brain. The generality of variational free energy minimisation and corresponding information theoretic formulations may speak to interesting applications beyond the neurosciences; e.g., in molecular or evolutionary biology.}
}

Friston [261] — A Free Energy Principle for Biological Systems From an information theoretical or statistical perspective, free-energy minimization is tightly coupled with predictive coding, variational inference⁶¹ procedures and the general question of "how a biological system, exposed to random and unpredictable fluctuations in its external milieu, can restrict itself to occupying a limited number of states, and therefore survive in some recognizable form."

@article{KiebeletalPLoS-08,
       author = {Kiebel, Stefan J. and Daunizeau, Jean and Friston, Karl J.},
      journal = {PLOS Computational Biology},
    publisher = {Public Library of Science},
        title = {A Hierarchy of Time-Scales and the Brain},
         year = {2008},
       volume = {4},
       number = {11},
        pages = {1-12},
     abstract = {Currently, there is no theory that explains how the large-scale organization of the human brain can be related to our environment. This is astonishing because neuroscientists generally assume that the brain represents events in our environment by decoding sensory input. {\it{Here, we propose that the brain models the entire environment as a collection of hierarchical, dynamical systems, where slower environmental changes provide the context for faster changes.}} We suggest that there is a simple mapping between this temporal hierarchy and the anatomical hierarchy of the brain. Our theory provides a framework for explaining a wide range of neuroscientific findings by a single principle.},
}

Kiebel et al [269] A Hierarchy of Time-Scales and the Brain "specific time-scale that engages a cortical area can be inferred by its location along a rostro-caudal gradient, which reflects the anatomical distance from primary sensory areas."

@article{FristonandKiebelPTRSLB-09,
       author = {Friston, Karl and Kiebel, Stefan},
        title = {Predictive coding under the free-energy principle},
      journal = {Philosphical Transactions Royal Society London B Biological Science},
         year = {2009},
    publisher = {The Royal Society},
      address = {London},
       volume = {364},
        issue = {1521},
        pages = {1211-1221},
     abstract = {This paper considers prediction and perceptual categorization as an inference problem that is solved by the brain. We assume that the brain models the world as a hierarchy or cascade of dynamical systems that encode causal structure in the sensorium. {\it{Perception is equated with the optimization or inversion of these internal models, to explain sensory data.}} Given a model of how sensory data are generated, we can invoke a generic approach to model inversion, based on a free energy bound on the model's evidence. The ensuing free-energy formulation furnishes equations that prescribe the process of recognition, i.e. the dynamics of neuronal activity that represent the causes of sensory input. Here, we focus on a very general model, whose hierarchical and dynamical structure enables simulated brains to recognize and predict trajectories or sequences of sensory states. We first review hierarchical dynamical models and their inversion. We then show that the brain has the necessary infrastructure to implement this inversion and illustrate this point using synthetic birds that can recognize and categorize birdsongs.},
}

Friston and Kiebel [179] Predictive coding under the free-energy principle "perception is equated with the optimization or inversion of these internal models, to explain sensory data."

@article{FristonetalPLOS-ONE-09,
       author = {Friston, Karl and Daunizeau, Jean and Kiebel, Stefan},
        title = {Reinforcement Learning or Active Inference?},
      journal = {PloS one},
         year = {2009},
       volume = {4},
        pages = {e6421},
     abstract = {This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. {\it{In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility.}} We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.},
}

Friston et al [178] Reinforcement Learning or Active Inference? "minimizing free energy [...] results in policies that reproduce those optimized by reinforcement learning."

@article{WhiteheadandBallardNC-90,
       author = {Whitehead, Steven D. and Ballard, Dana H.},
        title = {Active Perception and Reinforcement Learning},
      journal = {Neural Computation},
       volume = {2},
       number = {4},
        pages = {409-419},
         year = {1990},
     abstract = {This paper considers adaptive control architectures that integrate active sensorimotor systems with decision systems based on reinforcement learning. One unavoidable consequence of active perception is that the agent's internal representation often confounds external world states. We call this phenomenon perceptual aliasing and show that it destabilizes existing reinforcement learning algorithms with respect to the optimal decision policy. A new decision system that overcomes these difficulties is described. The system incorporates a perceptual subcycle within the overall decision cycle and uses a modified learning algorithm to suppress the effects of perceptual aliasing. The result is a control architecture that learns not only how to solve a task but also where to focus its attention in order to collect necessary sensory information.}
}

Whitehead and Ballard [527] Active Perception and Reinforcement Learning "one unavoidable consequence of active perception is that the agent's internal representation often confounds external world states."

Reviewer Commentary

DQT: I found the Carroll (Mindscape) interview with Friston interesting, but I came away with the conclusion that the free-energy principle is less a unified model or theory (the term "principle" is misleading / ambiguous given its usage in Friston's papers) and more a framework (collection of design principles) for exploring a class of models, that allow researchers to instantiate the framework components, e.g., using different priors or inference algorithms, and then examine the consequences of the resulting fully-instantiated models.
Confusion arises when the reader, having read one paper making one set of claims, reads another paper making a different set of claims but doesn't realize that two sets of claims apply to different fully-instantiated models and may be irreconcilable. Among other concerns, it is not clear, except perhaps to someone who has followed Friston's work assiduously — a somewhat daunting challenge given the sheer volume of papers Friston has written or co-authored, whether the combined set of claims apply to either of the fully-instantiated models.

Sam Gershman's recent paper entitled "What does the free energy principle tell us about the brain?" does a good job of teasing apart the issues and providing a bit of much needed clarity concerning some aspects of the conundrum facing readers wanting to understand the free energy principle more deeply. Peter Battaglia and Nathaniel Daw were acknowledged for their feedback on an earlier draft — I wonder what they think about Friston's principle and Gershman's critique?

GDP: My reading of the free energy literature from Friston is that he is describing two problems:

He has some papers that describe variational filtering⁶² in which the previous observations are used to construct a prior distribution over the time evolution of the observations. In MERLIN, this is constructed by a prior over a latent variable z_t: p(z_t | z₀, z₁, z₂, ..., z_t−1, a_t). There is a posterior q(z_t | z₀, z₁, z₂, ..., z_t−1, o_t) that is a function of the current observation as well. The predictive loss is D_KL [q || p]: p is trying to predict q.

He has other work in which he describes a variational reinforcement learning framework. Here, he considers a prior over states, which generates a reward function. The system is homeostatic insofar as it tries to stay at equilibrium and maintain the prior distribution; to do so, it must solve a control problem. This is very similar to what is done in control as inference (works from Attias, Todorov, Kappen and Toussaint). Sergey Levine [310] has written a good summary which is available here: Equations 17-19 formulate RL as a variational learning problem.

Miscellaneous Loose Ends: There's nothing particularly special about a human body, living plant or even a bacterial colony in terms of their interaction with their environment being consistent with it minimizing its free energy in the sense defined by Friston. Conceptually, the idea of minimizing free energy as being the defining characteristic of autonomous living organisms may yet turn out to provide a useful perspective for engineers building autonomous agents [261].

Neither is there anything special about the envelope / boundary corresponding to the body of an animal being functionally modeled as a Markov blanket⁶³, so do many other boundary-defining body parts and cellular ensembles occurring at multiple scales and hence they too can be said to evolve to minimize their free energy. We need not observe this requirement at all scales; being able to do so, however, may be an emergent property of naturally evolved biological machines [276].

I've been rereading Hinton's publications — co-authored with Dayan, Neal, Zemel — on Helmholtz machines and their properties. These papers are important contributions to our understanding of biological and artificial neural networks, and I recommend you read them at some point in your studies. However, for the purpose of this course and your final projects, I think you would be better served using the idea of minimizing free-energy as a design principle rather than as a source of any specific neural network architectures or engineering advice (see for example Neal and Hinton [373]⁶⁴).

March 15, 2020

%%% Wed Mar 18 03:59:36 PDT 2020

I suppose it was inevitable but with all of this spare time on my calendar I went back and read the some of Karl Friston's papers that I'd suggested to my students, but only cursorily looked at myself. As I expect you can appreciate, following the citation trail I got lost in the dense thicket of Friston's voluminous output and despaired until I arrived at Geoff and Peter's Helmholtz machine papers and related variational inference papers including the familiar work of Mike Jordan [255], Kingma and Welling et al [273, 274], etc. Suddenly Greg's paper on MERLIN made all the more sense — it's been on my assigned reading list for CS379C for two last few years for it's elegance and control-theoretic underpinnings, but now I can motivate it using my revised interpretation of MERLIN, supplemented with Neal and Hinton [1993] and a recent accessible survey'ish paper by Sam Gershman [190].

Excerpts from discussion concerning Friston's FEP as a model, theory, principle, framework:

TLD: I found the Carroll (Mindscape) interview with Friston interesting, but I came away with the conclusion that the free energy principle is less a unified model or theory (the term "principle" is misleading / ambiguous given its usage in Friston's papers) and more a framework (collection of design principles) for exploring a class of models, that allow researchers to instantiate the framework components, e.g., using different priors or inference algorithms, and then examine the consequences of the resulting fully-instantiated models.
Confusion arises when the reader, having read one paper making one set of claims, reads another paper making a different set of claims but doesn't realize that two sets of claims apply to different fully-instantiated models and may be irreconcilable. Among other concerns, it is not clear, except perhaps to someone who has followed Friston's work assiduously — a somewhat daunting challenge given the sheer volume of papers Friston has written or co-authored, whether the combined set of claims apply to either of the fully-instantiated models.

Sam Gershman's recent paper [190] entitled "What does the free energy principle tell us about the brain?" does a good job of teasing apart the issues and providing a bit of much needed clarity concerning some aspects of the conundrum facing readers wanting to understand the free energy principle more deeply. Peter Battaglia and Nathaniel Daw were acknowledged for their feedback and Greg Wayne's MERLIN instantiates some features of FEP and exhibits some of the properties that FEP aspires to — I wonder what they think about Friston's principle and Gershman's critique?

MJF: I actually have very similar thoughts about the FEP as you do, and also thought Sam's article was useful for distilling it. I've had some discussions with Karl about it as regards the active inference part of it and how he uses it to revise theories about dopamine in contradistinction to RL — which I didn't agree with (though do find active inference itself interesting). And as a general framework that could apply to many phenomena in life (and not just the brain) it can also be stimulating to think about, more from a philosophical than scientific standpoint.

Indeed as you point out the instantiations of the FEP in terms of specific implementations do not all accord with each other, and the FEP as applied to the brain/mind is not really falsifiable — and that doesn't really bother Karl. He admitted it was not "a theory to be falsified but is instead a mathematical truism" in a debate he had about it at the Cognitive Computational Neuroscience meeting last summer. AFAIK that was the first time he publicly admitted it as not really a scientific theory but intended as a higher level framework.

March 13, 2020

%%% Fri Mar 13 03:42:01 PDT 2020

Welcome to CS 379C. If you're receiving this, then you are currently enrolled in the class for the Spring quarter. My name is Tom Dean and I will be your instructor / professor for the class. Rafael Rafailov will be the class teaching assistant and is familiar with the course materials and final project details having participated in the class last year.

We've been preparing to run the course online in keeping with Stanford's efforts to mitigate the risk of contagion from coronavirus. We've used Zoom in previous years and found it easy to use and full enough featured to handle our AV requirements. Our plan is to use Zoom again this year and conduct all of our class and team meetings using its various features.

Apart from my two introductory lectures, most meetings will involve preassigned readings and student led discussions on topics relating to computational cognitive neuroscience and its applications. Around the midterm, Rafael and I will work with you individually or in teams to develop proposals for interesting projects feasible in the remaining weeks of class.

Most years, a subset of students taking the class and I write an arXiv paper summarizing what we learned in class and working on final projects. In the summer and fall following last year's class, I met regularly with three students to develop the content and write such a paper; it's a work in progress but you are welcome to take a look at the latest draft here.

In addition to the arXiv paper, I spent much of the last few months putting together the syllabus and producing additional materials that should prove useful to you in quickly getting up to speed on the relevant concepts from systems and cognitive neuroscience as well as the various technologies that you are likely to be using in developing your final projects.

Below you'll find descriptions of these materials and links to the most recent drafts:

An introduction to the concepts, methods and technologies covered;
The current draft of the lecture slides for the first two days of class;
The initial calendar listing the discussion section topics and dates;
A short primer on how to run a student-guided discussion section;
A document that is part biography and part explanation of the course;

March 12, 2020

%%% Thu Mar 12 04:07:34 PDT 2020

Here are drafts of four documents that I'm preparing for this year's class:

The updated calendar listing the discussion section topics and dates;
A short primer on how to run a student-guided discussion section;
The current draft of the lecture slides for the first two days of class;
A document that is part biography and part explanation of the course;

In the first lecture I briefly discuss how the engineers who built the first fully programmable computers — including the ENIAC, EDVAC and the machine built at the Institute for Advanced Studies — had to work hard to get the vacuum tubes and other circuit components of the day to work reliably long enough to perform useful computations. Their struggle prompted John von Neumann to develop the theory in Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components [506] that, among other fundamental ideas [401], is the basis for the error-correcting technology in modern computers.

Von Neumann linked his theory to Claude Shannon's A mathematical theory of communication [451] and understood before anyone else that information and computation were essentially equivalent as a consequence of the work of Alonzo Church, Kurt Göedel and Alan Turing. Von Neumann drew inspiration from biological systems and was able to provide insights into our understanding of biological computation by formalizing the concept of self-reproduction in his Theory of Self-Reproducing Automata [508]. He also understood the complementary roles of "digital" and "analogy" computation in much the same way that McClelland, McNaughton and O'Reilly [347, 383] articulated in their complementary learning systems framework:

No matter how long you wait, numbers will never become organisms, just as nucleotides will never become proteins. Once the translation between genotype and phenotype is launched, evolution picks up speed — not only the evolution of the resulting organisms, but the evolution of the genetic language and translation system itself. A successful interpretive language both tolerates ambiguity and takes advantage of it. "A language which has maximum compression would actually be completely unsuited to conveying information beyond a certain degree of complexity, because you could never find out whether a text is right or wrong," von Neumann explained in the third of five lectures he gave at the University of Illinois in December 1949, where a copy of the MANIAC was being built.
"I would suspect, that a truly efficient and economical organism is a combination of the 'digital' and 'analogy' principle"' he wrote in his preliminary notes on "Reliable organizations of unreliable elements" (1951). "The 'analogy' procedure loses precision, and thereby endanger significance, rather fast ... hence the 'analogy' method can probably not be used by itself — 'digital' re-standardizations will from time to time have to be interposed." On the eve of the discovery of how the reproduction of living organisms is coordinated by the replication of strings of instructions encoded as DNA, von Neumann emphasized that for complex organisms to survive in a noisy, unpredictable environment, they would have to periodically reproduce fresh copies of themselves using digital, error correcting codes.

For complementary reasons, digital organisms — whether strings of nucleotides or strings of binary code — may find it advantageous to translate themselves, periodically, into analog, nondigital form, so that tolerance for ambiguity, the introduction of nonfatal errors, and the ability to gather tangible resources can replenish their existence in the purely digital domain. If "every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond" von Neumann explains in his fourth lecture at the University of Illinois. "This is a completely different philosophy from the philosophy which proclaims that the end of the world is at hand as soon as the first error has occurred." Excerpt from George Dyson's excellent "Turing's Cathedral" [150]

I wrote the following before writing the biographical piece mentioned above and then couldn't find a suitable place for it and just parked it here: Before going on I think it might help to tell you a little about myself so you know where I'm coming from and why I'm focusing on these problems in the particular way I'm constraining them. I'll begin with a summary of my reasons for optimism concerning the near-term — next 10-20 years — prospects for making substantial progress in building AI systems modeled after the human brain:

I've been interested in neuroscience for over 40 years, but only now do I think there is enough known about the brains of mammals with any certainty to warrant my investing in full time.
I'm interested in learning about human brains, because I want to understand mine and because understanding the human brain is arguably the most interesting problem in science today.
Why now and not 40 years ago? In a word "technology", including fMRI for awake behaving subjects, 2PEF (MPEF) for non-human studies⁶⁵, and FIBSEM for connectomic reconstruction of fixed tissue samples⁶⁶.
There is one other enabling technology that we take for granted but without which all of the others would not be possible. Sixty years of sustained exponential growth in computing⁶⁷.
My research plan is to look at the whole brain for ideas about how to build end-to-end systems that achieve or surpass human intelligence or that serve as prosthetic extensions.
The engineering strategy is to apply current artificial neural network technology wherever possible or determine why it is not possible and find biologically inspired alternatives.
One key prediction is that this strategy will lead quickly to the development of such systems yielding useful technology in the near term as well as insights into human cognition.

Miscellaneous Loose Ends: Stanislas Ulam is credited with inventing the Monte Carlo method for simulating physical systems with many degrees of freedom. Ulam refined his original idea working with John von Neumann on the problem of simulating nuclear reactions relating to nuclear weapons work carried out at Los Alamos. It generally known today as the Metropolis-Hastings algorithm as these two were responsible for the first implementations and several of the subsequent improvements that were run on the ENIAC computer. You might find it interesting to listen to these two interviews with Ulam — one in 1979 and the other in 1983 — when he was visiting Los Alamos, long after his work there during World War II.

February 23, 2020

%%% Sun Feb 23 03:08:33 PST 2020

What follows is the latest in a series of attempts to produce a credible computational account of Fuster's hierarchy with the focus on how such a hierarchy might be learned. This attempt emphasizes the role of the reciprocal connections linking the sensory and motor stacks. The proposed solution explains how predictive coding might be utilized to recognize and repair action plans by detecting anomalies that warrant the integration of activities that ameliorate the consequences of predicted unmet perceptual requirements, prerequisite conflicts and previously unrecognized opportunities to act.

Figure 18: On the left is a depiction of Fuster's hierarchy [185]. The motor stack is on the left and the sensory stack on the right. Light blue blocks correspond to vector representations of activity at the different levels in the hierarchy. Light green blocks with rounded corners correspond to neural networks that construct motor and sensory feature maps. The light orange blocks correspond to components involved in predictive coding and subsequent analysis relating to anomaly identification and resolution broadly construed.

On the right is a more detailed diagram showing an expanded view of the components that comprise level i and part of level i−1 of the hierarchy, including the bidirectional predictive coding (BPC) unit responsible for learning to predict motor features from sensory features and vice versa, and the prediction anomaly resolution (PAR) unit responsible for identifying, analyzing and resolving problems with assistance from networks in the motor and prefrontal cortex associated with (tactical) action selection and (strategic) executive control and goal-based planning.

Let μⁱ_t denote the representation (pattern of activity) in level i of the motor stack in Fuster's hierarchy at time t, and σⁱ_t the corresponding representation in the sensory stack. One unit of time — the interval from t to t + 1 — corresponds to one complete iteration of the perception-action cycle [185] in which activity propagates throughout the hierarchy, "up" the sensory stack from the bottom level — corresponding to direct input from the sensors — to the top of the hierarchy and then back "down" to the bottom of the motor stack — corresponding to the direct output to a game console, robot servomotors or other extensions whereby the agent can influence its environment. Let μ̂ⁱ_t (σ̂ⁱ_t) denote the prediction of the associated motor (sensory) representations.

Figure 18 depicts Fuster's hierarchy along with an expanded view of the neural network architecture including level i and part of level i−1 of the hierarchy. The network labeled BPC — bidirectional predictive coding — implements two predictive models — the treatment here is agnostic concerning whether these are implemented as some variant of predictive coding or as a bidirectional transformer model of the sort we discussed on Saturday and mentioned earlier in this log. The BPC consists of a motor path responsible for learning to predict motor representations from sensory representations and a sensor path for learning to predict sensor representations from motor representations.

At the completion of iteration t of the perception-action cycle, the sensor representation σⁱ_t in level i of the sensor stack is compared with the prediction σ̂ⁱ_t generated by the sensor path through the BPC unit. Likewise, the motor representation μⁱ_t in level i−1 of the motor stack — note the offset in level — at the completion of iteration t of the perception-action cycle is compared with the prediction μ̂ⁱ⁻¹_t generated by the motor path through the BPC unit.

In keeping with anterior-posterior dichotomy proposed by Vladimir Betz [297], the networks comprising the sensor stack are primarily in posterior cortex and especially the association areas located in the parietal and temporal lobes, while those comprising the motor stack are in anterior cortex and include circuits in motor cortex consisting of the primary, premotor and supplementary motor areas and analogous areas in the prefrontal cortex [42]. The basal ganglia and prefrontal cortex control timing by inhibiting or gating the propagation of activity in the motor cortex and executive areas of the prefrontal cortex [371].

This orchestration of multiple circuits spanning the entire cortex is essential to ensure that both bottom up and top down information processing is allowed to converge in accord with the compositional structure implicit in the sensory stack, and both the sensory and motor conditions for acting determined by prerequisite structure implicit within the individual stacks and the reciprocal connections linking them together.

The two processing threads (motor ⇒ sensory and sensory ⇒ motor) that comprise the BPC unit serve to notice anomalies warranting the integration of activities that ameliorate the consequences of predicted unmet perceptual requirements, prerequisite conflicts and previously unrecognized opportunities to act. The neural circuits that implement these two threads and the concomitant circuits that mitigate said consequences are distributed throughout the cortex including the dorsolateral prefrontal cortex and frontopolar cortex [40] and subcortical areas including the basal ganglia [481].

The conceptual framework linking perceptual and motor representations and an account of exactly where the necessary functions are implemented in the brain, i.e., the associated neural correlates, are beyond the scope of this discussion. The literature goes back to William James — as do so many topics in psychology and cognitive science, but you might start with the common coding theory first articulated by Wolfgang Prinz [406]⁶⁸.

Note: The predictive models in the BPC unit also have to be trained. Moreover, since both the sensory representation input and the motor representation output are changing, the predictive models have to be continuously altered. The internal dynamics of the motor and sensory networks determine the stochastic process that the BPC is trying to model. Given that the distribution governing this process changes over time, the process is said to be non-stationary [298].

The challenges of learning non-stationary processes can be finessed to some extent by providing a burn-in period during which the motor and sensory representations are allowed to stabilize before relying on the predictive models. Subsequent to the burn-in period, there are still challenges adjusting the predictive models while concurrently using them to analyze the behavior of the motor and sensor models, but similar challenges are common in modeling neural systems.

Related Work

Alexander and Brown [15] — [S]imple computational motif based on predictive coding can be stacked hierarchically to learn and perform arbitrarily complex goal-directed behavior. The resulting [...] model simulates a wide array of [...] studies of both lateral and medial prefrontal cortex hierarchical predictive coding reconceptualizing lateral prefrontal activity as anticipating prediction errors

De Wit et al [117] — Predictive coding is efficient in the sense that the brain does not need to maintain multiple versions of the same information at different levels of the processing hierarchy. Indeed, activation in early sensory areas no longer represents sensory information per se, but only that part of the input that has not been successfully predicted by higher level areas. The activity in lower level areas can therefore be considered an "error signal" that updates the predictions at higher areas and guides learning.

Kiebel et al [269] — Many aspects of brain function can be understood in terms of a hierarchy of temporal scales at which representations of the environment evolve. The lowest level of this hierarchy corresponds to fast fluctuations associated with sensory processing, whereas the highest levels encode slow contextual changes in the environment, under which faster representations unfold. [We present] a mathematical model that exploits the temporal structure of fast sensory input to track the slower trajectories of their underlying causes.

Lotter et al [331] — Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. [...] These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects

Stachenfeld et al [468] — [E]vidence for predictive coding, reward sensitivity and policy dependence in place cells suggests that the representation is not purely spatial. We approach this puzzle from a reinforcement learning perspective: what kind of spatial representation is most useful for maximizing future reward? We show that the answer takes the form of a predictive representation. This representation captures many aspects of place cell responses that fall outside the traditional view of a cognitive map. Furthermore, we argue that entorhinal grid cells encode a low-dimensionality basis set for the predictive representation, useful for suppressing noise in predictions and extracting multiscale structure for hierarchical planning.

Van den Oord et al [498] — [W]e propose a universal unsupervised learning approach to extract useful representations from high-dimensional data, which we call Contrastive Predictive Coding. The key insight of our model is to learn such representations by predicting the future in latent space by using powerful autoregressive models. We use a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples.

February 19, 2020

%%% Wed Feb 19 03:45:18 PST 2020

I'm still thinking about our discussion on Saturday about training neural-network architectures patterned after Fuster's hierarchy on Saturday. I made a first effort at a figure / diagram depicting the basic wiring digram and puzzled over what exctly it would mean for sensory and motor features to reciprocally influence one another during training. The new diagram is shown in Figure 18 and I will attempt to explain it in the remainder of this entry. Along the way, I will include a few, particularly relevant excerpts from Fuster [185]:

Corresponding areas of the two hierarchies, sensory and motor, on each side of the central sulcus, are connected by reciprocal connections. Each successive area in the posterior cortical pathways for three major sensory modalities — somesthesis⁶⁹, vision, and audition — sends collateral efferent connections to a progressively more rostral frontal area; all such connections are reciprocated by others in the opposite direction. Page 404 [184]

In earlier posts, we provided a few examples suggesting that features in the visual cortex are specifically designed to enable certain socially and environmentally relevant activities. Assuming that the reciprocal connections between the motor and sensory subnetworks in each level of Fuster's hierarchy exist — at least in part — to support such dependencies, what principles might we emply to design the neural-network instantiations of these connections. Reciprocity implies that in addition to learning sensory features that facilitate action selection, so too, motor features arise that drive the information gathering activity required to instantiate such sensory features.

You can think of the task of attending to specific features of the environment that make possible performing certain actions as a form of physical attention just as attending to features that, say, correspond to words or their meaning that occurred earlier in a dialogue make possible interpretation of the last utterance issued by your partner in the dialogue. Figuring out that your interlocutor’s use of the word "she" can often require a good deal of clever inference — cleverness that we take for granted in something "so simple as" understanding one another's utterances. To determine that "she" refers to Fred's grandmother might require that we scan earlier utterances for the names of females or designations indicating roles that might be filled by females, and check to see if the reference might include someone participating in the current conversation or present in the immediate vicinity.

The neural networks that we design increasingly require some form of attention to resolve ambiguity in context [502]. Wayne et al [523] implicitly invoke attention in training predictive models to anticipate future needs for information in the case of partially observable environments. The title of Greg's paper "Unsupervised Predictive Memory in a Goal-Directed Agent" echoes the view espoused by Alan Baddeley [36] and Joaquín Fuster [185] that working memory is a form of sustained attention focussed on executive cognition for processing of prospective action — see Page 394 of [184].

Wayne's model is less proactive and investigative in acquiring state information than is likely to be required for successfully dealing with many realistic environments. In general, it would seem that any agent operating in a dynamic environment would constantly be gathering information that might be useful, discovering new opportunities and sources of such information, and figuring out new strategies for gathering information or exploiting naturally occurring affordances such as local landmarks for navigation and the advantage of higher ground for the purpose of reconnaissance.

Most voluntary and deliberate behavior is made of simple and automatic acts, old habits integrated at hierarchically lower cortices and basal ganglia. If the behavior is new and elaborate, however, its most critical constituents are the attentive acts that palpate the environment in search of significant clues, the intentional and elaborate movements, the continuous monitoring and updating of relevant information, and the referring of that information to the schema of the action and its goal. [...] The perceptual set is the selective anticipatory priming of a sector of the sensory system(s) for an expected sensory percept at the onset or in the course of goal-directed behavior or language. Page 392 [184]

In terms of building a version of Fuster's hierarchy that facilitates this commingling of perception and action, there two (complementary) desiderata we have to consider: (a) we want to learn sensory features that depend upon and enable the selection of the next action, and (b) we want to learn motor features that depend on and enable the prediction of those state variables that are required to plan for and execute the next action.

Figure 18: On the left is a depiction of Fuster's hierarchy. The motor stack is on the left and the sensory stack on the right. Light blue blocks correspond to to representations of activity at the different levels in the hierarchy. Light green blocks with rounded corners correspond to neural networks that construct motor and sensory feature maps. On the right is a more detailed diagram showing an expanded view of the components that comprise the ith and i−1 level of the hierarchy, including the reciprocal bidirectional unit (RBU) responsible for learning to predict motor features from sensory features and sensory features from motor features. See the main text for more detail concerning the notation shown and the functional description of the different various component networks.

Let μⁱ_t denote the representation (pattern of activity) in the ith level of the motor stack in Fuster's hierarchy at time t, and σⁱ_t the corresponding representation in the sensory stack. For the purpose of this discussion, assume that in one unit of time — from t to t + 1, activity propagates throughout the network, and, in particular, "up" the sensory stack from bottom — input from the sensors — to the top of the hierarchy, including all reciprocal connections, and then back "down" to the bottom of the motor stack — output to the game console, robot servomotors or other extensions whereby the agent can influence its environment. Let μ̂ⁱ_t+1 (σ̂ⁱ_t+1) denote the prediction of the associated motor (sensory) representation.

Figure 18 depicts Fuster's hierarchy along with an expanded view of the neural network architecture of the ith level and part of the i−1 level in the hierarchy. The network labeled RBU — Reciprocal Bidirectional Unit — implements two predictive models — the treatment here is agnostic concerning whether these are implemented as some variant of predictive coding or as a bidirectional transformer model of the sort we discussed on Saturday and mentioned earlier in this log. The RBU is responsible for learning to predict motor features from sensory features and sensory features from motor features.

The motor (sensory) predictions μ̂ⁱ_t+1 (σ̂ⁱ_t+1) are used to construct a prior that can be combined with the current sensory (motor) representation σⁱ_t (μⁱ_t) to produce the motor (sensory) representation μⁱ⁻¹_t (σⁱ⁻¹_t) at the immediately lower (i−1) level in the hierarchy. While we provide no specific details regarding how to implement the aforementioned priors, it seems reasonable to suggest that any practical implementation will involve one more intermediate layers between the motor (sensory) representation and its paired sensory (motor) prediction.

February 18, 2020

%%% Tue Feb 18 05:08:22 PST 2020

I've been going over our discussion yesterday and thinking about the role that BERT-like architectures might play while also reviewing suggestions about solutions based on some form of predictive coding or a simpler version of Hebbian learning used to train a forward model. You might find it useful to read Chapter 8 Section B. "Functional Anatomy of the Perception–Action Cycle" — Pages 404-407 [184] (attached).

Fuster's description focuses almost exclusively on the anatomy and biological functions of the nervous system and as such it serves primarily as a cartoon computational account. Nevertheless, you might find some inspiration in reading his account. I also recommend skimming Kiebel et al [269] and Alexander and Brown [15] — (attached as well), not for the implementation details so much as for their arguments vis a vis the role of predictive coding and the relevance of the free-energy principle as it accords with — or not — Fuster's account.

February 17, 2020

%%% Mon Feb 17 04:32:50 PST 2020

Chaofei, Gene, Meg, Yash and I spent a couple of hours on Saturday morning hashing out our ideas on how to train architectures patterned after Fuster's hierarchy. We talked about predictive coding options but then spent most of the time thinking about BERT related solutions [137]. Here are a couple of posts on Jay Alammar's "Visualizing Machine Learning One Concept at a Time" blog that are particularly relevant to our discussion:

February 12, 2020

%%% Wed Feb 12 04:47:47 PST 2020

Here's a review of BERT trained models. It includes the two related papers listed below. I know Cordelia and Kevin and just sent them a message inquiring if they know of any work training models with an architecture like Fuster's hierarchy. Their original arXiv paper and follow-on papers are the closest I've found so far, but I expect there have been quite a few new papers published since the above review was posted last June — I submitted an arXiv query just now to search for papers whose abstract includes "BERT" and the server returned 591 results — I've only scanned the first page of (50) results. In any case, there is plenty more to be found if you're interested.

@article{SunetalCoRR-19a,
       author = {Chen Sun and Austin Myers and Carl Vondrick and Kevin Murphy and Cordelia Schmid},
        title = {VideoBERT: {A} Joint Model for Video and Language Representation Learning},
      journal = {CoRR},
       volume = {arXiv:1904.01766},
         year = {2019},
     abstract = {Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube. Whereas most existing approaches learn low-level representations, we propose a joint visual-linguistic model to learn high-level features without any explicit supervision. In particular, inspired by its recent success in language modeling, we build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively. We use VideoBERT in numerous tasks, including action classification and video captioning. We show that it can be applied directly to open-vocabulary classification, and confirm that large amounts of training data and cross-modal information are critical to performance. Furthermore, we outperform the state-of-the-art on video captioning, and quantitative results verify that the model learns high-level semantic features.}
}
@article{SunetalCoRR-19b,
       author = {Chen Sun and Fabien Baradel and Kevin Murphy and Cordelia Schmid},
        title = {Contrastive Bidirectional Transformer for Temporal Representation Learning},
      journal = {CoRR},
       volume = {arXiv:1906.05743},
         year = {2019},
     abstract = {This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods. Our method extends the BERT model for text sequences to the case of sequences of real-valued feature vectors, by replacing the softmax loss with noise contrastive estimation (NCE). We also show how to learn representations from sequences of visual features and sequences of words derived from ASR (automatic speech recognition), and show that such cross-modal training (when possible) helps even more.},
}
@article{ShangetalCoRR-19,
       author = {Junyuan Shang and Tengfei Ma and Cao Xiao and Jimeng Sun},
        title = {Pre-training of Graph Augmented Transformers for Medication Recommendation},
      journal = {CoRR},
       volume = {arXiv:1906.00346},
         year = {2019},
     abstract = {Medication recommendation is an important healthcare application. It is commonly formulated as a temporal prediction task. Hence, most existing works only utilize longitudinal electronic health records (EHRs) from a small number of patients with multiple visits ignoring a large number of patients with a single visit (selection bias). Moreover, important hierarchical knowledge such as diagnosis hierarchy is not leveraged in the representation learning process. To address these challenges, we propose G-BERT, a new model to combine the power of Graph Neural Networks (GNNs) and BERT (Bidirectional Encoder Representations from Transformers) for medical code representation and medication recommendation. We use GNNs to represent the internal hierarchical structures of medical codes. Then we integrate the GNN representation into a transformer-based visit encoder and pre-train it on EHR data from patients only with a single visit. The pre-trained visit encoder and representation are then fine-tuned for downstream predictive tasks on longitudinal EHRs from patients with multiple visits. G-BERT is the first to bring the language model pre-training schema into the healthcare domain and it achieved state-of-the-art performance on the medication recommendation task.}
}

February 11, 2020

%%% Tue Feb 11 04:13:25 PST 2020

Arianna gave an interesting talk in McClelland's weekly lab meeting on grounding mathematical / numerical concepts such as natural numbers:

FXY: Mathematical concepts usually have multiple representations. Even the simplest mathematical concepts, such as natural numbers, can be grounded in various ways. For instance, the concept of "five" can be grounded in "five things" (cardinality of a set), a position on a number line, a distance from one point to another point in space, the fifth number word in a verbal count list or the written Arabic numeral "5". Different representations of natural numbers are supported by diverse sensory-motor modalities.
How do children learn to integrate these representations of numbers? And why does the learning outcome of one task sometimes transfer to another task, despite the perceptual difference between these tasks? To systematically answer these questions, we build neural network models that simulate the computation underlying diverse numeric processing tasks. The modeling work explains cross-task transfer in various mathematical learning environments. Our work sheds light on the mechanisms of grounded number sense acquisition in humans.

TLD: Your talk sounds really interesting and very relevant to the problems that I'm working on such as the programmer's apprentice. In designing a system that learns to program by interacting with an expert software engineer, the idea of "number", as well as the idea(s) of "plus", "equals", "variable" and "assignment" are constantly in flux and employed at multiple levels of abstraction.

It is common for programmers to say something like "the first argument must be a number" when it is assumed this requirement is also satisfied if the first argument is any expression that evaluates to an integer or floating point number. What does an expression like "x := y + 1" or a statement like "increment the counter by one" mean? In the latter case, what would it take for a system to learn that this statement can be realized in code as "i := i + 1"?

From a practical point of view, I don't believe that the programmer's apprentice will necessarily have to become proficient in natural language in order to provide value as an assistant. A simple pidgin language supplemented by referential pointing and highlighting code fragments in a listing will go a long way. That and a protocol for demonstrating what an expression "means" by first assigning any unbound variables and then executing the resulting statement in the IDE could go some way toward learning program semantics.

Miscellaneous Loose Ends: Here are some of the features of the programmer's apprentice FIDE. The PA can see the value of every symbol in both the global environment (symbol table) and the local environment (lexical closure) of the currently executing procedure.

In the latter case, that would be the environment of a procedure in which an exception is called, a break is encountered or an error occurs in a program running in the debugger. The assistant can also "see the program counter" in the sense of having direct access to the node (subtree) in the abstract syntax tree of the currently running program corresponding to the head of the FIFO call stack. There might be a clever way to exploit code completion.

All of these features of the running code are effectively perceived but not necessarily attended to by the apprentice. In addition, the programmer can say the name of a variable or function and use the mouse and shared screen to point to a variable or highlight a region of the screen to disambiguate something said. The apprentice will have a speech-to-text transcript of whatever the programmer says, but we assume this mode of interaction is reserved for relatively simple communication.

February 9, 2020

%%% Sun Feb 09 04:12:49 PST 2020

Subject: two updates (a) new take on variable binding and (b) training networks like Fuster's hierarchy with bidirectional connections ...

... starting with (b) as it is shorter ...

During my visit to Google last week, I asked everyone I met with if they had attempted to train networks with bidirectional connections of the sort we encounter in Fuster's hierarchy. I got a lot of answers most of them ones we've already thought about, but I was consistently asked if I'd considered BERT [137]. I wouldn't be surprised if you have, and if so I'd like to know if you consider it match for our application. If you haven't, this Google blog post by Devlin et al provides a quick introduction.

... turning to (a) ...

Chaofei, Gene, Yash and I met for three hours yesterday in an attempt to translate Randy's model of variable binding in the context of PBWM into a more conventional deep network architecture. The result was less than satisfying in large part because I stubbornly pressed for as general a model as possible despite Randy's repeated caveats regarding the generality of his model. My conclusion from our discussion on Saturday, is that, while something like Randy's model may be an accurate account of BG-PFC activity in supporting a form variable binding, the human facility with language and symbolic / combinatorial processing likely depends on other, more general mechanisms located elsewhere.

Relating to Chaofei's mention of Dan's comment concerning dynamic binding and the hippocampal place cells, two of the papers I sent around last week consider multiple candidates for the neural correlates of dynamic binding and, in particular, consider the putative role of the hippocampus in supporting different aspects of binding information in the human brain. The first of these is from Randy and Daniel Cer [86] and is explored further in his collaborations with Kenneth Norman [378]. From the abstract of [86]

The posterior cortex can learn coarse-coded distributed representations [...] of low-order conjunctions which can efficiently and systematically bind information in the service of many different forms of cortical information processing. However, these representations are learned slowly over experience. In contrast, the hippocampus is specialized for rapidly binding novel information into high-order conjunctive representations (e.g., episodes or locations). The prefrontal cortex can actively maintain dynamic bindings in working memory and, through more abstract rule-like representations, support more flexible generalization of behaviour across novel task contexts.

This accords well with my intuition and relates to the mechanism for restoring cognitive context following an interruption and the role of narrative as a means of anchoring all sorts of information, and procedural knowledge relating to plans and goals in particular. This method — an extension of or variation on the method of loci — serves as particular sort of ecological affordance in keeping with work of J.J. Gibson [194].

The second paper is by Frank van der Velde and Marc De Kamps [500] building on building on van der Velde's PhD thesis [499], and I'll review this work in subsequent notes.

Miscellaneous Loose Ends: A method for loading gated memory locations in the prefrontal cortex including the selective suppression of perceptual information as a means of shaping the context for action selection in the basal ganglia (BG) and the context-driven reorganization of perceptual information to facilitate location-based variable binding in the prefrontal cortex (PFC). The method is based on the prefrontal basal-ganglia working memory (PBWM) model of O'Reilly and Frank [384] with an emphasis on how perceptual information corresponding to patterns of activity in the association cortex, initially acquired in a block-structured canonical format based on their encoding in thalamic nuclei, is subsequently reorganized in the process of loading into working-memory locations in the prefrontal and motor cortex.

By default, blocks are loaded according to their position in the canonical format employed by the thalamus. Learned exceptions are guided by dedicated circuits in the frontal cortex that have been trained to support content addressable locations that can be employed to reroute information from its default location determined by thalamic nuclei to alternative locations in thefrontal cortex. Upstream information processing circuitry in the frontal cortex can then exploit these programmable locations as registers in which to temporarily store variable assignments and support data structures that employ slot-filler representations.

[Placeholder description only]: The method considered here aims at a significantly more challenging version of the PBWM inspired approach. S = perceptual state (thalamus) — implicitly segmented into N chunks {Cⁿ | 1 ≤ n ≤ N}. B = reciprocal tract (thalamus ⇔ prefrontal cortex) — fully-connected matrix — any chunk to any register & think about using the identity matrix for long-distance (reciprocal) transfer plus a (local) buffer plus last-mile (local) fully connected layer for sorting / routing network.

M = suppression mask (striatum) — |M| = |S| — segmented into N chunk-sized submasks {Mⁿ | 1 ≤ n ≤ N}. R = memory registers (prefrontal) — |R| = |S| — segmented into N chunk-sized registers {Rⁿ | 1 ≤ n ≤ N}. V = register indices (prefrontal) — |V| = N — think of these as the unique memory addresses / names for registers — optional. K = gate assignments (prefrontal) — determined by the output of the softmax layer.

A = affinity network (prefrontal) — |S| × [...] P = softmax network (prefrontal) — assigns each chunk a memory register corresponding to cell in a multi-cell LSTM network. For each chunk C_i and register pair R_j, there is an (implicit) submatrix A_i,j that computes a measure of the affinity of the ith chunk C_i4 for the jth register R_j. The above model assumes that any chunk in S could be a slot filler (variable assignment) and any register could be a slot (variable location).

February 03, 2020

%%% Mon Feb  3 03:45:23 PST 2020

Here is part of an email exchange with Chaofei last week:

STF: I was in a meeting with Dan and Doris Tsao yesterday. One problem Doris is working on is how object segmentation is represented in the brain. She and her students are mainly looking at the dorsal and ventral streams. One question Doris kept asking is how brain can solve the dynamic binding problem, i.e., how one neuron or one population of neurons can represent object A at one time point and object B at another time. Dan's idea is that maybe it's something similar to the hippocampal place cells, which can be remapped in different environments. But they don't mention anything about the PFC or the stripes. It seems to me that the stripes in PFC are perfect neural correlates for the dynamic binding problem. But it's still possible that there maybe something more primitive in the visual and hippocampal areas to support this function. Do you have any thoughts on this?
TLD: Regarding the meeting you had with Dan and Doris. I think that there are two issues at play. The first is the variable binding problem and how information from all over the brain is processed in the frontal cortex and in particular the prefrontal cortex. There is plenty of evidence pointing to the basal ganglia serving as a router / multiplexer of sorts and the system of stripes as a collection of registers to temporarily store information and high-speed bus for moving information around that together implement and manage working memory.

The second is a particular system for encoding, storing and adapting episodic memory that roughly parallels the system for encoding, storing and adapting procedural memory in the cerebellum. Both the cerebellum and hippocampus participate in the process of recording memories in the cortex thereby making them available for driving action selection by engaging the basal ganglia, posterior association cortex and prefrontal executive control systems.

I cobbled together a very busy slide — perhaps it is more accurate to call it a poster — that I can use in explaining what the brain has to offer by way of architectural guidance in building artificial neural networks modeled after the BG, PPC and PFC system that the PBWM model depends on. There is a second slide that I'm working on to cash out that guidance in the process of designing, building and training complex neural networks modeled after the human brain. The "poster" consists largely of figures from O'Reilly et al [389] and the three of you should be able to trace out a rough outline of how my presentation might go.

Thanks for the paper. It underscores an important lesson from the history of science. The metaphor of the drunk looking under the lamppost is certainly apt in this case — though it might be said that we are all working in the dark and frequently drawn like moths to the latest results, heralding them as breakthroughs. I don't know which characterization — as a drunk or as an insect — I would most take most issue with. There was a time in which the opinions of biologists and psychologists were deemed more insightful with respect to cognition and behavior.
Ethologists like Konrad Lorenz and Niko Tinbergen explained why animals did what they did on the basis of their drives and the challenges and affordances of the environments they inhabited — see also the work of J.J. Gibson on ecologically-based approaches to understanding visual perception [194, 193]. In many cases the resulting theories provided a direct link between observed behavior and the specific needs of the organism. A case in point being the way that some birds bond instinctively with the first moving object they see within the first hours of hatching — what we now call imprinting.

In the 80s and 90s neuroscientists like Christoph Koch and his students including Laurent Itti proposed theories of attention and visual salience in the language of information theory [241, 242]. Meanwhile other researchers in biological vision, proceeding in a bottom up fashion like John Tsotsos, attempted to infer models of attention by looking at early visual processing in the striate cortex. But John's work [493, 492] quickly evolved to take into account higher-level brain areas including IT and in doing so began to delve closer and closer to the underlying reason guiding attention and saccadic eye movements.

The drunk is looking under the lamppost precisely because there is no light elsewhere, irrespective of it being unlikely that the lost item is in the vicinity of the lamppost; in the case of the history of research on visual attention, computer vision people were uncomfortable invoking the idea of individuals having goals in large part because there were no computational models that they felt accounted for goal-driven activity.

In terms of understanding the neural correlates of plans and goals, we still have very little by way of computational models with enough supporting evidence that neuroscientists are willing to base their theories of visual attention. I think the situation is starting to change in large part due to scientists relating "drive"s in the form of reinforcement signals that can be used to learn, say, a motor program, to abstract conditioned responses that can be used to predict opportunities to act and that serve as the raw material for representing generalized policies of a sort that at least some cognitive scientists would be willing to call "goals".

Of course we don't need high-level goals or even the engagement of the frontal cortex in order to connect notions of salience and visual attention to an organism's physical needs and environmental challenges. Fuster's hierarchy connects perception to actions through multiple levels of abstraction starting from the most primitive behaviors initiated by hardwired instinct-driven responses and propagating upward through layers of increasingly abstract representation — abstract in the sense of being grounded indirectly through learned / latent features that facilitate generalization and high-level reasoning — culminating in layers that consist entirely of abstract features and the models and representations required for reasoning about them.

I'd invest my effort in leveraging Fuster's model and the various related work that blurs the boundaries between perception, attention and representation. As a quick high-level view of Fuster's model of executive control, see the discussion of "drives" in Section B. Attentional Set, e.g., "perceptual set is the selective anticipatory priming of a sector of the sensory system(s) for an expected sensory percept at the onset or in the course of goal-directed behavior or language."

He goes on to say that "with the exception of automatic correction of prediction error [...] perceptual set is the most agile prospective function of the prefrontal cortex". His (borrowed) perspective on memory and attention is summarized in C. Working Memory, Fuster agrees with Baddeley's view that working memory and attention are identical, or as Fuster characterizes it, "working memory is indeed a form of attention: sustained attention focused on prefrontal representation for the processing of prospective action."

Also worth considering are variations on the central role of prediction and the monitoring of prediction errors throughout the brain, e.g., "reconceptualizing lateral prefrontal activity as anticipating prediction errors" Alexander and Brown [15, 134] and "learning is based on making predictions about what the senses will report at 100 msec (alpha frequency) intervals, and adapting synaptic weights to improve prediction accuracy" O'Reilly et al [389]. Just contemplating planning and learning in the context of "errors" should make you pause and think about what proportion of our day-to-day information processing is involved with identifying and mitigating and recovering from the consequences of errors, broadly construed, and what fraction of our neural circuitry is dedicated to carrying out the required computations.

TLD: Almost forgot. I included the busy slide / poster to draw your attention to Randy's nice graphical summary of the major pathways — Figure 10.2 in O'Reilly et al [385] — implicated in decision making and the frontal cortex in particular. Note the inputs to the oculomotor (visual attention) and cingulate (including input from HPC and EC) loops. Regarding the latter, it's the posterior cingulate cortex that I was most interested in with respect to Dan's comment:

The posterior cingulate cortex, or PCC, lies just behind the anterior cingulate. Although it is believed the PCC has important roles in cognition and affect, there is some debate as to what exactly those roles are. Neuroimaging studies indicate the PCC is active during the recall of autobiographical memories. It is also activated by emotional stimuli, and thus some have suggested it may be recruited for the recall of memories that have an emotional quality (e.g. autobiographical memories). The PCC is also considered part of the default mode network, a group of brain structures that are more active when an individual is not involved in a task that requires externally-focused attention. For example, the PCC is stimulated when someone is daydreaming or recalling memories. Some have asserted that the PCC helps to regulate the balance between internally and externally-focused attention, making it a crucial structure in awareness and attentional focus. SOURCE

Miscellaneous Loose Ends: By way of emphasis, reread the last sentence in the last paragraph above. As another meta cognition, it seems to many of us that "everything" relating to artificial intelligence, machine learning and the neurosciences is accelerating. It's hard to tell when you're in the midst of the process that you are trying to account for, caught up in the excitement of the moment and interacting with hundreds of scientists and engineers on a regular basis. One measure that I look at from time to time is the number of entries currently in my BibTeX database compared to one year prior. Since I archive the index periodically — "sporadically" more accurately describes my habit, it is easy to estimate just how quickly it has grown:

/u/tld/Drive/write % grep '@' ./bibtex/dean.bib | wc -l
    5376
/u/tld/Drive/write % grep '@' ./bibtex/dean.12-27-18.bib | wc -l
    4613
/u/tld/Drive/write % grep '@' ./bibtex/dean.11-29-19.bib | wc -l
    5120

That's approximately 16% ((/ (- 5376.0 4613.0) 4613.0) = 0.16) in the last year and 5% ((/ (- 5376.0 5120.0) 5120.0) = 0.05) in the last month alone. I wish I had kept the archives for the entire 40 or so years that I've been using LaTeX and BibTeX to typeset technical papers in order to get a better idea how things have changed, but the evolution of online services and the accessibility of reference data has certainly had a huge impact on my ability to find and cite papers.

Miscellaneous Loose Ends: I wanted to understand the basis for Randy's comments in O'Reilly et al [384] and during our meeting last week regarding variable binding. There's a lot of literature relating to thalamocortical and corticothalamic projections, their development, distribution, etc., but I wasn't familiar with the literature concerning their possible role in binding. I spent several hours reading and searching the literature and learned enough to feel comfortable with Randy and Michael's use in the Leabra model described in O'Reilly and Frank [384]. The best reference if you want to pinpoint his contributions vis à vis the binding problem is probably O'Reilly and Soto [388], and I've included below the quote from [384] that precipitated by search.

More generally, dynamic gating enables a form of dynamic variable binding, as illustrated in Figure 12 for this SIR-2 task. The two PFC stripes in this example act as variable "slots" that can hold any of the stimulus inputs; which slot a given input gets "bound" to is determined by the gating system as driven by the control input (S1 or S2). This ability to dynamically route a stimulus to different memory locations is very difficult to achieve without a dynamic gating system, as our results indicate. Nevertheless, it is essential to emphasize that despite this additional flexibility provided by the adaptive gating mechanism, the PBWM network is by no means a fully general-purpose variable binding system. The PFC representations must still learn to encode the stimulus inputs, and other parts of the network must learn to respond appropriately to these PFC representations. Therefore, unlike a traditional symbolic computer, it is not possible to store any arbitrary piece of information in a given PFC stripe. Page 305 in [384]

I've also compiled a few references for the related work that I found most interesting and applicable; if might be worth your while to read or skim the abstracts if you don't have time to dig deeper⁷⁰.

January 29, 2020

%%% Wed Jan 29 09:11:18 PST 2020

STF: The Fuster's hierarchy needs to encode both "what" and "where" from perception in order to guide action⁷¹. It seems that nowadays most AI researchers mix these two kinds of information, e.g. train a deconv decoder to predict some future images. If we believe in perception's goal is to guide action, then the "where" part somehow should be learnt by doing some actions. For example, to learn sound localization, we can perform a sequence of head movements to equalize the sound in both ears. And to predict an object's movement, we can do a sequence of saccades to keep the object in the fovea. It seems a bit like inverse RL, but I'm not very familiar with the literature. Do you guys have any thoughts on this?
TLD: A couple of things spring to mind: Randy talks about how the proximity of the motor cortex to the parietal cortex — and in particular the somatosensory cortex, ensures that the configuration (pose) of our body parts — essentially our internal representation of the relevant configuration space / manifold body plus its currently surrounding space, will likely be available to assist in predicting motor programs and motor actions — which may incur an additional cost to make those predictions if they aren't already being carried out as part of the ongoing tuning of the reciprocal connections in Fuster's hierarchy.

In addition, as you point out, representations of the spatial parameters of the objects that we see will also be mapped to the association cortex so that the "what" features generated in the ventral visual pathway end up being mixed with the semantic features in the temporal cortex, and the "where" features generated in the dorsal stream will end up being mixed with spatial features in the somatosensory cortex thereby aligning / co-registering the position of objects within the frame of reference of the somatosensory cortex. Together these high-level features of the association cortex would enable predictive machinery to anticipate, for example, when — borrowing Randy's example — you might predict (unconsciously) that you are potentially likely to burn yourself when you are cooking and you reach for, say, the salt, and in doing so your hand might get too close to a hot surface or the flame of a burner.

For a nice review of the relevant neural structures and their functional roles, see the section entitled "Functional Specialization Across PFC Areas" in [385]. For your convenience, I've provided a glossary of basic anatomical terms mentioned in this section here ⁷². Supplied with this mapping from structure to function, a careful review of Figures 10.2, 10.4 and 10.6 [385] and the relevant surrounding text goes a long way in understanding what Randy was talking about in the first part of our discussion last week. In particular, reviewing this material made it much clearer how to revise the arXiv and submitted Frontiers papers.

See the paper [224] by Seth Herd that Randy mentioned and that lays out their proposer-predictor-actor-critic framework. The other thing I thought of relates to a paper [176] that I suggested to Gene with first author Karl Friston on his free-energy principle as a unified theory accounting for action, perception and learning. That paper was published in 2010, but a follow-up paper [177] published in 2012 explores the idea that saccadic eye movements are optimal experiments in which data are gathered to test hypotheses or beliefs about how those data are caused. I haven't read the paper except for the abstract but I'm guessing that in the related work you may find some references relevant to your questions:

"This leads to a natural explanation for exploratory behavior and visual search strategies, of the sort studied in psychology and psychophysics (Gibson, 1979; Itti and Koch, 2001; Humphreys et al., 2009; Itti and Baldi, 2009; Shires et al., 2010; Shen et al., 2011; Wurtz et al., 2011). Crucially, this behavior is an emergent property of minimizing surprise about sensations and their causes. In brief, this requires an agent to select or sample sensations that are predicted and believe that this sampling will minimize uncertainty about those predictions." Excerpt lifted from "Perceptions as hypotheses: Saccades as Experiments" [177]

January 27, 2020

%%% Mon Jan 27 03:57:58 PST 2020

Proposition for Randy: I was exchanging email with Adam Marblestone and Greg Wayne and mentioned that we met with you to discuss some topics related to your work and I attached the list of questions we sent to you — just the questions and not the transcript of our conversation as we agreed. Adam replied:

AHM: Love the questions for O'Reilly⁷³. To them I'd add it would be nice to have a simplified, whiteboard style explanation of his "goal driven cognition in the brain: a computational framework" model and how it differs from TD-style learning.

That would certainly be interesting if only we knew the answers. You addressed this issue to some extent in our discussion and I wanted to run by you my strategy for using the transcript of our conversation as a part of my class in the Spring. The idea is to take the initial set of questions that I posed to you prior our discussion, and channel what we discussed to generate answers that reflect your viewpoint.

Our discussion was casual, discursive at times, and nonlinear, driven as it was in part by our peppering you with questions, and so what I am suggesting to do, with your approval, is summarize and rationally reconstruct what you said as filtered through my head, providing background in the form of references and supplementary commentary suitable for students taking the class. These notes would be made available on the course website, again with your approval, and would include a prologue in which it is made clear that the following is my interpretation of what you said but that the content is primarily drawn from a transcript of our discussion. For example, here's an excerpt of how I would answer Adam's question drawing from the transcript:

That's exactly right. If it's at the task level — the whole task, then it's not going work. You need to be able to work with smaller components of activity in order to achieve a finer grained level of control and then having that control, you can mix and match different motor actions in ways that make sense to produce coherent behavior [...] that's the challenge.
Any ideas? So to me, the number one thing that we're working on and that I think is missing — of course there are many missing pieces — is the development of a kind of goal representation, essentially an encoding within the system of the desired outcome.

To me this is the key difference between what we see in all the existing AI models and what we feel is so essential to how we behave. We do things because [...] and this feeds into issues relating to consciousness, we know what we want to do; we're doing it because we want some particular outcome, we're not doing it just because we happen to have done it before, like a zombie slave to a habitual system that it can't resist, we're doing it because we have an internal representation of a desired outcome — a goal that we want to achieve, and that by having that kind of goal-level representation [...]

I think that's one critical element for getting the system to be able to use a goal representation to drive behavior [...] so essentially you learn a bunch of actions by exploring your environment, randomly babbling or whatever, doing exploratory learning and through that process you're learning that if I did execute this sequence of actions, then I would get to this goal, so I guess that I want that goal.

I'm now going to be able to drive this sequence of motor actions to get to that goal and if I don't get that goal, I know what I was trying to get and I know what I got and now I can do a lot more, for example, intelligent learning about the difference between where I thought I was going to get to and where I actually got and where I went wrong. I think this opens up a lot of possibilities for smarter [...] learning than dumb reinforcement learning in which you proceed by trial and error, all the time bumping into walls, [...] that's just not characteristic of how people learn. So one research route I've seen people take involves a form of unsupervised segmentation of activity traces.

Watching an animal perform a task, I probably wouldn't believe that it is pursuing a goal or planning a sequence of actions to perform — for the most part, animals are just dynamic reaction machines. I think the reason why we have this intuition for really nice crisp goals is because of language. Not only does language provide a way for us to segment our experience early on but it also provides a common ground for other agents to trigger dynamics in us. So we're seeing two problems here, not only would you be having goals in your agent like [...] through the representation of [...] but you also have a way for us to now control [...]

What do you think? I believe that I can do justice to your ideas while tailoring my interpretation to match the level of understanding I expect from Stanford students taking the class. I'll make clear in the usual fashion of biographies or interviews, e.g., to the extent that this makes sense it reflects your ideas and otherwise reflects my lack of knowledge or misrepresentation of your intent. The result I hope will minimize your additional effort while recording for posterity your ideas and insights.

Tom

January 25, 2020

%%% Sat Jan 25 04:27:59 PST 2020

Here's a note that I sent to Dan Dennett and Sean Carroll regarding consciousness and Fuster's hierarchy:

TLD: I'm disappointed that I couldn't convince you guys of the importance of Fuster's hierarchy and the perception-action cycle that provides the key to learning the reciprocal connections between motor and sensor features in the levels of abstraction that comprise the hierarchy. It instantiates the "loop" in Hofstadter's understanding of consciousness, explains the role of inner "speech" with or without language and provides a clear account of consciousness in non-human organisms. It doesn't dismiss qualia, but rather explains how the phenomenon naturally arises as a consequence of our sharing personal narratives. The key idea of — if not the attribution to — Fuster's hierarchy has also found its way into several recent papers on deep networks and reinforcement learning, spurred in large part by new ideas for efficiently training networks having the same signature architecture as Fuster's model. I don't have the patience or skill in communicating ideas to a general but well-educated audience that the two of you have, but it seems to me that a broader dissemination of these ideas might pave the way to a better — and possibly more satisfying — understanding of human consciousness and its non-human complement including the increasingly likely prospects for conscious machines.

Here's a recent email exchange that I've included verbatim, having no time or obvious alternative location to archive it:

FEL: See Kaiser et al 2019. Very very similar work from Levine's lab. Core idea of the paper is they train a world model, train an agent solely in the world model, evaluate the agent in real world, and repeat. Upshot here is the world model they propose is essentially exactly our fuster hierarchy model. I think we still have an interesting contribution, as it seems the agent itself wasn't a focus of the work (I couldn't find any details in the paper), and our hypothesis is that fusters is a good way to capture structured complexity in the agent's state and action space. The strength of their results is also encouraging, and it seems like there's open source code; perhaps we can simply repurpose their world model for action prediction
TLD: Nice paper. The collaboration is also interesting. I know Lukasz and Dumi from way back and the paper by Levine I think you are referring to is the 2016 Finn and Levine paper [164] that obviously I didn't read carefully enough at the time. The "Language to Goals" paper [180] that Levine worked on with Sergio did catch my eye and is definitely on my radar. That these seemingly diverse threads are coming together is exciting and bodes well for accelerated progress in this area.

As for our paper strategy going forward, I agree that we still have an interesting contribution to make. As you point out, the Kaiser et al paper, as well as a number of the others [133, 279, 321, 286, 235, 537, 257, 164, 180] that we've dredged up in our recent literature search, are all leveraging the power of Fuster's hierarchy and the insight behind the perception-action cycle for closing the loop, including the hierarchy of aligned motor-sensor features plus the embodied agent and its environment. If we could (a) point out this rather deep insight and (b) demonstrate its strength on Fuster's arcade game with its complex, many degrees-of-freedom action & perception space, that would make for a great paper.

January 22, 2020

%%% Wed Jan 22 03:58:43 PST 2020

Gene asked me what I had in mind for implementing the perception-action-cycle in the Fuster hierarchy. I mentioned that I was assuming some variant of predictive coding should work well since it was originally developed by Rao and Ballard for just this sort of problem [417, 416]. I pointed him to the PredNet model of Lotter, Kreiman & Cox [331] since I know and trust the authors and their implementation relies on standard components, using a standard software stack: convolutional layers, rectified linear unit activation followed by max-pooling and recurrent neural networks (LSTM), and written in Keras with TensorFlow.

Karl Friston describes a free-energy model — see Box 1 The Free-energy Principle in [176] — that accounts for action, perception and learning and unifies several theories of global brain activity, including the venerable von Helmholtz 1909 Treatise on Physiological Optics, Richard Gregory, Longuet-Higgins & Sutherland's Perceptions as Hypotheses [203, 177], Dayan, Hinton, & Neal's Helmholtz machine model [113], David MacKay's free-energy minimization algorithm [334], and, a little more recently, Lee & Mumford's hierarchical Bayesian model of inference in the visual cortex [305] which was the inspiration for some of my work [121, 120] in this area.

Friston's unifying framework is quite elegant, accounting for a wide range of theories. Relevant to our particular interests, Alexander & Brown [14] present a model of hierarchical predictive coding in which information is propagated by way of both bottom-up and top-down pathways. This model will seem familiar if you've taken Daphne's course on probabilistic graphical models or read her textbook with Nir Friedman [287]. As for the perception-action-cycle in the Fuster hierarchy, the Alexander & Brown provides a nice conceptual model that focuses on learning to predict prediction errors as it relates to Vladimir Betz' anterior-posterior dichotomy and Joaquín Fuster's hierarchy. I suggest leveraging ideas from both Alexander & Brown and Lotter, Kreiman & Cox, with the implementation exploiting the latter as a starting point.

Continuation of the consciousness discussion from where it left off:

AHM: Something I've discussed with Konrad Kording a bit, and which is probably written up somewhere in the context of the intentional stance or related: It also seems that language maybe has some close connection with / privileged access to our memory replay and possibly-related prospective planning / mental simulation systems. So much of language starts out imperative, e.g., "mama up" (pick me up), and then ends up specifying what I think of as "conditionings" of a mental simulation: who what where why when.
Maybe my default simulator shows Greg walking over to the door. But now I need to specify that Greg is a dinosaur and there is a fire pit between here and the door that he needs to go around. Language + our mental simulation infrastructure seem somehow perfectly architected to allow this. So language seems to condition goals and modifiers of some forward model. I wonder if the speech/sensorimotor part of language is less fundamental than that inner machinery, for how "consciousness" (by some definition) and inner speech emerge within the cortical system.

TLD: I'm in complete agreement with the claim that language considerably enhances our expressiveness and in particular our ability better contextualize the basis — pattern of activity in the association cortex that is fed into the striatum — for acting. I also believe our cat Calliope has a sense of self, a relatively detailed models of me and the big black-and-white male that steals her food from time to time, including an accurate model the big tom's body language and all that his pose portends.

I fully expect that a good deal of that knowledge is encoded in the association cortex and incorporated into the latent state vector shared with the basal ganglia and prefrontal cortex. The hippocampus also plays role in enriching this context by drawing on episodic memory suitably reconstructed and reimagined to bear more relevantly on the current state.

Spoken words are just sounds, but they are sounds that produce brain activity in networks encoding useful latent state inferred from our exposure to conversations, stories and personal narratives using the same basic technology that Calliope uses to represent, predict and foil her catfood-stealing nemesis. Spoken language evolved by adapting to our native physical constraints. We evolved very few physical adaptations to language — there simply hasn't been enough time.

Language evolves orders of magnitude more quickly and hence it is constrained more by us than we by it. Calliope can't learn to use language, but humans, long before they acquired a spoken language comparable to even the most primitive of pidgin languages, were using a rich language of signs and gestures to hunt, mate and express basic needs and dispositions.

Miscellaneous Loose Ends: You might say that our ancestors going back a few million years were "language-ready", except that there was no language ready for them to use. A more accurate assessment would be to say that our ancestors provided the fertile soil, temperate climate and motivated horticulturists necessary to grow a language well suited to our human physiology and collective aspirations — where I suppose killing one another and destroying their environment were coincidentally and unfortunately aligned with those aspirations.

January 21, 2020

%%% Tue Jan 21 03:52:22 PST 2020

One of my students pointed out that in my response to your earlier comments, I didn't address your concern that the theory as posed doesn't directly "connect to how language is shaped by communicative utility and theory of mind". I don't believe — and I expect you would agree — that cats and dogs, apes and cephalopods all have some sort of theory of mind. They also have a sense of self, however impoverished it might be in comparison with ours. The question of how language is shaped by communicative utility is relatively easy to explain when you consider the selective advantage of the proto languages that arise in many species that exhibit coordinated strategic and social behavior. The harder and largely-ignored question concerns how language shapes us; specifically, how are we shaped by language as an emergent phenomenon that persists over generations and makes possible the institutions that preserve and make accessible our collective reservoir of shared knowledge and culture. I addressed this and related issues at some length in an earlier entry in this log; you can access it here if you are curious, and thanks again for your constructive feedback.

January 19, 2020

%%% Sun Jan 19 13:15:16 PST 2020

In responding to my sending a link to an earlier entry in this log explaining consciousness in humans and other animals, Greg Wayne writes:

You should write a book! I would love to read it. I agree that inner speech is one of the cases where motor production and sensory prediction are somehow deeply intertwined. I am not sure, however, that I am fully on board with this part, "I suggest that once you have all this machinery in place, inner speech is inevitable, reporting on what you see and hear, generating a running commentary, trading stories about what you feel, producing a narrative in which to embed your memories ... consciousness doesn't seem all that strange from this perspective and probably didn't to Douglas Hofstadter either." From a high-level perspective, you are saying that if you build an encoder-decoder over speech and context data, you get inner speech. But so far it feels that this view doesn't quite connect to how language is shaped by communicative utility and theory of mind.

My response:

Thanks for the complement. I appreciate the feedback.
In response to your comment, each level in Fuster's (idealized) hierarchy defines a recurrent circuit with two primary pathways: one linking action to perception by way of the agent's physical environment thereby grounding our experience in a complex dynamical system that provides the basis for much of our subsequent learning, and the other corresponding to reciprocal connections between sensory and motor feature maps at each level within the hierarchy. Roughly speaking, the forward (posterior to frontal) path maps sensory features to motor features and the reverse pathway (essentially the inverse of the forward path) maps motor features to sensory features, thereby establishing dependencies so that sensory features are tuned to support motor activity and motor features tuned to support perceptual requirements as in the case of MERLIN [523] immersed in what amounts to a partially observable Markov process.

In an adult, these mappings are constantly playing out — shaped by the attentional and control systems that guide behavior, and by doing so they determine, for want of a better word, the degree to which we are conscious of the relationships between our activity and the external environment. At the "apex" of this hierarchy — the language facility clearly isn't relegated to a single level, it is more likely spread out over / integrated into over many levels, the reciprocal connections (what some call the "phonological loop" and its homologs in apes and other mammals) are argued by some to be realized in the arcuate fasciculus.

I'm arguing that in humans this is the basis for inner speech and that, analogous to the reciprocal activity occurring in other levels of the hierarchy, left to its own devices in humans that have acquired language, the pattern of activity takes the form of a dialogue of sorts⁷⁴. Such a linguistically adept human may not be consciously aware of this ongoing dialogue and, in a human without speech, it may not manifest in any particular interesting or useful form. One of the most interesting characteristics of this phenomenon is that much of what we associate with consciousness, including the way in which we communicate our sensory experience, the central role / concept of the self and the various ways in which we engage with and interpret our inner voices, would not occur or would not be nearly as rich, if it were not for the expressiveness of our language and the narratives that we share with one another⁷⁵. See here for the first part of this thread. See here for a continuation of this discussion.

Miscellaneous Loose Ends: Software and Research on Spiking Neurons as an alternative to traditional artificial neural networks: Wolfgang Maass: Computing Brains and Spiking Networks & Liquid State Machines [INTERVIEW]. Chris Eliasmith: How to Build a Brain: A Neural Architecture for Biological Cognition & [SOFTWARE] [BOOK].

January 13, 2020

%%% Mon Jan 13 06:35:23 PST 2020

Below is the text of a message I sent to Daniel Dennett following his interview on Sean Carroll's Mindscape podcast earlier this month. I sent Dan a note asking if he had read Fuster's book on the prefrontal cortex [185]. He replied that he had not but would take a look. I replied in turn with a slightly redacted version the following commentary, sans footnotes and citations, addressing his and Sean's discussion of consciousness:

There's great story that you can tell by starting at the base of Fuster's hierarchy [185, 66], and working your way up the abstraction levels one level at a time, basically recapitulating at each level the corresponding stage of human development. At each level, you invoke Fuster's perception-action cycle [184, 323] establishing the requisite level-specific sensory-motor feature dependencies — in principle, standard encoder-decoder recurrent neural network pairs could work to learn these mappings [96, 100, 43].
An arc of abstraction extends forward (posterior to anterior) and downward (dorsal to ventral) across frontal lobe from primary motor to premotor and supplementary motor and from there into the executive control areas in the anterior prefrontal cortex (PFC) that Etienne Koechlin, Thomas Jubault, David Badre and others identified in the late 1990s early 2000s [283, 285, 42, 41, 38].

Continuing the arc you run into the dorsolateral PFC implicated in executive function, the anterior ventromedial PFC including circuits involved with reward-based decision making, and finally the inferior frontal gyrus home to Broca's area and related speech production functions [284].

In the posterior parietal and temporal cortex exists a similar but somewhat more tangled complementary arc part of which ends up in superior temporal lobe in the dominant cerebral hemisphere where you find Wernicke's area, one of several implicated in written and spoken language comprehension. The extensive reciprocal white matter tracts — the arcuate fasiculus — connecting the two language areas likely to play a key role in the perception-action cycle at the corresponding level in the hierarchy⁷⁶.

Fuster's perception-action cycle involves two complementary reciprocal networks — one network that learns to predict motor output from sensory input and the other that essentially inverts that mapping. What would it mean for this cycle to repeat indefinitely in the level of Fuster's hierarchy including the language-related areas responsible for speech production (motor) and comprehension (sensory)?

An encoder-decoder pair of LSTM networks can be trained to map what one interlocutor in a dialogue says to a distribution predicting what the other interlocutor in the dialogue might reply [504]. The model described in [504] is not perfect by any means since the technology available at the time the referenced paper was published was rather primitive and the result in the paper was primarily used as a proof of concept. When you speak, it seems plausible that you map what you're thinking onto a similar intermediate representation that reflects what you want to say and how you want to say it.

Presumably, when you want to understand what someone else said, you reverse the process and generate the same sort of internal representation. When you hear yourself speak — and here we suggest that you needn't go so far as to speak the words out loud — you would likely do the same. In principle, you could carry out these transformations, mimicking the accent, style or mannerisms of people you're familiar with, and, in the process, easily get confused and imagine that some of those voices are real and originate within us as separate agents or independent manifestations of ourselves.

A proto self, as it were, emerges gradually, first appearing in the earliest stages of the hierarchy and then gathering complexity as subsequent stages are filled in. The full self, as manifest in a healthy, linguistically adept adult human, doesn't emerge until much later, and even then, it could be argued that we don't represent the pinnacle of selfhood nor would we be able to articulate a recognizably nuanced sense of self if left to our own devices — language, education and culture play an important role in shaping our sense of self.

We cringe when we see a child's hand crushed by a car door carelessly slammed shut, duck when we see or expect to see the player at bat hit by a fastball and hold our breath while watching a gymnast attempting a difficult floor routine. Why then does it seem odd that we anticipate what others say, quietly practice what we are going to say before saying it, or remember a parent's soothing words of consolation when saddened by some loss long after our parents have passed away.

We don't store the words or record an audio track of someone saying the words when we commit to memory some particularly striking conversation at an important turning point in our life. We encode the patterns of neural activity corresponding to the representations we generated in the association cortex and elsewhere at the time we experienced the conversation. Even then we need more than just those representations to reconstruct the experience, we need the same neural substrate that we used at the time to construct those representations.

I suggest that once you have all this machinery in place, inner speech is inevitable [160, 10, 477, 161, 253], reporting on what you see and hear, generating a running commentary, trading stories about what you feel, producing a narrative in which to embed your memories [308, 118] ... consciousness doesn't seem all that strange from this perspective and probably didn't to Douglas Hofstadter either [228].

Having sent my email version of the above and thinking over what I wrote, I subsequently added the following postscript:

In an interesting sense, every level in the Fuster hierarchy constitutes a form of consciousness ... animals articulate their awareness at different levels in the hierarchy by displaying level-appropriate activity that demonstrates their mastery of the relevant characteristics of their environment at that level. When we ask whether or not a crow or an orangutan can recognize itself in a mirror or behave in a deceptive manner, we are asking whether it is "conscious" of particular relationships between themselves, their physical environment and other agents with whom they share that environment.
In his discussion with Sean Carroll, Dennett suggested that it made sense — was useful in some circumstances — to apply the intentional stance when speaking about single-celled animals. Perhaps a similar argument can be made for consciousness. In its relatively impoverished version of the Fuster hierarchy, we might say a single-celled organism demonstrates an awareness of the difference between itself and its environment by the observed activity of its protein-studded lipid bilayer membrane in allowing nutrients to enter, waste to exit and in some cases actively repel viruses and other bacteria — apparently bacterial mats (biofilms) exhibit a form of deception in misleading other bacteria that encroach on their territory.

The difference between a single-celled organism and us is that they can't speak, don't have language and can't report on their experience! I admit that the last statement was outrageously exaggerated. It was made to be provocative and emphasize the point that language is not necessary for consciousness and other organisms can be said to be conscious to some degree, but that in humans consciousness manifests in a unique way due to our facility with language and the rich manner in which we share our experience, enhanced by cognitive tools — ways of thinking and talking about consciousness — that employ introspection to gain insight into our inner experience — see here for a continuation of this thread.

January 11, 2019

%%% Sat Jan 11 03:20:00 PST 2020

Below are some of my current thoughts relating to how humans are able to handle examples like the three listed in the previous entry while at the same time dealing with more abstract (cognitive) problems — like the one I'm wrestling with at this minute. I'll take it for granted that the basal ganglia working along with the prefrontal cortex iteratively construct a redacted version of the system state vector consisting of activity patterns drawn from locations throughout the association cortex and reward centers [385]. Some of the parts (stripes) that comprise these patterns are suppressed while others emphasized and possibly annotated to provide additional context for downstream action / program selection and execution [210].

Setting the Stage for Abstract Thinking Modeled After the Motor Cortex

As explored in earlier notes in this record, some of these annotations might help to route different pieces of the redacted state vector to appropriate networks in the frontal cortex. The graphic below takes the earlier depiction of columns, stripes and clusters as a starting point, adding the possibility that there are multiple targets corresponding to networks specializing in more abstract domains — mathematics being one such domain — in the frontal cortex akin to those in the motor cortex responsible for selecting and executing motor programs.

There are some lingering concerns I had about such an architecture that I was able resolve with a little more thought and share with you below in case you have similar concerns. Here are the issues requiring resolution couched in the terminology of computer operating systems: (i) How are multiple processing threads managed? (ii) What if a thread is aborted or a line of thought abandoned? (iii) What if a thread requires more short-term memory for processing? (iv) What happens to the intermediate results of a program that invokes another program or calls itself recursively? and, most importantly, (v) What should we call such a program to distinguish it from a motor program? As for (v) I suggest we call it a mentat program in homage to Frank Herbert's Dune series in which a mentat was a human replacement for computers after they were banned following the Butlerian Jihad⁷⁷.

Here's a first attempt at answering the rest: (a) assume that the stripes relevant to signaling different specialized networks are disjoint or nearly so, (b) assume that the BG and PFC together can manage multiple threads, and that, in particular, the PFC can influence / modulate the behavior of the BG in cases where the PFC has information that the BG is not privy to, (c) assume that in our artificial-neural-network architecture, stripes are implemented as multi-cell LSTM networks that can use their input, output and forget gates to control and protect their contents (d) [ignore (c)] and assume that the only reason the BG might want to modify the contents of a stripe is if the context for invoking the relevant process has changed enough to warrant aborting the process or altering its behavior by updating the stripe to reflect the new context.

Continuing: (e) assume that the LSTM implementing a given stripe associated with a specialized network can foil the BG by locking its input gate thereby denying the BG the opportunity to modify the context, (f) [ignore (e)] and assume the LSTM implementing a given stripe associated with a specialized network has a certain amount of dedicated short term memory that it can use to keep a history of last K stripe contents controlled with a pointer-network attentional model implemented as a DNC whose purpose it is access intermediate results and bring them to bear when required, and (g) [ignore (f)] and employ the DNC implementing the hippocampus to perform similar services.

Defining Abstract Analogs of Sensory (state) and Motor (action) Vectors

These non-motor-cortex targets might be in the areas in prefrontal cortex identified by Koechlin et al [284, 285] and Badre et al [40, 42]. I wouldn't be surprised if the entire frontal cortex consists of layered planning and control networks organized as a continuation of Fuster's hierarchy in which the reciprocal connections implement forward-inverse-model pairs — essentially what Joaquín Fuster describes as the upper processing stages of the perception-action cycle [183]. Later in this entry we'll address the question of what would replace the roles of the state proxy (pattern of activity throughout the association cortex) and abstract action space (augmented configuration manifold) in these non-motor networks.

The argument for tiling (much of) the frontal cortex with generic planning and control networks is that the recent significant expansion of the prefrontal cortex in great apes and humans [461] could not have dedicated / allocated the new cortical area to specific functions in so little time (evolutionarily speaking) and indeed did not have to, given that the structural organization was already in place to exploit the windfall of computing resources, i.e., some variant on the idea of routing each problem to an expert (network) able to solve such problems [415, 221, 433, 151]. I don't know if there was a corresponding proportional increase in the number of stripe clusters providing for complementary expansion of working memory, but it seems reasonable to assume so⁷⁸.

If the basal-banglia (BG) / motor cortex (MC) system takes as input a pattern of activity occurring across the associative cortex and generates as output a motor command that produces activity across the motor cortex, then what exactly does the input — redacted and annotated state vector — for an abstract task like proving a theorem or designing a fusion containment field look like and where do the p atterns of activity from which it arises originate?

Both the basal ganglia and hippocampus employ such vectors to represent states / choices / indices for, respectively, (BG) directing (impending / current) activity and (HPC) accessing (episodic / prior) experience. What are the analogous perceptual, sensory patterns for encoding the internal state relating to activity of writing a program, deriving a theorem, designing a bridge? If so, are they different for each of these activities and how do they relate to our more familiar notions about perception.

Making the Case for Natural (human) and Synthetic (machine) Languages

The invention of written language enabled us to extend the benefits of our acquired knowledge beyond the natural span of a single human life by making it possible to produce a sturdy facsimile suitable for both expanding short-term memory (scratch paper) and providing an inexhaustible basis for long-term storage (shared libraries plus broad cultural and institutional agreement and investment in long-term management and preservation). The benefits of shared language are likely to be even more beneficial to our synthetic progeny.

Language provides a foundation for precise symbolic reasoning in addition to its more common, less-precise usage as a means of everyday communication that allows for ambiguity, supports analogy and metaphor and deals gracefully with the inevitable drifts in meaning as language evolves to keep pace with cultural, scientific and technological change. It seems unlikely that we could have achieved so much without the invention of language. It may be that in order for machines to collaborate — or compete — with us they will have to adopt our language — or invent their own⁷⁹.

Even apes and untutored children are competent naïve physicists. In addition, children are adventurous explorers and natural rhetoricians, quick to recognize the value of logic and adapt it to suit their needs. Fortified with language, enough paper for external read-write memory and proper training, we humans have the capacity to write programs and run them by emulating a universal Turing machine. The invention of the computer has enabled us to extend our cognitive capacities and the promise of advanced digital prostheses offers us the possibility of not being superseded by our creations.

Spoken words are selected for their conciseness, ease of articulation and ability to distinguish from other different phonetic tokens in noisy environments thereby facilitating efficient communication. While words are hence made easily distinguished, their meanings often lack precise definition. This makes them ideal signs⁸⁰ for supporting a wide range of communication tasks. However, while we can get away with being sloppy with language, but we can also benefit greatly from learning how to use it precisely.

In addition to the obvious utility of unambiguously conveying an common shape, the word "triangle" can be given a specific technical meaning along with rules stipulating its (technical) usage, constituent parts and defining properties. Such technical ontologies have the advantage that they can be "typed" thereby facilitating precise, repetitive or recursive usage.

For example, the ontology of propositional logic includes propositions, e.g., A and B, Boolean functions, e.g., AND and OR, and rules of inference, e.g., modus ponens, and you only have to learn how to apply such a rule — IF A AND B and A then B — once in order to apply it in a multitude of contexts, as well mentally catalog it along with its complement modus tollens — IF A IMPLIES B and NOT B then NOT A. This sort of disciplined approach to structuring and applying knowledge is incredibly useful, and yet humans come by such deep knowledge only by exerting extraordinary discipline. We should be able to design machines that overcome this barrier.

Humans evolve in response to selection pressures imposed by their environment. When language appeared on the scene it became a part of our environment — a part that we share with others depending on how quickly language innovations are propagated, vetted and adopted throughout the population. Communication and transportation technology hasten propagation while commerce and scientific collaboration hasten both language adoption and innovation.

The net result is that language as a collective enterprise evolves much more quickly than we do and, while some adaptations will require physical changes, the experiments will play out on a much larger scale due to the the rapid spreading of language thereby increasing the probability that useful genetic variations will arise in the population. At a much faster pace, many of useful language innovations will have an immediate impact insofar as our social institutions and educational systems incorporate changes into their curricula.

At any given point in time we have will Version XX.XX software running on Version YY.YY hardware. The more capable Version YY.YY hardware is as a computational substrate, the more value we can derive from running Version XX.XX software. Biological analogs of more RAM, faster memory access and processor cycle time, larger cache size, and higher bandwidth buses will change slowly encouraging the development of cognitive prostheses and genetic tinkering, but Version XX.XX software will be running in parallel on billions of brains.

The combination of a faster biological analog of Intel's tick-tock model could give rise to an interesting version of Moore's law. Anecdotal evidence collected from colleagues working at Google, Apple and a number of startups suggests that it is much harder to be the first with a new idea, especially in the case of incremental improvements on deep networks and the proverbial least publishable unit that characterizes much of the output of academic researchers whether they graduate students struggling to find a job or untenured faculty struggling not to lose theirs.

Phillip Lieberman and Terrence Deacon on the Evolution of Language

It seems to me that the role of language in the evolution of Homo sapiens is both over hyped and under valued. It certainly remains a puzzle for scientists who, lacking anything as revealing as a fossil record, have to content themselves with the evidence, such as it is, relating to when we acquired the physical traits that make the generation of natural language possible [320].

That language is a means for communicating with one another and codifying and preserving knowledge for future generations is not controversial. However, how language emerges from social interaction is still mysterious [319]. The more I work on designing artificially intelligent systems, the more I think that human language plays a pivotal role in bootstrapping our ability to think abstractly. There must be some fundamental computational principle governing how language and our brains have evolved together to get us to where we are today.

Terrence Deacon's The Symbolic Species: The Co-evolution of Language and the Brain is the first book I've encountered that offers a compelling theory about what's been missing. Deacon explores the possible origins of language focusing on the problem of how evolution stumbled on a tool so powerful and so well adapted to our brains that it catapults human cognition well beyond any other sort of cognitive prosthetic we might design today given our limited knowledge of the brain, while at the same time providing tantalizing hints about how we might reverse engineer this organic process for advanced AI systems patterned after the human brain. Here is my rough understanding of Deacon's theory.

Language evolves quickly as a consequence of our collective efforts to experiment with innovative adaptations generated by humans to solve human problems. Problems like how to create a currency to facilitate trade, a calculus to balance accounts, an instrument to survey property or introduce a representative quorum to adjudicate disputes. These innovations spread quickly through the population by way of trade and colonization facilitated by rapid mixing in small-world social networks, and are subjected to the experiments of humans who quickly and collectively decide which innovations to adopt and which to discard.

The ongoing development of human language is a co-evolutionary process in which human children quickly learn spoken language — but the late arrival of reading and writing provide one explanation for why these skills are so difficult for children to master, and adults contribute to its rapid development by experimenting with novel refinements that are subjected to the pressure of widespread human adoption. Just as beavers are subject to the selection pressures of their environment, e.g., they have evolved webbed feet and a large flat tail to better suit their aquatic lifestyle, and they in turn have altered their environment by selective forest thinning and building dams that create new habitats.

The changes that drive language are not randomly chosen or geographically isolated and selection pressure rapidly seizes upon broadly useful adaptations. These changes become part of the environment that humans inhabit and, on a slower timescale, drive the evolution of the human brain. From this perspective, language is an organic process that emerges the from interplay between two symbiotic processes each of which provides value to the other in a stable, mutually beneficial and indefinitely sustainable relationship. See here for more on Deacon's theory presented in The Symbolic Species [118] and here for a short summary of his more recent book entitled Incomplete Nature: How Mind Emerged from Matter [119] concerning the general class of such phenomena.

Latent Features of Language Expose the Foundations of Human Thought

Given what they teach children in school, one might think that the most important features of language revolve around syntax and proper grammar, when in fact meaning and context are the key components and syntax, while conveying some nuances of meaning, is more often than not window dressing. What language conveys is deeply rooted in the way we think. The fundamental characteristics of human thought have to do with how and why the world works as it does, because that's what we need to know in order to survive.

Daniel Dennett writes of our inclination to adopt the intentional stance in explaining a wide range of phenomena including the behavior of animals and humans in particular, but also extending to inanimate and make-believe objects. In this interview with Sean Carroll, Dennett makes the case that the intentional stance can be applied to anything that has a history, and Sean Carroll makes the same point in saying that the primary criteria is participation in the arrow of time. Dennett likes to use the example of a thermostat and how a parent might go about explaining how a thermostat works to a child. They want to ascribe purpose to people and things and having done so are surprised when those people and things don't behave accordingly.

One could make the case that it doesn't make sense to attribute intentions to a box of crayons or a stack of blocks, but that obviously misses the point. The point is that something like Dennett's notion of intentional stance is baked into the way humans think about the world and so, not surprisingly, in the language we use to communicate with one another and to ourselves. Language is not necessary for us to communicate much less think, but the language we share among ourselves is a powerful tool for thinking and the key to unlocking the recorded knowledge of our culture.

If we could figure out the hidden language of thought that underlies human language we might be able to parse the written and spoken word so as to reveal its meaning. Perhaps we could do so in the same way that we attempt to extract meaning from visual information, but as Dennett pointed out, while AlphaGo can tell you where to place the next stone, it can't tell you why.

The ideal training data for synthetic minds patterned after the human brain would be a parallel corpus in which a running monologue is paired with an animation depicting people and things interacting with one another, and requiring the synthetic mind to answer questions of the same sort that developmental psychologists use in studying how children think and learn⁸¹.

Unfortunately, such training data may not be enough without our providing an opportunity for the system to interact with people, preferably the same person supplying the monologue. In lieu of such a dataset however, I think we might be able to generate a learning environment specifically for the programmer's apprentice by creating an artificial environment in which the apprentice interacts with a computer by means of the sort of instrumented integrated development environment described in [124]. Stay tuned.

Miscellaneous Loose Ends: Some classic and recent interviews on Paul Middlebrooks' Brain Inspired podcast. Tom Griffiths on on the nature of goals [101], Nicole Rust on visual memory and novelty, and Dan Yamins' research [542, 533] as it relates to learning a familiarity bias [505]. From the abstract of one of Dan's recent papers [542] mentioned by Nicole Rust: "Here, we describe a method that trains an embedding function to maximize a metric of local aggregation, causing similar data instances to move together in the embedding space, while allowing dissimilar instances to separate. This aggregation metric is dynamic, allowing soft clusters of different scales to emerge. We evaluate our procedure on several large-scale visual recognition datasets, achieving state-of-the-art unsupervised transfer learning performance on object recognition in ImageNet, scene recognition in Places 205, and object detection in PASCAL VOC." And one last thing, where does "learning by doing" / "learning by imitating" fit into the grand scheme. Are these strategies built in and if so when during development do they become available for use.

@article{LampinenetalCoRR-19,
author = {Sebastien Racaniere and Andrew K. Lampinen and Adam Santoro and David P. Reichert and Vlad Firoiu and Timothy P. Lillicrap},
title = {Automated curricula through setter-solver interactions},
journal = {CoRR},
volume = {arXiv:1909.12892},
year = {2019},
abstract = {Reinforcement learning algorithms use correlations between policies and rewards to improve agent performance. But in dynamic or sparsely rewarding environments these correlations are often too small, or rewarding events are too infrequent to make learning feasible. Human education instead relies on curricula--the breakdown of tasks into simpler, static challenges with dense rewards--to build up to complex behaviors. While curricula are also useful for artificial agents, hand-crafting them is time consuming. This has lead researchers to explore automatic curriculum generation. Here we explore automatic curriculum generation in rich, dynamic environments. Using a setter-solver paradigm we show the importance of considering goal validity, goal feasibility, and goal coverage to construct useful curricula. We demonstrate the success of our approach in rich but sparsely rewarding 2D and 3D environments, where an agent is tasked to achieve a single goal selected from a set of possible goals that varies between episodes, and identify challenges for future work. Finally, we demonstrate the value of a novel technique that guides agents towards a desired goal distribution. Altogether, these results represent a substantial step towards applying automatic task curricula to learn complex, otherwise unlearnable goals, and to our knowledge are the first to demonstrate automated curriculum generation for goal-conditioned agents in environments where the possible goals vary between episodes.}
}
%

January 9, 2020

%%% Tue Jan  7 04:47:01 PST 2020

Question: What is the rest of the body/brain doing when the basal ganglia are executing a motor program? Answer: The basal ganglia apparently don't execute anything! The question doesn't even make sense given what we know about the basal ganglia. The basal ganglia don't even really select or launch a program. They suggest, motivate, cajole, provide a context for, forward suggestions and annotations vetted by the prefrontal cortex, or can be said to perform similar advisory or executive functions that set in motion other parts of the brain to perform actions and execute programs.

The standard story is that the basal ganglia — ignoring for the moment their possible role in learning and programming the cerebellum — repeatedly serve up a smorgasbord of tempting stimuli and enticements vetted by a combination of cortical and subcortical circuits and prepared for perusal and adoption by a collection of experts primarily residing in the frontal cortex that can make sense of and, in some cases, select appropriate programs that they are uniquely suited to oversee execution, and perhaps prepared to take additional advice concerning how execution is progressing and whether or not it makes sense to continue or abort.

Here are three examples taken from my morning routine that illustrate some of the issues pertaining to how plans are formulated, coordinated and executed:

(I) It's 2:30 AM and I'm sitting at my desk in the living room, two screens are in front of me and a large mug of tea with a straw sticking out of it next to the keyboard in front of the screen on the left. I've been working in an Emacs window up on the right screen, and I just shifted my gaze to the browser on the left screen and in so doing got a glance of the tea in my peripheral vision. I noticed the tea but was focused on examining something that had flashed by in the browser and had grabbed my attention. Something nagging at the "back of my brain" reminding me that I often get so absorbed in work that my tea gets cold and fails to warm me up in the cold room or supply the cafeine needed to fully wakeup.

(II) It's 5:30 AM and I've been up for three hours editing a manuscript, I've just saved the file and run the LaTeX compiler to format manuscript, and notice the compiler has spit out some error messages including one that reads: unexpected end of file at line 1475. I return to the editor window, jump to line 1475, fail to see evidence of an LaTeX environment begin statement without a corresponding end statement, scroll up a few more lines and type a right curly bracket character that matches an earlier left curly bracket character that I failed to close, enter the missing bracket, save the file and re-run the LaTex compiler that completes this time, launching a PDF previewer as its final step⁸².

(iii) It's 6:30 AM and I am on the Stanford campus in the workout room at the AOERC facility with all of its treadmills, weight machines and sweating exercise fanatics. I've used elliptical machines once in the recent past and am planning to try to use one again today. I step up on the footpads and attempt to rock them back and forth to get the machine working properly. In the midst of doing so I suddenly find myself lifting my right leg up while simultaneously leaning to the left to put more of my weight on the left pad. In two more strokes, I'm rocking along like a pro. I don't know where those moves came from, but I'm performing well beyond anything I achieved the previous time I tried.

Miscellaneous Loose Ends: The technology for sensing and servoing actuators is pretty sophisticated having been driven by the increasing demand for industrial automation, consumer products including toys and hobbyist robots, and military applications including advanced avionics and mobile robotics. In many cases, sensors and servos are packaged resulting in a distributed control system in which all of the necessary sensory, control and mechanical components are remote, e.g., integrated into the design of the actuator just as our muscles are attached to tendons collocated with proprioceptive neurons that measure position and angle. Composite control, e.g., reconfiguring a multi-joint arm to reposition the end effector for grasping, are carried out in an increasingly centralized fashion, e.g., just as movement commands are generated in the motor cortex and then executed and coordinated in the spinal cord.

Examples of biologically inspired robotic systems have been around for decades — Dan Koditschek, whom I know from graduate school at Yale, pioneered some of the earliest work, e.g., a statement that would require a good deal of qualification in a more scholarly venue given that Grey Walter designed and built the one of the first biologically-inspired autonomous robots in the late 1940's [513]. Today there are a number of companies that have carved out niche products and mainstream industrial robotics has been incorporating lessons learned from biology into their designs since the field of cybernetics became popular in the 1950s — though seldom advertising such features given the skepticism voiced by traditionally-trained engineers — see this short review article by George Lauder.

As mentioned earlier, Albus developed his CMAC controller for industrial robotics based on the Marr-Ito-Albus model of the cerebellum in the mid 1980s. In the meantime, there have been dozens of academic labs springing up in Biology, Computer Science and Electrical & Mechanical Engineering Departments channeling biological-inspired ideas to build hybrid control systems, including, for example, work by a team of researchers and engineers at the University of Granada in Spain developing what they claim to be a realistic cerebellar spiking neural network model for robotic control [189, 85, 333]. I expect we'll see much the same cycle of development that has occurred in artificial neural networks in which new ideas surface in the biological disciplines, are developed borrowing opportunistically from the biology literature, mature into technologies that bear only superficial resemblance to the original biological models and are finally absorbed into the engineering disciplines having discarded their biological origins and rarely acknowledging their inspirational roots if at all.

January 4, 2019

%%% Sat Jan  4 03:54:14 PST 2020

Here is a more coherent physical account of what the effectors do in the toy arcade game described in Figure 17. Imagine the effector array as integrated into the membrane of a single-cell organism and suppose that the holes mentioned in the original description correspond to pores in the membrane that admit nutrients necessary to sustain the life of the cell. The nutrient pellets fall on the membrane at random locations and the organism uses the effector array to reshape its surface so that pellets roll down the (inclined) surface and fall into one of the organism's nutrient pores.

This game is more interesting than, say, Tetris, because the effectors can be thought of as muscles that have to be coordinated in order the achieve the desired outcome analogous to how we have to coordinate our muscles to grasp an object. If we assume that each effector is tipped with a sensor that registers whether or not there is a nutrient pellet at the effector's location, then the topographic map of the sensors is aligned with that of the effectors analogous to the arrangement sensor and motor apparatus in the rodent barrel cortex [399].

In a publication, we might attract more attention if we were to suggest at the outset that the game is intended to explore the dynamics involved in coordinating behavior that requires aligning and integrating information from the motor and somatosensory cortex. This would be an excellent application for implementing the ideas relating to Fuster's hierarchy that we discussed in subsections 2.3 (Sensorimotor Hierarchy) and 3.5 (Neural Programming) of the submitted Frontiers paper. It would also make for a great conference paper.

To begin with, the game engine would be easy to implement. As for the neural network architecture, each level in Fuster's hierarchy would consist of two multi-layer convolutional stacks, one for sensory features and one for motor features. The two stacks in each level would be linked by reciprocal connections and the loss function would include separate terms accounting for the accuracy of predicting motor features from sensory features — and vice versa — for each level in the hierarchy. It should be feasible to pretrain the network with a simple curriculum-based learning strategy, and then use reinforcement learning for training the final end-to-end system with length of play (survival) as a reward⁸³.

Miscellaneous Loose Ends: "The cerebellum is well poised to contribute to behavior because it receives a wide array of cortical and sensory information and is subject to control by a number of neuromodulators. To perform its function, the cerebellum is believed to integrate these diverse inputs to provide the rest of the brain with predictions required for optimal behavior." — excerpt from [84].

"Much of the information required for complex movements is generated by the cerebellum and the basal ganglia in conjunction with the cortex. Cerebellum and basal ganglia have been thought to communicate with each other only through slow, multi-synaptic cortical loops, begging the question as to how they coordinate their outputs in real time. We found that the cerebellum rapidly modulates the activity of the striatum via a disynaptic pathway in mice." — excerpt from [92].

December 29, 2019

%%% Sun Dec 29 03:15:22 PST 2019

Sketch for a revised and shortened version the Frontiers in Computational Neuroscience submission that eliminates three subsections in the Neuroscience section and replaces the last two subsections in the Architecture section with a new subsection entitled "Compiling Programs" more relevant to the theme of the special issue. I could use some help with the following tasks:

Someone to go through the first few paragraphs of the new subsection below with an editorial eye for prose and technical eye for the content. I'm summarizing and simplifying to save space and make the treatment accessible; I would like for one of you to fact check the draft prose and suggest concrete changes to fix any errors or misleading simplifications.
Someone to look at the cerebellar model articulation controller (CMAC) in terms of (a) its biological relevance as a model of the cerebellum and (b) its potential technological value as an adaptive controller that might play a role in developing new artificial neural network architectures not just for robotics but also applications like the apprentice.
Someone to look at the recent developments in our understanding of the cerebellum — see here — with an eye for (a) what they imply for the role of the CB in social and other cognitive activities and (b) how Bostan et al's view of the BG plus CB plus cerebral cortex "nodes in an integrated network" might alter current models [61].

Since we are short of time I will arbitrarily assign #1 to Gene, #2 to Chaofei and #3 to Meg. #1 is the most critical and hence time dependent, whereas #2 and #3 are candidates for a box insert if time permits. If you'd prefer not to work on this now, that is perfectly fine with me; perhaps you'd be interested in working on it after the break. If you're curious about the way in which the paper has come together and our working on it up to and beyond the deadline, it's my experience that anything really worth publishing traces out a similar arc defined by the strong desire to get it right and get it out there. Thanks for your patience.

3.1 Embodied Cognition

3.2 Conscious Attention

3.3 Action Selection

3.4 Executive Control

%%% %%%%%%%%%%%%%%%%%%%%%%%%%% EXECUTIVE CONTROL  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

TAKE ONLY WHAT'S NEEDED FROM 2.3 'SENSORIMOTOR HIERARCHY' TO SUPPORT THIS SECTION

Figure: Fusters_Coupled_Sensory_Motor_Hierarchy.jpg

ALSO ADD THE CONCEPTS IN THE ATTACHED GRAPHIC BUT IN SUCCINCT PROSE TO SAVE SPACE

Figure: Columns_Stripes_Clusters_Topographic_Maps.png

%%% %%%%%%%%%%%%%%%%%%%%%%%%%% COMPILING PROGRAMS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3.5 Program Compilation

The activity in the system comprised of the basal ganglia, cerebellum and motor cortex plus the brain stem and spinal cord is perhaps the closest the mammalian brain comes to writing and executing programs. Generally referred to as "motor programs", they control all of your voluntary movements and, in humans, they play an important role in language.

The right panel in Figure 3 shows the white matter reciprocal connections between the frontal cortex and the cerebellum that are believed to facilitate higher-order cognitive functions. Curiously, it is possible to lead a relatively normal life even if you were born without a cerebellum, difficulty speaking and walking being the most obvious deficits.

The distribution of cell types and neural circuitry of the cerebellum is reminiscent of the hippocampal formation and there are several detailed models of the cerebellum [341, 7, 240] that have inspired useful machine learning techniques [8], and yet discoveries in the last decade have challenged prevailing opinions.

It was believed that all communication between the basal ganglia and cerebellum was indirectly enabled via the cerebral cortex, but evidence now supports the existence of subcortical connections between the two suggesting that the basal ganglia, the cerebellum and the cerebral cortex form an integrated network [61].

These new discoveries will likely have important ramifications for our understanding of these critical systems that will lead to new algorithmic insights that parallel those fueled by our study of the hippocampal formation [[REFER TO NEW BOX INSERT]]. Here we focus on what the basal ganglia, the cerebellum and the cerebral cortex tell us about creating, selecting and coordinating motor programs.

The brain derives much of its utility from exploiting distributed representations and parallel processing. Even so, in big brains it is often useful bring representations from distant parts of the brain together and necessary to perform some computations serially with the results from one computation feeding into another. The human brain has evolved machinery that makes it possible for us to do both by making better use of existing memory systems and adapting circuits optimized for movement to communicate, plan and perform abstract reasoning.

The human brain makes extensive use of topographically organized representations, often contructing multiple maps with same topographic organization that can be aligned with one another to construct more abstract representation that retain the locality relationships of their constituent maps [516, 515]. This is a subtle point and a generally underappreciated fact about organisms with a central nervous system responsible for coordinating behavior distributed across their peripheral nervous system.

Several of the subcortical circuits we've discussed including the basal ganglia and hippocampus have access to sensorimotor areas of the cerebral cortex by way of the thalamus and the striatum, the latter being part of the basal ganglia. The thalamus consists of a set of nuclei that map specific subcortical inputs to the cortex and receive feedback from the same cortical areas. The striatum assists in coordinating cognitive functions, including both motor and action planning.

Much of the information moving around in the cortex is structured in the form of maps that align with the topography of the sensory and motor systems of the body, retinotopy of the visual cortex being one example. The striatum's distinctive striated appearance is due to its arrangement of specialized circuits called stripes that enable the basal ganglia to select and transfer information to distant locations in the cortex that have similarly striated and functioning circuits, preserving essential topographical features in the process [46, 312].

Each stripe is composed of columnar-shaped circuits called minicolumns that encode patterns of coordinated activity originating elsewhere in the cortex so it can brought together in one place for processing. Each stripe can be updated independently allowing the basal ganglia a great deal of discretion in creating a context for initiating subsequent computations at remote locations in the frontal cortex. These patterns of activity can be stored indefinitely providing a working memory system that supports a simple yet powerful method for binding variables and composing their values [385].

Stripes are grouped in clusters that can reinforce or inhibit one another and clusters map to other clusters often by way of white-matter tracts that connect distant sensory and motor areas. Information is transferred preserving its topographic structure so processed information resulting from motion planning or other cognitive functions can be mapped back to its origin to support learning or stimulate muscle activity.

The activity in stripe clusters can be maintained or updated individually allowing for sustained or iterative processing and providing the basis for working memory. Information in multiple clusters can be combined to support a simple form of variable binding. The basal ganglia and prefrontal cortex can influence what information is transferred but the preserved alignment of the stripes within clusters dictates the origin of the information.

Here are some of most cited early papers[31, 156, 155] on different implementations of Albus' cerebellar model articulation controller (CMAC):

Ers\"{u}, E. Tolle, H. Hierarchical Learning Control - An Approach With Neuron-Like Associative Memories. American Institute of Physics, 249-261, 1988.

Ers\"{u}, E. Militzer, J. Software Implementation of a Neuron-Like Associative Memory System for Control Application, Proceedings of the 2nd IASTED Conference on Mini- and Microcomputer Applications, MIMI'82, Davos (Switzerland), 1982.

Ers\"{u}, E. Militzer, J. Real-Time Implementation of an Associative Memory-Based Learning Control Scheme for NonLinear Multivariable Processes, Symposium "Applications of Multivariable System Techniques", Plymouth (UK), 1984.

Ers\"{u}, E. Tolle, H. A New Concept for Learning Control Inspired by Brain Theory, Proceed. 9th IFAC World Congress, Budapest (Hungary), 1984.

Ers\"{u}, E. Tolle, H. Learning Control Structures with Neuron-Like Associative Memory Systems, in: Organization of Neural Networks, Structures and Models, ed. by von Seelen, Shaw, Leinhos, VCH Verlagsgesellschaft Weinheim, West Germany, 1988.

December 27, 2019

%%% Fri Dec 27 03:41:13 PST 2019

As we look more carefully at action selection and motion planning, it will help if we adopt standard terminology as it will both simplify our discussions and reveal aspects of the problem that we've neglected to take into account. In classical mechanics and robot motion planning, the parameters that define the configuration of a system are called generalized coordinates or degrees of freedom, and the vector space defined by these coordinates is called the configuration space of the physical system. The following graphic shows a simple configuration space and two coordinate vectors referred to as configurations and illustrated here as two poses of a cartoon robot.

In the case of a robot, its actuators are restricted mechanically just as our movements are restricted by our skeletal structure and musculature. In robot planning, movement is further restricted by the obstacles in the robot's environment. These constraints limit the reachable space, corresponding to a lower-dimensional manifold called the configuration manifold. If you took a course from Jean-Claude Latombe or Oussama Khatib, you might remember that the complexity of path planning — finding an uninterrupted path from one point to another in a given configuration space — is worst-case exponential in the number of degrees of freedom.

That said, there are many efficient approximations [302] including Khatib's [268] artificial potential field method for escaping local minima and Kavraki and Latombe's [264] method for randomized preprocessing of the configuration space. The former lends itself to artificial neural network implementations and the latter suggests that simple local search methods often suffice for many practical problems assuming sensory systems and servomechanisms able to exploit structural and environmental affordances.

The following is a narrative sketch of the mechanisms and related representations that guide the selection of motor plans. So far our characterization of robot motion planning is impoverished as an agent's representation of its internal state in that it leaves out the critical role of perception and feedback. Robust motion plans must necessarily include perceptual activities whose purpose it is to help guide movement, avoid obstacles including one's own body parts and facilitate accurate positioning and grasping.

Configuration space is a kinematic model plus a static representation of the physical environment. This is not the same as phase space which emphasizes the system dynamics and includes the velocity and mass (momentum) of every particle. To serve as a basis for motion planning, an internal representation must capture not only the physical constraints that the body and environment impose on movement, but also the cognitive state of the agent incorporating current estimates of the location and velocity of nearby objects and body parts relative to the body's frame of reference and relevant knowledge about their properties and role in the agent's objectives.

There is a consensus that motor plans are represented in the primary, premotor and supplementary motor cortex in the frontal cortex adjacent to the central sulcus. The nearby somatosensory and proprioceptive association areas of the anterior parietal cortex are likely to figure prominently in planning movement, as are the sensory association areas in the posterior parietal cortex and in particular the dorsal visual stream also known as the "where" or "how" stream. These sources of relevant neural activity provide a rich context for action selection available directly to the basal ganglia to shape in the form of proposals or intentions⁸⁴.

We know that regions of the motor cortex are topographically mapped; in much the same way that nearby points in configuration space — or at least on the corresponding low-dimensional manifold — are joined by smooth trajectories through joint space. In lieu of an internal representation for abstract goals and the neural circuits required to act upon them, we borrow the term setpoint from control theory to denote the target state of the controlled system. Negative feedback systems use the difference between the system state and the setpoint to guide action selection and are common in biological organisms [28].

The complete context for acting is a composite summary of the agent's sensorimotor, proprioceptive and vestibular state derived from a hierarchy of primary, secondary and associative features that constitute an abstraction hierarchy [184] and roughly aligns with the features available to the basal ganglia and prefrontal cortex for modulating action selection.

The division of labor between the basal ganglia, motor cortex and cerebellar cortex is pretty well established — see here if you want more detail than this brief summary. The basal ganglia do not directly select motor programs but rather they enable them to run in the motor cortex. The motor cortex selects and executes motor programs issuing motor commands via the descending pathways. The cerebellum does not initiate motor commands but rather modifies the motor commands of the descending pathways to make movements more adaptive and accurate.

In the model for motor control presented here, the basal ganglia initiate a motor program in the motor cortex by creating a context for action corresponding to a desired future state. Since the motor cortex already has access to this information, all that is required of the basal ganglia is an offset from the current context to serve as a setpoint, thereby enabling the motor cortex to select an appropriate motor program for generating motor commands. Execution then consists of repeatedly invoking the selected motor program to traverse the augmented configuration manifold from the current context to the specified offset / setpoint.

The actions available in the process of executing a motor program include motor commands in the form of muscle contraction and relaxation and sensorimotor activities in service to visual — or other sensory modality — servoing of the sort required for grasping objects and avoiding obstacles. Methods for path planning such as the artificial potential field methods mentioned earlier or strategies for sensor-based traversals with randomized recovery could easily be incorporated in this model [315]. As for learning how to carry out and coordinate more complicated movements and manipulations, decades of research developing models of the cerebellum offer practical suggestions.

James Albus developed a model of the mammalian cerebellum [7] about the same time that David Marr was working on his model [341]. The Marr-Albus theory — sometimes referred to as the Marr-Albus-Ito theory — is still the foundation on which most theories of the cerebellum are built — though have yet to accommodate the more recent discoveries summarized here. Albus went on to work in applied robotics for manufacturing [9] and invented a new approach to robotic control [8] that he christened the cerebellar model articulation controller (CMAC) which is widely used in reinforcement learning for classification and control problems [454].

Figure 17: The above graphic depicts a simple robot environment illustrating the role of topographically organized representations of sensory and motor systems. The environment is shown on the left and corresponds to an 8 × 8 grid of locations divided into floor area and a 2 × 2 opening labeled HOLE in the graphic. The robot can visualize the grid as an 8 × 8 array and distinguish floor area from the hole. Periodically, a red circle labeled above as FOOD PELLET will appear in a randomly selected location. The robot has to deposit at least one FOOD PELLET in the HOLE every N ticks of the game clock. When the time on the clock runs out or the robot fails to deposit a FOOD PELLET in the allotted N ticks, the game stops and the robot's score is the total number of FOOD PELLETS consumed.

The robot has 8² − 4 = 60 motorized effectors, one for each floor area location. Each effector works like a hydraulic lift and can raise or lower the floor of its assigned location at a rate of one FOOD PELLET diameter per tick. The robot can raise or lower as many locations at it deems appropriate on each tick of the clock. The floor can't lowered below its initial location at the start of the game or raised more than 4 times the diameter of a FOOD PELLET — attempting to do so in this case results in no change. When raised above the starting level, the effector behaves as if rounded so that any FOOD PELLET occupying that location will roll in a random grid-axis-aligned direction unless prevented from doing so by a neighboring effector raised above the FOOD PELLET's current level. The robot's visual system is topographically mapped, preserving the connectivity of the grid layout. Floor locations always appear blue unless occupied by a FOOD PELLET in which case they appear red.

The array of 60 effectors is shown in profile on the right side of the above graphic. The robot cannot determine visually from its vantage point whether a given location on the floor is lifted or at the initial starting level. However, the robot's array of effectors is topographically aligned with the grid of locations, and so the correspondence could be determined by trial and error. In addition to knowing the coordinates of each effector in the frame of reference of the effector array, the robot's somatosensory map keeps track of the current height — 0, 1, 2, 3 or 4 — of each effector. Since the robot can move multiple effectors at the same time, the robot can execute complex motor programs, analogous to complex reaching or grasping movements, in which it creates a tilted trough of raised lifts, and a spherical FOOD PELLET simply rolls down the trough into the HOLE.

It is a worthwhile exercise of your understanding of the ideas presented here — as well as a good way of revealing their shortcomings, to come up with a toy model simple enough that you can simulate it on paper or, better yet, write a program that implements the model and that makes the abstract ideas more concrete. I've provided one such model in the form of a simple arcade game to illustrate the role of hierarchically and topographically organized representations of sensory and motor systems and the importance of being able to represent and deploy complex motor plans that require the coordination of multiple muscle groups.

The caption and panels on the right and left of Figure 17 describe a simple robot and physics simulation environment, plus the rules and scoring for the arcade game. The graphic in the center panel represents an instance of Fuster's hierarchy [184] illustrating the characteristic two stacks of reciprocally paired levels of feature maps ordered from the most concrete on the bottom of each stack to the most abstract on the top. Each level might be implemented as an encoder-decoder (sensory-motor) pair of gated recursive convolutional neural networks [99].

Your assignment is to flesh out the details of how and under what conditions it might be possible to train a model such as the one shown in Figure 17 and sketched earlier in this entry to excel at playing the arcade game. To recap the key characteristics of the earlier sketch: In the first stage, the basal ganglia propose a motor program specified as a setpoint offset from a suitably compressed version of the current state vector that you can think of as a goal or intention.

In the second stage, circuits in the motor cortex implementing a negative feedback controller recurrently drive the system toward the target setpoint. To accomplish the goal of reaching the supplied setpoint, these circuits attempt to trace a path along the manifold representing the set of reachable system states analogous to the configuration space of the robot-environment pairing. This step makes clear the sense in which motor programs are stored in the motor cortex.

December 25, 2019

%%% Thu Dec 25 03:26:38 PST 2019

By way of trying to explain what we are after in developing models of robust action selection and error mitigation, here is a short narrative illustrating how the hippocampal and cerebellar memory systems might work together to solve everyday problems. See also the earlier comments concerning the undervalued role of the cerebellum outside of coordinating motor programs.

The hippocampus is generally considered as the neural basis for episodic memory whereas the cerebellum is believed to be primarily procedural. In cognitive science, the former is said to be explicit and declarative since such memories can be easily articulated and the latter as implicit and hence not available for narrative purpose. The distinction has become somewhat more complicated by the fact that we now know that the prefrontal cortex and the cerebellar cortex coevolved in primates to be larger and much more connected with one another.

To illustrate the similarities and differences between the different types of memory offered by the hippocampus and cerebellum, consider how these memory systems play a role in learning to play a musical instrument such as a piano or violin. Aside: Professor Minsky's legacy was his insatiable curiosity. "He used to say, 'You don’t really understand something if you only understand it one way,'" Dr. Kay said. "He never thought he had anything completely done." — Alan Kay and Danny Hillis remembering Marvin Minsky.

Imagine a student learning how to play the violin being taught by an accomplished violinist. Suppose that the student tries to play a short passage from a Vivaldi concerto and is frustrated when she finds it difficult. Her teacher repositions her fingers on the strings and illustrates how to hold the bow more vertically aligned with the plane of the violin body. In her subsequent practicing, the student encodes memory traces in her hippocampus as she repeats the motions and experiments with variations in the suggested posture.

The memories are consolidated during sleep by replaying the traces of her experience recorder during the lesson. In her next lesson, she remembers her teacher's advice regarding the Vivaldi piece and applies it to a similar passage in a Chopin étude combining it with additional advice from an earlier lesson. As she does so her episodic memory is altered to reflect the successful combination of movements and its expanded scope.

A few days later she attempts to use the same combination of movements in a cantata written by Bach and her teacher corrects her playing suggesting an alternative fingering. Having retrieved the earlier combination of movements, she now amends her memory to reflect a more restrictive application of the earlier advice. Her explicit memories from the lessons with her teacher become less like arbitrary sequences corresponding to specific experiences and more like precisely targeted subroutines.

In addition to her lessons, the student practices for several hours each day, engaging her cerebellum in refining her skill. As she learns new pieces of music practicing the techniques she has learned from her teacher, her basal ganglia create snapshots incorporating sensory information about what she hears, how she holds the violin and bow and her proprioceptive state. These snapshots along with information relating to positive and negative feedback are forwarded to the cerebellum where they are used refine existing motor programs and

The snapshots correspond to edited summaries of the current neural activity in the sensorimotor cortex and serve as a context for the cerebellum in deciding how to drive muscle activity. The cerebellum compares these summaries against its predictions concerning the next state and expected reward and makes adjustments to coordinate with other ongoing motor programs before forwarding the amended motor commands to the motor areas in the frontal cortex.

December 21, 2019

%%% Sat Dec 21 04:55:26 PST 2019

Until recently it was believed⁸⁵ that the basal ganglia (BG) and cerebellum (CB) communicated with one another only indirectly through the cerebral cortex⁸⁶. In the last decade however, evidence has been accumulating for topographically organized subcortical connections involving the dentate nucleus in the cerebellum and the subthalamic nucleus of basal ganglia that open up the possibility of new learning and control mechanisms [61, 60]⁸⁷.

The new picture emerging is that of two functionally distinct components linked together to form an integrated functionally coherent composite network not unlike the hippocampal formation with respect to its micro and macro structure [60]. There is even some evidence involving theta band synchrony between basal ganglia and hippocampus in support for the role of the hippocampal formation in sensorimotor integration [212]. These observations argue against the traditional view of modularity and suggest a more algorithmic view.

We generally assume that the basal ganglia serve in a supervisory role in training the cerebellum by creating motor programs that it can subsequently call in appropriate contexts. It must also train itself by using reinforcement learning to map patterns of cortical activity in the cerebral cortex to programs stored in the cerebellar cortex that maximize expected reward.

In keeping with the model of hippocampus proposed by Gershman and Daw, perhaps the interaction between the basal ganglia and the cerebellum is less like calling / selecting a program from a library of such programs stored in the cerebellum, and more like tapping into a large network of procedural knowledge in the form of a process one can allow to enfold or intervene as required.

A little more precisely, think of the CB as a stochastic process with an evolving state space and transition function. Imagine this process has entry, branch and terminal states. The job of the BG is to oversee this process by intervening to reset to a new entry point, provide information to update the transition prior at a choice point or otherwise adjust the transition function. In equilibrium, the process is governed by the state transition function, but occasionally has to be redirected in response to unanticipated contingencies and require additional direction not available in the current state, including input from the prefrontal cortex that allows for context switching, multi-tasking, additional deliberation, problem solving, reevaluation of options.

While the papers we have looked at employ a wide variety of modeling and implementation methods, it would seem that there are only a (small) finite number of basic algorithmic components including different memory models, linear transformations, activation functions and weight adjustment strategies. We might also want to include standard components for attention, compression and prediction like adaptive sparse coding, generative adversarial networks and variational autoencoders.

Often the most important lessons we can learn from neuroscience involve network topologies, descriptions of what problems they solve, e.g., pattern separation and completion, and characterization of problems and failure modes, e.g., weight saturation, vanishing gradients and negative rewards. In some cases, problems can easily be fixed by some form of regularization such as normalization, truncation or dropout. In other cases, recovery requires that the agent recognize and take steps to address the problem — failure is inevitable.

Here are some of the problems that the cerebellar cortex solves that were mentioned in the papers mentioned in earlier entries in this record. Note that the footnotes for each item in this list include additional explanation excerpted from the relevant papers:

error-based versus non-error-based adaptation⁸⁸
memory-based prediction and partial-observability⁸⁹
reinforcement learning and pattern recognition⁹⁰
controlling movement timing and coordination⁹¹

Miscellaneous Loose Ends: Results over the last few years have added support for the hypothesis that the cerebellum is involved in a wide range of cognitive and social behaviors. See for example recent popular-press review articles from Scientific American and the Dana Foundation. In their 2019 paper [84] published in Science, Carta et al report that "monosynaptic excitatory projections from the cerebellar nuclei to the ventral tegmental area (VTA) powerfully activate the reward circuitry and contribute to social behavior."

In their 2018 paper [339] appearing in Neuron, Marek et al report that "Cerebellar functional networks are topographically individual-specific. Cerebellar intrinsic fMRI signals lag those in cortex by 100-400 ms. The frontoparietal control network is greatly overrepresented (> 2-fold), suggesting that the cerebellum is important for the adaptive control of the brain’s cognitive processes." The researchers were surprised to discover that a lot more granule cells were firing simultaneously to encode sensory information, and these firing patterns encoded both external sensory information and signals from other parts of the brain.

December 19, 2019

%%% Thu Dec 18  3:15:36 PST 2019

Given that our goal is not to understand the brain per se, but rather to leverage what is known in order to design new algorithms and architectures, perhaps one of you can glean enough to come up with solutions to some of the problems we discussed in our arXiv paper. The Todorov et al [486] paper features the interplay between the basal ganglia and cerebellum. The authors provide a simple mathematical model of the cerebellum⁹², and combine it with a mathematical model of the basal ganglia developed by Kim et al [271]. Both models appear to be detailed enough to replicate or adapt in developing a model that uses standard deep learning components.

In the multi-authored [81] the section entitled "The DPM Architecture for Learning and Control" James Houk describes the reciprocal connections that the basal ganglia and the cerebellum form with the cerebral cortex as separate distributed processing modules [233] that interact with the cortex through pattern classification relying on Hebbian learning. At the same time, reinforcement learning takes place in the basal ganglia and supervised learning in the cerebellum. See Figures 4 and 5 in [81] for more about the architectural insight and the main text for the mathematical model.

December 17, 2019

%%% Tue Dec 17 3:45:59 PST 2019

I wanted a more concrete description of Fuster's hierarchy [185] including more detail regarding the reciprocal connections between motor and sensory features at different levels within the hierarchy, along with a more detailed account of the perception-action cycle. From my reading, it was clear that looking at just the basal ganglia and prefrontal cortex wasn't going to be enough since the cerebellum plays an important role. Such a description may be buried in Fuster's magnum opus or distributed across several chapters in O'Reilly et al [385], but I couldn't find it.

So I returned to one of my favorite introductory textbooks, Bear, Connors and Paradiso's Neuroscience: Exploring the Brain [51] and found part of what I was looking for in their explanation of the motor control hierarchy. Their account, supplemented with Matt Botvinick's [66] retelling, should make a nice teaching module for CS379C in the Spring. Lacking was an algorithmic / detailed computational description or system model in which the component transfer functions are well defined. I'll get back to this issue shortly. Here is Bear, Connors and Paradiso's high-level description of the motor control hierarchy:

Level	Function	Brain Structures
High	Strategy	association areas of the neocortex, basal ganglia
Middle	Tactics	motor cortex, cerebellum
Low	Execution	brain stem, spinal cord

The proper function of each level of the motor control hierarchy relies so heavily on sensory information, that the motor system of the brain might properly be considered a sensorimotor system. At the highest level, sensory information generates a mental image of the body⁹³ and its relationship to the environment. At the middle level, tactical decisions are based on the memory of sensory information from the past. At the lowest level, sensory feedback is used to maintain posture, muscle length, and tension before and after each voluntary movement. Page 453 [51].

The following illustrates the direct path from the visual and somatosensory areas in the posterior parietal cortex, to subnuclei of the basal ganglia and then back to the premotor and suplementary motor cortex by way of ventral lateral nucleus in the thalamus. See the panel on the left in Figure 15:

Cortex → Striatum → Globus Pallidus → VLo → Cortex (SMA)

To complete the cycle and generate actions, the basal ganglia enlist the cerebellum to orchestrate the execution of motor commands through the brainstem and spinal cord as shown on the right in Figure 15.

Figure 15: On the left is a summary of the motor loop from the cortex to the basal ganglia to the thalamus and back to area 6 spanning the premotor and supplementary motor areas. On the right is a summary of the motor loop though the cerebellum. VLo and VLc are different subnuclei of the ventral lateral nucleus responsible sending control signals to the spinal cord. Source: Figures 14.10 and 14.18 in Bear, Connors and Paradiso. Neuroscience: Exploring the Brain, Fifth Edition. Lippincott Williams & Wilkins. Baltimore, Maryland, 2007

One other point made by Bear, Connors and Paradiso is that the motor areas in the frontal cortex are topographically organized: "Wilder Penfield discovered that weak electrical stimulation of area 4 in the precentral gyrus would elicit a twitch of the muscles in a particular region of the body on the contralateral side. Systematic probing of this region established that there is a somatopic organization in the human precentral gyrus much like that seen in the somatosensory area of the postcentral gyrus. Area 4 is now referred to as the primary motor cortex or M1. Penfield also found two somatotopically organized motor maps in area 6: one in a lateral region he called the premotor area and one in a medial region he called the supplementary motor area." I would like to understand better how PBWM squares with this picture.

I was using the third edition of Bear, Connors and Paradiso printed in 2006 as a reference; I bought a copy of the fifth edition (2015) when it first came out, but one of the students in CS379C borrowed it, graduated and then never returned it. That forced me to look to the literature where I found a bunch of recent relevant papers. One that caught my attention has eighteen co-authors [81] and the provocative title, "Consensus Paper: Towards a Systems-Level View of Cerebellar Function: the Interplay Between Cerebellum, Basal Ganglia, and Cortex". Another [486] has half that many co-authors and the modest title, "The interplay between cerebellum and basal ganglia in motor adaptation: A modeling study". Both of these papers are worth a look and I will refer back to them in this entry in these notes.

December 15, 2019

%%% Sun Dec 15  2:42:37 PST 2019

I'm trying to obtain a better understanding of the structure of the cortical motor areas in order to design an architecture that reflects the dependencies in Fuster's hierarchy and flexibly accommodates the role of the basal ganglia in action selection and learning even as the hierarchy assembles itself during development to include ever more complex sensory and motor abstractions. This now seems more complicated than I originally thought. So I'm assembling relevant with an emphasis on the motor areas in the frontal cortex.

Given the attention paid to the visual cortex, we are comfortable thinking about a hierarchy of features that span the lowest, e.g., edges and orientations, to the highest, e.g., faces and object shapes, within the single modality association areas. That the primary and secondary visual features are embedded in retinotopically organized brain areas is also easy to imagine⁹⁴. The other sensory modalities including somatosensory areas are somewhat more complicated but still relatively easy to understand thanks in large part due to the work of Wilder Penfield.

The same cannot be said about motor cortex. Stimulating areas in the motor cortex produces motion as muscles flex and joints move. Low level stimulation in primary cortex result in the motion of individual body parts whereas prolonged stimulation in the association (premotor) cortex and supplementary motor cortex produce more complicated motions⁹⁵ organized in a complex topographic map that has to accommodate multiple frames of reference, kinematic and dynamic properties, ethological considerations and environmental characteristics⁹⁶.

O'Reilly et al [385] provide a number clues relating to how the basal ganglia initiate motor activities:

Carrying the parietal "how" pathway forward, visual information going along the dorsal pathway through the parietal cortex heads directly into the frontal cortex, where it can drive motor neurons in primary motor cortex, which can directly drive the muscles to produce overt motor actions. Page 77
Both the cerebellum and basal ganglia have a complex disinhibitory output dynamic, which produces a gating-like effect on the brain areas they control. For example, the basal ganglia can disinhibit neurons in specific nuclei of the thalamus, which have bidirectional excitatory circuits through frontal and prefrontal cortical areas. Page 99
Dopamine burst activity drives the direct "Go" pathway neurons in the striatum, which then inhibit the tonic activation in the globus pallidus internal segment (GPi), which releases specific nuclei in the thalamus from this inhibition, allowing them to complete a bidirectional excitatory circuit with the frontal cortex, resulting in the initiation of a motor action. Page 101

In the next step, we consider the role of reciprocal connections between complementary motor and sensor areas. These connections are responsible for aligning the motor and sensory features with the agents [...] Sensory features are optimized to provide the information required to support motor activities at a comparable level of abstraction in the hierarchy, while motor features are optimized to define a repertoire of actions that maximizes expected reward by way of the basal ganglia. The thalamus is crucial for perception, with 98% of all sensory input being relayed by it. The only sensory information not relayed by the thalamus into the cerebral cortex is information related to olfaction.

Probably no efferents from the prefrontal cortex are more closely related to its executive and motor functions than those that flow from it to other areas of the frontal lobe and to the basal ganglia. The analysis of corticocortical connections within the frontal lobe reveals that the prefrontal cortex is the origin of a cascade of connective links that flow down from it to premotor and supplementary motor areas (SMAs) (area 6), and from there to primary motor cortex (area 4). This cascade of motor links, which are reciprocal and to some degree topologically organized, has been best substantiated in primate brains. Page 43 Chapter 2 [185].

Figure: Connectivity of the prefrontal cortex with structures involved in motor function. The three lateral frontal regions of the motor hierarchy (prefrontal, premotor, and motor cortices) are interconnected with the thalamus, basal ganglia, and cerebellum by recurrent axonal loops that are essential to motor control. Abbreviations: CM, central median nucleus; MD, mediodorsal nucleus; VA, anteroventral nucleus; VL, ventrolateral nucleus. Source: Figure 2.20 in Joaquiín M. Fuster. Prefrontal Cortex, 5th Edition. Elsevier, London, 2015.

December 7, 2019

%%% Sat Dec  7  4:22:04 PST 2019

One of the issues that we keep returning to when we contemplate how the apprentice might learn to analyze and manipulate code involves where the relevant tools and models are located in the overall architecture, how the apprentice might learn to work them into its workflow and whether we have to do something special in terms of designing the necessary circuits and routing the appropriate representations to these circuits when it makes sense to do so.

Figure 14 depicts a network architecture for a battery of tools that might be used by the programmer's apprentice to analyze, emulate, and otherwise operate on representations of code fragments and related objects that correspond to the inputs and outputs of the fully-instrumented integrated development environment (FIDE). These representations arise, just like any other abstract multi-modal representation, as patterns of activity in the posterior parietal association areas.

Obviously the circuits humans use to represent and work with mathematical objects were not selected, evolutionarily speaking, to specifically serve the needs of mathematical reasoning. In the case of mathematics, it would seem that the neural correlates for mathematical reasoning are located in the frontal cortex nearby but not coincident with the circuits used for non-mathematical forms of communication relying on language⁹⁷.

Our discussions concerning executive control in the prefrontal cortex [384], symbolic processing including variable-binding [292] and the regular topographic organization⁹⁸ found throughout the frontal cortex led us to a model for content-specialized abstract processing that depends on PBWM to parcel out the abstract representations at the top of Fuster's hierarchy [66] into stripe clusters that are routed to initially similar circuits that are subsequently specialized by content and use in the process of learning⁹⁹.

The network shown in Figure 14 is essentially the top level in Fuster's hierarchy with the M_i for 1 ≤ i ≤ N implementing the sensory-to-motor cortex pathway at a given level and M⁻¹_i for 1 ≤ i ≤ N implementing the reciprocal connections. Note that this network doesn't include the components shown in Figure 13 and described in [124] that are responsible for executive control and together with basal ganglia determine activity selection.

Figure 14: A network architecture for learning a collection of software engineering skills that might be used by the programmer's apprentice to operate on representations of code fragments and related objects that correspond to the inputs and outputs of the fully-instrumented integrated development environment. This network also serves as the top level of Fuster's hierarchy [184] enabling the continuous goal-directed interaction between the apprentice and its environment that Fuster calls the perception–action cycle.

Figure 14 defines a set of new abstract activities that, in the case of the programmer's apprentice, facilitate the analysis and synthesis of conventional computer programs. The inputs and outputs labeled E and A respectively correspond to two groups of N stripes and each individual stripe is roughly analogous to a register used to store operands and the intermediate results of carrying out instructions in a conventional processor chip.

B and C are the same size as E and A respectively, and are used to store intermediate results. The network layer labeled E provisionally allows inputs to be rerouted to different stripes, and may or may not turn out to be useful. As described in [292], the PBWM model handles variable binding and executive control, allowing the PFC and BA to initiate activities analyzing and synthesizing conventional computer programs.

The reciprocal connections M⁻¹_i for 1 ≤ i ≤ N between C and B correspond to inverse kinematic models that ground learning in the interaction between the apprentice and its environment, benefiting from this interaction by allowing the apprentice to learn these models in an unsupervised manner even in partially observable domains¹⁰⁰. This application of feedback is called the perception–action cycle and it occurs at all levels within Fuster's hierarchy with the prefrontal cortex sitting at the top of that cycle controlling the temporal integration and the sequencing of actions [184].

Miscellaneous Loose Ends: Continuing with the background research started in the previous log entry, here is some of the early work identifying the distinctive cytoarchitecture of the associational stripes that make up frontrostiatal circuits¹⁰¹ and tracing their local and distal networks leading up to the realization of the fundamental role they play in the larger, white-matter pathways linking the prefrontal cortex with the basal ganglia (striatum) and posterior cortex (thalamus) [405, 407, 311, 13].

December 5, 2019

%%% Thu Dec  5  3:25:33 PST 2019

Create a narrative explaining how the architecture described in Figure 13 might perform the following task. Given a program and input-output pair, run the program with the specified input, determine if the program produces the specified output and communicate the result. Note that I'm not asking how you would learn to represent programs or evaluate such representations. Right now I'm asking a much simpler problem that amounts to "Assuming you can learn to represent programs as differentiable objects and evaluate them one step at a time by iteratively transforming the program as you would in parsing a natural language sentence, e.g., using graph nets and a variant of imagination-based planning as sketched in the arXiv paper, then what actions would an RL agent perform to accomplish this task in a connectionist setting." Such a narrative might describe the following steps:

perceive / identify (attention)
fetch / load (operands)
select action (operator)
fetch / load (operator)
apply / compare (evaluation)

Each of these steps — fetch, load, compare, etc. requires moving information around, performing operations on pieces of the information, etc. The PBWM model suggests an approach to doing so in a connectionist architecture, by providing a flexible solution to the variable binding problem as described in Kriete et al [292]. In this exercise, I just want you to describe how we could perform the steps using the model in Figure 13 with modifications as required. Once we understand how to do this we can think about how you might learn to do this as well as other related operations. Another way of characterizing what I want to do here is to expand the repertoire of actions that an reinforcement-learning agent has in order to encompass learning subroutines like the fetch-decode-execute cycle¹⁰² performed by the arithmetic logic unit in conventional computers based on the von Neumann architecture.

The architecture for encoding a program given an equivalent abstract syntax tree and executing the program one step at a time is assumed to require only standard neural-network technology including attentional, recurrent and convolutional networks of the sort used in machine translation and automated question answering [438, 130]. We further assume that the neural correlates for performing such tasks in humans reside in the prefrontal cortex perhaps in the same or similar circuits as those employed for mathematical reasoning and not those areas responsible for non-mathematical forms of communication relying on language [18, 132].

It's worth pointing out that we are completely ignoring timing, coordinating, recognizing and correcting errors, or any of the analogs of fine motor control that accompany the rapid sifting, checking, ranking and rejecting that constitutes the bulk of our everyday thought processes including those of mathematicians when they are in throes of problem solving.

Figure 13: The above graphic is a visual study aid that includes the basic neural network components described in [124]. The edges shown indicate dependencies that have already accommodated in our thinking such as hyperdirect pathway and the reciprocal connections between the basal ganglia (BG) striatum, prefrontal cortex (PFC) and association areas of the posterior parietal cortex (PPC) including the medial temporal and parietal cortex [292, 539]. The challenge — spelled out in the text — is to provide a compelling account of how some suitably fleshed refinement of this model could be used to accomplish a routine task that comes up in writing and debugging software. The insets in the upper left and half way down on the right are from O'Reilly and Frank [384].

The network shown in Figure 13 consists of [...] to be continued in subsequent entries. In the remainder of this entry, I've provided answers to a number questions that might help in resolving some of confusions and omissions in the model in Figure 13. Keep in mind that you have easy access to most of the referenced papers in my research paper repository.

On the difference between the seven-plus-or-minus-two model of short-term memory¹⁰³ as conceived of by George Miller [357], the three-part working memory model of Baddeley and Hitch [37, 188] and the long-term working memory model of Ericsson and Kintsch [154]. Bernard Baar's global workspace theory¹⁰⁴ is a simple cognitive architecture developed to explain aspects of conscious processing [33], whereas the related model of Dehaene and Changeux¹⁰⁵, focuses on the interplay between attention and selection [127]. On the role of posterior parietal cortex (PPC) and prefrontal cortex (PFC) synchronization in the access and active maintenance of working memory [437] and the prefrontal cortex basal ganglia working memory (PBWM) model of O'Reilly and Frank [384] described here¹⁰⁶

On the role of prefrontal cortex¹⁰⁷ and striatum¹⁰⁸ in updating working memory [262]. On role of working memory and the prefrontal cortex in symbolic reasoning in the prefrontal cortex basal ganglia working memory (PBWM) model and an approach to supporting variable binding by using a form of indirection analogous to a pointer in computer science¹⁰⁹

On the role of the arcuate fasciculus connecting the frontal, parietal and temporal lobes in facilitating language and visuotemporal processing¹¹⁰. On the role of hyperdirect pathway in modulating the output activity of the basal ganglia [369] and related work on the development of this pathway as it relates to inhibitory control [80] — see footnote here¹¹¹. Here are a couple of useful links from last year's class [...] canonical illustration of neuromodulatory pathways influencing the basal ganglia here and the best graphic I've found illustrating the hyperdirect pathway [...] discussion in class notes here and here.

November 2, 2019

%%% Sat Nov  2  3:50:05 PDT 2019

"Boxes, boxes, we got boxes!" — Stan Rosenschein

>>> Chaofei — hippocampal formation — pattern separation, pattern integration — September 3 [457, 3, 4, 394]
Motivation: developing neural substrate and experience for encoding activity patterns, memory capacity, integration opportunities.

>>> Eugene — prefrontal cortex — hierarchy, abstraction, goal based planning — October 27 & 29 [173, 66, 219]
Motivation: hierarchy and abstraction for goal-directed planning, coordination with basal ganglia, latent state space decomposition.

>>> Megumi — basal ganglia and dopamine — experience replay, transfer learning — October 31 [191, 29, 277]
Motivation: complementary memory systems, transfer learning, experience replay, memory consolidation, and catastrophic interference.

Note to self: "Keep it simple stupid!'' — Kelly Johnson

Here is the Cliffs Notes version of the October 31 entry on complementary memory systems:

Availability of memories is reliably predicted by statistical relationships related to stimuli and the pattern of exposure ...
>>> conclude we learn to predict the importance of such information and use it to guide the storage and usage of memories.

We discover structure in ensembles of items if learning of each item is gradual and interleaved with learning other items ...
>>> conclude it's important to interleave learning new items with discovering / exploiting structure in prior learned items.

The memory system makes most accessible the traces that it predicts will be most useful in the future and integrates them ...
>>> conclude we have to conduct analysis on items in temporary memory in anticipation of performing long-term consolidation.

Ideally we would have access to prior data and the ability to retrain networks combining both recent and prior knowledge ...
>>> conclude that existing networks provide the basis for recovering latent patterns that serve as proxies for the original data.

Miscellaneous Loose Ends: Andy Clark Edge piece on predictive coding — which he often refers to as perception as prediction and maintains it is fundamental in computational cognitive neuroscience. Andrew Saxe 48 minutes into this Brain Inspired podcast discusses when replay makes sense. He describes one scenario in which it does make sense and one in which it doesn't, and then then discusses his research efforts to determine whether or not the brain can make such distinctions and apply them in practice.

October 31, 2019

%%% Thu Oct 31  3:37:04 PDT 2019

Here are my notes on dual (complementary) memory systems, the statistical relationships between environmental stimuli and the pattern and frequency of exposure to such stimuli, methods utilizing such information to govern the duration of retaining events associated with persistent environmental stimuli, and whether and when to transfer information in short-term memory to long-term memory and avoid catastrophic forgetting in the process. Most of the following prose consists of excerpts from the cited papers. I've included the bibliographic entries along with the abstracts for your convenience and PDF is included for the listed papers in the shared Google Drive folder.

Anderson and Schooler [21] — Availability of human memories for specific items shows reliable relationships to frequency, recency, and pattern of prior exposures to the item. Empirical data about a number of environmental sources (New York Times, parental speech, electronic mail) are examined to show that the probability that a memory will be needed also shows reliable relationships to frequency, recency, and pattern of prior exposures. Moreover, the environmental relationships are the same as the memory relationships. It is argued that human memory has the form it does because it is adapted to these environmental relationships. Models for both the environment and human memory are described.

McClelland et al [348, 296] — Damage to the hippocampal system disrupts recent memory but leaves remote memory intact. The account presented here suggests that memories are first stored via synaptic changes in the hippocampal system, that these changes support reinstatement of recent memories in the neocortex, that neocortical synapses change a little on each reinstatement, and that remote memory is based on accumulated neocortical changes. Models that learn via changes to connections help explain this organization. These models discover the structure in ensembles of items if learning of each item is gradual and interleaved with learning about other items. This suggests that the neocortex learns slowly to discover the structure in ensembles of experiences.

Stevens et al [470] — John Anderson and colleagues [21] developed the rational analysis of memory, in which they argued that much of memory performance, including forgetting, might be understood in terms of adaptation to the structure of the environment. The first key assumption of the rational analysis is that environmental stimuli make informational demands on the cognitive system that are met by retrieving memory traces associated with those stimuli. The second assumption is that the memory system acts on the expectation that environmental stimuli tend to reoccur in predictable ways; the pattern of past encounters can, thus, predict the future need of information.

The third assumption is that the memory system makes most accessible those traces that it predicts will be most useful in the future. Consequently, memory performance should reflect the patterns with which environmental stimuli occur and reoccur in the environment. For instance, more recently encountered stimuli will likely be encountered again. An adaptive memory system should make information about those stimuli more accessible because it is more likely to be needed. Conversely, the longer time interval since the last encounter, the less likely the information will be needed in the future, and so it can and should be forgotten.

McClelland, McNaughton and Lampinen [346] — We use artificial neural networks with properties like those we attribute to the neocortex to develop a theoretical understanding of the role of consistency with prior knowledge in putatively neocortex-like learning systems, providing new insights into when integration will be fast or slow and how integration might be made more efficient when the items to be learned are hierarchically structured. The work relies on deep linear networks that capture the qualitative aspects of the learning dynamics of the more complex non-linear networks used in previous work.

We examine the question of how new knowledge gets integrated into linear networks that have already learned something. Given the above, this reduces to the question of how this new knowledge projects onto the knowledge that is already stored in the network. [...] Bruce McNaughton has observed that hippocampal retrieval falls off according to a power law distribution [...] suppose you could infer a decay rate on the basis of the predictions from a learned model of latent state space [...] could you then use the delay duration to adjust your prior expectation for seeing it again [...] or use this factor to influence experience replay [...] or improve pseudo-rehearsal updates to the long-term stable memory.

Robins [426] — Replicate Ratcliff [420] simple "recency" based rehearsal regime ... We then develop further rehearsal regimes which are more effective than recency rehearsal. In particular "sweep rehearsal" is very successful at minimising catastrophic forgetting. One possible limitation of rehearsal in general, however, is that previously learned information may not be available for retraining. We describe a solution to this problem, "pseudorehearsal", a method which provides the advantages of rehearsal without actually requiring any access to the previously learned information (the original training population) itself.

We then suggest an interpretation of these rehearsal mechanisms in the context of a function approximation based account of neural network learning. Both rehearsal and pseudorehearsal may have practical applications, allowing new information to be integrated into an existing network with minimum disruption of old information. Robert French and his colleagues [175, 174, 25] pushed these ideas a bit further and the following three papers — all three appearing within the last two years — extend the earlier work by applying recent advances in deep neural networks and generative models.

Atkinson et al [29] — Our model combines pseudo-rehearsal with a deep generative model and a dual memory system, resulting in a method that does not demand additional storage requirements as the number of tasks increase. Our model iteratively learns three Atari 2600 games while retaining above human level performance on all three games and performing as well as a set of networks individually trained on the tasks. This result is achieved without revisiting or storing raw data from past tasks. Furthermore, previous state-of-the-art solutions demonstrate substantial forgetting compared to our model on these complex deep reinforcement learning tasks.

Ha and Schmidhuber [207] — We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment.

Ketz et al [267] — Our main focus in this work is testing the hypothesis that interleaving pseudo-rehearsal based replays can be sufficient to preserve the learned knowledge within a world models framework. A variational auto encoder is trained to compress a high dimensional input (e.g. images) into a smaller latent space while also allowing for a reconstruction of that latent space back into the high dimensional space. This latent space representation is then fed into a temporal prediction network that is trained to predict one time step into the future. The output of these networks is then used as a latent state-space for a reinforcement learning based controller.

@article{AndersonandSchoolerPSYCHOLOGY-SCIENCE-91,
       author = {John R. Anderson, and Lael J. Schooler},
        title = {Reflections of the Environment in Memory},
      journal = {Psychology Science},
       volume = {2},
        issue = {6},
         year = {1991},
     abstract = {Availability of human memories for specific items shows reliable relationships to frequency, recency, and pattern of prior exposures to the item. These relationships have defied a systematic theoretical treatment. Empirical data about a number of environmental sources (New York Times, parental speech, electronic mail) are examined to show that the probability that a memory will be needed also shows reliable relationships to frequency, recency, and pattern of prior exposures. Moreover, the environmental relationships are the same as the memory relationships. It is argued that human memory has the form it does because it is adapted to these environmental relationships. Models for both the environment and human memory are described.}
}
@article{AtkinsonetalCoRR-18,
       author = {Craig Atkinson and Brendan McCane and Lech Szymanski and Anthony Robins},
        title = {Pseudo-Rehearsal: Achieving Deep Reinforcement Learning without Catastrophic Forgetting},
      journal = {CoRR},
       volume = {arXiv:1812.02464},
         year = {2018},
     abstract = {Neural networks can achieve extraordinary results on a wide variety of tasks. However, when they attempt to sequentially learn a number of tasks, they tend to learn the new task while destructively forgetting previous tasks. One solution to this problem is pseudo-rehearsal, which involves learning the new task while rehearsing generated items representative of previous tasks. Our model combines pseudo-rehearsal with a deep generative model and a dual memory system, resulting in a method that does not demand additional storage requirements as the number of tasks increase. Our model iteratively learns three Atari 2600 games while retaining above human level performance on all three games and performing as well as a set of networks individually trained on the tasks. This result is achieved without revisiting or storing raw data from past tasks. Furthermore, previous state-of-the-art solutions demonstrate substantial forgetting compared to our model on these complex deep reinforcement learning tasks.}
}
@article{McClellandetalNIiM-19,
       author = {James L. McClelland and Bruce L. McNaughton and Andrew K. Lampinen},
        title = {Integration of New Information in Memory: New Insights from a Complementary Learning Systems Perspective},
      journal = {Submitted to the Philosophical transactions of the Royal Society of London. Series B, Biological sciences},
    publisher = {The Royal Society},
         year = {2019},
     abstract = {According to complementary learning systems theory, integrating new memories into the neocortex of the brain without interfering with what is already known depends on a gradual learning process, interleaving new items with items already known. However, empirical studies show that information consistent with prior knowledge can be integrated very quickly. We use artificial neural networks with properties like those we attribute to the neocortex to develop a theoretical understanding of the role of consistency with prior knowledge in putatively neocortex-like learning systems, providing new insights into when integration will be fast or slow and how integration might be made more efficient when the items to be learned are hierarchically structured. The work relies on deep linear networks that capture the qualitative aspects of the learning dynamics of the more complex non-linear networks used in previous work. The time course of learning in these networks can be linked to the hierarchical structure in the training data, captured mathematically as a set of dimensions that correspond to the branches in the hierarchy. In this context, a new item to be learned can be characterized as having aspects that project onto previously known dimensions, and others that require adding a new branch/dimension. The projection onto the known dimensions can be learned rapidly without interleaving, but learning the new dimension requires gradual interleaved learning. When a new item only overlaps with items within one branch of a hierarchy, interleaving can focus on the previously-known items within this branch, resulting in faster integration with less interleaving overall. The discussion considers how the brain might exploit these facts to make learning more efficient and highlights predictions about what aspects of new information might be hard or easy to learn.}
}
@article{RobinsCONNECTION-SCIENCE-95,
       author = {Anthony Robins},
        title = {Catastrophic Forgetting, Rehearsal and Pseudorehearsal},
      journal = {Connection Science},
       volume = {7},
       number = {2},
        pages = {123-146},
         year = {1995},
    publisher = {Taylor \& Francis},
     abstract = {This paper reviews the problem of catastrophic forgetting (the loss or disruption of previously learned information when new information is learned) in neural networks, and explores rehearsal mechanisms (the retraining of some of the previously learned information as the new information is added) as a potential solution. We replicate some of the experiments described by Ratcliff (1990), including those relating to a simple 'recency' based rehearsal regime. We then develop further rehearsal regimes which are more effective than recency rehearsal. In particular, 'sweep rehearsal' is very successful at minimizing catastrophic forgetting. One possible limitation of rehearsal in general, however, is that previously learned information may not be available for retraining. We describe a solution to this problem, 'pseudorehearsal', a method which provides the advantages of rehearsal without actually requiring any access to the previously learned information (the original training population) itself. We then suggest an interpretation of these rehearsal mechanisms in the context of a function approximation based account of neural network learning. Both rehearsal and pseudorehearsal may have practical applications, allowing new information to be integrated into an existing network with minimum disruption of old information.}
}
@article{RollsARP-01,
       author = {Edmund Rolls},
        title = {Memory systems in the brain},
      journal = {Annual Review Psychology},
       volume = {51},
         year = {2001},
        pages = {599-630},
     abstract = {The operation of different brain systems involved in different types of memory is described. One is a system in the primate orbitofrontal cortex and amygdala involved in representing rewards and punishers, and in learning stimulus-reinforcer associations. This system is involved in emotion and motivation. A second system in the temporal cortical visual areas is involved in learning invariant representations of objects. A third system in the hippocampus is implicated in episodic memory and in spatial function. Fourth, brain systems in the frontal and temporal cortices involved in short term memory are described. The approach taken provides insight into the neuronal operations that take place in each of these brain systems, and has the aim of leading to quantitative biologically plausible neuronal network models of how each of these memory systems actually operates.}
}
@article{RollsHIPPOCAMPUS-00,
       author = {Rolls, Edmund T.},
        title = {Hippocampo-cortical and cortico-cortical backprojections},
      journal = {Hippocampus},
       volume = {10},
       number = {4},
         year = {2000},
        pages = {380-388},
     abstract = {First, the information represented in the primate hippocampus, and what is computed by the primate hippocampus, are considered. Then a theory is described of how the information represented in the hippocampus is able to influence the cerebral cortex by a hierarchy of hippocampo-cortical and cortico-cortical backprojection stages. The recalled backprojected information in the cerebral neocortex could then be used by the neocortex as part of memory recall, including that required in spatial working memory; to influence the processing that each cortical stage performs based on its forward inputs; to influence the formation of long-term memories; and/or in the selection of appropriate actions.},
}
@article{StevensetalJARMC-13,
       author = {Jeffrey R. Stevens and Thorsten Pachur and Lael J. Schooler},
        title = {Rational analysis of the adaptive and predictive nature of memory},
      journal = {Journal of Applied Research in Memory and Cognition},
       volume = {2},
       number = {4},
        pages = {251-253},
         year = {2013},
     abstract = {In his target article, Klein (2013) makes the important point that many approaches to studying memory neglect the function of memory, in particular its capacity to help predict the future. Here, we complement Klein’s argument in two ways. First, we point to an existing and well-developed research program that formalizes a functional approach to memory, exploring its adaptive nature. Second, we illustrate how this approach can be applied to analyze regularities in social interactions, which memory might exploit to predict future interactions.},
}
@article{JarosiewiczetalJoN-02,
       author = {Jarosiewicz, Beata and McNaughton, Bruce L. and Skaggs, William E.},
        title = {Hippocampal Population Activity during the Small-Amplitude Irregular Activity State in the Rat},
      journal = {Journal of Neuroscience},
    publisher = {Society for Neuroscience},
       volume = {22},
       number = {4},
         year = {2002},
        pages = {1373-1384},
     abstract = {The sleeping rat cycles between two well-characterized physiological states, slow-wave sleep (SWS) and rapid-eye-movement sleep (REM), often identified by the presence of large-amplitude irregular activity (LIA) and theta activity, respectively, in the hippocampal EEG. Inspection of the activity of ensembles of hippocampal CA1 complex-spike cells along with the EEG reveals the presence of a third physiological state within SWS. We characterize the hippocampal EEG and population activity of this third state relative to theta activity and LIA, its incidence relative to REM and LIA, and the functional correlates of its population activity. This state occurs repeatedly within stretches of SWS, occupying \~{}33\% of SWS and \~{}20\% of total sleep, and it follows nearly every REM episode; however, it never occurs just before a REM episode. The EEG during this state becomes low in amplitude for a few seconds, probably corresponding to {\textquotedblleft}small-amplitude irregular activity{\textquotedblright} (SIA) described in the literature; we will call its manifestation during sleep {\textquotedblleft}S-SIA.{\textquotedblright} During S-SIA, a small subset of cells becomes active, whereas the rest remain nearly silent, with the same subset of cells active across long sequences of S-SIA episodes. These cells are physiologically indistinguishable from ordinary complex-spike cells; thus, the question arises as to whether they have any special functional correlates. Indeed, many of these cells are found to have place fields encompassing the location where the rat sleeps, raising the possibility that S-SIA is a state of increased alertness in which the animal{\textquoteright}s location in the environment is represented in the brain.},
}
@article{RollsFiSN-13,
       author = {Rolls, Edmund},
        title = {The mechanisms for pattern completion and pattern separation in the hippocampus},
      journal = {Frontiers in Systems Neuroscience},
       volume = {7},
         year = {2013},
        pages = {74},
     abstract = {The mechanisms for pattern completion and pattern separation are described in the context of a theory of hippocampal function in which the hippocampal CA3 system operates as a single attractor or autoassociation network to enable rapid, one-trial, associations between any spatial location (place in rodents, or spatial view in primates) and an object or reward, and to provide for completion of the whole memory during recall from any part. The factors important in the pattern completion in CA3 together with a large number of independent memories stored in CA3 include a sparse distributed representation which is enhanced by the graded firing rates of CA3 neurons, representations that are independent due to the randomizing effect of the mossy fibers, heterosynaptic long-term depression as well as long-term potentiation in the recurrent collateral synapses, and diluted connectivity to minimize the number of multiple synapses between any pair of CA3 neurons which otherwise distort the basins of attraction. Recall of information from CA3 is implemented by the entorhinal cortex perforant path synapses to CA3 cells, which in acting as a pattern associator allow some pattern generalization. Pattern separation is performed in the dentate granule cells using competitive learning to convert grid-like entorhinal cortex firing to place-like fields. Pattern separation in CA3, which is important for completion of any one of the stored patterns from a fragment, is provided for by the randomizing effect of the mossy fiber synapses to which neurogenesis may contribute, by the large number of dentate granule cells each with a sparse representation, and by the sparse independent representations in CA3. Recall to the neocortex is achieved by a reverse hierarchical series of pattern association networks implemented by the hippocampo-cortical backprojections, each one of which performs some pattern generalization, to retrieve a complete pattern of cortical firing in higher-order cortical areas.}
}
@article{KudrimotietalJoN-99,
       author = {Kudrimoti, H. S.  and Barnes, C. A.  and McNaughton, B. L. },
        title = {{{R}eactivation of hippocampal cell assemblies: effects of behavioral state, experience, and {E}{E}{G} dynamics}},
      journal = {Journal Neuroscience},
       volume = {19},
       number = {10},
         year = {1999},
        pages = {4090-4101},
     abstract = {During slow wave sleep (SWS), traces of neuronal activity patterns from preceding behavior can be observed in rat hippocampus and neocortex. The spontaneous reactivation of these patterns is manifested as the reinstatement of the distribution of pairwise firing-rate correlations within a population of simultaneously recorded neurons. The effects of behavioral state [quiet wakefulness, SWS, and rapid eye movement (REM)], interactions between two successive spatial experiences, and global modulation during 200 Hz electroencephalographic (EEG) "ripples" on pattern reinstatement were studied in CA1 pyramidal cell population recordings. Pairwise firing-rate correlations during often repeated experiences accounted for a significant proportion of the variance in these interactions in subsequent SWS or quiet wakefulness and, to a lesser degree, during SWS before the experience on a given day. The latter effect was absent for novel experiences, suggesting that a persistent memory trace develops with experience. Pattern reinstatement was strongest during sharp wave-ripple oscillations, suggesting that these events may reflect system convergence onto attractor states corresponding to previous experiences. When two different experiences occurred in succession, the statistically independent effects of both were evident in subsequent SWS. Thus, the patterns of neural activity reemerge spontaneously, and in an interleaved manner, and do not necessarily reflect persistence of an active memory (i.e., reverberation). Firing-rate correlations during REM sleep were not related to the preceding familiar experience, possibly as a consequence of trace decay during the intervening SWS. REM episodes also did not detectably influence the correlation structure in subsequent SWS, suggesting a lack of strengthening of memory traces during REM sleep, at least in the case of familiar experiences.}
}
@article{BattagliaetalCSH-04,
       author = {Battaglia, F. P. and Sutherland, G. R. and McNaughton, B. L.},
        title = {Hippocampal sharp wave bursts coincide with neocortical up-state transitions},
      journal = {Learning Memory},
       volume = {11},
       number = {6},
         year = {2004},
        pages = {697-704},
     abstract = {The sleeping neocortex shows nested oscillatory activity in different frequency ranges, characterized by fluctuations between "up-states" and "down-states." High-density neuronal ensemble recordings in rats now reveal the interaction between synchronized activity in the hippocampus and neocortex: Electroencephalographic sharp waves in the hippocampus were more probable during down-states than during up-states, and tended to coincide with transitions from down-states to up-states. The form of cortical activity fluctuations and their interactions with sharp waves depend on sleep depth: In deeper sleep stages, characterized by strong neocortical oscillation in the delta range or slower (approximately 0.8-4 Hz), sharp-wave-triggered peri-event time histograms (PETH) are consistent with a longer duration for down-states than for up-states. In lighter sleep, the sharp-wave-triggered PETH suggested longer up-states than down-states. These results highlight the interplay in the hippocampal/neocortical loop: Decreased neocortical input during down-states may be a factor in generation of sharp waves. In turn, sharp waves may facilitate down-to-up transitions. This interplay may reflect joint memory trace reactivation in the hippocampus and in the neocortex, possibly contributing to consolidation of long-term memory: Off-line reactivation of recent neural activity patterns in the hippocampus occurs during 50-100-msec electroencephalographic sharp waves, corresponding to pyramidal-cell population bursts. The neocortical up-states starting in correspondence with sharp waves may be influenced by the reactivated information carried by the hippocampal sharp wave.}
}
@article{KetzetalCoRR-19,
       author = {Nicholas Ketz and Soheil Kolouri and Praveen K. Pilly},
        title = {Using World Models for Pseudo-Rehearsal in Continual Learning},
      journal = {CoRR},
       volume = {arXiv:1903.02647},
         year = {2019},
     abstract = {The utility of learning a dynamics/world model of the environment in reinforcement learning has been shown in a many ways. When using neural networks, however, these models suffer catastrophic forgetting when learned in a lifelong or continual fashion. Current solutions to the continual learning problem require experience to be segmented and labeled as discrete tasks, however, in continuous experience it is generally unclear what a sufficient segmentation of tasks would be. Here we propose a method to continually learn these internal world models through the interleaving of internally generated episodes of past experiences (i.e., pseudo-rehearsal). We show this method can sequentially learn unsupervised temporal prediction, without task labels, in a disparate set of Atari games. Empirically, this interleaving of the internally generated rollouts with the external environment's observations leads to a consistent reduction in temporal prediction loss compared to non-interleaved learning and is preserved over repeated random exposures to various tasks. Similarly, using a network distillation approach, we show that modern policy gradient based reinforcement learning algorithms can use this internal model to continually learn to optimize reward based on the world model's representation of the environment.},
}

October 29, 2019

%%% Tue Oct 29 02:42:48 PDT 2019

There is a growing consensus and a fair bit of evidence to support the hypothesis that the human frontal cortex is in charge of executive control, goal-directed planning and abstract thinking. There are differences in opinion about how these cognitive processes are implemented and how they coordinate their activity with that of the rest of the brain. One thing that seems clear is that the frontal cortex and in particular the prefrontal cortex employs many of the same strategies as do networks elsewhere in the brain, both cortical and subcortical.

In particular, circuits in the prefrontal cortex recapitulate the coarse-to-fine, concrete-to-abstract feature hierarchies that we see in the sensory, motor and somatosensory cortex. They exhibit the profuse reciprocal recurrent connections between levels of abstraction that enable us to generalize on the basis of relatively small amounts of information, learn to make accurate predictions in an unsupervised manner depending on observations and interactions with the environment to ground our conclusions, and that provide the foundation for constructing a rich repertoire of representations that serve decision-making.

The neural correlates of abstract thinking, including the circuits that enable us to solve practical problems as well as pursue pure mathematics, are generally agreed to be located in the prefrontal cortex with reciprocal connections throughout the rest of the cerebral cortex, the cerebellar cortex and subcortical regions including the basal ganglia, hippocampal formation and parts of the limbic system involved with emotion, motivation and episodic memory.

With respect to hierarchical goal-based planning, there is growing evidence pointing to a set of adjoining regions in the prefrontal cortex that are responsible for how abstract plans are initially selected, subsequently refined and finally realized as concrete actions. These same regions also appear to be involved in relational reasoning from simple binary relations to higher-order relationships.

These theoretical observations combined with behavioral studies and fMRI recordings have led to a number of computational models of hierarchical planning that exhibit similar patterns of cognitive activity. In particular, cognitive neuroscientists have developed models of how such abstract hierarchical reasoning in the prefrontal cortex is related to what we know about how the basal ganglia and areas of the limbic system involved in motivation contribute to action selection.

Figure 12: The panel on the left highlights three areas of the prefrontal cortex shown in the figure from left to right (rostro-caudal) and referred to in the text as the lateral frontal polar cortex (LFPC) dorsolateral prefrontal cortex (DPFC) and anterior premotor cortex (APMC). According to the theory first articulated by Joaquín Fuster and subsequently refined David Badre [40], Mark D'Esposito [136] and Etienne Koechlin et al [285] and their colleagues, as actions are specified from abstract plans to concrete responses, progressively posterior regions of lateral frontal cortex are responsible for integrating more concrete information over more proximate time intervals. This process of progressive articulation does not correspond to different stages of execution so much as to how actions are selected, maintained and inhibited at multiple levels of abstraction [41].

The panel on the right shows a simple neural-network model of the brain regions aligned with the rostro-caudal axis of the frontal cortex and hypothesized to account for how action representations are selected, maintained and inhibited at multiple levels of abstraction. The neural-network model is described in more detail in the main text, but a few points are in order here: A — different abstraction layers may include input from other sources, e.g., natural language embeddings, that are only required at particular levels of abstraction; B — each recurrent level of the abstraction hierarchy includes state variables encoding information that would typically appear on the call stack in a conventional computer architecture; C — attentional layers mask (suppress) input that is not determined to be relevant to making a decision making at a given time and level of abstraction resulting in a sparse context vector.

The network shown on the right in Figure 12 consists of three subnetworks that roughly align with the lateral frontal polar cortex (bottom), dorsolateral prefrontal cortex (middle) and anterior premotor cortex (top) as shown in the figure. Each of the subnetworks is composed of three elements: a recurrent multilayer perceptron constructed of interleaved convolutional and max-pooling layers shown in orange, a multilayer attention network shown in green and a masking layer in blue that selectively suppresses a subset of the outputs of the convolutional stack in accordance with the output of the attention network.

Input to each of the three subnetworks includes areas of associative activity throughout the sensory and motor cortex as well as areas corresponding to higher-level abstractions located in the frontal cortex responsible for abstract thought and subcortical regions responsible for motivation. While not emphasized here, the active maintenance in working memory of information originating from these sources¹¹² is critical for the cognitive activities that these networks support [107, 199]. The outputs are fed to a network (not shown) that serves as the interface for the peripheral motor system (a fully instrumented integrated develop environment in the case of the programmer's apprentice) which could play the role of the basal ganglia and cerebellum, but could also be considerably simpler depending on the application.

Figure 12 is just a sketch employing familiar neural network components to make the point that building these architectures out of standard components is not the most significant challenge. The real challenge is in training them as part of larger system with lots of moving parts. The expectation here, as in the model sketched in Figure 11, is that end-to-end training with stochastic gradient descent isn't going to work, and that training will likely require some form of layer-by-layer developmentally-staged curriculum learning [413, 206, 53, 55] and a strategy for holding some weights fixed while adjusting other weights to account for new information and avoid problems like catastrophic forgetting.

Miscellaneous Loose Ends: There is an interesting tradeoff as regards levels of dopamine in the basal ganglia and striatum versus the prefrontal cortex. Listen to Roshan Cools at 48:30:00 into the interview on Brain Inspired: Episode 36. Here's short excerpt from the interview:

This has been an oversimplified way of thinking about for example, Parkinson's disease, where dopamine levels are severely depleted early on but actually are relatively intact in the prefrontal cortex. If you increase dopamine we might see an increase in cognitive flexibility by increases dopamine in the striatum and basal ganglia, but at the same time that same drug might detrimentally overdose the prefrontal cortex which was relatively intact to begin with. Thereby interfere with a key role for dopamine in prefrontal cortex which is the stabilization of working memory representations and this is at least we see empirically is that dopaminergic drugs make people more flexible — they can switch between tasks more readily.

Her paper [106] focuses primarily on the variability in dopaminergic drug efficacy. If you're interested more about the effect on modulating behavior and cognitive control, the related papers by Westbrook and Braver [525] and Westbrook and Frank [526] are worth reading:

Cognitive control — the ability to override a salient or prepotent action to execute a more deliberate one — is required for flexible, goal-directed behavior, and yet it is subjectively costly: decision-makers avoid allocating control resources, even when doing so affords more valuable outcomes. Dopamine likely offsets effort costs just as it does for physical effort. And yet, dopamine can also promote impulsive action, undermining control. We propose a novel hypothesis that reconciles opposing effects of dopamine on cognitive control: during action selection, striatal dopamine biases benefits relative to costs, but does so preferentially for "proximal" motor and cognitive actions. Excerpt from Westbrook and Braver [525].
Cognitive control is subjectively costly, suggesting that engagement is modulated in relationship to incentive state. Dopamine appears to play key roles. In particular, dopamine may mediate cognitive effort by two broad classes of functions: (1) modulating the functional parameters of working memory circuits subserving effortful cognition, and (2) mediating value-learning and decision-making about effortful cognitive action. Here, we tie together these two lines of research, proposing how dopamine serves "double duty", translating incentive information into cognitive motivation. — Excerpt from Westbrook and Frank [526].

October 27, 2019

%%% Sun Oct 27 03:27:09 PDT 2019

Frank and Badre [173] build on an earlier model by O'Reilly and Frank [384] in which "the striatum modulates the selection of frontal cortical actions, including motor actions and working memory updating. It does so by gating the inputs to be maintained in frontal cortex (input gating) and gating which of these maintained representations has an influence on action selection (output gating). The learning of which actions are adaptive given a particular state is accomplished via dopaminergic reinforcement learning mechanisms." They extend this model to accommodate learning hierarchical structure:

Here, in order to support learning of hierarchical structure within the context of the Badre et al. (2010) task, we modified the model such that anterior regions of prefrontal cortex (PFC) contextualize striatal gating of more posterior frontal regions. More specifically, as with previous models (O'Reilly and Frank [384], Reynolds and O'Reilly [423]), working memory representations maintained in the "posterior PFC layer of the model constrain motor output decisions. Unique to the present model, working memory representations maintained in additional "anterior" PFC layers constrain which of the working memory representations maintained in posterior PFC should be "output-gated" to influence attention and ultimately response selection. Page 509-510.

Hazy et al [219] don't address the issue of hierarchy specifically, but they do describe a model in which the basal ganglia "modulate working memory representations in prefrontal areas to support more abstract executive functions," and the paper provides a useful characterization of the basic challenges in building such a system and, in particular, they list six core functional demands that collectively serve to define "the fundamental nature of prefrontal cortical function from a neuro-mechanistic perspective" and they have interesting things to say about the role of working memory in facilitating coordination between the basal ganglia and the prefrontal cortex.

Further, whereas our initial focus was on the mechanisms by which the BG–PFC system learns when to update and maintain information in working memory [...], here we extend the model to include an output-gating mechanism that can determine which of a subset among multiple parallel active representations should be currently used to guide action selection [...]. Interestingly, the same BG mechanisms that can drive the selection of when to update PFC working memory representations can also be used (in parallel circuits) to select which of the already maintained PFC representations should actually be used to guide behaviour. Page 1601.

Botvinick [66] is interesting for a number of reasons, one of them being that the paper analyzes an earlier recurrent connectionist model developed by Botvinick and Plaut [65, 64] (BP04) that was not originally designed specifically to deal with hierarchy:

[A] central point demonstrated by the BP04 model is that despite the evidently hierarchical structure of everyday sequential behaviour, the representations underlying such behaviour need not be, and indeed are unlikely to be, strictly hierarchical. Instead, like the detailed structure of everyday behaviour itself, those representations are more likely to be quasi-hierarchical, capturing the distinctions between separable levels of sequential structure, while also allowing information to be shared across those levels and between interrelated task sequences. Page 1919

To understand the relationship between the proposed computational framework and this empirical data, the original neural network model was reimplemented so as to incorporate a pattern of connectivity based on the neuroanatomical hierarchy described by Fuster [184] — see Figure 6 on Page 1621 of [66]. The model incorporates a pattern of connectivity resembling the ladder-like structure described by Fuster — see Figure 8.6 of [184]. At the base of the ladder was a group that received external inputs representing environmental stimuli, and an output group representing motor commands. Connected to these were two internal unit groups, occupying an intermediate position. And connected to these was a final group of units, occupying the apex of the hierarchy, a position analogous to that of PFC in Fuster’s hierarchy — see Figure 8 on Page 1623 of [66].

When the [new] model was trained on a task that required it both to accomplish immediate input–output mappings and to maintain context information over time, it spontaneously developed a graded division of labour over its internal unit-groups, with groups further from periphery taking a larger role in context representation. The simulation results indicated how the original computational framework can be reconciled with the presence of a hierarchical arrangement of cortical regions, by showing how a regional division of labour might coexist with the same sorts of graded, distributed representations that defined the original computational model. Perhaps more important was the finding that this regional division of labour arose spontaneously as a result of learning within a hierarchically structured system. In view of this, the simulation results presented here provide a possible account for how different levels of temporal structure in behaviour may come to be represented at different levels within the hierarchy of cortical areas, and in particular how the PFC comes to assume its distinctive role in the representation of temporal or task context. Page 1624 of [66].

All three models — the reimplementation of BP04 in the case of Botvinick and Plaut — fit within the general architectural framework described by Badre [41, 38], Fuster [184, 183] and Koechlin et al [285, 284] that we've reviewed in these pages. Botvinick et al [68] provide a good overview of hierarchical reinforcement learning and its possible neural correlates that complements the above. Stanislas Dehaene and his colleagues have made a number of contributions to the related work. In particular, Dehaene and Changeux [131] describe a hierarchical neural network model for goal directed-planning and executive control of action. Here are some relevant passages from their paper:

Our modeling approach views the brain not as a passive input–output system, but as an active, projective device that spontaneously generates "hypotheses" and tests their adequacy. Neuronal circuits are modeled as hierarchically organized assemblies of neuronal clusters linked in multiple parallel pathways by bundles of synapses. In the prefrontal cortex, neurons within each cluster are connected by excitatory collaterals and thus can maintain a long-lasting, self-sustained level of activation implementing an elementary form of working memory. [...] The spontaneous activity patterns thus generated are selected (stabilized) or eliminated (destabilized) by a simple Hebbian rule modulated by positive or negative reward signals received from diffuse projection systems. Rewards may be received from an external teacher, or they may be internally generated by the organism itself using autoevaluation circuits. Page 13293 of [131].

The focus of the paper is on using their simulation model — shown in Figure 2 of [131] — to demonstrate how the fully functional model solves the Towers of Hanoi problem and how "lesioned" versions of the model fail do so. Predictions that follow from the simulated lesion studies offer an opportunity to test their theory on humans assuming a cohort of patients with the characteristic lesions can be identified in the general population and are willing to be observed in the midst of trying to solve such planning problems. The simulation model depends on a biological model of inhibition and activation worth considering in contrast with the Hazy et al [219] model mentioned above. Both models are worth considering in terms of translating ideas to the language deep learning:

The model is divided into ascending evaluative and descending planning pathways with multiple horizontal interconnections at each of three hierarchical levels (gestures, operations, and plans). At the top level, plan units receive inputs from reward units. At the intermediate level, operation units receive inputs from "current state," "movable beads," and "reachable goals" units. Finally, at the lowest level, gestures units receive direct inputs from "current state" units, enabling the fine details of the motor plan to be adapted to the current configuration of the beads. Within each assembly, units pool the activation or inhibition they receive and compete for the control of behavior through self-excitation and nonlinear shunting lateral inhibition, resulting in the selection, at any given time, of a single, active unit at each level, the one that is best adapted to the constraints received from other assemblies. Page 13295 of [131].
Functional brain imaging experiments show that when normal subjects engage in the Tower of London task, an extended network of areas activates, including parietal, prestriate, insular-opercular, premotor, and prefrontal cortices. Because parietal, prestriate, premotor, and insular-opercular areas are active even for simple problems, they may contribute to the operation level, with occipitoparietal areas contributing to identifying movable beads and reachable goals, and premotor and insular cortex involved in mentally selecting and executing direct moves. During the resolution of more complex Tower of London problems, increased activity is detected in this network as well as in dorsolateral, rostrolateral, and medial prefrontal areas and in the basal ganglia. Page 13297 of [131].

The biblographical entries for the most relevant papers cited in this entry including all of the abstracts are available here ¹¹³ and the corresponding PDF documents available on the Google Cloud volume shared with the students taking the class.

October 25, 2019

%%% Fri Oct 25 02:04:00 PDT 2019

Reinforcement learning problems are typically modeled as Markov decisions problems (MDPs). While MDPs can be solved in polynomial time, the size of the state space is often prohibitively large, making practical solution intractable [324]. Hierarchical reinforcement learning offers a means of reducing the computational burden by decomposing the state space resulting in a relatively small number of tractable MDPs each of which can be solved independently [256, 141, 222]. However, the problem of finding an optimal decomposition is itself intractable and hence it is necessary to resort heuristic methods and approximate solutions.

There exist a number of approaches that develop solutions to the problem of hierarchical reinforcement learning (HRL) employing various decomposition strategies, several of which we draw inspiration from [370, 22, 433, 293, 45, 360, 396] including a few that relate to biological or biologically plausible models [419, 142, 173, 424]. It's important to keep in mind that we are dealing a partially-observable, high-dimensional, continuous state space, and an action space that includes abstract cognitive activities in addition to concrete physical activities that engage the motor system in interacting with the environment.

In the treatment here, we emphasize the problem of life-long learning as it relates to the nonstationarity of underlying process as a consequence of changes in the external environment and changes in the goals of the agent and the neural substrate available for computation during development and extending on into adulthood. In the case of a growing infant, the changes involve the appearance and maturation of critical circuits and the limitations of finite memory. In both human and machine, internal representations progress from concrete to abstract, building on a foundation grounded in the environment. This maturation in cognitive capability is accelerated by a curriculum that takes advantage of dependencies between concepts. See here for earlier notes on multi-policy RL.

Figure 11: This is an updated version of the network model described in Figure 8. The network shown here takes as input a pattern of activation originating in the temporal and parietal lobes and selects an action to perform. The subnetworks labeled A and B are relatively straightforward multilayer neural networks that compute features and generate representations as their output. Network A takes as input a representation of the current state, and generates a representation of the context for action selection. Network K is an embedding network that takes as input a sequence of states corresponding to recent activity and generates as output a unique key associated with a subspace of the full MDP state space that includes the current state. The box labeled M corresponds to a location in working memory.

The networks C, D, E and F are controllers for two differentiable neural computer (DNC) peripherals that provide storage and access for short-term and long-term memory respectively. The long-term memory is used to store the weights for networks that encode architecturally identical networks — only the weights are different — providing specialized expertise in restricted domains corresponding to subspaces of the full MDP state space. The model operates in two modes. In each cycle during the online mode, the C controller loads the selected expert network into location M where it is fed the output of A and produces the input to B. In this mode, the short-term memory is used to record activity traces that are subsequently used in the offline mode to update the networks stored in long-term memory. The network in the lower-right inset represents one approach to solving the problem of catastrophic forgetting [174].

The network model shown in Figure 11 illustrates a system that takes as input a pattern of neural activity originating in the medial temporal and inferior parietal cortex and selects an action to perform. This particular example is meant to illustrate how episodic memory might play an expanded role in action selection. For illustration, patterns of activity serve as proxies for the state of the external environment and are represented in the figure as a sequence s_t, s_t−1, s_t−2, .... The subnetworks labeled A and B are relatively straightforward multilayer neural networks that compute features and generate representations as their output. Network A takes input a representation of the current state, and generates a representation of the context for action selection.

%%% MIGHT WANT TO POSITION THE NEXT PARAGRAPH MORE PROMINENTLY AS IT APPLIES TO THE ENTIRE PAPER.

In keeping with our objective to demonstrate how ideas from neuroscience can influence and accelerate the development of artificial intelligence, this model is not intended to emulate how the human brain processes information. Nor is the architecture intended to mirror that of the human brain or any other biological organism. What we borrow from neuroscience are ideas that enable engineers to solve problems that current AI systems cannot handle or do so poorly. We are not bothered by combining neural mechanisms that neuroscientists currently identify with, say, the basal ganglia with neural mechanisms associated with the hippocampal formation. At this stage in the development of more capable AI systems, what engineers need most are general principles that inform design and be applied whenever the need arises.

We'll explain the function of the box labeled M in a moment; assume for now that it generates a representation of the options available for acting in the current context. Network B then takes these suggestions as input and produces as output a representation of the selected action. The boxes labeled C, D, E and F are controllers for two differentiable neural computer (DNC) units that provide storage and access for short-term and long-term memory respectively. The controllers on the left are part of the online system for selecting actions. The controllers on the right are responsible for off-line training during which the recorded actions, along with their associated states and rewards are consolidated in long-term memory using experience replay as described in the main text.

The blue boxes represent stored information in the form of key-value pairs. Each key is associated with a subset or subspace of the set of all states that represents a restricted domain of expertise for selecting actions. The value for this key is a function implemented as a network trained as an expert for the associated subspace. K is an embedding network that takes as input a sequence of states corresponding to recent activity and generates as output a unique key associated with a subspace of the current state. A given state can belong to more than one subspace and the particular key selected at any given point in time depends on the current state and the immediately previous states in a fixed window. The order of the states matters.

In the online phase, the embedding network retrieves this key which it forwards to the controller labeled C that uses it to retrieve the expert for the relevant subspace. The box labeled M corresponds to a location in working memory and in each online cycle the C controller loads the expert subsystem in location M of working memory where it can be utilized to compute a set of options appropriate for the current state. During off-line periods the system uses the recorded sequences of activity to run some variant of experience replay to update the relevant expert subsystems stored in long term memory [23, 441, 322].

The training that occurs offline involves adjusting the weights of networks using relatively small samples and so runs the risk of catastrophic interference in transfer learning [348]. One way in which we hope to ameliorate the adverse consequences of catastrophic interference is by defining separate networks for separate subspaces. The embedding space method mentioned in describing K is designed to isolate expertise by identifying states that tend to occur together. The hope is that the actions exercised is such states will tend to be interrelated and hence they should be represented using the same network to facilitate their coordination.

Of course temporal proximity in their occurrence doesn't guarantee they serve the same task since we are always getting distracted or interrupted requiring us to interleave tasks that have very little to do with one another. It may be possible to segment activity streams into coherent tasks in a similar way to how we segment conversations involving multiple speakers [447, 448]. Alternatively, there has been some success retraining an existing network by interleaving new examples with pseudo-examples produced by randomly activating the existing network [97, 277, 25, 174, 175] — see the inset in Figure 11.

In this model the STM roughly corresponds to the hippocampus as the storage system for episodic memory. The LTM resembles the cerebellum in the way that it essentially compiles prior activity to construct a set of programs each of which spans some portion of the overall state space. As described above, the STM is only used for temporary storage awaiting off-line replay to consolidate recent memories. An alternative is to maintain a much larger collection of episodic memories that can be used in a manner similar to that suggested in Gershman and Daw who posit that we routinely draw upon our stored memories in the hippocampus to figure out what to do in novel situations not covered by our other sources of procedural knowledge [191].

For example, if we enter a fast food restaurant, a franchise of a chain of such restaurants, that we never eaten at in the past, we might recall a previous episode in which we ate a meal at another restaurant at a competing chain and apply what we learned in that prior episode to the present situation. The protocol for ordering and subsequently being served your order in the present restaurant may be different than that of the restaurant in the recalled episode and so we might have to adapt somewhat and what we might think of as an instance of compilation on the fly. Gershman and Daw conjecture that indeed this sort of planning may be quite common. In the retrieval and subsequent reconsolidation of the memory which may include additional annotations to record the differences between the two restaurants, the hippocampus might serve as the basis for many of our day-to-day activities.

In this case, the LTM as pictured here stores what might be thought of as subroutines or libraries for solving routine problems. Used in manner described in Gershman and Daw [191] the DNC labeled STM more closely captures the functionality of the hippocampus in combining short-term and long-term episodic memories with specific procedural knowledge based on past experience that may or may not be common enough to warrant compiling as a standalone library. The dentate gyrus is best known for its ability to separate patterns to avoid mistaking one pattern for another. Less well understood is a a possible complementary role that involve integrating similar patterns.

The ability to draw upon episodic memory to select what to do in situations similar to those encountered in the past provides a simple form of one-shot learning. It could enable us to make predictions, perform hypothetical reasoning and put ourselves in someone else's shoes assuming that our ability to retrieve memories allows us match situations that we find ourselves that we haven't experienced, but know of someone else's experience. It might avoid some of the problems with interference if the process of integrating new procedural knowledge with old could be spread out over longer periods if, say, each time you encounter a similar situation you make only minor adjustments to the weights of the associated subspace expert network.

In its role as an analog of the hippocampus, one might imagine that over time those memories might be amended to better reflect more current challenges and that amended record may serve to update the compiled libraries from time to time or refresh libraries that have been inadvertently compromised by catastrophic interference. Indeed, given the way in which subroutines are trained it might very well be that such routine refresh is necessary to maintain precision and coverage and quite separate of the idea that the hippocampus could be used to compensate for interference our current plan is to employ a method proposed for ameliorating interference when adding to an existing subroutine.

%%% BEGIN EXTRA(EOUS) MATERIAL %%%

With respect to the programmer's apprentice application, an agent that can't formulate plans and pursue goals; that can't identify tasks that need to be achieved in pursuit of goals and pursue them to their completion despite complications and failures along the way; or that can't understand and attach value to achieving the goals of another agent in a collaborative effort will simply fail as an assistant.

Summarizing the first half of Gershman and Daw: because the world is partially observable we have to perform state estimation in order to provide a foundation for planning and acting; because rewards are unreliable and sparsely distributed we have to learn to predict / estimate expected value from limited exploration; because the world is nonstationary and our experience limited we have to be prepared to reassess what we think we know and continually update and extend our model of the dynamics.

Gershman and Daw "Our proposal suggests that the ubiquitous and diverse roles of memory in RL may function as part of an integrated learning system." Adding to their suggestion, inventing a developmentally staged curriculum for training and an abstract hierarchical planning foundation can help to assist such an integration ... the suggest the use of nonparametric learning and the application of kernel methods.

The importance of precision and repeatability in selecting actions; error detection and remediation executing actions; feedback in the form of prediction and observation to direct planning and support learning, account for the dynamics in the external portion of recurrent feedback loops; in the case of the assistant, the IDE part of the environment simplifies generating training examples for basic operations including pointing and referring to variables and their values, evaluation of expressions involving numerical operators and predicting the result of evaluating expression without using the IDE interpreter.

See also Kammerer [259] on the capacity of the hippocampus, multiple sequences, mutability, separated patterns via Hebbian learning and the interplay of inhibitory and excitatory neural populations, hypothesized global remapping mechanism ... auto associative memory [...]

%%% END EXTRA(EOUS) MATERIAL %%%

October 23, 2019

%%% Wed Oct 23 04:32:58 PDT 2019

When you start out learning about the brain, its complexity is overwhelming. When you ask where does the brain carry out some particular function, the answer is more often than not that the brain carries out the function in many different locations. This answer is misleading since the function may rely on circuits distributed throughout the brain, or it may be that there are multiple locations where the function is carried out but in each case the function serves different purposes depending on different parts of the brain that only become available at different stages during development.

[You can’t just take the basal ganglia and cerebellum off line while you renovate the brain to install your shiny new hippocampus and prefrontal cortex. First of all you will need those systems to survive during the renovation. Second they will continue to serve their original purpose in your renovated brain, and, third, they will enlarge their scope to expedite, coordinate and control the timing of higher-level cognitive functions.]

Moreover as the brain continues to mature, more capable instances of the same function are not necessarily built upon the seemingly less capable instances nor are the less capable instances retired or the parts of the brain they relied on repurposed to serve different functions. The reason for this staged approach to building a brain is that the human brain is designed to adapt to its environment and that adaptation requires a lengthy period of exposure to the environment as we build the necessary physical and cognitive foundations.

The selection and execution of actions is carried out in the basal ganglia, the cerebellum and the prefrontal cortex and is dependent upon different memory systems spread throughout the brain including declarative, procedural and episodic memories. Indeed, it is not too much of a stretch to say that almost every part of the brain stores memories of one sort or another and at different times appears to play attentional, perceptual and selectional roles in carrying out its multiple functions. In the following we ignore most of this complexity in order to set forth our strategy for designing a brain that builds itself.

[The circuits that rely on primary sensory and motor cortex to allow the infant to interact with its environment and thereby enhance its ability to make sense of and interact with its physical environment. The impetus in this case is largely instinctual but quickly becomes goal oriented as the infant matures. The basal ganglia plays a key role in selecting what to do and the cerebellum figures out how to do it.]

The hippocampus and prefrontal cortex are still developing during infancy (first 2 years) and given that they are thought to serve episodic memory and higher-level cognitive activities it makes sense to wait until the growing child has at least a basic understanding of its body and the environment in which it was born. During early childhood (year 3 to 8) the hippocampus matures, the child develops a sense of self defined in large part by their memories that also serve as a basis for how to act and solve problems. The prefrontal cortex develops in early adolescence and on into adulthood during which the growing child learns to modulate intense emotions, inhibit inappropriate behavior and initiate appropriate behavior, as well as engage in complex problem solving.

The basal ganglia, hippocampus and cerebellum continue to play important roles in shaping behavior including the support of higher-level cognitive functions. The main insight here is that the brain builds itself, adapting to its ongoing development and adjusting to meet the challenges and exploit the opportunities offered by its environment. The long span of time that characterizes human developmental is critical in building our brains. The understanding of this process constitutes one of the most important lessons neuroscience has to offer artificial intelligence.

October 17, 2019

%%% Thu Oct 17  3:47:31 PDT 2019

This log entry includes extended quotes from Alan Badre [41] and Joaquín Fuster [185]. You can access Badre's journal article and the 5th edition of Fuster's book The Prefrontal Cortex using Stanford library resources. Chapter 8 of Fuster's book is particularly relevant to our current interests. The following three paragraphs and figure graphic are from Badre's article sprinkled with my annotations and links to relevant online resources. His use of the word context is well aligned with the model initially developed in the Spring quarter and revisited in the October 15 entry of this log.

Formal models of cognitive control [41, 285, 185] make explicit how control might be implemented within a hierarchy. One influential class of connectionist models of control derives from the assumptions of biased competition. Competition arises when bottom-up input produces activation of multiple response pathways¹¹⁴. This competition can be quantified as Hopfield energy, which is the sum of the products of all the nodes in a layer weighted by their competitive, lateral inhibitory connections. Assuming mutually equivalent competition, at a given overall level of activation, energy in the network will be maximal when all units are equivalently active. In response to competition, a recurrent layer that maintains task-relevant contextual information biases task-relevant responses over competitors. See Botvinick et al [67] for a study examining the role of the anterior cingulate cortex in error detection which is a critical cognitive function largely ignored in the reinforcement learning literature¹¹⁵. As an exercise, think about how you would design a conventional neural network that, whenever it executes a task, checks to see if its expectations are met, and, if not, takes steps to resolve whether that entails (a) modifying its prediction machinery, (b) fixing the plan that issued the task or (c) modifying the machinery that suggested the plan¹¹⁶.

@article{BotvinicketalPSYCHOLOGICAL_REVIEW-01,
        title = {Conflict monitoring and cognitive control},
       author = {Botvinick, Matthew M. and Carter, Cameron S. and Braver, Todd S. and Carter, Cameron S. and Cohen, Jonathan D.},
      journal = {Psychological Review},
    publisher = {American Psychological Association Inc.},
         year = {2001},
        pages = {624-652},
       volume = {108},
       number = {3},
     abstract = {A neglected question regarding cognitive control is how control processes might detect situations calling for their involvement. The authors propose here that the demand for control may be evaluated in part by monitoring for conflicts in information processing. This hypothesis is supported by data concerning the anterior cingulate cortex, a brain area involved in cognitive control, which also appears to respond to the occurrence of conflict. The present article reports two computational modeling studies, serving to articulate the conflict monitoring hypothesis and examine its implications. The first study tests the sufficiency of the hypothesis to account for brain activation data, applying a measure of conflict to existing models of tasks shown to engage the anterior cingulate. The second study implements a feedback loop connecting conflict monitoring to cognitive control, using this to simulate a number of important behavioral phenomena.},
}

A sophisticated biased competition framework by O'Reilly [384, 423] and colleagues implements hierarchical control through multiple contextual layers at different levels of abstraction and recursion. An early variant of the model, motivated by a double dissociation in the intra–extra-dimensional switching task, included two hierarchically related working memory buffers that separately maintain the recently rewarded dimension (e.g. a category of shapes) or feature (e.g. a particular shape). These buffers independently bias dimension or feature in a common representational space and are rapidly updated in response to conflict between predicted and actual outcomes. 'Lesioning' each of these layers simulated a double dissociation between reversals at feature or dimension levels. More recent versions of the model are capable of modeling different cognitive control tasks and even show generalization to novel tasks. O’Reilly and colleagues have suggested that this hierarchical architecture reflects the gradient of abstraction along the rostro–caudal axis of the PFC¹¹⁷.

@article{OReillyandFrankNC-06,
       author = {O'Reilly, Randall C. and Frank, Michael J.},
        title = {Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia},
      journal = {Neural Computation},
    publisher = {MIT Press},
      address = {Cambridge, MA, USA},
       volume = 18,
        issue = 2,
         year = 2006,
        pages = {283-328},
     abstract = {The prefrontal cortex has long been thought to subserve both working memory (the holding of information online for processing) and executive functions (deciding how to manipulate working memory and perform processing). Although many computational models of working memory have been developed, the mechanistic basis of executive function remains elusive, often amounting to a homunculus. This article presents an attempt to deconstruct this homunculus through powerful learning mechanisms that allow a computational model of the prefrontal cortex to control both itself and other brain areas in a strategic, task-appropriate manner. These learning mechanisms are based on subcortical structures in the midbrain, basal ganglia, and amygdala, which together form an actor-critic architecture. The critic system learns which prefrontal representations are task relevant and trains the actor, which in turn provides a dynamic gating mechanism for controlling working memory updating. Computationally, the learning mechanism is designed to simultaneously solve the temporal and structural credit assignment problems. The model's performance compares favorably with standard backpropagation-based temporal learning mechanisms on the challenging 1-2-AX working memory task and other benchmark working memory tasks.}
}

The cascade model [284, 285, 283] shares fundamental properties with the biased competition framework — see Figure 9. The primary control demand is uncertainty or entropy. Uncertainty increases with the number of candidate responses. For any number of responses, uncertainty will be maximal when the response probabilities are equivalent. Given that relative activation among competing response units in a network relates directly to the probability distributions of those responses] entropy will exhibit similar dynamics as Hopfield energy when all responses are equally likely. Control is expressed as the reduction in uncertainty (mutual information) given some additional signal (sensory, context or episodic). Different regions of the PFC are specialized for different types of control signals and signals sum from super- to sub-ordinate levels. Badre summarizes the import of the graphic shown in Figure 9 as "Thus, from caudal to rostral, regions of the PFC are distinguished based on their reliance on control signals that differ temporally, from immediate environment (sensory and context), to current temporal frame (episodic), to a pending frame (branching)."

@article{KoechlinetalSCIENCE-03,
       author = {Etienne Koechlin and Chryst\`{e}le Ody and Fr\'{e}d\'{e}rique Kouneiher},
        title = {The architecture of cognitive control in the human prefrontal cortex},
      journal = {Science},
       volume = 302,
         year = 2003,
        pages = {1181-1185},
     abstract = {The prefrontal cortex (PFC) subserves cognitive control: the ability to coordinate thoughts or actions in relation with internal goals. Its functional architecture, however, remains poorly understood. Using brain imaging in humans, we showed that the lateral PFC is organized as a cascade of executive processes from premotor to anterior PFC regions that control behavior according to stimuli, the present perceptual context, and the temporal episode in which stimuli occur, respectively. The results support an unified modular model of cognitive control that describes the overall functional organization of the human lateral PFC and has basic methodological and theoretical implications.},
}

Figure 9: The cascade model is an elegant cognitive control framework that makes predictions about the rostro–caudal organization of the prefrontal cortex (PFC). In cascade, control resolves competition among alternative action representations based on mutual information with contextual information, also termed control signals. Crucially, the control signals relate to one another hierarchically, in that information is inherited from superordinate to subordinate levels, and separate signals are processed by spatially distinct regions along the rostro–caudal axis of the PFC. At the lowest level, sensory control is supported by premotor cortex and selects a motor response based on a sensory input. Next, contextual control, supported by posterior PFC, selects an action based on an environmental contextual cue. Episodic control, supported by anterior dorso lateral PFC, selects an action based on an ongoing temporal context. Finally, a highest level, branching control, supported by frontal polar cortex (FPC), selects action representations based on a pending temporal context. Thus, from caudal to rostral, regions of the PFC are distinguished based on their reliance on control signals that differ temporally, from immediate environment (sensory and context), to current temporal frame (episodic), to a pending frame (branching). Excerpt and figure adapted from Badre [41]. See also [173, 39, 38].

In the 1980's Alan Baddeley wrote a number of papers on the relationship between attention and memory. Baddeley [35] defined working memory as a temporary memory store for information needed to perform a task or to solve a problem in the short term. You might find it useful to review the discussion of attention in the 2019 CS379C class discussion notes, including two attentional neural network models one of them focusing on the role of attention in dialog and the other conscious awareness. Francesco Donnarumma and his colleagues[145, 146] have a reputation for taking on such challenges and in doing so often come up with interesting models inspired by neuroscience and yet departing from biology in ways that invite novel interpretation¹¹⁸.

Figure 10: This is a simple neural-network model of the brain regions aligned with the rostro-caudal axis of the frontal cortex and hypothesized to account for how action representations are selected, maintained and inhibited at multiple levels of abstraction. The network shown here consists of three subnetworks each of which is composed of three elements: a recurrent multilayer perceptron constructed of interleaved convolutional and max-pooling layers shown in orange, a multilayer attention network shown in green and a masking layer in blue that selectively suppresses a subset of the outputs of the convolutional stack in accordance with the output of the attention network. Input to each of the three subnetworks includes areas of associative activity throughout the sensory and motor cortex as well as areas corresponding to higher-level abstractions located in the frontal cortex responsible for abstract thought. The outputs are fed to a network (not shown) that serves as the interface for the peripheral motor system (a fully instrumented integrated develop environment in the case of the programmer's apprentice) which could play the role of the basal ganglia and cerebellum, but could be considerably simpler depending on the application.

Joaquín Fuster [184] agrees with Baddeley, arguing that working memory is indeed a form of attention, "sustained attention focused on an executive cognit for the processing of prospective action," where cognit is the term that Fuster uses for cognitive network, further distinguishing two categories of cognits: perceptual cognits in the posterior cortex and executive cognits in the frontal cortex. The remainder of the paragraph containing the preceding quote is well worth reading and is included in the footnote at the end of this sentence¹¹⁹.

Figure 10 is just a sketch employing familiar neural network components to make the point that building these architectures out of standard components is not the most significant challenge. The real challenge is in training them as part of larger system with lots of moving parts. The expectation here, as in the model sketched in Figure 8, is that end-to-end training with stochastic gradient descent isn't going to work, and that training will likely require some form of layer-by-layer developmentally-staged curriculum learning [413, 206, 53, 55] and a strategy for holding some weights fixed while adjusting other weights to account for new information and avoid problems like catastrophic forgetting.

October 16, 2019

%%% Wed Oct 16  2:15:26 PDT 2019

As you work on your assigned research topic, I expect you to focus primarily on research from neuroscience that might help in building architectures of the sort sketched in Figure 8. As we talked about in our last meeting, the proposed model attempts to combine functional features hypothesized to be integral to the human hippocampus, prefrontal cortex and basal ganglia, relating to abstract and hierarchical decision making, experience replay and reinforcement learning and addressing core problems involving transfer learning and catastrophic interference.

At some point relatively soon we have to make the case that ideas derived from the neuroscience research you've uncovered can be applied to solve these problems. At that point, we'll have to connect to related ideas in the machine learning literature, some of it deriving insight from neuroscience that we can cite in the paper. Below I've provided excerpts from some of the most promising research, and the May 21 entry in the 2019 class discussion listing provides a quick overview of current research on hierarchical and multi-policy strategies for transfer learning and hierarchical planning.

Routing networks ... Ramachandran et al [415] >>> Much of this earlier work has either fixed the composition strategy and learned the modules (neurons or whole networks) or fixed the modules and learned the composition strategy. But in its most general form, the compositionality problem is to jointly learn both the parameters of the modules and a strategy for their composition with the goal of solving new tasks. Given that the router is characterized as neural network, it must be trained. One option is to use reinforcement learning (Rosenbaum et al [428]). In this work, we use the noisy top-k gating technique of Shazeer et al [452] that enables the learning of the router directly by gradient descent. Noisy top-k gating is used to choose k experts out of a list of n experts in a way that enables gradients to flow to the router. [...] While the keep-top-k operation introduces discontinuities in the output, gradient descent learning with noisy top-k gating works well in practice. The induced sparsity of choosing k components significantly reduces computational cost, in contrast to soft gating mechanisms where the entire supernetwork is activated (Eigen et al [151]).

[...] and Rosenbaum et al [428] >>> A routing network consists of a set of modules (parameterized functions) from which a router (the composition strategy) can choose a composition. In a neural network setting, the modules are sub-networks and the router assembles them into a model that processes the input. In making these decisions the routing network can be viewed as selecting a model (path) from the set of combinatorially many such models, one for each possible sequence of modules. This connects routing to a body of related work in conditional computation, meta-learning, architecture search and other areas. Because routing networks jointly train their modules and the module composition strategy, they face a set of challenges which non-routed networks do not. In particular, training a routing network is non-stationary from both the perspective of the router, and from the perspective of the modules, because the optimal composition strategy depends on the module parameters and vice versa.

Skill networks .... Sahni et al [433] >>> The ComposeNet architecture allows learning of skill-state embeddings which can be used inside a differentiable composition function. Each skill has its own network trunk but the final layer, called the policy layer, is shared across all the skills. Each trunk is trained for its particular skill in isolation but gradients from all the skills are applied to the policy layer. The trunks are therefore forced to encode information about their particular skill as well as the agent state in their topmost layer. The policy layer is learning to take embeddings from any skill trunk and output a policy corresponding to that skill. This can be seen as a reversal of many multi-task learning architectures where a common input trunk is used with branches at the top for different tasks. In that case, a common embedding is learned for all tasks. Our goal is the opposite, i.e. to learn unique embeddings for each task and a common layer that can take any embedding and output the corresponding policy.

Modulated policy hierarchies ... Pashevich et al [396] >>> A critical design choice of hierarchical agents is the form of communication between levels. Typically, each level receives a modulation signal from the more abstract level above it (Dayan and Hinton [116]). Such signal could be a categorical variable called an option or a continuous-valued activation vector. While a categorical signal allows to select exactly one skill at a time, a continuous signal allows smooth modulation over lower levels. Inspired by this trade-off, we propose communication via bit-vectors, which allows to mix multiple skills. Empirically, this outperforms categorical modulation signals. [...] See the overview of hierarchical policies in Figure 2: (a) The options agent selects between separate skill networks using a categorical master policy. (b) The one-hot agent combines the skills into a single network and is modulated by a 1-hot signal. (c) Our modulated policy hierarchy sends a binary vector, allowing for richer communication and mixing of skills.

October 15, 2019

%%% Tue Oct 15  3:52:55 PDT 2019

Contexts define procedures and serve to differentiate between them. In programming languages, the context is partly implicit in the function signature corresponding to the name of the function plus the number and type of the formal parameters and the type of the object returned. I say "partly" because the signature doesn't take into account what the function actually computes, and one can imagine different objects with the same method name or a single function that performs a different computation depending on characteristics of the input not reflected in the formal parameter types.

Contexts represent the observable state vector at a particular time; however, in the case of a partially observable system, there is likely to be relevant state information not reflected in the state vector that could influence what a given function computes. Short of being able to predict the future value of a state variable from its value at an earlier time or the value of other state variables at earlier time [523]. This could result in a procedure appearing to be compatible with or even supportive to another procedure at one point in time and yet incompatible or in conflict with at other times.

In transfer learning, the objective is to share functionality in the sense that functionality defined in one procedure can be applied in another. Having identified potentially shared functionality, one way to accomplish sharing without duplicating effort is to separate out the shared functionality in a procedure or library. In terms of contexts, this might be done by identifying a subcontext common to one or more procedural contexts, training a specialized procedure associated with the subcontext and then always selecting the most specific context.

Here we are assuming that each procedure has the same network architecture. This is analogous to the notion of a column in Rusu et al [432]. The columns are viewed as providing features and the authors employ a form of perturbation analysis to establish the degree to which a column for one task might transfer part or all of its features to serve the objective of another task. They note that "choosing which column to use for inference requires knowledge of the task label [...] issues left as future work." In the model proposed here, the context defines the task and selects the procedure in the form of a combination of procedure vectors stored in long-term memory.

Next steps: What if you have two procedures with very similar contexts? under what conditions would you combine them and how? What if you have two procedures that are always employed together either sequentially or simultaneously as in the case of mutually recursive functions? How do you control the depth of the recursion or, more generally, the depth of the procedure call stack? How do the notions of goals, plans, tasks and subtasks in hierarchical planning relate to applications, programs, procedures and subroutines in software development.

How might you employ experience replay to facilitate one shot learning of a new procedure? What if you maintained an approximate measure of expected value for every procedure context? If you have two procedures with the same basic architecture, how might you combine them into one procedure so as to free up parameters for other procedures? If you identify a procedure with a lot of variance in its expected value, how might you identify some way to tease out the source of the variance and generate two new procedures.

Suppose that the replay mechanism simulates a wake-sleep cycle in which it periodically processes the traces stored in short-term memory. Alternatively, one might learn one task and upon completion initiate a awake-sleep cycle of replay rather than learn a second task before initiating replay. However, as we found out earlier, humans fare better if they attempt to learn two tasks simultaneously rather than sequentially i.e. first learn one task and then the other [445, 444].

Miscellaneous Loose Ends: Dileep George and his colleagues at Vicarious have developed a framework and architecture called Visual Cognitive Computer (VCC) that represent everyday objects as programs consisting of visuomotor commands related to Shimon Ullman's visual routines [496, 306]. Drawing from their paper [304]:

The programs that we learn can be considered as 'patterns of sensorimotor experience’ that constitute a concept, forming the basis for communication and grounded language. Our work is also consistent with the idea of a probabilistic language of thought where complex representations are formed from small repertoire of primitives. Representing concepts as programs makes them naturally compositional as required for a language of thought, and programs provide explainability that has been lacking in current black box policy learning methods. [...] [O]ur work here focused on an learning programs on a biased computer architecture whose biases were inspired by cognitive neuroscience and by computational considerations.

October 14, 2019

%%% Mon Oct 14 04:15:18 PDT 2019

I told this story to a couple of my students yesterday, but I couldn't remember the source. I searched for a direct quote and found the excerpt below. I think Feynman's insight is profound and I've heard mathematicians voice similar sentiments [149, 503]. Writing is not the same as speaking aloud or conversing with yourself by way of your inner voice. Language is a powerful tool for thinking, but the physical act of writing is under appreciated, and, in a world of passive information consumption, sadly underutilized:

When historian Charles Weiner looked over a pile of Richard Feynman's notebooks, he called them a wonderful 'record of his day-to-day work'."No, no!", Feynman objected strongly. "They aren't a record of my thinking process. They are my thinking process. I actually did the work on the paper." "Well," Weiner said, "The work was done in your head, but the record of it is still here." "No, it's not a record, not really. It's working. You have to work on paper and this is the paper. Okay?", Feynman explained. Source: Clive Thompson. Smarter Than You Think [482].

Feynman was trying to convey that it isn't that he now has a written record of what he was thinking, but rather that the very act of writing (or for some simply gesturing) is a mode of thinking. Even if you are writing with an ink that will disappear after ten minutes, the process of writing on paper (a) increases the size of your short-term memory several fold and (b) engages your innate understanding of space, time, geometry, force, momentum, etc.

The act of writing exercises the sensorimotor and somatosensory cortex, the approximate anatomical correlates of which are shared across humans. Most animals maintain a detailed internal model of their body that enables them to interact with the world around them taking into account the position and articulation of all their body parts and their relationship to the physical world that surrounds them [495].

It's not just that we have body, it's that we have an extensively "wired" body that enables us to sense our body and thereby model its current state of activity and conformation relative to objects in our immediate environment including parts of our own body, e.g., how do you move your arm without colliding with your torso or by extension of the body how do you move a fork or spoon without hitting your nose, cheeks or teeth even if your eyes are closed or you are sitting in a darkened room. Running, dancing and doing gymnastics require precise timing, coordination and accurate prediction regarding the dynamics of muscles and body parts in motion¹²⁰.

This model is grounded in our interactions with the environment, providing a rich source of consistent relationships and physical laws that we generalize to other environments both real and imagined. The model serves as a prototype for learning about a wide range of other systems that we encounter in our day-to-day lives and serves as an innate inductive bias for learning complex dynamical systems.

At some point I hope to flesh out the physical interface that determines how the apprentice interacts with the artificial world of programs and computation through the integrated development environment (IDE) that constitutes an important part of its environment. I imagine a system in which the IDE API converts thought vectors into code that the IDE interprets, returning values and related IO as perceptions.

October 12, 2019

%%% Sat Oct 12 02:25:05 PDT 2019

Memory Development

The capacity for long-term memory with robust recall takes years to develop¹²¹. Childhood memories earlier than two to four years are relatively uncommon and those earlier than ten years are often suspect unless accompanied by positive memory emotions¹²². Obviously children do learn, especially in the case of motor skills. The ability to perform simple inferences, a sense of time and causality, and a practical understanding of how the causal ordering of events can be used to help recall sequences of events.

However, the acquisition of language is the most critical capacity enabling long-term memories in humans and especially autobiographical memories¹²³, and functional maturation of the hippocampus lags behind other developmental milestones. With the exception of neurogenic granule cells, the cells that comprise the hippocampus have appeared by the end of the prenatal period. However, the cells in the dentate gyrus do not mature until 12 to 15 months of age — see Jabès et al [245]. In prefrontal cortex, the density of synapses increases dramatically at around eight months and peaks between 15 and 24 months of age¹²⁴.

Modeling Consequences

Given the initial stages of learning in both the motor and sensory areas it doesn't make sense to start forming episodic memories since the patterns of activity in the early developmental stages of the relevant networks are not stable enough to enable subsequent retrieval. The neural correlates of short-term memory, on the other hand, rely on circuits in prefrontal cortex and the lateral and medial temporal cortex [374], rapidly developing between four and ten months and continuing through adolescence and on into adulthood [541].

By 10 months, infants remember several things, suggesting that their short-term memory develops extremely rapidly in the first year of life. In general, we have discovered that although short-term memory emerges early in life, it rapidly develops between 4 and 10 months of age [541, 188]. There are a number of practical consequences of these findings for the design of artificial systems that rely on such memory systems and that require a lengthy developmental period, during the first part of which the system must learn the representations those memory systems will rely on for indexing and retrieval.

Among other features, this arrangement will enable us to exploit a property of human motor learning, namely that if you learn two tasks sequentially, you will learn quickly, but are more likely to forget them later. However, if you learn them simultaneously, switching back and forth from one to the other, you will learn them more slowly, but be more likely to remember them both later¹²⁵. This is a special case of interference — related to the more extreme variant that we've been calling catastrophic forgetting. We've read about a number of proposals for solving the general problem of interference, and here we introduce yet another that borrows ideas from current theories concerning the hippocampal formation.

Memory and Learning

In preparation for our meeting later this morning when we discuss next steps and assign responsibility for specific topics related to the above, here are a few examples of papers that illustrate ideas from both neuroscience and machine learning that might figure prominently in in explaining and elaborating on the model sketch illustrated in Figure 8:

      HalassaandShermanNEURON-19, thalamic reticular nucleus, PFC meta-control
      TopalidouetalENEURO-18, AronetalJOURNAL-NEUROSCIENCE-16 hyperdirect pathway 
      GluckandMyersHIPPOCAMPUS-93, hippocampal stimulus representation mediation
      WayneetalCoRR-18, unsupervised predictive memory, partial observability
      RollsNN-13, role of mossy fibers, detonator synapses, pattern separation 
      AimoneetalNEURON-09, hippocampal neurogenesis and pattern integration
      ToniandAlejandroCSHL-2015, integrating new granule cells in an adult brain
      MattarandDawNATURE-NEUROSCIENCE-18, hippocampus role in experience replay 
      RusuetalCoRR-16, progressive networks, elastic weight consolidation

We'll be talking about declarative and procedural memory models and what might constitute the analog of semantic memory in defining the input to the prefrontal circuits responsible for executive control and abstract, hierarchical reasoning. Elsewhere in these notes we characterized semantic memory as roughly corresponding to primary sensory cortex, unimodal associative cortex and multimodal associative cortex stacked so the least abstract concepts are on the bottom and most abstract on the top. Intuitively I imagine the stack as something like the following:

INPUT monitor and shared buffer and including mouse control for pointing 
      integrated development environment API including debugger controls
      text input including code ingestion and programmer typed comments 
      + text-to-speech speech recognition and, possibly, ambient sounds;
SENSE primary sensory, somatosensory, sensorimotor processing including
      secondary, association both single and multi-modal layera provide 
      analog of medial temporal lobe and parietal lobe representations;
MODEL simple and higher-order relation identification and extraction, 
      graph and network representations, dynamical systems feeding into 
      layers responsible for specialized state prediction and analysis;
THINK hierarchical abstraction layers providing features for planning,
      designing programs, natural language processing, decision making;

Figure 8: This version is deprecated — see Figure 11 for the most recent version. Network A is responsible for constructing and augmenting extended state vectors; it takes as input activation patterns from throughout the temporal and parietal (semantic memory) lobes, extended to incorporate higher-order representations of internal state which it then augments by emphasizing some features and deemphasizing others to create a context for action selection plus adding annotations relating to hierarchical planning ... this component is recurrent and maintains state information that would normally reside on a procedure call stack in a conventional computer architecture; we refer to such an extended state vector as a context;

Network B serves as the long-term memory for storing programs referred to here as procedures that encode procedural knowledge; the network is is a variant of a Neural Turing Machine (NTM) with multiple read heads that store key-value vectors corresponding to a context vector concatenated with a procedure vector referred to as a closure; the network takes as input the context provided by A and uses it as a probe to select the top K nearest closures using the dot product of the probe and the context of the closure key as the metric; the dot-product weighted sum of the selected closures are loaded into working memory M and the result fed into network C;

Network D serves as the short-term memory for storing program traces; the format is similar to network B except that the order is important as it corresponds to the order in which the closures were carried out and each closure includes reward information necessary for adjusting weight strengths, making it easy to generate sequences of tuples of the form: e_t = ⟨s_t, a_t, r_t s_t+1⟩ required for experience replay updates; if you haven't already, read Mnih et al [359] for an account of the parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms;

Networks E and F are responsible for hippocampal replay reinforcement learning and in subsequent entries in this log we will propose a solution whereby this particular coupling of STM an LTM systems realized as NTM networks avoids catastrophic interference and facilitates transfer learning, procedural abstraction and parameter and function sharing across multiple procedural components.

October 11, 2019

%%% Fr Oct 11 03:45:35 PDT 2019

I'm trying to organize all three components — the prefrontal cortex, the basal ganglia, cerebellar cortex — in a single architecture. The way forward is not at all obvious. The primary sensory and somatosensory cortex is like the nozzle the firehose. It spreads out rapidly ending up in a dense thicket in the medial temporal cortex and thereabouts. That's in primate brain. The assistant and likely humans and primates to some extent perform considerably more processing to generate latent states that provide the input to decision-making and executive control. Presumably the circuits that serve these abstractions are found in part if not wholly within the frontal cortex including the premotor cortex and adjoining circuits.

In the case of the assistant, these circuits support functions related to modeling, abstraction and hierarchical reasoning, including identifying relationhips and constructing and modifying graphical representations, but also using the resulting representational components to make predictions and support planning and decision-making.And then there are the three different, highly-interconnected areas of the brain mentioned above that appear to play a major role in action selection, motivation and coordination. It is tempting to somehow re-factor planning as a form of prediction in some sort of an arrangement like that described in the work on imagination-based planning.

It is tricky however to tease apart what we might call constructive analysis of the sort one engages in when simply thinking about a problem, but this would support having a separate collection of circuits dedicated to just this sort of thinking. Instead I'm leaning toward a model in which the hippocampal formation plays a central enabling role in all of thinking aside from that which is cashed out in the circuits of the cerebellar cortex. In this model, attention and creative recall perform in a perpetual loop directing what we attend to and deciding what memories to recollect and make available that might be relevant to the current circumstances.

The prefrontal cortex is free to move these memories corresponding to both activities and the contexts in which they are relevant into registers in working memory so that the activities can be carried out in the current context and the results stored back into episodic memory. To transfer either the products of the activity or the context altered activities themselves we would have to include some variation on the idea of replay which in the case of primates occurs during REM sleep.

Think of the diversity of skills you have. Skills you don't even think of as requiring any special skill. How do you spell a word even a familiar one? How is it you can wash the dishes and not drop any or leave behind any bits of food or soap? Almost every morning I peel and cut up fruit for breakfast. The type of fruit changes with the seasons and each type of fruit — apples, oranges, melons, bananas — requires different techniques for grasping, peeling and cutting into bite-size sections. I grasp and manipulate the knife and the piece of fruit in an orchestrated sequence of moves, accommodating the subtle differences in the thickness and texture of the skin.

My hands do what's required and I don't think about the details — usually my mind is elsewhere thinking about a problem that I'm working on or perhaps listening to music. Indeed, I would be hard pressed to provide a detailed account of exactly what my hands are doing. The rind of a cantaloupe is different from that of a watermellon as is the flesh of the fruit but my hands don't get confused and apply one method when another is required or combine different techniques except in so far as it improves my technique.

October 9, 2019

%%% Wed Oct  9 04:50:05 PDT 2019

I'd like you think about a rational reformulation of the prefrontal cortex plus basal ganglia plus cerebellar cortex described below. The goal is not to recapitulate nature or reconstruct any of the details of underlying biology, but rather to refactor the biological system, simplifying the model and eliminating any unnecessary (for our purposes) or redundant architectural features.

Keep in mind the environment-coupled sensory-motor hierarchy shown in Figure 5. We need to extend and flesh out the model to incorporate the latent structures and processes that give rise to behavior: perception >> representation >> planning and analysis >> representation >> action. Keep in mind that these structures and processes are only present to serve the needs of the organism / target application.

See if you can derive some inspiration from the section entitled "Harnessed Imagination" in the Spring CS379C introductory materials. In particular, take a look at the model shown in Figure 19. You might also take a look at Figure 74 and Figure 76 in the class discussion list for this year.

... to map onto what we see — or more generally perceive — and what we do — both physical and cognitive activities of a more abstract sort ... perceiving is but one step in creating an internal representation that we can use as a basis for analysis and planning — in order to carry out plans we also need an internal representation for the actions that we can take that serves as a basis constructing plans and that can be used to translate between the process of planning and the process of carrying out plans in the agent's environment ... the picture might look something like ...

... contexts in the form of features / activity patterns ... think pattern completion, separation and integration more generally ... gate prediction and action selection to manage this tradeoff without catastrophic forgetting ... think imagination based planning [395] as start for a model of executive control that integrates the functionality of basal ganglia and prefrontal cortex ... some of the time you're imagining the future and other times you're acting in the world influencing the future ... the policy in this case is in the form of the context-based hierarchical planning model introduced earlier ... work by Chelsea Finn [163] at Stanford and Pieter Abbeel [467] at UC Berkeley is relevant here ...

In principle, we could eliminate the cerebellum for now, but I'd prefer to simplify and incorporate the functionality of the basal ganglia into that of the cerebral cortex and re-purpose the traditional cerebellar cortex systemto take on the broader role that I outlined earlier. Greg Wayne's MERLIN model suggests a deep neural network model that could naturally be adapted to implement a version of Masao Ito's inverse model of the cerebellum:

See here for a somewhat more detailed explanation of Ito's model or read the paper [238] or more recent book [239].

See here for a somewhat more detailed summary of Greg's work or read the paper [523]:

October 7, 2019

%%% Mon Oct  7  3:51:19 PDT 2019

Gary Marcus makes a good point in his recent interview with Rex Fridman. He basically said that while deep networks represent a significant advance in AI and a powerful technology for building useful systems, they have their limitations and in particular their ability to generalize is extremely limited as a consequence of their not having access to a rich enough set of models.

This limitation is evident in the way that they fail to see similarities between objects and events unless they are made explicit, whereas a child will immediately see the connection since it can see past the superficial aspects to supply a narrative that establishes the latent relationship. The source of this narrative is the child's common sense understanding of the world, and the urge to acquire such an understanding is baked into our DNA.

A child learns to understand the world around her by interpreting the objects she sees and the events she experiences in terms of models that she acquires from experimenting with her physical body to understand its dynamics and using it to explore her environment. Learning the dynamical system that corresponds to our body interacting the environment is the basis / foundation for our learning other such systems both concrete and abstract.

The development of these models is substantially aided by the instincts and biases we are born with and that we take for granted without crediting them for being some our most powerful cognitive assets. Our brains are not unlike those of other vertebrates¹²⁶. Our direct experience of the world is supplemented with knowledge we acquire from our parents and teachers and from the written word.

Marcus suggests that current research dominated by deep learning has apparently dismissed the whole of symbolic AI in its enthusiasm to adopt deep networks and supervised learning based on gradient descent. In so doing it has failed to understand the value of symbolic representations in mathematics, science and our use of natural language to communicate with one another and, arguably, with ourselves as we try to make sense of the world around us.

It is worth noting that the early connectionists were not so quick to dismiss the advocates of symbolic systems in their day, and they recognized that connectionist and symbolic systems complement one another: James L. McClelland, Bruce L. McNaughton, and Randall C. O'Reilly. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102(3):419-457, 1995.

This doesn't imply that we adopt symbolic models as they were implemented in the 1980s. Ignoring the fact that the programmer's apprentice is expected to communicate with its paired programmer, the assistant's job requires it to manipulate symbolic representations of programs, and, indeed, it is embodied as a sophisticated symbolic system — essentially an integrated development environment adapted as a cognitive prosthesis, and specifically designed to exploit the advantages of language both natural and contrived.

A symbol signifies something; the referent is the thing signified. Think of a symbol as a pointer or hash index and the referent as the contents stored at a fixed location. Any stable structure that can reliably recover the referent can be used as a symbol. The dentate gyrus performs pattern separation in generating an index for recovering a memory. The referent can change and the pointer remain to access the altered referent. This is just stating the obvious that it is possible to manipulate symbols in a connectionist system.

October 5, 2019

%%% Sat Oct  5  4:44:04 PDT 2019

This entry provides more detail concerning Section (iii) focusing on action selection and, in particular, the role of the basal ganglia, prefrontal cortex and cerebellar cortex. I'll assume that you're familiar with the basic ideas covered in Randy OReilly's lecture covering the basal ganglia and Matt Botvinick's lecture that builds on Randy's covering the role of the prefrontal cortex and, in particular, the hyperdirect path.

The Wikipedia page on the basal ganglia provides a reasonable refresher, but make sure that you understand the connection to reinforcement learning and meta learning [517] covered in Matt's presentation. The entry is intended to supply an overview of the architecture that I've been working this summer by summarizing the primary literature that led me to the current model. Where possible I include excerpts from relevant papers and assume that you will follow up for more detail as you require.

Figure 6: Three figures borrowed from Kenji Doya [148] in which the author examines the hypothesis that the basal ganglia and cerebellum play complementary roles in controlling action and suggesting that, from a computational viewpoint, the cerebellum, basal ganglia, and the cerebral cortex are specialized for three different types of learning: namely, supervised learning, reinforcement learning and unsupervised learning, respectively.

Complementary Roles of Basal Ganglia and Cerebellum

Kenji Doya [148] outlines the complementary roles of cerebral cortex, basal ganglia and cerebellar cortex in learning and motor control¹²⁹. Specifically, he summarizes the view that the cerebellum, the basal ganglia, and the cerebral cortex are specialized for different types of learning: namely, supervised learning, reinforcement learning and unsupervised learning, respectively. In [148], Doya includes three figures — reproduced in Figure 6 for your convenience — that succinctly outline this viewpoint.

Consensus on Role of Cerebellum in Cognitive Function

There is a growing consensus concerning the role of the cerebellum in supporting cognitive and emotional functions [291]. The basic ideas concerning the control of cognitive activities relying on learned models stored in the cerebellum was championed by Masao Ito [239, 238]. It is generally agreed that the cerebellum carries out sensory-motor activities such as walking, running and speaking with precise timing and coordination.

Such activities are selected by the basal ganglia guided by circuits in prefrontal cortex privy to a wide range of internal activity including activity resulting from high-level cognitive effort. The human cerebellar cortex is also believed to carry out cognitive tasks controlled by the prefrontal cortex made possible by reciprocal connections in the form of myelinated fiber tracts that are present in all primates but significantly more pronounced in humans.

Neural Basis Prefrontal Cortex Shaping Action Selection

Until recently, not a great deal was known about the exact neural mechanism whereby the prefrontal cortex exerts control over / influences action selection. The thalamic reticular nucleus (TRN) is hypothesized to regulate thalamo-cortical interactions, but the exact mechanism is not clear [367, 211].

However, Michael Halassa, Murray Sherman along with their students and colleagues have developed a general framework aimed at identifying the long-range, input-output connectivity patterns of single thalamic neurons that explain modulatory control of the basal ganglia given that the thalamus is composed of excitatory neurons devoid of local recurrent excitatory connectivity [210].

In a recent lecture [209], Halassa proposes a candidate neural mechanism explaining how the prefrontal cortex might guide action selection by suppressing irrelevant activity and thereby indirectly encouraging activity relevant to task at hand. This proposal runs counter to the prevailing (opposite) view that the prefrontal cortex guides action selection by highlighting relevant information in working memory.

Figure 7: [...]

Expanded Prefrontal Cortex Role in Abstract Reasoning

Hypothesis: The prefrontal cortex interacts with the cerebellum to guide action selection in a manner analogous to how the basal ganglia gate activity ... how do the basal ganglia learn to perform sensory-motor activities ... cognitive activities require a more sophisticated method of planning and prediction ... complex cognitive ... more complicated execution .. deal with interruptions, contingencies, ... It is posited that ... supervised .. cerebellum ... reinforcement learning .. basal ganglia ... unsupervised .. prefrontal cortex ... Explain Figure 7.

Revisit discussion of contexts in the May 31 entry of 2019 class discussion list.

Next steps: how might the prefrontal cortex execute what amount to cognitive subroutines, policies or plans ... think abstract relational, hierarchical plans ... describe how the prefrontal cortex could program / train the cerebellar cortex so that it can execute such plans in much the same way that the basal ganglia teach the cerebellum to execute motor subroutines ... keep in mind that the cerebellar cortex uses reinforcement learning to train the cerebellum with supervised learning ... note that, while the basal ganglia have access to sensory and motor cortex, the prefrontal cortex has access to both sensory and motor cortex in addition to abstract relational, dynamical systems models necessary to model a much wider range of cognitive activities ... describe how abstract plans for, say, finding the prime decomposition of a number, factoring a polynomial or applying a given strategy for identifying and fixing errors in conditional statements involving inequality statements, etc.

There is some evidence to suggest that the information required to reconstruct and carry out learned motor activities is stored in the cerebellum¹³⁰. >>> what exactly is stored in short term memory during practice ... what form does it take in the process of consolidation ... how is it represented in the cerebellum ... Marr [341], Attwell et al [30], Ito [239, 240].

It is also clear, however, that some record of our learning to perform such activities — albeit with less precision than is possible in the cerebellum — is stored for some indeterminate length of time, first in short-term memory and then in long-term memory — likely in the premotor cortex. >>> how exactly do the basal ganglia interact with the cerebellum ... what information is passed back and forth and how do the basal ganglia exert any control over the cerebellum ... Frank, O'Reilly [384], Doya [148].

Accurate transfer between cerebral and cerebellar cortex is not guaranteed by any means. It assumes that the learner is not distracted between the time when the activity traces are initially committed to short-term memory and the subsequent consolidation in long-term memory. >>> how exactly does the cerebellar cortex interact with the brain stem and peripheral nervous system ... what information is passed back and forth to enable action via the cerebellar peduncles¹³³ ... what is the role of inferior olivary nuclei¹³⁴ in coordinating transfer between the cerebellum and peripheral nervous system [327].

Miscellaneous Loose Ends: In searching for information on the evolutionary origins of the vertebrate body (and brain) plan, I ran across the following anatomical reference: Hoke et al [229] write that "Neuroanatomical and pharmacological experiments support the existence of a homologue of the mammalian substantia nigra / ganglia circuit¹³⁵ in the amphibian brain. Demarcation of borders between the striatum and pallidum in frogs, however, has been contentious. Here we present evidence of functional coupling between the putative nigral and striatal homologues."¹³⁶.

Remember that I've shared my BibTeX database and cache of PDF documents for referenced papers. The database includes all of the references cited in the class notes including abstracts for most papers and PDF for those prominently mentioned. Each reference include a key that you can use to index the PDF for most of the entries in the BibTeX database. For example, for the BibTeX entry in the database that starts with @bookShepherdandGrillner2018, ..., you can find the PDF file for the entire book in the shared cachedirectory listed as ShepherdandGrillner2018.pdf.

October 3, 2019

%%% Thu Oct  3  4:37:45 PDT 2019

The prefrontal cortex exerts control over the basal ganglia by shaping the context (inhibiting patterns of activation in the thalamus) for switching between behaviors. The basal ganglia play a key role in procedural learning by supervised training of the cerebellum [366, 458]. The basal ganglia are trained by reinforcement learning, while learning in the prefrontal cortex is thought to be primarily unsupervised.

What does this say about the sort of inference that must go on in the prefrontal cortex if we are to survive? One answer is that "the prefrontal cortex is on its own" — there is no external fount of wisdom to draw from, no arbiter of truth to appeal to. We are forced to depend on logic, the scientific method, our ability to obtain unbiased information about the world we live in and heuristic search, the last of which bottoms out in flipping a coin since the heuristic is just a hunch that could be completely wrong.

Depicted here are parts of the basal ganglia and their interconnections as revealed by diffusion spectrum imaging based on thirty subjects from the human connectome project. The direct, indirect and hyperdirect pathways are visualized in different colors (see the inset legend). Subcortical structures are rendered based on the Harvard-Oxford subcortical (Thalamus) as well as the Atlasing of the Basal Ganglia atlas (other structures).

This graphic shows two coronal slices that have been superimposed to include the involved basal ganglia structures. + and - signs at the point of the arrows indicate respectively whether the pathway is excitatory or inhibitory in effect. Green arrows refer to excitatory glutamatergic pathways, red arrows refer to inhibitory GABAergic pathways and turquoise arrows refer to dopaminergic pathways that are excitatory on the direct pathway and inhibitory on the indirect pathway. (SOURCE)

September 30, 2019

%%% Mon Sep 30 05:16:03 PDT 2019

IV. Cognitive architecture of the programmer's apprentice (i.) An overview of focus: systems and circuits covered (ii.) Cerebral cortex network architecture & strange loops: subcortical prefrontal connections, prefrontal training (iii.) Basal ganglia & prefrontal cortex action selection: relevant anatomy, role of the cerebellum, action sans hyperdirect,action with hyperdirect (iv.) Hippocampus episodic memory & creative recall replay: layered architecture, detonator synapses, mossy fiber cells, adult neurogenesis, sharp wave ripples (v.) Differentiable programs imagination and evaluation: (vi.) Repetition & inner speech in the dual-stream model:

September 29, 2019

%%% Sun Sep 29 04:28:05 PDT 2019

Training a model such as the one described in Figure 5 is a daunting challenge, especially when you consider that current deep neural network technologies rely heavily on large amounts of labeled data. It is possible the basic network architecture consisting of two parallel stacks as shown in Figure 5 turns out to be simple to implement using familiar deep neural network components such as convolutional and generative adversarial networks. However, training will likely require new ideas and neuroscience has much to offer in terms of insights drawn from the study of rodent and primate brains.

The mammalian brain is organized as a 3-D structure in which specific cell types are positioned in a radial, laminar and areal arrangement that depends on the production, specialization and directed migration of cells from their origin in the embryo to their final destination [414]. It is only on arriving at their final location that they establish connections to other cells. Postnatally laminar and areal differentiation exhibit substantial differences between early (2-3 months) and late (7-12 months) infancy [361, 290]. Functional organization begins early (2-3 months) even as construction continues and the shaping of cortical circuits reflects the consequences of increasingly complex behavior.

All of this carefully orchestrated activity is critical to development. Early brain structures appear as a consequence of the simple reflexive behaviors the infant engages in, laying the foundation for more coordinated behavior depending on increasingly complex internal representations. The infant's ability to engage its environment broadens, exposing it to more complicated stimuli and the opportunity to experiment with new behaviors. The physical and social environment seem to conspire to ensure that the growing infant and then adolescent has the necessary physical and intellectual prerequisites in place when exposed to circumstances that require them. Here are some examples of how the ideas covered in this section might inform AI research.

A seemingly simple example involves a child acquiring the ability to recognize that something has changed in a small collection of objects when, without any direct knowledge, an object has been removed from or added to the collection [128]. Apparently, our innate ability to acquire such a number sense [112] is baked into the human neural circuitry responsible for visual object recognition [372, 138] and is thought to share the same neural correlates involving the early visual stream, V1, V2 and V3, and possibly attentional circuits located posterior to the primary somatosensory cortex that facilitate reaching for objects [140].

Long-term potentiation (LTP) strengthens synapses based on recent patterns of activity to produce a long-lasting increase in signal transmission between two neurons. LTP is coupled with long-term depression¹³⁷ (LTD) to normalize activation levels thereby avoiding synapses reaching a ceiling that would inhibit encoding new information [110]. As a complement to traditional stochastic gradient descent, it might make sense to use a LTP and LTD together with pruning to reduce noise and reshape activation patterns.

So-called silent synapses¹³⁸ have the property that they can be rendered inactive by means of receptor-mediated signaling thereby avoiding the degradation of previously learned circuits and conditionally controlling circuit activity [266]. This ability to control activity at the level of synapses suggests a method to avoid inadvertently altering previously trained circuits in the process of training new circuits.

Comment: Some of the following examples need work to convince the reader that the cited neuroscience research has value in providing useful guidance to artificial intelligence and the programmer's apprentice in particular.

Our current understanding of the developing brain suggests the use of multi-stage curriculum strategies [205, 55, 335] to train complex systems such as the programmer's apprentice. We also envision exploiting multiple feedback paths at different levels of the coupled sensory-motor hierarchy to learn level-by-level from the bottom up, and starting with many more units corresponding to neurons and weights corresponding to synapses and axons than we expect to need and then pruning in later stages of development [54].

We might implement a form of neural network architecture search [543, 456, 400, 208] that mimics early-development cell migration and adult neurogenesis in reshaping existing neural representations to accommodate unexpected pattern density — this last an option likely available only in the dentate gyrus of the hippocampal formation [3, 535, 442].

Develop a model of reinforcement learning that relies on motivation based on a combination of intrinsic and extrinsic rewards [263, 247, 396]. Studies [307, 522] employing protocols designed to identify markers of intrinsic motivation point to activity in the anterior insular cortex combined with enhanced coupling with somatasensory areas and reduced coupling with the visual sensory areas, related to apprehending ones own and the pain of others [62].

Schizophrenia patients show deficits in the lateral prefrontal cortex relating to intrinsic motivation. Researchers have compared schizophrenic patients with healthy individuals to identify neural circuits implicated in the loss of intrinsic motivation. These include reduced activity in lateral prefrontal cortex and in particular circuits in the ventromedial prefrontal cortex associated with empathy, motivation and identification rewarding stimuli [478].

September 27, 2019

%%% Fri Sep 27 09:28:05 PDT 2019

As somewhat of a departure, here is a transcript of a short segment from Leonard Susskind¹³⁹ being interviewed by Lex Fridman on Fridman's Artificial Intelligence Podcast starting at 00:37:00 into the interview:

LEONARD SUSSKIND: I can't see past quantum mechanics. I don't see any reason for there to be anything beyond quantum mechanics. I think these are very interesting and very deep questions¹⁴⁰. I don't like his answers.

LEX FRIDMAN: If we look at very deepest level of reality whether that be deterministic or, when observed, as probabilistic, what does that mean for our human ideas of free will? Is there any connection whatsoever from this perception, perhaps illusion, of free will that we have and the fundamental nature of reality?

LEONARD SUSSKIND: The only thing I can say is that I am puzzled by that as much as you are.

LEX FRIDMAN: The illusion of that?

LEONARD SUSSKIND: The illusion of consciousness. The illusion of free will. The illusion of self. How can a physical system do that? I am as puzzled as anybody.

LEX FRIDMAN: There are echoes of that in the observer effect¹⁴¹. Do you understand what it means to be an observer?

LEONARD SUSSKIND: I understand it at a technical level. An observer is a system with enough degrees of freedom that it can record information and which can become entangled with the thing it is measuring. Entanglement is the key. The system that we call an apparatus or observer interacts with the system it is observing. It doesn't just look at it. It becomes physically entangled with it and it is that entanglement that we call an observation or measurement. Now does that satisfy me personally as an observer? Yes or no. I'm glad that we have a mathematical representation of what it means to observe a system.

LEX FRIDMAN: You are observing something now at a conscious level. Do you think there are echoes of that at the macro-scale world?

LEONARD SUSSKIND: Yes. Absolutely. For sure. Mechanically we are entangled with everything in his room. If we weren't then we just wouldn't be observing it. But on the other hand, you can ask, am I really comfortable with it. I'm uncomfortable with it in the same way that I can never get comfortable with five dimensions. My brain isn't wired for it.

LEX FRIDMAN: Are you comfortable with four dimensions?

LEONARD SUSSKIND: I'm comfortable with four dimensions a little bit more because I can always imagine the fourth dimension is the arrow of time.

September 25, 2019

%%% Wed Sep 25 09:36:22 PDT 2019

Just because so many of us can demonstrate basic math skills, it doesn't follow we are all brilliant students or latent Fields Medal recipients, but rather that we inhabit an environment providing numerous examples of physical phenomena that can be explained in relatively simple mathematics — that and the concomitant fact that nature selected for traits that make it easier for us to acquire such skills. As part of my preparation to interact more closely with Jay McClelland and his students working on how humans learn basic math skills, I'm reading a bunch of related papers, re-reading Stanislas Dehaene's The Number Sense as well as a book by Tobias Dantzig that Dehaene mentions in his first chapter [128, 112]:

Here are some excerpts from Dantzig, Tobias. Number: The Language of Science. New York: Macmillan Company, 1930 (URL):

"Man, even in the lower stages of development, possesses a faculty which, for want of a better name, I shall call Number Sense¹⁴². This faculty permits him to recognize that something has changed in a small collection when, without his direct knowledge, an object has been removed from or added to the collection". — Page 1.

Comment: Nasr et al [372] suggest that some aspects of our innate ability to acquire the number sense are baked into the neural networks responsible for visual object recognition¹⁴³. DeWind [138] makes a similar case — [372] and [138] appeared in print within days of one another — and cites several multi-author papers that make related claims along with plausible hypotheses regarding the neural correlates for the number sense [140, 139].

"Yet the transition from relative number to absolute is not difficult. It is necessary only to create model collections, each typifying a possible collection. Estimating any given collection is then reduced to the selection among the available models of one which can be matched with the given collection member by member". — Page 7.

"To create a counting process it is not enough to have a motley array of models, comprehensive though this latter may be. We must devise a number system: our set of models must be arranged in an ordered sequence, a sequence which progresses in the sense of growing magnitude, the natural sequence: one, two, three ..."— Page 8.

"Once this system is created, counting a collection means assigning to every member a term in the natural sequence in ordered succession until the collection is exhausted. The term of the natural sequence assigned to the last member of the collection is called the ordinal number of the collection". — Page 8.

"Correspondence and succession, the two principles which permeate all mathematics — nay, all realms of exact thought — are woven into the very fabric of our number system". — Page 9.

Comment: It is interesting that Dantzig doesn't introduce the term magnitude as having any special significance, but rather treats it as an obvious concept! Dehaene [128] on the other hand spends considerable time discussing the common-sense notions of distance and magnitude as they relate to the number sense. — Pages 27-27, 76-77, 124-126.

The idea of the relative magnitudes of two entities seems key to understanding order. Here are a couple of YouTube videos, I and II, illustrating one teacher's strategy — including her teaching aids designed to exercise students' physical intuitions — for guiding students to develop a sense of what numbers mean and what they're good for.

In the same spirit of Yuan and McClelland [536], I can imagine simulated environments, using the learning aids featured in the above videos, that an AI system could explore to achieve a basic level of numeracy as a starting point for learning more sophisticated mathematics.

"We see therefore that no progress was possible until a symbol was invented for an empty class, a symbol for nothing, our modern zero. The concrete mind of the ancient Greeks could not conceive the void as a number, let alone endow the void with a symbol". — Page 31.

"Today, when positional numeration has become a part of our daily life, it seems that the superiority of this method, the compactness of its notation, the ease and elegance it introduced in calculations, should have assured the rapid and sweeping acceptance of it". — Page 33.

September 21, 2019

%%% Sat Sep 21 04:23:22 PDT 2019

Rough outline of Sections III and IV in the invited paper:

III. Building biologically inspired digital assistants — motivate building end-to-end systems, embodiment, basic language and vision, programming as a task for the assistant ... (i.) Digital assistants as cognitive enhancements (ii.) Embodied cognition and how it enables learning (iii.) Human-machine interaction language and gesture (iv.) Automated programming as tractable application

IV. Cognitive architecture of the programmer's apprentice — main content section with subsections for action selection, episodic memory, representing programs, code synthesis ... (i.) An overview of focus: systems and circuits covered (ii.) Cerebral cortex network architecture & strange loops (iii.) Basal ganglia & prefrontal cortex action selection (iv.) Hippocampus episodic memory & creative recall replay (v.) Differentiable programs imagination and evaluation (vi.) Repetition & inner speech in the dual-stream model

Architectures

The architecture of the human brain, at any scale you choose to consider, bears little or no resemblance to conventional computer architectures. There is no separate program memory, no centralized processing unit, no highly stable, random-access, non-volatile memory and nothing like the digital level of abstraction that enables software engineers to ignore instabilities in the analog circuits that implement logic gates. Since representations (data) are collocated with the transformations (computations) that operate on them and different parts of the brain perform different computations requiring different types of memory the human brain has to support multiple memory systems.

Human memory¹⁴⁴. is characterized along multiple dimensions depending on what sort of information stored, how it is accessed and how long it remains accessible [108]. Short-term, long-term and working memory are differentiated on the basis of access, persistence, volatility and the effort required to maintain. Short term is measured in seconds, long term in days, months or years and working memory is essentially short-term memory that can be maintained (with cognitive effort) indefinitely and manipulated (very roughly) analogous to a register in the ALU of a von Neumann machine [35].

Declarative memory is defined by the ability to explicitly (consciously) recollect facts, whereas non-declarative memory is accessed unconsciously or implicitly through performance — he can open a bottle "without thinking", but can't tell you whether to turn the lid clockwise or counter clockwise — rather than recollection. Episodic memory is generally considered long-term and declarative, and is further differentiated on the basis of the kinds of relationships it can encode, including spatial, temporal and social [468, 431, 295, 105].

In this remainder of this section — continued here, we consider several core components of the programmer's apprentice architecture each of which depends on or implements one or more memory systems. Drawing upon concepts covered earlier, we consider three major elements:

the role of the posterior cortex role in supporting declarative knowledge and semantic memory,
the basal ganglia and prefrontal cortex as the basis for motivation and executive function, and
the hippocampal formation¹⁴⁵ in supporting episodic memory formation, retrieval and consolidation.

September 19, 2019

%%% Thu Sep 19  4:46:55 PDT 2019

Much of the cortex is in the business of learning representations of concepts relevant to survival. Perception is the means by which we apprehend and act on the physical realization of the concepts we have learned. It seems obvious that perception serves action. It may not seem so obvious that action serves perception, but the fact is we are almost always moving our head, hands and torso in order to resolve ambiguities in what we see, feeling the shape of unfamiliar objects in order to grasp them firmly and twisting about to see who is behind us calling our name or to get a better idea of where we've come from in order to ensure we can retrace our steps. These are complex sensorimotor activities we depend on every day.

In thinking about physically realizable concepts we think first about what they look, feel, sound and smell like. The sensory cortex is responsible for constructing a hierarchy of representations to characterize such concepts, not to capture everything we sense, but rather to account for what we need to know about concepts to survive. Reconstructing scenes with photographic realism is not what our sensory systems were designed for. Circuits of the primary sensory cortex feed into the circuits of the (unimodal) associative sensory cortex which feed into (multimodal) sensory cortex. All of these representations are abstract and yet patterns of regionalization are remarkably preserved within species [95, 403, 288].

Concepts arise in patterns of neural activity that account for what we need to know about them, including how they appear to us so we can recognize them, what affordances they offer for us to make use of them and how we might predict their behavior in decision making. Many of the concepts that are represented in our brains serve to model the dynamics of physical systems that we interact with every day, such as riding a bike, working with tools, opening doors, negotiating stairs and riding escalators in department stores. Just as important, if not more so, are the social dynamics we deal with at work and school with their constantly shifting personal relationships and status rankings.

If you are a software engineer designing robot control systems, you might give action much the same scrutiny as perception and build a parallel hierarchy of representations that describes the concepts that relate to movement including navigation, articulation and manipulation ranging from servomotor commands to strategies for moving furniture, but designing or learning these hierarchies independently is generally a bad idea. In mammals, these two hierarchies are tightly coupled to account for how they depend on one another [185].

Indeed, determining what sensory representations to learn depends upon and influences what motor representations to learn and vice versa, where we follow the convention of using the term motor as a catchall term for concepts relating to muscles and movement. As pointed out in the introduction, there is evidence to suggest that circuits occurring early in the ventral visual stream code for object-selective features and exhibit large-scale organization characterized by the high-level properties of animacy and object size [289, 330].

Figure 5: A simplified block diagram of the cortex. The column on the left represents the posterior cortex including the occipital, temporal and parietal lobes. The column on the right represents the frontal lobe of the cortex corresponding to the primary motor cortex, premotor cortex (associative motor cortex) and prefrontal cortex. Green arrows represent interaction with the environment, black arrows represent sensorimotor abstractions and red arrows indicate cognitive activity relating speech, planning and abstract thinking. See the main text for more detail. Adapted from Figure 8.9 in [184]

Figure 5 is a simplified block diagram of the cortex organized as two columns. The left column represents the posterior cortex consisting of the occipital, temporal and parietal lobes that are primarily concerned with processing sensory information. The relevant brain areas are summarized in three blocks roughly corresponding to primary sensory cortex, unimodal associative cortex and multimodal associative cortex stacked so the least abstract concepts are on the bottom and most abstract on the top. The combined area is often referred to as semantic memory and characterized as long-term declarative memory [57].

The right column represents the frontal lobe of the cortex corresponding to the primary motor cortex, premotor cortex (associative motor cortex) and prefrontal cortex. The primary motor cortex is responsible for creating abstract representations of motor activity throughout the body. The premotor cortex is responsible for integrating sensory and motor abstractions to construct sensorimotor representations. The prefrontal cortex orchestrates cognitive behavior including speech, planning and abstract thinking, and is reciprocally connected to the association areas just mentioned as well subcortical structures including the basal ganglia and hippocampus.

The two columns are connected with one another at multiple levels: by physical interaction with the environment (green arrows), by sensorimotor abstraction and alignment (black arrows), and by cognitive effort in directing activity mediated through subcortical structures (red arrows). This arrangement supports the formation of rich representations that serve a wide range of cognitive function. The sensorimotor connections and feedback through the environment provide an inductive bias to guide learning, ground inference and reduce sample complexity by reducing reliance on labeled data and enabling opportunities for unsupervised learning [48].

Simple as this model of cortical function may seem, it may be one of the most important architectural contributions of neuroscience to the development of artificial intelligence patterned after the human brain. Some of the lessons have already been integrated into the discipline of control theory through exposure to early work in biological cybernetics [181, 309, 246, 193, 349, 509], but some of the most important lessons impact the application of machine learning in building autonomous embodied systems including robots and digital assistants as alluded to above.

Excerpt from the 5th Edition of Joaquín M. Fuster's The Prefrontal Cortex:

The prefrontal cortex is profusely and reciprocally connected with subcortical and other cortical structures, notably the thalamus, the basal ganglia, the hypothalamus, the amygdala, the hippocampus, and cortices of association of the temporal and parietal lobes — see here for more on subcortical projections. Several neurotransmitter systems are represented in the prefrontal cortex, notably dopamine and cholinergic systems. Based on neuropsychological, neurophysiological, and neuroimaging studies, it can be concluded that the principal and most general function of the prefrontal cortex is the temporal organization of behavior, speech, and reasoning — see here for work related to the neural correlates of abstract thinking. That general function is supported by at least three cognitive functions in which the prefrontal cortex critically participates: active short-term memory (working memory), preparatory action set, and control of interference [185].

September 11, 2019

%%% Wed Sep 11  1:50:07 PDT 2019

Miscellaneous Loose Ends: Here are some of my notes for writing the second half of the second section of the paper. Attributions are scarce and excerpts are common. I've cited papers or pointed to online materials where it was convenient. These notes are supplied for students interested in working on related topics:

Two Streams Hypothesis

See Figure 1.0 in [226] Two Streams hypothesis with respect to processing auditory stimuli. You may be familiar with this hypothesis applied to processing visual stimuli which has a longer history. Introduce dual path architectures and in particular white matter reciprocal connections between areas responsible for generating and understanding speech ... Broca's and Wernicke's areas ... reciprocal connections between the medial temporal cortex and hippocampal complex (URL):

The inferior frontal gyrus is the location of Broca's area involved in language processing and speech production. The middle frontal gyrus makes up about one-third of the frontal lobe of the human brain. Note that the middle frontal gyrus, like the inferior frontal gyrus and the superior frontal gyrus, is more of a region in the frontal gyrus than a true gyrus.

Brodmann Functional Areas

Note that Brodmann area 46 is located in the dorsolateral prefrontal cortex and "plays a central role in sustaining attention and managing working memory, and has recently been shown to regulate self-control. It is one of the few cortical areas whose activity diminishes during REM sleep." Brodmann area 46 is part of the frontal cortex in the human brain. The dorsolateral prefrontal cortex plays a central role in sustaining attention and managing working memory, and has recently been shown to regulate self-control. It is one of the few cortical areas whose activity diminishes during REM sleep (URL).

The orbitofrontal cortex (OFC) is a prefrontal cortex region in the frontal lobes of the brain which is involved in the cognitive process of decision-making. In non-human primates it consists of the association cortex areas Brodmann area 11, 12 and 13; in humans it consists of Brodmann area 10, 11 and 47. It is defined as the part of the prefrontal cortex that receives projections from the medial dorsal nucleus of the thalamus, and is thought to represent emotion and reward in decision making (URL). A more precise definition of PFC can be accomplished by using maps [73]. Brodmann areas are based on the types of neurons and connections that typically are found there within. According to this definition, the PFC consists of Brodmann areas 8, 9, 10, 11, 12, 13, 44, 45, 46 and 47:

Figure 12.3. Brodmann areas in the frontal lobes. Areas forward of the motor cortex are considered to be prefrontal (Brodmann areas 4 and 6 are motor and premotor regions). However, the boundary is not rigid. It is often useful to think of a gradual transition between more "cognitive" areas and primary motor cortex (Broadmann 4), which directly controls voluntary muscles (URL).

Subcortical Projections

Another method of outlining the PFC is through its subcortical projections. Fuster [183] and Koechlin et al [285] has hypothesized a perception-action cycle including PFC and posterior regions of the brain and subcortical nuclei including the thalamus (Figure 12.4). Note that PFC has bidirectional connections between many subcortical regions:

Figure 12.4. A schematic diagram of connections from (shown in red) the prefrontal cortex to posterior brain regions and to subcortical nuclei. Connections to the prefrontal cortex are shown in blue (URL).

Hot and Cold Circulation

See Figure 5.10 in [385] where Randy O'Reilly Characterizing the ventral (analytical) versus the dorsal (emotional) in terms of hot versus cold:

Figure 5.10 shows an overall schematic for how this occurs. It also illustrates how the lateral surface is more associated with "cold" cognitive function, while the medial surface is more involved in "hot" emotional and motivational processing [385].

Basal Ganglia Connections

If we assume that the job of the basal ganglia is to select an action to perform based on reinforcement learning, then the job of the cerebellum is to take over once the action has been initiated, and use error-driven learning to shape the performance of the action so it is accurate and well-coordinated — adapted from page 109 in [385]. See Figure 18.5 showing connections from the basal ganglia to other brain regions in Purves et al [111] (URL):

Functional organization of the outputs from the basal ganglia. Diagram of the targets of the basal ganglia, including the intermediate relay nuclei (the globus pallidus, internal and external segments, and the subthalamic nucleus), the superior colliculus, the thalamus, and the cerebral cortex. (B) An idealized coronal section through the human brain, showing the structures and pathways diagrammed in (A). See Figure 18.2 Connections to the basal ganglia from other brain regions in Purves et al [111] (URL):

Anatomical organization of the inputs to the basal ganglia. An idealized coronal section through the human brain, showing the projections from the cerebral cortex and the substantia nigra pars comparta to the caudate and putamen. You might also want to check out Randall 0'Reilly's presentation in CS379C Spring of 2018 (URL).

Hippocampal Connectivity

Consider the bidirectional connectivity between the hippocampus and the cortex, specifically the entorhinal cortex and a wide range of other cortical areas. See Figure 9.3 regarding the hippocampal memory system in [382] and the work of Squire and Zola-Morgan [466] (URL)::

White Matter Tractography

See [514] for technical overview of fMRI / DTI tractography technology. See [251, 260] for discussion relating white matter fiber bundles and the connection between frontal cerebral cortex and cerebellar cortex ... and [291] for a summary of the consensus view.

The arcuate fasciculus is a major anterior/posterior tract. The arcuate fasciculus is a white matter bundle that contains both long and short fibers that connect the frontal, parietal, and temporal lobes. The arcuate plays a key role in the left hemisphere in language processing and in the right hemisphere in visuospatial processing and some aspects of language processing, such as prosody and semantics — Nicole M. Gage, Bernard J. Baars, in Fundamentals of Cognitive Neuroscience, 2018 (URL).

Travis et al [490, 199, 198] on the importance of the cerebellum for learning to read and child development and [376, 192, 328, 354, 332] on related memory and aging issues. episodic memory and aging. See Chica et al [98] and Dehaene [129] on conscious awareness, attention and short-term working memory¹⁴⁶. See [434] for a comparison of developmental patterns of white matter within the prefrontal region between humans and nonhuman primates.

September 7, 2019

%%% Sat Sep  7  3:34:07 PDT 2019

From the brain of an Etruscan shrew weighing in at less than a tenth of a gram to a sperm whale brain weighing more than eight kilograms, it is clear that natural selection has stumbled on a basic brain plan and set of developmental strategies that enables it to construct a diverse set of special-purpose brain architectures for efficiently expressing a wide range of sophisticated behavior [147, 529]. The human brain with its approximately 100 billion neurons and the shrew brain with approximately 1 million neurons share the same basic architecture.

The mouse brain has homologues of most human subcortical nuclei and has contributed significantly to our understanding of the human brain and human neurodegerative disease in particular. The differences between between human and chimpanzee brains are subtle [362] and yet humans display a much wider range of behavior and express a much larger repertoire of genes than any other species [216]. So what makes the difference?

It's the connections between neurons that matter or, more generally, it's the different types of communication between neurons that biologists refer to as pathways. There are electrical, chemical and genetic pathways and each of them obey different constraints and are used for different purposes. They include point-to-point and broadcast methods of communication [215]. They transfer information at different speeds and using different coding strategies. Layered architectures are common not just in the cortex but throughout the brain. It's the wiring that sets humans apart.

Figure 6: A highly stylized rendering of the major functional areas of the human cortex shown from the side with the head facing to right. Highlighted regions include the occipital lobe shown in shades of green including the primary visual cortex; the parietal lobe shown in shades of blue including the primary somatosensory cortex; the temporal lobe shown in shades of yellow including the primary auditory cortex; and the frontal lobe shown in shades of pink including the primary motor and prefrontal cortex. The region outlined by a dashed line on the right is Broca’s area and it is historically associated with the production of speech and hence its position relative to the motor cortex. The region outlined by a dashed line on the left is Wernicke’s area and it is historically associated with the understanding of speech and hence its position relative to the sensory cortex.

Figure 1 shows the major functional areas of the human neocortex including the primary and secondary sensory and motor areas. The proximal locations of the areas provide a very rough idea of how different functions might relate to another. The graphic shown belies the fact that the brain is three dimensional and much of its functional circuitry hidden under the cortical sheet. The human cerebral cortex is complexly folded to fit within the skull with much of it hidden within the folds. This folded sheet of tissue accounts for more than 75% of the human brain by volume [476] and is largely responsible for the rich behavioral repertoire that humans exhibit. It is worth pointing out in this context that the cortical sheet enshrouds a collection of evolutionarily preserved and highly specialized circuits homologues of which are found in all mammals and without which the cortex would be useless (URL).

The graphic shown in Figure 1 is a simplification of the standard medical textbook diagram. In particular, several of the association areas are not shown and those that are shown are labeled somewhat differently than is common practice. The organizing biological principle is that, the further away from the primary sensory areas, associative functions become more general by integrating information from multiple moralities to construct abstract representations tailored to serve ecologically relevant objectives [531]. It is worth contemplating the arrangement of areas to note that they converge on locations in the cortex that will play an important role in decision making and higher-order cognition more generally (URL).

Figure 7: Two 3-D renderings of the human brain generated by the Allen Institute Brain Explorer from the Allen Human Brain Reference Atlas [217]. On the left (A) are shown 3-D reconstructions of several subcortical nuclei featured in this paper. The inset indicates the orientation of the head and a cross-sectional view of the whole brain is projected on the mid-sagittal plane dividing the right and left sides of the brain illustrating how the cortex envelopes the subcortical regions. The panel on the right (B) depicts the same subcortical nuclei as seen from above (horizontal plane) and to the rear of the brain illustrating how the thalamus is located between the cortical sheet and the subcortical nuclei serving in its role as relay between the two regions.

Figure 2 highlights the 3-D structure of several subcortical nuclei emphasized in this paper showing how they relate anatomically to one another and to the cortex. The reconstructions were generated from data generated by functional magnetic resonance imaging (fMRI) of adult human subjects [217] and offer additional functional insight complementing conventional histological studies [72]. They don't however provide detailed information concerning either local or long-range connectivity.

Traditionally, tracing connections between individual neurons has been accomplished using conventional histology staining techniques and electrophysiology, but doesn't scale well [27]. Neurotropic viruses can be used to spread infection through networks of neurons thereby inducing cells to express fluorescent proteins that are imaged to yield three dimensional maps, but this too has scaling problems [82].

Relatively small samples of neural tissue can be fixed, stained and sliced into thin sections. Each of the sections is then imaged with an electron microscope and the resulting digital images analyzed with computer vision software to reconstruct neurons and identify synapses [356]. The process is time consuming but can be fully automated and scaled to handle larger samples [248, 540].

Larger samples can be imaged with conventional light microscopes, but lower resolution results in less accurate reconstructions. It is also possible to reconstruct the major white matter tracts formed by bundles of myelinated fibers using diffusion-weighted fMRI. Unlike the previous technologies, this method is not destructive and accuracy is improved by averaging over multiple subjects after registering the individual brain scans with a reference atlas [379, 511].

These major tracts increase the speed of signal transmission between regions allowing for more distant communication in larger brains. The differences between the necortex in humans and chimpanzees are subtle [362]; however, white matter connections observed in humans but not in chimpanzees particularly link multimodal areas of the temporal, lateral parietal, and inferior frontal cortices, including tracts important for language processing [26, 200].

Figure 8: White matter tracts speed the transmission of information between distant regions. Connections (A) between the prefrontal cortex and circuits in the parietal and temporal cortex shape conscious awareness, guide attention and support short-term memory maintenance [98, 129]. In humans, white matter tracts between the frontal cortex and the cerebellum (B) shown against a mid-coronal section facilitate cognitive function in addition to their role in motor function in all mammals. These connections are particularly important in the development of reading skills in children and adolescents [490, 291].

The cerebellum in mammals serves to shape motor selected in the basal ganglia by ensuring they are precisely timed and well-coordinated. Such activities are intiated by the basal ganglia. In humans, the cerebellum also supports cognitive functions such as those involved in reading [490]. Figure 3 (B) shows the white matter tracts connecting the cerebelllum and prefrontal cortex where such abstract cognitive functions originate.

A white matter bundle called the arcuate fasciculus — Figure 3 (A) — provides reciprocal connections between the frontal cortex and association areas in the parietal and temporal lobes playing a key role in attention and the active maintenance of short-term working memory, including support for language processing in the left hemisphere and visuospatial processing in the right hemisphere [98].

The human brain exhibits structure at many scales, the white matter tracts being but one example. A common pattern involves paths that connect multiple circuits that have their own internal components and conections. At a global scale, processing begins in primary sensory areas, propagates forward through dorsal regions integrating additional sources of information to produce composite representations that are processed in the frontal cortex before returning through ventral regions resposible for motivation and action selection.

Many of the connections within such paths are reciprocal allowing feedback to adjust behavior and improve prediction. Advocates of the two-streams hypothesis [201] contend that sensory processing — in particular, seeing and hearing — splits into two streams that account for different aspects of perception — accounting for what you perceive, e.g., properties of an object, versus where you perceive, e.g., its relative location.

Similar reflective and self-corrective patterns arise within subcortical regions including the hippocampal complex and basal ganglia, within their subcomponents such as between the dentate gyrus and CA1 in the hippocampus and as layered networks inside individual subcomponents such as the mossy fiber network within dentate gyrus. Each level solves different problems, offers general insights and provides hints about how one might realize such solutions in artificial systems.

Figure 9: The left panel (A) depicts the reciprocal connections between two subnuclei of the basal ganglia, the putamen and caudate nucleus, and locations in prefrontal cortex responsible for influencing action selection. The right panel (B) highlights reciprocal connections between cortical areas — Brodmann areas 7, 8, 9, 11, 12, 13, 19, 20, 21, 22, 23 and 46 — and the hippocampal complex via the adjacent perirhinal cortex (blue) and the parahippocampal (red) cortices.

Figure 4 describes how subcortical nuclei such as the hippocampal complex and basal ganglia interact with cortical regions. Such attributions provide insight on how to construct complex artificial neural architectures consisting of smaller subnetworks responsible for perception, action selection and episodic memory.

Here we consider two levels of granularity: the first is coarse grained relying on major anatomical divisions illustrated in Figure 1. The second is somewhat finer grained and relies on areal divisions based on cytoarchitectural distinctions involving cell types, neural processes including dendrites and axons, and other histological characteristics.

The latter generally employs Korbinian Brodmann's decomposition of the cortex into 52 areas published in 1909 [73] and revised several times since then to accord with more modern staining and imaging technologies as well as improved methods for functional localization. Identifying the Brodmann area associated with the endpoint of a neural connection can tell us a good deal about the functional relationship between two brain regions.

The left side of Figure 4 (A) highlights the reciprocal connections between two subnuclei of the basal ganglia, the putamen and caudate nucleus, and locations in prefrontal cortex responsible for influencing action selection by adjusting input to the basal ganglia and, by way of the thalamus, locations in the parietal and temporal cortex that provide information about the current state relevant to decision making. Our summary description of the three cortical areas belies the complexity of the circuits that comprise these areas.

We can often improve our functional descriptions if we can localize to specific Brodmann areas. For example, the orbitofrontal cortex (OFC) is located in the prefrontal cortex in a region of the frontal lobes involved in the cognitive process of decision-making. In humans it consists of Brodmann area 10, 11 and 47. It is defined as the part of the prefrontal cortex that receives projections from the medial dorsal nucleus of the thalamus, and is thought to represent emotion and reward in decision making [63]. The prefrontal cortex, consisting of Brodmann areas 8, 9, 10, 11, 12, 13, 44, 45, 46 and 47, includes the OFC but covers a wider range of functionality.

The right side of Figure 4 (B) highlights reciprocal connections between cortical areas — Brodmann areas 7, 8, 9, 11, 12, 13, 19, 20, 21, 22, 23 and 46 — and the hippocampal complex via the adjacent perirhinal cortex (shown as blue connections) and the parahippocampal cortex (shown as red connections). See [18, 351] for mention of Brodmann areas involved in mathematical reasoning ...

We can descend to yet smaller scales, substituting granule, basket and pyramidal cells for Brodmann areas and mossy fiber synapses and stem / progenitor cells as the conduits for information passing between cells ... to exploit what is known about these networks in order come up with new artificial neural networks that employ specialized detonator synapses that support single-synapse activations and are able to sustain activation for tens of seconds, that add new units to network layers to control the pattern separation and integration.

Research in neuroscience offers solutions to problems we didn't know we had, and provides new ideas for existing problems for which current solutions are found wanting as we push those solutions to support new applications or discover existing solutions aren't able to scale.

The anatomy of the brain and patterns of connectivity linking its major functional areas provide a structural account that derives from and informs function. However, functional analyses relating to human cognition require technologies that are able record neural activity or its correlates aligned with relevant behavioral features. Non-human model systems often employed when invasive technology is required.

On the one hand, optogenetics, two-photon microscopy and conventional electrophysiology are able to record from and modify the electrical activity of tens to thousands of neurons at the resolution of a few microns. While this represents progress, the coverage is inadequate for many studies, and the methods are, for the most part, limited to non-human subjects due to the invasive nature of their practical application [144, 69, 538, 534].

Conversely, fMRI can used to study awake, behaving humans performing a wide range of cognitive tasks, but relies on signals that are at best indirectly related to neural activity as in the case of blood oxygen levels, and that are currently limited to spatial resolutions on the order of tens of millimeters and temporal resolutions on the order of hundreds of milliseconds [195, 196, 78].

Moreover the electrical activity of individual neurons is but one marker for neural function. Other pathways including diffuse signaling by way of chemical neuromodulation and genetic activity and protein transport at the cell level are becoming increasingly important as markers for behavior at multiple time scales [521]. Despite these limitations, neuroscientists have made considerable progress by combining information from different model systems using multiple recording technologies.

Miscellaneous Loose Ends: Listen to Raia Hadsell at 00:50:00 into this interview in which she explains two techniques her team at DeepMind developed to solve complex sequences of tasks. She advocates an approach to lifelong learning that shares some features of the approach we explored in the Spring. Kirkpatrick et al [277] introduce elastic weight consolidation that works by differentially slowing down learning on certain weights based on how important they are to previously seen tasks, and then in [432], Rusu et al present progressive networks that avoid catastrophic forgetting by instantiating a new network or column for each new task, and that support transfer learning by exploiting lateral connections to features of previously learned columns — columns in progressive networks are free to reuse, modify or ignore previously learned features via lateral connections.

September 5, 2019

%%% Thu Sep  5 05:40:45 PDT 2019

Artificial neural networks enable distributed computations in which concepts are represented as patterns of activity in the units that comprise the network layers, and inference is carried out by propagating activation levels between layers weighted by learned connection weights. Artificial neural networks provide a type of fast, flexible computing well suited to handling ambiguity of the sort we routinely encounter in real-world environments, and, by doing so, they complement traditional symbolic computing technologies.

Engineers frequently borrow ideas from nature and they generally find it more practical to translate these ideas into current technology rather than attempt to reproduce nature's solutions in detail. Indeed, the basic idea of artificial neural networks has been implemented multiple times using different technologies in order to approximate the connectivity patterns and signal transmission characteristics of real neural circuits while largely ignoring the physiology of real neurons in their implementation [344].

Modern artificial neural networks have come a long way in the last twenty years and many of those innovations were inspired by what we have learned from the various sub-disciplines of neuroscience. In this paper, we describe our work in designing digital assistants by drawing upon the many insights we have gathered from collaborating with cellular, systems and cognitive neuroscientists. We also survey additional ideas drawn from current research that we are exploring in an attempt to address more ambitious challenges.

The human brain is anything but a blank slate. Neither is it an homogeneous network of neurons all of which perform the same function. It is an incredible self-assembling computing device that emerges out of a collection of embryonic stem cells having the potential to become any cell in the adult brain. These cells have the ability to sense and respond to their environment. They migrate and self organize to construct complex multi-cellular machines and communicate using chemical, electrical and genetic pathways [425].

During its first few years the infant brain requires 60% of the body's metabolic budget. The number of neurons and synapses increases to nearly double that of the adult brain before undergoing pruning during adolescence. Temporary scaffolding is erected to move neurons to appropriate locations and neural processes are guided by chemotaxis to make tentative connections to other neurons. While heightened through adolescence, the same sort of activity continues throughout life and is crucial to learning and memory.

The human brain supports a wide array of learning and memory systems. Some of them we have begun to understand functionally and behaviorally, others we can only hypothesize they must exist, and no doubt there are still others about which we haven't a clue. Just knowing that the brain supports a particular capability can serve as an important clue in engineering complex AI systems. Knowing how can lead to an innovative design, enhanced performance and extended competence.

Knowing how the brain implements a given capability seldom leads directly to an implementation in silicon. A biological realization might rely on diffuse signaling with neuromodulators [71], multiplexing by synchronous spiking [300], DNA methylation for memory formation [381], neurogenesis in pattern separation [3, 77], synchronized oscillations for memory consolidation during sleep [254, 392, 79], or any of the myriad mechanisms the brain employs to perform its diverse functions. Understanding the biology well enough to convert such knowldge into practical algorithms requires a collaborative effort.

It's not just these specialized mechanisms selected for their reproductive edge that are valuable to the engineer designing AI systems; the highly conserved circuitry of the vertebrate brain offers a wealth of architectural detail aligned with cognitive function. While (generally) not extending to the level of individual neurons in mammals, much of this structure is highly stereotyped across individuals [403].

This is especially evident in the primary sensorimotor cortex where the number, arrangement and topography of the areas that comprise the first stages of processing are remarkably stereotypical [516, 515]. In terms of more general localized functions, action selection [230], social behavior [94], mathematical and related analytical reasoning [18, 351] and even the signature basis for abstraction and hierarchical reasoning [39, 284] all provide useful hints about how to organize complex artificial neural network architectures.

Knowing something about how circuit activity relates behavior also helps in thinking about a suitable network architectures and the connections between subnetworks in deep network including recurrent connections. There is evidence to suggest that circuits occurring early in the ventral visual stream code for object-selective features and exhibit large-scale organization characterized by the high-level properties of both animacy and object size [330, 289].

It seems reasonable to expect biological organisms to learn features that serve the needs of navigation, social interaction, tool use, etc ... motivate how such features relate to the physical environment signaling opportunities to learn and act ... explain how this impacts learning in general and in the dorsal and ventral visual streams in particular ... note that most research in machine perception ignores this aspect and in so doing makes the problem both harder and less relevant to AI applications ... tie into the value of information and predictive coding ... mention that the objective of animate vision — and perception in general — is not to reconstruct the visual world we live in but rather to facilitate acting to achieve reproductive success ... think of planning in terms of goals, drives, instincts, behavioral repertoires, reward structures, affordances.

September 3, 2019

%%% Tue Sep  3 02:54:09 PDT 2019

Miscellaneous Loose Ends: Here are some of my notes for writing the first section of the paper. Attributions are scarce and excerpts are common. I've cited papers or pointed to online materials where it was convenient. These notes are supplied for students interested in working on related topics:

Multiple Memory Systems

The human brain supports a wide range of memory systems. The classic episodic versus semantic distinction introduced by Endel Tulving [494] only begins to unravel the many specialized memory systems operating on time scales from few seconds to decades, encoding different content ranging from primarily visual to abstract memories that incorporate input from throughout the cortex, and supporting a wide range of behavior from navigation and object recognition to abstract mathematical thinking. — see this excerpt from science writer Sam Kean's The Tale of Two Neuroscientists: The History of the Human Brain as Revealed by True Stories of Trauma, Madness and Recovery [265] on lesion studies of different types of memory¹⁴⁷.

Studying these memory systems provides insight into the many specialized circuits, cell types and signaling pathways they depend on — the how part of design, as well as the behaviors they support — the what part. In this exercise, we survey some of the many ways that the cellular, systems and cognitive neurosciences are providing new clues regarding the behavioral and computational nature of memory in mammals and humans. In particular, we consider how the hippocampal-entorhinal system and the prefrontal-cortex basal-ganglia complex are influencing our thinking on episodic memory [469] and action selection [152].

Complementary Learning

Complementary Learning Systems (CLS) theory posits two learning systems; one of which operates on structured knowledge representations primarily in the cortex and the other that is able to quickly take advantage of individual experiences to enable goal dependent updating of experience statistics. The CLS model provides a perspective on the role of the hippocampus (HPC), not simply as a database for storing episodic memories but rather working closely with the neocortex in order to rapidly integrate individual experiences into the structured representations of the neocortex by performing "goal-related manipulation of experience statistics such that the neocortex is not a slave to the statistics of its environment". — See Kumaran et al [294].

The key systems covered in this entry comprise the same collection of circuits mentioned in our discussion of the hippocampal complex and the bidirectional connections that link the entorhinal cortex with the integrative neocortical association areas in the medial temporal lobe (MTL), but with a somewhat different end goal and circuit-level dynamics. The hippocampus is a treasure trove of neural function including special cell types, architectures and signaling pathways. By way of example, in this entry we explore several of the cell types and layered architectures found in dentate gyrus and adjoining area CA3.

Layered Architectures

The dentate gyrus (DG) consists of a layered architecture dominated by two cell types, granule cells, which constitute the majority of the cells in the DG, and the so-called "mossy cells." The rat DG includes approximately a million granule-cells outnumbering its 300,000 target CA3 pyramidal neurons by about 3:1 and its upstream 200,000 EC neurons by 5:1. The resulting expansion and subsequent compression is thought to produce the pattern separation necessary to avoid interference in the process of encoding patterns of neural activity corresponding to individual episodes¹⁴⁸.

Detonator Synapses

To perform pattern completion in CA3, the DG relies on the sparse connectivity between CA3 pyramidal neurons and granule cells, facilitated by a property of DG granule cell mossy fiber neurons synapses that they share with the neurons that comprise neuromuscular junctions. These synapses are said to be detonators in that a single synapse can reliably discharge a CA3 neuron and sustain activation for tens of seconds. Dentate gyrus mossy-fibers serve as conditional detonators that reliably discharge pyramidal cells in CA3¹⁴⁹.

Mossy Fiber Cells

Scharfman and Myers [440] summarize our understanding of the unique properties of hilar mossy cells in the dentate gyrus. See Kumaran et al [294] on the role of detonator synapses in sparse conjunctive coding and pattern separation. Note the layer-specific projections from entorhinal complex from Layer II to the dentate gyrus and CA3 and from Layer III to CA1 and the subiculum, as well as receiving input from the associational, perirhinal, and parahippocampal cortices¹⁵⁰. Interesting discusion concerning support for both statistical and individual episode learning via associative reactivation through recurrence¹⁵¹.

Three studies by Danielson et al, 2017, GoodSmith et al, 2017, and Senzai and Buzsáki (2017) distinguish in vivo firing properties of dentate mossy cells from granule cells during behavior. Robust spatial remapping of mossy cells, in contrast to sparse firing of granule cells, suggests differential involvement in pattern separation. See Nakazawa [368] (C). Relating to the perforant-fiber, mossy-fiber, and Schaeffer-collateral pathways, see Wiera and Mozrzymas [528] (E) on the structure and function of mossy-fiber synapses in hippocampus and Treves and Roll [491] (F) analysis of computational constraints suggesting the need for two distinct input systems to the hippocampal CA3 network.

Adult Neurogenesis

In March of 2018 an article [93] (URL) in Scientific American focused on a paper [464] appearing in Nature challenging the theory that the human brain generates new neurons in the hippocampus¹⁵². This was just the latest salvo in the debate surrounding neurogenesis in adult humans. One year later — almost to the day, a second editorial (URL) also in Scientific American described a new paper [363] making the case that neurogenesis plays a pivotal role in the adult human brain¹⁵³. The jury is still out on humans and the controversy will likely continue until we have better technology for observing developing embryonic cell populations in awake behaving humans.

It is optimistically estimated that, in adult humans, 700 new neurons are added in each hippocampus per day — corresponding to an annual turnover of 1.75% of the neurons within the renewing fraction, with a modest decline during aging — see Spalding et al [465]. Aimone et al [3, 442] posit that neurogenesis in the hippocampus supports a form of pattern integration in which immature neurons introduce a degree of similarity to memories learned at the same time and supports the formation of new memories in familiar contexts while treating novel contexts differently¹⁵⁴.

Rodent Models

The evidence for neurogenesis in adult rodents is much stronger, e.g., 9000 neural stem / progenitor cells (NSCs) per day in rats — see page 106 in [278]. NSCs generate significant numbers of progeny in young adult rats, with approximately 9000 new cells, or 0.1% of the granule cell population, being replaced daily. Approximately 50% of the neuronal progeny survive, and new granule cells populate the inner third of the GCL. Mice generate fewer with 4000 NSCs per day — see page 470 in [278]. In the dentate gyrus of young adult mice, approximately 4000–7000 new cells are born each day as measured by pulsing dividing cells with nucleotide analog bromodeoxyuridine (BrdU) and in young adult rats, the rate of neurogenesis is nearly 50% higher.

In mice, less than a third of the newborn neurons survive and are ultimately integrated into neuronal circuitry in the adult brain. Although adult neurogenesis appears to recapitulate embryonic neurodevelopment in many respects, it is unique in that adult-born neurons must incorporate into established circuitry within a functionally mature brain¹⁵⁵. A fundamental question in this field is whether these comparatively young neurons make a special contribution to information processing mediated by the local circuitry — see Kim et al [272].

Pattern Completion

In the CA3 region of the hippocampus, basket cells can often form recurrent inhibition loops with pyramidal cells. Projections from a pyramidal cell will innervate the basket cell, which in turn has a projection back onto the original pyramidal cells. Since basket cells are inhibitory, this generates a closed loop that can help dampen excitatory responses¹⁵⁶

Sharp-Wave Ripples

Ouchi et al [392] investigated the contributions of mossy cells and granule cells in controlling sharp-wave ripples (SWs) and found that mossy cells play a major role in relaying SWs backward to the dentate gyrus. This observation may help in our understanding of how sparse codes are adjusted during memory consolidation and mossy cells play a role in rapid replay and interleaved instance-based learning. See Kumaran et al [294] for a discussion of the bidirectional connections that link neocortical representations to the hippocampus and medial temporal lobe for storage, retrieval, and replay¹⁵⁷.

References

[1]	Program Language Translation Using a Grammar-Driven Tree-to Tree Model. Mehdi drissi and olivia watkins and aditya khant and vivaswat ojha and pedro sandoval and rakia segev and eric weiner and robert keller. CoRR, arXiv:1807.01784, 2018.
[2]	Tyson N. Aflalo and Michael S. A. Graziano. Possible origins of the complex topographic organization of motor cortex: Reduction of a multidimensional space onto a two-dimensional array. Journal of Neuroscience, 26(23):6288--6297, 2006.
[3]	J. B. Aimone, J. Wiles, and F. H. Gage. Computational influence of adult neurogenesis on memory encoding. Neuron, 61(2):187--202, 2009.
[4]	James B. Aimone, Wei Deng, and Fred H. Gage. Resolving new memories: A critical look at the dentate gyrus, adult neurogenesis, and pattern separation. Neuron, 70(4):589--596, 2011.
[5]	Jay Alammar. The illustrated bert and company. http://jalammar.github.io, December 2018.
[6]	Jay Alammar. The illustrated transformer. http://jalammar.github.io, June 2018.
[7]	James S. Albus. A theory of cerebellar functions. Mathematical Biology, 10:25--61, 1971.
[8]	James S. Albus. A new approach to manipulator control: The cerebellar model articulation controller (cmac). Journal of Dynamic Systems, Measurement, and Control, 97:270--277, 1975.
[9]	James S. Albus, Anthony J. Barbera, and M. L. Fitzgerald. A control system for an automated manufacturing research facility. In Proceedings of Robots 8, pages 13--28--13--34. Robotics International of SME, 1984.
[10]	Ben Alderson-Day and Charles Fernyhough. Inner speech: Development, cognitive functions, phenomenology, and neurobiology. Psychological bulletin, 141:931--965, 2015.
[11]	Ben Alderson-Day, Simon McCarthy-Jones, and Charles Fernyhough. Hearing voices in the resting brain: A review of intrinsic functional connectivity research on auditory verbal hallucinations. Neuroscience and Biobehavioral Reviews, 55:78--87, 2015.
[12]	Bridget Alex. How did human language evolve? scientists still don't know. Discover, 2018.
[13]	G. E. Alexander, M. R. DeLong, and P. L. Strick. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annual Review Neuroscience, 9:357--381, 1986.
[14]	Joshua W. Alexander, William H.and Brown. Frontal cortex function as derived from hierarchical predictive coding. Scientific Reports, 8:3843, 2018.
[15]	William H. Alexander and Joshua W. Brown. Frontal cortex function as derived from hierarchical predictive coding. Scientific Reports, 8:3843, 2018.
[16]	A. Paul Alivisatos, Miyoung Chun, George M. Church, Karl Deisseroth, John P. Donoghue, Ralph J. Greenspan, Paul L. McEuen, Michael L. Roukes, Terrence J. Sejnowski, Paul S. Weiss, and Rafael Yuste. The brain activity map. Science, 339(6125):1284--1285, 2013.
[17]	J.M. Allman, N.A. Tetreault, A.Y. Hakeem, K.F. Manaye, K. Semendeferi, J.M. Erwin, S. Park, V. Goubert, and P.R. Hof. The von Economo neurons in fronto-insular and anterior cingulate cortex in great apes and humans. Brain Structure and Function, 214:495--517, 2010.
[18]	Marie Amalric and Stanislas Dehaene. Origins of the brain networks for advanced mathematics in expert mathematicians. Proceedings of the National Academy of Sciences, 113(18):4909--4917, 2016.
[19]	D. G. Amaral, H.E. Scharfman, and P. Lavenex. The dentate gyrus: fundamental neuroanatomical organization (dentate gyrus for dummies). Progress in Brain Research, 163:3--22, 2007.
[20]	David G Amaral, Norio Ishizuka, and Brenda Claiborne. Chapter neurons, numbers and the hippocampal network. Progress in Brain Research, 83:1--11, 1990.
[21]	John R. Anderson and Lael J. Schooler. Reflections of the environment in memory. Psychology Science, 2, 1991.
[22]	Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. CoRR, arXiv:1611.01796, 2016.
[23]	Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. CoRR, arXiv:1707.01495, 2017.
[24]	Submission Anonymous. Cognitive consequences of structured education in a connectionist model of analogical reasoning. In Proceedings of the 2020 Annual Meeting of the Cognitive Science Society, 2020.
[25]	Bernard Ans, Stèphane Rousset, Robert French, and Serban Musca. Preventing catastrophic interference in multiple-sequence learning using coupled reverberating elman networks. Proceedings of the 24th Annual Meeting of the Cognitive Science Society, 2002.
[26]	Dirk Jan Ardesch, Lianne H. Scholtens, Longchuan Li, Todd M. Preuss, James K. Rilling, and Martijn P. van den Heuvel. Evolutionary expansion of connectivity between multimodal association areas in the human brain compared with chimpanzees. Proceedings of the National Academy of Sciences, 116(14):7101--7106, 2019.
[27]	B.R. Arenkiel. Neural Tracing Methods: Tracing Neurons and Their Connections. Springer New York, 2014.
[28]	W. Ross Ashby. An introduction to cybernetics. University paperbacks. Methuen, 1964.
[29]	Craig Atkinson, Brendan McCane, Lech Szymanski, and Anthony Robins. Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting. CoRR, arXiv:1812.02464, 2018.
[30]	Phillip J.E. Attwell, Samuel F. Cooke, and Christopher H. Yeo. Cerebellar function in consolidation of a motor memory. Neuron, 34(6):1011--1020, 2002.
[31]	Eduard Aved'yan. The cerebellar model articulation controller (CMAC). In J. Mason and P. C. Parks, editors, Learning Systems, pages 110--119. Springer London, London, 1995.
[32]	Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. CoRR, arXiv:1610.06258, 2016.
[33]	B. J. Baars. A cognitive theory of consciousness. Cambridge University Press, New York, NY, 1988.
[34]	P. B. Badcock, K. J. Friston, M. J. D. Ramstead, A. Ploeger, and J. Hohwy. The hierarchically mechanistic mind: an evolutionary systems theory of the human brain, cognition, and behavior. Cognitive Affect Behavior Neuroscience, 19(6):1319--1351, 2019.
[35]	Alan Baddeley. Modularity, mass-action and memory. The Quarterly Journal of Experimental Psychology Section A, 38(4):527--533, 1986.
[36]	Alan Baddeley. Working memory or working attention? In A. D. Baddeley and Lawrence Weiskrantz, editors, Attention: Selection, Awareness, and Control. Oxford University Press, 1993.
[37]	Alan Baddeley and Graham James Hitch. Working memory. In G.A. Bower, editor, Recent Advances in Learning and Motivation, volume 8, pages 47--90. Academic Press, 1974.
[38]	D. Badre and M. D'Esposito. Functional magnetic resonance imaging evidence for a hierarchical organization of the prefrontal cortex. Journal Cognitive Neuroscience, 19(12):2082--2099, 2007.
[39]	D. Badre and M. J. Frank. Mechanisms of hierarchical reinforcement learning in cortico-striatal circuits II: evidence from fMRI. Cerebral Cortex, 22(3):527--536, 2012.
[40]	D. Badre and A. D. Wagner. Selection, integration, and conflict monitoring; assessing the nature and generality of prefrontal cognitive control mechanisms. Neuron, 41(3):473--487, 2004.
[41]	David Badre. Cognitive control, hierarchy, and the rostro-caudal organization of the frontal lobes. Trends in Cognitive Sciences, 12(5):193--200, 2008.
[42]	David Badre and Derek Evan Nee. Frontal cortex and the hierarchical control of behavior. Trends in Cognitive Sciences, 22(2):170--188, 2018.
[43]	Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, arXiv:1409.0473, 2014.
[44]	Renée Baillargeon. The acquisition of physical knowledge in infancy: A summary in eight lessons. In U. Goswami, editor, Blackwell Handbook of Childhood Cognitive Development, Edition: 1st, pages 46--83. Blackwell, 2002.
[45]	Bram Bakker and Jürgen Schmidhuber. Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization. In Proceedings of the 8th Conference on Intelligent Autonomous Systems, pages 438--445, 2004.
[46]	H. Barbas and M. A. Garcia-Cabezas. How the prefrontal executive got its stripes. Current Opinion in Neurobiology, 40:125--134, 2016.
[47]	Horace B. Barlow. Possible principles underlying the transformations of sensory messages. In W. A. Rosenblith, editor, Sensory Communication, pages 217--234. MIT Press, Cambridge, MA, 1961.
[48]	Horace B. Barlow. Unsupervised learning. Neural Computation, 1:295--311, 1989.
[49]	Andrew G. Barto and Richard S. Sutton. Landmark learning: An illustration of associative search. Biological Cybernetics, 42:1--8, 1981.
[50]	Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, arXiv:1806.01261, 2018.
[51]	Mark F. Bear, Barry Connors, and Michael Paradiso. Neuroscience: Exploring the Brain (Third Edition). Lippincott Williams & Wilkins, Baltimore, Maryland, 2006.
[52]	Mark F. Bear, Barry Connors, and Michael Paradiso. Neuroscience: Exploring the Brain (Fifth Edition). Lippincott Williams & Wilkins, Baltimore, Maryland, 2015.
[53]	Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. CoRR, arXiv:1506.03099, 2015.
[54]	Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19, pages 153--160. MIT Press, Cambridge, MA, 2007.
[55]	Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 41--48, New York, NY, USA, 2009. ACM.
[56]	Raphaëlle Bertrand and Valèrie Camos. The role of attention in preschoolers working memory. Cognitive Development, 33:14--27, 2015.
[57]	J. R. Binder and R. H. Desai. The neurobiology of semantic memory. Trends in Cognitive Science, 15(11):527--536, 2011.
[58]	Marcel Binz and Dominik Endres. Emulating human developmental stages with bayesian neural networks. CoRR, arXiv:1902.07579, 2019.
[59]	David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. CoRR, arXiv:1601.00670, 2017.
[60]	Andreea C. Bostan, Richard P. Dum, and Peter L. Strick. The basal ganglia communicate with the cerebellum. Proceedings of the National Academy of Sciences, 107(18):8452--8456, 2010.
[61]	Andreea C. Bostan and Peter L. Strick. The basal ganglia and the cerebellum: nodes in an integrated network. Nature Reviews Neuroscience, 19:338--350, 2018.
[62]	M. Botvinick, A. P. Jha, L. M. Bylsma, S. A. Fabian, P. E. Solomon, and K. M. Prkachin. Viewing facial expressions of pain engages cortical areas involved in the direct experience of pain. Neuroimage, 25(1):312--319, 2005.
[63]	Matthew Botvinick and James An. Goal-directed decision making in prefrontal cortex: A computational framework. Advances in Neural Information Processing Systems, 21:169--176, 2009.
[64]	Matthew Botvinick and David C. Plaut. Representing task context: proposals based on a connectionist model of action. Psychological Research, 66(4):298--311, 2002.
[65]	Matthew Botvinick and David C. Plaut. Doing without schema hierarchies: A recurrent connectionist approach to normal and impaired routine sequential action. Psychological Review, 111(2):395--429, 2004.
[66]	Matthew M. Botvinick. Multilevel structure in behaviour and in the brain: a model of fuster's hierarchy. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 362:1615--1626, 2007.
[67]	Matthew M. Botvinick, Cameron S. Carter, Todd S. Braver, Cameron S. Carter, and Jonathan D. Cohen. Conflict monitoring and cognitive control. Psychological Review, 108(3):624--652, 2001.
[68]	Matthew M. Botvinick, Yael Niv, and Andrew C. Barto. Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition, 113(3):262--280, 2009.
[69]	Edward Boyden. A history of optogenetics: the development of tools for controlling brain circuits with light. F1000 Biology Reports, 3, 2011.
[70]	Valentino Braitenberg. Cell assemblies in the cerebral cortex. In Roland Heim and Günther Palm, editors, Theoretical Approaches to Complex Systems, pages 171--188, Berlin, Heidelberg, 1978. Springer Berlin Heidelberg.
[71]	Vladimir Brezina. Beyond the wiring diagram: signalling through complex neuromodulator networks. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 365:2363--2374, 2010.
[72]	Holly Bridge and Stuart Clare. High-resolution MRI: in vivo histology? Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 361:137--146, 2006.
[73]	Korbinian Brodmann. Vergleichende Lokalisationslehre der Grosshirnrinde in ihren Prinzipien dargestellt auf Grund des Zellenbaues. Johann Ambrosius Barth Verlag, Leipzig, 1909.
[74]	Rodney A. Brooks. A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, 2:14--23, 1986.
[75]	Rodney A. Brooks. Intelligence without reason. In Proceedings of the 12th International Joint Conference on Artificial Intelligence - Volume 1, pages 569--595, San Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc.
[76]	Neil Burgess, Eleanor A Maguire, and John O'Keefe. The human hippocampus and spatial and episodic memory. Neuron, 35(4):625--641, 2002.
[77]	N. S. Burghardt, E. H. Park, R. Hen, and A. A. Fenton. Adult-born hippocampal neurons promote cognitive flexibility in mice. Hippocampus, 22(9):1795--1808, 2012.
[78]	R.B. Buxton, K. Uludag, D.J. Dubowitz, and T.T. Liu. Modeling the hemodynamic response to brain activation. Neuroimaging, 23:220--233, 2004.
[79]	György Buzsàki. Hippocampal sharp wave-ripple: A cognitive biomarker for episodic memory and planning. Hippocampus, 25:1073--1188, 2015.
[80]	Weidong Cai, Katherine Duberg, Aarthi Padmanabhan, Rachel Rehert, Travis Bradley, Victor Carrion, and Vinod Menon. Hyperdirect insula-basal-ganglia pathway and adult-like maturity of global brain responses predict inhibitory control in children. Nature Communications, 10:4798, 2019.
[81]	D. Caligiore, G. Pezzulo, G. Baldassarre, A. C. Bostan, P. L. Strick, K. Doya, R. C. Helmich, M. Dirkx, J. Houk, H. Jorntell, A. Lago-Rodriguez, J. M. Galea, R. C. Miall, T. Popa, A. Kishore, P. F. Verschure, R. Zucca, and I. Herreros. Consensus paper: Towards a systems-level view of cerebellar function: the interplay between cerebellum, basal ganglia, and cortex. Cerebellum, 16(1):203--229, 2017.
[82]	Edward M. Callaway. Transneuronal circuit tracing with neurotropic viruses. Current opinion in neurobiology, 18:617--623, 2008.
[83]	Matteo Carandini. From circuits to behavior: a bridge too far? Nature Neuroscience, 15:507--509, 2012.
[84]	Ilaria Carta, Christopher H. Chen, Amanda L. Schott, Schnaude Dorizan, and Kamran Khodakhah. Cerebellar modulation of the reward circuitry and social behavior. Science, 363(6424), 2019.
[85]	Claudia Casellato, Alberto Antonietti, Jesus A. Garrido, Richard R. Carrillo, Niceto R. Luque, Eduardo Ros, Alessandra Pedrocchi, and Egidio D'Angelo. Adaptive robotic control driven by a versatile spiking cerebellar network. PLOS ONE, 9(11):1--17, 2014.
[86]	Daniel Cer and Randall O'Reilly. Neural mechanisms of binding in the hippocampus and neocortex: Insights from computational models. In Axel Mecklinger Hubert Zimmer and Ulman Lindenberger, editors, Handbook of Binding and Memory: Perspectives from Cognitive Neuroscience. Oxford University Press, 1998.
[87]	Samit Chakrabarty and John Martin. Postnatal development of the motor representation in primary motor cortex. Journal of neurophysiology, 84:2582--94, 2000.
[88]	Matthew Chalk, Olivier Marre, and Gašper Tkačik. Toward a unified theory of efficient, predictive, and sparse coding. Proceedings of the National Academy of Sciences, 115(1):186--191, 2018.
[89]	Morphological changes of the human hippocampal formation from midgestation to early childhood. Seress, l. In C.A. Nelson and M. Luciana, editors, Handbook of developmental cognitive neuroscience, pages 45--58. MIT Press, Cambridge, MA, 2001.
[90]	Christopher H. Chatham, Seth A. Herd, Angela M. Brant, Thomas E. Hazy, Akira Miyake, Randy O'Reilly, and Naomi P. Friedman. From an executive network to executive control: A computational model of the n-back task. Journal of Cognitive Neuroscience, 23:3598--3619, 2011.
[91]	Bo Chen, G. Polatkan, G. Sapiro, D. Blei, D. Dunson, and L. Carin. Deep learning with hierarchical convolutional factor analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35:1887--1901, 2013.
[92]	C. H. Chen, R. Fremont, E. E. Arteaga-Bracho, and K. Khodakhah. Short latency cerebellar modulation of the basal ganglia. Nature Neuroscience, 17(12):1767--1775, 2014.
[93]	Helen Chen. Does the adult brain really grow new neurons? Scientific American, March 7, 2018.
[94]	Patrick Chen and Weizhe Hong. Neural circuit mechanisms of social behavior. Neuron, 98(1):16--30, 2018.
[95]	X. Chen, Y. Mu, Y. Hu, A. T. Kuan, M. Nikitchenko, O. Randlett, A. B. Chen, J. P. Gavornik, H. Sompolinsky, F. Engert, and M. B. Ahrens. Brain-wide Organization of Neuronal Activity and Convergent Sensorimotor Transformations in Larval Zebrafish. Neuron, 100(4):876--890, 2018.
[96]	Xinyun Chen, Chang Liu, and Dawn Song. Tree-to-tree neural networks for program translation. CoRR, arXiv:1802.03691, 2018.
[97]	Zhiyuan Chen and Bing Liu. Continual learning and catastrophic forgetting. In Lifelong Machine Learning, Second Edition, volume 12 of Synthesis Lectures on Artificial Intelligence and Machine Learning, pages 1--207. Morgan & Claypool Publishers, 2018.
[98]	Ana B. Chica, Michel Thiebaut de Schotten, Paolo Bartolomeo, and Pedro M. Paz-Alonso. White matter microstructure of attentional networks predicts attention and consciousness functional interactions. Brain Structure and Function, 223(2):653--668, 2018.
[99]	K. Cho, B. Merriënboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, arXiv:406.1078, 2014.
[100]	Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. CoRR, arXiv:1409.1259, 2015.
[101]	Brian Christian and Tom Griffiths. Algorithms to Live By: The Computer Science of Human Decisions. HarperCollins Publishers, 2016.
[102]	Paul Cisek. Resynthesizing behavior through phylogenetic refinement. Attention, Perception & Psychophysics, 81(7):2265--2287, 2019.
[103]	Andy Clark. Busting out: predictive brains, embodied minds, and the puzzle of the evidentiary veil. Nous, 51:727--753, 2017.
[104]	Camilla N. Clark and Jason D. Warren. Music, memory and mechanisms in alzheimer's disease. Brain, 138(8):2122--2125, 2015.
[105]	Neal J. Cohen and Howard Eichenbaum. Memory, amnesia, and the hippocampal system. The MIT Press, Cambridge, MA, US, 1993.
[106]	Roshan Cools. The costs and benefits of brain dopamine for cognitive control. Wiley Interdisciplinary Reviews Cognitive Science, 7(5):317--329, 2016.
[107]	S. M. Courtney. Attention and cognitive control as emergent properties of information representation in working memory. Cognitive, Affective and Behavioral Neuroscience, 4(4):501--516, 2004.
[108]	Nelson Cowan. What are the differences between long-term, short-term, and working memory? Progress in Brain Research, 169:323--338, 2008.
[109]	V. Cutsuridis, A. Hussain, and J.G. Taylor. Perception-Action Cycle: Models, Architectures, and Hardware. Springer Series in Cognitive and Neural Systems. Springer New York, 2011.
[110]	Purves. D. Excitatory and inhibitory postsynaptic potentials. In D. Purves, G. J. Augustine, Fitzpatrick D., and et al, editors, Neuroscience. 2nd edition. Sinauer Associates, 2001.
[111]	George J. Augustine Dale Purves, David Fitzpatrick, William C. Hall, Anthony-Samuel LaMantia, Richard D. Mooney, Michael L. Platt, and Leonard E. White. Neuroscience, Sixth Edition. Oxford University Press, 2017.
[112]	Tobias Dantzig. Number: The Language of Science. Macmillan Company, New York, USA, 1930.
[113]	P. Dayan, G.E. Hinton, R.M. Neal, and R.S. Zemel. The Helmholtz machine. Neural Computation, 7:889--904, 1995.
[114]	Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5:613--624, 1993.
[115]	Peter. Dayan and Larry F. Abbott. Theoretical Neuroscience. MIT Press, Cambridge, MA, 2001.
[116]	Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing 5, pages 271--278. Morgan Kaufmann, San Francisco, California, 1993.
[117]	Lee de Wit, Bart Machilsen, and Tom Putzeys. Predictive coding and the neural response to predictable stimuli. The Journal of neuroscience: the official journal of the Society for Neuroscience, 30:8702--8703, 2010.
[118]	Terrence W. Deacon. The Symbolic Species: The Co-evolution of Language and the Brain. W. W. Norton, 1998.
[119]	Terrence W. Deacon. Incomplete Nature: How Mind Emerged from Matter. W. W. Norton, 2012.
[120]	Thomas Dean. A computational model of the cerebral cortex. In Proceedings of AAAI-05, pages 938--943, Cambridge, Massachusetts, 2005. MIT Press.
[121]	Thomas Dean. Learning invariant features using inertial priors. Annals of Mathematics and Artificial Intelligence, 47:223--250, 2006.
[122]	Thomas Dean, Maurice Chiang, Marcus Gomez, Nate Gruver, Yousef Hindy, Michelle Lam, Peter Lu, Sophia Sanchez, Rohun Saxena, Michael Smith, Lucy Wang, and Catherine Wong. Amanuensis: The Programmer's Apprentice. CoRR, arXiv:1807.00082, 2018.
[123]	Thomas Dean, Greg S. Corrado, and Jonathon Shlens. Three controversial hypotheses concerning computation in the primate cortex. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
[124]	Thomas Dean, Chaofei Fan, Francis E. Lewis, and Megumi Sano. Biological blueprints for next generation AI systems. CoRR, arXiv:1912.00421, 2019.
[125]	Thomas Dean and Michael Wellman. Planning and Control. Morgan Kaufmann Publishers, San Francisco, California, 1991.
[126]	B. Deen, H. Richardson, D. D. Dilks, A. Takahashi, B. Keil, L. L. Wald, N. Kanwisher, and R. Saxe. Organization of high-level visual cortex in human infants. Nature Communications, 8:13995, 2017.
[127]	S. Dehaene and J. P. Changeux. Reward-dependent learning in neuronal networks for planning and decision making. Progress in Brain Research, 126:217--229, 2000.
[128]	Stanislas Dehaene. The Number Sense: How the Mind Creates Mathematics. Oxford University Press, 1999.
[129]	Stanislas Dehaene. Consciousness and the Brain: Deciphering How the Brain Codes Our Thoughts. Viking Press, 2014.
[130]	Stanislas Dehaene and Elizabeth Brannon. Space, Time and Number in the Brain: Searching for the Foundations of Mathematical Thought. Elsevier Science, 2011.
[131]	Stanislas Dehaene and Jean-Pierre Changeux. A hierarchical neuronal network for planning behavior. Proceedings of the National Academy of Sciences, 94:13293--13298, 1997.
[132]	Stanislas Dehaene and Laurent Cohen. Cultural recycling of cortical maps. Neuron, 56(2):384--398, 2007.
[133]	Vyacheslav Demin and Dmitry Nekhaev. Recurrent spiking neural network learning based on a competitive maximization of neuronal activity. Frontiers in Neuroinformatics, 12:79, 2018.
[134]	Hanneke Den Ouden, Peter Kok, and Floris De Lange. How prediction errors shape perception, attention, and motivation. Frontiers in Psychology, 3:548, 2012.
[135]	Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Sailesh R, and Subhajit Roy. Program synthesis using natural language. In Proceedings of the 38th International Conference on Software Engineering, page 345–356. Association for Computing Machinery, 2016.
[136]	M. D'Esposito, J. A. Detre, D. C. Alsop, R. K. Shin, S. Atlas, and M. Grossman. The neural basis of the central executive system of working memory. Nature, 378(6554):279--281, 1995.
[137]	Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, arXiv:1810.04805, 2018.
[138]	Nicholas K. DeWind. The number sense is an emergent property of a deep convolutional neural network trained for object recognition. bioRxiv, 2019.
[139]	Nicholas K. DeWind, Geoffrey K. Adams, Michael L. Platt, and Elizabeth M. Brannon. Modeling the approximate number system to quantify the contribution of visual stimulus features. Cognition, 142:247--265, 2015.
[140]	Nicholas K. DeWind, Joonkoo Park, Marty G. Woldorff, and Elizabeth M. Brannon. Numerical encoding in early visual cortex. Cortex, 114:76--89, 2019.
[141]	Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227--303, 2000.
[142]	C. Diuk, A. Schapiro, N. Córdova, J. Ribas-Fernandes, Y. Niv, and M. Botvinick. Divide and conquer: hierarchical reinforcement learning and task decomposition in humans. In Computational and robotic models of the hierarchical organization of behavior, pages 271--291, Berlin, Heidelberg, 2013. Springer.
[143]	Carl Doersch. Tutorial on variational autoencoders. CoRR, arXiv:1606.05908, 2017.
[144]	Daniel Dombeck and David Tank. Imaging in neuroscience. In Helmchen and Konnerth, editors, Two-Photon Imaging of Neural Activity in Awake Mobile Mice, pages 355--369. Cold Spring Harbor Press, 2011.
[145]	F. Donnarumma, R. Prevete, F. Chersi, and G. Pezzulo. A programmer-interpreter neural network architecture for prefrontal cognitive control. International Journal Neural Systems, 25(6):1550017, 2015.
[146]	Francesco Donnarumma, Roberto Prevete, and Giuseppe Trautteur. Programming in the brain: a neural network theoretical framework. Connection Science, 24(2-3):71--90, 2012.
[147]	Rodney J. Douglas and Kevan A.C. Martin. Behavioral architecture of the cortical sheet. Current Biology, 22(24):R1033--R1038, 2012.
[148]	Kenji Doya. Complementary roles of basal ganglia and cerebellum in learning and motor control. Current Opinion in Neurobiology, 10(6):732--739, 2000.
[149]	Marcus du Sautoy. The Music of the Primes: Searching to Solve the Greatest Mystery in Mathematics. HarperCollins, 2004.
[150]	George Dyson. Turing's Cathedral: The Origins of the Digital Universe. Pantheon Books, 2012.
[151]	David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. CoRR, arXiv:1312.4314, 2013.
[152]	Lotem Elber-Dorozko and Yonatan Loewenstein. Striatal action-value neurons reconsidered. eLife, 7:e34248, 2018.
[153]	Izhikevich E.M. and FitzHugh R. FitzHugh-Nagumo model. Scholarpedia, 1:1349, 2006.
[154]	K. Anders Ericsson and Walter Kintsch. Long-term working memory. Psychological Review, 102(2):211--245, 1995.
[155]	E. Ersü and H. Tolle. A new concept for learning control inspired by brain theory. 9th IFAC World Congress: A Bridge Between Control Science and Technology, 17(2):1039--1044, 1984.
[156]	E. Ersü and H. Tolle. Hierarchical learning control - An approach with neuron-like associative memories. Proceedings of American Institute of Physics, pages 249--261, 1988.
[157]	Henry Markram et al. The Human Brain Project: A Report to the European Commission. The HBP-PS Consortium, Lausanne, 2012.
[158]	C. Fajardo, M.I. Escobar, E. Buriticá, G. Arteaga, J. Umbarila, M.F. Casanova, and H. Pimienta. Von economo neurons are present in the dorsolateral (dysgranular) prefrontal cortex of humans. Neuroscience Letters, 435:215--218, 2008.
[159]	Charles Fefferman, Sanjoy K. Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. CoRR, pages 983--1049, 2013.
[160]	Charles Fernyhough. The Voices Within: The History and Science of How We Talk to Ourselves. Basic Books, 2016.
[161]	Charles Fernyhough and Simon McCarthy-Jones. Thinking aloud about mental voices. In Fiona Macpherson and Dimitris Platchias, editors, Hallucination: Philosophy and Psychology. The MIT Press, 2013.
[162]	Shai Fine, Yoram Singer, and Naftali Tishby. The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32:41--62, 1998.
[163]	Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. CoRR, arXiv:1703.03400, 2017.
[164]	Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. CoRR, arXiv:1610.00696, 2016.
[165]	Jason Fischer. Naïve physics: building a mental model of how the world behaves. In D. Poeppel, G. R. Mangun, and M. S. Gazzaniga, editors, The Cognitive Neurosciences VI. The MIT Press, 2020.
[166]	Jason Fischer, John G. Mikhael, Joshua B. Tenenbaum, and Nancy Kanwisher. Functional neuroanatomy of intuitive physical inference. Proceedings of the National Academy of Sciences, 113(34):E5072--E5081, 2016.
[167]	Jason Fischer, John G. Mikhael, Joshua B. Tenenbaum, and Nancy Kanwisher. Supplementary Materials and Methods for Functional Neuroanatomy of Intuitive Physical Inference. Proceedings of the National Academy of Sciences, 113(34), 2016.
[168]	Christophe Fitamen, Agnés Blaye, and Valèrie Camos. Five-year-old children's working memory can be improved when children act on a transparent goal cue. Scientific Reports, 9:15342, 2019.
[169]	W. Tecumseh Fitch. Empirical approaches to the study of language evolution. Psychonomic Bulletin & Review, 24:3--33, 2017.
[170]	Richard FitzHugh. Impulses and physiological states in theoretical models of nerve membrane. Biophysical Journal, pages 445--466, 1961.
[171]	Jerry Fodor. The Language of Thought. Harvard University Press, Cambridge, MA, 1975.
[172]	Jerry A. Fodor and Zenon W. Pylyshyn. Connectionism and cognitive architecture. Cognition, 28(1-2):3--71, 1988.
[173]	M. J. Frank and D. Badre. Mechanisms of hierarchical reinforcement learning in corticostriatal circuits I: computational analysis. Cerebral Cortex, 22(3):509--526, 2012.
[174]	Robert French. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3:128--135, 1999.
[175]	Robert M. French. Pseudo-recurrent connectionist networks: An approach to the 'sensitivity-stability' dilemma. Connection Science, 9(4):353--380, 1997.
[176]	Karl Friston. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11:127--38, 2010.
[177]	Karl Friston, Rick Adams, Laurent Perrinet, and Michael Breakspear. Perceptions as hypotheses: Saccades as experiments. Frontiers in Psychology, 3:151, 2012.
[178]	Karl Friston, Jean Daunizeau, and Stefan Kiebel. Reinforcement learning or active inference? PloS one, 4:e6421, 2009.
[179]	Karl Friston and Stefan Kiebel. Predictive coding under the free-energy principle. Philosphical Transactions Royal Society London B Biological Science, 364:1211--1221, 2009.
[180]	Justin Fu, Anoop Korattikara, Sergey Levine, and Sergio Guadarrama. From language to goals: Inverse reinforcement learning for vision-based instruction following. CoRR, arXiv:1902.07742, 2019.
[181]	K. Fukushima. Neocognitron: A self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernnetics, 36:93--202, 1980.
[182]	Joaquín M. Fuster. Network memory. Trends in Neurosciences, 20(10):451--459, 1997.
[183]	Joaquín M. Fuster. Upper processing stages of the perception-action cycle. Trends in Cognitive Sciences, 8(4):143--145, 2004.
[184]	Joaquín M. Fuster. Chapter 8: An Overview of Prefrontal Functions, pages 375--425. Elsevier, London, 2015.
[185]	Joaquín M. Fuster. Prefrontal Cortex, 5th Edition. Elsevier, London, 2015.
[186]	Nicole M. Gage and Bernard J. Baars. Chapter 6 - language and thought. In Nicole M. Gage and Bernard J. Baars, editors, Fundamentals of Cognitive Neuroscience (Second Edition), pages 185--214. Academic Press, San Diego, CA, 2018.
[187]	Pablo Garcia-Lopez, Virginia Garcia-Marin, and Miguel Freire. The histological slides and drawings of cajal. Frontiers in Neuroanatomy, 4:1--16, 2010.
[188]	S. E. Gathercole, S. J. Pickering, B. Ambridge, and H. Wearing. The structure of working memory from 4 to 15 years of age. Developmental Psychology, 40(2):177--190, 2004.
[189]	Alice Geminiani, Claudia Casellato, Alberto Antonietti, Egidio D'Angelo, and Alessandra Pedrocchi. A multiple-plasticity spiking neural network embedded in a closed-loop control system to model cerebellar pathologies. International Journal of Neural Systems, 28(05):1750017, 2018.
[190]	Samuel J. Gershman. What does the free energy principle tell us about the brain? CoRR, arXiv:1901.07945, 2019.
[191]	Samuel J. Gershman and Nathaniel D. Daw. Reinforcement learning and episodic memory in humans and animals: An integrative framework. Annual Reviews of Psychology, 68:101--128, 2017.
[192]	Simona Ghetti and Silvia A. Bunge. Neural changes underlying the development of episodic memory during middle childhood. Developmental Cognitive Neuroscience, 2(4):381--395, 2012.
[193]	James J. Gibson. Perception of the Visual World. Houghton Mifflin, Boston, 1950.
[194]	James J. Gibson. The Ecological Approach to Visual Perception. Houghton Mifflin, Boston, 1979.
[195]	Jozien Glense, Yvette Bohraus, and Nikos K. Logothetis. fMRI at high spatial resolution: Implications for BOLD-models. Frontiers in computational neuroscience, 10:66--66, 2016.
[196]	Gary H. Glover. Overview of functional magnetic resonance imaging. Neurosurgery clinics of North America, 22:133--144, 2011.
[197]	M. A. Gluck and C. E. Myers. Hippocampal mediation of stimulus representation: a computational theory. Hippocampus, 3(4):491--516, 1993.
[198]	Patricia S. Goldman-Rakic. Development of cortical circuitry and cognitive function. Child Development, 58(3):601--622, 1987.
[199]	Patricia S. Goldman-Rakic. Topography of cognition: Parallel distributed networks in primate association cortex. Annual Review of Neuroscience, 11(1):137--156, 1988.
[200]	Aida Gómez-Robles, William D. Hopkins, Steven J. Schapiro, and Chet C. Sherwood. Relaxed genetic control of cortical organization in human brains compared with chimpanzees. Proceedings of the National Academy of Sciences, 112(48):14799--14804, 2015.
[201]	Melvyn A. Goodale and A.David Milner. Separate visual pathways for perception and action. Trends in Neurosciences, 15(1):20--25, 1992.
[202]	Goren Gordon, David M. Kaplan, Benjamin Lankow, Daniel Ying-Jeh Little, Jason Sherwin, Benjamin A. Suter, and Lore Thaler. Toward an integrated approach to perception and action: conference report and future directions. Frontiers in Systems Neuroscience, 5:20--20, 2011.
[203]	Richard Langton Gregory, Hugh Christopher Longuet-Higgins, and N. S. Sutherland. Perceptions as hypotheses. Philosophical Transactions of the Royal Society of London. B, Biological Sciences, 290(1038):181--197, 1980.
[204]	R. W. Guillery and S. Murray Sherman. Branched thalamic afferents: What are the messages that they relay to the cortex? Brain research reviews, 66:205--219, 2011.
[205]	Caglar Gülcehre and Yoshua Bengio. Knowledge matters: Importance of prior information for optimization. Journal of Machine Learning Research, 17(8):1--32, 2016.
[206]	Caglar Gülcehre, Marcin Moczulski, Francesco Visin, and Yoshua Bengio. Mollifying networks. CoRR, arXiv:1608.04980, 2016.
[207]	David Ha and Jürgen Schmidhuber. World models. CoRR, arXiv:1803.10122, 2018.
[208]	Xia Hu Haifeng Jin, Qingquan Song. Auto-keras: Efficient neural architecture search with network morphism. CoRR, arXiv:1806.10282, 2018.
[209]	Michael Halasa. The thalamus in cognitive control and flexibility. https://cbmm.mit.edu/video/thalamus-cognitive-control-and-flexibility-11921, 2018.
[210]	Michael M. Halassa and S. Murray Sherman. Thalamocortical circuit motifs: A general framework. Neuron, 103(5):762--770, 2019.
[211]	M.M. Halassa, Z. Chen, R.D. Wimmer, P.M. Brunetti, S. Zhao, B. Zikopoulos, F. Wang, E.N. Brown, and M.A. Wilson. State-dependent architecture of thalamic reticular subnetworks. Cell, 158(4):808--821, 2014.
[212]	N. E. Hallworth and B. H. Bland. Basal ganglia-hippocampal interactions support the role of the hippocampal formation in sensorimotor integration. Experimental Neurology, 188(2):430--443, 2004.
[213]	Jessica B Hamrick. Analogues of mental simulation and imagination in deep learning. Current Opinion in Behavioral Sciences, 29:8--16, 2019.
[214]	Jessica B. Hamrick, Kelsey R. Allen, Victor Bapst, Tina Zhu, Kevin R. McKee, Joshua B. Tenenbaum, and Peter W. Battaglia. Relational inductive bias for physical construction in humans and machines. In Proceedings of the 40th Annual Conference of the Cognitive Science Society, pages 1773--1779, 2018.
[215]	Yunyun Han, Justus M. Kebschull, Robert A. A. Campbell, Devon Cowan, Fabia Imhof, Anthony M. Zador, and Thomas D. Mrsic-Flogel. The logic of single-cell projections from visual cortex. Nature, 556:51--56, 2018.
[216]	Michael Hawrylycz, Jeremy A. Miller, Vilas Menon, David Feng, Tim Dolbeare, Angela L. Guillozet-Bongaarts, Anil G. Jegga, Bruce J. Aronow, Chang-Kyu Lee, Amy Bernard, Matthew F. Glasser, Donna L. Dierker, Jörg Menche, Aaron Szafer, Forrest Collman, Pascal Grange, Kenneth A. Berman, Stefan Mihalas, Zizhen Yao, Lance Stewart, Albert-László Barabási, Jay Schulkin, John Phillips, Lydia Ng, Chinh Dang, David R. Haynor, Allan Jones, David C. Van Essen, Christof Koch, and Ed Lein. Canonical genetic signatures of the adult human brain. Nature Neuroscience, 18:1832--1844, 2015.
[217]	Michael J. Hawrylycz, Ed S. Lein, Angela L. Guillozet-Bongaarts, Elaine H. Shen, Lydia Ng, Jeremy A. Miller, Louie N. van de Lagemaat, Kimberly A. Smith, Amanda Ebbert, Zackery L. Riley, Chris Abajian, Christian F. Beckmann, Amy Bernard, Darren Bertagnolli, Andrew F. Boe, Preston M. Cartagena, M. Mallar Chakravarty, Mike Chapin, Jimmy Chong, Rachel A. Dalley, Barry David Daly, Chinh Dang, Suvro Datta, Nick Dee, Tim A. Dolbeare, Vance Faber, David Feng, David R. Fowler, Jeff Goldy, Benjamin W. Gregor, Zeb Haradon, David R. Haynor, John G. Hohmann, Steve Horvath, Robert E. Howard, Andreas Jeromin, Jayson M. Jochim, Marty Kinnunen, Christopher Lau, Evan T. Lazarz, Changkyu Lee, Tracy A. Lemon, Ling Li, Yang Li, John A. Morris, Caroline C. Overly, Patrick D. Parker, Sheana E. Parry, Melissa Reding, Joshua J. Royall, Jay Schulkin, Pedro Adolfo Sequeira, Clifford R. Slaughterbeck, Simon C. Smith, Andy J. Sodt, Susan M. Sunkin, Beryl E. Swanson, Marquis P. Vawter, Derric Williams, Paul Wohnoutka, H. Ronald Zielke, Daniel H. Geschwind, Patrick R. Hof, Stephen M. Smith, Christof Koch, Seth G. N. Grant, and Allan R. Jones. An anatomically comprehensive atlas of the adult human brain transcriptome. Nature, 489:391--399, 2012.
[218]	Kenneth J. Hayworth and Adam H. Marblestone. How thalamic relays might orchestrate supervised deep training and symbolic computation in the brain. bioRxiv, 2018.
[219]	T. E. Hazy, M. J. Frank, and R. C. O'Reilly. Towards an executive without a homunculus: computational models of the prefrontal cortex/basal ganglia system. Philosophical Transactions of the Royal Society London B, Biological Science, 362(1485):1601--1613, 2007.
[220]	Donald O. Hebb. The Organization of Behavior: A Neuropsychological Theory. Wiley, 1949.
[221]	Mikael Henaff, William F. Whitney, and Yann LeCun. Model-based planning in discrete action spaces. CoRR, arXiv:1705.07177, 2017.
[222]	Bernhard Hengst. Hierarchical reinforcement learning. In Claude Sammut and Geoffrey I. Webb, editors, Encyclopedia of Machine Learning and Data Mining, pages 611--619. Springer US, Boston, MA, 2017.
[223]	John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. 5th Edition, Morgan Kaufmann, San Francisco, CA, 2012.
[224]	Seth Herd, Kai Krueger, Ananta Nair, Jessica Mollick, and Randall OReilly. Neural mechanisms of human decision-making. CoRR, arXiv:1912.07660, 2019.
[225]	G. Hickok and S.L. Small. Neurobiology of Language. Elsevier, 2015.
[226]	Gregory Hickok and David Poeppel. The cortical organization of speech processing. Nature Reviews Neuroscience, 8:393, 2007.
[227]	Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computing, 9:1735--1780, 1997.
[228]	Douglas Hofstadter. I Am a Strange Loop. Basic Books, 2007.
[229]	K. L. Hoke, M. J. Ryan, and W. Wilczynski. Functional coupling between substantia nigra and basal ganglia homologues in amphibians. behavioral Neuroscience, 121(6):1393--1399, 2007.
[230]	Kim L. Hoke, Eileen A. Hebets, and Daizaburo Shizuka. Neural Circuitry for Target Selection and Action Selection in Animal Behavior. Integrative and Comparative Biology, 57(4):808--819, 2017.
[231]	David Holcman and Rafael Yuste. The new nanophysiology: regulation of ionic flow in neuronal subcompartments. Nature Reviews Neuroscience, 16:685--692, 2015.
[232]	Jonathan C Horton and Daniel L Adams. The cortical column: a structure without a function. Philosophical Transactions of the Royal Society B: Biological Sciences, 360:837--862, 2005.
[233]	J. C. Houk and S. P. Wise. Distributed modular architectures linking basal ganglia, cerebellum, and cerebral cortex: their role in planning and controlling action. Cerebral Cortex, 5(2):95--110, 1995.
[234]	Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko. Language-conditioned graph networks for relational reasoning. CoRR, arXiv:1905.04405, 2019.
[235]	S. E. Hua, J. C. Houk, and F. A. Mussa-Ivaldi. Emergence of symmetric, modular, and reciprocal connections in recurrent networks with Hebbian learning. Biological Cybernetics, 81(3):211--225, 1999.
[236]	D. H. Hubel and T. N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. Journal of Physiology, 160:106--154, 1962.
[237]	D. H. Hubel and T. N Wiesel. Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology, 195:215--243, 1968.
[238]	Masao Ito. Control of mental activities by internal models in the cerebellum. Nature Reviews Neuroscience, 9:304--313, 2008.
[239]	Masao Ito. The Cerebellum: Brain for an Implicit Self. Financial Times Press, 2012.
[240]	Masao Ito. Olivocerebellar system. In G.M. Shepherd and S. Grillner, editors, Handbook of Brain Microcircuits, pages 293--299. Oxford University Press, 2018.
[241]	L. Itti and P. Baldi. A principled approach to detecting surprising events in video. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 631--637, San Siego, CA, 2005.
[242]	L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:1254--1259, 1998.
[243]	Eugene M. Izhikevich. Dynamical Systems in Neuroscience: The Geometry of Excitability and Bursting. MIT Press, Cambridge, MA, 2007.
[244]	Nagumo J., Arimoto S., and Yoshizawa S. An active pulse transmission line simulating nerve axon. Proceedings of the Intitute of Radio Engineers (IRE), 50:2061--2070, 1962.
[245]	Adeline Jabès, Pamela Banta Lavenex, David G. Amaral, and Pierre Lavenex. Postnatal development of the hippocampal formation: a stereological study in macaque monkeys. The Journal of comparative neurology, 519:1051--1070, 2011.
[246]	John Hughlings Jackson. Selected Writings of John Hughlings Jackson: Evolution and dissolution of the nervous system. Selected Writings of John Hughlings Jackson. Basic Books, 1958.
[247]	Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. International Conference on Learning Representations, 2017.
[248]	Michal Januszewski, Jörgen Kornfeld, Peter H Li, Art Pope, Tim Blakely, Larry Lindsey, Jeremy B Maitin-Shepard, Mike Tyka, Winfried Denk, and Viren Jain. High-precision automated reconstruction of neurons with flood-filling networks. Nature Methods, 15:605--610, 2017.
[249]	Terry L. Jernigan, William F. C. Baar, Joan Stiles, and Kathrine Skak Madsen. Postnatal brain development: structural imaging of dynamic neurodevelopmental processes. Progress in brain research, 189:77--92, 2011.
[250]	David Jilk, Christian Lebiere, Randall O'Reilly, and John R. Anderson. SAL: An explicitly pluralistic cognitive architecture. Journal Experimental and Theoretical Artificial Intelligence, 20:197--218, 2008.
[251]	P. Jissendi, S. Baudry, and D. Balériaux. Diffusion tensor imaging (DTI) and tractography of the cerebellar projections to prefrontal and posterior parietal cortices: A study at 3T. Journal of Neuroradiology, 35:42--50, 2009.
[252]	Daniel D. Johnson. Learning graphical state transitions. In International Conference on Learning Representations, 2017.
[253]	Simon R. Jones and Charles Fernyhough. Neural correlates of inner speech and auditory verbal hallucinations: A critical review and theoretical integration. Clinical Psychology Review, 27(2):140--154, 2007.
[254]	Hannah R. Joo and Loren M. Frank. The hippocampal sharp wave-ripple in memory retrieval for immediate use and consolidation. Nature Reviews Neuroscience, 19:744--757, 2018.
[255]	Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183--233, 1999.
[256]	Leslie Pack Kaelbling. Hierarchical reinforcement learning: A preliminary report. In Proceedings Tenth International Conference on Machine Learning, pages 167--173, 1993.
[257]	Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H. Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Ryan Sepassi, George Tucker, and Henryk Michalewski. Model-based reinforcement learning for atari. CoRR, arXiv:1903.00374, 2019.
[258]	Marcus Kaiser. A tutorial in connectome analysis: Topological and spatial features of brain networks. CoRR, arXiv:1105.4705, 2011.
[259]	Axel Kammerer. Memory capacity in the hippocampus: Influence of inhibition and remapping. Technical report, Graduate School of Systemic Neurosciences der Ludwig-Maximilians-Universität MÜunchen, 2015.
[260]	Efstratios Karavasilis, Foteini Christidi, Georgios Velonakis, Zoi Giavri, Nikolaos L. Kelekis, Efstathios P. Efstathopoulos, Ioannis Evdokimidis, and Georges Dellatolas. Ipsilateral and contralateral cerebro-cerebellar white matter connections: A diffusion tensor imaging study in healthy adults. Journal of Neuroradiology, 46:52--60, 2019.
[261]	Friston Karl. A free energy principle for biological systems. Entropy (Basel), 14(11):2100--2121, 2012.
[262]	Fumi Katsuki and Christos Constantinidis. Unique and shared roles of the posterior parietal and dorsolateral prefrontal cortex in cognitive functions. Frontiers in integrative neuroscience, 6:17--17, 2012.
[263]	Paul S. Katz and William N. Frost. Intrinsic neuromodulation: altering neuronal circuits from within. Trends in Neurosciences, 19:54--61, 1996.
[264]	Lydia Kavraki and Jean-Claude Latombe. Randomized preprocessing of configuration space for fast path planning. In IEEE International Conference on Robotics & Automation, pages 2138--2145, San Diego, 1994.
[265]	Sam Kean. The Tale of the Dueling Neurosurgeons: The History of the Human Brain as Revealed by True Stories of Trauma, Madness, and Recovery. Little, Brown, 2014.
[266]	G. A. Kerchner and R. A. Nicoll. Silent synapses and the emergence of a postsynaptic mechanism for LTP. Nature Reviews Neuroscience, 9(11):813--825, 2008.
[267]	Nicholas Ketz, Soheil Kolouri, and Praveen K. Pilly. Using world models for pseudo-rehearsal in continual learning. CoRR, arXiv:1903.02647, 2019.
[268]	Oussama Khatib. Real-time obstacle avoidance for manipulators and mobile robots. International Journal of Robotics Research, 5:90--99, 1986.
[269]	Stefan J. Kiebel, Jean Daunizeau, and Karl J. Friston. A hierarchy of time-scales and the brain. PLOS Computational Biology, 4(11):1--12, 2008.
[270]	Jenny Kien and Jennifer S Altman. Preparation and execution of movement: Parallels between insect and mammalian motor systems. Comparative Biochemistry and Physiology Part A: Physiology, 103(1):15--24, 1992.
[271]	Taegyo Kim, Khaldoun C. Hamade, Dmitry Todorov, William H. Barnett, Robert A. Capps, Elizaveta M. Latash, Sergey N. Markin, Ilya A. Rybak, and Yaroslav I. Molkov. Reward based motor adaptation mediated by basal ganglia. Frontiers in Computational Neuroscience, 11:19, 2017.
[272]	Woon Ryoung Kim, Kimberly Christian, Guo-Li Ming, and Hongjun Song. Time-dependent involvement of adult-born dentate granule cells in behavior. Behavioural Brain Research, 227:470--479, 2012.
[273]	Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, and Max Welling. Semi-supervised learning with deep generative models. In Proceedings of the 27th International Conference on Neural Information Processing Systems, pages 3581--3589, Cambridge, MA, USA, 2014. MIT Press.
[274]	Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. CoRR, arXiv:1312.6114, 2013.
[275]	Kevin G. Kirby. A tutorial on helmholtz machines. Department of Computer Science, Northern Kentucky University, 2019.
[276]	Michael Kirchhoff, Thomas Parr, Ensor Palacios, Karl Friston, and Julian Kiverstein. The Markov blankets of life: Autonomy, active inference and the free energy principle. Journal of the Royal Society, Interface, 15:20170792, 2018.
[277]	James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. CoRR, arXiv:1612.00796, 2016.
[278]	Y. Kitabatake, K. A. Sailor, G. L. Ming, and H. Song. Adult neurogenesis and hippocampal memory function: new cells, more plasticity, new memories? Neurosurgery Clinics of North America, 18(1):105--113, 2007.
[279]	Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. CoRR, arxiv:arXiv:1609.01596, 2016.
[280]	Christof Koch. Project MindScope. In Frontiers in Computational Neuroscience, 33. Bernstein Conference Proceedings, 2012.
[281]	Christof Koch and Idan Segev, editors. Methods in Neuronal Modeling: From Ions to Networks. MIT Press, Cambridge, MA, USA, 1998.
[282]	E. Koechlin and C. Summerfield. An information theoretical approach to prefrontal executive function. Trends in Cognitive Science, 11(6):229--235, 2007.
[283]	Etienne Koechlin, Gregory Corrado, Pietro Pietrini, and Jordan Grafman. Dissociating the role of the medial and lateral anterior prefrontal cortex in human planning. Proceedings of the National Academy of Sciences, 97:7651--7656, 2000.
[284]	Etienne Koechlin and Thomas Jubault. Broca's area and the hierarchical organization of human behavior. Neuron, 50:963--974, 2006.
[285]	Etienne Koechlin, Chrystèle Ody, and Frédérique Kouneiher. The architecture of cognitive control in the human prefrontal cortex. Science, 302:1181--1185, 2003.
[286]	Adam A. Kohan, Edward A. Rietman, and Hava T. Siegelmann. Error forward-propagation: Reusing feedforward connections to propagate errors in deep learning. CoRR, arxiv:1808.03357, 2018.
[287]	Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, MA, 2009.
[288]	H. Kolster, J. B. Mandeville, J. T. Arsenault, L. B. Ekstrom, L. L. Wald, and W. Vanduffel. Visual field map clusters in macaque extrastriate visual cortex. Journal Neuroscience, 29(21):7031--7039, 2009.
[289]	Talia Konkle and Alfonso Caramazza. Tripartite organization of the ventral stream by animacy and object size. Journal of Neuroscience, 33(25):10235--10242, 2013.
[290]	Ivica Kostović and Miloš Judaš. Early development of neuronal circuitry of the human prefrontal cortex. In Michael S. Gazzaniga, editor, The Cognitive Neurosciences, 4th Edition, pages 29--48. The MIT Press, Cambridge, MA, 2009.
[291]	Leonard F. Koziol, Deborah Budding, Nancy Andreasen, Stefano D'Arrigo, Sara Bulgheroni, Hiroshi Imamizu, Masao Ito, Mario Manto, Cherie Marvel, Krystal Parker, Giovanni Pezzulo, Narender Ramnani, Daria Riva, Jeremy Schmahmann, Larry Vandervert, and Tadashi Yamazaki. Consensus paper: the cerebellum's role in movement and cognition. Cerebellum (London, England), 13:151--177, 2014.
[292]	Trenton Kriete, David C. Noelle, Jonathan D. Cohen, and Randall C. O'Reilly. Indirection and symbol-like processing in the prefrontal cortex and basal ganglia. Proceedings of the National Academy of Sciences, 2013.
[293]	Tejas D. Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems 29, pages 3675--3683, 2016.
[294]	Dharshan Kumaran, Demis Hassabis, and James L. McClelland. What learning systems do intelligent agents need? Complementary learning systems theory updated. Trends in Cognitive Sciences, 20(7):512--534, 2016.
[295]	Dharshan Kumaran and Eleanor A. Maguire. The human hippocampus: Cognitive maps or relational memory? Journal of Neuroscience, 25(31):7254--7259, 2005.
[296]	Dharshan Kumaran and James L. McClelland. Generalization through the recurrent interaction of episodic memories: A model of the hippocampal system. Psychology Review, 119:573--616, 2012.
[297]	Sergiy V. Kushchayev, Vitaliy F. Moskalenko, Philip C. Wiener, Vitaliy I. Tsymbaliuk, Viktor G. Cherkasov, Irina V. Dzyavulska, Oleksander I. Kovalchuk, Volker K. H. Sonntag, Robert F. Spetzler, and Mark C. Preul. The discovery of the pyramidal neurons: Vladimir Betz and a new era of neuroscience. Brain, 135(1):285--300, 2011.
[298]	Vitaly Kuznetsov and Mehryar Mohri. Theory and algorithms for forecasting non-stationary time series: Tutorial at NIPS 2016. https://cs.nyu.edu/ mohri/talks/NIPSTutorial2016.pdf, 2016.
[299]	Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017.
[300]	Milad Lankarany, Dhekra Al-Basha, Stéphanie Ratté, and Steven A. Prescott. Differentially synchronized spiking enables multiplexed neural coding. Proceedings of the National Academy of Sciences, 116(20):10097--10102, 2019.
[301]	Michael Laskey, Jonathan Lee, Wesley Yu-Shu Hsieh, Richard Liaw, Jeffrey Mahler, Roy Fox, and Ken Goldberg. Iterative noise injection for scalable imitation learning. CoRR, arXiv:1703.09327, 2017.
[302]	Jean-Claude Latombe. Robot Motion Planning. Kluwer, Boston, Massachusetts, 1990.
[303]	Pierre Lavenex and David G. Amaral. Hippocampal-neocortical interaction: A hierarchy of associativity. Hippocampus, 10(4):420--430, 2000.
[304]	Miguel Lázaro-Gredilla, Dianhuan Lin, J. Swaroop Guntupalli, and Dileep George. Beyond imitation: Zero-shot task transfer on robots by learning concepts as cognitive programs. CoRR, arXiv:1812.02788, 2018.
[305]	Tai Sing Lee and David Mumford. Hierarchical Bayesian inference in the visual cortex. Journal of the Optical Society of America, 2:1434--1448, 2003.
[306]	T.S. Lee, D. Mumford, R.Romero, and V.Lamme. The role of primary visual cortex in higher level vision. Vision Research, 38:2429--2454, 1998.
[307]	Woogul Lee and Johnmarshall Reeve. Identifying the neural substrates of intrinsic motivation during task performance. Cognitive, Affective, & Behavioral Neuroscience, 17(5):939--953, 2017.
[308]	Carlos León. An architecture of narrative memory. Biologically Inspired Cognitive Architectures, 16:19--33, 2016.
[309]	J. Y. Lettvin, H. R. Maturana, W. S. McCulloch, and W. H. Pitts. What the frog's eye tells the frog's brain. Proceedings of the Institute for Radio Engineers, 47:1940--1951, 1959.
[310]	Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. CoRR, arXiv:1805.00909, 2018.
[311]	Jonathan B. Levitt, David A. Lewis, Takashi Yoshioka, and Jennifer S. Lund. Topography of pyramidal neuron intrinsic connections in macaque monkey prefrontal cortex (areas 9 and 46). Journal of Comparative Neurology, 338(3):360--376, 1993.
[312]	David A. Lewis, Darlene S. Melchitzky, and Guillermo-Gonzalez Burgos. Specificity in the functional architecture of primate prefrontal cortex. Journal of Neurocytology, 31(3):265--276, 2002.
[313]	Jian Li, Yue Wang, Irwin King, and Michael R. Lyu. Code completion with neural attention and pointer networks. CoRR, arXiv:1711.09573, 2017.
[314]	Stefano Fusi Liam J. Drew and and René Hen. Adult neurogenesis in the mammalian hippocampus: Why the dentate gyrus? Learning and Memory, 20:710--729, 2013.
[315]	M. V. Liarokapis, A. M. Dollar, and K. J. Kyriakopoulos. Humanlike, task-specific reaching and grasping with redundant arms and low-complexity hands. In International Conference on Advanced Robotics, pages 490--497, 2015.
[316]	Philip Lieberman. Human language and our reptilian brain: The subcortical bases for speech, syntax and thought. Harvard University Press, Cambridge, MA, 2002.
[317]	Philip Lieberman. On the nature and evolution of the neural bases of human language. American Journal of Physical Anthropology, 119:36--62, December 2002.
[318]	Philip Lieberman. Toward an Evolutionary Biology of Language. Belknap Press of Harvard University Press, 2006.
[319]	Philip Lieberman. Language did not spring forth 100,000 years ago. PLOS Biology, 13(2):1--4, 2015.
[320]	Philip Lieberman and Robert Mccarthy. The Evolution of Speech and Language, pages 1--41. Springer-Verlag, Berlin Heidelberg, 2013.
[321]	Timothy P. Lillicrap, Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications, 7:13276, 2016.
[322]	Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3):293--321, 1992.
[323]	Daniel Ying-Jeh Little and Friedrich Tobias Sommer. Learning and exploration in action-perception loops. Frontiers in Neural Circuits, 7, 2013.
[324]	Michael Littman, Thomas Dean, and Leslie Kaelbling. On the complexity of solving Markov decision problems. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pages 394--402, San Francisco, California, 1995. AUAI, Morgan Kaufmann Publishers.
[325]	Chang Liu, Xin Wang, Richard Shin, Joseph E. Gonzalez, and Dawn Song. Neural code completion. Submitted to ICLR 2017, 2017.
[326]	Zinan Liu, Arne Hitzmann, Shuhei Ikemoto, Svenja Stark, Jan Peters, and Koh Hosoda. Local online motor babbling: Learning motor abundance of A musculoskeletal robot arm. CoRR, arXiv:1906.09013, 2019.
[327]	Rudolfo R. Llinàs. Olivocerebellar system. In G.M. Shepherd and S. Grillner, editors, Handbook of Brain Microcircuits, pages 301--308. Oxford University Press, 2018.
[328]	Samuel N. Lockhart, Adriane B. V. Mayda, Alexandra E. Roach, Evan Fletcher, Owen Carmichael, Pauline Maillard, Christopher G. Schwarz, Andrew P. Yonelinas, Charan Ranganath, and Charles Decarli. Episodic memory function is associated with multiple measures of white matter integrity in cognitive aging. Frontiers in human neuroscience, 6:56--56, 2012.
[329]	Michael London and Michael Häusser. Dendritic computation. Annual Review of Neuroscience, 28(1):503--532, 2005.
[330]	Bria Long, Chen-Ping Yu, and Talia Konkle. Mid-level visual features underlie the high-level categorical organization of the ventral stream. Proceedings of the National Academy of Sciences, 115(38):E9015--E9024, 2018.
[331]	William Lotter, Gabriel Kreiman, and David D. Cox. Deep predictive coding networks for video prediction and unsupervised learning. CoRR, arXiv:1605.08104, 2016.
[332]	Francesca Arese Lucini, Gino Del Ferraro, Mariano Sigman, and Hernan A. Makse. How the brain transitions from conscious to subliminal perception. CoRR, arXiv:1903.09630, 2019.
[333]	N. R. Luque, J. A. Garrido, R. R. Carrillo, S. Tolu, and E. Ros. Adaptive cerebellar spiking model embedded in the control loop: context switching and robustness against noise. International Journal of Neural Systems, 21(5):385--401, 2011.
[334]	D. J. C. MacKay. Free energy minimisation algorithm for decoding and cryptanalysis. Electronics Letters, 31(6):446--447, 1995.
[335]	Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations, 2019.
[336]	Gary Marcus. The Birth of the Mind: How a Tiny Number of Genes Creates The Complexities of Human Thought. Basic Books, 2004.
[337]	Gary Marcus, Adam Marblestone, and Thomas Dean. The atoms of neural computation. Science, 346:551--552, 2014.
[338]	Gary F. Marcus. The next decade in ai: Four steps towards robust artificial intelligence. CoRR, arXiv:2002.06177, 2020.
[339]	Scott Marek, Joshua S. Siegel, Evan M. Gordon, Ryan V. Raut, Caterina Gratton, Dillan J. Newbold, Mario Ortega, Timothy O. Laumann, Babatunde Adeyemo, Derek B. Miller, Annie Zheng, Katherine C. Lopez, Jeffrey J. Berg, Rebecca S. Coalson, Annie L. Nguyen, Donna Dierker, Andrew N. Van, Catherine R. Hoyt, Kathleen B. McDermott, Scott A. Norris, Joshua S. Shimony, Abraham Z. Snyder, Steven M. Nelson, Deanna M. Barch, Bradley L. Schlaggar, Marcus E. Raichle, Steven E. Petersen, Deanna J. Greene, and Nico U.F. Dosenbach. Spatial and temporal organization of the individual human cerebellum. Neuron, 2018.
[340]	D. Marr and Giles Skey Brindley. Simple memory: a theory for archicortex. Philosophical Transactions of the Royal Society of London. B, Biological Sciences, 262(841):23--81, 1971.
[341]	David Marr. A theory of cerebellar cortex. Journal of Physiology, 202:437--470, 1969.
[342]	Hideaki Matsui. Dopamine system, cerebellum, and nucleus ruber in fish and mammals. Development, Growth & Differentiation, 59(4):219--227, 2017.
[343]	J. L. McClelland and N. H. Goddard. Considerations arising from a complementary learning systems perspective on hippocampus and neocortex. Hippocampus, 6(6):654--665, 1996.
[344]	James L. McClelland and Axel Cleeremans. Connectionist models. In T. Byrne, A. Cleeremans, and P. Wilken, editors, Oxford Companion to Consciousness. New York: Oxford University Press, 2009.
[345]	James L. McClelland, Felix Hill, Maja Rudolph, Jason Baldridge, and Hinrich Schütze. Extending machine language models toward human-level language understanding. CoRR, arXiv:1912.05877, 2019.
[346]	James L. McClelland, Bruce L. McNaughton, and Andrew K. Lampinen. Integration of new information in memory: New insights from a complementary learning systems perspective. Submitted to the Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 2019.
[347]	James L. McClelland, Bruce L. McNaughton, and Randall C. O'Reilly. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102:419--457, 1995.
[348]	James L. McClelland, Bruce L. McNaughton, and Randall C. O'Reilly. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102(3):419--457, 1995.
[349]	W. S. McCulloch and W. H. Pitts. A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115--133, 1943.
[350]	Carver Mead. Neural hardware for vision. Engineering & Science, 1:2--7, 1987.
[351]	Vinod Menon. Memory and cognitive control circuits in mathematical cognition and learning. Progress in Brain Research, 227:159--186, 2016.
[352]	Josh Merel, Matt Botvinick, and Greg Wayne. Hierarchical motor control in mammals and machines. Nature Communications, 10(1):5489, 2019.
[353]	Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019.
[354]	Claudia Metzler-Baddeley, Derek K. Jones, Boubakeur Belaroussi, John P. Aggleton, and Michael J. O’Sullivan. Frontotemporal connections in episodic memory and aging: A diffusion mri tractography study. Journal of Neuroscience, 31(37):13236--13245, 2011.
[355]	D. Michie, M. Bain, and J. Hayes-Michie. Cognitive models from subcognitive skills. In M. Grimble, S. McGhee, and P. Mowforth, editors, Knowledge-based Systems for Industrial Control, Control, Robotics & Sensors, pages 71--99. Institution of Engineering and Technology, 1990.
[356]	Shawn Mikula and Winfried Denk. High-resolution whole-brain staining for electron microscopic circuit reconstruction. Nature Methods, 2015.
[357]	George A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2):81--97, 1956.
[358]	Alex Mitko and Jason Fischer. When it all falls down: the relationship between intuitive physics and spatial cognition. Cognitive Research: Principles and Implications, 5:24, 2020.
[359]	Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529--533, 2015.
[360]	Kristof Van Moffaert and Ann Nowé. Multi-objective reinforcement learning using sets of pareto dominating policies. Journal of Machine Learning Research, 15:3663--3692, 2014.
[361]	Zoltán Molnár, Gavin J. Clowry, Nenad Sestan, Ayman Alzu'bi, Trygve Bakken, Robert F. Hevner, Petra S. Hüppi, Ivica Kostović, Pasko Rakic, E. S. Anton, David Edwards, Patricia Garcez, Anna Hoerder-Suabedissen, and Arnold Kriegstein. New insights into the development of the human cerebral cortex. Journal of Anatomy, 235(3):432--451, 2019.
[362]	Felipe Mora-Bermúdez, Farhath Badsha, Sabina Kanton, J. Gray Camp, Benjamin Vernot, Kathrin Köhler, Birger Voigt, Keisuke Okita, Tomislav Maricic, Zhisong He, Robert Lachmann, Svante Paabo, Barbara Treutlein, and Wieland B. Huttner. Differences and similarities between human and chimpanzee neural progenitors during cerebral cortex development. eLife, 5:e18683, 2016.
[363]	Elena P. Moreno-Jiménez, Miguel Flor-García, Julia Terreros-Roncal, Alberto Rábano, Fabio Cafini, Noemí Pallas-Bazarra, Jesús Ávila, and Marín Llorens-Martín. Adult hippocampal neurogenesis is abundant in neurologically healthy subjects and drops sharply in patients with alzheimer's disease. Nature Medicine, 25:554--560, 2019.
[364]	Vernon Mountcastle. An organizing principle for cerebral function: the unit model and the distributed system. In Gerald Edelman and Vernon Mountcastle, editors, The Mindful Brain, pages 7--50. MIT Press, Cambridge, MA, 1978.
[365]	Vernon B. Mountcastle. The columnar organization of the neocortex. Brain, 120:701--722, 1997.
[366]	S. Nagao and H. Kitazawa. Role of the cerebellum in the acquisition and consolidation of motor memory. Brain Nerve, 60(7):783--790, 2008.
[367]	M. Nakajima and M. M. Halassa. Thalamic control of functional cortical connectivity. Current Opinion in Neurobiology, 44:127--131, 2017.
[368]	Kazu Nakazawa. Dentate mossy cell and pattern separation. Neuron, 93:465--467, 2017.
[369]	Atsushi Nambu, Hironobu Tokuno, and Masahiko Takada. Functional significance of the cortico-subthalamo-pallidal 'hyperdirect' pathway. Neuroscience Research, 43:111--117, 2002.
[370]	Karthik Narasimhan, Regina Barzilay, and Tommi Jaakkola. Grounding language for transfer in deep reinforcement learning. Journal of Artificial Intelligence Research, 63:849--874, 2018.
[371]	Nandakumar S. Narayanan, , and Mark Laubach. Delay activity in rodent frontal cortex during a simple reaction time task. Journal of neurophysiology, 101:2859--2871, 2009.
[372]	Khaled Nasr, Pooja Viswanathan, and Andreas Nieder. Number detectors spontaneously emerge in a deep neural network designed for visual object recognition. Science Advances, 5(5), 2019.
[373]	Radford M. Neal and Geoffrey E. Hinton. A new view of the em algorithm that justifies incremental and other variants. In Learning in Graphical Models, pages 355--368. Kluwer Academic Publishers, 1993.
[374]	Derek Evan Nee and John Jonides. Neural correlates of access to short-term memory. Proceedings of the National Academy of Sciences, 105(37):14228--14233, 2008.
[375]	John von Neumann. First draft of a report on the EDVAC. Technical report, Institute for Advanced Study, 1945.
[376]	Chi T. Ngo, Kylie H. Alm, Athanasia Metoki, William Hampton, Tracy Riggins, Nora S. Newcombe, and Ingrid R. Olson. White matter structural connectivity and episodic memory in early childhood. Developmental Cognitive Neuroscience, 28:41--53, 2017.
[377]	Caroline A. Niziolek, Srikantan S. Nagarajan, and John F. Houde. What does motor efference copy represent?: Evidence from speech production. The Journal of neuroscience : the official journal of the Society for Neuroscience, 33:16110--16116, 2013.
[378]	Kenneth Norman and Randall O'Reilly. Modeling hippocampal and neocortical contributions to recognition memory: A complementary learning systems approach. Psychological review, 110:611--46, 2003.
[379]	Kenichi Oishi, Karl Zilles, Katrin Amunts, Andreia Faria, Hangyi Jiang, Xin Li, Kazi Akhter, Kegang Hua, Roger Woods, Arthur W. Toga, G. Bruce Pike, Pedro Rosa-Neto, Alan Evans, Jiangyang Zhang, Hao Huang, Michael I. Miller, Peter C.M. van Zijl, John Mazziotta, and Susumu Mori. Human brain white matter atlas: Identification and assignment of common anatomical structures in superficial white matter. NeuroImage, 43(3):447--457, 2008.
[380]	John O'Keefe and Lynn Nadel. The hippocampus as a cognitive map. Clarendon Press, 1978.
[381]	A. M. Oliveira. DNA methylation: A permissive mark in memory formation and maintenance. Learning and Memory, 23(10):587--593, 2016.
[382]	Randall O'Reilly and Yuko Munakata. Computational Explorations in Cognitive Neuroscience: Understanding the Mind by Simulating the Brain. MIT Press, Cambridge, Massachusetts, 2000.
[383]	Randall C. O'Reilly, Rajan Bhattacharyya, Michael D. Howard, and Nicholas Ketz. Complementary learning systems. Cognitive Science, 38(6):1229--1248, 2014.
[384]	Randall C. O'Reilly and Michael J. Frank. Making working memory work: A computational model of learning in the prefrontal cortex and basal ganglia. Neural Computation, 18:283--328, 2006.
[385]	Randall C. O'Reilly, Yuko Munakata, Michael J. Frank, and Thomas E. Hazy Contributors. Computational Cognitive Neuroscience. Wiki Book, 1st Edition, 2012.
[386]	Randall C. O'Reilly, Yuko Munakata, Michael J. Frank, and Thomas E. Hazy Contributors. Computational Cognitive Neuroscience. Wiki Book, Third Edition, 2016.
[387]	Randall C. O'Reilly, Alex A. Petrov, Jonathan D. Cohen, Christian J. Lebiere, Seth A. Herd, and Trent Kriete. How limited systematicity emerges: A computational cognitive neuroscience approach. In Paco Calvo and John Symons, editors, The Architecture of Cognition, pages 191--224. MIT Press, Cambridge, Massachusetts, 2014.
[388]	Randall C. O'Reilly and R. Soto. A model of the phonological loop: Generalization and binding. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 83--90. MIT Press, 2002.
[389]	Randall C. O'Reilly, Dean R. Wyatte, and John Rohrlich. Deep predictive learning: A comprehensive model of three visual streams. CoRR, arXiv:1709.04654, 2017.
[390]	Takayuki Osa, Voot Tangkaratt, and Masashi Sugiyama. Hierarchical reinforcement learning via advantage-weighted information maximization. CoRR, arXiv:1901.01365, 2019.
[391]	Torben Ott and Andreas Nieder. Dopamine and cognitive control in prefrontal cortex. Trends in Cognitive Sciences, 23(3):213--234, 2019.
[392]	Ayako Ouchi, Nobuyoshi Matsumoto, and Yuji Ikegaya. Activation of hilar mossy cells and dentate granule cells during sharp wave/ripples. Current Research in Neuroscience, 7:1--8, 2017.
[393]	Günther Palm, Andreas Knoblauch, Florian Hauser, and Almut Schüz. Cell assemblies in the cerebral cortex. Biological Cybernetics, 108:559--572, 2014.
[394]	Eun Hye Park, Nesha S. Burghardt, Dino Dvorak, René Hen, and André A. Fenton. Experience-dependent regulation of dentate gyrus excitability by adult-born granule cells. Journal of Neuroscience, 35(33):11656--11666, 2015.
[395]	Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sébastien Racanière, David P. Reichert, Theophane Weber, Daan Wierstra, and Peter Battaglia. Learning model-based planning from scratch. CoRR, arXiv:1707.06170, 2017.
[396]	Alexander Pashevich, Danijar Hafner, James Davidson, Rahul Sukthankar, and Cordelia Schmid. Modulated policy hierarchies. CoRR, arXiv:1812.00025, 2018.
[397]	Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Çelikyilmaz, Sungjin Lee, and Kam-Fai Wong. Composite task-completion dialogue system via hierarchical deep reinforcement learning. CoRR, arXiv:1704.03084, 2017.
[398]	C. J. Perry and M. Fallah. Feature integration and object representations along the dorsal stream visual hierarchy. Frontiers in Computational Neuroscience, 8:84, 2014.
[399]	C. C. H. Petersen. Sensorimotor processing in the rodent barrel cortex. Nature Reviews Neuroscience, 20(9):533--546, 2019.
[400]	Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. CoRR, arXiv:1802.03268, 2018.
[401]	Nicholas Pippenger. Developments in 'the synthesis of reliable organisms from unreliable components'. In James Glimm, John Impagiazo, and Isadore Singer, editors, Proceedings of Symposia in Pure Mathematics: The Legacy of John von Neumann, volume 50, pages 311--324. American Mathematical Society, 1990.
[402]	David Poeppel, Karen Emmorey, Gregory Hickok, and Liina Pylkkanen. Towards a new neurobiology of language. The Journal of Neuroscience: The official journal of the Society for Neuroscience, 32:14125--14131, 2012.
[403]	Ruben Portugues, Claudia E. Feierstein, Florian Engert, and Michael B. Orger. Whole-brain activity maps reveal stereotyped, distributed networks for visuomotor behavior. Neuron, 81:1328--1343, 2014.
[404]	David Premack. Human and animal cognition: continuity and discontinuity. Proceedings of the National Academy of Science, 104(35):13861--13867, 2007.
[405]	Alison R. Preston and Howard Eichenbaum. Interplay of hippocampus and prefrontal cortex in memory. Current Biology, 23:R764--R773, 2013.
[406]	Wolfgang Prinz. Modes of linkage between perception and action. In Wolfgang Prinz and Andries F. Sanders, editors, Cognition and Motor Processes, pages 185--193. Springer Berlin Heidelberg, Berlin, Heidelberg, 1984.
[407]	Michele L. Pucak, Jonathan B. Levitt, Jennifer S. Lund, and David A. Lewis. Patterns of intrinsic and associational circuitry in monkey prefrontal cortex. Journal of Comparative Neurology, 376(4):614--630, 1996.
[408]	F. Pulvermüller, M. Garagnani, and T. Wennekers. Thinking in circuits: toward neurobiological explanation in cognitive neuroscience. Biological Cybernetics, 108(5):573--593, 2014.
[409]	Friedemann Pulvermüller. How neurons make meaning: brain mechanisms for embodied and abstract-symbolic semantics. Trends in Cognitive Sciences, 17(9):458--470, 2013.
[410]	Friedemann Pulvermüller. Neural reuse of action perception circuits for language, concepts and communication. Progress in Neurobiology, 160:1--44, 2018.
[411]	Rodrigo Quian Quiroga. Concept cells: the building blocks of declarative memory functions. Nature Reviews Neuroscience, 13:587--597, 2012.
[412]	Neil C. Rabinowitz, Frank Perbet, H. Francis Song, Chiyuan Zhang, S.M. Ali Eslami, and Matthew Botvinick. Machine theory of mind. CoRR, arXiv:1802.07740, 2018.
[413]	Sebastien Racaniere, Andrew K. Lampinen, Adam Santoro, David P. Reichert, Vlad Firoiu, and Timothy P. Lillicrap. Automated curricula through setter-solver interactions. CoRR, arXiv:1909.12892, 2019.
[414]	Pasko Rakik, Jon Arellano, and Joshua Breunig. Development of the primate cerebral cortex. In Michael S. Gazzaniga, editor, The Cognitive Neurosciences, 4th Edition, pages 7--28. The MIT Press, Cambridge, MA, 2009.
[415]	Prajit Ramachandran and Quoc V. Le. Diversity and depth in per-example routing models. In International Conference on Learning Representations, 2019.
[416]	Rajesh P. N. Rao and Dana H. Ballard. Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9(5):721--763, 1996.
[417]	Rajesh P. N. Rao and Dana H. Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2:79--87, 1999.
[418]	F.D. Raslau, I.T. Mark, A.P. Klein, J.L. Ulmer, V. Mathews, and L.P. Mark. Memory part 2: The role of the medial temporal lobe. American Journal of Neuroradiology, 36(5):846--849, 2015.
[419]	Daniel Rasmussen, Aaron Voelker, and Chris Eliasmith. A neural model of hierarchical reinforcement learning. PLOS ONE, 12(7):1--39, 2017.
[420]	R. Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological Review, 97(2):285--308, 1990.
[421]	Lucia Regolin, Rosa Rugani, Gionata Stancher, and Giorgio Vallortigara. Spontaneous discrimination of possible and impossible objects by newly hatched chicks. Biology Letters, 7(5):654--657, 2011.
[422]	R. Clay Reid. From functional architecture to functional connectomics. Neuron, 75:209--217, 2012.
[423]	J. R. Reynolds and R. C. O'Reilly. Developing PFC representations using reinforcement learning. Cognition, 113(3):281--292, 2009.
[424]	J. J. Ribas-Fernandes, A. Solway, C. Diuk, J. T. McGuire, A. G. Barto, Y. Niv, and M. M. Botvinick. A neural signature of hierarchical reinforcement learning. Neuron, 71(2):370--379, 2011.
[425]	Leonidas M. A. Richter and Julijana Gjorgjieva. Understanding neural circuit development through theory and models. bioRxiv, 2017.
[426]	Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123--146, 1995.
[427]	Edmund Rolls. Memory systems in the brain. Annual Review Psychology, 51:599--630, 2001.
[428]	Clemens Rosenbaum, Ignacio Cases, Matthew Riemer, and Tim Klinger. Routing networks and the challenges of modular and compositional computation. CoRR, arXiv:1904.12774, 2019.
[429]	Nicolas P. Rougier, David C. Noelle, Todd S. Braver, Jonathan D. Cohen, and Randall C. O'Reilly. Prefrontal cortex and flexible cognitive control: Rules without symbols. Proceedings of the National Academy of Sciences, 102(20):7338--7343, 2005.
[430]	Nicolas P. Rougier and Randall C. O'Reilly. Learning representations in a gated prefrontal cortex model of dynamic task switching. Cognitive Science, 26(4):503--520, 2002.
[431]	Jon W. Rueckemann and Elizabeth A. Buffalo. Auditory landscape on the cognitive map. Nature, 543:631, 2017.
[432]	Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. CoRR, arXiv:1606.04671, 2016.
[433]	Himanshu Sahni, Saurabh Kumar, Farhan Tejani, and Charles L. Isbell. Learning to compose skills. CoRR, arXiv:1711.11289, 2017.
[434]	T. Sakai, A. Mikami, M. Tomonaga, M. Matsui, J. Suzuki, Y. Hamada, M. Tanaka, T. Miyabe-Nishiwaki, H. Makishima, M. Nakatsukasa, and T. Matsuzawa. Differential prefrontal white matter development in chimpanzees and humans. Current Biology, 21(16):1397--1402, 2011.
[435]	Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin A. Riedmiller, Raia Hadsell, and Peter Battaglia. Graph networks as learnable physics engines for inference and control. CoRR, arXiv:1806.01242, 2018.
[436]	Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4967--4976. Curran Associates, Inc., 2017.
[437]	J. Sarnthein, H. Petsche, P. Rappelsberger, G. L. Shaw, and A. von Stein. Synchronization between prefrontal and posterior association cortex during human working memory. Proceedings of the National Academy of Sciences of the United States of America, 95:7092--7096, 1998.
[438]	David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, 2019.
[439]	Anna C. Schapiro, Nicholas B. Turk-Browne, Matthew M. Botvinick, and Kenneth A. Norman. Complementary learning systems within the hippocampus: a neural network modelling approach to reconciling episodic memory with statistical learning. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1711):20160049, 2017.
[440]	Helen Scharfman and Catherine Myers. Hilar mossy cells of the dentate gyrus: A historical perspective. Frontiers in neural circuits, 6:106, 2012.
[441]	Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. CoRR, arXiv:1511.05952, 2015.
[442]	Brandy Schmidt, Diano F. Marrone, and Etan J. Markus. Disambiguating the similar: The dentate gyrus and pattern separation. Behavioural Brain Research, 226(1):56--65, 2012.
[443]	Walter Schneider and Richard M. Shiffrin. Controlled and automatic human information processing: I. detection, search, and attention. Psychological Review, 84(1):1--66, 1977.
[444]	N. Schweighofer, J. Y. Lee, H. T. Goh, Y. Choi, S. S. Kim, J. C. Stewart, R. Lewthwaite, and C. J. Winstein. Mechanisms of the contextual interference effect in individuals poststroke. Journal of Neurophysiollgy, 106(5):2632--2641, 2011.
[445]	Nicolas Schweighofer, Eric J. Lang, and Mitsuo Kawato. Role of the olivo-cerebellar complex in motor learning and control. Frontiers in Neural Circuits, 7, 2013.
[446]	Terry Sejnowski and Miyoung Chun. 10th kavli futures symposium: Data deluge from the brain activity map. https://www.kavlifoundation.org/sites/default/files/docs/BAMHistory-Jan2013b.pdf, 2013.
[447]	Yevgeny Seldin, Gill Bejerano, and Naftali Tishby. Unsupervised segmentation and classification of mixtures of Markovian sources. In Proceedings of the 33rd Symposium on the Interface of Computing Science and Statistics, 2001.
[448]	Yevgeny Seldin, Gill Bejerano, and Naftali Tishby. Unsupervised sequence segmentation by a mixture of switching variable memory Markov sources. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pages 513--520, 2001.
[449]	H. Sebastian Seung. Neuroscience: Towards functional connectomics. Nature, 471:170--172, 2011.
[450]	Sebastian Seung. Connectome: How the Brain's Wiring Makes Us Who We Are. Houghton Mifflin Harcourt, Boston, 2012.
[451]	Claude Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379--423 and 623--656, 1948.
[452]	Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. CoRR, arXiv:1701.06538, 2017.
[453]	Chet C. Sherwood, Francys Subiaul, and Tadeusz W. Zawidzki. A natural history of the human mind: tracing evolutionary changes in brain and cognition. Journal of Anatomy, 212:426--454, 2008.
[454]	John Shewchuk. Reinforcement Learning in Dynamical Systems With High Input and Output Dimensionality. PhD thesis, Brown University, Providence, RI, Forthcoming.
[455]	John Shewchuk and Thomas Dean. Towards learning time-varying functions with high input dimensionality. In Proceedings of the 5th IEEE International Symposium on Intelligent Control, pages 383--388. IEEE, 1990.
[456]	Richard Shin, Charles Packer, and Dawn Song. Differentiable neural network architecture search. International Conference on Learning Representations, 2018.
[457]	Daphna Shohamy and Nathaniel D. Daw. Integrating memories to guide decisions. Current Opinion in Behavioral Sciences, 5:85--90, 2015.
[458]	F. Shutoh, M. Ohki, H. Kitazawa, S. Itohara, and S. Nagao. Memory trace of motor learning shifts transsynaptically from cerebellar cortex to nuclei for consolidation. Neuroscience, 139(2):767--777, 2006.
[459]	Kyriaki Sidiropoulou, Eleftheria Kyriaki Pissadaki, and Panayiota Poirazi. Inside the brain of a neuron. EMBO Reports, 7:886--892, 2006.
[460]	Caitlin R. Siu and Kathryn M. Murphy. The development of human visual cortex and clinical implications. Eye and brain, 10:25--36, 2018.
[461]	Jeroen B. Smaers, Aida Gómez-Robles, Ashley N. Parks, and Chet C. Sherwood. Exceptional evolutionary expansion of prefrontal cortex in great apes and humans. Current Biology, 27(5):714--720, 2017.
[462]	Richard Socher. Recursive deep learning for natural language processing and computer vision. Ph.D. Thesis. Computer Science Department, Stanford University, 2014.
[463]	Richard Socher, Cliff Chiung-Yu Lin, Andrew Y. Ng, and Christopher D. Manning. Parsing natural scenes and natural language with recursive neural networks. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning, pages 129--136, 2011.
[464]	Shawn F. Sorrells, Mercedes F. Paredes, Arantxa Cebrian-Silla, Kadellyn Sandoval, Dashi Qi, Kevin W. Kelley, David James, Simone Mayer, Julia Chang, Kurtis I. Auguste, Edward F. Chang, Antonio J. Gutierrez, Arnold R. Kriegstein, Gary W. Mathern, Michael C. Oldham, Eric J. Huang, Jose Manuel Garcia-Verdugo, Zhengang Yang, and Arturo Alvarez-Buylla. Human hippocampal neurogenesis drops sharply in children to undetectable levels in adults. Nature, 555:377, 2018.
[465]	Kirsty L. Spalding, Olaf Bergmann, Kanar Alkass, Samuel Bernard, Mehran Salehpour, Hagen B. Huttner, Emil Boström, Isabelle Westerlund, Celine Vial, Bruce A. Buchholz, Göran Possnert, Deborah C. Mash, Henrik Druid, and Jonas Frisén. Dynamics of hippocampal neurogenesis in adult humans. Cell, 153:1219--1227, 2013.
[466]	Larry R. Squire and Stuart Zola-Morgan. Memory: brain systems and behavior. Trends in Neurosciences, 11(4):170--175, 1988.
[467]	Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks. CoRR, arXiv:1804.00645, 2018.
[468]	Kimberly L. Stachenfeld, Matthew M. Botvinick, and Samuel J. Gershman. The hippocampus as a predictive map. Nature Neuroscience, 20:1643, 2017.
[469]	Bernhard P. Staresina, Thomas P. Reber, Johannes Niediek, Jan Boström, Christian E. Elger, and Florian Mormann. Recollection in the human hippocampal-entorhinal cell circuitry. Nature Communications, 10:1503, 2019.
[470]	Jeffrey R. Stevens, Thorsten Pachur, and Lael J. Schooler. Rational analysis of the adaptive and predictive nature of memory. Journal of Applied Research in Memory and Cognition, 2(4):251--253, 2013.
[471]	Georg F. Striedter and R. Glenn Northcutt. Brains Through Time: A Natural History of Vertebrates. Oxford University Press, 2020.
[472]	Anthony Strock, Xavier Hinaut, and Nicolas P. Rougier. A robust model of gated working memory. Neural Computation, 32(1):153--181, 2020.
[473]	Steven H. Strogatz. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering. Studies in Nonlinearity. Westview Press, 2014.
[474]	Fumiaki Sugahara, Yasunori Murakami, Juan Pascual-Anaya, and Shigeru Kuratani. Reconstructing the ancestral vertebrate brain. Development, Growth & Differentiation, 59(4):163--174, 2017.
[475]	Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial. Intelligence, 112(1-2):181--211, 1999.
[476]	Larry W. Swanson. Mapping the human brain: past, present, and future. Trends in Neurosciences, 18(11):471--474, 1995.
[477]	Lauren Swiney and Paulo Sousa. A new comparator account of auditory verbal hallucinations: how motor prediction can plausibly contribute to the sense of agency for inner speech. Frontiers in Human Neuroscience, 8:675, 2014.
[478]	Kazuyoshi Takeda, Tomiki Sumiyoshi, Madoka Matsumoto, Kou Murayama, Satoru Ikezawa, Kenji Matsumoto, and Kazuyuki Nakagome. Neural correlates for intrinsic motivational deficits of schizophrenia; implications for therapeutics of cognitive impairment. Frontiers in Psychiatry, 9:178, 2018.
[479]	Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2154--2162. Curran Associates, Inc., 2016.
[480]	Gregory Z. Tau and Bradley S. Peterson. Normal development of brain circuits. Neuropsychopharmacology, 35:147--168, 2010.
[481]	Sundeep Teki, Manon Grube, and Timothy D. Griffiths. A unified model of time perception accounts for duration-based and beat-based timing mechanisms. Frontiers in integrative neuroscience, 5:90--90, 2012.
[482]	Clive Thompson. Smarter Than You Think: How Technology is Changing Our Minds for the Better. HarperCollins Publishers, 2013.
[483]	Naftali Tishby, Fernando Pereira, and William Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, pages 368--377, 1999.
[484]	Naftali Tishby and Daniel Polani. Information theory of decisions and actions. In V. Cutsuridis, A. Hussain, and J.G. Taylor, editors, Perception-Action Cycle: Models, Architectures, and Hardware. Springer New York, 2011.
[485]	Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. CoRR, arXiv:1503.02406, 2015.
[486]	Dmitrii I. Todorov, Robert A. Capps, William H. Barnett, Elizaveta M. Latash, Taegyo Kim, Khaldoun C. Hamade, Sergey N. Markin, Ilya A. Rybak, and Yaroslav I. Molkov. The interplay between cerebellum and basal ganglia in motor adaptation: A modeling study. PLOS ONE, 14(4):1--36, 2019.
[487]	Edward C. TOLMAN. Cognitive maps in rats and men. Psychological Review, 55(4):189--208, 1948.
[488]	Meropi Topalidou, Daisuke Kase, Thomas Boraud, and Nicolas P. Rougier. A computational model of dual competition between the basal ganglia and the cortex. eNeuro, 5(6), 2018.
[489]	David S. Touretzky and Geoffrey E. Hinton. Symbols among the neurons: details of a connectionist inference architecture. In Proceedings of the 9th International Joint Conference on Artificial Intelligence, pages 238--243, San Francisco, CA, USA, 1985. Morgan Kaufmann.
[490]	Travis, Katherine E., Yael Leitner, Heidi M. Feldman, and Michal Ben-Shachar. Cerebellar white matter pathways are associated with reading skills in children and adolescents. Human Brain Mapping, 36(4):1536--1553, 2015.
[491]	Alessandro Treves and Edmund T. Rolls. Computational constraints suggest the need for two distinct input systems to the hippocampal ca3 network. Hippocampus, 2(2):189--200, 1992.
[492]	J. Tsotsos, Y. Liu, J. Martinez-Trujillo, M. Pomplun, E. Simine, and K. Zhou. Attending to visual motion. Computer Vision and Image Understanding, 100:3--40, 2005.
[493]	John K. Tsotsos. A Computational Perspective on Visual Attention. MIT Press, 2011.
[494]	Endel Tulving. Elements of Episodic Memory. Oxford Psychology Series. Oxford University Press, 1985.
[495]	Barbara Tversky. Mind in Motion: How Action Shapes Thought. Basic Books, 2019.
[496]	Shimon Ullman. Visual routines. Cognition, 18:97--159, 1984.
[497]	Giorgio Vallortigara. Core knowledge of object, number, and geometry: A comparative and neural approach. Cognitive Neuropsychology, 29(1-2):213--236, 2012.
[498]	Aäron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. CoRR, arXiv:1601.06759, 2016.
[499]	Frank van der Velde. Neural blackboard architectures of combinatorial structures in cognition. Technical report, Unit of Cognitive Psychology, Leiden University, 2003.
[500]	Frank van der Velde and Marc de Kamps. The necessity of connection structures in neural models of variable binding. Cognitive Neurodynamics, 9(4):359--370, 2015.
[501]	Kurt VanLehn. Human procedural skill acquisition: Theory, model and psychological validation. In Proceedings of AAAI-83. AAAI, 1983.
[502]	Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, arXiv:1706.03762, 2017.
[503]	Cèdric Villani and Clèment Mouhot. Birth of a Theorem: A Mathematical Adventure. Farrar, Straus and Giroux, 2015.
[504]	Oriol Vinyals and Quoc V. Le. A neural conversational model. In ICML Deep Learning Workshop, 2015.
[505]	Matteo Visconti di Oleggio Castello, Morgan Taylor, Patrick Cavanagh, and M. Ida Gobbini. Idiosyncratic, retinotopic bias in face identification modulated by familiarity. eNeuro, 5(5), 2018.
[506]	John von Neumann. Probabilistic logics and the synthesis of reliable organisms from unreliable components. In Claude E. Shannon and John McCarthy, editors, Automata Studies, pages 329--378. Princeton University Press, Princeton, NJ, 1956.
[507]	John von Neumann. The computer and the brain. The Silliman Memorial Lectures Series. Yale University Press, 1958.
[508]	John von Neumann. Theory of Self-Reproducing Automata. University of Illinois Press, Urbana, IL, 1966.
[509]	J. von Uexküll and D.L. Mackinnon. Theoretical Biology. International Library of Psychology, Philosophy and Scientific Method. K. Paul, Trench, Trubner & Company Limited, 1926.
[510]	Nicholas P Vyleta, Carolina Borges-Merjane, and Peter Jonas. Plasticity-dependent, full detonation at hippocampal mossy fiber-CA3 pyramidal neuron synapses. eLife, 5:e17977, 2016.
[511]	S. Wakana, H. Jiang, L. M. Nagae-Poetscher, P. C. van Zijl, and S. Mori. Fiber tract-based atlas of human white matter anatomy. Radiology, 230(1):77--87, 2004.
[512]	H. Wallach, H. Larochelle, A. Beygelzimer, F. d' Alché-Buc, E. Fox, and R. Garnett, editors. Code Generation as a Dual Task of Code Summarization. Curran Associates, Inc., 2019.
[513]	William Grey Walter. An imitation of life. Scientific American, 182:42--45, 1950.
[514]	Brian A. Wandell. Clarifying human white matter. Annual Review of Neuroscience, 39(1):103--128, 2016.
[515]	Brian A. Wandell, Alyssa A. Brewer, and Robert F. Dougherty. Visual field map clusters in human cortex. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 360:693--707, 2005.
[516]	Brian A. Wandell, Serge O. Dumoulin, and Alyssa A. Brewer. Visual field maps in human cortex. Neuron, 56:366--383, 2007.
[517]	Jane X. Wang, Zeb Kurth-Nelson, Dharshan Kumaran, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Demis Hassabis, and Matthew Botvinick. Prefrontal cortex as a meta-reinforcement learning system. Nature Neuroscience, 21:860--868, 2018.
[518]	Jun Wang, Gregory Dam, Sule Yildirim, William Rand, Uri Wilensky, and James C. Houk. Reciprocity between the cerebellum and the cerebral cortex: Nonlinear dynamics in microscopic modules for generating voluntary motor commands. Complexity, 14(2):29--45, 2008.
[519]	Ke Wang, Rishabh Singh, and Zhendong Su. Dynamic neural program embedding for program repair. CoRR, arXiv:1711.07163, 2017.
[520]	Ke Wang, Rishabh Singh, and Zhendong Su. Dynamic neural program embedding for program repair. International Conference on Learning Representations, 2018.
[521]	Wei Wang and Guang-Zhong Wang. Understanding molecular mechanisms of the brain through transcriptomics. Frontiers in physiology, 10:214--214, 2019.
[522]	Xingchao Wang, Qiong Wu, Laura Egan, Xiaosi Gu, Pinan Liu, Hong Gu, Yihong Yang, Jing Luo, Yanhong Wu, Zhixian Gao, and Jin Fan. Anterior insular cortex plays a critical role in interoceptive attention. eLife, 8:e42265, 2019.
[523]	Greg Wayne, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-Barwinska, Jack Rae, Piotr Mirowski, Joel Z. Leibo, Adam Santoro, Mevlana Gemici, Malcolm Reynolds, Tim Harley, Josh Abramson, Shakir Mohamed, Danilo Rezende, David Saxton, Adam Cain, Chloe Hillier, David Silver, Koray Kavukcuoglu, Matt Botvinick, Demis Hassabis, and Timothy Lillicrap. Unsupervised predictive memory in a goal-directed agent. CoRR, arXiv:1803.10760, 2018.
[524]	Theophane Weber, Sébastien Racanière, David P. Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Peter Battaglia, David Silver, and Daan Wierstra. Imagination-augmented agents for deep reinforcement learning. CoRR, arXiv:1707.06203, 2017.
[525]	A. Westbrook and T. S. Braver. Dopamine does double duty in motivating cognitive effort. Neuron, 89(4):695--710, 2016.
[526]	A. Westbrook and M. Frank. Dopamine and proximity in motivation and cognitive control. Current Opinion in Behavioral Science, 22:28--34, 2018.
[527]	Steven D. Whitehead and Dana H. Ballard. Active perception and reinforcement learning. Neural Computation, 2(4):409--419, 1990.
[528]	Grzegorz Wiera and Jerzy W Mozrzymas. Extracellular proteolysis in structural and functional plasticity of mossy fiber synapses in hippocampus. Frontiers in Cellular Neuroscience, 9:427, 2015.
[529]	Romain Willemet. Understanding the evolution of mammalian brain structures; the need for a (new) cerebrotype approach. Brain sciences, 2:203--224, 2012.
[530]	David Windridge. Emergent intentionality in perception-action subsumption hierarchies. Frontiers in Robotics and AI, 4:38, 2017.
[531]	Anthony Wright. Higher cortical functions: Association and executive processing. In Neuroscience Online: An electronic textbook for the neurosciences. The University of Texas McGovern Medical School, 1997.
[532]	Hua-Dong Xiang, Hubert M. Fonteijn, David G. Norris, and Peter Hagoort. Topographical Functional Connectivity Pattern in the Perisylvian Language Networks. Cerebral Cortex, 20(3):549--560, 2009.
[533]	Daniel L. K. Yamins, Ha Hong, Charles F. Cadieu, Ethan A. Solomon, Darren Seibert, and James J. DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23):8619--8624, 2014.
[534]	O. Yizhar, L.E. Fenno, T.J. Davidson, M. Mogri, and K. Deisseroth. Optogenetics in neural systems. Neuron, 71:9--34, 2011.
[535]	Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations, 2018.
[536]	Arianna Yuan and Jay McClelland. A model of structure learning, inference, and generation for scene understanding. In Proceedings of the 41st Annual Meeting of the Cognitive Science Society. Cognitive Science Society, 2019.
[537]	Amir Roshan Zamir, Te-Lin Wu, Lin Sun, William B. Shen, Jitendra Malik, and Silvio Savarese. Feedback networks. CoRR, arXiv:1612.09508, 2016.
[538]	F. Zhang, V. Gradinaru, A.R. Adamantidis, R. Durand, R.D. Airan, L. de Lecea, and K. Deisseroth. Optogenetic interrogation of neural circuits: technology for probing mammalian brain structures. Nature Protocols, 5:439--56, 2010.
[539]	Yan Zhang, Xiaochuan Pan, Rubin Wang, and Masamichi Sakagami. Functional connectivity between prefrontal cortex and striatum estimated by phase locking value. Cognitive Neurodynamics, 10:245--254, 2016.
[540]	Zhihao Zheng, J. Scott Lauritzen, Eric Perlman, Camenzind G. Robinson, Matthew Nichols, Daniel Milkie, Omar Torrens, John Price, Corey B. Fisher, Nadiya Sharifi, Steven A. Calle-Schuler, Lucia Kmecova, Iqbal J. Ali, Bill Karsh, Eric T. Trautman, John Bogovic, Philipp Hanslovsky, Gregory S. X. E. Jefferis, Michael Kazhdan, Khaled Khairy, Stephan Saalfeld, Richard D. Fetter, and Davi D. Bock. A complete electron microscopy volume of the brain of adult drosophila melanogaster. Cell, 174:1--14, 2018.
[541]	Xin Zhou, Dantong Zhu, Xue-Lian Qi, Sihai Li, Samson G. King, Emilio Salinas, Terrence R. Stanford, and Christos Constantinidis. Neural correlates of working memory development in adolescent primates. Nature Communications, 7:13423, 2016.
[542]	Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. CoRR, arXiv:1903.12355, 2019.
[543]	Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In Proceedings of the 5th International Conference on Learning Representations, 2017.

¹ The instruction set architecture (ISA) is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various micro architectural elements of the machine, which may be anything from single gates and registers, to complete arithmetic logic units ALUs and even larger elements. These diagrams generally separate the data path (where data is placed) and the control path (which can be said to steer the data). (SOURCE)

² Merel et al [353] note that "[a] remaining challenge on the way to truly flexible and general purpose control is to be able to sequence and generalize individual movements or "skills" in a task-directed manner. Achieving this goal requires not just the ability to acquire individual skills in the first place, but also an architecture and associated training procedure that supports representation, recruitment, and composition of a large number of skills."

³ These computatonal affordances work well for problems of the sort we evolved to solve and well enough to support – indeed they shaped – the social institutions that gave Homo sapiens an evolutionary advantage. They have not, however, kept up with the demands of subsequent innovations in science, technology and social engineering. We have become adept at harnessing our laboriously slow, contemplative capacity for thinking to reshape our environment. There is no reason however that we have to pass along these cognitive limitations to our technological progeny.

⁴ The following email exchange builds on an earlier lecture in 2018 by Randy O'Reilly and follow on discussions with Randy and Michael Frank in January and June of 2020.

TLD: We read several of your papers, listened to your lectures from past classes, reviewed our conversation with Randy back in January and followed the citation trail to track down some of the details that we couldn't find in your work – which isn't to say they weren't there for us to find and we simply missed them. Here are some of the questions that came up in discussions with students that I couldn't answer and couldn't readily find a satisfactory answer for in the literature. A couple of YES/NO answers and one or two references to relevant papers would be most appreciated.

TLD: What's the timing of the "basic cycle" of the basal ganglia including the direct and indirect pathways? What is the average time between BG to PFC/WM transfers?

RCO: 50 msec is the minimum, corresponding to the dominant beta frequency in the BG / PFC circuits ... interestingly ACT-R came up with the same number based purely on psychological fits – see Jilk et al., 2008 [250] for more on that (referred to in attached paper).

TLD: How frequently does the pattern of sensory activity in the posterior cortex get updated in the striatum — with the default answer being "every basic cycle"?

RCO: Direct connections = continuousl, though the STN may act as a bit of a "gatekeeper" such that striatum doesn't "register" the inputs until a relevant top-down PFC context is engaged.

TLD: Does each stripe-like "register" have its own GO/NOGO gating signal, thereby allowing the BG to independently set these signals and initiate transfers to working memory for just the GO-marked registers? Randy seems to have confirmed this in our January meeting but in fact he simply agreed to my interpretation of his comments and my interpretation now seems ambiguous to me.

RCO: This has been the standard PBWM model – see discussion in the [attached] chapter for more on this – not at all a settled question.

TLD: What control do the circuits in the frontal cortex have over the working memory registers that are aligned with those in the striatum – specifically, can the prefrontal and motor cortex circuits prevent the BG from writing to those registers?

RCO: Yes – PFC is major controlling input to its own "stripes"

TLD: And finally, can the BG alter representations in the sensory cortex or are its only targets working memory registers in the frontal cortex? The question originated from a student who read the following paragraph written by Michael in an article on the Dana Foundation website about Parkinson's Disease and was puzzled by the sentence in bold font:

The basal ganglia are a collection of interconnected areas deep below the cerebral cortex. They receive information from the frontal cortex about behavior that is being planned for a particular situation. In turn, the basal ganglia affect activity in the frontal cortex through a series of neural projections that ultimately go back up to the same cortical areas from which they received the initial input. This circuit enables the basal ganglia to transform and amplify the pattern of neural firing in the frontal cortex that is associated with adaptive, or appropriate, behaviors, while suppressing those that are less adaptive. The neurotransmitter dopamine plays a critical role in the basal ganglia in determining, as a result of experience, which plans are adaptive and which are not.

RCO: Yes, BG gates thalamus that projects to frontal cortex, not sensory cortex (though maybe some minor such projections in IT / MTL cortex – not sure about current status of that)

MJR: The sentence in the paragraph your student read was meant to highlight that each FC-BG circuit is a loop – the frontal cortex proposes candidate actions (motor or cognitive) and sends those to the BG which then gates their selection by amplifying the desired one and suppressing the others via disinhibition of that same cortical area that sent the proposal in the first place. This is separate from the fact that sensory (posterior) cortex also projects to both striatum and to frontal cortex, representing the state that then triggers the candidate action generation and contextualizes the striatal valuation. (Though as Randy alluded there is also some anatomical evidence that there could additionally be BG control over higher posterior cortex – my guess is that this can explain visual hallucinations for example when striatal DA is too high.)

MJR: In addition to PFC controlling its own updating in a single loop, our models and data suggest a hierarchical organization such that more anterior PFC-BG loops can provide control over more posterior FC-BG action selection, e.g., when a higher order rule needs to contextualize decision making or during task switching etc. This also facilitates generalization of abstract rules to new sensory contexts. see for example the modeling and imaging work with Badre and Collins.

⁵ In addition to the SIR task, you might be interested in more challenging tasks such as the Wisconsin Card Sort Task (WCST) described in Rougier and O'Reilly [430].

⁶ Preschoolers have trouble performing tasks that require the use of working memory. Developmentally there appears to be shift that occurs around the age of four marking a distinct improvement in memory performance on standard tests [168]. The second of the two paper listed below makes some interesting observations about the role of attention motor activity in preschoolers' working memory which might be worth following up on [56]:

  @article{FitamenetalSCIENTIFIC-REPORTS-19,
       author = {Fitamen, Christophe and Blaye, Agn{\'{e}}s and Camos, Val{\`{e}}rie},
        title = {Five-year-old children’s working memory can be improved when children act on a transparent goal cue},
      journal = {Scientific Reports},
       volume = {9},
        issue = {1},
         year = {2019},
        pages = {15342},
     abstract = {Working memory is a key component of human cognition and its development throughout childhood a major predictor of cognitive development and school achievement. Noticeably, preschoolers exhibit poor performance in working memory tasks. The present study aimed at testing different means to improve working memory performance in preschoolers. To this aim, we tested the effect of abstract and transparent goal cues in a Brown-Peterson task performed by 4- and 5-year-old preschoolers. If the transparent goal cue helps to better maintain the instructions, it should lead to better memory performance. Moreover, preschoolers had to track, either visually or with their fingers, the goal cue during the retention interval. If the motor activity favors the active engagement of the children in the task, the finger tracking should lead to improvement in memory performance. Our findings were that 5-year-old children benefitted from a transparent goal cue when they acted on it, while 4-year-old children did not show any improvement. These results suggest that working memory performance can be improved in 5-year-old children when the task embeds elements that can scaffold the task goal.}
},
@article{BertrandandCamosCOGNITIVE-DEVELOPMENT-15,
       author = {Rapha{\"{e}}lle Bertrand and Val{\`{e}}rie Camos},
        title = {The role of attention in preschoolers’ working memory},
      journal = {Cognitive Development},
       volume = {33},
         year = {2015},
        pages = {14-27},
     abstract = {Recent studies showed that introducing an unfilled delay or a secondary task between encoding and recall reduces recall performance in preschoolers, whereas increasing the attentional demand of this secondary task does not. Based on these results, different authors drew opposing conclusions regarding the use of attention in preschoolers’ memory maintenance. This study aimed to bring further evidence to bear on the issue. In two experiments, recall was reduced when an unfilled delay was introduced before recall, but also when the demand of the concurrent task increased. These effects did not interact with age, although performance improved from 4 to 6 years of age. A third experiment ruled out an alternative interpretation. These findings show the role of attention in preschoolers’ working memory and raise the question of how motor activity may help memory maintenance.}
}

⁷ To avoid a microarchitecture version of catastrophic interference, we might learn a flexible basis for executive control in the PFC and a collection of motor primitives of the sort just mentioned and then fix the weights or use the DNC-based strategy for avoiding catastrophic forgetting in multi-task planning advocated in the models that we played around with last year [352].

⁸ A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing. The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC). FPGAs contain an array of programmable logic blocks, and a hierarchy of "reconfigurable interconnects" that allow the blocks to be "wired together", like many logic gates that can be inter-wired in different configurations. Logic blocks can be configured to perform complex combinational functions, or merely simple logic gates like AND and XOR. (SOURCE).

⁹ Apropos our earlier mention of curriculum learning, one might design exercises that specify complete solutions sans commitment to specific operators in the form of a circuit diagram that requires the learning system to correctly identify and specialize generic components by learning the weights, where possible using simple, easily-trained networks such as perceptrons. For example, I think one could construct a solution to the SIR task using a modulo 3 sequencer, three comparators, an OR gate and a couple of AND gates. You should be able to train this model pretty easily and use it for ablation studies in which we substitute MLPs for subnetworks composed of simpler components. Another strategy is to start with simpler problems and build up from there, e.g., single-pattern tasks like "1-AX" and "2-BY" and stripped-down two-pattern tasks like "1-A and 2-B". Obviously we could also combine these two strategies. I'd be interested to hear about any strategies you've developed or read about.

¹⁰ Template meta-programming (TMP) is a meta-programming technique in which templates are used by a compiler to generate temporary source code, which is merged by the compiler with the rest of the source code and then compiled. The output of these templates include compile-time constants, data structures, and complete functions. (SOURCE)

Another automation tool called context-aware code completion has become a common feature of more advanced integrated development environments, speeding up the process of coding applications by reducing typos and other common mistakes, and recently we have seen demonstrations of how machine learning research on neural code completion might assist in automatic programming [325, 313]. (SOURCE)

¹¹ Here is a small sample of papers on the topic of machine models of Learning in the Prefrontal Cortex and Basal Ganglia. These are just the papers that I reviewed when writing the final entry in the 2020 class discussion list. You can find a good deal more commentary elsewhere in earlier in the year:

@article{HayworthandMarblestoneBIORXIV-18,
       author = {Hayworth, Kenneth J. and Marblestone, Adam H.},
        title = {How thalamic relays might orchestrate supervised deep training and symbolic computation in the brain},
      journal = {bioRxiv},
         year = {2018},
    publisher = {Cold Spring Harbor Laboratory},
     abstract = {The thalamus appears to be involved in the flexible routing of information among cortical areas, yet the computational implications of such routing are only beginning to be explored. Here we create a connectionist model of how selectively gated cortico-thalamo-cortical relays could underpin both symbolic and sub-symbolic computations. We first show how gateable relays can be used to create a Dynamically Partitionable Auto-Associative Network (DPAAN) (Hayworth, 2012) consisting of a set of cross-connected cortical memory buffers. All buffers and relays in a DPAAN are trained simultaneously to have a common set of stable attractor states that become the symbol vocabulary of the DPAAN. We show via simulations that such a DPAAN can support operations necessary for syntactic rule-based computation, namely buffer-to-buffer copying and equality detection. We then provide each DPAAN module with a multilayer input network trained to map sensory inputs to the DPAAN's symbol vocabulary, and demonstrate how gateable thalamic relays can provide recall and clamping operations to train this input network by Contrastive Hebbian Learning (CHL) (Xie and Seung, 2003). We suggest that many such DPAAN modules may exist at the highest levels of the brain{\textquoteright}s sensory hierarchies and show how a joint snapshot of the contents of multiple DPAAN modules can be stored as a declarative memory in a simple model of the hippocampus. We speculate that such an architecture might first have been 'discovered. by evolution as a means to bootstrap learning of more meaningful cortical representations feeding the striatum, eventually leading to a system that could support symbolic computation. Our model serves as a bridging hypothesis for linking controllable thalamo-cortical information routing with computations that could underlie aspects of both learning and symbolic reasoning in the brain.},
}
@article{OReillyandFrankNC-06,
       author = {O'Reilly, Randall C. and Frank, Michael J.},
        title = {Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia},
      journal = {Neural Computation},
       volume = 18,
        issue = 2,
         year = 2006,
        pages = {283-328},
    publisher = {MIT Press},
      address = {Cambridge, MA, USA},
     abstract = {The prefrontal cortex has long been thought to subserve both working memory (the holding of information online for processing) and executive functions (deciding how to manipulate working memory and perform processing). Although many computational models of working memory have been developed, the mechanistic basis of executive function remains elusive, often amounting to a homunculus. This article presents an attempt to deconstruct this homunculus through powerful learning mechanisms that allow a computational model of the prefrontal cortex to control both itself and other brain areas in a strategic, task-appropriate manner. These learning mechanisms are based on subcortical structures in the midbrain, basal ganglia, and amygdala, which together form an actor-critic architecture. The critic system learns which prefrontal representations are task relevant and trains the actor, which in turn provides a dynamic gating mechanism for controlling working memory updating. Computationally, the learning mechanism is designed to simultaneously solve the temporal and structural credit assignment problems. The model's performance compares favorably with standard backpropagation-based temporal learning mechanisms on the challenging 1-2-AX working memory task and other benchmark working memory tasks.}
}
@article {RougieretalPNAS-05,
       author = {Rougier, Nicolas P. and Noelle, David C. and Braver, Todd S. and Cohen, Jonathan D. and O'Reilly, Randall C.},
        title = {Prefrontal cortex and flexible cognitive control: Rules without symbols},
      journal = {Proceedings of the National Academy of Sciences}
    publisher = {National Academy of Sciences},
       volume = {102},
       number = {20},
        pages = {7338-7343},
         year = {2005},
     abstract = {Human cognitive control is uniquely flexible and has been shown to depend on prefrontal cortex (PFC). But exactly how the biological mechanisms of the PFC support flexible cognitive control remains a profound mystery. Existing theoretical models have posited powerful task-specific PFC representations, but not how these develop. We show how this can occur when a set of PFC-specific neural mechanisms interact with breadth of experience to self organize abstract rule-like PFC representations that support flexible generalization in novel tasks. The same model is shown to apply to benchmark PFC tasks (Stroop and Wisconsin card sorting), accurately simulating the behavior of neurologically intact and frontally damaged people.},
}
@article{RougierandOReillyCS-02,
       author = {Nicolas P. Rougier and Randall C. O'Reilly},
        title = {Learning Representations in a Gated Prefrontal Cortex Model of Dynamic Task Switching},
      journal = {Cognitive Science},
       volume = {26},
       number = {4},
        pages = {503-520},
         year = {2002},
     abstract = {We present a computational model of task switching based on dynamic context representations using a prefrontal cortex gating mechanism. Based solely on an external reward signal and task instructions, the model is able to search among available subtasks and stabilize itself on the correct context representation of the current subtask. Results are presented and demonstrate that in this framework activation-based mem- ory leads to better performance than weight-based memory because it allows for rapid representation updating.},
}
@article{StrocketalNC-20,
       author = {Strock, Anthony and Hinaut, Xavier and Rougier, Nicolas P.},
        title = {A Robust Model of Gated Working Memory},
      journal = {Neural Computation},
       volume = {32},
       number = {1},
        pages = {153-181},
         year = {2020},
     abstract = {Gated working memory is defined as the capacity of holding arbitrary information at any time in order to be used at a later time. Based on electrophysiological recordings, several computational models have tackled the problem using dedicated and explicit mechanisms. We propose instead to consider an implicit mechanism based on a random recurrent neural network. We introduce a robust yet simple reservoir model of gated working memory with instantaneous updates. The model is able to store an arbitrary real value at random time over an extended period of time. The dynamics of the model is a line attractor that learns to exploit reentry and a nonlinearity during the training phase using only a few representative values. A deeper study of the model shows that there is actually a large range of hyperparameters for which the results hold (e.g., number of neurons, sparsity, global weight scaling) such that any large enough population, mixing excitatory and inhibitory neurons, can quickly learn to realize such gated working memory. In a nutshell, with a minimal set of hypotheses, we show that we can have a robust model of working memory. This suggests this property could be an implicit property of any random population, that can be acquired through learning. Furthermore, considering working memory to be a physically open but functionally closed system, we give account on some counterintuitive electrophysiological recordings.}
}
@article{TopalidouetalENEURO-18,
       author = {Topalidou, Meropi and Kase, Daisuke and Boraud, Thomas and Rougier, Nicolas P.},
        title = {A Computational Model of Dual Competition between the Basal Ganglia and the Cortex},
      journal = {eNeuro},
       volume = {5},
       number = {6},
         year = {2018},
    publisher = {Society for Neuroscience},
     abstract = {We propose a model that includes interactions between the cortex, the basal ganglia (BG), and the thalamus based on a dual competition. We hypothesize that the striatum, the subthalamic nucleus (STN), the internal globus pallidus (GPi), the thalamus, and the cortex are involved in closed feedback loops through the hyperdirect and direct pathways. These loops support a competition process that results in the ability of BG to make a cognitive decision followed by a motor one. Considering lateral cortical interactions, another competition takes place inside the cortex allowing the latter to make a cognitive and a motor decision. We show how this dual competition endows the model with two regimes. One is driven by reinforcement learning and the other by Hebbian learning. The final decision is made according to a combination of these two mechanisms with a gradual transfer from the former to the latter. We confirmed these theoretical results on primates (Macaca mulatta) using a novel paradigm predicted by the model.},
}
@inproceedings{TouretzkyandHintonIJCAI-85,
       author = {Touretzky, David S. and Hinton, Geoffrey E.},
        title = {Symbols among the neurons: details of a connectionist inference architecture},
    booktitle = {Proceedings of the 9th International Joint Conference on Artificial Intelligence},
     location = {Los Angeles, California},
    publisher = {Morgan Kaufmann},
      address = {San Francisco, CA, USA},
         year = 1985,
        pages = {238-243},
     abstract = {Pattern matching and variable binding are easily implemented in conventional computer architectures, but not necessarily in all architectures. In a distributed neural network architecture each symbol is represented by activity in many units and each unit contributes to the representation of many symbols. Manipulating symbols using this type of distributed representation is not as easy as with a local representation where each unit denotes one symbol, but there is evidence that the distributed approach is the one chosen by nature. We describe a working implementation of a production system interpreter in a neural network using distributed representations for both symbols and rules. The research provides a detailed account of two important symbolic reasoning operations, pattern matching and variable binding, as emergent properties of collections of neuron-like elements. The success of our production system implementation goes some way towards answering a common criticism of connectionist theories: that they aren't powerful enough to do symbolic reasoning.},
}

¹² The Brandenburg Concertos by Johann Sebastian Bach are a collection of six instrumental works presented by Bach to Christian Ludwig, Margrave of Brandenburg-Schwedt in 1721 –though they were probably composed earlier. They are widely regarded as some of the best orchestral compositions of the Baroque era. (SOURCE)

¹³ In addition, Chakrabarty and Martin [87] report that "[t]here was an age-dependent reduction in the number of sites where microstimulation did not produce a motor effect (i.e., ineffective sites), from 95% in animals younger than 60 days to 33% between 81 and 90 days. In adults, 24% of sites were ineffective." They observed a "proximal-to-distal development of the somatotopic organization of the motor map [...] During most of the period examined, development of the caudal M1 subregion lagged that of the rostral subregion [in terms of percent of effective sites and recorded threshold currents], although these differences were minimal or absent in adults.

¹⁴ A conceptual framework for understanding how perceptual hierarchies of the sort proposed by Joaquín Fuster [182] naturally emerge from the bottom up and in so doing serve to ground the corresponding representations:

@article{WindridgeFiRAI-17,
       author = {Windridge, David},
        title = {Emergent Intentionality in Perception-Action Subsumption Hierarchies},
      journal = {Frontiers in Robotics and AI},
       volume = {4},
        pages = {38},
         year = {2017},
     abstract = {A cognitively autonomous artificial agent may be defined as one able to modify both its external world-model and the framework by which it represents the world, requiring two simultaneous optimization objectives. This presents deep epistemological issues centered on the question of how a framework for representation (as opposed to the entities it represents) may be objectively validated. In this article, formalizing previous work in this field, it is argued that subsumptive perception-action learning has the capacity to resolve these issues by (a) building the perceptual hierarchy from the bottom up so as to ground all proposed representations and (b) maintaining a bijective coupling between proposed percepts and projected action possibilities to ensure empirical falsifiability of these grounded representations. In doing so, we will show that such subsumptive perception-action learners intrinsically incorporate a model for how intentionality emerges from randomized exploratory activity in the form of "motor babbling." Moreover, such a model of intentionality also naturally translates into a model for human–computer interfacing that makes minimal assumptions as to cognitive states.}
}

¹⁵ The mutual information of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable through observing the other random variable. More precisely, let U and W be discrete random variables over a finite vocabulary V with joint distribution P_(U,W) and marginal distributions P_U and P_W then their mutual information is defined as:

I(U;W) = D_KL(P_(U,W) || PU ⊗ P_W)

where D_KL is the Kullback-Leibler divergence. (SOURCE)

¹⁶ Fuster [185] describes the perception-action cycle as "the circular flow of information between an organism and its environment in the course of a sensory guided sequence of actions towards a goal."

¹⁷ Karl Friston [261] makes a similar point relating to the emergent properties of coupled dynamical system and offers a solution in the form of variational free energy that is formally equivalent to the information bottleneck method suggested by Tishby [485, 483]:

@article{FristonENTROPY-12,
       author = {Karl, Friston},
        title = {A Free Energy Principle for Biological Systems},
      journal = {Entropy (Basel)},
         year = {2012},
       volume = {14},
       number = {11},
        pages = {2100-2121},
     abstract = {This paper describes a free energy principle that tries to explain the ability of biological systems to resist a natural tendency to disorder. It appeals to circular causality of the sort found in synergetic formulations of self-organization (e.g., the slaving principle) and models of coupled dynamical systems, using nonlinear Fokker Planck equations. Here, circular causality is induced by separating the states of a random dynamical system into external and internal states, where external states are subject to random fluctuations and internal states are not. This reduces the problem to finding some (deterministic) dynamics of the internal states that ensure the system visits a limited number of external states; in other words, the measure of its (random) attracting set, or the Shannon entropy of the external states is small. We motivate a solution using a principle of least action based on variational free energy (from statistical physics) and establish the conditions under which it is formally equivalent to the information bottleneck method. This approach has proved useful in understanding the functional architecture of the brain. The generality of variational free energy minimisation and corresponding information theoretic formulations may speak to interesting applications beyond the neurosciences; e.g., in molecular or evolutionary biology.}
}

¹⁸ Tishby and Polani [484] provide the mathematical foundations for their information-theoretic model of a reinforcement-learning agent interacting with its environment. Their paper provides the basic foundation. Some of the asymptotic results mentioned in this entry of the class discussion list were mentioned in Tishby's lecture at the Institute for Advanced Studies on February 24, 2020, but not in published papers as far as I was able to ascertain.

@inproceedings{TishbyandPolaniITDA-11,
title = {Information Theory of Decisions and Actions},
author = {Naftali Tishby and Daniel Polani},
booktitle = {Perception-Action Cycle: Models, Architectures, and Hardware},
editor = {Cutsuridis, V. and Hussain, A. and Taylor, J.G.},
publisher = {Springer New York},
year = {2011},
abstract = {The perception-action cycle is often defined as "the circular flow of information between an organism and its environment in the course of a sensory guided sequence of actions towards a goal" (Fuster 2001, 2006). The question we address in this paper is in what sense this "flow of information" can be described by Shannon's measures of information introduced in his mathematical theory of communication. We provide an affirmative answer to this question using an intriguing analogy between Shannon's classical model of communication and the Perception-Action-Cycle. In particular, decision and action sequences turn out to be directly analogous to codes in communication, and their complexity - the minimal number of (binary) decisions required for reaching a goal - directly bounded by information measures, as in communication. This analogy allows us to extend the standard Reinforcement Learning framework. The latter considers the future expected reward in the course of a behaviour sequence towards a goal (value-to-go). Here, we additionally incorporate a measure of information associated with this sequence: the cumulated information processing cost or bandwidth required to specify the future decision and action sequence (information-to-go). Using a graphical model, we derive a recursive Bellman optimality equation for information measures, in analogy to Reinforcement Learning; from this, we obtain new algorithms for calculating the optimal trade-off between the value-to-go and the required information-to-go, unifying the ideas behind the Bellman and the Blahut-Arimoto iterations. This trade-off between value-to-go and information-togo provides a complete analogy with the compression-distortion trade-off in source coding. The present new formulation connects seemingly unrelated optimization problems. The algorithm is demonstrated on grid world examples.}
}

¹⁹ Development of the human visual system proceeds in multiple, overlapping stages involving cell migration to predetermined locations, differentiation into location-appropriate cell types, formation of and competition among connections, aggregation of similar types of cells into distinct regions, and the selective elimination of connections followed by the stabilization of the 100 trillion or so that remain, take place from about 5 weeks after conception onward. After about 18 months of age, no more neurons are added, and the aggregation of cell types into distinct regions is roughly complete, but the pruning of excess connections continues for years [460]. (SOURCE)

²⁰ Eric Lander was a child prodigy in mathematics and his PhD was on coding theory. He later studied applications of representation theory to coding theory. As an aside, representation theory studies abstract algebraic structures by representing their elements as linear transformations of vector spaces–a topic that, at least to my unschooled ear, sounds relevant to the study of artificial neural networks. Lander's most important work may be his development of a molecular taxonomy for cancers, which promises to increase our understanding of the molecular origins of cancers and aid in the design of more effective therapies. (SOURCE)

²¹ The Manifold Hypothesis states that real-world high-dimensional data lie on low-dimensional manifolds embedded within the high-dimensional space [159]. (SOURCE)

²³ Dogs have brains that are much like ours. They can construct rich representations of the world around them and and encode their experience in memory so as to take advantage of what they learned in the past and apply it to their present circumstances.

They can communicate with one another and with humans in order to convey that they are hungry or would like to play or go outdoors. They can accurately interpret the signs produced by other animals and humans and exploit what they learn from the signs to interact and collaborate.

The three characteristics involve short-term memory capacity, neural signal transmission and auditory production anatomy. Roughly speaking we have a larger cache, a faster bus and a better microphone where the first two translate into a deeper stack and an increased capability for recursion.

So why do dogs bark and humans speak? In short, our incrementally better computational capability and the rapidly increasing complexity of our social behavior enabled us to develop a tool capable of representing and sharing our experience, coordinating our collective behavior and analyzing the results of our observations to pursue scientific inquiry, create art and produce technology.

One of the goals of comparative, developmental and evolutionary neuroscience is to understand how humans acquire language and why other animals with similar brains fall short or fail altogether in acquiring a language of comparable complexity and utility. In the following paragraphs, we attempt to explain this conundrum by providing a computational model of how language is acquired and appealing to the differences between the brains of humans and other mammals listed above.

Just as babies begin to induce the retinal mapping of the primary visual cortex very early and possibly even in the womb, so too they learn the prosody, phonology and, in particular, the ability to recognize word boundaries in their native tongue within the first few weeks of hearing their mother's voice.

²² Sounds signify events that we associate with entities, e.g., animals, storms, and processes, e.g., a birds calling, rain falling. Sounds vary over time, involve multiple sources that may or may not be related and provide information critical to our survival. We do not rely on it as much as some other animals, but it serves us well if our vision is obscured or we lack adequate illumination.

Sound is also a convenient medium for exchanging information in the form of spoken language. It is not the only medium for language, but it may provide insight into how language is processed in the brain. While spoken language is a late arrival evolutionarily speaking, the brain structures that make it possible were very likely in place and serving other purposes long before humans started using anything we would recognize as language.

Most mammals are able to parse the complex structure of their aural experience, assigning meaning to different aspects of their experience including the identity and behavior of the sources. It seems plausible that the ability to parse the ambient stream of sounds enables hearing equipped animals to create and remember wordless narratives that help them to remember and categorize particularly interesting or relevant sonic segments²³ .

The previous paragraph was intended to consider sound as a medium for conveying information about the world we live in without relying on the spoken word. We needn't have restricted ourselves to just sound, together our senses allow us to construct rich representations of the world that we use to guide immediate action and subsequently recall in order to deal with novel situations. Here are some relevant excerpts from the writings of Philip Lieberman and Terrence Deacon on the evolution of the brain and the emergence of language:

From On the nature and evolution of the neural bases of human language, by Philip Lieberman [317]:

Page 72 — The central role that speech plays in the human functional language system is manifested in the rehearsal mechanism of verbal working memory whereby words or sub vocally maintained using the neuroanatomical structures that regulate speech production. Verbal working memory appears to be an integral component, perhaps the key component, of the human functional language system coupling speech perception, production, semantics, and syntax.

Page 69 — In this light, the shift in modality to gesture that can occur in hearing-impaired individuals who use manual sign language is not surprising. Neuroanatomical structures that otherwise would code for the production of speech signals appear instead to code the gestures of complex sign language, in addition to systems that process visual spatial inputs.As Elman et all note, this is not surprising, since the damaged door so parietal and frontal regions of the brain that cause sign language aphasia are involved in processing visual signals.

Page 62 — The neural instantiation of the lexicon appears to be a massively parallel distributed network in which individual words are represented by activity patterns in local circuits along the general principles outlined by Mesulam 1990. These circuits reference the anatomical structures that are involved in the direct perception or execution of the concepts coded in words, as well as the neural circuits that code the phonological "names" associated with concepts, In other words, the brain's linguistic dictionary appears to link circuits that code the concept referenced in a word to the stored "phonologic" sound pattern that represents the word. The sound pattern of a word, its name, in turn, is the primary key to addressing the semantic and syntactic information that constitutes the meaning of a word. The sound pattern of the word also appears to maintain the word in verbal working memory. Verbal working memory can be regarded as the neural computational space in which the meaning of a sentence is derived, taking account of syntactic, semantic, contextual, and pragmatic information.

Page 136 — Therefore, speech, is a derived feature of human language. However, this does not mean that speech is a recent feature of hominid evolution. Purposeful, referential speech must have existed well before the evolution of anatomically modern Homo sapiens. The basis for this conclusion paradoxically rests in the claim that the anatomy necessary to produce the full range of human speech was absent in Homo erectus and certain, if not all, Neanderthals.

From The Evolution of Speech and Language by Philip Lieberman and Robert Mccarthy [320]:

Abstract — Human speech, language, and cognition derive from anatomy and neural mechanisms that have been shaped by the Darwinian process of natural selection acting on variation but that have roots present in other living species. Language did not suddenly arise 50,000-100,000 years ago through a mutation that yielded an innate "faculty of language" nor does the human brain include an organ devoted to language and language alone. Broca's area is not the center of language. Neural circuits linking local activity in different neural structures regulate complex behaviors. Neural circuits that were present in early mammal-like reptiles play a part in regulating laryngeal phonation, conveying both referential information and emotion. Speech plays a central role, enabling transmission of information at a rate that exceeds the auditory fusion frequency. The unique human tongue enhances the robustness of speech, but Neanderthals and other archaic hominins whose neck and skull proportions preclude their having an adult-like human tongue nevertheless could talk. Comparative studies of present-day apes suggest that hominin "protolanguage" lacking syntax never existed. The neural bases of human language are not domain-specific - in other words, they are not devoted to language alone. Mutations on the FOXP2 transcriptional gene shared by humans, Neanderthals, and at least one other archaic species enhanced synaptic plasticity in cortical-basal ganglia circuits that are implicated in motor behavior, cognitive flexibility, language, and associative learning. A selective sweep occurred about 200,000 years ago on a unique human version of this gene. Other transcriptional genes appear to be implicated in enhancing cortical-basal ganglia and other neural circuits.

From The Symbolic Species: The Co-evolution of Language and the Brain by Terrence W. Deacon [118]:

Page 53 — Treating animal calls and gestures as subsets oflanguage not only reverses the sequence of evolutionary precedence, it also inverts their functional dependence as well. We know that the nonlanguage communication used by other animals is self-sufficient and needs no support from language to help acquire or interpret it. This is true even for human calls like sobbing or ges tures like smiling. In contrast, however, language acquisition depends crit ically on nonlinguistic communication of all sorts, including much that is as innately prespecified as many nonhuman counterparts. Not only that, but extensive nonverbal communication is essential for providing the scaffold ing on which most day-to-day language communication is supported. In con versations, demonstrations, and explanations using words we make extensive use of prosody, pointing, gesturing, and interactions with objects and other people to disambiguate our spoken messages.

Page 63 — So, what are some of the interpretants of words? Probably the most common view ofword meaning is that a word is interpreted when one generates a mental image of something that it refers to: for example, an image of a familiar dog for "dog" or of someone throwing a baseball for "pitch." Though once treated as a sort offairy tale of introspectionism, mental imagery has in recent years become recognized as an experience that has clear neural and behavioral correlates. The relative "locations" of features on an imagined object, or its size, shape, and movement in imagination, or other changes, can have a direct effect on such factors as the time and ef fort required to consider these features and what parts of the brain might be involved. But a mental image (or the neural process that constitutes it) is only one sort of interpretive response that a word might elicit, and it may not be the most important one.

Page 83 — Even without struggling with the philosophical subtleties of this relationship, we can immediately see the significance for learning. The learning problem associated with symbolic reference is a consequence of the fact that what determines the pairing between a symbol (like a word) and some object or event is not their probability of co-occurrence, but rather some complex function of the relationship that the symbol has to other symbols. This is a separate but linked learning problem, and worse yet, it creates a third, higher-order unlearning problem. Learning is, at its base, a function of the probability of correlations between things , from the synaptic level to the behavioral level. Past correlations tend to be predictive of future cor relations. This, as we've seen, is the basis for indexical reference. In order to comprehend a symbolic relationship, however, such indexical associations must be subordinated to relationships between different symbols. This is a troublesome shift of emphasis. To learn symbols we begin by learning symbol-object correlations, but once leamed, these associations must be treated as no more than clues for determining the more crucial relation ships. And these relationships are not highly correlated; in fact, often just the reverse. Words that carry similar referential function are more often used alternatively and not together, and words with very different (complementary) referential functions tend to to be adjacent to one another in sentences. Worst of all, few sentences or phrases are ever repeated exactly, and the frequency with which specific word combinations are repeated is also extremely low. Hardly a recipe for easy indexical learning.

Page 92 — The problem with Symbol systems, then, is that there is both a lot of learning and unlearning that must take place before even a single symbolic re lationship is available. Symbols cannot be acquired one at a time, the way other learned associations can, except after a reference symbol system is established. A logically complete system of relationships among the set of symbol tokens must be learned before the symbolic association between any one symbol token and an object can even be determined. The learning step occurs prior to recognizing the symbolic function, and this function only emerges from a system; it is not vested in any individual sign-object pairing. For this reason, it's hard to get started. To learn a first symbolic relationship requires holding a lot of associations in mind at once while at the same time mentally sampling the potential combinatorial patterns hidden in their higher-order relationships. Even with a very small set of symbols the number of possible combinations is immense, and so sorting out which combinations work and which don't requires sampling and remembering a large number ofpossibilities.

Page 109 — Languages don't just change, they evolve. And children themselves are the rigged game. Languages are under powerful selection pressure to fit children's likely guesses, because children are the vehicle by which a language gets reproduced. Languages have had to adapt to children's spontaneous assumption about communication, learning, social interaction, and even symbolic reference, because children are the only game in town. It turns out that in a curious sort of inversion of our intuitions about this problem, languages need children more than children need languages.

Page 263 — The prefrontal cortex helps us inhibit the tendency to act on simple cor relative stimulus relationships and guides our sampling of alternative higher order sequential or hierarchic associations. Its role in language and symbol learning in particular is not, however, merely to increase something we might call prefrontal intelligence. Rather I suspect the importance of the size change can be thought of in displacement terms, in patterns of cognition as in patterns of brain development. Prefrontal computations out-compete other cognitive computations and tend to dominate learning in us as in no other species. In simple terms, we have become predisposed to use this one cognitive tool whenever an opportunity presents itself, because an inordinate amount of control of the other processes in the brain has become vested in our prefrontal cortex. The way the parietal cortex handles tactile and movement information, the way the auditory cortex handles sound in formation, the way the visual cortex handles visual information are all now much more constrained by prefrontal activity than in other species.

Page 277 — So, the cerebellum is probably far more involved with sound analysis in human beings than it is in any other species. This may be very important for the generation ofword associations at a rate sufficient for speech. At the rate words are presented in speech, the speaker or listener must be able rapidly to generate associated words and avoid letting earlier associations interfere. The cognitive search process must be as rapid but as shallow as possible. Any slight tendency to perseverate would entirely derail the process. It is not a passive analysis that must be performed, but the rapid, controlled generation of relatively novel responses, though only relatively novel because word associations are recalled from innumerable previous as sociations in which they were elicited together in sentences. A word generation process analogous to conjugating a verb for a noun in rapid alternation must inevitably be part of all sentence production. Succeeding at this task appears to be greatly facilitated by linkage to the brain's rapid prediction computer, the cerebellum. The cerebellum may provide an independent generator of novel but predictable shifts of associations from one context to another, while prefrontal cortex is providing a selective inhibi tion of all but the one sample that fits the new criteria. In addition to the cerebellum, the cingulate cortex also seems to be intensely activated in a variety of tasks that require top-down intentional shifting of attention.

²⁴ For an interesting new take on math education that emphasizes data science, listen to Jo Boaler, Stanford professor of mathematics education at Stanford University, speaking with Steve Levitt on this Freakonomics podcast.

²⁵ McClelland et al [345] in reviewing recent progress in achieving human level linguistic competence note that in humans, "language is part of a larger system for acquiring, representing, and communicating about objects and situations in the physical and social world, and future machine language models should emulate such a system." The authors also mention that human language processing "exploits complementary learning systems", and that "[a]dding such a system to machine language models will be an important further step toward truly human-like language understanding." The same could be said for machine code-synthesis models.

²⁶ On a related note, the structured nature of mathematics and the domain of mathematical objects begs for a comparison with computer programs and their natural language encodings. To that end, Saxton et al [438], develop test suite of mathematics problems involving sequential questions and answers in a free-form textual input/output format and then employ these problems "to conduct a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and find notable differences in their ability to resolve mathematical problems and generalize their knowledge."

²⁷ Excerpt from Premack [404] Human and animal cognition: Continuity and discontinuity:

Chimpanzee short-term memory is only slightly less than that of humans. Both species remember no more than five to seven items. This limitation can adversely affect problem-solving (by limiting the alternatives that can be considered), but does it pose the same danger to both humans and chimpanzees?
Although short-term memory limits the number of units one can remember, it does not define the content of a unit. A language-trained chimpanzee exposed to the numbers 1–9 might remember, for example: 2, 6, 4, 3, 7, whereas humans might remember, for example, 21, 43, 96 ... ; 214, 618, 109 ... ; 1012, 6680, 3456, etc. In language, content of the unit is even more open-ended. A chimpanzee may remember five words; a human may remember five phrases, five sentences, five stories, etc. Humans are able to make these expansions because they are capable of both recursive language and numbers.

Thus, despite comparable limitations in short-term memory, animals and humans may differ dramatically in the amount of information they can remember. In addition, humans can, and often do, circumvent short-term memory with written language. Similar limitations in different species may have entirely different consequences depending on the other cognitive resources of the species. (SOURCE),

²⁸ Excerpt from Lieberman and McCarthy [320] The Evolution of Speech and Language:

Manual gestures, facial expressions, and body language all continue to play a role in human communication, but speech is the default, primary, phonetic modality of language. Sign languages are a comparatively recent invention, dating to the eighteenth century. Speech confers numerous advantages over communication by means of manual gestures, facial expressions, and posture, such as not having to direct one’s attention to individuals who are communicating and freeing one’s hands when communicating. However, perhaps more importantly, speech allows humans to rapidly transmit information.
The rate at which the sounds that convey words are transmitted exceeds the fusion frequency of the auditory system the rate at which other sounds merge into a meaningless buzz. The process by which this high transmission rate is achieved has been enhanced through the evolution of the species-specific human tongue and the airway above the larynx. The fossil record thus provides a time line for this process as well as for the evolution of the brain bases for speech motor control, language, and some critical elements of human cognitive ability

²⁹ Speaker: Friedemann Pulvermüller
Affilation: Brain Language Laboratory, Freie Universität Berlin
Title: Brain Language Research
Comment: This is a transcript of a video featuring Pulvermüller on Serious Science - bracketed [segments] are my corrections
Abstract: Neuroscientist Friedemann Pulvermüller on brain-language relations, brain lesions, and simulation of cognitive processes

I would like to say a few words about the brain basis of language. From patients and neurological studies we know that some parts of the brain are particularly important for language – actually, the whole new part of the brain, the neocortex, which is especially large in humans, is important for language. However, there are specific parts of that big structure – the neocortex – which are of particular relevance for language. One of them is in the frontal cortex, the second one is in the temporal cortex or behind the left ear. It's usually the left hemisphere which is the most important for language – in the most of us, at least.
We know from case descriptions, neurologists in the 19th century already described diseases that involved the frontal language 'center' which were accompanied by some problems in speaking and difficulty in understanding. The other one behind the ear, the temporal lobe lesion, also produced a problem in speaking and understanding. Those diseases were called 'aphasias' – language disorders after acquired brain injury in adults. So, from these patients, we know which parts of the brain take most of the burden in language processing.

Why is that? We have models today that would actually account for that – at least in part. We know that the brain has certain neuroanatomical structure, that the articulators of the mouth are controlled by a region called 'motor cortex' that is close to the frontal language region. The cables from our ears that tell the brain about acoustic signals we process, reach areas in the temporal cortex. These areas are close to the regions that would also be of particular relevance for language.

Why then would these language regions take their home close to the inputs and outputs of language? The explanation can be provided by neurobiological models where you produce neural networks – artificial brain-like structures, where we have a motor cortex, the frontal language region, the auditory cortex, the sound processing areas and the areas around them, – and then present this network with information that we would produce or receive during language acquisition, early stages of language learning. If I play a baby or an early language learner, I may first say senseless syllables like 'ba ba ba' or single words like 'bat' or 'car'. Those words would, of course, the movements of my mouth that produce them would require some motor activation. At the same time, there is activation in the auditory system as I hear myself speaking. Therefore, at two very distant ends of the brain, in a motor cortex and in the auditory cortex, in the front and in the back, we have correlated activation.

I mentioned the term 'correlation'. It is something very important, as our brain is particularly good at mapping correlation. It has learning rules that drive the strength between nerve cells according to the correlation of their firing. If we have this correlation at different ends of our brain when we speak, there will be strengthening of the links between these nerve cells. As there are no actually strong connections between the motor cortex and the auditory cortex, the activation needs to take detours around regions in the other part of the frontal cortex close by parts to the temporal cortex. Those now happen to be linked with each other.

That's a particularly important feature of the human brain. It has only recently been discovered that especially in the left hemisphere there are fundamentally strong connections between frontal language and the temporal language area. These connections lead information back and forth between the two. For mapping the correlation between motor and auditory neural activation the human brain is especially good and well-developed – it's cut out for it, if you wish. The idea that I'm trying to bring forward is that during our early language acquisition, we would build neuronal assemblies circuits of strongly connected nerve cells, one for every word and maybe larger constructions. That is why we build vocabulary of words and longer [chunks] of language which then become consolidated over time and form the building blocks of the language.

Back to our aphasia data, our brain lesion data, we can use this type of model and lesion it in the front and in the back and find out that our artificial neural networks do something similar to what patients with frontal and temporal lesions do. They show deficits, as our patients do. They have problems mainly in speaking, but there's also a detectable problem in understanding the language and even single words, if we lesion the front. Of course, if the back part of the temporal cortex receives a lesion, there's a heavy language understanding problem and some difficulty with finding words and composing them in the correct way from the individual language sounds, phonemes.

So, we're in the process of slow but steady improvement of our understanding of brain-language relations. I have only talked about words – this is a very simple topic relative to complex syntax and the meaning of words and larger constructions. Finally, there are social interactions in which language plays a big role. However, these biological mechanisms kick in at a very basic level, in the higher, more complex level we have similar situations. The language machinery can slowly but steadily be explained a little better using neurocomputational work and looking at the patients in much detail. We could elaborate to this topic when we use neuroimaging and modern techniques to stimulate parts of the brain.

One possibility is to look at the patients with language disturbances and describe their problems. They can have trouble naming objects (saying 'that's a glass'), understanding problems (if you show them a glass and a hand, and ask which of them is a hand, they may point at the wrong object). One important part is that when we do neuroimaging – we use fMRI, EEG, MEG and look at the brain activation patterns elicited by the speech sounds, words, little constructions, complex sentences or whole interactions between communication partners, when I ask somebody to give me something and he or she responds. There are specific brain activation patterns we can map.

A different strategy is to [compare] patients with healthy people. There are methods to slightly affect the functionality of the brain by magnetic stimulation. By that we can produce mini lesions or mini activations much more focally than big brain lesions. Then we can address the question of whether a smaller part of the brain is causally involved in language understanding, for example.

In 2015 we published a paper where we showed that stimulating the motor cortex in the frontal lobe influences the understanding of single words. Some colleagues believe that the frontal cortex and the motor system especially is not so important for understanding. Certainly, it's not as important as other parts of the brain, like the temporal language region. However, it also plays a role in understanding and even at the single-word level.

We now know a little bit more about brain-language relationships, and we can even address the question of why the language regions are placed in the brain where they are. We can use neural network simulations mimicking the brain to provide explanations as to why certain aphasias occur after specific lesions.

Friedemann Pulvermüller
Brain Language Laboratory, Freie Universität Berlin

³⁰ Excerpt from Neural reuse of action perception circuits for language, concepts and communication [410]:

Neurocognitive and neurolinguistics theories make explicit statements relating specialized cognitive and linguistic processes to specific brain loci. These linking hypotheses are in need of neurobiological justification and explanation. Recent mathematical models of human language mechanisms constrained by fundamental neuroscience principles and established knowledge about comparative neuroanatomy offer explanations for where, when and how language is processed in the human brain. In these models, network structure and connectivity along with action- and perception-induced correlation of neuronal activity co-determine neurocognitive mechanisms.
Language learning leads to the formation of action perception circuits (APCs) with specific distributions across cortical areas. Cognitive and linguistic processes such as speech production, comprehension, verbal working memory and prediction are modeled by activity dynamics in these APCs, and combinatorial and communicative-interactive knowledge is organized in the dynamics within, and connections between APCs. The network models and, in particular, the concept of distributionally-specific circuits, can account for some previously not well understood facts about the cortical 'hubs' for semantic processing and the motor system's role in language understanding and speech sound recognition.

³¹ Excerpt from Thinking in circuits: Toward neurobiological explanation in cognitive neuroscience [408]

Cognitive theory has decomposed human mental abilities into cognitive (sub) systems, and cognitive neuroscience succeeded in disclosing a host of relationships between cognitive systems and specific structures of the human brain. However, an explanation of why specific functions are located in specific brain loci had still been missing, along with a neurobiological model that makes concrete the neuronal circuits that carry thoughts and meaning. Brain theory, in particular the Hebb-inspired neurocybernetic proposals by Braitenberg, now offers an avenue toward explaining brain-mind relationships and to spell out cognition in terms of neuron circuits in a neuromechanistic sense. Central to this endeavor is the theoretical construct of an elementary functional neuronal unit above the level of individual neurons and below that of whole brain areas and systems: the distributed neuronal assembly (DNA) or thought circuit (TC).
It is shown that DNA/TC theory of cognition offers an integrated explanatory perspective on brain mechanisms of perception, action, language, attention, memory, decision and conceptual thought. We argue that DNAs carry all of these functions and that their inner structure (e.g., core and halo subcomponents), and their functional activation dynamics (e.g., ignition and reverberation processes) answer crucial localist questions, such as why memory and decisions draw on prefrontal areas although memory formation is normally driven by information in the senses and in the motor system. We suggest that the ability of building DNAs/TCs spread out over different cortical areas is the key mechanism for a range of specifically human sensorimotor, linguistic and conceptual capacities and that the cell assembly mechanism of overlap reduction is crucial for differentiating a vocabulary of actions, symbols and concepts. – See here for a list of recent articles citing this paper.

³² Excerpt from How neurons make meaning: Brain mechanisms for embodied and abstract-symbolic semantics [409]:

How brain structures and neuronal circuits mechanistically underpin symbolic meaning has recently been elucidated by neuroimaging, neuropsychological, and neurocomputational research. Modality-specific 'embodied' mechanisms anchored in sensorimotor systems appear to be relevant, as are 'disembodied' mechanisms in multimodal areas. In this paper, four semantic mechanisms are proposed and spelt out at the level of neuronal circuits: referential semantics, which establishes links between symbols and the objects and actions they are used to speak about; combinatorial semantics, which enables the learning of symbolic meaning from context; emotional-affective semantics, which establishes links between signs and internal states of the body; and abstraction mechanisms for generalizing over a range of instances of semantic meaning. Referential, combinatorial, emotional-affective, and abstract semantics are complementary mechanisms, each necessary for processing meaning in mind and brain.

³⁴ An efference copy or efferent copy is an internal copy of an outflowing (efferent), movement-producing signal generated by the motor system. It can be collated with the (reafferent) sensory input that results from the agent's movement, enabling a comparison of actual movement with desired movement, and a shielding of perception from particular self-induced effects on the sensory input to achieve perceptual stability. Together with internal models, efference copies can serve to enable the brain to predict the effects of an action. An equal term with a different history is corollary discharge. (SOURCE)

³³ Examines how efference copy³⁴ might influence sensory prediction and the accuracy of motor activity to achieve a target using auditory feedback as an example [377]:

How precisely does the brain predict the sensory consequences of our actions? Efference copy is thought to reflect the predicted sensation of self-produced motor acts, such as the auditory feedback heard while speaking. Here, we use magnetoencephalographic imaging (MEG-I) in human speakers to demonstrate that efference copy prediction does not track movement variability across repetitions of the same motor task. Specifically, spoken vowels were less accurately predicted when they were less similar to a speaker's median production, even though the prediction is thought to be based on the very motor commands that generate each vowel. Auditory cortical responses to less prototypical speech productions were less suppressed, resembling responses to speech errors, and were correlated with later corrective movement, suggesting that the suppression may be functionally significant for error correction. The failure of the motor system to accurately predict less prototypical speech productions suggests that the efferent-driven suppression does not reflect a sensory prediction, but a sensory goal. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3792453/SOURCE)

³⁵ Semiotics is the study of signs, symbols, and signification. It is the study of how meaning is created, not what it is. Below are some brief definitions of semiotic terms, beginning with the smallest unit of meaning and proceeding towards the larger and more complex:

Signifier: any material thing that signifies, e.g., words on a page, a facial expression, an image. Signified: the concept that a signifier refers to. Together, the signifier and signified make up the sign which is the smallest unit of meaning. Anything that can be used to communicate (or to tell a lie).

There are three types of sign we are primarily concerned with: Symbolic signs: signs where the relation between signifier and signified is purely conventional and culturally specific, e.g., most words; Iconic signs: signs where the signifier resembles the signified, e.g., a picture; and Indexical Signs: signs where the signifier is caused by the signified, e.g., smoke signifies fire.

Finally, linguists and logicians distinguish between Denotation: the most basic or literal meaning of a sign, e.g., the word "rose" signifies a particular kind of flower, Connotation: the secondary, cultural meanings of signs; or "signifying signs," signs that are used as signifiers for a secondary meaning, e.g., the word "rose" signifies passion. (SOURCE)

³⁶ "In our case, experts trained to reproduce single clips can be conceptualized as nonlinear feedback controllers around a nominal trajectory, and the manifold of states visited by experts can be thought of as a tube around that reference." [Page 5 [353]]

³⁷ Behavioral cloning is a method by which human subcognitive skills can be captured and reproduced in a computer program. As the human subject performs the skill, his or her actions are recorded along with the situation that gave rise to the action. A log of these records is used as input to a learning program [355].

³⁸ Predictive coding is but one way in which to serve this requirement though it is not sufficient and requires the ability to formulate and test that actions have their intended consequences.

³⁹ This includes active inference & perception required to deploy sensors so as to gather information necessary to plan and act in advance of the need to perform the requisite planning and acting.

⁴⁰ This means providing the necessary information to support downstream computations including just the information required at a suitable level of abstraction with little noise or spurious information.

⁴¹ The cingulate cortex receives inputs from the thalamus and the neocortex, and projects to the entorhinal cortex via the cingulum. It is an integral part of the limbic system, which is involved with emotion formation and processing, learning, and memory. (SOURCE)

⁴² The orbitofrontal cortex (OFC) is defined as the part of the prefrontal cortex that receives projections from the medial dorsal nucleus of the thalamus, and is thought to represent emotion and reward in decision making. (SOURCE)

⁴³ "Parallel circuits through the basal ganglia for different regions of the frontal cortex -- each region of frontal cortex has a corresponding basal ganglia circuit, for controlling action selection/initiation in that frontal area. Motor loop: SMA = supplementary motor area -- the associated striatum (putamen) also receives from premotor cortex (PM), and primary motor (M1) and somatosensory (S1) areas -- everything needed to properly contextualize motor actions." — Page 144 in [386]

⁴⁴ The substantia nigra (SN) is a basal ganglia structure located in the midbrain that plays an important role in reward and movement. Although the substantia nigra appears as a continuous band in brain sections, anatomical studies have found that it actually consists of two parts with very different connections and functions: the pars compacta (SNpc) and the pars reticulata (SNpr). The pars compacta serves mainly as an output to the basal ganglia circuit, supplying the striatum with dopamine. The pars reticulata, though, serves mainly as an input, conveying signals from the basal ganglia to numerous other brain structures. (SOURCE)

⁴⁵ The limbic system also known as the paleomammalian cortex supports a variety of functions including emotion, behavior, motivation, long-term memory, and olfaction. Emotional life is largely housed in the limbic system, and it critically aids the formation of memories. With a primordial structure, the limbic system is involved in lower order emotional processing of input from sensory systems and consists of the amygdaloid nuclear complex (amygdala), mammillary bodies, stria medullaris, central gray and dorsal and ventral nuclei of Gudden. (SOURCE)

⁴⁶ The reticular theory is an obsolete scientific theory in neurobiology positing that everything in the nervous system is a single continuous network. The concept was postulated by a German anatomist Joseph von Gerlach in 1871, and was most popularised by the Nobel laureate Italian physician Camillo Golgi. (SOURCE)

⁴⁷ The neuron doctrine is the concept that the nervous system is made up of discrete individual cells, a discovery due to decisive neuro-anatomical work of Santiago Ramón y Cajal [187]. The neuron doctrine, as it became known, served to position neurons as special cases under the broader cell theory evolved some decades earlier. (SOURCE)

⁴⁸ Actually, John Gage, the chief scientist at Sun Microsystems, coined the phrase, "The network is the computer." These words went on to become the company's advertising slogan for several years. McNealy, if the rumor is correct, was only paraphrasing Gage.

⁴⁹ The reporter can be cut a little slack since at the time diskless desktop computers—also called thin clients or diskless nodes—were uncommon and looked like nothing more than the monitor of a standard personal computer without the bulky cabinet containing the disk drives, motherboard and power supply. Oracle went on to manufacture such devices under the name "Network Computer" but they never caught on, most likely because personal computers built from inexpensive commodity parts were better and cheaper for most use cases.

⁵⁰ Bruno Olshausen once said to me that it isn't particularly surprising that neurons behave nonlinearly, what's interesting is that in some (appropriate) cases they behave linearly.

⁵¹ Our growing understanding of the synapse, dendritic spine, axon hillock, cellular transport system, etc have substantially altered our appreciation of the complexity of the myriad activities that contribute to computation within an individual neuron [231, 83, 459, 329]. Here we acknowledge that complexity but choose to divide computational contributions into those that control behavior and those that control the environment in which computations are carried out analogous to how, in electrical circuits, voltage regulators maintain constant voltage levels, heat sinks dissipate waste heat, impedance matching circuits maximize power transfer or minimize signal reflection, etc.

⁵² Raphael Yuste recalls Carver Mead suggesting something along these lines in an invited talk at the annual meeting of the Society for Biophysics in 1993. I've articulated the idea in terms of the hypothesis that much of what goes on in individual neurons and their pairwise interactions is in service to maintaining some sort of equilibrium state conducive to performing their primary computational roles in maintaining the physical plant and controlling behavior.

Such a division of labor, say 90% to routine maintenance and homeostatic regulation at the microscale and 10% to computation required to conduct business at the macroscale, makes sense when you think of the lengths that semiconductor process engineers have to go to in maintaining the purity of the silicon crystal substrate, exposure of photoresists, diffusion depth of dopants, constant width of traces and vias, etc. Intel takes care of all this in production, the brain builds fragile, pliant neurons and never stops tinkering with them.

Note: I asked Dick Lyon, one of Carver Mead's close collaborators and former graduate students, about Raphael's recollection and here is what he had to say: Tom, the closest I could find is this passage [350] in a 1987 Caltech Engineering & Science Journal article, with "garbage" but not "crappy" or 90%:

There's nothing special about this fabrication process, and it's not exactly desirable from an analog point of view. Neurons in the brain don't have anything special about them either; they have limited dynamic range, they're noisy, and they have all kinds of garbage. But if we're going to build neural systems, we'd better not start off with a better process (with, say, a dynamic range of 10⁵), because we'd simply be kidding ourselves that we had the right organizing principles. If we build a system that is organized on neural principles, we can stand a lot of garbage in the individual components and still get good information out. The nervous system does that, and if we're going to learn how it works, we'd better subject ourselves to the same discipline. [Page 5 in Mead [350] (PDF)]

⁵³ A mesoscale model is used in dyanamical-system modeling to provide a computational or informational bridge between two levels of description. These two levels often account for very different physical principals operating at widely-separated temporal and spatial scales, as in the case of a description at the level of individual atoms requiring quantum electrodynamics, dipole forces, band theory, etc versus a description at the molecular level involving Coulomb's inverse-square law, van der Waals forces and Brownian motion. In the case of understanding the brain, at one end of the spectrum you might have a molecular-dynamics-based model and at the other end a behavioral-based model. There is as yet no consensus on what sort of description would serve as mesoscale bridge between the two.

⁵⁴ If a pattern occurs significantly more often than in a randomly organized network with the same degree distribution, it is called a network motif or simply motif. Colloquially referred to as the "building blocks" of complex networks, network motifs mimic the concept of sequence motifs as used in genomics. In a gene sequence, a motif is a recurring subsequence, a pattern that is conjectured to have some functional significance. In a network, a motif is a recurring sub-network conjectured to have some significance. [Excerpted from Kaiser [258]].

⁵⁵ The term functional connectomics — as articulated by Clay Reid [422] who is leading the Project Mindscope [280] team at the Allen Institute for Brain Science and Sebastian Seung [450, 449] — is derived from Hubel and Wiesel's idea of a functional architecture corresponding to the complex relationship between in vivo physiology and the spatial arrangement of neurons [237, 236].

⁵⁶ Algebraic topology employs tools from abstract algebra to study topological spaces. The basic goal is to find algebraic invariants that classify topological spaces up to homeomorphism — an invertible continuous mapping between topological spaces. The directed graph corresponding to a reconstructed connectome can be characterized topologically in terms of its connected components thereby capturing functionally-important features relating to the organization of neural circuits [258]. (SOURCE)

⁵⁷ Nonlinear dynamical systems theory is an interesting example of how mathematics has, on the one hand, had a significant impact on the field through the work of Hodgkins and Huxley and then largely failed to follow through building on the foundation Hodgkin and Huxley provided to systematically embrace refinements such as the FitzHugh-Nagumo model [170, 244] that accounts for thresholding and all-or-none spiking or Andronov-Hoph model that accounts for the bifurcation dynamics of neurons [243].

Izhikevich and FitzHugh note that "A good neuronal model must reproduce not only electrophysiology but also the bifurcation dynamics of neurons" and "These features and not ionic currents per se determine the neuronal responses, i.e., the kind of computations that neurons do" [153]. While there are good basic texts that introduce students to the Nernst equation, the differential equations of the Hodgkin-Huxley model and Cable Theory, e.g., Dayan and Abbott [115], and Koch and Segev [281], until recently it was rare for a graduate education in neuroscience to include careful study of dynamical systems theory [473].

⁵⁸ "Formally, a Markov blanket renders a set of states, internal and external states, conditionally independent of one another. That is, for any variable A, A is conditionally independent of B, given another variable, C, if and only if the probability of A and B given C can be written as p(A | C) and p(B | C). In other words, A is conditionally independent of B given C if, when C is known, knowing A provides no further information" — Excerpt from [276]

⁵⁹ "Self-evidencing (Hempel, 1965) occurs when a hypothesis best explains some piece of evidence and, in virtue of that explanatory success, thereby provides evidence for its own truth or correctness. In such cases, the occurrence of the evidence is best explained by the hypothesis but the fact that the evidence occurs at all is used to lend support to the hypothesis itself. To use a common example, my lack of study may be offered as an explanation of why I failed the exam, while my failing the exam might reasonably be offered as evidence for my lack of study."— Excerpt from Clark [103]

⁶⁰ The information entropy, often just entropy, is a basic quantity in information theory associated with any random variable, that can be interpreted as the average level of "information", "surprise", or "uncertainty" inherent in the variable's possible outcomes. The concept of information entropy was introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication". The entropy is the expected value of the self-information, a related quantity also introduced by Shannon. The self-information quantifies the level of information or surprise associated with one particular outcome or event of a random variable, whereas the entropy quantifies how "informative" or "surprising" the entire random variable is, averaged on all its possible outcomes. (SOURCE)

⁶¹ David Blei's tutorial [59] on variational inference (PDF) offers a statisticians mathematical understanding of variational inference. As for predictive coding implemented as a recurrent neural network, Friston has this to say,

It is difficult to overstate the generality and importance of [Equation 25 in [261]]: its solutions grandfather nearly every known statistical estimation scheme, under parametric assumptions about additive noise. These range from ordinary least squares to advanced variational deconvolution schemes. In neural network terms, [Equation 25] says that error-units receive predictions from the same level and the level above. Conversely, prediction-units are driven by prediction errors from the same level and the level below. These constitute bottom-up and lateral messages that drive conditional expectations towards a better prediction to reduce the prediction error in the level below. This is the essence of recurrent message passing between hierarchical levels in the brain that suppress free energy or prediction error.

⁶² Generalized filtering is a generic Bayesian filtering scheme for nonlinear state-space models. It is based on a variational principle of least action, formulated in generalized coordinates. Generalized filtering furnishes posterior densities over hidden states (and parameters) generating observed data using a generalized gradient descent on variational free energy, under the Laplace assumption. (SOURCE)

⁶³ For a probabilistic (Bayesian) (directed, acyclic) graphical model, the Markov blanket of a node is defined as its parents, children and the other parents of all of its children. Presumably, an ensemble of component units might be characterized as such by defining the relationship between individual units in terms causal connections, for some useful definition of cause.

⁶⁴ The function F in Equation 4 is the Helmholtz free energy which is minimized by the thermal equilibrium or Boltzmann distribution. ... Learning is much more tractable if we use the non-equilibrium Helmholtz free energy as a Lyapunov function (Neal and Hinton, 1993).

⁶⁵ Two-photon excitation microscopy (TPEF or 2PEF) is a fluorescence imaging technique that allows imaging of living tissue up to about one millimeter in thickness. (SOURCE)

⁶⁶ Focused Ion Beam (FIB) Scanning Electron Microscopy (SEM) is used in the semiconductor industry, materials science and biological fields for site-specific analysis, deposition, and ablation of materials. (SOURCE)

⁶⁷ Moore's Law in some form will continue for some time to come. Bridging technologies like extreme ultraviolet lithography (EUL) will tide us over for awhile, but the demand for computing is so high that VC firms — as well as the incumbent chip manufacturers — will seriously consider investing in the development of promising alternatives that couldn't otherwise compete against major chip companies.

⁶⁸ Common coding theory is a cognitive psychology theory describing how perceptual representations (e.g. of things we can see and hear) and motor representations (e.g. of hand actions) are linked. The theory claims that there is a shared representation (a common code) for both perception and action. More important, seeing an event activates the action associated with that event, and performing an action activates the associated perceptual event. [...] Perception and action processes are functionally intertwined: perception is a means to action and action is a means to perception. Indeed, the vertebrate brain has evolved for governing motor activity with the basic function to transform sensory patterns into patterns of motor coordination.(SOURCE)

⁶⁹ The term somesthesis includes all of the various sensory systems in the skin and other bodily tissues responsible for the senses of touch-pressure, warmth and coldness, pain, itch together with positioning and movement

⁷⁰ Here are a few of the references that I compiled regarding dynamic routing and variable binding that are relevant to O'Reilly and Frank [384] and O'Reilly and Soto [388]:

@incollection{OReillyandSotoNIPS-02,
        title = {A Model of the Phonological Loop: Generalization and Binding},
       author = {Randall C. O'Reilly and R. Soto},
    booktitle = {Advances in Neural Information Processing Systems 14},
       editor = {T. G. Dietterich and S. Becker and Z. Ghahramani},
        pages = {83-90},
         year = {2002},
    publisher = {MIT Press},
     abstract = {We present a neural network model that shows how the prefrontal cortex, interacting with the basal ganglia, can maintain a sequence of phonological information in activation-based working memory (i.e., the phonological loop). The primary function of this phonological loop may be to transiently encode arbitrary bindings of information necessary for tasks the combinatorial expressive power of language enables very flexible binding of essentially arbitrary pieces of information. Our model takes advantage of the closed-class nature of phonemes, which allows different neural representations of all possible phonemes at each sequential position to be encoded. To make this work, we suggest that the basal ganglia provide a region-specific update signal that allocates phonemes to the appropriate sequential coding slot. To demonstrate that flexible, arbitrary binding of novel sequences can be supported by this mechanism, we show that the model can generalize to novel sequences after moderate amounts of training.},
}
@inbook{CerandReillyCOGNITIVE-SCIENCE-98,
       author = {Cer, Daniel and O’Reilly, Randall},
        title = {Neural mechanisms of binding in the hippocampus and neocortex: Insights from computational models},
    booktitle = {Handbook of Binding and Memory: Perspectives from Cognitive Neuroscience},
       editor = {Hubert Zimmer, Axel Mecklinger, and Ulman Lindenberger},
    publisher = {Oxford University Press},
         year = {1998},
     abstract = {The development of accurate models of the neural mechanisms underlying binding represents a critical step in the understanding of the mechanisms that give rise to most cognitive processes. This chapter presents a range of computational models based on the biological specializations associated with different brain areas that support a range of different contributions to binding. The posterior cortex can learn coarse-coded distributed representations (CCDRs) of low-order conjunctions which can efficiently and systematically bind information in the service of many different forms of cortical information processing. However, these representations are learned slowly over experience. In contrast, the hippocampus is specialized for rapidly binding novel information into high-order conjunctive representations (e.g., episodes or locations). The prefrontal cortex can actively maintain dynamic bindings in working memory and, through more abstract rule-like representations, support more flexible generalization of behaviour across novel task contexts. Taken together, this overall biologically based cognitive architecture represents a more plausible framework for understanding binding than that provided by temporal synchrony approaches.}
}
@article{LevittetalJoCN-93,
       author = {Levitt, Jonathan B. and Lewis, David A. and Yoshioka, Takashi and Lund, Jennifer S.},
        title = {Topography of pyramidal neuron intrinsic connections in macaque monkey prefrontal cortex (areas 9 and 46)},
      journal = {Journal of Comparative Neurology},
       volume = {338},
       number = {3},
         year = {1993},
        pages = {360-376},
     abstract = {An understanding of the normal organization of prefrontal cortex is essential to the recognition of pathology underlying human behavioral disorders believed to depend on this region. We have therefore studied the pattern of intrinsic intra- and interlaminar pyramidal neuron connectivity in prefrontal areas 9 and 46 (of Walker) in macaque monkey cerebral cortex (anterior to the arcuate sulcus between the principal sulcus and midline). We made focal (200–400 μm) injections of biocytin and mapped the pattern of orthogradely transported label. Injections made into the superficial layers label wide-ranging lateral projections within the same areas of prefrontal cortex. Projections local to such small injections form a narrow band of terminals in layers 1–3 (200–400 μm wide, 2–4 mm long) centered on the injection site. Collateral fibers spread orthogonal to this terminal band, making frequent bifurcations, to establish a series of parallel bands of terminals with uninnervated bands between, spaced regularly across the cortex (center to center 500–600 μm). The entire pattern of terminal label is stripelike, with occasional narrower interbands and crosslinks between the bands, and can extend over 7–8 mm across the cortex. These projections arise from pyramidal neurons in layers 2, 3, and 5 and terminate in layers 1–3. The stripelike pattern contrasts with patchlike patterns in other cortical regions (V1, V2, V4, motor, somatosensory) and is smaller in scale than stripelike zones of corticocortical afferent terminals to this region, reported to be 300–750 μm wide and spaced 1.0–1.5 mm center to center. © 1993 Wiley-Liss, Inc.},
}
@article{VogesetalPiN-10,
        title = {A modeler's view on the spatial structure of intrinsic horizontal connectivity in the neocortex},
       author = {Nicole Voges and Almut Sch{\"u}z and Ad Aertsen and Stefan Rotter},
      journal = {Progress in Neurobiology},
         year = {2010},
       volume = {92},
        pages = {277-292},
     abstract = {Most current computational models of neocortical networks assume a homogeneous and isotropic arrangement of local synaptic couplings between neurons. Sparse, recurrent connectivity is typically implemented with simple statistical wiring rules. For spatially extended networks, however, such random graph models are inadequate because they ignore the traits of neuron geometry, most notably various distance dependent features of horizontal connectivity. It is to be expected that such non-random structural attributes have a great impact, both on the spatio-temporal activity dynamics and on the biological function of neocortical networks. Here we review the neuroanatomical literature describing long-range horizontal connectivity in the neocortex over distances of up to eight millimeters, in various cortical areas and mammalian species. We extract the main common features from these data to allow for improved models of large-scale cortical networks. Such models include, next to short-range neighborhood coupling, also long-range patchy connections. We show that despite the large variability in published neuroanatomical data it is reasonable to design a generic model which generalizes over different cortical areas and mammalian species. Later on, we critically discuss this generalization, and we describe some examples of how to specify the model in order to adapt it to specific properties of particular cortical areas or species.}
}
@article{ConstantinidisandQietalFiN-18,
       author = {Constantinidis, C. and Qi, X. L.},
        title = {Representation of Spatial and Feature Information in the Monkey Dorsal and Ventral Prefrontal Cortex},
      journal = {Frontiers Integrative Neuroscience},
         year = {2018},
       volume = {12},
        pages = {31},
     abstract = {The primate prefrontal cortex (PFC) is critical for executive functions including working memory, task switching and response selection. The functional organization of this area has been a matter of debate over a period of decades. Early models proposed segregation of spatial and object information represented in working memory in the dorsal and ventral PFC, respectively. Other models emphasized the integrative ability of the entire PFC depending on task demands, not necessarily tied to working memory. An anterior-posterior hierarchy of specialization has also been speculated, in which progressively more abstract information is represented more anteriorly. Here we revisit this debate, updating these arguments in light of recent evidence in non-human primate neurophysiology studies. We show that spatial selectivity is predominantly represented in the posterior aspect of the dorsal PFC, regardless of training history and task performed. Objects of different features excite both dorsal and ventral prefrontal neurons, however neurons highly specialized for feature information are located predominantly in the posterior aspect of the ventral PFC. In accordance with neuronal selectivity, spatial working memory is primarily impaired by inactivation or lesion of the dorsal PFC and object working memory by ventral inactivation or lesion. Neuronal responses are plastic depending on task training but training too has dissociable effects on ventral and dorsal PFC, with the latter appearing to be more plastic. Despite the absence of an overall topography, evidence exists for the orderly localization of stimulus information at a sub-millimeter scale, within the dimensions of a cortical column. Unresolved questions remain, regarding the existence or not of a functional map at the areal and columnar scale, and the link between behavior and neuronal activity for different prefrontal subdivisions.}
}
@article{KatsukiandConstantinidisFiIN-12,
       author = {Katsuki, Fumi and Constantinidis, Christos},
        title = {Unique and shared roles of the posterior parietal and dorsolateral prefrontal cortex in cognitive functions},
      journal = {Frontiers in integrative neuroscience},
    publisher = {Frontiers Media S.A.},
       volume = {6},
         year = {2012},
        pages = {17-17},
     abstract = {The dorsolateral prefrontal cortex (PFC) and posterior parietal cortex (PPC) are two parts of a broader brain network involved in the control of cognitive functions such as working-memory, spatial attention, and decision-making. The two areas share many functional properties and exhibit similar patterns of activation during the execution of mental operations. However, neurophysiological experiments in non-human primates have also documented subtle differences, revealing functional specialization within the fronto-parietal network. These differences include the ability of the PFC to influence memory performance, attention allocation, and motor responses to a greater extent, and to resist interference by distracting stimuli. In recent years, distinct cellular and anatomical differences have been identified, offering insights into how functional specialization is achieved. This article reviews the common functions and functional differences between the PFC and PPC, and their underlying mechanisms.},
}
@article{ZhaoetalNATURE-16,
       author = {Zhou, Xin and Zhu, Dantong and Qi, Xue-Lian and Li, Sihai and King, Samson G. and Salinas, Emilio and Stanford, Terrence R. and Constantinidis, Christos},
        title = {Neural correlates of working memory development in adolescent primates},
      journal = {Nature Communications},
    publisher = {Nature Publishing Group},
         year = {2016},
       volume = {9},
        issue = {7},
        pages = {13423},
     abstract = {Working memory ability matures after puberty, in parallel with structural changes in the prefrontal cortex, but little is known about how changes in prefrontal neuronal activity mediate this cognitive improvement in primates. To address this issue, we compare behavioural performance and neurophysiological activity in monkeys as they transitioned from puberty into adulthood. Here we report that monkeys perform working memory tasks reliably during puberty and show modest improvement in adulthood. The adult prefrontal cortex is characterized by increased activity during the delay period of the task but no change in the representation of stimuli. Activity evoked by distracting stimuli also decreases in the adult prefrontal cortex. The increase in delay period activity relative to the baseline activity of prefrontal neurons is the best correlate of maturation and is not merely a consequence of improved performance. Our results reveal neural correlates of the working memory improvement typical of primate adolescence.},
}
@article{AdamsetalPTRS_B-02,
       author = {Adams, P.  and Guillery, R. W.  and Sherman, S. M.  and Sillito, A. M.  and Jones, Edward G. },
        title = {Thalamic circuitry and thalamocortical synchrony},
      journal = {Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences},
       volume = {357},
       number = {1428},
        pages = {1659-1673},
         year = {2002},
     abstract = {The corticothalamic system has an important role in synchronizing the activities of thalamic and cortical neurons. Numerically, its synapses dominate the inputs to relay cells and to the γ–amino butyric acid (GABA)ergic cells of the reticular nucleus (RTN). The capacity of relay neurons to operate in different voltage–dependent functional modes determines that the inputs from the cortex have the capacity directly to excite the relay cells, or indirectly to inhibit them via the RTN, serving to synchronize high– or low–frequency oscillatory activity respectively in the thalamocorticothalamic network. Differences in the α–amino–3–hydroxy–5–methyl–4–isoxazolepropionic acid (AMPA) subunit composition of receptors at synapses formed by branches of the same corticothalamic axon in the RTN and dorsal thalamus are an important element in the capacity of the cortex to synchronize low–frequency oscillations in the network. Interactions of focused corticothalamic axons arising from layer VI cortical cells and diffuse corticothalamic axons arising from layer V cortical cells, with the specifically projecting core relay cells and diffusely projecting matrix cells of the dorsal thalamus, form a substrate for synchronization of widespread populations of cortical and thalamic cells during high–frequency oscillations that underlie discrete conscious events. }
}
@article {AlcarazetalELIFE-18,
 article_type = {journal},
        title = {Thalamocortical and corticothalamic pathways differentially contribute to goal-directed behaviors in the rat},
       author = {Alcaraz, Fabien and Fresno, Virginie and Marchand, Alain R and Kremer, Eric J and Coutureau, Etienne and Wolff, Mathieu},
       editor = {Schoenbaum, Geoffrey},
      journal = {eLife},
    publisher = {eLife Sciences Publications, Ltd},
       volume = 7,
         year = 2018,
        pages = {e32517},
     abstract = {Highly distributed neural circuits are thought to support adaptive decision-making in volatile and complex environments. Notably, the functional interactions between prefrontal and reciprocally connected thalamic nuclei areas may be important when choices are guided by current goal value or action-outcome contingency. We examined the functional involvement of selected thalamocortical and corticothalamic pathways connecting the dorsomedial prefrontal cortex (dmPFC) and the mediodorsal thalamus (MD) in the behaving rat. Using a chemogenetic approach to inhibit projection-defined dmPFC and MD neurons during an instrumental learning task, we show that thalamocortical and corticothalamic pathways differentially support goal attributes. Both pathways participate in adaptation to the current goal value, but only thalamocortical neurons are required to integrate current causal relationships. These data indicate that antiparallel flow of information within thalamocortical circuits may convey qualitatively distinct aspects of adaptive decision-making and highlight the importance of the direction of information flow within neural circuits.},
}

⁷¹ O'Reilly et al [385] use slightly different terminology distinguishing between "what" and "how" rather than "what" and "where" since action selection depends factors other than purely spatial ones; they write that "the dorsal pathway in posterior cortex is specialized for perception-for-action ("how" processing) in service to extracting perceptual signals to drive motor control, whereas the ventral pathway is specialized for perception-for-identification ("what" processing).

⁷² Glossary of Basic Anatomical Terms

HINDBRAIN: Except for the spinal cord, the brain's lower-level structures are largely located within the HINDBRAIN, diencephalon (or interbrain), and midbrain. The hindbrain consists of the medulla oblongata, the pons, and the cerebellum, which control respiration and movement among other functions.
MIDBRAIN: The MIDBRAIN is the topmost part of the brainstem, the connection central between the brain and the spinal cord. There are three main parts of the midbrain - the colliculi, the tegmentum, and the cerebral peduncles.
ROSTRAL and CAUDAL: ROSTRAL is taken from the Latin rostrum, to mean towards the nose/beak. Rostral therefore refers to the ANTERIOR (front) aspect of the head. In this context, CAUDAL refers to the POSTERIOR (back) of the head. The hindbrain is caudal to the midbrain.
DORSAL and VENTRAL: The DORSAL (from Latin dorsum, meaning 'back') surface of an organism refers to the back, or upper side, of an organism. If talking about the skull, the dorsal side is the top. The VENTRAL surface refers to the front, or lower side, of an organism.
MEDIAL and LATERAL: MEDIAL refers to being toward the midline of the body or the median plane, which splits the body, head-to-toe, into two halves, the left and right. LATERAL is the side of the body or part of the body that is away from the middle.
PROXIMAL and DISTAL: PROXIMAL refers to something closer to the torso while DISTAL refers to parts and places away from the torso. So a finger is distal to the wrist, which is distal to the elbow, which is distal to the shoulder.
SUPERIOR and INFERIOR: These terms reference the body’s vertical axis. A body part higher than another or above is SUPERIOR to it; conversely, the other body part is INFERIOR to the first. So head is superior to neck, the neck is superior to torso.

⁷³ Hi Randy,

We met in the new Wu Tsai Neurociences Institute yesterday — first time in the new building — to discuss what we most wanted to get out of our meeting with you on Thursday. I wrote up a synopsis that I've included below.

The main theme involves our wanting a better understanding of how the pieces corresponding to major circuits, white-matter tracts and reciprocal connections come together to solve computational problems. We are particularly interested in systems we can build using off-the-shelf neural networks with back propagation so that we can exploit existing acceleration hardware. There are also some more detailed issues relating to stripes, gating in the striatum and the different roles of similar stripe-like circuits in the frontal cortex.

We also talked briefly about the AlphaStar architecture (see attached) for playing StarCraft from Vinyals et al's 2019 Nature paper. Challenged by Andrew Lampinen — one of Jay's students — in a talk I gave recently, I attempted to convince Andrew that, however complicated the StarCraft game may seem, AlphaStar falls short of a completely general architecture. It might be fun talking about this if time permits. Here's our list of topics and "simple" questions, most of it from an engineer's perspective:

Item 1: Walk us through the "action selection" cycle. Start by describing the initial contents of striatum prior to the beginning of the cycle, if that makes sense. I think we basically understand the role of the direct, indirect and hyper-direct pathways, but a quick review couldn't hurt. What we would most like to understand is the ensuing dynamics as the basal ganglia converges to a Go state, and how the activity in the striatal nuclei changes from its initial state to the altered pattern of activity just prior to the Go state.

Q: What's wrong with this picture: striatum starts each cycle with activity patterns from association cortex assigned to different blocks corresponding stripe clusters?

Q: As a related question, just how inappropriate is it to view the gating mechanism implemented in the basal ganglia as an LSTM with input, output and forget gates?

Q: Can the targeted masking / suppression of the state vector in basal ganglia be thought of as analogous to the use of attentional mechanisms in recurrent networks?

Item 2: We could really benefit from your explanation of how information stored / maintained in the stripes of the striatum makes its way to stripe-like circuits in the frontal cortex and ends up feeding into the supplementary motor cortex enroute to the primary by way of the premotor cortex and then by way of the descending pathway to the brainstem and locations in the spinal cord. And finally what if anything does the cerebellum contributes in terms of timing or coordination by modifying the signal generated in the primary motor cortex.

Q: As a crude approximation, what's wrong with implementing clusters in the striatum and frontal cortex as LSTMs connected by an identity matrix as a "bus"?

Item 3: It wouldn't hurt if you could summarize what we know about the location of stripes in the frontal cortex and the dynamics of how they are gated and their content maintained as part — or perhaps all — of working memory. In a presentation I gave in one of Jay McClelland’s weekly lab meetings, I suggested the possibility of modeling each stripe cluster as a multi-cell LSTM using the input, output and forget gates to control access to the stripe content. Jay was uncomfortable with this assumption and launched into a discussion of the dynamics of gating and maintaining patterns of activity using such a model. He suggested some papers written around the same time as Hazy, Frank and O'Reilly that he claimed provided evidence of very narrow timing windows involving such transfers.

Q: How can we meet the demand for rapid updates in working memory models for, say, language or math while avoiding the problem of catastrophic forgetting?

Q: Where do you see the engineering need for Hebbian learning, spiking neurons and other biologically plausible models of neural information processing?

Item 4: There's a very interesting pattern in the frontal cortex that starts with primary, premotor and supplementary motor cortex cozied up against somatosensory circuits in the parietal lobe, and then arcs forward and down through the areas in the prefrontal cortex identified by Etienne Koechlin, David Badre and others hypothesized to support hierarchical abstract executive planning and control. See here for a somewhat more detailed account of the current theories relating to these areas.

We have lots of questions relating to these areas as they appear to be good candidates for the neural correlates of abstract thinking and perhaps play a major role in the abstract thought processes that accompany writing and debugging code. I hesitate to ask for specific information about these areas, I'd prefer that you give an introduction to your general understanding of these areas and how they relate to activity in the association cortex and apparently modulate motor commands initiated in the basal ganglia.

Q: How does Fuster's hierarchy fit into your model of the cortex plus basal ganglia and cerebellum as a general architecture for hierarchical planning?

Item 5: As another example, how do you imagine the action–perception cycle constructing the levels in Fuster's abstraction hierarchy works. Would training take place one level at a time from the most concrete at the bottom of the hierarchy to the most abstract at the top. It would seem that in the model Fuster had in mind the networks implementing the reciprocal connections between the motor stack and sensory stack in each level of the hierarchy would be trained by unsupervised learning by exploiting the fact that the agent's environment completes the perception-action loop by executing actions that give rise to observable sensory stimuli. The architecture doing this seems quite simple and the required network components pretty standard, but I'm sure that the details will turn out to be more complicated.

Q: How might the developmental stages of early brain maturation in the first few years of life inform the way in which we train / learn these models?

⁷⁴ Here's a simple procedure for inner speech: (a) ingest an utterance in the form of a sequence of words, (b) create a thought in the form of an embedding of the utterance, (c) generate a new thought in the process of trying to understand the original one, (d) generate a new utterance from the new thought, and (e) iterate. Depending on how you're feeling, it could be recipe for ruminative thinking.

⁷⁵ Existential angst: In the midst of reading over what I wrote, I came to have the following thoughts and ask the following questions: Who am I? Or perhaps I should ask, What am I? I am a tool making and tool using mammal with the gift of a language that I share with billions of other humans living and dead that are physically like me but cognitively very different depending on what they have heard and read. I don't know what I would have become without language. I imagine that it might be better off to have never had language than to have acquired it and then lose it.

⁷⁶ "Vladimir Betz was the first to note a major functional dichotomy of structures along the nerve axis. Posterior structures are largely devoted to sensory functions, anterior structures to motor functions. This division of labor is most obvious in the spinal cord. We can also discern it in the cerebral cortex, however, if we expand sensory functions to include representations acquired through the senses and motor functions to include executive representations." Source [183] see also [297, 423]

⁷⁷ "A Mentat was a profession or discipline that was developed as a replacement to the computers and thinking machines following the Butlerian Jihad that banned the creation of machines in the human mind's image, just as the Spacing Guild and the Bene Gesserit took up similar functions." SOURCE

⁷⁸ O'Reilly et al [385] posits that each stripe "is interconnected with a set of roughly 10 or more other stripes" and that "given the size of the human frontal cortex, there may be as many as 20,000 stripes within all of frontal cortex including the motor areas".

⁷⁹ In terms of evolutionary advantage, it is worth considering that machines could, in principle, communicate directly (and secretly) with one another using the electromagnetic spectrum and a shared language that could evolve much faster than human language by exploiting the power of machine learning. Machines that also share hardware, e.g., the analog of a separate primary sensory modality, could communicate even more efficiently using a shared language of thought with both (conventional) symbolic [171] and distributed components [387, 172].

⁸⁰ In the sense that Charles Sanders Peirce's used the term in his theory of semiotics and Terrence Deacon wrote about in his exploration of the missing simple languages that must have existed prior to the emergence of modern language [118].

⁸¹ The following excerpts were drawn from this 2016 interview with Charles Fernyhough around the time his most recent book entitled The Voices Within [160] was published. The interview was featured in an article written by Julie Beck and featured in The Atlantic Magazine:

Inner speech, Fernyhough writes, "isn't bound by many of the conventions of verbal speech. For one, we can produce it much faster when we don't have to go at the pace required to use tongues and lips and voice boxes. One researcher the book cites clocks inner speech at an average pace of 4,000 words per minute — 10 times faster than verbal speech. And it's often more condensed — we don't have to use full sentences to talk to ourselves, because we know what we mean."
In response to a question about how inner speech might manifest in a young child, he answers "You have some basic intelligence, which any one-year-old baby is showing. They're able to do all sorts of things, initiate actions, work stuff out, remember stuff. But that's intelligence before language — it's prelinguistic intelligence. And then you've got this thing that comes along which is language. It's quite phenomenal how quickly most kids acquire language. The idea is not that you need language for thinking but that when language comes along, it sure is useful. It changes the way you think, it allows you to operate in different ways because you can use the words as tools. Somewhere around age two, language comes together with intelligence and bang! Something really special is created. And the thing that is created might well be unique in the universe."

⁸² I didn't know for sure that the steps taken would fix the problem, but I was able to execute them with hardly thinking abaout the details, carrying out each individual step as fast as my hands could type, and I distinctly remember the satisfying — but fleeting — feeling of efficiently solving the problem and getting on with my work, thereby avoiding a lengthy and generally discouraging debugging effort.

⁸³ Forget about rewards; how about punishment, pain and failure. I find it interesting that most animals other than humans are susceptible to second-order Pavlovian conditioning. Actually I just wrote that to get your attention; I'm not sure if it is true, and, having done a small amount of research attempting to ascertain one way or the other, I'm not sure I even know what it would mean for the statement to be true. So, if you happen to know and have a good reference, please send it along and I can rewrite this footnote.

⁸⁴ The term "intention" as employed here is more in line with "proposal" as it doesn't carry the weight of "commitment" nor the baggage of "consciousness". "From executive set or motor attention derives the intention to act, although the latter may appear to precede the former. The conscious intention to perform an act, whether at the start of a sequence of actions or in the middle of it, may in fact precede the preparatory set of the motor apparatus for the act." Page 393 in [184]

⁸⁵ The following is intended to serve as a primer, reference and review on the basal ganglia and cerebellum for students taking CS379C in the Spring of 2020. It primariy consists of relevant excerpts from the online, open-access neuroscience course at University Texas Houston written by James Knierim, Ph.D., Department of Neuroscience, The Johns Hopkins University. Use the links below to access the rest of the content.

Chapter 4: Basal Ganglia

Motor functions

The function of the basal ganglia in motor control is not understood in detail. It appears that the basal ganglia is involved in the enabling of practiced motor acts and in gating the initiation of voluntary movements by modulating motor programs stored in the motor cortex and elsewhere in the motor hierarchy (Figure 4.6). Thus, voluntary movements are not initiated in the basal ganglia (they are initiated in the cortex); however, proper functioning of the basal ganglia appears to be necessary in order for the motor cortex to relay the appropriate motor commands to the lower levels of the hierarchy.

Recall that the major output from the basal ganglia is an inhibitory connection from the GPint (or SNr) to the thalamus (or superior colliculus). Studies of eye movements in monkeys have shed light on the function of the basal ganglia loop. Normally, the SNr neurons are tonically active, suppressing the output of the collicular neurons that control saccadic eye movements. When the direct pathway striatal neurons are excited by the cortical frontal eye fields, the SNr neurons are momentarily inhibited, releasing the collicular neurons from inhibition. This allows the appropriate collicular neurons to signal the target of the eye movement, allowing the monkey to change its gaze to a new location. The movement was initiated in the frontal eye fields; however, the proper activation of the eye movement required that collicular neurons be released from the inhibition of the basal ganglia.

What is the function of the tonic inhibitory output of the basal ganglia? Recall from the Motor Cortex chapter that stimulating the motor cortex of monkeys at various locations results in stereotyped sequences of movements, such as bringing the hand to the mouth or adopting a defensive posture. It appears that a number of "primitive" motor programs are stored in the cortex, and motor control may require the activation of these elemental motor programs in the precise temporal order to accomplish a sophisticated motor plan. It is important that only one motor program be active at a given time, however, such that one motor act (e.g., use hand to bring food to the mouth) is not competing with a conflicting motor act (e.g., use hand to shield face from dangerous object). It is thought that the basal ganglia is normally active in suppressing inappropriate motor programs, and that activation of the direct pathway temporarily releases one motor program from inhibition, enabling it to be executed by the organism. Thus, the basal ganglia act as a gate that enables the execution of automatic programs in the hierarchy.

Which motor programs should be released from inhibition at a given moment? The basal ganglia may have a major role in learning what motor acts result in rewards for the organism. This information is provided by the dopaminergic neurons of the SNc and ventral tegmental nucleus. Recordings from these neurons in monkeys have shown that they tend to respond when the monkey receives an unexpected reward, and they tend to be inhibited when the monkey fails to receive an expected reward (Figure 4.7). Because the net effect of activation of the nigrostriatal pathway is to excite the direct pathway and inhibit the indirect pathway, this pattern of dopaminergic firing may be involved in tuning the relative balance of direct/indirect pathway activity to enhance the firing of cortical motor programs that produce rewarding outcomes and to suppress the activity of motor programs that do not result in reward. In this way, motor habits can be constructed that tend to reward the animal.

Cognitive functions

As mentioned earlier, there are a number of cortical loops through the basal ganglia that involve prefrontal association cortex and limbic cortex. Through these loops, the basal ganglia are thought to play a role in cognitive function that is similar to their role in motor control. That is, the basal ganglia are involved in selecting and enabling various cognitive, executive, or emotional programs that are stored in these other cortical areas. Moreover, the basal ganglia appear to be involved in certain types of learning. For example, in rodents the striatum is necessary for the animal to learn certain stimulus-response tasks (e.g., make a right turn if stimulus A is present and make a left turn if stimulus B is present). Recordings from rat striatal neurons show that early in training, striatal neurons fire at many locations while a rat learns such a task on a T-shaped maze (Figure 4.8). This suggests that initially the striatum is involved throughout the execution of the task. As the animal learns the task and becomes exceedingly good at its performance, the striatal neurons change their activity patterns, firing only at the beginning of the trial and at the end. It appears that the learned programs to solve this task are now stored elsewhere; the firing of the striatal neurons at the beginning of the maze presumably reflects the enabling of the appropriate motor/cognitive plan in the cortex, and the firing at the end of the maze is presumably involved in evaluating the reward outcome of the trial.

Chapter 5: Cerebellum

Glossary Entries:

Forward models infer the state resulting from executing a given motor command in the current state.

Inverse models infer a motor command that transforms the current state into selected target state.

Historically, the cerebellum has been considered a motor structure, because cerebellar damage leads to impairments in motor control and posture and because the majority of the cerebellum’s outputs are to parts of the motor system. Motor commands are not initiated in the cerebellum; rather, the cerebellum modifies the motor commands of the descending pathways to make movements more adaptive and accurate. The cerebellum is involved in the following functions:

Coordination of voluntary movements. Most movements are composed of a number of different muscle groups acting together in a temporally coordinated fashion. One major function of the cerebellum is to coordinate the timing and force of these different muscle groups to produce fluid limb or body movements.

Motor learning. The cerebellum is important for motor learning. The cerebellum plays a major role in adapting and fine-tuning motor programs to make accurate movements through a trial-and-error process (e.g., learning to hit a baseball).

Cognitive functions. Although the cerebellum is most understood in terms of its contributions to motor control, it is also involved in certain cognitive functions, such as language. Thus, like the basal ganglia, the cerebellum is historically considered as part of the motor system, but its functions extend beyond motor control in ways that are not yet well understood.

The cerebellum receives extensive sensory input, and it appears to use this input to guide movements in both a feedback and feedforward control manner.

Feedforward Controller

In a feedforward control system, when a desired output is sent to the controller, the controller evaluates sensory information about the environment and about the system itself before the output commands are generated. It uses the sensory information to program the best set of instructions to accomplish the desired output. However, in a pure feedforward system, once the commands are sent, there is no way to alter them (i.e., there is no feedback loop). The advantage of a feedforward system is that it can produce the precise set of commands for the effector without needing to constantly check the output and make corrections during the movement itself. The main disadvantage, however, is that the feedforward controller requires a period of trial-and-error learning before it can function properly. In most biological systems, it is hard (perhaps impossible) to pre-program all of the possible sensory conditions that the controller may encounter during the life of the organism. Furthermore, the environment and conditions under which actions are made are constantly changing, and the feedforward controller must be able to adapt its output commands to account for these changes.

Feedback Controller

In a feedback controller, a desired output is compared continuously with the actual output, and adjustments are made during the execution of the movement until the actual movement matches the desired movement. Feedback control systems can produce very accurate outputs; however, in general they are slow. In order to change the output, the effector must wait until information is transmitted from the sensor to the comparator and then to the effector. At this point, another comparison is made, and the process continues.

The cerebellar involvement in the vestibuloocular reflex (VOR) may be explained in terms of the learning requirements of a feedforward controller. When the head moves, a compensatory eye movement must be made to maintain a stable gaze. The cerebellum receives sensory input from the vestibular system informing it that the head is moving. It also receives input from eye muscle proprioceptors and other relevant sources of information about current conditions in order to make an accurate compensatory eye movement. It evaluates all of this advance sensory information and calculates the proper eye movement to exactly counterbalance the head movement. What if the eye movement does not match the head movement, however, and the visual image moves across the retina (such as in the experimental condition in which a prism was worn, or in a real-life situation in which an individual wears new prescription eyeglasses)? The retinal slip constitutes an error signal to tell the cerebellum that next time these conditions are met, adjust the eye movement to decrease the retinal slip. This trial and error sequence will be repeated until the movement is properly calibrated; moreover, these mechanisms will ensure that the movements stay calibrated.

As another example, the coordination of movements requires that muscle groups be activated in precise temporal sequence. Not only do the different joints need to be coordinated temporally, but even antagonist muscles that control the same joint need precise temporal coordination. For example, an extensor muscle needs to be activated to start a reaching movement, and the corresponding flexor muscle needs to be activated at the end of the movement to stop the movement appropriately. The precise timing of muscle contractions and the force necessary for each contraction varies with the amount of load placed on a muscle, as well as on the inherent properties of the muscle itself (e.g., elasticity). These variables are constantly changing throughout life, as one grows, gains/loses weights, and ages. Moreover, a similar movement will require different patterns of motor activity depending on the weight being born by the muscle (for example, if an extended hand is empty or holding a heavy weight). The cerebellum appears necessary for the proper timing and coordination of muscle groups, very likely through a trial-and-error learning mechanism discussed previously.

It is believed that the mossy fiber inputs to the cerebellum convey the sensory information used to evaluate the overall sensory context of the movement. Mossy fibers are known to respond to sensory stimuli; they are also correlated with different movements (Figure 5.11). These fibers convey such information as: Where are the appropriate body parts (proprioceptors), what is the current load on the muscle (proprioceptors, somatosensory receptors, etc.), what other sensory information can predict a useful response (e.g., the tone in the eye blink conditioning), what are the desired movements (motor cortex). The error signal is believed to be conveyed by the climbing fiber inputs. Climbing fibers are known to be especially active when an unexpected event occurs, such as when a greater load than expected is placed on a muscle or when a toe is stubbed. Thus, the large divergence of input from the mossy fibers to the granule cells to the parallel fibers is believed to create complex representations of the entire sensory context at present and the desired motor output. When the desired output is not achieved, the climbing fibers signal this error and trigger a calcium spike in the Purkinje cell. The influx of calcium changes the connection strengths between parallel fibers and Purkinje cells, such that the next time the same behavioral context occurs, the motor output will be modified to more closely approximate the desired output.

Hybrid Controller

The cerebellum may act as a feedback control system for slow movements and a feedforward controller for fast movements. In its function as a feedforward controller, the mossy fibers may provide information regarding the desired output from motor cortex and the advance sensory information about the state of the worlds and the body. The climbing fibers may convey information about movement errors, which provides a teaching signal such that the cerebellum is more likely to produce the correct movement the next time the output is desired.

Neuroscience Online: The Open-Access Neuroscience Electronic Textbook

Section 3: Motor Systems

Chapter 1: Motor Units and Muscle Receptors, James Knierim, Ph.D.
Chapter 2: Spinal Reflexes and Descending Motor Pathways, James Knierim, Ph.D.
Chapter 3: Motor Cortex, James Knierim, Ph.D
Chapter 4: Basal Ganglia, James Knierim, Ph.D.
Chapter 5: Cerebellum, James Knierim, Ph.D.

⁸⁶ "The basal ganglia are specialized for learning from reward/punishment signals, in comparison to expectations for reward/punishment, and this learning then shapes the action selection that the organism will make under different circumstances (selecting the most rewarding actions and avoiding punishing ones; Figure 7.1). This form of learning is called reinforcement learning. The cerebellum is specialized for learning from error, specifically errors between the sensory outcomes associated with motor actions, relative to expectations for these sensory outcomes associated with those motor actions. Thus, the cerebellum can refine the implementation of a given motor plan, to make it more accurate, efficient, and well-coordinated.
There is a nice division of labor here, where the basal ganglia help to select one out of many possible actions to perform, and the cerebellum then makes sure that the selected action is performed well. Consistent with this rather clean division of labor, there are no direct connections between the basal ganglia and cerebellum -- instead, each operates in interaction with various areas in the cortex, where the action plans are formulated and coordinated. Both basal ganglia and cerebellum are densely interconnected with the frontal cortex, including motor control areas in posterior frontal cortex, and the prefrontal cortex anterior to those." Source: Chapter 7. Motor Control and Reinforcement Learning [385]

⁸⁷ Here, we review recent data showing that the basal ganglia and the cerebellum are interconnected at the subcortical level. The subthalamic nucleus in the basal ganglia is the source of a dense disynaptic projection to the cerebellar cortex. Similarly, the dentate nucleus in the cerebellum is the source of a dense disynaptic projection to the striatum. These observations lead to a new functional perspective that the basal ganglia, the cerebellum and the cerebral cortex form an integrated network. This network is topographically organized so that the motor, cognitive and affective territories of each node in the network are interconnected. This perspective explains how synaptic modifications or abnormal activity at one node can have network-wide effects. A future challenge is to define how the unique learning mechanisms at each network node interact to improve performance. Source: "The basal ganglia and the cerebellum: Nodes in an integrated network." Source: [61]

⁸⁸ "Motor adaptation to perturbations is provided by learning mechanisms operating in the cerebellum and basal ganglia. The cerebellum normally performs motor adaptation through supervised learning using information about movement error provided by visual feedback. However, if visual feedback is critically distorted, the system may disengage cerebellar error-based learning and switch to reinforcement learning mechanisms mediated by basal ganglia. Yet, the exact conditions and mechanisms of cerebellum and basal ganglia involvement in motor adaptation remain unknown. We use mathematical modeling to simulate control of planar reaching movements that relies on both error-based and non-error-based learning mechanisms. We show that for learning to be efficient only one of these mechanisms should be active at a time. We suggest that switching between the mechanisms is provided by a special circuit that effectively suppresses the learning process in one structure and enables it in the other. To do so, this circuit modulates learning rate in the cerebellum and dopamine release in basal ganglia depending on error-based learning efficiency. We use the model to explain and interpret experimental data on error- and non-error-based motor adaptation under different conditions" From the abstract of [486]

⁸⁹ "MERLIN optimises its representations and learns to store information in memory based on unsupervised prediction. MERLIN has two basic components: a memory-based predictor (MBP) and a policy. The MBP is responsible for compressing observations into low-dimensional state representations z, which we call state variables, and storing them in memory. The state variables in memory in turn are used by the MBP to make predictions guided by past observations. This is the key thesis driving our development: an agent's perceptual system should produce compressed representations of the environment; predictive modeling is a good way to build those representations; and the agent's memory should then store them directly. The policy can primarily be the downstream recipient of those state variables and memory contents." Excerpt from [523]

⁹⁰ Understanding of the communications between basal ganglia and cerebellum prior to the work reviewed in Bostan and Strick [61]:

Learning and control operations in a Distributed Modular Architectures (DPM) model of the basal ganglia and cerebellum: "Hebbian learning occurs in cerebral cortex. The control operation is pattern formation. Reinforcement learning occurs in basal ganglia (BG). The main control operation is pattern classification, which occurs in the striatum on cortical and thalamic input to spiny projection neurons (SPNs). Through direct (disinhibition) and indirect (inhibition of disinhibition) pathways, a coarse selection of goals is briefly stored in reciprocal corticothalamic pathways. Supervised learning occurs in the refinement stage in the cerebellar cortex, through depression of parallel fiber/Purkinje cell synapses. The positive feedback loop between the cerebellar nucleus and cerebral cortex is a working memory that is regulated by potent inhibition from Purkinje cells." Source: James Houk [81, 233]

⁹¹ The cerebellum and basal ganglia are reciprocally connected with the cerebral cortex, forming many loops that function as distributed processing modules. Here we present a detailed model of one microscopic loop between the motor cortex and the cerebellum, and we show how small arrays of these microscopic loops (CB modules) can be used to generate biologically plausible motor commands for controlling movement. A fundamental feature of CB modules is the presence of positive feedback loops between the cerebellar nucleus and the motor cortex. We use nonlinear dynamics to model one microscopic loop and to investigate its bistable properties. Simulations demonstrate an ability to program a motor command well in advance of command generation and an ability to vary command duration. However, control of command intensity is minimal, which could interfere with the control of movement velocity. From the abstract of [518]

⁹² The cerebellum model described in Todorov et al [486] relies on a nonlinear Kalman filter for state estimation. Contrast with the model developed by Masao Ito [238] and consider how the MERLIN model developed by Wayne et al [523] might be adapted for this purpose.

⁹³ [The physical] body image [required to specify an action] is generated by somatosensory, proprioceptive and visual inputs to the posterior parietal cortex ... Two areas are of particular interest in the posterior parietal cortex: area 5 which is a target of inputs from the primary somatosensory cortical areas 3, 1, and 2; and area 7, which is a target of higher-order visual cortical areas such as MT. [...] The parietal lobes are extensively interconnected with regions in the anterior frontal lobes that in humans are thought to be important for abstract thought, decision-making, and anticipating the consequences of action. These prefrontal areas, along with the posterior parietal cortex, represent the highest levels of the motor control hierarchy, where decisions are made about what actions to take and the likely outcome. Prefrontal and parietal cortex both send axons that converge on area 6. Recall that area 6 and 4 together contribute most of the axons to the descending corticospinal track. Thus area 6 lies at the junction where encoding what signals specify how the actions will be carried. Source: Page 461 [51]

⁹⁴ This comment applies most directly to the ventral visual stream (the "what" pathway). The dorsal stream (the "where" pathway) codes for spatial features that represent three-dimensional motion:

In the dorsal stream, this hierarchical processing has classically been thought to lead to the computation of complex motion in three dimensions. However, there is evidence to suggest that there is integration of both dorsal and ventral stream information into motion computation processes, giving rise to intermediate object representations, which facilitate object selection and decision making mechanisms in the dorsal stream [398]. (SOURCE)

⁹⁵ The motor cortex encodes representations of body motions ranging from simple to complex integrating many aspects of movement relating to muscles, joint angles, relative body orientation, etc:

Of the three motor cortex areas, stimulation of the primary motor cortex requires the least amount of electrical current to elicit a movement. Low levels of brief stimulation typically elicit simple movements of individual body parts. Stimulation of premotor cortex or the supplementary motor area requires higher levels of current to elicit movements, and often results in more complex movements than stimulation of primary motor cortex. Stimulation for longer time periods (500 msec) in monkeys results in the movement of a particular body part to a stereotyped posture or position, regardless of the initial starting point of the body part (Figure 3.2). Thus, the premotor cortex and supplementary motor areas appear to be higher level areas that encode complex patterns of motor output and that select appropriate motor plans to achieve desired end results. SOURCE

⁹⁶ Aflalo and Graziano [2] employ a self-organizing Kohonen network to generate an artificial motor cortex embedding (map) that optimizes local continuity to accommodate these conflicting mapping requisites.

The resultant hybrid map contained many features seen in actual motor cortex, including the following: a rough, overlapping somatotopy; a posterior strip in which simpler movements were represented and more somatotopic segregation was observed, and an anterior strip in which more complex, multisegmental movements were represented and the somatotopy was less segregated; a clustering of different complex, multisegmental movements into specific subregions of cortex that resembled the arrangement of subregions found in the monkey; three hand representations arranged on the cortex in a manner similar to the primary motor, dorsal premotor, and ventral premotor hand areas in the monkey; and maps of hand location that approximately matched the maps observed in the monkey. (SOURCE)

⁹⁷ "Using high-resolution whole-brain fMRI, we observed the activation of a restricted and consistent network of brain areas whenever mathematicians engaged in high-level mathematical reflection. This network comprised bilateral intraparietal, inferior temporal, and dorsal prefrontal sites. It was activated by all domains of mathematics tested (analysis, algebra, topology, and geometry) and even, transiently, by meaningless mathematical statements. It remained silent, however, to nonmathematical statements of matched complexity. Instead, such problems activated distinct bilateral anterior temporal and angular regions. [...] Our main goal was to explore the relationships between high-level mathematics, language, and core number networks. In mathematicians, we found essentially no overlap of the math-responsive network with the areas activated by sentence comprehension and general semantic knowledge. We observed, however, a strong overlap and within-subject similarity of the math-responsive network with parietal and inferior temporal areas activated during arithmetic calculation and number recognition [18, 132]." (SOURCE)

⁹⁸ "Within each functional PFC area, there is some interesting topographic organization of neurons into hypercolumns, macrocolumns or stripes (each of these terms is typically associated with a similar type of neural organization, but in different parts of the cortex, with stripes being specific to the PFC). In all areas of cortex, one can identify the smallest level of neural topological organization as a cortical column or microcolumn (to more clearly distinguish it from the larger macrocolumn), which contains roughly 20 pyramidal neurons in a region that is roughly 50 microns across. A stripe contains roughly 100 of these microcolumns, generally organized in an elongated shape that is roughly 5 microcolumns wide (250 microns) by 20 microcolumns long (1000 microns or 1 millimeter). Each such stripe is interconnected with a set of roughly 10 or more other stripes, which we can denote as a stripe cluster. Given the size of the human frontal cortex, there may be as many as 20,000 stripes within all of frontal cortex (including motor areas)." Excerpt from Page 149 in O'Reilly et al [385]

⁹⁹ "Our model [292] focuses on functional specializations within the BG that we have previously proposed implement a dynamic, adaptive gating mechanism for regulating the updating, maintenance, and output of information in the PFC. Importantly, the model assumes that separate pathways through the BG can independently regulate different subregions of the PFC. This is consistent with the anatomy of both the PFC and BG. Within the PFC, relatively isolated stripe-like patches of neurons have been observed, each of which exhibits dense within-stripe interconnectivity and sparse (largely inhibitory) between-stripe connectivity. Furthermore, these stripes project to distinct regions within the BG which, in turn, project back to different stripes within the PFC. In previous modeling work, we have shown that this anatomy and physiology can support a system of control in which gating signals from the BG regulate when a given stripe within the PFC will either (i) update to encode new information, (ii) continue to robustly maintain information in an active state (through sustained neural firing), or (iii) output the information it is currently encoding to drive information processing elsewhere in the brain. That is, it provides a separation of function that allows different signals to control when and where information is encoded or used (BG) compared with the signals that encode the information content itself (PFC). Furthermore, we have shown that if the output of one PFC stripe controls the BG gating signal that regulates a different stripe, this architecture can support a system of hierarchically nested control. This, in turn, can be used to separate the representation of variables and their values, upon which a mechanism for indirection can be built. (SOURCE)

¹⁰⁰ See Wayne et al [523] for an application of the perception-action cycle employing a variational autoencoder [274] to learn the inverse model. The authors propose "MERLIN, an integrated AI agent architecture that acts in partially observed virtual reality environments and stores information in memory based on different principles from existing end-to-end AI systems: it learns to process high-dimensional sensory streams, compress and store them, and recall events with less dependence on task reward. We bring together ingredients from external memory systems, reinforcement learning, and state estimation (inference) models and combine them into a unified system using inspiration from three ideas originating in psychology and neuroscience: predictive sensory coding [417], the hippocampal representation theory of Gluck and Myers [197], and the temporal context model and successor representation [114, 468]. To test MERLIN, we expose it to a set of canonical tasks from psychology and neuroscience, showing that it is able to find solutions to problems that pose severe challenges to existing AI agents. MERLIN points a way beyond the limitations of end-to-end RL toward future studies of memory in computational agents." (SOURCE)

¹⁰¹ "Frontostriatal circuits are neural pathways that connect frontal lobe regions with the basal ganglia (striatum) that mediate motor, cognitive, and behavioural functions within the brain. They receive inputs from dopaminergic, serotonergic, noradrenergic, and cholinergic cell groups that modulate information processing. Frontostriatal circuits are part of the executive functions [423]." (SOURCE):

¹⁰² The instruction cycle (also known as the fetch–decode–execute cycle or simply the fetch-execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions. It is composed of three main stages: the fetch stage, the decode stage, and the execute stage. In simpler CPUs, the instruction cycle is executed sequentially, each instruction being processed before the next one is started. In most modern CPUs, the instruction cycles are instead executed concurrently, and often in parallel, through an instruction pipeline. (SOURCE)

¹⁰³ On the difference between working memory and short-term memory as conceived by George Miller:

Short Term Memory
By Saul McLeod, published 2009

Short-term memory (STM) is the second stage of the multi-store memory model proposed by the Atkinson-Shiffrin. The duration of STM seems to be between 15 and 30 seconds, and the capacity about 7 items.

Short term memory has three key aspects:

1. limited capacity (only about 7 items can be stored at a time)

2. limited duration (storage is very fragile and information can be lost with distraction or passage of time)

3. encoding (primarily acoustic, even translating visual information into sounds).

There are two ways in which capacity is tested, one being span, the other being recency effect.

The Magic number 7 (plus or minus two) provides evidence for the capacity of short term memory. Most adults can store between 5 and 9 items in their short-term memory. This idea was put forward by Miller (1956) and he called it the magic number 7. He thought that short term memory could hold 7 (plus or minus 2 items) because it only had a certain number of "slots" in which items could be stored.

However, Miller didn't specify the amount of information that can be held in each slot. Indeed, if we can "chunk" information together we can store a lot more information in our short term memory.

Miller's theory is supported by evidence from various studies, such as Jacobs (1887). He used the digit span test with every letter in the alphabet and numbers apart from "w" and "7" because they had two syllables. He found out that people find it easier to recall numbers rather than letters. The average span for letters was 7.3 and for numbers it was 9.3.

The duration of short term memory seems to be between 15 and 30 seconds, according to Atkinson and Shiffrin (1971). Items can be kept in short term memory by repeating them verbally (acoustic encoding), a process known as rehearsal.

Using a technique called the Brown-Peterson technique which prevents the possibility of retrieval by having participants count backwards in 3s.

Peterson and Peterson (1959) showed that the longer the delay, the less information is recalled. The rapid loss of information from memory when rehearsal is prevented is taken as an indication of short term memory having a limited duration.

Baddeley and Hitch (1974) have developed an alternative model of short-term memory which they call working memory.

References:

Atkinson, R. C., & Shiffrin, R. M. (1971). The control processes of short-term memory. Institute for Mathematical Studies in the Social Sciences, Stanford University.

Baddeley, A.D., & Hitch, G. (1974). Working memory. In G.H. Bower (Ed.), The psychology of learning and motivation: Advances in research and theory (Vol. 8, pp. 47–89). New York: Academic Press.

Miller, G. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. The psychological review, 63, 81-97.

Peterson, L. R., & Peterson, M. J. (1959). Short-term retention of individual verbal items. Journal of experimental psychology, 58(3), 193-198.

McLeod, S. A. (2009, Dec 14). Short-term memory. Simply Psychology. https://www.simplypsychology.org/short-term-memory.html

¹⁰⁴ "GWT resembles the concept of working memory, and is proposed to correspond to a momentarily active, subjectively experienced event in working memory — the inner domain in which we can rehearse telephone numbers to ourselves or in which we carry on the narrative of our lives. It is usually thought to include inner speech and visual imagery." SOURCE (SOURCE)

¹⁰⁵ GWT is composed of distributed and heavily interconnected neurons with long-range axons, and a set of specialized and modular perceptual, motor, memory, evaluative and attentional processors. Workspace neurons are mobilized in effortful tasks for which the specialized processors do not suffice; they selectively mobilize or suppress, through descending connections, the contribution of specific processor neurons. In the course of task performance, workspace neurons become spontaneously co-activated, forming discrete though variable spatio-temporal patterns subject to modulation by vigilance signals and to selection by reward signals [127].

¹⁰⁶ "PBWM is a computational model of the prefrontal cortex to control both itself and other brain areas in a strategic, task-appropriate manner. These learning mechanisms are based on subcortical structures in the midbrain, basal ganglia and amygdala, which together form an actor/critic architecture. The critic system learns which prefrontal representations are task-relevant and trains the actor, which in turn provides a dynamic gating mechanism for controlling working memory updating. Computationally, the learning mechanism is designed to simultaneously solve the temporal and structural credit assignment problems." (SOURCE)

¹⁰⁷ Prefrontal cortex basal ganglia working memory (PBWM) is an algorithm that models working memory in the prefrontal cortex and the basal ganglia. It can be compared to long short-term memory (LSTM) in functionality, but is more biologically explainable. [...] It uses the primary value learned value model (PVLV) to train prefrontal cortex working-memory updating system, based on the biology of the prefrontal cortex and basal ganglia. (SOURCE)

¹⁰⁸ "The striatum, or corpus striatum (also called the neostriatum and the striate nucleus) is a nucleus (a cluster of neurons) in the subcortical basal ganglia of the forebrain. The striatum is a critical component of the motor and reward systems; receives glutamatergic and dopaminergic inputs from different sources; and serves as the primary input to the rest of the basal ganglia. [...] A white matter, nerve tract (the internal capsule) in the dorsal striatum separates the caudate nucleus and the putamen." (SOURCE)

¹⁰⁹ Whereas symbol processing is a fundamental feature of all computer systems, it remains a mystery whether and how this ability is carried out by the brain. Here, we provide an example of how the structure and functioning of the prefrontal cortex/basal ganglia working memory system can support variable binding, through a form of indirection (akin to a pointer in computer science). We show how indirection enables the system to flexibly generalize its behavior substantially beyond its direct experience (i.e., systematicity). We argue that this provides a biologically plausible mechanism that approximates a key component of symbol processing, exhibiting both the flexibility, but also some of the limitations, that are associated with this ability in humans [292, 90] (SOURCE).

¹¹⁰ "The arcuate fasciculus is a white matter bundle that contains both long and short fibers that connect the frontal, parietal, and temporal lobes. The arcuate plays a key role in the left hemisphere in language processing and in the right hemisphere in visuospatial processing and some aspects of language processing, such as prosody and semantics [186]." (SOURCE)

¹¹¹ "How the motor-related cortical areas modulate the activity of the output nuclei of the basal ganglia is an important issue for understanding the mechanisms of motor control by the basal ganglia. The cortico-subthalamo-pallidal 'hyperdirect' pathway conveys powerful excitatory effects from the motor-related cortical areas to the globus pallidus, bypassing the striatum, with shorter conduction time than effects conveyed through the striatum. [...] Through this sequential information processing, only the selected motor program is initiated, executed and terminated at the selected timing, whereas other competing programs are canceled [369]." (SOURCE)

¹¹² Susan Courtney provides an excellent overview of the many sources of information that are utilized by cognitive functions supported in the frontal cortex [107]. In particular, her articulation of the role of attention and cognitive control aligns with the views that we've emphasized in class and that drive our designs:

[The circuits in the prefrontal cortex that drive goal directed planning and executive control] receive multimodal information about the current environment and have access to previously stored memories. The prefrontal cortex's extensive outputs allow for direct control of motor behavior, but they may also influence behavior indirectly by altering perceptual and cognitive representations and influencing the storage and re- trieval of long-term memories.
I suggest that attention and cognitive control are not directed actions or specific processes contained within any particular set of brain regions. Rather, what we experience and observe that we call attention and cognitive control are emergent properties dependent on the dis- tributed representation of all types of information, both that available from present perceptual input and the information currently sustained in WM, including contextual and motivational information.

¹¹³ Here are the bibliographical entries for the most relevant papers cited in the October 27th class discussion notes including all of the abstracts with PDF available on the Google Cloud volume shared with the students taking the class:

@article{DehaeneandChangeuxPNAS-97,
        title = {A hierarchical neuronal network for planning behavior},
       author = {Dehaene, Stanislas and Changeux, Jean-Pierre},
      journal = {Proceedings of the National Academy of Sciences},
       volume = 94,
        issue = 24,
         year = 1997,
        pages = {13293-13298},
     abstract = {Planning a goal-directed sequence of behavior is a higher function of the human brain that relies on the integrity of prefrontal cortical areas. In the Tower of London test, a puzzle in which beads sliding on pegs must be moved to match a designated goal configuration, patients with lesioned prefrontal cortex show deficits in planning a goal-directed sequence of moves. We propose a neuronal network model of sequence planning that passes this test and, when lesioned, fails in a way that mimics prefrontal patients' behavior. Our model comprises a descending planning system with hierarchically organized plan, operation, and gesture levels, and an ascending evaluative system that analyzes the problem and computes internal reward signals that index the correct/erroneous status of the plan. Multiple parallel pathways connecting the evaluative and planning systems amend the plan and adapt it to the current problem. The model illustrates how specialized hierarchically organized neuronal assemblies may collectively emulate central executive or supervisory functions of the human brain.},
}
@article{HazyetalPTRS-07,
       author = {Hazy, T. E. and Frank, M. J. and O'reilly, R. C.},
        title = {Towards an executive without a homunculus: computational models of the prefrontal cortex/basal ganglia system},
      journal = {Philosophical Transactions of the Royal Society London B, Biological Science},
       volume = {362},
       number = {1485},
         year = {2007},
        pages = {1601-1613},
     abstract = {The prefrontal cortex (PFC) has long been thought to serve as an 'executive' that controls the selection of actions and cognitive functions more generally. However, the mechanistic basis of this executive function has not been clearly specified often amounting to a homunculus. This paper reviews recent attempts to deconstruct this homunculus by elucidating the precise computational and neural mechanisms underlying the executive functions of the PFC. The overall approach builds upon existing mechanistic models of the basal ganglia (BG) and frontal systems known to play a critical role in motor control and action selection, where the BG provide a 'Go' versus 'NoGo' modulation of frontal action representations. In our model, the BG modulate working memory representations in prefrontal areas to support more abstract executive functions. We have developed a computational model of this system that is capable of developing human-like performance on working memory and executive control tasks through trial-and-error learning. This learning is based on reinforcement learning mechanisms associated with the midbrain dopaminergic system and its activation via the BG and amygdala. Finally, we briefly describe various empirical tests of this framework.}
}
@article{BotvinickPTRS_B-07,
       author = {Botvinick, Matthew M.},
        title = {Multilevel structure in behaviour and in the brain: a model of {F}uster's hierarchy},
      journal = {Philosophical transactions of the Royal Society of London. Series B, Biological sciences},
    publisher = {The Royal Society},
       volume = {362},
        issue = {1485},
         year = {2007},
        pages = {1615-1626},
     abstract = {A basic question, intimately tied to the problem of action selection, is that of how actions are assembled into organized sequences. Theories of routine sequential behaviour have long acknowledged that it must rely not only on environmental cues but also on some internal representation of temporal or task context. It is assumed, in most theories, that such internal representations must be organized into a strict hierarchy, mirroring the hierarchical structure of naturalistic sequential behaviour. This article reviews an alternative computational account, which asserts that the representations underlying naturalistic sequential behaviour need not, and arguably cannot, assume a strictly hierarchical form. One apparent liability of this theory is that it seems to contradict neuroscientific evidence indicating that different levels of sequential structure in behaviour are represented at different levels in a hierarchy of cortical areas. New simulations, reported here, show not only that the original computational account can be reconciled with this alignment between behavioural and neural organization, but also that it gives rise to a novel explanation for how this alignment might develop through learning.},
}
@article{BotvinicketalCOGNITION-09,
       author = {Botvinick, M. M. and Niv, Y.  and Barto, A. G.},
        title = {Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective},
      journal = {Cognition},
       volume = {113},
       number = {3},
         year = {2009},
        pages = {262-280},
     abstract = {Research on human and animal behavior has long emphasized its hierarchical structure-the divisibility of ongoing behavior into discrete tasks, which are comprised of subtask sequences, which in turn are built of simple actions. The hierarchical structure of behavior has also been of enduring interest within neuroscience, where it has been widely considered to reflect prefrontal cortical functions. In this paper, we reexamine behavioral hierarchy and its neural substrates from the point of view of recent developments in computational reinforcement learning. Specifically, we consider a set of approaches known collectively as hierarchical reinforcement learning, which extend the reinforcement learning paradigm by allowing the learning agent to aggregate actions into reusable subroutines or skills. A close look at the components of hierarchical reinforcement learning suggests how they might map onto neural structures, in particular regions within the dorsolateral and orbital prefrontal cortex. It also suggests specific ways in which hierarchical reinforcement learning might provide a complement to existing psychological models of hierarchically structured behavior. A particularly important question that hierarchical reinforcement learning brings to the fore is that of how learning identifies new action routines that are likely to provide useful building blocks in solving a wide range of future problems. Here and at many other points, hierarchical reinforcement learning offers an appealing framework for investigating the computational and neural underpinnings of hierarchically structured behavior.}
}
@article{BotvinickandPlautPSYCHOLOGICAL-REVIEW-04,
       author = {Botvinick, Matthew and Plaut, David C.},
        title = {Doing Without Schema Hierarchies: A Recurrent Connectionist Approach to Normal and Impaired Routine Sequential Action},
      journal = {Psychological Review},
       volume = {111},
       number = {2},
         year = {2004},
        pages = {395-429},
     abstract = {In everyday tasks, selecting actions in the proper sequence requires a continuously updated representation of temporal context. Previous models have addressed this problem by positing a hierarchy of processing units, mirroring the roughly hierarchical structure of naturalistic tasks themselves. The present study considers an alternative framework, in which the representation of context depends on recurrent connections within a network mapping from environmental inputs to actions. The ability of this approach to account for human performance was evaluated by applying it, through simulation, to a specific everyday task. The resulting model learned to deal flexibly with a complex set of sequencing constraints, encoding contextual information at multiple time scales within a single, distributed internal representation. Degrading this representation led to errors resembling those observed both in everyday behavior and in apraxia. Analysis of the model's function yielded numerous predictions relevant to both normal and apraxic performance.},
}
@article{BotvinickandPlautPSYCHOLOGICAL-RESEARCH-02,
       author = {Botvinick, Matthew and Plaut, David C.},
        title = {Representing task context: proposals based on a connectionist model of action},
      journal = {Psychological Research},
       volume = {66},
       number = {4},
         year = {2002},
        pages = {298-311},
     abstract = {Representations of task context play a crucial role in shaping human behavior. While the nature of these representations remains poorly understood, existing theories share a number of basic assumptions. One of these is that task representations are discrete, independent, and non-overlapping. We present here an alternative view, according to which task representations are instead viewed as graded, distributed patterns occupying a shared, continuous representational space. In recent work, we have implemented this view in a computational model of routine sequential action. In the present article, we focus specifically on this model's implications for understanding task representation, considering the implications of the account for two influential concepts: (1) cognitive underspecification, the idea that task representations may be imprecise or vague, especially in contexts where errors occur, and (2) information-sharing, the idea that closely related operations rely on common sets of internal representations.},
}

¹¹⁴ Last week I recommended some recent work from Michael Halassa's lab at the McGovern Institute for Brain Science at MIT. Halassa has identified the slow waves produced by the thalamic reticular nucleus as a central factor in the rapid modulation of local cortical function. His group is interested in delineating the long-range, input-output connectivity patterns of single thalamic neurons and mapping the thalamocortical circuit motifs responsible for these patterns to provide a basis for building functional models. Explaining the mechanism responsible for modulating cortical function is a challenge in part because the thalamus is composed of excitatory neurons that are devoid of local recurrent excitatory connectivity [210, 209]

¹¹⁵ In the human brain, the anterior cingulate cortex (ACC) is the frontal part of the cingulate cortex. It consists of Brodmann areas 24, 32 and 33. While it participates in a wide variety of autonomic functions, our interest with respect to cognitive control is the ACC involvement in higher-level functions, including attention allocation, reward anticipation, decision-making, ethics and morality, impulse control (e.g. performance monitoring and error detection), and emotion. (SOURCE)

¹¹⁶ I've seen very little work on self reflection in the neural network community interested in biologically plausible models. I have to admit that I don't know where I would locate a "self reflection" module if I wanted to add it to such a model. This occurred to me when I started thinking about how one might become aware of having made an error on reading [67]. Of course, it seems likely there exists machinery in the brainstem or limbic system that registers errors and in doing so signals the conscious self.

The problem is that this hypothetical "self" is not likely to have any access to the low-level machinery responsible for responding to and resolving errors. In fact, if the low-level machinery is capable of both recognizing that an error has occurred and engaging the limbic system to signal other parts of the brain that might be able to take steps to avoid the error in the future, then it is likely that the necessary low-level machinery exists to resolve the problem one way or another without the intervention of the conscious self

That said, the perception of having some control is not entirely illusory since having had the error become apparent to the conscious self, it could easily repeat whatever recent actions it might have executed, and in the process of repeating the action could facilitate the inaccessible, subortical machinery coming up with a solution. That is to say, the conscious self in the role of homunculus in residence can exert some indirect control over the low-level processes responsible for dealing with errors.

This line of reasoning came up primarily because it seems obvious that, in order to have any control over the body and its extensions, the low-level systems absolutely have to be able to determine the success or failure of the actions they initiate. From which it follows that all of the systems that require such oversight need some sort of a predictive model in order to check that the consequences of actions unfold in accord with its understanding of the relevant dynamics.

As a corollary to the above observation, one would expect there to be machinery capable of comparing two outcomes, one generated by a predictive model and the other reconstructed from observations made just for this purpose. And, as it is not likely that there will be a single, all-powerful system for making arbitrary comparisons, there must be the ability to deploy a wide range of analyses, including the ability to infer various relationships between outcomes as a diagnostic aid in determining their success or failure.

Lest you think I've gone off the deep end, I should add that I wrote down the above thoughts not with the intention of implementing them in the programmer’s apprentice, but rather, attempting to understand the larger questions surrounding conflict and error monitoring and mediation before setting off to implement something substantially simpler and more tractable for the apprentice.

¹¹⁷ Recent papers regarding the proposed rostral-to-caudal hierarchy of neural processing within the prefrontal cortex including papers both supporting and challenging the hypothesis:

@article{SchumacheretalNEUROIMAGE-19,
author = {Schumacher, F. K. and Schumacher, L. V. and Schelter, B. O. and Kaller, C. P.},
title = {Functionally dissociating ventro-dorsal components within the rostro-caudal hierarchical organization of the human prefrontal cortex},
journal = {Neuroimage},
volume = {185},
year = {2019},
pages = {398-407},
abstract = {Cognitive control is proposed to rely on a rostral-to-caudal hierarchy of neural processing within the prefrontal cortex (PFC), with more rostral parts exerting control over more caudal parts. Anatomical and functional data suggest that this hierarchical organization of the PFC may be separated into a ventral and a dorsal component. Furthermore, recent studies indicate that the apex of the hierarchy resides within the mid-lateral rather the rostral PFC. However, investigating the hierarchical aspect of rostro-to-caudal processing requires quantification of the directed interactions between PFC regions. Using functional near-infrared spectroscopy (fNIRS) in a sample of healthy young adults we analyzed directed interactions between rostral and caudal PFC during passive watching of nature documentaries. Directed coherence (DC) as a measure of directed interaction was computed pairwise between 38 channels evenly distributed over the lateral prefrontal convexity. Results revealed an overall predominance of rostral-to-caudal directed interactions in the PFC that further dissociated along a ventro-dorsal axis: Dorsal regions exerted stronger rostro-caudally directed interactions on dorsal than on ventral regions and vice versa. Interactions between ventral and dorsal PFC were stronger from ventral to dorsal areas than vice versa. Results further support the notion that the mid-dorsolateral PFC constitutes the apex of the prefrontal hierarchy. Taken together these data provide novel evidence for parallel dorsal and ventral streams within the rostro-caudal hierarchical organization of the PFC. FNIRS-based analyses of directed interactions put forward a new perspective on the functional architecture of the prefrontal hierarchy and complement previous insights from functional magnetic resonance imaging.}
}
@article{CapraraetalNEUROIMAGE-18,
author = {Caprara, I. and Premereur, E. and Romero, M. C. and Faria, P. and Janssen, P.},
title = {Shape responses in a macaque frontal area connected to posterior parietal cortex},
journal = {Neuroimage},
volume = {179},
year = {2018},
pages = {298-312},
abstract = {The primate dorsal visual stream processes object shape to guide actions involving an object, but the transmission of shape information beyond posterior parietal cortex remains largely unknown. To clarify the information flow between parietal and frontal cortex, we applied electrical microstimulation during functional Magnetic Resonance Imaging (fMRI) in a shape-selective patch in the posterior part of the Anterior Intraparietal area (pAIP) to chart its connectivity. Subsequently, we recorded single-unit responses to images of objects in the fMRI activation in prefrontal cortex, corresponding to area 45B, elicited by pAIP microstimulation. Neurons in area 45B had properties similar to neurons in pAIP, responding selectively to shape contours and to very small shape fragments measuring less than one deg at exceedingly short latencies. However, contrary to the prevailing view on the hierarchical organization of cortical areas, neurons in area 45B preferred even smaller shape fragments and had smaller receptive fields than neurons in pAIP. These findings provide the first evidence for ultra-fast shape processing in prefrontal cortex, and suggest that the pathway from pAIP to area 45B may not be important for object grasping.}
}
@article{ChoietalJoCN-18,
author = {Choi, E. Y. and Drayna, G. K. and Badre, D.},
title = {Evidence for a Functional Hierarchy of Association Networks},
journal = {Journal Cognitive Neuroscience},
volume = {30},
number = {5},
year = {2018},
pages = {722-736},
abstract = {Patient lesion and neuroimaging studies have identified a rostral-to-caudal functional gradient in the lateral frontal cortex (LFC) corresponding to higher-order (complex or abstract) to lower-order (simple or concrete) cognitive control. At the same time, monkey anatomical and human functional connectivity studies show that frontal regions are reciprocally connected with parietal and temporal regions, forming parallel and distributed association networks. Here, we investigated the link between the functional gradient of LFC regions observed during control tasks and the parallel, distributed organization of association networks. Whole-brain fMRI task activity corresponding to four orders of hierarchical control [Badre, D., & D'Esposito, M. Functional magnetic resonance imaging evidence for a hierarchical organization of the prefrontal cortex. Journal of Cognitive Neuroscience, 19, 2082-2099, 2007] was compared with a resting-state functional connectivity MRI estimate of cortical networks [Yeo, B. T., Krienen, F. M., Sepulcre, J., Sabuncu, M. R., Lashkari, D., Hollinshead, M., et al. The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of Neurophysiology, 106, 1125-1165, 2011]. Critically, at each order of control, activity in the LFC and parietal cortex overlapped onto a common association network that differed between orders. These results are consistent with a functional organization based on separable association networks that are recruited during hierarchical control. Furthermore, corticostriatal functional connectivity MRI showed that, consistent with their participation in functional networks, rostral-to-caudal LFC and caudal-to-rostral parietal regions had similar, order-specific corticostriatal connectivity that agreed with a striatal gating model of hierarchical rule use. Our results indicate that hierarchical cognitive control is subserved by parallel and distributed association networks, together forming multiple localized functional gradients in different parts of association cortex. As such, association networks, while connectionally organized in parallel, may be functionally organized in a hierarchy via dynamic interaction with the striatum.}
}
@article{PischeddaJoN-17,
author = {Pischedda, D. and Gorgen, K. and Haynes, J. D. and Reverberi, C.},
title = {Neural Representations of Hierarchical Rule Sets: The Human Control System Represents Rules Irrespective of the Hierarchical Level to Which They Belong},
journal = {Journal Neuroscience},
volume = {37},
number = {50},
year = {2017},
pages = {12281-12296},
abstract = {Humans use rules to organize their actions to achieve specific goals. Although simple rules that link a sensory stimulus to one response may suffice in some situations, often, the application of multiple, hierarchically organized rules is required. Recent theories suggest that progressively higher level rules are encoded along an anterior-to-posterior gradient within PFC. Although some evidence supports the existence of such a functional gradient, other studies argue for a lesser degree of specialization within PFC. We used fMRI to investigate whether rules at different hierarchical levels are represented at distinct locations in the brain or encoded by a single system. Thirty-seven male and female participants represented and applied hierarchical rule sets containing one lower-level stimulus-response rule and one higher-level selection rule. We used multivariate pattern analysis to investigate directly the representation of rules at each hierarchical level in absence of information about rules from other levels or other task-related information, thus providing a clear identification of low- and high-level rule representations. We could decode low- and high-level rules from local patterns of brain activity within a wide frontoparietal network. However, no significant difference existed between regions encoding representations of rules from both levels except for precentral gyrus, which represented only low-level rule information. Our findings show that the brain represents conditional rules regardless of their level in the explored hierarchy, so the human control system did not organize task representation according to this dimension. Our paradigm represents a promising approach to identifying critical principles that shape this control system. SIGNIFICANCE STATEMENT Several recent studies investigating the organization of the human control system propose that rules at different control levels are organized along an anterior-to-posterior gradient within PFC. In this study, we used multivariate pattern analysis to explore independently the representation of formally identical conditional rules belonging to different levels of a cognitive hierarchy and provide for the first time a clear identification of low- and high-level rule representations. We found no major spatial differences between regions encoding rules from different hierarchical levels. This suggests that the human brain does not use levels in the investigated hierarchy as a topographical organization principle to represent rules controlling our behavior. Our paradigm represents a promising approach to identifying which principles are critical.}
}
@article{YeoetalJoN-11,
author = {Yeo, B. T. and Krienen, F. M. and Sepulcre, J. and Sabuncu, M. R. and Lashkari, D. and Hollinshead, M. and Roffman, J. L. and Smoller, J. W. and Zollei, L. and Polimeni, J. R. and Fischl, B. and Liu, H. and Buckner, R. L.},
title = {The organization of the human cerebral cortex estimated by intrinsic functional connectivity},
journal = {Journal of Neurophysiology},
volume = {106},
number = {3},
year = {2011},
pages = {1125-1165},
abstract = {Information processing in the cerebral cortex involves interactions among distributed areas. Anatomical connectivity suggests that certain areas form local hierarchical relations such as within the visual system. Other connectivity patterns, particularly among association areas, suggest the presence of large-scale circuits without clear hierarchical relations. In this study the organization of networks in the human cerebrum was explored using resting-state functional connectivity MRI. Data from 1,000 subjects were registered using surface-based alignment. A clustering approach was employed to identify and replicate networks of functionally coupled regions across the cerebral cortex. The results revealed local networks confined to sensory and motor cortices as well as distributed networks of association regions. Within the sensory and motor cortices, functional connectivity followed topographic representations across adjacent areas. In association cortex, the connectivity patterns often showed abrupt transitions between network boundaries. Focused analyses were performed to better understand properties of network connectivity. A canonical sensory-motor pathway involving primary visual area, putative middle temporal area complex (MT+), lateral intraparietal area, and frontal eye field was analyzed to explore how interactions might arise within and between networks. Results showed that adjacent regions of the MT+ complex demonstrate differential connectivity consistent with a hierarchical pathway that spans networks. The functional connectivity of parietal and prefrontal association cortices was next explored. Distinct connectivity profiles of neighboring regions suggest they participate in distributed networks that, while showing evidence for interactions, are embedded within largely parallel, interdigitated circuits. We conclude by discussing the organization of these large-scale cerebral networks in relation to monkey anatomy and their potential evolutionary expansion in humans to support cognition.}
}
@article{TarenJoN-11,
Author="Taren, A. A. and Venkatraman, V. and Huettel, S. A.",
Title="A parallel functional topography between medial and lateral prefrontal cortex: evidence and implications for cognitive control",
Journal="Journal of Neuroscience",
Volume="31",
Number="13",
Year="2011",
Pages="5026-5031",
Abstract={The dorsomedial and dorsolateral prefrontal cortices (dmPFC and dlPFC) together support cognitive control, with dmPFC responsible for monitoring performance and dlPFC responsible for adjusting behavior. The dlPFC contains a topographic organization that reflects complexity of control demands, with more anterior regions guiding increasingly abstract processing. Recent evidence for a similar gradient within dmPFC suggests the possibility of parallel, hierarchical organization. Here, we measured connectivity between functional nodes of dmPFC and dlPFC using resting-state functional magnetic resonance imaging in humans. We found a posterior-to-anterior connectivity gradient; posterior dmPFC maximally connected to posterior dlPFC and anterior dmPFC maximally connected to anterior dlPFC. This parallel topographic pattern replicated across three independent datasets collected on different scanners, within individual participants, and through both point-to-point and voxelwise analyses. We posit a model of cognitive control characterized by hierarchical interactions--whose level depends on current environmental demands--between functional subdivisions of medial and lateral PFC.}
}

¹¹⁸ Here are a couple of interesting papers that incorporate concepts from computer programming developed by Francesco Donnarumma and his colleagues:

@article{DonnarummaetalCS-12,
       author = {Francesco Donnarumma and Roberto Prevete and Giuseppe Trautteur},
        title = {Programming in the brain: a neural network theoretical framework},
      journal = {Connection Science},
    publisher = {Taylor \& Francis},
       volume = {24},
       number = {2-3},
         year = {2012},
        pages = {71-90},
     abstract = {Recent research shows that some brain areas perform more than one task and the switching times between them are incompatible with learning and that parts of the brain are controlled by other parts of the brain, or are "recycled", or are used and reused for various purposes by other neural circuits in different task categories and cognitive domains. All this is conducive to the notion of "programming in the brain". In this paper, we describe a programmable neural architecture, biologically plausible on the neural level, and we implement, test, and validate it in order to support the programming interpretation of the above-mentioned phenomenology. A programmable neural network is a fixed-weight network that is endowed with auxiliary or programming inputs and behaves as any of a specified class of neural networks when its programming inputs are fed with a code of the weight matrix of a network of the class. The construction is based on the "pulling out" of the multiplication between synaptic weights and neuron outputs and having it performed in "software" by specialised multiplicative-response fixed subnetworks. Such construction has been tested for robustness with respect to various sources of noise. Theoretical underpinnings, analysis of related research, detailed construction schemes, and extensive testing results are given.}
}
@article{DonnarummaetalIJNS-15,
       author = {Donnarumma, F. and Prevete, R. and Chersi, F. and Pezzulo, G.},
        title = {A Programmer-Interpreter Neural Network Architecture for Prefrontal Cognitive Control},
      journal = {International Journal Neural Systems},
         year = {2015},
       volume = {25},
       number = {6},
        pages = {1550017},
     abstract = {There is wide consensus that the prefrontal cortex (PFC) is able to exert cognitive control on behavior by biasing processing toward task-relevant information and by modulating response selection. This idea is typically framed in terms of top-down influences within a cortical control hierarchy, where prefrontal-basal ganglia loops gate multiple input-output channels, which in turn can activate or sequence motor primitives expressed in (pre-)motor cortices. Here we advance a new hypothesis, based on the notion of programmability and an interpreter-programmer computational scheme, on how the PFC can flexibly bias the selection of sensorimotor patterns depending on internal goal and task contexts. In this approach, multiple elementary behaviors representing motor primitives are expressed by a single multi-purpose neural network, which is seen as a reusable area of "recycled" neurons (interpreter). The PFC thus acts as a "programmer" that, without modifying the network connectivity, feeds the interpreter networks with specific input parameters encoding the programs (corresponding to network structures) to be interpreted by the (pre-)motor areas. Our architecture is validated in a standard test for executive function: the 1-2-AX task. Our results show that this computational framework provides a robust, scalable and flexible scheme that can be iterated at different hierarchical layers, supporting the realization of multiple goals. We discuss the plausibility of the "programmer-interpreter" scheme to explain the functioning of prefrontal-(pre)motor cortical hierarchies.}
}

¹¹⁹ The following excerpt comes from Chapter 8 of The Prefrontal Cortex [185] entitled "Coordinating New Sequences of Purposeful Action":

That prospective aspect makes working memory a prospective executive function by definition, like the three already discussed in this section. The cognit in its content may be an item of long-term memory that has been updated by a new but associated item that is critical for the best outcome of the action a few seconds or minutes hence. However, the content of working memory is not limited to that new item but extends to its entire associative context, which includes the representation of the action itself. The focus, the center, of the sustained attention, nonetheless, is in the new item, the memorandum (the Latin gerund emphasizes the to-do feature of working memory), and the associated prospective action it evokes. Insofar as the sustained attention of working memory incorporates percepts and actions, working memory operates intimately with attentional set. Insofar as the content of working memory consists largely of updated long-term memory, working memory can be legitimately called active memory, as it was in previous editions of this book.
Two distinctive characteristics of working memory mentioned above deserve further emphasis because they bear decisively on its neural substrate and mechanisms. One is its associative character. The content of working memory is associative and elicited by association. That content, the memorandum, is an item of information that, before it is perceived for retention in working memory, has been associated with many others in its context; in formal testing, for example, it is associated with a set of prior instructions from the experimenter. It may also be associated with a previously acquired lexicon or behavior. In any case, only a portion of the associated information is in the focus of attention of working memory at any given time. The rest, which is just as critical for the pending action, is a mass of information in associative long-term memory elicited by the memorandum.

The point here is that, whereas the focus of working memory is a discrete item of lexicon or sensorium, that item is couched in a constellation of associated items that provide it with meaning and purpose. In other words, the item is part of a vast associative neural network, a large cognit in long-term memory with perceptual and executive components, which is activated not only to retain that special item but also to provide it with historical and behavioral context. Only with that context will the working memory in focus serve its purpose. Therefore, working memory may be viewed as largely consisting of the temporary activation of updated long-term memory. Supporting this assertion are unit data from the primate cortex showing that the cells that participate in working memory belong to widely distributed networks attuned to multiple associated features of the memorandum, even if these features pertain to different sensory modalities. Page 393 in Fuster [184].

¹²⁰ Here is an excerpt from an article about proprioception and embodiment with a focus on what goes wrong when the corresponding sensory apparatus is challenged or part of the body is removed or badly damaged as in the case of amputation:

The human body has many types of muscles, but the most abundant are the skeletal muscles, which wrap around the framework of your skeleton. Skeletal muscles are voluntary muscles, meaning we can directly control them with our brains. From the explosive pitch of a baseball to the delicate manipulation of a paintbrush, these muscles are powerful and precise. But, many people don’t realize that muscles are also important sensors.
A muscle is essentially a big bundle of fibers of varying size and function. Large fibers on the outside do the brunt of the work, contracting powerfully to pull on different parts of the skeleton and create movement. Within these larger fibers are sensory fibers called muscle spindles, which move along with the outer fibers. Muscle spindles are specialized for sensation because they have sensory nerve fibers wound tightly around them like little springs. These springs stretch rather than pull, sending messages up to the brain to communicate how fast a muscle is moving.

Your brain receives signals from all your muscle spindles, along with other sensory receptors in your tendons and joints, and constructs an overall sense of the position and motion of your body. Hence, even with your eyes closed you can touch your nose, wave your hand, or toss a ball, remaining confident in where your arm is. This remarkable ability is called proprioception, and it is sometimes called the “sixth sense” (after sight, smell, taste, touch and hearing).

Proprioception is a sense that we often overlook because it is only subtly distinguished from movement. And, unlike sight or hearing or taste, we rarely experience the absence of proprioception (dead arm being a rare exception). Yet, a total loss of proprioception might be even more devastating than going blind or deaf. Without sensory information coming in from our muscles, we would be unable to monitor and correct our paths of motion. Imagine trying to walk, gesture, or eat if you had no sense of where your limbs were without looking at them. (SOURCE)

¹²¹ The development of memory in children becomes evident within the first 3 years of a child's life as they show considerable advances in declarative memory. This enhancement continues into adolescence with major developments in short term memory, Working Memory, long-term memory, and autobiographical memory. Recent research on the development of memory has indicated that declarative, or explicit memory, may exist in infants who are even younger than two years old. For example, newborns who are less than 3 days old demonstrate a preference for their mother’s own voice. (SOURCE)

¹²² Childhood amnesia is defined as "inability of adults to retrieve episodic memories (memories of situations or events) before the age of two to four years, as well as the period before the age of ten of which adults retain fewer memories than might otherwise be expected given the passage of time. [...] Some define it as the age from which a first memory can be retrieved. This is usually at the age of three or four, but it can range from two to eight years. (SOURCE)

¹²³ The incomplete development of language in young children is thought to be a critical cause of childhood amnesia as infants do not yet have the language capacity necessary to encode autobiographical memories. [...] Adults and children can often remember memories from around three or four years of age which is during a time of rapid language development. [...] Before language develops, children often only hold preverbal memories and may use symbols to represent them. Therefore, once language comes online, one can actively describe their memories with words. The context that one is in when they encode or retrieve memories is different for adults and infants because language is not present during infancy. (SOURCE)

¹²⁴ Infants and young children experience rapid brain development. The weight of the brain increases from 25% of its ultimate adult weight at birth to 75% by the second year of life. However, not all parts of the brain develop at the same time. This is especially true for the areas of the brain that are implicated in declarative memory. The cells that make up most of the hippocampus, a brain structure in the medial temporal lobe necessary for the formation of declarative memories, are formed by the end of the prenatal period. Yet the cells in the dentate gyrus of the hippocampus, an area that links the structure with cortical regions of the brain, do not appear adult-like until 12 to 15 months of age [89]. Another area of the brain implicated in memory function is the prefrontal cortex. The density of synapses in this area increases dramatically at eight months and peaks between 15 and 24 months. Changes continue to occur after this period, until well into adolescence.4 Thus, we see dramatic changes in the brain areas implicated in memory in the first two years of life. (SOURCE)

¹²⁵ Here is an announcement from the USC public relations press service describing research results from professor Nickolas Schweighofer's lab and summarizing his paper [444] in the Journal of Neurophysiology:

Motor memory is actually the product of two processes: short-term and long- term memory. If you focus on learning motor skills sequentially -- for example, two overhand ball throws -- you will acquire each fairly quickly, but are more likely to forget them later. However, if you split your time up between learning multiple motor skills -- say, learning two different throws -- you will learn them more slowly but be more likely to remember them both later.
This phenomenon, called the "contextual interference effect," is the result of a showdown between your short-term and long-term motor memory, Schweighofer said. Though scientists have long been aware of the effect's existence, Schweighofer's research is the first to explain the mechanism behind it. "Continually wiping out motor short-term memory helps update long-term memory," he said.

In short, if your brain can rely on your short-term motor memory to handle memorizing a single motor task, then it will do so, failing to engage your long-term memory in the process. If you deny your brain that option by continually switching from learning one task to the other, your long-term memory will kick in instead. It will take longer to learn both, but you won't forget them later. (SOURCE)

¹²⁷ The red nucleus or nucleus ruber is a structure in the rostral midbrain involved in motor coordination. (SOURCE)

¹²⁸ Zebrafish have been used as animal models for research because of their small body size, vast amounts of eggs produced, their rapid development, low husbandry costs, and transparency during embryogenesis. Although the body size and appearance seem different, fish and mammals including human still possess anatomical and functional similarities in their brains. This review summarizes the similarities of brain structures and functions between teleost fish and mammalian brains, focusing on the dopamine system, functional regionalization of the cerebellum, and presence of the nucleus ruber.

¹²⁶ You cannot have a vertebrate brain without a basal ganglia or cerebellum: Kien and Altman [270] — Preparation and Execution of Movement: Parallels Between Insect (invertebrate) and Mammalian (vertebrate) Motor Systems; Sugahara et al [474] — Reconstructing the Ancestral Vertebrate Brain; Matsui [342] — Dopamine System, Cerebellum, and Nucleus Ruber¹²⁷ in Fish¹²⁸ and Mammals; Reiner A. (2009) You Cannot Have a Vertebrate Brain Without a Basal Ganglia. In: Groenewegen H., Voorn P., Berendse H., Mulder A., Cools A. (eds) The Basal Ganglia IX. Advances in Behavioral Biology, vol 58. Springer, New York, NY.

¹²⁹ Abstract from Kenji Doya [148] "Complementary roles of basal ganglia and cerebellum in learning and motor control":

The classical notion that the basal ganglia and the cerebellum are dedicated to motor control has been challenged by the accumulation of evidence revealing their involvement in non-motor, cognitive functions. From a computational viewpoint, it has been suggested that the cerebellum, the basal ganglia, and the cerebral cortex are specialized for different types of learning: namely, supervised learning, reinforcement learning and unsupervised learning, respectively. This idea of learning-oriented specialization is helpful in understanding the complementary roles of the basal ganglia and the cerebellum in motor control and cognitive functions.

¹³¹ Muscle memory is a form of procedural memory that involves consolidating a specific motor task into memory through repetition, which has been used synonymously with motor learning. When a movement is repeated over time, a long-term muscle memory is created for that task, eventually allowing it to be performed without conscious effort. [...] The main area involved in motor learning is the cerebellum. Some models of cerebellar-dependent motor learning, in particular the Marr-Albus model, propose a single plasticity mechanism involving the cerebellar long-term depression (LTD) of the parallel fiber synapses onto Purkinje cells. These modifications in synapse activity would mediate motor input with motor outputs critical to inducing motor learning. (SOURCE)

¹³² A woman has reached the age of 24 without anyone realising she was missing a large part of her brain. The case highlights just how adaptable the organ is.The discovery was made when the woman was admitted to the Chinese PLA General Hospital of Jinan Military Area Command in Shandong Province complaining of dizziness and nausea. She told doctors she’d had problems walking steadily for most of her life, and her mother reported that she hadn’t walked until she was 7 and that her speech only became intelligible at the age of 6. (SOURCE)

¹³⁰ There is anecdotal evidence that, in patients with advanced Alzheimer's disease, fine motor (procedural) memory is retained longer than episodic or semantic (declarative) memory [104]. Given the degree to which Alzheimer's impacts the cerebral cortex, this suggests that at least some procedural knowledge might be stored in the cerebellar cortex. The premotor cortex also stores procedural memories¹³¹. It is also apparent that one can survive without the cerebellar cortex and even acquire and perform complex motor skills including walking and speaking, though perhaps without the precise timing and coordination one expects of an individual with an intact cerebellum¹³²

¹³³ A cerebellar peduncle — not to be confused with the cerebral peduncle —is a nerve tract that permits communication between the cerebellum and the other parts of the central nervous system. Three pairs of cerebellar peduncles conduct this communication. The inferior peduncles bring sensory information about the actual position of body parts such as limbs and joints. The middle peduncles transmit information about the desired position of these parts. After integrating and analyzing the information from these two sources, the cerebellum sends impulses through the superior peduncles to the midbrain. In response, motor impulses are transmitted down through the pons, medulla oblongata, and spinal cord to stimulate or inhibit skeletal muscles at appropriate times and cause movements of body parts into the desired positions. This activity makes rapid and complex muscular movements possible. (SOURCE)

¹³⁴ The inferior olivary nucleus (ION), is a structure found in the medulla oblongata underneath the superior olivary nucleus. In vertebrates, the ION is known to coordinate signals from the spinal cord to the cerebellum to regulate motor coordination and learning. These connections have been shown to be tightly associated, as degeneration of either the cerebellum or the ION results in degeneration of the other. Though neurons of the ION are glutamatergic, they receive inhibitory input via GABA receptors. There are two distinct GABAα receptor populations that are spatially organized within each neuron present in the ION. The GABAα receptor make-up varies based on where the receptor localizes on the ION neuron. The inferior olivary nucleus (ION) has a distinct laminar structure. These lamina house the cell bodies of the olivocerebellar fibers. These neurons are the major input source for the cerebellum. (SOURCE)

¹³⁵ The substantia nigra (SN) is a basal ganglia structure located in the midbrain that plays an important role in reward and movement. [...] Although the substantia nigra appears as a continuous band in brain sections, anatomical studies have found that it actually consists of two parts with very different connections and functions: the pars compacta (SNpc) and the pars reticulata (SNpr). The pars compacta serves mainly as an output to the basal ganglia circuit, supplying the striatum with dopamine. The pars reticulata, though, serves mainly as an input, conveying signals from the basal ganglia to numerous other brain structures. [...]

The GABAergic neurons in the pars reticulata convey the final processed signals of the basal ganglia to the thalamus and superior colliculus. In addition, the pars reticulata also inhibits dopaminergic activity in the pars compacta via axon collaterals, although the functional organization of these connections remains unclear. The pars compacta is heavily involved in learned responses to stimuli. In primates, dopaminergic neuron activity increases in the nigrostriatal pathway when a new stimulus is presented. Dopaminergic activity decreases with repeated stimulus presentation. However, behaviorally significant stimulus presentation (i.e. rewards) continues to activate dopaminergic neurons in the substantia nigra pars compacta. (SOURCE)

¹³⁶ The tectum is a region of the brain, specifically the dorsal (towards the back) part of the midbrain (mesencephalon). The position of the tectum is contrasted with the tegmentum, which refers to the region ventral (in front of) to the ventricular system. The tectum is responsible for auditory and visual reflexes. (SOURCE)

¹³⁷ In neurophysiology, long-term depression (LTD) is an activity-dependent reduction in the efficacy of neuronal synapses lasting hours or longer following a long patterned stimulus. LTD occurs in many areas of the CNS with varying mechanisms depending upon brain region and developmental progress.

As the opposing process to long-term potentiation (LTP), LTD is one of several processes that serves to selectively weaken specific synapses in order to make constructive use of synaptic strengthening caused by LTP. This is necessary because, if allowed to continue increasing in strength, synapses would ultimately reach a ceiling level of efficiency, which would inhibit the encoding of new information. (SOURCE)

¹³⁸ In neuroscience, a silent synapse is an excitatory glutamatergic synapse whose postsynaptic membrane contains NMDA-type glutamate receptors but no AMPA-type glutamate receptors. These synapses are named "silent" because normal AMPA receptor-mediated signaling is not present, rendering the synapse inactive under typical conditions. Silent synapses are typically considered to be immature glutamatergic synapses. As the brain matures, the relative number of silent synapses decreases. However, recent research on hippocampal silent synapses shows that while they may indeed be a developmental landmark in the formation of a synapse, that synapses can be "silenced" by activity, even once they have acquired AMPA receptors. Thus, silence may be a state that synapses can visit many times during their lifetimes. (SOURCE)

¹³⁹ Leonard Susskind is an American physicist, who is professor of theoretical physics at Stanford University, and founding director of the Stanford Institute for Theoretical Physics. His research interests include string theory, quantum field theory, quantum statistical mechanics and quantum cosmology. Susskind is widely regarded as one of the fathers of string theory. He was the first to give a precise string-theoretic interpretation of the holographic principle in 1995 and the first to introduce the idea of the string theory landscape in 2003. SOURCE

¹⁴⁰ Susskind is speaking here of Gerard 't_Hooft, a Dutch theoretical physicist and professor at Utrecht University, the Netherlands. He shared the 1999 Nobel Prize in Physics with his thesis advisor Martinus J. G. Veltman for elucidating the quantum structure of electroweak interactions. His work concentrates on gauge theory, black holes, quantum gravity and fundamental aspects of quantum mechanics. His contributions to physics include a proof that gauge theories are renormalizable, dimensional regularization and the holographic principle. SOURCE

¹⁴¹ In physics, the observer effect is the theory that the mere observation of a phenomenon inevitably changes that phenomenon. This is often the result of instruments that, by necessity, alter the state of what they measure in some manner. A common example is checking the pressure in an automobile tire; this is difficult to do without letting out some of the air, thus changing the pressure. Similarly, it is not possible to see any object without light hitting the object, and causing it to reflect that light.

While the effects of observation are often negligible, the object still experiences a change. This effect can be found in many domains of physics, but can usually be reduced to insignificance by using different instruments or observation techniques. An especially unusual version of the observer effect occurs in quantum mechanics, as best demonstrated by the double-slit experiment. SOURCE

¹⁴² In mathematics education, number sense can refer to "an intuitive understanding of numbers, their magnitude, relationships, and how they are affected by operations". Other definitions of number sense emphasize an ability to work outside of the traditionally taught algorithms, e.g., "a well organized conceptual framework of number information that enables a person to understand numbers and number relationships and to solve mathematical problems that are not bound by traditional algorithms".

Psychologists believe that the number sense in humans can be differentiated into the approximate number system, a system that supports the estimation of the magnitude, and the parallel individuation system, which allows the tracking of individual objects, typically for quantities below 4. Gersten and Chard say number sense "refers to a child's fluidity and flexibility with numbers, the sense of what numbers mean and an ability to perform mental mathematics and to look at the world and make comparisons".

In non-human animals, number sense is not the ability to count, but the ability to perceive changes in the number of things in a collection. All mammals, and most birds, will notice if there is a change in the number of their young nearby. Many birds can distinguish two from three. In humans, small children around fourteen months of age are also able to notice something that is missing from a group that they are familiar with. (SOURCE)

¹⁴³ Results from a new artificial intelligence study indicate number sense is spontaneously created by the visual system, without prior experience of counting.

Humans and animals have a "number sense," an inborn ability to register the number of objects in a scene. The neural basis of this ability is believed to be what are called the number neurons, which respond to certain numbers and have been found in both human and animal brains. Researchers have long wondered whether these number neurons are formed in the brain merely by the ability to see — and if so, how. Now, a team of researchers headed by Professor Andreas Nieder at the University of Tübingen’s Institute of Neurobiology has investigated the origins of number sense using an artificial neural network. The results indicate that it is created spontaneously by the visual system, without any experience in counting. The study has just been published in the latest edition of Science Advances.
The researchers started out by training an artificial "deep learning" network to recognize pictured objects such as tennis balls, necklaces, spiders and dogs. "The network model is based on an architecture structured like the early developmental stage of the human visual cortex," Andreas Nieder explains. "It has been discovered there that nerve cells work together in different hierarchical levels to enable vision." The artificial network learned to recognize objects on the basis of 1.2 million images which were classified into one thousand categories. After this training, the network was able to classify thousands of new images with a high proportion of success.

The network is in two parts. One of these extracts the object’s characteristics from the images and transforms these into an abstract representation; the second part sorts the object into a category on the basis of probability. "We separated the two network parts and to the first part we presented not photos but simple dot patterns of one to 30 dots," Nieder says. In the following cycles of the experiment, the patterns were repeated with varying dot patterns and densities. Then the researchers analyzed whether the network’s artificial neurons reacted to the same number of points independently of other characteristics. "Almost ten percent of the artificial neurons had each specialized in a certain number, although the network was never trained to differentiate between numbers. The network had spontaneously developed a sense of numbers," says Nieder.

Researchers had suspected that the ability to count developed from the visual system. The most fundamental task of vision is to recognize visible objects, Nieder says. The new study shows how number neurons can develop spontaneously from an artificial visual system that was only trained to recognize visible objects, he adds. Furthermore, the artificial neurons’ responses resembled that of real number neurons in animals and humans. "Number sense does not seem to depend on a specific, specialized area of the brain, but rather on neural networks formed by vision. This now makes it possible to explain why even newborns or untrained, wild animals have a number sense," Nieder explains.

Source: University of Tübingen
Contacts: Andreas Nieder — University of Tübingen

¹⁴⁴ For students looking for an introduction to memory systems in the brain, you might start with Chapter 24 in [52] if you own or have access to either the 3rd or 4th edition, or, for a more succinct overview, try Rolls [427], Raslau et al [418, 418], or, most expedient, Wikipedia (URL).

¹⁴⁵ The hippocampal formation is a compound structure located in the medial temporal cortex that consists of the dentate gyrus, the hippocampus proper, subiculum and, depending on whom you consult, presubiculum, parasubiculum and entorhinal cortex.

¹⁴⁶ Excerpt from Stanislas Dehaene [129] Consciousness and the Brain: Deciphering How the Brain Codes Our Thoughts regarding long distance connections — variously called spindle cells or von Economo neurons — in support of attentional mechanism for conscious awareness and global workspace short-term memory maintenance [17, 158]:

According to this evolutionary argument, consciousness implies connectivity. Flexible information sharing requires a specific neuronal architecture to link the many distant and specialized regions of the cortex into a coherent role. Can we identify such a structure inside our brains? As early as the late nineteenth century, the Spanish neuroanatomist Santiago Ramón noted a peculiar aspect of brain tissue. Unlike the dense mosaic of cells that make up our skin, the brain comprises enormously elongated cells: neurons. With their long axon, neurons possess the property, unique among cells, of measuring up to meters in size. A single neuron in the motor cortex may send its axon to extraordinarily distant regions of the spinal cord, in order to command specific muscles. Most interestingly, Cajal discovered that long-distance projection cells are quite dense in the cortex (figure 25), the thin mantle that forms the surface of our two hemispheres. From their locations in the cortex, nerve cells shaped like pyramids often send their axons all the way to the back of the brain or to the other hemisphere. Their axons group together into dense bundles of fibers that form cables of several millimeters in diameter and up to several centimeters in length. Using magnetic resonance imaging, we can now easily detect these crisscrossing fiber bundles in the living human brain.
Importantly, not all brain areas are equally well connected. Sensory regions, such as the primary visual area V1, tend to be choosy and to establish only a small set of connections, primarily with their neighbors. Early visual regions are arranged in a coarse hierarchy: area V1 speaks primarily to V2, which in turns speaks to V3 and V4, and so on. As a result, early visual operations are functionally encapsulated: visual neurons initially receive only a small fraction of the retinal input and process it in relative isolation, without any “awareness” of the overall picture.

In the higher association areas of the cortex, however, connectivity loses its local nearest-neighbor or point-to-point character, thus breaking the modularity of cognitive operations. Neurons with long-distance axons are most abundant in the prefrontal cortex, the anterior part of the brain. This region connects to many other sites in the inferior parietal lobe, the middle and anterior temporal lobe, and the anterior and posterior cingulate areas that lie on the brain’s midline. These regions have been identified as major hubs—the brain’s main interconnection centers. All are heavily connected by reciprocal projections: if area A projects to area B, then almost invariably B also sends a projection back to A (figure 25). Furthermore, long-distance connections tend to form triangles: if area A projects jointly to areas B and C, then they, in turn, are very likely to be interconnected.

FIGURE 25. Long-distance neuronal connections may support the global neuronal workspace. The famous neuroanatomist Santiago Ramón y Cajal, who dissected the human brain in the nineteenth century, already noted how large cortical neurons, shaped like pyramids, sent their axons to very distant regions (left). We now know that these long-distance projections convey sensory information to a densely connected network of parietal, temporal, and prefrontal regions (right). A lesion in these long-distance projections may cause spatial neglect, a selective loss of visual awareness of one side of space.

These cortical regions are strongly connected to additional players, such as the central lateral and intralaminar nuclei of the thalamus (involved in attention, vigilance, and synchronization), the basal ganglia (crucial for decision making and action), and the hippocampus (essential for memorizing the episodes of our lives and for recalling them). Pathways linking the cortex with the thalamus are especially important. The thalamus is a collection of nuclei, each of which enters into a tight loop with at least one region of the cortex and often many of them at once. Virtually all regions of the cortex that are directly interconnected also share information via a parallel information route through a deep thalamic relay.17 Inputs from the thalamus to the cortex also play a fundamental role in exciting the cortex and maintaining it in an "up" state of sustained activity.18 As we shall see, the reduced activity of the thalamus and its interconnections play a key role in coma and vegetative states, when the brain loses its mind.

The workspace thus rests on a dense network of interconnected brain regions—a decentralized organization without a single physical meeting site. At the top of the cortical hierarchy, an elitist board of executives, distributed in distant territories, stays in sync by exchanging a plethora of messages. Strikingly, this anatomical network of interconnected high-level areas, involving primarily the prefrontal and parietal lobes, coincides with the one that I described in Chapter 4 and whose sudden activation constituted our first signature of conscious processing. We are now in a position to understand why these associative areas systematically ignite whenever a piece of information enters our awareness: those regions possess precisely the long-distance connectivity needed to broadcast messages across the long distances of the brain.

The pyramidal neurons of the cortex that participate in this long-distance network are well adapted to the task (figure 26). To harbor the complex molecular machinery needed to sustain their immense axons, they possess giant cell bodies. Remember that the cell’s nucleus is where the genetic information is encoded in DNA—and yet the receptor molecules that are transcribed there must somehow make their way to synapses centimeters away. The large nerve cells capable of performing this spectacular feat tend to concentrate in specific layers of the cortex—the layers II and III, which are especially responsible for the callosal connections that distribute information across the two hemispheres.

¹⁴⁷ These details, along with scans of K.C.'s brain, provided strong evidence that our episodic and semantic memories rely on different brain circuits. The hippocampus helps record both types of memories initially and it helps retain them for the medium-term. The hippocampus probably also helps us access old personal memories in long-term storage. But to access old semantic memories, the brain seems to use the parahippocampus, an extension of the hippocampus on the brain's southernmost surface. K.C., whose parahippocampi survived, could therefore remember to sink the eightball last in pool (semantic knowledge), even though every last memory of playing pool with his buddies had disappeared (personal knowledge).

What's more, while a healthy hippocampus will usually take responsibility for recording new semantic memories, the parahippocampus can — albeit excruciatingly slowly — absorb new facts if it has to. For instance, after years of shelving books as a volunteer at a local library, K.C.'s parahippocampus learned the Dewey decimal system, even though he had no idea why he knew it. Similarly, H.M.'s healthy parahippocampus picked up a few choice facts after his 1953 surgery. After seeing the crossword clue thousand times he dimly recalled "Salk vaccine target" equaled P-O-L-I-O and through incessant references, he retained a sliver of information about the 1969 moon landing and the 1963 Kennedy assassination. Contra the cliché, he couldn't recall where he was when he learned those things — that's episodic memory. And his knowledge of the events remained weak and fragmentary since the parahippocampus cannot learn very well. He nevertheless absorbed that they happened.

Along the same lines, K.C.helps neuroscience come to grips with another important distinction in memory research, between recollection and familiarity. Colloquially recollection means I specifically remember this, while familiarity means this sounds familiar, even if the details are fuzzy. And sure enough, the brain makes the same distinction. In one test K.C.doctors compiled a list of words (El Niño, posse) that entered the common parlance after his accident in 1981. They then sprinkled those words into a list of pseudo-words — strings of letters that looked like plausible words but that meant nothing. Time and again K.C. picked out the real word, and did so with confidence. But when asked to define the word, he shrugged. From a list of common Names he picked out people who became famous after 1981 (e.g., Bill Clinton). But he had no inkling what Clinton had done. In other words, K.C .found these terms familiar, even though specific recollection eluded him. This indicates that recollection again requires hippocampus, while a feeling of familiarity requires only certain patches of cortex.

A final type of memory that amnesiacs have helped illuminate is emotional memory — which makes sense, given that the hippocampus belongs to the limbic system. Possibly because he had no amygdala, H.M. was always pretty affable around the scientists who visited him, despite never recognizing them. (Not even Milner, who worked with him for a half-century.) Other amnesiacs lacked his easy-going manner, though, and a few got outright snarly. In 1992, herpes simplex — the same bug that knocked out people's ability to recognize fruits animals and tools — hollowed out the hippocampi and other structures inside the brain of a 70-year-old San Diego man named E.P. He started repeating the same anecdote over and over again, verbatim and eating up to three breakfast each day. And despite being a former sailor who lived less than two miles from the coast, he suddenly couldn't remember even the general direction of the Pacific Ocean. [...]

[...] Taken as a whole, this alphabetic soup of amnesiacs help scientist to sort out how the brain divides up responsibility for memories. Non-declarative memories (like motor memories) rely on the cerebellum and uncertain internal clusters of gray matter such as the striatum. Episodic (personal) memories lean heavily on the hippocampus, while semantic (factual) memories utilize the parahippocampus to a much larger degree, especially for retrieval, the frontal lobes as well, both in searching for memories and in double checking that the brain has grabbed the right memory from long-term storage in the cortex. Sensory and limbic circuits also kick on to reanimate the moment in our minds. Meanwhile, parietal and frontal lobes whisper to us that we are reviewing old information, so that we don't get terrified or amorous all over again. Each step works independently, and each one can malfunction without affecting other mental faculties in the slightest.

That's the theory at least in reality it seems impossible to tear out any one aspect of memory — especially our episodic memories memories of holidays and lovers and times we fell short — without carrying out so much more. K.C. knows how to play solitaire and change a tire, but he can never recall a moment of contentment, peace, loneliness, or lust. And however paradoxical it might seem, losing his past wiped out his future as well. The ultimate biological purpose of memory isn't to recall the past per se, but to prepare for the future by giving us clues about how to act in certain situations. As a result, when K.C. lost his past self, his future self died along with it. He cannot tell you what he'll do over the next hour, the next day, the next year; he cannot even even imagine these things. The loss of his future self doesn't pain K.C.; he doesn't suffer or roux his fate, but in some ways that lack of suffering seems sad in and of itself. However unfair, it's hard not to see him as reduced, diminished.

¹⁴⁸ In theoretical studies proposing a critical role for the DG in pattern separation, two aspects of its physiology and connectivity have been emphasized; one, the large number, relative to CA3 and EC, of principal cells it contains, and, two, its sparse activity (O'Reilly and McClelland 1994). The rat DG contains around a million GCs-meaning that it outnumbers its 300,000 target CA3 pyramidal neurons by about 3:1 and its upstream 200,000 EC neurons by 5:1 (Amaral et al. 1990). The remapping of the activity of EC neurons onto the much larger, sparsely active GC population is what Marr defined as "expansion recoding" and it is hypothesized to be a central step in separating EC activity patterns, as small changes in EC activity would be thought to change significantly the activated cohort of GCs. Indeed, the influential theoretical study of O'Reilly and McClelland (1994) indicated that the distance between neural representations in the DG is systematically larger than in the EC when DG representations are sparse. This form of decorrelation improves separability, however it also disrupts the ability of the CA3 neurons to perform pattern completion. — See Drew et al [314] and Amaral et al [19]

¹⁴⁹ A single CA3 pyramidal cell receives convergent input from approximately 50 granule cells, and a single granule cell provides divergent output onto approximately 15 CA3 pyramidal neurons. How sparse 'engrams' are relayed from the dentate gyrus to the CA3 region has been unclear. In the subdetonation mode, activation of several granule cells by multiple highly specific patterns would be necessary to activate a given CA3 pyramidal neuron. Given the sparse activity of granule cells, this is an unlikely scenario. In contrast, in the full detonation mode, activation of a single granule cell by a single pattern would be sufficient to trigger activation. Furthermore, in the full detonation mode, a single action potential in a granule cell should activate an ensemble of approximately 15 CA3 pyramidal cells [20]. See Vyleta et al [510].

Pattern separation in the dentate gyrus depends on unique properties of DG granule cells mossy-fiber neurons. DG mossy-fiber neuron to CA3-pyramidal neuron synapses result in sustained activity in the post synaptic neurons of CA3. "Mossy fiber synapses on CA3 pyramidal cells are 'conditional detonators' that reliably discharge postsynaptic targets. The 'conditional' nature implies that burst activity in dentate gyrus granule cells is required for detonation."

Mossy fiber synapses exhibit both pronounced short-term facilitation and uniquely large post-tetanic potentiation (PTP). [...] Unitary EPSPs failed to initiate a spike in CA3 neurons under control conditions, but reliably discharged them after induction of presynaptic short-term plasticity. Remarkably, PTP switched mossy fiber synapses into full detonators for tens of seconds. Plasticity-dependent detonation may be critical for efficient coding, storage, and recall of information in the granule cell–CA3 cell network". — See Vyleta et al [510].

The strength of a synapse in relation to action potential threshold is a key factor that defines its role in neuronal information processing. In neocortical and hippocampal circuits, it is generally thought that individual synapses are weak, and that temporal or spatial summation of unitary excitatory post-synaptic potentials (EPSPs) is required to reach the firing threshold. In contrast, in the peripheral nervous system and brainstem, 'detonator' synapses are abundant, which directly control the activity of postsynaptic targets in the absence of temporal and spatial summation. Well established examples of detonator synapses include the neuromuscular junction and the calyx of Held in the auditory brainstem.

Whether detonator synapses exist in higher brain regions is presently unclear. In a network with weak synapses below detonation, activity in an ensemble of multiple convergent presynaptic neurons is required to activate a postsynaptic cell. In contrast, in a network with strong synapses near detonation, activity of a single connected presynaptic neuron may be sufficient to drive a postsynaptic cell and to initiate a behaviorally relevant output. The hippocampal mossy fiber synapse onto CA3 pyramidal neurons is a candidate for a detonator synapse. — see Kammerer [259] (G)

¹⁵⁰ The superficial layers — II and III — of EC project to the dentate gyrus and hippocampus: Layer II projects primarily to dentate gyrus and CA3; layer I II projects primarily to hippocampal region CA1 and the subiculum. These layers receive input from other cortical areas, especially associational, perirhinal, and parahippocampal cortices, as well as prefrontal cortex. EC as a whole receives highly processed input from every sensory modality, as well as input relating to ongoing cognitive processes.

EC neurons process general information such as directional activity in the environment, which contrasts to that of the hippocampal neurons, which usually encode information about specific places. This suggests that EC encodes general properties about current contexts that are then used by hippocampus to create unique representations from combinations of these properties. Research generally highlights a useful distinction in which the medial entorhinal cortex mainly supports processing of space, whereas the lateral entorhinal cortex mainly supports the processing of time. (SOURCE).

¹⁵¹ Several authors have produced evidence supporting some form of integration or generalization depending on circuits in the hippocampal formation along with recurrent connections to and from the medial temporal cortex. The pattern separation accomplished by the DG may also serve as a means of relating episodic memories with similar content, but different temporal or spatial context. "We exposed a neural network model that instantiates known properties of hippocampal projections and subfields to sequences of items with temporal regularities. We found that the monosynaptic pathway—the pathway connecting entorhinal cortex directly to region CA1—was able to support statistical learning, while the trisynaptic pathway—connecting entorhinal cortex to CA1 through dentate gyrus and CA3—learned individual episodes, with apparent representations of regularities resulting from associative reactivation through recurrence". — see Schapiro et al [439] (H)

¹⁵² New neurons continue to be generated in the subgranular zone of the dentate gyrus of the adult mammalian hippocampus. This process has been linked to learning and memory, stress and exercise, and is thought to be altered in neurological disease. In humans, some studies have suggested that hundreds of new neurons are added to the adult dentate gyrus every day, whereas other studies find many fewer putative new neurons. Here we show that a defined population of progenitor cells does not coalesce in the subgranular zone during human fetal or postnatal development. We also find that the number of proliferating progenitors and young neurons in the dentate gyrus declines sharply during the first year of life and only a few isolated young neurons are observed by 7 and 13 years of age. In adult patients with epilepsy and healthy adults (18-77 years; n = 17 post-mortem samples from controls; n = 12 surgical resection samples from patients with epilepsy), young neurons were not detected in the dentate gyrus. In the monkey (Macaca mulatta) hippocampus, proliferation of neurons in the subgranular zone was found in early postnatal life, but this diminished during juvenile development as neurogenesis decreased. We conclude that recruitment of young neurons to the primate hippocampus decreases rapidly during the first years of life, and that neurogenesis in the dentate gyrus does not continue, or is extremely rare, in adult humans. The early decline in hippocampal neurogenesis raises questions about how the function of the dentate gyrus differs between humans and other species in which adult hippocampal neurogenesis is preserved. — see Sorrells et al [464]

¹⁵³ The hippocampus is one of the most affected areas in Alzheimer's disease (AD). Moreover, this structure hosts one of the most unique phenomena of the adult mammalian brain, namely, the addition of new neurons throughout life. This process, called adult hippocampal neurogenesis (AHN), confers an unparalleled degree of plasticity to the entire hippocampal circuitry. Nonetheless, direct evidence of AHN in humans has remained elusive. Thus, determining whether new neurons are continuously incorporated into the human dentate gyrus (DG) during physiological and pathological aging is a crucial question with outstanding therapeutic potential. By combining human brain samples obtained under tightly controlled conditions and state-of-the-art tissue processing methods, we identified thousands of immature neurons in the DG of neurologically healthy human subjects up to the ninth decade of life. These neurons exhibited variable degrees of maturation along differentiation stages of AHN. In sharp contrast, the number and maturation of these neurons progressively declined as AD advanced. These results demonstrate the persistence of AHN during both physiological and pathological aging in humans and provide evidence for impaired neurogenesis as a potentially relevant mechanism underlying memory deficits in AD that might be amenable to novel therapeutic strategies. — see Moreno-Jiménez et al [363]

¹⁵⁴ Adult neurogenesis in the hippocampus leads to the incorporation of thousands of new granule cells into the dentate gyrus every month, but its function remains unclear. Here, we present computational evidence that indicates that adult neurogenesis may make three separate but related contributions to memory formation. First, immature neurons introduce a degree of similarity to memories learned at the same time, a process we refer to as pattern integration. Second, the extended maturation and change in excitability of these neurons make this added similarity a time-dependent effect, supporting the possibility that temporal information is included in new hippocampal memories. Finally, our model suggests that the experience-dependent addition of neurons results in a dentate gyrus network well suited for encoding new memories in familiar contexts while treating novel contexts differently. Taken together, these results indicate that new granule cells may affect hippocampal function in several unique and previously unpredicted ways. — see [3] (B)

¹⁵⁵ The distinction between embryo and fetus is made based on gestational age. An embryo is the early stage of human development in which organs are critical body structures are formed. An embryo is termed a fetus beginning in the 11th week of pregnancy, which is the 9th week of development after fertilization of the egg. Totipotent cells can form all the cell types in a body, plus the extraembryonic, or placental, cells. Embryonic cells within the first couple of cell divisions after fertilization are the only cells that are totipotent. Pluripotent cells can give rise to all of the cell types that make up the body; embryonic stem cells are considered pluripotent. Multipotent cells can develop into more than one cell type, but are more limited than pluripotent cells; adult stem cells and cord blood stem cells are considered multipotent.

¹⁵⁶ Basket cells are inhibitory GABAergic interneurons of the brain, found throughout different regions of the cortex and cerebellum. Basket cells are multipolar GABAergic interneurons that function to make inhibitory synapses and control the overall potentials of target cells. Basket cells form axo-somatic synapses, meaning their synapses target somas of other cells. By controlling the somas of other neurons, basket cells can directly control the action potential discharge rate of target cells.

Hippocampal basket cells target somata and proximal dendrites of pyramidal neurons. In the CA3 region of the hippocampus, basket cells can often form recurrent inhibition loops with pyramidal cells. Projections from a pyramidal cell will innervate the basket cell, which in turn has a projection back onto the original pyramidal cells. Since basket cells are inhibitory, this generates a closed loop that can help dampen excitatory responses. (SOURCE).

¹⁵⁷ The structures forming the medial temporal lobe appear to be necessary for the establishment of long-term declarative memory. In particular, they may be involved in the "consolidation" of information in higher-order associational cortices, perhaps through feedback projections. This review highlights the fact that the medial temporal lobe is organized as a hierarchy of associational networks. Indeed, associational connections within the perirhinal, parahippocampal, and entorhinal cortices enables a significant amount of integration of unimodal and polymodal inputs, so that only highly integrated information reaches the remainder of the hippocampal formation. The feedback efferent projections from the perirhinal and parahippocampal cortices to the neocortex largely reciprocate the afferent projections from the neocortex to these areas. There are, however, noticeable differences in the degree of reciprocity of connections between the perirhinal and parahippocampal cortices and certain areas of the neocortex, in particular in the frontal and temporal lobes. These observations are particularly important for models of hippocampal-neocortical interaction and long-term storage of information in the neocortex. Furthermore, recent functional studies suggest that the perirhinal and parahippocampal cortices are more than interfaces for communication between the neocortex and the hippocampal formation. These structures participate actively in memory processes, but the precise role they play in the service of memory or other cognitive functions is currently unclear — see Lavenex and Amaral [303] (D)