How the (Tensor-) Brain uses Embeddings and Embodiment to Encode Senses and Decode Symbols

Volker Tresp Corresponding Author: Volker Tresp, Ludwig-Maximilians-Universität München, Oettingenstr. 67, D-80538 Muenchen, Volker.Tresp@lmu.de. Hang Li LMU Munich

Abstract

The tensor brain has been introduced as a computational model for perception and memory. We provide an overview of the tensor brain model, including recent developments. The tensor brain has two major layers: the representation layer and the index layer. The representation layer is a model for the subsymbolic global workspace from consciousness research. The state of the representation layer is the cognitive brain state. The index layer contains symbols for concepts, time instances, and predicates. In a bottom-up operation, the cognitive brain state is encoded by the index layer as symbolic labels. In a top-down operation, symbols are decoded and written to the representation layer. This feeds to earlier processing layers as embodiment. The top-down operation became the basis for semantic memory. The embedding vector of a concept forms the connection weights between its index and the representation layer. The embedding is the signature or “DNA” of a concept, which is decoded by the brain when its index is activated. It integrates all that is known about a concept from different experiences, modalities, and symbolic decodings. Although being computational, it has been suggested that the tensor brain might be related to the actual operation of the brain. The sequential nature of symbol generation might have been a prerequisite to the generation of natural language. We describe an attention mechanism and discuss multitasking by multiplexing. We emphasize the inherent multimodality of the tensor brain. Finally, we discuss embedded and symbolic reasoning.

keywords:

Symbolic Representation, Embeddings, Index Layer, Representation Layer, Perception, Episodic Memory, Semantic Memory, Reasoning

and

1 Introduction

There has been a long and ongoing debate about the roles of symbolic versus subsymbolic processing to achieve intelligent systems. The viewpoint of the work presented here is that symbols are the outcomes of measurement devices. A measurement device can be of a technical nature or, as considered here, it can be the brain. Via the sensory system, the real world forms the input to the measurement device, and the symbols are the outputs. The outcome of a measurement can feed back to and influence the measurement device. If the measurement device is also an actor like a robot or a person, it can also affect the real world.

The work on the tensor brain, which is reviewed in this paper, follows these ideas. It suggests that much of the brain’s processing is subsymbolic, but at the highest level, the brain uses symbols.

A major component of the tensor brain is the subsymbolic representation layer, which is the main communication hub. It is related to the global workspace in consciousness research and is sometimes called the ”‘mental canvas” or the “theater of the brain”. It can be viewed as the blackboard in multiagent systems. Modules in the brain can write to and read from the representation layer. It is a cognitive hub, and its activation is the cognitive brain state. Whatever reaches conscious awareness propagates through the representation layer.

A second layer is the symbolic index layer, where each symbol or index has a local representation. If the brain looks at a scene and classifies the entity in the scene as a dog, then the brain acts as a bottom-up measurement device, and the symbol Dog is the outcome of the measurement. We call this symbolic encoding. Symbols and subsymbolic representations describe and analyze views or projections of the world by interacting with one another. There is no activation of a symbol without an associated activation on the subsymbolic level, and subsymbolic representations are constantly annotated with symbols.

The connection weights between a symbolic index and the representation layer form the embedding vector of that symbol. The embedding of a symbol is its signature or “DNA”. If the outcome of a measurement is the symbol Dog, or the brain is focussing on the symbol Dog for any other reason, then the whole brain should be informed about this fact. The index activates the representation layer with its embedding and this is then propagated to other parts of the brain via grounding and embodiment. Symbols can activate other symbols and thus can also be understood in the context of their relationships to other symbols. The embedding vector of an index is optimized in its role in this network, considering perceptual inputs, and embodiment. Refer to Figure 1 as a reference for the overall TB architecture.

The work on the tensor brain considers that symbols can be concepts like entities, classes, and attributes. In addition, symbols can be predicates or episodic instances. The latter permits the brain to form episodic memories. The embedding similarity of two scenes is not only determined by the scene input but also by their similarity in symbolic encodings. Thus symbolic similarity is important for memory recall.

The tensor brain approach argues that perception is greatly supported by memory, or as Goethe puts it: “You only see what you know!” Figure 2 illustrates how episodic and semantic memory support perception. Symbolic decoding for perception and memory is a serial process and an “inner” language. Humans can easily talk about perception and memories, without much effort! It seems plausible that the “outer” language that is spoken is based on an evolutionary earlier “inner” language. We communicate with symbols, we argue with symbols, and sometimes we might even reason with symbols. But it is not just symbols: Although the cognitive brain state is not easily communicated directly, it is reflected in embodiment and grounding, e.g., as intonation, body language, gesture, and mimics.

Maybe the brain is a prediction machine, as many have stated. There is a lot of evidence that this is true on the level of implicit memories, including perceptual and motor skills. To predict at longer time scales (e.g., evening plans), the mind needs to understand the present and relate it to the past: it needs explicit understanding and it needs an explicit memory. It needs a future memory, it needs imagination.

The paper is organized as follows. In the next section, we summarize the literature on the tensor brain. In Section 3, we describe the representation layer, i.e., the tensor brain’s model for the global workspace. We introduce the cognitive brain state and show how it progresses in time as a recurrent neural network. In Section 4, we introduce a probabilistic interpretation of the cognitive brain state. Section 5 describes symbolic encoding as bottom-up inference. We discuss symbolic indices for concepts, predicates, and episodic time instances. In Section 6, we describe symbolic decoding as top-down inference. We introduce an attention mechanism and describe embodiment as an autoencoder. In Section 7, we describe the different operational modes, in particular, perception, episodic memory, and semantic memory. We explain the generation of triple statements and discuss language generation and understanding. In Section 8 we describe how self-supervised learning adapts embedding vectors. In Section 9, we discuss multitasking by multiplexing and cognitive control. Section 10 discusses multimodality and different forms of reasoning, in particular, embedded reasoning, symbolic reasoning, and embedded symbolic reasoning. Section 11 contains a summary.

Refer to caption — Figure 1: The TB architecture. Scene input (shown at the bottom) is mapped by the layers of a deep convolutional network to the representation layer. The representation layer is a mathematical model for the global workspace. On the right, we see the evolution neural network with one hidden layer providing recurrence. In the TB the latter is called the dynamic context layer. The representation layer maps to the index layer, which feeds back to the representation layer. The columns of matrix $\mathbf{A}$ contain the embedding vectors. The bottom-up and top-down processing can produce several labels for a scene or ROI.

Glossary and notation:

•

The agent is an individual and is the actor. Its mind executes cognitive functions using its neurobiological basis, i.e., its brain.
•

The representation layer is a model for the brain’s global workspace. Its activation is the cognitive brain state (CBS). The probabilistic interpretation of the CBS is the probabilistic cognitive brain state (pCBS).
•

The index layer represents indices which can be concepts, predicates, and episodic indices. Each index is realized by an ensemble of neurons. An ensemble is referred to as an index, a symbol, or a label.
•

The synaptic weights linking an index with the representation layer form the embedding vector of that index. An embedding vector is the signature or “DNA” of the associated index.
•

$\vec{\gamma}$ is the vector of post-activations of the ensembles in the representation layer, and $\mathbf{q}$ is the vector of pre-activations. $\mathbf{a}_{k}$ is the embedding vector of index $k$ . Components are $\gamma_{i}$ , ${q}_{i}$ , and ${a}_{i,k}$ , with $i=1,\ldots n$ .

2 The Development of the Tensor Brain

We review the development of the tensor brain model (TB). The first paper on the topic was published in 2015[41]. It was the first paper to discuss the interaction between symbolic representations, subsymbolic representations, and concept embeddings in the operation of the brain. It used the Tucker tensor decomposition as a mathematical generative model for perception and memory. The paper introduced the representation layer and the index layer. Semantic memory was related to knowledge graph factorization, and episodic memory was related to the factorization of temporal knowledge graphs. The paper pioneered the field of temporal knowledge graph embedding models and suggested the analysis of visual scenes by factorized graph models, now called scene graph embedding models. The paper renewed the interest in index-based approaches to memory, as pioneered by Teyler and co-workers [40].

The vision part of the model was extended in [2] by including bounding boxes as regions of interest (ROIs) for entities in the scene and presented experimental results. It used the experimental framework and the annotated image data provided by [17]. Temporal knowledge graphs were further explored, e.g., in [43] and [26].

[44] introduced the main components of the TB architecture consisting of the representation layer, the index layer, and a recurrent neural network. [45] introduced a probabilistic generative model for the TB. [37] considered an attention mechanism and emphasized the role of prior knowledge and grounding. [46] extended the approach further and included extensive experiments. It provides a consistent mathematical framework and contains a detailed discussion relating the TB to cognition and neuroscience. [42] derived a probabilistic TB model using a Heisenberg measurement process that can be motivated by decoherent quantum theory.

3 The Subsymbolic Representation Layer

3.1 The Representation Layer and the Cognitive Brain State (CBS)

The brain might consist of thousands or even millions of modules that need to coordinate and exchange information. Some information needs to be conveyed to higher processing layers, maybe even reaching conscious attention. In the TB, an integrative layer is introduced in the form of the representation layer, which is a high-dimensional basis from which modules can write and read.

The TB assumes that the activation (firing rate) of an ensemble of neurons implements a population code. The activation pattern of all ensembles in the representation layer reflects the cognitive brain state (CBS). Some ensembles might be specific and represent, e.g., the color “red”. Others might be rather unspecific latent factors. Experimental results indicate that the CBS exhibits a meaningful clustering structure [46].

Let $n$ be the number of ensembles. Then for $i=1,\ldots,n$

\gamma_{i}\leftarrow\mathrm{sig}(q_{i}).

(1)

Here, $\gamma_{i}\in(0,1)$ is the (post-) activation of ensemble $i$ , and $q_{i}\in\mathbb{R}$ is its pre-activation. Thus, $\gamma_{i}=1$ means that ensemble $i$ is firing with maximal intensity, and $\gamma_{i}=0$ means that ensemble $i$ is firing at some basal rate or not at all. As discussed further down $q_{i}$ receives input signals, and $\gamma_{i}$ is the output activation. The CBS is the activation vector $\vec{\gamma}$ of the representation layer. In the TB, one assumes that $\mathrm{sig}(q_{i})=(1+\exp(-q_{i}))^{-1}$ is the logistic function.

3.2 The Representation Layer, the Global Workspace Theory, and the Blackboard

Baars and his coworkers have introduced the concept of a global workspace [1], which is the foundation of some theories on consciousness [1, 8]. The global workspace is a functional hub of broadcast and integration that allows information to be disseminated across modules. The global workspace is sometimes referred to as the “theater of the brain” or its “mental canvas”. As noted by [1], the global workspace could be related to the blackboard in a multi-agent system, where computational modules share information. The global workspace provides a communication and interaction hub, which is suitable for modules dealing with higher levels of abstraction, where direct module-to-module interactions might not be appropriate.

Let’s consider vision. Visual inputs contribute to the pre-activations of the ensembles in the global workspace through the visual paths, so visual input writes to the representation layer. In reverse, the CBS might feed back to the visual path in the form of embodiment, so for the visual backward path, the global workspace acts as input, and early visual processing layers get activated.

Abstracting from vision, the modeling assumption is that the global workspacereceives signals from many brain modules, and brain modules, in turn, might be affected by the state of the global workspace. In particular, the global workspace receives input from all sensory modalities the agent can be aware of, like vision, hearing, taste, smell, touch, proprioception, and pain. It has been proposed that all brain states that can reach consciousness, including inner feelings and emotional states, contribute to the global workspace and, in reverse, can be affected by its state. However, not all modules in the brain directly affect the global workspace and, thus, might not directly reach conscious attention. Naturally, perceptual input plays a major role. But also the pain after injury and emotions associated with a memory.

In the TB, it is assumed that the global workspace is implemented by the representation layer. The representation layer is related to the hidden layer in a recurrent neural network (RNN), with some important modifications, as discussed in the following.

3.3 Evolution Neural Network and Recurrency

The representation layer in the TB exhibits feedback connections, which introduce dependencies between the activations of the ensembles and an internal memory. The model assumes, with $i=1,\ldots,n$

q_{i}^{(\tau)}\leftarrow q_{i}^{(\tau-1)}+g_{i}(\mathbf{v}^{(\tau)})+f_{i}^{% \textit{NN}}(\vec{\gamma}^{(\tau-1)}).

(2)

The index $\tau$ is a counter of a discrete-time operation of the brain. All operations that occur for a fixed $\tau$ are related to the same concept or episodic index. Here, $f_{i}^{\textit{NN}}(\vec{\gamma}^{(\tau-1)})$ is the $i$ -th output of a neural network with linear output units that describes the evolution neural network, and $\mathbf{v}^{(\tau)}$ represents some input vector, originating, e.g., from a visual scene. $\mathbf{g}(\mathbf{v}^{(\tau)})$ is a deep convolutional neural network with linear output units that maps visual input to the representation layer.

In general, a particular $q_{i}^{(\tau)}$ might only depend on a subset of ensembles, permitting modularity. The direct dependency on $q_{i}^{(\tau-1)}$ models a skip connection, as commonly used in generative AI [15], and realizes a self-memory. This implements a recurrency that encourages stable activation patterns in the CBS.

The update is equivalent to an RNN update with skip connections. The main difference is that $f_{i}^{\textit{NN}}(\cdot)$ is a neural network that contains at least one hidden layer, which provides universal modeling capability. This hidden layer would not be part of a standard RNN.

The evolution neural network enables a prediction of a future CBS. In addition, it enables relationship modeling in semantic memory and episodic memory.

4 The Probabilistic Cognitive Brain State (pCBS)

The post-activation $\gamma_{i}$ can be interpreted as a Bernoulli parameter, i.e.,

P(X_{i}=1)=\gamma_{i}.

(3)

Here, $X_{i}=1$ means that the ensemble $i$ is “on”, and $X_{i}=0$ means that the ensemble is “off”. Assuming mutual independence, we can define the probabilistic cognitive brain state (pCBS) as a probability distribution with

P(X_{1}=i_{1},...,X_{n}=i_{n})=\prod_{j=1}^{n}(\gamma_{j})^{i_{j}}(1-\gamma_{j% })^{1-i_{j}}

with $i_{j}\in\{0,1\}$ . The pCBS is extensively discussed in [42]. The probabilistic interpretation extends the model in interesting ways and justifies top-down inference.¹¹1A distribution with maximum entropy would imply a $\gamma_{j}=1/2,\forall j$ . This explains the surprising fact that with no input activation all neurons fire with half intensity, which might not be physiologically plausible. Neurons with ReLu activation function are inactive with no input activation.

Equation 2 now has two interpretations. First, one can assume that the dynamics of the neural activations is the basic deterministic dynamic equation, and Equation 3 describes a noisy measurement. Alternatively, one can consider the states of the random variables $X_{i}$ to be the fundamental quantity and we need to consider the evolution of a multivariate probability distribution. In [42], it was shown that the evolution neural network can be derived from this latter view by enforcing independencies and under some approximations. In particular, at each instance, the pCBS is approximated by $n$ independent Bernoulli distributions. The physical reality in the brain is the firing rate $\gamma_{i}$ . The random variable $X_{i}$ is a mathematical construct and is never measured directly.

5 The Symbolic Index Layer and Symbolic Encoding

5.1 The Index Layer

The symbolic index layer is a module of high cognitive relevance that interacts with the representation layer. It receives input from the representation layer but also feeds back to it. In the brain, an index (or pointer) might be realized by an ensemble of neurons, as before. A difference is that the index layer can enforce a sample-take-all behavior, a variant of the well-known winner-take-all concept.

Each index represents a symbol. We discuss three types of indices: concept indices, predicate indices, and episodic indices. Thus, we propose that an episodic index is also a symbol. In perception, we also use the term label for a symbol.

5.2 Concept Indices

The indices that belong to concepts are symbolic in nature. Concepts can, e.g., be entities, classes, attributes, locations, inner emotional states, actions, and decisions. Concepts permit the recognition of stable patterns in the CBS. Thus, a cluster analysis would have the brain detect a repeated pattern for concepts like Dog, a particular dog Sparky, and attributes such as Black and Happy. In perception, concept indices might label a scene or a visual area of interest, which we will refer to as the region of interest (ROI).

Assume the CBS is $\vec{\gamma}$ . An index $k$ receives inputs from the representation layer with

P(Y=k|\vec{\gamma})\approx\mathrm{softmax}_{\textit{dom}}(a_{0,k}+\sum_{i=1}^{% n}a_{i,k}\gamma_{i})

(4)

where $Y=k$ means that the index layer is in state $k$ , i.e., only ensemble $k$ is active, and all other indices are inactive. The vector $\mathbf{a}_{k}=(a_{1,k},\ldots,a_{n,k})^{\top}$ is the typically sparse embedding vector of concept $k$ , and $a_{0,k}$ is a bias term. In the architecture, $a_{i,k}$ is the weight on the link from ensemble $i$ in the representation layer to ensemble $k$ in the index layer. The softmax is defined as

\mathrm{softmax}_{\textit{dom}}(a_{0,k}+\sum_{i=1}^{n}a_{i,k}\gamma_{i})=\frac% {\exp(a_{0,k}+\sum_{i=1}^{n}a_{i,k}\gamma_{i})}{\sum_{k^{\prime}\in\textit{dom% }}\exp(a_{0,k^{\prime}}+\sum_{l=1}^{n}a_{l,k^{\prime}}\gamma_{l})}.

The domain dom could be the set of all entities in case one only normalizes about entities. Similarly, it could be the set of all predicates, the set of all classes, or the set of all colors. Each domain defines a random variable. The set members should be mutually exclusive and complete labels, although this is not enforced, e.g., if two colors are assigned to the same entity. See the discussion in Section 7.4.3.²²2For the pCBS, Equation 4 approximates $P(Y=k|\textit{context})=\mathbb{E}_{P(X_{1},...,X_{n}|\textit{context})}P(Y=k|% X_{1},...,X_{n})$ [42].

Indices reflect repeated patterns in the CBS and are identified by their embeddings and maybe their location in the brain. In animals, they do not have associated names like Dog, Sparky, Black and Happy. In humans, a subset might be identified with concepts in natural language.

5.3 Predicate Indices

The second type of indices refers to predicates. Consider that the agent analyses a scene. The CBS first focuses on an ROI that, e.g., contains Sparky, a Dog in the scene, then on a second ROI, that, e.g., contains Jack, a Person, in the scene. Then, an ROI is formed that encompasses both previous ROIs. In the context of the two previous ROIs, the third one might be labeled by their relationship, e.g., looksAt. Thus, the brain does not only consider labels of ROIs, but also labels their relationships. Here, $\mathbf{a}_{k}$ is the embedding vector of predicate $k$ .

5.4 Episodic Indices Refer to Time Instances

The third type of indices concerns episodic indices and refers to time instances. In the TB, an episodic index is introduced for each (relevant) time instance. In its simplest form, the embedding vector of index $t$ is the vector of pre-activations of the representation layer at that instance, i.e., $\mathbf{a}_{t}\leftarrow\mathbf{q}^{(t)}$ . If the representation layer is dominated by visual input from a scene, it will store the resulting pre-activations of the CBS, including, e.g., the emotional state at that instance. Later, we will discuss how $\mathbf{a}_{t}$ is actually optimized.

6 Symbolic Decoding

The indices or symbols in the index layer can be decoded into subsymbolic representations. This is a form of top-down inference. The basic purpose of top-down inference is that the representation layer, and via embodiment the whole brain, should be informed about, which concept has been detected in vision, or more generally, which concept the brain is focussing on. Thus, if the concept of Sparky was detected, this information should be communicated to the representation layer and thus the whole brain. This is particularly important for integrating memory in perception: with the current hypothesis that Sparky is in the image, the brain can add background on Sparky, i.e., that Sparky is Friendly and is ownedBy Jack and lovedBy Mary. See Figure 3. We discuss four variants. First, we discuss the deterministic model without top-down inference. Second, we discuss how a sampled concept label feeds back on the representation layer. Third, we discuss the generation of several sampled concepts. Fourth, we discuss the attention approximation.

6.1 A Deterministic Model without Top-down Inference

In a deterministic CBS interpretation, Equation 4 describes the output of the system. Assume that the labels for Sparky, Dog, Black, and Happy have substantial activation. The agent might select the label with the highest probability, e.g., Sparky. The agent might also consider other likely labels, such as Dog, Black, and Happy. There is no feedback to the representation layer in either case. The remaining part of the brain is not informed about likely labels. This is the standard RNN setting. Also, given the CBS, all labels are independent.

6.2 A Single Sampled Label

Assume that the labels for Sparky, Dog, and Black exhibit substantial activation. In the sampling mode, the TB samples from the distribution in Equation 4, and might generate the label Sparky. In general, if index $k$ is sampled, then it is assumed that concept $k$ is represented in the representation layer.

The index layer works in a sample-take-all mode: only one ensemble is firing at an instance, suppressing the firing of all other indices; in the example the index Sparky would be firing. Technically, after sampling, the index layer’s state is a one-hot-vector.

If index $Y=k$ is firing, it activates the global workspace in a top-down manner with

\mathbf{q}\leftarrow\alpha\mathbf{q}+\beta\mathbf{a}_{k}.

(5)

One can interpret $a_{i,k}$ as the synaptic weight connecting ensemble $k$ in the index layer, with ensemble $i$ in the representation layer. In the pCBS interpretation, we again obtain independent Bernoulli distributions. This update equation was derived in [42]. The factor $\alpha$ , with $0\leq\alpha\leq 1$ controls how much of $\mathbf{q}$ is preserved in the update. With $\beta=1$ and $\alpha=0$ , it is the Heisenberg approximation and with $\alpha=1$ , it is the Heisenberg approximation with an “added prior”. With $\alpha=1$ and $\beta=0$ we get an RNN update.

Thus, as reflected in the last equation and in Equation 4, connections are bi-partite and symmetrical. Strict symmetry is implied by some theoretical considerations [42], but it is not a strict requirement and is likely to be violated biologically. Note that with $\alpha=1$ the previous pre-activation is not eliminated but added to the embedding vector as a skip connection.

6.3 Multiple Sampled Labels

The process can be repeated several times by repeatedly sampling from Equation 4 and updating Equation 5. With $\alpha>0$ , each sample is always conditioned on the previous ones. Each sample is a particular interpretation of the CBS, in the context of what has been sampled before. For example, labels might represent the entity Sparky, the next one Dog, then Black, Happy, and so on. Finally, with $\alpha=1$ ,

\mathbf{q}\leftarrow\mathbf{q}+\sum_{k\in\textit{Sample}}\mathbf{a}_{k}.

(6)

Thus, the index layer is part of a labeling engine. We also clearly see the difference to a standard RNN, where the last equation would not contain the sum, i.e., there is no feedback from the samples to the state. In the TB, the representation layer is explicitly informed about all generated samples. Also, subsequently generated samples will consider all the labels generated before. The activation of an index is an internal brain measurement, which is feeding back to the brain! All modules in the brain can learn about what was detected. Equation 7 shows the additive combination of embedding vectors found in word2vec [28], and also in TransE[4].

6.4 Probabilistic Averaging: Attention

Consider that there is no sampling, and the CBS is updated to consider all possible labels, weighted by their probabilities. In [46, 42] the equation

\mathbf{q}=\mathbf{q}+\sum_{k}\mathbf{a}_{k}\mathrm{softmax}(a_{0,k}+\sum_{l=1% }^{n}a_{l,k}\gamma_{l})

(7)

has been introduced. Thus, one obtains an update of the CBS, even without sampling but by rather considering label probabilities. Note that the equation is a form of an attention mechanism with a skip connection from generative AI. The CBS $\vec{\gamma}$ is the query vector, and the embedding vectors $\mathbf{a}_{k},\forall k$ are the key and the value vectors.

6.5 Attention versus Sampling

Attention does not result in a symbolic interpretation since no index is active in isolation; an advantage is that the attention mechanism is fast and parallel and holistically considers all interpretations. The attention equation relates indices to related indices leading to a sharing of statistical strength. In [42] it is shown that the attention mechanism can be derived from a Heisenberg scenario and not from a Bayes scenario: in the latter, only actual samples would influence the CBS.

Sampling commits the brain to a specific interpretation. For example, if the black thing in the scene is sampled to be Sparky, then the brain can add a lot of background information about Sparky. But if it is sampled to be a Puma, other background information is added and the agent’s reaction might be quite different. In sampling, both options are explored in different sampling rounds. Sampling explores the joint dependencies between labels. Sampling and top-down inference are the basis for semantic memory. See Figure 4.

The sampling of the indices is an inherent feature of the TB. What initiates the sampling? Sampling appears to have many things in common with the collapse of the wave function in quantum mechanics. Is the brain a decoherent quantum computer? These questions are addressed in [42].

6.6 Attention, Autoencoding and Embodiment

Consider the last equation but without the skip connection, i.e., $\alpha=0$ . One can define an autoencoder with an encoder, which maps the CBS to the index layer activation, i.e., using Equation 4. The decoder then maps the index layer softmax activation back to the representation layer, using attention Equation 7 (again without the skip connection). Thus, the index layer is the bottleneck layer in an autoencoder where one considers all symbolic encodings weighted with their probabilities.

In perception, one can also start with the visual input. In the encoder, the visual input $\mathbf{v}$ is mapped to the index layer, via the visual processing and function $\mathbf{g}(\mathbf{v})$ . Then the CBS is mapped to the index distribution, as before. In the decoder, the activation of the index layer is mapped back to the representation layer and then back to the visual input via an approximation to the inverse of the visual processing layer, $\mathbf{\hat{v}}=\mathbf{g}^{\text{inv}}(\vec{\gamma})$ . This top-down processing is a form of grounding or embodiment: not only the representation layer but all visual processing layers are informed, e.g., about what it means to consider Sparky in the index layer.

Brain circuits are generally bi-directional in the sense that if a module connects to another module, there are also connections in the reverse directions. In the TB, the connections between the representation layer and the index layer are bi-directional. Now, also the perceptual processing pipeline is assumed bi-directional. The inverse perceptual processing pipeline might not be as precise: as we all experience, a visual recall of the concept of Sparky is not as strong and clear as actually observing Sparky in a scene.

7 Operational Modes

One strength of the TB approach is that different operational modes of the brain can be modeled by one architecture. We discuss perception, episodic memory, and semantic memory.

7.1 Pseudocode

The pseudocode is described in Algorithm 7.1, Algorithm 7.1, and Algorithm 7.1. In Algorithm 7.1, $\alpha=1$ and $\beta=1$ , which is the Heisenberg approximation with an added prior. With $\alpha=0$ , it is the Heisenberg approximation (without an added prior). With $\beta=0$ we get an RNN.

{algorithm}

[h]Evolution Neural Network

1:Input:

\mathbf{q}^{\textit{in}}

\triangleright

The CBS is the input

q_{i}\leftarrow q_{i}^{\textit{in}}+f_{i}^{\textit{NN}}(\mathrm{sig}(\mathbf{q% }^{\textit{in}})),\forall i

\triangleright

Evolution NN with skip connection

3:Output:

\mathbf{q}

{algorithm}

[h]Input and Attention

1:Input:

\mathbf{q},\mathbf{v}

\triangleright

The CBS is the input

\mathbf{q}\leftarrow\mathbf{q}+\mathbf{g}(\mathbf{v})

\triangleright

\mathbf{v}

from scene or bounding box

\mathbf{q}\leftarrow\mathbf{q}+\sum_{k}\mathbf{a}_{k}\mathrm{softmax}(a_{0,k}+% \sum_{l=1}^{n}a_{l,k}\mathrm{sig}(q_{l}))

\triangleright

Attention

4:Output:

\mathbf{q}

{algorithm}

[h]Encoding (bottom-up) and Decoding (top-down)

1:Input:

\mathbf{q}

\alpha

\beta

\triangleright

The CBS is the input

2:Sample

k\sim\mathrm{softmax}_{dom}(a_{0,k}+\sum_{l=1}^{n}a_{l,k}\mathrm{sig}(q_{l}))

\triangleright

Bottom-up:

k

is sampled

\mathbf{q}\leftarrow\alpha\mathbf{q}+\beta\mathbf{a}_{k}

\triangleright

Top-down inference

4:Output:

\mathbf{q},k

7.2 Visual Perception

7.2.1 Visual Scene

Assume that the agent analyzes a visual scene. Algorithm 7.1 starts with some neutral input, e.g., $\mathbf{q}=0$ . The scene $\mathbf{v}\leftarrow\mathbf{v}^{\textit{scene}}$ is the visual input (Line 2). Next, the attention step is applied to get strengths from related past episodic memories (Line 3). Algorithm 7.1 describes the encoding and decoding. A label is sampled (bottom-up) and the sampled label feeds back to the representation layer (top-down). Algorithm 7.1 might be called several times. Each time a label $k$ is generated. These labels might indicate, e.g., the location of the agent (e.g., EnglishGarden) and the weather condition (e.g., Sunny). However, a generated label might also be a past time-index, indicating a similarity to a past scene.

7.2.2 A First ROI in a Visual Scene

Then, the agent might be interested in the properties of an entity, defined, e.g., by a particular ROI in the scene. The TB applies the evolution neural network of Algorithm 7.1. Its input $\mathbf{q}_{i}^{\textit{in}}$ is the output from Algorithm 7.1, as applied before. Then Algorithm 7.1 is called. The ROI $\mathbf{v}\leftarrow\mathbf{v}^{\textit{ROI}}$ is the visual input (Line 2). Attention is calculated w.r.t. to entities that are already established. Algorithm 7.1 describes the encoding and decoding. It might be called several times. Each time a label $k$ is generated. Labels might be, e.g., Sparky, Dog, Black, and Happy. Due to top-down inference, labels will be dependent, e.g., the probability for labeling Dog will take into account that Sparky was labeled before. For a given scenes many labels are generated for a given ROI and the sampling is repeated several times in different sampling rounds.

7.2.3 A Second ROI in a Visual Scene

Then, a second ROI might be analyzed in the context of the scene and the first bounding box. As for the first ROI, Algorithm 7.1, Algorithm 7.1, and Algorithm 7.1 are applied. The second ROI might, e.g., describe the entity Jack. Further labels might be Person, Tall, ContentLooking.

7.2.4 A Third ROI in a Visual Scene

Then, a third ROI encloses both previous bounding boxes. It might label the relationship between both entities with the predicate label looksAt, again using Algorithm 7.1, Algorithm 7.1, and Algorithm 7.1.

7.2.5 Forming Episodic Indices

For each perceptual event, an episodic index $t$ is introduced (see Section 5.4), and it can be indexed as an episodic event. Its embedding $\mathbf{a}_{t}$ is optimized to reconstruct the labels for the scene but also for the ROIs even when the visual inputs are missing. So, the goal is the symbolic reconstruction, as much as the subsymbolic reconstruction. See Section 8.

7.2.6 From Labels to Triples

An ROI and an episodic time index are unique identifiers (keys) for the other labels of an ROI. Assuming that entities are mutually exclusive, one can replace the ROI with the entity represented in the ROI, e.g., Sparky, and one can describe the generated labels at time $t$ as: (Sparky, type, ClassDog), (Sparky, type, ColorBlack), (Sparky, type, MoodHappy), or shorter (Sparky, type, Dog), (Sparky, type, Black), and (Sparky, type, Happy).³³3The relations would be dog(Sparky), black(Sparky), happy(Sparky). These are triple statements, and the first entry is the subject, the second is the predicate, and the third is the object. Thus, a logical triple statement describes the relationship between concept labels. In the context of perception, triples describe currently generated labels, and in the context of episodic memory, past generated labels. If an entity does not yet exist it is introduced with a new index.

One can think of an entity as a thing on which a measurement is executed, the predicate describing the type of a measurement and the object as the outcome of the measurement. In addition, and this is important for reasoning, one can also generate triples with subjects that are not entities such as (Dog, type, Happy), (Black, type, Sparky), and so on. The TB calls them generalized statements and they are useful for embedded symbolic reasoning (see Section 10.2).

By involving two ROIs (for the subject and object) and an enclosing ROI (for the predicate) one obtains triples of the form (Sparky, looksAt, Jack).

7.3 Episodic Memory

According to Tulving, who introduced the term [47]: Episodic memory stores information about general and personal events and concerns information we ”remember“. It is about past observations. In the context of vision, it would be the reconstruction and the decoding of a past scene.

7.3.1 Episodic Engram

In the TB, an episodic memory engram consists of the episodic index and its embedding. If an episodic index is activated, its embedding vector will define the CBS and this propagates to earlier visual processing layers, using embodiment. In this way, the brain gains a subsymbolic understanding of the past event. In the context of vision, it would be an approximate reconstruction of a past scene. The TB proposes that episodic indices are symbolic in nature, similar to concept indices.

7.3.2 Decoding of Episodic Memory

In the TB approach, an episodic memory is realized by the activation of the episodic index that belongs to that past memory. Thus, an episodic memory approximately restores the CBS of a past instance. We start with Algorithm 7.1 with a neutral input $\mathbf{q}=0$ where we set $k\leftarrow t$ and do not sample. Then Algorithm 7.1 is reapplied to generate labels for the past scene, e.g., EnglishGarden and Sunny. Note that if the CBS of the original scene is perfectly restored in the episodic recall, the label probabilities are exactly the same as at the time of perception.

Next, we consider an ROI from the past scene. The computational evolution neural network of Algorithm 7.1 is applied. Then Algorithm 7.1 is applied, which generates labels by sampling. The first label might correspond to the episodic index, which belonged to an ROI in the scene. Then Sparky, Dog, Black, and Happy might be generated. By reapplying Algorithm 7.1 and Algorithm 7.1 we can also generate labels for objects and predicates from the past scene.

Essentially, the labeling and relationship prediction is as in perception, except that visual inputs are replaced by stored episodic embedding vectors and without applying attention. Past scenes can be embodied and thus partially reconstructed! Thus, one can recover not just the past memories but also, e.g., emotions attached to the memories.

As perception, episodic memory is a process that explores the different interpretations of a memory. The initiation of an episodic memory and its decoding into symbols and the involved embodiment might activate substantial parts of the brain. Based on the labels, triple statements can be generated.

Episodic memories are not restricted only to past visual events. The CBS at any instance can be stored as an episodic memory with its own episodic index. This, e.g., permits the restoration of a previous thought. We will continue this discussion in Section 9.

7.3.3 Recent Episodic Memory

Recent episodic memory enriches perception through the retrieval of perceptual experiences, which provide the agent with a sense of the here and now, i.e., the current context. To understand its own state and the world’s state in general, the agent needs to know what happened recently, in recent scenes, and on recently perceived entities. This state information cannot be directly derived from perceptual input, only.

For instance, the agent needs to remember that, even though perception does not give a clue, it is still in the hide-out because the bear had been chasing it and might still be lurking outside. Thus, recent episodic memory guides behavior and provides decision support.

Incidentally, patients who are unable to form new episodic memories show great deficits in personal orientation and context understanding. These deficits are often associated with severe bilateral damage to the medial temporal lobe (MTL), which includes the hippocampus[12]. Old age is also a factor.

7.3.4 Remote Episodic Memory

Remote episodic memory concerns events that are memorable but might have occurred further in the past. They can provide decision support and inform the agent about good and bad outcomes of previous episodic events that are similar to the current perceptual experience. It aids the agent in decision-making. Continuing the previous example, the agent might remember previous personal bear encounters and subsequent dangerous situations. Recall in remote episodic memory is triggered by the closeness between episodic representation and scene representation.

Remote episodic memories might occupy the brain for quite some time and are associated with retraining the brain during consciousness or during sleep [27].

Consider all the movies, movie scenes, TV shows, and comic strips you can memorize or at least remember if they are novel or already familiar. This gives you an impression of the many episodic indices which are present in the brain, at least if you follow the indexing approach of the TB.

7.3.5 Future Episodic Memories and Imagination

A future episodic memory is a forecasted event in the future, which at some point in time, is predicted to become a regular episodic memory. Consider an example. The agent might know that there is a football game in town in the evening and that the weather will be bad. By creating a future episodic index with labels FootballGame and RainyWeather, the agent might predict a bad traffic condition. This is a prediction associated with a future event, and it is a form of imagination. Note that the imagination is also grounded and embodied.

Thus, memory guides behavior for the future (future episodic memory) [36]. [9] describe this process as integration across relational events by imagining possible rewards in the future. The value associated with a memory (e.g., reward, threat) might be an integral aspect of episodic memory. The article also states that there is now extensive empirical data supporting the prevalent use of episodic memory across various decision-making tasks in humans.

Consider that the brain has one or several modules evaluating the CBS, e.g., into the states PersonallyRewarding, ImprovesSocialStatus, or Undesirable. Potentially, each such state obtains an associated index. The imagining of different future scenarios, under different decisions of the agent (drive to a friend at night and get into bad traffic and weather or stay home), can lead to different scenarios with different expected rewards, and the agent has the option to decide for the action that might lead to a situation with maximum reward.

Imagining future scenarios lets the agent contemplate about facts very likely to be true or false under some assumed conditions. In spirit, this is similar to some form of logical reasoning.

Maybe the brain is a prediction machine, as many have stated [7, 33, 19, 20, 39, 14, 11]. This predictive machinery might become part of the fast reactive system of the brain, i.e., its implicit memory. This is relevant when your tennis playing improves. But to predict at the longer time scales considered here, the mind needs to understand the present and relate it to the past: it needs explicit understanding and it needs an explicit memory. It needs a future memory, it needs imagination.

7.3.6 Temperature Scaling, Simulation, and Bayes Probabilities

To decide on the scenario that gives the most expected reward, the mind needs to take into account the plausibilities or likelihoods of the different scenarios. In the tensor brain, only samples are visible and not the probabilities. So how can the brain deal with probabilities?

Consider an example. “From what I know and also from what my gut feeling tells me, I bet that Sparky will win the race.“⁴⁴4Looks like we just found out that Sparky is a RacingDog. Betting has been used as evidence that individuals actually behave Bayes optimally, requiring the careful combination and comparison of probabilities. Consider that the domain is about racing dogs. We can introduce temperature scaling in the $\mathrm{softmax}_{\textit{dom}}$ function ( $\mathrm{softmax}_{\textit{dom}}(x)$ becomes $\mathrm{softmax}_{\textit{dom}}(x/T)$ with temperature $T$ ). If we turn to low temperatures, we can turn a sample-take-all into a winner-take-all, i.e., only the expected winner is sampled, in this case, Sparky. So even when it might be difficult to get accurate probabilities in the brain, it might still be possible to get optimal decisions!

Alternatively [35], suggested that the brain actually is a Bayesian sampler. The paper states that “Only with infinite samples does a Bayesian sampler conform to the laws of probability; with finite samples, it systematically generates classic probabilistic reasoning errors, including the unpacking effect, base-rate neglect, and the conjunction fallacy.” Statements in semantic memory might also be associated with probabilities (see Section 7.4.3). Sampling might be the strategy to get estimates of probability values there as well.

7.4 Semantic Memory

In the cognitive literature, semantic memory is defined as representing facts that we know but don’t remember where we learned them. It is about statements that are true or false or probabilistic in nature. The logical or probabilistic statements are either time-invariant or are only slowly or rarely changing in time.

Consider that a friend describes a dog that you don’t know. Your episodic memory might provide you with information about previous encounters with dogs, but we seem to have an understanding of the concept Dog in semantic memory, even without the recall of specific episodic memories.

7.4.1 Semantic Engram

In the TB, the semantic memory engram includes the concept index $s$ and its embedding vector $\mathbf{a}_{s}$ . If a concept’s embedding is activated in the representation layer, and this propagates to earlier layers, using embodiment, the brain has a subsymbolic understanding of the concept. The embedding vector $\mathbf{a}_{s}$ can be thought of as a prototype vector, but it is a prototype vector in the abstract embedding space. In Section 10.7 we argue that $\mathbf{a}_{s}$ is the signature or “DNA” of the symbolic index $s$ .

7.4.2 Embedded Semantic Memory

Consider a semantic memory recall for a symbol $s$ , e.g., Sparky. We start with Algorithm 7.1 with a neutral input $\mathbf{q}=0$ and where we set $k\leftarrow\bar{t}$ and do not sample. $\bar{t}$ is the index for a time-independent constant embedding vector $\mathbf{\bar{a}}$ . The computational evolution neural network Algorithm 7.1 is applied. Algorithm 7.1 is applied where we set $k\leftarrow s$ and do not sample. As discussed, $s$ might be Sparky. Algorithm 7.1 might be applied several times generating Dog, Black, Happy. By reapplying Algorithm 7.1 and Algorithm 7.1 we can also generate labels for objects and predicates and recall, e.g., that (Sparky, ownedBy, Jack).

Thus semantic memory recall is quite similar to an episodic recall, only that $t\leftarrow\bar{t}$ and that $s$ is set and not sampled. Note that we can work with the same (brain-) architecture as for perception and episodic memory.⁵⁵5Consider the pCBS. For semantic memory, we need to average over potential visual inputs, i.e., define a prior distribution for the pCBS. We again assume that the prior distribution can be described by independent Bernoulli distributions with pre-activation vector $\mathbf{\bar{a}}$ with index $\bar{t}$ . This approximates the distribution $P^{prior}(X_{1},..,X_{n})=\int P(X_{1},...,X_{n}|\mathbf{v})P(\mathbf{v})d% \mathbf{v}$ .

By substituting the expression in Equation 6 into Equations 1 and 4, we see that a sample probability depends on all previous samples. The brain could implement an AND operation with several arguments. However, the dependencies are not fully general. To achieve that, one needs to apply the computational evolution neural network, as discussed in Section 10.6.

7.4.3 Triple-Facts and Conditional Probabilities

Perception and episodic memory are about observations. Semantic memory is about facts. So what are facts? Consider the perceptual analysis in Section 7.2.6. Assume that each time that an ROI was identified as Sparky and, in case the domain color was determined, it was Black. Then, (Sparky, type, Black) is not only an observation but also a fact, independent of a particular episodic instance.

The TB can also attach probabilities to facts. Assume that in case Sparky was in the scene and Sparky’s mood was determined, Sparky was Happy 60% of all times, one attaches (Sparky, type, Happy) with the conditional probability: $P(\textit{Happy}|\textit{Sparky})=0.6$ . Thus, facts and fact probabilities can be derived from observational statistics of the generated labels.

(Sparky, type, Black) can be interpreted as a statement about the color of the dog Sparky. It can also interpreted as a rule: If an entity is identical to Sparky, then that entity is Black. This allows the TB to also define the conditional probability for generalized statements like $P(\textit{Black}|\textit{Dog})$ . The associated conditional probability is defined as the number of times that the color was sampled to be Black for an ROI labeled as being a Dog. More explicitly, e.g., $P(\textit{Black}|\textit{Sparky})$ can be written as $P(\textit{Color}=\textit{Black}|\textit{Entity}=\textit{Sparky})$ where Color and Entity are the random variables.

A triple does not need to relate to vision. The observation (Sparky, type, Purebred) might have originated from another modality, like language. $P(\textit{Purebred}|\textit{Sparky})=1$ would mean that at any time someone made a statement about the pedigree of Sparky, it was stated that he was purebred.

Probabilities are tricky, also for the brain. The brain might have ways to deal with non-IID labels, limited sample size, and the authority/uncertainty of the source.

Also, there is an issue of mutual exclusiveness implied by the softmax normalization over a domain. Consider that Sparky is black and white (checkered). One solution would be to introduce the color BlackWhite, and $P(\textit{BlackWhite}|\textit{Sparky})=1$ . A second option would be that perception labels Sparky’s color to be Black in 50% of all cases and to be White, also in 50% of all cases. Then, $P(\textit{Black}|\textit{Sparky})=P(\textit{White}|\textit{Sparky})=0.5$ . A third option is that we define a domain black with labels Black and NotBlack and a domain white with labels White and NotWhite, in which case we get $P(\textit{Black}|\textit{Sparky})=P(\textit{White}|\textit{Sparky})=1$ .

The normalization issue might not be as pressing for the brain. First, it is concerned about generated samples and not primarily probabilities, and, second, the brain might consult the subsymbolic aspect of semantic or episodic memory recall to conclude that yes, Sparky is black and white! Recall, that in Section 7.3.5 and 7.3.6, we discussed how the brain might deal with probabilities in decision making.

7.4.4 Symbolic Semantic Memory

We can also model directly,

P(Y=k|s)=\mathrm{softmax}_{dom}(b_{0,k}+b_{s,k}).

(8)

Here, $b_{s,k}$ is a link from concept $s$ to concept $k$ , and $b_{0,k}$ is an offset. Note that the embeddings are not involved here, and we call this symbolic decoding of semantic memory. Symbolic decoding might be useful for modeling both a successor relation between time indices and a partOf relationship, e.g., between a time index for a scene and an associated ROI.

7.5 Knowledge Graphs

Triples form a knowledge graph (KG), where concepts are nodes and predicates link labels. Some triples would be certain (in the probabilistic sense described before), and some would be labeled with conditional probabilities. The episodic memory can be related to a temporal knowledge graph (tKG). Generative models for embedded knowledge graphs were also studied by [23].

7.6 Serial Processing as a Prerequisite for Language

As discussed, symbolic and subsymbolic representations interact tightly. One cannot think, e.g., of Sparky without activating a subsymbolic embedding of Sparky and an embodiment process that might activate earlier visual and other processing layers in a top-down process. Similarly, anything presented in the representation layer is constantly decoded into labels, e.g., Sparky, in a bottom-up process.

The processing of different brain modules can occur in parallel, particularly in the processing of perceptual paths. However, the symbolic decoding itself is a serial process. This is discussed in Dehaine [8] as the serial bottleneck in consciousness. The importance of sampling is also recognized in [8] in the context of conscious perception. For example, the author states that “… consciousness is a slow sampler”.

The sequential nature of symbol generation in the TB might have been a prerequisite to the generation of natural language. The natural “outer” language might be based on the triple-oriented “inner” language. Individuals, to a large degree, share common concepts (of the “inner” language), even when they do not share a common language (“outer language”)! Embeddings might be specific to a “personal brain”, although recent research indicates that they as well might be shared to some degree [13, 38]. How eloquent is natural language? Part of natural language is quite fact oriented: Some managers might request, “Just give me the facts”. It has been speculated that even more sophisticated language consists of facts, glued together by fillers. This might be true for some texts generated by large language models.

Natural (outer) language can also be an input. It feeds to the representation layer and a sentence or a paragraph is like an episode to be encoded as an inner language. As with any episodic memory that has an embedding vector, we can recall a past sentence and its meaning. An input sentence thus becomes part of the memory systems.

8 Self-supervised Learning

8.1 Associative Learning?

Consider simple associations between indices and embeddings: An episodic memory embedding can be formed by simply storing the CBS at perception. Similarly, the embedding vector for a concept could be the simple average of all CBSs, when the concept was present. Indeed, this might have been the starting point in biological evolution. The TB actually uses self-supervised learning, which leads to better performance.

8.2 Self-generated Labels

No external agent provides training data, i.e., symbolic labels. In the TB, perception itself produces labels by stochastic sampling and the brain then treats the generated labels as actual labels and trains on them. This is a type of bootstrap learning where model predictions are used as targets for learning; see the bootstrap Widrow-Hoff rule [16] and learning with pseudo labels [22]. The self-generated labels occur in different cost functions. [3] discusses different principled forms of self-supervised adaptation in deep learning.

8.3 Embedding Learning with Self-generated Labels

Assume a scene $\mathbf{v}^{\textit{scene}}$ has been observed and then the focus was on an ROI with $\mathbf{v}^{\textit{ROI}}$ . To simplify we do not include attention. The labels $s=\textit{Sparky}$ and $k=\textit{Dog}$ are generated.

For perception, the contribution to the log-likelihood (negative cost function) is $\log P(s|\mathbf{v}^{\textit{ROI}},\mathbf{v}^{\textit{scene}})+\log P(k|s,% \mathbf{v}^{\textit{ROI}},\mathbf{v}^{\textit{scene}})$ which becomes

\log\mathrm{softmax}_{\textit{dom}}(\mathbf{a}^{\top}_{s}\mathrm{sig}(\mathbf{% h}_{1}))+\log\mathrm{softmax}_{\textit{dom}}(\mathbf{a}^{\top}_{k}\mathrm{sig}% (\mathbf{a}_{s}+\mathbf{h}_{1})).

with $\mathbf{h}_{1}=\mathbf{g}(\mathbf{v}^{ROI})+\mathbf{g}(\mathbf{v}^{scene})+% \mathbf{f}^{\textit{NN}}(\mathbf{g}(\mathbf{v}^{scene}))$ . Let’s assume that it is the first time the first label $s=\textit{Sparky}$ has been observed. The first term in the cost function will attempt to align the embedding $\mathbf{a}_{s}$ with the visual input. Then, the second cost term will attempt to align $\mathbf{a}_{s}$ with the embedding of Dog. Thus the “dogginess” will be integrated in Sparky’s embedding.

For the episodic memory for time $t$ , the contribution to the log-likelihood (negative cost function) is $\log P(s|t)+\log P(k|s,t)$ which becomes

\log\mathrm{softmax}_{\textit{dom}}(\mathbf{a}^{\top}_{s}\mathrm{sig}(\mathbf{% h}_{2}))+\log\mathrm{softmax}_{\textit{dom}}(\mathbf{a}^{\top}_{k}\mathrm{sig}% (\mathbf{a}_{s}+\mathbf{h}_{2})).

with $\mathbf{h}_{2}=\mathbf{a}_{t}+\mathbf{f}^{\textit{NN}}(\mathbf{a}_{t})$ .⁶⁶6If the embedding vector of the ROI $\mathbf{a}_{t_{roi}}$ , can be included $\mathbf{h}_{2}=\mathbf{a}_{t_{roi}}+\mathbf{a}_{t}+\mathbf{f}^{\textit{NN}}(% \mathbf{a}_{t})$ .

For the semantic memory for entity $s$ , the contribution to the log-likelihood (negative cost function) is

\log P(k|s)=\log\mathrm{softmax}_{\textit{dom}}(\mathbf{a}^{\top}_{k}\mathrm{% sig}(\mathbf{a}_{s}+\mathbf{h}_{3}))

with $\mathbf{h}_{3}=\mathbf{\bar{a}}+\mathbf{f}^{\textit{NN}}(\mathbf{\bar{a}})$ . This last cost term will also try to align the embedding of Sparky with the embedding of the concept Dog. When semantic memory recalls Sparky, it is likely that Dog will be sampled next. The embedding of the index Sparky will integrate all labels that were generated in context with the index Sparky.

We can also derive cost terms for triples with predicates other than type, like (Sparky, looksAt, Jack).⁷⁷7We can see that the equations are not commutative: First sampling $s$ and then $k$ leads to a different cost than if we first sample $k$ and then $s$ .

A consolidated episodic memory recall is more consistent with the agent’s worldview: Past memories are made a bit more to fit the agent’s expectation. Memories that fit an agent’s understanding are much better to recall than, e.g., scenes that occurred in a foreign or alien place, where situations might not fit the agent’s expectations. We can also only verbally report about past scenes effectively if they fit the agent’s world understanding [6, 18].

8.4 Autoencoder Learning

In Section 6.6 we discussed how scene input can be used to train an autoencoder, which is instrumental to realize embodiment. Cost functions derived from a perceptual autoencoder might regularize and stabilize learning.

8.5 Memory Recall Changes Memories

As discussed, the recall of an episodic or semantic memory involves the complex decoding of the embedding vectors. In this process samples are generated. The brain might decide to train on these samples. This provides an explanation to the well-known phenomena, that a memory trace can change when retrieved; in fact it can even be manipulated as shown by the work of Loftus [24, 25].

8.6 New Indices

For existing indices, the predicted labels can be used for adapting the embeddings of those indices. But the brain also needs to form new indices. A new perceptual experience is represented by a newly formed episodic index and its learned embedding. At a slower rate, new entity indices are introduced, and even at a slower rate, new indices for classes and attributes are required. The neurobiological aspects of forming new indices are extensively discussed in [46].

9 Multitasking, Cognitive Control, and Working Memory

9.1 Multitasking by Multiplexing

Assume that the agent is occupied with planning a friend’s visit for the evening while driving a car. Suddenly, and maybe just for a few seconds, the traffic situation requires full attention. Afterward, the agent continues the evening planning. How can the brain do that?

A leading hypothesis suggests that it only appears that the mind can think of several things at the same time. At the conscious level, we rapidly switch between mental states. One might consider that cognitive attention is another serial bottleneck in the brain! An extreme case is multitasking, where we seem to be able to do two things at a time, like driving a car and participating in a conversation. Cognitive multitasking is an illusion, and the brain does multiplexing, rather than multitasking [29].

The TB would support the concept of multitasking by multiplexing. Returning to the example: The TB approach suggests that the evening-planning CBS is stored as the embedding vector of an episodic index. This CBS is then restored by activating this novel episodic index after the traffic situation is resolved.

Multitasking by multiplexing might have a capacity limit. The capacity of working memory has been estimated to be seven or less [30, 10, 5], which would mean that the mind can switch between seven or fewer mental states. In the context of the TB, cognitive control might consider seven or fewer episodic or concept indices and their embeddings in problem-solving.

A specific example, of course, is the usage of episodic and semantic memory during perception. In case of time permits, the mind can activate episodic or semantic memory to supplement the current perceptional activity with background information, e.g., to provide state information (recent episodic memory) and for decision guidance (remote episodic memory). See Figure 3. The multitasking between perception and memory can easily be achieved by the TB approach but might be quite difficult to achieve with other technical models. The samples generated by semantic memory can be considered as samples generated by the prior; samples generated from perception or episodic memory can be considered as data. In [46] this was related to Dirichlet fusion.

9.2 Cognitive Control and Working Memory

There seem to be a number of mental tasks that require a form of cognitive control. Multitasking is an obvious example: How does the brain decide to focus on which mental or physical task?

Another example is semantic memory: assume the agent’s thought is with Sparky. Should the mind stay with Sparky or digress to Jack (his owner), who just got married to Mary, who lives in Munich. Munich has the EnglishGarden with several nice BeerGardens. That reminds the agent to buy some beer on the way home.⁸⁸8This type of digression is infamous for the character J.D. in the sitcom ”Scrubs“. A third example is the serial selection of ROIs in an image [34].

Cognitive control seems to be one of the central faculties of consciousness and conscious decision making. Who is in charge? Technically, the module responsible for cognitive control might be optimized by a form of reinforcement learning.

10 Multimodality, Generalization, Reasoning, and Semantics

So is the TB just a big database, as implied by the focus on past observations and simple statistics? Of course, not! Labels generated in the past are not stored explicitly. They are used to adapt perception and to provide generalization in memory by learning and fine-tuning embedding vectors, as discussed in this Section. Certainly, the TB generalizes to new perceptional inputs, such as new scenes. There is no training data to which the model can overfit: the labels are self-generated. So, we clearly see generalization. We now discuss reasoning. Reasoning can best be understood in terms of multimodality.

10.1 Multimodality: More Dimensions

By design, the TB is multimodal: We proposed that many modules communicate with the representation layer, in particular, different sensor modalities. A given modality might only contribute to a subset of the dimensions of the representation layer. Figure 4 illustrates that some dimensions might be mainly visual, others auditory, tactile, or abstract.

Embedding vectors might be sparse, as well. For example, the index Black will mostly interact with the visual dimensions of the representation layer, whereas the index Loud or Barking with auditory dimensions. However, one might also talk about a loud image, so concepts sometimes generalize across dimensions.

Consider the entity Sparky. The visual dimensions in the representation layer provide the mind with information about how he looks (e.g., his color, his appearance), his auditory dimensions (how his bark sounds and other sounds he makes), how it feels like touching him, and maybe some abstract dimensions. Thus the top-down inference is quite rich. See Figure 4.

10.2 Embedded Symbolic Reasoning and Chaining

Consider the example in Figure 4. The sensory input might be visual and auditory. So only the parts of the representation layer that respond to vision and sound are active. This is reflected in the bottom-up inference. If Sparky is in the scene, his index is sampled. Top-down inference is global in the sense that all regions in the representation layer which link to Sparky are activated, in the example auditory, vision, tactile and abstract dimensions. The firing of Sparky supports the firing of Dog. Only when Dog fires, then Mammal also fires, since mammal is not visual and there are no common abstract dimensions with Sparky directly but only with Dog. This activation of Mammal via the activation of Dog demonstrates chaining in symbolic embedded reasoning. Thus, the brain can conclude that Sparky is a Mammal, even without an adaptation of Sparky’s embedding.

10.3 Embedded Reasoning and Materialization

Consider the same situation as before but assume that the embedding of Sparky has been adapted. This means that the embedding of the label Mammal will be integrated into the embedding of Sparky. Then the activation of the index for Sparky might already lead to a sampling of Mammal directly, without an intermediate sampling of Dog. It is the type of reasoning applied in recommendation systems [21] and is related to the process of materialization in logic and database theory.

10.4 Embedded Reasoning Creates Episodic Similarities

Assume the agent has stored a scene from EnglishGarden at time $t_{1}$ , with visual labels, e.g., Sunny, and a scene from Marienplatz at time $t_{2}$ , with visual labels, e.g., Rainy. The episodic embeddings for both and also the generated visual labels might show no similarity.

A location module of the brain might tell the mind at the time of perception that both scenes happened in Munich. Vision and location-generated labels can now be shared. When the episodic embeddings are adapted they will integrate the embedding of Munich. For episodic index $t_{1}$ , the embedding vector is adapted to also generate the label Munich; the same for episodic index $t_{2}$ .

The two engrams become similar in terms of their embeddings. An episodic recall of one of them might activate the other one. Similarly, a semantic recall of the concept Munich might now activate both episodic memories.

10.5 Symbolic Reasoning

We define symbolic reasoning as reasoning that does not explicitly exploit the embeddings of the symbolic indices. In Section 7.4.4, we discussed the symbolic decoding of semantic memory. It is solely based on the symbolic indices and does not exploit the embeddings. Symbolic decoding can represent binary statistics (e.g., between Dog and Black) and can do chaining: Assuming that Sparky was sampled in the scene, and then Dog, and this triggered the sampling of Mammal, then the mind can conclude that Sparky is a Mammal. This chaining operation can also be materialized by introducing a link from Sparky to Mammal directly. It is conceivable that more complex reasoning can be performed in the indices, e.g., based on (relational) Bayes nets.

10.6 Universal Approximation Property

In Section 7.4.2 we discussed that a sample can nonlinearly depend of previous samples, but that the dependency is not fully general. For this we need the evolution neural network. Assume that the object entity is identical to the subject entity, e.g., Sparky. Then, we can perform two samples in the entity’s role as a subject, e.g., Dog and Black, and one as an object, e.g., Cute. The latter one, in general, will have a nonlinear dependency on the outcomes of the first two, since in the transition from subject to object, the evolution neural network is applied. Thus, the brain can exploit the evolution operation to perform general mappings and achieve universal approximation ability.

10.7 The Embedding is the “DNA”

What is the semantics of a concept like Sparky? The embedding of Sparky is its signature, or “DNA”. The decoding of the “DNA” involves the activation of the representation layer with Sparky’s embedding. The embodiment informs substantial parts of the brain about Sparky’s sensory and other properties: how does Sparky look like, how does he sound, how does it feel to pad him. Then semantic memory tells the brain about attributes of Sparky, e.g., that he is Black, and by the decoding of Sparky’s relationships, e.g., about his social network (owned by Jack, loved by Mary). All this is embedded and embodied as well. Then, past episodic memories involving Sparky might be recalled. Thus, substantial parts of the brain’s sensory processing modules, as well as many memories “decode” Sparky’s “DNA”. It is an adaptable signature that can only be decoded in the context of an individual’s brain. Thus, thinking intensely of Sparky triggers a firework of activities in the brain. A similar firework, of course, happens also during perception and episodic memory recall.

10.8 Working Memory and Cognitive Control

Intelligence is often associated with working memory and cognitive control. In Section 9, we indicated how working memory might control a few symbolic indices to contemplate a decision. Although perception and memory might be the basis, model-based reasoning might also play a significant role. We support the idea that human intelligence has many facets and relies on several functional modules.

11 Summary and Conclusions

Perception and memory are not simple mappings but are processes that occupy substantial parts of the brain. The TB proposes that symbolic indices are activated in a bottom-up process. In a top-down process, activated symbolic indices communicate to the brain. Several indices are sequentially activated in perception and memory. This can be describes as a dance in which symbolic indices activate the subsymbolic representation layer and vice versa.

The TB proposed learning with self-generated labels which gave promising first results, although it can be dangerous: as we discussed the retrieval of an episodic instance also generates labels which the brain might use for adaption and a possible falsification of a memory.

It might be that the brain is a prediction machine, as many have stated. On a shorter time scale, this predictive machine might become part of the fast reactive system of the brain, its implicit memory. But to predict, at the longer time scales considered here, the mind needs to understand the present and relate it to the past and future: it needs explicit understanding and it needs an explicit memory. It needs a future memory, it needs imagination.

The TB demonstrated embedded and symbolic reasoning and discussed their relationships to semantic memory. We also elaborated on the role of working memory and cognitive control. Although quite powerful and general, it appears that only humans can do some form of complex reasoning. It has been speculated that language modules might play an important role since both language and reasoning rely on similar cognitive resources. Memory-based imagination might need to be supplemented with model-based imagination. In any case, perception and memory might be the basis. We should always keep in mind that the brain does many different things concurrently, with only a few of those activities reaching conscious awareness, and that it addresses a particular problem with several strategies.

The debate about localized representations in the brain is ongoing. Specific concept cells have been found in the medial lobe (MTL) region of the brain. Researchers have reported on a remarkable subset of MTL neurons that are selectively activated by strikingly different pictures of given individuals, landmarks, or objects and, in some cases, even by letter strings with their names [32, 31]. Thus index-specific “symbolic” representations might indeed be biologically plausible.

The sampling of the indices is an inherent feature of the TB. What initiates the sampling? Sampling appears to have many things in common with the collapse of the wave function in quantum mechanics after a measurement has been initialed by the experimenter. These questions are addressed in [42].

References

[1] Bernard J Baars. In the theater of consciousness: The workspace of the mind. Oxford University Press, USA, 1997.
[2] Stephan Baier, Yunpu Ma, and Volker Tresp. Improving visual relationship detection using semantic modeling. In ISWC. Springer, 2017.
[3] Yoshua Bengio, Dong-Hyun Lee, Jorg Bornschein, Thomas Mesnard, and Zhouhan Lin. Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156, 2015.
[4] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems 26, 2013.
[5] Nelson Cowan. The magical mystery four: How is working memory capacity limited, and why? Current directions in psychological science, 19(1):51–57, 2010.
[6] Fergus IM Craik and Endel Tulving. Depth of processing and the retention of words in episodic memory. Journal of experimental Psychology: general, 104(3):268, 1975.
[7] Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889–904, 1995.
[8] Stanislas Dehaene. Consciousness and the brain: Deciphering how the brain codes our thoughts. Penguin, 2014.
[9] Katherine D Duncan and Daphna Shohamy. Memory states influence value-based decisions. Journal of Experimental Psychology: General, 145(11):1420, 2016.
[10] Randall W Engle. What is working memory capacity? 2001.
[11] Karl Friston. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2):127–138, 2010.
[12] Mark A Gluck, Eduardo Mercado, and Catherine E Myers. Learning and memory: From brain to behavior. Palgrave, 2013.
[13] Ariel Goldstein, Zaid Zada, Eliav Buchnik, Mariano Schain, Amy Price, Bobbi Aubrey, Samuel A Nastase, Amir Feder, Dotan Emanuel, Alon Cohen, et al. Shared computational principles for language processing in humans and deep language models. Nature neuroscience, 25(3):369–380, 2022.
[14] Thomas L Griffiths, Charles Kemp, and Joshua B Tenenbaum. Bayesian models of cognition. In The Cambridge Handbook of Computational Psychology. Cambridge University Press, 2008.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[16] Geoffrey E Hinton and Steven J Nowlan. The bootstrap widrow-hoff rule as a cluster-formation algorithm. Neural Computation, 2(3):355–362, 1990.
[17] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
[18] Alex Kafkas and Daniela Montaldi. Expectation affects learning and modulates memory experience at retrieval. Cognition, 180:123–134, 2018.
[19] David C Knill and Alexandre Pouget. The bayesian brain: the role of uncertainty in neural coding and computation. Trends in Neurosciences, 27(12):712–719, 2004.
[20] Konrad P Körding, Shih-pi Ku, and Daniel M Wolpert. Bayesian integration in force estimation. Journal of Neurophysiology, 92(5):3161–3165, 2004.
[21] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
[22] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, 2013.
[23] Lorenzo Loconte, Nicola Di Mauro, Robert Peharz, and Antonio Vergari. How to turn your knowledge graph embeddings into generative models. Advances in Neural Information Processing Systems, 36, 2024.
[24] Elizabeth F Loftus and Katherine Ketcham. Witness for the defense: The accused, the eyewitness, and the expert who puts memory on trial. Macmillan, 1991.
[25] Geoffrey R Loftus and Elizabeth F Loftus. Human memory: The processing of information. Psychology Press, 2019.
[26] Yunpu Ma, Volker Tresp, and Erik A Daxberger. Embedding models for episodic knowledge graphs. Journal of Web Semantics, 2018.
[27] James L McClelland, Bruce L McNaughton, and Randall C O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995.
[28] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
[29] Earl K Miller and Timothy J Buschman. Working memory capacity: Limits on the bandwidth of cognition. Daedalus, 144(1):112–122, 2015.
[30] George A Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological review, 63(2):81, 1956.
[31] R Quian Quiroga, Leila Reddy, Gabriel Kreiman, Christof Koch, and Itzhak Fried. Invariant visual representation by single neurons in the human brain. Nature, 435(7045):1102–1107, 2005.
[32] Rodrigo Quian Quiroga. Concept cells: the building blocks of declarative memory functions. Nat Rev Neurosci, 13(8), 2012.
[33] Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79–87, 1999.
[34] D Sabatinelli, DW Frank, TJ Wanger, M Dhamala, BM Adhikari, and X Li. The timing and directional connectivity of human frontoparietal and ventral visual attention networks in emotional scene perception. Neuroscience, 277:229–238, 2014.
[35] Adam N Sanborn and Nick Chater. Bayesian brains without probabilities. Trends in cognitive sciences, 20(12):883–893, 2016.
[36] Daniel L Schacter, Donna Rose Addis, Demis Hassabis, Victoria C Martin, R Nathan Spreng, and Karl K Szpunar. The future of memory: remembering, imagining, and the brain. Neuron, 76(4):677–694, 2012.
[37] Sahand Sharifzadeh, Sina Moayed Baharlou, and Volker Tresp. Classification by attention: Scene graph classification with prior knowledge, 2020.
[38] Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Jascha Achterberg, Joshua B Tenenbaum, et al. Getting aligned on representational alignment. arXiv preprint arXiv:2310.13018, 2023.
[39] Joshua B Tenenbaum, Thomas L Griffiths, and Charles Kemp. Theory-based bayesian models of inductive learning and reasoning. Trends in cognitive sciences, 10(7):309–318, 2006.
[40] Timothy J Teyler and Pascal DiScenna. The hippocampal memory indexing theory. Behavioral neuroscience, 1986.
[41] Volker Tresp, Cristóbal Esteban, Yinchong Yang, Stephan Baier, and Denis Krompaß. Learning with memory embeddings. arXiv preprint arXiv:1511.07972, 2015.
[42] Volker Tresp and Hang Li. Bayes or heisenberg: Who(se) rules? arXiv preprint, 2024.
[43] Volker Tresp, Yunpu Ma, Stephan Baier, and Yinchong Yang. Embedding learning for declarative memories. In European Semantic Web Conference, pages 202–216. Springer, 2017.
[44] Volker Tresp, Sahand Sharifzadeh, and Dario Konopatzki. A model for perception and memory. In Conference on Cognitive Computational Neuroscience, 2019.
[45] Volker Tresp, Sahand Sharifzadeh, Dario Konopatzki, and Yunpu Ma. The tensor brain: Semantic decoding for perception and memory, 2020.
[46] Volker Tresp, Sahand Sharifzadeh, Hang Li, Dario Konopatzki, and Yunpu Ma. The tensor brain: A unified theory of perception, memory, and semantic decoding. Neural Computation, 35(2):156–227, 2023.
[47] Endel Tulving. Elements of episodic memory. Oxford University Press, 1985.