## Abstract

Is thought computation over ideas? Turing, and many cognitive scientists since, have assumed so, and formulated computational systems in which meaningful concepts are encoded by symbols which are the objects of computation. Cognition has been carved into parts, each a function defined over such symbols. This paper reports on a research program aimed at computing these symbolic functions without computing over the symbols. Symbols are encoded as patterns of numerical activation over multiple abstract neurons, each neuron simultaneously contributing to the encoding of multiple symbols. Computation is carried out over the numerical activation values of such neurons, which individually have no conceptual meaning. This is massively parallel numerical computation operating within a continuous computational medium. The paper presents an axiomatic framework for such a computational account of cognition, including a number of formal results. Within the framework, a class of recursive symbolic functions can be computed. Formal languages defined by symbolic rewrite rules can also be specified, the subsymbolic computations producing symbolic outputs that simultaneously display central properties of both facets of human language: universal symbolic grammatical competence and statistical, imperfect performance.

## 1. Introduction

The classical theory of computation initiated by Turing [1] has provided the foundation for mainstream theories of cognition and intelligence since the inception of cognitive science and artificial intelligence in the 1950s [2]. This paper presents another type of computation that has developed more recently within cognitive science, driven primarily by two goals: improved formal characterization of human mental processing—including mental grammars—and improved reduction to neural computation. These newer methods develop a distinction, absent in earlier computational theory, between three levels of description, which I will here call *symbolic*, *vectorial* and *neural*. These distinguish the formal characterization of the mind, using recursive functions over discrete symbols, from the characterization of the brain, using continuous, parallel, numerical computation. The vectorial level provides an interlingua. The descriptions at all three levels characterize a single computer, the mind/brain [3].

A key departure from earlier theory (including Turing's work on network machines [4]) is that machine computations operate only at a level lower than that of meaning. A conceptually meaningful entity is realized as an activation pattern—an activation vector—over many abstract neurons, each neuron simultaneously participating in the realization of multiple meaningful entities. Knowledge is realized in connections between elements (neurons) that are individually meaningless but collectively meaningful. The objects of computational manipulation are not meaningful (the neural-level description), and the meaningful elements are not objects of computational manipulation (the symbolic-level description) [5–7]. Nonetheless, we will see that the proposed vectorial computation provides the technical bridge for computing meaningful recursive symbolic functions using subsymbolic, neural computation.

## 2. Overview

The theory of human cognition owes much of its progress to the principle (formalized in §3) that cognition within some domain (e.g. arithmetic) consists in a system of functions that take inputs, and produce outputs, which are structures built of symbols that typically refer to elements in the domain (e.g. numbers).^{1} This raises two basic questions—and a meta-question: How should these functions be specified by cognitive scientists? How are they computed within the brain? How are these two questions connected? The answers developed here are roughly these.

*How should cognitive functions be specified?* In §6, we adopt the conventional assumption that cognitive functions can usefully be specified via recursive equations; but in §7, we conclude that—at least in one key domain, the theory of human language—an alternative, specification by constraints on (input, output) pairs, has important advantages.

*How are cognitive functions computed?* In §6, we challenge the assumption that cognitive functions specified via recursive equations should be computed in the conventional manner, by sequentially computing the primitive functions in terms of which those equations are ultimately defined. Instead, we show how cognitively relevant functions in a certain class can be computed in one, massively parallel step, in which neural activations encoding the input symbol structure are transformed by a simple linear transformation to output activations that encode the output symbol structure specified by the target function. In this approach, symbolic functions are computed, but not by symbolic computation: there is no algorithm over symbols that describes the internal processes by which input becomes output. There *is* a formal specification of the process, however, using the primitives of neural computation. If this account of cognition is on the right track, then the kinds of symbol-manipulating algorithms sought by traditional artificial intelligence and cognitive theories do not exist, despite the existence of symbolic descriptions of cognitive functions.

*How are these questions connected?* If symbolic cognitive functions are computed by neural, not symbolic, computation, then natural descriptions of the functions computed by neural networks provide a promising candidate framework for specifying cognitive functions. This leads, in §7, to the description of these functions in terms of optimization over constraints, at all three levels.

Addressing these questions in §§6 and 7 requires laying considerable groundwork, the foundation of which is a mapping from a space of symbol structures to a vector space. This embedding is presented in §4, and the relation between the similarity of symbol structures and the similarity of their embeddings in the vector space is then analysed in §5.

The paper provides a concise synopsis of the formal core of *The harmonic mind* [9], in which many of the results presented here are derived. Integrated into this synopsis are new results (§§5 and 7) and current research (§8). The paper initiates a research programme developing axiomatic theory in cognitive science, motivated by the need for more precise characterizations of the central notions of cognitive theory, which are highly abstract and therefore in need of formal support to escape crippling vagueness.

### (a) Summary of results

In this paper, we take up the following topics: the decomposition of cognition into component functions (§3); the representation of the inputs and outputs of these cognitive functions (§4); the similarity structure of these representations (§5); a recursive-function-theoretic approach to specifying cognitive functions, and neural computation of those functions (§6); a formal-language-theoretic approach to specifying cognitive functions, and neural computation of those functions (§7); and the challenge of producing symbolically interpretable outputs from neural computation (§8).

Each of these topics (except the last) is treated at three levels of description, in separate subsections: (*a*) the abstract symbolic level; (*c*) the neural level; and, mediating between them, (*b*) the vectorial level.^{2}

The main results, simplified, are approximately these (numbers identify the corresponding theorems):

**Theorem 4.8.** *The fundamental symbolic data structure we assume, binary trees, can be embedded in an infinite-dimensional vector space in such a way that the basic data construction and data access operations are linear transformations.*

**Theorem 5.7.** *Dissimilarity between two symbolic representations—as measured by differences in their constituents and the structural relations between their constituents—is given by the distance in the vector space between the two vectors ‘embedding’ those representations.*

**Theorem 6.5.** *Each function in the class defined as the closure under composition of the basic binary tree and symbol-mapping operations can be computed by a linear transformation over the vectors embedding the function's input and output.*

**Theorem 6.6.** *The infinite matrices implementing these linear transformations have a particularly simple form, the tensor product of the infinite identity matrix and a finite matrix that specifies the particular recursive function.*

**Theorem 7.2.** *A formal language—defined standardly by sequentially applied symbol-rewriting rules—can be specified as the structures that optimize (maximize) a numerical measure of symbolic well-formedness called (symbolic) Harmony.*

**Theorem 7.5.** *A natural language specified in a new grammatical framework called Optimality Theory can also be characterized as the optima of symbolic Harmony.*

**Theorem 7.7.** *The symbolic Harmony of a structure can be computed as the value, for the vector that embeds that structure, of a quadratic form called (neural) markedness Harmony.*

**Theorem 7.11.** *In evaluating errors, the dissimilarity between an output symbolic structure and an input symbolic structure considered as the desired output can be computed as the value, at the vectors embedding the input and output structures, of a bilinear form called neural faithfulness Harmony.*

**Theorem 7.12.** *The local optima of total network Harmony—the sum of markedness and faithfulness network Harmony—can be computed by a deterministic neural network.*

**Theorem 7.13.** *The global optima of total network Harmony can be computed by a stochastic neural network.*

**Theorem 8.1.** *The requirement that the output of a neural computation be a vector that is the embedding of a symbol structure can be met via a network dynamics that creates an attractor at each such vector.*

To obtain (or merely state) these results as theorems, a certain degree of formalization is necessary.

Through the course of the paper, it may prove useful to refer back to the following synopsis, in which organization of results is orthogonal to that of the paper: it gives the total picture at each level separately. It also identifies all the principal elements of the theory, providing an informal glossary of their notation.

At this point, the reader may skip directly to §3 without loss of continuity.

### (b) Synopsis by level

At the most abstract level, the different types of information encoded mentally (and their mutual relations) are located within a cognitive macro-architecture graph (§3*a*); these mental representations are characterized as systems of structures built of symbols filling specified structural roles (§4*a*); dissimilarity metrics, motivated by empirical patterns of cognitive performance, are defined with respect to (w.r.t.) a set of abstract relations among constituents of mental representations (§5*a*); the symbolic functions computed over within are characterized, including a set of recursive functions (§6*a*) and the set of formal languages generated through sequential derivation by rewrite-rule grammars ; these languages and dissimilarity metrics are cast in the form of numerical Harmony functions (theorem 7.2) and (definition 5.4), the correct mental representations being those that maximize Harmony; Harmony functions *H*_{G} can be used to characterize the grammars of natural languages, including Optimality-Theoretic grammars within a system in terms of which the typology of possible human languages can be formally calculated (§7*a*).

At a level intermediate between the symbolic and the neural, linear and quasi-linear classes of dynamical systems in vector spaces are defined (§3*b*); a model of the symbol system of mental states is specified abstractly by a realization mapping *Ψ* of symbol structures into a vector space *S* (§4*b*); symbolic structural-relation dissimilarity is realized as vector distance within *S* (§5*b*); recursive functions and Harmony functions *H*_{G} are reduced to linear transformations on *S*, W_{g} (§6*b*) and W_{G} (theorem 7.7).

At the neural level, *S* is modelled with , the states of a neural network , by positing a distinguished neural basis (§3*c*) in terms of which: mental representations **s** and the realization mapping *Ψ* are explicitly specified numerically (§4*c*); Euclidean distance is explicitly computable, and is reduced to neural faithfulness Harmony *H*_{F} (theorem 7.11); the linear transformations W_{g} and W_{G} are instantiated as numerical matrices **W**_{g} (§6*c*) and **W**_{G} (§7*c*); is reduced to neural markedness Harmony *H*_{M} (definition 7.6); deterministic spreading activation can be used to compute local maxima of the network Harmony (theorem 7.12); and stochastic spreading activation dynamics within a diffusion network can be used to compute global maxima of network Harmony (theorem 7.13); but a (deterministic) quantization dynamics is needed to output an activation pattern `s` that is interpretable as a discrete symbolic state *s*, with `s`=*Ψ*(*s*) (theorem 8.1); a total dynamics combining and yields the *λ*-diffusion dynamics (definition 8.2) of a network that computes symbolic mental representations, enabling models simultaneously capturing central features of both the idealized computational capacity of human linguistic competence and the errorful real-time computation of human linguistic performance.

## 3. Cognitive macro-architecture

At a coarse level of description, a cognitive architecture is an interconnected collection of components, operating in parallel, each of which processes information of a certain type (e.g. for language, these types include orthographic, phonological, syntactic and semantic information: see ○1in figure 1). Knowledge of the structure of a type of information is contained within the corresponding component, and knowledge of the relations between the types is contained in the links connecting components. A component computes a function: the input comes along inward-directed links from neighbouring components, and the output is, in turn, processed by knowledge in the outward-directed links to generate input to other components. The flowcharts (‘box and arrow’ diagrams) conventionally used to depict such an architecture instantiate a kind of graph structure.

### (a) Symbolic graph structure

### Definition 3.1

A cognitive macro-architecture is a directed graph together with:

For each node (‘cognitive component’ or ‘module’) in ,

a set , the ‘state space’ of ; an is called a ‘mental representation’;

a set , the ‘input space’ of ;

a function for aggregating ‘partial inputs’ to see b(ii)).

For each edge (‘connection’ or ‘pathway’) , from to , in ,

a transformation (the ‘inter-component transformation’);

is called the ‘partial input’ from to when is in state .

is the ‘external input’ to .

Henceforth, we assume given some specific architecture in which, for all , the spaces and , while conceptually distinct, are formally identical; functions as a ‘target’ towards which the components external to drive its state *s*^{γ}. Knowledge within will generally partially resist this pressure.

### (b) Vectorial model

The vectorial level models each as a vector space *S*^{γ}=*I*^{γ} (figure 1 ○2). The vector space *S*^{γ} constitutes an abstract information-encoding medium—a space of cognitive states, of mental representations—that resides in continuous mathematics, rather than the discrete mathematics of symbolic encodings. We study two example types of vectorial models here.

In quasi-linear models, the transformations between components are linear, combining by addition; a potentially nonlinear function ** f** and a linear transformation of

*S*

^{γ}to itself determines how the state of

*S*

^{γ}at one time, and the external input to

*S*

^{γ}from other components at that time, move the state forward.

### Example 3.2

In a *quasi-linear dynamics* [10] for the architecture , for all cognitive components and :

the operation

*τ*^{γα}on the link to from is a linear transformation T^{γα}:*S*^{α}→*I*^{γ},*ι*^{γα}=*T*^{γα}*s*^{α};*σ*^{γ}is summation: the external input to ,*ι*^{γ}, is the sum of the partial inputs*ι*^{γα}from neighbouring components ;a linear transformation

*W*^{γ}:*S*^{γ}→*S*^{γ}(the ‘intra-component input transformation’) and a function*f*^{γ}:*S*^{γ}→*S*^{γ}(the ‘activation function’) define , the internal dynamics of , by*W*^{γ}**s**^{γ}is the internally generated input to ; it combines linearly with the external input*ι*^{γ}.

Note that, at equilibrium, we must have **s**^{γ}(*t*)=*f*^{γ}[**i**^{γ}(*t*)]; *f*^{γ} gives the equilibrium relation between input **i**^{γ} and activation **s**^{γ}.

A simpler type of model we will use (ultimately, as exemplified in figure 2) is as follows.

### Example 3.3

A *linear associator* [11] is a simple sub-architecture consisting of two components, and , with a single (uni-directional) pathway, from to , in which

the operation

*τ*^{γα}is a linear transformation*T*^{γα}:*S*^{α}→*I*^{γ}, producing external input*ι*^{γ}=*T*^{γα}**s**^{α}andthe state

**s**^{γ}equals the external input*ι*^{γ}:**s**^{γ}=*T*^{γα}**s**^{α}.

Both types of model are spelled out more explicitly and intuitively at the neural level.

### (c) Neural computation: cognitive micro-architecture

The neural level (figure 1 ○3) invokes a coordinate system (with axes defined by a ‘neural basis’ [12]) for *S*^{γ} in which the list of coordinates for any vector **s**^{γ} ∈ *S*^{γ}, (, is the list of activation values of an enumerated set of *n* ‘abstract neurons’ in a network ; in this model of the vectorial theory, the mental state *s*^{γ} is realized by an ‘activation pattern’ described by a vector in . We assume given a neural basis for every *S*^{γ}.

Generally, neural computation is a type of analogue computation: a dynamical system in , fundamentally continuous in time, defined by differential equations that describe how the *n* neurons function as parallel processors (how they ‘spread activation’; figure 1 ○4) [13]. A particular computation is specified by values for the parameters in ; these parameter values are interpreted in parallel by the machine that evolves in accordance with . In simpler cases, similar to the linear associator (example 3.3), there may be no dynamics, just static relations between component states.

### Definition 3.4

Given a quasi-linear dynamics for *S*^{γ}, we define the following, relative to the neural basis.

The inter-component linear transformation

*T*^{γα}is realized as the matrix*T*^{γα}, the element [**T**^{γα}]_{μν}being the ‘connection strength’ or ‘weight’, of a ‘connection’ from neuron*ν*in to neuron*μ*in . (This applies to a linear associator as well.)The linear intra-component input transformation

*W*^{γ}is realized as the weight matrix*W*^{γ}, the element [**W**^{γ}]_{μν}being the weight of a connection from neuron*ν*in to neuron*μ*in .The

*total input*to the neurons in is**i**^{γ}≡**W**^{γ}**s**^{γ}+*ι*^{γ}, where*ι*^{γ}is the external input from other components, while**W**^{γ}**s**^{γ}(the matrix–vector product of**W**^{γ}and**s**^{γ}) is the input generated internally within . At neuron*μ*of , the total input is , i.e. the weighted sum of the activation values of*μ*'s neighbours*ν*within plus the external input to*μ*.

### Definition 3.5

To the requirements defining a quasi-linear dynamics at the vectorial level (example 3.2) we now add:

the

*locality*requirement*f*^{γ}:**i**^{γ}=(*i*_{1},*i*_{2},…,*i*_{n})↦(*f*^{γ}(*i*_{1}),*f*^{γ}(*i*_{2}),…,*f*^{γ}(*i*_{n})) for some called the ‘neuron activation function’; andthe

*monotonicity*requirement that the activation function*f*^{γ}be non-decreasing.If, in addition,

*f*^{γ}is the identity function,*f*^{γ}(*i*)=*i*, then the dynamics is*linear*.

Owing to locality, the differential equation for neuron *μ* depends only on the activation levels of units connected to *μ*, the weights on connections to neuron *μ* and the external input to unit *μ*:
Monotonicity implies that, at equilibrium, the higher the input to neuron *μ*, the greater its activation value .

Henceforth, we focus primarily on a single component , and generally drop *γ* from the notation.

## 4. Symbolic functions

A component of an architecture computes a function, producing a mental representation as output.

### (a) Mental representations as symbol structures

At the most abstract level, mental representations in are symbol structures [14] (figure 1 ○5); e.g. in an orthographic component, a string of mental letters such as `CABS`; in a phonological component, a string of mental phonemes such as `kæbz` (or less simplistically, the binary tree [`k` [`æ` [`b z`]]]); in a syntactic component, a phrase-structure tree such as [_{S} [_{NP} Frodo] [_{VP} [_{V} lives]]] (simplifying greatly). The key idea now is to regard a symbol structure as a set of internal roles, each filled by a constituent symbol structure (see ○5).

### Definition 4.1

A *filler-role decomposition* consists of three sets, the ‘structures’ , the ‘fillers’ and the ‘roles’ , and a one-to-one function ; the *N* pairs constituting *β*(*s*), written as , are the ‘filler/role bindings’, or ‘constituents’, of [15].

A symbolic data structure that is widely applicable for mental representations is the binary tree; the artificial-intelligence language Lisp deploys it exclusively [16]. Two filler-role decompositions are as follows [17].

### Example 4.2

The following defines the *canonical filler-role decomposition of binary trees*, . Let the structure set be the set of binary trees labelled with ‘atomic symbols’ in the set , e.g. . Define the role set ; we think of, e.g., *r*_{01} as the binary-tree position ‘left child [0] of the right child [1] of the root’, and analogously for any string *x* of 0s and 1s. In *s*_{cab}, the position *r*_{01} is labelled with the symbol `æ`. Let the filler set be . Then, `æ`/*r*_{01} is a filler-role binding of *s*_{cab}; the bindings function *β*_{t} pairs roles with their labels: *β*_{t}(*s*_{cab})={`æ`/*r*_{01},`k`/*r*_{0},`b`/*r*_{11}}. The root position is *r*_{ε}, where *ε* is the empty string; for any atomic symbol , *β*_{t}(`X`)={`X`/*r*_{ε}}.

### Example 4.3

Continuing the example of , binary trees labelled with , we can also define the *recursive* filler-role decomposition ). The roles are simply , whereas the fillers are . Now, for *s*_{cab} above, *β*_{r}(*s*_{cab})={`k`/*r*_{0},[`æ` `b`]/*r*_{1}}. Here, the filler of *r*_{1},*f*_{1}≡[`æ` `b`], is a non-atomic element of , and itself has bindings *β*_{r}(*f*_{1})={`æ`/*r*_{0},`b`/*r*_{1}}; the atomic filler k has binding set *β*_{r}(`k`)={`k`/*r*_{ε}}, as in *β*_{t}. As in Lisp, we denote by cons the binary-tree constructor function: *s*_{cab}= `cons`(`k`,[`æ` `b`])=`cons`(`k`,`cons`(`æ`,`b`)); thus *β*_{r}(`cons`(*s*_{0},*s*_{1}))={*s*_{0}/*r*_{0},*s*_{1}/*r*_{1}} for all . The access function that extracts the left (right) child of a tree will be denoted ex_{0} (ex_{1}): (both return a special ‘null filler’ Ø when applied to an atom); thus, for non-atomic , *β*_{r}(*s*)={`ex`_{0}(*s*)/*r*_{0},`ex`_{1}(*s*)/*r*_{1}}. `ex`_{ε}(*s*) is the symbol bound to the root of the tree *s* (or Ø if there is no such symbol).

Henceforth, unless stated otherwise, we assume that is a given set of binary trees under the canonical filler-role decomposition . We also assume throughout that, in general, the sets and are denumerable (not necessarily finite); henceforth, let and be given enumerations of them.

### (b) Symbol structures as vectors

The vectorial level of description springs from the following fundamental definition [15]: it asserts that the vector realizing (or modelling, or instantiating, or embedding) a symbol structure is the sum of vectors that realize each constituent filler-role binding of the structure, and that the vector realizing a filler-role binding is the tensor (generalized outer) product of vectors that realize the filler and the role.

### Definition 4.4

A *tensor-product realization* (, *F*, *R*, *ψ*_{F}, *ψ*_{R}) consists of a filler-role decomposition , two real vector spaces *F* and *R*—the ‘filler vector space’ and the ‘role vector space’, with dimensions *dim*(*F*) and dim(*R*), respectively—and two ‘realization’ functions and . Here, we require that each of the ranges of *ψ*_{F} and *ψ*_{R} be linearly independent sets.

The associated *realization mapping* is defined by

The vector space *S*, containing the range of *Ψ*, is the tensor product of spaces *F* and *R*: *S*≡*F*⊗*R*. [Given respective bases and for *F* and *R*, is a basis for *S*, and the mapping (**f**,**r**)↦**f**⊗**r** from *F*×*R* to *S* is bilinear: linear in each of f and r independently; dim(*S*)=dim(*F*)dim(*R*).]

Corresponding to our given enumerations and of and , we have their vectorial realizations and .

Given a vector **v**∈*S* realizing some symbol structure in , we can determine that structure from v.

### Proposition 4.5

The linear independence of the ranges of *ψ*_{F} and *ψ*_{R} entails that *ψ*_{F} and *ψ*_{R} are invertible. Furthermore, *Ψ* is invertible: given **v**=*Ψ*(*s*), *s* can be recovered as the unique element of with bindings , where *f*_{k}≡*ψ*_{F}^{−1}(**f**_{k}), and {**f**_{k}} is the unique sequence in *F* such that [15].

Within the continuous space of vectorial mental representations *S*, distance can be used to model cognitive dissimilarity, once *S* is endowed with a metric structure.

### Definition 4.6

A *metric vectorial realization* is a tensor-product realization in which each vector space *V* ∈{*F*,*R*} has an *inner product*; the inner product of two vectors u, **v**∈*V* is written as **u**⋅**v** [(**u**,**v**)↦**u**⋅**v** is bilinear, symmetric and positive-definite: **u**≠**0**⇒**u**⋅**u**>0]. The *length* of u is ∥**u**∥≡(**u**⋅**u**)^{1/2}; **u** is *normalized* iff ∥**u**∥=1. The *distance* between u and v is ∥**u**−**v**∥. u and v are *orthogonal* iff **u**⋅**v**=0. Inner products on *F*, *R* induce an inner product on *F*⊗*R* satisfying (**f**_{1}⊗**r**_{1})⋅(**f**_{2}⊗**r**_{2})=(**f**_{1}⋅**f**_{2})(**r**_{1}⋅**r**_{2}).

Returning to the case of binary trees, we can now relate the two decompositions in example 4.2 and example 4.3:

### Definition 4.7

A *recursive realization of binary trees* is a metric vectorial realization of , built from a vector space *R*_{(1)}, in which:

*ψ*_{R}(*r*_{0})≡**r**_{0}∈*R*_{(1)},*ψ*_{R}(*r*_{1})≡**r**_{1}∈*R*_{(1)};**r**_{x0}=**r**_{x}⊗**r**_{0}and**r**_{x1}=**r**_{x}⊗**r**_{1}for all*x*∈{0,1}* (where*x*0 is the concatenation of string*x*and 0);the roles for tree positions at depth

*d*lie in the vector space*R*_{(d)}≡*R*_{(1)}⊗*R*_{(1)}⊗⋯⊗*R*_{(1)}(*d*factors); letting , the total role space*R*is the direct sum of all the vector spaces*R*_{(d)}:*R*≡*R*_{(0)}⊕*R*_{(1)}⊕*R*_{(2)}⊕*R*_{(3)}⊕⋯ [an**r**∈*R*can be represented as**r**=(**r**_{(0)},**r**_{(1)},…), each**r**_{(d)}∈*R*_{(d)}].^{3}

### Theorem 4.8

*There are four linear transformations on S=F⊗R (namely W*_{cons0}*, W*_{cons1}*, W*_{ex0}*, W*_{ex1}*) such that if s=*`cons`*(p,q), and we define* **s***≡Ψ(s),* **p***≡Ψ(p),* **q***≡Ψ(q), then [*19*]:
*

This recursive realization of the canonical filler-role decomposition of definition 4.2 makes it possible to write **s** either in the form corresponding to the canonical decomposition, , where **f**_{x} is the realization of the atomic symbol at tree position *x*, or as **s**=**p**⊗**r**_{0}+**q**⊗**r**_{1}, just as it would be expressed using the recursive decomposition , **r**_{0} and **r**_{1} realizing its two roles (left/right child), and **p** (**q**) realizing the entire left (right) sub-tree. In this sense, definition 4.7*b* renders the canonical representation recursive.

Later, the linear transformations of theorem 2.8 will enable computation of an interesting class of functions.

Henceforth, we assume given metric vectorial realizations *Ψ*^{S} and *Ψ*^{I} of and which are defined over the same filler and role vector spaces *F* and *R*, and which obey the following *input scaling condition*.

### Condition 4.9

For all , ; there exists a function such that for all , .

Section 5*b* uses *ρ* to control the length of each input ** ι** in computing its distance to a state

**s**.

### (d) Symbolic structures as neural activation patterns

Each vector space *S*^{γ} has a given neural basis; the coordinates of **s**=*Ψ*_{S}(*s*) w.r.t. this basis are the neural activation values realizing state (§3*c*). If the individual vectors and realizing the individual symbolic fillers and roles each lie along a neural basis vector, then the presence of a given symbolic constituent corresponds to the activation of a single neuron: this is ‘local representation’. We will assume the general case, *distributed representation*, in which an individual constituent is realized by a vector that is a linear combination of multiple neural basis vectors: the presence of that constituent is encoded by an activation pattern distributed over multiple neurons, and each neuron participates in the encoding of multiple constituent s. (Distributed representations: allow many more representations per *n* neurons; allow encoding of similarity among constituents; are what results from neural network learning; and are ubiquitous in the brain [20–22].) The [*φ*×*ϱ*]th activation value in the tensor-product realization of a symbol structure is the product [*φ*th activation in the realization of constituent *k*'s filler] × [*ϱ*th activation in the realization of constituent *k*'s role], summed over all the constituents *k*.

### Definition 4.10

A ‘neural basis’ , for the spaces *F*, *R* of a metric vectorial realization is a distinguished orthonormal basis. The neural coordinates, or activation pattern, realizing a symbol structure is the point with elements s_{φϱ} such that .

### Proposition 4.11

For each binding of *s*, *β*(*s*)={*f*_{k}/*r*_{k}}, let [**f**_{k}]_{φ} be the *φ*th neural coordinate of **f**_{k}≡*ψ*_{F}(*f*_{k}) (i.e. and similarly let [**r**_{k}]_{ϱ} be the *ϱ*th neural coordinate of **r**_{k}≡*ψ*_{R}(*r*_{k}). Then [15].

## 5. Dissimilarity as representational distance: relational faithfulness

The total input to a component is the target towards which the state of is driven under the combined influence of those components that send partial input to . As we see in §7, *all else equal*, the greater the distance *d*(*s*,*ι*) of a state to the total input *ι*—i.e. the more dissimilar or ‘unfaithful’ *s* is to *ι*—the less likely is to be in state *s*. Experimental data provide evidence for uncovering the (component-specific) dissimilarity metric *d* in ; cognitive scientists formulate hypotheses concerning *d* in terms of the format of representations in : the format determines the dimensions of unfaithfulness relevant for *d*. We propose to formalize the problematic notion of representational format by making explicit those relations between constituents that are understood to be defined in that format (figure 1 ○7).

### (a) Representational format as relational faithfulness

### Definition 5.1

A *representational format* is a filler-role decomposition in which the roles are a set of *relations* . Each R_{i} has an ‘arity’ *n*_{i}, and a collection of ‘argument domains’ A_{i1},…,A_{ini}. Then for any particular symbol structure , R_{i}(*s*)⊂A_{i1}×⋯×A_{ini} is the set of elements which in *s* stand in the relation R. The filler/role bindings are defined as .

### Example 5.2

Let be the set of strings of symbols from the alphabet such that no symbol appears more than once in any string. For any , define the arity-1 relation , i.e. R_{A}(*s*) is the set of symbols that occur (once) in the string *s*. Define the arity-2 relation R_{L}(*s*)≡{(*l*,`X`)|`X` is the symbol of *s* in position *l* (relative to the left edge)} ; e.g. R_{L}(`CAB`)={(2,`A`),(1,`C`),(3,`B`)}. is a representational format for , yielding a bindings function *β*_{s}; e.g. *β*_{s}(`BA`)={`A`/R_{A},`B`/R_{A}, (1,`B`)/R_{L},(2,`A`)/R_{L}}.

The distance *d*(*ι*,*s*) is measured in terms of the degree of unfaithfulness of *s* to *ι* w.r.t. (figure 1 ○8).

### Definition 5.3

Given a representational format including some relation R, the *relational faithfulness constraints* for R are the following functions (|*X*| denotes the number of members of set *X*):

,

.

is the number of elements for which R is true of the input but not of the output ; is the reverse. So for the string example earlier mentioned, for R=R_{A}, is the number of symbols that have been deleted, and , *s*) is the number of symbols that have been inserted (irrespective of position in the string, which R_{A} ignores). So, for example, . For R=R_{L}, is the number of symbols in the input that are not in the same position (relative to the left edge) in the output; e.g. , although .

It is relations such as these that are implicit in arguments (e.g. [23]) that the mental representation for letter strings uses a format in which position is reckoned from the right as well as the left edge (i.e. that, in addition to R_{L}, the mental format involves an analogous relation R_{R} in which positions are counted from the right edge). Under neurological damage, a patient spelling a list of words may erroneously intrude letters from previous words into a given word, more likely errors being those more faithful to the previous-word target. Because these errors tend to preserve the position of letters relative to right as well as left word edges, the relevant faithfulness constraints must penalize discrepancies w.r.t. R_{R} as well as R_{L}.

An overall measure of the discrepancy between an input (target) *ι* and an output *s*—which plays the role here of a distance metric *d*(*ι*,*s*)—is given by a weighted sum of the relational faithfulness constraints for R (definition 5.3). Faithfulness is one facet of well-formedness assessed by a measure called ‘Harmony’.

### Definition 5.4

A pair of positive *weights* (*w*^{I}_{R}, *w*^{O}_{R}) defines an R-*faithfulness Harmony function* *H*_{R}:

The total relational-faithfulness Harmony of format with relation set is .

The faithfulness Harmony is increasingly negative as *ι* and *s* become more disparate along the dimensions encoded in the relations ; the maximum faithfulness Harmony value is 0, attained when , R(*s*)=R(*ι*) (e.g. when *s*=*ι*).

### (b) Relational unfaithfulness as vector distance

Given a representational format with relation set , the dissimilarity or relational-faithfulness Harmony between an input *ι* and a state *s*, , is directly related to the distance, in the metric vector space *S*, between their vectorial realizations ** ι** and

**s**: is the negative of the distance squared (see ○9). This result assumes the following type of tensor-product realization of the representations in .

The tensor-product realization of a representational format including an arity-*n* relation R involves a mapping *ψ*_{F} from a filler set to a vector space *F*. Treating such a filler—an element such as —as a structure in its own right, and then recursively applying tensor-product realization to (using contextual roles [15]) leads to the following type of mapping *ψ*_{F}: (`a`,`b`,`c`)↦ **a**⊗**b**⊗**c**.

### Definition 5.5

Given a representational format (assuming all the notation of definition 5.1), a role realization mapping *ψ*_{R}: , and a set of filler realization mappings , the *induced* tensor-product realization is defined by

### Definition 5.6

A tensor-product realization is called *orthogonal* iff the ranges of *ψ*_{F} and *ψ*_{R} are each orthogonal sets: ∀*f*, , *f*≠*f*′⇒ *ψ*_{F}(*f*)⋅*ψ*_{F}(*f*^{′})=0, and similarly for *ψ*_{R}. The realization is *orthonormal* iff, in addition, , ∀*r*∈*R*, ∥*ψ*_{F}(*f*)∥=∥*ψ*_{R}(*r*)∥=1.

Note that orthogonal realizations are in general distributed: they need not be local (§4*c*). The result is as follows.

### Theorem 5.7

*Let Ψ*^{S} *be a tensor-product realization of a representational format* *induced by orthonormal filler and role realizations* *,* *. Assume given faithfulness weights w*^{I}_{R}*, w*^{O}_{R} *for each* *; these define the faithfulness Harmony function H*_{R} *(definition 5.4). For each* R*, replace the unit-length role vector* *by the rescaled vector* *. For the scaling of Ψ*^{I} *(condition 4.9 ),* *define
*
*Given any* *,* *, let* *ι**≡Ψ*^{I}*(ι) and* **s***≡Ψ*^{S}*(s). Then, up to a constant term depending on ι, the total relational-faithfulness Harmony of s to ι decreases as the square of the distance between s and* *ι**:
*
*where* *κ*_{R}*(ι);* *; and n*^{I}_{R}*≡|*R*(ι)|.*

Note that the complexities arise only when there is an asymmetry *w*^{I}_{R}≠*w*^{O}_{R}; otherwise, *ρ*(R)=1, *κ*_{R}(*ι*)=0. (For a proof, see the electronic supplementary material.)

### (c) Relational faithfulness as network weights

We will see in theorem 7.11 how relational faithfulness Harmony is naturally realized at the network level.

## 6. Recursive functions as linear transformations

Having considered the representational states of the components of the cognitive architecture , we turn to the functions computed over these representations.

### (a) Primitives'-closure functions

Mental processes compute recursive functions; for concreteness (without compromising generality), we take the inputs and outputs of these functions to be binary trees. We now see that an interesting class of such recursive functions, denoted here , can be computed at the neural level in an extremely simple architecture: the linear associator (example 3.3). is the closure under composition of the primitive tree-manipulating functions defined in §4*a* (figure 2 gives an example). It is convenient to work directly with the filler-role bindings of binary trees, assuming the canonical filler/role decomposition (example 4.2).

### Definition 6.1

A *pseudo-tree* t is a set of binary-tree filler/role bindings with at most one filler bound to each role; Ø is the null pseudo-tree with no bindings; is the set of pseudo-trees. The *unification* t⊔t^{′} of two pseudo-trees is the union of their bindings, provided this yields a pseudo-tree (at most one filler bound to each role); otherwise, t⊔t^{′}≡Ø. Any can be represented as for some sequence , the alphabet of filler symbols, and .

A function is *first order* iff *g*(Ø)=Ø and .

First-order functions process each binding separately, with no interactions among bindings.

### Proposition 6.2

The primitive binary-tree functions ex_{0}, ex_{1}, ex_{ε}, cons_{0}, cons_{1} (example 4.3) are first order [19], p. 320].

As for manipulation of fillers, the basis of the set of recursive functions we consider are those in .

### Definition 6.3

is the set of functions *h*: satisfying the following conditions:

*h*is first order;for all ,

*h*(t)=`cons`(*h*(`ex`_{0}(t)),*h*(`ex`_{1}(t)))⊔*h*(`ex`_{ε}(t))/*r*_{ε};there is a partial function such that , ; if

`X`is not in the domain of the partial function , then*h*(`X`/*r*_{ε})=Ø.

A function is indeed very simple: it simply replaces every filler `X` in the domain of by the filler , and deletes all other fillers. Finally, we join the filler-manipulating functions in with role manipulation in the form of the closure, under composition, of the primitive binary-tree functions.

### Definition 6.4

The class of *PC functions* (primitives'-closure) has the following recursive definition:

base case:

recursion

if and , then

if , then and

if

*g*, , then

no other functions are in .

While further development of the theory to address a larger class of recursive functions is in progress, the class already includes many of the functions hypothesized to be computed in cognition, for example, recursive functions like that which maps a syntactic representation such as [[`Few` [`world leaders`]] [[`are admired`] [`by` [`George Bush`]]]] to a semantic representation like `admire`(`George Bush`, `few world leaders`): see the caption of figure 2 for the definition of this function *g*.

### (b) Primitives'-closure functions as linear transformations

Because the primitive binary-tree functions can be realized as linear transformations (theorem 4.8), and because the set of linear transformations is closed under composition and addition, we get straightforwardly the following result.

### Theorem 6.5

*Any PC function* *is realized by a linear transformation W*_{g} *on S; that is,
*

### (c) Primitives'-closure functions as neural networks

It follows immediately that a PC function can be computed by a simple type of network, a linear associator (example 3.3); the weight matrix turns out to have a form that enables finite specification of an infinite matrix.

### Theorem 6.6

*For any* *, the linear transformation W*_{g} *is realized by a linear associator with weight matrix
*
*where I is the identity matrix on (infinite-dimensional) R (definition 4.7) and* *is a finite matrix.*

The map is compositional: it can be defined constructively [19, p. 324] (see figure 2 for an example).

## 7. Grammar as optimality

Having seen that a significant class of recursive functions can be computed in one massively parallel step (fully specified at the neural level in terms of multiplication and addition operations), in order to analyse linguistic cognition, we turn from recursive function theory to another general approach for specifying functions over symbols: formal language theory. This approach conceives of the function to be computed as a kind of language, and deploys rewrite-rule grammars, interpreted sequentially. These grammars can be classified in various ways, e.g. the Chomsky hierarchy, with its corresponding hierarchy of formal sequential automata, culminating in Turing machines. It turns out that a different approach to specifying languages, more directly reducible to neural computation, deploys Harmony. In §5, we encountered one facet of Harmony, faithfulness; specifying languages introduces the other facet, markedness. Markedness reduces the Harmony of symbol structures that violate grammatical constraints, and may reward structures that meet grammatical desiderata. The structures of the formal language are those with maximal Harmony.

Because of the extended chain of inference in this section, a roadmap is provided in figure 3.

To see the basic intuition, consider a context-free rewrite-rule grammar in Chomsky normal form; a legal derivation *D* produced by such a grammar can be represented as a binary tree t_{D}. For example, use of the rule in a derivation *D* contributes to t_{D} a local tree [_{A} `B` `C`]: a pair of sister nodes labelled `B` and `C` immediately dominated by a mother node labelled `A`. We can take the language generated by , , to be the set of all binary trees representing legal derivations. To evaluate whether a given tree , we must check that every local tree [_{X} `Y` `Z`] in t is sanctioned by a rule in . We can do this numerically, computing a Harmony value *H*(t) for t, as follows. We assess a markedness Harmony penalty of −3 (lower *H*(t) by 3) for every symbol in t. Then for every rule in , we add +2 to *H*(*t*) for every pair in t consisting of a mother node labelled `X` with a left (right) child node labelled `Y` (`Z`); we conceptually split this into a reward of +1 for each node in the pair. Now, consider a node *x* labelled by some symbol `V`: *x* will lower *H*(*t*) by 3. Then if *x* is dominated by a legal parent node for `V`, *x* increases *H*(*t*) by 1; if *x* has two legal daughter nodes for `V`, *x* increases *H*(*t*) by 1 for each. Thus if *x* is fully sanctioned by the grammar (both above and below), it will contribute 0 to *H*(*t*); if it is not legal, its penalty −3 will not be fully offset and its net contribution to *H*(*t*) will be negative. The result is that if t is legal, *H*(t)=0, otherwise, *H*(t)<0. The maximal Harmony trees are those with *H*=0: exactly the trees of .^{4}

### (a) Rewrite-rule and Optimality-Theoretic grammars as Harmonic grammars

### Definition 7.1

A *Harmonic grammar* over [24] is a function ; *H*_{G}(*s*) is the ‘(markedness) Harmony’ of . The *language specified by* *H*_{G}, , is
An has maximum Harmony, and is said to be *optimal*.

A Harmonic grammar *H*_{G} is *second-order* iff there exists a function (where *B*≡*F*×*R*) such that if *β*(*s*)={*f*_{k}/*r*_{k}}≡{*b*_{k}}, then (*H*_{G} depends only on *pairs* of constituents).

### Theorem 7.2

*Given a formal language* *generated by a grammar* *in the Chomsky hierarchy, there is a second-order Harmonic grammar* *that specifies the same language:* *.*

As sketched intuitively earlier, *H*_{G} can be constructed compositionally from the rewrite rules of [25], p. 399]. This shows that formal languages can be *specified*; we consider later how languages can be *computed*.

We turn now to the grammars relevant to cognition, those of human natural languages. It turns out that for generative linguistics—the formal theory of human language—analysing the set of grammatical expressions as a set of optimal structures is quite fruitful, thanks primarily to Optimality Theory [26,27].

### Definition 7.3

An *Optimality-Theoretic system* consists of the following:

a set of symbol structures called ‘inputs’ (e.g. for syntax, a meaning to be expressed);

a set of symbol structures called ‘outputs’ (e.g. for syntax, a parse tree of a sentence);

a function

*Gen:*;*Gen*(*ι*) is the set of ‘candidate outputs’ for (e.g. all possible parses);a finite set

*Con*of functions : called ‘constraints’; is the number of ‘violations of by (*ι*,*s*)’, and if we say ‘prefers*s*_{1}to*s*_{2}given*ι*’*Con*includes*faithfulness*constraints; each such constraint evaluates a dimension of structure, being violated by each deviation of*s*from*ι*w.r.t. that dimension*Con*includes*markedness*constraints; each such constraint evaluates, independently of*ι*, the inherent well-formedness of*s*with respect to some structural dimension.

An -*grammar* is a total order ≫ on *Con*: a ‘constraint ranking’. An input–output pair (*ι*,*s*), *s*∈*Gen*(*ι*), is *optimal* w.r.t. the -grammar iff the following holds:

s.t. (*ι*,*o*^{′}) is preferred to (*ι*, *o*) by the ≫-maximal constraint that prefers one of them

The language specified by an -grammar is the set of all -optimal input–output pairs.

The *typology* specified by is the set of all languages specified by some -grammar.

The empirical hypothesis is that the space of possible human grammars is an -typology for some called ‘universal grammar’—all languages possess the same grammatical constraints or preferences: differences arise only in how conflicts among those preferences are resolved; i.e. differences are limited to how the grammar of each language ranks the universal constraints.

### Example 7.4

The Italian counterpart of English *it rains* is *piove*. An Optimality-Theoretic analysis goes as follows [28]. Our meaning is *ι*_{0}=[rain(tense= present)]; we assume here a faithfulness constraint that requires the verb form to be *piove* in Italian, or *rains* in English. This constraint outranks all others, ensuring that only an expression with the correct verb can be optimal. A markedness constraint requires that every sentence have a subject; the Italian candidate output sentence *piove* (or English *rains*) violates ; *esso piove* (or *it rains*) satisfies . Another markedness constraint requires that every word in an expression contribute to the meaning: *esso piove* (or *it rains*) violates (the subject is meaningless) while *piove* (or *rains*) satisfies . In the Italian grammar, : avoiding meaningless words is more important than providing a subject; so (*ι*_{0}, *piove*) is optimal, hence grammatical. In English, the ranking is reversed, so (*ι*_{0}, *it rains*) is optimal.

Despite being violated in these optimal expressions, is active in Italian and is active in English: in English, can only be violated when overruled by a higher-ranked constraint such as . For example (*ι*_{0}, *it rains it*) is not grammatical—it has lower Harmony than (*ι*_{0}, *it rains*)—because of the second violation of incurred by the meaningless object *it*: its presence (unlike the subject *it*) is not required to satisfy any higher-ranked constraint. The difference between English and Italian is not that the former prefers and the latter disprefers meaningless items: *all* languages (according to Optimality Theory) have the same grammatical preferences—differences arise only in how to resolve conflicts among those preferences, i.e. in the ranking of constraints.

Optimality Theory arguably provides the first general formal theory of linguistic typology. This has enabled formal analysis of typologies, as well as individual languages, at all linguistic levels: phonology, syntax, semantics and pragmatics [29–32] (see also the archive http://roa.rutgers.edu/). The empirical successes of Optimality Theory are cases when what is universally shared by human languages are *preferences* that outputs of the grammar should respect, as opposed to *processes* that generate those outputs. Rewrite rules characterize grammatical knowledge as procedures, while Optimality Theory characterizes grammatical knowledge as preferences: this is the force of the idea, ‘grammars as optimization’.

Given the potential value of Optimality Theory for understanding linguistic cognition, we turn to how neural computation might be used to compute languages specified by Optimality-Theoretic grammars, rather than languages specified by serial rewrite rules or functions specified by recursive equations.

For reduction of an -grammar to the neural level, we translate into a Harmonic grammar .

### Theorem 7.5

*Let* *be an* *-grammar s.t. every constraint* *is bounded above:* *,* *, ∀s∈Gen(ι),* *. Then, there is a Harmonic grammar* *over* *that specifies the same language:* *.* *for a set of weights {w*_{k}*} with* *. (In this context, (ι,s***) is optimal for an H*_{G} *iff s***∈argmax*_{s}*{H*_{G}*(ι,s)|s∈Gen(ι)}.) [*26*], §10.2.2; [*33*,*34*], p. 463].*

The basic idea of the construction is that if is ordered by ≫ so that , then the constraint weights can be *w*_{k}≡*M*^{k} (*M*−1 being the largest possible number of violations of any constraint). The Harmony cost incurred by a single violation of (*w*_{k}=*M*^{k}) exceeds the total maximal cost that can be incurred by violations of all lower-ranked constraints (). This means that if and only if the highest-ranking constraint that has a preference between (*ι*, *o*) and (*ι*, *o*^{′}) prefers the former: optimality under and under are equivalent.

### (b) Harmonic grammars as linear input transformations

The motivation for the following definition is given shortly (theorem 7.12).

### Definition 7.6

With respect to a linear intra-component input transformation W: on the vector space *S* realizing , the *markedness Harmony* of a vector is
A linear intra-component input transformation realizes a Harmonic grammar *H*_{G} iff

### Theorem 7.7

*A second-order Harmonic grammar H*_{G} *can be realized by a linear input transformation W*_{G}*.*

Because a formal language specified by a rewrite-rule grammar can also be specified by a corresponding second-order Harmonic grammar (theorem 7.2), we immediately get the following corollary.

### Corollary 7.8

Given a rewrite-rule grammar , there is a linear input transformation that realizes the Harmonic grammar counterpart of [19], p. 333.

Moving from rewrite-rule grammars to Optimality-Theoretic grammars realized in Harmonic grammars requires not only markedness Harmony but also faithfulness Harmony, introduced next.

### (c) Harmony optimization as neural computation

Given a Harmonic grammar *H*_{G} realized as a linear intra-component input transformation W_{G} on *S* (definition 7.6), and a neural realization of *S* in a network , the elements of *W*_{G} w.r.t. the neural basis constitute the internal weight matrix of . The utility of this is that can perform Harmony optimization, computing maximum Harmony representations (theorem 7.12): these should be the optimal states constituting the language . (In order to achieve this, two obstacles will need to be overcome.)

### Definition 7.9

Given a quasi-linear network (example 3.2, definition 3.4) with weight matrix **W**, external input ** ι** and unit activation function

*f*, the

*network Harmony*is the sum of markedness Harmony

*H*

_{M,W}(definition 7.6) and

*faithfulness Harmony*

*H*

_{F}: where the

*unit Harmony*

*H*

_{1}is

### Example 7.10

Let be linear (definition 3.5), i.e. have units with activation function *f*(*z*)=*z*. Then, the unit Harmony is
where

What is the relation between the neural-level faithfulness Harmony *H*_{F} of the network and the symbolic-level relational faithfulness *H*_{R} that evaluates the dissimilarity of symbol structures (definition 5.4)?

### Theorem 7.11

*Let Ψ*^{S} *be a tensor-product realization in a vector space S of a representational format* *satisfying the conditions of theorem 5.7. Suppose given, for each* *, a pair of weights* *that define the* R*-faithfulness Harmony function H*_{R}*, and hence the total relational faithfulness* *(definition 5.4). Let S be realized in a linear network* *, and let* *,* *; write* *ι**≡Ψ*^{I}*(ι)∈I,* **s***≡Ψ*^{S}*(s)∈S. Then:
*
*where
*

(For a proof, see the electronic supplementary material.) Thus, the relational Harmony between the symbol structures (*ι*,*s*) can be computed as the network faithfulness Harmony between the vectors (** ι**,

**s**) realizing (

*ι*,

*s*), up to a constant depending on

*ι*; this constant has no effect on determining the optimal representation

*s*for a given input

*ι*. By theorem 7.7, the well-formedness of a symbolic state

*s*as assessed by a Harmonic grammar

*H*

_{G},

*H*

_{G}(

*s*), can be computed as the network markedness Harmony. The network Harmony combines both these evaluations. The computational utility of derives from theorem 7.12.

### Theorem 7.12

*Given a network* *with quasi-linear dynamics (example 3.2) and external input* *ι**, suppose the weight matrix is symmetric (∀μ,ν,W*_{μν}*=W*_{νμ}*) and scaled so that network Harmony* *is bounded above.*^{5} *Then, as activation spreads, the Harmony of the network state s,* *, is non-decreasing, and s converges to a local maximum of* *(a state s such that* *, for all small* *ε**[*35*–*37*]).*

We are now close to showing that neural computation can compute the grammatical expressions of a symbolic rewrite-rule grammar or an Optimality-Theoretic grammar : these expressions are the maxima of a symbolic second-order Harmonic grammar (theorem 7.2) or (theorem 7.5), and *H*_{G} can be realized in network weights **W**_{G} as the Harmony of a network (theorem 7.7), and spreading activation in can compute the maxima of (theorem 7.12). But two obstacles remain.

The first is that the network computes *local* Harmony maxima, but the Harmonic grammar demands *global* maxima (indeed, the Harmonic grammar recognizes no such notion as ‘local maximum’). The quasi-linear dynamics drives the network state constantly uphill in Harmony, so the network ends up at the peak of whatever Harmony mountain it happens to start on: the peak is higher than all neighbouring points, but need not be the highest peak of the mountain range, which is what the Harmonic grammar requires us to find. To compute the global Harmony maximum, the network needs some probability of moving downhill, providing a chance to pass through valleys in order to arrive at the highest mountain. Global optimization requires some randomness in activation spreading [38–40].

### Theorem 7.13

*The neural network* *with dynamics* *defined by the stochastic differential equation*^{6}
*(or the corresponding stochastic difference equation
*
*converges to the distribution* *[*41*,*42*]. Thus, as T→0, p*_{T}(** s**)

*→0 except for globally optimal s.*

In , the *computational temperature* *T* is a function of computational time *t*: the initial temperature *T*(0) is high, and *T*(*t*) decreases to 0 during the course of computation. In principle, if this is carried out sufficiently slowly, the network's probability of being in state **s** at time *t* will remain approximately proportional to *e*^{H(s)/T(t)} [43]; then at time *t*, the probability of a non-globally optimal state **s**′, with Harmony *H*′, relative to the probability of the globally optimal state **s***, with Harmony , is, as hence *T*(*t*)→0,
because . Thus the probability of any non-globally optimal state *s*′ goes to zero as .

## 8. Discreteness

The stochastic neural network (theorem 7.13) computes, in the limit, a state **s**∈*S* with globally maximal Harmony. The Harmonic grammar requires a *symbolically interpretable* state **s** with globally maximal Harmony: a state **s** realizing a symbol structure *s* that maximizes over the set of all symbol structures . Because *H*_{M} realizes (definition 7.6), we know that for every `s`=*Ψ*(*s*) that realizes a symbol structure ; these `s` comprise a discrete subset * S* of the continuous vector space

*S*. Our second obstacle is that the global maximum of in

*S*is generally

*not*in

`S`: conflicts between the constraints encoded in

*H*entail that optima constitute compromises that interpolate between the alternative discrete states favoured by the conflicting constraints. Harmony optimization at the neural level does not yield realizations of symbolic states. To achieve that, in addition to the optimization dynamics , another dynamics is required:

*quantization*.

### Theorem 8.1

*There is a deterministic dynamics* *on S, d***s***/dt=*** Q**(

**s**

*(t)), which has an attractor at every vector*

`s`

*∈*

`S`

*, i.e. at every vector*

`s`

*=Ψ(s) that realizes a symbol structure*

*[*44

*], §2.6.*

### Definition 8.2

Given a Harmony function *H*, a *λ*-*diffusion network* is defined by the dynamics
where *λ* and *T* are functions of computational time *t* such that *λ*(*t*) and *T*(*t*) decrease to 0 as .

The *λ*-diffusion dynamics (definition 8.2) linearly combines the Harmony-optimization dynamics (theorem 7.13), which ignores discreteness, and the quantization dynamics (theorem 8.1), which ignores Harmony [45]. Although formal results have not yet been achieved, in a range of (simple) applications to modelling human language processing, *λ*-diffusion networks have proved capable of reliably computing the vectorial realizations of globally optimal symbolic states, when the speed at which *T*(*t*) and *λ*(*t*) go to zero is sufficiently slow. When too fast, errors are made; the outputs are symbolic states, but not necessarily globally optimal ones. In fact, in a number of cases, the probability of outputting a symbolic state *s* is approximately proportional to (for some *T*): the relative Harmonies of error states *s* govern their relative probabilities. In these applications, the distribution of errors captures the key properties of human performance. For example, because faithfulness Harmony formalizes relational similarity of an error to a correct response (theorem 7.11), the probability of an error type decreases appropriately as the similarity between the error and the correct response decreases; and because markedness and faithfulness Harmony together formalize grammatical knowledge (theorems 7.5 and 7.7), the probability of ungrammatical errors is appropriately small.

## 9. Conclusion

A detailed summary was provided in §2*b*; only a few concluding remarks are given here.

This work seeks not to go beyond classical computability, but rather to explore what restrictions may be placed on cognitive functions assuming that they are computed in accord with a certain conception of neural computation. In this conception, cognition can be characterized by functions over meaningful symbols, but not as computation over such symbols. Vector spaces (of neural network states) are seen to provide a powerful representational medium for the computation of symbolic cognitive functions, even restricting to linear processing. In such vector spaces, a cognitively significant class of recursive functions can be computed in a single massively parallel step. A natural characterization of the functions computed by certain neural networks is through optimization, and as a result neural computation has led to powerful insights into one of the deepest realms of symbolic cognitive science, the theory of universal grammar. Formal languages can be specified in such neural computational terms, as well as natural languages; and while effective computability has not been proved, simulations suggest that these models simultaneously capture central features of both the idealized computational capacity of human linguistic competence and the errorful real-time computation of human linguistic performance.

## Acknowledgements

I warmly thank my collaborators Alan Prince, Géraldine Legendre, Matthew Goldrick, Donald Mathis, Bruce Tesar, John Hale and Yoshiro Miyata in this work; special thanks to Don and Matt for comments on an earlier draft of this paper (I am solely responsible for any errors, of course). For helpful discussion of the work I thank Colin Wilson, James McClelland, Robert Frank, Robert Cummins and most of all the late David E. Rumelhart. I gratefully acknowledge partial financial support for this work from the National Science Foundation, Johns Hopkins University, les Chaires internationales de recherche Blaise Pascal, la Laboratoire de Sciences Cognitives et Psycholinguistique/CNRS and the Robert J. Glushko and Pamela Samuelson Foundation. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation or other sponsors.

## Footnotes

One contribution of 18 to a Theme Issue ‘The foundations of computation, physics and mentality: the Turing legacy’.

↵1 In accepting this principle, we differ from most research programmes on neural network (or ‘connectionist’) theories of cognition, which deny the validity of our symbolic level [8].

↵2 A notational challenge arises because most elements of the theory need three symbols, one for each level. The choice here is to use essentially the same symbol for all three, but to distinguish them by font and/or style (italics, bold, etc.).

↵3 Physicists will recognize this construction of an infinite-dimensional Hilbert space as isomorphic to Fock space, where

*R*_{(d)}corresponds to the subspace of*d*particles, and**f**⊗**r**corresponds to the tensor product binding, for example, of the spin of an electron (f) to its orbital (r) in an atom [18].↵4 This glosses over details of the full analysis, which requires special treatment of the tree root and the start symbol, as well as terminal nodes, and conversion of to ‘Harmonic normal form’, which enables a second-order Harmony function (definition 7.1) [24].

↵5 In order that have a maximum, as s gets large, it is necessary that the

*H*_{1}term get small faster than the*H*_{M,W}gets large (if it does). For the linear case, of interest here,*H*_{1}goes to quadratically;*H*_{M,W}might go to quadratically. But if we scale**W**appropriately, by multiplying it by a sufficiently small constant, we can ensure that*H*_{1}dominates, so that has a maximum. Rescaling**W**does not affect the location of the maxima of*H*_{M,W}. We can ensure that the vectors realizing all symbol structures have the same length, so adding*H*_{1}does not affect the relative Harmonies of the symbolic states.↵6 The derivative is simply [

**Ws**+−*ι**f*^{−1}(**s**)]_{μ}which is just [**Ws**+−*ι***s**]_{μ}in the linear case.

- This journal is © 2012 The Royal Society