## Abstract

We present an unsupervised learning algorithm (`GenESeSS`) to infer the causal structure of quantized stochastic processes, defined as stochastic dynamical systems evolving over discrete time, and producing quantized observations. Assuming ergodicity and stationarity, `GenESeSS` infers probabilistic finite state automata models from a sufficiently long observed trace. Our approach is abductive; attempting to infer a simple hypothesis, consistent with observations and modelling framework that essentially fixes the hypothesis class. The probabilistic automata we infer have no initial and terminal states, have no structural restrictions and are shown to be probably approximately correct-learnable. Additionally, we establish rigorous performance guarantees and data requirements, and show that `GenESeSS` correctly infers long-range dependencies. Modelling and prediction examples on simulated and real data establish relevance to automated inference of causal stochastic structures underlying complex physical phenomena.

## 1. Introduction and motivation

Automated model inference is a topic of wide interest, and reported techniques range from query-based concept learning in artificial intelligence [1,2], to classical system identification [3], to more recent symbolic regression based reverse-engineering of nonlinear dynamics [4,5]. In this paper, we wish to infer the dynamical structure of quantized stochastic processes (QSPs): stochastic dynamical systems evolving over discrete time, and producing quantized time series (figure 1). Such systems naturally arise via quantization of successive increments of discrete time continuous-valued stochastic processes. A simple example is the standard random walk on the line (figure 2). Constant positive and negative increments are represented symbolically as *σ*_{R} and *σ*_{L}, respectively, leading to a stochastic dynamical system that generates strings over the alphabet {*σ*_{L},*σ*_{R}}, with each new symbol being generated independently with a probability of 0.5. The dynamical characteristics of this process are captured by the probabilistic automaton shown in figure 2*b*, consisting a single state *q*_{1}, and self-loops labelled with symbols from *Σ*, which represent probabilistic transitions. We wish to infer such causal structures, formally known as probabilistic finite state automaton (PFSA), from a sufficiently long sequence of observed symbols, with no *a priori* knowledge of the number, connectivities and the transition rules of the hidden states.

### (a) Relevance to signal processing and inference in physical systems

We use symbolic representation of data, and our inferred models are abstract logical machines (PFSA). We quantize relative changes between our successive physical observations to map the gradient of continuous time series to symbolic sequences (with each symbol representing a specific increment range). Additionally, we want our inferred models to be generative, i.e. predict the distribution of the future symbols from knowledge of the recent past. What do we gain from such a symbolic approach, particularly since we incur some information loss in quantization? The main advantages are unsupervised inference, computational efficiency, deeper insight into the hidden causal structures, ability to deduce rigorous performance guarantees and the ability to better predict stochastic evolution dictated by many hidden variables interacting in a structured fashion. Systems with such stochastic dynamics and complex causal structures are of emerging interest, e.g. flow of wealth in the stock market, geophysical phenomena, ecology of evolving ecosystems, genetic regulatory circuits and even social interaction dynamics.

Our scheme is illustrated in figure 1, where we model global annual temperature changes from 1880 to 1986 [6]. `GenESeSS` infers a 4-state PFSA; each state specifies the probabilities of positive (symbol 1) and negative (symbol 0) rate of change in the coming year. Here, we use a binary alphabet, but finer quantizations are illustrated in §4*b*. Global temperature changes result from the complex interaction of many variables, and a first principle model is not deducible from this data alone; but we can infer causal states in the data as equivalence classes of history fragments yielding statistically similar futures.

We use the following intuition: if the recent pattern of changes was encountered in the past, at least approximately, and then often led to a positive temperature change in the immediate future, then we expect to see a positive change in the coming year as well. Causal states simply formalize the different classes of patterns that need to be considered, and are inferred to specify which patterns are equivalent.

This allows us to make quantitative predictions. For example, state *q*_{0} is the equivalence class of histories with last two consecutive years of increasing temperature change (two 1 s), and our inferred model indicates that this induces a 42 per cent probability of a positive rate of change next year (that such a pattern indeed leads uniquely to a causal state is inferred from data, and not manually imposed). Additionally, the maximum deviation between the inferred and the true hidden model is limited by a user-selectable bound. For comparison, we also learn a standard auto-regressive moving-average (ARMAX) model [7], which actually predicts the mean quite well. However, we capture stochastic changes better, and gain deeper insight into the causal structure in terms of the number of equivalence classes of histories, their inter-relationships and model memory.

### (b) Abduction versus induction

Our modelling framework assumes that, at any given instant, a QSP is in a *unique hidden causal state*, which has a well-defined probability of generating the next quantized observation. This fixes the *rule* (i.e. the hypothesis class) by which sequential observations are generated, and we seek the correct hypothesis, i.e. the automaton that explains the observed trace. Inferring the hypothesis, given the observations and the generating rule is *abduction* (as opposed to induction, which infers generating rules from a knowledge of the hypothesis, and the observation set) [8].

### (c) Contribution and contrast with reported literature

Learning from symbolic data is essentially grammatical inference; learning an unknown formal language from a presentation of a finite set of example sentences [9,10,11]. In our case, every subsequence of the observed trace is an example sentence [12]. Inference of probabilistic automata is well studied in the context of pattern recognition. However, learning dynamical systems with PFSA has received less attention, and we point out the following issues.

—

*Initial and terminal conditions.*Dynamical systems evolve over time, and are often termination-free. Thus, inferred machines should lack terminal states. Reported algorithms [13,14,15] often learn probabilistic automata with initial and final states, and with special termination symbols.—

*Structural restrictions.*Reported techniques make restrictive structural assumptions, and pre-specify upper bounds on inferred model size. For example, [16,13,15] infer only probabilistic suffix automata, those in [12,17] require synchronizability, i.e. the property by which a bounded history is sufficient to uniquely determine the current state, and [13] restrict models to be acyclic and aperiodic.—

*Restriction to short memory processes.*Often memory bounds are pre-specified as bounds on the order of the underlying Markov chain [16], or synchronizability [12]; thus only learning processes with*short range dependencies*. A time series possesses*long-range dependencies*(LRDs), if it has correlations persisting over all time scales [18]. In such processes, the auto-correlation function follows a power law, as opposed to an exponential decay. Such processes are of emerging interest, e.g. internet traffic [19], financial systems [20,21] and biological systems [22,23]. We can learn LRD processes (figure 3).—

*Inconsistent definition of probability space.*Reported approaches often attempt to define a probability distribution on*Σ*^{⋆}, i.e. the set of all finite but unbounded words over an alphabet*Σ*, and then additionally*define*the probability of the null-word*λ*to be 1. This is inconsistent, since the latter condition would demand that the probability of every other finite string in*Σ*^{⋆}be 0. Some authors [14] use*λ*for string termination, which is in line with the grammatical picture where the empty word is used to erase a non-terminal [26]. However, this is inconsistent with a dynamical systems model. We address this via rigorous construction of a*σ*-algebra on strictly infinite strings.

Our algorithmic procedure (see §3*a*) has some commonalities with [27,28]. However, Gavaldà *et al.* uses initial states (superfluous for stationary ergodic processes), special termination symbols and requires a pre-specified bound on the model size. Additionally, key technical differences (discussed in §3*a*) make our models significantly more compact.

To summarize our contribution, (i) we formalize PFSAs in the context of QSPs, (ii) remove inconsistencies with the definition of the probability of the null-word via a *σ*-algebra on strictly infinite strings, and (iii) show that PFSAs arise naturally via an equivalence relation on infinite strings. Also, (iv) we characterize the class of QSPs with finite probabilistic generators, establish probably approximately correct (PAC)-learnability [29] and show that (v) algorithm `GenESeSS` learns PFSA models with no *a priori* restrictions on the structure, size and memory.

The rest of the paper is organized as follows: §2 develops new theory, which is used in §3 to design `GenESeSS` and establish complexity results within the PAC criterion. Section 4 validates our development using synthetic and experimental data. The paper is summarized and concluded in §5.

## 2. Formalization of probabilistic generators for stochastic dynamical systems

Throughout this paper, *Σ* denotes a finite alphabet of symbols. The set of all finite but possibly unbounded strings on *Σ* is denoted by *Σ*^{⋆}, the Kleene ⋆ operation [26]. The set of finite strings over *Σ* form a concatenative monoid, with the empty word *λ* as identity. Concatenation of two strings *x*,*y*∈*Σ*^{⋆} is written as *xy*. Thus, *xy*=*xλy*=*xyλ*=*λxy*. The set of strictly infinite strings on *Σ* is denoted as *Σ*^{ω}, where *ω* denotes the first transfinite cardinal. For a string *x*∈*Σ*^{⋆}, |*x*| denotes the length of *x* and for a set *A*, |*A*| denotes its cardinality.

### (a) Quantized stochastic processes

### Definition 2.1 (QSP)

A QSP is a discrete time *Σ*-valued strictly stationary, ergodic stochastic process, i.e.
A stochastic process is ergodic if moments can be calculated from a single, sufficiently long realization, and strictly stationary if moments are not functions of time.

We next formalize the connection of QSPs to PFSA generators.

### Definition 2.2 (σ-algebra on infinite strings)

For the set of infinite strings on *Σ*, we define to be the smallest *σ*-algebra generated by {*xΣ*^{ω}:*x*∈*Σ*^{⋆}}.

### Lemma 2.3

Any QSP induces a probability space .

### Proof.

Using stationarity of QSP we can construct a probability measure by defining for any sequence *x*∈*Σ*^{⋆}∖{*λ*}, and a sufficiently large number of realizations *N*_{R} of , with fixed initial conditions:
and extending the measure to elements of via at most countable sums. Note that , and for the null word *μ*(*λΣ*^{ω})=*μ*(*Σ*^{ω})=1.

For notational brevity, we denote *μ*(*xΣ*^{ω}) as Pr(*x*).

Classically, states are induced via the Nerode equivalence, which defines two strings to be equivalent if and only if any finite extension of the strings is either both accepted or both rejected [26] by the language under consideration. We use a probabilistic extension [30].

### Definition 2.4 (probabilistic Nerode relation)

induces an equivalence relation ∼_{N} on the set of finite strings *Σ*^{⋆} as
2.1
For *x*∈*Σ*^{⋆}, the equivalence class of *x* is denoted as [*x*].

It follows that ∼_{N} is right invariant, i.e.
2.2
A right-invariant equivalence on *Σ*^{⋆} always induces an automaton structure.

### Definition 2.5 (initial-marked PFSA)

An initial-marked PFSA is a 5-tuple where *Q* is a finite state set, *Σ* is the alphabet, is the state transition function, and specifies the conditional symbol-generation probabilities. *δ* and are recursively extended to arbitrary *y*=*σx*∈*Σ*^{⋆} as *δ*(*q*,*σx*)=*δ*(*δ*(*q*,*σ*),*x*) and . *q*_{0}∈*Q* is the initial state. If the next symbol is specified, our resultant state is fixed; similar to probabilistic deterministic automata [27]. However, unlike the latter, we lack final states. Additionally, we assume our graphs to be strongly connected.

Definition 2.5 has no notion of a final state, and later we will remove initial state dependence using ergodicity. First, we formalize how the notion of a PFSA arises from a QSP.

### Lemma 2.6 (from QSP to a PFSA)

If the probabilistic Nerode relation has a finite index, then there exists an initial-marked PFSA generator.

### Proof.

Every QSP represented as a probability space induces a probabilistic automaton , where *Q* is the set of equivalence classes of the probabilistic Nerode relation (definition 2.4), *Σ* is the alphabet, and
2.3
and
2.4
*q*_{0} is identified with [*λ*], and finite index of ∼_{N} implies .

The above construction yields a *minimal realization* unique up to state renaming.

### Corollary 2.7 (to lemma 2.6: null-word probability)

For the PFSA induced from a QSP 2.5

### Proof.

For *q*∈*Q*, let *x*∈*Σ*^{⋆} such that [*x*]=*q*. From equation (2.4), we have
2.6

Next, we define canonical representations to remove initial-state dependence. We use to denote the matrix representation of , i.e. , *q*_{i}∈*Q*,*σ*_{j}∈*Σ*. We need the notion of transformation matrices *Γ*_{σ}.

### Definition 2.8 (transformation matrices)

For an initial-marked PFSA the symbol-specific transformation matrices *Γ*_{σ}∈{0,1}^{|Q|×|Q|} are
2.7

States in the canonical representation (denoted as ℘_{x}) are identified with probability distributions over states of the initial-marked PFSA. Here, *x* denotes the string in *Σ*^{⋆} realizing this distribution, beginning from the stationary distribution on the states of the initial-marked representation. ℘_{x} is an equivalence class, and hence *x* is not unique.

### Definition 2.9 (canonical representations)

An initial-marked PFSA uniquely induces a canonical representation with *Q*^{C} being the set of probability distributions over *Q*, and as follows.

— Construct the stationary distribution on

*Q*using the transition probabilities of the Markov chain induced by*G*, and include this as the first element ℘_{λ}of*Q*^{C}. The transition matrix for this induced chain is the row-stochastic matrix*M*∈[0,1]^{|Q|×|Q|}, with .Define

*δ*^{C}and recursively: 2.8 and 2.9

For a QSP , the canonical representation is denoted as .

Ergodicity of QSPs, which makes ℘_{λ} independent of the initial state in the initial-marked PFSA, implies that the canonical representation is initial state independent (figure 4), and subsumes the initial-marked representation in the following sense: let denote the set of distributions satisfying
2.10
Then we note the following.

— If we execute the canonical construction with an initial distribution from , then we get the initial-marked PFSA (with the initial marking missing).

— If during the construction we encounter for some

*x*, then we stay within the graph of the initial-marked PFSA for all right extensions of*x*. This thus eliminates the need of knowing the initial state explicitly (figure 5), provided we find string*x*, which takes us within or close to .

Consequently, we denote the initial-marked PFSA induced by a QSP , with the initial marking removed, as , and refer to it simply as a ‘PFSA’ (dropping the qualifier ‘initial-marked’). States in are representable as states in as elements of . Next, we establish a key result: we always encounter a state arbitrarily close to some element in in the canonical construction starting from the stationary distribution ℘_{λ}.

### Theorem 2.10 (ϵ-synchronization of probabilistic automata)

*For any QSP* *over Σ, the PFSA* *satisfies
*
2.11
*where the norm used is unimportant.*

### Proof.

See appendix A.

Theorem 2.10 induces the notion of *ϵ*-synchronizing strings, and guarantees their existence for arbitrary PFSA.

### Definition 2.11 (ϵ-synchronizing strings)

An *ϵ*-synchronizing string *x*∈*Σ*^{⋆} for a PFSA is one that satisfies
2.12
The norm used is unimportant.

Theorem 2.10 does not yield an algorithm for computing synchronizing strings (theorem 2.18). It simply shows that one always exists. As a corollary, we estimate an asymptotic upper bound on the effort required to find it.

### Corollary 2.12 (to theorem 2.10)

*At most O(1/ϵ) strings need to be analysed to find an ϵ-synchronizing string.*

### Proof.

See appendix A.

Next, we describe the basic principle of our inference algorithm. PFSA states are not observable; we observe symbols generated from hidden states. This leads us to the notion of *symbolic derivatives*, which are computable from observations.

We denote the set of probability distributions over a set of cardinality *k* as . First, we specify a count function.

### Definition 2.13 (symbolic count function)

For a string *s* over *Σ*, the count function counts the number of times a particular substring occurs in *s*. The count is overlapping, i.e. in a string *s*=0001, we count the number of occurrences of 00s as and implying #^{s}00=2.

### Definition 2.14 (symbolic derivative)

For a string *s* generated by a QSP over *Σ*, the symbolic derivative is a function as
2.13
Thus, ∀*x*∈*Σ*^{⋆},*ϕ*^{s}(*x*) is a probability distribution over *Σ*. *ϕ*^{s}(*x*) is referred to as the symbolic derivative at *x*.

For *q*_{i}∈*Q*, induces a distribution over *Σ* as . We denote this as . We show that the symbolic derivative at *x* can be used to estimate this distribution for *q*_{i}=[*x*], provided *x* is *ϵ*-synchronizing.

### Theorem 2.15 (ϵ-convergence)

*If x∈Σ*^{⋆} *is ϵ-synchronizing then
*
2.14

### Proof.

The proof follows from the Glivenko–Cantelli theorem 31 on uniform convergence of empirical distributions. See appendix A for details.

Next, we describe identification of *ϵ*-synchronizing strings given a sufficiently long observed string *s*. Theorem 2.10 guarantees existence, and corollary 2.12 establishes that *O*(1/*ϵ*) substrings need to be analysed until we encounter an *ϵ*-synchronizing string. These do not provide an executable algorithm, which arises from an inspection of the geometric structure of the set of probability vectors over *Σ*, obtained by constructing *ϕ*^{s}(*x*) for different choices of the candidate string *x*.

### Definition 2.16 (derivative heap)

Given a string *s* generated by a QSP, a derivative heap is the set of probability distributions over *Σ* calculated for a given subset of strings *L*⊂*Σ*^{⋆} as follows:
2.15

### Lemma 2.17 (limiting geometry)

Let and be the convex hull of . If *u* is a vertex of then
2.16

### Proof.

Recalling theorem 2.15, the result follows from noting that any element of is a convex combination of elements from the set .

Lemma 2.17 does not claim that the number of vertices of the convex hull of equals the number of states, but that every vertex corresponds to a state. We cannot generate in practice since we have a finite observed string *s*, and we can only calculate *ϕ*^{s}(*x*) for a finite number of *x*. Instead, we show that choosing a string corresponding to the vertex of the convex hull of the heap, constructed by considering *O*(1/*ϵ*) strings, gives us an *ϵ*-synchronizing string with high probability.

### Theorem 2.18 (derivative heap approximation)

*For s generated by a QSP, let* *be computed with L=Σ*^{O(log(1/ϵ))}*. If for some x*_{0}*∈Σ*^{O(log(1/ϵ))}*, ϕ*^{s}*(x*_{0}*) is a vertex of the convex hull of* *then
*
2.17

### Proof.

The result follows from Sanov’s theorem [32] for convex set of probability distributions. See appendix A for details.

Figure 6 illustrates PFSAs, derivative heaps, and *ϵ*-synchronizing strings. Next, we present our inference algorithm.

## 3. Algorithm `GenESeSS`

### (a) Implementation steps

We call our algorithm ‘Generator Extraction Using Self-similar Semantics’, or `GenESeSS` which for an observed sequence *s*, consists of three steps.

—

*Identification of*Construct a derivative heap using the observed trace*ϵ*-synchronizing string*x*_{0}.*s*(definition 2.16), and set*L*consisting of all strings up to a sufficiently large, but finite, depth. We suggest as initial choice of*L*as . If*L*is sufficiently large, then the inferred model structure will not change for larger values. We then identify a vertex of the convex hull for , via any standard algorithm for computing the hull [33]. Choose*x*_{0}as the string mapping to this vertex.—

*Identification of the structure of , i.e. transition function*We generate*δ*.*δ*as follows. For each state*q*, we associate a string identifier*x*^{id}_{q}∈*x*_{0}*Σ*^{⋆}, and a probability distribution*h*_{q}on*Σ*(which is an approximation of the -row corresponding to state*q*). We extend the structure recursively:Initialize the set

*Q*as*Q*={*q*_{0}}, and set*x*^{id}_{q0}=*x*_{0},*h*_{q}=*ϕ*^{s}(*x*_{0}).For each state

*q*∈*Q*, compute for each symbol*σ*∈*Σ*, find symbolic derivative*ϕ*^{s}(*x*^{id}_{q}*σ*). If for some*q*′∈*Q*, then define*δ*(*q*,*σ*)=*q*′. If, on the other hand, no such*q*′ can be found in*Q*, then add a new state*q*′ to*Q*, and define ,*h*_{q′}=*ϕ*^{s}(*x*^{id}_{q}*σ*).

The process terminates when every

*q*∈*Q*has a target state, for each*σ*∈*Σ*. Then, if necessary, we ensure strong connectivity using [34].*Identification of arc probabilities, i.e. function :*Choose an arbitrary initial state

*q*∈*Q*.Run sequence

*s*through the identified graph, as directed by*δ*, i.e. if current state is*q*, and the next symbol read from*s*is*σ*, then move to*δ*(*q*,*σ*). Count arc traversals, i.e. generate numbers where .Generate by row normalization, i.e.

Gavaldà *et al*. [27] and Castro & Gavaldà [28] use similar recursive structure extension. However, with no notion of *ϵ*-synchronization, they are restricted to inferring only synchronizable or short-memory models, or large approximations for long-memory ones (figure 7).

### (b) Complexity analysis and probably approximately correct learnability

While *h*_{q} (in step 2) approximates rows, we find the arc probabilities via normalization of traversal count. *h*_{q} only uses sequences in *x*_{0}*Σ*^{⋆}, while traversal counting uses the entire sequence *s*, and is more accurate.

`GenESeSS` has no upper bound on the number of states; which is a function of the complexity of the process itself.

### Theorem 3.1 (time complexity)

*Asymptotic time complexity of* `GenESeSS` *is
*
3.1

### Proof.

See appendix A for details.

Theorem 3.1 shows that `GenESeSS` is polynomial in *O*(1/*ϵ*), size of input *s*, model size |*Q*| and alphabet size |*Σ*|. In practice, |*Q*|≪1/*ϵ*, implying that
3.2

An identification method is said to identify a target language *L*_{⋆} in the PAC sense [29,35,36] if it always halts and outputs *L* such that
3.3
where *d*(⋅,⋅) is a metric on the space of target languages. A class of languages is efficiently PAC-learnable if there exists an algorithm that PAC-identifies every language in the class, and runs in time polynomial in 1/*ϵ*, 1/*δ*, the length of sample input, and inferred model size. We prove PAC-learnability of QSPs by first establishing a metric on the space of probabilistic automata over *Σ*.

### (c) Probably approximately correct identifiability of quantized stochastic processes

### Lemma 3.2 (metric for probabilistic automata)

*For two strongly connected PFSAs* *G*_{1} *and* *G*_{2}, *denote the symbolic derivative at* *x*∈*Σ*^{⋆} *as* *and* *respectively. Then*,
*defines a metric on the space of probabilistic automata on* *Σ*.

### Proof.

Non-negativity and symmetry follow immediately. Triangular inequality follows from noting that is upper bounded by 1, and therefore for any chosen order of the strings in *Σ*^{⋆}, we have two sequences, which would satisfy the triangular inequality under the norm. The metric is well defined since for any sufficiently long *s*_{1} and *s*_{2}, the symbolic derivatives at arbitrary *x* are uniformly convergent to some linear combination of the rows of the corresponding matrices.

### Theorem 3.3 (PAC-learnability)

*QSPs for which the probabilistic Nerode equivalence has a finite index are PAC-learnable using PFSAs, i.e. for ϵ,η>0, and for every sufficiently long sequence s generated by QSP* *we can compute* *as an estimate for* *such that
*
3.4
*The algorithm runs in time polynomial in 1/ϵ,1/η, input length |s| and model size.*

### Proof.

`GenESeSS` construction implies that, once the initial *ϵ*-synchronizing string *x*_{0} is identified, there is no scope of the model error to be more than *ϵ*. Hence
Thus, for any *η*>0, if we have , then the required condition of equation (3.4) is met. The polynomial runtimes are established in theorem 3.1.

## 4. Discussion, validation and application

### (a) Application of `GenESeSS` to simulated data

For the scenario in figure 3, `GenESeSS` is applied to 10^{3} symbols generated from the model M2 (denoting *σ*_{0} as 0, and *σ*_{1} as 1) with parameters *γ*=0.75 and *γ*′=0.66. We generate a derivative heap with all strings of length up to 4, and find 000 mapping to an estimated vertex of the convex hull of the heap, thus qualifying as our *ϵ*-synchronizing string *x*_{0}. Figure 8*a*,*b* illustrate the choice of a ‘wrong’ *x*_{0}; our theoretical development guarantees that if *x*_{0} is close to a vertex of the convex hull, then the derivatives *ϕ*^{s}(*x*_{0}*x*) will be close to some row of , and thus form clusters around those values. If the chosen *x*_{0} is far from all vertices, this is not true; thus, in figure 8*a*, the incorrect choice of string 11 leads to a spread of symbolic derivatives on the line *x*+*y*=1. Figure 8*b* illustrates clustering of the values around the two hidden states with *x*_{0}=000. A few symbolic derivatives are shown in figure 8*c*, with highlighted rows corresponding to strings with *x*_{0} as prefix; which are approximate repetitions of two distinct distributions. Switching between these repeating distributions on right concatenation of symbols induces the structure shown in figure 8*d* for *ϵ*=0.05 (see §3*a*). The inferred model is already strongly connected, and we run the input data through it (step 3 of `GenESeSS`) to estimate the matrix (see the inset of figure 8*d*). The model is recovered with correct structure, and with deviation in entries smaller than 0.01.

This example infers a non-synchronizable machine (recall that M2 exhibits LRDs), which often proves to be too difficult for reported unsupervised algorithms (see comparison with [27] in figure 7, which can match `GenESeSS` performance only at the expense of a large number of states, and yet still fails to infer true LRD models). Hidden Markov model learning with latent transitions can possibly be applied; however, such algorithms do not come with any obvious guarantees on performance.

### (b) Application of `GenESeSS` to model and predict experimental data

We apply `GenESeSS` to two experimental datasets: photometric intensity (brightness) of a variable star for a succession of 600 days [38], and twice-daily count of bow-fly population in a jar [39]. While the photometric data are quasi-periodic, the population time series has non-trivial stochastic structure. Quantization schemes, with four symbols, are shown in table 1.

The mean magnitude of successive change represented by each symbol is also shown, which is computed as the average of the continuous values that get mapped to each symbol, e.g. for the photometric data, symbol 0 represents a change of brightness in the range [−4,−2], and an average change of −2.67. Inferred PFSAs generate symbolic sequences, which we map back to the continuous domain using these average magnitudes, e.g. a generated 0 is interpreted as a change of −2.67 for photometric predictions.

The inferred PFSAs are shown in figures 9*b* and 10*b*. The causal structure is simple for the photometric data, and significantly more complex for the population dynamics. For the photometric dataset, the states with only outgoing symbols 2,3 (indicating a brightness increase next day), the states with outgoing symbols 1,2 (indicating the possibility of either an increase or a decrease), and the states with outgoing symbols 2,3 (indicating a decrease in intensity next day) are arranged in a simple succession (figure 9*b*). For the population dynamics, the group of states which indicate a predicted population increase in the next 12 h, the ones which indicate a decrease, and the ones which indicate either possibility have significantly more complex interconnections (figure 10*c*). In both cases, we learn ARMAX [7] models for comparison. ARMAX models include auto-regressive and moving average components, assume additive Gaussian noise, and linear dependence on history. ARMAX model order (*p*,*q*) is chosen to be the smallest value-pair providing an acceptable fit to the data (table 1), such that increasing the model order does not significantly improve prediction accuracy. We learn the respective models with a portion of the time series (figures 9*a* and 10*a*), and compare the approaches with different prediction horizons (‘prediction horizon = 1 day’ means ‘predict values 1 day ahead’) against the remaining part of the time series. ARMAX does well in the first case, even for longer prediction horizons (see figure 9*c*,*d*), but is unacceptably poor with the more involved population dynamics (figure 10*d*,*e*). Additionally, the ARMAX models do not offer clear insight into the causal structure of the processes as discussed above. Figure 11*a*,*c* plots the mean absolute prediction error percentage as a function of increasing prediction horizon for both approaches. For the photometric data, the error grows slowly, and is more or less comparable for ARMAX and `GenESeSS`. For the population data, `GenESeSS` does significantly better. Since we generate continuous prediction series from symbolic sequences using mean magnitudes (as described above; see also table 1, column 4), `GenESeSS` is expected to better predict the sign of the gradient of the predicted data rather than the exact magnitudes. We plot the percentage of wrongly predicted trends (i.e. percentage of sign mismatches) in figure 11*b*,*d* for the two cases. We note that ARMAX is slightly better for the photometric data, while for the population data, `GenESeSS` is significantly superior (75% incorrect trends at a horizon of 8 days for ARMAX versus about 25% for `GenESeSS`). Thus, while for quasi-periodic data `GenESeSS` performance may be comparable to existing techniques, it is significantly superior in modelling and predicting complex stochastic dynamics.

## 5. Summary and conclusion

We establish the notion of causal generators for stationary ergodic QSPs on a rigorous measure-theoretic foundation, and propose a new inference algorithm `GenESeSS` for identification of probabilistic automata models from sufficiently long traces. We show that our approach can learn processes with LRDs, which yield non-synchronizable automata. Additionally, we establish that `GenESeSS` is computationally efficient in the PAC sense. The results are validated on synthetic and real datasets. Future research will investigate the prediction problem in greater detail in the context of new applications.

## Acknowledgements

This work was supported in part by NSF CDI grant ECCS 0941561 and DTRA grant HDTRA 1-09-1-0013. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the sponsoring organizations.

## Appendix A. Detailed proofs of mathematical results

We present the proofs of theorems 2.10, 2.15, 2.18, 3.1, and for corollary 2.12 to theorem 2.10.

— (*Statement of theorem* 2.10) *For any QSP* *over* *Σ*, *the PFSA* *satisfies*
*where the norm used is unimportant*.

### Proof.

We show that all PFSA are at least approximately synchronizable [40,41], which is not true for deterministic automata. If the graph of (i.e. the deterministic automaton obtained by removing the arc probabilities) is synchronizable, then equation (5) trivially holds true for *ϵ*′=0 for any synchronizing string *x*. Thus, we assume the graph of to be non-synchronizable. From the definition of non-synchronizability, it follows:
A 1
If the PFSA has a single state, then every string satisfies the condition in equation (5). Hence, we assume that the PFSA has more than one state. Now if we have
A 2
then, by definition 2.4, we have a contradiction *q*_{i}=*q*_{j}. Hence, ∃*x*_{0} such that
A 3
Since
A 4
we conclude without loss of generality that ∀*q*_{i},*q*_{j}∈*Q*, with *q*_{i}≠*q*_{j}:
It follows from induction that if we start with a distribution ℘ on *Q* such that ℘_{i}=℘_{j}=0.5, then for any *ϵ*′>0 we can construct a finite string such that if , then for the new distribution ℘′ after execution of will satisfy ℘_{s}′>1−*ϵ*′. Recalling that is strongly connected, for any *q*_{t}∈*Q*, there exists a string *y*∈*Σ*^{⋆}, such that *δ*(*q*_{s},*y*)=*q*_{t}. Setting , we can ensure that the distribution ℘′′ obtained after execution of satisfies ℘_{t}′′>1−*ϵ*′ for any *q*_{t} of our choice. For arbitrary initial distributions ℘^{A} on *Q*, we consider contributions arising from simultaneously executing from states other than just *q*_{i} and *q*_{j}. Nevertheless, it is easy to see that executing implies that in the new distribution ℘^{A′}, we have . It immediately follows that executing of the string , where
A 5
would result in a final distribution ℘^{A′′} which satisfies . Appropriate scaling of *ϵ*′ then completes the proof.

— (*Statement of corollary 2.12 to theorem 2.10*) *At most* *O*(1/*ϵ*) *strings need to be analysed to find an* *ϵ*-*synchronizing string*.

### Proof.

Theorem 2.10 works by multiplying entries from the matrix, which cannot be all identical (otherwise the states would collapse). Let the minimum difference between two unequal entries be *η*. Then, following the construction in theorem 2.10, the length ℓ of the synchronizing string, up to linear scaling, satisfies *η*^{ℓ}=*O*(*ϵ*), implying ℓ=*O*(log(1/*ϵ*)). The number of strings to be analysed therefore is at most all strings of length ℓ, which is given by
A 6
This completes the proof.

— (*Statement of theorem 2.15 on* *ϵ*-*convergence*) *If* *x*∈*Σ*^{⋆} *is* *ϵ*-*synchronizing*, *then*
A 7

### Proof.

The proof follows from the Glivenko–Cantelli theorem 31 on uniform convergence of empirical distributions. Since *x* is *ϵ*-synchronizing, we have
A 8
Let *x* *ϵ*-synchronize to *q*. Thus, every time we encounter *x* while reading *s*, we are guaranteed to be distributed over *Q*, and at most *ϵ* distance from the element of corresponding to *q*. Assuming that we encounter *n*(|*s*|) occurrences of *x* within *s*, we note that *ϕ*^{s}(*x*) is an approximate empirical distribution for . Denoting *F*_{n(|s|)} as the perfect empirical distributions (i.e. ones that would be obtained for *ϵ*=0), we have
We use as , implied by the strong connectivity of our PFSA. This completes the proof.

— (*Statement of theorem 2.18 on derivative heap approximation*) *For* *s* *generated by a QSP*, *let* *be computed with* . *If for some* , *ϕ*^{s}(*x*_{0}) *is a vertex of the convex hull of* , *then*
A 9

### Proof.

The result follows from Sanov’s theorem [32] for convex set of probability distributions. Note that if , then *x*_{0} is guaranteed to be *ϵ*-synchronizing (theorem 2.10 and corollary 2.12). Denoting the number of times we encounter *x*_{0} in *s* as *n*(|*s*|), and since is a convex set of distributions (allowing us to drop the polynomial factor in Sanov’s bound), we apply Sanov’s theorem to the case of finite *s*:
A 10
where KL(⋅∥⋅) denotes the Kullback–Leibler divergence [42]. From the bound [43]:
A 11
and , where *α*>0 is the stationary probability of encountering *x*_{0} in *s*, we conclude:
A 12
which completes the proof.

— (*Statement of theorem 3.1 on time complexity*) *Asymptotic time complexity of* `GenESeSS` *is*
A 13

### Proof.

`GenESeSS` performs the following computations.

Computation of a derivative heap by computing

*ϕ*^{s}(*x*) for*O*(1/*ϵ*) strings (corollary 2.12), each of which involves reading the input*s*and normalization to distributions over*Σ*, thus contributing*O*(1/*ϵ*×|*s*|×|*Σ*|).(C2) Finding a vertex of the convex hull of the heap, which, at worst, involves inspecting

*O*(*ϵ*) points (encoded by strings generating the heap), contributing*O*(1/*ϵ*×|*Σ*|), where each inspection is done in*O*(|*Σ*|) time.(C3) Finding

*δ*, involving computing derivatives at string-identifiers (step 2), thus contributing*O*(|*Q*|×|*s*|×|*Σ*|). Strong connectivity [34] requires*O*(|*Q*|+|*Σ*|), and hence is unimportant.(C4) Identification of arc probabilities using traversal counts and normalization, done in time linear in the number of arcs, i.e.

*O*(|*Q*|×|*Σ*|).

Summing the contributions, and using |*s*|>|*Σ*|,
which completes the proof.

## Footnotes

One contribution of 17 to a Discussion Meeting Issue ‘Signal processing and inference for the physical sciences’.

- © 2012 The Author(s) Published by the Royal Society. All rights reserved.