Royal Society Publishing

Abductive learning of quantized stochastic processes with probabilistic finite automata

Ishanu Chattopadhyay, Hod Lipson


We present an unsupervised learning algorithm (GenESeSS) to infer the causal structure of quantized stochastic processes, defined as stochastic dynamical systems evolving over discrete time, and producing quantized observations. Assuming ergodicity and stationarity, GenESeSS infers probabilistic finite state automata models from a sufficiently long observed trace. Our approach is abductive; attempting to infer a simple hypothesis, consistent with observations and modelling framework that essentially fixes the hypothesis class. The probabilistic automata we infer have no initial and terminal states, have no structural restrictions and are shown to be probably approximately correct-learnable. Additionally, we establish rigorous performance guarantees and data requirements, and show that GenESeSS correctly infers long-range dependencies. Modelling and prediction examples on simulated and real data establish relevance to automated inference of causal stochastic structures underlying complex physical phenomena.

1. Introduction and motivation

Automated model inference is a topic of wide interest, and reported techniques range from query-based concept learning in artificial intelligence [1,2], to classical system identification [3], to more recent symbolic regression based reverse-engineering of nonlinear dynamics [4,5]. In this paper, we wish to infer the dynamical structure of quantized stochastic processes (QSPs): stochastic dynamical systems evolving over discrete time, and producing quantized time series (figure 1). Such systems naturally arise via quantization of successive increments of discrete time continuous-valued stochastic processes. A simple example is the standard random walk on the line (figure 2). Constant positive and negative increments are represented symbolically as σR and σL, respectively, leading to a stochastic dynamical system that generates strings over the alphabet {σL,σR}, with each new symbol being generated independently with a probability of 0.5. The dynamical characteristics of this process are captured by the probabilistic automaton shown in figure 2b, consisting a single state q1, and self-loops labelled with symbols from Σ, which represent probabilistic transitions. We wish to infer such causal structures, formally known as probabilistic finite state automaton (PFSA), from a sufficiently long sequence of observed symbols, with no a priori knowledge of the number, connectivities and the transition rules of the hidden states.

Figure 1.

Inference in physical systems: learning causal structure as a probabilistic finite state automaton (PFSA) from quantized time series of annual global temperature changes from 1880 to 1986. Inferred PFSA, and comparative predictions against an auto-regressive moving-average (ARMAX) model are shown (predicting for 1 year in the future) overlaid with the true observations. (Online version in colour.)

Figure 2.

Simple example of a QSP: a random walk on the line (a). The 1-state probabilistic automaton shown in (b) captures the statistical characteristics of this process. Our objective is to learn such a model from a single long observed trace, like any one of the three shown in (c). (Online version in colour.)

(a) Relevance to signal processing and inference in physical systems

We use symbolic representation of data, and our inferred models are abstract logical machines (PFSA). We quantize relative changes between our successive physical observations to map the gradient of continuous time series to symbolic sequences (with each symbol representing a specific increment range). Additionally, we want our inferred models to be generative, i.e. predict the distribution of the future symbols from knowledge of the recent past. What do we gain from such a symbolic approach, particularly since we incur some information loss in quantization? The main advantages are unsupervised inference, computational efficiency, deeper insight into the hidden causal structures, ability to deduce rigorous performance guarantees and the ability to better predict stochastic evolution dictated by many hidden variables interacting in a structured fashion. Systems with such stochastic dynamics and complex causal structures are of emerging interest, e.g. flow of wealth in the stock market, geophysical phenomena, ecology of evolving ecosystems, genetic regulatory circuits and even social interaction dynamics.

Our scheme is illustrated in figure 1, where we model global annual temperature changes from 1880 to 1986 [6]. GenESeSS infers a 4-state PFSA; each state specifies the probabilities of positive (symbol 1) and negative (symbol 0) rate of change in the coming year. Here, we use a binary alphabet, but finer quantizations are illustrated in §4b. Global temperature changes result from the complex interaction of many variables, and a first principle model is not deducible from this data alone; but we can infer causal states in the data as equivalence classes of history fragments yielding statistically similar futures.

We use the following intuition: if the recent pattern of changes was encountered in the past, at least approximately, and then often led to a positive temperature change in the immediate future, then we expect to see a positive change in the coming year as well. Causal states simply formalize the different classes of patterns that need to be considered, and are inferred to specify which patterns are equivalent.

This allows us to make quantitative predictions. For example, state q0 is the equivalence class of histories with last two consecutive years of increasing temperature change (two 1 s), and our inferred model indicates that this induces a 42 per cent probability of a positive rate of change next year (that such a pattern indeed leads uniquely to a causal state is inferred from data, and not manually imposed). Additionally, the maximum deviation between the inferred and the true hidden model is limited by a user-selectable bound. For comparison, we also learn a standard auto-regressive moving-average (ARMAX) model [7], which actually predicts the mean quite well. However, we capture stochastic changes better, and gain deeper insight into the causal structure in terms of the number of equivalence classes of histories, their inter-relationships and model memory.

(b) Abduction versus induction

Our modelling framework assumes that, at any given instant, a QSP is in a unique hidden causal state, which has a well-defined probability of generating the next quantized observation. This fixes the rule (i.e. the hypothesis class) by which sequential observations are generated, and we seek the correct hypothesis, i.e. the automaton that explains the observed trace. Inferring the hypothesis, given the observations and the generating rule is abduction (as opposed to induction, which infers generating rules from a knowledge of the hypothesis, and the observation set) [8].

(c) Contribution and contrast with reported literature

Learning from symbolic data is essentially grammatical inference; learning an unknown formal language from a presentation of a finite set of example sentences [9,10,11]. In our case, every subsequence of the observed trace is an example sentence [12]. Inference of probabilistic automata is well studied in the context of pattern recognition. However, learning dynamical systems with PFSA has received less attention, and we point out the following issues.

  • Initial and terminal conditions. Dynamical systems evolve over time, and are often termination-free. Thus, inferred machines should lack terminal states. Reported algorithms [13,14,15] often learn probabilistic automata with initial and final states, and with special termination symbols.

  • Structural restrictions. Reported techniques make restrictive structural assumptions, and pre-specify upper bounds on inferred model size. For example, [16,13,15] infer only probabilistic suffix automata, those in [12,17] require synchronizability, i.e. the property by which a bounded history is sufficient to uniquely determine the current state, and [13] restrict models to be acyclic and aperiodic.

  • Restriction to short memory processes. Often memory bounds are pre-specified as bounds on the order of the underlying Markov chain [16], or synchronizability [12]; thus only learning processes with short range dependencies. A time series possesses long-range dependencies (LRDs), if it has correlations persisting over all time scales [18]. In such processes, the auto-correlation function follows a power law, as opposed to an exponential decay. Such processes are of emerging interest, e.g. internet traffic [19], financial systems [20,21] and biological systems [22,23]. We can learn LRD processes (figure 3).

  • Inconsistent definition of probability space. Reported approaches often attempt to define a probability distribution on Σ, i.e. the set of all finite but unbounded words over an alphabet Σ, and then additionally define the probability of the null-word λ to be 1. This is inconsistent, since the latter condition would demand that the probability of every other finite string in Σ be 0. Some authors [14] use λ for string termination, which is in line with the grammatical picture where the empty word is used to erase a non-terminal [26]. However, this is inconsistent with a dynamical systems model. We address this via rigorous construction of a σ-algebra on strictly infinite strings.

Figure 3.

Process with LRDs. (a) Consider the dynamics of lane switching for a car on the highway, which may shift lanes (symbol σ0) or continue in the current lane (symbol σ1), and these symbol probabilities depend on the lane that the car is in. The correct model for this rule is M2 (top), which is non-synchronizable, and no length of history will determine exactly the current state. The similar, but incorrect, model M1 (with transitions switched) is a PFSA with 1-step memory. The difference in dynamics is illustrated by the computation of the Hurst exponent for symbol streams generated by the two models (b), with σ0 and σ1 represented as 1 and −1. M2 has a Hurst exponent greater than 0.5 [24,25], indicating LRD behaviour. GenESeSS correctly identifies M2 from a sequence of driver decisions; while most reported algorithms identify models similar to M1. (Online version in colour.)

Our algorithmic procedure (see §3a) has some commonalities with [27,28]. However, Gavaldà et al. uses initial states (superfluous for stationary ergodic processes), special termination symbols and requires a pre-specified bound on the model size. Additionally, key technical differences (discussed in §3a) make our models significantly more compact.

To summarize our contribution, (i) we formalize PFSAs in the context of QSPs, (ii) remove inconsistencies with the definition of the probability of the null-word via a σ-algebra on strictly infinite strings, and (iii) show that PFSAs arise naturally via an equivalence relation on infinite strings. Also, (iv) we characterize the class of QSPs with finite probabilistic generators, establish probably approximately correct (PAC)-learnability [29] and show that (v) algorithm GenESeSS learns PFSA models with no a priori restrictions on the structure, size and memory.

The rest of the paper is organized as follows: §2 develops new theory, which is used in §3 to design GenESeSS and establish complexity results within the PAC criterion. Section 4 validates our development using synthetic and experimental data. The paper is summarized and concluded in §5.

2. Formalization of probabilistic generators for stochastic dynamical systems

Throughout this paper, Σ denotes a finite alphabet of symbols. The set of all finite but possibly unbounded strings on Σ is denoted by Σ, the Kleene ⋆ operation [26]. The set of finite strings over Σ form a concatenative monoid, with the empty word λ as identity. Concatenation of two strings x,yΣ is written as xy. Thus, xy=xλy=xyλ=λxy. The set of strictly infinite strings on Σ is denoted as Σω, where ω denotes the first transfinite cardinal. For a string xΣ, |x| denotes the length of x and for a set A, |A| denotes its cardinality.

(a) Quantized stochastic processes

Definition 2.1 (QSP)

A QSP Embedded Image is a discrete time Σ-valued strictly stationary, ergodic stochastic process, i.e. Embedded Image A stochastic process is ergodic if moments can be calculated from a single, sufficiently long realization, and strictly stationary if moments are not functions of time.

We next formalize the connection of QSPs to PFSA generators.

Definition 2.2 (σ-algebra on infinite strings)

For the set of infinite strings on Σ, we define Embedded Image to be the smallest σ-algebra generated by {ω:xΣ}.

Lemma 2.3

Any QSP induces a probability space Embedded Image.


Using stationarity of QSP Embedded Image we can construct a probability measure Embedded Image by defining for any sequence xΣ∖{λ}, and a sufficiently large number of realizations NR of Embedded Image, with fixed initial conditions: Embedded Image and extending the measure to elements of Embedded Image via at most countable sums. Note that Embedded Image, and for the null word μ(λΣω)=μ(Σω)=1.

For notational brevity, we denote μ(ω) as Pr(x).

Classically, states are induced via the Nerode equivalence, which defines two strings to be equivalent if and only if any finite extension of the strings is either both accepted or both rejected [26] by the language under consideration. We use a probabilistic extension [30].

Definition 2.4 (probabilistic Nerode relation)

Embedded Image induces an equivalence relation ∼N on the set of finite strings Σ as Embedded Image 2.1 For xΣ, the equivalence class of x is denoted as [x].

It follows that ∼N is right invariant, i.e. Embedded Image 2.2 A right-invariant equivalence on Σ always induces an automaton structure.

Definition 2.5 (initial-marked PFSA)

An initial-marked PFSA is a 5-tuple Embedded Image where Q is a finite state set, Σ is the alphabet, Embedded Image is the state transition function, and Embedded Image specifies the conditional symbol-generation probabilities. δ and Embedded Image are recursively extended to arbitrary y=σxΣ as δ(q,σx)=δ(δ(q,σ),x) and Embedded Image. q0Q is the initial state. If the next symbol is specified, our resultant state is fixed; similar to probabilistic deterministic automata [27]. However, unlike the latter, we lack final states. Additionally, we assume our graphs to be strongly connected.

Definition 2.5 has no notion of a final state, and later we will remove initial state dependence using ergodicity. First, we formalize how the notion of a PFSA arises from a QSP.

Lemma 2.6 (from QSP to a PFSA)

If the probabilistic Nerode relation has a finite index, then there exists an initial-marked PFSA generator.


Every QSP represented as a probability space Embedded Image induces a probabilistic automaton Embedded Image, where Q is the set of equivalence classes of the probabilistic Nerode relation (definition 2.4), Σ is the alphabet, and Embedded Image 2.3 and Embedded Image 2.4 q0 is identified with [λ], and finite index of ∼N implies Embedded Image .

The above construction yields a minimal realization unique up to state renaming.

Corollary 2.7 (to lemma 2.6: null-word probability)

For the PFSA Embedded Image induced from a QSP Embedded Image Embedded Image 2.5


For qQ, let xΣ such that [x]=q. From equation (2.4), we have Embedded Image 2.6

Next, we define canonical representations to remove initial-state dependence. We use Embedded Image to denote the matrix representation of Embedded Image, i.e. Embedded Image, qiQ,σjΣ. We need the notion of transformation matrices Γσ.

Definition 2.8 (transformation matrices)

For an initial-marked PFSA Embedded Image the symbol-specific transformation matrices Γσ∈{0,1}|Q|×|Q| are Embedded Image 2.7

States in the canonical representation (denoted as ℘x) are identified with probability distributions over states of the initial-marked PFSA. Here, x denotes the string in Σ realizing this distribution, beginning from the stationary distribution on the states of the initial-marked representation. ℘x is an equivalence class, and hence x is not unique.

Definition 2.9 (canonical representations)

An initial-marked PFSA Embedded Image uniquely induces a canonical representation Embedded Image with QC being the set of probability distributions over Q, Embedded Image and Embedded Image as follows.

  • — Construct the stationary distribution on Q using the transition probabilities of the Markov chain induced by G, and include this as the first element ℘λ of QC. The transition matrix for this induced chain is the row-stochastic matrix M∈[0,1]|Q|×|Q|, with Embedded Image.

  • Define δC and Embedded Image recursively: Embedded Image 2.8 and Embedded Image 2.9

For a QSP Embedded Image, the canonical representation is denoted as Embedded Image.

Ergodicity of QSPs, which makes ℘λ independent of the initial state in the initial-marked PFSA, implies that the canonical representation is initial state independent (figure 4), and subsumes the initial-marked representation in the following sense: let Embedded Image denote the set of distributions satisfying Embedded Image 2.10 Then we note the following.

  • — If we execute the canonical construction with an initial distribution from Embedded Image, then we get the initial-marked PFSA (with the initial marking missing).

  • — If during the construction we encounter Embedded Image for some x, then we stay within the graph of the initial-marked PFSA for all right extensions of x. This thus eliminates the need of knowing the initial state explicitly (figure 5), provided we find string x, which takes us within or close to Embedded Image.

Figure 4.

Illustration of canonical representations. (a) An initial-marked PFSA, and (b) The canonical representation obtained starting with the stationary distribution ℘λ on the states q1,q2,q3 and q4. We obtain a finite graph here, which is not guaranteed in general. (c) The strongly connected component, which is isomorphic to the original PFSA. States of the machine in (c) are distributions over the states of the machine in (a). (Online version in colour.)

Figure 5.

Why initial states are unnecessary. A sequence 10 is executed in the initial-marked PFSA in (a) from q3 (transitions shown in bold), resulting in the new state q1 (b). For this model, the same final state would be obtained by executing 10 from any other initial state, and also from any initial distribution over the states (as shown in (c) and (d)). This is because 10 is a synchronizing string for this model. Thus, if the initial state is unknown, and we assume that the initial distribution is stationary, then we actually start from the state ℘λ in the canonical representation (e), and we would end up with a distribution that has support on a unique state in the initial-marked machine (see (f), where the resulting state in the pruned representation corresponds to q1 in the initial-marked machine). Knowing a synchronizing string (of sufficient length) therefore makes the initial condition irrelevant. This machine is perfectly synchronizable, which is not always true, e.g. model M2 in figure 3. However, approximate synchronization is always achievable (theorem 2.10), which therefore eliminates the need of knowing the initial state explicitly, but requires us to compute such an ϵ-synchronizing sequence. (Online version in colour.)

Consequently, we denote the initial-marked PFSA induced by a QSP Embedded Image, with the initial marking removed, as Embedded Image, and refer to it simply as a ‘PFSA’ (dropping the qualifier ‘initial-marked’). States in Embedded Image are representable as states in Embedded Image as elements of Embedded Image. Next, we establish a key result: we always encounter a state arbitrarily close to some element in Embedded Image in the canonical construction starting from the stationary distribution ℘λ.

Theorem 2.10 (ϵ-synchronization of probabilistic automata)

For any QSP Embedded Image over Σ, the PFSA Embedded Image satisfies Embedded Image 2.11 where the norm used is unimportant.


See appendix A.

Theorem 2.10 induces the notion of ϵ-synchronizing strings, and guarantees their existence for arbitrary PFSA.

Definition 2.11 (ϵ-synchronizing strings)

An ϵ-synchronizing string xΣ for a PFSA is one that satisfies Embedded Image 2.12 The norm used is unimportant.

Theorem 2.10 does not yield an algorithm for computing synchronizing strings (theorem 2.18). It simply shows that one always exists. As a corollary, we estimate an asymptotic upper bound on the effort required to find it.

Corollary 2.12 (to theorem 2.10)

At most O(1/ϵ) strings need to be analysed to find an ϵ-synchronizing string.


See appendix A.

Next, we describe the basic principle of our inference algorithm. PFSA states are not observable; we observe symbols generated from hidden states. This leads us to the notion of symbolic derivatives, which are computable from observations.

We denote the set of probability distributions over a set of cardinality k as Embedded Image. First, we specify a count function.

Definition 2.13 (symbolic count function)

For a string s over Σ, the count function Embedded Image counts the number of times a particular substring occurs in s. The count is overlapping, i.e. in a string s=0001, we count the number of occurrences of 00s as Embedded Image and Embedded Image implying #s00=2.

Definition 2.14 (symbolic derivative)

For a string s generated by a QSP over Σ, the symbolic derivative is a function Embedded Image as Embedded Image 2.13 Thus, ∀xΣ,ϕs(x) is a probability distribution over Σ. ϕs(x) is referred to as the symbolic derivative at x.

For qiQ, Embedded Image induces a distribution over Σ as Embedded Image. We denote this as Embedded Image. We show that the symbolic derivative at x can be used to estimate this distribution for qi=[x], provided x is ϵ-synchronizing.

Theorem 2.15 (ϵ-convergence)

If x∈Σ is ϵ-synchronizing then Embedded Image 2.14


The proof follows from the Glivenko–Cantelli theorem 31 on uniform convergence of empirical distributions. See appendix A for details.

Next, we describe identification of ϵ-synchronizing strings given a sufficiently long observed string s. Theorem 2.10 guarantees existence, and corollary 2.12 establishes that O(1/ϵ) substrings need to be analysed until we encounter an ϵ-synchronizing string. These do not provide an executable algorithm, which arises from an inspection of the geometric structure of the set of probability vectors over Σ, obtained by constructing ϕs(x) for different choices of the candidate string x.

Definition 2.16 (derivative heap)

Given a string s generated by a QSP, a derivative heap Embedded Image is the set of probability distributions over Σ calculated for a given subset of strings LΣ as follows: Embedded Image 2.15

Lemma 2.17 (limiting geometry)

Let Embedded Image and Embedded Image be the convex hull of Embedded Image. If u is a vertex of Embedded Image then Embedded Image 2.16


Recalling theorem 2.15, the result follows from noting that any element of Embedded Image is a convex combination of elements from the set Embedded Image.

Lemma 2.17 does not claim that the number of vertices of the convex hull of Embedded Image equals the number of states, but that every vertex corresponds to a state. We cannot generate Embedded Image in practice since we have a finite observed string s, and we can only calculate ϕs(x) for a finite number of x. Instead, we show that choosing a string corresponding to the vertex of the convex hull of the heap, constructed by considering O(1/ϵ) strings, gives us an ϵ-synchronizing string with high probability.

Theorem 2.18 (derivative heap approximation)

For s generated by a QSP, let Embedded Image be computed with L=ΣO(log(1/ϵ)). If for some x0∈ΣO(log(1/ϵ)), ϕs(x0) is a vertex of the convex hull of Embedded Image then Embedded Image 2.17


The result follows from Sanov’s theorem [32] for convex set of probability distributions. See appendix A for details.

Figure 6 illustrates PFSAs, derivative heaps, and ϵ-synchronizing strings. Next, we present our inference algorithm.

Figure 6.

Illustration of derivative heaps. (a, c) Synchronizable PFSAs, and hence the derivative heaps have few distinct images (two for (a), and about five for (c)), while the PFSAs shown in (b) and (d) are non-synchronizable, and their heaps consist of a spread of points. Note since the alphabet size is 2 in these cases, the heaps are part of a line segment. (e,f) PFSAs and derivative heaps for 3-letter alphabets. The derivative heap is a polytope as expected. Red square marks the derivative at the chosen ϵ-synchronizing string x0 in all cases. ϕi is the ith coordinate of the symbolic derivative. (Online version in colour.)

3. Algorithm GenESeSS

(a) Implementation steps

We call our algorithm ‘Generator Extraction Using Self-similar Semantics’, or GenESeSS which for an observed sequence s, consists of three steps.

  • Identification of ϵ-synchronizing string x0. Construct a derivative heap Embedded Image using the observed trace s (definition 2.16), and set L consisting of all strings up to a sufficiently large, but finite, depth. We suggest as initial choice of L as Embedded Image. If L is sufficiently large, then the inferred model structure will not change for larger values. We then identify a vertex of the convex hull for Embedded Image, via any standard algorithm for computing the hull [33]. Choose x0 as the string mapping to this vertex.

  • Identification of the structure of Embedded Image, i.e. transition function δ. We generate δ as follows. For each state q, we associate a string identifier xidqx0Σ, and a probability distribution hq on Σ (which is an approximation of the Embedded Image-row corresponding to state q). We extend the structure recursively:

    1. Initialize the set Q as Q={q0}, and set xidq0=x0, hq=ϕs(x0).

    2. For each state qQ, compute for each symbol σΣ, find symbolic derivative ϕs(xidqσ). If Embedded Image for some q′∈Q, then define δ(q,σ)=q′. If, on the other hand, no such q′ can be found in Q, then add a new state q′ to Q, and define Embedded Image, hq=ϕs(xidqσ).

    The process terminates when every qQ has a target state, for each σΣ. Then, if necessary, we ensure strong connectivity using [34].

  • Identification of arc probabilities, i.e. function Embedded Image:

    1. Choose an arbitrary initial state qQ.

    2. Run sequence s through the identified graph, as directed by δ, i.e. if current state is q, and the next symbol read from s is σ, then move to δ(q,σ). Count arc traversals, i.e. generate numbers Embedded Image where Embedded Image.

    3. Generate Embedded Image by row normalization, i.e. Embedded Image

Gavaldà et al. [27] and Castro & Gavaldà [28] use similar recursive structure extension. However, with no notion of ϵ-synchronization, they are restricted to inferring only synchronizable or short-memory models, or large approximations for long-memory ones (figure 7).

Figure 7.

Application of the competing algorithm of [27,28] on simulated data from model M2 (figure 3) with ϵ=0.05. (a) The modelling error in the metric defined in lemma 3.2 with increasing model size, requiring 91 states to match GenESeSS performance (which finds the correct 2 state model M2 (b)). Gavaldà et al. requires a maximum model size, and (c)–(e) show models obtained with 2, 4 and 8 states. All models found are synchronizable, and miss the key point found by GenESeSS that M2 is non-synchronizable with LRDs. Algorithm CSSR [12] also finds only similar synchronizable models. (Online version in colour.)

(b) Complexity analysis and probably approximately correct learnability

While hq (in step 2) approximates Embedded Image rows, we find the arc probabilities via normalization of traversal count. hq only uses sequences in x0Σ, while traversal counting uses the entire sequence s, and is more accurate.

GenESeSS has no upper bound on the number of states; which is a function of the complexity of the process itself.

Theorem 3.1 (time complexity)

Asymptotic time complexity of GenESeSS is Embedded Image 3.1


See appendix A for details.

Theorem 3.1 shows that GenESeSS is polynomial in O(1/ϵ), size of input s, model size |Q| and alphabet size |Σ|. In practice, |Q|≪1/ϵ, implying that Embedded Image 3.2

An identification method is said to identify a target language L in the PAC sense [29,35,36] if it always halts and outputs L such that Embedded Image 3.3 where d(⋅,⋅) is a metric on the space of target languages. A class of languages is efficiently PAC-learnable if there exists an algorithm that PAC-identifies every language in the class, and runs in time polynomial in 1/ϵ, 1/δ, the length of sample input, and inferred model size. We prove PAC-learnability of QSPs by first establishing a metric on the space of probabilistic automata over Σ.

(c) Probably approximately correct identifiability of quantized stochastic processes

Lemma 3.2 (metric for probabilistic automata)

For two strongly connected PFSAs G1 and G2, denote the symbolic derivative at xΣ as Embedded Image and Embedded Image respectively. Then, Embedded Image defines a metric on the space of probabilistic automata on Σ.


Non-negativity and symmetry follow immediately. Triangular inequality follows from noting that Embedded Image is upper bounded by 1, and therefore for any chosen order of the strings in Σ, we have two Embedded Image sequences, which would satisfy the triangular inequality under the Embedded Image norm. The metric is well defined since for any sufficiently long s1 and s2, the symbolic derivatives at arbitrary x are uniformly convergent to some linear combination of the rows of the corresponding Embedded Image matrices.

Theorem 3.3 (PAC-learnability)

QSPs for which the probabilistic Nerode equivalence has a finite index are PAC-learnable using PFSAs, i.e. for ϵ,η>0, and for every sufficiently long sequence s generated by QSP Embedded Image we can compute Embedded Image as an estimate for Embedded Image such that Embedded Image 3.4 The algorithm runs in time polynomial in 1/ϵ,1/η, input length |s| and model size.


GenESeSS construction implies that, once the initial ϵ-synchronizing string x0 is identified, there is no scope of the model error to be more than ϵ. Hence Embedded Image Thus, for any η>0, if we have Embedded Image, then the required condition of equation (3.4) is met. The polynomial runtimes are established in theorem 3.1.

(i) Remark on Kearns’ hardness result

We are immune to Kearns’ hardness result [37], since ϵ>0 enforces state distinguishability [16].

4. Discussion, validation and application

(a) Application of GenESeSS to simulated data

For the scenario in figure 3, GenESeSS is applied to 103 symbols generated from the model M2 (denoting σ0 as 0, and σ1 as 1) with parameters γ=0.75 and γ′=0.66. We generate a derivative heap with all strings of length up to 4, and find 000 mapping to an estimated vertex of the convex hull of the heap, thus qualifying as our ϵ-synchronizing string x0. Figure 8a,b illustrate the choice of a ‘wrong’ x0; our theoretical development guarantees that if x0 is close to a vertex of the convex hull, then the derivatives ϕs(x0x) will be close to some row of Embedded Image, and thus form clusters around those values. If the chosen x0 is far from all vertices, this is not true; thus, in figure 8a, the incorrect choice of string 11 leads to a spread of symbolic derivatives on the line x+y=1. Figure 8b illustrates clustering of the values around the two hidden states with x0=000. A few symbolic derivatives are shown in figure 8c, with highlighted rows corresponding to strings with x0 as prefix; which are approximate repetitions of two distinct distributions. Switching between these repeating distributions on right concatenation of symbols induces the structure shown in figure 8d for ϵ=0.05 (see §3a). The inferred model is already strongly connected, and we run the input data through it (step 3 of GenESeSS) to estimate the Embedded Image matrix (see the inset of figure 8d). The model is recovered with correct structure, and with deviation in Embedded Image entries smaller than 0.01.

Figure 8.

Application of GenESeSS to simulated data for highway traffic scenario (model M2, see figure 3). (a,b) The correct and wrong choices of ϵ-synchronizing x0. The derivative heap for right extensions of correctly chosen x0 has few distinct images (b). (c) Some symbolic derivatives, and the chosen x0 is marked with a red arrow. Right extensions of x0=000 fall into two classes (shown in yellow and green), and switching between them induces the structure in (d). Finally, Embedded Image (inset of (d)) is estimated by step 3 of GenESeSS. (Online version in colour.)

This example infers a non-synchronizable machine (recall that M2 exhibits LRDs), which often proves to be too difficult for reported unsupervised algorithms (see comparison with [27] in figure 7, which can match GenESeSS performance only at the expense of a large number of states, and yet still fails to infer true LRD models). Hidden Markov model learning with latent transitions can possibly be applied; however, such algorithms do not come with any obvious guarantees on performance.

(b) Application of GenESeSS to model and predict experimental data

We apply GenESeSS to two experimental datasets: photometric intensity (brightness) of a variable star for a succession of 600 days [38], and twice-daily count of bow-fly population in a jar [39]. While the photometric data are quasi-periodic, the population time series has non-trivial stochastic structure. Quantization schemes, with four symbols, are shown in table 1.

View this table:
Table 1.

Quantization scheme and inferred model parameters.

The mean magnitude of successive change represented by each symbol is also shown, which is computed as the average of the continuous values that get mapped to each symbol, e.g. for the photometric data, symbol 0 represents a change of brightness in the range [−4,−2], and an average change of −2.67. Inferred PFSAs generate symbolic sequences, which we map back to the continuous domain using these average magnitudes, e.g. a generated 0 is interpreted as a change of −2.67 for photometric predictions.

The inferred PFSAs are shown in figures 9b and 10b. The causal structure is simple for the photometric data, and significantly more complex for the population dynamics. For the photometric dataset, the states with only outgoing symbols 2,3 (indicating a brightness increase next day), the states with outgoing symbols 1,2 (indicating the possibility of either an increase or a decrease), and the states with outgoing symbols 2,3 (indicating a decrease in intensity next day) are arranged in a simple succession (figure 9b). For the population dynamics, the group of states which indicate a predicted population increase in the next 12 h, the ones which indicate a decrease, and the ones which indicate either possibility have significantly more complex interconnections (figure 10c). In both cases, we learn ARMAX [7] models for comparison. ARMAX models include auto-regressive and moving average components, assume additive Gaussian noise, and linear dependence on history. ARMAX model order (p,q) is chosen to be the smallest value-pair providing an acceptable fit to the data (table 1), such that increasing the model order does not significantly improve prediction accuracy. We learn the respective models with a portion of the time series (figures 9a and 10a), and compare the approaches with different prediction horizons (‘prediction horizon = 1 day’ means ‘predict values 1 day ahead’) against the remaining part of the time series. ARMAX does well in the first case, even for longer prediction horizons (see figure 9c,d), but is unacceptably poor with the more involved population dynamics (figure 10d,e). Additionally, the ARMAX models do not offer clear insight into the causal structure of the processes as discussed above. Figure 11a,c plots the mean absolute prediction error percentage as a function of increasing prediction horizon for both approaches. For the photometric data, the error grows slowly, and is more or less comparable for ARMAX and GenESeSS. For the population data, GenESeSS does significantly better. Since we generate continuous prediction series from symbolic sequences using mean magnitudes (as described above; see also table 1, column 4), GenESeSS is expected to better predict the sign of the gradient of the predicted data rather than the exact magnitudes. We plot the percentage of wrongly predicted trends (i.e. percentage of sign mismatches) in figure 11b,d for the two cases. We note that ARMAX is slightly better for the photometric data, while for the population data, GenESeSS is significantly superior (75% incorrect trends at a horizon of 8 days for ARMAX versus about 25% for GenESeSS). Thus, while for quasi-periodic data GenESeSS performance may be comparable to existing techniques, it is significantly superior in modelling and predicting complex stochastic dynamics.

Figure 9.

Modelling photometric intensity of a variable star (source: Newton [38]). (a) Time-series data, (b) inferred PFSA. (c,d) Predicted series for GenESeSS and ARMAX model. (Online version in colour.)

Figure 10.

Modelling bow-fly population in a jar (source: Nicholson [39]). (a) Time-series data, (b) inferred PFSA. (c) The interconnection between groups of states that indicate population increase (only symbols 2 and 3 defined), decrease (only symbols 0 and 1 defined), or both in the next observed measurement. (d,e) Predicted series for GenESeSS and ARMAX model. (Online version in colour.)

Figure 11.

Comparison of GenESeSS against ARMAX model (a,b) photometric and (c,d) population time series. (a,c) Variation of the mean absolute prediction error as a function of the prediction horizon. (b,d) Variation of the percentage of wrongly predicted trends, i.e. the number of times the sign of the predicted gradient failed to match up with the ground truth. (Online version in colour.)

5. Summary and conclusion

We establish the notion of causal generators for stationary ergodic QSPs on a rigorous measure-theoretic foundation, and propose a new inference algorithm GenESeSS for identification of probabilistic automata models from sufficiently long traces. We show that our approach can learn processes with LRDs, which yield non-synchronizable automata. Additionally, we establish that GenESeSS is computationally efficient in the PAC sense. The results are validated on synthetic and real datasets. Future research will investigate the prediction problem in greater detail in the context of new applications.


This work was supported in part by NSF CDI grant ECCS 0941561 and DTRA grant HDTRA 1-09-1-0013. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the sponsoring organizations.

Appendix A. Detailed proofs of mathematical results

We present the proofs of theorems 2.10, 2.15, 2.18, 3.1, and for corollary 2.12 to theorem 2.10.

— (Statement of theorem 2.10) For any QSP Embedded Image over Σ, the PFSA Embedded Image satisfies Embedded Image where the norm used is unimportant.


We show that all PFSA are at least approximately synchronizable [40,41], which is not true for deterministic automata. If the graph of Embedded Image (i.e. the deterministic automaton obtained by removing the arc probabilities) is synchronizable, then equation (5) trivially holds true for ϵ′=0 for any synchronizing string x. Thus, we assume the graph of Embedded Image to be non-synchronizable. From the definition of non-synchronizability, it follows: Embedded Image A 1 If the PFSA has a single state, then every string satisfies the condition in equation (5). Hence, we assume that the PFSA has more than one state. Now if we have Embedded Image A 2 then, by definition 2.4, we have a contradiction qi=qj. Hence, ∃x0 such that Embedded Image A 3 Since Embedded Image A 4 we conclude without loss of generality that ∀qi,qjQ, with qiqj: Embedded Image It follows from induction that if we start with a distribution ℘ on Q such that ℘i=℘j=0.5, then for any ϵ′>0 we can construct a finite string Embedded Image such that if Embedded Image, then for the new distribution ℘′ after execution of Embedded Image will satisfy ℘s′>1−ϵ′. Recalling that Embedded Image is strongly connected, for any qtQ, there exists a string yΣ, such that δ(qs,y)=qt. Setting Embedded Image, we can ensure that the distribution ℘′′ obtained after execution of Embedded Image satisfies ℘t′′>1−ϵ′ for any qt of our choice. For arbitrary initial distributions ℘A on Q, we consider contributions arising from simultaneously executing Embedded Image from states other than just qi and qj. Nevertheless, it is easy to see that executing Embedded Image implies that in the new distribution ℘A, we have Embedded Image. It immediately follows that executing of the string Embedded Image, where Embedded Image A 5 would result in a final distribution ℘A′′ which satisfies Embedded Image. Appropriate scaling of ϵ′ then completes the proof.

— (Statement of corollary 2.12 to theorem 2.10) At most O(1/ϵ) strings need to be analysed to find an ϵ-synchronizing string.


Theorem 2.10 works by multiplying entries from the Embedded Image matrix, which cannot be all identical (otherwise the states would collapse). Let the minimum difference between two unequal entries be η. Then, following the construction in theorem 2.10, the length ℓ of the synchronizing string, up to linear scaling, satisfies η=O(ϵ), implying ℓ=O(log(1/ϵ)). The number of strings to be analysed therefore is at most all strings of length ℓ, which is given by Embedded Image A 6 This completes the proof.

— (Statement of theorem 2.15 on ϵ-convergence) If xΣ is ϵ-synchronizing, then Embedded Image A 7


The proof follows from the Glivenko–Cantelli theorem 31 on uniform convergence of empirical distributions. Since x is ϵ-synchronizing, we have Embedded Image A 8 Let x ϵ-synchronize to q. Thus, every time we encounter x while reading s, we are guaranteed to be distributed over Q, and at most ϵ distance from the element of Embedded Image corresponding to q. Assuming that we encounter n(|s|) occurrences of x within s, we note that ϕs(x) is an approximate empirical distribution for Embedded Image. Denoting Fn(|s|) as the perfect empirical distributions (i.e. ones that would be obtained for ϵ=0), we have Embedded Image We use Embedded Image as Embedded Image, implied by the strong connectivity of our PFSA. This completes the proof.

— (Statement of theorem 2.18 on derivative heap approximation) For s generated by a QSP, let Embedded Image be computed with Embedded Image. If for some Embedded Image, ϕs(x0) is a vertex of the convex hull of Embedded Image, then Embedded Image A 9


The result follows from Sanov’s theorem [32] for convex set of probability distributions. Note that if Embedded Image, then x0 is guaranteed to be ϵ-synchronizing (theorem 2.10 and corollary 2.12). Denoting the number of times we encounter x0 in s as n(|s|), and since Embedded Image is a convex set of distributions (allowing us to drop the polynomial factor in Sanov’s bound), we apply Sanov’s theorem to the case of finite s: Embedded Image A 10 where KL(⋅∥⋅) denotes the Kullback–Leibler divergence [42]. From the bound [43]: Embedded Image A 11 and Embedded Image, where α>0 is the stationary probability of encountering x0 in s, we conclude: Embedded Image A 12 which completes the proof.

— (Statement of theorem 3.1 on time complexity) Asymptotic time complexity of GenESeSS is Embedded Image A 13


GenESeSS performs the following computations.

  • Computation of a derivative heap by computing ϕs(x) for O(1/ϵ) strings (corollary 2.12), each of which involves reading the input s and normalization to distributions over Σ, thus contributing O(1/ϵ×|s|×|Σ|).

  • (C2) Finding a vertex of the convex hull of the heap, which, at worst, involves inspecting O(ϵ) points (encoded by strings generating the heap), contributing O(1/ϵ×|Σ|), where each inspection is done in O(|Σ|) time.

  • (C3) Finding δ, involving computing derivatives at string-identifiers (step 2), thus contributing O(|Q|×|s|×|Σ|). Strong connectivity [34] requires O(|Q|+|Σ|), and hence is unimportant.

  • (C4) Identification of arc probabilities using traversal counts and normalization, done in time linear in the number of arcs, i.e. O(|Q|×|Σ|).

Summing the contributions, and using |s|>|Σ|, Embedded Image which completes the proof.



View Abstract