## Abstract

We review connections between phase transitions in high-dimensional combinatorial geometry and phase transitions occurring in modern high-dimensional data analysis and signal processing. In data analysis, such transitions arise as abrupt breakdown of linear model selection, robust data fitting or compressed sensing reconstructions, when the complexity of the model or the number of outliers increases beyond a threshold. In combinatorial geometry, these transitions appear as abrupt changes in the properties of face counts of convex polytopes when the dimensions are varied. The thresholds in these very different problems appear in the same critical locations after appropriate calibration of variables. These thresholds are important in each subject area: for linear modelling, they place hard limits on the degree to which the now ubiquitous high-throughput data analysis can be successful; for robustness, they place hard limits on the degree to which standard robust fitting methods can tolerate outliers before breaking down; for compressed sensing, they define the sharp boundary of the undersampling/sparsity trade-off curve in undersampling theorems. Existing derivations of phase transitions in combinatorial geometry assume that the underlying matrices have independent and identically distributed Gaussian elements. In applications, however, it often seems that Gaussianity is not required. We conducted an extensive computational experiment and formal inferential analysis to test the hypothesis that these phase transitions are *universal* across a range of underlying matrix ensembles. We ran millions of linear programs using random matrices spanning several matrix ensembles and problem sizes; visually, the empirical phase transitions do not depend on the ensemble, and they agree extremely well with the asymptotic theory assuming Gaussianity. Careful statistical analysis reveals discrepancies that can be explained as transient terms, decaying with problem size. The experimental results are thus consistent with an asymptotic large-*n* universality across matrix ensembles; finite-sample universality can be rejected.

## 1. Introduction

Recent work has exposed a phenomenon of abrupt *phase transitions* in high-dimensional geometry. The phase transitions amount to rapid shifts in the likelihood of a property’s occurrence when a dimension parameter crosses a critical level (a *threshold*). We start with a concrete example and then identify surprising parallels in data analysis and signal processing.

### (a) Convex hulls of Gaussian point clouds

Suppose we have a sample *X*_{1},…,*X*_{n} of independent standard normal random variables in dimension *d*, forming a point cloud of *n* points in *R*^{d}. Our intuition suggests that a few of the points will lie on the ‘surface’ of the dataset, that is, the boundary of the convex hull; the rest will lie ‘inside’, i.e. interior to the hull. However, if *d* is a fixed fraction of *n* and both are large, our intuition is completely violated. Instead, *all* of the points are on the boundary of the convex hull—*none* is interior. Moreover, the line segment connecting the typical pair of points *does not* intersect the interior; in complete defiance of expectation, it stays on the boundary. Even more, for *k* in some appreciable range, *the typical k-tuple spans a convex hull which does not intersect the interior*! For humans stuck all their lives in three-dimensional space, such a situation is hard to visualize.

The phenomenon of phase transition appears as follows: such seemingly strange behaviour continues for quite large *k*, up to a *predictable* threshold given by a formula *k** = *d*×*ρ*(*d*/*n*;*T*), where *ρ*() is defined in §2. Below this threshold (i.e. *k* a bit smaller than *k**), the strange behaviour is observed; but suddenly, above this threshold (i.e. for *k* a bit larger), our normal low-dimension intuition works again—convex hulls of *k*-tuples of points indeed intersect the interior.

This curious phenomenon in high-dimensional geometric probability is one of a small number of fundamental such phase transitions. There are consequences

(i) in selecting models for statistical data analysis of large datasets,

(ii) in coping with outlying measurements in designed experiments, and

(iii) in determining how many samples we need to take in designing imaging devices.

The consequences can be both profound and important. They range from negative-philosophical—if your database has too many ‘junk’ variables in it nothing can be learned from it—to positive-practical—it is not really necessary to sit cooped up for an hour in a medical MRI scanner: with the right inference procedures, the necessary data could be collected in a fraction of the time commonly used today.

Our paper aims to define these phase transitions more precisely and to indicate applications; it will then discuss our main contribution.

*Main contribution.* We have observed a *universality* of threshold locations across a range of underlying probability distributions. Changing the underlying distribution from Gaussian to any of several non-Gaussian choices, we observe phase transitions at the same locations.

We compiled evidence based on millions of random trials and observed the same phase transitions even for several highly non-independent and identically distributed (i.i.d.) ensembles. We here formally state and test the universality hypothesis.

We pose a challenge in the field of high-dimensional geometric probability.

*Open problem.* Characterize the *universality class* containing the standard Gaussian, i.e. the class of ensembles leading to phase transitions matching those for the Gaussian i.i.d. case.

The hypothesized universality class is expected to be fairly broad. In view of the significance of these phase transitions in applications, this is quite an attractive challenge. We begin by illustrating three surprising appearances of these phase transitions.

### (b) Surprise 1: model selection with large databases

A characteristic feature of today’s *data deluge* is the tendency in each field to collect ever more and more measurements on each observed entity, whether it be a pixel of sky, a sample of blood or a sick patient. Technology continually puts in our hands *high-throughput* measurement equipment making ever more varied and ever more detailed numerical measurements on the spectrum of light, the protein expression in whole blood or fluctuations in neural or muscular activity.

As a result, observed entities are represented by ever higher dimensional feature vectors. In fact, the transition between the twentieth and twenty-first centuries marked a sudden increase in the dimensionality of typical datasets that scientists studied, so that it became unremarkable for each observational unit to be represented as a datum point in a *p*-dimensional space with *p* very large—in the hundreds, thousands or millions.

The modern trend to high-throughput measurement devices often does not address the fundamental difficulty of obtaining good observational units. Scientists face the same troubles they always have faced when searching for subjects affected by a rare disease or observing rare events in distant galaxies. Hence, in many fields, the number of observational units stays small, perhaps in the hundreds (or even dozens), but each of those few units can now be routinely subjected to unprecedented density of numerical description.

Orthodox statistical methods assumed a quite different set of conditions: an abundance of observational units and a very limited set of measured characteristics on each unit. Modern statistical research is intensively developing new tools and theory to address the new unorthodox setting; such research constituted much of the activity in the recent six-month Newton Institute programme *Statistical Theory and Methods for Complex, High-Dimensional Data*.

Consider a linear modelling scenario going back to Legendre, Gauss and perhaps even before. We have a response variable *Y* , which we intend to model as a linear function of up to *p* numerical predictor variables *X*_{1},…,*X*_{p}. We contemplate an utterly standard multi-variate linear model, , where the *β*_{j} are regression coefficients and *Z* is a standard normal measurement error. In words, the expected value of *Y* given {*X*_{j}} is a linear combination with coefficients *β*_{j}.

Suppose we have a collection of measurements (*Y*_{i},*X*_{i,j},*j* = 1,…,*p*), one for each observational unit *i*. We will use these data to estimate the *β*_{j}’s, allowing future predictions of *Y* given the *X*_{j}’s.

In the ‘twenty-first-century setting’ described above, we have more predictor variables than observations, meaning *p*>*n*. While Legendre and Gauss may have understood the *p*<*n* case, they would have been very troubled by the *p*>*n* case: there are more unknowns than equations, and there is noise to boot.

A key feature of high-throughput analysis is that batches of potential predictors are automatically measured but one does not know in advance which, if any, may be useful in a particular project. Researchers in applied sciences where high-throughput studies are popular (e.g. genomics, proteomics, metabolonomics) believe that some small fraction of the measured features are useful, among many useless ones. Unfortunately, high-throughput techniques give us everything, useful and useless, all mixed together in one batch.

In this setting, a reasonable response is forward stepwise linear regression. We proceed in stages, starting with the simple model *Y* = *α* + *Z* (i.e. no dependence on *X*’s) and at each stage expand the model by adding the single variable *X*_{j} offering the strongest improvement in prediction.

Stodden (2006) reported a simulation experiment using forward stepwise regression with false-discovery rate control stopping rule. Their experiment chose *p* = 200 and explored a range of *n*<*p* cases. Letting *k* denote the number of useful predictors among the *p* potential predictors, they set up true underlying regression models reflecting the choice (*k*,*n*,*p*), ran the stepwise regression routine and recorded the mean-squared prediction error of the resulting estimate. Figure 1 displays the results: the coloured attribute gives the relative mean-squared error of the estimate; the axes present the ratio *δ* = *n*/*p* of observations to variables, with 0<*δ*<1 in this brave new world, and *ρ* = *k*/*n* of useful variables to observational units. Evidently, there is an abrupt change in performance: one can suddenly ‘fall off a cliff’ by slightly increasing the number of useless variables per useful variable. The *surprise* is that the ‘cliff’ is roughly at the same position as the overlaid curve. That curve, denoted by *ρ*(*δ*;*C*) and defined fully below, derives from combinatorial geometry1 notions similar to those in §1*a*.

*Interpretation*:

(i) A standard scientific data analysis approach in the ‘twenty-first-century setting’ falls off a cliff, failing abruptly when the model becomes too complex.

(ii) The location of this failure (ratio of model variables to observations) matches a curve derived from the field of geometric combinatorics.

### (c) Surprise 2: robustness in designed experiments

We now consider a problem in robust statistics. Suppose that a response variable *Y* is thought to depend on *p* independent variables *X*_{1},…,*X*_{p}. Unlike the data-drenched high-throughput observational studies of §1*b*, we are in a classical designed experiment, with *p*<*n*. The dependence is linear, so we again have . The error *Z* is again normal, but *W* is a ‘wild’ variable containing occasional very large outliers.

Most scientists realize that such outliers could upset the usual least-squares procedure for estimating *β*, and many know that *ℓ*_{1} minimization,
1.1is purported to be ‘robust’, particularly in designed experiments where wild *X*’s do not occur. Let us focus on a specific designed experiment, where *X* is an *n* by *p* partial Hadamard matrix, i.e. *p* columns chosen at random from an *n*×*n* Hadamard matrix. This design chooses *X*_{i,j} either +1 or −1 in a very specific way; there are no wild *X*’s. We let the outlier generators *W* have most entries 0, but, in our study, a small fraction *ε* = *k*/*n*—at randomly chosen sites—will have very large values; in any one realization, they can be either all large positive or all large negative.

To clarify better the relationships we are trying to make in this paper, we set the standard error *Z* to zero, and consider only the wild outliers in *W*. We quantify breakdown properties of the *ℓ*_{1} estimator by a large computational experiment. We vary parameters (*k*,*n*,*p*) creating a range of situations. At each one, we solve the *ℓ*_{1} fitting problem (1.1) and measure to how many digits agrees with *β*; we record as *breaking down* due to outliers when fewer than six digits agree.2Figure 2*a* shows the results of this experiment, depicting the breakdown fraction.

Evidently, there is an abrupt change in behaviour at a certain critical fraction *ε**; this depends on *γ* = *p*/*n*. Now 0<*γ*<1, so there are more observational units than predictors. When there are many observation units per predictor, i.e. *ε**≈1, *ℓ*_{1} fitting can resist a large fraction of outliers. When the model is almost saturated, i.e. nearly one predictor per observation, *ε**≈0, it takes very little contamination to break down the *ℓ*_{1} estimator. In figure 2*a*, we overlay a theoretical curve *ε**(*γ*;*C*) = (1−*γ*)*ρ*((*n*−*p*)/*n*;*C*), where *ρ*(·;*C*) is derived from geometric combinatorics. Evidently, the curve coincides with the observed breakdown point of the *ℓ*_{1} estimator in a designed experiment. Figure 2*b* depicts a transformation of figure 2*a*, into new axes, with variables *ρ* = *k*/(*n*−*p*) and *δ* = (*n*−*p*)/*n*. The display now looks similar to Stodden’s fig. 1.

*Interpretations*:

(i) Standard

*ℓ*_{1}fitting in a standard designed experiment (but with large*n*and*p*, this time with*p*<*n*) turns out to be robust below a certain critical fraction of outliers, at which point it breaks down.(ii) The simulation results, properly calibrated, closely match seemingly unrelated phenomena in sparse linear modelling in the

*p*>*n*case; they also match a known phase transition in geometric combinatorics.

### (d) Surprise 3: compressed sensing

We now leave the field of data analysis for the field of signal processing.

Since the days of Shannon, Nyquist, Whittaker and Kotelnikov, the ‘sampling theorem’ has helped engineers decide how much data need to be acquired in the design of measurement equipment. Consider the following imaging problem. We wish to acquire a signal *x*_{0} having *N* entries. Now suppose that only *k*≪*N* of those pixels are actually non-zero—we do not know which pixels are non-zero, or even that this is true. There are *N* degrees of freedom here, as any of the *N* pixel values vary.

Consider making *n*≪*N* measurements of a special kind. We simply observe *n* random Fourier coefficents of *x*. Here *n*≪*N* so that, although the image has *N* degrees of freedom, we make far fewer measurements. Let *A* be the linear operator that delivers the selected *n* Fourier coefficients and let *y* be the resulting measured coefficients. We attempt to reconstruct by solving for the object *x*_{1} with smallest *ℓ*_{1} norm subject to agreeing with the measurements *y*1.2

Figure 3 shows the results of computational experiments conducted by the authors. In those experiments, we chose *N* = 200 and varying levels of *k* and *n*. The horizontal axis *δ* = *n*/*N* measures the undersampling ratio—how many fewer measurements we are making than the customary *N*. The vertical axis *ρ* = *k*/*n* measures to what extent the effective number of degrees of freedom *k* is smaller than the number of measurements. The contours indicate the success rate. The superimposed curve *ρ*(*δ*;*C*) roughly coincides with the empirical 50 per cent success rate curve.

*Interpretation*: We can violate the usual ‘sampling theorem’ (*n*≥*N*) with impunity. The true limit is , where *ρ*(*δ*;*C*) is the curve decorating the display.3 We have seen this curve twice before already; it arises in a superficially unrelated problem in high-dimensional geometric combinatorics.

### (e) The connection to high-dimensional geometric combinatorics

Recall the problem in geometric probability we discussed in §1*a*. Draw a sequence of *n* samples *X*_{1},…,*X*_{n} from a standard *d*-dimensional normal distribution. Let *P* = Conv(*X*_{1},…,*X*_{n}) denote the convex hull of these *n* points; this is a random convex polytope.

Suppose that *d* and *n* are both large, and let *δ* = *d*/*n*. Figure 4 presents a black curve to be called *ρ*(*δ*;*T*) and formally defined4 in §2. It has the following interpretation.

Let *ρ* = *k*/*d*. Suppose that *ρ*<*ρ*(*d*/*n*;*T*) and that *n* and *d* are both large, with *d*<*n*. For the typical *k*+1 tuple (*X*_{i1},…,*X*_{ik+1}),

(i) every

*X*_{ij}is a vertex of*P*, 1≤*j*≤*k*+1,(ii) every line segment [

*X*_{ij},*X*_{ij′}] is an edge of*P*, 1≤*j*,*j*′≤*k*+1,⋮

(

*n*) the convex polytope Conv(*X*_{i1},…,*X*_{ik+1}) is a*k*-dimensional face of*P*.

Figure 4 also presents a second, lower, black curve, which is actually the one we have seen in our three surprises. This curve, denoted by *ρ*(*δ*;*C*) and defined in §2, has the following interpretation. Draw the same *n* samples *X*_{1},…,*X*_{n} from a standard *d*-dimensional normal distribution. Now let *Q* = Conv(*X*_{1},…,*X*_{n},−*X*_{1},…,−*X*_{n}) denote the convex hull of the 2*n* points including the original *n* points and their reflections through the origin. This is a random *centrosymmetric* convex polytope.

Let *ρ* = *k*/*d* and *ε*>0. Suppose that *ρ*<*ρ*(*d*/*n*;*C*)(1−*ε*) and that *n* and *d* are both large. For the typical *k*+1 tuple (*X*_{i1},…,*X*_{ik+1}),

every ±

*X*_{ij}is a vertex of*Q*, 1≤*j*≤*k*+1,every line segment [±

*X*_{ij},±*X*_{ij′}] is an edge of*Q*, 1≤*j*,*j*′≤*k*+1,⋮

(

*n*) the typical convex polytope Conv(±*X*_{i1},…,±*X*_{ik+1}) is a*k*-dimensional face of*Q*.

In short, the curve arising each time in Surprises 1–3 involves convex hulls of symmetrized Gaussian point clouds. The curve involving *ρ*(*δ*;*T*) would arise if we had instead positivity constraints on the objects to be recovered (Surprises 1 and 3) or on the outliers (Surprise 2).

### (f) Main results

The curve *ρ*(*δ*;*C*) describes the properties of high-dimensional polytopes deriving from the Gaussian distribution. What *now* seems surprising about Surprises 1–3 is the lack of formal connection to those polytopes in the applications. For example,

(i) in Surprise 1, stepwise regression as practised by statisticians seems unrelated to convex polytopes.

In Surprises 2 and 3, the appearance of the *ℓ*_{1} norm establishes a connection with convex polytopes (as the unit ball of the *ℓ*_{1} norm is in fact a regular polytope). Yet

(ii) in Surprise 2, neither a Hadamard design nor outliers have any formal connection to any Gaussian distribution, and

(iii) in Surprise 3, observing random frequencies of the Fourier transform of a two-dimensional signal has no visible connection to any Gaussian distribution.

The curves *ρ*(*δ*;*C*) and *ρ*(*δ*;*T*) accurately describe the thresholds in many situations where the Gaussian distribution is not present; in fact, we have witnessed it in cases where the points of a point cloud were chosen deterministically. We believe this signals a new kind of *limit theorem* in probability theory that, when formalized, will make precise a new kind of universality phenomenon in high-dimensional geometry.

## 2. Combinatorial geometry and phase transitions

### (a) Polytope terminology

Let *P* be a convex polytope in *R*^{N}, i.e. the convex hull of points *p*_{1},…,*p*_{m}. Let *A* be an *n*×*N* matrix. The image *Q* = *AP* lives in *R*^{n}; it is a convex set, in fact a polytope, the convex hull of points *Ap*_{1},…,*Ap*_{m}. *Q* is the result of ‘projecting’ *P* from *R*^{N} down to *R*^{n} and will be called the *projected polytope*.

The polytopes *P* and *Q* have vertices, edges, two-dimensional faces and so on. Let *f*_{k}(*P*) and *f*_{k}(*Q*) denote the the number of such *k*-dimensional faces; thus, *f*_{0}(*P*) is the number of vertices of *P* and *f*_{N}(*P*) the number of facets, while *f*_{0}(*Q*) is the number of vertices and *f*_{n}(*Q*) the number of facets. Projection can only reduce the number of faces, so

Three very special families of polytopes *P* are available in every dimension *N*>2, the so-called regular polytopes. Here we consider two of the three:

(i) the

*simplex*(an (*N*−1)-dimensional analogue of the equilateral triangle) 2.1and the*(ii) crosspolytope*(an*N*-dimensional analogue of the octahedron) 2.2

Statements concerning the hypercube similar to those made here for the simplex and crosspolytope are available to an interested reader in 8, Donoho & Tanner (2008*a*).

### (b) Connection to underdetermined systems of equations

The regular polytopes are simple and beautiful objects, but are not commonly thought to be *useful* objects. However, their face counts reveal solution properties of underdetermined systems of equations. Such underdetermined systems arise frequently in modern applications, and the existence of *unique* solutions to such systems is responsible for the three surprises given in the introduction. Consider the case of the simplex.

Consider the underdetermined system of equations *y*_{0} = *Ax*, where *A* is *n*×*N*, *n*<*N*, and the optimization problem (LP) isLPOrdinarily, the system has infinitely many solutions, as does problem (LP).

### Lemma 2.1.

*Let A be a fixed matrix with n columns in general position in R ^{N}. Consider vectors y*

_{0}

*with a sparse solution y*

_{0}

*= Ax*

_{0}

*where x*

_{0}

*≥0 has k non-zeros. The fraction of systems (y*

_{0}

*,A) where (LP ) has that underlying x*

_{0}

*as its only solution is*

So the ratio of face counts between the projected and the unprojected simplices gives the probability that (LP) correctly recovers a *k*-sparse *x*_{0}.

Consider the case of the crosspolytope. To an underdetermined system of equations *y*_{0} = *Ax*_{0}, where *A* is *n*×*N*, *n*<*N*, apply the optimization problem (P 1)P1Ordinarily, both this system and the problem (P 1) have infinitely many solutions.

### Lemma 2.2.

*Let A be a fixed matrix with n columns in general position in R*^{N}*. Consider vectors y*_{0}*with a sparse solution y*_{0}*= Ax*_{0}*, where x*_{0}*has k non-zeros. The fraction of systems (y*_{0}*,A) where (P 1) has that underlying x*_{0}*as its unique solution is*

In short, the ratio of face counts between the projected and unprojected crosspolytopes equals the chance that (P 1) recovers the underlying *k*-sparse object.

In short, it is essential to know whether or not
for *Q* the simplex or crosspolytope.

### (c) Asymptotics of face counts with Gaussian matrices *A*

We now consider the case where the *n*×*N* matrix *A* has i.i.d. Gaussian random entries. Then the mapping *P*↦*Q* is a random projection. In this case, rather amazingly, tools from polytope theory and probability theory can be combined to study the expected face counts in high dimensions. The results demonstrate rigorously the existence of sharp thresholds in face count ratios.

### Theorem 2.3.

(Donoho (2005*a*,*b*) and Donoho & Tanner (2005*a*,*b*, 2009)). *Let the n × N random matrix A have i.i.d. N*(0,1) *Gaussian elements. Consider sequences of triples (N,n,k), where n = δ N, k = ρ n and*. *There are functions ρ (δ;Q) for Q∈{T,C} demarcating phase transitions in face counts,*

Figure 4 displays the two curves referred to in this theorem. The simplex’s transition is higher than that of the crosspolytope: *ρ*(*δ*,*T*)>*ρ*(*δ*,*C*) for *δ*∈(0,1).

## 3. Empirical results for non-Gaussian ensembles

### (a) Explorations

Over the last few years, we ran computer experiments generating millions of underdetermined systems of equations of various kinds, using standard optimization tools to select specific solutions, and checking whether or not the solution was unique and/or sparse. In overwhelmingly many cases, Gaussian polytope theory accurately matches the experimental results, even when the matrices involved are not Gaussian. We here summarize results about experiments with the non-Gaussian ensembles listed in table 1. For details, see the electronic supplementary material.

We varied the matrix shape *δ* = *n*/*N* and the solution sparsity levels *ρ* = *k*/*n*. At problem size *N* = 1600, we varied *n* systematically through a grid ranging from *n* = 160 up to *n* = 1440 in nine equal steps. At each combination *N*,*n*,*k*, we considered *M* = 200 different problem instances *x*_{0} and *A*, each one drawn randomly as above. We both generated non-negative sparse vectors and solved (LP), and generated signed sparse vectors and solved (P 1). The ‘signal processing language’ event ‘exact reconstruction of an object with non-zeros in these *k* specific positions’ corresponds to the ‘polytope language’ event ‘this specific *k*-face of *Q* is also a *k*-face of *AQ*’. In both cases, we speak of *success*, and we call the frequency of success in *M* empirical trials at a given (*k*,*n*,*N*) the *success rate*. At each combination *N*,*n*, we varied *k* systematically to sample the success rate transition region from 5 to 95 per cent. Figure 4 offers an overview; it shows the level curves for 50 per cent success rate, for each of the nine ensembles above. The appropriate theoretical curves *ρ*(*δ*;*Q*) are overlaid. The uppermost nine curves give the case of non-negative solutions, *Q* = *T*, where we solve (LP), and the nine lower curves present the data for *Q* = *C*, where we solve (P 1). (Note: the Hadamard case is exceptional and uses *N* = 512.)

At first glance, figure 4 shows excellent agreement between the actual empirical results in each matrix ensemble5 and the asymptotic theory for the Gaussian. This is not very surprising for *A* from the Gaussian ensemble; it merely proves that the large *N* polytope theory works accurately already at moderate *N* = 1600. For other ensembles, there is not, to our knowledge, any existing theory suggesting what we see so clearly here: phase transition behaviour in non-Gaussian ensembles that accurately matches the Gaussian case (compare §4).

### (b) Universality hypothesis

Figure 4 suggests to us the following hypothesis.

**Hypothesis.***Universality of phase transitions.* Suppose that the *n*×*N* matrix *A* is sampled randomly from a ‘well-behaved’ probability distribution. Suppose that the *N* by 1 vector *x*_{0} is sampled randomly from the set of *k*-sparse vectors, either with or without positivity constraints on the non-zeros of *x*_{0}. Solutions to (LP) and (P1) will exhibit, as a function of (*N*,*n*,*k*), phase transitions in success probabilities matching those that are proved to hold when sampling *A* from the i.i.d. Gaussian distribution.

This hypothesis really contains two assertions: (i) that many matrix ensembles behave like the Gaussian and (ii) that moderate-sized *N* exhibit behaviour in line with the asymptotic.

The hypothesis also contains an element of vagueness, since we do not know at the time of writing how to delineate the ensembles of random matrices over which Gaussian-like behaviour will hold. Of course universality results are well known in probability theory; the central limit theorem is the most well-known universality result for the distribution of sums of independent random variables. The precise universality class of the Gaussian distribution for such sums was discovered only two centuries after the phenomenon itself was identified. Apparently, we are here at the stage of just identifying a comparable phenomenon. We hope it does not take two centuries to identify the corresponding universality class.

*Clarification 1.* In fact, our hypothesis could also be called a *rigidity* of the phase transition—it is invariantly located at the same place in the phase diagram across a range of matrix ensembles. In statistical physics, universality of a phase transition means something different, and much weaker—not a rigidity, but instead a flexibility of the location of a phase transition while preserving an underlying structural similarity. Our hypothesis is far stronger.

*Clarification 2.* A completely general form of the hypothesis has trivial counterexamples; the matrix *A* of all ones does not generate any useful phase transition behaviour.

### (c) Experimental procedure

We conducted a Monte Carlo experiment to test the universality hypothesis.

The general procedure was like our earlier exploratory studies. We call a *suite* a distribution of problem instances (*y*,*A*) fully specified by two factors: (i) the ensemble of matrices *A* and (ii) the ensemble of coefficients *x*_{0} generating *y* = *Ax*_{0}. Matrix ensembles include Bernoulli, Ternary, … . Coefficient ensembles studied here have vectors of *N* coefficients with only *k* non-zeros, in sites chosen at random. The positive sign coefficient ensemble indicated by + has all non-zeros drawn uniformly from [0, 1]. The signed coefficient ensemble indicated by ± has non-zeros drawn uniformly from [−1, 1]. For each suite, we visited a collection of triples (*N*,*n*,*k*). At each triple, we drew a sequence of random problem instances of the given size and shape from the given problem suite. We then ran optimization software to compute the solution of the random problem. We computed observables from the obtained solution, in particular the binary observable `ExactRecon`, which takes the value 1 when the obtained solution is equal to the true solution within six digits accuracy, and zero otherwise.

We aimed to be *confirmatory* rather than *exploratory*: to use formal inferential tools and carefully explain apparent departures from our hypothesis. Our experiments differed from earlier efforts in scope and attention to detail.

(i)

*Scale.*We performed 2 948 000 separate optimizations spanning 16 984 different situations. Our computations used as many as 200 CPUs in an available cluster and overall required 6.8 CPU years. The scale of earlier exploratory studies, amounting to CPU days, is tiny by comparison.(ii)

*Calibration.*The vast majority of our experimental computations relied on Mosek, a commercial package (http://www.mosek.com). We made runs comparing the results with CVX, a popular open-source optimization package (http://www.stanford.edu/~boyd/cvx/). We believe our results are consistent across optimizers.

### (d) Inferential formulation

Rather than go on a fishing expedition, we chose, from the outset, to frame our evaluation of the evidence using standard inferential procedures.

*Two-sample comparisons.* The strict form of the universality hypothesis says that the probability of unique solution under the Gaussian ensemble is the same as the probability of unique solution at each other ensemble in the universality class. It follows that we may compare two sets of results at the same problem size, one with the Gaussian ensemble and one where everything else in the problem is the same except that a specific non-Gaussian ensemble is used. If at each ensemble we generate *M* problem instances and obtain *M* realizations of the observable `ExactRecon`, strict universality requires that the number of successes in each ensemble have a binomial probability distribution with the same success probability in both ensembles. Hence, the hypothesis really amounts to the assertion that two binomial distributions are the same. We proceed with traditional tests for equality of two binomial distributions. We chose to work with the *Z* score
3.1Here denotes ‘the fraction of cases where `ExactRecon` = 1 in ensemble *i*’ and is the appropriate standard error for comparing proportions with possibly unequal sample sizes *M*_{i}. In this comparison, describes the Gaussian baseline experiment and describes the non-Gaussian alternative experiment. Under the universality hypothesis, *Z* has an approximate standard normal distribution.

Reducing our problem merely to consideration of *Z* scores, we can formalize our hypothesis.

**Strong null hypothesis.***The Z scores have an approximate N(0,1) distribution at each value of (k,n,N).*

*Study of asymptotics with problem size.* The strong null hypothesis seems implausible *a priori* on the strength of experience from other settings.

Consider another setting where the Gaussian distribution is universal: the central limit theorem. Although the Gaussian distribution provides the correct limiting behaviour, there are well-understood departures from Gaussian behaviour at small problem sizes. Such departures of course decay with increasing problem size. The theory of Edgeworth expansions shows that, for a symmetric distribution, we will see deviations of order 1/(problem size), and, for an asymmetric one, deviations of order occur.

Analogously, in this setting, we may see systematic behaviour of the *Z* scores varying with problem size *N* and perhaps also with *k*. Studying three problem sizes (*N* = 200,400 and 1600), we might observe trends in the *Z* scores with the problem size.

**Weak null hypothesis.***The Z scores exhibit discrepancies from the standard N(0,1) distribution (e.g. in means, variances, tail probabilities) which decay to zero with increasing N.*

### (e) Results

Results of our experiment are already summarized in figure 4. For each suite in table 1, for each value of *δ* = *n*/*N*, we measured the value of *ρ* = *k*/*n* at which the empirical probability of success crossed 50 per cent. Each of the 18 different curves in figure 4 presents the results for one suite; each one depicts the 50 per cent success rate curves as a function of *δ*. These ‘empirical *ρ* curves’ exhibit very strong visual agreement with the corresponding theoretical curves *ρ*(*δ*;*T*) and *ρ*(*δ*;*C*).

#### (i) Raw *Z* scores

Formal statistical tests are much more sensitive and objective than visual impressions. Figure 5 displays the *Z* scores (3.1) for two-sample comparisons between the Gaussian ensemble with non-negative coefficients and the odd numbered suites; similarly, figure 6 presents two-sample comparisons between the Gaussian ensemble and the odd numbered suites. The eight panels in each figure depict differences between each of the non-Gaussian ensembles and the Gaussian ensemble. The problem shape *δ* = *n*/*N* runs along the horizontal axis; these plots display results for *N* = 200,400,1600 all combined.

The vast majority of the *Z* scores in these displays fall in the range −2, 2.

**Finding 1.***The Z scores*

**in bulk**

*are consistent with our hypothesis of*

**no difference**

*between the distributions.*

In effect, our experiment conducted 16 984 hypothesis tests and found relatively few ‘significant differences’ at the individual test level.

#### (ii) Rejection of strict universality

There are marked ‘tilts’ in the display of *Z* scores in figures 5 and 6; linear trends with *δ* are visually evident. Consider the general mean-shift model
where is standard normal. This expresses the idea that the observed *Z* scores exhibit ‘drift’ as a function of *δ* and *N*, but otherwise have the expected statistical properties of such scores.

If μ is not truly zero in this model, then of course the null hypothesis of no difference fails. In our setting, this means that the Gaussian ensemble does not give truly the same success probabilities as the ensemble being compared with it. Our analysis below rejects the hypothesis that μ = 0.

**Finding 2.***The Z scores are*

**not consistent**

*with the hypothesis of*

**strict universality**.

*Exact finite-problem-size agreement of success probabilities*

*p*_{1}and*p*_{0}between each alternative ensemble and the Gaussian ensemble is not supported by our experiments.#### iii Non-rejection of weak universality

We reformulate the weak universality hypothesis in terms of moments.

**Refined null hypothesis.***For each ensemble E, μ(δ;N,E) tends to zero with increasing N, and the variance of Z scores approaches 1.*

This hypothesis has a clear motivation. Earlier displays aided the eye with lines fitted to the means. Evidently, the mean *Z*-scores within a given suite are generally closer to zero for *N* large than for *N* small. Hence, in an informal appraisal, the refined null hypothesis seems quite plausible. Inspired by the ‘higher criticism’ (see Donoho & Jin 2009), we compare the observed bulk distribution of *Z* scores with the theoretical distribution *N*(0,1). Table 2 shows roughly as many large *Z* scores as one would expect under the null hypothesis.

**Finding 3.***The Z scores*

**do not reject**

*the hypothesis of*

**weak universality**.

*The difference between success probabilities*

*p*_{1}and*p*_{0}at each alternative ensemble and the Gaussian ensemble can be adequately modelled as a matrix-dependent random variable with stochastic order*p*_{1}−*p*_{0}=*O*_{p}(*N*^{−1/2}).### (f) Results presented in the electronic supplementary material

The authors present much more detailed analysis in the electronic supplementary material. Key findings include the following:

*(i) Transition zone scaling with*. The width*N**w*(*δ*,*N*;*Q*) of the zone where success probability drops from 1 to 0 scales as .*(ii) Adequacy of probit model*. The success rate varies with*ρ*as a probit function . Here is the Gaussian survival function and*w*is the transition width.*(iii) Exceptional ensembles.*It is evident from figures 5 and 6 that, at small*N*= 200, certain ensembles offer a relatively poor match to the Gaussian case. Most of these discrepancies can be accounted for by saying that in these exceptional ensembles, at small problem sizes, the level curve for 50 per cent success rate is shifted noticeably below the 50 per cent curve for the Gaussian ensemble. However, at the larger problem size*N*= 1600, both the shift and the exceptional character of those ensembles are no longer evident.

### (g) Limitations of our analysis

We considered a limited set of matrix ensembles in this study. The ensembles are not all based on i.i.d. elements—there are dependencies among rows in the Fourier and Hadamard ensembles and among columns in the Expander ensemble. Even so, there is a certain air of ‘orthogonality’ or ‘weak independence’ in these examples.

There are exceptions to the pattern presented here. The classical example of cyclic polytopes shows that (LP) can have a notably higher success rate for very special matrices than it does for random matrices (see Donoho & Tanner 2005*a*).

In addition to forward stepwise regression, (LP) and (P 1), there are several competing algorithms which we do not study here. Maleki & Donoho (2009) conducted extensive empirical testing of many such algorithms and observed clear phase-transition-like behaviour, which varies from algorithm to algorithm and from the results presented here. Unlike the phase transitions presented here, which match theoretical results in combinatorial geometry, the phase transitions observed for competing algorithms are not yet supported by theoretical derivations.

## 4. Conclusion

Certain phase transitions in a high-dimensional combinatorial geometry have been derived assuming a Gaussian distribution. We had informally observed that the Gaussian theory seemed approximately right even in some non-Gaussian cases. In this study, we made extensive computational experiments with more than a dozen matrix ensembles considering millions of instances at a range of problem sizes. Empirical results for both Gaussian and non-Gaussian ensembles show finite *N* transition bands centred around the asymptotic phase transition derived from a Gaussian assumption. The bands have a width of size *O*(*N*^{−1/2}), consistent with the proven behaviour for the Gaussian ensemble (Donoho & Tanner 2008*b*). Such a behaviour at non-Gaussian ensembles goes far beyond the current theory. Adamczak *et al.* (2009) proved that, for a range of random matrix ensembles with independent columns, there is a region in the phase diagram where the expected success fraction tends to 1, without suggesting that different ensembles have the same region.

We used standard two-sample statistical inference tools to compare the results from non-Gaussian ensembles with their Gaussian counterparts at the same problem size and sparsity level. We observed fairly good agreement of the two-sample *Z* scores with the null hypothesis of no difference; however, fitting a linear model to an array of such *Z* scores, we were able to identify statistically significant trends of the *Z* scores with problem size and with undersampling fraction *δ* = *n*/*N*. The fitted trends vary from ensemble to ensemble, decay with problem size and are consistent with weak, ‘asymptotic’, universality but not with strong, finite *N*, universality.

Our evidence points to a new form of ‘high-dimensional limit theorem’. There is some as yet unknown class of matrix ensembles that yield phase transitions at the same location as the Gaussian polytope transitions. Delineating this universality class seems an important new task for future work in stochastic geometry.

## Acknowledgements

The authors thank the Isaac Newton Mathematical Institute for hospitality during the programme *Statistical Theory and Methods for Complex, High-Dimensional Data* and for a Rothschild Visiting Professorship held by D.L.D. The authors thank Erling Andersen for donating licenses for the Mosek software package which saved a great deal of computer time in the studies described here. This work has made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF). D.L.D. was partially supported by NSF DMS 05-05303, and J.T. was partially supported by Sloan and Leverhulme Fellowships.

## Footnotes

↵1 The auxilary parameter

*C*in*ρ*(*δ*;*C*) is used to indicate the connection of this curve with the standard crosspolytope*C*, defined in equation (2.2), from which it is derived.↵2 In order to save computer time, the actual experiment relied on asymptotic approximation, asymptotic in the size of the amplitude of the wild component. In our experiment, we set

*Z*= 0 and modified the definition of breakdown; instead of declaring breakdown when the estimated beta was wrong by more than 5 s.e. we declared breakdown if the estimated beta was wrong in the sixth digit. Because of a scale invariance and continuity enjoyed by*ℓ*^{1}, the*Z*= 0 experiments can be viewed as the limiting case as the size of the wild component increases while the ordinary noise*Z*stays non-zero and fixed.↵3 The symbol denotes an asymptotic relatonship; for precise conditions, see Donoho & Tanner (2009).

↵4 The auxilary parameter

*T*in*ρ*(*δ*;*T*) associates this curve with*T*^{N−1}, the standard simplex (2.1); see §2.↵5 Visual evidence, similar to figure 4, of qualitative agreement was presented at conferences in 2006–2009 by Donoho & Tanner for all but the Expander ensemble. Inclusion of the Expander ensemble in the results presented here was motivated by evidence in Berinde

*et al.*(2008) for an Expander ensemble (with a different choice of*P*(0)), which also shows a qualitative agreement with the asymptotic phase transition*ρ*(*δ*;*C*).One contribution of 11 to a Theme Issue ‘Statistical challenges of high-dimensional data’.

- © 2009 The Royal Society