## Abstract

Modern applications of statistical theory and methods can involve extremely large datasets, often with huge numbers of measurements on each of a comparatively small number of experimental units. New methodology and accompanying theory have emerged in response: the goal of this Theme Issue is to illustrate a number of these recent developments. This overview article introduces the difficulties that arise with high-dimensional data in the context of the very familiar linear statistical model: we give a taste of what can nevertheless be achieved when the parameter vector of interest is sparse, that is, contains many zero elements. We describe other ways of identifying low-dimensional subspaces of the data space that contain all useful information. The topic of classification is then reviewed along with the problem of identifying, from within a very large set, the variables that help to classify observations. Brief mention is made of the visualization of high-dimensional data and ways to handle computational problems in Bayesian analysis are described. At appropriate points, reference is made to the other papers in the issue.

## 1. Introduction

While the origins of statistical investigation (e.g. Graunt 1662) predate even those of this eldest of extant scientific journals, the largest development of the science of statistics occurred in the twentieth century. The theory and practice of frequentist methods, the likelihood approach and the Bayesian paradigm all flourished, and informal graphical methods, computational algorithms and careful mathematical theory grew up together.

For most of the period, the primary motivating practical problems consisted of a comparatively large number of ‘experimental units’, on which a comparatively small number of features were measured. If, informally, we let *p* denote the dimension of what is ‘unknown’ and let *n* denote the cardinality of what is ‘known’, then traditional theory, and most practice, has until recently been largely limited to the ‘small *p*, large *n*’ scenario. This scenario also naturally reflected the contemporary limitations of computers (the term meant *people* prior to 1950) and graphical display.

A natural mode for asymptotic approximation therefore imagines that while *p* remains of smaller order than *n*, in fact usually fixed. Among the most familiar theoretical results of this type are the Laws of Large Numbers and the Central Limit Theorems. The former says that the sample mean of a random sample of size *n* from a population has as a limit, in a well-defined sense, the population mean, as *n* tends to . The corresponding central limit theorem shows that the limiting distribution of the sample mean about the population mean (when scaled up by ) is of the normal or Gaussian type. In statistics, such results are useful in deriving asymptotic properties of estimators of parameters, but their validity relies on there being, in theory at least, many ‘observations per parameter’.

In practice, *n* will generally correspond to the number of experimental units on which data are available; for *p*, however, there are at least two, albeit related, interpretations. The more basic interpretation is as the measure of complexity of the model to be fitted to the data. However, that is often determined by the dimension of the data as given by the number of items (variables) recorded for each experimental unit, and in our presentation we shall use *p* to represent either interpretation, as appropriate.

Over the last 20 years or so, however, the practical environment has changed dramatically, with the spectacular evolution of data acquisition technologies and computing facilities. At the same time, applications have emerged in which the number of experimental units is comparatively small but the underlying dimension is massive; illustrative examples might include image analysis, microarray analysis, document classification, astronomy and atmospheric science. Methodology has responded vigorously to these challenges, and procedures have been developed or adapted to provide practical results.

However, there is a need for consolidation in the form of a systematic and critical assessment of the new approaches as well as development of appropriate theoretical underpinning (Lindsay *et al.* 2004). In terms of asymptotic theory, the key scenarios to be investigated can be described as ‘large *p*, small *n*’ or in some cases as ‘large *p*, large *n*’; theory for the former scenario would assume that *p* goes to infinity faster than *n* and for the latter would assume that *p* and *n* go to infinity at the same rate.

The practical and theoretical challenges posed by the large *p*/small *n* settings, along with the ferment of recent research, formed the backdrop to the 2008 research programme ‘Statistical Theory and Methods for Complex, High-dimensional Data’ at the Isaac Newton Institute for Mathematical Sciences, which stimulated this Theme Issue. It is important to emphasize the breadth of the research community represented in that programme and this Theme Issue. From a theoretical/methodological point of view, it is of relevance not only to statisticians but also to many in the growing population of machine-learning researchers. In addition, increasingly many areas of application generate data, the analysis of which requires the type of theory and methodology described in this issue.

Before setting the scene for the papers in this volume, we conclude this introduction to the introduction with some general remarks.

It should not, of course, be imagined that the ‘large *p*’ scenarios are mere alternative cases to be explored in the same spirit as their ‘small *p*’ forebears. A better analogy would lie in the distinction between linear and nonlinear models and methods—the unbounded variety and complexity of departures from linearity is a metaphor (and in some cases a literal model) for the scope of phenomena that can arise as the number of parameters grows without limit.

Indeed, *a priori*, the enterprise seems impossible. Good data-analytical practice has always held that the number of data points *n* should exceed the number of parameters *p* to be estimated by some solid margin: *n*/*p*≥5 is a plausible rule of thumb, mentioned for example by Hamilton (1970) and repeated in the classic text by Huber (1981).

The large *p*/small *n* world would therefore seem to depend on a certain statistical alchemy—the computational transformation of ignorance into parameter estimates by fearless specification and fitting of high-dimensional models.

Nevertheless, as will be indicated by example in the papers in this volume, in a variety of methodological settings, as well as in numerous scientific applications, there has been notable success with large-*p* models and methods.

The key, of course, is that we are not always ignorant. Indeed, it seems clear that the enterprise can have hopes of success only if the actual number of influential parameters, *k*, say, is much smaller than the nominal number *p*. Thus, prior knowledge of the existence of a sparse representation, either of a hypothesized form or to be discovered by exploration, is a *sine qua non.*

At the same time, statistical theory is challenged to provide heuristics, principles and results to help explain when sparse models can be expected to be well estimable, or alternatively when the enterprise is simply too ambitious without further reliable prior information.

One such theoretical construct that emerges in a couple of papers in this volume is the ‘phase diagram’. An asymptotic model of a large-*p* regression or classification problem is expressed in terms of parameters such as the data ratio *n*/*p* and the effective parameter sparsity *k*/*n*, for which the diagram depicts sharp transitions between conditions in which estimation/classification is possible and those in which it must fail entirely.

## 2. Areas of application

Specific frontier fields for development and application of methods for analysing complex, high-dimensional data include a wide variety of areas within bioinformatics, classification problems in astronomy, tool development for implementing Basel II finance proposals, weather prediction and so on.

In this Theme Issue, the greatest emphasis in terms of applications is on aspects of biology. This was also the case in the Newton Institute research programme, with one of the workshops being explicitly so directed. Bickel *et al.* (in press) provide an extensive review of genomics and the array of statistical techniques that have been recruited or developed to handle the required data analysis: the use of exploratory data analysis, cluster analysis and visualization methods to investigate patterns and structures; the use of modern approaches to classification and prediction to identify disease states or motifs, with the help of methods like the ‘least absolute shrinkage and selection operator’ (Lasso), which we shall describe in §3, to handle scenarios involving very large numbers of explanatory variables or ‘predictors’; the application of latent-variable models such as hidden Markov models for DNA sequencing; and the need to overcome problems associated with multiple testing when investigation of a large number of variables requires the performance of a large number of statistical tests. This last issue is the main theme of Benjamini *et al.* (in press), the other biologically oriented paper in the batch. They apply the highly influential false-discovery-rate approach, see §5, in the context of type 2 diabetes.

## 3. From simple linear regression to high dimension

In this section, we attempt to indicate the nature of the general issue of high dimension by starting with a very elementary statistical model and showing how a straightforward process of evolution quickly takes us into realms where the difficulties of high dimension become clear.

### (a) The traditional scenario

A very simple statistical model can be described as follows. We have data in the form of a pair of measurements on *n* individuals, {*x*_{i},*y*_{i};*i* = 1,…,*n*}, where *x*_{i} is a predictor and *y*_{i} is a response, and it is assumed that the two measurements are related by
for each *i*. It is also assumed that, independently for each *i*, ε_{i}∼*N*(0,σ^{2}); that is, ε_{i} follows a Gaussian distribution with mean zero and variance σ^{2}. Thus, the ‘average’ relationship between the predictor and the response follows a straight line with intercept *β*_{1} and slope *β*_{2}; this is written as
and is referred to as the simple linear regression model for *y* on *x*, ‘simple’ because there is only one predictor. The two ‘parameters’, *β*_{1} and *β*_{2}, are unknown constants to be estimated. In one possible application, the predictor might be ‘age’ and the response might be ‘systolic blood pressure’. (Typically, σ^{2} would also be unknown, but for simplicity here we take it as known.)

There is a convenient vector–matrix notation for the model for the complete set of *n* data pairs,
3.1where the *n*×1 vector *y* contains the responses, the vector *β* contains the two (in general *p*) parameters except for σ^{2}, the *n*×1 vector ε contains the ‘noise’ and the *n*×*p* so-called design matrix *X* completes the model. In simple linear regression, each element in the first column of *X* is 1.

The standard way of estimating the unknown slope and intercept in *β* is to use Gauss’s least-squares approach and obtain
which means that is the minimizer of the sum of squares function on the right-hand side. In the general vector–matrix notation, this can be written in terms of the Euclidean or ℓ_{2} norm, as
There is another important interpretation of . Our distributional assumption about ε implies that *y*∼*N*_{n}(*Xβ*,σ^{2}*I*), in which *N*_{n} now denotes an *n*-variate multi-variate Gaussian distribution, with *Xβ* as the vector of means and σ^{2}*I* as the covariance matrix, and where *I* is the *n*×*n* identity matrix. The probability density function for *y* is then
The available data provide *y* and *X*. When viewed as a function of the parameters, this is now called the likelihood function, and, clearly,
Thus, are the so-called ‘maximum-likelihood estimators’ of *β*; the use of maximum-likelihood estimators is a very common paradigm for statistical inference.

In our problem, satisfies
and
the explicit formula in the second equation being available *provided thatX*^{T}*Xcan be inverted*. (Here *X*^{T} is the matrix transpose of *X*.) Furthermore, if the model is correct,
3.2from which (frequentist) interval estimates for *β* can be obtained, with slight modification if, as is usually the case, σ^{2} has to be estimated.

In general, for many maximum-likelihood scenarios concerning parametric models involving a fixed number of parameters *β*, for large *n*, approximately if rarely exactly,
for a certain matrix . Thus, is asymptotically unbiased, in that on average it does not overestimate or underestimate *β*, and normally distributed.

### (b) A high-dimensional reality check and what to do about it

For the above elegant and simple analysis, we must have *p*≤*n*, otherwise (*X*^{T}*X*) is singular and the parameters in the regression model cannot be uniquely estimated. Furthermore, in the general maximum-likelihood contexts, especially if *p* is not fixed, the asymptotic theory breaks down. What if *p*>*n* or even *p*≫*n* in the regression problem, i.e. *p* is much greater than *n*? One approach to side-stepping the singularity of (*X*^{T}*X*) is to use a method of *regularization*, otherwise known as *penalized least squares* or *penalized maximum likelihood*. An early example of this is *ridge regression* (Hoerl & Kennard 1970), in which we estimate *β* by
where *S*_{λ2} = (*X*^{T}*X*+λ_{2}*I*)^{−1}, and the positive scalar λ_{2} is called a ridge parameter or regularization constant. The frequency distribution of is
3.3Thus the estimator is now biased, but it can be calculated; as λ_{2} increases, bias increases, but ‘variance’ decreases, to compensate. There are a number of interpretations for :

(i) ,

(ii) minimizes subject to , for some

*c*_{2}(λ_{2}), depending on λ_{2}, and(iii) minimizes subject to , for some

*b*_{2}(λ_{2}), depending on λ_{2}.

The first interpretation shows that corresponds to what is called ℓ_{2} regularization because the penalty function is given by the ℓ_{2} or quadratic norm. (It also has an interpretation in the Bayesian approach to statistical analysis, as discussed in §7.) In the other interpretations, λ_{2} or its inverse is a Lagrange multiplier.

However, although invertible, *X*^{T}*X*+λ_{2}*I* is *p*×*p* and still potentially a very large matrix. A dominant strategy in current approaches to this sort of difficulty is to try to exploit *sparsity*; in other words, to seek a solution for *β* in which many of the elements are zero. After all, if *n*<*p*, it is intuitively plausible that only sparse solutions can be obtained ‘reliably’. Furthermore, in practice, if there are vast numbers of predictors, it is often scientifically plausible that only a small proportion are likely to be influential as predictors. With this in mind, we try changing the penalty function and consider what is called ℓ_{0} regularization. Our three equivalent formulations are now as follows:

(i) , where ∥

*β*∥_{0}, the number of non-zero elements in*β*, is the ℓ_{0}norm of*β*,(ii) minimizes subject to ∥

*β*∥_{0}≤*c*_{0}(λ_{0}), and(iii) minimizes ∥

*β*∥_{0}subject to .

The problem about implementing this approach is its combinatorial character: there is no alternative to considering, individually, each configuration of zero and non-zero values in *β*, and this leads to unacceptable computational complexity and a potential proliferation of local optima.

An intermediate strategy is to base the penalty function on the ℓ_{1} norm,
leading to the following equivalent formulations:

(i) , for some λ

_{1},(ii) minimizes subject to ∥

*β*∥_{1}≤*c*_{1}(λ_{1}), and(iii) minimizes ∥

*β*∥_{1}subject to .

The subscript L is chosen because this method is called the Lasso (Tibshirani 1996). The method has the dual advantages that it is computationally feasible, fitting the paradigm of quadratic programming, and generally leads to sparse solutions. The latter can be explained informally in the context of the second formulation. The solution will be at the point of contact of a ‘smooth’ residual sum of squares function and a convex, piecewise-flat constraint surface. The point of contact is very likely to be at a vertex of the constraint surface and therefore at a point where elements of *β* are zero. As with ℓ_{2} regularization, the Lasso has a Bayesian interpretation; see §7.

Some strong support for the Lasso strategy comes from considering noise-free versions of the problem: minimize ∥*β*∥_{0} or ∥*β*∥_{1} subject to the equality constraint *y* = *Xβ*. If the solution to the ℓ_{0} problem is sufficiently sparse—having at most a sufficiently small number *k* non-zero entries, say—then the solutions to the ℓ_{1} and ℓ_{0} problems coincide; see, for example, Candès & Tao (2005) and Donoho (2006).

The main aims of any investigation of sparsity in the linear model can be described informally as follows:

(i) to identify exactly which components of

*β*are non-zero, i.e. the*support*of*β*,(ii) to estimate reliably the true values of the non-zero components, assuming that the model is correct, and

(iii) to do this ‘as well’ as one could be expected to do if the identities of the non-zero components were known from the beginning, the so-called

*oracle*property.

We now describe briefly, at a very informal level, a few particular results along these lines and we give pointers to relevant papers later in this issue.

In their Dantzig Selector, Candès & Tao (2007) choose to minimize ∥*β*∥_{1} subject to in which *v*_{i} denotes the *i*th element of a vector *v*. Note that the residuals, *y*−*Xβ*, enter into the formulation directly rather than being squared. Let the number of non-zero elements of the true *β* be *k*. Then, if *k*<*n*/2, a version of the oracle property is obtained with overwhelming probability. The problem is amenable to algorithms based on linear programming, and the method was named in honour of the inventor of the simplex method for solving linear programmes.

Bickel *et al.* (2009) show that, under a sparsity scenario, the Lasso and Dantzig selector exhibits similar behaviour in both linear and non-parametric regression models, and satisfy parallel sparsity oracle inequalities.

For the Lasso itself, Wainwright (2006) investigates the probability, Prob(success), that the Lasso successfully identifies the correct support set of *β*, as a function of *n*, *p* and *k*, under certain assumptions about *X*. In the limit, the results are, informally, as follows:

(i) if

*n*>2*k*log(*p*−*k*), then Prob(success) tends to 1,(ii) if

*n*<(1/2)*k*log(*p*−*k*), then Prob(success) tends to 0, and(iii) as

*n*increases from to , the limiting value of Prob(success) increases from 0 to 1.

Donoho & Tanner (in press) show that sparse linear models can exhibit a transition in behaviour illustrated by a phase diagram. As one example, consider a noise-free setting, in which *y* = *Xβ*, and the matrix *X* is chosen at random, for example with independent Gaussian rows, or from the rows of a discrete Fourier transform matrix. Let ρ = *k*/*n*, the ratio of the cardinality of the support of *β* and the sample size, and let *δ* = *n*/*p*. Then, for a given value of *δ*<1, there is a threshold, ρ*(*δ*), such that, for ρ<ρ*(*δ*), one is virtually certain to be able to identify the non-zero elements of *β* and otherwise such identification will almost never occur. As *δ* increases, then so does the threshold, *d*(*δ*), resulting in an intriguing phase diagram in the ρ−*δ* plane.

A fascinating property of such diagrams is that closely related versions of them appear in apparently unconnected settings. In geometrical probability, one can look at the convex hull of *n* points in *d*-dimensional space. In high-dimensional regression of the form (3.1), one can study the performance of traditional stepwise regression methods for entering variables. In high-dimensional geometry, one can ask what happens to the faces of a simplex under random projections *A* on to lower dimensional subspaces. Donoho & Tanner (in press) review the deep connections between these and yet other settings in which a phase transition occurs. They use a large-scale computational experiment to support a conjectured ‘universality’ result that would establish such transition behaviour for a variety of sampling models for the random projection *A*.

## 4. More general subspace selection and dimension reduction

The identification of sparse *β*s in effect allows us to discard variables, leaving an informative *k*-dimensional linear subspace of the space of predictors, although of a rather special nature; if there are two predictors, *p* = 2, and if *k* = 1, the reduced space would be one of the coordinate axes. (Other modern approaches to the selection of variables, especially from a very large number of candidates, are described in the text and discussion of Fan & Lv (2008).) However, there are plenty of other one-dimensional linear subspaces of the plane, i.e. all straight lines are candidates. Why not, therefore, seek different types of informative *k*-dimensional linear subspace? The so-called *sufficient dimension reductions* reduce the effective dimension of the data space without loss of information about the distribution of *y*|*X*, in some sense, in which ‘|’ indicates conditioning. There are various ways of defining a sufficent reduction, *R*(*X*). The first approach has the property that the distribution of *y*|*R*(*X*) is the same as that of *y*|*X*; in other words, no information is lost from the (forward) regression of *Y* on *X* by restricting oneself to conditioning on *R*(*X*). In the second approach, related to the idea of inverse regression, the aim is to choose *R*(*X*) so that the distribution of *X*|*R*(*X*) is the same as that of *X*|{*y*,*R*(*X*)}. Finally, in a ‘joint’ approach, *R*(*X*) is sought such that *y* is independent of *X* given *R*(*X*). These options are identified by Cook (2007), the leading figure in this area, who takes the inverse-regression approach, proposing a model of the form
with Γ a *p*×*k* matrix, with the goal that *X*_{y} has the same distribution as that of *X* given *y*. Here, the vector *z*_{y} contains so-called ‘latent’ or ‘factor’ variables. In the present issue, Adragni & Cook (in press) rehearse and expand on these concepts in detail, as well as announcing new methodology for prediction.

All the discussion hitherto has been about the *linear* model (3.1) in which the part of the model that defines the means of the responses is linear in the parameters *β*. There is a vast array of other regression models that generalize the linear case, including nonlinear parametric models in which the mean function is defined explicitly as a nonlinear function of parameters, and various types of semiparametric or nonparametric approach. The so-called ‘errors’, like ε in equation (3.1), need not be normally distributed and need not enter the model additively. Traditionally, as with the linear model, these other approaches have been applied in contexts where the number of experimental units has comfortably exceeded the number of explanatory variables, i.e. *n*>*p*, but the complementary scenario has increasingly attracted attention. Here we limit consideration to two particular manifestations.

Ravikumar *et al*.’s (2007) SpAM method, an abbreviation of ‘Sparse Additive Models’, aims to estimate a function rather than a vector, using so-called general additive models (Hastie & Tibshirani 1990). For the *i*th observational unit, the model for the response is
in which *x*_{ij} is the value of the *j*th explanatory variable for the *i*th observational unit. The {*f*_{j}} are rather arbitrary functions, chosen to minimize the residual sum of squares but subject to constraints that both encourage sparsity and render the optimization problem convex. When *p* is very large but it is assumed that only a small subset of the explanatory variables truly contribute to the model, i.e. sparsity obtains, the objective is to identify those variables as well as possible.

Chipman *et al.* (2009) propose Bayesian additive regression trees, in which the regression function is the sum of ‘trees’,
each *T*_{j} representing a tree structure involving a number, usually very small, of the explanatory variables.

We leave the regression setting and turn to some other traditional areas of multi-variate statistical analysis. We describe briefly two active areas of development: investigation of the properties of classical methods when the number of variables is large, and development of new, typically nonlinear variants of these methods that respond to high dimension.

In the first direction, it is a striking fact that most of the standard techniques of classical multi-variate analysis—principal components analysis, canonical correlation analysis, multi-variate analysis of variance, discriminant analysis and so forth—are based on the eigenanalysis of sample covariance matrices. Under Gaussian assumptions for the distributions of the sampled data, and under the symmetry assumptions characteristic of null hypotheses, the sampling densities of the eigenvalues of these matrices have precisely the laws arising in the classical orthogonal polynomial ensembles of random matrix theory. The natural asymptotics in random matrix theory—reflecting its origins in the many-particle models of statistical and nuclear physics—allow the number of variables to become large. This leads to limiting distributions (Marčenko–Pastur, Tracy–Widom) that are new for multi-variate statistics and to approximations for the distributions of extreme eigenvalues that can work well even when the number of variables is small; see Johnstone (2007, 2008) and the references therein.

Ideas from random matrix theory are also exerting active influence on a variety of other problems connected with large covariance matrices. A rich set of examples may be found in a recent special issue of the *Annals of Statistics* devoted to the topic (Bickel 2008). For example, an unknown covariance matrix for observations on *p* variables in principle has *p*(*p*+1)/2 unknown parameters and the sample covariance matrix is a poor estimate when *p* is large. Strategies to exploit sparsity are essential, for example by direct thresholding (Bickel & Levina 2008) or by exploiting sparse covariance models (El Karoui 2008). Estimation of the leading *eigenvectors* associated with large covariance matrices can encounter problems with consistency unless sparsity is assumed, and is attracting attention both in statistics, see, for example, Nadler (2008) and Johnstone & Lu (2009), and in econometrics (Onatski 2009).

We now focus on the second direction, methods of dimension reduction: given data *x*_{1},…,*x*_{n} in , find representations *y*_{1},…,*y*_{n} in for *m*≪*p* that preserve as much of the relevant information as possible. Of course, if *m* is 3 or less, one can then visualize the reduced data. The traditional approach to this problem is via principal components analysis, or its close relation, multi-dimensional scaling, which uses the leading eigenvectors of the sample covariance matrix of (*x*_{i}) (e.g. Jolliffe 2002). Geometrically, this corresponds to finding the *m*-dimensional linear subspace into which the projections of *x*_{i} have the largest variance.

In recent years, attention has been directed at settings in which a *nonlinear* manifold of low dimension might provide a better representation. For example, a collection of images of similar objects is nominally very high-dimensional (corresponding to the number of pixels in each image), but in fact potentially described by the variation of a much smaller number of parameters. Some influential approaches to this problem include ISOMAP (Tenenbaum *et al.* 2000), local linear embedding (Roweis & Saul 2000), Laplacian eigenmaps (Belkin & Niyogi 2003) and Hessian eigenmaps (Donoho & Grimes 2003).

These nonlinear representations can also be described as spectral methods for dimension reduction, because each involves finding a small number of extreme eigenvalue/vector pairs for a suitable matrix derived from the local characteristics of the data (*x*_{i}). These and other methods are reviewed by Belabbas & Wolfe (in press) in this volume.

The computational burden of solving an eigenproblem can be significant if is large. As one approach to reducing this burden, Belabbas & Wolfe (in press) compare algorithms for selecting a subset of the *n* data points (‘landmark selection’), and performing an approximate spectral analysis known as the Nyström extension.

## 5. Classification

A variation of the regression problem is that of ‘classification’, also known as ‘discriminant analysis’ or ‘statistical pattern recognition’, in which the class label *y*_{i} is categorical or in particular binary: it might denote an image type or a disease category. Here, the main objective is to estimate a classification rule for a future ‘patient’ on the basis of their *x*_{i}, which represents the patient’s medical history, given ‘training data’ from people whose disease categories and medical histories are known. This scenario is known as ‘supervised learning’, whereas any situation in which the disease categories of even the training set are not available corresponds to ‘unsupervised learning’. There are many approaches to the classification problem, from both statistics and machine learning, the latter including the so-called ‘support vector machine’ (SVM); see, for example, Hastie *et al.* (2009). The large *p*, small *n* problem arises here too, and the method can also be described in terms of constrained optimization. In the simplest version of the problem, there are two disease categories and the training data consist, geometrically, of two clouds of points that can be separated by at least one hyperplane. The SVM simply identifies a hyperplane such that the resulting degree of separation is, informally speaking, as large as possible. Many variations of the method have been developed, to cope with non-planar separating surfaces, cases in which the two point clouds overlap, and so on.

So far as probabilistic modelling in classification is concerned, there are two approaches, corresponding to two possible factorizations of *p*(*x*,*y*),
and
where (θ_{1},θ_{2}) and (ϕ_{1},ϕ_{2}) denote two parametrizations, with θ_{1} distinct from θ_{2} and ϕ_{1} distinct from ϕ_{2}. In the modern machine-learning literature, these approaches are called the ‘generative’ and ‘discriminative’ approaches, respectively. Much earlier, Dawid (1976) called them the ‘sampling’ and ‘diagnostic’ paradigms. Which approach is selected is crucial in so-called ‘semisupervised’ situations in which the available database contains both diagnosed and undiagnosed cases. From the point of view of diagnosing new patients, the conditional distribution *p*(*y*|*x*) is the key tool. If the θ version is adopted, then the undiagnosed cases do contribute information about *p*(*y*|*x*), but if one uses the ϕ version they do not. There is much current work revisiting this issue in the machine-learning literature, under key words such as ‘domain adaptation’ as well as ‘semisupervised learning’.

The selection of good class predictors from among many raises problems incurred in ‘multiple testing’. In microarray analysis, for example, we could be comparing expression levels of *p* = 20 000 genes obtained from people from two populations, e.g. diseased and healthy, and wish to decide which genes respond differently in the two populations. It is natural to examine , the difference in the average values of the expression levels of the *j*th gene, in the members of the two disease classes in the available database, for *j* = 1,…,20 000, and to select those genes for which there is a ‘significant difference’. The following problems arise: first, with a conventional significance level of 5 per cent we should expect about 1000 significant results even when there is no real difference; secondly, results for different genes may well be correlated. There is much recent work under this banner of multiple testing on the same dataset, including methods for controlling the so-called ‘false discovery rate’ (Benjamini & Hochberg 1995). The actual false discovery rate is the number of ‘null hypotheses’ wrongly rejected divided by the total number of null hypotheses rejected. In practice, of course, this proportion is not knowable, so one aims to control its expectation. As mentioned in §2, this topic is the subject, in this issue, of Benjamini *et al.* (in press).

The problem of distinguishing between two classes on the basis of noise-filled, high-dimensional data is one that lends itself to the theoretical challenge described earlier of delimiting the limits of the possible. Perhaps the simplest idealization runs as follows. The *n* labels *y*_{i} are either −1 or +1 and the feature vectors *x*_{i} have *p* components *x*_{ij}. The components *x*_{ij} are assumed to be independently normally distributed, with means *y*_{i}*δ*_{j} and variance 1. The coordinate separations *δ*_{j} are mostly zero, that is for all but a small number of components *j* whose identities are unknown—more precisely, a proportion *p*^{−β} for 0<*β*<1. In those rare separated coordinates, the dependence on *n* and *p* is assumed to be given by

Two papers in this volume (Donoho & Jin in press; Ingster *et al.* in press) independently study this problem asymptotically. Whether or not successful classification is possible at all in this model turns out to depend in a precise way on the sparsity *β* and separation constant *r*. Both papers show that there is a ‘detection boundary’ *r* = *r*(*β*), or phase diagram, such that above the boundary—corresponding to greater separation *r* and less sparsity *β*—correct classification is possible, while below the boundary it is impossible, whatever be the method used.

Donoho & Jin (in press) in this volume, and Jin (2009) elsewhere, focus on a particular method for feature selection, based on the *higher criticism* approach to simultaneous inference associated with J. W. Tukey, while Ingster *et al.* (in press) consider a somewhat more general setting.

While this theoretical setting is certainly highly idealized, it is worth noting that a simple classifier based on higher criticism feature selection turns out to be completely competitive, if not superior, to standard classification methods including SVMs and random forests (Breiman 2001), at least on some standard test datasets as reported in Dettling (2004) and Donoho & Jin (2008).

## 6. Visualization and informal inference

The facility to *look* at data is vital, partly with a view to exploring the data and suggesting questions to be followed up more formally and partly as a way of checking assumptions that underlie a formal procedure being implemented, often after that procedure has been carried out and a model fitted; Buja *et al.* (in press) regard these two activities, respectively, as cases of ‘exploratory data analysis’ and ‘model diagnostics’. At a trivial level, in the context of the simple linear regression model of §3, a two-dimensional plot of responses against predictors will give an immediate warning if the straight-line model is implausible. It is hard to achieve such a direct look in high dimensions. However, the power of modern computers is there to be exploited. Dynamic graphics allow the display of a multitude of two-dimensional projections of three-dimensional data clouds by spinning the clouds, with the axis of rotation being chosen by the user. The so-called ‘grand tour’ generalizes this notion to the exploration of *p*-dimensional data clouds for larger values of *p*. The idea is to generate a space-filling tour of low-dimensional projections of the data, to be visualized as a movie-like presentation. Modifications of the grand tour allow the user to interact with the tour, for example, concentrating on classes of projections of particular interest. A different medium for the two-dimensional display of high-dimensional data is that of ‘parallel coordinates’. In the most basic form of this procedure for *p*-dimensional data, the *p* typically orthogonal axes are replaced by a set of *p* parallel axes, displayed in two dimensions. For a given observational unit, the values of the variables are marked off on the axes and joined up in a piecewise-linear way. This allows, for instance, ‘similar’ points to be recognized through their similar piecewise-linear plots. There are many refinements and variations of the basic idea.

For algorithmic and practical discussion of these and other visualization techniques, see Buja *et al.* (2005) and Wegman (2003), and, for treatments of the underlying mathematics, see Wegman & Solka (2002). These papers also provide links to the relevant software.

In most statistical practice, the consequence of data visualization, be it for exploratory analysis or model diagnostics, is typically an informal, non-quantitative assessment of the plausibility of some ‘hypothesis’, to be followed up by the performance of an appropriate formal test or construction of a confidence interval. In this Theme Issue, Buja *et al.* (in press) pursue the goal of adding quantification of the level of statistical significance of the sort of ‘visual discovery’ that plots and diagrams may provide.

It is in the spirit of data exploration that Banks *et al.* (in press) develop their data-analytic approach designed to exploit what they call *mixture sparsity*. For example, in the context of cluster analysis, they envisage that a dataset comprises a number of clusters, but that cluster membership indicators are missing, and that the different clusters may be best characterized by possibly different (small) subsets of variables. By ‘cherry-picking’ small numbers of variables in a way designed to identify mutually similar data points, they construct clusters one-by-one. Similarly, they tease out mixtures of regression models where, again, the different (sparse) regression models may involve different (small) sets of predictors.

## 7. A brief encounter with Bayesian statistics

We return to §3*b* and the discussion of the linear model as expressed in equation (3.1). Frequentist distributional results for estimators of *β* were given in equation (3.2) for the least-squares estimators and in equation (3.3) for the ridge-regression estimators. In addition, the ridge-regression estimators could be interpreted as penalized maximum-likelihood estimators. Here we provide another interpretation in the context of so-called Bayesian inference, in which probability distributions are developed for unknown parameters. Bayes’ theorem is used to relate the distribution *p*(*β*), assumed for *β* ‘prior’ to the data-gathering process, to the distribution *p*(*β*|*y*) ‘posterior’ to that process. To be specific,
so that, as a function of *β*,
that is,
7.1In the Bayesian approach, inference about *β* is based on *p*(*β*|*y*), historically a great source of controversy, essentially because of the combination within equation (7.1) of probability distributions of quite different natures, the likelihood term being frequentist-based and the prior not so. In addition, adoption of the approach often led to practical (computational) difficulties, to which we allude later. However, these problems do not arise in the context of the linear model in equation (3.1), if we are prepared to act as if, *a priori*,
and σ^{2} is known, for then the posterior distribution for *β* is a Gaussian distribution with mean, and indeed mode, given by . Thus, the ridge-regression estimator has the Bayesian interpretation of being a *maximum a posteriori* (MAP) estimator of *β*. Similarly, the Lasso has an MAP interpretation provided that the prior for *β* is

In other contexts, the analysis is not so straightforward, especially if there are missing or latent (unobservable) variables, *z*, as well as observed data *y*. Often (not always), analysis would be comparatively easy were *z* known, in that simple formulae would exist for *p*(*y*,*z*|*β*) and *p*(*β*|*y*,*z*). With *z* unknown, the key quantity of interest is *p*(*z*,*β*|*y*), the distribution of everything that is unknown given what is known, but this distribution is typically complicated. We briefly indicate two popular ways of handling this problem.

*Stochastic method.* Generate a large number of realizations from *p*(*z*,*β*|*y*), where
The corresponding realizations of *β* are a sample from *p*(*β*|*y*) and, for instance, their average would be a good estimator of the true posterior mean of *β*. During the last 25 years or so, a large repertoire of so-called Markov chain Monte Carlo methods, some based on antecedents from the statistical physics literature, has been developed for simulating the required realizations.

*Variational method.* Here the approach is to choose a deterministic approximation *q*_{y}(*z*,*β*) that is as close as possible to *p*(*z*,*β*|*y*), but that is of a more manageable structure than *p*(*z*,*β*|*y*). Here ‘closeness’ is measured by the Kullback–Leibler directed divergence, KL(*q*_{y},*p*), where *p* denotes *p*(*z*,*β*|*y*),
Key properties are that KL(*q*_{y},*p*)≥0, with equality essentially if and only if *q*_{y} and *p* are the same. The typical way of choosing a ‘more manageable structure’ is to assume a factorized form for *q*_{y}, that is,
choose the factors so as to minimize KL(*q*_{y},*p*) and use the resulting to approximate *p*(*β*|*y*).

This variational approach also has links to statistical physics, including concepts such as mean-field approximations. Its recent application to many implementations of the Bayesian approach has mainly developed in machine learning, see Bishop (2006) and references therein, rather than in the mainstream statistics literature, although not exclusively so (Titterington 2004; Beal & Ghahramani 2006).

In theory, the stochastic method is superior in that, if enough realizations are generated, one can get arbitrarily close to the target distribution. However, especially in problems on a very large scale, practical considerations render the stochastic approach non-viable and the variational method is a good pragmatic alternative.

Among the papers in this issue, Barber (in press) describes a particular application to the identification of clusters of books about US politics. In general, variational approximations have proved to be extremely useful in the analysis of graphical models, which involve structures of nodes and connecting edges with a wide range of architectures. As Barber (in press) indicates, such models are essential in contexts such as social networks, web analysis and bioinformatics, the models often necessarily display a high degree of complexity and efficient approximate methodology is essential; roughly speaking, the more connectivity there is in the graphical model, the more intractable the analysis thereof becomes.

## Acknowledgements

This research was supported in part by the Isaac Newton Institute for Mathematical Sciences, Cambridge, UK, and by grants NSF DMS 0505303 and NIH RO1 EB 001988.

## Footnotes

One contribution of 11 to a Theme Issue ‘Statistical challenges of high-dimensional data’.

- © 2009 The Royal Society