## Abstract

Dimension reduction for regression is a prominent issue today because technological advances now allow scientists to routinely formulate regressions in which the number of predictors is considerably larger than in the past. While several methods have been proposed to deal with such regressions, principal components (PCs) still seem to be the most widely used across the applied sciences. We give a broad overview of ideas underlying a particular class of methods for dimension reduction that includes PCs, along with an introduction to the corresponding methodology. New methods are proposed for prediction in regressions with many predictors.

## 1. Introduction

Consider the frequently encountered goal of determining a rule *m*(**x**) for predicting a future observation of a univariate response variable *Y* at the given value **x** of a *p* × 1 vector **X** of continuous predictors. Assuming that *Y* is quantitative, continuous or discrete, the mean-squared error *E*(*Y* − *m*(**x**))^{2} is minimized by choosing *m*(**x**) to be the mean *E*(*Y* | **X** = **x**) of the conditional distribution of *Y* | (**X** = **x**). Consequently, the prediction goal is often specialized immediately to the task of estimating the conditional mean function *E*(*Y* | **X**) from the regression of *Y* on **X**. When the response is categorical with sample space *S*_{Y} consisting of *h* categories *S*_{Y} = {*C*_{1}, …, *C*_{h}}, the mean function is no longer a relevant quantity for prediction. Instead, given an observation **x** on **X**, the predicted category *C*_{*} is usually taken to be the one with the largest conditional probability , where the maximization is over *S*_{Y}. When pursuing the estimation of *E*(*Y* | **X**) or Pr(*C*_{k} | **X**), it is nearly always worth while to consider predictions based on a function *R*(**X**) of dimension less than *p*, provided that it captures all of the information that **X** contains about *Y* so that *E*(*Y* | **X**) = *E*(*Y* | *R*(**X**)). We can think of *R*(**X**) as a function that concentrates the relevant information in **X**. The action of replacing **X** with a lower-dimensional function *R*(**X**) is called *dimension reduction*; it is called *sufficient dimension reduction* when *R*(**X**) retains all the relevant information about *Y* . A potential advantage of sufficient dimension reduction is that predictions based on an estimated *R* may be substantially less variable than those based on **X**, without introducing worrisome bias. This advantage is not confined to predictions, but may accrue in other phases of a regression analysis as well.

One goal of this paper is to give a broad overview of ideas underlying sufficient dimension reduction for regression, along with an introduction to the corresponding methodology. Sections 1*a*,*b*, 2 and 3 are devoted largely to this review. Sufficient dimension-reduction methods are designed to estimate a population parameter called the central subspace, which is defined in §1*b*. Another goal of this paper is to describe a new method of predicting quantitative responses following sufficient dimension reduction; categorical responses will be discussed only for contrast. The focus of this paper shifts to prediction in §4, where we discuss four inverse regression models, describe the prediction methodology that stems from them and give simulation results to illustrate their behaviour. Practical implementation issues are discussed in §5, along with additional simulation results.

### (a) Dimension reduction

There are many methods available for estimating *E*(*Y* | **X**) based on a random sample (*Y*_{i},**X**_{i}), *i* = 1, …, *n*, from the joint distribution of *Y* and **X**. If *p* is sufficiently small and *n* is sufficiently large, it may be possible to estimate *E*(*Y* | **X**) adequately by using non-parametric smoothing (see Wand & Jones 1995). Otherwise, nearly all techniques for estimating *E*(*Y* |**X**) employ some types of dimension reduction for **X**, either estimated or imposed as an intrinsic part of the model or method.

Broadly viewed, dimension reduction has always been a central statistical concept. In the second half of the nineteenth century, ‘reduction of observations’ was widely recognized as a core goal of statistical methodology, and principal components (PCs) were emerging as a general method for the reduction of multivariate observations (Adcock 1878). PCs was established as a first reductive method for regression by the mid-1900s.

Dimension reduction for regression is a prominent issue today because technological advances now allow scientists to routinely formulate regressions in which *p* is considerably larger than in the past. This has complicated the development and fitting of regression models. Experience has shown that the standard iterative paradigm for model development guided by diagnostics (Cook & Weisberg 1982, p. 7) can be imponderable when applied with too many predictors. An added complication arises when *p* is larger than the number of observations *n*, leading to the so-called ‘*n* < *p*’ problem. Standard methods of fitting and corresponding inference procedures may no longer be applicable in such regressions. These and related issues have caused a shift in the applied sciences towards a different regression genre with the goal of reducing the dimensionality of the predictor vector as a first step in the analysis. Although large-*p* regressions are perhaps mainly responsible for renewed interest, dimension-reduction methodology can be useful regardless of the size of *p*. For instance, it is often helpful to have an informative low-dimensional graphical summary of the regression to facilitate model building and gain insights. For this goal, *p* may be regarded as large when it exceeds 2 or 3 because these bounds represent the limits of our ability to view a dataset in full using computer graphics. Subsequent references to ‘large *p*’ in this paper do not necessarily imply that *n* < *p*.

Reduction by PCs is ubiquitous in the applied sciences, particularly in bioinformatics applications, where PCs have been called ‘eigen-genes’ (Alter *et al.* 2000) in microarray data analyses and ‘meta-*k*mers’ in analyses involving DNA motifs. The report by reiterates and makes clear that past influential analyses of data on global warming are flawed because of an inappropriate use of the PC methodology.

While PCs seem to be the dominant method of dimension reduction across the applied sciences, there are many other established and recent statistical methods that might be used to address large-*p* regressions, including factor analysis, inverse regression estimation (IRE) (Cook & Ni 2005), partial least squares (PLS), projection pursuit, seeded reductions (Cook *et al.* 2007), kernel methods (Fukumizu *et al.* 2009) and sparse methods like the lasso (Tibshirani 1996) that are based on penalization.

### (b) Sufficient dimension reduction

Dimension reduction is a rather amorphous concept in statistics, changing its character and goals depending on context. Formulated specifically for regression, the following definition (Cook 2007) of a *sufficient reduction* will help in our pursuit of methods for reducing the dimension of **X** while en route to estimating *E*(*Y* | **X**).

### Definition 1.1.

A reduction , *q* ≤ *p*, is sufficient if it satisfies one of the following three statements:

(i) inverse reduction,

**X**| (*Y*,*R*(**X**)) ∼**X**|*R*(**X**),(ii) forward reduction,

*Y*|**X**∼*Y*|*R*(**X**),(iii) joint reduction,

**X***Y*|*R*(**X**),

where indicates independence, ∼ means identically distributed and **A** | **B** refers to the random vector **A** given the vector **B**.

Each of the three conditions in this definition conveys the idea that the reduction *R*(**X**) carries all the information that **X** has about *Y* , and consequently all the information available to estimate *E*(*Y* | **X**). They are equivalent when (*Y*, **X**) has a joint distribution. In that case, we are free to determine a reduction inversely or jointly and then pass it to the conditional mean without additional structure: *E*(*Y* | **X**) = *E*(*Y* | *R*(**X**)). In some cases, there may be a direct connection between *R*(**X**) and *E*(*Y* | **X**). For instance, if (*Y*, **X**) follows a non-singular multivariate normal distribution, then *R*(**X**) = *E*(*Y* | **X**) is a sufficient reduction, *E*(*Y* | **X**) = *E*{*Y* | *E*(*Y* | **X**)}. This reduction is also *minimal sufficient*: if *T*(**X**) is any sufficient reduction, then *R* is a function of *T*. Further, because of the nature of the multivariate normal distribution, it can be expressed as a linear combination of the elements of **X**: *R* = *β*^{T}**X** is minimal sufficient for some vector ** β**.

Inverse reduction by itself does not require the response *Y* to be random, and it is perhaps the only reasonable reductive route when *Y* is fixed by design. For instance, in discriminant analysis, **X**|*Y* is a random vector of features observed in one of a number of subpopulations indicated by the categorical response *Y* , and no discriminatory information will be lost if classifiers are restricted to *R*.

If we consider a generic statistical problem and reinterpret **X** as the total data *D* and *Y* as the parameter *θ*, then the condition for inverse reduction becomes *D*|(*θ*,*R*)∼*D*|*R* so that *R* is a sufficient statistic. In this way, the definition of a sufficient reduction encompasses Fisher’s (1922) classical definition of sufficiency. One difference is that sufficient statistics are observable, while a sufficient reduction may contain unknown parameters and thus needs to be estimated. For example, if (**X**,*Y*) follows a non-singular multivariate normal distribution, then *R*(**X**) = *β*^{T}**X**, and it is necessary to estimate ** β**.

In some regressions, *R*(**X**) may be a nonlinear function of **X**, and in extreme cases no reduction may be possible, so all sufficient reductions are one-to-one functions of **X** and thus equivalent to *R*(**X**) = **X**. Most often, we encounter multi-dimensional reductions consisting of several linear combinations *R*(**X**) = *η*^{T}**X**, where ** η** is an unknown

*p*×

*q*matrix,

*q*≤

*p*, that must be estimated from the data. Linear reductions may be imposed to facilitate progress, as in the moment-based approach reviewed in §3

*a*. They can also arise as a natural consequence of modelling restrictions, as we will see in §3

*b*. If

*η*^{T}

**X**is a sufficient linear reduction, then so is (

*η***A**)

^{T}

**X**for any

*q*×

*q*full-rank matrix

**A**. Consequently, only the subspace span(

**) spanned by the columns of**

*η***can be identified—span(**

*η***) is called a**

*η**dimension-reduction subspace*. If span(

**) is a dimension-reduction subspace, then so is span(**

*η***,**

*η*

*η*_{1}) for any

*p*×

*q*

_{1}matrix

*η*_{1}. If span(

*η*_{1}) and span(

*η*_{2}) are both dimension-reduction subspaces, then under mild conditions so is their intersection (Cook 1996, 1998). Consequently, the inferential target in sufficient dimension reduction is often taken to be the central subspace , defined as the intersection of all dimension-reduction subspaces (Cook 1994, 1996, 1998). A

*minimal sufficient linear reduction*is then of the form

*R*(

**X**) =

*η*^{T}

**X**, where the columns of

**now form a basis for . We assume that the central subspace exists throughout this paper, and use to denote its dimension.**

*η*The ideas of a sufficient reduction and the central subspace can be used to further our understanding of the existing methodology and to guide the development of a new methodology. In §§2 and 3, we consider how sufficient reductions arise in three contexts: forward linear regression, inverse moment-based reduction and inverse model-based reduction.

## 2. Reduction in forward linear regression

The standard linear regression model *Y* = *β*_{0} + *β*^{T}**X** + ϵ, with ϵ **X** and *E*(ϵ) = 0, implies that and thus that *R*(**X**) = *β*^{T}**X** is minimal sufficient. The assumption of a linear regression then automatically focuses our interest on ** β**, which can be estimated straightforwardly using ordinary least squares (OLS) when

*n*is sufficiently large, and it may appear that there is little to be gained from dimension reduction. However, dimension reduction has been used in linear regression to improve on the OLS estimator of

**and to deal with**

*β**n*<

*p*regressions. One approach consists of regressing

*Y*on

**X**in two steps. The first is the reduction step: reduce

**X**linearly to

**G**

^{T}

**X**using some methodology that produces ,

*q*≤

*p*. The second step consists of using OLS to estimate the mean function

*E*(

*Y*|

**G**

^{T}

**X**) for the reduced predictors. To describe the resulting estimator of

**and establish notation for later sections, let be the**

*β**n*× 1 vector of centred responses, let denote the sample mean vector, let be the

*n*×

*p*matrix with rows ,

*i*= 1, …,

*n*, let denote the usual estimator of

**= var(**

*Σ***X**), let , which is the usual estimator of

**C**= cov(

**X**,

*Y*), and let be the vector of coefficients from the OLS fit of

*Y*on

**X**. Then (Cook & Forzani 2009

*b*) 2.1 This estimator, which is the projection of onto span(

**G**) in the inner product, does not require computation of if

*q*<

*p*and thus could be useful when

*n*<

*p*, depending on the size of

*q*. In any case, the estimator of

*E*(

*Y*|

**X**) is 2.2

If **G** = **I**_{p}, then , which achieves nothing beyond . If we choose the columns of **G** to be the first *q* eigenvectors of , then **G**^{T}**X** consists of the first *q* PCs and is the standard principal component regression (PCR) estimator. Setting yields the PLS estimator with *q* factors (Helland 1990). Eliminating predictors by using an information criterion like Akaike AIC or Bayesian BIC (§5*a*) can result in a **G** with rows selected from the identity matrix **I**_{p}, and again we obtain a reduction in **X** prior to estimation of ** β**. If span(

**G**) is a consistent estimator of a dimension-reduction subspace , then may be a reasonable estimator of

**because , recalling that the intersection of any two dimension-reduction subspaces is itself a dimension-reduction subspace. However, while these estimators are well known, span(**

*β***G**) may not be a consistent estimator of a dimension-reduction subspace without an additional structure, even if the linear model is accurate. The PCR estimator depends on

**G**only through the marginal distribution of

**X**, and this alone cannot guarantee that span(

**G**) is consistent. The performance of the PLS estimator depends on the relationship between

**C**and the eigenstructure of

**(Naik & Tsai 2000).**

*Σ*A somewhat different approach is based on estimating ** β** by using a penalized objective function like that for the lasso (Tibshirani 1996). The lasso estimator is
where

*β*

_{j}is the

*j*th element of

**,**

*β**j*= 1, …,

*p*, and the tuning parameter λ is often chosen by cross validation. Several elements of are typically zero, which corresponds to setting the rows of

**G**to be the rows of the identity matrix

**I**

_{p}corresponding to the non-zero elements of . However, with this

**G**, we do not necessarily have , although the two estimators are often similar. Consequently, the methodology based on penalization does not fit exactly the general form given in equation (2.1).

Pursuing dimension reduction based on linear regression may not produce useful results if the model is not accurate, particularly if the distribution of *Y* | **X** depends on more than one linear combination of the predictors. There are many diagnostic and remedial methods available to improve linear regression models when *p* is not too large. Otherwise, application of these methods can be quite burdensome.

## 3. Inverse reduction

Inverse regression **X** | *Y* provides an alternative approach to estimating a sufficient reduction. It can deal straightforwardly with regressions that depend on more than one linear combination of the predictors, and does not necessarily suffer from the modelling problems that plague forward regression when *p* is large. There are two general paradigms for determining a sufficient reduction inversely. The first is by specifying a parametric model for the inverse regression of **X** on *Y* , as discussed in §§3*b* and 4. In this model-based approach, minimal sufficient reductions can in principle be determined from the model itself. For example, we saw previously that *E*(*Y* | **X**) is a sufficient reduction when (*Y*, **X**) is normally distributed. The second, which is discussed §3*a*, is the moment-based approach in which derived moment relations are used to estimate a sufficient reduction by way of the central subspace. Model accuracy is nearly always an issue in the model-based approach, while efficiency is worrisome in the moment-based approach. Some of the best moment-based methods have turned out to be quite inefficient in relatively simple settings (see Cook & Forzani 2009*a*, for an instance of this inefficiency).

### (a) Moment-based inverse reduction

In contrast to model-based reduction, there is no law to guide the choice of *R*(**X**) in moment-based reduction. However, progress is still possible by restricting consideration to multi-dimensional linear reduction and pursuing estimation of the central subspace , as discussed in §1*b*. Recall that and that the columns of the *p* × *d* matrix ** η** are a basis for .

Sliced inverse regression (SIR; Li 1991) and sliced average variance estimation (SAVE; Cook & Weisberg 1991) were the first moment-based methods proposed for dimension reduction. Although the concept of the central subspace was not developed until a few years after SIR and SAVE were introduced, it is now known that these methods in fact estimate under two key conditions: (i) *E*(**X** | *η*^{T}**X**) is a linear function of **X** (*linearity condition*) and (ii) var(**X** | *η*^{T}**X**) is a non-random matrix (*constant covariance condition*). We forgo discussion of these conditions, which involve only the marginal distribution of **X**, as they are well known and widely regarded as mild. A good recent discussion of them was given by Li & Wang (2007). Under the linearity condition, , which is the population foundation for SIR. Under the linearity and constant covariance conditions, , which is population basis for SAVE. When the response is categorical, *E*(**X** | *C*_{k}) can be estimated straightforwardly as the average predictor vector in category *C*_{k}. The SIR estimator of , which requires *n* > *p*, is then the span of the first *d* eigenvectors of , where **M** is the *p* × *h* matrix with columns . Continuous responses are treated by slicing the observed range of *Y* into *h* categories *C*_{k} and then applying the method for a categorical response. The SAVE estimator uses a similar construction. Routines for computing SIR, SAVE and other moment-based estimates of are available in the R package ‘dr’ (http://cran.us.r-project.org/web/packages/dr/index.html) and in the Arc software (www.stat.umn.edu/arc/software.html).

Both SIR and SAVE provide consistent estimators of under standard conditions, but by itself consistency does not guarantee good performance in practice. It is known that SIR has difficulty finding directions that are associated with certain types of nonlinear trends in *E*(*Y* | **X**). SAVE was developed in response to this limitation, but its ability to find linear trends is generally inferior to SIR’s. Several moment-based methods have been developed in an effort to improve on the estimates of provided by SIR and SAVE. Using the same population foundations as SIR, Cook & Ni (2005) developed an asymptotically optimal method of estimating called IRE. Ye & Weiss (2003) and Zhu *et al.* (2005) attempted to combine the advantages of SIR and SAVE by using linear combinations of them. Cook & Forzani (2009*a*) used a likelihood-based objective function to develop a method called LAD (likelihood acquired directions) that apparently dominates all dimension-reduction methods based on the same population foundations as SIR and SAVE. These methods have been developed and studied mostly in regressions where *p* ≪ *n*, although there are some results for other settings (Li 2007; Li & Yin 2008). SIR, SAVE, IRE and LAD come with a range of inference capabilities, including methods for estimating *d* and tests of conditional independence hypotheses such as *Y* is independent of **X**_{1} given **X**_{2}, where we have partitioned .

Moment-based sufficient dimension-reduction methods provide estimates of the minimal sufficient linear reduction, but they are not designed specifically for prediction and do not produce predictive methods *per se*. Instead, once has been determined, it is considered fixed and standard model development methods are typically used to estimate and thereby obtain predictions. Model-based sufficient reduction methods allow a more direct route to estimation of *E*(*Y* | **X**).

### (b) Model-based inverse reduction

Model-based sufficient dimension reduction is relatively new (Cook 2007). There are several useful characteristics of this approach, which may become clear in the sections that follow. One is that a model for **X** | *Y* can itself be inverted to provide a method for estimating the forward mean function *E*(*Y* | **X**) without specifying a model for the full joint distribution of (**X**, *Y*). For convenience, we denote the densities of **X** and **X** | *Y* by *g*(**X**) and *g*(**X** | *Y*), and so on, keeping in mind that the symbol *g* indicates a different density in each case. Because densities will always appear together with their arguments, this should cause no ambiguity. We assume that *R*(**X**) has a density as well. With these understandings, we have
3.1
where all right-hand side expectations are with respect to the marginal distribution of *Y* . Equation (3.1)_{1} provides a relationship between the mean function *E*{*Y* | **X**} and the conditional density of **X** | *Y* , while equation (3.1)_{2} is a restatement of the equality of the mean functions for the regressions of *Y* on **X** and *R*(**X**). The final equality (3.1)_{3} establishes a relationship between these forward mean functions and the conditional density of *R* | *Y* , and provides a method to estimate *E*(*Y* | **X**):
3.2
where denotes an estimated density and is the estimated reduction. This estimator is reminiscent of a non-parametric kernel estimator (Simonoff 1996, ch. 4), but there are important differences. The weights in a kernel estimator do not depend on the response, while the weights *w*_{i} here do. Kernel weights typically depend on the full vector of predictors **X**, while the weights here depend on **X** only through the estimated reduction . If *d* is small, the estimator (3.2) may avoid the curse of dimensionality. Multivariate kernels are usually taken to be the product of univariate kernels, corresponding here to constraining the components of *R* to be independent. Finally, there is no explicit bandwidth in our weights as they are determined entirely from , which eliminates the need for bandwidth estimation by, for example, cross validation.

The success of this approach depends on obtaining good estimators of the reduction and of its conditional density. In the next section, we address these issues by using normal models for the conditional distribution of **X** | *Y* . The models in §§4*a,b* and *d* were introduced by Cook (2007). The model in §4*c* is from Cook & Forzani (2009*b*). Our discussion includes a review of the results that are needed to use equation (3.2).

## 4. Normal inverse models

Let **X**_{y} denote a random vector distributed as **X** | (*Y* = *y*), and assume that **X**_{y} is normally distributed with mean *μ*_{y} and constant variance matrix ** Δ** > 0, where the inequality means that

**is positive definite. Let and let denote a basis matrix whose columns form a basis for the**

*Δ**d*-dimensional subspace , where

*S*

_{Y}denotes the sample space of

*Y*. Then we can write (Cook 2007) 4.1 where

**is independent of**

*ε**Y*and normally distributed with mean 0 and covariance matrix

**, and ; we assume that var(**

*Δ*

*ν*_{Y}) > 0. The basis matrix

**is not identifiable in this model, as for any full-rank**

*Γ**d*×

*d*matrix

**A**we can always obtain an equivalent parametrization as

*Γ*

*ν*_{y}= (

*Γ***A**

^{−1})(

**A**

*ν*_{y}). However, span(

**) is identifiable and estimable, and for this reason we assume without loss of generality that**

*Γ***is a semiorthogonal matrix,**

*Γ*

*Γ*^{T}

**=**

*Γ***I**

_{d}.

Model (4.1) represents the fact that the translated conditional means fall in the *d*-dimensional subspace . Under model (4.1), *R*(**X**) = *Γ*^{T}*Δ*^{−1}**X** is minimal sufficient (Cook & Forzani 2009*b*), and the goal is thus to estimate . As the minimal sufficient reduction is linear, it follows that . In other words, in the class of models represented by equation (4.1), moment-based and model-based dimension reduction coincide in the population.

As a convenient notation for describing estimators of , let denote the span of **A**^{−1/2} times the first *d* eigenvectors of **A**^{−1/2}**B****A**^{−1/2}, where **A** and **B** are symmetric matrices and **A** is non-singular. The subspace can also be described as the span of **A**^{−1} times the first *d* eigenvectors of **B**. We refer to errors having covariance matrix ** Δ** =

*σ*

^{2}

**I**

_{p}as

*isotonic*. Isotonic models are models with isotonic errors. For notational simplicity, we will use

**X**

_{i}when referring to observations rather than the more awkward notation

**X**

_{yi}.

### (a) The principal component model

The isotonic version of model (4.1) is called the *PC model* because the maximum likelihood estimator (MLE) of is and thus the *d* components of are simply the first *d* PCs. This relatively simple result is due to the nature of ** Δ**. As the errors are isotonic, the contours of

**are circular. When the signal**

*Δ*

*Γ*

*ν*_{y}is added, the contours of

**=**

*Σ***var(**

*Γ*

*ν*_{Y})

*Γ*^{T}+

*σ*

^{2}

**I**

_{p}become

*p*-dimensional ellipses with their longest

*d*axes spanning . The MLE of

*σ*

^{2}is , where are the eigenvalues of , and the MLE of is simply .

We performed a small simulation to provide some intuition. Observations on **X** were generated as , where was sampled uniformly from the boundary of the square [−1,1]^{2}, the elements of the *p* × 2 matrix ** Γ*** were sampled independently from a standard normal distribution, and the error vector

**was sampled from a normal distribution with mean 0 and variance matrix**

*ε***I**

_{p}. In terms of model (4.1), ,

**=**

*Γ****(**

*Γ*

*Γ*^{*T}

***)**

*Γ*^{−1/2}and . This sampling process, which does not require an explicit choice of

*Y*, was repeated

*n*= 80 times for various values of

*p*. Figure 1 shows plots of the first two PCs for four values of

*p*. We see that for small

*p*the square is not recognizable, but for larger values of

*p*the square is quite clear. In figure 1

*d*, there are

*p*= 500 predictors, while the number of observations is still

*n*= 80. The sides of the estimated square in figure 1

*d*do not align with the coordinate axes because the method is designed to estimate only the subspace , which is equal to with isotonic errors.

Turning to prediction, let denote the *p* × *d* matrix with columns consisting of the first *d* eigenvectors of . Then, the weights (3.2) can be written as
4.2
where is the estimated reduction. In this case, is in the form of a normal product kernel density estimator with bandwidth . Here and in other weights, we assumed that *d* is known. Methods for choosing *d* in the context of prediction are discussed in §5.

The method of estimating *E*(*Y* | **X**) characterized by equation (4.2) is distinct from the standard PCR estimator given in equation (2.2) with the columns of **G** being the first *d* eigenvectors of . Predictions from equations (4.2) and (2.2) both use PCs, but the way in which they are used is quite different. Model (4.1), which leads to PCs as the MLE of the sufficient reduction, has a very general mean function, but its error structure is restrictive. Nevertheless, this error structure has recently been used in studies of gene expression (**X**) that are complicated by stratification and heterogeneity (Leek & Storey 2007). On the other hand, the usual linear model has a restrictive mean function, and under that model alone there seems to be no clear rationale for reduction by PCs (Cox 1968).

We performed a small simulation study to illustrate the relative behaviour of predictions using equations (2.2) and (4.2). The basic simulation scenario was the same as that leading to figure 1, but in this case we set *n* = 200, *d* = 1, ** Γ*** = (1, …, 1)

^{T}, var(

**) = 5**

*ε*^{2}

**I**

_{p}and , where

*Y*is a standard normal random variable. In this set-up, (

**X**,

*Y*) has a multivariate normal distribution, so both equations (2.2) and (4.2) are appropriate, but the predictions from PCR (2.2) use the fact that the mean function

*E*(

*Y*|

**X**) is linear, while predictions using equation (4.2) do not. For each of 100 datasets generated in this way, the mean-squared prediction error PE

_{k}for the estimated mean function was determined by sampling 200 new observations (

**X***,

*Y**): 4.3 The final prediction error was then determined by averaging, . The prediction errors are shown in figure 2

*a*for 3 ≤

*p*≤ 150. While the PCR predictions (2.2) perform a bit better at all values of

*p*, the loss when using predictions from the inverse model is negligible. Figure 2

*b*was constructed in the same way, except that we set . In this case, the predictions (4.2) from the inverse model are much better than the PCR predictions because they automatically adapt to

*E*(

*Y*|

**X**) regardless of the nature of

*ν*_{Y}.

### (b) The isotonic principal fitted component model

It will often be possible and useful to model the coordinate vectors as , where is a known vector-valued function of *y* with linearly independent elements and , , is an unrestricted rank-*d* matrix. Under this model for **ν**_{y}, each coordinate *X*_{yj}, *j* = 1, …, *p*, of **X**_{y} follows a linear model with predictor vector **f**_{y}. Consequently, when *Y* is quantitative, we are able to use inverse response plots (Cook 1998, ch. 10) of *X*_{yj} versus *y*, *j* = 1, …, *p*, to gain information about suitable choices for **f**_{y}, which is an ability that is not generally available in the forward regression of *Y* on **X**. When *p* is too large for such graphical inspection, the PC model can be used as a tool to aid in selecting **f**_{y}. Under the PC model, the MLE of *ν*_{y} for the *i*th observed case is ; the second equality follows because is a semiorthogonal matrix. These vectors can then be plotted against *Y*_{i}, *i* = 1, …, *n*, and used in the same way as the inverse response plots. In cases like figure 1*d*, brushing a histogram of the response while observing movement of the corresponding points around the square may be useful. We can also consider vectors **f**_{y} that contain a reasonably flexible set of basis functions, like polynomial terms in *Y* , which may also be useful when it is impracticable to apply graphical methods to all of the predictors. Piecewise polynomials could also be used.

In some regressions, there may be a natural choice for **f**_{y}. Suppose for instance that *Y* is categorical, taking values in one of *h* categories *C*_{k}, *k* = 1, …, *h*. We can then set *r* = *h* − 1 and specify the *k*th element of **f**_{y} to be *J*(*y* ∈ *C*_{k}), where *J* is the indicator function. Another option consists of ‘slicing’ the observed values of a continuous *Y* into *h* bins (categories) *C*_{k}, *k* = 1, …, *h*, and then specifying the *k*th coordinate of **f**_{y} as for the case of a categorical *Y* . This has the effect of approximating each conditional mean *E*(*X*_{yj}) as a step function of *y* with *h* steps,
where is the *j*th row of ** Γ** and

**b**

_{k}is the

*k*th column of

**.**

*β*Models with are called *principal fitted component (PFC ) models*. To describe the MLE of when the errors are isotonic, let denote the *n* × *r* matrix with rows . Then, the *n* × *p* matrix of centred fitted vectors from the linear regression of **X** on **f** is and the sample covariance matrix of these fitted vectors is , where denotes the linear operator that projects onto the subspace spanned by the columns of . Under this *isotonic PFC model*, the MLE of is . Johnson (2008) gave sufficient conditions for this estimator to converge at the usual rate.

Under the isotonic PFC model, the sufficient reduction is estimated as , where the columns of are the first *d* eigenvectors of . The elements of this are called PFCs. Additionally, the MLE of ** β** is and the MLE of

*σ*

^{2}is , where the are the ordered eigenvalues of .

To describe the weights under this model, let be the coefficient matrix from the multivariate OLS fit of **X** on **f**, and let the fitted vectors be denoted by . The weights from equation (3.2) are then
4.4
These weights are of the same general form as those in equation (4.2) from the PC model, but there are two important differences. First, the isotonic PFC reduction uses the eigenvectors from , while the PC reduction uses the eigenvectors from . Second, the reduction is applied to the observed predictor vectors **X**_{i} in the PC weights (4.2), while the reduction is applied to the fitted predictor vectors in the isotonic PFC weights (4.4).

Shown in figure 3*a* are results from a simulation study to illustrate the potential advantages of using fitted components. The data were generated as for figure 2*b*. We used **f**_{y} = (*y*,*y*^{2},*y*^{3})^{T} for the fitted model and, to broaden the scope, we increased the overall variability by reducing the sample size to *n* = 50 and increasing the conditional variance matrix to var(** ε**) = 15

^{2}

**I**

_{p}. This is a highly variable regression, but all three methods respond quickly as the number of predictors increases. The PCR predictions do better than the inverse predictions from the PC model for relatively small

*p*, but the situation is reversed for larger values of

*p*. Most importantly, the prediction errors from isotonic PFC are the smallest at all values of

*p*. Figure 3

*b*is discussed in the next section.

### (c) The diagonal principal fitted component model

The isotonic PFC model requires that, given the response, the predictors must be independent and have the same variance. While this model will be useful in some applications, the requirement of equal variances is restrictive relative to the range of applications in which reduction may be desirable. In this section, we expand the scope of application by permitting a diagonal error covariance matrix . This allows for different measurement scales of the predictors, but still requires that they be conditionally independent. The estimated sufficient reduction for this model is , where is the MLE of ** Δ** and is any basis for the MLE of span(

**). With this , the weights have the same form as the weights for the isotonic PFC model, 4.5 where is as defined for equation (4.4).**

*Γ*It remains to determine and . For this model, we were unable to find a closed-form expression for these estimators. However, a straightforward alternating algorithm can be developed based on the following reasoning. If the inverse mean function is specified, then the variances can be estimated by using the sample variances of the centred variables . If ** Δ** is specified, then we can standardize the predictor vector to obtain an isotonic PFC model in

**Z**=

*Δ*^{−1/2}

**X**, 4.6 where

**is normal with mean**

*ε***0**and variance matrix

**I**

_{p}. Consequently, we can estimate span(

**) as**

*Γ*

*Δ*^{1/2}times the estimate of

*Δ*^{−1/2}

**from the isotonic model (4.6). Alternating between these two steps leads to the following algorithm.**

*Γ*Fit the isotonic PFC model to the original data, getting initial estimates and . The MLE of the intercept is always , so there is no need to include this parameter explicitly in the algorithm.

For some small ϵ > 0, repeat for

*j*= 1, 2 … until .(a) Calculate .

(b) Transform .

(c) Fit the isotonic PFC model to

**Z**, yielding estimates , and .(d) Backtransform the estimates to the original scale , .

Cook & Forzani (2009*b*) proposed a different algorithm for fitting the diagonal PFC model. The algorithm here is more reliable when *n* < *p*.

We again report partial results of a simulation study to illustrate the potential benefits of the diagonal PFC model. The scenario for generating the data was the same as that for figure 2*a* with 3 ≤ *p* ≤ 200 and *n*=50, modified to induce different conditional variances. The conditional variances were generated once as the order statistics for a sample of size 200 from 60 times a chi-squared random variable with two degrees of freedom. The smallest order statistic was and the largest was . We then used for a regression with *p* predictors. In this way, the conditional variances increased as we added predictors. While it is unlikely that predictors would be so ordered in practice, this arrangement seems useful to highlight the relative behaviour of the methods. In each case, we fitted the isotonic and diagonal PFC model using **f**_{y} = *y*. Predictions were again assessed using 200 new simulated observations, and the entire set-up was again replicated 100 times to obtain the average prediction errors shown in figure 3*b*. With *p* = 3 predictors, all methods perform well because the first three conditional variances are similarly small. The prediction error for the diagonal PFC model changed very little as we added predictors, while the prediction errors for all other methods increased substantially. This happened because the MLEs from the diagonal PFC model can weight each predictor according to its conditional variance, while the predictors are treated equally by the other methods. Predictions from the PC model were consistently the worst. This is perhaps to be expected because the other three methods use mean functions that are appropriate for this simulation.

Predictions from the diagonal PFC model should perform well also when ** Δ** > 0 is not a diagonal matrix, provided that the conditional predictor correlations are small to moderate. This is in agreement with Prentice & Zhao (1991, p. 830) who recommended in a related context that independence models generally should be adequate for a broad range of applications in which the dependences are not strong. The methodology discussed in the next section may be useful in the presence of strong conditional correlations.

### (d) The principal fitted component model

In this section, we describe the estimator (3.2) for general ** Δ** > 0, allowing the predictors to be conditionally dependent with different variances. The methods deriving from the isotonic and diagonal PFC models do not require

*p*<

*n*. In contrast, the methods of this section work best in data-rich regressions where

*p*≪

*n*. The analysis relies on the MLEs given by Cook (2007, §7.2) and studied in more detail by Cook & Forzani (2009

*b*).

The following notation will help describe the MLEs under the PFC model. Let be the sample covariance matrix of the residuals vectors from the OLS fit of **X** on **f**. Let and be the eigenvalues and the corresponding matrix of eigenvectors of . Finally, let be a *p* × *p* diagonal matrix, with the first *d* diagonal elements equal to zero and the last *p* − *d* diagonal elements equal to . Then, the MLE of ** Δ** is
If

*d*=

*r*, then , and this estimator reduces to . Otherwise, recovers information on variation due to overfitting (

*r*>

*d*). The MLE of

**is , where is the projection of**

*β***B**

_{ols}onto the span of in the inner product, and the MLE of

*Δ*^{−1}

**is , which is equal to the SIR estimator of when the response is categorical (Cook & Forzani 2009**

*Γ**b*).

Turning to prediction, we substituted these MLEs into the multivariate normal density for **X** | *Y* and simplified the resulting expression by in part ignoring proportionality constants not depending on the observation *i* to obtain the weights (3.2):
4.7
where is as defined for equation (4.4). We next discuss how these weights can be simplified to a more intuitive and computationally efficient form.

Let . Then it is easy to show that , and that the columns of are the normalized eigenvectors of . Let and denote the *p* × *d* matrices consisting of the first *d* columns of and . Then, because the MLE of *Δ*^{−1}** Γ** is , it follows that . Consequently, we may take , which implies the reduction
Using these results in equation (4.7), we have , and thus
4.8
The reduction in this case is applied to

**x**and in the same way that it was for the weights from the isotonic and diagonal PFC models, but the reduction itself differs. A perhaps clearer connection between the PFC reduction and the isotonic PFC reduction can be seen by noting that is equal to the span of the first

*d*eigenvectors of . Under the isotonic PFC model, the reduction is computed by using the first

*d*eigenvectors of , while under the PFC model the reduction is computed using the first

*d*eigenvectors of . Additionally, as , the reduction could be computed also as the first

*d*eigenvectors of .

## 5. Implementation

In this section, we present additional simulation results and provide some suggestions for issues that must be addressed prior to implementation. We consider only the isotonic, diagonal and general PFC models, and address the implementation issues first.

### (a) Choice of *d*

Two general methods for choosing have been studied in the literature. One is by using an information criterion such as the Akaike (AIC) and Bayesian (BIC) information criteria. Let *L*(*d*_{0}) denote the value of the maximized log likelihood for any of the three inverse models under consideration fitted with a value *d*_{0} for *d*. Then the dimension is selected that minimizes the information criterion −2*L*(*d*_{0}) + *h*(*n*)*g*(*d*_{0}), where *h*(*n*) is equal to for BIC and 2 for AIC, and *g*(*d*_{0}) = *p* + *d*_{0}(*p* − *d*_{0}) + *d**r* + *D*(** Δ**) is the number of parameters to be estimated as a function of

*d*

_{0}. The first addend in this count corresponds to

**, the second to and the third to**

*μ***. The final term**

*β**D*(

**) is the number of parameters required for**

*Δ***, which is equal to 1,**

*Δ**p*and

*p*(

*p*+ 1)/2 for the isotonic, diagonal and PFC models. The second method is based on using the likelihood ratio statistic in a sequential scheme to choose

*d*. Using a common test level and starting with

*d*

_{0}= 0, choose the estimate of

*d*as the hypothesized value that is not rejected; see Cook (2007) and Cook & Forzani (2009

*a*) for further discussion of these methods.

These methods for choosing *d* have been shown to work well in data-rich regressions where *p* ≪ *n*, but can be unreliable otherwise. Consequently, in the context of this paper, we will determine *d* using the cross-validation method proposed by Hastie *et al.* (2001, ch. 7).

### (b) Screening

When dealing with large-*p* regressions, there may be a possibility that a substantial subset **X**_{2} of the predictors is inactive; that is, **X**_{2} is independent of the response given the remaining predictors **X**_{1}. In terms of PFC models, **X**_{2} is inactive if and only if the corresponding rows of *Δ*^{−1}** Γ** are all equal to zero. Addressing this possibility in terms of the general PFC models requires a data-rich regression, and is outside the scope of this paper (see Cook & Forzani 2009

*b*for further discussion). For the isotonic and diagonal PFC models, the predictors

**X**

_{2}are inactive if and only if the corresponding rows of

**are all equal to 0. In these models, we can test if the**

*Γ**k*th row of

**is equal to 0 straightforwardly by using an**

*Γ**F*statistic to test if

**=0 in the univariate regression model of the**

*α**k*th predictor

*X*

_{k}on

**f**,

*X*

_{k}=

*μ*+

*α*^{T}

**f**+ ε. This method with a common test level of 0.1 was used for screening in the simulations that follow. When

*d*= 1 and

**f**=

*y*, this method is equivalent to sure independence screening proposed by Fan & Lv (2008).

### (c) Simulation results

To allow for some flexibility, we used the following general set-up as the basis for our simulations. Let **J**_{k} and **O**_{k} denote vectors of length *k* having all elements equal to 1 and 0. A response was generated from a uniform (0,3) distribution. Then with **f**_{y} = (*y*, e^{2y})^{T} a predictor vector of length *p* was generated using an integer *p*_{0} ≤ *p*/2 as , where with and , and **ϵ**∼*N*(0,** Δ**) with , and . Datasets of

*n*independent observations generated in this way have the following properties: (i) one set of

*p*

_{0}predictors is linearly related to the response, another set of

*p*

_{0}is nonlinearly related and the remaining

*p*− 2

*p*

_{0}predictors are independent of the response; (ii) the conditional covariance of

**X**|

*Y*has a diagonal structure; (iii)

**is a**

*Γ**p*× 2 matrix and thus the sufficient reduction is composed of

*d*= 2 linear combinations of the predictors.

We present results from three cases of this simulation model. In each case, we fitted the diagonal PFC model with **f**_{y} = (*y*^{1}, …, *y*^{4})^{T}, so the fitted **f**_{y} was not the same as that used to generate the data. The prediction errors were determined using the method of equation (4.3).

### Case 1.

*p* ≪ *n*. In this regression, we set *n*=400, *p* = 40 and *p*_{0} = 20, so all predictors are relevant and no screening was used. The results are shown in the third column of table 1. The first row shows the prediction errors when *d* is determined by the cross-validation (c.v.) method of Hastie *et al.* (2001, ch. 7). In this case, the estimated *d* can vary across the 100 replicate datasets. The next four rows show the prediction errors when *d* is fixed at the indicated value across all datasets. The final two rows show the prediction errors for PLS and the lasso applied straightforwardly to the linear regression of *Y* on **X** using the R packages ‘pls’ and ‘lars’. We see that the PFC prediction errors are fairly stable across the choices for *d* and that they are roughly one-third of the prediction error for the forward methods.

### Case 2.

*n* ≪ *p*. In this simulation, we set *n* = 100, *p*_{0} = 60 and *p* = 120, so again all predictors are relevant and no screening was used. The results, which are shown in the fourth column of table 1, are qualitatively similar to those for case 1, except that the prediction error for the lasso is notably less than that for PLS.

### Case 3.

*n* ≪ *p*. For this case, we set *n* = 100, *p*_{0} = 20 and *p* = 120. Now there are only 40 relevant predictors out of 120 total predictors and, as discussed in §5*b*, screening was used prior to fitting the diagonal PFC model. The results shown in the final column of table 1 are qualitatively similar to the other two cases. Here, we were a bit surprised that the lasso did not perform better because it is designed specifically for sparse regressions. Evidently, the adaptability of the diagonal PFC model through the choices of **f** and *d* can be important for good predictions.

### (d) Illustration

We conclude this section with an illustrative analysis of an economic regression using data from Enz (1991). The data and additional discussion were given by Cook & Weisberg (1999, p. 140). The response, which was measured in 1991, is the average minutes of labour required to buy a BigMac hamburger and french fries in *n* = 45 world cities. There are nine continuous predictors. We fitted the diagonal PFC model as discussed in §4*c*, and estimated the prediction error using leave-one-out cross validation. We set *d* = 1 to allow an even-handed comparison with forward methods.

The results are shown in table 2. Two entries are given for PFC. The first corresponds to fitting PFC with the cubic polynomial basis **f** = (*y*, *y*^{2}, *y*^{3})^{T} and the second uses a more elaborate piecewise constant polynomial with 10 slices for **f**. PFC with the piecewise constant polynomial basis gives a better prediction error than PFC with the cubic polynomial basis, but both methods outperform the predictions by five forward methods, the elastic net (Zou & Hastie 2005), the lasso, ridge regression, PLS and OLS. The relative behaviour of PFC illustrated here is typical of our experiences, regardless of the number of predictors.

## 6. Discussion

The methodology introduced in this paper applies across a useful range of applications with continuous predictors that, given the response, are mildly correlated. We view the ability to choose **f**_{y} as an important feature because it allows a level of adaptability that is not really available when directly modelling the mean function *E*(*Y* |**X**) in large-*p* regressions. The assumption of normal errors is not crucial, but the predictions may not perform well under extreme deviations from normality. For instance, we expect that it is possible to develop substantially better methodology for regressions in which all the predictors are binary, perhaps employing a quadratic exponential family for **X**|*Y* in place of the multivariate normal. Work along these lines is in progress.

It also seems possible to develop models ‘between’ the diagonal and general PFC models that allow for some of the conditional predictor correlations to be substantial but stop short of permitting a general ** Δ** > 0. One route is to model the conditional covariance matrix as , where

**is the same as that in model (4.1), (**

*Γ***,**

*Γ*

*Γ*_{0}) is an orthogonal

*p*×

*p*matrix,

**M**> 0 and

**M**

_{0}> 0 (Cook 2007). The minimal sufficient reduction in this case is

*R*(

**X**)=

*Γ*^{T}

**X**, which does not involve the

**M**s. This model for

**allows arbitrary correlations among the elements of**

*Δ**R*and among the elements of , but requires that

*R*and be conditionally independent. Predictions under this model in a data-rich regression can be obtained as an extension of results by Cook (2007). The behaviour of this model in

*n*≪

*p*regressions is under study.

## Acknowledgements

Research for this paper was supported in part by the US National Science Foundation, Microsoft Research and the Newton Institute for Mathematical Sciences, Cambridge, UK.

## Footnotes

One contribution of 11 to a Theme Issue ‘Statistical challenges of high-dimensional data’.

- © 2009 The Royal Society