Sufficient dimension reduction and prediction in regression

Kofi P. Adragni, R. Dennis Cook

Abstract

Dimension reduction for regression is a prominent issue today because technological advances now allow scientists to routinely formulate regressions in which the number of predictors is considerably larger than in the past. While several methods have been proposed to deal with such regressions, principal components (PCs) still seem to be the most widely used across the applied sciences. We give a broad overview of ideas underlying a particular class of methods for dimension reduction that includes PCs, along with an introduction to the corresponding methodology. New methods are proposed for prediction in regressions with many predictors.

1. Introduction

Consider the frequently encountered goal of determining a rule m(x) for predicting a future observation of a univariate response variable Y at the given value x of a p × 1 vector X of continuous predictors. Assuming that Y is quantitative, continuous or discrete, the mean-squared error E(Ym(x))2 is minimized by choosing m(x) to be the mean E(Y | X = x) of the conditional distribution of Y | (X = x). Consequently, the prediction goal is often specialized immediately to the task of estimating the conditional mean function E(Y | X) from the regression of Y on X. When the response is categorical with sample space SY consisting of h categories SY = {C1, …, Ch}, the mean function is no longer a relevant quantity for prediction. Instead, given an observation x on X, the predicted category C* is usually taken to be the one with the largest conditional probability Embedded Image, where the maximization is over SY. When pursuing the estimation of E(Y | X) or Pr(Ck | X), it is nearly always worth while to consider predictions based on a function R(X) of dimension less than p, provided that it captures all of the information that X contains about Y so that E(Y | X) = E(Y | R(X)). We can think of R(X) as a function that concentrates the relevant information in X. The action of replacing X with a lower-dimensional function R(X) is called dimension reduction; it is called sufficient dimension reduction when R(X) retains all the relevant information about Y . A potential advantage of sufficient dimension reduction is that predictions based on an estimated R may be substantially less variable than those based on X, without introducing worrisome bias. This advantage is not confined to predictions, but may accrue in other phases of a regression analysis as well.

One goal of this paper is to give a broad overview of ideas underlying sufficient dimension reduction for regression, along with an introduction to the corresponding methodology. Sections 1a,b, 2 and 3 are devoted largely to this review. Sufficient dimension-reduction methods are designed to estimate a population parameter called the central subspace, which is defined in §1b. Another goal of this paper is to describe a new method of predicting quantitative responses following sufficient dimension reduction; categorical responses will be discussed only for contrast. The focus of this paper shifts to prediction in §4, where we discuss four inverse regression models, describe the prediction methodology that stems from them and give simulation results to illustrate their behaviour. Practical implementation issues are discussed in §5, along with additional simulation results.

(a) Dimension reduction

There are many methods available for estimating E(Y | X) based on a random sample (Yi,Xi), i = 1, …, n, from the joint distribution of Y and X. If p is sufficiently small and n is sufficiently large, it may be possible to estimate E(Y | X) adequately by using non-parametric smoothing (see Wand & Jones 1995). Otherwise, nearly all techniques for estimating E(Y |X) employ some types of dimension reduction for X, either estimated or imposed as an intrinsic part of the model or method.

Broadly viewed, dimension reduction has always been a central statistical concept. In the second half of the nineteenth century, ‘reduction of observations’ was widely recognized as a core goal of statistical methodology, and principal components (PCs) were emerging as a general method for the reduction of multivariate observations (Adcock 1878). PCs was established as a first reductive method for regression by the mid-1900s.

Dimension reduction for regression is a prominent issue today because technological advances now allow scientists to routinely formulate regressions in which p is considerably larger than in the past. This has complicated the development and fitting of regression models. Experience has shown that the standard iterative paradigm for model development guided by diagnostics (Cook & Weisberg 1982, p. 7) can be imponderable when applied with too many predictors. An added complication arises when p is larger than the number of observations n, leading to the so-called ‘n < p’ problem. Standard methods of fitting and corresponding inference procedures may no longer be applicable in such regressions. These and related issues have caused a shift in the applied sciences towards a different regression genre with the goal of reducing the dimensionality of the predictor vector as a first step in the analysis. Although large-p regressions are perhaps mainly responsible for renewed interest, dimension-reduction methodology can be useful regardless of the size of p. For instance, it is often helpful to have an informative low-dimensional graphical summary of the regression to facilitate model building and gain insights. For this goal, p may be regarded as large when it exceeds 2 or 3 because these bounds represent the limits of our ability to view a dataset in full using computer graphics. Subsequent references to ‘large p’ in this paper do not necessarily imply that n < p.

Reduction by PCs is ubiquitous in the applied sciences, particularly in bioinformatics applications, where PCs have been called ‘eigen-genes’ (Alter et al. 2000) in microarray data analyses and ‘meta-kmers’ in analyses involving DNA motifs. The report by reiterates and makes clear that past influential analyses of data on global warming are flawed because of an inappropriate use of the PC methodology.

While PCs seem to be the dominant method of dimension reduction across the applied sciences, there are many other established and recent statistical methods that might be used to address large-p regressions, including factor analysis, inverse regression estimation (IRE) (Cook & Ni 2005), partial least squares (PLS), projection pursuit, seeded reductions (Cook et al. 2007), kernel methods (Fukumizu et al. 2009) and sparse methods like the lasso (Tibshirani 1996) that are based on penalization.

(b) Sufficient dimension reduction

Dimension reduction is a rather amorphous concept in statistics, changing its character and goals depending on context. Formulated specifically for regression, the following definition (Cook 2007) of a sufficient reduction will help in our pursuit of methods for reducing the dimension of X while en route to estimating E(Y | X).

Definition 1.1.

A reduction Embedded Image, qp, is sufficient if it satisfies one of the following three statements:

  • (i) inverse reduction, X | (Y, R(X)) ∼ X | R(X),

  • (ii) forward reduction, Y | XY | R(X),

  • (iii) joint reduction, X Embedded Image Y | R(X),

where Embedded Image indicates independence, ∼ means identically distributed and A | B refers to the random vector A given the vector B.

Each of the three conditions in this definition conveys the idea that the reduction R(X) carries all the information that X has about Y , and consequently all the information available to estimate E(Y | X). They are equivalent when (Y, X) has a joint distribution. In that case, we are free to determine a reduction inversely or jointly and then pass it to the conditional mean without additional structure: E(Y | X) = E(Y | R(X)). In some cases, there may be a direct connection between R(X) and E(Y | X). For instance, if (Y, X) follows a non-singular multivariate normal distribution, then R(X) = E(Y | X) is a sufficient reduction, E(Y | X) = E{Y | E(Y | X)}. This reduction is also minimal sufficient: if T(X) is any sufficient reduction, then R is a function of T. Further, because of the nature of the multivariate normal distribution, it can be expressed as a linear combination of the elements of X: R = βTX is minimal sufficient for some vector β.

Inverse reduction by itself does not require the response Y to be random, and it is perhaps the only reasonable reductive route when Y is fixed by design. For instance, in discriminant analysis, X|Y is a random vector of features observed in one of a number of subpopulations indicated by the categorical response Y , and no discriminatory information will be lost if classifiers are restricted to R.

If we consider a generic statistical problem and reinterpret X as the total data D and Y as the parameter θ, then the condition for inverse reduction becomes D|(θ,R)∼D|R so that R is a sufficient statistic. In this way, the definition of a sufficient reduction encompasses Fisher’s (1922) classical definition of sufficiency. One difference is that sufficient statistics are observable, while a sufficient reduction may contain unknown parameters and thus needs to be estimated. For example, if (X,Y) follows a non-singular multivariate normal distribution, then R(X) = βTX, and it is necessary to estimate β.

In some regressions, R(X) may be a nonlinear function of X, and in extreme cases no reduction may be possible, so all sufficient reductions are one-to-one functions of X and thus equivalent to R(X) = X. Most often, we encounter multi-dimensional reductions consisting of several linear combinations R(X) = ηTX, where η is an unknown p × q matrix, qp, that must be estimated from the data. Linear reductions may be imposed to facilitate progress, as in the moment-based approach reviewed in §3a. They can also arise as a natural consequence of modelling restrictions, as we will see in §3b. If ηTX is a sufficient linear reduction, then so is (ηA)TX for any q × q full-rank matrix A. Consequently, only the subspace span(η) spanned by the columns of η can be identified—span(η) is called a dimension-reduction subspace. If span(η) is a dimension-reduction subspace, then so is span(η,η1) for any p × q1 matrix η1. If span(η1) and span(η2) are both dimension-reduction subspaces, then under mild conditions so is their intersection Embedded Image (Cook 1996, 1998). Consequently, the inferential target in sufficient dimension reduction is often taken to be the central subspace Embedded Image, defined as the intersection of all dimension-reduction subspaces (Cook 1994, 1996, 1998). A minimal sufficient linear reduction is then of the form R(X) = ηTX, where the columns of η now form a basis for Embedded Image. We assume that the central subspace exists throughout this paper, and use Embedded Image to denote its dimension.

The ideas of a sufficient reduction and the central subspace can be used to further our understanding of the existing methodology and to guide the development of a new methodology. In §§2 and 3, we consider how sufficient reductions arise in three contexts: forward linear regression, inverse moment-based reduction and inverse model-based reduction.

2. Reduction in forward linear regression

The standard linear regression model Y = β0 + βTX + ϵ, with ϵ Embedded Image X and E(ϵ) = 0, implies that Embedded Image and thus that R(X) = βTX is minimal sufficient. The assumption of a linear regression then automatically focuses our interest on β, which can be estimated straightforwardly using ordinary least squares (OLS) when n is sufficiently large, and it may appear that there is little to be gained from dimension reduction. However, dimension reduction has been used in linear regression to improve on the OLS estimator of β and to deal with n < p regressions. One approach consists of regressing Y on X in two steps. The first is the reduction step: reduce X linearly to GTX using some methodology that produces Embedded Image, qp. The second step consists of using OLS to estimate the mean function E(Y | GTX) for the reduced predictors. To describe the resulting estimator Embedded Image of β and establish notation for later sections, let Embedded Image be the n × 1 vector of centred responses, let Embedded Image denote the sample mean vector, let Embedded Image be the n × p matrix with rows Embedded Image, i = 1, …, n, let Embedded Image denote the usual estimator of Σ = var(X), let Embedded Image, which is the usual estimator of C = cov(X, Y), and let Embedded Image be the vector of coefficients from the OLS fit of Y on X. Then (Cook & Forzani 2009b) Embedded Image 2.1 This estimator, which is the projection Embedded Image of Embedded Image onto span(G) in the Embedded Image inner product, does not require computation of Embedded Image if q < p and thus could be useful when n < p, depending on the size of q. In any case, the estimator of E(Y | X) is Embedded Image 2.2

If G = Ip, then Embedded Image, which achieves nothing beyond Embedded Image. If we choose the columns of G to be the first q eigenvectors of Embedded Image, then GTX consists of the first q PCs and Embedded Image is the standard principal component regression (PCR) estimator. Setting Embedded Image yields the PLS estimator with q factors (Helland 1990). Eliminating predictors by using an information criterion like Akaike AIC or Bayesian BIC (§5a) can result in a G with rows selected from the identity matrix Ip, and again we obtain a reduction in X prior to estimation of β. If span(G) is a consistent estimator of a dimension-reduction subspace Embedded Image, then Embedded Image may be a reasonable estimator of β because Embedded Image, recalling that the intersection of any two dimension-reduction subspaces is itself a dimension-reduction subspace. However, while these estimators are well known, span(G) may not be a consistent estimator of a dimension-reduction subspace without an additional structure, even if the linear model is accurate. The PCR estimator depends on G only through the marginal distribution of X, and this alone cannot guarantee that span(G) is consistent. The performance of the PLS estimator depends on the relationship between C and the eigenstructure of Σ (Naik & Tsai 2000).

A somewhat different approach is based on estimating β by using a penalized objective function like that for the lasso (Tibshirani 1996). The lasso estimator is Embedded Image where βj is the jth element of β, j = 1, …, p, and the tuning parameter λ is often chosen by cross validation. Several elements of Embedded Image are typically zero, which corresponds to setting the rows of G to be the rows of the identity matrix Ip corresponding to the non-zero elements of Embedded Image. However, with this G, we do not necessarily have Embedded Image, although the two estimators are often similar. Consequently, the methodology based on penalization does not fit exactly the general form given in equation (2.1).

Pursuing dimension reduction based on linear regression may not produce useful results if the model is not accurate, particularly if the distribution of Y | X depends on more than one linear combination of the predictors. There are many diagnostic and remedial methods available to improve linear regression models when p is not too large. Otherwise, application of these methods can be quite burdensome.

3. Inverse reduction

Inverse regression X | Y provides an alternative approach to estimating a sufficient reduction. It can deal straightforwardly with regressions that depend on more than one linear combination of the predictors, and does not necessarily suffer from the modelling problems that plague forward regression when p is large. There are two general paradigms for determining a sufficient reduction inversely. The first is by specifying a parametric model for the inverse regression of X on Y , as discussed in §§3b and 4. In this model-based approach, minimal sufficient reductions can in principle be determined from the model itself. For example, we saw previously that E(Y | X) is a sufficient reduction when (Y, X) is normally distributed. The second, which is discussed §3a, is the moment-based approach in which derived moment relations are used to estimate a sufficient reduction by way of the central subspace. Model accuracy is nearly always an issue in the model-based approach, while efficiency is worrisome in the moment-based approach. Some of the best moment-based methods have turned out to be quite inefficient in relatively simple settings (see Cook & Forzani 2009a, for an instance of this inefficiency).

(a) Moment-based inverse reduction

In contrast to model-based reduction, there is no law to guide the choice of R(X) in moment-based reduction. However, progress is still possible by restricting consideration to multi-dimensional linear reduction and pursuing estimation of the central subspace Embedded Image, as discussed in §1b. Recall that Embedded Image and that the columns of the p × d matrix η are a basis for Embedded Image.

Sliced inverse regression (SIR; Li 1991) and sliced average variance estimation (SAVE; Cook & Weisberg 1991) were the first moment-based methods proposed for dimension reduction. Although the concept of the central subspace was not developed until a few years after SIR and SAVE were introduced, it is now known that these methods in fact estimate Embedded Image under two key conditions: (i) E(X | ηTX) is a linear function of X (linearity condition) and (ii) var(X | ηTX) is a non-random matrix (constant covariance condition). We forgo discussion of these conditions, which involve only the marginal distribution of X, as they are well known and widely regarded as mild. A good recent discussion of them was given by Li & Wang (2007). Under the linearity condition, Embedded Image, which is the population foundation for SIR. Under the linearity and constant covariance conditions, Embedded Image, which is population basis for SAVE. When the response is categorical, E(X | Ck) can be estimated straightforwardly as the average predictor vector Embedded Image in category Ck. The SIR estimator of Embedded Image, which requires n > p, is then the span of the first d eigenvectors of Embedded Image, where M is the p × h matrix with columns Embedded Image. Continuous responses are treated by slicing the observed range of Y into h categories Ck and then applying the method for a categorical response. The SAVE estimator uses a similar construction. Routines for computing SIR, SAVE and other moment-based estimates of Embedded Image are available in the R package ‘dr’ (http://cran.us.r-project.org/web/packages/dr/index.html) and in the Arc software (www.stat.umn.edu/arc/software.html).

Both SIR and SAVE provide Embedded Image consistent estimators of Embedded Image under standard conditions, but by itself consistency does not guarantee good performance in practice. It is known that SIR has difficulty finding directions that are associated with certain types of nonlinear trends in E(Y | X). SAVE was developed in response to this limitation, but its ability to find linear trends is generally inferior to SIR’s. Several moment-based methods have been developed in an effort to improve on the estimates of Embedded Image provided by SIR and SAVE. Using the same population foundations as SIR, Cook & Ni (2005) developed an asymptotically optimal method of estimating Embedded Image called IRE. Ye & Weiss (2003) and Zhu et al. (2005) attempted to combine the advantages of SIR and SAVE by using linear combinations of them. Cook & Forzani (2009a) used a likelihood-based objective function to develop a method called LAD (likelihood acquired directions) that apparently dominates all dimension-reduction methods based on the same population foundations as SIR and SAVE. These methods have been developed and studied mostly in regressions where pn, although there are some results for other settings (Li 2007; Li & Yin 2008). SIR, SAVE, IRE and LAD come with a range of inference capabilities, including methods for estimating d and tests of conditional independence hypotheses such as Y is independent of X1 given X2, where we have partitioned Embedded Image.

Moment-based sufficient dimension-reduction methods provide estimates of the minimal sufficient linear reduction, but they are not designed specifically for prediction and do not produce predictive methods per se. Instead, once Embedded Image has been determined, it is considered fixed and standard model development methods are typically used to estimate Embedded Image and thereby obtain predictions. Model-based sufficient reduction methods allow a more direct route to estimation of E(Y | X).

(b) Model-based inverse reduction

Model-based sufficient dimension reduction is relatively new (Cook 2007). There are several useful characteristics of this approach, which may become clear in the sections that follow. One is that a model for X | Y can itself be inverted to provide a method for estimating the forward mean function E(Y | X) without specifying a model for the full joint distribution of (X, Y). For convenience, we denote the densities of X and X | Y by g(X) and g(X | Y), and so on, keeping in mind that the symbol g indicates a different density in each case. Because densities will always appear together with their arguments, this should cause no ambiguity. We assume that R(X) has a density as well. With these understandings, we have Embedded Image 3.1 where all right-hand side expectations are with respect to the marginal distribution of Y . Equation (3.1)1 provides a relationship between the mean function E{Y | X} and the conditional density of X | Y , while equation (3.1)2 is a restatement of the equality of the mean functions for the regressions of Y on X and R(X). The final equality (3.1)3 establishes a relationship between these forward mean functions and the conditional density of R | Y , and provides a method to estimate E(Y | X): Embedded Image 3.2 where Embedded Image denotes an estimated density and Embedded Image is the estimated reduction. This estimator is reminiscent of a non-parametric kernel estimator (Simonoff 1996, ch. 4), but there are important differences. The weights in a kernel estimator do not depend on the response, while the weights wi here do. Kernel weights typically depend on the full vector of predictors X, while the weights here depend on X only through the estimated reduction Embedded Image. If d is small, the estimator (3.2) may avoid the curse of dimensionality. Multivariate kernels are usually taken to be the product of univariate kernels, corresponding here to constraining the components of R to be independent. Finally, there is no explicit bandwidth in our weights as they are determined entirely from Embedded Image, which eliminates the need for bandwidth estimation by, for example, cross validation.

The success of this approach depends on obtaining good estimators of the reduction and of its conditional density. In the next section, we address these issues by using normal models for the conditional distribution of X | Y . The models in §§4a,b and d were introduced by Cook (2007). The model in §4c is from Cook & Forzani (2009b). Our discussion includes a review of the results that are needed to use equation (3.2).

4. Normal inverse models

Let Xy denote a random vector distributed as X | (Y = y), and assume that Xy is normally distributed with mean μy and constant variance matrix Δ > 0, where the inequality means that Δ is positive definite. Let Embedded Image and let Embedded Image denote a basis matrix whose columns form a basis for the d-dimensional subspace Embedded Image, where SY denotes the sample space of Y . Then we can write (Cook 2007) Embedded Image 4.1 where ε is independent of Y and normally distributed with mean 0 and covariance matrix Δ, and Embedded Image; we assume that var(νY) > 0. The basis matrix Γ is not identifiable in this model, as for any full-rank d × d matrix A we can always obtain an equivalent parametrization as Γνy = (ΓA−1)(Aνy). However, span(Γ) is identifiable and estimable, and for this reason we assume without loss of generality that Γ is a semiorthogonal matrix, ΓTΓ = Id.

Model (4.1) represents the fact that the translated conditional means Embedded Image fall in the d-dimensional subspace Embedded Image. Under model (4.1), R(X) = ΓTΔ−1X is minimal sufficient (Cook & Forzani 2009b), and the goal is thus to estimate Embedded Image. As the minimal sufficient reduction is linear, it follows that Embedded Image. In other words, in the class of models represented by equation (4.1), moment-based and model-based dimension reduction coincide in the population.

As a convenient notation for describing estimators of Embedded Image, let Embedded Image denote the span of A−1/2 times the first d eigenvectors of A−1/2BA−1/2, where A and B are symmetric matrices and A is non-singular. The subspace Embedded Image can also be described as the span of A−1 times the first d eigenvectors of B. We refer to errors having covariance matrix Δ = σ2Ip as isotonic. Isotonic models are models with isotonic errors. For notational simplicity, we will use Xi when referring to observations rather than the more awkward notation Xyi.

(a) The principal component model

The isotonic version of model (4.1) is called the PC model because the maximum likelihood estimator (MLE) of Embedded Image is Embedded Image and thus the d components of Embedded Image are simply the first d PCs. This relatively simple result is due to the nature of Δ. As the errors are isotonic, the contours of Δ are circular. When the signal Γνy is added, the contours of Σ = Γvar(νY)ΓT + σ2Ip become p-dimensional ellipses with their longest d axes spanning Embedded Image. The MLE Embedded Image of σ2 is Embedded Image, where Embedded Image are the eigenvalues of Embedded Image, and the MLE of Embedded Image is simply Embedded Image.

We performed a small simulation to provide some intuition. Observations on X were generated as Embedded Image, where Embedded Image was sampled uniformly from the boundary of the square [−1,1]2, the elements of the p × 2 matrix Γ* were sampled independently from a standard normal distribution, and the error vector ε was sampled from a normal distribution with mean 0 and variance matrix Ip. In terms of model (4.1), Embedded Image, Γ = Γ*(Γ*TΓ*)−1/2 and Embedded Image. This sampling process, which does not require an explicit choice of Y , was repeated n = 80 times for various values of p. Figure 1 shows plots of the first two PCs for four values of p. We see that for small p the square is not recognizable, but for larger values of p the square is quite clear. In figure 1d, there are p = 500 predictors, while the number of observations is still n = 80. The sides of the estimated square in figure 1d do not align with the coordinate axes because the method is designed to estimate only the subspace Embedded Image, which is equal to Embedded Image with isotonic errors.

Figure 1.

Plots of the estimated sufficient reduction from the PC model with n=80 observations and varying number of predictors p: (a) p=3; (b) p=5; (c) p=25; (d) p=500.

Turning to prediction, let Embedded Image denote the p × d matrix with columns consisting of the first d eigenvectors of Embedded Image. Then, the weights (3.2) can be written as Embedded Image 4.2 where Embedded Image is the estimated reduction. In this case, Embedded Image is in the form of a normal product kernel density estimator with bandwidth Embedded Image. Here and in other weights, we assumed that d is known. Methods for choosing d in the context of prediction are discussed in §5.

The method of estimating E(Y | X) characterized by equation (4.2) is distinct from the standard PCR estimator given in equation (2.2) with the columns of G being the first d eigenvectors of Embedded Image. Predictions from equations (4.2) and (2.2) both use PCs, but the way in which they are used is quite different. Model (4.1), which leads to PCs as the MLE of the sufficient reduction, has a very general mean function, but its error structure is restrictive. Nevertheless, this error structure has recently been used in studies of gene expression (X) that are complicated by stratification and heterogeneity (Leek & Storey 2007). On the other hand, the usual linear model has a restrictive mean function, and under that model alone there seems to be no clear rationale for reduction by PCs (Cox 1968).

We performed a small simulation study to illustrate the relative behaviour of predictions using equations (2.2) and (4.2). The basic simulation scenario Embedded Image was the same as that leading to figure 1, but in this case we set n = 200, d = 1, Γ* = (1, …, 1)T, var(ε) = 52Ip and Embedded Image, where Y is a standard normal random variable. In this set-up, (X, Y) has a multivariate normal distribution, so both equations (2.2) and (4.2) are appropriate, but the predictions from PCR (2.2) use the fact that the mean function E(Y |X) is linear, while predictions using equation (4.2) do not. For each of 100 datasets generated in this way, the mean-squared prediction error PEk for the estimated mean function Embedded Image was determined by sampling 200 new observations (X*, Y*): Embedded Image 4.3 The final prediction error was then determined by averaging, Embedded Image. The prediction errors are shown in figure 2a for 3 ≤ p ≤ 150. While the PCR predictions (2.2) perform a bit better at all values of p, the loss when using predictions from the inverse model is negligible. Figure 2b was constructed in the same way, except that we set Embedded Image. In this case, the predictions (4.2) from the inverse model are much better than the PCR predictions because they automatically adapt to E(Y | X) regardless of the nature of νY.

Figure 2.

Comparison of prediction errors: (a) Embedded Image; (b) Embedded Image. The line passing through the plotted points corresponds to predictions from the inverse model using equation (3.2) with weights (4.2). The other line is for PCR predictions (2.2).

(b) The isotonic principal fitted component model

It will often be possible and useful to model the coordinate vectors as Embedded Image, where Embedded Image is a known vector-valued function of y with linearly independent elements and Embedded Image, Embedded Image, is an unrestricted rank-d matrix. Under this model for νy, each coordinate Xyj, j = 1, …, p, of Xy follows a linear model with predictor vector fy. Consequently, when Y is quantitative, we are able to use inverse response plots (Cook 1998, ch. 10) of Xyj versus y, j = 1, …, p, to gain information about suitable choices for fy, which is an ability that is not generally available in the forward regression of Y on X. When p is too large for such graphical inspection, the PC model can be used as a tool to aid in selecting fy. Under the PC model, the MLE of νy for the ith observed case is Embedded Image; the second equality follows because Embedded Image is a semiorthogonal matrix. These vectors Embedded Image can then be plotted against Yi, i = 1, …, n, and used in the same way as the inverse response plots. In cases like figure 1d, brushing a histogram of the response while observing movement of the corresponding points around the square may be useful. We can also consider vectors fy that contain a reasonably flexible set of basis functions, like polynomial terms in Y , which may also be useful when it is impracticable to apply graphical methods to all of the predictors. Piecewise polynomials could also be used.

In some regressions, there may be a natural choice for fy. Suppose for instance that Y is categorical, taking values in one of h categories Ck, k = 1, …, h. We can then set r = h − 1 and specify the kth element of fy to be J(yCk), where J is the indicator function. Another option consists of ‘slicing’ the observed values of a continuous Y into h bins (categories) Ck, k = 1, …, h, and then specifying the kth coordinate of fy as for the case of a categorical Y . This has the effect of approximating each conditional mean E(Xyj) as a step function of y with h steps, Embedded Image where Embedded Image is the jth row of Γ and bk is the kth column of β.

Models with Embedded Image are called principal fitted component (PFC ) models. To describe the MLE of Embedded Image when the errors are isotonic, let Embedded Image denote the n × r matrix with rows Embedded Image. Then, the n × p matrix of centred fitted vectors from the linear regression of X on f is Embedded Image and the sample covariance matrix of these fitted vectors is Embedded Image, where Embedded Image denotes the linear operator that projects onto the subspace spanned by the columns of Embedded Image. Under this isotonic PFC model, the MLE of Embedded Image is Embedded Image. Johnson (2008) gave sufficient conditions for this estimator to converge at the usual Embedded Image rate.

Under the isotonic PFC model, the sufficient reduction is estimated as Embedded Image, where the columns of Embedded Image are the first d eigenvectors of Embedded Image. The elements of this Embedded Image are called PFCs. Additionally, the MLE of β is Embedded Image and the MLE of σ2 is Embedded Image, where the Embedded Image are the ordered eigenvalues of Embedded Image.

To describe the weights under this model, let Embedded Image be the coefficient matrix from the multivariate OLS fit of X on f, and let the fitted vectors be denoted by Embedded Image. The weights from equation (3.2) are then Embedded Image 4.4 These weights are of the same general form as those in equation (4.2) from the PC model, but there are two important differences. First, the isotonic PFC reduction uses the eigenvectors from Embedded Image, while the PC reduction uses the eigenvectors from Embedded Image. Second, the reduction is applied to the observed predictor vectors Xi in the PC weights (4.2), while the reduction is applied to the fitted predictor vectors Embedded Image in the isotonic PFC weights (4.4).

Shown in figure 3a are results from a simulation study to illustrate the potential advantages of using fitted components. The data were generated as for figure 2b. We used fy = (y,y2,y3)T for the fitted model and, to broaden the scope, we increased the overall variability by reducing the sample size to n = 50 and increasing the conditional variance matrix to var(ε) = 152Ip. This is a highly variable regression, but all three methods respond quickly as the number of predictors increases. The PCR predictions do better than the inverse predictions from the PC model for relatively small p, but the situation is reversed for larger values of p. Most importantly, the prediction errors from isotonic PFC are the smallest at all values of p. Figure 3b is discussed in the next section.

Figure 3.

Comparison of prediction errors from PCR (equation (2.2)), inverse PCs (equation (4.2)), isotonic principal fitted components (iso. PFC) (equation (4.4)) and diagonal principal fitted components (diag. PFC) (equation (4.5)): (a) isotonic PFC, fy=(y,y2,y3)T; (b) diagonal PFC, fy=y.

(c) The diagonal principal fitted component model

The isotonic PFC model requires that, given the response, the predictors must be independent and have the same variance. While this model will be useful in some applications, the requirement of equal variances is restrictive relative to the range of applications in which reduction may be desirable. In this section, we expand the scope of application by permitting a diagonal error covariance matrix Embedded Image. This allows for different measurement scales of the predictors, but still requires that they be conditionally independent. The estimated sufficient reduction for this model is Embedded Image, where Embedded Image is the MLE of Δ and Embedded Image is any basis for the MLE of span(Γ). With this Embedded Image, the weights have the same form as the weights for the isotonic PFC model, Embedded Image 4.5 where Embedded Image is as defined for equation (4.4).

It remains to determine Embedded Image and Embedded Image. For this model, we were unable to find a closed-form expression for these estimators. However, a straightforward alternating algorithm can be developed based on the following reasoning. If the inverse mean function is specified, then the variances Embedded Image can be estimated by using the sample variances of the centred variables Embedded Image. If Δ is specified, then we can standardize the predictor vector to obtain an isotonic PFC model in Z = Δ−1/2X, Embedded Image 4.6 where ε is normal with mean 0 and variance matrix Ip. Consequently, we can estimate span(Γ) as Δ1/2 times the estimate Embedded Image of Δ−1/2Γ from the isotonic model (4.6). Alternating between these two steps leads to the following algorithm.

  1. Fit the isotonic PFC model to the original data, getting initial estimates Embedded Image and Embedded Image. The MLE of the intercept Embedded Image is always Embedded Image, so there is no need to include this parameter explicitly in the algorithm.

  2. For some small ϵ > 0, repeat for j = 1, 2 … until Embedded Image.

    • (a) Calculate Embedded Image.

    • (b) Transform Embedded Image.

    • (c) Fit the isotonic PFC model to Z, yielding estimates Embedded Image, and Embedded Image.

    • (d) Backtransform the estimates to the original scale Embedded Image, Embedded Image.

Cook & Forzani (2009b) proposed a different algorithm for fitting the diagonal PFC model. The algorithm here is more reliable when n < p.

We again report partial results of a simulation study to illustrate the potential benefits of the diagonal PFC model. The scenario for generating the data was the same as that for figure 2a with 3 ≤ p ≤ 200 and n=50, modified to induce different conditional variances. The conditional variances Embedded Image were generated once as the order statistics for a sample of size 200 from 60 times a chi-squared random variable with two degrees of freedom. The smallest order statistic was Embedded Image and the largest was Embedded Image. We then used Embedded Image for a regression with p predictors. In this way, the conditional variances increased as we added predictors. While it is unlikely that predictors would be so ordered in practice, this arrangement seems useful to highlight the relative behaviour of the methods. In each case, we fitted the isotonic and diagonal PFC model using fy = y. Predictions were again assessed using 200 new simulated observations, and the entire set-up was again replicated 100 times to obtain the average prediction errors shown in figure 3b. With p = 3 predictors, all methods perform well because the first three conditional variances are similarly small. The prediction error for the diagonal PFC model changed very little as we added predictors, while the prediction errors for all other methods increased substantially. This happened because the MLEs from the diagonal PFC model can weight each predictor according to its conditional variance, while the predictors are treated equally by the other methods. Predictions from the PC model were consistently the worst. This is perhaps to be expected because the other three methods use mean functions that are appropriate for this simulation.

Predictions from the diagonal PFC model should perform well also when Δ > 0 is not a diagonal matrix, provided that the conditional predictor correlations are small to moderate. This is in agreement with Prentice & Zhao (1991, p. 830) who recommended in a related context that independence models generally should be adequate for a broad range of applications in which the dependences are not strong. The methodology discussed in the next section may be useful in the presence of strong conditional correlations.

(d) The principal fitted component model

In this section, we describe the estimator (3.2) for general Δ > 0, allowing the predictors to be conditionally dependent with different variances. The methods deriving from the isotonic and diagonal PFC models do not require p < n. In contrast, the methods of this section work best in data-rich regressions where pn. The analysis relies on the MLEs given by Cook (2007, §7.2) and studied in more detail by Cook & Forzani (2009b).

The following notation will help describe the MLEs under the PFC model. Let Embedded Image be the sample covariance matrix of the residuals vectors Embedded Image from the OLS fit of X on f. Let Embedded Image and Embedded Image be the eigenvalues and the corresponding matrix of eigenvectors Embedded Image of Embedded Image. Finally, let Embedded Image be a p × p diagonal matrix, with the first d diagonal elements equal to zero and the last pd diagonal elements equal to Embedded Image. Then, the MLE of Δ is Embedded Image If d=r, then Embedded Image, and this estimator reduces to Embedded Image. Otherwise, Embedded Image recovers information on variation due to overfitting (r > d). The MLE of β is Embedded Image, where Embedded Image is the projection of Bols onto the span of Embedded Image in the Embedded Image inner product, and the MLE of Δ−1Γ is Embedded Image, which is equal to the SIR estimator of Embedded Image when the response is categorical (Cook & Forzani 2009b).

Turning to prediction, we substituted these MLEs into the multivariate normal density for X | Y and simplified the resulting expression by in part ignoring proportionality constants not depending on the observation i to obtain the weights (3.2): Embedded Image 4.7 where Embedded Image is as defined for equation (4.4). We next discuss how these weights can be simplified to a more intuitive and computationally efficient form.

Let Embedded Image. Then it is easy to show that Embedded Image, and that the columns of Embedded Image are the normalized eigenvectors of Embedded Image. Let Embedded Image and Embedded Image denote the p × d matrices consisting of the first d columns of Embedded Image and Embedded Image. Then, because the MLE of Δ−1Γ is Embedded Image, it follows that Embedded Image. Consequently, we may take Embedded Image, which implies the reduction Embedded Image Using these results in equation (4.7), we have Embedded Image, and thus Embedded Image 4.8 The reduction in this case is applied to x and Embedded Image in the same way that it was for the weights from the isotonic and diagonal PFC models, but the reduction itself differs. A perhaps clearer connection between the PFC reduction and the isotonic PFC reduction can be seen by noting that Embedded Image is equal to the span of the first d eigenvectors of Embedded Image. Under the isotonic PFC model, the reduction is computed by using the first d eigenvectors of Embedded Image, while under the PFC model the reduction is computed using the first d eigenvectors of Embedded Image. Additionally, as Embedded Image, the reduction could be computed also as the first d eigenvectors of Embedded Image.

5. Implementation

In this section, we present additional simulation results and provide some suggestions for issues that must be addressed prior to implementation. We consider only the isotonic, diagonal and general PFC models, and address the implementation issues first.

(a) Choice of d

Two general methods for choosing Embedded Image have been studied in the literature. One is by using an information criterion such as the Akaike (AIC) and Bayesian (BIC) information criteria. Let L(d0) denote the value of the maximized log likelihood for any of the three inverse models under consideration fitted with a value d0 for d. Then the dimension is selected that minimizes the information criterion −2L(d0) + h(n)g(d0), where h(n) is equal to Embedded Image for BIC and 2 for AIC, and g(d0) = p + d0(pd0) + dr + D(Δ) is the number of parameters to be estimated as a function of d0. The first addend in this count corresponds to μ, the second to Embedded Image and the third to β. The final term D(Δ) is the number of parameters required for Δ, which is equal to 1, p and p(p + 1)/2 for the isotonic, diagonal and PFC models. The second method is based on using the likelihood ratio statistic Embedded Image in a sequential scheme to choose d. Using a common test level and starting with d0 = 0, choose the estimate of d as the hypothesized value that is not rejected; see Cook (2007) and Cook & Forzani (2009a) for further discussion of these methods.

These methods for choosing d have been shown to work well in data-rich regressions where pn, but can be unreliable otherwise. Consequently, in the context of this paper, we will determine d using the cross-validation method proposed by Hastie et al. (2001, ch. 7).

(b) Screening

When dealing with large-p regressions, there may be a possibility that a substantial subset X2 of the predictors is inactive; that is, X2 is independent of the response given the remaining predictors X1. In terms of PFC models, X2 is inactive if and only if the corresponding rows of Δ−1Γ are all equal to zero. Addressing this possibility in terms of the general PFC models requires a data-rich regression, and is outside the scope of this paper (see Cook & Forzani 2009b for further discussion). For the isotonic and diagonal PFC models, the predictors X2 are inactive if and only if the corresponding rows of Γ are all equal to 0. In these models, we can test if the kth row of Γ is equal to 0 straightforwardly by using an F statistic to test if α=0 in the univariate regression model of the kth predictor Xk on f, Xk = μ + αTf + ε. This method with a common test level of 0.1 was used for screening in the simulations that follow. When d = 1 and f = y, this method is equivalent to sure independence screening proposed by Fan & Lv (2008).

(c) Simulation results

To allow for some flexibility, we used the following general set-up as the basis for our simulations. Let Jk and Ok denote vectors of length k having all elements equal to 1 and 0. A response was generated from a uniform (0,3) distribution. Then with fy = (y, e2y)T a predictor vector of length p was generated using an integer p0p/2 as Embedded Image, where Embedded Image with Embedded Image and Embedded Image, Embedded Image and ϵN(0,Δ) with Embedded Image, Embedded Image and Embedded Image. Datasets of n independent observations generated in this way have the following properties: (i) one set of p0 predictors is linearly related to the response, another set of p0 is nonlinearly related and the remaining p − 2p0 predictors are independent of the response; (ii) the conditional covariance of X | Y has a diagonal structure; (iii) Γ is a p × 2 matrix and thus the sufficient reduction is composed of d = 2 linear combinations of the predictors.

We present results from three cases of this simulation model. In each case, we fitted the diagonal PFC model with fy = (y1, …, y4)T, so the fitted fy was not the same as that used to generate the data. The prediction errors were determined using the method of equation (4.3).

Case 1.

pn. In this regression, we set n=400, p = 40 and p0 = 20, so all predictors are relevant and no screening was used. The results are shown in the third column of table 1. The first row shows the prediction errors when d is determined by the cross-validation (c.v.) method of Hastie et al. (2001, ch. 7). In this case, the estimated d can vary across the 100 replicate datasets. The next four rows show the prediction errors when d is fixed at the indicated value across all datasets. The final two rows show the prediction errors for PLS and the lasso applied straightforwardly to the linear regression of Y on X using the R packages ‘pls’ and ‘lars’. We see that the PFC prediction errors are fairly stable across the choices for d and that they are roughly one-third of the prediction error for the forward methods.

View this table:
Table 1.

Prediction errors (4.3) for three simulation scenarios, with standard errors given in parentheses.

Case 2.

np. In this simulation, we set n = 100, p0 = 60 and p = 120, so again all predictors are relevant and no screening was used. The results, which are shown in the fourth column of table 1, are qualitatively similar to those for case 1, except that the prediction error for the lasso is notably less than that for PLS.

Case 3.

np. For this case, we set n = 100, p0 = 20 and p = 120. Now there are only 40 relevant predictors out of 120 total predictors and, as discussed in §5b, screening was used prior to fitting the diagonal PFC model. The results shown in the final column of table 1 are qualitatively similar to the other two cases. Here, we were a bit surprised that the lasso did not perform better because it is designed specifically for sparse regressions. Evidently, the adaptability of the diagonal PFC model through the choices of f and d can be important for good predictions.

(d) Illustration

We conclude this section with an illustrative analysis of an economic regression using data from Enz (1991). The data and additional discussion were given by Cook & Weisberg (1999, p. 140). The response, which was measured in 1991, is the average minutes of labour required to buy a BigMac hamburger and french fries in n = 45 world cities. There are nine continuous predictors. We fitted the diagonal PFC model as discussed in §4c, and estimated the prediction error using leave-one-out cross validation. We set d = 1 to allow an even-handed comparison with forward methods.

The results are shown in table 2. Two entries are given for PFC. The first corresponds to fitting PFC with the cubic polynomial basis f = (y, y2, y3)T and the second uses a more elaborate piecewise constant polynomial with 10 slices for f. PFC with the piecewise constant polynomial basis gives a better prediction error than PFC with the cubic polynomial basis, but both methods outperform the predictions by five forward methods, the elastic net (Zou & Hastie 2005), the lasso, ridge regression, PLS and OLS. The relative behaviour of PFC illustrated here is typical of our experiences, regardless of the number of predictors.

View this table:
Table 2.

BigMac dataset.

6. Discussion

The methodology introduced in this paper applies across a useful range of applications with continuous predictors that, given the response, are mildly correlated. We view the ability to choose fy as an important feature because it allows a level of adaptability that is not really available when directly modelling the mean function E(Y |X) in large-p regressions. The assumption of normal errors is not crucial, but the predictions may not perform well under extreme deviations from normality. For instance, we expect that it is possible to develop substantially better methodology for regressions in which all the predictors are binary, perhaps employing a quadratic exponential family for X|Y in place of the multivariate normal. Work along these lines is in progress.

It also seems possible to develop models ‘between’ the diagonal and general PFC models that allow for some of the conditional predictor correlations to be substantial but stop short of permitting a general Δ > 0. One route is to model the conditional covariance matrix as Embedded Image, where Γ is the same as that in model (4.1), (Γ, Γ0) is an orthogonal p × p matrix, M > 0 and M0 > 0 (Cook 2007). The minimal sufficient reduction in this case is R(X)=ΓTX, which does not involve the Ms. This model for Δ allows arbitrary correlations among the elements of R and among the elements of Embedded Image, but requires that R and Embedded Image be conditionally independent. Predictions under this model in a data-rich regression can be obtained as an extension of results by Cook (2007). The behaviour of this model in np regressions is under study.

Acknowledgements

Research for this paper was supported in part by the US National Science Foundation, Microsoft Research and the Newton Institute for Mathematical Sciences, Cambridge, UK.

Footnotes

References

View Abstract