## Abstract

We consider two-class linear classification in a high-dimensional, small-sample-size setting. Only a small fraction of the features are useful, these being unknown to us, and each useful feature contributes weakly to the classification decision. This was called the rare/weak (RW) model in our previous study (Donoho, D. & Jin, J. 2008*Proc. Natl Acad. Sci. USA***105**, 14 790–14 795). We select features by thresholding feature *Z*-scores. The threshold is set by *higher criticism* (HC). For 1≤*i*≤*N*, let π_{i} denote the *p*-value associated with the *i*th *Z*-score and π_{(i)} denote the *i*th order statistic of the collection of *p*-values. The HC threshold (HCT) is the order statistic of the *Z*-score corresponding to index *i* maximizing . The ideal threshold optimizes the classification error. In that previous study, we showed that HCT was numerically close to the ideal threshold.

We formalize an asymptotic framework for studying the RW model, considering a sequence of problems with increasingly many features and relatively fewer observations. We show that, along this sequence, the limiting performance of ideal HCT is essentially just as good as the limiting performance of ideal thresholding. Our results describe two-dimensional *phase space*, a two-dimensional diagram with coordinates quantifying ‘rare’ and ‘weak’ in the RW model. The phase space can be partitioned into two regions—one where ideal threshold classification is successful, and one where the features are so weak and so rare that it must fail. Surprisingly, the regions where ideal HCT succeeds and fails make exactly the same partition of the phase diagram. Other threshold methods, such as false (feature) discovery rate (FDR) threshold selection, are successful in a substantially smaller region of the phase space than either HCT or ideal thresholding. The FDR and local FDR of the ideal and HC threshold selectors have surprising phase diagrams, which are also described. Results showing the asymptotic equivalence of HCT with ideal HCT can be found in a forthcoming paper (Donoho, D. & Jin, J. In preparation).

## 1. Introduction

The modern era of high-throughput data collection creates data in abundance. Some devices—spectrometers and gene chips come to mind—automatically generate measurements on thousands of standard features of each observational unit.

What has not changed in science is the difficulty of obtaining good observational units. High-throughput devices do not help us to find and enrol qualified patients, or to catch exotic butterflies or to observe primate mating behaviour. So the number of observational units—be they patients, butterflies or matings—has not increased; it stays in the dozens or hundreds. However, each of those few observational units can now be subjected to a large battery of automatic feature measurements.

Many of those automatically measured features will have little relevance to any given project. This new era of *feature glut* poses a needle-in-a-haystack problem: we must detect a relatively few valuable features among many useless ones. Unfortunately, the combination of small sample sizes (few observational units) and high dimensions (many feature measurements) makes it hard to tell needles from straw.

Orthodox statistical methods assumed quite a different set of conditions: more observations than features, and all features highly relevant. Modern statistical research has started to address this modern, unorthodox setting; such research comprised much of the activity in the recent six-month Newton Institute programme *Statistical Theory and Methods for Complex, High-Dimensional Data*.

This paper examines the challenges that feature glut poses to linear classification schemes. A new feature selection principle—higher criticism (HC) thresholding—and a new optimality property—optimality of the phase diagram—will be introduced.

### (a) Multivariate normal classification

In the standard model of linear classifier training, we have labelled training samples (*Y*_{i},*X*_{i}), *i*=1,…,*n*, where each label *Y*_{i} is ±1 and each feature vector *X*_{i}∈*R*^{p}. For simplicity, let us assume that the training set contains equal numbers of +1 and −1 and that the feature vectors *X*_{i}∈*R*^{p} obey *X*_{i}∼*N*(*Y*_{i}μ,*Σ*;), *i*=1,…,*n*, for an unknown *mean contrast vector* μ∈*R*^{p}; here *Σ*; denotes the feature covariance matrix and *n* is the training set size. In this simple setting, one normally uses linear classifiers to classify an unlabelled test vector *X*, taking the general form , with ‘feature weights’ *w*=(*w*( *j*):*j*=1,…,*p*).

Classical theory going back to R. A. Fisher shows that the optimal classifier has feature weights (Anderson 2003). However, modern feature glut renders Fisher’s formula inapplicable.

### (b) When *p* is larger than *n*

Frequently, in modern settings, the number *p* of measurements exceeds the number *n* of observational units. Naive application of the formula to empirical data in the *p*>*n* setting is then problematic, at a minimum, because the matrix of empirical feature covariances is not invertible. Even the generalized inverse gives weights , yielding ‘noisy’ classifiers with low accuracy.

A by-now standard response to this problem is simply to ignore feature covariances, and to standardize the features to mean 0 and variance 1. In effect, one pretends that the feature covariance matrix *Σ*; is the identity matrix, and uses the formula (Bickel & Levina 2004; Fan & Fan 2008). Even after this reduction, further challenges remain.

### (c) When features are rare and weak

When feature glut arises from promiscuous measurement, researchers cannot say in advance which features will be useful in a given project. Moreover, although one hopes to find a single ‘silver bullet’ feature that reliably classifies, in all too many projects the useful features are relatively rare and individually quite weak.

Such thinking motivated the following *rare/weak (RW) feature model* in Donoho & Jin (2008). Under this model, the following hold.

(i)

**Useful features are rare.**The contrast vector μ is non-zero in only*k*out of*p*elements, where ε=*k*/*p*is small, i.e. close to zero. As an example, think of*p*=10 000,*k*=100, so ε=*k*/*p*=0.01.(ii)

**Useful features are weak.**The non-zero elements of μ have*common*amplitude μ_{0}, which is assumed not to be ‘large’. ‘Large’ can be measured using ; values of τ in the range of 2–4 imply that corresponding values of μ are not large.

Since the elements *X*( *j*) of the feature vector where the class contrast μ( *j*)=0 are entirely uninformative about the value of *Y* ( *j*), only the *k* features where μ( *j*)=μ_{0} are useful. The problem is how to identify and benefit from those rare/weak features.

We speak of ε and τ as the *sparsity* and *strength* parameters for the RW model, and refer to the RW(ε,τ) model. Models with a ‘sparsity’ parameter ε are common in estimation settings (Donoho *et al.* 1992; Donoho & Johnstone 1994; Johnstone & Silverman 2004), but not with the feature strength constraint τ. Also closely related to the RW model is work in multiple testing by Ingster and other authors (Ingster 1997; Jin 2003; Donoho & Jin 2004).

### (d) Feature selection by thresholding

Feature selection—i.e. working only with an empirically selected subset of features—is a standard response to feature glut. We are supposing, as announced in §1*b*, that feature correlations can be ignored and that features are already standardized to variance 1. We therefore focus on the vector of feature *Z*-scores, with components , *j*=1,…,*p*. These are the *Z*-scores of two-sided tests of *H*_{0,j}: Cov(*Y*,*X*( *j*))=0. Under our assumptions, *Z*∼*N*(θ,*I*_{p}) with and μ the feature contrast vector. Features with non-zero μ( *j*) typically have significantly non-zero *Z*( *j*), while all other features will have *Z*( *j*) values largely consistent with the null hypothesis μ( *j*)=0. In such a setting, selecting features with *Z*-scores above a threshold makes sense.

We identify three useful threshold functions: , with ★∈{clip, hard, soft}. These are: , which ignores the size of the *Z*-score, provided it is large; hard thresholding, , which uses the size of the *Z*-score, provided it is large; and soft thresholding, , which uses a shrunken *Z*-score, provided it is large.

### Definition 1.1.

Let ★∈{soft, hard, clip}. The threshold feature selection classifier makes its decision based on , where , and .

In other words, the classifier sums across features with large training set *Z*-scores, and a simple function of the *Z*-score generates the feature weight.

Several methods for linear classification in bioinformatics follow this approach. The shrunken centroids method (Tibshirani *et al*. 2002) is a variant of soft thresholding in this two-class setting. The highly cited methods in Golub *et al*. (1999) and Hedenfalk *et al*. (2001) are variants of hard thresholding. Clipping makes sense in the theoretical setting of the RW model (since then the useful features all have the same strength) and is simpler to analyse than the other nonlinearities.

One crucial question remains: how to choose the threshold based on the data? Popular methods for threshold choice include cross-validation (Tibshirani *et al*. 2002), control of the false discovery rate (Abramovich & Benjamini 1995; Benjamini & Hochberg 1995; Abramovich *et al*. 2006; Donoho & Jin 2006) and control of the local false discovery rate (Efron *et al*. 2001).

### (e) Higher criticism

Recent work on multiple comparisons offers a new method of threshold choice.

#### (i) Higher criticism testing

Suppose that we have a collection of *N**p*-values π_{i}, which, under the global null hypothesis, are uniformly (independent and identically) distributed: π_{i}∼_{i.i.d.}*U*[0,1]. Consider the order statistics π_{(1)}≤π_{(2)}≤⋯≤π_{(N)}. Under the null hypothesis, these order statistics have the usual properties of uniform order statistics, including the asymptotic normality π_{(i)}∼_{approx}Normal(*i*/*N*,*i*/*N*(1−*i*/*N*)). HC forms a *Z*-score comparing π_{(i)} with its mean under the null, and then maximizes over a wide range of *i*.

### Definition 1.2

(**HC testing** (Donoho & Jin 2004)).*The higher criticism objective is*
1.1Fix *α*_{0}∈(0,1) (e.g. *α*_{0}=0.10). The higher criticism test statistic is *.*

HC seems insensitive to the selection of *α*, in rare/weak situations; here we always use α_{0}=0.10.

Expressed in words, we look for the largest standardized discrepancy for any π_{(i)} between the observed behaviour and the expected behaviour under the null. When this is large, the whole collection of *p*-values is not consistent with the global null hypothesis. The phrase ‘higher criticism’ is due to John Tukey, and reflects the shift in emphasis from single test results to the whole collection of tests.

#### (ii) Higher criticism thresholding

We return to the classification setting in the previous sections. We have a vector of feature *Z*-scores (*Z*( *j*), *j*=1,…,*p*). To apply HC notions, translate *Z*-scores into two-sided *p*-values, and maximize the HC objective over index *i* in the appropriate range. Define the *feature p-values* π

_{i}=Prob{|

*N*(0,1)|>|

*Z*(

*i*)|},

*i*=1,…,

*p*; and define the increasing rearrangement π

_{(i)}, the HC objective function HC(

*i*;π

_{(i)}), and the increasing rearrangement |

*Z*|

_{(i)}correspondingly. Here is our proposal.

### Definition 1.3 (HC thresholding)

Apply the HC procedure to the feature *p*-values. Let the maximum HC objective be achieved at index . The *higher criticism threshold* (HCT) is the value . The *HC threshold feature selector* selects features with Z-scores exceeding in magnitude.

Figure 1 illustrates the procedure. Figure 1*a* shows a sample of *Z*-scores, figure 1*b* shows a probability–probability (PP) plot of the corresponding ordered *p*-values versus *i*/*p*, and figure 1*c* shows a standardized PP plot. The standardized PP plot has its largest deviation from 0 at ; and this generates the threshold value.

#### (iii) Previously reported results for higher criticism threshold

Donoho & Jin (2008) reported several findings on the behaviour of HCT based on numerical and empirical evidence. In the RW model, define the *ideal threshold* as the optimal threshold based on full knowledge of the RW parameters ε and μ (see §2 below).

(i) The HCT threshold is numerically very close to the ideal threshold.

(ii) In the case of very weak feature

*Z*-scores, HCT has a false feature discovery rate (FDR) substantially higher than other popular approaches, but a feature missed detection rate (MDR) substantially lower than those other approaches.(iii) FDR and MDR of HCT closely match those of the ideal threshold.

In short, HCT has operating characteristics unlike those of popular thresholding schemes such as FDR thresholding (Abramovich & Benjamini 1995; Abramovich *et al*. 2006) and Bonferroni thresholding, but very similar to the ideal ones.

### (f) Asymptotic rare/weak model, and the phase diagram

This paper reports a large-sample analysis of HCT and the RW model. The number of observations *n* and the number of features *p* will tend to infinity in a linked fashion, with *n* remaining very small compared to *p*. (Empirical results in Donoho & Jin (2008) show that large-*p* theory is applicable at moderate *n* and *p*).

### Definition 1.4.

The phrase *asymptotic rare/weak (ARW) model* refers to the following combined assumptions.

(i)

*Asymptotic setting.*We consider a sequence of problems, with the number of observations*n*and the number of features*p*both tending to .(ii)

. Along this sequence, for some positive constants*p*dramatically larger than*n**c*and γ, so there are dramatically more features per observational unit than there are observational units.(iii)

*Increasing rarity.*The sparsity ε tends to 0: ε=*p*^{−β},0<*β*<1.(iv)

*Decreasing strength.*The strength τ varies with*n*and*p*according to .

The symbol ARW(*r*,*β*;*c*,γ) refers to the model combining these assumptions.

In this model, because *r*<1, useful features are individually too weak to detect and, because 0<*β*<1, useful features are increasingly rare with increasing *p*, while increasing in total number with *p*. It turns out that *c* and γ are incidental, while *r* and *β* are the driving parameters. Hence, we always simply write ARW(*r*,*β*) below.

For many choices of (*r*,*β*), asymptotically successful classification is possible; for another large family of choices, it is impossible. To understand this fact, we use the concept of *phase space*, the two-dimensional domain, 0<*r*,*β*<1. We show that this domain is partitioned into two regions or ‘phases’. In the ‘impossible’ phase, useful features are so rare and so weak that classification is asymptotically impossible *even with the ideal choice of threshold*. In the ‘possible’ phase, successfully separating the two groups is indeed possible—*if* one has access to the ideal threshold. Figure 2 displays this domain and its partition into phases. Because of the partition into two phases, we also call this display the *phase diagram*. An explicit formula for the graph *r*=*ρ**(*β*) bounding these phases is given in equation (3.2) below.

The phase diagram provides a convenient platform for comparing different procedures. A threshold choice is *optimal* if it gives the same partition of phase space as the one obtained with the ideal choice of threshold.

How does HCT compare to the ideal threshold, and what partition in the phase space does HCT yield? For reasons of space, we focus in this paper on the *ideal HC threshold*, which is obtained upon replacing the empirical distribution of feature *Z*-scores by its expected value. The ideal HC threshold is thus the one that HCT is ‘trying to estimate’.

The central surprise of our story is that ideal HCT behaves surprisingly well: the partition of phase space describing the two regions where ideal thresholding fails and/or succeeds also describes the two regions where HCT fails and/or succeeds. The situation is depicted in table 1: here ‘succeeds’ means asymptotically zero misclassification rate, and ‘fails’ means asymptotically 50 per cent misclassification rate.

In this sense of size of regions of success, HCT is just as good as the ideal threshold. Such statements cannot be made for some other popular thresholding schemes, such as false discovery threshold selection. Even the very popular cross-validated choice of threshold will fail if the training set size is bounded, while HCT will still succeed in the RW model in that case.

### (g) Contents

The paper is organized as follows. Section 2 introduces a functional framework and several ideal quantities. These include the proxy classification error where Fisher’s separation (Sep) plays a key role, the ideal threshold as a proxy for the optimal threshold, and the ideal HCT as a proxy of the HCT. Section 3 introduces the main results on the asymptotic behaviour of the HC threshold under the asymptotic RW model, and the focal point is the phase diagram. Section 4 outlines the basic idea behind the main results followed by the proofs. Section 5 discusses the connection between the ideal threshold and the ideal HCT. Section 6 discusses the ideal behaviour of Bonferroni threshold feature selection and FDR-controlling feature selection. Section 7 discusses the link between ideal HCT and ordinary HCT, the finite-*p* phase diagram, and other appearances of HC in recent literature.

This paper has three companions: Donoho & Jin (2008, in preparation) and Jin (2009). The full proof of a broader set of claims, with a considerably more general treatment, will appear in Donoho & Jin (in preparation) (see §7).

## 2. Fisher’s separation functional and ideal threshold

Suppose *L* is a fixed, *non-random* linear classifier, with decision boundary *L*<>0. Will *L* correctly classify the future realization (*Y*,*X*)? Suppose that *Y* =±1 are equiprobable and *X*∼*N*(*Y* μ,*I*_{p}). The misclassification probability can be written as:2.1where Φ denotes the standard normal distribution function and Sep(*L*;μ) measures the standardized interclass distance:2.2The *ideal linear classifier**L*_{μ} with feature weights and decision threshold *L*_{μ}<>0 implements the likelihood ratio test. It also maximizes Sep, since, for every other linear classifier *L*, Sep(*L*;μ)≤Sep(*L*_{μ};μ)=2∥μ∥_{2}.

### (a) Certainty equivalent heuristic

Threshold selection rules give *random* linear classifiers: the classifier weight vector *w* is a *random* variable, because it depends on the *Z*-scores of the realized training sample. If *L*_{Z} denotes a linear classifier constructed based on such a realized vector of *Z*-scores, then the misclassification error can be written as2.3this is a random variable depending on *Z* and on μ. Heuristically, because there is a large number of coordinates, some statistical regularity appears, and we anticipate that random quantities can be replaced by expected values. We proceed as if2.4where the expectation is over the conditional distribution of *Z* conditioned on μ. Next, let *I*_{1},*I*_{2},…,*I*_{p} be independent Bernoulli random variables with parameter ε, and μ_{j}=*I*_{j}μ_{0}, so that μ has about ε*p* non-zero coordinates, in random locations. Since *w*_{i}=η_{t}(*Z*_{i}), we write heuristicallyHere *Z*_{1} and μ_{1} are the first coordinates of *Z* and μ, respectively, and μ_{0} is a constant as before.

### Definition 2.1.

Let the threshold be fixed and chosen independently of the training set. In the RW(ε,τ) model, we use the following expression for *proxy separation*:whereand W denotes a standard normal random variable. By *proxy classification error* we mean

Normalizations are chosen so that, in large samples,While, normally, we expect averages to ‘behave like expectations’ in large samples, the word *proxy* reminds us that there is a difference (presumably small). Software to compute proxy expressions has been developed by the authors, and some numerical results were reported in Donoho & Jin (2008).

Of course, the rationale for our interest in these proxy expressions is our heuristic understanding that they accurately describe the exact large-sample behaviour of certain threshold selection schemes. This issue is settled in the affirmative, after considerable effort, in Donoho & Jin (in preparation).

### (b) Certainty equivalent threshold functionals

In general, the best threshold to use in a given instance of the RW model depends on both ε and τ. It also depends on the specific realization of μ and even of *Z*. However, the dependence on μ and *Z* is simply ‘noise’ that goes away in large samples, while the dependence on ε and τ remains.

### Definition 2.2.

The ideal threshold functional *T*_{ideal}(ε,τ) maximizes the proxy separation *.*

Heuristically, *T*_{ideal} represents a near-optimal threshold in all sufficiently large samples; it is what we ‘ought’ to be attempting to use.

### Definition 2.3 (Folding).

The following concepts and notation will be used in connection with distributions of absolute values of random variables.

(i) The half-normal distribution function Ψ(

*t*)=*P*{|*N*(0,1)|≤*t*}.(ii) The non-central half-normal distribution Ψ

_{τ}(*t*)=*P*{|*N*(τ,1)|≤*t*}.(iii) Given a distribution function

*F*, the*folded distribution*is given by*G*(*t*)=*F*(*t*)−*F*(−*t*). The half-normal is the folded version of the standard normal, and the non-central half-normal is the folded version of a normal with unit standard deviation and non-zero mean equal to the non-centrality parameter.(iv) Let

*F*_{ε,τ}denote the two-point mixture*F*_{ε,τ}(*t*)=(1−ε)Φ(*t*)+εΦ(*t*−τ); then*G*_{ε,τ}denotes the corresponding folded distribution given by*G*_{ε,τ}(*t*)=(1−ε)Ψ_{0}(*t*)+εΨ_{τ}(*t*).

We now define an HCT functional representing the target that HC thresholding aims for. For a distribution function *F*, the survival function is its complement.

### Definition 2.4.

Let *F* be a distribution function that is not the standard normal *Φ*. At such a distribution, we define the HCT functional by
Here *G* is the folding of *F*, and *t*_{0} is a fixed parameter of the HCT method (e.g. ). The HC threshold in the RW(ε,τ) model may be written, in an abuse of notation, as *T*_{HC}(ε,τ), meaning *T*_{HC}(*F _{ε,τ}*).

Let *F*_{n,p} denote the usual empirical distribution of the feature *Z*-scores *Z*_{i}. The HCT of definition 1.3 can be written as . Let *F* denote the expected value of *F*_{n,p}; then *T*_{HC}(*F*) will be called the *ideal HC threshold*. Heuristically, we expect the usual sampling fluctuations and that *T*_{HC}(*F*_{n,p})≈*T*_{HC}(*F*) with a discrepancy decaying as *p* and *n* increase. This issue is carefully considered in the companion paper (Donoho & Jin in preparation), which shows that the empirical HC threshold in the ARW model indeed closely matches the ideal HC threshold.

For comparison purposes, we considered two other threshold schemes. First, (ideal) false discovery rate thresholding (FDRT). For a threshold *t*, and parameters (*p*,ε,τ), the expected number of useful features selected isand the expected number of useless features selected isHere, TP/FP stand for true/false positives. Let TPR(*t*)=*p*^{−1}*E*(TP)(*t*) denote the expected *rate* of useful features above threshold and FPR(*t*)=*p*^{−1}*E*(FP)(*t*) the expected *rate* of useless features above threshold. By analogy with our earlier heuristic, we define the proxy false discovery rateThe term ‘proxy’ reminds us thatalthough for large *p* the difference will often be small.

We define the FDRT-α functional byHeuristically, this is the threshold that FDRT is ‘trying’ to learn from noisy empirical data. We will also need the *proxy local FDR*:Here FPR′ denotes the derivative of FPR, which exists, using smoothness properties of Ψ_{0}; similarly for TPR′ and Ψ_{τ}. Intuitively, denotes the expected fraction of useless features among features having observed *Z*-scores near level *t*.

Secondly, we considered Bonferroni-based thresholding: . This threshold level is set at the level that would cause on average one false alarm in a set of *p* null cases. Figs 2 and 3 of Donoho & Jin (2008) presented numerical calculations of these functionals and their separation behaviour in two cases: (*a*) *p*=10 000, ε=0.01, and (*b*) *p*=10^{6}, ε=0.0001. Although our calculations are exact numerical finite-*p* calculations, we remark that they correspond to sparsity exponents *β*=1/2 and *β*=2/3, respectively. Those figures show the following.

(i) There is a very close numerical approximation of the HCT to the ideal threshold, not just at large τ but also even at quite small τ,2<τ<3.

(ii) FDR and Bonferroni thresholding behave differently from ideal thresholding and ideal HCT.

(iii) The separation behaviour of the HCT is nearly ideal. For the constant FDR rules, the separation behaviour is close to ideal at some τ but becomes noticeably sub-ideal at other τ.

(iv) The false discovery rate behaviour of HCT and ideal thresholding depends on τ. At small τ, both rules tolerate a high FDR, while, at large τ, both rules obtain a small FDR.

(v) The missed detection rate of HCT and ideal thresholding also depends on τ. At small τ, the missed detection rate is high, but noticeably less than 100 per cent. At large τ, the missed detection rate falls, but remains noticeably above 0 per cent. In contrast, the MDR for FDR procedures is essentially 100 per cent for small τ and falls below that of HCT/ideal for large τ.

These numerical examples illustrate the idealized behaviour of different procedures. We can think of the HCT functional as the threshold that is being estimated by the actual HCT rule. On an actual dataset sampled from the underlying *F*, the HC threshold will behave differently, primarily due to stochastic fluctuations *F*_{n,p}≈*F*. Nevertheless, the close approximation of the HC threshold to the ideal one is striking and, to us, compelling.

## 3. Behaviour of ideal threshold, asymptotic rare/weak model

We now study the ideal threshold in the asymptotic RW model of definition 1.4. That is, we fix parameters *r*, *β* in that model and study the choice of threshold *t* maximizing class separation.

We first make precise a structural fact about the ideal threshold, first observed informally in Donoho & Jin (2008).

### Definition 3.1 (ROC curve).

The *feature detection receiver operating characteristic curve* (ROC) is the curve parametrized by (FPR(*t*),TPR(*t*)). The tangent to this curve at *t* is *, and the secant is* . Note that, in the RW(ε,τ) model, TPR, FPR, *and* all depend on *t*, ε, τ and *p*, although we may, as here, indicate only a dependence on *t*.

### Theorem 3.2 (Tangent–secant rule).

*In the* ARW(*r,β*) *model definition 1.4, we have*3.1

### Definition 3.3 (Success region).

The *region of asymptotically successful ideal threshold feature selection* in the (*β, r*) plane is the interior of the subset where, under the ARW(*r, β*) model of definition 1.4, the ideal threshold choice *T*_{ideal}(ε,τ) obeys .

The interesting range involves (*β*,*r*)∈[0,1]^{2}. The following function is important for our analysis, and has previously appeared in important roles in other (seemingly unrelated) problems (see §7):3.2As it turns out, it marks the boundary between success and failure for threshold feature selection. (The correct interpretation of *ρ**(*β*)=0 for *β*∈(0,1/2] is simply that the critical value of τ separating success from failure is .)

### Theorem 3.4 (Existence of phases).

*The success region is precisely r> ρ**

*(**

*β*), 0<*β*<1. In the interior of the complementary region, r<*ρ**(*

*β*), 1/2<*β*<1, even the ideal threshold cannot send the proxy separation to infinity with increasing (n,p).### Definition 3.5 (Regions I, II, III).

The success region can be split into three regions, referred to here and below as regions I–III. The interiors of the regions are as follows (figure 2):

I. 0<

*r*≤*β*/3 and 1/2<*β*<3/4;*r*>*ρ**(*β*).II.

*β*/3<*r*≤*β*and 1/2<*β*<1;*r*>*ρ**(*β*).III.

*β*<*r*<1 and 1/2<*β*<1;*r*>*ρ**(*β*).

In the asymptotic RW model, the optimal threshold must behave asymptotically like for a certain *q*=*q*(*r*,*β*). Surprisingly, we need not have *q*=*r*.

### Theorem 3.6 (Formula for ideal threshold).

*Under the asymptotic RW model ARW(r, β), with r>ρ**

*(*

*β*), the ideal threshold has the form*where*3.3

Note in particular that, in regions I and II, *q**>*r*, and hence *T*_{ideal}(ε,τ)>τ. Although the features truly have strength τ, the threshold is best set *higher than* τ. An apparently similar result was found in the detection setting (Donoho & Jin 2004), but the formula for *q** was subtly different from equation (3.3).

We now turn to FDR properties. The tangent–secant rule implies immediately3.4Hence, any result about FDR is tied to one about local FDR, and vice versa.

### Theorem 3.7.

*Under the asymptotic RW model ARW(r, β), at the ideal threshold T*

_{ideal}

*(ε,τ), proxy FDR and proxy local FDR obey*3.5

*where*

Several aspects of the above solution are of interest.

*(i) Threshold elevation.*The threshold is significantly higher than in regions I and II. Instead of looking for features at the amplitude they can be expected to have, we look for them at much higher amplitudes.*(ii) Fractional harvesting.*Outside region III, we are selecting only a small fraction of the truly useful features.*(iii) False discovery rate.*Outside region III, we actually have a very large false discovery rate, which is very close to 1 in region I. Surprisingly,*even though most of the selected features are useless*, we still correctly classify!*(iv) Training versus test performance.*The quantity can be interpreted as a ratio: = strength of useful features in training/strength of those features in test. From equation (3.3), we learn that, in region I, the selected useful features perform about half as well in testing as we might expect from their performance in training.

## 4. Behaviour of ideal clipping threshold

This section sketches the arguments supporting theorems 3.2, 3.4, 3.6 and 3.7. In view of space restrictions, arguments supporting lemmas from this and later sections may be found in the electronic supplementary material.

In the RW model, it makes particular sense to use the clipping threshold function , since all non-zeros are known to have the same amplitude. The ideal clipping threshold is also very easy to analyse heuristically. However, it turns out that all the statements in theorems 3.2, 3.4, 3.6 and 3.7 are equally valid for all the three types of threshold function, so we prefer to explain the derivations using clipping.

### (a) in terms of true and false discoveries

In the RW model, we can express the components of the proxy separation very simply when using the clipping threshold:andwhere *W* denotes an *N*(0,1) random variable. Recall the definitions of useful selections TP and useless selections FP; we must also count *inverted detections*, for the case where the μ_{i}>0 but . Put *E*(ID)(*t*;ε,τ,*p*)=ε*p*Φ(−*t*−τ), with Φ again the standard normal distribution, and define the *inverted detection rate* by IDR=*p*^{−1}*E*(ID). ThenandWe arrive at an identity for in the case of clipping:

To explain theorem 3.2, drop the term IDR and consider the alternative proxy

### Lemma 4.1.

*Let ε>0 and τ>0. The threshold t*_{alt} *maximizing* *as a function of t satisfies the tangent–secant rule as an exact identity; at this threshold we have* *.*

The connection follows from part (iii) of the next lemma.

### (b) Analysis in the asymptotic rare/weak model

We now invoke the ARW(*r*,*β*) model: ε=*p*^{−β}, , , . Let . We have4.1

We also need some notation for poly-log terms.

### Definition 4.2.

Any occurrence of the symbol PL(*p*) denotes a term that falls between and as for some ζ>0 and C>0. Different occurrences of this symbol may stand for different such terms.

In particular, we may well have *T*_{1}(*p*)=PL(*p*), *T*_{2}(*p*)=PL(*p*), as , and yet as . However, certainly *T*_{1}(*p*)/*T*_{2}(*p*)=PL(*p*), .

The following lemma exposes the main phenomena driving theorems 3.2, 3.4, 3.6 and 3.7. It follows by simple algebra, and several uses of Mills’ ratio (4.1) in the convenient form .

### Lemma 4.3.

*In the asymptotic RW model*, ARW(*r*,*β*), *as* , *we have the following results*.

*(i) Quasi power law for useful feature discoveries*:*where the useful feature discovery exponent δ obeys**(ii) Quasi power law for useless feature discoveries*:*(iii) Negligibility of inverted detections*:

As an immediate corollary, under ARW(*r*,*β*), we haveOn the right-hand side of this display, the poly-log term is relatively unimportant. The driving effect is the power-law behaviour of the fraction. The following lemma contains the core idea behind the appearance of *ρ** in theorems 3.2 and 3.4, and the distinction between region I and regions II and III.

### Lemma 4.4.

*Let* *. For fixed q, r and β, let γ(q;r,β) denote the rate at which*

*tends to*

*as*

*. Then γ>0 if and only if r>**

*ρ**(*

*β*). A choice of q maximizing this ratio is given by equation (3.3 ).Lemma 4.4 shows that equation (3.3) gives us *one choice* of threshold maximizing the rate at which tends to infinity. Is it the *only choice*? Define *q*_{2}=*q*_{2}(*r*,*β*)=(*β*+*r*)^{2}/(4*r*). Except in the case *r*>*β* (and so *q*_{2}∈(*β*,*r*)), this is indeed the only choice. As the proof of lemma 4.4 shows, if *r*>*β*, *any**q*∈[*β*,*r*] optimizes the *rate* of separation. It turns out that, in that case, our formula *q** not only maximizes the rate of separation, but also correctly describes the leading-order asymptotic behaviour of *T*_{ideal}. The key point is the tangent–secant formula, which picks out, from among all *q*∈[*β*,*r*], uniquely *q*_{2}. This is shown by the next two lemmas, which thereby complete the proof of theorem 3.6.

### Lemma 4.5.

*Set* *, for q∈(0,1). In the ARW(r, β) model, consider the threshold t*

_{q}

*(p). Then, as*

*,*4.2

*Moreover, suppose q≠r. Then*4.3

### Lemma 4.6.

*The above defined q*_{2}*(r, β) is the unique solution of γ*

_{0}

*(q;r,*

*β*)=0. Suppose r>*β*. As*,*

*.*

### (c) False discovery rate/local false discovery rate properties of ideal threshold

We now turn to Theorem 3.7 and make an important observation:

The asymptotic *T*_{ideal}=*t*_{q*}(*p*)(1+*o*(1)) is simply too crude to determine the FDR and Lfdr properties of *T*_{ideal}; it is necessary to consider the second-order effects implicit in the (1+*o*(1)) term. For this, the tangent–secant formula is essential.

Indeed, equation (4.3) shows that the only possibilities for limiting local FDR of a threshold of the *exact* form *t*_{q}(*p*) are 0,1/2,1. The actual local FDR of *T*_{ideal} spans a continuum from [0,1], because small perturbations of *t*_{q}(*p*)(1+*o*(1)) implicit in the *o*(1) can cause a change in the local FDR. To understand this, for *q*≠*r* and , putwhere the sign of ± is + if and only if *q*>*r*. Clearly, is well defined for all sufficiently large *p*. As an example, . The peculiar definition ensures that .

By simple algebra one can show the following result.

### Lemma 4.7.

*Let q, r and s be fixed. With* *,* *and* *,* *.*

Let us put for short and . Combining several identities from the last few sentences, we have, as , *T*′_{s}=*s**T*′_{1}, *F*′_{s}∼*F*′_{1}, *T*_{s}∼*s**T*_{1} and *F*_{s}∼*F*_{1}. Hence,

Choosing *s* appropriately, we can therefore obtain a perturbed *q* that perturbs the Lfdr and FDR. In fact, there is a unique choice of *s* needed to ensure the tangent–secant formula.

### Lemma 4.8.

*For given values F*_{1}*, F′*_{1}*, T*_{1}*≠0 and T′*_{1}*≠0, put s***=(F′*_{1}*/T′*_{1}*)− 2(F*_{1}*/T*_{1}*). This choice of s obeys the tangent–secant rule:* *.*

To use this, recall equations (4.2) and (4.3). These formulae give expressions for *T*_{1}/*F*_{1} and *T*′_{1}/*F*′_{1}. Plugging in *q*=*q**, we get the following corollary.

### Corollary 4.9.

*We have* *, where s** *is obtained by setting* *,* *, T′*_{1}*=1 and F′*_{1}*=2. Moreover, if β/3<r<β, FDR(T*

_{ideal}

*)∼(*

*β*−r)/2r, Lfdr(T_{ideal}

*)∼(r+*

*β*)/4r.## 5. Connection of higher criticism objective with

Let *F*=*F*_{ε,τ} be the two-point mixture of definition 2.3 and *G*=*G*_{ε,τ} the corresponding folded distribution. For *t*=*t*_{q}(*p*) in the asymptotic RW model, we have5.1Step (5.1)_{3} follows from as . Step (5.1)_{5} follows from εFPR(*t*_{q}(*p*))=*o*(TPR(*t*_{q}(*p*))) as . Step (5.1)_{6} follows from TPR(*t*_{q}(*p*))=*o*(FPR(*t*_{q}(*p*))) as .

This derivation can be made rigorously correct when *q*=*q**(*r*,*β*), where *q** is as announced in theorem 3.4. With extra work, not shown here, one obtains the following theorem.

### Theorem 5.1.

*The statements made for the ideal threshold in theorems 3.2, 3.4 and 3.6 are equally valid for the ideal HC threshold.*

## 6. Suboptimality of phase diagram for other methods

Thresholds can instead be chosen by false discovery rate control (FDRT) and by control of the family-wise error rate (FWERT), or Bonferroni. Theorems 3.4 and 3.6 implicitly show the suboptimality of these approaches. The electronic supplementary material sketches an argument implying the following theorem.

### Theorem 6.1.

*Under the asymptotic RW model ARW(r, β), the regions of the (r,β) phase space for successful classification by FDRT and FWERT are*

*, 0<*

*β*<1.### Remark.

The success regions for FDRT and FWERT are smaller than for HCT. However, even over the region of joint success, HCT still yields asymptotically much larger Fisher separations than the other principles. Figure 3 displays the exponents of the Fisher separations for *β*=1/2 and *β*=5/8 (see also table 2).

## 7. Discussion and conclusions

These results form part of a series including Donoho & Jin (2008, in preparation) and Jin (2009). Donoho & Jin (2008) reported numerical results both with simulated data from the RW model and with real datasets often used as standard benchmarks for classifier performance. The current paper develops an analysis of ideal HCT that is both transparent and insightful. While this paper focuses on the case where , the more general cases where *n* is fixed as *p* grows and *n*=*n*_{p}=*p*^{θ} for some θ∈(0,1) require much new discussion. This wider setting is considered in Jin (2009) (see also Ingster *et al*. in press), who proved the impossibility of successfully training classifiers in what we call the region of impossibility. In Donoho & Jin (in preparation), we will develop a complete set of results for the more general linkages of *n* with *p* and for HCT as opposed to ideal HCT.

*Beyond ideal performance*. Conceptually, the ideal threshold envisages a situation with an oracle, who, knowing ε and τ, and *n* and *p*, chooses the very best threshold possible under those given parameters. In this paper we have analysed the behaviour of this threshold within a certain asymptotic framework. Clearly, no empirical procedure can duplicate the performance of the ideal threshold.

However, HC thresholding comes close. The HC threshold does not involve optimal exploitation of knowledge of ε and τ; it needs only the feature *Z*-scores. In this paper, we analysed the behaviour of the ideal HCT: *t*^{HC}=*T*_{HC}(*F*_{ε,τ}).

This is an ideal procedure because in a real-life situation we do not know the distribution *F*_{ε,τ} of *Z*-scores; we have instead the empirical cumulative distribution function (CDF) *F*_{n,p} defined by The (non-ideal) HCT that we defined in §1*e* is then simply . Because *F*_{ε,τ}(*z*)=*E*(*F*_{n,p})(*z*), there is nothing controversial about the assertion that for large *n* and *p*. Indeed, there are generations of experience for *other* functionals *T* showing that we typically have *T*(*F*_{n,p})≈*T*(*E*(*F*_{n,p})) for large *n*, *p* for those other functionals. Proving this for *T*=*T*_{HC} is more challenging than one might anticipate; the problem is that the functional *T*_{HC} is not continuous at *F*=Φ, and yet as . After considerable effort, we justify the approximation of HCT by ideal HCT in Donoho & Jin (in preparation). The analysis presented here only partially proves that HCT gives near-optimal threshold feature selection; it only explains the connection at the level of ideal quantities. But this is the main insight.

*Phase diagram for finite sample sizes*. The phase diagram correctly reflects finite (*n*,*p*) behaviour. In figure 4, we consider *p*=3×10^{3}×*N*, *N*=(1,10,100). For such *p*, we take and display the boundary of the set of (*β*,*r*), where ideal HCT yields a classification error between 10 and 40 per cent. As *p* grows, the upper and lower bounds migrate towards the limit curve *r*=*ρ**(*β*).

*Other work on HC*. HC was originally proposed for use in a detection problem that has nothing to do with threshold feature selection: testing an intersection null hypothesis μ( *j*)=0, for all *j* (Donoho & Jin 2004). The literature has developed since then. Hall & Jin (2008, 2009) extend the optimality of HC in detection to correlated settings. Wellner and his collaborators investigated HC in the context of goodness of fit; see for example Jager & Wellner (2007). Hall and his collaborators have investigated HC for robustness; see for example Delaigle & Hall (2009). HC has been applied to data analysis in cosmology and astronomy (Cayon *et al*. 2005; Jin *et al*. 2005). HC has also been used as a principle to synthesize new procedures: Cai *et al*. (2007) use HC to motivate estimators for the proportion ε of non-null effects.

*HC in classification*. HC was previously applied to high-dimensional classification in Hall *et al*. (2008), but with a key conceptual difference from the present paper. Our approach uses HC in classifier *training*—it selects features in designing a linear classifier—but the actual classification decisions are made by the linear classifier when presented with specific test feature vectors. The approach in Hall *et al*. (2008) uses HC to directly make decisions from feature vector data. Let us call these two different strategies *HC-based feature selection* (used here) and *HC-based decision* (used in Hall *et al*. 2008).

In the ARW model, for a given classifier performance level, the HC-based decision strategy requires much stronger feature contrasts to reach that level than the HC-based feature selection strategy. In this paper, we have shown that HC feature selection requires useful features to have contrasts exceeding , . HC decision requires useful features to have contrasts exceeding , . Therefore, if the number of training samples *n*>1, HC feature selection has an advantage. Imagine that we have *n*=36 training samples; we will find that HC feature selection can asymptotically detect features at roughly 1/6 the strength required when using HC directly for decision.

*Other work on the RW model.* During the review of this paper, a new manuscript by Ingster *et al*. (2009) appeared in arXiv analysing a feature selection setting equivalent to the ARW model, using non-HC-based approaches.

## Acknowledgements

The authors would like to thank the Isaac Newton Mathematical Institute at Cambridge for hospitality during the program *Statistical Theory and Methods for Complex, High-Dimensional Data*, and for a Rothschild Visiting Professorship held by D.D. We would also like to thank an anonymous referee for many helpful comments. D.D. was partially supported by NSF DMS-0505303, and J.J. was partially supported by NSF CAREER award DMS-0908613.

## Footnotes

One contribution of 11 to a Theme Issue ‘Statistical challenges of high-dimensional data’.

- © 2009 The Royal Society