## Abstract

We study the problem of classification of *d*-dimensional vectors into two classes (one of which is ‘pure noise’) based on a training sample of size *m*. The main specific feature is that the dimension *d* can be very large. We suppose that the difference between the distribution of the population and that of the noise is only in a shift, which is a sparse vector. For Gaussian noise, fixed sample size *m*, and dimension *d* that tends to infinity, we obtain the sharp classification boundary, i.e. the necessary and sufficient conditions for the possibility of successful classification. We propose classifiers attaining this boundary. We also give extensions of the result to the case where the sample size *m* depends on *d* and satisfies the condition , 0 ≤ γ < 1, and to the case of non-Gaussian noise satisfying the Cramér condition.

## 1. Introduction

Over the last decade, high-dimensional data have become a common reality in many fields, for example, in microarray analysis, astronomy, information technology, etc. Often the task is to perform a prediction or to classify a *d*-dimensional observation vector using the information from a training sample of vectors of the same dimension. The main difficulty is that the dimension *d* can be very large. For example, in gene microarray data, *d* is typically of the order 10^{5}, so that, speaking in mathematical terms, it is natural to consider the asymptotics in the dimension, i.e. as . This situation is highly non-standard, in the sense that it is not covered by the classical statistical techniques.

In this paper, we study the two-class classification problem in such a framework. When the dimension *d* is large, there is no hope of constructing a successful classification method for vectors of general form. However, we shall show that it becomes possible if the underlying high-dimensional vectors have some simple structure. Below, we shall assume that they are sparse, i.e. contain only a small fraction of non-zero components described by the sparsity index *β*. This assumption is natural in many applications. We shall present efficient classification methods under such a sparsity assumption. What is more, we shall establish the classification boundary, i.e. the necessary and sufficient conditions for the possibility of successful classification. These conditions characterize the classification problem and not a particular method. They can be helpful for the practitioner who needs to design experiments. Modern experiments, e.g. in bioinformatics, are often resource intensive (time, raw material, cost). Our results on the classification boundary can give guidelines to decide whether or not an experiment is worth performing. Indeed, it turns out that the regions where the practitioner can or cannot expect good performance in classification are determined by two parameters of the problem, namely, the sparsity index *β* and a kind of signal-to-noise ratio. The most striking result here is that the regions are completely described through very simple conditions on these two parameters. Knowing in advance that the parameters of the experiment are correct regarding our theoretical results can help the practitioner to conserve resources. On the other hand, our theory would give some hints about how to improve the experimental design. Another striking result is that successful classification can be expected even if the coordinates of the signal cannot be accurately estimated. For example, in cancer classification using gene microarray data, our results show that, for a typical set-up with *m* = 10 patients and more than *d* = 20 000 genes (see Yeoh *et al.* 2002; Ross *et al.*2003, 2004), the signal-to-noise ratio can be smaller than 1 even when very few coordinates are informative. This point is very important and can lead one to discard a broadly used methodology that consists of selecting genes prior to classification in the diagnosis problem. Indeed, our results show that from the diagnosis point of view the practitioner should use the whole range of data. This also suggests taking a new look at the previously collected data that were uninformative with the traditional methodology.

### (a) Model and problem

Assume that we have an independent and identically distributed (i.i.d.) sample *Y* = (*Y*_{1},…,*Y*_{m}) from a population with probability distribution *P*_{Y} on . Here,where are the components of *Y*_{j}. In this paper we assume that1.1where *u* = (*u*_{1},…,*u*_{d}) is a deterministic mean vector and the errors are (unless other conditions are explicitly mentioned) jointly i.i.d. zero-mean random variables with probability density *f* on . Let *P*_{η} denote the probability distribution of .

We consider the classification problem when the dimension of the observations *d* is very large (tends to ). Assume that we observe a random vector *Z* = (*Z*^{1},…,*Z*^{d}) independent of **Y** and we know that the distribution of *Z* is either *P*_{η} or *P*_{Y}. Our aim is to classify *Z*, i.e. to decide whether *Z* comes from the population with distribution *P*_{Y} or from the ‘pure noise’ population *P*_{η}.

Distinguishing between *P*_{η} and *P*_{Y} presents a difficulty only when the vector *u* is close to 0. A particular type of closeness for large *d* can be characterized by the sparsity assumption (Ingster & Suslina 2002*a*; Donoho & Jin 2004) that we shall adopt in this paper. Similar to Ingster & Suslina (2002*a*) and Donoho & Jin (2004), we introduce the following set of sparse vectors in characterized by a positive number *a*_{d} and a sparsity index *β* ∈(0,1]:Here, are two constants that are supposed to be fixed throughout the paper. The value *p* = *d*^{−β} can be interpreted as the ‘probability’ of occurrence of non-zero components in vector *u*. A similar model was also considered recently by Hall *et al.* (2008).

In this paper, we establish the classification boundary, i.e. we specify the necessary and sufficient conditions on *β* and *a*_{d} such that successful classification is possible. Let us first define the notion of successful classification. We shall need some notation. Let *ψ* be a decision rule, i.e. a measurable function of **Y**,*Z* with values in [0,1]. If *ψ* = 0, we allocate *Z* to the *P*_{η} population, whereas for *ψ* = 1 we allocate *Z* to the *P*_{Y}-population. The rules *ψ* taking intermediate values in (0,1) can be interpreted as randomized decision rules. Let and denote the joint probability distributions of **Y**, *Z* when *Z* is only noise and *Z*∼*P*_{Y}, respectively, and let , and , denote the corresponding expectation and variance operators. We shall also denote by *P*^{(u)} the distribution of **Y** and by *E*^{(u)}, *V**a**r*^{(u)} the corresponding expectation and variance operators. Consider the Bayes risk,where 0 < π < 1 is a prior probability of the noise population, and the maximum risk,Let be either the Bayes risk or the maximum risk .

We shall say that successful classification is possible if *β* and *a*_{d} are such that1.2for and with any fixed 0 < π < 1. Conversely, we shall say that successful classification is impossible if *β* and *a*_{d} are such that1.3where for and for with 0 < π < 1.

We call equation (1.2) the upper bound of classification and equation (1.3) the lower bound of classification. The lower bound equation (1.3) for the maximum risk is interpreted as the fact that no decision rule is better (in a minimax sense) than a simple random guess. For the Bayes risk , the lower bound equation (1.3) is attained at the degenerate decision rule that does not depend on the observations: *ψ* ≡ 0 if π > 1/2 or *ψ* ≡ 1 if π ≤ 1/2.

The condition on (*β*,*a*_{d}) corresponding to the passage from equation (1.2) to (1.3) is called the classification boundary. We shall say that a classifier *ψ* = *ψ*_{d} is asymptotically optimal (or that *ψ* attains the classification boundary) if, for all *β* and *a*_{d} such that successful classification is possible, we have1.4where or with any fixed 0 < π < 1.

### (b) Summary of the main results

According to the value of *β*, we shall distinguish between *moderately sparse vectors* and *highly sparse vectors*. This division depends on the relationship between *m* and *d*. For *m* not too large, i.e. when , moderately sparse vectors correspond to *β* ∈ (0,1/2] and highly sparse vectors to *β* ∈ (1/2,1). For large *m*, i.e. when , moderate sparsity corresponds to *β* ∈ (0,(1 − γ)/2] and high sparsity to *β* ∈ ((1−γ)/2,1−γ).

The classification boundary for moderately sparse vectors is obtained in a relatively simple way (cf. §2). It is of the form1.5This means that successful classification is possible if , and it is impossible if *R*_{d}→0 as . The result is valid both for *β* ∈ (0,1/2] and *m*≥1 fixed or for *m* depending on *d*, such that as , and for β∈(0,(1−γ)/2]. Moreover, equation (1.5) holds under weak assumptions on the noise. In particular, for the upper bound of classification we only need to assume that the noise has mean 0 and finite second moment (cf. §2). The lower bound is proved under a mild regularity condition on the density *f* of the noise.

The case of highly sparse vectors is more involved. We establish the classification boundary for the following scenarios:

(A)

*m*≥ 1 is a fixed integer, and the noise density*f*is Gaussian with known or unknown σ > 0.(B) as , , and

*f*is Gaussian with known or unknown σ>0.(C) , and

*f*is Gaussian with known or unknown σ > 0.

The upper bounds are extended to the following additional scenario:

(D) as , , and the noise satisfies the Cramér condition.

The conditions on the noise in (A)–(D) are crucial and, as we shall see later, they suggest that a special dependence of *a*_{d} on *d* and *m* of the form is meaningful in the highly sparse case. More specifically, we take1.6where *x*_{1} > 0 is fixed. The classification boundary in (A), (B) and (D) is then expressed by the following condition on *β*, *s* and *m*:1.7where1.8with1.9In other words, successful classification is possible if *x*_{1} ≥ *ϕ*(*β*)+ *δ*, and it is impossible if *x*_{1}≤ *ϕ*(*β*)− *δ*, for any * δ* > 0 and

*d*large enough. This classification boundary is also extended to the case where

*x*

_{1}depends on

*d*but stays bounded.

For scenario (C), let with fixed *x* > 0. We show that in this framework successful classification is impossible if *β* > 1 − *γ* (cf. remark 1 in §2), and therefore we are interested in *β*∈((1 − *γ*)/2,1 − *γ*). Set *β** = *β*/(1 − *γ*)∈(1/2,1) and . Then the classification boundary is of the form1.10for the function *ϕ*(*β*) defined above. This means that successful classification is possible if *x** > *ϕ*(*β**), and it is impossible if *x** < *ϕ*(*β**) as .

For *m* = 0 (i.e. when there is no sample **Y** and only *Z* is observed), the problem that we consider here reduces to the problem of signal detection with growing dimension *d* (Ingster 1997; Ingster & Suslina 2001*a*,*b*, 2002*a*,*b*; Jin 2003, 2004; Donoho & Jin 2004; Jager & Wellner 2007), and our classification boundary coincides with the *detection boundary* established by Ingster (1997). Sharp asymptotics in the detection problem were studied by Ingster (1997) (see also Ingster & Suslina 2002*a*, ch. 8) for known *a*_{d} or *β*. The adaptive version of the problem (this corresponds to unknown *a*_{d} and *β*) was studied by Ingster & Suslina (2001*a*,*b*). Various procedures to attain the detection boundary were proposed by Ingster & Suslina (2002*b*), Donoho & Jin (2004) and Jager & Wellner (2007). Ingster & Suslina (2002*b*) introduced a method to attain the detection boundary based on the combination of three different procedures for the zones *β*∈(0,1/2], *β* ∈ (1/2,3/4] and *β* ∈ (3/4,1). Later, Donoho & Jin (2004) showed that a test based on the higher criticism (HC) statistic attains the detection boundary simultaneously for these zones. More recently, Jager & Wellner (2007) proved that the same is true for a large class of statistics including the HC statistic. Hall & Jin (2009) have extended those results to the case of correlated noise. Haupt *et al.* (2008) showed that the rates of detection can be improved when using adaptive multiple sampling schemes.

Inference on the fraction of non-zero components in the model with *m*=0 has been studied by Cai *et al.* (2007). They derived minimax rates of convergence for estimators of the fraction and the corresponding confidence sets.

Hall et al. (2008) deal with the same classification model as the one that we consider here, but study a problem that is different from ours. They analyse the conditions under which some simple (e.g. minimum-distance) classifiers *ψ* satisfy1.11Hall et al. (2008) conclude that, for minimum-distance classifiers, equation (1.11) holds if and only if 0 < *β* < 1/2. This implies that such classifiers cannot be optimal for 1/2 ≤ *β* < 1. They also derive equation (1.11) for some other classifiers in the case *m* = 1.

The results of this paper and their extensions to the multi-class setting were summarized by Pouet (2008) and presented at the meeting ‘Rencontres de Statistique Mathématique’ (Luminy, 16–21 December 2008) and at the Oberwolfach meeting ‘Sparse Recovery Problems in High Dimensions: Statistical Inference and Learning Theory’ (15–21 March 2009). A detailed version with full proofs is given by Ingster *et al.* (2009). In work parallel to ours, Donoho & Jin (2008, in press) and J. Jin (2009, unpublished data) independently and contemporaneously have analysed a setting less general than the present one. They did not consider a minimax framework, but rather demonstrated that the HC methodology can be successfully extended to the classification problem. Donoho & Jin (in press) showed that, for a special case of scenario (B), the ‘ideal’ (i.e. depending on the unknown *a*_{d} and *β*) HC statistic attains the same upper bound of classification that we prove below. Together with our lower bound, this implies that the ‘ideal’ HC statistic is asymptotically optimal, in the sense defined above, for scenario (B). Donoho and Jin announce that similar results for the HC statistic in scenarios (A) and (C) will appear in their work in preparation.

This paper is organized as follows. Section 2 contains some preliminary remarks. In §3, we present the classification boundary and asymptotically optimal classifier for moderately sparse vectors under rather general conditions on the noise. In §4, we give the classification boundary and asymptotically optimal classifiers for highly sparse vectors under scenarios (A), (B) and (C). Section 5 provides an extension to scenario (D). The definitions of our classifiers do not involve the unknown parameters *a*_{d} and *β*, in contrast to the ‘ideal’ HC of Donoho & Jin (in press). The proofs are sketched in §7. The detailed proofs can be found in Ingster *et al.* (2009).

## 2. Preliminary remarks

In this section we collect some basic remarks on the problem assuming that *f* is the standard Gaussian density. As a starting point, we discuss some natural limitations for *a*_{d}.

### Remark 1.

We remark that *a*_{d} cannot be too small. Indeed, assume that, instead of the set *U*_{β,ad}, we have only one vector *u* = (*a*_{d}ϵ_{1},…,*a*_{d}ϵ_{d}) with known ϵ_{k} ∈ {0,1}. Then we get a familiar problem of classification with two given Gaussian populations. The notion of classification boundary can be defined here in the same terms as above, and the explicit form of the boundary can be derived from standard textbook results. It is expressed through the behaviour of :

(i) if

*Q*_{d}→ 0, then successful classification is impossible, with(ii) if , then successful classification is realized by the maximum-likelihood classifier , where

Here and below denotes the indicator function.

If we assume that , we immediately obtain some consequences for our model defined in §1. We see that successful classification in that model is impossible if *a*_{d} is so small that , and it makes sense to consider only such *a*_{d} that 2.1 In particular, for *γ* > 1 − *β*, successful classification is impossible under scenario (C) with .

We shall later see that equation (2.1) is a rough condition, which is necessary but not sufficient for successful classification in the model of §1. For example, in that model with *β* ∈(0,1/2] and fixed *m*, the value *a*_{d} should be substantially larger than given by the condition (cf. equation (1.5)).

### Remark 2.

Our second remark is that, at the other extreme, for sufficiently large *a*_{d} the problem is trivial. Specifically, non-trivial results can be expected only when the signal-to-noise ratio is small enough: 2.2 for a constant *B*_{*} > 0. To make this remark clear, assume, for example, that 2.3 Then the problem becomes simple, in the sense that successful classification is easily realizable under (2.3) and the classical condition (2.1). Indeed, take an analogue of the statistic *T** where *a*_{d} and ϵ_{k} are replaced by their natural estimators: 2.4 and consider the classifier . We can write , where ζ_{k} are independent standard normal random variables. It is well known that with probability tending to 1 as . This and equation (2.3) imply that, with probability tending to 1, the vector recovers exactly (ϵ_{1},…,ϵ_{d}) and the statistic *T* coincides with Since and , we find that It follows from Chebyshev’s inequality that under condition (2.1) we have 2.5 as . This argument is applicable in the general model of §1 (since the convergence in equation (2.5) is uniform in *u*∈*U*_{β,ad}), implying successful classification by *ψ* under conditions (2.1) and (2.3).

### Remark 3.

Let us now discuss a connection between conditions (2.1) and (2.3). First, (2.3) implies (2.1) if *m* is not too large, i.e.2.6On the other hand, if *m* is very large, i.e.2.7then we have , and condition (2.1) implies (2.3). Thus, the relationshipdetermines the classification boundary in the general model of §1, if m is very large (satisfies (2.7)).

### Remark 4.

Finally, note that conditions (2.3) and (2.2) involve the unknown *a*_{d}. In practice, we can replace them by their data-driven counterparts. In fact, set . Then with probability tending to 1 as , we have . Hence, if , then with the same probability. It is therefore convenient to consider the following pre-classifier taking values in {0,1,ND} (ND means ‘no decision’, i.e. we need to switch to some other classifier):where T is given by (2.4). Suppose that we choose a classifier if *ψ*^{pre} takes the value ND. Then the combined classifier is and Here the o(1) terms are due to the argument in remark 2. We now take as one of the classifiers suggested below in this paper. We prove their optimality under assumption (2.2) with any B_{*}>0. This allows us to treat the terms with in the above relationships. Indeed, with probability tending to 1 as , condition *ψ*^{pre} = ND implies (2.2) with , since with such probability (cf. (7.1) below). In conclusion, has asymptotically the same risk behaviour as .

The above remarks can be easily extended to the case of Gaussian errors with known variance σ^{2} > 0 by using the normalization *Z*^{k}/σ,SY^{k}/σ. Moreover, they extend to the case of non-Gaussian errors under the Cramér condition and the additional assumption (cf. §4*d*).

## 3. Classifier for moderately sparse vectors

In this section we consider the case of moderately sparse vectors. To simplify the notation, without loss of generality we set σ = 1. Assume that *R*_{d}=*d*^{1/2−β}*a*_{d} satisfies3.1and consider the classifier based on a linear statistic:Note that *T*′ is similar to the statistic *T* defined in (2.4), the difference being that in *T*′ we do not threshold to estimate the positions of non-zero ϵ_{k}. Indeed, here we do not necessarily assume (2.3), and thus there is no guarantee that ϵ_{k} can be correctly recovered.

Assume that for all *k*,*j* are random variables with mean 0 and variance 1. Then the means of and *Z*^{k} are , , their variances are equal to 1, and we haveWe now consider a vector of the form3.2By equation (3.2), Chebyshev’s inequality and equation (3.1), we obtainas . An analogous argument yields that . The convergence here is uniform in *u* satisfying equation (3.2), and thus uniform in *u*∈*U*_{β,ad}. Therefore, we have the following result.

### Theorem 3.1.

*Let* *for all k,j be random variables with mean 0 and variance 1. If (3.1) holds, then successful classification is possible and it is realized by the classifier ψ*

^{lin}

*.*

### Remark.

We have proved theorem 3.1 with the set of vectors u defined by (3.2), which is larger than *U*_{β,ad}. The upper bound on in the definition of *U*_{β,ad} is not needed. Also the variances of the need not be equal to 1. It is easy to see that the result of theorem 3.1 remains valid if these random variables have unknown variances uniformly bounded by an (unknown) constant.

Finally, note that theorem 3.1 is valid for all *β*∈(0,1). However, for *β*>1/2 its assumption guaranteeing successful classification is much too restrictive when compared with the correct classification boundary that we shall derive in the subsequent sections.

## 4. Classifiers for highly sparse vectors

We now analyse the case of highly sparse vectors, i.e. we suppose that *β*∈(1/2,1) if , and *β**=*β*/(1−*γ*)∈(1/2,1) if . The classification boundary in this case is expressed in terms of the functionwhere the functions *ϕ*_{1} and *ϕ*_{2} are defined in equation (1.9). Note that *ϕ*_{1} and *ϕ*_{2} are monotone increasing on (1/2,1), satisfy *ϕ*_{1}(*β*)≤*ϕ*_{2}(*β*) for all *β*∈(1/2,1), and the equality holds if and only if *β*=3/4.

We shall propose some classifiers that attain this boundary.

The following notation will be useful in the sequel:4.14.2Clearly, *x*_{0}<*x*<*x*_{1}. We allow *s*,*x*,*x*_{0},*x*_{1} to depend on *d* but do not indicate this dependence in the notation for the sake of brevity. We shall also suppose throughout that (2.2) holds, so that *x*_{1}=*O*(1) as .

### (a) Classifier for fixed *m*

We now propose optimal classifiers under scenario (A). First, we consider a procedure that attains the classification boundary only for *β*∈[3/4,1), but has a simple structure. We introduce the statisticswhere4.3DefineTaking a small *c*_{0}>0, consider the classifier of the form,

### Theorem 4.1.

*Consider scenario (A). Let β∈(0,1) and (2.2) hold. Then, for any c*

_{0}

*>0,*4.4

*If*

*, then, for any c*

_{0}

*>0,*4.5

*If*

*, then there exists c*

_{0}

*>0 such that*4.6

Theorem 4.1 (cf. equations (4.4) and (4.6) and the fact that *ϕ*(*β*)=*ϕ*_{2}(*β*) for *β*∈[3/4,1)) and theorem 6.3 below imply that *ψ*^{max} attains the classification boundary for *β*∈[3/4,1). On the other hand, equation (4.5) implies that, for *β*∈(1/2,3/4) (where *ϕ*(*β*)=*ϕ*_{1}(*β*)<*ϕ*_{2}(*β*)), the classifier *ψ*^{max} does not do the correct job. Its maximal risk is asymptotically 1, which is larger than the risk 1/2 of the simple random guess. We therefore introduce another classifier that has, however, a more involved structure. Consider the statisticswhere , *ϕ* is the standard normal cumulative distribution function and the statistics *S**Y*^{k}, SZ^{k} are defined in equation (4.3). Consider the grid4.7with a step *h*>0 depending on *d* and such that *h*=*o*(1), . This implies that 1 ≪ *N* ≪ *T*_{d} as (here and below *v*_{d} ≪ *w*_{d} for *v*_{d}>0 and *w*_{d}>0 depending on *d* means that ). Setwhere *H*=*H*_{d} is such that4.8for any *B*>0, *b*>0 and any *d*>*d*_{0}(*B*,*b*), where *d*_{0}(*B*,*b*) is a constant depending only on *B* and *b* (such an *H* can always be determined depending on the choice of *h*). Now consider the classifier of the form

### Theorem 4.2.

*Consider scenario (A) with β∈(1/2,1) and assume (2.2). Then*4.9

*If*

*and*

*, then*4.10

Theorems 4.1, 4.2 and 6.3 below show that the classification boundary for highly sparse vectors (i.e. for *β*∈(1/2,1)) is given by (1.7). Furthermore, the classifier *ψ**_{m} is optimal (attains the classification boundary) for *β*∈(1/2,1), except for the case , which is already covered by the classifier *ψ*^{max}. Indeed, implies that for all *β*∈(1/2,1).

### (b) Classifier for

In this subsection, we analyse scenario (B). Then , as and the classifier *ψ**_{m} is not, in general, optimal. Nevertheless, we propose another classifier that essentially attains the same classification boundary as in §4*a* above. We introduce the statisticswhere the maximum is taken over the grid (4.7). Here and below we use the same notation Δ(*t*), Δ as previously for different ratio statistics, since it causes no ambiguity. Set alsoand define4.11where *H* satisfies (4.8).

### Theorem 4.3.

*Consider scenario (B). Let β∈(1/2,1) and let (2.2) hold. Then*4.12

*If*

*, then*4.13

### (c) Classifier for scenario (C)

We now suggest an asymptotically optimal classifier for scenario (C). For *t*≥0, we introduce the statisticsTake a grid *t*_{1},…,*t*_{N} of the form (4.7) and define the classifier

### Theorem 4.4.

*Consider scenario (C) with* *. Let β**

*=*4.14

*β*/(1−*γ*)∈(1/2,1) and let (2.2) hold. Then*If*

*, then*4.15

### (d) Extensions for non-Gaussian noise

We now discuss an extension of our results to scenario (D). The remarks on the pre-classifier *ψ*^{pre} in §2 and the proofs of the upper bounds are based only on the constraint (2.2) and the following property of the tails of the Gaussian distribution:4.16Here , and ζ_{i} are i.i.d. random variables.

Indeed, in the proof of theorem 4.4 (see §7 below), we can write *ϕ*(−*t**T*_{d}) as *P*(*S*ζ^{m}>σ*t**T*_{d}). From (4.16) we deduce thatwhere *A*_{d} satisfies, for *t*_{+}=*O*(1),4.17as . This is exactly the relationship4.18which is also the only property of the noise distribution needed for the proof of theorem 4.4.

If *m* is large enough, relationship (4.16) holds not only for the Gaussian ζ_{i}. It suffices to have the i.i.d. ζ_{i} with satisfying the Cramér condition:and . In fact, using theorem 5.23 in Petrov (1995) we get that, under the Cramér condition and for ,where λ(*t*) is the Cramér series. Inserting here the expression for the Cramér series and the relationship as , we obtainas . These remarks allow us to follow the proof of theorem 4.4 leading to the next result.

### Theorem 4.5.

*Consider scenario (D) with* *. Let β**

*=*4.19

*β*/(1−*γ*)∈(1/2,1) and let (2.2) hold. Then*If*

*, then*4.20

## 5. Adaptive procedures

We have proposed several classifiers in §§3 and 4. We would like to stress that their definitions do not involve the unknown parameters *a*_{d} and *β*, in contrast to the ‘ideal’ classifiers of Donoho & Jin (in press). We prove the optimality of our classifiers (i.e. the fact that they attain the boundary) under different conditions on *a*_{d} and *β*. This can be easily generalized to obtain an adaptive procedure, i.e. the one attaining the classification boundary simultaneously for all the considered domains of *a*_{d} and *β*. It suffices to combine the classifiers as follows. We start with the pre-classifier *ψ*^{pre}. If it outputs ‘no decision’ (ND) and *m* is not too large (say, *m*≤10–30), then we combine the classifiers *ψ*^{lin}, *ψ*^{max} and *ψ**_{m} using the Bonferroni device, i.e. our final classifier is . In other words, we allocate *Z* to the *P*_{Y} population if and only if it is allocated to *P*_{Y} by at least one of the three classifiers. Analogously, if (in practice, *m*>30), then for *ψ*^{pre}=ND it suffices to classify by or by .

## 6. Lower bounds

We now prove the lower bounds on the risks to show that relationships (1.5) and (1.10) determine the classification boundaries for moderately and highly sparse vectors, respectively.

### (a) Lower bound for moderately sparse vectors

The lower bound for moderately sparse vectors is based on the next theorem. For *a*>0, , setand

### Theorem 6.1.

*Let either m≥1 be fixed or* *. If*6.1*then successful classification is impossible.*

### Corollary 6.2.

*Let f be the density of the standard normal distribution. If*6.2*then successful classification is impossible for β∈(0,1/2] and m fixed or for β∈(0,1/2) and*

*such that m=O(d*

^{1−2β}

*).*

### Proof.

For standard normal errors, we have *D*_{a}=e^{a2}. Therefore, condition (6.1) can be satisfied only if as . Moreover, in this case6.3Thus, if , conditions (6.1) and (6.2) are equivalent. Now, (6.2) and the assumption *β*∈(0,1/2] imply that *a*_{d}=*o*(1). This proves the corollary for fixed *m*. Also, if *β*∈(0,1/2) and such that *m*=*O*(*d*^{1−2β}), then . ▪

Note that, for *m*≍*d*^{γ} with 0<*γ*<1, the condition *m*=*O*(*d*^{1−2β}) in theorem 6.1 becomes .

### Remark.

Relationship (6.3) is valid for a larger class of noise distributions, e.g. for non-Gaussian noise with finite Fisher information. Indeed, assume that ℓ_{a}*(t)* is *L*_{2}*(f)*-differentiable at point *a* = 0, i.e. there exists a function such that6.4 where . Observe thatis the Fisher information of *f* (with *f*′ defined in a somewhat stronger sense than, for instance, in Ibragimov & Has’minski (1981)). Under assumption (6.4), we have

Combining the remarks mentioned earlier with theorems 3.1 and 6.1, we see that relationship (1.5) determines the classification boundary for *β**∈(0,1/2] when *m* is fixed or , if the errors have mean 0, finite variance and finite Fisher information.

As corollaries of theorems 3.1 and 6.1, we can establish classification boundaries for particular choices of *a*_{d}. Recall that non-trivial results can be expected only if *a*_{d} satisfies (2.2). For instance, consider *a*_{d}=*d*^{−s} with some *s*>0. Then, for fixed *m*, the classification boundary in the region *β*∈(0,1/2] is given by *s*=*β*−1/2, i.e. successful classification is possible if *s*<1/2−*β*, and is impossible if *s*>1/2−*β*. Other choices of *a*_{d} appear to be less interesting when *β*∈(0,1/2]. For example, in the next section we consider the sequence with some *s*>0. If *a*_{d} is chosen in this way, successful classification is possible for all *β*∈(0,1/2] with no exception, so that there is no classification boundary in this range of *β*.

### (b) Lower bound for highly sparse vectors

### Theorem 6.3.

*Let the noise density f be Gaussian* *, σ*^{2}*>0. Assume that β∈ (1/2,1) and*

*. Then successful classification is impossible for fixed m and for*

*.*

Though theorem 6.3 is valid with no restriction on *m*, it does not provide a correct classification boundary if *m* is large, i.e. , as in scenarios (C) and (D). The correct lower bound for large *m* is given in the next theorem.

### Theorem 6.4.

*Consider scenario (C) with β**

*=*

*β*/(1−*γ*)∈(1/2,1) and*Assume that*

*. Then successful classification is impossible.*

Recall that, by an elementary argument, under scenario (C) and for *a*_{d} as in theorem 6.4, successful classification is impossible if *β*>1−*γ* (cf. remark after (2.1)). This is the reason why in theorem 6.4 we consider only *β*<1−*γ*.

## 7. Sketch of the proofs

### (a) Proofs of upper bounds

Theorem 4.1 follows from the following relationships: ∀ *δ*>0 uniformly in *u*∈*U*_{β,ad},7.1where

The proofs of theorems 4.2–4.4 are based on evaluation of the expectations and variances of the statistics under consideration. For theorem 4.2, we get for *x*,*x*_{s}>*b* and all *d* large enough uniformly in *u*∈*U*_{β,ad} with log-factor *A*_{d},where *s*=0,1. Since the maximum of *t*^{2}/4−((*t*−*x*)_{+})^{2}/2 in is attained either at *t*=2*x* when , or at when , we have7.2For the variances, we havewhere *s*=0,1. Take *N*_{0}>0 such that . By Chebyshev’s inequality, for each *l*=1,…,*N*, each *u*∈*U*_{β,ad} and *s*=0,1, with probability greater than , we have, for ,and these inequalities are also valid for Δ(·) instead of Δ_{0}(·). All these inequalities (with Δ(·) and Δ_{0}(·)) simultaneously hold with probability greater than (uniformly in *u*∈*U*_{β,ad}). On this event of high probability we can evaluate Δ_{0} and Δ by taking the maxima of the expectations and comparing them with the maxima of the square roots of the variances. Proceeding in this way and using the bounds obtained above, we findwith *P*^{(u)} probability tending to 1 as , and, for *s*=0,1,with probability tending to 1 as (the convergence of all the probabilities is uniform in *u*∈*U*_{β,ad}). Using these relationships we get the results of theorem 4.2.

For theorem 4.3, we have ,Next, for we haveAnalogously for we haveUsing these relationships, similar to the above, we obtain theorem 4.3.

For theorem 4.4, we haveand, for , we haveThus, for all *l*=1,…,*N*, with probability tending to 1, the statistics *L*^{1}(*t*_{l}) belong to the intervals and, with probability tending to 1, they belong to the intervals , and the statistics *L*^{0}(*t*) belong to with probability tending to 1.

Consider the ratios Δ(*t*_{l}), *l*=1,…,*N*. We check that, for all *l*=1,…,*N* such that *R*(*t*_{l})>4*N*^{2}, with probability tending to 1, Δ(*t*_{l})<4*N* for *s*=0, and for *s*=1. Thus, uniformly over *u*∈*U*_{β,ad}, and it suffices to show that , for some τ>0. Let us study the ratio assuming that *x**>*ϕ*(*β**). We have, with a logarithmic factor *A*_{d},Set *t*_{0}=*x*/2+*β*/*x*. We check that7.3These relationships imply that *s**>τ for some τ>0 as *x**>*ϕ*(*β**). Since *s*(·) is a Lipschitz function, we can replace the maximum over the interval by the maximum over our grid with step *h*, inducing the error of order *O*(*h*). This yields theorem 4.4.

### (b) Proofs of lower bounds

We bound the maximal risk from below by the Bayesian risk corresponding to the mixtures *P*_{Hl} of the measures over the product prior measure μ^{d}(*d**u*), where μ=(1−*p*)*δ*_{0}+*p**δ*_{ad}, *δ*_{a} is the Dirac mass at point *a* and *p*=*d*^{−β}. We show that μ^{d}(*U*_{β,ad})→1. Then it suffices to verify that7.4Set *L*(**Y**,*Z*)=*d**P*_{H1}/*P*_{H0}, ℓ_{a}(*t*)=*f*(*t*−*a*)/*f*(*t*). We have whereTheorem 6.1 follows from the relationshipsand (with the measure *P*_{H0,k} corresponding to the *k*th coordinate)7.5For the proof of theorem 6.3, without loss of generality consider σ=1 and introduce the eventsand and We show that7.6We repeat calculations similar to (7.5) for the truncated version of the statistics . The relationships (7.6) then yield and we getwhereFinally we observe, for , thatand the exponent is negative under the assumption of theorem 6.3.

For theorem 6.4 in view of the first two lines of (7.5), it suffices to show that Set and observe that Next, we haveTake a threshold such that *p**L*_{1}(*H*_{*})=1, i.e. *t*=*x*/2+*β*/*x*. Then *p**L*(**Y**^{1})<1 (respectively, *p**L*(**Y**^{1})>1) is equivalent to SY^{1}<*H*_{*} (respectively, SY^{1}>*H*_{*}). We find , andwhere *P*=(1−*p*)*P*_{0}+*p**P*_{a}, *a*=*a*_{d} for brevity, and *P*_{a} is the Gaussian measure with the density *f*_{a}(·). Evaluating these items, we get, with log-factor *A*_{d},Set The condition is equivalent to , whereThis condition is equivalent to for *β**∈(1/2,1).

## Acknowledgements

Yu.I.I. was partially supported by the RFBI grant 08-01-00692-a and by grant NSh–638.2008.1. A.B.T. would like to thank the Isaac Newton Institute for Mathematical Sciences in Cambridge for hospitality during the programme Statistical Theory and Methods for Complex, High-Dimensional Data. His research was partially supported by the grants EPSRC 531174, ANR-06-BLAN-0194 and by the PASCAL Network of Excellence. C.P. was partially supported by the grant ANR-07-BLAN-0234 and by PICS–2715.

## Footnotes

One contribution of 11 to a Theme Issue ‘Statistical challenges of high-dimensional data’.

- © 2009 The Royal Society