## Abstract

Considered in the paper is the problem of selecting a diagnostic biomarker that has the highest classification rate among several candidate markers with dichotomous outcomes. The probability of correct selection depends on a number of nuisance parameters from the joint distribution of the biomarkers and thus can be substantially affected if these nuisance parameters are misspecified. A two-stage procedure is proposed to compute the needed sample size that achieves the desired level of correct selection, as so confirmed by simulation results.

## 1. Introduction

In the area of diagnostic medicine, a biomarker is used to identify subjects that have a certain condition, e.g. a disease or risk status for the disease, so that proper treatments can be followed (Zhou *et al*. 2002). Often, multiple candidate biomarkers are developed simultaneously and medical investigators are interested in selecting the one with the highest diagnostic accuracy among these newly identified biomarkers so that future studies can be focused on the selected biomarker.

One approach to achieving this goal is as follows. A number of subjects are randomly selected from a population with the condition and one without the condition, hereafter referred to as ‘diseased’ and ‘non-diseased’ populations, respectively. The candidate biomarkers are then tested on each subject and their diagnostic accuracy is evaluated from the test results. The biomarker that yields the highest (estimated) accuracy will then be identified as the best biomarker.

A key feature associated with such a selection process is the probability of correct selection (PCS), that is, the probability that the selected biomarker indeed has the highest diagnostic accuracy. The number of diseased and non-diseased subjects needs to be sufficiently large so that the PCS can be maintained at a certain desired level. Since the biomarkers are tested on the same subject, their test outcomes from the same subject are correlated, following a multivariate distribution. Thus, the PCS depends on a number of parameters from the joint distribution which have to be specified in order to have an estimate of the sample size. Unfortunately, however, the PCS can be substantially adversely affected if these parameters are not correctly quantified.

To relax the dependence of the PCS on these parameters, we propose a two-stage procedure. First, we randomly select a number of diseased and non-diseased subjects and use the test results from these subjects to estimate the corresponding parameters in the PCS function. Then, the total number of diseased and non-diseased subjects is computed using these estimates that achieve the desired PCS level when the diagnostic accuracy of the best biomarker differs from the other biomarkers by a specified amount. Simulation results are presented to demonstrate the effectiveness of the proposed procedure in achieving the PCS. This paper ends with some discussions in §6.

## 2. Formulation of the problem

Suppose there are *K* candidate biomarkers under consideration, and each yields a binary test result on a subject, 1 or 0, representing the subject being classified as diseased or non-diseased, respectively, by the biomarker. To assess their diagnostic accuracy, these biomarkers are tested on a random sample of *m* diseased and *n* non-diseased subjects. For the *k*th biomarker, its test result is denoted by *X*_{ik} from the *i*th diseased subject and *Y*_{jk} from the *j*th non-diseased subject. With these notations, the sensitivity (the probability of correctly classifying a diseased subject) of the *k*th biomarker is and can be estimated by . Its specificity (the probability of correctly classifying a non-diseased subject) is , with and can be estimated by , with . The classification rate *θ*_{k} of the *k*th biomarker is the sum of the two quantities, i.e. , and can be estimated subsequently by . The best biomarker is the one with the largest *θ*, and the one that yields the largest *θ*-estimate will be chosen as the ‘best’ biomarker. Selection error occurs when the largest *θ*-estimate is not from the biomarker with the largest *θ*.

Without loss of generality, we assume that the first biomarker is the best among the candidate markers, i.e. . The *l*th biomarker is selected as the best if Therefore, the PCS is

The key issue is to determine *m* and *n* so that the PCS will be at least *γ*, a specified level of probability, under certain parametric configuration of the *θ*s. Usually we require that *ρ*≥*γ* when , where *Δ*>0 is also a prespecified constant. For sufficiently large *m* and *n*, follows asymptotically a multivariate normal distribution. Thus if *ρ*=*γ* for , then *ρ*≥*γ* for . Therefore, the sample sizes *m* and *n* are computed by solving the equationIn order to compute the sample sizes, parameters other than *Δ* which appear in the equation are given values based on ‘educational guess’ or other data sources if possible.

## 3. Computation of PCS

The test outcomes of the *K* biomarkers from a diseased or non-diseased subject follow a multinomial distribution. Exact calculation of the PCS involves the joint probabilities of {*X*_{i1}=0 or 1, …, *X*_{iK}=0 or 1} and {*Y*_{j1}=0 or 1, …, *Y*_{jK}=0 or 1}, which becomes extremely complicated when *K* is relatively large. Instead, we may use normal approximation to relax the computational complexity, assuming that the sample sizes are relatively large.

Write and . Then by classical asymptotic theory (Bickel & Doksum 1977), follows asymptotically a multivariate normal distribution with mean vector *θ* and a variance-covariance matrix given byandwhere .

Define and , for , and write and . Then is also normally distributed with mean vector *δ* and variance-covariance matrix , whereand

Let be the multivariate normal distribution function with mean *δ* and variance-covariance matrix *Σ*, then the PCS is approximately(3.1)

The sample sizes *m* and *n* are computed from the equation under the constraints that *δ*_{k}=*Δ* for all .

## 4. A two-stage procedure

The asymptotic formula (3.1) of the PCS involves *K*(*K*+1) parameters from the multinomial distributions. With the *K*−1 constraints *δ*_{k}=*Δ*, there are *K*^{2}+1 distributional parameters (referred to as nuisance parameters) left to be specified in order to solve for sample sizes, assuming that *m*/*n*=*λ* is a known constant. If these nuisance parameters are misspecified, then the PCS can be substantially adversely affected, resulting in an unacceptably high level of selection error; see §5 for numerical demonstration. In order to maintain the desired level of the PCS, we propose, following Stein's (1945) idea, a two-stage selection procedure to compute the required sample size *N*=*m*+*n*=(1+*λ*)*n* with ratio *λ*=*m*/*n* fixed.

*Step 1*. Randomly select *m*_{1} diseased and *n*_{1} non-diseased subjects from the target populations and test the biomarkers on these subjects. Use the test outcomes *X*_{ik}, *i*=1, …, *m*_{1} and *Y*_{jk}, *j*=1, …, *n*_{1} to obtain empirical estimates of the nuisance parameters, given as follows:(4.1)

(4.2)

*Step 2*. Plug these estimates into the PCS formula and then solve for sample size *N*′ from the equation *ρ*=*γ*. The required sample size *N* is then set to max(*N*_{1}, *N*′), with *N*_{1}=*m*_{1}+*n*_{1}.

The sample size *N* computed in this way is no longer a fixed quantity. Rather, it is a random variable, a statistic of the data from the first stage. Note that the estimates in (4.1) and (4.2) are consistent estimators of the corresponding parameters. For sufficient sample size *N*_{1}, the PCS is expected to be approximately *γ* regardless of the true values of the nuisance parameters; see simulation results below.

## 5. Numerical evaluation

Numerical studies are conducted to investigate the effects of the nuisance parameters on PCS and the performance of the proposed two-stage procedure in maintaining the level of PCS. To reduce the computational burden, we use three biomarkers (*K*=3) for illustration.

Table 1 presents the values of PCS under various configurations of the parameter space which determine the joint distribution of the three biomarkers, that is, the joint probability of and , for *i*, *j*, *k*=0, 1. These joint probabilities thus give the marginal probabilities, for example,and so on.

All configurations in table 1 correspond to a fixed value of 0.1 for . The first row in the table specifies the parameters used to compute the sample size, for a single-stage procedure, assuming that *λ*=*m/n*=1. It turns out that, with this configuration, 102 diseased and 102 non-diseased subjects are needed to achieve a PCS of at least 0.95.

The *ρ*-values, computed using formula (3.1), in the remaining rows of table 1 are the PCS with 102 diseased and 102 non-diseased subjects under other specifications of the parameter values. We see that the PCS can be substantially lower than the desired level of 0.95 if the parameters differ from the specified values in the first row of the table. For example, if the corresponding parameters from the joint distribution of the three biomarkers take values from the second row rather than as being specified in the first row of the table, then the PCS can be decreased by more than 20%!

To investigate whether the proposed two-stage procedure can relax the dependence on nuisance parameters and maintain the PCS at the desired level, we conducted 10 000 simulations for each configuration of the parameter space in table 1. The simulated values of PCS are presented as *ρ*^{*} in the last column of the table. For the first-stage sample size, we used *m*_{1}=75 diseased and *n*_{1}=75 non-diseased subjects to estimate the corresponding nuisance parameters.

The results in table 1 clearly show that the PCS is maintained at the desired level of 0.95, regardless of the parametric configuration under which the data are generated. Thus the two-stage procedure is proved to be efficient in controlling selection error rates. Numerical results (not shown here) for other configurations and ratios of *λ* reveal similar findings. With such a multidimensional parameter space, however, it is difficult to characterize any configurations for which the two-stage procedure would perform relatively better.

## 6. Discussion

In this paper, we proposed a two-stage procedure to select the best diagnostic biomarker among a number of candidate markers, and numerically demonstrated its effectiveness in controlling the selection error. While there is no universal standard on how the first-stage sample size is determined, it should be relatively large so that accurate estimates of the nuisance parameters can be obtained. On the other hand, it should not be too large when taking the cost effectiveness of a study into consideration.

We confine our attention to biomarkers with dichotomous outcomes. However, the two-stage approach proposed in this paper can be easily extended to biomarkers with ordinal or continuous test results. In these two cases, an appealing selection criterion is to select the marker that has the largest area under its receiver operating characteristic curve (Hanley 1989).

Various selection problems and related procedures have been discussed extensively in the statistical literature (see Bechhofer *et al*. 1968; Gibbons 1988), but mostly focusing on multivariate normal distributions. Applications of the selection concepts and theory to the area of diagnostic medicine are relatively new, and many other issues remain to be solved such as selecting a fixed number of best diagnostic biomarkers and selecting the best biomarker (or all biomarkers) that are better than a standard biomarker. Further research efforts are much needed along these lines.

Throughout this paper we assume that the data will be available for all biomarkers from each subject. Very often values on some biomarkers may be missing. One possible approach to dealing with this situation is to use multiple imputation based on the multinomial distributions and then apply the proposed method to the imputed data. The effectiveness of this approach in controlling the PCS needs further investigation. Moreover, selection bias may occur when the subjects are not randomly selected, and correction for such bias in estimating the PCS is desirable.

## Acknowledgments

This research was supported by the Intramural Research Program of the National Institute of Child Health and Human Development, National Institutes of Health. The opinions expressed in this article are those of the authors and not of the National Institutes of Health. The authors thank Helen Guo from the University of Virginia for help with simulation.

## Footnotes

Electronic supplementary material is available at http://dx.doi.org/10.1098/rsta.2008.0032 or via http://journals.royalsociety.org.

One contribution of 13 to a Theme Issue ‘Mathematical and statistical methods for diagnoses and therapies’.

- © 2008 The Royal Society