## Abstract

When hypotheses concerning the sensitivity and specificity of a binary medical diagnostic test are simultaneously tested using a group sequential procedure, constructing point and interval estimates of the parameters is challenging because there is no unique way to order sample points in the two-dimensional space. In this paper, upon termination of a group sequential procedure, we compare the bias and mean squared errors of the maximum-likelihood and Rao–Blackwell unbiased estimators of sensitivity and specificity. Confidence intervals (CIs) of the two parameters were constructed using normal approximation and Woodroofe's pivot methods based on maximum-likelihood and Rao–Blackwell unbiased estimates. The coverage probability and the expected length of CIs for the parameters were compared by simulation studies.

## 1. Introduction

Accurate diagnosis of a medical condition is often the first step towards cure of the disease. Sensitivity (se) and specificity (sp), the probability of correctly classifying a true case of disease as diseased and a non-case as non-diseased, respectively, are two important measures to evaluate a medical diagnostic test (Zhou *et al*. 2002). Although medical test results are frequently measured on a continuous scale or in ordinal categories, they can always be dichotomized to binary outcomes by choosing a threshold value, on which we focus here. In assessing the performance of a diagnostic test, we want to know whether the test results differentiate the two health states, presence and absence of the medical condition.

Group sequential testing procedures have been considered to evaluate a diagnostic test and to estimate parameters of accuracy. Obuchowski & Zhou (2002) considered two-stage designs in which all subjects tested positive in the first stage; a random sample of test negatives are selected for disease verification in the second stage to reduce the costs of prospective studies of binary diagnostic tests. Wruck *et al*. (2006) proposed estimators based on a two-step sequential sampling scheme in a large cohort study, in which the first step of sampling is conducted to efficiently estimate specificity, and the second step is to estimate sensitivity if the candidate test is determined to have sufficiently high specificity. The sensitivity and the specificity have also been evaluated simultaneously in a group sequential design with a case-control sampling method under the criteria of minimizing the expected sample size when the candidate diagnostic test performs poorly (Shu *et al*. 2007).

In medical studies, a more complete analysis is usually required than simply the ‘accept’ or ‘reject’ decision of a hypothesis. Medical investigators are interested in how accurately a medical test can distinguish diseased from disease-free subjects in the general population. An interval estimate for the measures of accuracy will allow consideration of the magnitude of the accuracy and indicate its practical importance, irrespective of whether a hypothesis test declares statistical significance.

For sequential sampling with Bernoulli trials, Girshick *et al*. (1946) and Lehmann & Stein (1950) considered unbiased estimation of the success probability and further found necessary and sufficient conditions for completeness, enabling conclusions about uniformly minimum variance unbiased estimators. Liu *et al*. (2006) considered unbiased estimation following a group sequential test for distributions in a one-parameter exponential family. Jennison & Turnbull (1983) described a method for constructing CIs for a binomial parameter following a multistage test. The method involves defining an ordering among points on the stopping boundaries and computing the probability of a point being more extreme in this ordering than the observed one. Whether or not the ordering of the sample space may be reasonable depends on the particular stopping rule used in the sequential testing. Todd & Whitehead (1997) extended Woodroofe's (1992) pivot technique to provide CIs for the absolute success probability of two treatments in a clinical trial.

In this paper, we consider the joint inference of sensitivity and specificity following a group sequential test. In §2, we derive the distribution theory of the empirical sensitivity and specificity upon termination of the group sequential testing. In §3, the maximum-likelihood and Rao–Blackwell unbiased estimates are obtained and compared in terms of their bias and mean squared error. Adopting Woodroofe's pivot method, approximate CIs of sensitivity and specificity are constructed in §4. Simulation results are presented to compare the performance of different CIs. Some discussions are given in §5.

## 2. Distribution theory

Suppose that the hypotheses concerning the sensitivity and specificity of a binary diagnostic test are(2.1)where ** θ**=(se, sp)

^{T}is a column vector (the superscript ‘T’ denotes the transpose of a vector), and

*Θ*

_{i},

*i*=0, 1, are the parameter spaces under

*H*

_{i},

*i*=0, 1. The multistage testing procedure with

*K*interim analyses is as follows. At the

*k*th stage, 1≤

*k*<

*K*, if , stop the sampling and reject

*H*

_{0}; otherwise, if , continue the sampling to the (

*k*+1)th stage. Here

*S*_{k}is a two-dimensional vector, with the two elements representing, respectively, the cumulative number of correctly classified diseased and non-diseased subjects up to stage

*k*, and and are the stopping and continuation regions at stage

*k*, respectively.

Upon termination of the sequential testing, we observe two statistics: *M*, the (random) number of analyses performed, and *S*_{M}. Since, for each fixed ** n**,

*S*_{n}is sufficient for

**, the statistics (**

*θ**M*,

*S*_{M}) is jointly sufficient for

**with respect to the sample spaceThus, estimation of**

*θ***can be solely based on (**

*θ**M*,

*S*_{M}).

Denote by ‘+’ and ‘−’ the diseased and non-diseased groups, respectively, by the vector representing the cumulative sample sizes (for diseased and non-diseased) up to stage *k*, and by (*n*_{0}=0) the vector representing the incremental sample sizes (for diseased and non-diseased) at stage *k*, respectively. To obtain the joint distribution function of (*M*, *S*_{M}), we first note that the increments *X*_{1}=*S*_{1}, *X*_{2}=*S*_{2}−*S*_{1}, …, *X*_{k}=*S*_{k}−*S*_{k−1} are independently distributed with joint density(2.2)where *b*(*a*,*m*,*p*) is the binomial density function with *a* successes in *m* independent trials with success probability *p*. Using this density function we can easily conclude that the density function of the sufficient statistics (*M*, *S*_{M}) at (*k*, ** s**) is(2.3)where , and, for

*k*>1,(2.4)withwhere for

**=(**

*n**n*

^{+},

*n*

^{−})

^{T}and

**=(**

*s**S*

^{+},

*S*

^{−})

^{T}Here, is the combinatorial number of choosing

*m*from

*n*.

Formula (2.3) provides the basis for computing the expectation and variance of a statistic :(2.5)

(2.6)

## 3. Point estimation

### (a) Maximum-likelihood estimates

Suppose that a null hypothesis concerning the value of a parameter ** θ** is tested using a sequential procedure; the maximum-likelihood estimator of

**is calculated from the data collected. Its value will not be altered by the fact that the test was conducted sequentially; however, the distribution of will be affected in a substantial way. Estimators that are unbiased or almost unbiased in fixed size sampling usually yield substantial bias if used with sequential sampling.**

*θ*The likelihood function of ** θ** is(3.1)In equation (3.1), only

*f*

_{M,θ}(

*s*) involves the parameters. Maximizing the likelihood function, we have(3.2)Thus, the values of the maximum-likelihood estimates of sensitivity and specificity will not be altered by the test being conducted sequentially.

However, these estimators are no longer unbiased. The expectation of in two-stage designs can be expressed as follows:(3.3)(3.4)Further simplification of the above equations depends on the stopping and continuation regions at each stage. Unlike in fixed size sampling, it is apparent that the biases of the maximum-likelihood estimates are functions of both sensitivity and specificity, which is understandable since the stopping of the sampling depends on both parameters.

The variance expressions of , *i*=1, 2, for a two-stage design are given by

### (b) Rao–Blackwell unbiased estimates

Let be the unique sufficient statistic-based unbiased estimator, which coincides with the maximum-likelihood estimator, of *η* for a non-sequential sampling with size . Then the Rao–Blackwell estimator is given by(3.5)which is unbiased for *η* and with reduced variance. A recursive expression for the Rao–Blackwell estimator is and, for *k*>1,In a two-stage design, the Rao–Blackwell unbiased estimator at the second stage is a weighted average of the first-stage estimator,(3.6)The variance of the Rao–Blackwell estimator can be computed using (2.6).

### (c) Comparison of estimators

Shu *et al*. (2007) proposed an optimal two-stage test to evaluate sensitivity and specificity of a binary medical test simultaneously. The objective was to minimize the expected sample size when the accuracies of a medical test are below a desired level of expectation. Let and be the stopping boundaries for test statistics and in the first stage. If the test statistics exceed the stopping boundaries, then the process continues to the second stage; otherwise it stops for low sensitivity or specificity.

Suppose that . Then the power of the test is evaluated at and the probabilities of type I and II errors are *α*=0.05, *β*=0.1, respectively. The optimal two-stage design requires 14 diseased and 12 non-diseased subjects in stage 1, and a total of 32 subjects per group at the end of stage 2. Only when both the test statistics exceed (10, 9) in stage 1, and exceed (25, 27) in stage 2, is the null hypothesis rejected (see Shu *et al*. 2007). It is clear that the stopping and continuation regions depend on both test statistics. Consequently, the bias, variance and mean squared error of the estimates are functions of sensitivity and specificity. Figure 1 shows the bias of the maximum-likelihood estimate in this optimal two-stage design.

Firstly, we observe that the negative biases shown in figure 1 indicate that the maximum-likelihood estimators underestimate both sensitivity and specificity. This negativity of the bias is attributed to the design features, where only one-sided lower stopping boundaries are being used, and the sequential testing only stops early under the null hypothesis. (Liu *et al*. (2005) noted similar patterns in the bias of the maximum-likelihood estimators for certain optimal two-stage designs for phase II clinical trials.) Secondly, figure 1 reveals a consistent trend that the magnitude of the bias of the maximum-likelihood estimator of one parameter (sensitivity or specificity) increases as the value of the other parameter increases. For example, in figure 1*a*, the maximum absolute bias of the maximum-likelihood estimator of sensitivity is close to 0.025 when the true value of specificity is 0.9, and decreases to 0.015 when specificity is 0.8; the bias is negligible when specificity is less than 0.5. Thirdly, the absolute bias appears to reach its maximum at around the values of the parameters under the null hypothesis (se_{0}=0.7 and sp_{0}=0.75).

Figure 2 shows the square root of the mean squared error of the maximum-likelihood and the Rao–Blackwell unbiased estimators of sensitivity and specificity in the optimal two-stage design. The difference in mean squared error between the maximum-likelihood and the Rao–Blackwell estimators of the parameters is more obvious when the parameters of interest have relatively large values. The value of the other parameter has more effect on the mean squared error of the maximum-likelihood estimator of one parameter than on that of the Rao–Blackwell estimator of the same parameter. Although the curves in figure 2*a*,*b* show that the mean squared error of the maximum-likelihood estimator of sensitivity decreases more than that of the Rao–Blackwell estimator when specificity increases, the magnitude of the reduction is not appealing taking into account the bias of the maximum-likelihood estimator. The same pattern is observed when comparing figure 2*c*,*d* for specificity.

## 4. CIs for sensitivity and specificity

### (a) CIs based on pivotal quantities

Woodroofe (1992) proposed a simple approach for the problem of estimating a normal mean *θ* following a truncated sequential probability ratio test and provided a useful method for constructing confidence bounds and intervals based on an approximate pivot. The CIs of sensitivity and specificity can also be obtained using this approximate pivot method. The method is based on constructing an approximately normally distributed pivotal quantity and then employs traditional probability arguments.

Continuing to use the notation of the previous sections, *W*_{1}(se) and *W*_{2}(sp) are defined asIn the group sequential designs, and are random variables depending on the test statistics in previous stages, and . Hence, *W*_{1}(se) and *W*_{2}(sp) are no longer normally distributed with zero mean and unit variance. Without loss of generality, we consider the CI of sensitivity. Let *δ*(se) and *σ*(se) be the mean and standard deviation functions of *W*_{1}(se), respectively. Then (se) is defined as(4.1)When the sample size is sufficiently large, this quantity has zero mean and unit standard deviation and is treated by approximately following the standard normal distribution. Then an approximate (1−*α*)-level CI for sensitivity is of the form(4.2)where and are proper estimates of *δ*(se) and *σ*(se), respectively. Note thatwhere ** s**=(

*s*

^{+},

*s*

^{−}). Then an estimate of

*δ*(se) can be given bySimilarly

*σ*(se) can be estimated by

Working along similar lines, an approximate (1−*α*)-level of CI for specificity by Woodroofe's pivot method can be shown to be(4.3)

### (b) Coverage probabilities and expected length of CIs

Because the stopping boundaries in sequential designs are constructed on a two-dimensional test statistic, CIs for sensitivity and specificity in equations (4.2) and (4.3), respectively, cannot be expressed by a single parameter.

Tables 1 and 2 show the coverage probability and the expected length of the CIs for sensitivity and specificity, respectively, based on a simulation study of 5000 iterations for five optimal two-stage designs. As defined in §3, and are the stopping boundaries in stage 1, and are the required number of diseased and non-diseased subjects, and *n*_{2} is the number of subjects per group at the end of stage 2 with equal allocation between two groups. For each two-stage design, three sets of the underlying true values of parameters are specified in the second column. The CIs constructed by Woodroofe's pivot method are based on different plug-in estimators, i.e. W.MLE for the maximum-likelihood estimators and W.RB for the Rao–Blackwell unbiased estimators. These CIs are compared with the conventional 100(1−*α*)% CIs constructed based on the maximum-likelihood estimates, i.e.(4.4)

It is observed that the coverage probabilities of the CIs in the simulation study are under the nominal level for small sample sizes in tables 1 and 2. When the true values of the parameters are closer to 1.0, the coverage probabilities get lower. Approximating a binomial distribution with a normal distribution becomes inaccurate when the binomial parameter *p* is close to points 0 and 1. In spite of less coverage probabilities than the nominal level, the CIs by Woodroofe's pivot method using the maximum-likelihood and the Rao–Blackwell estimators tend to have shorter interval length with competitive coverage probabilities than the normal approximated intervals based on the maximum-likelihood estimates.

## 5. Discussion

We evaluated the bias and mean squared error of the point estimates of sensitivity and specificity in group sequential designs, and demonstrated the difference between maximum-likelihood and Rao–Blackwell unbiased estimates using an optimal two-stage design proposed by Shu *et al*. (2007). Although the magnitude of the bias of the maximum-likelihood estimates is relatively less substantial in two-stage designs, it is expected to increase if the number of interim analyses increases.

Whitehead (1986) suggests adjusting the maximum-likelihood estimate by subtracting an estimate of the bias of the MLE. Another approach, suggested by Wang & Leung (1997), is to use parametric bootstrap methods to find bias-adjusted estimators. The applications of Whitehead's bias correction method and Wang & Leung's bootstrap methods in the group sequential designs of sensitivity and specificity are of interest for future studies.

The simulation results for the CIs show that the normal approximation approach does not work well in small samples. The coverage probabilities are less than the nominal level of confidence, especially when the estimates are closer to the boundaries. One approach is to construct the CI based on a transformation of the parameters. A logit transformation is a traditional choice for estimating proportions. Furthermore, Whitehead *et al*. (2000) described methods for setting CIs for secondary parameters in a way that provides the correct coverage probability in repeated frequentist realizations of the sequential design used.

The challenge of constructing exact CIs of the parameters is how to order the two-dimensional sample space following a multistage test. There are a variety of ways to order sample points on stopping boundaries and across sequential stages. With a specific ordering, a more ‘extreme’ case can be defined and exact CIs can be computed. Further research is much needed to develop these methods.

## Footnotes

One contribution of 13 to a Theme Issue ‘Mathematical and statistical methods for diagnoses and therapies’.

- © 2008 The Royal Society