Abstract
There is an inherent relationship between two-sided hypothesis tests and confidence intervals. A series of two-sided hypothesis tests may be inverted to obtain the 100(1-\(\alpha\))% confidence interval defined as the smallest interval that contains all point null parameter values that would not be rejected at the \(\alpha\) level. Unfortunately, for discrete data there are several different ways of defining two-sided exact tests and the most commonly used two-sided exact tests are defined one way, while the most commonly used exact confidence intervals are inversions of tests defined another way. This can lead to inconsistencies where the exact test rejects but the exact confidence interval contains the null parameter value. The packages exactci and exact2x2 provide several exact tests with the matching confidence intervals avoiding these inconsistencies as much as possible. Examples are given for binomial and Poisson parameters and both paired and unpaired \(2 \times 2\) tables.Applied statisticians are increasingly being encouraged to report
confidence intervals (CI) and parameter estimates along with p-values
from hypothesis tests. The htest class of the stats
package is ideally suited to these kinds of analyses, because all the
related statistics may be presented when the results are printed. For
exact two-sided tests applied to discrete data, a test-CI inconsistency
may occur: the p-value may indicate a significant result at level \(\alpha\) while the associated 100(1-\(\alpha\))% confidence interval may cover
the null value of the parameter. Ideally, we would like to present a
unified report (Hirji 2006), whereby the
p-value and the confidence interval match as much as possible.
I was asked to help design a study to determine if adding a new drug
(albendazole) to an existing treatment regimen
(ivermectin) for the treatment of a parasitic disease
(lymphatic filariasis) would increase the incidence of a rare
serious adverse event when given in an area endemic for another
parasitic disease (loa loa). There are many statistical issues
related to that design (Fay M. P., Huang, and
Twum-Danso 2007), but here consider a simple scenario to
highlight the point of this paper. A previous mass treatment using the
existing treatment had 2 out of 17877 experiencing the serious adverse
event (SAE) giving an observed rate of 11.2 per 100,000. Suppose the new
treatment was given to 20,000 new subjects and suppose that 10 subjects
experienced the SAE giving an observed rate of 50 per 100,000. Assuming
Poisson rates, an exact test using
poisson.test(c(2,10),c(17877,20000)) from the stats
package (throughout we assume version 2.11.0 for the stats package)
gives a p-value of \(p=0.0421\)
implying significant differences between the rates at the 0.05 level,
but poisson.test also gives a 95% confidence interval of
\((0.024, 1.050)\) which contains a
rate ratio of 1, implying no significant difference. We return to the
motivating example in the ‘Poisson two-sample’ section later.
We briefly review inferences using the p-value function for discrete data. For details see Hirji (2006) or Blaker (2000). Suppose you have a discrete statistic \(t\) with random variable \(T\) such that larger values of \(T\) imply larger values of a parameter of interest, \(\theta\). Let \(F_{\theta}(t) = Pr[T \leq t ; \theta]\) and \(\bar{F}_{\theta}(t) = Pr[T \geq t ; \theta]\). Suppose we are testing \[\begin{aligned} H_0: \theta \geq \theta_0 \\ H_1: \theta < \theta_0 \end{aligned}\] where \(\theta_0\) is known. Then smaller values of \(t\) are more likely to reject and if we observe \(t\), then the probability of observing equal or smaller values is \(F_{\theta_0}(t)\) which is the one-sided p-value. Conversely, the one-sided p-value for testing \(H_0: \theta \leq \theta_0\) is \(\bar{F}_{\theta_0}(t)\). We reject when the p-value is less than or equal to the significance level, \(\alpha\). The one-sided confidence interval would be all values of \(\theta_0\) for which the p-value is greater than \(\alpha\).
We list 3 ways to define the two-sided p-value for testing \(H_0: \theta=\theta_0\), which we denote
\(p_c\), \(p_m\) and \(p_b\) for the central,
minlike, and blaker methods, respectively:
\(p_c\) is 2 times the minimum of
the one-sided p-values bounded above by 1, or mathematically, \[p_c = \min \left\{ 1, 2 \times \min \left(
F_{\theta_0}(t),
\bar{F}_{\theta_0}(t) \right) \right\}.\] The name
central is motivated by the associated inversion confidence
intervals which are central intervals, i.e., they guarantee that the
lower (upper) limit of the 100(1-\(\alpha\))% confidence interval has less
than \(\alpha/2\) probability of being
greater (less) than the true parameter. This is called the TST (twice
the smaller tail method) by Hirji
(2006).
\(p_m\) is the sum of probabilities of outcomes with likelihoods less than or equal to the observed likelihood, or \[p_m = \sum_{T:f(T) \leq f(t)} f(T)\] where \(f(t) = Pr[T=t;\theta_0]\). This is called the PB (probability based) method by Hirji (2006).
\(p_b\) combines the probability of
the smaller observed tail with the smallest probability of the opposite
tail that does not exceed that observed tail probability. Blaker (2000) showed that this p-value may be
expressed as \[p_b = Pr \left[ \gamma(T) \leq
\gamma(t) \right]\] where \(\gamma(T) =
\min \left\{ F_{\theta_0}(T), \bar{F}_{\theta_0}(T) \right\}\).
The name blaker is motivated by Blaker (2000) which comprehensively studies the
associated method for confidence intervals, although the method had been
mentioned in the literature earlier, see e.g., Cox and Hinkley (1974), p. 79. This is called
the CT (combined tail) method by Hirji
(2006).
There are other ways to define two-sided p-values, such as defining extreme values according to the score statistic (see e.g., Hirji (2006, chap. 3), or Agresti and Y. Min (2001)). Note that \(p_c \geq p_b\) for all cases, so that \(p_b\) gives more powerful tests than \(p_c\). On the other hand, although generally \(p_c > p_m\) it is possible for \(p_c < p_m\).
If \(p(\theta_0)\) is a two-sided
p-value testing \(H_0:
\theta=\theta_0\), then its \(100(1-\alpha)\)% matching confidence
interval is the smallest interval that contains all \(\theta_0\) such that \(p(\theta_0)>\alpha\). To calculate the
matching confidence intervals, we consider only regular cases where
\(F_{\theta}(t)\) and \(\bar{F}_{\theta}(t)\) are monotonic
functions of \(\theta\) (except perhaps
the degenerate cases where \(F_{\theta}(t)=1\) or \(\bar{F}_{\theta}(t)=0\) for all \(\theta\) when \(t\) is the maximum or minimum). In this
case the matching confidence limits to the central test are
\((\theta_L, \theta_U)\) which are
solutions to: \[\alpha/2 =
\bar{F}_{\theta_L}(t)\] and \[\alpha/2
= {F}_{\theta_U}(t)\] except when \(t\) is the minimum or maximum, in which
case the limit is set at the appropriate extreme of the parameter space.
The matching confidence intervals for \(p_m\) and \(p_b\) require a more complicated algorithm
to ensure precision of the confidence limits (M.
Fay 2010).
If matching confidence intervals are used then test-CI
inconsistencies will not happen for the central method, and
will happen very rarely for the minlike and
blaker methods. We discuss those rare test-CI
inconsistencies in the ‘Unavoidable inconsistencies’ section later, but
the main point of this article is that it is not rare for \(p_m\) to be inconsistent with the
central confidence interval (M. Fay
2010) and that particular test-CI combination is the default for
many exact tests in the stats
package. We show some examples of such inconsistencies in the following
sections.
If \(X\) is binomial with parameters
\(n\) and \(\theta\), then the central
exact interval is the Clopper-Pearson confidence interval. These are the
intervals given by binom.test. The p-value given by
binom.test is \(p_m\). The
matching interval to the \(p_m\) was
proposed by Stern (1954) (see Blaker (2000)).
When \(\theta_0=0.5\) we have \(p_c=p_m=p_b\), and there is no chance of a
test-CI inconsistency even when the confidence intervals are not
inversions of the test as is the case in binom.test. When
\(\theta_0 \neq 0.5\) there may be
problems. We explore these cases in the ‘Poisson: two-sample’ section
later, since the associated tests reduce through conditioning to
one-sample binomial tests.
Note that there is a theoretically proven set of shortest confidence
intervals for this problem. These are called the Blyth-Still-Casella
intervals in StatXact (StatXact Procs Version 8). The problem with these
shortest intervals is that they are not nested, so that one could have a
parameter value that is included in the 90% confidence interval but not
in the 95% confidence interval (see Theorem 2 of Blaker (2000)). In contrast, the matching
intervals of the binom.exact function of the exactci
will always give nested intervals.
If \(X\) is Poisson with mean \(\theta\), then poisson.test
from stats gives
the exact central confidence intervals (Garwood 1936), while the p-value is \(p_m\). Thus, we can easily find a test-CI
inconsistency: poisson.test(5, r=1.8) gives a p-value of
\(p_m=0.036\) but the 95% central
confidence interval of \((1.6, 11.7)\)
contains the null rate of \(1.8\). As
\(\theta\) gets large the Poisson
distribution may be approximated by the normal distribution and these
test-CI inconsistencies become more rare.
The exactci
package contains the poisson.exact function, which has
options for each of the three methods of defining p-values and gives
matching confidence intervals. The code
poisson.exact(5, r=1.8, tsmethod="central") gives the same
confidence interval as above, but a p-value of \(p_c=0.073\); while
poisson.exact(5, r=1.8, tsmethod="minlike") returns a
p-value equal to \(p_m\), but a 95%
confidence interval of \((2.0, 11.8)\).
Finally, using tsmethod="blaker" we get \(p_b=0.036\) (it is not uncommon for \(p_b\) to equal \(p_m\)) and a 95% confidence interval of
\((2.0, 11.5)\). We see that there is
no test-CI inconsistency when using the matching confidence
intervals.
For the control group, let the random variable of the counts be \(Y_0\), the rate be \(\lambda_0\) and the population at risk be \(m_0\). Let the corresponding values for the test group be \(Y_1\), \(\lambda_1\) and \(m_1\). If we condition on \(Y_0+Y_1=N\) then the distribution of \(Y_1\) is binomial with parameters \(N\) and \[\begin{aligned} \theta &= \frac{m_1 \lambda_1}{m_0 \lambda_0 + m_1 \lambda_1} \end{aligned}\] This parameter may be written in terms of the ratio of rates, \(\rho = \lambda_1/\lambda_2\) as \[\begin{aligned} \theta &= \frac{ m_1 \rho }{m_0 + m_1 \rho } \end{aligned}\] or equivalently,
\[\begin{aligned} \label{eq:rho} \rho &= \frac{m_0 \theta }{ m_1 \left( 1 - \theta \right) }. \end{aligned} (\#eq:rho) \]
Thus, the null hypothesis that \(\lambda_1=\lambda_0\) is equivalent to
\(\rho=1\) or \(\theta=m_1/(m_0+m_1)\), and confidence
intervals for \(\theta\) may be
transformed into confidence intervals for \(\rho\) by Equation @ref(eq:rho). So the
inner workings of the poisson.exact function when dealing
with two-sample tests simply use the binom.exact function
and transform the results using Equation @ref(eq:rho).
Let us return to our motivating example (i.e., testing the difference
between observed rates \(2/17877\) and
\(10/20000\)). As in the other
sections, the results from poisson.test output \(p_m\) but the 95% central confidence
intervals, as we have seen, give a test-CI inconsistency. The
poisson.exact function avoids such test-CI inconsistency in
this case by giving the matching confidence interval; here are the
results of the three tsmethod options:
| tsmethod | p-value | \(95\%\) confidence interval |
|---|---|---|
| central | 0.061 | (0.024, 1.050) |
| minlike | 0.042 | (0.035, 0.942) |
| blaker | 0.042 | (0.035, 0.936) |
The \(2 \times 2\) table may be
created from many different designs, consider first designs where there
are two groups of observations with binary outcomes. If all the
observations are independent, even if the number in each group is not
fixed in advance, proper inferences may still be obtained by
conditioning on those totals (Lehmann and Romano
2005). M. Fay (2010) considers the
\(2 \times 2\) table with independent
observations, so we only briefly present his motivating example here.
The usual two-sided application of Fisher’s exact test:
fisher.test(matrix(c(4,11,50,569), 2, 2)) gives \(p_m=0.032\) using the minlike
method, but 95% confidence interval on the odds ratio of \((0.92, 14.58)\) using the
central method. As with the other examples, the test-CI
inconsistency disappears when we use either the exact2x2 or
fisher.exact function from the exact2x2
package.
The case not studied in M. Fay (2010) is when the data are paired, the case which motivates McNemar’s test. For example, suppose you have twins randomized to two treatment groups (Test and Control) then test on a binary outcome (pass or fail). There are 4 possible outcomes for each pair: (a) both twins fail, (b) the twin in the control group fails and the one in the test group passes, (c) the twin on the test group fails and the one in the control group passes, or (d) both twins pass. Here is a table where the numbers of sets of twins falling in each of the four categories are denoted \(a\), \(b\), \(c\) and \(d\):
| Test | ||
| Control | Fail | Pass |
| Fail | \(a\) | \(b\) |
| Pass | \(c\) | \(d\) |
In order to test if the treatment is helpful, we use only the numbers
of discordant pairs of twins, \(b\) and
\(c\), since the other pairs of twins
tell us nothing about whether the treatment is helpful or not. McNemar’s
test statistic is \[\begin{aligned}
Q \equiv Q(b,c) &= \frac{ (b- c)^2 }{b+c}
\end{aligned}\] which for large samples is distributed like a
chi-squared distribution with 1 degree of freedom. A closer
approximation to the chi-squared distribution uses a continuity
correction: \[\begin{aligned}
Q_C \equiv Q_C(b,c) &= \frac{ \left( |b- c|-1 \right)^2 }{b+c}
\end{aligned}\] In R this test is given by the function
mcnemar.test.
Case-control data may be analyzed this way as well. Suppose you have a set of people with some rare disease (e.g., a certain type of cancer); these are called the cases. For this design you match each case with a control who is as similar as feasible on all important covariates except the exposure of interest. Here is a table:
| Exposed | ||
| Not Exposed | Control | Case |
| Control | \(a\) | \(b\) |
| Case | \(c\) | \(d\) |
For this case as well we can use \(Q\) or \(Q_C\) to test for no association between case/control status and exposure status.
For either design, we can estimate the odds ratio by \(b/c\), which is the maximum likelihood estimate (see Breslow and Day (1980), p. 165). Consider some hypothetical data (chosen to highlight some points):
| Test | ||
| Control | Fail | Pass |
| Fail | 21 | 9 |
| Pass | 2 | 12 |
When we perform McNemar’s test with the continuity correction we get
\(p=0.070\) while without the
correction we get \(p=0.035\). Since
the inferences are on either side of the traditional \(0.05\) cutoff of significance, it would be
nice to have an exact version of the test to be clearer about
significance at the 0.05 level. From the exact2x2
package using mcnemar.exact we get the exact McNemar’s test
p-value of \(p=.065\). We now give the
motivation for the exact version of the test.
After conditioning on the total number of discordant pairs, \(b+c\), we can treat the problem as \(B \sim Binomial(b+c, \theta)\), where \(B\) is the random variable associated with \(b\). Under the null hypothesis \(\theta=0.5\). We can transform the parameter \(\theta\) into an odds ratio by
\[\begin{aligned} \label{eq:orp} Odds Ratio \equiv \phi = \frac{\theta}{1-\theta} . \end{aligned} (\#eq:orp) \]
(Breslow and Day (1980), p. 166). Since
it is easy to perform exact tests on a binomial parameter, we can
perform exact versions of McNemar’s test internally using the
binom.exact function of the package exactci
then transform the results into odds ratios via Equation @ref(eq:orp).
This is how the calculations are done in the exact2x2
function when paired=TRUE. The alternative and
the tsmethod options work in the way one would expect. So
although McNemar’s test was developed as a two-sided test testing the
null that \(\theta=0.5\) (or
equivalently \(\phi=1\)), we can easily
extend this to get one-sided exact McNemar-type Tests. For two-sided
tests we can get three different versions of the two-sided exact
McNemar’s p-value function using the three tsmethod
options, but all three are equivalent to the exact version of McNemar’s
test when testing the usual null that \(\theta=0.5\) (see the Appendix in
vignette("exactMcNemar") in exact2x2).
If we narrowly define McNemar’s test as only testing the null that \(\theta=0.5\) as was done in the original
formulation, there is only one exact McNemar’s test; it is only when we
generalize the test to test null hypotheses of \(\theta=\theta_0 \neq 0.5\) that there are
differences between the three methods. Those differences between the
tsmethod options become apparent in the calculation of the
confidence intervals. The default is to use central
confidence intervals so that they guarantee that the lower (upper) limit
of the 100(1-\(\alpha\))% confidence
interval has less than \(\alpha/2\)
probability of being greater (less) than the true parameter. These
guarantees on each tail are not true for the minlike and
blaker two-sided confidence intervals; however, the latter
give generally tighter confidence intervals.
In order to gain insight as to why test-CI inconsistencies occur, we
can plot the p-value function. This type of plot explores one data
realization and its many associated p-values on the vertical axis
representing a series of tests modified by changing the point null
hypothesis parameter (\(\theta_0\)) on
the horizontal axis. There is a default plot command for
binom.exact, poisson.exact, and
exact2x2 that plots the p-value as a function of the point
null hypotheses, draws vertical lines at the confidence limits, draws a
line at 1 minus the confidence level, and adds a point at the null
hypothesis of interest. Other plot functions
(exactbinomPlot, exactpoissonPlot, and
exact2x2Plot) can be used to add to that plot for comparing
different methods. In Figure 1 we create such a
plot for the motivating example. Here is the code to create that
figure:
x <- c(2, 10) n <- c(17877, 20000) poisson.exact(x, n, plot = TRUE) exactpoissonPlot(x, n, tsmethod = "minlike", doci = TRUE, col = "black", cex = 0.25, newplot = FALSE)
We see from Figure 1 that the
central method has smoothly changing p-values, while the
minlike method has discontinuous ones. The usual confidence
interval is the inversion of the central method (the limits
are the vertical gray lines, where the dotted line at the significance
level intersects with the gray p-values), while the usual p-value at the
null that the rate ratio is 1 is where the black line is. To see this
more clearly we plot the lower right hand corner of Figure 1 in Figure 2.
From Figure 2 we see why the test-CI
inconsistencies occur, the minlike method is generally more
powerful than the central method, so that is why the
p-values from the minlike method can reject a specific null
when the confidence intervals from the central method imply
failing to reject that same null. We see that in general if you use the
matching confidence interval to the p-value, there will not be test-CI
inconsistencies.
We have argued for using a unified report whereby the p-value and the
confidence interval are calculated from the same p-value function (also
called the evidence function or confidence curve). We have provided
several practical examples. Although the theory of these methods have
been extensively studied (Hirji 2006),
software has not been readily available. The exactci
and exact2x2
packages fill this need. We know of no other software that provides the
minlike and blaker confidence intervals,
except the PropCIs
package which provides the Blaker confidence interval for the single
binomial case only.
Finally, we briefly consider closely related software. The rateratio.test
package does the two-sample Poisson exact test with confidence intervals
using the central method. The PropCIs
package does several different asymptotic confidence intervals, as well
as the Clopper-Pearson (i.e. central) and Blaker exact
intervals for a single binomial proportion. The PropCIs
package also performs the mid-p adjustment to the Clopper-Pearson
confidence interval which is not currently available in exactci.
Other exact confidence intervals are not covered in the current version
of PropCIs
(Version 0.1-6). The coin and perm
packages give very general methods for performing exact permutation
tests, although neither perform the exact matching confidence intervals
for the cases studied in this paper.
I did not perform a comprehensive search of commercial statistical
software; however, SAS (Version 9.2) (perhaps the most comprehensive
commercial statistical software) and StatXact (Version 8) (the most
comprehensive software for exact tests) both do not implement the
blaker and minlike confidence intervals for
binomial, Poisson and 2x2 table cases.