Concepts of Nonparametric Theory (PDFDrive)
Concepts of Nonparametric Theory (PDFDrive)
Advisors:
D. Brillinger, S. Fienberg, 1. Gani, 1. Hartigan
1. Kiefer, K. Krickeberg
John W. Pratt
Jean D. Gibbons
Concepts of
Nonparametic Theory
With 23 Figures
[I] Springer-Verlag
New York Heidelberg Berlin
John W. Pratt Jean D. Gibbons
Graduate School of Business Graduate School of Business
Administration Administration
Harvard University University of Alabama
Boston, Massachusetts 02163 University, Alabama 35486
USA USA
9 8 7 6 54 3 2
This book explores both non parametric and general statistical ideas by
developing non parametric procedures in simple situations. The major goal
is to give the reader a thorough intuitive understanding of the concepts
underlying nonparametric procedures and a full appreciation of the properties
and operating characteristics of those procedures covered. This book differs
from most statistics books by including considerable philosophical and
methodological discussion. Special attention is given to discussion of the
strengths and weaknesses of various statistical methods and approaches.
Difficulties that often arise in applying statistical theory to real data also
receive substantial attention.
The approach throughout is more conceptual than mathematical. The
"Theorem-Proof" format is avoided; generally, properties are "shown,"
rather than "proved." In most cases the ideas behind the proof of an im-
portant result are discussed intuitively in the text and formal details are left
as an exercise for the reader. We feel that the reader will learn more from
working such things out than from checking step-by-step a complete presen-
tation of all details.
Those who are interested in applications of nonparametric procedures
and not primarily in the mathematical side of things, but who would like
to have a general understanding of the theoretical bases and properties of
these techniques, will find this book useful as both a reference and a text. In
order to follow most of the main ideas and concepts presented, the reader
should have a good knowledge of the basic concepts of probability theory
and statistical inference at the level of introductory books with a pre-
requisite of one or two years of calculus. More advanced topics require more
mathematical and statistical sophistication. The particularly advanced
sections are indicated by an asterisk and may be omitted. The many exercises
vii
VIII Preface
at the end of each chapter also vary in level, from a straightforward data
analysis to a complicated proof. They are designed to supplement, com-
plement, and illustrate the materials covered in the text. The extensive
references provide ample sources for further study. The non parametric
area is still a fertile field for research, and the interested reader will find no
dearth of topics for further ~tudy; this book might provide an impetus for
additional research in non parametric inference.
The instructor who adopts this book for classroom use can proceed in
various directions and at various levels, as appropriate to the level and
interests of the students. If this course is the student's first exposure to non-
parametric methods, we recommend coverage of selected (unstarred)
portions of Chap. 1-7. If the student has already had an elementary survey
course in non parametric methods, this book can be used for a second course
to provide more advanced material and deeper coverage of the properties of
the procedures already known to the student. In assigning problems, the
instructor should indicate how much rigor is expected in the solution.
Appropriate references could be assigned for reports on selected topics. The
book could be supplemented by outside readings from some of the references
given.
The book does not attempt to provide a complete compendi~m of all
the non parametric methods presently available; only the most important
procedures for testing and estimation that are applicable to the one-sample
and two-sample situations are included. However, those procedures covered
are treated in considerable detail.
This book originated from notes which provided the basis for a course in
non parametric statistics first given in 1959 at Harvard University. Over the
years, many readers have made valuable comments and suggestions. As there
are too many to name individually, we can only acknowledge a large collec-
tive debt to all readers in the past. .
The authors are particularly grateful to the Office of Naval Research,
the National Science Foundation, the Guggenheim Foundation, The
Associates of the Harvard Business School, the Kyoto Institute of Economic
Research, Kyoto University, and the Board of Visitors of the Graduate
School of Business at The University of Alabama, for support; to Robert
Schlaifer for computation of some entries in Tables 8.1 and 11.1 of Chapter 8;
to Arthur Schleifer, Jr. for computation of the entries in Table C; and to
the Literary Executor of the late Sir Ronald A. Fisher, F.R.S., to Dr. Frank
Yates, F.R.S., and to Longman Group Ltd., London, for permission to
reprint Table IIi from their book Statistical Tables/or Biological, Agricultural
and Medical Research (6th edition, 1974).
A Note to the Reader
A two digit reference system is used throughout this book (with the exception
of problems). The first digit denotes a section number within a chapter. For
subsections, equations, theorems, figures and tables within each section, a
second digit is added. If a reference is made to a different chapter, the chapter
number is included, but within the same chapter, it is omitted. Numerical
references, except to equations, are preceded by an appropriate designation,
like Section or Table. Equation numbers always appear in parentheses and
are referred to in this way, e.g., (3.4) of Chap. 2 means Eq. (3.4), which is the
fourth numbered equation in Sect. 3 of Chap. 2. Problems are given at the
end of each chapter; they are numbered sequentially.
Justification of a result, even when entirely heuristic, is sometimes labeled
proof and separated from the rest of the text so that the reader who is not
interested can skip that portion. The end of a proof is indicated by a 0 when
it seems helpful. References in the text are given by surname of author and
date. The full citations for these and other pertinent references are given in
the Bibliography.
Throughout the book, more difficult material is indicated by an asterisk *
at the beginning and end. These portions may be omitted without detriment
to understanding other parts of the book.
IX
Contents
CHAPTER 1
Concepts of Statistical Inference and the Binomial Distribution
1 Introduction 1
2 Probability Distributions 2
3 Estimators and their Properties 6
3.1 Unbiased ness and Variance 7
3.2 Consistency 8
3.3 Sufficiency 8
3.4 Minimum Variance 12
4 Hypothesis Testing 14
4.1 Tests and their Interpretation 14
4.2 Errors 17
4.3 One-Tailed Binomial Tests 22
4.4 P-values 23
4.5 Two-Tailed Test Procedures and P-values 28
4.6 Other Conclusions in Two-Tailed Tests 32
5 Randomized Test Procedures 34
5.1 Introduction: Motivation and Examples 34
5.2 Randomized Tests: Definitions 37
5.3 Nonrandomized Tests Equivalent to Randomized Tests 38
5.4 Usefulness of Randomized Tests in Theory and Practice 39
5.5 *Randomized P-values 40
6 Confidence Regions 41
6.1 Definition and Construction in the Binomial Case 41
6.2 Definition of Confidence Regions and Relationship to Tests in the
General Case 45
6.3 Interpretation of Confidence Regions 46
6.4 True Confidence Level 48
xi
xii Contents
CHAPTER 2
One-Sample and Paired-Sample Inferences Based on the Binomial
~~~oo ~
I Introduction 82
2 Quantile Values 83
3 The One-Sample Sign Test for Quantile Values 85
3.1 Test Procedures 85
3.2 .. Optimum" Properties 88
3.3 *Proofs 91
4 Confidence Procedures Based on the Sign Test 92
5 Interpolation between Attainable Levels 96
6 The Sign Test with Zero Differences 97
6.1 Discussion of Procedures 97
6.2 Conditional Properties of Conditional Sign Tests 99
6.3 Unconditional Properties of Conditional Sign Tests 101
6.4 *Proof for One-Sided Alternatives 101
6.5 *Proof for Two-Sided Alternatives 102
7 Paired Observations 104
8 Comparing Proportions using Paired Observations 106
8.1 Test Procedure 108
8.2 Alternative Presentations 109
8.3 Example 110
8.4 Interpretation of the Test Results 112
8.5 Properties of the Test 114
Contents xiii
CHAPTER 3
One-Sample and Paired-Sample Inferences Based on Signed Ranks 145
I Introduction 145
2 The Symmetry Assumption or Hypothesis 146
3 The Wilcoxon Signed-Rank Test 147
3.1 Test Procedure and Exact Null Distribution Theory 147
3.2 Asymptotic Null Distribution Theory 150
3.3 Large Sample Power 151
3.4 Consistency 153
3.5 Weakening the Assumptions 155
4 Confidence Procedures Based on the Wilcoxon Signed-Rank
Test 157
5 A Modified Wilcoxon Procedure 158
6 Zeros and Ties 160
6.1 Introduction 160
6.2 Obtaining the Signed Ranks 162
6.3 Test Procedures 163
6.4 Warnings and Anomalies: Examples 167
6.5 Comparison of Procedures 170
7 Other Signed-Rank Procedures 171
7.1 Sums of Signed Constants 172
7.2 Signed Ranks and Walsh Averages 173
7.3 Confidence Bounds Corresponding to Signed-Rank Tests 174
7.4 Procedures Involving a Small Number of Walsh Averages 175
8 Invariance and Signed-Rank Procedures 177
8.1 Permutation Invariance 177
8.2 Invariance under Increasing, Odd Transformations 179
9 Locally Most Powerful Signed-Rank Tests 181
Problems 185
CHAPTER 4
One-Sample and Paired-Sample Inferences Based on the Method of
Randomization 203
1 Introduction 203
2 Randomization Procedures Based on the Sample Mean and
Equivalent Criteria 205
xiv Contents
CHAPTER 5
Two-Sample Rank Procedures for Location 231
1 Introduction 231
2 The Shift Assumption 232
3 The Median Test, Other Two-Sample Sign Tests, and Related
Confidence Procedures 234
3.1 Reduction of Data to a 2 x 2 Table 234
3.2 Fisher's Exact Test for 2 x 2 Tables 238
3.3 Ties 241
3.4 Corresponding Confidence Procedures 242
3.5 Power 243
3.6 Consistency 245
3.7 "Optimum" Properties 246
3.8 Weakening the Assumptions 247
4 Procedures Based on Sums of Ranks 249
4.1 The Rank Sum Test Procedure 249
4.2 Null Distribution of the Rank Sum Statistics 252
4.3 Corresponding Confidence Procedures 253
4.4 Approximate Power 255
4.5 Consistency 257
4.6 Weakening the Assumptions 257
4.7 Ties 258
4.8 Point and Confidence Interval Estimation of P(X > Y) 263
5 Procedures Based on Sums of Scores 265
6 Two-Sample Rank Tests and the Y - X Differences 269
7 Invariance and Two-Sample Rank Procedures 269
8 Locally Most Powerful Rank Tests 272
8.1 Most Powerful and Locally Most Powerful Rank Tests Against
Given Alternatives 272
8.2 The Class of Locally Most Powerful Rank Tests 277
Problems 279
Contents XV
CHAPTER 6
Two-Sample Inferences Based on the Method of Randomization 296
I Introduction 296
2 Randomization Procedures Based on the Difference Between Sample
Means and Equivalent Criteria 297
2.1 Tests 297
2.2 Weakening the Assumptions 298
2.3 Related Confidence Procedures 300
2.4 Properties of the Exact Randomization Distribution 301
2.5 Approximations to the Exact Randomization Distribution 302
3 The Class of Two-Sample Randomization Tests 305
3.1 Definition 305
3.2 Properties 307
4 Most Powerful Randomization Tests 310
4.1 General Case 310
4.2 One-Sided Normal Alternatives 311
4.3 Two-Sided Normal Alternatives 312
Problems 314
CHAPTER 7
Kolmogorov-Smirnov Two-Sample Tests 318
I Introduction 318
2 Empirical Distribution Function 319
3 Two-Sample Kolmogorov-Smirnov Statistics 320
4 Null Distribution Theory 322
4.1 An Algorithm for the Exact Null Distribution 323
4.2 RelatIon Between One-Tailed and Two-Tailed Procedures 325
4.3 Exact Formulas for Equal Sample Sizes 325
4.4 Asymptotic Null Distributions 328
5 Ties 330
6 Performance 331
7 One-Sample Kolmogorov-Smirnov Statistics 334
Problems 336
CHAPTER 8
Asymptotic Relative Efficiency 345
1 Introduction 345
2 Asymptotic Behavior of Tests: Heuristic Discussion 347
2.1 Asymptotic Power of a Test 347
2.2 Nuisance Parameters 351
2.3 Asymptotic Relative Efficiency of Two Tests 353
3 Asymptotic Behavior of Point Estimators: Heuristic Discussion 355
3.1 Estimators of the Same Quantity 355
3.2 Relation of Estimators and Tests 357
3.3 "'Estimators of Different Quantities 360
xvi Contents
Tables 425
Table A Cumulative Standard Normal Distribution 426
Table B Cumulative Binomial Distribution 428
Table C Binomial Confidence Limits 431
Table D Cumulative Probabilities for Wilcoxon Signed-Rank Statistic 433
Table E Cumulative Probabilities for Hypergeometric Distribution 435
Table F Cumulative Probabilities for Wilcoxon Rank Sum Statistic 437
Table G Kolmogorov-Smirnov Two-Sample Statistic 443
Bibliography 445
Index
455
CHAPTER 1
Concepts of Statistical Inference
and the Binomial Distribution
1 Introduction
Most readers of this book will already be well acquainted with the binomial
probability distribution, since it arises in a wide variety of statistical problems,
is simple to understand and use, and is extensively tabled. Our study of
non parametric statistics will begin with a rather thorough discussion of the
basic concepts of statistical inference, developed and explained in the context
of the binomial model. This approach has been chosen for two reasons. First,
some important non parametric procedures lead to the binomial model, and
the properties of these nonparametric procedures therefore depend on
properties of binomial procedures. Second, the binomial model provides a
familiar and easy context for the illustration of many of the concepts, terms
and notations which are necessary for an understanding of the non parametric
procedures developed later in this book. Some ofthese ideas will be familiar to
the reader, but many belong especially to the area of non parametric statistics
and will require more careful study. The reader may also find that even the
"simple" binomial situation is less simple than it may have seemed on
previous acquaintance.
In this first chapter, after a brief introduction to probability distributions,
we will discuss the basic concepts and principles of point estimation, hypo-
thesis testing and interval estimation. The various inference techniques will be
described, with an emphasis on problems arising in their interpretation. In the
process of illustrating the procedures, we will study many properties of the
binomial probability distribution, including approximations using other
distributions.
2 I Concepts of Stallstlcal Inference and the Binomial DistrIbution
2 Probability Distributions
Suppose that the possible outcomes of an experiment are distinguished only
as belonging to one of two possible categories which we call success and failure.
The two categories must be mutually exclusive, but the terms success and
failure are completely arbitrary and are used solely for convenience. (For
example, if the experiment involves administering a drug to a patient, we
might assign the label" success" to the event that the patient dies. This choice
might be convenient, not merely macabre, because tables are sometimes lim-
ited to the situation where the probability of success does not exceed 0.50.) We
denote the probability of success by p and the probability offailure by q, where
q = 1 - p for any p, 0 :s; p :s; 1. The set of all possible outcomes of this simple
experiment could be written as {Success, Failure}, but it will be more con-
venient to write {l,O} where 1 denotes a success and 0 denotes a failure.
When an experiment of this type is repeated, the trials are called Bernoulli
trials if they are independent and the probability p of success is identical on
every trial. Consider a sequence of n Bernoulli trials where n is fixed. Then the
possible outcomes of this compound experiment can be written as n-tuples
for any r, r = 0,1, ... , n. This probability is the same for every arrangement of
exactly r 1's and n - r O's. There are
This holds for r = 0, 1, ... , n. The probability is zero for all other values of r.
This result is useful whenever we want to distinguish the possible outcomes of
the compound experiment only according to the value of S, i.e., the number of
l's irrespective of the order in which they appear.
2 Probability Distributions 3
More formally, what we have done is to map the set of n-tuples into a set of
nonnegative integers which represent the number of 1's. The function can be
denoted by S(Xl> ... , x n), with a range of {O, 1, ... , n}. The function S is then
called a random variable. This means that it is a function whose domain is the
set of all possible outcomes of an experiment, each outcome of which has a
probability, known or unknown.
This illustrates the usual mathematical definition of a random variable as a
"function on a sample space." Intuitively, a random variable is any uncertain
quantity to which one is willing to attach probability statements. The" sample
space" can be refined if necessary so that all such quantities are functions
thereon. A measurement, or the outcome of an experiment, is a random vari-
able, provided the probabilities of its possible values are subject to discussion.
A random variable may be multidimensional-when several one-dimensional
uncertain quantities are considered simultaneously. Thus a large set of
measurements may be considered a random variable, but as a vector or an
n-tuple.
Any function of a random variable-for instance the sum of a set of measure-
ments-is also ipso facto a random variable. Any function of observable
random variables is called a statistic. For instance, in Bernoulli trials, a
random variable describing the outcome of the ith trial could be defined as
The value of Pea :s; X :s; b) is then the area under the density function f(x)
and above the x axis, between a and b. The area P(X :s; z) for arbitrary z is
shown in Fig. 2.1 as the hatched region. Generalization to vector random
variables is straightforward.
tP(x)
Figure 2.1
The particular density function graphed in Figure 2.1 is called the standard,
or unit, normal density and is given by the formula
,/.,( ) -
'l'X
_ --
1 e -x 2 /2 . (2.3)
J21t
The area under this curve from z to infinity, or P(X ~ z), is given by Table A
for z ~ O. Because the density is symmetric about 0, we have P(X :s; - z) =
P(X ~ z). Thus the probability to the left of a negative number, that is, the
area from minus infinity to - z for z ~ 0, can also be read directly from this
table. If X has a normal distribution with mean Jl and standard deviation (1,
then Z = (X - Jl)/(1 has the standard normal density 4> above.
The cumulative distribution function, or c.d!, of any random variable X,
is defined as F(x) = P(X :s; x), so that
Note that F( - (0) = 0 and F( (0) = 1, while F(a) :s; F(b) for all a :s; b. It is
customary to denote the c.d.f. by the capital of that letter which in lower case
denotes the frequency or density function. The c.dJ. of a discrete random
variable jumps upward by an amount equal to the value of the frequency
I This book omits measurability and" almost everywhere" qualifications. Anyone who ought
to care about them should have no dIfficulty decidmg where they are appropriate
6 I Concepts uf Statistical Inference and the Bmomial DistributIOn
One property of the estimator Sin of the parameter p in the binomial dis-
tribution is that its expected value, or mean, is exactly p. Denoting this
expectation by E, we write this statement as
(3.1)
(3.2)
Hence, for any n ~ 2, the variance of Sin is smaller than the variance of the
single observation Xi for all values of p except 0 and 1 (where both variances
are 0).
In fact, the same comparison holds between Sin and any other unbiased
estimator, so that Sin is the unique minimum variance unbiased estimator of p.
This property will be further discussed and proved in Sect. 3.4.
When reporting the value of an estimator T of a parameter (), one should
report also some measure of its spread or an estimate thereof. For this purpose,
it is better to use the square root of the variance, the standard deviation,
because it has the same units as e and T. For many theoretical purposes,
however, like that just mentioned, the variance is slightly more convenient.
Of course, the variance or standard deviation is a measure of the spread of
the estimator around the parameter only if the estimator is unbiased. For an
arbitrary estimator T it is also useful to define the bias as the difference between
8 I Concepts of StatIstical Inference and the Bmomlal DlstnbutlOn
the expected value of the estimator and the parameter 8 being estimated, or
E(T) - 8, and the mean squared error as E[(T - 8)2]. The latter can be
written as
E[(T - 0)2] = E{[T - E(T)]2} + [E(T) - oy = var(T) + [bias(T)]2.
If the bias contributes a negligible proportion of the mean squared error, then
the lack of unbiased ness is of little consequence. Of course the bias, variance,
and expectations above depend on the distribution of T, which mayor may
not be completely determined by 0 under any given assumptions.
3.2 Consistency
3.3 Sufficiency
on any particular three of the n trials. This is true regardless of the value of p.
More generally, given S = s, for any p, the s success are equally likely to have
occurred on each set of s of the n trials. Consequently, once the number of
successes is known, it appears intuitively that no further information is gained
about p by knowing which trials produced these successes. The meaning and
implications of this intuitive idea can be made more explicit as follows.
Suppose that we (the authors) know the outcomes of the individual trials
X 1> ••• , X nwhile you (the reader) know only S. It might seem that we have an
advantage by access to more complete information about the experiment.
Suppose, upon observing S = s, however, that you choose s out of the n trials
at random and arbitrarily call those trials successes and the rest failures,
getting what might be called simulated trials Y1 , •.. , Y". Then, whatever the
value of p, your simulated trials Y1, ••• , y" have the same distribution as the
trials Xl' ... , Xn which actually took place and whose outcomes we know.
(Proof: Whatever the value of p, the X's and the Y's have the same conditional
joint distribution given S; and of course S is common to both sets of trials.
Consequently Y10 ••• , y" have the same unconditional joint distribution as
X 1> •••• XII for every p.)
It is now evident that any inference about p which we can make knowing
Xl' ...• Xn, you can mimic knowing only S. More explicitly, suppose we use a
certain procedure depending on X 10 •••• X n. such as an estimator of P. or an
inference statement about P. or a forecasting statement about future observa-
tions Xn+ 1••••• or a decision rule whose payoff depends on p and/or Xn+ 1"" •
Suppose you use the same procedure, but applied to the simulated trials
Y10 •••• y" in place of Xl • ...• Xn' Then. although we may not get the same
result in a particular instance. your procedure will have exactly the same
probabilistic behavior as ours regardless of the value of p. The probability of
any event defined in terms of Xl' ... , X n depends only on the value of p. The
same event defined in terms of the simulated trials Y1 ••.•• y" will have the
same probability, whatever the value of p, because the Y's have the same
distribution as the X's for all p. If. for instance, we estimate p by a function of
Xl' ... , X n and you estimate p by the same function of Y1, ••• , Y", then your
estimator will have the same bias, variance, and distribution as ours for all p.
In short, our procedure and yours have the same operating characteristics,
where the term operating characteristic means any specific aspect of the
probabilistic behavior of a procedure. Introducing the vector notations X =
(X 10 ••• , X II) and Y = (Y1 , ••• , y") for convenience, we state the following
properties.
(1) f(x; e) factors into a function of Sand etimes a function of x for all real
vectors x.
(2) The conditional distribution of X given S does not depend on e.
e,
(3) There is a (random) function Y of S such that, for all YeS) has the same
distribution as X.
Any of these might be considered a justification for the intuitive explanation
e
that S is sufficient for X ifthe distribution of X depends on only through S, or
e,
S contains all the information about so it is fortunate that they agree. We
will now prove that (1) implies (2), and (2) implies (3). The converse proofs,
and the special case of the binomial, are left for the reader in Problem 3. Some
of these proofs have been given in part already.
PROOFS. To avoid technicalities in the general definition of conditional
distributions, we will assume that X has a discrete distribution with frequency
function f(x; e).
Suppose that ffactors as in (3.6). To show that S is sufficient, we compute the
conditional distribution of X given S = s as follows.
P (X = IS = ) = Po(X = x, S = s)
o x s Po(S = s)
(3.7)
PO(X= x)
{ if S(x) = s,
= oPo(S = s)
otherwise.
Using (3.6), we have
Po(X = x) = f(x; e) = g[S(x); e]h(x) (3.8)
Po(S = s) = L Po(X = x') = g(s; e) L h(x'), (3.9)
S(x')=s S(x')=s
where the sums are over those x' for which S(X') = s. Substituting (3.8) and
(3.9) into (3.7), we find
hex) I if Sex) = s
Po(X = xiS = s) = { Ls(x')=s h(x) (3.10)
o otherwise.
e,
Since the right-hand side does not depend on the conditional distribution of
e,
X given S does not depend on and S is sufficient.
Now suppose that the conditional distribution of X given S does not
depend on e. We will define Y(s) so that it has the same distribution as X for
every e. Let
For each s let yes) be a random variable with frequency function J.(x).
We must show that YeS) has the same distribution as X for all e. This follows
from the fact that it has the same conditional distribution given S as X has, for
all e. Explicitly,
P//[Y(S) = xJ = L P//(S = s)p//[y(S) = xiS = sJ
= L P//(S = s)J.(x)
s (3.12)
= L pecs = s)p//(X = xiS = s)
= p//(X = x),
so YeS) has indeed the same distribution as X for all e. o
It was stated in Sect. 3.1 that Sin is the unique, minimum variance unbiased
estimator of the parameter p of the binomial distribution. In this section we
will first discuss this important property briefly and then prove the statement.
In general, an unbiased estimator T of a parameter e is called a minimum
variance unbiased estimator if no other unbiased estimator has smaller
variance for any distribution under discussion, so that T minimizes the
variance, among unbiased estimators, simultaneously for every distribution
of whatever family has been assumed. This sounds like a splendid property for
an estimator to have, and a minimum variance unbiased estimator is ordinarily
a good one to choose. Note, however, that no such estimator need exist.
Furthermore, even if one does, nothing in the definition precludes the
possibility that some other estimator, though biased, has much smaller mean
squared error. Also, a minimum variance unbiased estimate is sometimes
smaller than the smallest possible value of the parameter, or larger than the
largest (Problem 4). When this happens, the estimate seems clearly unreason-
able, and replacing it by the smallest or largest possible value ofthe parameter
as appropriate obviously reduces estimation error, though it makes the esti-
mator biased.
Thus as a concept of optimality in estimation, minimum variance un-
biased ness is not completely satisfactory. But neither is any other concept.
Mean squared error cannot ordinarily be minimized for more than one
distribution at a time (Problem 5). However, seeking a truly satisfactory
concept is taking the point estimation problem too seriously. Formal versions
of it do not correspond at all closely to any real problem of inference. There
is, after all, no need to give just a single estimate, "optimal" or not. (In
making actual decisions, treating a decision as an estimate or vice versa is
more confusing than clarifying.) A full-fledged inference must somehow re-
flect the uncertainty in the situation. An estimate is just an intuitive first step.
3 Estimators and TheIr Properties 13
Ep[~(S)] =I ~(s)Pp(S = s)
s=o
for all p. Dividing by (1 - p)n for p f:. 1 and replacing p(1 - p)-l by y gives
the polynomial equation in y
(3.14)
for all y ~ O. But a polynomial vanishes identically (in fact, at more points
than its degree) if and only if all coefficients vanish. Hence
(3.15)
for all s. Since C) > 0, it follows that ~(s) = 0 for all s. This says that the
difference of two functions of S which are both unbiased estimators of the
binomial parameter p is always 0; accordingly, there is only one such function.
D
*The main part of this proof showed that a function of S having expected
value 0 for all p must be identically 0, that is, the binomial family of distri-
butions with n fixed is "complete." A family of distributions is called complete
14 1 Concepts of Statisllcal Inference and the Bmomlal DistributIOn
4 Hypothesis Testing
Null Hypothesis
The null hypothesis (called "null" to distinguish the hypothesis under test
from alternative possible hypotheses) is a statement about the distribution of
the observations. Here it might be that "the number of successes is binomial
with p :s; 0.10," for example. As long as the binomial family is clearly under-
stood in the context of the problem, the statement might be given as simply
"p :s; 0.10." It is customary to denote the null hypothesis by H o.
A distribution of the observations is called a null distribution if it satisfies
the null hypothesis, and an alternative distribution otherwise. Although we
defined a null hypothesis as a statement about the distribution of the observa-
tions, we can also define it as the set of all distributions satisfying that state-
ment, that is, the set of all null distributions. It will be convenient to allow both
usages. An alternative hypothesis may be defined similarly as a set of alternative
distributions. If the null hypothesis completely specifies the null distribution
including all parameters, that is, the set contains only one particular distribu-
tion, then the null hypothesis is called simple. Otherwise, it is called composite.
An alternative hypothesis may also be either simple or composite.
Rejection Rule
The rejection rule is a criterion saying when the null hypothesis should be
rejected; it is "accepted" (see below) otherwise. The rule may depend on the
observations in any way, but must not depend on any unknown parameters.
4 Hypothesis Testing 15
It determines the rejection region, often called the critical region, given usually
as a range or region of values of a test statistic. A test statistic may be any
function of the observations. For example, here the rule might be "reject if
S ~ 3," or "reject if 1Sin - 0.31 > 0.10," when stated in terms of the test
statistics S and Sin. A test is said to be a one-tailed or a two-tailed test based on
a statistic S if it rejects in one or both tails of S, that is, it rejects for S outside
some interval but not for S inside the interval. Each end of the interval may be
closed or open, finite or infinite. (More complicated regions are occasionally
useful, but in this book the term "test statistic" will always imply a one- or
two-tailed test.) The least extreme value of a test statistic in the rejection
region is called its critical value. For instance, if the rejection region is S ~ 3,
then 3 is the critical value of S. A two-tailed test has a critical value in each tail,
called the lower and upper critical values.
It is sometimes convenient to represent a test based on a random variable
X with observed value x by the critical function ¢(x), 0 ~ ¢(x) ~ 1 for all x.
The rejection region corresponds to ¢(x) = 1, that is, those values x for which
¢(x) = 1, while the "acceptance" region corresponds to ¢(x) = O. When
randomization is considered (Sect. 5), the value of ¢(x) is the probability that,
given the observed x, the test will choose to reject the null hypothesis, while
1 - ¢(x) is the probability that it will not. Regions where the test may either
reject or "accept" thus correspond to values x such that 0 < ¢(x) < 1. In
any case, if the distribution of X is F, the probability of rejection by a test with
critical function ¢(x) is E F [ ¢(x)].
In an actual application, when the observations are such that the value of the
test statistic lies in the critical region, it is customary to announce that H 0 is
rejected by the test, or that the set of observations is statistically significant
by the test, or simply that the result is significant. In the contrary case, one may
say that the null hypothesis is not rejected by the test, or that the set of obser-
vations is not statistically significant. Of course (Problem 8), a result which is
not statistically significant may still appear significant for practical purposes,
especially if the test is weak, and vice versa, especially if the test is strong
(technically, powerful-see Sect. 4.2).
If a null hypothesis is rejected by a "reasonable" statistical test, then one is
presumably justified in concluding, at least tentatively in the absence of other
evidence, that the null hypothesis is false. (Unfortunately, this statement is
either very vague or merely a definition of "reasonable.") If the null hypo-
thesis is not rejected, this does not ordinarily justify a conclusion that the null
hypothesis is true. We will find it convenient to say that the null hypothesis is
"accepted" whenever it is not rejected, but will use quotation marks to
emphasize that" accepting" the null hypothesis does not justify concluding
it is true in the same sense that rejecting it justifies concluding it is false.
16 1 Concepts of Statistical Inference and the Binomial DistrIbutIOn
4.2 Errors
When a statistical test is performed, two kinds of error are possible. We may
reject the null hypothesis when it is true, making a Type I error (or error of the
first kind). On the other hand, we may "accept" (fail to reject) the null hypo-
thesis when it is false, making a Type I I error (or error of the second kind). The
types of errors and correct decisions, which cover all four possibilities, are
shown in the diagram below.
ConclUSIOn
3 (10)
cx(p) = P p(S ~ 3) = i~O . - p)n-"
i p'(1 (4.1)
which may be looked up in Table B. The upper curve in Fig. 4.1, labeled S = 3,
shows a graph ofthis probability for all values of p. Ifthe null hypothesis under
test is H 0: p = 0.5, then the probability of a Type I error, rejecting H 0 when it
is true, is simply cx(O.5) on the curve, which is 0.172. On the other hand, if we
test the null hypothesis H'o: p 2 0.6 with this same rejection region, then the
probability of a Type I error is given by that part of the curve cx(p) in Fig. 4.1
where p 2 0.6. Since the curve never rises above 0.055 for p 2 0.6, this prob-
ability is never more than 0.055.
The probability of a Type II error is calculated in a similar manner. Since
the null hypothesis is "accepted" whenever it is not rejected, the probability of
"acceptance" is one minus the probability of rejection. Hence the probability
of" acceptance" in the example of the previous paragraph is given by
1 - cx(p) = 1 - PiS ~ 3)
1.0
0.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P
Level
Ordinarily, for any statistical test, the probability of rejecting the null
hypothesis depends on the distribution of the observations. It can be cal-
culated as long as this distribution is known. If the null hypothesis is true, this
probability may be a fixed number or may still depend on the distribution of
the observations. If the null hypothesis is simple, like H 0: p = 0.5 above, then
the probability of rejecting the null hypothesis when it is true necessarily has
only one value. If the null hypothesis is composite, then the probability of
rejecting the null hypothesis mayor may not be the same for all distributions
allowed by the null hypothesis. For instance, in the previous example, the
probability of rejecting the null hypothesis H'o: p ~ 0.6 when true depends on
p.
If for a test of a particular null hypothesis, simple or composite, the
probability of a Type I error is less than or equal to some selected value IX for
all null distributions, then the test is said to have level IX or to be at level IX
(0.05 and 0.Q1 are popular values for IX). The level may be described as con-
servative to emphasize that this kind oflevel is meant rather than nominal or
exact level (defined below). Thus the test above, which rejects the null hypo-
thesis H'o: p ~ 0.6 when S :::; 3, has level 0.10. It also has level 0.08 and level
0.06. It does not quite have level 0.05, however, because there is a distribution
satisfying the null hypothesis and giving a probability of rejection greater
than 0.05; for example, p = 0.6 gives probability 0.055 of rejection. If a test
has level IX, it is natural to say also that the level of the test is IX. The word" the"
here is somewhat misleading, though not seriously so in practice, since, as the
foregoing example illustrates, the level of a test is not unique.
If a test of the null hypothesis H'o: p ~ 0.6 is desired at level 0.10, the test
rejecting when S :::; 3 might be selected. The number 0.10 would then be called
the" nominal level " of the test; 0.055 is called the" exact level" because it is
the maximum probability of rejection for p ~ 0.6. In general, the nominal level
of a test is the level which one set out to achieve, while the exact level is the
actual maximum probability of rejection under the null hypothesis. The
exact level is the smallest conservative level the test actually has. It would
perhaps be simpler mathematically to define only exact levels, but conservative
20 I Concepts of Statistical Inference and the Binomial DIstributIOn
and nominal levels are needed in practice and must be discussed, and it is
convenient to have names for them.
The term size is sometimes used instead of level, but it will not be used in
this book. Connotatively, "size" seems to place more emphasis on the rejection
region, and less on the null hypothesis, than the word "level." A test is some-
times called valid at level (X if it has level (X; this terminology is especially useful
when a null hypothesis has been changed or broadened. Sometimes significance
level is used instead of simply level, to distinguish it from confidence level
which will be defined in Sect. 6.
Interpretation of Level
Power
The probability of rejection when the null hypothesis is false also depends on
the distribution of the observations, and is called the power of the test. Power
is evaluated using an alternative distribution. If the alternative is simple, the
power is a single number; otherwise it is a function.
Consider again, for instance, the test above for n = 10, which rejects H'o:
p ~ 0.6 when S ~ 3. Its power against the alternative p < 0.6 is given by the
function (4.1) for values of p < 0.6. The power curve is then represented by
that portion of the curve in Fig. 4.1 for which p < 0.6. Specifically, the
power of the test is 0.172 when p = 0.5; 0.650 when p = 0.3; 0.987 when p =
0.1; etc. Recall that the remaining portion of the curve, where p ~ 0.6,
represents the probability of a Type I error, and the complements of the
ordinates for that portion where p < 0.6 represent the probability of a Type II
error. Clearly, the power is always one minus the probability of a Type II
error.
which provides large power against those alternatives which are of particular
interest because of their practical importance or (" subjective!") probability
of occurrence.
Power comparisons of different tests should ordinarily be made at the same
exact levels. Otherwise they may be seriously misleading, because the power of
a test can be increased by increasing the probability of a Type I error. Con-
fusion generally attends comparisons of tests which have the same nominal or
conservative levels but different exact levels.
Two tests, say A and B, are called equivalent if test A rejects if and only if
Test B rejects. Equivalent tests necessarily have the same exact level and the
same power against all alternatives, but the converse is not true (Problem 11).
Correspondingly, two test statistics are called equivalent if any test based on
either statistic is equivalent to a test based on the other. This holds whenever
the test statistics are strictly monotonically related (Problem 12).
The null hypothesis P ~ 0.6 and the alternative P < 0.6 are each one-sided, in
an obvious sense. The test which rejects when S :::;; 3 is also called one-sided,
or one-tailed. Explicitly, it is called lower-tailed or left-tailed since the
rejection region is at the lower end of the range of the test statistic S.
More generally, suppose S is the number of successes in n Bernoulli trials
and we wish to test the null hypothesis P = Po or P ~ Po against the alternative
P < Po· One rule for testing either of these null hypotheses is to reject when
S :::;; sc, where the critical value Sc is chosen so that the level of the test is some
preselected number rx.. Let Sc be the largest integer possible, subject to the
restriction that the left-tail probability peS :::;; sc) is less than or equal to rx.
when P = Po. This critical value is easily found from Table B. Algebraically
it is the largest integer Sc for which
(4.2)
does not exceed the nominal level. (The subscript on P indicates that the
probability is to be computed for P = Po.) Given SC' a simple comparison of
the observed value of S with Sc determines whether the observations are
significant or not.
For the null hypothesis P = Po, the exact level of this test is, of course, the
exact value of the probability in (4.2). Furthermore, it has the same exact level
for the null hypothesis P ~ Po (with of course the same power against alter-
natives P < Po). Intuitively, this is because testing the null hypothesis P ~ Po
against alternatives P < Po is the same as testing the" least favorable case;'
P = Po against alternatives p < Po. For the binomial distribution, this
intuition is correct (see Problem 13).
4 HypothesIs Testmg 23
The power of this test, namely Pis ~ sc) where P is any value less than Po,
is also easily found from Table B, and can be expressed algebraically as
PiS ~ sJ = f: (~)pi(1 -
i=O I
p)"-i for any P < Po. (4.3)
Ppo(S ;;:: sJ = t
'='c
(~)p~(l
I
- PO)"-i ~ oc (4.4)
and again Table B can be used. For example, when n = 10 and the alternative
is p > 0.6, rejecting the null hypothesis p = 0.6 or p ~ 0.6 when S ;;:: 9 gives
an exact level of 0.046. This is specifically called an upper-tailed or right-tailed
test procedure. In fact, these testing problems and procedures correspond to
the one-sided binomial testing problems and procedures discussed earlier by
the simple exchange ofthe definitions of" failure" and" success." For instance,
previously we tested the null hypothesis that the probability of success is equal
to 0.6 against the alternative that it is less than 0.6. Using the rule" reject when
the number of successes is 3 or fewer in 10 trials," the exact level was 0.055.
This is precisely the same as testing the null hypothesis that the probability of
failure is equal to 0.4 against the alternative that it is greater than 0.4, by
rejecting if there are 7 or more failures in the 10 trials. Since" success" and
"failure" are completely arbitrary designations anyway, we can rename the
failures" successes." This is therefore a right-tailed test, and properties of the
two types correspond, with the exact level again 0.055.
4.4 P-values
Definition of P-values
For some well-behaved problems, the P-value as a tail probability and the
critical level as the level of just significance can be sensibly defined and are
equal. This value can therefore be reported and provides more information
than a statement that the observations are or are not significant at a pre-
selected level. It is possible in some problems, even by some kinds of" opti-
mum" tests, for a set of observations to be significant at one level but not at a
larger level, for instance at the 0.01 level but not at the 0.05 level. Then rejection
regions at different levels would not be nested, and P-values and critical levels
would be difficult to interpret, even if they were defined. No such situations
arise in this book; Chernoff [1951] gives an example which illustrates the
pathology.
Interpretation of P-values
as 6 times or as large as 12 times the P-value for P-values between 0.001 and
0.05, although it is seldom less than 3 times or more than 30 times the P-value
(according to Good [1958]). These figures are rough, but based on consider-
able though unpublished evidence. See also Jeffreys [1961] and Lindley
[1957]. For interesting examples with discussion, see Good [1969], Efron
[1971J, and Pratt [1973]. In this framework then, if the value of a test
statistic is just significant at the 0.05 level, there is still a substantial chance
(at least 0.15) that the null hypothesis is nearly true. This suggests that bare
significance at the 0.05 level, a P-value just below 0.05, is at best not a very
strong justification for concluding that the null hypothesis is appreciably false.
Of course, significance substantially beyond the 0.05 level is another matter.
We note that this illustrates again the disadvantage of a mere statement of
significance or nonsignificance.
In the special case of a one-tailed test of a truly one-sided null hypothesis,
such as P :s;; Po (not P = Po) or a multiparameter analogue, the P-value may
often be expected to be close to the posterior probability of the null hypo-
thesis (see, e.g., Pratt [1965]). It can be argued that In all other cases, both
P-values and tests should be interpreted with great caution.
When the exact level nearest the nominal level is used, what is the border-
line level of significance? Suppose, for example, that S = 3 is observed. Then
the exact P-value is 0.0548. The probability P O. 6 (S :s; 2) = 0.0123 will be
called the next P-value. The average of these two numbers is (0.0548 +
0.0123)/2 = 0.0336, called the mid-P-value; this is the borderline level of
significance since a nominal level greater than 0.0336 is nearer to 0.0548 than
to 0.0123, while one less is nearer to 0.0123.
For a test of the null hypothesis P 2: Po in a binomial problem, by the
same rule, S = s is significant at nominal levels greater than, and not sig-
nificant at nominal levels smaller than, the mid-P-value
P po(S :s; s) + P po(S < s) (4.6)
2
In general, as long as the possible outcomes x can be ordered according to
how extreme they are, the mid-P-value is defined as the arithmetic average of
the exact P-value and the next P-value, and is the borderline level of signific-
ance 2 or critical level according to the rule of" exact level nearest the nominal
level." Here the next P-value, also called the tail probability beyond x, is the
maximum probability under null distributions of an outcome more extreme
than x (see Lancaster [1952] for further discussion).
Summary Recommendations
As mentioned earlier, reporting the P-value for the outcome observed gives
more information than a report of simply significant or not significant, and in
effect permits everyone to choose his own significance level.
For test statistics with discrete null distributions, however, there is still the
question of whether to report the exact P-value or the mid-P-value when they
differ appreciably. This seems a matter of taste, as long as it is made clear
which is being done. If there is a custom for the particular type of problem, this
should be followed. For some audiences, it may be desirable to report both the
exact and next P-values (the tail probabilities including and beyond x).
Approximations based on continuous distributions generally approximate
the mid-P-value rather than the exact P-value unless a "correction for
continuity" is made. This sometimes makes the mid-P-value a little more
convenient to compute. Some people, especially ifthey believe that the P-value
can be given a precise interpretation, may feel that some one number should
be chosen for the purpose and that there are fundamental grounds for choice
between the exact P-value and the mid-P-value (or something else). See
the discussion of randomized P-values in Sect. 5 and, for instance, Lancaster
[1961].
2 Whether the outcome would be slgmficant at precIsely this nommallevel depends on whether
one chooses the larger or smaller when the nommal level is halfway between the two nearest
exact levels.
28 I Concepts of Statistical Inference and the Bmomlal Distribution
In the binomial problem, how should we test the null hypothesis P = 0.6
against the alternative p ¥ 0.6? This alternative is a combination of the two
alternatives p < 0.6 and p > 0.6, and is two-sided in an obvious sense. A test
might be performed by combining the left-tail and right-tail tests discussed
previously. Thus, for n = 10, one might reject when S :5; 3 and also when
S ~ 9. Since the two tails are mutually exclusive, the exact level of this test can
be computed as the sum of the two tail probabilities under the null hypo-
thesis p = 0.6. These two tail probabilities, the exact levels of the two one-
tailed tests, are respectively 0.055 and 0.046. The exact level of this two-tailed
test is thus
PO.6 (S:5; 3 or S ~ 9) = PO. 6 (S :5; 3) + PO. 6 (S ;;:: 9)
= 0.055 + 0.046 = 0.101.
Similarly, given that the nominal levels of the two individual tests were both
0.10, the nominal level of this two-tailed test is 0.20.
In general, it is always true that combining individual tests at levels <Xl>
<X2' ••• for the same null hypothesis H o gives a test at level <Xl + !X2 + ... for Ho.
The exact level of the combined test is the sum of the exact levels of the in-
dividual tests if the null hypothesis is simple (or the same distribution is
"least favorable" for all tests) and the tests are mutually exclusive (that is, no
possible set of observations is rejected by more than one of the tests). Other-
wise the exact level may be less than (but cannot be more than) the sum
(Problem 19).
For a binomial null hypothesis H 0: p = Po, the standard two-tailed test at
level <X rejects when either one-tailed test at level <x12 rejects. Thus, specifically
it rejects if S :5; St or if S ~ Su' where St is the largest integer which satisfies
(4.7)
(4.8)
4 HypothesIs Testing 29
While this test has nominal level IX, its exact level is the sum of the actual
values of the left-hand sides of Eqs. (4.7) and (4.8). Similarly, its power
function is the sum
PiS:$; St) + P p(S ;::: s,J,
calculated under any alternative P =1= Po.
It is not automatic that a test against a combined alternative should be
constructed by combining the tests which would be chosen for the individual
alternatives, nor even that a test against a two-sided alternative should
be two-tailed in form just because a one-tailed test would have been used
with each of the one-sided alternatives. It can be shown, however, that
two-tailed tests are (in a sense to be made precise later) the only ones that need
be considered against two-sided alternatives in binomial problems and indeed
in most practical problems. (See Sect. 8.3.)
Even if attention is restricted to two-tailed tests, the critical values St and
Sa can be chosen in ways other than that above. They may be critical values
for anyone-tailed levels 0(1 and 0(2 such that 1X1 + 0(2 :$; IX. In other words, they
need only satisfy
(4.9)
Various possibilities, including an "optimality criterion" which is really a
convention, will be discussed later, when the properties of two-tailed tests are
investigated in Sect. 8. It is difficult, however, to give a convincing justification
in the frequentist framework for choosing a particular two-tailed test among
those at level IX.
For two-tailed tests, then, we have the problem of selecting not only a
significance level Ct, but also the upper and lower critical values, or 0: 1 and 0: 2 ,
for a given significance level. For the latter, like the former, except by adoption
of some convention, the usual methodology of hypothesis testing tells us only
to" look at the power functions and make a choice," but sheds no light on how
to do so. It is easier said than done.
For two-tailed, as for one-tailed, tests, we can avoid the problem of choosing a
significance level by reporting a P-value. Unfortunately, however, the very
definition of the P-value for two-tailed tests presents a problem equivalent to
that of choosing 0:1 and Ct2 for a given significance level. (This problem has no
counterpart for one-tailed tests.)
One possibility is to report the one-tailed P-value even for a two-tailed test,
and remark that the two-tailed P-value, while depending on what kind of two-
tailed critical region would have been formed, is presumably about twice as
large as the one-tailed P-value reported. Some people go further and claim
that P-values are not appropriate in two-sided situations, but that seems an
30 1 Concepts of StatIstIcal Inference and the Bmomlal DIstributIOn
severely asymmetric null distributions, just those where the choice of pro-
cedure matters most.
We illustrate these procedures in the binomial case with n = 10 and Ho:
p = 0.6. For convenience, the point probabilities for Sunder Ho are given in
Table 4.1.
Suppose that S = 3 is observed. The appropriate one-tailed P-value is
lower-tailed, and P 0.6(S :$ 3) = 0.055. One procedure is simply to report this,
with the comment that the two-tailed P-value is presumably about 0.110.
This is the borderline level of significance if the standard two-tailed test with
levell'J./2 in each tail is used. Since no upper tail probability equals 0.055, the
first procedure suggested above would add the next smaller upper-tail
probability, which is P O. 6 (S 2 9) = 0.046, and report a two-tailed P-value
of 0.055 + 0.046 = 0.101. This is the exact level of the standard two-tailed test
at the borderline nominal level 0.110. It is also the borderline level of signific-
ance of a test based on the one-tailed P-value. The nearest attainable upper-
tail probability is the next smallest in this case and hence gives the same
result.
For the minimum likelihood procedure, the values of S with probability
smaller than S = 3 under the null hypothesis are 0, 1, 2, 9, and 10. Hence the
P-value is PO. 6 (S :$ 3) + P O. 6 (S 2 9) = 0.101 again.
Since the mean, median, and mode of the null distribution each equal 6,
locating the two tails at equal distances from any of these also gives the same
result of 0.101. Using the midrange of 5, however, gives a P-value of PO. 6 (S:$ 3)
+ P0.6(S 2 7) = 0.055 + 0.382 = 0.437. If the observed value of S had been
in the upper tail, this procedure with the midrange would have given a two-
tailed P-value smaller than twice the one-tailed P-value in this example since
the null distribution is skewed left here.
We mention one other procedure that could be used for statistics with a
finite range. It places an equal number of possible values in each tail if the
distribution is discrete. If the possible values are equally spaced, this also makes
the tails equal in length, and this is the procedure used for continuous
distributions also. Except for discrete distributions with unequally spaced
values, which are unusual in practice, this procedure is equivalent to locating
the tails at equal distances from the midrange, as in the example of the pre-
vious paragraph. Unfortunately, this procedure not only is restricted to
statistics with a finite range, but also, even when defined, can give absurd
results if the null distribution is highly skewed. For example, in the binomial
case with H 0: p = 0.1, suppose that S = 7 is observed with n = 10. The one-
tailed P-value is then P0.1 (S 2 7) = 0.000. When an equal number of extreme
sOl 2 3 4 5 6 7 8 9 10
Po 6(S = s) 0.000 0002 0.010 0.043 0.111 0.201 0.251 0.215 0.121 0.040 0.006
32 I Concepts of Statistical Inference and the Binomial DiStributIOn
values are placed in the lower tail, the two-tailed P-value becomes P O• 1 (S ::;; 3)
+ P O. 1(S ~ 7) = 0.987. Even though S = 7 strongly contradicts H o, a
P-value of 0.987 would lead to the conclusion that Ho is highly "acceptable."
Intuitively, the extent to which the data contradict a null hypothesis should
not change sharply if the hypothesis is changed slightly. However, for all of
these procedures except doubling the one-tailed P-value, a slight change in the
null hypothesis can lead to a sharp change in the P-value because the P-value
is a discontinuous function of the null hypothesis (Problem 23). The authors
consider this counterintuitive property less a flaw in the methods of forming
two-tailed P-values than a symptom of the fundamental difficulty of inter-
preting P-values as measuring the support or contradiction of the null
hypothesis. The extent of contradiction depends in part on the congruence
of the data with plausible alternatives, whereas the P-valuedepends only on the
null distribution.
In summary, a reasonable two-tailed P-value can be obtained in most
situations by either doubling the one-tailed P-value, or adding to it the largest
attainable probability not exceeding it in the other tail. The minimum like-
lihood method may also be satisfactory. The practice of doubling the one-
tailed P-value is perhaps the most popular, but that may be more the result of
habit than a thoughtful consideration of the merits. When all is said and done,
however, we find the game of defining a precise two-tailed P-value not worth
the candle. If a single procedure is to be recommended as appropriate for two-
tailed tests based on any distribution and any outcome, we prefer reporting
the one-tailed P-value and the direction of the observed departure from the
null hypothesis. The primary basis for this recommendation is that the P-
value then retains its clearest interpretation. Further, when the one-tailed
P-value is small, the sample outcome is extreme in a particular direction and a
one-sided conclusion will usually be desired. On the other hand, if the one-
tailed P-value is moderate or not small, the null hypothesis will be "accepted"
whether it is doubled or not. In borderline cases, the appropriate conclusions
require careful thought, not blind adherence to some rule. Careful thought is
perhaps best encouraged by reporting a one-tailed P-value with suitable
commentary attached. For further discussion, see Gibbons and Pratt [1975].
The recommendation for reporting the one-tailed P-value even with a two-
tailed test is further reinforced when we consider the test procedures which
allow a greater variety of conclusions to be reached, as our next topic.
able" two-tailed test, while one may draw essentially no conclusion if it is not
rejected. This two-conclusion interpretation is indeed appropriate in some
situations. For example, rejection may amount to deciding from a preliminary
experiment that further study is worthwhile. Alternatively, it may mean that a
simple model is inadequate, in circumstances where it is not necessary to
conclude how the model might be made adequate.
In many situations, however, more definite conclusions are desirable. For
instance, when the null hypothesis P = Po is rejected by a two-sided binomial
test, we might wantto conclude that p < Po if S S St and that P > Po if S s S",
Then there would be three possible conclusions, namely P < Po, P > Po, and
"no conclusion" (which corresponds to "accepting" the null hypothesis).
Table 4.2 gives the probability of drawing each conclusion in each kind of
situation where it is erroneous. These probabilities are bounded by the one-
tailed levels 0:1 and 0: 2 , For instance, if P > Po, the probability of concluding
that P < Po depends on p but is less than 0:1' There are no entries in the third
column because "accepting" the null hypothesis is regarded as drawing no
conclusion and hence cannot be erroneous. No matter what the true situation,
a two-tailed test, with this three-conclusion interpretation, leads to an er-
roneous conclusion with probability at most 0: = 0:1 + 0:2' the ordinary two-
tailed significance level.
Now suppose we modify the test procedure so that instead of concluding
that p < Po when S SSt, we conclude that p S Po. This leads to the prob-
abilities of erroneous conclusions given in Table 4.3. No matter what the true
situation, the probability of an erroneous conclusion is now at most the
larger of 0: 1 and 0:2'
For example, if the two one-tailed tests each have level 0.05, Table 4.3
shows that this two-tailed test procedure will lead to an erroneous con-
clusion with probability at most 0.05 (the one-tailed level). The procedure
of Table 4.2 permits a more refined conclusion in one case, but at the cost
of increasing the bound on the probability of an erroneous conclusion to 0.10
(the two-tailed level).
If we would be just as happy to conclude that p S Po as P < Po, the second
procedure would be much better than the first because of its much lower error
P = Po 0(2 0(2
rate. One might add Po to the upper-sided conclusion instead of, or in addition
to, the lower-sided, making the first two conclusions P < Po and P ~ Po, or
(symmetrically) P ~ Po and P ~ Po. The probability that the procedure will
lead to an erroneous conclusion is still at most the larger of 0(1 and 0(2 in each
case, although the appropriate tables will differ somewhat from Table 4.3
(Problem 24). There is also a procedure with a similar property allowing all
five conclusions mentioned above (Problem 25).
The validity of Tables 4.2 and 4.3, and thus of the alternative interpreta-
tions of two-tailed tests considered here, follows from the fact that the prob-
ability that S :2 Su is larger when P = Po than when P < Po, and similarly in
the other tail. For all two-tailed tests which are used in practice, this fact and
consequently Tables 4.2 and 4.3 remain valid when S is replaced by the test
statistic and p by a suitable parameter e. Thus the corresponding alternative
interpretations of two-tailed tests are always valid in practice.
In summary, when reporting a conclusion from a two-tailed test, a con-
clusion at the appropriate one-tailed level may be more descriptive of the
true probability of erroneous rejection than the two-tailed level, unless it is
e
clear that rejection requires the conclusion #- eo or one of the conclusions
e < eo and >e eo, e
not ~ (Jo or ~ e eo.
From this point of view, a one-
tailed P-value is also more descriptive even in the two-tailed test situation.
This further suggests the desirability of reporting a one-tailed P-value so that
when a definite conclusion rather than a P-value is required, the choice of the
two-tailed procedure which best fits the purposes and problem at hand is left
to the ultimate decision-maker.
exceed the nominal level IY.. In reporting whether the observations are signifi-
cant, the exact level might be stated instead of, or in addition to, the nominal
level.
Consider, for example, a lower-tailed binomial test with n = 10, Po = 0.6
and IY. = 0.10. The" conservative" procedure has critical value Sc = 3 with an
exact level of 0.055. If the rejection region could be enlarged without increas-
ing the exact level above 0.10, the power would increase. However, the next
possibility is Sc = 4 and PO.6(S :$; 4) = 0.166. If we reject when S :$; 4, the
exact level increases to 0.166, which considerably exceeds the nominal level of
0.10. However, the conservative test is too conservative. Even if the critical
value is chosen to give the exact level nearest the nominal level, the test
remains the same. The exact level 0.055 is far smaller than we would like,
while 0.166 is far larger. What then shall we do?
There is a definite theoretical answer to this question, but unfortunately it
is not a satisfactory practical alternative because it introduces an irrelevant
random variable. This is a procedure called a randomized test. Consider again
the binomial problem, but now suppose that n = 6 and we wish to test H 0:
p ~ 0.5 versus HI: P < 0.5, at the levellY. = -h. (This example is used because
it leads to simpler arithmetic than the previous one. The ideas are the same for
both.) For p = 0.5, n = 6, the binomial probabilities are given in Table 5.1.
We naturally plan to reject when S = O. If we reject when S = 1 as well, the
exact level is -i4, larger than the value 14 selected for IY.. However, if we reject
only when S = 0, the exact level is only 614' How can we enlarge the rejection
region in order to increase the power? A good solution might appear to be to
reject the null hypothesis when S = 1, but not when S = O. This test has
greater power for most values of p < 0.5 than the test rejecting only when
S = O. This is not an appropriate solution, however, since another procedure
is clearly superior, as will be shown shortly.
The respective power functions for any pare
These two functions are graphed in Fig. 5.1. For p very small, a case where
rejection is especially desirable, the power of the test rejecting when S = 1 only
is smaller than for S = 0 only, and in fact decreases to zero, while the power of
the test rejecting only when S = 0 increases to one. From (5.1) and (5.2), the
power of the S = 1 only test is greater than the power of the S = 0 only test for
all p > i- (Problem 26). A test which is a combination of these two tests and
s o 2 3 4 5 6
i. li .lQ li i.
64 64 64 64 64
36 1 Concepts of StatistIcal Inference and the BinomIal Dlstnbullon
1.0
0.9
0.7
,-.
0 0.6
:t:
....,u - Reject S = 0 Only
' 0;- 0.5
0:1::
~
0.4
0.3
0.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P
Ho False Ho True
has power everywhere greater than either of them would be desirable; this can
be accomplished using a randomized test procedure.
Specifically, consider a test which rejects when S = 0 is observed, rejects
with probability ~ when S = 1 is observed, and "accepts" otherwse. For
instance, when S = 1 we might roll a fair die, reject if the number of spots is less
than 6, and" accept" if it is equal to 6. This procedure makes the probability
of a Type I error when p = 0.5 equal to
This function is also plotted in Fig. 5.1 for all p. The figure shows that this
probability is smaller than l4 for all p > 0.5. It also shows that the randomized
5 RandomIzed Test Procedures 37
test has power everywhere greater than either of the other two tests, and has
smaller probability of a Type I error than the test which rejects when S = 1
only, except at p = 0.5 where the probabilities are the same. Later we will
show that similar properties hold more generally.
Whether or not one would ever use a randomized test in practice, the fact
that there is a randomized test everywhere better than the test rejecting when
S = 1 only shows that the latter test should not be used. In the next su bsection,
we will explain what is meant by randomized procedures generally and why it
is useful to talk about them even though no claim is made that people do or
should carry out irrelevant randomizations.
The basic idea behind any randomized procedure is that we decide what action
to take, or what inference to make, or how to report the results, not only on the
basis of an observed random variable X as previously, but also at least in part
on the basis of some irrelevant random experiment. When we observe X = x,
we may reject the null hypothesis, or we may not. We may even decide at
random what to do. That is, we may reject the null hypothesis with probability
¢(x), say, and "accept" it otherwise, where ¢(x) may be any value, 0 ~ ¢(x)
~ 1. This kind of procedure is called a randomized test, and ¢(x) is its critical
function, as already defined in Sect. 4.1. If such a test were carried out re-
peatedly, in the long run the null hypothesis would be rejected in a proportion
¢(x) of those cases in which the observation is x. The randomized test dis-
cussed in Sect. 5.1 which rejected always when S = 0 and with probability
t when S = 1, and "excepted" otherwise, is given by ¢(O) = 1, ¢(1) = i, and
¢(s) = 0 for s 2 2.
Ordinary (nonrandomized) tests are equivalent to randomized tests for
which ¢(x) takes on only the values 0 and 1. Specifically, the nonrandomized
test with rejection region R is given by ¢(x) = 1 for x in a region Rand ¢(x)
= 0 otherwise. Thus we reject the null hypothesis with probability 1 for all
x E R and we "accept" it with probability 1 for all x ¢ R.
The randomization necessary to perform a randomized test could be
carried out by drawing a random variable U (independent of X) from the
uniform distribution between 0 and 1 and rejecting the null hypothesis if
U ~ ¢(x) but not otherwise. Such a U may be obtained, to any desired
accuracy, from a table ofrandom digits or generated by a computer. Since the
event U ~ ¢(x) has probability ¢(x), this procedure rejects the null hypo-
thesis with probability ¢(x) when x is observed. We note incidentally that this
makes any randomized test based on X equivalent to a nonrandomized
test based on (X, U). Instead of carrying out the randomization, one might
report the value of ¢(x) for the x observed.
In the general binomial problem, suppose that S is the number of successes
In n independent Bernoulli trials with probability p of success on each trial.
38 I Concepts of Statistical Inference and the BInomIal DlstnbutIon
IX = Ppo(S < Sf) + ¢(ppo(S = Sf) + Ppo(S > sJ + ¢«Ppo(S = sJ. (5.5)
There is an infinite number of two-tailed tests at a given exact level 0:. For
each IX10 0 ~ IXl ~ IX, there is one given by the lower-tailed test at exact level (Xl
and the upper-tailed test at exact level IX2 = IX - IX 1 • The difficulty of choosing
among them was pointed out in Section 4.5.
3 We assume that s( s S" and ¢J( + cPu s I If s( = S" so that the upper and lower tails are
mutually exclUSive
5 Randomized Test Procedures 39
Extending the earlier definition of the P-value to randomized tests gives what
we shall call the randomized P-value, which is uniformly distributed between
the exact P-value and the next P-value (Problem 32). If the P-value is to
measure the extent to which the data support the null hypothesis (see Sect. 4.4
for difficulties with this interpretation), a single number is presumably
desired for each possible outcome. The mid-P-value is suggested by the fact
that the distribution of the randomized P-value is symmetric about it (and
in particular has mean and median equal to it). The observations are sig-
nificant at nominal levels above this value and not significant at nominal levels
below it if the test chosen at nominal level a is the nonrandomized test having
6 Confidence Regions 41
greatest probability of agreeing with the randomized test at exact level 0:, or,
as mentioned in Sect. 4.4, that with exact level nearest to the nominal level.
This does not alter the recommendatIon made above to report the exact
P-value as well as the next P-value. Anyone who thinks that the mid-P-value
is of special interest can then compute it. *
6 Confidence Regions
We now introduce the concept of confidence regions. This form of inference,
like estimation, refers to any parameter value, not just a preselected one, yet
also, like a significance test, provides an exact statement of error probability.
We shall lead up to confidence regions by way of tests to remove the mystery
of their construction and to emphasize the intimate relationship between the
two concepts.
s 0 2 3 4 5
level 0.10, for n = 5 (Problem 35). The upper confidence limits are found
following a procedure analogous to (6.1)-(6.3), for each possible value of S.
The lower limits are found similarly. Notice that half of the values in Table 6.1
can be obtained by subtraction, since pAs) = 1 - p,,(n - s) for any s (Prob-
lem 37). This example will be discussed further in Section 6.4.
The results in Table 6.1 are plotted as points on two curves in Fig. 6.1. The
principle of construction for this graph can be extended to any sample size;
this has been done by Clopper and Pearson [1934J to produce the well-known
Clopper-Pearson charts. The chart for confidence level 0.95 is reproduced as
Fig. 6.2 in order to illustrate the format. These charts provide a convenient
method for finding upper, lower, or two-sided confidence limits in binomial
problems. They also provide a graphic version of the foregoing derivation.
Consider the region between the curves for a given n. The horizontal sections
are the values of sin for which each given p would be "accepted." The vertical
sections are the values of p which would be "accepted" for a given sin. The
vertical section corresponding to the observed sin covers the true p if and only
if the horizontal section corresponding to the true p covers the observed sin.
The relations between tests and confidence regions and between their error
probabilities follow.
Though slightly less convenient than graphs, Table C is compact and
allows greater accuracy in finding binomial confidence limits. It includes five
common levels (J. in the range 0.005 ~ (J. ~ 0.100. For each s, for sin < 0.50,
the tabulated values are n times the confidence limits, and hence are simply
divided by n to obtain the confidence limits. For sin> 0.50, Table C is
entered with 1 - (sin) and lower and upper are interchanged; the correspond-
ing entries are then divided by n and subtracted from 1 to find the confidence
limits. The values s = 0 and s = n are special cases, as explained in the table.
To illustrate the use of Table C, consider n = 5 and s = 2. Then sin = 0.40,
and the table entries for (J. = 0.10 are 0.561 and 3.77. These numbers are
0.2
Figure 6.1 Lower and upper 90% confidence limits for p when 11 = 5.
44 I Concepts of Statistical Inference and the BlOomllll DIstribution
0.98 094 090 0.86 0.82 0.78 0.74 0.70 0.66 0.62 0.58 0.54
0.90
Figure 6.2 Chart providing confidence lImits for p in binomial sampling, given a
sample fraction YIn, confidence coefficient, 1 - 2a = 0.95. The numbers printed along
the curves IIldicate the sample size n. If for a given value of the abscIssa YIn, L, and U
are the ordinates read from (or interpolated between) the appropriate lower and upper
curves, then P(L ::; p ::; U) ;::: 1 - 2a. (Adapted from Table 41, pp. 204-205, of E. S.
Pearson and H. O. Hartley, Eds. (1962), Biometrika Tables Jor Statisticians, Vol. I,
Cambndge University Press, Cambridge, with permission of the BlOmetnka Trustees.)
divided by 5 to get 0.112 and 0.754 as lower and upper limits, which agree
(except for rounding) with Table 6.1.
Binomial confidence limits are quantiles of the beta distribution. They
are also ratios of linear functions of quantiles of the F distribution. The
approximation at the end of Table C results from a transformation of the F
distribution derived from the cube root transformation of the chi-square
distribution (see Wilson and Hilferty [1931], Paulson [1942], Camp [1951]
and Pratt [1968J).
6 Confidence RegIOns 45
The concept and derivation of confidence limits explained above for binomial
problems generalize directly to confidence regions for any parameter 0 in any
situation. Suppose that we have a test based on any random variable S of the
null hypothesis e = 00 , for each value of eo and level IX. For any fixed eo and IX,
the test will reject for certain values of S and not for others. The set of values of
S for which the null hypothesis e = eo would be" accepted" will be denoted by
A(e o). Once S = s is observed, we are interested in the converse question: for
which values of eo would the null hypothesis e = 00 be "accepted?" The
set of all such values of 00 is a region C(s), defined by
If the test of the null hypothesis 0 = eo has level IX, then the probability of
e
"accepting" = 00 when it is true is at least 1 - IX, that is
(6.8)
(If e = eo allows more than one distribution of S, this holds for all of them, and
similarly hereafter.) But the event S E A(Oo) is, by the definition (6.7) of C,
equivalent to the event 00 E C(S). Substituting this equivalence in (6.8), we
have
(6.9)
If, for each 00 , the test of the null hypothesis 0 = eo is a test at level IX, then
(6.9) holds for every 00 , This is the defining condition for a confidence region
C(S) at confidence level 1 - IX. We see that a test at level IX for each eo leads
immediately to a corresponding confidence region at confidence level 1 - IX.
The left-hand side of (6.9) is called the true confidence level and is discussed in
Sect. 6.4. The exact confidence level is the maximum value of 1 - IX such that
(6.9) holds for all possible distributions of S. This is the minimum (or infimum)
of the left-hand side, the probability that C(S) includes the true parameter
value, over all possible distributions. Nominal and conservative levels are
defined in the obvious way.
Conversely, if one has a confidence region C(S) at confidence levell - IX,
then for each eo a test of the null hypothesis e = 00 at level IX may be performed
by "accepting" if the confidence region C(S) includes 00 and rejecting other-
wise. This is equivalent to defining the "acceptance" region A(Oo) by (6.7),
"accepting" the null hypothesis 0 = 00 if S E A(e o) and rejecting it otherwise.
Thus there is an exact correspondence between a confidence region for 0 at
confidence level 1 - IX and a family of tests of null hypotheses 0 = 00 , each
test at significance level IX. It is conventional, if not particularly convenient, to
measure significance levels as (Type I) error probabilities and confidence
46 I Concepts of Statistical Inference and the Binomial Distribution
levels as 1 minus the error probability. Typical significance levels are 0.10,
0.05,0.01, and the corresponding typical confidence levels are 0.90, 0.95, 0.99.
We generally adhere to this convention, although context would determine
the meaning anyway. Notice that the definition of a confidence region, that
(6.9) holds for all eo, makes no reference to hypothesis testing. Nevertheless,
the relationship is so intimate that it should never be lost sight of.
For the parameter p of the binomial distribution, the confidence regions
defined above agree with the confidence limits derived in Sect. 6.1. Specifically,
the confidence region corresponding to the family of lower-tailed tests at
level 0: is the interval p < p", where p" is the upper confidence limit for p at
level 1 - 0:, and similarly for upper-tailed and two-tailed tests. Verification
beyond what has already been given is left to the reader (Problem 40). Some
confidence procedures for the binomial parameter which correspond to other
two-tailed tests are discussed later, in Sect. 8.2.
The random region C(S) must be distinguished from the particular region
C(s) obtained for the particular value s of S. C(S) is a random variable,
because it is a function of a random variable. The distinction between C(S)
and C(s) is another case of the distinction between a random variable and its
value, like that between estimator and estimate in Sect. 3. Unfortunately, the
term confidence region is standard for both the random variable and its value,
and it must be determined from the context which is meant. Intuitively, a
confidence region is random before the observations are made, but not
afterwards.
If C(S) is a confidence region at confidence level 1 - 0:, then, by definition,
the event eo E C(s) has probability at least 1 - 0: whatever the true value eo.
However, once a value S = s is observed, the probability that the true value eo
lies in C(s) is not necessarily 1 - 0:. In the "frequentist" framework of
probability, one can say only that the probability that the true value lies in C(s)
is unknown, but must be either 0 or 1. (A similar point was made in Sect. 4.2 in
connection with hypothesis tests.) The confidence level is a property of the
confidence procedure, not of the confidence region for a particular outcome.
One interpretation of a confidence region is as those values which cannot
be rejected, with a testing interpretation of rejection. Some people are careful
to limit themselves to this interpretation. Most, however, probably go some-
what further in allowing the connotations of the word "confidence" to enter
their thinking, without necessarily claiming any very strong justification for
doing so,
The Bayesian framework sometimes justifies confidence in observed
confidence regions. When probabilities are used in a Bayesian framework to
represent" degrees of belief" (see Sect. 4.2), prior belief may have much less
6 Confidence RegIOns 47
1.00
~
0.95
-------
u> 0.85
~ 0.80
...::>
OJ
I-
! , ! I I , I , I ! t '
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 p
FIgure 6.3 True confidence level for 80% confidence mtervals for p when n = 5.
from Table B. This can be calculated for any p, and Fig. 6.3 shows a graph of
the result.
Numerical values for selected p are shown in Table 6.2 under the heading
True Level. The true, two-sided confidence level can also be computed in two
pieces as follows. The probability that the interval includes p is one minus the
probability that it does not include p, and the latter will occur if and only if pis
either (a) smaller than the lower confidence limit, or (b) larger than the upper
confidence limit. But p is smaller than the lower limit if and only if, for that p,
the null hypothesis would be rejected by an upper-tailed test. This occurs if the
observed s satisfies P is
~ s) $ 0.1 0, that is, if s is in the upper 0.1 0 tail of the
distribution corresponding to p. Hence the probability of (a) is the largest
upper tail probability, not exceeding 0.10, of the distribution corresponding
to p. For p = 0.4, n = 5, for instance, we see in Table 6.1 that p is smaller than
6 Confidence RegIons 49
a ProbabIlity upper confidence limit IS smaller than p (equals largest lower tall probability not
exceedlllg 0.10)
b ProbabIlity lower confidence limit is larger than p (equals largest upper tail probabIlity not
exceeding 010).
the lower confidence limit if and only if the observed s is 4 or 5, which has
probability PO.4(S 2 4) = 1 - 0.9130 = 0.0870. This is indeed the largest
upper binomial tail probability for p = 0.4 which does not exceed 0.10.
Similarly, the probability of (b) is the largest lower binomial tail probability
which does not exceed 0.10. For p = 0.4, this is P o.iS ~ 0) = 0.0778. Since
(a) and (b) are mutually exclusive events, the true two-sided confidence level
is 1 - (0.0870 + 0.0778) = 0.8352, as found in the previous paragraph.
Similar results for other values of p :$ 0.5 are also shown in Table 6.2 under
the headings Upper and Lower respectively. The entries for values of p not
included in Table B are found in the same way, but using more extensive
tables of the binomial distribution or interpolation or a computer. Since the
true level is not a continuous function of p, it is necessary to give special
attention to the points of discontinuity, which are the possible confidence
limits. A similar table for p > 0.5 can be obtained by changing the label p to
1 - p and interchanging the labels Lower and Upper in Table 6.2.
4 As used here, the word" sIze" IS to be IIlterpreted III a nontechmcal sense, not to be confused
wIth the previous techmcal definition as the level of a test. Since the technical term size IS not used
III thIS book, no difficulty III IIlterpretation should occur.
50 I Concepts of Statistical Inference and the Bmomial Distnbutlon
which this is true is complicated by the fact that different tests may be powerful
against different alternatives, and, correspondingly, different confidence
regions have high probability of excluding false values and are of small size
under different hypotheses. These ideas will be discussed briefly below.
Further explanation is given by Pratt [1961] and Madansky [1962].
Suppose C(s) is a confidence procedure corresponding to a family of tests
with" acceptance" regions A{Oo). Since 00 E C(S) if and only if S E A{Oo), we
have
poJOo E C(S)] = poJS E A(Oo)] for all 0 1 , (6.10)
For 0 1 = 00 , this says that'the probability of " accepting" the null hypothesis
o= 00 when it is true is equal to the probability that the confidence region
will include the value 00 when 00 is the true value. This result has already
been used in demonstrating that 1 - IX is the confidence level if all the tests
have level IX (establishing (6.9) from (6.8». For 01 '" 00 , (6.10) says that the
probability of a Type II error at 0 1 for the test of the null hypothesis 0 = 00
equals the probability, when 0 1 is the true value, that the confidence region will
include the false value 00 , Thus, the ability of a confidence procedure to
exclude false values is the same as the power of the corresponding tests, in this
specific sense. This also leads to a correspondence between certain" optimum"
properties of tests and confidence procedures. (See Sect. 7.3).
We consider next the size of a region, defined as its length (if it is an interval),
its area (if it is a region in the plane), and in general its k-dimensional volume
(if it is a region in k-dimensional space). It is perhaps more reasonable to be
concerned with the probability of including false values than the size of a
confidence region, since there is no merit in a small region if it is not even close
to the true value. At the same time, it is natural to feel that a good confidence
procedure would produce small regions. There turns out to be a direct
connection between the size of the confidence region and the probability of
including false values which implies that making either one small will tend to
make the other small.
Since a confidence procedure does not ordinarily give a region of fixed size,
let us consider the expected size of the confidence region. Consider, for instance,
a two-sided confidence interval [0(0 OJ for a one-dimensional real parameter
O. The size is the length 0,. - 0" and the expected length is Eo.[O,,{S) - Ot{S)]
if 0 1 is the true value. By Problem 41 we have
and the integral is unchanged if the integration is over all 0 except the true
value. In other words, the expected length is the integral of the probability of
including false values. This is true of size generally (Problem 42). The
essential condition is that the expected size and the probability of inclusion
must be computed under the same distribution.
Similarly, suppose we are concerned only with the upper confidence bound
6 Confidence Regions 51
Then wedo not mind including false values smaller than the true value f:) 1,
f:) I I '
but prefer to exclude those greater than 0 l' That is, it does not matter if 0" > 0
for 0 < 0 1 , but we would like the probability that 011 > 0 to be small for 0 > 0,.
We would also like to overestimate 0 1 by as little as possible. In thise case the
role of size is played by the" excess," defined as Ou - 01 for Ou > 01 and 0 for
0" ~ 0 l' In particular, the expected" excess" equals the integral over all 0 > e1
of the probability that e" exceeds e. When this probability is small, the expec-
ted "excess is small, and conversely."
The foregoing statements imply that when confidence regions correspond
to powerful tests, the confidence regions will have, first, high probability of
excluding false values and, second, small expected size. The exact relationship
between the properties of confidence regions and the relative emphasis placed
on various alternatives in seeking powerful tests is subtle. It is common
practice to choose tests which have desirable power functions without reference
to confidence properties, and to use the corresponding confidence regions
without investigating the confidence properties. The remarks above suggest
that this practice will provide good confidence regions.
Theorem 7.1. A lower-tailed binomial test at exact level rx of the simple null
hypothesis Ro : P = Po has maximum power against any alternative P = Pi with
Pi < Po, among all level rx tests of Ro·
The property in Theorem 7.1 also holds for the broader null hypothesis
p ~ Po because the test has the same exact level for p ~ Po, while any other
test at level rx for P ~ Po has level rx for P = Po, and therefore, by Theorem 7.1,
cannot be more powerful against any alternative Pi < Po.
Of course, if a lower-tailed test has level rx for the null hypothesis p ~ Po,
but its exact level is rxo where lXo < rx, then there will be other tests at level rx
which are more powerful; however, they all must have exact level greater than
rxo. As a matter of fact, slightly more is true. Any test whose power is greater
than a lower-tailed test against any alternative Pi < Po also has greater
probability of a Type I error under every null distribution P ~ Po.
Similar results hold for upper-tailed tests, and indeed follow simply by
interchanging" success" and "failure."
The results of Sect. 7.1 imply that any lower-tailed test for the null hypothesis
P ~ Po is admissible, meaning that any test with smaller probability of error
under some distribution has greater probability of error under some other
distribution. (The errors may be Type I or Type II, as the distributions may be
null or alternative.) Briefly, any test which is better somewhere is worse
somewhere else. If a test is not admissible, then it can be eliminated from
consideration because there is another test at least as good everywhere, and
better somewhere. This other test has power at least as great against every
alternative and probability of a Type I error at least as small under every null
distribution, and it either has greater power against some alternative, or else
54 I Concepts of StatIstical Inference and the BInomIal Dlstnbution
5 It does not immediately follow that all tests which are not admISSIble can be eliminated
simultaneously, because one could imagine an Infinite sequence of tests, each better than the one
before but WIth no test better than all of them. This does not occur in practical testing problems,
however See, for instance, Lehmann [1959, Appendix 4J
6 In restrIctIng consideratIOn to sufficient statIstIcs, we have already adopted the vIew that when
several procedures have the same probability of rejection for all p, they are equivalent, and only
one of them need be consIdered. Our definItion of "complete" reflects this vIew. Often "com-
plete" IS defined to reqUIre excluded procedures to be strIctly infenor; what we call" complete"
here is then called "essentially complete."
7 Properties of One-Tailed Procedures 55
*7.4 Proofs
Let ex(p, ¢) be the probability that the null hypothesis P = Po will be rejected
by the test ¢ when p is the true value. Given any Po, 0 < Po < 1, and any
1X0, 0 ~ 1X0 < 1, there is one and only one lower-tailed randomized test ¢*
based on S for which ex(Po, ¢*) = ex o . (This is easy to see by considering what
happens as the lower tail is gradually augmented.) We will now prove that this
test ¢* uniformly maximizes the probability ofn;jection ex(P1' ¢) for P1 < Po
and uniformly minimizes ex(P1' ¢) for P1 > Po, among tests ¢ for which
ex(Po, ¢) = exo.1t will be left to the reader (Problems 48 and 49) to verify that
all statements of Sect. 7.1-7.3 follow (with the help of hints given there and the
relation between tests and confidence procedures, particularly the discussion
of including false values in Sect. 6.5).
Let ¢ be any (randomized) test. As in Sect. 5, denote by ¢(s) the prob-
ability that ¢ rejects the null hypothesis when a particular value s is observed.
Then
(7.1)
56 1 Concepts of Statistical Inference and the BInomIal DlstribUlJon
It is intuitively clear that we should choose cP(s) = 1 for those s where the
contribution to the sum (7.1) is greatest compared to the contribution to the
sum (7.2). In other words, the maximizing cP will be of the form
,j,. (
'1'0 S
) = {I if A(S) > k
0 ifA.(s) < k
(7.3)
~ L [¢o(s) - ¢(s)]kPo(S = s)
s
= k{Eo[¢o(s)] - Eo[¢(S)]}
~ klXo - klXo = a. (7.5)
The first inequality holds term by term because ¢o(s) = 1 ~ ¢(s) where
PI(S = s) > kPo(S = s) and ¢o(s) = a ~ ¢(s) where P 1(S = s) <
kP o(S = s). The inequality is strict unless ¢ also satisfies (iii). The second
inequality holds because Eo[¢o(S)] = lXo ~ Eo[¢(S)] and k ~ a. It is strict
unless k = a or Eo[ ¢(S)] = oe o. It follows that E 1 [¢(S)] ~ E I[¢o(S)] for any
¢ satisfying (i) and (ii), thus proving the direct half of the theorem. If ¢ also
maximizes E 1 [¢(S)], then neither inequality is strict. It follows that ¢ satisfies
7 Actually, this case covers any two distributIOns if denSIties wIth respect to an arbItrary measure
(iii), and that E o[4>(S)] = ao, except perhaps when k = o. Thus the converse
half of the theorem is proved except when k = 0,4> satisfies (iii), and Eo[ 4>(S)]
< ao. In this case, the likelihood ratio is greater than k on the set A of all
s where Pl(S = s) > 0; then P 1(SEA) = 1, and 4>(s) = 1 for sEA, so that
P o(S E A) ~ Eo[ 4>(S)] < ao, satisfying the last clause of the theorem. 0
In the last section we found that in the one-sided binomial problem, a one-
tailed randomized test based on S is uniformly most powerful against
alternatives in the appropriate direction. For two-sided alternatives it is
natural to use two-tailed tests, and later in this section, we shall show that no
others should be considered. No test is uniformly most powerful against a
two-sided alternative, however, since different one-tailed tests are most
powerful against alternatives on the two sides. Hence, even if the level a is
given, we need some further criterion to select among all two-tailed tests at
level a.
One possible criterion for choice is called the equal-tails criterion. The
idea is that a two-tailed test at level a should be a combination of two one-
tailed tests, each at level a/2. To make the exact levels equal, in most discrete
problems, one of the one-tailed tests must ordinarily be randomized. (If the
null distribution is symmetric, either both or neither must be randomized.)
The usual nonrandomized two-tailed binomial test (Clopper and Pearson
[1934]) discussed earlier in this book has equal nominal levels in the two
tails. Specifically, it rejects at nominal level a when either one-tailed, non-
randomized test at conservative level cx/2 would reject, and "accepts"
otherwise.
The usual two-tailed binomial test is therefore conservative, as was
illustrated in the context of confidence intervals by Fig. 6.3. In fact, it might be
considered ultraconservative, since it would sometimes be possible to add a
point to the rejection region without making the exact level greater than the
nominal level. For the null distribution given in Table 4.1, for example, at
level a = 0.06, the usual test of Ho: P = 0.6 would reject only for S = 0, 1,2,
or 10. Either S = 3 or S = 9 could be added to the rejection region without
raising the level above 0.06. The disadvantage of including such a point is that
the level of one of the one-tailed tests would then exceed a/2 so that the one-
tailed and two-tailed procedures would not be simply related. Still, under the
8 ChoIce of Two-Tailed Procedures and TheIr Properties 59
Power
1.0 - -__=-........::-----+--------~.--,::;....
(J
8 If the tests are unbiased but have different exact levels for different members of the family, the
corresponding confidence property IS more complicated and will not be discussed here. (Problem
53 requests a statement of thiS property.)
62 1 Concepts of StatIstical Inference and the BmomIaI DIstribution
This may be maximized by more than one level IX test. Crow [1956J took
advantage of this flexibility to modify the minimum likelihood procedure so
that the corresponding confidence regions are intervals (for n :s; 30 and
rt. = 0.10, 0.05, or 0.01). Crow made further modifications to shorten the
intervals for S near 0 or n and to make the interval for a given S at level 0.01
contain that at level 0.05 and the interval at level 0.05 contain that at level 0.10.
Crow's confidence procedure, and any other confidence procedure cor-
responding to nonrandomized level rt. tests which maximize the number of
values of S in each rejection region, have the following properties among
nonrandomized procedures at level rt.. First, they minimize the average pro-
bability of including each point Po, SA Pp(Po included)dp, which can also be
interpreted as the average probability of including the false value Po. (In the
integrand, if p is the true value and p =F Po, then Po is a false value; omitting
p = Po from the region of integration has no effect.) Second, they also
minimize both the average expected length of the confidence region and the
ordinary, arithmetic average of the lengths of the n + 1 possible confidence
regions, these being in fact equal (Problem 56). Specifically, for intervals these
averages are
If, for some s, the confidence region C(s) is not an interval, its generalized
length (Borel measure) JC(S) dp must be used here in place of the length
p,,(s) - pAs). (Since" Po included" is equivalent to .. Po not rejected," the first
property follows from (8.1) and the second from the relation
±i
s=O CIs)
dp = it
0
[number of values of s such that p E c(s)Jdp, (8.3)
9 The defillltlOn of completeness requIres only" at least as great" The statement is true only if
one-tailed tests are included in the class of two-taIled tests, as we shall take them to be and as they
°
are by the formal defillltlOn of two-taIled tests gIven above. In the binomial problem, we can take
Sf = 0, tPt = or Sa = II, tPa = I
64 I Concepts of StatIstical Inference and the BInomIal DistrIbutIOn
given any two different two-tailed tests, each one has greater probability of
making the correct decision than the other at some values of p. Thus we need
consider only two-tailed tests, but none can be excluded from consideration
without adducing some further criterion.
The facts of the last two paragraphs have been proved by Karlin [1955] for
any strictly P6lya type 3 distribution. (The monotone likelihood ratio property
mentioned in Sect. 7.4 is P6lya type 2. P6lya type 3 is a generalization.)
Karlin's proof uses a fundamental theorem of game theory, that the class of all
Bayes' procedures and their limits is complete, under certain compactness
conditions. One could use instead the generalization of the Neyman-Pearson
fundamental lemma to two side conditions. The proof of this generalization
(see, for instance, Lehmann [1959]) is related to the usual proof of the funda-
mental theorem of game theory. With the help of completeness, a proof of
admissibility is analogous to that given in Sect. 7.4.*
Results similar to those of Sect. 8.3 hold for the various three-conclusion inter-
pretations of two-sided problems. For definiteness (the discussion applies to
the others also with trivial modifications), we consider the conclusions as
(a) P < Po, (b) "accept" the null hypothesis (that is, draw no conclusion), and
(c) P > Po. Suppose that when in fact P < Po, we prefer (a) to (b) and (b) to (c);
when P = Po we prefer (b) to either (a) or (c); and when p > Po we prefer (c) to
(b) and (b) to (a). Then, given any procedure for reaching one of these con-
clusions, there is a two-tailed test which (in its three-conclusion interpretation)
is at least as good for every p. Specifically, if !Xl equals the probability when
p = Po of concluding p < Po and !Xl equals the probability when p = Po of
concluding P > Po by the given procedure, then the two-tailed test combining
the lower-tailed test at exact level !Xl and the upper-tailed test at exact level !Xl
is at least as good as the given procedure, whatever the value of p. When
P < Po, its probability of concluding P < Po is at least as large, and its prob-
ability of concluding p > Po is at least as small, as that of the original pro-
cedure. (This follows from the results given in Sect. 7.) When P > Po, the
same statement holds with the inequalities reversed; and when P = Po, the
two procedures have the same probability of leading to each conclusion.
Thus the two-tailed tests form a complete class in the three-conclusion
interpretation (with the natural preferences given above). Are all admissible in
this interpretation, or might there now be one which is at least as good as
another whatever the value of P and better for some p? The results given earlier
for one-tailed tests imply immediately that, given any two-tailed test inter-
preted as a three-conclusion procedure, any procedure having greater
probability under any P > Po of correctly concluding p > Po has also greater
probability under all p ~ Po of incorrectly concluding p > Po. This statement
9 AppendIces to Chapter I 65
also holds with the inequalities reversed. These results are not quite what we
would like however, because we might prefer a procedure having somewhat
smaller probability, under P > Po, of correctly concluding P > Po, if it also
had a sufficiently smaller probability of incorrectly concluding P < Po (and
therefore, of course, larger probability of "accepting" the null hypothesis).
That is, a sufficient decrease in the probability of the least desirable conclusion
P < Po might more than offset a decrease in the probability of the most
desirable conclusion P > Po.
In order to make the problem definite enough to investigate, let us suppose
that the undesirability of a procedure, when P is the true value, is measured by
weighting the probabilities of errors and adding. The weights may depend on
P and are denoted by L}(p) for the conclusionsj = a,b and c. Specifically, then,
when P is the true value, the undesirability is the sum
La(P)P p( test concludes P < Po)
+ Lb(P)P p(test "accepts") (8.4)
+ Le(P)P v<test concludes P > Po).
In accordance with the preferences expressed earlier, we have
Lip) < Lb(P) < Le(P) for P < Po (8.5)
(8.6)
(8.7)
Beyond this, the L j may be chosen almost arbitrarily and the statement we are
about to make will still hold.
If undesirability is interpreted as the sum given in (8.4), and a mild further
restriction is satisfied by the L j , it can be proved (Karlin and Rubin [1956];
Karlin [1957b]; Problem 59) that any two-tailed test whose upper and lower
critical values are not equal is admissible; that is, no other procedure is as
desirable under all P and more desirable under some p. (For the sample sizes
and null hypotheses that occur in practice, tests whose upper and lower
critical values are equal have large probability of rejecting the null hypothesis
when it is true and hence are never considered.)
These statements about the three-conclusion interpretation of two-sided
binomial problems also hold if the null hypothesis p = Po is replaced by the
null hypothesis that P lies in an interval containing Po, as was done above for
the two-conclusion interpretation. *
9 Appendices to Chapter 1
Straightforward tabulation of the binomial distribution involves three
variables (n, p, and s) and leads to extremely bulky tables. (Table B is straight-
forward but very abbreviated and is not useful for in-between values, for large
66 I Concepts of StatIstical Inference and the Binomial Distribution
where s' and s" are integers, 0 $ s' $ s", and == denotes approximate equality
(in absolute, not relative terms). If n is large but n(1 - p) is moderate, then
n - S has approximately a Poisson distribution with mean m = n(1 - p).
This yields immediately an approximation to the distribution of S.
Precise limit statements corresponding to these approximations are as
follows. Suppose Sn is binomial with parameters nand p, and p depends on n
in such a way that np -+ m as n -+ 00. (The subscript has been added to S
because n is no longer fixed and the distribution of S depends on n.) Then
mSe- m
P(Sn = s) -+ - - I- as n -+ 00, S = 0, 1, ... (9.3)
s.
s" mSe-m
P(s' $ Sn $ s") -+ L - - I- as n -+ 00, 0 $ s' $ s" $ 00. (9.4)
s=s' s.
9 Appendices to Chapter 1 67
Of course, these limit statements, although they have precise meanings, say
nothing about when the limits are approximately reached.
*The statements in (9.3) and (9.4) are easily proved as follows. The exact
probability function for Sn is
= ~ [~ n -
s! n n
1... n - ns + 1J (np)S[(1 _ p)1/p]n p-s p. (9.6)
As n .... 00, each factor in the first bracket approaches 1, np .... m, and therefore
p .... 0; it follows that (1 - p)1/p .... e- 1 and sp .... 0, which proves (9.3).
Summing (9.3) over s gives (9.4) for s" < 00. Limits cannot be taken under
infinite summation signs without further justification, but (9.4) now follows
for s" = 00 as well, since
The density function of the standard normal distribution was given earlier in
(2.3) as
(9.8)
In this book the symbol <I> will be reserved to denote the cumulative distri-
bution function of the standard normal. Values of <1>( - z) are given in Table A
for z ~ 0; by symmetry, <I>(z) = 1 - <1>( -z). A random variable X is normal
with mean Jl. and variance (J2 if the standardized random variable, (X - Jl.)/(J,
has a standard normal distribution.
If S is binomial with parameters nand p, and np and n( 1 - p) are both large
(that is, n is large and p is not too close to 0 or 1), then S is approximately
68 I Concepts of StatIstIcal Inference and the Bmomml DlstnbutlOn
normal with mean Jl. = np and variance (12 = np(l - p). In other words, the
standardized random variable
S - Jl. S- np (S/n) - p (S ) ~
-(1- = Jnp(l- p) = Jp(l- p)/n = n- p V~
(9.9)
(9.10)
where Jl. = np, (J = Jnp(1 - p). Further, for any integers s' and S", s' ~ S",
we have approximately
pes' ~ S ~ S") == P ( -
S'l ! - Jl. ~ Z ~ SIl+l)
: - Jl.
The values of$ are given by Table A. This approximation can be recommen-
ded only for extremely casual use, where ease of remembering and calculating
are paramount. There is a normal approximation based on cube roots which
is only a little more complicated but an order of magnitude more accurate.
Another normal approximation based on logarithms is only slightly more
complicated and is yet another order of magnitude more accurate. The latter
is given at the end of Table B and the former is the basis of the approximate
confidence bounds given at the end of Table C.
Precise limit statements corresponding to the approximations (9.10) and
(9.11) are as follows. Suppose Sn is binomial with parameters nand p and p is
fixed,O < p < 1. If Sn is an integer depending on n in such a way that (sn - Jl.)/(J
approaches some number, say z, then
P(Sn = sn) -t 1 as n - t CfJ (9.12)
(l/(1)¢(z)
and
(9.13)
9 Appendices to Chapter 1 69
f.l) ~ 0
P(Sn :::; s) - q, ( -s (-J - as n ~ 00. (9.14)
Suppose that for each n we have a sequence of random variables, say {Xn}, and
a corresponding sequence of distributions. Suppose further that, as in the
previous section, the random variables X n are discrete and, for every real
number x, the probability that Xn = x, say fn(x), approaches a limit, say
f(x), as n ~ 00. Then as n ~ 00
This is true for infinite sets as well if the limit f is a discrete frequency function.
Specifically, we must have Lx
f(x) = 1, which cannot be taken for granted.
This added condition also holds when f is the Poisson frequency function, of
course, and the proof of (9.4) for s" = 00 used it.
An analogous result holds for densities. These facts and others are given
formally in the following theorem.
Theorem 9.1
(1) If Xn has discrete frequency function fn, n = 1,2, ... , fn(x) -. f(x)for all
x, and Ix
f(x) = 1, then f is the discrete frequency function of a random
variable X, and
(i) I /J,,(x) - f(x)/-. 0 (9.15)
x
(i') (9.17)
(ii') (9.18)
(iii')
10 Here, as els.:where III this book, all sets and functions are assumed to be measurable without
specific mentIOn.
9 AppendIces to Chapter I 71
The statement in (0 follows. The proof of (i') is similar, using Fatou's Lemma,
which says that
Actually, the densities may be with respect to an arbitrary measure. With this
understanding, the second case covers the first. (The method of proof used
here appears in a more natural context in Young [1911]. See also Pratt [1960J.
A somewhat different proof is given by Scheffe [1947].) 0
The foregoing discussion does not apply directly to the approach of the
binomial distribution to the normal, since the binomial distribution is discrete
while the normal has a density. The discussion does apply indirectly, however.
Specifically, suppose Sn is binomial with parameters nand p, and U is uni-
formly distributed between -! and 1. Then Sn + U has a density; in fact, the
density of Sn + U is
(9.21)
where s is the integer nearest y. (The definition for y half-way between adjacent
integers is immaterial.) This density approaches 0 as n -+ 00 for y fixed, as
would be expected since the variance of Sn + U approaches 00. Consider
X = Sn + U - J.l. (9.22)
n (J'
II The abbreviatIOn mf stands for mfimum, whIch IS the greatest lower bound. Snntlarly, sup
denotes supremum or least upper bound. The infimum and supremum of a set of numbers always
exist; eIther or both may be mfinite
72 I Concepts of Statistical Inference and the Blllomial DistributIOn
where f is the standard normal density. Thus fn(x) ...... f(x) for all x, and
Theorem 9.1 applies.
It follows in particular that
Suppose X 1> X 2, .•. , and X are real- or vector-valued random variables with
respective cumulative distribution functions F b F 2, ... , and F. Then the
following conditions are equivalent (Problem 62)
If these conditions hold (if one holds, they all do, since they are equivalent),
then Fn is said to converge in distribution to F. Alternative terminology is that
X n con verges in distribution to X, or X n to F, or Fn to X; X n or Fn is as ympto-
tically distributed as X or F; F is the limiting distribution of Xn or Fn; etc.
Part of the definition of convergence in distribution is that the limit F
should be a cumulative distribution function. It is possible for (1) to hold
without F being a cumulative distribution function (Problem 63); then Fn
does not converge in distribution to F (though the customary terminology is
to say that it converges "weakly" to F). If(l) holds, F must satisfy the mono-
tonicity properties of a cumulative distribution function; thus the further
requirement is just that it should behave properly at ± 00, which amounts to
the requirement that X be finite with probability one.
Conditions (1), (2), and (3) above are somewhat weaker than the cor-
responding statements of Theorem 9.1. Thus the hypotheses of Theorem 9.1
imply convergence in distribution, while convergence in distribution does not
imply the conclusions (or hypotheses) of Theorem 9.1, even if all the distribu-
tions are discrete. (It does if all the distributions are concentrated on the same
finite set of points. The proof is Problem 64.)
Notice also that the convergence in distribution of X n to X does not imply
that X n is probably near X for n large. Xl, X 2, •.. , X might be independent;
indeed, their joint distribution is not under discussion and need not exist. The
convergence in distribution of Xn to X says only that the distribution of Xn
is close to the distribution of X for n large, in a certain sense of close.
9 Appendices to Chapter I 73
Ii Xj - nil
aJn (9.26)
Theorem 9.3 (Liapounov Central Limit Theorem). If X 1> X 2, .•• are inde-
pendent real random variables with possibly difforent distributions, each having
finite absolute moments of the order 2 + (j for some number (j > 0, and if
I1 E(IXj - Ilj(2H)
(Ii UJ)1 H/2 -t 0 as n - t 00 (9.27)
which indeed approaches 0 as n - t 00. This illustrates the factthat the left-hand
side of (9.27) has a tendency to approach 0 at the rate n - d/2 for some () > 0, so
the absolute moments E(IX j - Ilj l2+6) must misbehave quite badly before
(9.27) will fail.
PROBLEMS
1. Show that for any estimator T,,(X 1, ... , X n) of a parameter (J based on a sample of
size n, iflim n _ oo E(T,,) = () and lim n_ oo var(T,,) = 0, then T" is consistent for (J, that
is, (3.5) holds.
2. Show that, for n Bernoulli trials, the probability that s successes occur on s specified
trials is the same regardless of which s of the n trials are designated as successes.
3. Complete the proof of the equivalence of the three conditions for sufficiency given
in Sect. 3.3
(a) for the binomial distribution.
(b) for an arbitrary frequency function.
(c) for an arbitrary density function.
4. (a) Show that, if S is binomial with parameters nand p, then S(n - S)/[n(n - 1)]
is a minimum variance unbiased estimator of p(1 - p).
(b) Show that max p(l - p) = !.
(c) Show that the maximum value of the estimator in (a) exceeds! by 1/4n for n
odd, and by 1/(4n - 4) for n even.
Problems 75
5. (a) Show that the mean squared error of an estimator T of a parameter ecan be
reduced to 0 for a particular value of 0 by suitable choice of T.
*(b) Suppose the same estimator has mean squared error 0 for two different values
of e. What unusual condition would follow for the distribution of the obser-
vations?
*6. (a) Show that the nonrandomized estimator T(S) defined in Sect. 3.4 has the same
mean as the randomized estimator Ys , and smaller variance unless they are
equal with probability one.
(b) Generalize (a) to any statistic S, sufficient or not, in any problem, and to any
loss function v(y, e) which is convex in the estimator y for each value of the
parameter O.
*7. Let X be any random variable. Show that
(a) The points of positive probability are at most countable.
(b) There exists a finite or countably infintte set A such that P(X E A) = 1 If and
only if the points of positive probability account for all of the probability.
(Either of these conditions could therefore be used to define "discrete" for
random variables or distributions.)
8. Give (real or hypothetical) examples of results of tests of binomial hypotheses
which are
(a) not statistically sIgnificant but are apparently practically sIgnificant.
(b) statistically significant but are practically not significant.
9. Show that if a test of hypothesis has level 0.05, then it also has level 0.10.
10. Graph the power of the upper-tailed binomial test at level 0.10 of H 0: P :0; 0.6 in
the case where n = 10.
11. Show that any two tests which are equivalent have the same exact level and the
same power against all alternatives, but not conversely.
12. Show that two test statistics are equivalent if they are strictly monotonically
related.
13. Suppose that the rejection region of a lower-tailed binomial test is S :0; Sc> and let
tJ.(p) = Pp(S :0; sc). Show that as P increases, tJ.(p) decreases for any fixed II and So
so that tJ.(Po) > tJ.(p) for any p > Po and tJ.(Po) < tJ.(p) for any p < Po. Hence, if the
null hypothesis is H 0: p ~ Po and Sc is chosen in such a way that ex(Po) = tJ., then tJ. is
the maximum probability of a Type I error for all null distributions, and 1 - ex is the
maximum probability of a Type II error for all alternative distributions, and both
these maxima are achieved for the" least favorable case" p = Po.
14. Suppose that 1 success is observed in a sequence of 6 Bernoulli trials. Is the sample
result significant for the one-sided test of Ho: p ~ 0.75
(a) at the 0.10 level?
(b) at the 0.05 level?
(c) at the 0.01 level?
What is the level of just significance (critical level)?
15. Let p be the proportion of defective items in a certain manufacturing process. The
hypothesis p = 0.10 is to be tested against the alternative p > 0.10 by the following
procedure in a sample of size 10. "If there is no defective, the null hypothesis is
76 I Concepts of Statistical Inference and the Bmomial DistributIOn
accepted; If there are two or more defectives the null hypothesis is rejected; if there
is one defective, another sample of size 5 is taken. In this latter situation, the null
hypothesis is accepted if there is no defective in the second sample, and it is rejected
otherwise."
(a) Find the exact probability of a Type I error for this test procedure.
(b) Find the power of the test for the alternative distribution where p 2 0.20.
(c) Graph the power curve as a function of p.
16. A manufacturing process ordInarily produces items at the rate of 5 % defective. The
process is considered .. in control" if the percent defective does not exceed 10%.
(a) Find a procedure for testing H 0: p S 0.05 for a sample of size 20 and a signifi-
cance level of 0.05.
(b) Find the power of this procedure when p = 0.10.
(c) A sample of size n is to be drawn to test the null hypothesis Ho: p = 0.05 against
the alternative p > 0.05. Determine n so that the level is 0.10 and the power
against p = 0.10 is 0.30.
17. Let p be the true proportion of voters who favor a school bond. Suppose we use a
sample of size 100 to test H 0: p S 0.50 against the alternative HI: p > 0.50, and 44
are in favor. Find the P-value. Does the test "accept" or reject at level 0.10?
*
18. A particular genetic trait occurs in all individuals in a certain population with
probability either or !. It is desired to determine which probability applies to this
population. If a sample of 400 is to be taken, construct a test at level o.ot for the null
hypothesis Ho: P = i·
(a) Find the power of this test.
(b) If 60 individuals in the sample have this genetic trait, what decision would you
reach?
19. (a) Suppose that k individual tests of a null hypothesis Ho are given, and their
respective exact levels are IX., ... , IXk. Show that the combined test, which re-
Jects H0 if and only if at least one of the given tests rejects H0, has exact
level IX S IX* where IX* = IXI + ... + IXk.
(b) Under what circumstances can we have IX = IX*?
*(c) If the individual tests are independent under Ho, show that ex*/(1 + ex*) <
m;1
IX :s; IX*. (ThiS Implies, for Instance, that IX* does not overstate IX by more than
10% if IX :s; 0.10.) (Hint: Show that 1 - ex = (1 - ex,) and 1 + ex* :s;
n~; I (1 + IX,), and multiply.)
(d) If the individual tests have possibly different null hypotheses HoI' i = I, ... , k,
show that (a) applies with H 0 = n~; I H 0" the irltersection of the H 0,.
20. Define a two-tailed test T(ex) at level IX by combining two conservative one-tailed
tests at levellX/2, as at (4.7) and (4.8). Let ex*(ex) be the exact level of this two-tailed
test.
(a) Show that ex*(IX) :s; ex.
(b) Under what circumstances will we have 1X*(ex) = IX? Does it matter whether
the null hypothesis is simple? Does it matter whether the null distribution is
continuous or symmetric?
(c) Under what circumstances will T[ex*(ex)] = T(ex)?
(d) Under what circumstances willlX*[IX*(IX)] = IX*(IX)?
Problems 77
21. Consider a two-tailed test for which "extreme" is defined by the one-tailed P-value
of the test statistic. Assume the test statistic has a unique null distribution.
(a) If this distribution is continuous, show that the two-tailed P-value is twice the
one-tailed P-value.
(b) Derive the two-tailed P-value (described in the text) in the discrete case.
(c) Apply this definition to all possible outcomes in Table 4.1.
(d) In Table 4.1, what test corresponds to this definition and what exact levels are
possible?
22. Define a two-tailed P-value as the one-tailed P-value plus the nearest attainable
probability from the other tail.
(a) Apply this definition to all possible outcomes in Table 4.1.
(b) In Table 4.1, what test corresponds to this definition and what exact levels are
possible?
(c) Compare these results with (c) and (d) of Problem 21.
(d) Apply this definition to all possible outcomes for the following null distribution.
s 2 3 4 5 6
28. Show that, for any exact level oc,O ::;; oc ::;; 1, there IS exactly one lower-tailed
randomized test based on a given statistic S if the distribution of S IS uniquely
determined under H 0 and all tests with the same critical functIOn are considered
the same.
29. Show that when n = 6, the binomial test which rejects either if there are no successes,
or if there is just one success and it does not occur on the first trial, has exactly the
same conditional probability of rejection given S for every P as does the lower-tailed
randomized test based on S which rejects always if there are no successes and with
probabilityi if there is one success and it occurs on any trial. Hence, in particular, it
has the same exact level (oc = A for Po = 0.5) and the same power.
30. Show that for n = 6, H 0: p ~ 0.5, the most powerful (randomized) test at level oc,
for h < oc < -k, is to reject always when no success is observed and with prob-
ability (64ex - 1)/6 when 1 success is observed.
31. Consider the binomial problem of testing the simple null hypothesis Ho: p = 0.5
against the simple alternative p = 0.3 when n = 6 using a lower-tailed test based on
S, the observed number of successes in n trials. If we restrict consideration to non-
randomized tests, there are 7 different critical values of S, and hence only 7 different
exact levels oc possible. For each of these, the corresponding probability of a Type II
error Pis easily found from Table B.
(a) Plot these 7 pairs of values (ex, p) on a graph to see how ex and 13 interact.
(b) If randomized tests are allowed, any exact ex can be obtained. Find the ran-
domized tests for some arbitrary values of oc in between the exact values, and
compute the corresponding values of p. Plot these points on the graph in (a).
(c) Show that the points in (b) lie on the straight line segments which connect
successive points in (a). Complete the (ex, 13) graph for randomized tests. If the
nominal level is 0.10, the graph provides strong support for not choosing a
conservative (nonrandomized) test in this situation, while if the nominal
level is 0.05, the graph provides some support for using randomized tests in this
case. What if the nominal level is 0.20?
34. Prove that if a confidence region has confidence level 0.99, then it also has level 0.95.
35. Verify the numerical values of the upper and lower 90 %confidence limits shown in
Table 6.1.
Problems 79
36. How do the regions A(Oo) and C(S) of Sect. 6.2 relate geometrically to the Clopper-
Pearson charts?
37. Show that binomial confidence limits satisfy pAs) = 1 - Pu(1I - s).
38. Suppose that no successes occur in II Bernoulli trials with probability of success p.
Find the one-sided upper confidence limit for p at an arbitrary level a using (a) the
binomial distribution, (b) the Poisson approximation, and (c) the normal approxi-
mation. For II = 4, graph the upper limit as a function of a for each ofthe procedures
(a), (b) and (c).
39. One of the large automobile manufacturers has received many complaints con-
cerning brake failure in one of their current models. The cause was traced to a
factory-defective part. This same part was found defective in six out of a group of
sixteen cars inspected; these six cars were designated" unsafe."
(a) Test the hypothesis that if this model is recalled for inspection, no more than
10 %in this population will be designated" unsafe."
(b) Find an upper confidence bound for the proportion "unsafe," with a level of
0.95.
(c) Use the large-sample method to find an approximate upper 95 % confidence
bound.
(d) Find a two-sided 95 % confidence interval for the proportion of cars without
the defective part.
(e) What inference procedure seems most helpful to the company managers and
why?
(f) Which assumption for the binomial model is likely not be to satisfied in this
example?
40. Show that the confidence region corresponding to the usual two-tailed binomial
test (defined in Sect. 4.5) is an interval and that its endpoints are the confidence
bounds (defined in Sect. 6.1) at level 1 - (aI2).
41. Verify the result stated in (6.11) concerning the expected length of a confidence
interval.
*42. Let C(S) be a confidence region for a parameter 0, and let V(S) = JC(S) dO be the size
of C(S). Denote by Q(8) the probabilIty that C(S) includes 8 under any fixed
distribution of S, that is, Q(O) = P[O E C(S)]. Show that the expected size is
E[V(S)] = JQ(O)dO. (Hint: This is just a change of order of integration in disguise.)
*44. For the randomized, lower-tailed binomial test of p at exact level a, show that the
probabIlity of "acceptance" when s is observed is a(p, s) if 0:::; a(p, s) :::; 1, is 1
if a(p, s) ;::: I, and is 0 if a(p, s) :::; 0, where
*45. Show that a(p, s) as defined in Problem 44 is a decreasing function of p for fixed s.
80 J Concepls of SlallsllcaJ Inference and Ihe BIIl0l11lai Dlslnbullon
*46. The table below gives the usual upper 90 %confidence limits and some alternate
limits for a binomial parameter P when /I = 6. Note that the alternate limit is larger
for S = 0 than for S = 1.
(a) Show that the alternate procedure has confidence level 0.90.
(b) Show that, when the true P = 0.5, the alternate limits have smaller probability
of exceeding Po than the usual hmits for O.SOO < Po < O.SIO and the same
probabihty for Po > 0.510.
(c) Show that the alternate limits have smaller expected "excess" when P = 0.5.
(d) What happens in (b) and (c) for other values of p?
(e) Show that the "acceptance" regIOn for Ho: P = Po corresponding to the
alternate procedure is not an interval for 0.500 < Po < O.SIO.
s o 2 3 4 S 6
*47. Consider the family of nonrandomized, lower-tailed binomial tests which, for each
null hypothesIs value p, have exact level as near IX as possible. Show that the cor-
respondmg confidence regIOn IS an mterval.
*48. Using the facts stated in the first paragraph of Sect. 7.4, show that for a null hypo-
theSIS I' = Po or P ~ Po
(a) a randomized, lower-tailed binomial test
(i) is uniformly most powerful at its exact level;
(ii) uniformly minimizes the probability of rejection for p > Po among tests
as powerful at PI for any PI < Po;
(iii) is admissible;
(b) the class of all randomized, lower-tailed binomial tests IS complete.
*49. (a) Show that, under any true value Po, the usual, nonrandomized upper con-
fidence bound for the binomial parameter p uniformly minimizes both
(I) the probability of exceeding values of I' > Po,
and
(ii) the expected" excess" over Po,
among upper confidence bounds havmg no greater probability offalhng below
any true value Po.
(b) Show that the randomized upper confidence bound at exact level IX for all P has
the properties stated in (a).
(c) Prove a similar result for any confidence procedure corresponding to a family of
one-tailed binomial tests.
SO. (a) Show that there is one and only one unbiased, two-tailed test for a given
bmomial null hypothesis P = PoCO < Po < 1) at a given level IX.
(b) Show that this test IS not equal-tailed in general.
(c) Show that, in general, It is randomized at both critical values.
(d) Show that, in general, even adjusting IX will not make the test nonrandomized.
51. Show that a one-tailed binomial test is unbiased, and hence that a one-sided
confidence procedure is also unbiased.
Problems 81
52. Show that a minimum likelihood test based on a sufficient statistic S having a
unique null distribution is most powerful against the alternative that S is uniformly
distributed, among tests at the same exact level.
53. Suppose that an unbiased test of the binomial null hypothesis P = Po is gIven for
each Po, but the exact level cx{Po) varies with Po. What property related to unbiased-
ness does the corresponding confidence procedure have?
54. Show that if p is distributed uniformly over CO, 1) and, for given p, S is binomial with
parameters 11 and p, then the marginal distribution of S is discrete uniform on the
integers 0, 1,2, ... , II. (For a generalization, see Raiffa and Schlaifer [1961],
pp. 237-241.)
55. (Continuation of Problem 54) Show that the average power of a nonrandomized
test of a binomial null hypothesis p = Po, that is, the integral of the power curve
over p, equals the number of possible values of S which are in the rejection region
divided by II + 1.
56. Demonstrate formula (8.2) for the average expected length of a binomial confidence
region.
*57. Generalize formula (8.2) to binomial confidence regions which are not necessarily
intervals.
62 Show the equivalence of the conditions (I )-(5) for convergence in distribution given
at the beginnIng of Sect 8.2.
63. (a) If X" is normal with mean 0 and variance 11, what is lim Fn(x)?
(b) If F is nondecreasing on ( - co, co) and 0 ~ F ~ 1, construct a sequence of
c.d.f.'s Fn such that F"(x) -> F{x) for every x at which F is continuous.
64. (a) Show that if a sequence of distributions on a finite set converges in distribution,
then the conditions of Theorem 9.1(1) hold.
(b) Give a counterexample for a count ably infinite set.
65. Apply the Central Limit Theorems 9.2 and 9.3 to the binomial distribution.
CHAPTER 2
1 Introduction
The goal of statistical inference procedures is to use sample data to obtain
information, albeit uncertain, about some larger population or data-
generating process. The inferences may concern any aspect of a suitably
defined population (or process) from which observations are obtained, for
example, the form or shape of the probability distribution of some variable
in the population, or any definable properties, characteristics or parameters
of that distribution, or a comparison of some related aspects of two or more
populations. Procedures are usually classified as non parametric when some
of their important properties hold even if only very general assumptions are
made or hypothesized about the probability distribution of the observations.
The word "distribution-free" is also frequently used in this context. We will
not attempt to give an exact definition of" non parametric " now or later, as it
is only this general spirit, rather than any exact definition, which underlies
the topics covered in this book.
In order to perform an inference in one-sample (or paired sample)
problems using the methodology of parametric statistics, information about
the specific form of the population must be postulated throughout or in-
corporated into the null hypothesis. The traditional parametric procedure
then either postulates or hypothesizes a specific form of population, often the
normal, and the inferences concern some population parameters, typically
the mean or variance or both. The exact distribution theory of the statistic,
and hence the probabilities of both types of errors in testing and the confi-
dence level in estimation, depend on this popUlation form. Such inference
procedures mayor may not be highly sensitive to the population form. If they
82
2 Quantile Values 83
are not, the procedure is said to be "robust." Robustness has been extensively
studied. (See, for instance Bradley [1968, pp. 28-40] and references given
there.)
A non parametric procedure is specifically designed so that only very
general characteristics of the relevant populations need be postulated or
hypothesized, for example, that a distribution is symmetric about some
specified point. The inference is then applicable to, and completely valid in,
quite general situations. In the one-sample case, the inference concerns some
definable property or aspect of the one population. For example, if symmetry
is assumed, such an inference might concern the true value of the center of
symmetry and its exact level may be the same for all symmetric populations.
Symmetry is a much less restrictive assumption than normality. Alternatively,
the inference may be an estimate or hypothesis test of the value of some other
parameter in a general population. In short, a nonparametric procedure is
designed so as to be perfectly robust in certain respects (usually the exact
significance or confidence level) under some very general assumptions.
The remainder of this book will consider various situations where in-
ferences can be made using non parametric procedures, rather than studying
post hoc the robustness of parametrically derived procedures. In this chapter
the inferences will be based on the binomial probability distribution; how-
ever, they are valid for observations from general populations. The first type
of inference to be covered relates to the value of a population percentile point
like the median, the first quartile, etc. For data consisting of matched pairs
of measurements, the same procedures are applicable for inferences con-
cerning the population of differences of pairs. If the matched pair data are
classificatory rather than quantitative, for example classified as either success
or failure, inferences about the differences of pairs can also be made by similar
procedures, but they merit separate discussion. Finally, we will discuss one-
sample procedures for setting tolerance limits for the distribution from which
observations are obtained.
2 Quantile Values
Most people are familiar with the terms percentiles, median, quartiles, etc.,
when used in relation to measurement data, for example, in reports of test
scores. These are points which divide the measurements into two parts, with
a specified percentage on either side. If a real random variable X has a
continuous distribution with a positive density, the statement that the median
or fiftieth percentile is equal to 10 means that 10 is the point having exactly
one-half of the distribution of X below it and one-half above. This statement
can be expressed in terms of probability as
F(x) F(x)
x x
(a) (b)
F(x)
p{
~ x
(c)
Figure 2.1
3 The One-Sample Sign Test for Quantile Values 85
Let Xl' ... , X n be n independent observations drawn from the same distri-
bution, and suppose we wish to test the null hypothesis that the median of
this distribution is O. Let us suppose that the point 0 does not have positive
probability, that is, assume that P(X, = 0) = O. (The contrary case is more
86 2 One-Sample and Paired-Sample Inferences Based on the BmomIaI DlstnbutlOn
complicated and will be discussed in Sect. 6.) Then with probability 1, each
observation is either positive (>0) or negative «0), and P(Xj < 0) =
P(Xj > 0) = 0.5 under the null hypothesis. We could then test the null
hypothesis by counting the number S of negative (or positive) observations.
Under the null hypothesis, S is binomial with parameter p = 0.5, and the
tests discussed in Chap. 1 for this hypothesis may be applied to S as defined
here.
An upper-tailed test, rejecting when S is larger than some critical value,
that is, when there are too many negative observations, is appropriate against
alternatives under which the probability of a negative observation is larger
than 0.5, that is, alternatives with P(X j < 0) > 0.5. Under such alternatives,
the population distribution is more negative than under the null hypothesis
in the sense that its median is negative instead of O.
A lower-tailed test is appropriate against alternatives under which the
population distribution has a positive median. A two-tailed test is appropriate
when one is concerned with both alternatives.
Let F be the c.dJ. of the distribution from which the Xl are sampled. Since
we are assuming that the point 0 does not have positive probability, F(O) =
P(X j ~ 0) = P(X j < 0), and the null hypothesis can be stated as H 0 :
F(O) = 0.5. Notice that an alternative distribution with F(O) > 0.5 is more
negative, in the above sense, because if the probability of a negative observa-
tion exceeds 0.5, the population median must be negative. That is, loosely
speaking, the larger F is, the more negative the population. This is illustrated
in Fig. 3.1, for arbitrary c.d.f.'s F 1 and F 2, where F 2(0) > F 1(0) = 0.5 and the
medians are related by e2 el
< = O.
The power of the tests above is, of course, just the power of whichever
binomial test is used (lower-, upper-, or two-tailed) against the alternative
F(O) = p for some p '# 0.5.
Of course, there is nothing special about the particular value 0 for the
median. To test the null hypothesis that the distribution of every observation
has median eo, say, assuming that P(Xj = eo) = 0 for allj, we would define
S as the number of observatioris which are smaller than eo. Under this null
F(x)
Figure 3.1
3 The One-Sample Sign Test for Quantile Values 87
integers just above and below np, that is, whenever p ~ (St + l)/n and/or
p ~ (sa - l)/n where St and Sa are the lower and upper critical values respec-
tively. It underestimates the power of a one-tailed test for p ~ St/n or p ~ s"ln
as relevant (where the power exceeds 0.5, approximately). The power of a
two-tailed test can be underestimated in the same region by simply ignoring
the (ordinarily negligible) probability of rejection in the" wrong" tail and
using the one-tailed lower bound for the" correct" tail. These results for the
"fixed effects" model are due to Hoeffding [1956].
The one-tailed sign tests are, in a sense which will now be described, the best
possible. Certain of the two-tailed tests have similar but less strong proper-
ties, and all are admissible.
Assume once more that the observations XI, ... , Xn are independent and
identically distributed and we wish to test the null hypothesis that the median
~ of the population distribution is a particular number ~o. Assume also that
P(XJ = ~o) = 0 under both null and alternative hypotheses. Consider the
class of tests based on S, the number of observations smaller than ~o. Since
S is binomial, there is a level IX test based on S which is uniformly most
powerful against one-sided alternatives, namely the appropriate one-tailed
binomial test at exact level Ct. This test is just the one-tailed sign test at exact
level Ct.
What if one considers not only tests based on S, but also tests which make
further use of the original observations? One might think that a better test
could be produced by taking into account how far above and below ~o the
observations fall, rather than just how many fall above and how many below.
This is not possible, however, as long as one insists that the test have level Ct
for every popUlation distribution with median ~o. Compared to any other
such test, the one-tailed sign test at exact level Ct which rejects when S is too
small has greater power than any other test against every alternative distri-
bution under which the median exceeds ~o. A similar statement holds for the
test rejecting in the other tail and alternatives on the other side. In other
words, a one-tailed sign test at exact level Ct is uniformly most powerful
against the appropriate one-sided alternative, among all level Ct tests based
on XI' ... , X n for the null hypothesis that the median is ~o.
For two-sided alternatives there is, of course, no uniformly most powerful
test of the null hypothesis that the median is ~o. Suppose, however, that we
consider only unbiased tests, that is, tests which reject with probability at
least as great under every alternative as under the null hypothesis. Among
these tests, the unbiased sign test (which is equal-tailed) is uniformly most
powerful.
The symmetry of the situation suggests another way of restricting the class
of tests to be considered with a two-sided alternative. Suppose we had
3 The One-Sample SIgn Test for QuantIle Values 89
observed not X 1>"', Xn but Y1>"" Y", where Y; is the same distance from
eo as Xj but on the other side (that is, Y; = 2eo - Xj)' If the Xj satisfy the
null hypothesis, so do the lj; ifthey satisfy the alternative, so do the lj. Hence,
in the absence of other considerations, it seems equally reasonable to apply
a test to the Y's as to the X's, and it would be unpleasant if different outcomes
resulted. This suggests requiring that a test be symmetric in the sense that
applying it to the Y's always gives the same outcome as applying it to the
X's. Among such tests also, the equal-tailed sign test is uniformly most
powerful.
Every two-tailed sign test is admissible (in the two-conclusion interpreta-
tion of two-tailed tests); that is, any test having greater power at some
alternative has either smaller power at some other alternative or greater
probability of rejection under some distribution of the null hypothesis.
The restriction to identically distributed observations can be relaxed
without affecting any of the properties above. That is, the results hold for
alternatives under which X 1"", Xn are independent, with P(Xj < eo) the
same for allj, but are not necessarily identically distributed, provided the null
hypothesis is similarly enlarged.
If" po-point" is substituted for the hypothesized median value eo through-
out, the foregoing properties continue to hold, except that an unbiased test
is not equal-tailed when Po :f= 0.5, and the discussion of symmetry no longer
applies. In summary, the" optimum" properties ofthe sign tests are as follows.
If X 1> ••• , X n are independent and identically distributed with
P(X j = eo) = 0, then among tests of the null hypothesis P(X j < eo) = Po:
(a) A one-tailed sign test is uniformly most powerful against the appropriate
one-sided alternative;
(b) Any two-tailed sign test is admissible;
(c) A two-tailed, unbiased sign test is uniformly most powerful against the
two-sided alternative P(Xj < eo) :f= Po among unbiased tests and, when
Po = 0.5, among symmetric tests.
If Xl' ... , X n are not necessarily identically distributed but are indepen-
dent with P(Xj < eo) = P(Xj ~ eo) = p for allj, then the same statements
apply to the null hypothesis p = Po and the alternative P :f= Po.
The proof of the foregoing statements, which will be given in Sect. 3.3,
depends on the fact that the null hypotheses are very broad and are satisfied
by some peculiar distributions, like the density represented by the dotted line
in Fig. 3.2(b). If one is willing to test a more restrictive null hypothesis, there
could well be tests which are more powerful, at least against some alterna-
tives. For instance, the hypothesis that the median is eo might be replaced by
the hypothesis that the distribution is symmetric around eo. Nonparametric
tests of this null hypothesis will be discussed in Chap. 3 and 4.
For Po :f= 0.5, restrictions of the null hypothesis that the po-point is eo
have been studied only for parametric situations. For instance, the hypothesis
might be that the observations come from a normal population with po-point
90 2 One-Sample and Paired-Sample Inferences Based on the Binomial Distribution
,,
, I
f(x)
,I
,,
I
,,
I
I
, ,
,
~o X
(a) (b)
Figure 3.2
~o. Procedures based on such assumptions are outside the scope of this book,
but it is relevant to mention that considerable risk accompanies their apparent
advantages. The risks and advantages are easily seen in terms ofthe estimators
involved. With no distribution assumption, the true population probability
below ~o, say p, would be estimated by Sin, where S is the number of observa-
tions below eo in the sample. With the assumption of a normal population,
the probability below eo is pi = (J>[(~o - 1i)lo], where Ii and (J are the
population mean and standard deviation and <II is the standard normal c.dJ.
Intuition would lead one to use P = (J>[(eo - 1')ls] as an estimator for pi,
where X and s are the sample mean and standard deviation. (This estimator
Pis slightly biased for pi in normal populations, but can be adjusted to be
unbiased and in fact minimum variance unbiased [Ellison, 1964].) Under
normality, p (or Padjusted) is a much better estimator than Sin. However, a
departure from normality which looks minor can easily lead to an important
difference between pi and the true proportion p and therefore very poor
properties for p. (Typical goodness-of-fit tests of the normality assumption
will throw almost no light on this crucial question.) In fact, such information
as the sample provides about p beyond the value of Sin is, in common sense,
relatively little and difficult to extract. The advantage of the estimator p over
Sin relies most heavily on the assumed normal shape when ~o is in the extreme
tails, and hence these reservations about p apply most strongly when p is
close to 0 or 1, which is unfortunately just when the advantage of p is also
greatest. (See also Sect. 3.1 of Chap. 8.)
Why do similar reservations not apply to non parametric procedures
based on symmetry? Because symmetry may well be more plausible than
normality, and the effect of departures from the assumption are less serious.
In fact, nonparametric procedures based on symmetry are often used to make
inferences about the location of the" center" of the population distribution;
a departure from symmetry will require that this "center" be defined some-
how, and the definition implicit in the procedure used may be satisfactory.
In contrast, an inference about p is typically made in a situation where the
3 The One-Sample Sign Test for Quantile Values 91
*3.3 Proofs
We will demonstrate the optimum properties summarized above for the sign
tests only in the case where XI, ... , Xn are independent and identically
distributed with P(Xj = ~o) = O. We will prove that the one-tailed, and
unbiased two-tailed, sign tests of the null hypothesis P(Xj < ~o) = Po are,
respectively, uniformly most powerful, and uniformly most powerful un-
biased (and, when Po = 0.5, uniformly most powerful symmetric) against the
appropriate alternatives. The proofs of admissibility and the extension to
observations with possibly different distributions are requested in Problems
11 and 12.
Consider any alternative distribution G with PG(X j < ~o) =
P G(X j :0:;; ~o) = PG #- Po· For any p, define F as a distribution with prob-
ability P to the left of ~o and the same conditional distribution as G on each
side of ~o. (To simplify notation, the dependence of F on P will not be in-
dicated.) Specifically, let
{
~ G(x) for x :0:;; ~o
F(x) = PG
1- ~ [1 - G(x)] for x 2 ~o.
1 - PG
Then PF(Xj < ~o) = PF(Xj :0:;; ~o) = p. Also, the conditional distribution of
Xj' given that Xj < ~o, is the same under F as G, as is the conditional
distribution given that Xj > ~o. The family of distributions F includes G
(when P = PG) and a null distribution F 0 (when P = Po). The point of the
definition is that F 0 is the null distribution" most like" the given alternative
G, or "least favorable" for testing against G, as we shall see.
Figure 3.2(a) illustrates possible c.d.f.'s of the type defined by F and G;
notice the abrupt change of slope in F at ~o. The corresponding densities f
and g are shown in Fig. 3.2(b), where f and g are related by
Theorem 3.1. Let Ho be a null hypothesis and suppose that H'o is contained in
H o·
(a) A test is most poweiful against an alternative G among level rx tests of H 0
ifit has level rxfor Ho and is most powerful against G among level rx tests of
H'o.
(b) A test is most powerful against G among level rx tests of Ho which are
unbiased against Hl if it has level rxfor H o, is unbiased against H 1 , and is
most poweiful against G among level ex tests of H'o which are unbiased
against H'1 , where H~ is contained in H 1.
(c) A test is most powerful against G among level ex symmetric tests ofHo ifit is
symmetric, has level ex for H 0, and is most powerful against G among
symmetric level rx tests of H'o.
(d) The property in (c) holds if the requirement of symmetry is replaced by any
requirement that does not depend on H o.*
P(X cr ) < ~p < XCv» = P(Xcr) < ~p) - P(XCV) :s; ~p)
where
(r + v - I)!
B(r, v) = (r _ 1)! (v - I)! (4.5)
96 2 One-Sample and Paired-Sample Inferences Based on the Binomial DistrIbution
e
As a final comment, we note that the event X(r) < p is equivalent to
F(X(r» < p. Applying the probability integral transformation, the latter
inequality can be replaced by U(r) < p where U(r) is the rth order statistic
from a uniform distribution on (0,1). Since the density of U(r) is the integrand
of (4.4) (Problem 30), this observation provides a direct verification of the
expression given in (4.4).
(see Problem 49, Chap. 3). Thus the weights for the true IX are (1 - iln, iln),
while the approximation above uses (1/2, 112). For IJ. fixed, iln ~ 1/2 as
n ~ 00, but at typical levels and in samples small enough to make discreteness
severe and interpolation really interesting, iln is not close to 1/2 and the
approximation in (5.1) is disappointingly inaccurate. For instance, for IX near
0.05 the exact values of iln are 0.28 when n = 9 and 0.32 when n = 17
(Problem 39). These two sample sizes were chosen since the attainable levels
for testing p = 0.5 are approximately equidistant from 0.05 (see Table B).
p<
p= . (6.1)
p< + p>
It is obvious that p = 0.5 when p< = P> ,and that p is larger or smaller than
0.5 according as p< is larger or smaller than p>. A test of the null hypothesis
p< = p> (or equivalently that p = 0.5), against either one- or two-sided
alternatives, can then be based on S <. This amounts to omitting from the
sample those observations which equal ~o and applying to the remaining
observations a test of Sect. 3.1 for the null hypothesis that eo is the median,
using the reduced sample size. Any test of this type will be called a conditional
sign test, because it is conditional on the value of N.
The parameter p may itself be of interest, as may the quantity
approximate methods can be used, based on the fact that S < - S> is ap-
proximately normally distributed with mean n(p< - p» and variance
n[p< + p> - (p< - p»2], provided that this variance is not too small
(Problem 42). The estimated variance S< + S> - [(S< - S»2jn] can be
used in place of the unknown true variance. It may be appropriate to in-
corporate a correction for continuity in the amount of! to S < - S>, although
the appropriate correction would be 1 rather than 1- in the case p< + p> = 1,
when S < + S> = n with certainty (Problem 42).
Since the procedures and properties of tests concerning cases (a), (b) and (c)
of the previous subsection have already been discussed in Sect. 3, the remain-
der of this section will be devoted to those tests appropriate for case (d),
specifically, the conditional sign tests just described. We first consider a direct
argument in favor of performing a test for the null hypothesis H 0: p < = p>
conditionally on N, the number of observations which are not equal to ~o.
The argument is as follows. The number N does not pertain to the matter
under test but is, in effect, the sample size, because the n - N observations
which equal ~o (the "ties") are irrelevant. Accordingly, whatever test is
performed, the effective sample size N and the properties of the test for that
given N should be reported. Failing to do so would be tantamount to using
a procedure involving a sample size which is random but reporting its
overall properties for any size sample instead of its properties for the sample
size actually used. For example, suppose a sample is taken, of size n 1 or n2'
where the probability of each sample size is 1. Suppose further that a test at
level 0.01 is made when the sample size is nl and a test at level 0.09 when it is
n2' The overall level of this procedure is 0.05, but it would be misleading to
report simply that a test had been made at level 0.05, withholding the in-
formation about whether in a particular instance the level was really 0.01
or 0.09. This is not an argument against varying the level in this way, but only
an argument for quoting the level and sample size actually used.
Such an argument for a conditional procedure can be very compelling.
On the other hand, there are situations where the conditional argument leads
to chaos at best. It is not always possible to condition on everything one
might like. Worse yet, the conditional argument, together with some ap-
parently harmless assumptions, leads to the radical conclusion that tail
probabilities are irrelevant and inferences should be based only on the
probability ofthe actual sample under the various hypotheses (the likelihood)
[Birnbaum, 1962]. Thus conditioning poses fundamental problems for
orthodox inference methods; these problems have no satisfactory resolution
entirely within the frequency interpretation of probability. Even the radical
conclusion is in accord with Bayesian and likelihood philosophies, however.
100 2 One-Sample and Paired-Sample Inferences Based on the Binomial DistributIOn
Consider the one-sided alternative p< < p>, that is, P(Xj < ~o) <
P(Xj > ~o). Let G be any distribution satisfying this alternative hypothesis.
In order to adapt the method of Sect. 3.3 to the present situation, for any p <
and p>, we define a distribution F as follows: PP(Xj < ~o) = p<; the condi-
tional distribution of Xj' given that Xj < ~o, is the same under F as under G;
PP(Xj = ~o) = 1 - p< - p> ; PF(Xj > ~o) = p> ; and the conditional dis-
tribution of X)' given that X J > ~o, is the same under F as under G (Problem
43a). Then the family ff of distributions F includes G and it includes a null
distribution for each value of p< = p> (Problem 43b).
For notational convenience we let S = S <, the number of observations
below ~o. Then for this family of distributions, the statistics Sand N are
jointly sufficient (Problem 43c). We shall show that any unbiased test based
on Sand N is conditional. We already know that, among tests at conditional
level iX, a one-tailed conditional sign test at level iX has uniformly greatest
conditional power against the appropriate one-sided alternative. The
desired conclusion, (a) of Sect. 6.3, then follows (Problem 44c).
It remains to show that any unbiased test 4>(S, N) is conditional. We note
first that,
EF [4>(S, N)] ~ iX (6.3)
for all null distributions F while
E K [4>(S, N)] ~ iX (6.4)
for all alternative distributions K, by unbiasedness. Second, every null
distribution F is a limit of alternative distributions (e.g., the alternatives
K = [(m - l)F + G]/m, or alternatives in the family ff for which p< and
p < approach their values under F). It follows that
EF [4>(S, N)] = rI. (6.5)
for all null distributions F.
102 2 One-Sample and PaIred-Sample Inferences Based on the Bmomial DIstribution
Consider now the null distributions of the family 17 above. For this
subfamily, N is a sufficient statistic (Problem 43d), and hence the conditional
probability of rejection given N is a function of N alone, say rx(N). That is,
for p< = p> = r/2 say,
Er[¢(S, N)IN] = rx(N) (6.6)
where rx(N) does not depend on r. Since (6.5) holds for all null distributions,
taking the expected value of both sides of (6.6) gives
Er[rx(N)J = rx for all r. (6.7)
Now N, the number of observations not equal to ~o, is binomially distributed
with parameters nand r = p< + p>. Since the family of binomial distribu-
tions is complete (Chap. 1, Sect. 3.4) it follows that rx(N) = CI. for all N, that is,
the test is conditional. This is all that remained to be proved. 0
Remarks
The type of argument employed in the previous two paragraphs often applies.
It is summarized in the following theorem, whose proof is requested in
Problem 45.
Theorem 6.1. Any unbiased test at level rx has probability exactly rx of rejection
on the common boundary K of the null hypothesis H 0 and the alternative H l'
If T is a complete sufficient statisticfor K, then any unbiased test at level CI. has
conditional level exactly rxfor all distributions of K, conditional on T. Hence if
T is a complete sufficient statistic for H 0, and if H 0 = K, that is, H 0 is con-
tained in the boundary of H to then any unbiased test at level CI. is a conditional
test, conditional on T.
A level rx test of the null hypothesis p< = p> which is unbiased against
p< =I- p> must have conditional level CI. given N, as follows from either
corresponding one-sided statement, but it need not be conditionally unbiased
(Problem 46). Accordingly, in contrast to the one-sided case (Sect. 6.4), the
fact that the equal-tailed, level CI., conditional sign test is uniformly. most
6 The Sign Test with Zero Differences 103
powerful among conditionally unbiased tests does not imply directly that it is
uniformly most powerful among unconditionally unbiased tests. To prove
that it is, let G be any alternative and consider its family $i of distributions F
as defined in Sect. 6.4. We will prove that, among tests having level IX for the
null distributions of the family $i and unbiased against the alternatives of the
family $i, the equal-tailed, level IX, conditional sign test is uniformly most
powerful against these alternatives, and, in particular, is most powerful
against G. Since it is in fact a level IX, unbiased test for the original, more
inclusive, null and alternative hypotheses, it is, among such tests also, most
powerful against G (Theorem 3.1) and thus uniformly most powerful, since
G was arbitrary.
Now restrict the problem to the family $i. For this family, a sufficient
statistic is (S, N). The distribution of (S, N) may be described as follows. N is
binomial with parameters nand r = P< + P>, while given N, S is binomial
with parameters Nand P = p</r, as at (6.1).
We seek a test ¢(S, N) of the null hypothesis p = 0.5 (that is, p< = p»,
which is unbiased against the alternative P =F- 0.5 (that is, p< =F- p». Let
lX(r, p) = Er.p[¢(S, N)] (6.8)
be the power (the level, when p = 0.5) of the test ¢. Let
lX(pIN) = Er,p[¢(S, N)IN] (6.9)
be the conditional power (level, when p = 0.5) of ¢ given N, which is a
function of p and N alone, not depending on r, because the conditional
distribution of S given N is a function of p and N alone. If ¢ is unbiased at
level IX, then
IX(r, 0.5) ::; IX,
(6.10)
lX(r, p) ~ IX for p =F- 0.5.
It follows, as we saw in the one-sided case, that lX(r, 0.5) = IX and that the
conditionallevellX(O.SI N) = IX. It also follows (Problem 47) that
7 Paired Observations
Frequently in practice measurements or observations occur in pairs; the two
members of a pair might be treated and untreated, or male and female, or
math score and reading score, etc. While the pairs themselves may be in-
dependent, the members of a pair are related in some way. This relationship
within pairs may be present because of the nature of the problem, or may be
artificially imposed by design, as when experimental units are matched
7 Paired Observations 105
according to some criterion. The units or pairs may be drawn randomly from
some population of interest, and the assignment within pairs may be random.
If not, additional assumptions may be needed, depending on the type of
inference desired.
For example, suppose that a random sample of individuals is drawn, and
a pair of observations is obtained for each individual, like one before and one
after some treatment. Then each individual acts as his own "control." Under
the assumption (not to be treated lightly) that there is no time-related change
other than the treatment, one can estimate the effect of the treatment, and
with smaller sampling variability than if the controls were chosen indepen-
dently of the treated individuals. If instead each individual receives two
treatments, such as a headache remedy administered on two completely
separate occasions, it may be possible to assign the treatments to the occasions
randomly for each individual. For comparing the two treatments in terms of
some measure of effectiveness, this provides a similar advantage in efficiency
without requiring such strong assumptions. One of the treatments could, of
course, be a placebo or other control treatment.
More generally, suppose that the units to be observed are formed into pairs
in some way, either naturally or according to some relevant criterion, and
observations are made on both members of each pair. A pair here might be
two siblings, two litter mates, a husband-and-wife couple, two different sides
of a leaf, two different but similar schools, etc., or one individual at two times,
as above. If the matching is such that the members of a pair would tend to
respond similarly if treated alike, random variation within pairs is reduced
and the nonrandom variation is easier to observe. If a difference is then ob-
served between two treatments, the difference can be attributed to the effect
of the treatments rather than to random differences between units with more
assurance than could an equal difference observed in a situation without
matching.
If, within each pair, one unit is selected at random to receive a certain
treatment, and the other unit receives a second treatment (or serves as a
control), we have a matched-pair experiment. If the pairs themselves are
independently drawn from some population of pairs, we have a simple random
sample of pairs. Another possibility is to draw a simple random sample of
individuals and then form pairs within this sample. If either type of randomi-
zation is lacking, as in the before-after example above, it is especially impor-
tant to consider the assumptions required for the type of inference being made.
In an analysis of paired observations, it is technically improper and
ordinarily disadvantageous to disregard the pairing. A convenient approach
to taking advantage of the pairing usually results if the measurements on the
two members of each pair are subtracted and the analysis is performed on the
resulting sample of differences. To a great extent, this procedure effectively
reduces a paired-sample problem to a one-sample problem, but the assump-
tions which are appropriate for the sample of differences depend on what
assumptions are appropriate for the pairs.
106 2 One-Sample and Paired-Sample Inferences Based on the Bmomlal Distnbution
In the situation under discussion here, the data might be recorded using the
format of Table 8.1. For each pair, we subtract the score on I from the score
on II to obtain differences which are either + 1, -1, or O. While there may
be any number n of pairs observed, there are only four categories of response,
that is four distinguishable pairs of scores, and these are listed in the table.
The last column shows the four symbols which we shall use to designate the
number of pairs observed in each of the four categories.
Now suppose that the null hypothesis of primary interest is that the
probability of a score of 1 on I is the same as the probability of a score of 1
on II, that is PI = PH' The difference PH - PI is equal to the probability of a
positive difference score II - I, minus the probability of a negative difference
score II - I (Problem 53). That is, the null hypothesis PH - PI = 0 is
equivalent to the hypothesis that the difference scores of + 1 and - 1 are
equally likely to occur in the population. Accordingly, the test suggested in
(d) of Sect. 6.1 applied to the numbers A, B, C, and D in Table 8.1 is appropri-
ate in this situation. (The number Bcorresponds to S < in Sect. 6.1.) The A + D
zero difference scores are ignored, and under the null hypothesis, given
B + C = N, the distribution of B is binomial with parameters Nand P = t.
Hence the sign test with zero differences or ties, which was introduced in
Sect. 6, can be used to test this hypothesis. The properties and interpretation
of this test, which will be called here a test for equality of proportions based
on paired or matched observations (it is also frequently called the McNemar
test) will be discussed later in this section. An example is given in Sect. 8.3.
Table 8.1
1 0 A
1 0 -1 2 B
0 1 1 3 C
0 0 0 4 D
8 Comparmg Proportions Using Paired ObservatIons 109
Table 8.2
Score on II
1 0
1 A 8
Score on I 0 C D
110 2 One-Sample and Paired-Sample Inferences Based on the Bmomial Distribution
Table 8.3
I II
Score
1 A+B A+C
0 C+D B+D
Table 8.3 shows another alternative method of presenting the data. With
this 2 x 2 format, the quantity we are interested in is the difference between
the proportions in the two columns, since this is equal to (C - B)/
(A + B + C + D). It may therefore appear that Fisher's exact test and the
chi-square test of "no association" are now applicable, but again they cannot
be used, in this case because the assumptions are not satisfied. The quantities
in the two columns labeled I and II are not independent. In fact, the numbers
or proportions here refer to matched pairs, and each pair appears twice in
Table 8.3, once in each column. An adjustment can be made, but it leads
either to the test already suggested or to a large-sample approximation to
that test (Stuart [1957]).
Of course, the format for presentation of the data is largely a matter of
taste and is irrelevant to proper analysis, provided that the situation is
correctly understood. Table 8.1 is quite clear, but is less compact than might
be desired. Tables 8.2 and 8.3, although compact, might lead to misinter-
pretation. In addition, since Table 8.3 gives only the marginal totals of
Table 8.2, it alone does not contain sufficient information for application of
the appropriate test which requires knowledge of at least Band B + C = N.
Table 8.2 is the more common, but the format of Table 8.1 generalizes more
easily to more than two types of unit or measurement when each observation
is still recorded as 0 or 1. This generalization is equivalent to 0-1 observa-
tions occurring as k-tuples rather than as matched pairs, and the usual test
procedure is Cochran's Q Test (Problem 57).
8.3 Example
Table 8.4
Preference Category
of(H, W) Number
(M,M) 5
(M, B) 8
(B, M) 3
(B, B) 4
The purpose ofthe study was to determine whether views offamily preference
for vacation are largely influenced by sex, and hence a possible source of
serious disharmony between husband and wife. Specifically, we wish to
determine whether a married man's view of family preference differs system-
atically or only randomly from his wife's view.
We first present the data in each of the formats that were described in
Sect. 8.2. The frequencies of occurrence for the four response categories of
pairs are easily counted. The results shown in Tables 8.4 and 8.5 are examples
of the general format of Tables 8.1 and 8.2 respectively. Table 8.6 is analogous
to Table 8.3, and it is clear here that the entries in the two columns are not
independent, because each couple appears twice in the table.
Suppose we wish to test the null hypothesis that the probability that the
husband responds mountains while the wife responds beach is equal to the
reverse type of disagreement, that is,
P(M, B) = PCB, M). (8.1)
If we add P(M, M) to both sides of (8.1), the left-hand side is simply the
probability that the husband responds mountains since
P(M, B) + P(M, M) = P[M, B or M] = PH(M)
say, while the right-hand side of (8.1) similarly becomes the probability
Pw(M) that the wife responds mountains. Hence the null hypothesis can also
be stated as either
PH(M) = Pw(M)or PH(B) = Pw(B),
which may be easier to interpret than (8.1). If H is Type I and M is score 1,
then PH(M) = Pw(M) represents PI = P" here. The ordinary binomial test
Wife's Preference
M B H W
Husband's M 5 8 M 13 8
Preference
Preference B 3 4 B 7 12
112 2 One-Sample and Paired-Sample Inferences Based on the Bmomial DistributIOn
The interpretation of the results of a before versus after test of any kind bears
close scrutiny. Suppose, for example, that the same characteristic is measured
before (I) and after (II) a treatment, and by a one-sided, level 0: test for equality
of proportions based on matched observations there are significantly more
1's after the treatment than before. Then, if the units constitute a random
sample from some population, the inference, at level 0:, is that if all elements in
the population had been treated, there would have been more 1's after
treatment than before. However, the inference that the population would
have changed in this direction if treated does not automatically justify the
inference that the treatment would have changed the population. The
observations in themselves provide no information about what would have
happened in the absence of the treatment. In order to make an inference
about the treatment, it is necessary either to assume that the proportion of
1's in the population would not have changed in the absence of treatment or
to run an additional control experiment (Problem 60).
In example (a) at the beginning of Sect. 8, for instance, it is reasonable to
assume that the soldiers would not have changed their opinions in the absence
of a lecture. Hence the effect, if real, may reasonably be attributed to the
lecture and the circumstances surrounding it. Of course, it is conceivable
that a dull lecture on any topic would have made the soldiers pessimistic
about a speedy end to the war. In example (c) if the drugs were given in the
same order to every patient, an apparent difference between the drugs
might be due to a time effect. If the order was randomized for each patient
independently, this difficulty would be obviated. Actually, in this experiment
exactly half of the patients in the sample were chosen at random and given
drug I first while the rest were given drug II first. This alters the null distribu-
tion of the test statistic if there is a time effect, but may be more powerful
inasmuch as it balances out the time effect by design rather than by random-
Ization. The test for equality of matched proportions will still be approx-
imately" valid" as long as the time effect is not too large and is "conservative"
in any case (Problem 61). An exact test is given in Problem 106c of Chap.
5. See also Problem 9c and text in Chap. 5.
Even If the effect is attributable to the treatment, the inference is limited
to the population sampled; the effect might be quite different on a popUlation
with a different proportion of 1's initially. Sometimes, of course, one might be
willing to assume that the effect on different populations would be in the same
8 Comparing Proportions Using Paired Observations 113
Some properties of the sign test with ties described earlier in Sect. 6 carry
over to the present situation. We assume throughout that the only data
available are the numbers A, B, C, and D in the four response categories for
a simple random sample of pairs, and that the null hypothesis is PI = PH'
with no further restriction on the probabilities.
Then, specifically, the one-tailed test as applied in this section has
uniformly greatest conditional power against the appropriate one-sided
alternative among tests at its conditional level, where "conditional" here
means" given B + c." The equal-tailed test has uniformly greatest conditional
power against any alternative among tests at its conditional level which are
conditionally unbiased against the alternative that the probability PI of
scoring 1 on I differs from the probability PH of scoring 1 on II. The one-
tailed, level ex, conditional test is, from an unconditional point of view,
uniformly most powerful against the appropriate one-sided alternative
PI < PH or PI > Pn, among level ex tests which are unbiased against this
alternative. The equal-tailed, level ex, conditional test is, from an unconditional
point of view, uniformly most powerful against the alternative PI i= PH'
among level ex tests which are unbiased against this alternative. Proofs are
requested in Problem 63.
In the situation under discussion, inferences other than a test for equality of
paired proportions may also be of interest. Some of these will be discussed
in this subsection. We continue to use the notation introduced in Sect. 8.1,
that PI and PH are the proportions of pairs in the popUlation scoring 1 on
I and II respectively. We will also now be referring to the joint classification
of observations on the basis of scores of both Types; hence it will be con-
venient to introduce the notation Pii' for i = 0, 1, j = 0, 1, to denote the
proportion of pairs in the population with Type I score i and Type II score j.
F or example, Po 1 is the true proportion scoring 0 on I and 1 on II. Thus P11,
P10' POl and POD denote the true proportions of the population of pairs
corresponding to the observed numbers A, B, C, and D respectively in
Tables 8.1-8.3. Notice that PI = P10 + P11 and Pn = POl + P11'
The test already described in Sect. 8.1 was for H 0: Pn - PI = 0, or equiva-
lently POl - P10 = 0; under this null hypothesis, given B + C = N,
the test statistic B foHows the binomial distribution with parameters Nand
P = t. The difference Po 1 - Plois relevant for comparison of the proportions
of kinds of "disagreements" between scores for the two Types, or kinds of
"switches." This parameter corresponds to the difference parameter P< - P>
(see Section 6.1) discussed in the context of the conditional sign test with ties in
Sect. 6, so that the test and confidence interval procedures discussed there
are relevant here also.
8 Comparing Proportions Using Paired Observations 115
In the present context and notation, the parameter P defined in Eq. (6.1)
can be expressed as P = PlO/(PlO + POl). It represents the conditional
probability of a score of 1 on I given a disagreement between scores for the
two types. Since B is the number of observations scoring 1 on I and 0 on II,
we know that, given B + C = N, B follows the binomial distribution with
parameters Nand p. (Recall that B corresponds to the number S < defined
in Sect. 6.1.) Hence the usual binomial procedures are appropriate for tests
of hypotheses and confidence intervals for this P and also for (l/p) - 1 =
POdPlO (Problem 64a). In the present situation, the primary advantage of the
parameter (l/p) - 1 seems to be its adaptability to simple inference tech-
niques, but it will be given a useful interpretation in the next subsection.
Another quantity which might be of interest is the true proportion of
observations scoring 1 on II among those scoring 1 on I, or Pl dpi = Pl d
(PlO + Pl1)· This quantity is the conditional probability of a score of 1 on II
given a score of 1 on I. Alternatively, the conditional probability of a score
of 1 on II given a score of 0 on I might be of interest. Inferences about these
quantities can appropriately be based on the binomial distribution, and the
procedures are easily developed (Problem 64b).
We have discussed inferences concerning the "disagreements" between
scores on the two categories; now what about the "agreements?" For
example, one might be interested in testing the null hypothesis that the
probability of Type I and II scoring the same does not depend on the score
of Type I. This condition reduces successively to
P(samel1 on I) = P(same)
Pll/(PI0 + Pll) = Pll + POO
Pll(1 - PI0 - Pll - POO) = POOPI0
PllPOl = POOPIO' (8.2)
and the result is identical if we start with either of the relations P(samelO on
I) = P(same) or P( same lOon I) = P( same 11 on I). If the data are represented
in a new 2 x 2 table using the format of Table 8.7, it is clear that the usual
contingency table test of independence (of score on I and sameness) is
appropriate for the null hypothesis in (8.2). This is of course equivalent to a
test of equality of proportions within the rows of Table 8.7, or
Pll/(PIO + Pll) = + POl)
Poo/(Poo (8.3)
which in the present context says P(same 11 on I) = P(same 10 on I). Another
equivalent way of stating the null hypothesis in (8.2) is as an equality of odds,
or
PI dPlO = POO/POI (8.4)
which says here that the odds for" same given 1 on I" are equal to the odds
for "~ame given 0 on I," or
P(samell on I)/P(differentll on I) = P(samelO on /)/P(differentIO on I).
116 2 One-Sample and PaIred-Sample Inferences Based on the Bmomlal DlstnbutlOn
II I
same different same different
1 A B 1 A C
I II
0 D C 0 D B
A test of the null hypothesis that the probability of Types I and II scoring
the same does not depend on the score of Type II can also be performed using
a test of independence (of score on II and sameness), or a test of equality of
proportions within the rows of the new table shown as Table 8.8. This
hypothesis is not the same as (8.2)-(8.4), but is equivalent (Problem 65) to
PllPlO = POlPOO' (8.5)
These hypotheses of independence of sameness and score on I or II are
not as easy to interpret as they may seem. If there are many more scores of 1
than 0 on II, then it is easier to be the same given 1 on I than given 0 on I.
(Compare Sect. 8.4.) It may further exemplify the difficulty of interpretation
of independence in Table 8.7, and Table 8.8, to remark that independence in
both implies that P11 = POO and PlO = POl' except in the degenerate case
where either P(same) = 0 or P(different) = 0 (Problem 66).
In Sects. 8.1-8.6, the essential assumption was that the observed pairs con-
stitute a simple random sample from some population of pairs. In Sect. 7,
we indicated that in the case of matched pairs of continuous observations
an alternative assumption is often made. This is that the observations of a
given pair are random (as in a matched-pair experiment), while the pairs
themselves have "fixed effects" and need not be random at all. Each observa-
tion then reflects the fixed effect of the pair to which it belongs, as well as the
effect of the treatment (or Type). An analogous assumption for paired
dichotomous observations has been discussed by D. R. Cox (1958c). We
explain it here in the context of the drug example mentioned in (c) at the
beginning of Sect. 8, where two drugs are tried on a group of patients, each
drug once on each patient. Consider the group of patients as fixed, and sup-
pose that, for patient i, drug I causes nausea with probability PH and drug II
causes nausea with probability Pu i' Note that the randomness is now as-
sociated with different possible outcomes on a given patient, rather than with
the choice of a patient from the population. Suppose also that the outcomes
of separate trials on the same patient are independent (as trials on different
patients would be). If each drug is tried once on each patient, then the
8 Comparmg ProportIOns Usmg PaIred Observations II7
probability Plli that patient i scores 1 (nausea) on each drug is, by the
independence assumption,
Plll = PIiPIIi' (8.6)
Similarly, with obvious definitions, we have
PIOi = Pli (I - Pili) (8.7)
POli = (1 - PIi)PIIi (8.8)
POOi = (1 - P/i)(1 - PIIJ (8.9)
Suppose now that we make the additional assumption that the drug effect
is constant, in the sense that, for all patients, the odds for nausea under drug
e
II are the same mUltiple of the odds for nausea under drug I. In symbols,
this assumption is
9 Tolerance Regions
One-sample procedures based on the binomial distribution are also useful
for obtaining tolerance regions. The methodology can be viewed as a general-
ization of the procedure for constructing confidence limits for the median
or any specified p-point (quantile) of a distribution. Because of difficulties
analogous to that of defining a unique p-point for discrete distributions, it is
convenient to assume throughout this section that the relevant distribution
is continuous.
Recall that the median of a population is the 50 %point ofthe distribution,
or the point such that 50 %of the population lies below it. Let X* be an upper
95 % confidence bound for the population median. This means that X* is
obtained in such a way that it has probability 0.95 of exceeding the population
median. It follows that X* has probability 0.95 of exceeding at least 50% of
the population. Equivalently, the region to the left of X* has probability 0.95
of covering (including) at least 50 % of the population. This is perhaps the
simplest example of a" tolerance region," more specifically, a " 50 %tolerance
region" at the" confidence" level 0.95.
In this section we will define tolerance regions exactly, mention some
practical situations where they might be useful, and explain a simple method
of constructing them from a random sample. Then we will discuss their
usefulness for description and prediction, pointing out some difficulties in
the interpretation of tolerance regions. Finally we will generalize the con-
struction procedures. The question of what would be a "good" or "best"
tolerance procedure will not be discussed.
9.1 Definition
Suppose that, for some purpose, we would ideally like to find a region with
coverage 0.5, that is, a region including 50 % of the population. Lacking
special knowledge about the population distribution, we cannot accomplish
this exactly. We might be willing, instead, to define a region (depending on a
sample) so that there is probability 0.95 that it will have coverage at least 0.5.
This would perhaps sound difficult to do, had an example not been given
above.
In general, a tolerance region is a random region having a specified
probability, say 1 - a, that its coverage is at least a specified value, say c.
Various names are given to 1 - a and c in the literature. We shall callI - a
the confidence level and c the tolerance proportion, the latter because in some
situations it is the minimum proportion of the population which it is con-
sidered tolerable to cover. We shall also speak of a "c tolerance region with
confidence 1 - a." Regions which have this property under essentially no
restrictions on the population are sometimes called "nonparametric toler-
ance regions," to distinguish them from "parametric tolerance regions,"
which have the required property as long as the population belongs to some
specified parametric family, but not in general otherwise. Only nonpara-
metric tolerance regions will be discussed here.
In nonspecific settings, tolerance regions are often suggested for the purpose
of describing the underlying population or for predicting future observations.
For these purposes, however, a tolerance region at a conventional confidence
level like 0.95 is of doubtful value. For description, for instance, the difficulty
of interpreting a tolerance region is analogous to the difficulty of interpreting
a confidence bound by itself as an estimator. Such difficulties, and possible
remedies, will be explained further later, in Sects. 9.4 and 9.5.
The specific context in which tolerance regions are most often employed
is that of production processes, since then it is natural to be concerned with
whether the items produced are meeting specifications or measuring within
some design tolerances, such as 100 ohms ± 10 ohms. If certain deviations
of various characteristics from designated values are specified in advance as
tolerable, it is easy to make non parametric inferences about the proportion
of the population in this region of tolerable values (Problem 68). This is not
the type of tolerance region defined above, however; we shall be concerned
here not with pre specified tolerance limits or regions of tolerable values, but
rather with finding a tolerance region, based on a sample, such that a pre-
specified proportion of the population will be covered by that region with a
preselected level of confidence. Because these regions are based on a sample,
they are often called "statistical" tolerance regions. We will not repeat the
adjective in the discussion to follow, since all tolerance regions here will be
statistical.
120 2 One-Sample and Paired-Sample Inferences Based on the Binomial Distribution
I The authors are mdebted to Frederick Mosteller for a discussion of uses of tolerance regions,
find. For example, clearly relevant but hard to estimate are the costs of the
various possible acts in each problem, both when the production process is
unchanged or the people are normal as regards cholesterol level, and when
any of a variety of possible alternatives holds.
Assume that Xl' ••• , Xn are independent observations on the same distribu-
tion, with c.dJ. F. Let X(1)"'" X(n) denote the order statistics of these
observations. Assume that F is continuous, so that, with probability one,
there are no ties and a unique ordering X(1) < X(2) < ... < X(n) exists. Let
Ck be the coverage of the interval between X(k-l) and X(k)' Then by (9.1) we
have for k = 2, ... , n,
(9.2)
(Since F was assumed continuous, the coverage is the same whether the
endpoints are included in the interval or not. If a specific statement were
required, we would assume that right (upper) endpoints are included, and
left (lower) endpoints are not.) We further define
C l = F(X(1» and Cn+ 1 = 1 - F(X(n)
as the coverage of the interval below X(l) and the interval above X(n) respec-
tively. The definition in (9.2) applies also to these two intervals once we define
X(O) = - 00 and X(n+ 1) = 00.
We now have n + 1 coverages, C l , C 2 , ••. , Cn + l , corresponding to the
n + 1 intervals into which the n sample points divide the real line. These n + 1
coverages are random variables, and their joint distribution has a number of
interesting properties (Problems 70 and 71). A property which provides an
easy method of construction of tolerance regions is the following. Let
iI, ... , is be any s different integers between 1 and n + 1 inclusive; then the
sum C of the corresponding coverages,
C = Cit + C + ... + Cis'
I2
has the same distribution as the sth smallest observation in a sample of n from
the uniform distribution on the interval (0, 1) (Problem 71g), namely
s _- 1
P(C ;;::: c) = n( n 1) i u c
l
S- l (l - u)n-s du (9.3)
sil (n)C
k=O k
k(l _ ct- k. (9.4)
Notice that the distribution depends on sand n only, not on which s integers
are chosen, nor on the distribution from which the sample was drawn, as long
as F is continuous. Notice also that C has a beta distribution by (9.3) and that
(9.4) is a left-tail binomial probability.
122 2 One-Sample and Paired-Sample Inferences Based on the Bmomial Distribution
;!
1>
..,
I 1111 I>'
Qm r !:I
~
0.995 1 ! 11 _P 1;'
0.99 I Ii: .......r-....J.::;:::~ ""o
0.98 ~ ~
'"
9.95 I V~V ... ~~~
0.90 <l =10.05 I i __/),::,:~~~~~~~~ I
0.80 I // /~~V-:---,;" ~~//
0.70 V V///:/'~V 1./ V V'/h V /V I
X (n + 1 -I), then m = k + I. If values of c and ()( are also selected, then n is the
smallest value for which the probability of m or more successes exceeds
1 - ct, under the binomial distribution with parameters nand p = 1 - c.
Murphy [1948] gives graphs of c versus n for various values of m and
1 - ct = 0.90,0.95, and 0.99. These are somewhat easier to use than binomial
tables for some purposes. They are essentially equivalent to graphs of
binomial confidence limits or percent points of the beta distribution (as, for
instance, Fig. 6.2, Chap. 1), arranged in a certain way (Problem 73). Figure 9.1
reproduces the graph for 1 - ()( = 0.95; m denotes the number of intervals
omitted, as explained in the previous paragraph.
c-point ec. Then the use of the tolerance region for description amounts to
9 Tolerance Regions 125
1.0
0.9
0.8
~
VI 0.7
U
is::'
"
8
0.6
5
'0 0.5
<::l
Q
8
0.4
0.3
0.2
0.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
p = coverage
Figure 9.2 Distribution of actual coverage of a 0.65 tolerance region with confidence
0.95 when n = 20.
the use of the upper confidence limit XiS) as a sample descriptor of ~c. How-
ever, a single confidence limit at a typical level would not ordinarily be
considered a descriptor in this sense, and would be a very lopsided descriptive
device. For example, an upper 95 % confidence limit for a parameter, say the
median, is not by itself very descriptive of what one knows about the param-
eter, since the limit could be any distance from the parameter.
Some methods of avoiding this deceptiveness when using tolerance regions
as a descriptive device are listed below.
(1) Use the confidence level 0.50. Then the tolerance region is analogous to
a median unbiased estimator. 2
(2) State several combinations of the tolerance proportion and confidence
level for the region given, for instance, the tolerance proportions corre-
sponding to the confidence levels 0.05, 0.25, 0.50, 0.75, and 0.95.
2 An estimator T is called median unbiased for a parameter () if T has median () for any allowed
distribution. A Significant fact about median unbiasedness is that h(T) is median unbiased for
h«(}) if T is median unbiased for () and h is monotone (see also van der Vaart [1961]). This prop-
erty does not hold for ordinary (mean) unbiasedness.
126 2 One-Sample and Paired-Sample Inferences Based on the Bmomlal DistributIOn
(3) Instead of giving any tolerance proportion and confidence level, give the
expected coverage, as defined in the next subsection. This is somewhat
analogous to unbiased estimation.
(4) Give two regions, one a tolerance region as defined already, and the
other what might be caIled an inner tolerance region with the same
tolerance proportion c and confidence level 1 - ct. This is analogous to
giving both upper and lower confidence bounds. By an inner tolerance
region is meant a region having probability 1 - ct that its coverage is at
most c. It can be chosen to lie inside the ordinary tolerance region (as
long as ct < 0.5). Its complement is a tolerance region with tolerance
proportion 1 - c and the same confidence level 1 - ct.
region R. For example, before any observations are taken, the probability
that the (n + l)th observation will lie within the entire range X(l) to X(n) of
the first n is (n - 1)/(n + 1), and the probability it will lie between any two
successive observations XO-I) and X(i) is 1j(n + 1).
If an ordinary tolerance region at a high confidence level is used for the
region R, then the probability in (9.6) will be larger, and may be much larger,
than the tolerance proportion c, since the actual coverage C(X 1, •.. , X n)
exceeds the tolerance proportion with probability equal to the confidence
level chosen. This illustrates in another way the difficulty of making a simple
interpretation of a tolerance region as a descriptor.
The procedure described in Sect. 9.3 for constructing tolerance regions can
be generalized in a number of ways. As an aid to understanding, we shall
proceed informally and one step at a time.
Suppose first that we have a sample of bivariate observations rather than
univariate observations, and we seek a tolerance region in two-dimensional
rather than one-dimensional space. Of course, we could look only at the first
coordinate of each observation and construct a univariate tolerance region
based on these n univariate observations. The corresponding (equivalent)
region in the plane would then be a bivariate tolerance region. For instance,
if the univariate region is an interval I, then the corresponding bivariate
region would be simply the vertical band whose intersection with the hori-
zontal axis is I.
A more interesting possibility would be to look at some real-valued
function ¢ other than the first coordinate of the bivariate observations, which
we denote by XI. Suppose that Z = ¢(X) has a continuous c.dJ. G. Define
a tolerance region S in Z-space based on the order statistics Z(1)' ... , Zen)
of the sample of values Z; = ¢(X;). Let R be the corresponding region in
X-space, or formally, R = {x: ¢(x) E S}. Then R has the same coverage for
X that S has for Z and hence is a tolerance region in X -space with the same
tolerance proportion and confidence level. For example, if ¢(X) is the distance
of the point X from the origin, or the length of the vector X, then the Z; are
the distances of the X; from the origin. If the Z tolerance region S is the interval
from Z(k) to Z{k+s), then the corresponding X tolerance region R is the ring
consisting of those points x whose distance from the origin is between Z(k)
and Z(k+S). If some other well-behaved function ¢ had been used in place of
distance, the boundaries of R would still be the two contours where ¢ has
the values Z(k) and Z(k+s).
The same method can be applied to any kind of X-space whatever, by
letting ¢ be a real-valued function on this space. We shall require only that
¢(X) have a continuous distribution G (this avoids the difficulties of discrete-
ness). To carry the ideas a bit further, let Z(1)' ... , Zen) again be the order
128 2 One-Sample and Paired-Sample Inferences Based on the Bmomial DistrIbutIOn
statistics of the sample of values Zj = ¢(X;). The Z(I) separate the Z-space
(the real line) into n + 1 intervals. Let R 1 , ••• , Rn+ t be the corresponding
regions in X-space, that is, Ri is the X-region where ¢(X) is between Z(i-l)
and Z(l)' Specifically, R j = {x: Z(i-1) < ¢(x) s Z(l)} where Z(o) = - 00 and
Z(n+ 1) = 00 as before. The coverage C j of R, is the probability under the X
distribution in the region R j , which is the probability under the Z distribution
in the interval between Z(i-l) and Z(i)' which is G(Z(I» - G(Z(i-l)' The
joint distribution of these coverages is therefore the same as in Sect. 9.3, and
in particular, the union of any s of the regions R, is a tolerance region having
a coverage C whose distribution is given by (9.3) and (9.4). The indices
iI' ... , is of the included regions are to be selected in advance, of course.
Regions R j whose coverages have the same joint distribution as in Sect. 9.3
are called "statistically equivalent blocks," the equivalence being that any
permutation of the coverages has the same joint distribution as any other.
By generalizing the procedure for constructing statistically equivalent blocks,
we can obtain more general tolerance procedures. We shall proceed further
in this way.
Instead of using the same function throughout, we could use a sequence
of functions ¢ t, ¢2, ... , ¢n' One way to do so is as follows. First, let R t be the
region where ¢t(X) is smaller than the smallest value ¢l(X j ) observed. Next,
remove the minimizing X, from the sample and Rl from the X-space, and
apply the same procedure to the remaining sample and the remaining portion
of X-space, using ¢2 in place of ¢1' And so on. At each stage, the remaining
X's are a sample from the original distribution except restricted to the re-
maining portion of the X -space; therefore, the conditional distribution of the
coverage of the next region to be removed given the coverages of the regions
already removed is the same no matter what function ¢} is used next and
hence is the same as when all functions ¢ j are the same. Therefore, the regions
R 1, R 2 , ••• , Rn+ 1 obtained are again statistically equivalent blocks.
The first step above may be thought of as using ¢ 1 to cut the X -space into
two regions, one consisting of one block and one consisting of n - 1 blocks
not yet subdivided. The second step then uses ¢2 to cut the latter region into
two regions, of 1 and n - 2 blocks, etc. Geometrically, the successive cuts
are along contours of ¢1> ¢2' ... , ¢n' Instead of cutting off one block at each
step, however, we could choose some arbitrary split. In this case, the first
step is to choose an integer r 1 between 1 and n, find the r 1 th from the smallest
of the values ¢1(X,), i = 1,2, ... , n, and cut the X-space into two regions
according to whether ¢l(X) is smaller or larger than this rtth value. We then
have one region containing 1'1 - 1 X's and still to be subdivided into 1'1
blocks, and a second region containing n - 'I X's and still to be subdivided
into n - r l + 1 blocks; the remaining X, say X[ll' is the borderline value
through which the first cut passes. The second step is to cut one of these two
regions, using the function ¢2 and another arbitrary integer. After n steps, n
cuts have been made, all the Xi have been used, and there are n + 1 regions,
each fully subdivided, i.e., consisting of a single block. These n + 1 regions
9 Tolerance Regions 129
and make the third cut at the largest value of ¢3(X(i», i = 2, ... , n - 1. Then
the tolerance region will be the interval from
X(I) + X(n) - X(n-l) to if X(2) - X(l) > X(n) - X(n-I)
X(n-l)
PROBLEMS
1. Show that, if F(x) = p for a < x < b, then a and b are both pth quantiles of F.
2. Show that ¢ is a pth quantile of F if and only if the point (¢, p) lies on the graph of
F with any vertical jumps filled in.
3. Show that ¢ is a pth quantile of a distribution if and only if, when ~ is included, the
left tail probability is at least p and the right tail probability IS at least 1 - p.
4. (a) Sketch seven c.d.f.'s to exhibit the seven possible combinations of cases (a)-(c)
of Sect. 2.
(b) Which combinations are possible for
(i) discrete distributions?
(ii) distributions with densities?
5. What is the relation between the quantiles and the inverse function of a c.dJ. ?
6. An estimator is called median unbiased for a parameter if its median is that para-
meter. Show that, for odd sample sizes, the sample median is a median unbiased
estimator of the population median.
7. (a) If ¢p is a pth quantile of a distribution on a finite set, show that either p or ¢p
IS not unique. Relate this to cases (a)-(c) of Sect. 2.
(b) For what countably infinite sets does (a) hold?
*10. Let XI' X 2 , ••• , Xn be independent, identically distributed m-vectors and let
P = P(X) E A) for A a given set in m-space. Find a uniformly most powerful test
of the null hypothesis H 0: P s Po against the alternative HI: P > Po [Lehmann,
1959,p.93].
*11. Prove that any two-tailed sign test is admissible if X I, .•. , Xn are independent and
identically distributed with P(X) = ~o) = 0, where ~o is the hypothesized median
value.
*12. Prove the optimum properties of the sign test for Ho: P = Po given in Sect. 3.2 for
the case where the observations are not necessarily identically distributed but are
independent with P(X) < ~o) = P(X) s ~o) = P for allj.
*13. Prove that S, the number of observations below ~o, is a sufficient statistic for P
if X I' ... , X n are independent and identically distributed with distributIOn F
belonging to the family defined in Sect. 3.3, paragraph 2.
17. An automobile manufacturer wishes to design a certain new model such that the
front-seat headroom is sufficient for all but the tallest 5 %ofmaledrivers. A random
sample of 100 male drivers is taken. The heights of the 9 tallest are as follows:
(a) Find a 90 % two-sided confidence interval for the 95th percent point of the
population of male drivers.
(b) Former studies by the Federal government have shown that the 95th percentile
point for height of U.S. males is 70.2 inches. Does this result appear to be valid
now and for the population of male drivers?
18. A sample of 100 names was drawn from the registered voters in Appaloosa County
and sent questionnaires regarding a proposed taxation bill. Of the 75 usable
returns, 50 were in favor of the bill. Find a 95 % confidence interval for the true
proportion of registered voters in favor of the bill. What assumption are you
making about the unusable returns?
132 2 One-Sample and Paired-Sample Inferences Based on the Binomial Distribution
19. Suppose that a quantile of order P is not unique for fixed p, that is, ~p is any value
in the closed interval [~~, ~;] for some ~~ < ~;. Show that if lower and upper
confidence bounds, say Land U, are determined by the usual sign test procedure,
each at level 1 - cx, then
P(L S ~~) ~ 1 - cx and P(U ~ ~;) ~ I - cx.
20. (a) Suppose that L is a lower confidence bound for a pth quantile ¢p constructed
by the usual sign test procedure at exact level 1 - cx for a continuous popula-
tion. Show that if the population is discontinuous, the lower confidence
bound has at least the indicated probability of falling at or below ¢p, and at
most the indicated probability offaIling strictly below ¢P' that is
24. Let X(r) denote the rth from the smallest in a random sample of size 5 from any
continuous population with ~p the quantile of order p. Evaluate the following
probabilities:
(a) P(X(I) < ~o so < XIS»~
(b) P(X(I) < ~o 25 < X(3»
(c) P(X(4) < ~0.80 < XIS»~'
25. If X(1) and X(n) are respectively the smallest and largest observations in a random
sample of size n from any continuous distribution F with median ~o so, find the
smallest value of n such that
(a) P(X(I) < ~o so < X(n» :2: 0.95
(b) P[F(X(n» - F(X(1» ~ 0.50] ~ 0.95.
26. Let V denote the proportion of the population lying between the smallest and
largest observations in a random sample of size n from any continuous population.
Find the mean and variance of V.
27. In a random sample of size n from any continuous population F, the interval
(X(r), X(n-r+ I» for any r < n/2
gives a level 1 - IX confidence interval for the
median of F. Show that IX can be written as
IX = (0.5r /I (n)
k=O k
= 2n(n - 1) fO soxn- r(1 _ xy-l dx.
r - 1 0
28. Show that the exact confidence level of a confidence interval for the median
with endpoints the second smallest and second largest observation is equal to
1 - (n + 1)/2n - 1
n !(x,)
n
n! for XI < X2 < ... < x n·
1= 1
30. Let X(r) be the rth order statistic of a random sample of size n from a population
with continuous c.dJ. F.
(a) Differentiate (4.1) to show that the marginal density of XI') is
(b) Show that the c.dJ. of the density in (a) can be written as
G(t) = P(X(r) ~ t) = I 0
F(')
[B(r, n - r + 1)] -1 11,- 1(1 - IIr' dll,
and hence this binomial sum is equivalent to the incomplete beta c.dJ. above.
(c) Integrate (a) by parts repeatedly to obtain the binomial form in (4.1).
134 2 One-Sample and Paired-Sample Inferences Based on the BInomial Distribution
31. By considering P(X(,) > tin) in the binomial form given in (4.1), find the asymp-
totIc distribution of XI') for r fixed and n -+ 00 if
(a) F is the uniform distribution on (0, 1).
(b) F is an arbitrary continuous c.dJ.
32. Let X(n) denote the largest value in a random sample of size n from the population
with density function!
(a) Show that lim n _ oo P(n- IX(n) S x) = exp( -rx/nx) if f(x) = rx/[n(rx2 + x 2 )]
(Cauchy).
(b) Show that limn _ oo P(n- 2 X(n) S x) = exp(-2J2/nx) iff(x) = (rx/fo)X- 3/2
exp( -rx 2/2x) for x ~ O.
33. Show that the joint density of two order statistics XI')' XIV)' 1 s r < v s n, of an
independent random sample from a population with continuous c.dJ. F, is
This result can be expressed in binomial form as in (4.3), or, by Problem 30(b), it
can be written as the difference
35. Let X(I) < ... < X(n) be order statistics of a random sample of size n from the
exponential density f(x) = e- x , x ~ O.
(a) Show that XI') and Xlv) - XI') are independent for any r < v.
(b) Find the distribution of X(r+ I) - XI')'
(c) Interpret the significance of these results if the sample arose from a life test of n
items with exponential lifetimes.
36. Find the density and c.dJ. of the range, X(n) - X(1)' of a random sample size n
from any continuous populatIOn.
37. Suppose we want to use a random sample of size n to find a level 0.95 confidence
interval for 6 in the density f(x) = exp[ -(x - 6)] for x > 6. Since the smallest
observation X(I) is a sufficient statistic for 6 (and also its maximum likelihood
estimator), some functions of X(1) would be a natural choice for the confidence
bounds. If X(I) is the upper confidence bound, find that lower confidence bound
g(X( I) which gives a two-sided level of 0.95.
Problems 135
38. Suppose we have a normal population with unknown mean and median ¢ and
known variance u 2 and we require a test of the null hypothesis H 0: ¢ = ¢o against
the simple alternative ~ = ~I> where ~I > ~o, such that the level is IX and the power
is 1 - p. Let IX and p be fixed while (~I - ~o)/u = 0 approaches O.
We consider two different tests for the situation described above. Test A is the
appropriate normal theory test for a sample of sIze "A, that is, the test based on
Z = In:(X - ¢o)/u with a right-tail critical value z. from Table A. Test B is the
sign test for a sample of size liB, that is, the test based on S, the number of observa-
tions smaller than ~o.
(a) Obtain a general expression for the sample size "A
required by the normal
theory test.
(b) Obtain an approximate expression for the sample size liB required by the sign
test, by approximating the binomial distribution by the normal distribution.
(c) The limit of the ratio IIA/IIB as 0 ..... 0 is known as the asymptotic efficiency of
the sign test relative to the optimum normal theory test. Using (a) and (b),
show that the asymptotic relative efficiency equals 2/n. This example and
others will be discussed at length in Chap. 8.
39. (a) Show that the value of illl such that X(I) is a lower confidence bound for the
median at conservative level 0.95 is 0.28 when II = 9, and 0.32 when II = 17.
What is the true level in each case?
(b) Find the true level of X(I+ I) as a lower confidence bound for the median
when illl = 0.28, 11= 9 and when i/II = 0.32, II = 17.
(c) Since the true levels in (a) and (b) are approximately equidistant from 0.95
for each II, one might consider using (X(I) + X(I+ 1)/2 as a lower confidence
bound for the median. If the population is symmetric, what is the true level of
this bound when II = 9, and when II = 17?
41. Use the results of Problem 40 to show that ~po = ~o for all Po points if and only if
p < ::::; Po ::::; p"", and thus that the two-sided test that rejects if S "" ::::; SI and also If
S < ~ Su is appropriate for a two-sided test of the null hypothesis that ~o is a Po-
point when one or more observations is equal to ~o.
E(S< - S» = II(P< - P»
(b) Show that if P < + p> = 1 then S < - S> is even with probability 1 for II even,
odd for II odd.
136 2 One-Sample and Palfed-Sample Inferences Based on the BmomIaI DistrIbutIOn
(c) Show that S< - S> IS asymptotically normal with the mean and variance
given In (a), even if the parameters P< and P> depend on n, provided that this
vanance approaches infinity.
(d) Derive approximate tests and confidence bounds for P< - P> and show
that their levels are asymptotically valid.
43. (a) Given any c.dJ. G and any positIve P<, P> with p< + p> :$; 1, show that
there is exactly one c.dJ. F having the same condItional distributions as G
on each side of ~o, PF(X < ~o) = P<, and PF(X > ~o) = p>, and express F
algebraically in terms of G, p<, and p>.
(b) Show that the family of such distributions F includes G and includes a dIs-
tribution with p< = p> for each p<, 0 < P< < t.
(c) Show that the statistics Sand N are jointly sufficient for this family, where S
and N are the numbers of observations in a sample which are respectively < ~o
and :$; ~o'
(d) Show that, for the subfamily of such distributions F wIth p< = P>, the statistic
N is sufficIent.
44. (a) Show that a conditional test at conditional level Ct. is an unconditional test at
level Ct..
(b) Show that a conditionally unbiased test is unconditionally unbiased.
(c) Show that if all unbiased tests against a certain alternative are conditional
and if a certain conditional test has uniformly greatest conditIOnal power,
then this test is umformly most powerful unbIased against this alternative.
*45. Prove Theorem 6.1 relating unbiased and conditional tests.
46. In the situation and notation of Sect. 6, for n = 4, let cjJ(S < , N) be defined by the
accompanying table, with cjJ(S < , N) = 0 elsewhere
N 0 2 3 4
S< 0 0, 1 0, 2 1, 2 0, 4
cjJ k k !! I
(a) Show that the test cjJ has conditional level kfor each N.
(b) Show that cjJ is biased conditional on N = 3.
(c) Show that cjJ has power
where Pili = P(N = m) = (!)r'''(l - /,)4-111 with r = p< + p> and p = p</r.
*(d) Show that the conditional power for N = 4 exceeds that for N = 2, I.e.,
p4 + (I - p)4 > Hp2 + (I - p)2] for p of t.
(e) Show that the conditional power for N = 2 and that for N = 3 average to
k, i.e., k[p2 + (I - P)2] + !p(l - p) = l
*(f) Show that P3 :$; P2 + P4'
*(g) Use these facts to show that cjJ is unconditionally unbiased.
*47. Fill in the details of the proof that the equal-tailed conditional sign test is uniformly
most powerful unbiased when ~o has positive probability by showing the results
stated in (6.11)-(6.13).
Problems 137
48. Suppose that a pair of control and treatment measurements (V, W) would be
permutable if the treatment had ,no effect, while the effect of the treatment is to
add a constant amount p to W. Show that the median of the difference X = W - V
isp.
49. Suppose that a confidence interval for the median IS obtained by applying the
methods of Section 4 to the differences X, = W. - V, of independent, identically
distributed pairs (V" W,). Show that the confidence procedure is valid for the" shift"
o which makes P(W, - 8 < V,) and P(W, - 8 > V,) both no larger than 0.5. (In
the continuous case, W. - 0 is equally likely to be less than or greater than v,.
e
Nevertheless, need not equal the difference between the medians of W, and v,;
see the following problems and Sect. 2 of Chapter 3.)
50. This problem and the next one show that the population medians of the treatment-
control differences in matched-pair experiments may be all positive or all negative
even though the treatment has no effect on the mean or median for any unit. If
the dispersion is larger for larger-valued units, and if the treatment accentuates
this, then the treatment-control differences typically have negative medians al-
though they may have positive medians. The following problem gives a similar
result for skewness.
Suppose that each unit possesses a "unit effect" u and would yield the observed
value U = u if untreated but T = u + ,(u)Z if treated, where ,(u)Z is a "random
error," independent from unit to unit, with positive scale factor ,(u). Suppose that a
pair of units is given whose unit effects are u' and u", with 1/' < u", say; then one
of the two units is chosen at random for treatment. Let X be the treatment-control
difference observed, and let (j = 1/" - u', " = ,(u'), and ," = ,(u").
(a) Show that P(X :::;; x) = tp«(j + ,"Z :::;; x) + tp( -(j + ,'Z :::;; x).
(b) Suppose that P(Z = -1) = P(Z = 1) = 0.5. Show that the possible medians
of X are all positive if ," < (j < ,', all negative if " < (j < ,", and include 0
otherwise. [Hint: P(X = - (j - ,') = P(X = - b + t') = P(X = (j - ,") =
P(X = (j + ,") = 0.25.]
(c) Suppose that ,(u) is an increasing function of 1/ and that Z is symmetrically
distributed about zero, that is, P(Z ~ z) = P(Z :::;; - z) for all z. Show that
P(X < 0) ~ 0.5 with equality holding if and only If P«(j/," :::;; Z :::;; (j/,') = O.
(In the case of inequality, the median of X must be negative.)
51. Suppose that the situation of the previous problem holds except that U is dis-
tributed as 1/ + v(u)Z with v(u) > 0, and the pairs (U, T) are independent from
unit to unit. (For anyone unit, only the marginal distributions of U and T matter.
Beyond this their joint distribution is irrelevant.) Let v' = 1'(1/') and v" = 1'(1/").
(a) Show that P(X :::;; x) = tp(,"Z" - v'Z' :::;; X - 8) + tp(,'Z' - v"Z" :::;; X +
(j) where Z' and Z" are 1l1dependently distributed as Z.
(b) Show that P(X < 0) > 0.5 if Z is normal with mean 0 and the treatment
variance mulUS the control variancc, ,2(1/) - 1'2(1/), IS an 1l1creaS1l1g function
of the unit effect 1/
*(c) Show that if Z is uniformly dlstnbuted over the interval [ - R R], ," - v" ~
" - v', and (v"," - v'r')(i," - 1"1''') > 0, then P(X < 0) ~ 0.5 wIth equality
if and only If R :::;; 8/(1" + r").
(d) Show that, and v satisfy thc conditIOns of (b) and (c) if ,(1/) - 1'(1/) IS a positive,
increas1l1g function of 1/ and 1'(1/) is nondecreasing.
138 2 One-Sample and PaIred-Sample Inferences Based on the Binomial DIstribution
(e) Show that P(X :::; 0) = ~ if P(Z = z) = t for z = -1,0, 1 and v' < (j < ,," <
r' + E> and (j < r' < r" < v' + E>.
If v(u) and r(u) - v(u) are increasing functions of u, parts (b)-(d) illustrate that
the median of X will typically be negative, although it may be positive, even for
symmetric Z, since these conditions do not preclude the conditions of part (e).
52. In the situation of the previous two problems, replace the assumptions about
U and Tby E(U) = E(T)= u(so the treatment has no effect on the mean), var(U)=
V2(U) and var(T) = r2(u) are finite, and E(U - U)3 = E(T - U)3 = 0, with (U, T)
stIlI independent from umt to unit. Show that E(X) = 0 and E(X3) = 1.5E>(r,,2 -
1'''2 - r'2 + 1"2). Therefore the distribution of X IS skewed to the right if r2(u) -
V2(U) is an increasing function of u, even though U and T have no skewness.
53. In the paired situation of Sect. 8, define a difference score for each pair as the
score on II minus the score on I. Show that the probability of a positive difference
score minus the probability of a negative difference score equals PII - PI' where
PI and PII denote the probabilities of a score of 1 on I and II respectively.
54. Suppose that one male and one female chick are selected at random from each of
10 litters and inoculated with an organism which is thought to produce an equal
chance of life (1) or death (0) for every chick. The death occurs within 24 hours
if at all, and the organism has no effect on life after the first 24 hours. Test the data
below to investigate whether the sexes differ in their response to the orgamsm.
1 1 0
2 0 0
3 1 1
4 1 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0
10 0
55. Prior to a nationally televised series of debates between the presidential candidates
of the two major parties, a random sample of 100 persons were asked their pre-
ference between these candidates. Sixty-three persons favored the Democrat at
this time. After the debate the same 100 people were asked their preference agam,
and now seventy-two favored the Democratic candidate. Of these, 12 had pre-
viously stated a preference for the Republican candidate. Test the null hypothesis
that the voting preferences were not significantly changed after the debate. Can
you say anything about the effect of the debate?
56. In a study to determine whether constant exposure of children to violence on TV
affects their tendency toward violent behavior and possibly crime, disturbance,
etc., a group of 100 matched pairs of children were randomly selected. The pairs
Problems 139
1 1 18
1 o 43
o 1 8
o o 31
57. Suppose that an achievement test is divided into k parts, all covering the same type
of achievement but using different methods of investigation. All parts of the test
are given to each of /' subjects and each IS given a score of pass or fail on each part
of the test. The data might be recorded as follows, where 1 represents pass and 0
represents fail.
0 1 Rl
2 1 0 R2
/' 0 0 R,
C1 C2 Ck N
The symbols R I , ... , R, and C\>oo., C k represent the row and column totals
respectively, so that R, is the number of parts passed by subject i and C) is the
number of subjects who passed part j. N is the total number of parts passed by all
subjects.
Let p') denote the probability that the score of the ith subject on the )th part of
the test will be a 1 (pass). Suppose we are interested in the null hypothesis
This says that the probability of a pass is the same on all parts of the test for each
subject, or that the parts are of equal difficulty. The Cochran Q test statistic is
defined as
Q = k(k k
- l)I(C)
1
I
r
- Nlk)2 IR,(k
1
- R,).
(a) Give a rationale for the use of the quantity Q to test the hypothesis of interest.
What is the appropriate tail for rejection?
(b) While the exact null distribution of Q conditional on R., . .. , Rr can be gen-
erated by enumeration, it is time-consuming and difficult to tabulate. Hence a
large sample approximation is usually used instead unless r is quite small,
specifically the chi square distribution with (k - 1) degrees of freedom. Give a
JustIfication of this approximation along the following lines. Within each
column, the observations are Bernoulli trials. Hence for each j, C) follows the
binomial distribution with mean I~=l PI) and variance I~=l p,i 1 - P,),
Under the null hypothesis, this mean and variance can be estimated from the
sample data as Nlk and I~=l (R,/k)[1 - (R,/k)], but this latter estimate is
improved by multiplying by the correction factor kl(k - 1). Standardized
variables are then squared and summed to obtain Q. While the C) are not
independent, as r increases the C) approach independence. One degree of
freedom is lost for the estImation procedure.
tC; - I
(c) Show that the test statistic Q above can be written in the equivalent form
Q = (k - 1)(k N 2) (kN - ~ R~ )
which is easier for calculation.
(d) Show that when k = 2, the test statistic Q reduces to
Q = (C - B)2/( C + B)
where C and B are defined as in Table 8.1, that is, C is the number of (0, 1)
pairs and B is the number of (1,0) pairs. Hence the test statistic Q when k = 2
is equivalent to the test statistic for equality of paired proportions presented
in Sect. 8.1.
58. Forty-six subjects were each given drugs A, Band C and observed as having a
favorable (1) or unfavorable (0) response to each. The results reported in Grizzle
et al. [1969, p. 494] are shown in the table below.
1 1 1 6
1 1 0 16
1 0 1 2
o 1 1 2
1 0 0 4
010 4
o 0 1 6
000 6
Test the null hypothesis that the drugs are equally effective.
Problems 141
59. Four different surgical procedures for a duodenal ulcer are A (drainage and
vagotomy), B (25% resection and vagotomy), C (50% resection and vagotomy)
and D (75 % resection). Each procedure was used for a fixed period of time in
each of 15 different hospitals, and an overall clinical evaluation made of the
severity of the "dumping syndrome," an undesirable aftereffect of surgery for
duodenal ulcer. The overall evaluation was made as simply" not severe" (0) or
"present to at least some degree" (1). Analyze the results below for any significant
difference between aftereffects of the four surgical procedures.
Surgical Procedure
Hospitals A B C D
1,7,8, 11 1 0 1 0
2,3,13 0 0 1 0
4,10 1 1 0 1
5, 12 1 1 0 0
6 0 1 0 1
9, 14, 15 1 0 0 0
60. Suppose that for a group of n matched pairs of individuals, one member of each
pair is selected to receive a treatment for a certain period while the other serves as
a control (is untreated or given a placebo). Each individual is measured both
before and after the treatment period so that there are a total of 4n observations.
Indicate how this can be reduced to a matched pair situation while making use of
both before-after and treatment-control information. If all 4n observations mea-
sure only the presence or absence of response, can the procedures of Sect. 8.1 be
applied? (Hint: How many different response categories are there for the dIffer-
ences?)
*61. Suppose that half of a given group of patients is selected at random to receive
Drug I at a certain time and Drug I I at a later time, while the remaining patients
receive Drug I I first and Drug I second. On each occasion, the characteristic
measured is dichtomous. Suppose further that the two drugs are known to have
exactly the same effect on all patients but there is a time effect. Show that a test for
equality of proportions using matched observations ignoring the time effect is
conservative. (Of course one would expect to obtain better power from a suitable
test which takes into account the time effect. See also the end of Sect. 3.1.)
62. Suppose that the randomization in Problem 61 is carried out within pairs of
patients rather than over the whole group of patients. How does the situation then
relate to that of Problem 60?
64. In the paired dichotomy situation and notation of Sect. 8.6, derive tests and con-
fidence procedures for
(a) the parameter A = POl/PlO'
(b) the parameter}, = p,,/p,.
142 2 One-Sample and Paired-Sample Inferences Based on the BinomIal Distribution
65. In the paired dichotomy situation and notation of Sect. 8.6, show that inde-
pendence of score on II and sameness is equivalent to PllPIO = POIPOO'
66. In the paired dichotomy situation and notation of Sect. 8.6, show that if sameness
IS independent of score on I and of score on II, then either P(same) = 0 or
P(different) = 0 or PII = POO and PIO = POI'
67. Show that, under the model of Sect. 8.7 for paired dIchotomous observations,
a one-sided test of P = 0.5 as described in Sect. 8.1 (or 0 = 00 as described in
Sect. 8.6) is uniformly most powerful against a one-sided alternative. Show also
that the related unbiased tests are uniformly most powerful unbiased against the
alternative P '# Po·
68. (a) Let c be the coverage of a specIfied (nonrandom) region R. Suggest non-
parametric tests and confidence procedures for c.
(b) What optimum properties would you expect your tests in (a) to have?
(c) Show that the optimum properties you stated in (b) hold.
69. Suppose a tolerance region with tolerance proportion 0.90 and confidence level
0.95 is set up for a production process. Thereafter, pairs of items are observed,
and trouble-shooting is undertaken whenever both fall outside the tolerance
region. What can you say about the amount of trouble-shooting required when the
process is in control?
70. Let C j , ••• , C n + I be the coverages of a sample of II from a continuous c.dJ. F.
Let Vs = I1 c" s = 1, ... , II.
(a) What is the joint density of V ... .. , V,,? Of C., . . , en?
(b) What is the joint density of V" V J for i '# j? Of C" C j for i '# j?
71. Let C I, ... , Cn + I be the coverages of a sample of II from a contmuous c.dJ. F,
and Vs = I1 C, for s = 1, ...• II. Show that the following properties hold.
(a) The joint distribution of C I, .. .J, Cn + I does not depend on F.
(b) The random variables Vs are the order statistics of a sample of II from the
uniform distribution-on (0. 1).
(c) Given V I, ... , Vs-I' Vs+ I' ... , Vn' the conditional distribution of Vs is
uniform on the interval (V,_ j, Vs+ d.
(d) Any permutatIOn of C I" .. , Cn + I has the same joint distribution as C., . .. ,
Cn + l ·
(e) E(C,) = 1/(11 + I) for all i.
(f) The correlation of C, and C J is -1/11 for all i,j with 1'# j. (Hint: This can be
shown without integration).
(g) The sum of any s coverages is distributed like Vs (which has the beta distribu-
tion (9.3) by Problem 30).
(h) If m intervals (X(,_I)' X(,) are excluded, the probability that the coverage of
the remaining region is at least c is the probability of m or more successes
under the binomial distribution with parameters II and p = 1 - c.
(i) The conditional distribution of V I" .. , Vs given Vs+ I is that of the order
statistics of a sample of s from the uniform distribution on (0, V,+ I)'
(j) The random variables C; = C./V s + I = C./(C I + ... + C s + .), i = I, .. . ,s + I
are jointly distributed like the coverages of a sample of s from a continuous
distribution.
(k) GIven V 3, V 7 , the conditional distribution of V I, V 2, V 4. V 5, V 6, V 8.··· ,Vn
(1/ > 7) IS that of the order statiStiCS of three independent samples, one of two
observations from the uniform dlstnbutlOn on (0, V 3), one of three observatIOns
from the uniform distribution on (V 3 , V 7 ), and one of II - 7 observations
Problems 143
from the uniform distribution on (U 7' 1). (Note that this generalizes to any
set of given U's.)
(I) The coverages C, have the same joint distribution as the random variables
Z,/(ZI + ... + Z"+I) where the Z, are independently distributed with
density e- z , z ~ O.
72. In a sample of size II from a contmuous distribution, let I I be the interval between
the smallest and largest observation and 12 be the interval between - 00 and the
next-to-largest observation. Show that
(a) II and 12 are both tolerance regions with tolerance proportion 0.5 and con-
fidence level 1 - (II + 1)/2".
(b) II is a confidence interval for the median with confidence level 1 - 1/2"- I.
(c) 12 is a confidence interval for the median with confidence level 1 - (n + 1)/2".
(d) II and 12 are tolerance regions with tolerance proportion 0.75 and confidence
level 1 - (II + 3)3"-1/4".
(e) II is a confidence interval for the upper quartile with confidence level 1 -
(1 + 3")/4".
(f) 12 is a confidence interval for the upper quartile with confidence level
I - (II + 3)3"-1/4".
73. (a) Show that Murphy's graphs (Fig. 9.1) give the upper confidence limit for a
binomial parameter p, as a function of n, for a fixed number of successes m
and confidence level 1 - a.
(b) To what extent do you agree with the accompanying estimates of ease and
accuracy of using various types of tables and graphs for various purposes?
Make reasonable assumptions about the grids of values employed, etc. The
notation is that 1 - a is the probability of r or more successes in n binomial
trials with parameter p.
Tables Graphs
Direct binomial Fisher-Yates Clopper-Pearson Murphy
74. (a) Verify that the coverage of a 0.65 tolerance region with confidence 0.95 based
on 20 observations has the distribution graphed in Fig. 9.2.
(b) Show that the same region has coverage c with confidence 1 - ex for any
values c and 1 - ex related according to the graph.
75. How could a tolerance region be defined for all sample sizes n ;;:: 10 so that the
relationship between the tolerance proportion and the confidence level is the
same for all n?
76. (a) Show that the region obtained from the cutting functions 4>1(X) = X,4>2(X) = x,
4>3(X) = Ix - !(X(I) + X(n») I is a valid univariate tolerance region.
(b) Which of the generalizations discussed in Sect. 9.6 are adequate to cover this
case, and how do they do so?
CHAPTER 3
One-Sample and Paired-Sample
Inferences Based on Signed Ranks
1 Introduction
In Chap. 2 we saw that the sign test is the best possible test (in strong senses
of "best ") at level a for a null hypothesis which is as inclusive as the statement
"the observations are a random sample from a population with median 0
(or ~o)." It certainly seems as though better use could be made of the observa-
tions by taking their magnitudes into account. However, since the sign test
is optimum at level oc for this inclusive set of null distributions, any procedure
which considers magnitudes would be a better test at level a only for some
smaller and more restrictive set of null distributions. Such procedures are of
special relevance if the restricted set is the one of interest anyway, or if the
"restriction" of the null hypothesis can reasonably be assumed as a part of
the model, so that the restricted hypothesis essentially amounts to the
unrestricted one above. Furthermore, their exact levels may vary only
slightly under the kinds of departure from assumptions which are likely,
and their increased power may well be worth the small price. Recall that, in
principle, level and power considerations should always be balanced off.
This chapter will present and discuss tests and confidence procedures
suggested by the more restrictive hypothesis that the observations are a
random sample from a popUlation which is symmetric with median 0 (or
any other value). (Symmetry will be defined precisely below.) Other related
hypotheses will also be considered. All of the inference procedures presented
are based on what are called the" signed ranks" of the observations, and the
primary emphasis is on the well-known Wilcoxon signed-rank test. All of the
discussion here will be in the context of observations in a single random
sample. However, as in the last chapter, these procedures may be performed
on the differences of paired observations, like treatment-control differences
145
146 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks
If there is symmetry, there must be a center of symmetry, 11, and this point is
both the mean (if it exists) and a median of the distribution (Problem 3).
It may be noted that (2.1) holds for all x if and only if it holds for all x > 0,
and that the two (non strict) inequalities therein may be replaced by two
strict inequalities (Problem 2).
Symmetry can be viewed in several alternative ways. Some common
conditions, each of which is equivalent to the statement that X is symmetri-
cally distributed about 11 (Problem 4), are listed below.
The null hypothesis of interest in this chapter is that the observations are
a random sample from a population symmetric about a given value ft, that is,
the observations are independently, identically and symmetrically distributed
about 11, where 11 is specified. Since 11 is then the median, the sign test of
Chap. 2 also applies here, but the additional assumption of symmetry permits
"better" tests. These tests have corresponding confidence procedures for
the center of symmetry.
3 The Wilcoxon Signed-Rank Test 147
The procedures developed in this chapter are tests and confidence intervals
for location, where the location parameter is the center of symmetry. In
this sense, they are nonparametric analogs of the classical (normal-theory)
tests and confidence intervals for the mean. Even when the symmetry
assumption is relaxed in certain ways, the procedures of this chapter remain
valid, that is, they retain their level of significance or confidence.
In paired-sample applications with observation pairs (V, W), the symmetry
property must hold for the distribution of differences W - V. The difference
between the individual population medians of V and W is not necessarily
a center of symmetry, or even a median, of the difference W - V (Problem 6).
It is a center of symmetry when V and Ware independent and related by
translation, or when their joint distribution is symmetric (permutable,
exchangeable) in the sense that (V, W) has the same distribution as (w, V)
(Problems 7 and 8). This means that in a treatment-control experiment,
for example, randomization of treatment and control within pairs makes a
test of the null hypothesis of 0 center of symmetry valid as a test of the null
hypothesis that the treatment has no effect on any unit. (See Sect. 7, Chap. 2
and Sect. 9, Chap. 8 for more complete discussion.)
x) 49 -67 8 16 6 23 28 41 14 29 56 24 75 60 -48
rank oflXJI 11 14 2 4 5 7 9 3 8 12 6 15 13 10
signed rank 11 -14 2 4 5 7 9 3 8 12 6 15 13 -10
The signed-rank sum T is defined as the sum of the signed ranks. It may be
expressed as the positive-rank sum T+ minus the negative-rank sum T-, where
T+ and T- are the sums of the ranks of the positive and negative observa-
tions respectively. (Note that, as defined, T- is positive.) Thus, in this
example, we have
T+ = 11 + 2 + ... + 13 = 96, T- = 14 + 10 = 24,
T = T+ - T - = 72.
Each of these three statistics determines the other two by a linear relation,
so that only the most convenient one need be computed. Specifically,
because T+ + T- equals the sum of all the ranks, or 1 + 2 + ... + n =
n(n + 1)/2, we have
T+ = n(n + 1)/2 - T- = n(n + 1)/4 + T/2 (3.1)
T- = n(n + 1)/2 - T+ = n(n + 1)/4 - T/2 (3.2)
T = T+ - T- = n(n + 1)/2 - 2T- = 2T+ - n(n + 1)/2. (3.3)
Any of these three statistics may be called the Wilcoxon signed-rank statistic,
as the test may be based on anyone of them.
Under the null hypothesis that the population center of symmetry J1- is
zero, the signs of the signed ranks are independent and equally likely to be
positive or negative (Problem 9). This fact determines the null distribution
of each of the statistics. The null distribution of T+ is the same as that of
T- (Problem 10). The probability of a particular value of T+, say, is
(3.4)
where un(t) is the number of ways that plus and minus signs can be attached
to the first n integers 1,2, ... , n such that the sum of the integers with positive
signs equals t. Equivalently, un(t) is the number of subsets of the first n
I The x) are differences, in eighths of an Inch, between heights of cross- and self-fertilized plants
of the same pair, as given by Fisher [1966 and earlier editIOns]. The original experiment was done
by Darwin. Fisher's discussion IS extensive and interesting, Including pertinent quotations from
Darwin and Galton. After applying the ordinary t test to these data, obtaining t = 2.148 and a
one-tailed probability of 0.02485, Fisher introduces the method of Sect. 2.1, Chap. 4 and obtains
a one-tailed probability of 0.02634.
3 The WIlcoxon SIgned-Rank Test 149
integers whose sum is t. The values of un(t) and the probabilities in (3.4)
may be easily generated using recursive techniques (Problem 11).
The possible values of T+ (and of T-) range from 0 to n(n + 1)/2. The
mean and variance (Problem 14) under the null hypothesis are
E(T+) = E(T-) = n(n + 1)/4 (3.5)
var(T+) = var(T-) = n(n + 1)(2n + 1)/24 = (2n + 1)E(T+)/6. (3.6)
From these results and the relation (3.3), we obtain
E(T) =0 (3.7)
var(T) = 4 var(T+) = n(n + 1)(2n + 1)/6. (3.8)
The null distributions of T+, T- and T are all symmetric about their
respective means (Problem 15).
The left-tail cumulative probabilities from (3.4), that is P(T+ s:; t) =
P(T- s:; t), are given in Table D for all different integer values of t not
exceeding n(n + 1)/4 (probabilities not exceeding 0.5), for all sample sizes
n s:; 20. The tables in Harter and Owen [1970] give these probabilities for
all n s:; 50. We now describe how. these tables are used to perform the test.
If the population center of symmetry J1 is positive, one would anticipate
more positive signs than negative, and hence more positive signed ranks
among the observations, at the higher ranks as well as the lower. Such an
outcome would give larger values of T, and hence of T+. This suggests a
test rejecting the null hypothesis if T is too large, that is, when T falls at or
above its critical value, which is the upper I)(-point of its null distribution
for a test at level 1)(. By (3.1)-(3.3), this is equivalent to rejecting if T+ is too
large, and also to rejecting if T - is too small. Since Table D gives only the
lower-tailed cumulative null distribution of T+ or T-, the convenient
rule for rejection based on this table, for the alternative J1 > 0, is T- less
than or equal to its critical value. The table entry for an observed value of
T- is the one-tailed (left-tailed) P-value according to the Wilcoxon signed-
rank test.
The corresponding test against the alternative J1 < 0 rejects if T is too
small (i.e., highly negative), which is equivalent to T+ too small (or T-
too large). Hence the appropriate P-value here, the probability that T+
is less than or equal to an observed value, is again found from Table D as a
left-tailed probability.
An equal-tailed test against the two-sided alternative J1 =f:. 0 rejects at
level 21)( if either of the foregoing tests rejects at level 1)(. Since the table gives
left-tailed probabilities only, it should be entered with the smaller of T+
and T- for an equal-tailed test, and the P-value is twice the tabulated value.
In the example above we found T- = 24. According to Table D, under
the null hypothesis the probability that T- s:; 24 is 0.0206. A one-tailed
Wilcoxon test in this direction would then reject at level 0.025 but not at
0.020; an equal-tailed test would have P-value 0.0412 and would reject at
level 0.05 but not at 0.04.
ISO 3 One-Sample and Paired-Sample Inferences Based on SIgned Ranks
In order to test the null hypothesis that the center of symmetry is Jl = Jlo
for any value of Jlo other than 0, the analogous procedure is to subtract
Jlo from every X j and then proceed to find the signed ranks as before; that is,
the Wilcoxon test is applied to Xl - Jlo, ... , X n - Jlo instead of X b ... , X n'
The corresponding confidence bounds on Jl will be discussed in Sect. 4.
A different representation of the Wilcoxon signed-rank statistics will be
convenient later, although it is not convenient for hypothesis testing. We
define a Walsh average as the average of any two observations Xi' Xl' that is
(Xi + X)/2 for 1 ::::; i::::;j::::; n. (3.9)
Note that each observation is itself a Walsh average where i = j. From a
sample of n then, we obtain n(n + 1)/2 Walsh averages, n(n - 1)/2 where
i < j and n where i = j. The sign of (Xi + X j )/2 is equal to the sign of either
X, or Xl' whichever is larger in absolute value. The theorem below follows
easily from this observation (Problem 18).
{
I if X, + Xl > 0
(3.10)
Tij = 0 otherwise.
Specifically, we can write (3.1) as
T+ =LL T;j' (3.11)
For n large, T, T+, and T- are approximately normal under the null hypo-
thesis, with the means and variances given in (3.5)-(3.8). (Asymptotic
normality was discussed in Sect. 9, Chap. 1.) The normal approximation may
be used for sample sizes outside the range of Table D. The procedure for
finding approximate normal tail probabilities based on T+ or T- is shown
at the bottom of this table. When using T, the mean and variance in (3.7)
and (3.8) are appropriate. A continuity correction may be incorporated,
in the amount of! for T+ or T- and 1 for T, since T takes on only alternate
integer values (Problem 19). However, the amount of the continuity correc-
tion is small (Problem 20) and in fact it reduces the accuracy of the approxi-
mation for small tail probabilities. The approximation without continuity
correction is very good for P = 0.025, and the approximation with continuity
3 The Wilcoxon Signed-Rank Test 151
correction is very good for P = 0.05. In general, for very small P, the correc-
tion is in the opposite direction from the error in the approximation, and of
much smaller magnitude. For details and an extensive investigation of the
accuracy of critical values based on the normal approximation, see
McCornack [1965J and Claypool and Holbert [1975]. Some simple approxi-
mations involving the t distribution are investigated in Iman [1974].
In the example of Sect. 3.1, with n = 15, the mean and variance of T
under the null hypothesis are 0 and 1240 respectively. The approximate
normal deviate corresponding to peT ~ 72) is thus
Theorem 3.2. Let Xl, ... , Xn be a random sample from some continuous
population. Then
(3.12)
152 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks
where
P3 = P(Xi > 0 and Xi + X) > 0) for all i 1= j, (3.16)
The results in (3.12) and (3.15) follow once these moments are substituted in
(3.18) and (3.19). 0
3 The Wilcoxon Signed-Rank Test 153
The moments in Theorem 3.2 hold for a random sample from any con-
tinuous population. If the population is symmetric about zero, as under the
null hypothesis that the popUlation center of symmetry is zero, the prob-
abilities defined in Theorem 3.2 have the values (Problem 25)
PI = t, pz = t, P3 = j, P4 = t.
3.4 Consistency
The first step in the proof is to observe that the test must, for sufficiently
large n, reject whenever the estimator falls more than e away from the value
to which it converges in probability under the null hypothesis (more than e
away in one direction if the test is one-tailed). The second step is to consider an
alternative under which the estimator converges in probability to a different
value, and to observe that, under such an alternative, the probability ap-
proaches 1 that the estimator will lie within e of this different value, and hence,
if e is small enough, more than e away from the null hypothesis value, whence
rejection occurs.
Note that it may be necessary to redefine the test statistic so that it
becomes a consistent estimator of something. For instance, we considered
3 The Wilcoxon SIgned-Rank Test 155
We have been assuming that X 1> ••• , Xn are independent and identically
distributed, with a continuous and symmetric common distribution. The
Wilcoxon test procedures were developed using only the fact that the signs
of the signed ranks are independent and are equally likely to be positive or
negative. The level of the test will therefore be unaffected if, in particular, the
Xj are independent with continuous distributions symmetric about 110'
even if their distributions are not the same (Problem 30). (The level of the
corresponding confidence procedure to be discussed in Sect. 4 is preserved
if the Xj are independent with continuous distributions possibly different,
but all symmetric about the same 11.)
Even independence is not required for the validity of the Wilcoxon test,
provided that the conditional distribution of each X j given the others is
symmetric about 11 (Problem 31). It is valid, for example, under the null
hypothesis that the treatment actually has no effect when the X j are the
treatment-control differences in a matched-pairs experiment, as long as
the controls are chosen independently and at random, one from each pair
(Problem 32).
If the continuity assumption is relaxed, then there is positive probability
of a zero or tie, and the test has not yet been defined. This situation is discussed
in some detail in Sect. 6.
For the one-tailed test, even the assumption of symmetry can be relaxed.
Specifically, suppose the X J are independent and continuously distributed,
not necessarily identically. Consider the one-tailed test at level IX for the
null hypothesis of symmetry about the origin with rejection region in the
lower tail of the negative-rank sum. It can be shown (Problem 35) that this
test rejects with probability at most IX if
P(X J < -x) ~ P(Xj > x) for all x ~ 0 and allj (3.21)
and with probability at least IX if
P(Xj < -x) ::;; P(Xj > x) for all x ~ 0 and allj. (3.22)
Under (3.21), the probability in any left tail of the distribution of Xj exceeds
or equals the probability in the corresponding right tail, so that the dis-
tribution of X j is "to the left of or equal to" (" stochastically smaller" than)
156 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks
Theorem 3.3. If X 1"", Xn are independent, Xj has c.df Fj, and c/>(X 1"" ,Xn)
is a randomized testfunction which is increasing in each Xj' then the probability
of rejection
(3.23)
The same holds if the words increasing and decreasing are interchanged
and the first inequality of (3.24) is reversed. (The theorem does not depend
on the fact that c/> is a test, but it will be applied only to tests.)
PROOF. Suppose that X 1> ••• , Xn are independent and Xj has c.d.f. Fj.Then
there exist independent random variables Yt> ... , Y", such that lj has c.d.f.
Gj and P(lj ~ Xj) = 1 for each j (Problem 34). It follows that, for c/> in-
creasing in each argument,
cx(G 1,···, Gn) = E[c/>(Y1 ,···, y")]
~ E[c/>(X 1, ..• , Xn)] = a(F 1, ... , Fn)·
the number of negative Walsh averages among the (Xj - Jl) equals the
number of Walsh averages less than Jl for the original Xi' Therefore, the
Wilcoxon test with rejection region T- ~ k, when applied to the observa-
tions (Xi - Jl), would reject or accept according as Jl is smaller than or
larger than the (k + l)th smallest Walsh average (Xi + X)/2, counting
from smallest to largest in order of algebraic (not absolute) value. That is, the
(k + l)th Walsh average from the smallest is a lower confidence bound for Jl
with a one-sided (X equal to the null probability that T- ~ k. Similarly, the
(k + l)th from the largest Walsh average is an upper confidence bound at
the same level. Hence, the endpoints of any confidence region based on the
Wilcoxon signed-rank procedure are order statistics of the Walsh averages.
All of the Walsh averages can be easily generated and ordered (sorted) by
computer. However, there is also a convenient graphical procedure for
finding the confidence bounds as Walsh averages. Plot each Xj value on a
horizontal line, the" X axis," as in Fig. 4.1 for the data -1,2,3,4,5,6,9, 13
with n = 8. On one side ofthe X axis (either above or below), draw two rays
emanating from each Xi' as in the diagram. All rays should make equal
angles with the X axis. Then the Walsh averages are exactly the horizontal
coordinates of the intersections of the rays. The points on the X axis must be
included, as they correspond to the n Walsh averages where i = j, that is, the
original observations. The (k + l)th smallest Walsh average can be identified
by counting from the left in the diagram. Its value may be read from the
graph or, if greater accuracy is desired, calculated as the corresponding
(Xi + X j )/2. In the latter case, it may be necessary to calculate more than
one Walsh average to determine exactly which is the (k + 1)th. Continuing
the example, for n = 8 we find from Table D that P(T- $; 3) = 0.0195,
158 3 One-Sample and PaIred-Sample Inferences Based on Signed Ranks
Figure 4.1 Adapted from Gibbons, Jean D. [1971], N onparametric StatistIcal II/jerel/ce,
New York: McGraw-Hili, p. 118, Fig. 3.1. With Permission of the publisher and author.
x} 49 -67 8 16 6 23 28 41 14 29 56 24 75 60 -48
modified
rankoflXjl 10 13 3 0 4 6 8 2 7 11 5 14 12 9
modified
signed rank 10 -13 3 0 4 6 8 2 7 11 5 14 12 -9
Tri = 83, To = 22, To = 61, n = 15.
5 A ModIfied WIlcoxon Procedure 159
Under the null hypothesis of symmetry about 0, with either the original
or weakened assumptions, To has the same distribution in a sample of size n
as does T- in a sample of size n - 1 (Problem 38). Accordingly, Table D
applies to To, and similarly to Tri, provided that (n - 1) is used in place ofn.
In the example above, then, the one-tailed P-value for the lower tail of To is
0.0290 for the modified test, since this is the probability from Table D that
T- ~ 22 for 14 observations.
With n - 1 in place of n, the formulas in (3.5) and (3.6) for the mean and
variance under the null hypothesis apply to Tri, as does the asymptotic
normality discussed in Section 3.2.
Tri is the number of positive Walsh averages excluding those with i = j,
that is, the number of positive (X i + X j)/2, for i < j (Problem 39). Therefore,
the graphical method of obtaining confidence bounds works for the modified
procedure with slight changes. Specifically, the points on the X axis must now
be excluded, and of course the critical value k is now that for T ri rather than
T+.
Under alternative distributions, we have (Problem 40)
E(T(;) = n(n - l)p2/2 (5.1)
var(Tri) = n(n - 1)(n - 2)(P4 - pD + n(n - 1)P2(l - P2)/2 (S.2)
where P2 and P4 are defined by (3.14) and (3.17) respectively. Although we
will not prove it here, Tri is approximately normal with this mean and vari-
ance. Thus we can compute the approximate power of a test based on Tri.
Since the dominant terms of (5.1) and (S.2) are the same as those of the
corresponding moments of T+ given in (3.12) and (3.1S), it follows that
Tri and T+ are asymptotically equivalent, in a sense which is not trivial to
make precise (see Chap. 8). In particular, the consistency results for T+ in
Sect. 3.4 apply to T ri .
A natural estimator of P2, the probability of a positive Walsh average,
is 2Tri /n(n - 1). When the X) are a random sample from some continuous
population, this estimator is unbiased, by (5.1), and consistent, as just
remarked. If the class of possible distributions is broad enough-all con-
tinuous distributions, for instance-then no other unbiased estimator has
variance as small (Problem 41). On the basis of asymptotic normality, an
approximate confidence interval for P2 could also be obtained from
given by (5.2), using one of the foregoing methods for P4 and the estimate
2Tt /n(n - 1) for P2' and substitute this estimate of var(Tt) on the right-
hand side of (5.3) (Problem 44).
t
The statistic T is thus especially natural and useful for estimation of P2.
There are two further reasons for introducing the modified Wilcoxon
procedure. The first is that it illustrates the inevitable arbitrariness involved
in the choice of a test or confidence procedure. Neither T nor To has any
definitive theoretical or practical advantage over the other, and they are
equally easy to use. In the example of Sect. 3.1 with n = 15 we get a one-tailed
P-value of 0.0206 from T and 0.0290 from To, and lower and upper confidence
bounds of 10 and 39 each at the one-tailed level 0.0535 from T, 7 and 39 each
at the one-tailed level 0.0520 from To (Problem 45). There is no reason to
say that one set of values is any better than the other, and both methods are
reasonable. A single" correct" or "best" method of statistical analysis does
not exist in anything like the present framework.
The other reason for introducing the modified procedure is that the
choice of exact levels is thereby increased. It may happen that T provides a
nonrandomized procedure at nearly the level desired but To does not, or
vice versa. For instance, suppose the nominal level selected is 0.05 for n = 10.
From Table D, we see that To provides a test at the one-sided level 0.0488
while the one-sided levels for T which are nearest 0.05 are 0.0420 and 0.0527.
If there is some real reason for insisting on one particular level and a ran-
domized procedure is undesirable, the fact that T and To provide tests at
different natural levels may be grounds for choice between them in a parti-
cular case. Of course, with either T or To, the approximate method of
interpolating confidence limits between those at attainable levels may be
used. (This was explained generally in Sect. 5, Chap. 2.) The accuracy of
such interpolation has not been investigated extensively, but results given in
Problem 49 and quoted in Chap. 2 for interpolation midway between any
two successive order statistics, and results given in Problem 50 for inter-
polation between the two smallest (or largest) Walsh averages, suggest that,
especially when discreteness matters most, the accuracy is often quite poor
but interpolation tends to be conservative.
An observation which equals 0 is called a "zero;" its sign has not been
defined. Two or more observations which have the same absolute value
(zero or not) are called a "tie;" their ranks (in order of absolute value) have
not been defined. As a result, if zeros or ties occur, the sum of the signed
ranks cannot be determined without some further specification. We have
6 Zeros and Ties 161
null distribution of the Wilcoxon test statistic as given in Table 0 can still be
applied, while others produce a different null distribution so that ordinary
tables cannot be used. Furthermore, some other surprising traps and
anomalies can arise. See Pratt [1959] for a more complete discussion than
is given here.
There are several different methods of obtaining signed ranks for observa-
tions which include zeros and/or ties. We will now describe each of these
methods briefly. In the next three subsections, we illustrate the resulting
test procedures and discuss certain properties of them.
Ties may occur in only nonzero observations, in only zero observations,
or in both simultaneously. Zeros may occur with or without ties. We will
consider each relevant case starting with nonzero ties.
The three basic methods of handling nonzero ties which we shall discuss
are called (a) the average rank method, (b) breaking the ties randomly, and
(c) breaking the ties "conservatively."
The average rank (or midrank) method assigns to each member of a group
of tied observations the simple average of the ranks they would have if they
were not tied. In general, for a set of tied observations larger than the (k - 1)th
largest and smaller than the (I + l)th largest observations, that is, tied values
in rank positions k through I, the average rank is (k + 1)/2, which is always
an integer or half integer, and this rank is given to each of the observations
in this tied set. This approach gives a unique set of ranks, and tied observa-
tions are given tied ranks. Since the Wilcoxon statistic uses the signed ranks
for the absolute values of the observations, the possible ranks are averaged
for sets of tied absolute values of the observations and then signs are attached
to these resulting average ranks. (Examples will be given shortly.)
Methods which break the ties give each observation, including the tied
values, a distinct integer rank, that is, as if there were no ties. Two methods
of doing so are the" random" method and the" conservative" method. In
the present context, with tied absolute values at ranks k through I, if m of
these belong to negative observations, the random method would attach
the m minus signs to a sample of size m drawn at random without replace-
ment from the integers k through I. The "conservative" method would
attach the m minus signs to the smallest m of these integers when testing for
rejection in the direction of smaller values of jl, the largest m in the other
direction, thus breaking all ties in favor of "acceptance." Other methods of
breaking ties will not be considered here.
For zeros, which may occur singly or as ties, there are analogs of each of
the foregoing methods. Still another method is to discard the zeros from the
sample before ranking the remaining observations. This last method will be
called the reduced sample procedure. For the signed-rank test, Wilcoxon
6 Zeros and Ties 163
[1945] recommended this practice, with the ordinary test then being applied
to the reduced sample if there are no nonzero ties. However, this leads to a
surprising anomaly, which will be discussed later. Further, one might argue
purely intuitively that the ambiguity about the signed ranks of the zero
observations is irrelevant to the ranks of the nonzero observations, and hence
these ranks should not be changed by discarding the zeros. Nevertheless,
if the zeros are retained, they must also be given signed ranks. One possi-
bility is to give each zero a signed rank of zero; we call this the signed-rank
zero procedure. The signed-rank zero is actually an average rank, since it is
the average of the two signed ranks which could be assigned to any zero,
regardless of what method is used to obtain the unsigned rank for the zero.
Tiebreaking procedures, on the other hand, would assign the signed ranks
± 1, ± 2, etc., to the zeros, choosing the signs either randomly (independently
with equal probability) or "conservatively" (all + signs when testing for
rejection in the direction of smaller values of jl, all - signs in the other
direction).
If both nonzero ties and one or more zeros are present, it would be
possible to use anyone of the procedures for zeros in conjunction with
anyone of the procedures for nonzero ties. With any procedure used for
nonzero ties, however, it is natural to use either the corresponding procedure
for zeros or the reduced sample procedure. When the "conservative"
procedure is viewed as inadequate, we recommend that the signed-rank
zero method be used in conjunction with the average rank procedure, for
reasons indicated later in Sects. 6.4 and 6.5.
We illustrate the basic methods of handling ties, with zeros also present,
for the following data (arranged in order of absolute value):
0, 0, -1, -2, 2, 2, 3. (6.2)
(a) The assignment of ranks to all observations by the average rank
method, in combination with the signed-rank zero procedure, is illustrated in
Table 6.1 for the data in (6.2). By the average rank and signed-rank zero
methods, the values of the Wilcoxon signed-rank statistics are T+ = 17,
T- = 8, T = 9. Note that the relationships between T+, T- and T, as given
in (3.1)-(3.3), must be modified; if v zeros are present, then v( v + 1) must be
subtracted from n(n + 1) throughout (Problem 52d).
Table 6.1
Xj 0 0 -1 -2 2 2 3
Possible ranks 1,2 1,2 3 4,5,6 4,5,6 4,5,6 7
Average rank 1.5 1.5 3 5 5 5 7
Signed rank 0 0 -3 -5 5 5 7
164 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks
The null distributions of these test statistics are not as given in Table D,
as is obvious since the ranks with signs are not the first n integers. An exact
test can, however, be performed conditionally on the number of zeros
present and the ranks assigned to the nonzero observations. Under the null
hypothesis of symmetry about 0, given the pattern of zeros and ties (and even
the absolute values) present, the conditional distribution of the positive and
negative rank sums by the average rank method is determined by the fact
that all assignments of signs to the average ranks of the nonzero observations
are equally likely.
For the data in (6.2), given the absolute values present, including two
zeros, three observations tied at ranks 4-6, and two untied observations at
ranks 3 and 7, there are 2 5 equally likely assignments of + and - signs to the
relevant average ranks, which are 3, 5, 5, 5, and 7. We enumerate these
assignments and calculate T- as follows.
Number of
Negative Ranks r- Cases
none 0
3 3
5 5 3
7 7 1
3, 5 8 3
3,7 10 1
5,5 10 3
5, 7 12 3
3,5,5 13 3
3,5,7 15 3
5,5,5 15 1
5,5,7 17 3
3,5,5,5 18 1
3,5,5,7 20 3
5,5,5,7 22 1
3,5,5,5,7 25
0 3 5 7 8 10 12 13 15 17 18 20 22 25
25 P(r- = t) 1 3 3 4 3 3 4 3 1 3 1
25 P(T- s; t) 2 5 6 9 13 16 19 23 26 27 30 31 32
The probability of a value smaller than that observed, the next P-value, is
P(T- < 8) = P(T- ~ 7) = 362 = 0.19.
When there are relatively many ties, as here, the normal approximation
may be very inaccurate. The relevant mean and variance are easily obtained
and the distribution is symmetric (Problem 52), but the uneven spacing of
the possible values of T- makes correction for continuity difficult and of
doubtful value. Further, general lumpiness of the distribution removes any
possibility of high accuracy, even when a continuity correction is used.
However, when enumeration of the null distribution is too laborious, one
could use the Monte Carlo method (simulation) described further in Sect. 2.5,
Chap. 4. A table of critical values for the case when there are zeros but no
nonzero ties is given in Rahe [1974].
(b) If a tiebreaking procedure is applied to the data in (6.2) without
omitting the zeros, there are 12 different possible signed-rank assignments
for the possible ranks shown in Table 6.1, obtained as follows. The two
zeros, with possible ranks 1 and 2, could be given signed ranks either 1 and 2,
or -1 and 2, or 1 and - 2, or -1 and - 2. The three observations with
absolute value 2 must each be assigned one of the ranks 4, 5, and 6. When
signs are attached, one of these ranks must be negative since one of the 2's
was negative. Thus the signed rank associated with - 2 is either - 4, - 5, or
- 6, and the two observations which are + 2 have the remaining two ranks
with positive sign. The observation -1 has signed rank - 3 and the observa-
tion 3 has signed rank 7 in any case.
As a result of these possibilities, breaking the ties in this set of observations
could lead to anyone of 4(3) = 12 sets of signed ranks. In all 12 cases, the
negative-rank sums are between 7 and 12 inclusive. The two methods of
breaking the ties which produce (i) the smallest, and (ii) the largest, negative-
rank sum are shown in Table 6.2. There are in addition two cases with
T- = 8, three with T- = 9, three with T- = to, and two with T- = 11,
as shown in Table 6.3 (Problem 53).
Table 6.2
x) 0 o -1 -2 2 2 3 T-
Signed ranks (i) 1 2 -3 -4 5 6 7 7
Signed ranks (ii) - 1 -2 -3 -6 5 4 7 12
The random method of breaking ties selects one of the possible resolutions
of ties by using some supplementary random experiment which ensures that
each possible set of signed-rank assignments is equally likely to be chosen.
Because this preserves the usual null distribution of the Wilcoxon signed-
rank statistic (Problem 54), the usual critical value can then be used, or the
P-value can be found from Table D. Thus one of the columns of Table 6.3
would be selected, with probability proportional to the number of cases.
Instead of actually breaking the ties at random and using standard tables,
we might report the probability for which doing so would lead to rejection,
as in reporting cfJ(x) for a randomized test (Sect. 5.2, Chap. 1). For example,
for the datu of (6.2), if we were using the critical value t = 8, three of the
twelve possible ways of breaking ties would lead to rejection. If the ties
were broken at random, therefore, there would be a 132 = 0.25 chance of
rejection. Instead of actually randomizing, one could report this probability.
(For P-values one could in principle report similarly that randomizing
would give P = 0.1484 with probability -h, P = 0.1875 with probability 122'
etc., or some summary of this information.)
The "conservative" method of breaking ties, when rejecting for small
values of T-, would assign negative signs to the two zeros of the data (6.2),
and assign rank 6 rather than 4 or 5 to the negative observation - 2, so as
to maximize T -, as in the bottom line of Table 6.2. The "conservative"
value of T- would therefore be T- = 12 and the "conservative" P-value
from Table 6.3 is 0.4063. This means that all methods of breaking ties would
give T- S; 12 and would reject at any critical value of 12 or more and hence
at any level oc ;::: 0.4063.
The other side of the conservative coin is that no method of breaking
ties would give a value T- < 7. Hence all methods of breaking ties lead to
"acceptance" for any critical value of 6 or less. For critical values between 7
and 11 inclusive, however, the conclusion is indeterminate, and the ordinary
Wilcoxon test would reject by one resolution of ties but not by another.
Approaching the hypothesis testing problem from a confidence interval
viewpoint sheds some light on the interpretation of results when ties are
broken in all possible ways. The Wilcoxon signed-rank test for the null
hypothesis p. = 0 should presumably" accept" p. = 0 when 0 is inside and
reject when 0 is outside the usual confidence interval defined by the order
statistics of the Walsh averages. By Problem 55, the ordinary signed-rank
test leads to rejection no matter how the ties are broken if and only if 0 is
outside the confidence interval and not an endpoint of it. The same holds for
"acceptance" and" inside." The remaining possibility, that 0 is an endpoint
of the confidence region, occurs when and only when the ordinary Wilcoxon
test would reject by one method of breaking the ties but not by another.
Hence, instead of resolving the indeterminacy as to whether the test" accepts"
or rejects p. = 0, one might simply state that 0 is an endpoint of the corre-
sponding confidence interval, as suggested earlier. For the data in (6.2),
we have seen that indeterminacy occurs for the critical values 7-11, and it is
6 Zeros and Ties 167
easily verified that the corresponding confidence bounds, the Walsh averages
at ranks 8-12, are all 0 (Problem 56).
Note that these comments apply only to tiebreaking procedures, meaning
that the ordinary Wilcoxon test is used after the ties are broken. They
unfortunately do not apply to the average rank procedure, which, as we
shall see, may give a conclusion opposite to that based on a tie breaking
procedure even when the latter is determinate. Though inconvenient, this
is not a telling objection to the average rank procedure, since there is nothing
sacrosanct about the Wilcoxon procedure itself, even in the absence of ties.
(c) A reduced sample procedure would omit the two zeros in the data of
(6.2), leaving a sample of size 5 with a three-way nonzero tie. The tie could
be handled by any of the methods described above. We shall not illustrate
this procedure here. However, note that it can disagree with tiebreaking in
the complete sample, and has a still more objectionable property which will
be described shortly.
Nonzero ties
For a given pattern of zeros and ties (strictly, for given absolute values), if
the average rank procedure would be used in some cases, it is to be used in all
cases, even those where tiebreaking is unambiguous. Furthermore, it may
lead to the opposite conclusion. Thus, it is not a valid shortcut to use tie-
breaking when it is unambiguous and average ranks when tiebreaking is
indeterminate. Similar comments would presumably apply to other pro-
cedures for handling zeros and ties, in the absence of proof to the contrary.
To illustrate the difficulty, consider the following sample, in which the
tied observations all have the same sign:
1, 1, 1, 1, 2, 3, -4, 5. (6.3)
Any method of breaking the ties gives the same signed ranks, namely
1, 2, 3, 4, 5, 6, -7, 8. (6.4)
Thus, one is tempted to apply the Wilcoxon test to these signed ranks without
further ado. The null probability of a negative-rank sum of less than 7 is
14/28 while that of 7 or less IS 19/28 (0.0547 and 0.0742 respectively, from
Table D). Hence, when the ties are broken in any way, (6.3) would be judged
not significant at anyone-sided level ()( :s; 0.0547 and significant at any level
()( ~ 0.0742. (For 0.0547 < ()( < 0.0742, the exact level ()( is unavailable, but
such values of ()( are not required for the present discussion.)
Now if the average rank procedure is used on the sample in (6.3), the
signed ranks are
2.5, 2.5, 2.5, 2.5, 5, 6, - 7, 8, (6.5)
168 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks
and the test statistic still has the value T- = 7. The left tail of the null
distribution of T- given the ties at ranks 1,2,3, and 4 is shown below (Prob-
lem 57).
o 2.5 5 6 7 7.5 8 8.5 9.5
4718144
Hence, by the average-rank procedure, P(T- ::; 7) = 14/2 8 = 0.0547, and
the sample in (6.3) would be judged significant at the level 0.0547, even
though it is not significant at this level when the ties are broken, no matter
how they are broken. For (X = 0.0547, the two methods reach opposite
conclusions. (This is true for all (X in some interval including 0.0547, but
this interval depends on what is done when the exact level (X is not available.
Furthermore, similar disagreement in the other direction is also possible
(Problem 58).)
In terms of P-values, the exact P-value is 0.0547 by the average-rank
procedure, while it is 0.0742 by the Wilcoxon test no matter how the ties are
broken. In terms of confidence bounds, for (X = 0.0547, the lower confidence
bound is 1 by the average-rank procedure but -0.5 by the usual procedure.
Thus the two procedures give bounds with opposite signs (and 0 is not an
endpoint of either confidence interval).
There is no contradiction here, but there is a warning; namely, it is not
valid to use a tiebreaking procedure when tiebreaking is unambiguous,
not even when all tied observations have the same sign, if one would have
used another procedure (such as average ranks) in other cases (with the
same absolute values).
Consider, for example, a sample with the same absolute values as (6.3),
but different observed signs, as in (6.6):
-1, -1, -1, 1, 2, 3, 4, 5. (6.6)
Using the average rank procedure, we find T- = 7.5. The null distribution
of T- for the average rank procedure is the same as that given above for
sample (6.2), so that P(T- ::; 7.5) = 22/2 8 = 0.0859. But this computation
of the null distribution of T- by the average rank procedure for sample (6.6)
assumes that the signed ranks for sample (6.3) are those of (6.5), not those of
(6.4) which result from breaking the ties. Thus, if we would use the average
rank procedure for the sample (6.6), we must also use it for the sample (6.3),
even though tiebreaking would be easier and unambiguous for (6.3). The
alternative would be to use tiebreaking in both samples (and others with the
same absolute values). When the ties in (6.6) are broken, the possible values
of T- are 6 ::; T- ::; 9 with corresponding P-values from Table 0 as
0.0547 and 0.1250 at the extremes. If this degree of ambiguity is too great,
one might prefer the average rank procedure, but one must make this
decision in advance, without knowing the signs and hence without knowing
whether the actual sample is (6.3), or (6.6), or some other sample with the
same absolute values.
6 Zeros and Ties 169
Zeros
In the case of zeros, if there are no nonzero ties, it can be shown that the
signed-rank zero procedure gives the same conclusion as tiebreaking
whenever the latter is unambiguous (Problem 59). The reduced sample
procedure, however, may exhibit strange behavior in this and other respects,
as can be illustrated by applying it to the 13 observations
0, 2, 3, 4, 6, 7, 8, 9, 11, 14, 15, 17, -18. (6.7)
Dropping the zero before ranking leaves 12 observations with a negative-
rank sum of 12, which is not significant at anyone-sided le"el IX ~ 55/2 12 =
0.0134 and is significant at any IX ;;:: 70/2 12 = 0.0171, these being the null
probabilities of less than 12 and 12 or less, respectively. On the other hand,
tiebreaking in favor of the null hypothesis, assigning 0 the signed rank -1,
would result in 13 observations with negative ranks 1 and 13 and a negative-
rank sum of 14, which is significant at IX = 109/2 13 = 0.0133, the null
probability of 14 or less. Thus, for 0.0133 ~ IX ~ 0.0134, the reduced sample
procedure disagrees with tie breaking even though tiebreaking is unam-
biguous (and, as before, disagreement occurs for a wider range of IX which,
however, depends on what is done when the exact level (X is not available).
The P-value is 0.0171 by the reduced sample procedure, but it cannot
exceed 0.0133 if the zero is retained, no matter what sign is given to the
zero. These results are comparable to those above for the average rank
procedure for nonzero ties. When we examine confidence regions, however,
an anomaly appears which is more striking and disturbing than before.
For IX = 0.0133, the usual lower confidence bound is 1. What is the
confidence region by the reduced sample procedure? It contains jJ. = 0,
since the reduced sample procedure would accept this hypothesis at this
level. If an amount jJ. between 0 and 1 is subtracted from every observation,
there will be no zero or tie, so the usual procedure will be used, and will
reject the value jJ.. This is already strange-the sample is not significant
as it stands but becomes significant in the positive direction if every observa-
tion is reduced by the same small amount. Correspondingly, the confidence
region is not an interval; it contains the point jJ. = 0, excludes all other values
jJ. < 1, and contains all jJ. > 1. Thus it is an interval plus an exterior point.
(Strictly speaking, for those integer and half-integer values of jJ. where
nonzero ties occur, the procedure has not been defined, but the statement
holds for the average rank procedure and for any tiebreaking procedure.)
It is also possible for the reduced sample procedure to judge a sample
significant in the positive direction, yet not significant when every observa-
tion is increased by the same small amount, corresponding to a confidence
region which is an interval with an interior point removed (Problem 61).
Thus the reduced sample procedure is not only inconsistent with tie-
breaking, but also inconsistent with itself, in the sense that shifting the
sample in one way may shift the conclusion the opposite way. The signed-
rank zero and average rank procedures are self-consistent in this sense.
170 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks
conditions. However, the true level IX is unknown and may be much less than
the nominal level, and considerable loss of power may result, especially if
many ties are likely.
Breaking the ties at random, whether by actually doing so and using
standard tables or by reporting the probability with which doing so would
lead to rejection, satisfies all of the above conditions and also the following
version of (i) (Problem 63), which is stronger for randomized test procedures.
(i') The probability that a sample is judged significantly positive shall not
decrease, nor the probability that it is judged significantly negative increase,
when (a) some observations are increased, or (b) all observations are in-
creased by equal amounts.
Breaking the ties at random permits use of the regular tables but then the
analysis depends on an irrelevant randomization. Imposing this extraneous
randomness in an artificial way is unpleasant in itself, and presumably
greater power could be achieved without it [see also Putter, 1955]. The
unpleasantness can be somewhat mitigated, but not entirely eliminated, by
reporting instead the probability with which breaking the ties would lead
to rejection for the sample at hand, rather than actually breaking the ties
in one particular, randomly chosen way. This, however, requires additional
calculation.
However reasonable the properties above may seem in general, in par-
ticular cases larger observations may not be greater evidence of positivity
in the population (Problem 64). Even the normal-theory t statistic does not
satisfy (i)(a) and may decrease when some observations are increased, since
this affects the sample variance as well as the mean (Problem 65). Neverthe-
less, in the absence of information about the underlying distribution, as in
the present non parametric context, the conditions appear desirable in-
tuitively.
Because the average rank procedure does not satisfy (i)(b) and (iii), it is
all the more tempting to resort to it only when tiebreaking is indeterminate,
i.e., only for flo at the end of the usual confidence interval. Unfortunately
there seems to be no easy way to do this and preserve the level IX. Accordingly
our recommendation is to use the "conservative" procedure if it is not too
conservative in view of the type of inference desired and the extent of zeros and
ties expected. (If one is testing a null hypothesis, and not forming a con-
fidence interval, one may look at the absolute values present before deciding,
but not at the signs.) Otherwise, we recommend the average rank procedure
in conjunction with the signed-rank zero procedure.
Xj 49 -67 8 16 6 23 28 41 14 29 56 24 75 60 -48
k=rankoflXjl II 14 2 4 1 5 79 3 812 61513 10
signed Ck CII -C 14 C2 C4 C1 Cs C7 C9 C3 C8 C12 C6 CIS CI3 -ClO
bilities for Ck arise naturally (in Sect. 9, for instance). However, even if
appropriate tables are available, these other tests (and especially the cor-
responding confidence procedures) are at least somewhat more difficult to
use than those just mentioned. In the absence of tables, the normal approxi-
mation could be used (Problem 80), but it has only limited, though perhaps
adequate, accuracy in small samples. In large samples, where the normal
approximation is more accurate, confidence limits may be preferable to
tests but especially difficult to obtain. With more effort, the null distribution
for any particular set of constants Ck can be obtained by enumeration, or
approximated by Monte Carlo methods (simulation). These approximations
will be discussed in the next chapter, in connection with "observation-
randomization" procedures (where tabulation is impossible). Here, we need
only note that the distribution can be determined or at least approximated
well. In some problems, the advantages of these procedures may warrant
the extra effort in analysis.
For any signed-rank test and corresponding confidence procedure, the
assumptions made in the introduction to this section can be relaxed as in
the first two paragraphs of Sect. 3.5. In some circumstances, the continuity
assumption can also be relaxed so that ties have positive probability; in
particular, tests based on sums of signed constants and corresponding
confidence bounds are conservative for discrete distributions if the Ck
satisfy 0 S Cl S C2 S ... S Cn (Problem 109). For one-tailed signed-rank
tests, the assumption of symmetry can also be relaxed as in Sect. 3.5, provided
the test satisfies the condition of Theorem 3.3. This condition is satisfied,
in particular, by anyone-tailed test based on a sum of signed constants Ck
such that 0 ::;; CI ::;; ... ::;; Cn (Problem 83).
The tests based on sums of signed constants Ck depend only on the signed
ranks of the Xj; if X J has signed rank -k, the signed constant is -Ck' and
if X j has signed rank + k, the signed constant is + Ck' Tests depending only
on the signed ranks, including those in this general class and many more,
are called signed-rank tests.
We have already seen that the Wilcoxon signed-rank test depends only
on the signs of the Walsh averages (because the positive-rank sum is the
number of positive Walsh averages, by Theorem 3.1), and that the corre-
sponding confidence limits are order statistics of the Walsh averages. We
will see in this subsection that all signed-rank tests are in practice equivalent
to tests depending only on the signs of the Walsh averages, and in the next
subsection that the corresponding confidence limits are always Walsh
averages.
The exact statement of the relationship for tests is given in the following
theorem.
174 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks
Theorem 7.1. The signed ranks determine the signs of the Walsh avemges,
so any test depending only on the signs of the Walsh averages is a signed-rank
test. Conversely, the signs of the Walsh averages determine the signed ranks
except possibly for the order in which they occur. Therefore, any signed-rank
test which does not take into account the order of the signed ranks depends
only on the signs of the Walsh averages.
The proof of this theorem is similar to that of Theorem 3.1 (Problem 81).
To illustrate the point about o!u~r, wnsider a sample XI, X 2, X 3 whose
Walsh averages have the signs given below.
to any signed-rank test are Walsh averages of the original sample. (The
region will be an interval provided the test satisfies condition (i)(b) of Sect.
6.5. This holds for a test based on a sum of signed Ck'S provided Ck + 1 :::=:
Ck :::=: 0 for all k (Problem 85).)
As a result, the confidence limits corresponding to signed-rank tests are
always Walsh averages. However, except in the Wilcoxon case, the confidence
limit at a given level does not always have the same rank among the Walsh
averages, and the ordered Walsh averages have different confidence levels in
different samples. For certain tests, such as the sign test and others mentioned
in the next subsection, the relevant Walsh average can be easily identified.
In general it cannot, but the following trial and error procedure seems likely
to identify it fairly quickly in most cases.
Consider the Walsh averages arranged in order of algebraic size. Let
T(Il) be the value of the test statistic for the hypothesized value 11. For a
signed-rank test, T(Il) is constant for 11 in each interval between adjacent
Walsh averages. Suppose, as is usual, that T(Il) is a monotone function of 11,
so that its values in successive intervals are increasing (or decreasing)
throughout - ex) < 11 < ex). Start the search with some Walsh average,
such as the Wilcoxon confidence bound or a Walsh average close to the
normal-theory bound or to some other approximate bound appropriate
to the test being used. Find the value of the test statistic T(Il) for 11 just below
the starting point (i.e., between it and the next smaller Walsh average).
Move to the next smaller or greater Walsh average depending on whether
T(Il) is smaller than or greater than the critical value of the test. Continue
(in the same direction) until the value of T(Il) equals the critical value, or
until successive values bracket it. The Walsh average which separates
rejectable from "acceptable" values of T(Il) is the confidence bound sought.
It may be helpful to rank beforehand all Walsh averages in what seems to be
the relevant range. At each step, at most two signed ranks will change,
and it may be easier to calculate the change in T(Il) than to recalculate
T(Il) from scratch (Problem 84; see also Bauer [1972]).
If, as for the Wilcoxon confidence intervals, a procedure involves ranking all
the Walsh averages, at least implicitly, then it will automatically be per-
mutation invariant (independent of the order of the observations), and it is
immaterial whether the Walsh averages are defined in terms of the original
observations Xj or the sample order statistics X(j)' Confidence procedures
for the center of symmetry which are simpler and still permutation invariant,
can, however, be obtained from the sample order statistics by using a small
number of Walsh averages of the form (X (i) + X(j)/2. We cannot choose a
completely arbitrary function of the Walsh averages; for validity under
the assumption that the population is symmetric, the corresponding test
176 3 One-Sample and Paired-Sample Inferences Based on SIgned Ranks
should be a signed-rank test and hence should depend only on the signs
of the Walsh averages (Theorem 7.1).
The simplest case would be to use a single Walsh average (X(i) + X(j)/2.
For i = j, the confidence limit is simply an order statistic and the procedure
corresponds to a sign test, as discussed in Chap. 2. For i < j, the confidence
limit corresponds to a sign test on the n + i - j observations which are
largest in absolute value (Problem 86; [No ether, 1973J).
With two Walsh averages there are more possibilities, and a more difficult
probability problem must be solved to obtain the confidence level, but for
sample sizes n s; 15, procedures have been worked out by Walsh [1949a,b].
Specifically, his tests are of the form reject the null hypothesis J1. = 0 (or
J1. s; 0) in favor of the alternative J1. > 0 if both (X(/) + X(j)/2 and (X(k) +
X(I)/2 are positive, and" accept" otherwise, where the four indices i, j, k,
and I are chosen, not necessarily all different, to give the desired level. The
corresponding lower confidence bound is
(7.1)
A lower-tailed test and corresponding upper confidence bound can be
obtained similarly, and combining one-sided procedures gives two-sided
procedures. For 4 S; n S; 15, Walsh [1949a,bJ gives a table of tests of this
form with one-tailed levels near the conventional values 0.005, 0.01, 0.025,
0.05, and two-tailed levels twice these values. He does not define a particular
method of choosing i, j, k, and I for n > 15.
The Wilcoxon procedures with the critical value 0,1, or 2 are of this type.
Specifically, for any n ~ 3 (Problem 36),
of all procedures which are ordinarily considered for the situation at hand.
One justification for restricting consideration to signed-rank procedures is
based on another kind of invariance. The notion of in variance applies very
generally, with similar rationale and limitations, and the "principle of
invariance" can be invoked for any suitable class of transformations. In
this subsection we consider a large class of transformations which leads to a
much reduced and very useful class of invariant procedures, in particular,
to signed-rank tests.
Suppose, for convenience, that XI' ... , X nare independent and identically
distributed, and that we are testing the null hypothesis that the distribution
is symmetric about 0 against the alternative that it is not. Consider the class
of transformations defined by all strictly increasing, odd functions g, where
odd means that
g( -x) = -g(x) for all x. (8.3)
Then if XI' ... , X n satisfy the null hypothesis, so also do g(X d, ... , g(X n),
and similarly for the alternative hypothesis. The transformation in (8.3)
then carries null distributions into null distributions and alternative
distributions into alternative distributions (Problem 97). Accordingly, we
could "invoke the principle of in variance," that is, require that the test
treat Xl' ... , X nand g(X 1), ... , g(Xn) in the same ~ay. If this is required
for all strictly increasing, odd functions g, then any two sets of observations
with the same signed ranks must be treated alike, because any set of observa-
tions can be carried into any other set with the same signed ranks by such a
function 9 (Problem 98). In short, the signed-rank tests are the only tests
which, for these hypotheses, are invariant under all strictly increasing, odd
transformations g.
The signed-rank tests are also invariant under this class of transformations
for some other null and alternative hypotheses we have been considering in
this chapter. The argument above applies to any hypotheses for which
strictly increasing, odd transformations carry null distributions into null
distributions and alternatives into alternatives. This restriction is satisfied
for the null hypothesis that the Xl are independent with possibly different
distributions but all symmetric about 0, and for null and alternative hypo-
theses of the form P(Xj < 0) > P(Xj > 0) or the reverse (as in Sect. 6.1,
Chap. 2), etc. However, it does not hold for alternatives under which the
Xj are symmetrically distributed about a value other than 0 (Problem 101).
Similarly, the confidence procedures for the center of symmetry J,l which
correspond to signed-rank tests are not justifiable by this invariance argu-
ment, because they are not invariant under all strictly increasing, odd
transformations. They are not themselves signed-rank procedures, that is,
they are not functions of the signed ranks of the original observations. The
relevant transformations are different for different J,l.
The argument for in variance under the class of transformations in (8.3)
is far less compelling than the argument for permutation invariance. On
9 Locally Most Powerful Signed-Rank Tests lSI
that signed-rank test which is most powerful against any alternative dis-
tribution, and in particular against alternatives of various kinds which are
"close to" the null distribution. We assume that the observations are
independently, identically and continuously distributed so that ties have
probability zero. Also, by sufficiency, we can ignore the order of the observa-
tions and restrict consideration to permutation-invariant tests (Sect. 8.1).
If we did not, they would result anyway. This and some other points which
arise here will be discussed more fully in Chap. 5.
For a sample of size n, there are 2n possible assignments of signs to the
ranks 1, ... , n, and accordingly 2n sets of signed ranks 1'1' .•. , rn , where
rj = ±j. We assume as usual that all 2n possible sets of signed ranks are
equally likely under the null hypothesis. By the Neyman-Pearson Lemma
(Theorem 7.1 of Chap. 1), it follows (Problem 101) that among signed-rank
tests at level IX, the most powerful test against the alternative F rejects if the
probability under F of the observed set of signed ranks is greater than a
constant k, and" accepts" if it is less than k, where k and the probability of
rejection at k are chosen to make the level exactly IX. Letting Pir l , ••• , rn)
be the probability under F of signed ranks 1'1' ..• , r n , we may express this
test as
Assume it is legitimate to differentiate (9.2) under the integral sign, and let
(9.3)
9 Locally Most Powerful SIgned-Rank Tests 183
(9.4)
where IX 1(1) < ... < IX I(n) in (9.4) are the absolute values of a sample of n
from F 0' arranged in order of size. Expanding Po(r" ... , rn) in a Taylor's
series about () = 0 and using (9.4) for its derivative, we have
where IZI(1) < ... < IZI(n) are the ordered absolute values of a sample of
n from the standard normal distribution. Accordingly, this test is the locally
most powerful signed-rank test against normal alternatives with positive
mean. The corresponding lower-tailed test is similarly the locally most
powerful signed-rank test against normal alternatives with negative mean.
The test with the Cj in (9.9) (or, equivalently (9.10», is frequently referred
to as the Fraser (normal scores) test since it was derived by Fraser [1957a].
Additional properties, as well as the values of the scores in (9.10) and the
critical values, are given in Klotz [1963]. This test is asymptotically equi-
valent to a test with "inverse normal scores" as constants, that is, Cj =
CI>-lU/(n + 1», where CI>(x) is the standard normal c.dJ. The values of these
scores are more readily accessible than those in (9.10); e.g., Fisher and Yates
[1963] and van der Waerden and Nievergelt [1956]. This test is mentioned
in van Eeden [1963].
F(x)
1.0
0.8
0.6
0.4
0.2
f(x)
Figure 9.1
Problems 185
PROBLEMS
1. Show that a distribution with discrete frequency functionfis symmetnc as defined
by (2.1) or (2.2) if and only iff satisfies (2.3).
2. (a) Show that the condition in (2.1) is equivalent to the same condition with
stnct Inequahtles, that IS, P(X < Jl - x) = P(X > Jl + x) for all x.
(b) Show that (2.1) holds for all x If and only ifit holds for all x> O.
3. Suppose that X is symmetrically distributed about Jl. Show that
(a) Jl is a median of X.
(b) Jl is the mean of X (provided it exists).
(c) Jl is the mode of X if X is unimodal.
4. Show that each of the symmetry conditions given as (a), (b), and (c) in Sect. 2 is
equivalent to the condition that X is symmetrically distributed about Jl.
*5. Show that a distribution is symmetric about 0 if and only if its characteristic
function is real everywhere. (Hint: Each is equivalent to the statement that X and
- X have the same distribution.)
{-l~W~V~l'V-W~l;
f( v, W) = -1 r
lor
2 O~v+l~w~l.
(a) Show that the marginal distributions of V and Ware both the uniform density
over (-1,1), and hence the medians of V and Ware each equal to O.
186 3 One-Sample and PaIred-Sample Inferences Based on SIgned Ranks
(b) Show that P(W < V) = t, which implies that the median of the population
of differences X = W - V must be negative and hence cannot equal the
difference between the medians of Wand V.
(c) Show that the density of the difference X = W - V is
(2 + x)f2
for -1 < x ~ 0
f(x) ={
(2 - x)/2 for 1 < x ~ 2.
This distribution is not symmetnc and does not have median O. It has a umque
median -2 + j3.
7. Suppose that V and Ware independent and X = W - V. Show that the following
properties hold. Note that (a)-(c) gIve condItions under which the difference X
is symmetrically distributed about a median which equals the difference of the
medians of Wand V, while (d) shows that, even if X is symmetrically distributed,
its center of symmetry may not be equal to the difference of the medians of Wand V.
(a) If V and Ware symmetrically distributed about J1 and A. respectively, then X is
symmetrically distributed about A. - J1.
(b) If V and Ware identically distributed, then X is symmetrically distributed
about O.
(c) If W has the same distribution as V + 8 for some "shift" 0, then X is sym-
metrically distributed about 8.
*(d) If V and W have any two distributions of the following family, then X is
symmetrically distributed about 0, even though the medians of V and W may
differ. The family of distributions is discrete with frequency functions fo given
by
fo(l) = 28, 10(2) =~- 38, fo(3) =!, 10(4) = 0,
all for 0 ~ 8 ~ ~. The median is uniquely 2 for 0 < !, uniquely 3 for 0 > !.
What is the median for 8 = !?
8. Define the difference X = W - V where Wand V need not be mdependent. The
example in Problem 6 shows that in general, the median of X need not be equal
to the difference of the medians of Wand V even if the marginal distributions
of V and Ware identical and symmetric. Show that in the following situations,
the median of the difference is equal to the difference of the individual medians,
by showing first the results stated.
(a) If (V, W) has the same distribution as either (w, V) or (- V, - W), then X is
symmetrically distributed about 0 (and the medians of V and Ware equal).
(b) If the distributions of V, W, and X are each symmetric (with finite means), then
the center of symmetry of X is equal to the difference of the centers of symmetry
of Wand V.
9. For a set of n independent observations from a population whIch is continuous and
symmetric about zero, show that the signs of the signed ranks are mutually inde-
pendent and each is equally likely to be positive or negative. Show that the signed
ranks themselves are dependent if the anginal order ofthe observatIOns is retained.
to. Show that T+ and T- have identical null distributions.
11. (a) If un(t) denotes the number of subsets of the first n integers whose sum is equal
to t, show that
Problems 187
for all t = 0, 1, ... , n(n + 1)/2 and all positIve n, with the following initial
and boundary conditions:
II n(t) =0 for all t < 0
uo(O) = 1
uo(t) =0 for all t =1= 0
IIn(t) = 0 for t > n(n + 1)/2.
This provides a simple recursive method for generating the frequencies of
values of T+ and hence the null probabihty function poet) = P(T+ = t) for
samples of size II uSlllg
12. Use the recursive method developed in Problem 11 to generate the complete
null distribution of T+ for all II ~ 4. Check your results against Table D.
*13. Define ""(t) as in Problem 11 and let u(t) be the number of subsets of all positive
integers whose sum is equal to t. Show that
(a) un(t) = u(t) for t ~ 11.
(b) The number of subsets of all positive integers with sum t and maximum III
is um- I(t - /II).
(c) un(t) = u(t) - I;-;,o un+, (t - II - 1 - i), where the sum actually terminates
because the summand vanishes for i > t - II - 1. (Hint: The terms in the
sum count the subsets with sum t and maximum II + 1 + i.)
(d) u,,(r) = u(t) - I/(I)(t - n - 1) for t ~ 211 + 1, where u(l)(s) = Ir;o II(S - i).
(e) IIn(t) = lI(t) - u(l)(t - II - 1) + I;-;,o I;;. 0 un+,+it - 2n - 2 - 2i - j).
(f) un(t) = u(t) - I/(I)(t - II - 1) + U(2)(t - 211 - 2) for t ~ 311 + 2, where IP)(S)
= I;-;,
0 1/(1 )(s - 2i).
(g) 1I,,(t) = I~o (-lill(k)(t - kll - k) where dOles) = u(s), and
Both sums term1l1ate because IIlkl(S) = 0 for s < O. (For a complete, formal
proof, it may be convenient to introduce 1I:,k)(t) = I;-;,
0 u~\-,I)(t - ki) and
prove by 1I1duction on k that u:~)(t) = lI(k)(t) - 1I~k+ I)(t - II - I).)
(h) All of the foregoing equalities hold for U"(r) = I:=oun(i) = the number of
subsets of {I, 2, ... , II} with sum at most t, if U(k) is replaced by U(k) where
U(k)(t) = Ir;o U(k-l)(t - ki) and V(O)(t) = Vet) = I:=o 1I(i) = 1I(1)(t). How
IS (b) Illterpreted III thIS case?
(i) The null probability function and c.dJ. of T+ satisfy
00
00
Note: Instead oftabulating the null distribution directly, one could tabulate the
functions U(k} (for point probabilities) or U(k} (for tail probabilities). The total
number of lower tail probabilities less than 0.5 for sample sizes 1, ... , n is
[[n(n + 1)(n + 2)/12]] - [[(n + 2)/4]] where [[x]] denotes the largest
integer not exceeding x. The number of function-values U(k}(t) required to
cover the same range is [[(n + 4)/4]]{[[n(n + 1)/4]] - [[n/4]](n + 1)/2}.
For large n, this is i as large a number, and the values of t covered are covered
for all sample sizes. The tabulation required could be reduced still further by
omitting U(k} for alternate values of k (odd or even) and using U(k-I)(S) =
U(k)(S) - U(k)(S - k). Tables of the functions U(k) and U(k} are easily generated
for successive k from their definitions and a table of u. The function u is a well-
studied partition function and is tabled in National Bureau of Standards
[1964, Table 24.5] where further references may be found. It can be generated
un(t))
recursively (without need for from the nontrivial relation
P(T+ = t) = ±
k=O
(n)rnp(T+ = tiS = k)
k
where S denotes the number of positive observations. This representation might be
useful for systematic generation of the null distribution of the signed-rank statistic.
Further, P(T+ = tiS = k) is a null probability for the Wilcoxon two-sample
statistic (covered in Chap. 5) where the positive observations are interpreted as
from one sample and the negative from another. Problem 17 gives further insight
into the relationship between the one-sample and two-sample statistics.
17. Let D I , . •• , DN be a sample of N nonzero observations and define X, or Y; for each
iby
D. = { -Y;
X,
ifD, >0
, ifD, < O.
Problems 189
L L sign(X, + Xi)
1 slsjsn
where
. { 1 if X > 0
slgn(X) = _1 if X < O.
19. (a) Show that the possible values of T = T+ - T- are alternate integers between
-n(n + 1)/2 and n(n + 1)/2 inclusive.
(b) For what values ofn are the possible values of T even?
20. (a) Show that the continuity correction to the approximate normal deviate for
the Wilcoxon test is [6/II(n + 1)(2n + 1)]1/2.
(b) Show that this correction is less than 0.02 if (and only if) n ;;:: 20, less than om
if (and only if) II ;;:: 31.
(c) Show that the corresponding values for the sign test (of an hypothesized
median) are l/n1/2, 2501, and 10001.
*21. Show that T+ and T- are asymptotically normal by using the fact that T is
asymptotically normal.
*22. Show that L'i R J , as defined in Sect. 3.2, satisfies the Liapounov criterion.
23. Show that the standardized statistics [T+ - E(T+)]fJvar(T+) and [T-
E(T)]/Jvar(T) are identical in value as long as the means and variances are
calculated under the same distribution.
24. Verify the moments of the T.j' which are given in the proof of Theorem 3.2.
25. Under the null hypothesis of symmetry about 0,
(a) Show that the probabilities defined in Theorem 3.2 have the values
PI = i, pz = i,
(b) Verify that the expressions given in (3.12) and (3.15) for the mean and variance
of T+ reduce correctly to (3.5) and (3.6).
26. Use the method of Problem 1 of Chap. 1 to show that 2T+ /n2 is a consistent
estimator of P2'
190 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks
27. (a) Consider the sign test of Chap. 2 for an hypothesized median based on an
independent random sample. Against what alternatives is it consistent?
(b) Give an example of an alternative against which the sign test is consistent but
the Wilcoxon test is not, and vice versa.
28. Show that the Wilcoxon test based on an independent random sample from a
symmetric distribution is consistent against shift alternatives.
*29. Suppose that IE(Zn) I < Band var(Zn) s B for all n and all null distributions.
Use Chebyshev's inequality to show that an equal-tailed test based on Z" is con-
sistent against any alternative under which IE(Zn)l-> 00 and var(Zn) is bounded.
30. Show that if X I' ... , X" are independent with distributions which are continuous
and symmetric about /lo, then a Wilcoxon test for center of symmetry /lo has the
same level irrespective of whether the distributions are identical or not.
31. If the conditional distribution of X 1 given X 2, ... , X" is symmetric about 0,
show that the conditional probability that X I > 0 equals the conditional prob-
ability that X I < 0 given the signs of X 2, •.. , X".
32. Consider a matched-pairs experiment with the null hypothesis that the treatment
actually has no effect. Show that randomization validates the null distribution
of the Wilcoxon test statistic defined on treatment-control differences.
33. (a) Let X have continuous c.dJ. F and let G(x) = ! + ![F(x) - F( -x)]. Show
that G is the c.dJ. of a symmetric distnbution and is stochastically larger than
F if and only if P(X < -x) ~ P(X > x) for all x ~ O.
(b) Generalize (a) to allow discrete distributions.
*34. (a) If X has c.dJ. F and F(x) ~ G(x) for all x, show that there exists a random
variable Y with c.dJ. G such that P( Y ~ X) = 1.
(b) If X I, ... , X" are independent, X) has c.dJ. F) and Fix) ~ Gix) for all x andj,
show that there exist independent random variables Y, such that Y, has c.dJ.
G) and P(Y, ~ X) = 1 for allj.
35. Use Theorem 3.3 to show that a suitable one-tailed Wilcoxon test rejects with
probability at most IX under (3.21) and at least IX under (3.22).
*36. Let X I, •.• , X n be independent observations on a continuous distribution which is
symmetric about /l.
(a) Show that for any n ~ 3, we have
*37. Consider the Walsh averages Wi) = (X(I) + X(j)/2, for i S j, defined in terms of
the order statistics X(I)" .. , X(n) of a set of n observations.
Problems 191
(a) Note that always W Il S; W12 S; all other Wi), What other inequalities always
hold between w,j with i S; j S; 4?
(b) Recall that the three smallest w,j are WIl , W12 , and min{W22' Wd. Which
w,J can be fourth smallest for some data sets? (There are three possibilities.)
(c) Show that the minimum possible rank of w,J among the Walsh averages is
ti(2j - i + 1).
(d) Show that the maximum possible rank of WI/ among the Walsh averages
IS tj(j - 1) + 1. What is the maximum possible rank of WiJ for i :?: 2?
(e) Which w,J can be fifth smallest for some data sets (four possibilities)? Sixth
smallest (six possibilities)?
(f) Show that the fourth smallest w,j is min{WI4' max(W22 , W13 )} = max{W13 ,
min(W22 , WI4 )}·
(g) Show that the fifth smallest w,j is min{WIs , W23 , max(WI4 , W22 )} =
max{min(WI4' W23 ), min(WIs , W22 )}.
(h) Show that the fourth and fifth smallest WiJ are confidence bounds correspond-
ing to one-sided Wilcoxon tests at level 0( = 5/2n and 7/2n for n :?: 5.
38. Show that the modified Wilcoxon test statistic To of Sect. 5 has the same null
distribution In a sample of size n as T - in a sample of size n - 1.
39. For a sample with neither zeros nor ties, show that
(a) T+ = Tri + S where S is the number of positive observatIOns. (This result
relates the Wilcoxon statistic, the modified Wilcoxon statistic and the sign
test statistic.)
(b) Tri is the number of positive Walsh averages (X, + X}/2 with i < j.
40. Verify the expressions given in (5.1) and (5.2) for the mean and variance of T ri .
41. Show that 2Tri In(n - 1) is the minimum variance unbiased estimator of P2 for
the family of all continuous distributions.
42. With the definItions of Sect. 3.3, show that P4 S; P2 for all distributions.
*43. (a) Show that the suggestions following (5.3) lead to inequalities of the form
(P2 - P2)2 S; C + 2Bp2 - Ap~ where P2 = 2Tri In(n - 1) and A, B, Care
nonnegative constants.
(b) Show that the corresponding confidence region is an interval with endpoints
{P2 + B ± [(P2 + B)2 - (P2 - C)(1 + A)]I/Z}/{l + A},
except that it is empty if the quantity in brackets is negative (which is impos-
sible if P4 is replaced by the upper bound PZ and extremely unlikely if it is
estimated as described following (5.3».
*44. Assuming that the asymptotic distribution of [Tri - E(Tri)]/Jyar(Tri) is
standard normal, show that P{ITri - n(n - l)p2/21 S; zjV) -> 1 - 20( if z is
the upper 0( point of the standard normal distribution and V is an estimator of
var(Tri) satisfying V/n3 -> P4 - pi in probability. This provides an asymptotIc
confidence interval for pz.
45. For the data In Sect. 2, verify the P-values, confidence bounds and confidence
levels given in Sect. 5 for the Wilcoxon and modified Wilcoxon procedures.
46. Show that To = 0 if and only if T- S; 1. Thus the modified Wilcoxon test with
critical value 0 is equivalent to the ordinary Wilcoxon test with critical value 1.
(Otherwise the tests are not equivalent.)
192 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks
*47. Modifying Problem 37, consider only those Walsh averages w,J with i < j.
(a) Show that the smallest three w,J with i < j are Wl2 , W13 and min{Wn, WI4 }.
(b) Show that the minimum and maximum possible ranks of the W;j among
those with i < j are the same as those of w,.r I in Problem 37.
(c) Show that the formulas for the ordered W;j in Problem 37 apply here if the
second subscnpt is increased by 1 throughout. For instance, (a) is so related
to 37(b). Similarly 37(d) gives that the fourth smallest W;J among those with
i <j is min{WIs , max(W23 , WI4 )} = max{WI4 , min(W23 , W2S )}'
(d) Show that the first five ordered w" with i < j are confidence bounds cor-
responding to one-sided modified Wilcoxon tests at levels 2/2",4/2",6/2",10/2",
and 14/2" respectively for n ~ 6. In particular, To : ;
0, 1, or 2 respectively if
and only if 0 < Wo , W13 , or min{W23' WI4 }.
48. For the data in Sect. 2 use procedures correspondmg to the tests based on T+
and T;j to find upper and lower confidence bounds for II, each at level approxi-
mately 0.025, by applying the methods of interpolation between attainable levels
(explained in Sect. 5, Chap. 2).
*49. In order to investigate the effect of interpolating halfway between two adjacent
order statistics of a random sample to find a confidence bound for the population
median JI, note that the true level of the interpolated confidence bound is
P[(X(.) + X(.+ 1))/2> JI] = P(X. > II) + pP(X(.) ::;; JI < X(.+ 1))
= (1 - p)P(X. > JI) + pP(X(.+ I) > It)
where
p = P(X(i+I) - JI > JI - X(.)IX(.)::;; JI < X(.+1))'
Linear interpolation approximates p by t. Show that, for a continuous, symmetric
population,
p = P(RI = -lIS- = i) = i/II
where RI is the first signed rank and S- is the number negative among the values
X. - JI. Thus, linear interpolation overestimates the error probability (for one-
tail probabilities below 0.5).
*50. (a) In order to investigate the effect of interpolating between the two smallest
(or largest) Wilcoxon confidence limits, show that for II observations on a
denSity J,
per) = P[(1 - r)X(1) + r(X(I) + X(2))/2 > 0]
= n(n - 1) If f(x)f(y)[1 - F(y)]"- 2 ely
where the region of integration is (1 - r/2)x + ry/2 > 0, x < y. Show that
for 0 ::;; r ::;; 1, this reduces to
[I - F(O)]" for r = 0
{
= 2/2" for,. = 1 and f symmetric about O.
Problems 193
The accuracy of linear interpolation depends on how linear the integral is,
and hence on the behavior of F[ - ry/(2 - r)], as a function of r, 0 :s; r :s; t.
(b) Show that, for the uniform distribution, F[ -ry/(2 - 1')] is a concave function
of,' in the relevant range, and hence Per) is convex.
(c) Show that, for the standard normal distribution, F[ - ry/(2 - 1')] is a concave
function of r for 0 :s; I' :s; 2 - y2 and hence for 0 :s; ,. :s; 1 and 0 :s; y :s; 1.
(Values of y > 1 contribute relatively little to the integral above, since both
the first and last terms in the integrand decrease rapidly as y increases above 1.)
These results suggest that the tail probability per) tends to be convex in I' and
hence to be overestimated by linear interpolation.
51. Show that, if L is the (k + l)th smallest Walsh average of a sample from a distribu-
tion which is symmetric about /-l, then P(L:s; Ji) ~ 1 - IX ~ peL < It) where
I - IX is the exact confidence level in the continuous case. (Hint: What confidence
region corresponds to the randomization method of breaking ties?)
52. (a) Show that the null mean and variance of T+, based on n nonzero observatIOns
with ties and calculated using the average rank procedure, conditional on the
ties observed, are
55. Show that 0 IS an endpoint of the usual Wilcoxon confidence interval if and only
if the ordinary Wilcoxon test would reject by one method of breaking the ties but
not by another.
56. Verify that the (k + l)th smallest Walsh average is 0 for 7 :s; k :s; 11 for the data
in (6.2).
57. Verify the results given after Equation (6.5) for the left tail of the null distribution
of T- for n = 8 by the average rank procedure, given a tie at ranks 1,2,3, and 4.
194 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks
°
tion is increased by the same small amount. The corresponding confidence
region for J1 is an interval with the interior point removed.
62. Show that for the requirements to avoid anomalies in the presence of zeros and
ties given in Sect. 6.5, condition (i)(b) holds if and only if condition (ii) holds.
*63. Show that, for the procedure of breaking the ties at random, either actually using
standard tables or reporting the probability with which doing so would lead to
rejection. conditions (i). (ii) and (in) and also (i') of Sect. 6.5 hold.
64. In order to show that, despite intuition, larger observations are not always greater
evidence of positivity, consider the density
OA -l<x<O,
{
f(x) = ~.8 - OAx ° :s; x < 1,
otherwise.
(a) Show that this is a "positive" density, sincef(x) > f( -x) for 0< x < 1 and
f(x) ~ f( -x) for all x> 0.
(b) Show that if a sample of size 2 is drawn from a population with this density,
the signed ranks 1, - 2 are more likely than the signed ranks -1,2.
Problems 195
65. (a) Show that increasing one observation in a sample may decrease the ordinary t
statistic.
(b) More generally, show that t is a decreasing function of XI for XI > L~ x~ /L~ X,
if L~ X, > 0, and that t -> 1 as XI -> 00.
*66. Show that, if the Wilcoxon statistic is computed for each possible way of breaking
the ties (as in Sect. 6.3), the simple average of the resulting values is numerically
equal to the statistic obtained by the average rank and signed-rank zero procedures,
68. For the data 0, 0, - 2, - 3, - 5,6,9,11, 12, 15, 16, and the negative-rank sum as test
statistic,
(a) Show that, by the signed-rank zero procedure, the exact P-value is 23/2 9 and the
next P-value is 19/29 .
(b) Show that, by the reduced sample procedure, the exact P-value is 14/2 9 and
the next P-value is 10/2 9 •
(c) If the zeros are given signed ranks ± 1, ±2, what are the possible P-values?
(d) Do the results in (c) agree or disagree with those in (a) and (b)?
*69. The modified Wilcoxon procedure of Sect. 5 agrees with the reduced sample
procedure for the data given in (6.7) and in Problem 61. Why is the modified
procedure not subject to the same objections?
*70. Construct examples showing that, for the modified Wilcoxon procedure of Sect. 5,
(a) If ties are handled by the average rank procedure, neither condition (i)(a)
nor (iii) of Sect. 6.S need hold.
(b) If zeros are handled by the reduced sample procedure, none of the conditions
(i)-(lII) of Sect. 6.S need hold.
71. Show that the Wilcoxon statistic, calculated using the average rank procedure and
including the zeros in the ranking but giving them signed-rank zero, can be written
as
T = LLsign(X, + X)
15:)
*72. Show that the one-tailed Wilcoxon test in the appropriate direction using the
average rank and signed-rank zero procedures for ties and zeros is consistent
against alternatives for which the X, are independent and P(X, + X J > 0) -
P(X, + Xj < 0) is at least some fixed amount for all i of. j, while the "conservative"
procedure is not.
73. A manufacturer of suntan lotion is testing a new formula to see whether it provides
more protection against sunburn than the old formula. Ten subjects are chosen.
The two types of lotion are applied to the back of each subject, one on each side,
randomly allocated. Each subject is then exposed to a controlled but intense
amount of sun. Degree of sunburn was measured for each side of each subject, with
the results shown below (higher numbers represent more severe sunburn).
1 41 37
2 42 39
3 48 31
4 38 39
5 38 34
6 45 47
7 21 19
8 28 30
9 29 25
10 14 8
(a) Test the null hypothesis that the difference of degree of sunburn is symmetric-
ally distributed about 0, against the one-sided alternative that the new formula
IS more effective than the old. Use a Wilcoxon test at level 0.05, handling ties by
the average rank procedure and using Table D as an approxImation.
(b) Compute the exact P-value by generating the appropriate tail of the distribu-
tion using average ranks.
(c) Find the range of P-values which results when the ties are broken.
(d) Do (b) and (c) of this problem always lead to the same decision when IX = 0.05?
Find the range of IX for which the decisions are the same.
(e) Find a 90% upper confidence bound for the median difference assuming that
the distribution of differences is symmetric.
74. For the data given in Problem 73, use the sign test procedure of Chap. 2 to
(a) Find the P-value for testing the null hypothesis that the median difference is O.
(b) Find an upper confidence bound at level 0.90 for the median difference.
75. The Johnson Rod Mill Company produces steel rods. When the process is operat-
ing properly, the rods have a median length of 10 meters. A sample of 10 rods,
randomly selected from the production line, yielded the results listed below.
9.8, 10.0, 9.7, 9.9, 10.0, 10.0, 9.8, 9.7, 9.8, 9.9
Does the process seem to be operating properly? How would you recommend
handling the tIes?
Problems 197
76. The Brighton Steel Works orders a certain size casting in large quantities. Before
the castings can be used, they must be machined to a specified tolerance. The
machining is either done by the company or is subcontracted, according to the
following deciSIOn rule:
"If average weight of casting exceeds 25 kilograms, subcontract the order for
machining. If average weight of castings is 25 kilograms or less, do not subcontract."
The company developed this decision rule in an effort to reduce costs, because the
weight of a casting is a good indication of the amount of machining that will be
necessary while the cost of subcontracting the castings is a function of the number
of castings to be machined rather than the amount of machining required by each
casting.
The following data are for a random sample taken from a lot of 100 castings.
Casting 2 3 4 5 6 7 8
Weight 24.3 25.8 25.4 24.8 25.2 25.1 25.5 24.6
(a) What decision is suggested by the Wilcoxon signed-rank test at level 0.05?
(b) What assumption of the Wilcoxon test is critical here?
(c) What do you think of this method of making a decision?
77. The manufacturers of Fusion, "a new toothpaste for the post-atomic age," hired
Anonymous Unlimited, an independent research organization, to test their
product. Anonymous Unlimited induced children to go to the dentist and have
their cavities counted and filled, and then to switch from their regular brand to
Fusion. A year later they went to the dentist again. Advertisements blared the
astounding news: 87.5% had fewer cavities.
The actual data were as follows:
Apply to these data the statistical methods you consider most applicable and
comment on your choice of methods. What conclusions can be drawn from the
experiment, under what assumptions, and with what reservations? How could the
experiment have been improved (without changing its scope greatly)? Be brief.
78. A sail-maker wanted to know whether the sails he makes of dacron are better than
the sails he makes of cotton. He made 5 suits of dacron sails and 5 suits of cotton
sails, all for a certain type of boat. He obtained 10 boats ofthis type, labeled A, B, . .. ,
J, and had them sail In two races. He picked 5 of the 10 boats at random; these 5
(they were A, C, E, G, and H) used dacron salls in the first race and cotton sails
in the second race. The other five (B, D, F, I, and J) used cotton Salls in the first race
and dacron sails in the second. The order of finish in the first race was C, H, A, J,
B, E, I, F, G, D; in the second race it was A, B, H, J, I, C, D, F, E, G. Analyze these
results to shed light on the sail-maker's question.
79. Generalize the relations in (3.1)-(3.3) among T, T+, and T- so that they apply to
sums of all, positive, and negative signed constants Ck'
198 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks
80. Represent the general test statistic based on the sum of signed constants as the
linear combination
T' = LCkSk
k
where Sk denotes the sign of the observation with rank k in absolute value. Under
the null hypothesis that the observations are independently, continuously dis-
trIbuted, symmetrically about 0, show that
(a) E(T') = 0
(b) var(T') = Lk cf
(c) T' is symmetrically distributed about O.
*81. Show that the signed ranks and the signs of the Walsh averages have the relation-
ship stated in Theorem 7.1.
*82. In a sample of size 10, suppose that the Walsh averages (Xi + X)/2 are negative
if both i and j are odd, and positive otherwise. What could be the signed ranks of
XI, .. ·,XJQ?
83. Show that, if Ck+ 1 ~ Ck ~ 0 for all k, the test based on the sum of the signed Ck'S
satisfies the hypothesis of Theorem 3.3 (increasing an observation never decreases
the probability of rejection l/J(X 1, ..• , X n»)'
*84. Consider n observations XI such that there are no ties in the Walsh averages.
(a) Show that, as j.J increases, the ~igned ranks of the centered observations
Xi - j.J change only when j.J equals some Walsh average (XI + X)/2.
(b) Show that, in the situation described in (a), the only changes are as follows.
If j = j, the signed rank of XI changes from 1 to -1. If i oF j and XI < X)' the
signed rank of XI changes from -(k + 1) to -(k + 2) and that of X) from
(k + 2) to (k + 1), where k is the number of observations between XI and Xj'
(c) Show that the sum of negative constants Ck increases by CI If i = j and by
(Ck+ 2 - Ck+ 1) if j oF j, while the sum of positive Ck'S decreases by the same
amount, and the sum of signed Ck'S decreases by twice this amount.
85. Show that the confidence region for the population center of symmetry j.J cor-
responding to a one- or two-sided test based on a sum of signed ck's is an interval
if Ck+ 1 ~ Ck ~ 0 for all k.
86. Show that, if Ck = 0 for k :;; n - m and Ck = 1 for k > n - m, then
(a) The test of Sect. 7.1 IS eqUIvalent to carrying out a sign test on the m observa-
tions largest in absolute value.
(b) The corresponding confidence limits are (X(k+ 1) + X(n-m+k+ 1)/2 and
(X(n-k) + X(m-k»/2, where k and m - k are the lower and upper critical
values for the number of negative observations in a sample of m. (Noether
[1973] suggests these confidence limits for easy calculation and studies their
efficiency.)
87. Show that, if Ck = 0 for k :;; n - m and Ck = k + m - n for k > n - m, then
(a) The test of Sect. 7.1 IS eqUIvalent to carrying out a Wilcoxon signed-rank test
on the m observations largest In absolute value. (This fact could be exploited to
reduce tabulation.)
(b) The corresponding confidence bounds are the (t + l)th smallest and largest
among those Walsh averages (X(I) + X()))/2 with j - j ~ 11 - m, where t is
Problems 199
the critical value of the negative rank sum in a sample ofm. (Hint: Show that the
rank of IX (I) I among the m largest in absolute value is the number of negative
X(I) + Xl)) withj - i ~ n - m if X(I) < 0.)
92. Show that multiplying all Ck'S by the same positive constant has no effect on a test
based on the sum of signed Ck'S. What about a negative constant? What about
adding a constant to all Ck'S?
93. Given a procedure 4>, let IjJ consist of applying 4> after permuting the observations
randomly. Show that IjJ is a permutation-invariant procedure.
*94. In a testing situation where permutation invariance is applicable (every permuta-
tion n carries null into null and alternative into alternative distributions), show that
(a) If 4> is uniformly most powerful (at level IX), then 4> has the same power against F
and F n for all alternatives F and all permutations n, and there is a permutation-
invariant test which is uniformly most powerful.
(b) The statement in (a) remains true when the words "most powerful" are
replaced by "most powerful unbiased."
200 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks
(c) The envelope power IS the same at F and F n for all F and 1t, and if there is a
most stringent test (at level ex), then there is a most stringent test which is
permutation invariant, where we use the following definitions. Let () index the
alternative distributions, let ex«(}; 4» be the power of 4> against 0, let the "enve-
lope power" ex*«(}) be the maximum of ex«(}; 4» over all 4> at level ex, and let b(4))
be the maximum of ex«(}; 4» - ex*(O) over all 0 (the maximum shortfall of 4»;
4>* is "most stringent" if it minimizes b( 4» among tests 4> at level ex.
(d) If there is a uniformly most powerful invariant test, then it is a most stringent
test.
(e) What properties of the permutations as a class of transformations are signifi-
cant for these results, and how?
*95. Let S be a set of transformations of the observations which are one-to-one and onto.
Suppose that, in a testing problem, all transformations in S carry null into null
and alternative into alternative distributions.
(a) Show that the same is true of all transformations in the group G generated by S
(under composition).
(b) Let 0 index the possible distributions of the observations. Show that Sand G
induce sets Sand G of transformations of () and that G IS the group generated
byS.
(c) Show that G is homomorphic, but not necessarily isomorphic, to G.
101. (a) Show that the most powerful signed-rank test against F is of the form (9.1) if
all combinations of signed ranks are equally likely under the null hypothesis.
(b) Explicitly, how is k determined and what happens when Pirl, ... , /'n) = k?
Problems 201
102. Verify formula (9.2) for the probability of signed ranks 1'1, ••. , I'n in a sample from a
density.f~ .
103. Show that IX 1fJ) in (9.4) is thejth order statistic in a sample of n from the cumulative
distribution 2F 0 - 1.
104. Show that a test of the form (9.6) is equivalent to one based on a sum of signed
constants (9.7).
*106. Show that the locally most powerful signed-rank test against the Laplace family
of alternatives
e(X-O)/2 x:,:;fJ
Fo(x) ={ 1 _ e-(X-O)/2 x>fJ
fo(x) = e- 1x - ol /2
is equivalent to the sign test of Chap. 2.
*107. Let c, = E[log(1 + U,) - 10g(1 - U,)] where U, has a beta distribution with
parameters j and n - } + 1. Show that a test based on the sum of signed constants
Cj is a locally most powerful signed-rank test of 0 = 0 for every Lehmann family of
alternatives F 0 = F 1+ 0 where F is the cdf of a continuous distribution symmetric
aboutO.
*108. Obtain the sign test by invoking the principle of mvariance for a SUitable class of
transformations when the observations are not assumed identically distributed.
and for confidence intervals. Thus closed confidence regions are conservative
for discrete distributions.)
(e) Show that the confidence bounds corresponding to the Wilcoxon signed-rank
test satisfy the hypotheses of (d).
(f) Show that the statement in (e) also holds for tests based on sums of signed
constants Ck with 0 :::; C 1 :::; C2 :::; ••• :::; Cn • (See Problem 85.)
(g) Show that, in the hypothesis of (d), continuous distributions can be replaced by
distributions having densities.
CHAPTER 4
One-Sample and Paired-Sample
Inferences Based on the
Method of Randomization
1 Introduction
The signed-rank tests of the previous chapter rely on the fact that all assign-
ments of signs to the ranks of the absolute values of independent observations
are equally likely under the null hypothesis of distributions symmetric
about zero, or about some arbitrary point which is subtracted from each
observation before ranking. The same fact is true also for the observations
themselves, not merely for their ranks, and the idea underlying these tests
can be applied to any function of the observations, not merely to a function
of the signed ranks.
More specifically, consider any function of sets of sample observations,
and the possible values of this function under all assignments of signs to a
given set of observations, or equivalently, to their absolute values. Dividing
the frequency of each possible value of the function by the total number of
possible assignments of signs to the given observations generates a frequency
distribution called the randomization distribution. In general, this distribu-
tion depends on the given set of observations through their absolute values.
As its name indicates, the randomization distribution is derived merely
from the conditions that, given the absolute values of the n observations,
all 2n possible randomizations of the signs, each either + or -, attached to
the n absolute values, are equally likely to occur. As discussed in the previous
chapter, this condition holds under the null hypothesis of symmetry about
zero. In the paired-sample case it follows from the physical act of randomiza-
tion under the null hypothesis of no treatment effect whatever.
There are many interesting tests based on such a randomization distribu-
tion. Any statistic could be used as the test statistic, and any such test is
called a randomization test. If the value of the test statistic is determined by
203
204 4 One-Sample and Paired-Sample Inferences Based on the Method of RandOlTIlzatlOn
the signed ranks, as for the sign test or the Wilcoxon signed-rank test, the
test is a signed-rank test as defined in Chap. 3. A signed-rank test may also
be called a rank-randomization test, for contrast with more general ran-
domization tests. Similarly, an arbitrary randomization test may be called
an observation-randomization test to indicate that the value of the test
statistic is determined by the signed observations, as opposed to the signed
ranks. (The terms randomization test, rank-randomization test and observa-
tion-randomization test generalize to the case of more than one sample, as
we will see in Chap. 6.) The randomization distribution of a rank-randomiza-
tion (signed-rank) test statistic does not depend on the particular set of
observations obtained as long as their absolute values are all different.
For an observation-randomization test, however, the randomization dis-
tribution of the test statistic does depend in general on the observations
obtained, specifically on their absolute values. Since the randomization
distribution treats the absolute values as given and depends only on them,
observation-randomization tests are conditional tests, conditional on the
absolute values observed.
The principle of randomization tests is usually attributed to R. A. Fisher;
it is discussed in both of Fisher's first editions [1970, first edition 1925; and
1966, first edition 1935]. Many non parametric tests are based on this prin-
ciple, as it is easily applied to a wide variety of problems. The randomization
may be a designation of sign, or sample label, or an actual rearrangement of
symbols or numbers. The test criterion is frequently a classical or parametric
test statistic applicable for the same situation, or some monotonic function
thereof which is equivalent for the purpose but simplifies calculations. In all
situations, the randomization distribution derives from the condition that all
possible outcomes of the randomization are equally likely.
Since many randomization distributions are generated by permutations,
randomization tests are frequently called permutation tests. This name will
not be used here, however, because an interpretation of "permutation"
which is broad enough to include designating signs is not natural, and we
have already used this term in discussing "permutation invariant" tests (as
in Sect. 8, Chap. 3). Conditional tests is another possible name. However,
this is insufficiently specific since many statistical tests are conditional tests,
but with different conditioning than here. Another term which appears in
the literature is Pitman tests, since Pitman [1937a, 1937b, 1938] studied
them extensively. The term randomization test could lead to confusion with
randomized tests (Sect. 5, Chap. 1), which are entirely unrelated, but some
nomenclature must be adopted and none is perfect.
In this chapter we will first discuss the one-sample randomization test
based on the sample mean, and the corresponding confidence procedure.
Then we will define the general class of randomization tests for the one-
sample problem, study some properties of these tests, and obtain most
powerful randomization tests. Two-sample observation-randomization tests
will be covered in Chap. 6.
2 Randomization Procedures Based on the Sample Mean and EqUIvalent CrIterIa 205
While the presentation here is limited to the one-sample case, all the pro-
cedures are equally applicable to treatment-control differences and in
general to paired-sample observations when the differences of the pairs
are used as a single sample of observations. The hypotheses and any assump-
tions then refer to these differences and their distributions, as do the pro-
perties of the statistical procedures based on them. (See Sect. 7, Chap. 2.) It
is not necessary that a, paired-sample randomization test be based on only
these differences, but other tests are seldom needed and will not be discussed
in this book.
which are somewhat easier to use than the sample mean X but are equivalent
for the purpose of a randomization test (Problem 1) are the sum of all the
sample observations S = Lj Xj' the sum S+ of the positive observations,
and the sum S- of the negative observations with sign reversed (so that
S- ~ 0). Student's t statistic, calculated in the ordinary way, is also equiva-
lent, as will be shown below.
The method of generating the randomization distribution and the
equivalence of these test statistics is illustrated in Table 2.1. In practice, of
Table 2.1"
Sample Values: 0.3, -0.8,0.4,0.6, -0.2,1.0,0.9,5.8,2.1,6.1
a These data are from Manis, Melvin [1955], Social interaction and the self concept,
Journal of Abnormal and Social Psychology, 51, 362-370. The X J are differences
X J = X~ - X;, where X~ is the decrease m the "distance" between a subject's self-
concept and a friend's impression of that subject after a certain period of time, and
X; is the corresponding decrease for a subject and a nonfriend; the subject-friend
paIr was matched with the subject-nonfriend pair accordmg to the value of theIr
"distance" at the beginning of the time period. Since the non friends were roommates
assigned randomly to the subjects, the subjects were expected to have the same
amount of contact with nonfriends as with friends during the time period. Manis'
hypothesis II was that over a given period of time, there will be a greater increase in
agreement between an indIVIdual's self-concept and his friend's impression of him
than there will be between an indivIdual's self-concept and his nonfriend's impres-
sion. This hypothesis is supported if the null hypothesis that the X J are symmetric
about 0 IS rejected in favor of a positive alternative. Manis used the Wilcoxon
sIgned-rank test, which gIves a one-tailed P-value of 0.0137 (Problem 2).
2 Randomization Procedures Based on the Sample Mean and Equivalent Cntena 207
course, the values of only one statistic would be calculated; we give details
for several only to make their relationship more intuitive. The first step is to
list the absolute values of the observations to which signs are to be attached.
It is generally easier to predict which assignments lead to the extreme values
if this listing is in order of absolute magnitude (increasing or decreasing).
While there are 2 10 = 1024 different assignments of signs, Table 2.1 enumer-
ates only those 17 cases which lead to the largest values of X, S, or S +, and
the smallest values of S-. S- = 1.0 was observed for these data and Table
2.1 shows that only 16 of the 1024 cases give S- that small; hence the one-
tailed P-value from the randomization test is 1~~4 = 0.0156.
The calculation is most easily performed in terms of S- when X is "large"
as here, and in terms of S+ when X is small. Ifthe same value of S+ or S-
occurs more than once, each occurrence must be counted separately. Al-
though calculating the complete randomization distribution straightfor-
wardly would require enumerating 2" possibilities, this is never necessary
for any randomization test. In order to find P (the P-value), the enumeration
must include only those 2"P assignments which lead to values as extreme as
that observed in the appropriate direction (that is, less than or equal to an
observed S- or greater than or equal to an observed S+, when S- ~ S+).
Furthermore, if a nonrandomized test is desired at level 0(, a decision to
reject (P ~ O() can be reached by enumerating these 2"P cases, and a decision
to "accept" by identifying any 2"0( cases as extreme as that observed. A
test decision by direct counting therefore requires enumeration of only
2"0( or 2"P cases as extreme as that observed, whichever is smaller. To com-
pute the P-value, of course, it is necessary to enumerate every point in the
relevant tail, and it is difficult to select them in the correct order for enumera-
tion except in the very extreme end of the relevant tail. Considerable care is
required if the entire distribution is not calculated. A systematic approach is
to enumerate according to the number of + or - signs (starting with 0).
Clever tricks, such as the "branch and bound" method of mathematical
programming, might reduce the work entailed. Even to calculate the entire
distribution, it is sufficient to enumerate 2"/2 = 2"-1 assignments, since the
randomization distributions of S+, S-, X, Sand t are all symmetric (Problem
4). This also means that the natural two-tailed test is equal-tailed. Ap-
proximations for use when exact calculation is too difficult will be discussed
in Sect. 2.5.
If the null hypothesis is generalized to state that the observations are
symmetrically distributed about some point J1.o, or the symmetry is assumed
and the null hypothesis is H 0: J1. = J1.o, the same procedure can be used but
applied to the observations X j - J1.o. That is, the randomization distribution
is generated by assigning signs to the 1X} - J1.o I.
Since, given the IX}I, the statistics S, X, S+, and S- are all linearly related,
they provide equivalent randomization test criteria. Although the ordinary
t statistic is not linearly related to these other statistics, it also provides an
equivalent randomization test criterion, as we now show.
208 4 One-Sample and Paired-Sample Inferences Based on the Method of Randomization
We have been assuming that the observations are independent and identically
distributed with a symmetric common distribution under H o. If the assump-
tions are relaxed so that the X J are not identically distributed but are in-
2 RandomizatIOn Procedures Based on the Sample Mean and EqUIvalent Crltena 209
As mentioned in Sect. 2.1, the statistics S, X, S+, S- and t all provide equiv-
alent randomization test criteria and all have symmetric distributions,
although only S+ and S- are identically distributed.
The means and variances of these statistics under the randomization
distribution are easily calculated. These moments are of course conditional
on the absolute values of the observations. Since they are considered
constants, we denote the values IX 11, ... , IXn I by a1"'" an' Then for
S = LJ Xj (Problem 11), for instance, we have
E(S) =0 (2.2)
var(S) = L aJ = (12. (2.3)
j
Thus the center of symmetry for S is zero, as it is for X and t, but not for
S+ or S- . Note that (12 is defined by (2.3) and is the variance of the randomiza-
tion distribution of S, not the popUlation variance of the Xj' although it is
an unbiased estimator of n times the latter under the null hypothesis.
ZIZ (~)1/2
2. (2.8)
n- Zil
2 RandOlTIlzatlOn Procedures Based on the Sample Mean and Equivalent Cntel Ja 213
d= 1+ eZn- 3)(1 - n; 2r 1
(2.10)
where
k= n- 1 + 3 _y y ( n - 5 - 3 3y
_ y) + terms of order (lin)
These results show clearly the order of magnitude of the corrections to the
normal theory values k = n - 1, c = 1.
We illustrate both of these approximations using the Darwin data (where
n = 15) introduced in Sect. 3.1, Chap. 3. Fisher [1966 and earlier editions]
gives the one-tailed P-value in the randomization distribution as 863/2 15 =
0.02634. Using Student's t distribution with n - 1 = 14 degrees of freedom,
Fisher obtains 0.02485 for the ordinary t test (t = 2.148), and 0.02529 with
a continuity correction to allow for the discreteness of the measurements
(t = 2.139). If we use the approximation that treats t 2 as F distributed with
d and (n - l)d degrees of freedom, we first calculate d from (2.9) for these
same data as d = 0.937; then the F distribution with degrees of freedom
d = 0.937 and (n - l)d = 13.12 gives the one-tailed P-value as 0.02643
without a continuity correction, and 0.02686 with one. If we use the method
2 RandomizatIOn Procedures Based on the Sample Mean and EqUivalent Cntena 215
even defined. There are several natural definitions, but under any of them,
the conditional result is much stronger than the unconditional one and does
not follow from an ordinary central limit theorem (Problem 20).
The randomization test based on the sample mean, or any of the equivalent
test criteria given in Sect. 2.1, relies for its level only on the assumption that,
given the absolute values of the observations, all assignments of signs are
equally likely. This randomization test and randomization distribution are
therefore conditional on these absolute values. In particular, this means that
the level of the randomization test is IX if, under the null hypothesis, the con-
ditional probability of rejection given IX 11, ... , IXn I is at most IX, and that
the P-value is the corresponding conditional probability of a value of the
test statistic equal to or more extreme than that observed.
Generalizing, we define a one-sample randomization test for center of
symmetry equal to zero as a test which is conditional on the absolute values
of the observations, having a null distribution that is the randomization
distribution generated by assigning signs randomly to these absolute values.
Tests which are members of this general class may depend on the actual
values of the Xj. Signed-rank (or rank-randomization) tests are those which
depend only on the signed ranks of the Xj. Any signed-rank test, including
the Wilcoxon signed-rank test or any test based on a sum of signed constants
(as defined in Sect. 7.1, Chap. 3) is a randomization test; however, not all
randomization tests are signed-rank tests, as the test of Sect. 2.1 based on
the sample mean or equivalent criteria is not.
The class of all randomization tests is even broader than it may seem. Two
other specific examples of randomization tests are described below.
(1) Consider the composite test defined as choosing a constant c and ap-
plying a level IX Wilcoxon signed-rank test whenever Lj XJ ~ c(Lj IXj 1)2,
and a level IX sign test otherwise. This is a randomization test because given
the IX) I, the conditional level of the composite test is IX regardless of which
test is used. Of course, any number of component tests may be used, but in
order for such a composite test to be a randomization test, the rule for
choosing which component test to use must depend only on the IX)I. A
rule of this kind aimed at achieving good power for a wide variety of dis-
tributions has come to be called "adaptive" [Hogg, 1974]. The possibility
was recognized at least as early as 1956, when Stein [1956] showed that in
this way, for large n, power can be attained arbitrarily close to that of the
3 The General Class of One-Sample RandomizatIOn Tests 217
3.2 Properties
Let ¢(X t, ... , Xn) denote the probability of rejection (critical function) of
an arbitrary test. We write X J = ajlj where aj = IXjl and
-I if X. < 0
I. = { J
J 1 if Xj > O.
Then we may write the critical function as
(3.1)
The aj = IXjl are to be treated as constants. Given the IXjl = aj, the con-
ditional expected value of the critical function, under the null hypothesis
Hoof a distribution symmetric about zero, is simply the mean of its random-
ization distribution, or
Eo[¢(Xt> ... ,Xn)IIXd = at, ... ,IXnl = an] = "L¢(±at, ... , ±an)/2n,
(3.2)
where the sum is over all the 2n possible assignments of signs. This expected
value is the conditional probability of a Type I error for the test. Accordingly,
a test ¢ of H 0 has conditional level ()( if the quantity in (3.2) is less than or equal
to ()(. If this holds for all at, ... , an, then the test is a randomization test. Any
such test also has level ()( unconditionally (Problem 21) by the usual argument
(see Sect. 6.3, Chap. 2).
In the notation just introduced, we may say that a signed-rank (or rank-
randomization) test is one whose critical function depends on the aj = IXjl
only through their ranks, as well as the signs Ir
218 4 One-Sample and Paired-Sample Inferences Based on the Method of RandomizatIOn
The statements made in Sect. 2.2 about weakening the assumptions apply
without change to all members of the general class of one-sample randomiza-
tion tests, as do the statements of Sect. 3.5, Chap. 3 provided the tests are
monotonic where monotonicity is obviously called for.
Theorem 3.1. If a test has level 0( for either Ho or H'o, then the test is a rall-
domization test.
PROOF. In order to see this for H o , suppose that the conditional level were
greater than 0(, given some set of absolute values IX tI = ai' ... , IXn I = an'
Consider independent Xj with a distribution such that P(X J = a) =
P(Xj = -a) = t; then Ho is satisfied (H'o is not) but the null probability
of rejection is greater than 0(. This contradiction shows that the supposition
is impossible and any level 0( test of this null hypothesis has conditional
level 0( given IXII, ... , IXn I. The result for H'o will not be proved here. See
Lehmann [1959, Sect. 5.10] or Lehmann and Stein [1949]. 0
3 The General Class of One-Sample Randomization Tests 219
Theorem 3.2. If a test has level IXfor H'O and is unbiased against H';, or has
level afor H'f) and is unbiased against H';', then it is an (n! 2n)-type randomiza-
tion test.
The proof of this property will be given shortly. As will be evident from
this proof, other, less broad, null and alternative hypotheses would also
imply the conclusion of this theorem.
The result in Theorem 3.2 "justifies" restricting consideration to (n! 2n)_
type randomization tests when the observations are identically distributed,
but we have not yet justified the further restriction to 2n-type randomization
tests. Recall, however, that in Sect. 8, Chap. 3, we defined a procedure as
permutation invariant if it is unchanged by permutations of the observations.
If a test is permutation invariant, then averaging over the permutations has
no effect, and hence the two types of randomization are equivalent. As a
result, a permutation-invariant test is a 2n-type randomization test if and
only if it is an (n! 2n)-type randomization test (Problem 22b). The reasons
given in Sect. 8, Chap. 3 for using a test which does not depend on the order
of the observations apply equally here. In particular, for independent, iden-
tically distributed observations, nothing whatever is gained by taking the
order into account. Therefore, any test of H'O against H';, or H'f) against
H~', may as well be taken as permutation invariant, and if it is unbiased, it
must be a randomization test (2n-type, or equivalently, (n! 2n)-type).
PROOF OF THEOREM 3.2 We outline here a proof of Theorem 3.2 for con-
tinuous distributions only. A proof for the discrete case is requested in
Problem 23. Suppose we have a test of Ho' which has level IX under every
common continuous distribution that is symmetric about 0 and is unbiased
against the alternative H~' of a common continuous distribution that is
symmetric about some other point. In order to prove that this test must have
conditional level IX given IXI(!), ... , IXI(n)' we need prove only that the
IX 1(1)' ... , IX I(n) are sufficient and boundedly complete for the common
boundary K of H'f) and H~', which is H'f) (by Theorem 6.1 of Chap. 2). The
sufficiency part is trivial, and will be left as an exercise (Problem 25). To
prove completeness, consider the family of symmetric densities given by
3 The General Class of One-Sample Randomlzallon Tests 221
where the integration is over the region of possible values of T1, ••• , T"
and where J(t I, ... , tn) is exp( - Lj Xi") times the Jacobian of the IX I(j)
with respect to the 1k. It follows by the theory of Laplace transforms that the
integrand is zero and hence that ¢ = O. For more details on this approach see
Lehmann [1959, pp. 132-133]. Alternatively, see Fraser [1957b, pp. 28-31]
or Lehmann [1959, pp. 152-153] for an approach that is derived from the
discrete case. These sources deal with the usual order statistics X(I)' ... , X(n)
and arbitrary densities rather than I XI(l)' ... , IX I(n) and symmetric densities;
this difference affects the proof only slightly. 0
Matched Pairs
with means Jlj - Jl'J and Jl'J - Jlj, and possibly extremely small variance.
Thus, it may, in a sense, approximate a two-point distribution Xj =
± (Jl~ - Jl'J) = ± vj, say. Third, the result holds for any null hypothesis
that contains Hci, Le., any null hypothesis including all the distributions in
Hci (Problem 26). Fourth, the result does not imply that the test is a ran-
domization test based solely on the treatment-control differences. One might,
for example, use a composite or "adaptive" test with components that
depend only on the treatment-control differences but a rule for choosing the
component test that depends on the sums within pairs (Problem 27).
Reasons for using randomization tests were given in Sect. 3.2. In this sub-
section we will see how to find that randomization test which is most powerful
against any specified alternative distribution. The particular case of normal
alternatives will be illustrated in the two subsections following.
We consider the usual (2n-type) randomization distribution, under which,
given the absolute values IX d, ... , IXnI, there are 2n possible assignments
of signs, and hence 2n possible samples, which are equally likely. The implica-
tion of using a randomization test is that this condition holds under the
null hypothesis. Consider now an alternative distribution with joint density
or discrete frequency function J(x l , •.. , xn). Under this alternative, given
IX 1 I, ... , IX nI, the conditional probabilities of each of the 2n possible
samples Xl"'" Xn are proportional to J(XI"'" Xn). By the Neyman-
Pearson Lemma (Theorem 7.1 of Chap. 1), it follows (Problem 28) that among
tests with conditional level (X given IX 1 I, ... , IX nI, that is, among randomiza-
tion tests, the conditional power against J is maximized by a test of the form
reject ifJ(X I,,,,, Xn) > k
"accept" ifJ(X 1"", Xn) < k. (4.1)
Randomization may be necessary at k (that is, the test may be randomized).
The choice of k and the randomization at k must be determined so that the
test has conditional level exactly a.
In other words, the procedure for finding the conditionally most powerful
randomization test is to consider the value of J at each Xl' .•• , Xn having
the same absolute values as the observed Xl"'" X n • The possible samples
X I, ... , xn are placed in the rejection region in decreasing order of their
corresponding values off, starting with the largest value off, until their null
probabilities total (x. The region will consist of the a2 n possible samples
Xl' ... , Xn which produce the largest values ofJif (X2 n is an integer and if the
(cx2n)th and (cx2n + l)th values ofJare not tied. Ties may be broken arbitrarily.
If cx2" is not an integer, a randomized test will be necessary.
4 Most Powerful RandOlnJzatlon Tests 223
Since this test is the randomization test which maximizes the conditional
power against f, it also maximizes the ordinary (unconditional) power
againstf(Problem 28). Thus we have shown how to find the (2"-type) ran-
domization test which is most powerful against any specified alternative f
The conditions given are necessary and sufficient. The method for (n! 2")-type
randomization tests is similar.
Consider now the alternative that Xl"'" XII are a random sample from a
normal distribution with mean Jl and variance (J2. Then the joint density is
f( Xl' -
.•. ,X") -
nil _1_
M: e
-(Xj-Il) 2/2<12
j; 1 V 21t(J
For given IXli, ... , IXII I (and Jl and (J2), this f is an increasing function of
L X j if Jl > 0, and a decreasing function of L X j if Jl < O. Therefore, rejecting
if/eX 1, ... , XII) is one of the k largest of its possible values given IX 11, ... ,
IX II I is equivalent to rejecting if L X j (or an equivalent statistic) is one of the
k largest of its possible values given IX 11, ... , IX"I when Jl > 0, and if it is
one of the k smallest when Jl < O. Thus the result of Sect. 4.1 shows that the
upper-tailed observation-randomization test based on S = L Xj (or X) is
the most powerful randomization test against any normal alternative with
Jl > 0, and similarly for the lower-tailed test and Jl < O. In short, the one-
tailed randomization tests based on the sample mean are uniformly most
powerful among randomization tests against one-sided normal alternatives.
Consider now the two-sided alternative that Xl, ... , XII is a random sample
from a normal distribution with mean Jl i= 0 and variance (J2. Since we found
in the previous subsection that different randomization tests are uniformly
most powerful against Jl < 0 and against Jl > 0, there is no uniformly most
powerful randomization test against this two-sided alternative. However,
the equal-tailed randomization test based on X is the uniformly most
powerful test against Jl i= 0, among unbiased randomization tests (Problem
29). Further, in the class of randomization tests which are invariant under the
transformation carrying X 1, ••• ,X" into -Xl"'" -Xn' we show below
that the equal-tailed randomization test is again the uniformly most powerful
test. This invariance means that the test is unaffected if the signs of all the
observations are reversed. Notice that this transformation carries the alter-
native given by Jl, (J2 into that given by - Jl, (J2, so the invariance rationale
224 4 One-Sample and PaIred-Sample Inferences Based on the Method of RandomIzatIOn
(Sect. 8, Chap. 3) can be applied. In particular, any invariant test has the
same power against }1, (12 as against -}1, (12.
PROOF. From the last sentence it follows that any invariant test has the
same power against }1, (12 as against the density h obtained by averaging the
density for }1, (12 and the corresponding density for -}1, (12, namely
For fixed Ix,I, ... , IXnl, (4.4) is an increasing function of 1}1 I: x)(121, and
hence of II: xjl (Problem 31). It follows from (4.1) that the most powerful
randomization test against h is that which rejects for large IL x j I. Since
this is simply the equal-tailed randomization test based on X, the proof is
complete. 0
5 Observation-Randomization versus
Rank-Randomization Tests
We have found that the one-sample observation-randomization test of
Sect. 2 is valid with almost no distribution assumptions, and it is the most
powerful randomization test against normal alternatives. One can show
5 ObservatIOn-RandomizatIOn versus Rank-RandomizatIOn Tests 225
PROBLEMS
L
I. (a) Express X, S, S + , and S- each in terms of each of the others and 1Xli.
(b) Show that the randomization tests based on these statistics are equivalent.
2. (a) Find the one-tailed P-value of the Wilcoxon signed-rank test for Manis' data
given in Table 2.1.
(b) For Manis' data, find lower and upper confidence bounds corresponding to
the Wilcoxon signed-rank test at the one-sided level IX = 10/1024 for an assumed
center of symmetry.
(c) Do (b) for the randomization test based on X.
(d) How would you interpret the confidence bounds in (b) or (c)?
(e) In Manis' data, some subjects had smaller initial" distance" than others, and
hence less possibility of reducing the distance over time. How does this affect the
various randomization tests of the null hypothesis of symmetry about O? What
alternative procedures might be considered, with what advantages and dis-
advantages?
3. For the data in Sect. 3.1, Chap. 3, Fisher [1966 and earlier editions] obtained the
one-tailed P-value of 0.02634 for a randomization test based on the t statistic. How
many values of t as small as or smaller than that observed must he have counted?
4. Show that the randomization distributions of X, S, S+, S-, and t are all symmetric.
5. Consider the null hypothesis Ho that the observations X I" .. , X" are independently
distributed, symetrically about 0, and the randomization test that rejects H 0 when
K ~ k for some chosen integer k, where K is the number of different assignments of
signs for which L, ± Xj ~ L Xl (including L Xl itself). Show that
(a) This test has level IX ~ kI2", irrespective of whether the observations are iden-
tically distributed. The P-value is KI2".
(b) This test has level IX = k12" if the observations are also continuously distributed.
6. Consider a matched-pairs experiment with the null hypothesis of no treatment
effect on any observation. Show that randomization within pairs induces the
randomization distribution of the treatment-control differences whatever the
paired observations may be.
7. Show that the confidence bounds corresponding to the upper-tailed randomization
test based on X are
(a) X(I) at level 1/2",
(b) (X(I) + X(2)/2 at level 2/2";
(c) mm{X(31' (X(1) + X(2)/2} at level 3/2",
where X(I) is the ith order statistic of a sample of size II. Compare Problem 36,
Chap. 3.
*8. Given sample observations X I' ... , X"' let K be the number of different assignments
of sIgns for which L ± Xj ~ ~ Xj (including ~ Xj itself). Show that
(a) The number of nonnegative subsample means is K - 1. (Hint: A subsample
total equals (L, Xl - ~ ± X)/2 where a - is assigned if Xl is in the sub-
sample and a + is assigned otherwise.)
(b) The confidence bound for center of symmetry II corresponding to the ran-
domization test that rejects for K ~ k is the kth smallest subsample mean.
Problems 227
(c) The test that rejects for K ~ k has level (J. ~ k/2" if the X) are independent and
symmetric about 0 (not necessarily identically distributed). The P-value is
K/2". If the X) are also continuously distributed, then (J. = k/2".
(d) If the X) are independent with a distnbutlOn that is continuous and symmetric
about J1., then the 2" - 1 subsample means all differ with probabihty one, and
they partition the real line into 2" intervals that are equally likely to contain J1..
*9. (Randomization tests with randomly chosen subsets of assignments of signs)
Let X I> ••• , X" be independent observations with a distribution that is symmetric
about 0 under the null hypothesis Ho· Define Yo = L) Xj and Y; = L ± X) =
L ± IX)I for i = 1, ... , m, where the signs are drawn at random either (i) un-
restrictedly, or (ii) without replacement from all 2" possible assignments except
that corresponding to Yo. Let K denote the number of values of i for which Y; ~ Yo
for 0 ~ ; ~ Ill, let R, denote the set of values ofj for wlllch the sign of X) differs 111 the
sums for Yo and y., and let Z, denote the mean of the X I for which} is in R, for R,
nonempty. Show that
(a) The two sums given above to define y. are equivalent.
(b) Yo is a random sample of size one from the order statistics of Yo, Yj , ••• , Ym under
Ho·
(c) The test that rejects H 0 if K ~ k has level (J. ~ k/(111 + 1) and P-value equal to
K/(m + 1).
(d) The R" i = 1, ... , Ill, are a random sample drawn with replacement from all 2"
possible subsets of the set {I, 2, ... , n} in case (i) above, and without replacement
from all 2" - 1 nonempty subsets in case (ii) above.
(e) If H is the number of nonnegative Z, and E is the number of empty R, for I ~
i ~ m, then K = H + 1 + E. In case (ii) above, E = O.
(f) The confidence limit for center of symmetry J1. that corresponds to the test that
rejects for K ~ k is the (k - E)th smallest Z, for 1 ~ i ~ m (which is the kth
smallest in case (ii) above).
(g) If the X) have a continuous distribution that IS symmetric about J1., then in
case (ii) the level of the test in (c) is (J. = k/(m + 1) and the Z, partition the real
line into III + 1 intervals that are equally likely to contain J1.. (This result can
also bc proved from Problem 8d using Problem 10 below.)
*10. Let WI < '" < WN be continuously distributed with P(w, > 0) = i/(N + 1) for
1 ~ i ~ N. Let W'I < ... < W;" be the order statistics of a sample of m drawn without
replacement from WI' ... , WN • Show that P(W: > 0) = i/(m + 1) for 1 ~ i ~ m.
(Hint: One method is to consider the case where the W, are obtained by subtracting
from the order statistics of a sample an additional observation from the same
distribution).
11. (a) Fmd the center of symmetry and the variance of the randomization distnbu-
tions of each of X, S, S+, S-, and t.
(b) How are the moments of X, S, S + , and S - related? Why do the moments of t
not have a simple relationship to these moments?
12. Let G be the set of all n-dimensional vectors J = (J I' ... , J") of I's and - 1's such
that an even number of elements are equal to -1. Let Gx be the set defined 111
procedure (b) of Sect. 2.4. Show that
(a) G has 2"-1 members
(b) If no X) = 0, then Gx also has 2"-1 members.
228 4 One-Sample and Paired-Sample Inferences Based on the Method of Randomization
(c) If J and J' both belong to the set G, then the vector (J I J'I> ... , J n J~) also belongs
toG.
(d) Given that X belongs to a particular set Gy , all members of G y are equally likely
under the randomization dlstnbutlOn. What If X does not belong to G y '?
* 13. Let G be the set of all n-dimensional vectors of l's and - l's such that the first two
elements are not both equal to -1. Let Gx be the set defined m procedure (b) of
Sect. 2.4. Show that
(a) G has 3(2,,-2) members. (G does not satisfy Problem 12c.)
(b) X = (X" ... , X") has the smallest or second smallest mean among members
of Gx if all X) > 0 except possibly X I or X 2 but not both.
(c) If no X) = 0, then under the randomization distribution the probability is at
least 3/2" that X has the smallest or second smallest mean among members of
Gx . (If X could be treated as a random member of Gx , the probability would be
(8/3)/2", which is always smaller than 3/2".)
*14. (a) Show that the confidence bound corresponding to a randomization test based
on X with a randomization set restricted by means of a group G as described
in procedure (b) of Sect. 2.4 is the kth smallest or largest subsample mean
among subsamples corresponding to members of G, where k is the critical
value of the randomization test and the subsample corresponding to a vector J
in G consists of those X) for which J) = -I.
(b) What operation on the subsamples corresponds to the group multiplication
in G?
15. (a) Make a small table to compare the values of Z.[(II - 1)/(11 - zDJ '12 , t., and
Z., where Z. is the upper (J. quantile of the standard normal distribution and
f. is that of the Student's r distribution with II - 1 degrees of freedom. (A good
picture can be obtained from the values (J. = 0.10, 0.05, 0.025, 0.01; II = 3, 6,
10,20.)
*(b) How do the values of f., Z.[(II - I)/(n - Z;)]112, and Z. compare for large
sample sizes?
(c) Find the ranges of the standardized randomization test statistic in (2.4) and the
( statistic in (2.7), given the sample absolute values. Find the ranges uncon-
ditionally.
(d) What do the ranges found in (c) imply about the normal approximation (2.4)
to the randomization test?
*(e) Show that the values of (J. for which f. - z.[(n - 1)/(11 - Z;)]1/2 is of smaller
order than lin as n --> 00 are the (J. values such that Z. = 0, ±ji
16. (a) Find the moments of the randomization distribution of (2 and compare them
with the corresponding moments of the F distribution with 1 and (II - I)
degrees of freedom.
(b) Find the moments of the randomization distribution of t 2/(t2 + II - I) and
compare them with the corresponding moments of the beta distribution with
parameters 1/2 and (n - 1)/2.
17. (a) Show that the expression for d in (2.10) is equivalent to the expression in (2.9).
(b) Show that b2 , as defined by (2.11), is a consistent estimator of the kurtosis of
the distnbution of the Xj under the null hypothesis of a distribution symmetric
about zero.
Problem!> 229
*18. (a) Derive the first four moments of the randomization distribution of the stallstic
Z defined by (2.4).
(b) Find the parameters of the beta distribution whose first two moments are the
same as the correspondmg moments of the randomization distribution of Z2/1I.
(c) Show that approximating the randomization distribution of Z2/11 by the beta
distribution in (b) is equivalent to approximating the randomization distribu-
tion of t 2 = (II - l)Z2/(1I - Z2) by an F distribution and find the degrees of
freedom of this F distribution.
(d) Let V have the F distribution with degrees of freedom 1 and k. Find the values
of c and k such that c 2 V has the same first two moments as the F distribution
in (c).
19. Apply the approximations of (c) in Sect. 2.5 to the Darwin data in Sect. 3.1 ofehap. 3
to venfy the results gIven in the text.
*20. (a) Give at least two possible definitions of convergence in distribution for con-
dItional distributions.
(b) Show that the definitions you gave in (a) imply the usual convergence m
distribution of the marginal distributions.
(c) Show that the converse of (b) does not hold.
(d) Show that the ordmary central limit theorems do not apply to these definitIOns.
21. Show that a randomization test at level IX has unconditional level IX under the null
hypothesis Ho of a distribution symmetric about zero if the observations are
independent and identically distributed.
22. Show cxplicltly, 111 terms of the expectatIOn of the cntJcal function 4> under ran-
domIzatIOn, that
(a) Any 2n-type randomization test is an (II! 2n)-type randomization test, but not
conversely.
(b) A permutation-invariant test has the same level under both types ofrandomiza-
tion.
*23. Show that for the null hypothesis H'O: XI' ... ' Xn are independent with a common
distribution symmetric about zero, if a test is unbiased against the alternative
H';: XI' ... , X n are independent with a common distribution that is symmetric
about some point It i' 0 (or It > 0), then that test is an (II! 2n)-type randomization
test. (Remember that the common distribution need not be continuous.)
*24. Show that, in Problem 23, if unbiasedness is required only against continuous
alternatives, then the test need not be a randomization test for all discrete dis-
tributions. Why does this result not really contradict the results given in Sect. 3.2?
25. Show that the order statistics of the absolute values of the observations are sufficient
stallstics for a sample from an arbitrary distribution that is symmetric about zero.
*26. In a matched pairs experiment, show that if a test has level IX for a null hypothesis
con taming H;j as stated in Sect. 3.2, then this test is a conditional test given the
observations X~ and X;.
27. In a matched pairs experiment, give an example of a test which is conditional on
the observations Xj and X; but does not depend only on the treatment-control
differences. Why might such a test be desirable?
230 4 One-Sample and Paired-Sample Inferences Based on the Method of Randomization
28. Show that, both conditionally and unconditionally, the most powerful randomiza-
tion test against an alternative density or discrete frequency functionf(xl> ... , x.)
is of the form given in (4.1).
*29. (a) Show that the equal-tailed randomization test based on X is a uniformly most
powerful unbiased randomizatIOn test agamst the alternatIve that the observa-
tions are a random sample from a normal distribution with mean JI ¥- 0 and
variance a 2 •
(b) Using the fact that the test in (a) is uniformly most powerful among tests which
are invariant under reversal of signs, show that it is the most stringent ran-
domization test against the same alternative.
*30. Show in general that under appropriate conditions
(a) If a uniformly most powerful invariant test and a uniformly most powerful
unbiased test both exist, then they must be the same test.
(b) If a uniformly most powerful invariant test exists, then it is most stringent.
31. Show that, for fixed lXII, ... , Ix.l, the function h(xl> ... , x.) given in (4.4) is an
L
increasing function of I x J I·
32. Let X I' ... , X. be independent and identically distributed with density f(x) =
(I/a)exp{ - p([x - OJ/a)} where p is a known symmetric function and a IS a known
scale factor. Let the null hypothesis be H 0: 0 = O.
(a) Find the most powerful randomization test of Ho against the alternative
HI: 0 = 0 1 for 0 1 specified.
*(b) Under what circumstances does there exist a uniformly most powerful ran-
domIzation test against the alternative H'I : 0 > O.
(c) Show that the (locally) most powerful randomization test agamst the alternative
H'I: 0 > 0 for small 0 can be based on the statistic LJ \I'(x/a) where \I'(x) =
p'(x) = dp(x)/dx. (See Sect. 9, Chap. 3.)
(d) Show that the tests in (a) and (c) are valid even if the a and p assumed are
IIlcorrect.
(e) Show that the maximum likelihood estimate &of 0 satisfies LJ \I'([xJ- OJ/a) =
o where \I'(x) = p'(x). (Estimates of this form appear significantly in work of
Edgeworth [1908-9] and Fisher [1935] and were named M-estimates in Huber
[1964], which studies them extensively.)
(0 Show that the estimate that corresponds (in a SUItable sense) to the test in (c)
is the maximum likelihood estimate.
33. WIth reference to Problem 32,
(a) If a is unknown, how could it be estimated without affecting the validity of
the randomization tests in (a) and (c)?
(b) If in addition p is not fully known but has some unknown shape parameters,
how could they be estimated without affecting the validity of the randomization
tests in (a) and (c).
(c) What estimates would correspond to the tests III (a) and (b)?
CHAPTER 5
Two-Sample Rank Procedures
for Location
1 Introduction
The previous chapters dealt with inference procedures applicable in one-
sample (or paired-sample) problems. We now consider the situation where
there are two mutually independent random samples, one from each of two
populations. We discuss tests which apply to the null hypothesis that the
two populations are identical and the confidence procedures related to these
tests.
In choosing an appropriate test from among those available for the null
hypothesis of identical populations, consideration should be given to the
alternative hypothesis, since different tests are sensitive to different alter-
natives. The alternatives may be simply that the two populations differ in
some unspecified way, but frequently some specific type of difference is of
particular interest. A general alternative which is frequently important is
that the observations from one population tend to be larger than the observa-
tions from the other (" stochastic dominance "). A particular case of this
relationship occurs when the populations satisfy the shift assumption, which
is explained explicitly in the next section. Frequently, the difference in
"location" between the two populations is of primary interest. Under the
shift assumption, this difference is the same whatever location parameter is
chosen, and is the amount of shift required to make the two populations
identical. Furthermore, we can develop confidence procedures (correspond-
ing to the test procedures) which give confidence intervals for this difference
in location, or shift.
The primary discussion of two-sample tests in this book is divided among
three chapters. The median test, tests based on rank sums, and more general
231
232 5 Two-Sample Rank Procedures for LocatIOn
Two arbitrary density functions which satisfy (2.3) are shown in Fig. 2.1 (c).
The shift assumption means that the two populations have the same
shape, and in particular their variances must be equal if they exist (Problem
1). Two normal populations with the same variance satisfy the shift assump-
tion, but two normal populations with different variances do not, nor do two
Poisson or exponential populations with different parameters.
If the shift assumption holds, then Jl, the amount of the shift, must equal
the difference between the two population medians. It must also equal the
2 The ShIft AssumptIon 233
x
(a)
x
(b)
(c)
Figure 2 I (a) F(x) = G(x + II), F normal, II < O. (b) F(x) = G(x + II), F exponential,
II > O. (c) f(x) = g(x + II), II <O.
difference between the two population means, and indeed the difference
between the two values of any other location parameter (if it exists), such
as the mode, the midrange, or the average of the lower and upper quartiles.
The mean need not equal the median (or any other location parameter) in
either population, but the difference between the mean and the median must
be the same for both popUlations. Since the populations are the same except
for location, the difference in location is the same however it is measured,
and it equals the shift J..l (Problem 2). For this reason, the shift parameter is
sometimes called the location parameter.
The confidence procedures in this chapter are developed under the shift
assumption, and accordingly they provide confidence intervals for J..l, the
amount of the shift. If the shift assumption fails badly, the procedures will
not perform as advertised since the confidence level will not ordinarily be
valid.
On the other hand, the test procedures here can be developed and justified
logically without assuming that the shift assumption (or any other relation-
ship between the distributions) holds under the alternative hypothesis. The
234 5 Two-Sample Rank Procedures for LocatIOn
tests retain their level as long as F = G under the null hypothesis. Thus,
while" acceptance" of this null hypothesis may be interpreted as not rejecting
the possibility that the shift assumption holds with fl = 0, rejection of the
null hypothesis implies no inference about whether the shift assumption
holds for any other fl. Furthermore, the tests developed here appear as if
they would be good against alternatives which include more than shifts,
and certain mathematical properties to be discussed provide justification for
this view. Of course similar statements apply to parametric tests of a dif-
ference in location for two otherwise identical populations, including the
normal theory test for equal means, so these points are not new or special to
non parametric tests.
Table 3,\
X's Y's
Below A B
Above /II-A Il-B N - t
III n N
the entire table, since the column totals, m and n, are fixed by the sample
sizes. In case (2), since t is also fixed by the procedure, A alone determines the
entire table.
In case (1), A and B are determined by simply comparing each observation
with the fixed ~. In case (2), the most straightforward procedure is to combine
both sets of observations into a single array but keep track of which observa-
tions are X's and which are Y's. Then A and B are the number of X's and Y's
respectively among the t smallest in the single combined array.
For any given data set, the same 2 x 2 tables can be obtained by method
(1) with various choices of ~ and by method (2) with various choices of t.
Specifically, the t in a table obtained by method (I) gives the same table by
method (2). Further, any method (2) table is also given by method (1) if ~
is selected in such a way that exactly t observations are "below C' that is,
if ~ is any number greater than the (t)th smallest observation and smaller
than or equal to the (t + l)th smallest observation. (If no such ~ exists
because of ties, then t is not a possible value by either method.) Note that
method (2) does not require any consideration of ~ at all, however.
For example, consider the two samples! below where m = 10 and n = 12.
XI: 13.8, 245, 20.7, 22.5, 26.5, 14.5, 6.4, 20.0, 17.1, 15.5
~: 16.2, 23.9, 24.3, 17.8, 15.7, 14.9, 6.1, 11.1, 16.5, 17.9, 15.3, 14.3
In case (1), if we take ~ = 18.0, we find that 5 of the X's and 10 of the y's are
less than 18.0 so that A = 5 and B = 10; this gives Table 3.2. For case (2),
I These data are from Mosteller, F. and D. Wallace [1964, Sect. 4.8 at pp. 174-175], Inference
and Displ/ted AI/thors/llp. The Federalist, Addison Wesley Publishing Co., Readmg, Mass.
The Y's are scores computed m a certam way for the 12 "Federalist Papers" whose authorship
IS m dispute between Hamilton and Madison. More specifically, ~ is the natural logarithm of
the odds provided by the data m favor of Madison's havmg WrItten the }th disputed paper,
under certam assumptions about the underlying model, except that it has been adjusted to allow
for the length of the paper. The X's are scores computed III the same way for 10 segments of
about 2,000 words each taken from material known to be by Madison With the adjustment
for length, the X's and Y's should come from approximately the same population If the model IS
reasonably good and if the disputed papers are by Madison. If the X's and Y's are not from the
same population, this by no means indicates that the disputed papers are by Hamllton--the
Y's are vastly different from the scores for Federalist Papers known to be by Hamilton. The
indicatIOn would be rather that something remains to be explained, perhaps an inadequacy in
the model. The adequacy of the model is explored extensively by Mosteller and Wallace.
236 5 Two-Sample Rank Procedures for LocatIOn
~
5
~
< 18.0 1O 15 Below 8
~ 18.0 5 2 7 Above 7 7 14
10 12 22 10 12 22
not made in advance but is based on the combined sample, as will be dis-
cussed.)
Specifically, if there are no ties at the combined sample median, the median
test for N even is equivalent to always choosing the value t = N /2; for N odd
it is equivalent to t = (N - 1)/2 if the combined sample median is counted
as" above" rather than" below," as.it is by our earlier, arbitrary convention.
Ties will be discussed in Sect. 3.3.
As an example, we develop the 2 x 2 table that arises when the median
test is applied to the Mosteller and Wallace data given earlier in this section.
Since N = 22, we choose t = 11. The smallest t = 11 observations in the
combined sample include four X's and seven Y's, as Table 3.4 shows. Fisher's
exact test or an approximation may be applied to this 2 x 2 table. We need
not consider the combined sample median explicitly, but it is any number
between the eleventh and twelfth observations in the ordered pooled sample.
These observations are 16.2 and 16.5 respectively. Dichotomizing at say
16.4 leads to Table 3.4, as would dichotomizing at any other ~ in the interval
16.2 < ~ :-: ; 16.5.
The two-sample sign test for fixed ~ is not a rank test because it does not
depend only on the ranks of the two samples. This test would be appropriate
to use when the measurement scale has only a small number of possible
(or likely) values, since then it is natural to choose ~ equal to the central
value expected.
The two-sample median and other quantile tests are particularly useful
in analyzing data related to experiments involving life testing because they
permit termination of the experiment before all units under test have ex-
pired. The information needed to perform the test is complete once t units
have expired, and sometimes well before that (Problem 5). The control
median test and the first-median test are variants of the two-sample quantile
tests with particular forms of termination rules. These variants reach the
same decisions as a two-sample quantile test (usually earlier) and hence
coincide if sampling is terminated as soon as a decision is reached, except
that the two tails may correspond to different two-sample quantile tests
(Problems 6 and 7). Rosenbaum [1954] gives a test which is equivalent to a
special case of a two-sample quantile test since it is based on the number of
observations in one sample that are larger than the largest value in the other
sample.
Table 3.4
X's Y's
10 12 22
238 5 Two-Sample Rank Procedures for LocatIOn
The general test known as Fisher's exact test provides a method for analyzing
2 x 2 tables like Table 3.1 that arise in two-sample sign tests. In Fisher's
test, the marginal totals m, n, t, and N - t are all fixed, either initially or after
conditioning, and the test is based on the conditional distribution of A
given t.
Assume that the X's and f's that produced Table 3.1 are independent.
Under the null hypothesis that the X's and f's have the same distribution,
the data in Table 3.1 represent N independent observations from a single
e
population. As long as either or t is preselected, for any given t, m, and n all
subsets of size t are equally likely to be the subset containing the t smallest
observations, and hence any set of t observations out of the N is as likely as
any other set of t to constitute the observations in the first row of Table 3.1.
It follows that the conditional distribution of A given t is the hypergeometric
distribution (Problem 8), with discrete frequency function given by
2 There are, however, 2 x 2 tables for which Fisher's exact test IS not appropriate. (See Sect. 8,
Chap. 2 for examples)
3 MedIan Test, Other Two-Sample SIgn Tests, Related Confidence Procedures 239
on which observations are X's and which are Y's, the null distribution of A
given t is given by (3.1).
The one-tailed P-value of Fisher's exact test is the cumulative probability
in (3.1) of the observed A or less for the left tail, and the observed A or more
for the right tail; these are tail probabilities in the hypergeometric distribu-
tion.
Tables of the hypergeometric distribution are available [for instance,
Lieberman and Owen, 1961], but they are necessarily bUlky. In order to
perform the median test in the absence of ties, only one value of t is required
for each combination of m and n, so that more convenient tables are possible.
Table E is designed for use with the median test for t = N /2 if N is even and
t = (N ± 1)/2 if N is odd, when m :::;; n. For example, it applies to Table 3.4
and gives a P-value for A :::;; 4 of 0.335. If m > n the designations X and Y
can be interchanged so that A still represents the observed number "below"
in the smaller sized sample. This is actually equivalent to basing the test on
B instead of A, with large B corresponding to small A. If other values of t
are required, as when there are ties at the combined sample median or
naturally dichotomous observations (see Sect. 3.3), Table E cannot be used.
Notice that A is symmetrically distributed for t = N/2.
In the absence of tables or outside the range of available tables, we must
use a computer program or approximations to the null distribution. The
most common approximation is based on Z2, the chi-square statistic cor-
rected for continuity, or on its signed square root, Z, which is approximately
the normal deviate corresponding to the one-tailed P-value and can there-
fore be referred to Table A. An advantage of Z is that it reflects the direction
of the sample difference; Z2 masks this direction, and hence can only be
used for two-tailed tests. Formulas for Z are
1 mt) [
Z= ( A+--- N3 ] 1/2
(3.2a)
- 2 N mnt(N - t)
[
= A(n - B) - (m - A)B ± 2NJ [mnt(NN _ t)
]1/2
, (3.2b)
22 J1 /2 -1.212,
Z = [5(2) - 5(10) + 11] [ 10(12)(15)(7)
is quite accurate in the case of the median test provided the smaller sample
size is at least 12. IfmlN and tiN are both far from t, however, the approxima-
tion is not very accurate for one-tailed probabilities.
The test based on chi-square is popularly known as "the" test for 2 x 2
tables, but it is really just an approximation to Fisher's exact test. The exact
test seems to be less frequently used, probably because tables of the chi-
square distribution are much more accessible and sample sizes are frequently
large anyway.
*Several binomial approximations are also available, but they require
the use of binomial tables which are themselves limited by having two more
parameters than normal tables. They work best when the table is rearranged
so that the two margins opposite A are the two smallest margins. That is, if
necessary, interchange the columns to make m :c;; n and the rows to make
t ~ N - t. The simplest binomial approximation is to treat A as binomial
with parameters
max(m, t)
nl = min(m, t), PI = (3.3)
N
. mt(N - 1) mt
n2 = an mteger near N(N _ 1) _ n(N _ t)' P2 = n2 N ' (3.4)
max(m, t) - AI2
n3 = min(m, t) = n l , P3 = . (3.5)
N - (n3 - 1)/2
This is not equivalent to treating A as binomial, since P3 depends on A.
For this reason, it is not easy to see what mean and variance this approxima-
tion assigns to A, but it obviously gives the correct range. This is the first
approximation given by Wise [1954], and is actually an upper bound on the
probability when A = O. It is based on approximating the sum of hyper-
geometric probabilities by the incomplete beta function plus a correction
factor.
To apply these three approximation to data in Table 3.2, we first rearrange
it in the form of Table 3.5.
3 Median Test, Othel Two-Sample SIgn Tests, Related Confidence PIOccdures 241
Table 3.5
X's Y's
~ 18.0 5 2 7
< 18.0 5 10 15
10 12 22
10
PI = 22 = 0.4546, P-value = 0.159;
10(7)
n2 = 5, P2 = 5(22) = 0.6363, P-value = 0.104;
10 - 2.5
P3 = 22 _ 3 = 0.3947, P-value = 0.091.
As mentioned earlier, the exact P-value is 0.113. Here the second approxima-
tion is better than the third. In other cases, the third approximation may be
better than the second, and both are almost always better than the first.
These and other approximations based on the binomial, Poisson, normal
and other distributions are discussed more fully in Lieberman and Owen
[1961] and Johnson and Kotz [1969]. Peizer, extending Peizer and Pratt
[1968J and Pratt [1968J, developed an excellent normal approximation that
is easily calculated. It has been refined and studied by Ling and Pratt [1981J
and is given at the end of Table E. See also Molenaar [1970].
3.3 Ties
Ties present no problem in the two-sample sign test with fixed ~ because we
defined" above" as meaning greater than or equal to and" below" as strictly
below. Ties are also easily handled for a quantile test, but a brief discussion is
needed here. Consider the median test, for example; then we intend to choose
t = N/2 or (N - 1)/2. However, if ties occur at the median of the com-
bined sample, then dichotomizing at the median will ordinarily lead to some
other value of t, a smaller value when the observations equal to the median are
counted as "above." The value of t could be preserved by breaking these ties
at random, along the lines of Sect. 6 of Chap. 3. A more appealing procedure
would be to dichotomize at a point slightly above or slightly below the
median, whichever value makes t closer to N /2. This is equivalent to keeping
the sample median as the dichotomizing point and redefining the terms
"above" and "below" in order to make t as close as possible to N /2. In
242 5 Two-Sample Rank Proccdure~ for LocatIOn
other words, the observations at the median are assigned to that category,
"above" or "below," which contains fewer other observations. We are free
to do this since t may be chosen as any function of the combined sample
without changing the null distribution given in (3.1), as remarked earlier.
It may sometimes be preferable, especially if the observations are changes
and the median occurs at "no change," to omit the observations at the
median. Then "below" means strictly below and .. above" means strictly
above, and the sample sizes are the numbers of observations different from
the median. The hypergeometric distribution continues to apply under the
null hypothesis, however (Problem 9c).
In the special case where the observations are not only not all different,
but also have only two possible values, the data are inherently in the form of
a 2 x 2 table. Then there is no freedom of choice regarding the value of t
e.
and the situation is more like the case of having a fixed However, if one of
the two possible values is called" below" and the other" above," this could
be considered an extreme case of the situation with ties described earlier.
Suppose that the shift assumption, as stated in Eqs. (2.1), (2.2) or (2.3),
holds so that X and Y - Jl have the same distribution, but Jl is unknown.
In order to test a null hypothesis which specifies a particular value for Jl
using a two-sample sign test procedure, we could subtract J1. from each lj
and then apply a two-sample test for identical distributions to the observa-
tions Xl' ... , X m , Yl - J1., ••• , Y,. - J1.. The confidence region corresponding
to such a test consists of those values of J1. which would be "accepted" when
so tested. We could proceed by trial and error, testing various values of Jl to
see which ones lead to "acceptance." However, for a two-sample median or
other quantile test, there is a very simple way to obtain these confidence limits
explicitly, as we will now see.
Consider a two-sample quantile test at level (X specifying t as the marginal
total of the first row. Let a and a' be the lower and upper critical values of A,
that is, P(A ::;; a) + P(A 2 a') ::;; (X under the hypergeometric distribution
with parameters m, n, and t. Then we would "accept" J1. if there are at least
(a + 1) X's and at most (a' - 1) X's among the t smallest of Xl' ... ' X m,
Y1 - J1., ••• , Y,. - Jl, and reject Jl otherwise. This region of" acceptance" is the
interval between two confidence limits which can be very simply stated in
terms of the order statistics of the two samples as follows.
Let X(I), .•. , X(m) , be the X's rearranged in order of increasing (algebraic)
value, so that X(1) ::;; X(2) ::;; ... ::;; X(m). Define Y(I), ..• , Y(n) similarly. Then
the test procedure" accepts" Jl if and only if (Problem to)
(3.6)
3 Median Test. Other Two-Sample SIgn Tests, Related Confidence Procedures 243
except perhaps at the endpoints, where the procedure has not been defined.
Equation (3.6) then gives the confidence interval for the shift /1 which cor-
responds to a two-sample quantile test with first row total t. The confidence
level is 1 - ex, where ex = P(A ::;; a) + P(A ~ a') according to the hyper-
geometric distribution for this m, n, and t. Of course, either the left-hand or
right-hand side of (3.6) may be used separately as a one-sided confidence
bound; then ex is P(A ~ a') or P(A ::;; a) respectively. If the distributions are
not continuous, the confidence levels are conservative as long as the end
points are included in the confidence regions (Problem 107).
The values of a and a' for given ex, or of ex for some selected a and a', can
be found from Table E for t = NI2 if N is even, or t = (N ± 1)/2 if N is
odd, that is, the values of t corresponding to the median test. It may be
desirable to use other values of t, especially since the choice of IX for anyone
t is very limited. However, this requires a more extensive table, since com-
putation of IX must of course be based on the value of t actually used.
Suppose that m = n = 10 and we use t = NI2 = 10. Then Table E shows
that choosing a = 2, a' = 8 gives ex = 0.0115 + 0.0115 = 0.0230. Thus at
level 1 - ex = 0.9770, (3.6) gives
as the confidence interval for the shift /1 which corresponds to the median
test.
*Mood [1950, pp. 395-398] suggests confidence bounds of the form
1(r) - X(s) without relating them to the median test and its counterpart
with arbitrary t. (See also Mood and Graybill [1963, pp. 412-416], and Mood
et al. [1974, pp. 521-522].) It is not obvious that the formula given there for
IX is equivalent to the hypergeometric formula (Problem 11). The confidence
limits corresponding to the control median test, the first median test, and
Rosenbaum's test are all of the form (3.6) except that the upper and lower
limits employ different values of t (Problem 13).*
3.5 Power
Consider now the median test, or, more generally, any quantile test with
the value of t fixed in advance. Suppose that the populations are continuous,
so that we may ignore the possibility of ties and hence of not being able to
attain the chosen t. Then, by (3.6), the two-tailed test rejects the null hypo-
thesis of equal populations if and only if
fJ.(X, a + 1) - fJ.(y, t - a)
z- (3.8)
- [(J'2(X, a + 1) + (J'2(y, t _ a)]1/2'
::-;.-':--'--~--';;--:"--------c-=
One approximation for the mean and variance is given by (Problem 16)
fJ.(X, a + 1) = the quantile of order p of the X distribution, (3.9)
(J'2(X, a + 1) = [P(1 - p)/m] {f[fJ.(x, a + I)J} -2, (3.10)
3.6 Consistency
x x
(a) (b)
FIgure 3.1
246 5 Two-Sample Rank Procedllles for Locallon
*These facts can be proved (Problem 18a) using the fact that the left and
right sides of (3.6) are both consistent estimators of the difference of the
medians. Alternatively, (3.11) can be used along with the consistency of the
e
two-sample sign test with fixed (Problem 18b).*
and the advantages of optimality for fixed ~ are ordinarily outweighed by the
advantages of the median test in being able to select the point of dichotomiza-
tion sensibly in light of the combined sample.
We have been assuming that all the observations are independent and
identically distributed under the null hypothesis. The name and construc-
tion of the median test might suggest that its level would be retained as long
as all observations are drawn from distributions with the same median.
However, this is unfortunately not the case. If the X's are drawn from one
population and the Y's from another population, where the medians are the
same but the scale parameters differ, the level may be seriously affected even
in large samples (see Pratt [1964] for further discussion and numerical,
asymptotic results). The same point applies, a fortiori, to the corresponding
confidence procedure. The two-sample sign test with fixed ~ is, of course,
valid whenever the probability below ~ is the same for the two populations,
but it would seldom happen that we know in advance a value of ~ for which
this assumption holds under a null hypothesis that allows different scales in
the two populations.
In a treatment-control comparison where the units are assigned at
random to the two groups, the randomization itself guarantees the level of
the median test for the null hypothesis that the treatment has no effect.
To be more specific, suppose the X's refer to a control group and the Y's
to a treatment group. Given the N units in the experiment, if any set of m
of them is as likely as any other set of m to be the control group, and each
unit would yield the same measurement whether treated or not, then the
probability of rejection by the median test is the usual null probability for
2 x 2 tables obtained from the hypergeometric distribution. The randomiza-
tion does not guarantee that the level is preserved in the corresponding
confidence procedure, however, in the absence of some property such as no
interaction between treatment and units, or the shift assumption. (See
Sect. 2, Problem 21, and for further detail and discussion, Sect. 7 of Chap. 2,
and Sect. 9 of Chap. 8.) The same statements apply to any other quantile
test.
Another kind of weakening is possible for all one-tailed, two-sample
sign tests, whether a quantile test or one with fixed ~. Suppose the Xi and
lj are independent, but not necessarily identically distributed, and consider
anyone-tailed, two-sample sign test with rejection region in the upper tail
of A (too many X's are in the "below" category). This test rejects with prob-
ability at most (X (the exact level when all observations are independently,
identically distributed) if
P(X i < z) $ P( lj < z) for all z, i, and j, (3.12)
248 5 Two-Sample Rank Procedures for Location
(Compare Eqs. (3.21) and (3.22), Chap. 3.) Under (3.12), any Xi is less likely
to be to the left of any specified point than any ~ is, so that the distribution
of every Xi is "to the right" of (" stochastically larger" than) the distribution
of every ~. See Fig. 2.1(a) for a graphic illustration of this relationship.
Similarly (3.13) means that all the X's are "stochastically smaller" than all
the Y's. Since the probability of rejection is at most IX when (3.12) holds, the
null hypothesis could be broadened to include (3.12) without affecting the
significance level. Similarly it is natural to broaden the alternative to include
(3.13); since the probability of rejection is at least IX when (3.13) holds, the
test is by definition unbiased against (3.13).
On the other hand, against certain alternatives under which the X's are
drawn from one population and the Y's from another population with a
different median, the power of the median test is less than its level. The test
is then biased against this alternative; this is true even for the one-tailed
test in the indicated direction (Problem 22).
*The statements of the next-to-Iast paragraph are consequences (Problem
23) of the following fact, which is of interest in itself. Suppose a test ¢ is
"increasing" in the Y direction in the sense that, if ¢ rejects for X 1, ... , X m ,
Y1, ••• , y" and any Xi is decreased or ~ increased, ¢ still rejects. (The one-
tailed two-sample sign tests rejecting if A is too large have this property, by
Problem 23.) Then the probability that ¢ will reject increases (not necessarily
strictly) when the distribution of Xi is moved to the left or the distribution of
~ is moved to the right, that is, when the c.dJ. F j of Xi is replaced by Ft
where Ft(x) ;;:: Fi(x) for all x, or when the c.dJ. Gj of ~ is replaced by G1
where G1(y) ::;; Giy) for all y. Formally, for randomized tests ¢, we have the
following theorem.
Theorem 3.1 If Xl, ... , X m , Y1, ••• , y" are independent with c.df's F i , Gj and
¢(X 1, ... , Xu .. Yt> ... , y") is a randomized test function which is decreasing
in each Xi and increasing in each ~,then the probability of rejection
To carry out the rank sum test procedure, we first combine the m X's and
nY's into a single group of m + n = N observations, which are all different
because of the continuity assumption. We then arrange the pooled observa-
tions in order of magnitude, but keep track of which observations are from
which sample. We assign the ranks 1,2, ... , N to the combined ordered
observations, with 1 for the smallest and N for the largest.
The data shown in Table 4.1 have been ranked by this procedure. (Often
in practice the first row is omitted and the values which are from say the X
sample are underlined or similarly indicated.)
The rank sum can be defined as the sum of the ranks in either sample;
we use Rx to denote the sum of the ranks of the X observations and Ry for
250 5 Two-Sample Rank PlOcedures for LocatIOn
Table 4.la
Sample Y X X Y Y X X Y X Y
Value 1.25 1.75 3.25 4.25 5.25 6.25 6.75 7.25 9.00 10.00
Rank 2 3 4 5 6 7 8 9 10
a These data are from United States Senate [1953], Hearings Before the Select
Committee 011 Small Busilless, Eighty-third Congress, First Session on Investiga-
tIOn of Battery Additive AD---X2 (March 31, June 22-26). The X's and Y's refer to
untreated batteries and batteries treated with AD-X2 respectively. The values
gIven here were obtained by averagmg the ranks given on performance of the
battenes by two representatIves of the manufacturer. The assumptions of the
begmnlllg of thiS section are not satisfied, but the battenes for treatment were
selected randomly from the 10 battenes, and this also vahdates the test, as dis-
cussed in Section 4.6.
the sum of the ranks of the Y observations. Since Rx + Ry is the sum of all
the ranks, 1 + 2 + ... + N = N(N + 1)/2, we have
and the tests based on Rx and Ry are therefore equivalent. In Table 4.1 we
have
Rx = 2 +3+6+7+9= 27,
Ry = 1 + 4 + 5 + 8 + 10 = 28,
and
Rx + Ry = 55
V.
< X·
= { I if Y.J ' (4.5b)
'J 0 if lj > Xj'
The Wilcoxon rank sum statistic is represented naturally as
N
Rx=Lk1k
, (4.6a)
where
I if the observation with rank k is an X
{ (4.6b)
h = 0 if the observation with rank k is a Y.
252 5 Two-Sample Rank PIOcedures for Location
The exact null distribution of any of these rank sum statistics is based on the
fact stated before that, under the null hypothesis of identical distributions,
the X ranks constitute a random sample of size m drawn without replacement
from the first N integers, where N = m + n. Equivalently, all arrangements
of the mX's and nY's in order of size are equally likely. With (~) possible
arrangements, each one occurs with probability 1/(~). This fact determines
the null distribution of R x , and by (4.1), (4.3) and (4.4), also that of R y , U
and U'.
The direct method of generating the null distribution of say Rx is to list
all possible arrangements, calculate the value of Rx for each, and tally the
results. Then
P(R x = t) = v(t)/(~)
where v(t) is the number of arrangements for which the sum of the X ranks
equals t. For tabulation it is more efficient to use an easily developed recur-
sive technique (Problem 30). Fix and Hodges [1955] present a more sophisti-
cated approach, tabulating related quantities more compactly than is possible
for the distribution itself (Problem 32).
The mean and variance of these rank sum statistics under the null hypo-
thesis are most easily evaluated by using the fact that Rx is the sum of m
observations drawn without replacement from the finite population consist-
ing of {t, ... , N}. The mean and variance of this population (Problem 33)
are (N + 1)/2 and (N 2 - 1)/12. The mean and variance (calculated using
the finite-population correction factor) of the sample sum are therefore
integers from n(n + 1)/2 to n(2N - n + 1)/2 (Problem 36). The distributions
of V, V', Rx and Ry are all symmetric about their respective means for any
m and n (Problem 37).
Since all the rank sum statistics are equivalent, a table of the null dis-
tribution is needed for only one of these statistics. Table F at the back of the
book gives the cumulative tail probabilities of Rx for m ~ n ~ 10. Only the
smaller tail probability is given; each entry is both a lower tail probability
for Rx ~ m(N + 1)/2 and a symmetrically equivalent upper tail probability
for Rx ~ m(N + 1)/2. In order to use this table, the sample with fewer
observations should be labeled the X sample. More extensive tables are
published in Harter and Owen [1970].
For m and n large, Rx, R y, V and V' are all approximately normally
distributed under the null hypothesis [Mann and Whitney, 1947], with the
means and variances given above. Small tail probabilities are generally
overestimated by the normal approximation. For sample sizes both smaller
than 20, for example, it is better to omit the continuity correction of t in
such a way as to reduce the tail probability when the standardized normal
variable is greater than 2 or, for comparison with critical values from normal
tables, when the one-sided significance level is smaller than 0.025. See Jacob-
son [1963] for more detail.
carried out using Rx. If c is that number from Table F such that P(R x ::; c) =
IX, then by (4.3) we have
y
IO~----~----~----------~~r--------'
6 7 8 9 X
Figure 4.1
4 Procedures Based on Sums of Ranks 255
sets of observations, along with general expressions for the mean and
variance of the test statistic.)
In order to approximate the power of the rank sum test against specified
alternative distribution, it will be convenient to introduce the probabilities:
for all i, j, k, 1 with i =1= k and j =1= I. Hence PI is the probability that an X
variable exceeds a Y variable; P2 is the probability that two different X
variables both exceed a single Y variable; and P3 is the probability that an
X variable exceeds both of two different Y variables. Integral expressions for
these three probabilities are given in Problem 41.
For any X and Y distributions, the moments of U needed for standardiza-
tion can be expressed in terms of these probabilities as
We will now prove these results using the expression for U given in (4.5a),
where U is the sum of mn indicator variables Ui}' defined by (4.5b) (Uij = 1
if Xi > YJ These Uii are Bernoulli random variables, identically distributed
although not all independent. In terms of the probability PI' their mean is
so that
III n
E(U) = L: L E(U ,)
1=1 }=I
= mnPI'
In terms of the probabilities PI' P2 and P3' the second-order moments of the
U'i are
(4.21)
4 Procedures Based 011 Sums of Ranks 257
4.5 Consistency
The null distribution of the rank sum test statistic is derived under the
assumption that the X ranks are equally likely to be any set of m out of the
integers 1,2, ... , N. This assumption is satisfied if the X's and Y's are
drawn from the same population. If the populations differ in any way, how-
ever, the level of the test is ordinarily affected. In particular, the assumption
that PI = 1. where PI = P(X > Y), is not sufficient to guarantee the level.
Even for populations which are symmetric about the same point and have
the same shape, the level may be seriously affected if their variances differ.
(See Pratt [1964] and also Problem 52.) The same observation applies, a
fortiori, to the corresponding confidence procedure.
258 5 Two-Sample Rank Procedures for Location
4.7 Ties
Two or more observations which are equal in value are called tied. The
ranks of tied observations have not yet been defined, and hence the rank sum
test cannot be applied in the presence of ties without some further specifica-
tion. We have avoided this difficulty so far by assuming continuous dis-
tributions and hence zero probability of a tie. In practice, we must have a
method of dealing with ties because of discontinuous distributions or un-
refined measurements. The discussion here will parallel but abridge the
corresponding discussion of zeros and ties in Sect. 6 of Chap. 3; in particular,
"zeros" have no counterpart here.
The confidence procedures (given in Sect. 4.3) for the amount of a shift
depend only on the differences lj - Xi' Even when some of these differences
are tied, we can determine the (k + 1)th from the smallest difference and this is
still a lower confidence bound L for the shift J1. with k defined as before. How-
ever, the exact confidence level now depends on whether L is included in the
confidence interval or not. More precisely,
peL :s; J1.) ;::: 1 - C( ;::: peL < J1.), (4.23)
where 1 - C( is the exact confidence level in the continuous case (Problem
56; see also Problems 62 and 107). A corresponding statement holds for an
upper confidence bound, and for two-sided confidence limits. Thus the
confidence procedures of Sect. 4.3 can still be used, but now it makes a
theoretical difference whether or not the endpoints are included in the stated
4 PlOccdures Based on Sums of Ranks 259
interval. Ordinarily this is of no practical consequence and the issue need not
be resolved.
Since the confidence procedures are still applicable, they could be used
to test the null hypothesis that the amount of the shift is f1 = 0, which is
equivalent to the hypothesis of identical populations. If 0 is not an endpoint
of the confidence interval, the corresponding test rejects or "accepts" the
null hypothesis according as 0 is outside or inside the confidence interval.
By Problem 57, this is equivalent to rejecting ("accepting") if the ordinary
rank sum test rejects (" accepts") no matter how the ties are broken. If 0
is an endpoint of the confidence interval, it may be sufficient to state this
fact and not actually carry the test procedure further. Another possibility
is to be "conservative" and" accept" the null hypothesis in all borderline
cases; this amounts to breaking the ties in the direction of "acceptance"
and corresponds to including the endpoint in the confidence interval state-
ment. When many ties are likely, however, both these possibilities may reduce
the power considerably.
Two other basic methods of handling ties are the average rank method
and breaking the ties, which we now discuss. Examples will be given shortly.
The average rank (or midrank) method assigns to each member of a
group of tied observations the simple average of the ranks they would have
if they were not tied. The rank sum statistic is then computed as before, but
its null distribution is not the same as for observations without ties. The
exact distribution conditional on the ties can be enumerated, or a normal
approximation can be used (see below). The average rank procedure is
equivalent to defining the Mann-Whitney U statistic as the number of
(X" ~) pairs for which Xi> lj plus one-half of the number for which
Xi = lj, because U and Rx continue to be related by Equation (4.3) when
U is defined in this way and Rx is computed from the average ranks (Problem
58).
Methods which break the ties assign distinct integer ranks to the tied
observations. If the ties are boken randomly the usual null distribution of
the rank sum statistic is preserved. Another possibility already mentioned
is to break the ties in favor of acceptance.
Table 4.2
Observation 01112223333 4 4 5
Sample XXYYXyy XYYY Y Y Y
Average Rank 3 3 3 6 6 6 9.5 9.5 9.5 9.5 12.5 12.5 14
can again be determined from the fact that each possible set of m ranks is
equally likely to be that belonging to the X observations, but this distribution
is now conditional on the positions of the ties in the combined sample, or
equivalently, on the average ranks present. For the data in (4.24), we have
m = 4 and n = 10 so that there are (~) = (144) = 1001 ways to select a set of
m = 4 ranks out of the N = 14 ranks. Table 4.2 shows that there are only 6
different average ranks present, in a pattern consisting of one 1, three 3's,
three 6's, four 9.5's, two 12.5's, and one 14. Of the 1001 possible selections of
m = 4 average ranks given this pattern, one selection (three 3's and one 1)
gives Rx = 10, nine selections give Rx = 13, three give Rx = 15, etc. A
portion of the lower tail of the distribution of Rx, given this pattern of ties,
is shown in Table 4.3. Note that the distribution is very uneven and lumpy,
as is frequently the case when many ties occur.
For the data in (4.24), the X rank sum can be found from Table 4.2 as
Rx = 19.5. From the distribution in Table 4.3, we see that under the null
hypothesis, given the ties observed, the exact probability of an X rank sum
as small as or smaller than that observed is P(R x .::;; 19.5) = 84/1001 = 0.084.
Enumeration of the exact distribution of Rx based on average ranks in the
presence of ties can be lengthy, but it is not difficult to carry out by computer.
In favorable circumstances, the distribution of Rx given in Table F could be
used, but it applies exactly only when no ties are present; in general, it
should not be used when the average rank method is applied in examples with
many ties. For the data in (4.24), Table F gives the P-values P(R x .::;; 19) =
0.071 and P(R x .::;; 20) = 0.094; these results, although close to the true P-
value 0.084 found in the paragraph above, are not correct, and in other
examples the discrepancy may be greater. Another possibility is to use the
normal approximation once the relevant mean and variance conditional
on the ties observed are obtained (Problem 59). This procedure may also
be very inaccurate in the presence of many ties because the exact distribution
is generally lumpy, as noted for Table 4.3. Simulation could also be used.
Lehman [1961] performed an interesting but limited comparison between
the exact and approximate distributions with ties for the case m = n = 5.
Table 4.3
r 10 13 15 16 16.5 18 18.5 19 19.5 21
1001 P(R x = r) 9 3 9 12 9 4 36 3
4 Procedures Based on Sums of Ranks 261
Table 4.4
Observation o I I 2223333445
Sample x Xyy Xyy Xyyy yy Y
Ranks (i) 2 3 4 5 6 7 8 9 10 II 12 13 14
Ranks (ii) 4 2 3 7 5 6 II 8 9 10 12 13 14
262 5 Two-Sample Rank Procedures 1'01 LocatIOn
Table 4.5
Observation
°1"'12221333314415
Sample X yyy XXX yyyy yy y
Notice in Table 4.5 that all the ties are within samples. In such a case,
any method of breaking the ties gives the same X ranks, namely 1, 5, 6, 7,
and the same rank sum Rx = 19. If we ignore the ties and use Table F, the
probability of an even smaller rank sum is P(R x < 19) = 0.053, and of one
as small or smaller is P(R x ~ 19) = 0.071. At the one-sided level 0.05, the
populations would not be judged significantly different when the ties are
broken, no matter how they are broken; tiebreaking leads to no ambiguity.
On the other hand, if the average rank method is used on the data in
Table 4.5, the rank sum is again Rx = 19, but the null distribution of the
average rank test statistic given the ties observed is that in Table 4.3, not
Table F. Table 4.3 gives P(R x ~ 19) = 0.048, so that the average rank test
judges this sample as significant at the 0.05 level, the opposite conclusion
from tie breaking.
The null distribution in Table 4.3 is correct only if the average rank test
is used on all samples with ties in the same positions. If the average rank
method would be used in some cases, such as the data in (4.24), and if the
null distribution would be calculated conditional on the pattern of ties
observed, then the average rank method must be used in all cases, including
those where the tie breaking is unambiguous, such as the data in Table 4.5.
This example shows that, unfortunately, trying to obtain the best of both
worlds affects the level of the average rank procedure.
The "conservative" procedure, that is, breaking the ties and choosing
the value of the test statistic that is least favorable to rejection, satisfies all of
these requirements. However, the true significance level is unknown 'and
may sometimes be much smaller than the nominal level, resulting in a con-
siderable reduction in power over the average rank procedure.
Breaking the ties at random permits use of the ordinary tables and
satisfies all of the requirements above (Problem 62). However, the introduc-
tion of extraneous randomness in an artificial way is objectionable in itself,
and presumably reduces the power.
The confidence bounds for an assumed shift f.1 corresponding to any
method of breaking ties are those obtained in Sect. 4.3. Whether or not the
confidence bounds are included in the confidence interval depends on how
the ties are broken. The confidence regions corresponding to the average
rank procedure may be different, although they are also intervals (Problem
63).
There are situations in which one is interested in the probability that a ran-
domly selected member of the X population will exceed an independent,
randomly selected member of the Y population. This probability is the
parameter PI = P(X > Y) defined earlier by (4.12). Suppose, for example,
that X is the strength of a manufactured item and Y is the maximum stress
to which it will be subjected when installed in an assembly (Birnbaum,
1956]. If X > Y, the component will not fail in use. In such a case, PI is a
parameter of clear economic importance. It might also be of interest in non-
economic contexts. In a comparison of two populations, it is frequently
desirable to say something about how much they differ, in addition to, or
instead of, performing a test of the hypothesis that they are the same. The
difference between the population means or medians, and the amount
of the shift f.1 if the shift assumption is made, are defined only if the difference
between two items can be measured on some numerical scale. A point
estimate or confidence interval for these quantities has meaning only to the
extent that the scale has meaning. However, in the absence of such a meaning-
ful scale, as long as the items can be ranked, PI is still meaningful. Accordingly,
we will now discuss point and interval estimation of PI' but again under the
assumption that ties occur with probability zero.
Since PI is the probability that an X exceeds a Y, a natural estimator is
the proportion of (X" ~) pairs for which X, > lj, that is, U /mn. This estima-
tor has expected value PI by (4.15), and the variance can be found from (4.16).
Hence U /mn is unbiased for PI' and it is consistent (Problem 65), that is,
for every [; > 0,
This bound on the variance can be made sharper if the class of distributions
is restricted. For example, if the X population is stochastically smaller than
5 Procedures Based 011 Sums of Scm e, 265
the Y, that is, F(t) ~ G(t) for all t, then the variance satisfies (Birnbaum and
Klose [1957]; Rustagi [1962])
var(U) :s; mn[(1 - 2pI)3/2(2m - n - 1) + (n - 2m + 1)
+ 3PI(2m - 1) - 3pi(m +n- 1)]/3 (4.27)
for m :s; n, and similarly for m > n with m and n interchanged in (4.27). If
var(U) is replaced by either of these upper bounds, the right-hand side of
(4.25) still depends on PI' so that an interval for PI is not immediately
obtained. The inequalities resulting in (4.25) could be solved for PI (Problem
69), or the estimate U /mn could be substituted for P I on the right-hand side
of (4.25) to produce endpoints which do not involve Pl'
PROOF. The inequality in (4.26) follows from (4.22) and the inequalities
below. For i i= k,j i= I,
In obtaining (4.28) we used the fact that the three events (X, > lj and
Xk > lj), (Xj < ~ and X j < Y,) and (Xk < lj and Xi > Y,) are mutually
exclusive, and hence the sum of their probabilities is at most 1. For m S n,
say, we write Equation (4.22) as
and substitute (4.18), (4.21), (4.28) and (4.30) to obtain the desired result,
~~ 0
Sample y X X Y Y Y X Y X Y
Value 1.25 1.75 3.25 4.25 5.25 6.25 6.75 7.25 9.00 10.00
Rank 2 3 4 5 6 7 8 9 10
The null distribution of the sum of scores test statistic can be determined
by enumeration in a manner analogous to that described for Rx in Sect. 4.2,
since under the null hypothesis the X scores constitute a random sample
of m scores drawn without replacement from the N available. A table of
P-values could then be constructed for any particular set of constants Ck.
Alternatively, a normal approximation could be used, by standardizing with
the mean and variance given in Problem 77a.
For a test of the null hypothesis that the Y population is the same as the
X population except for a shift by the amount J1., the foregoing test is applied
to X I, •.• , X m , Y1 - J1., ••• , Y,. - J1.. The set of all values of J1. which would
be "accepted" when so tested forms a confidence region for the amount of
the shift, under the model of the shift assumption. The confidence region will
be an interval if the Ck form a monotone sequence (Ck+ 1 - Ck has the same
sign for all k), and each confidence bound will be one of the mn differences
lj - Xi (see Sect. 6 and Problem 76).
The general sum of scores statistic can be written in a form analogous to
(4.6) as
(5.la)
where
Similar statistics analogous to Ry are also easily defined. For any particular
set of scores Ck' the sum of scores statistics for the two samples are again
linearly related and hence equivalent as test statistics.
Many different two-sample rank tests are of this general type, including
all the ones we have studied so far in this chapter. If Ck = k for k = 1, ... , N,
the sum of scores test is simply the rank sum test. If C~ = 1 for k :5; N /2 and
5 Procedures Based on Sums or Scores 267
and between 4.7 % (43/924) and 5.3 % inclusive. Futhermore, all three
statistics are almost monotonic functions of one another, although frequently
one stays constant while another changes. In other words, in the portion of
the tail listed for all three statistics (a little over 5 %), the possible sets of X
ranks can be put in an order such that they enter the critical region in almost
this order for all three tests, the only difference being how many enter at one
time. The one exception is where - T2 = 2.79 and - T3 = 2.55. For larger
sample sizes there would be more differences, although those between T2
and T3 are always minor.
7 InvaI rance and Two-Sample Rank Procedures 269
other words, the X's and Y's are all independent, the X's are identically dis-
tributed, the Y's are identically distributed, the null hypothesis is that the
X's and Y's have the same distribution and the alternative is that they do not.
Let 9 be any strictly increasing function. If XI' ... , X m' Y1, ••• , Y" satisfy
the null hypothesis, then so also do g(X d, ... , g(X m), g( Y1 ), ••• , g( Y,,), and
the same applies to the alternative hypothesis. Accordingly, we can "in-
voke the principle of in variance " and require that a test treat X I> ••• , Yn
in the same way as g(X I), ... , g( Y,,). If this is required for all strictly in-
creasing functions g, then any two sets of observations with the same X
ranks and Y ranks must be treated alike, because any set of observations can
be carried into any other set with the same ranks by such a 9 (Problem 89).
In short, tests based on the ranks of the observations are the only tests which
are invariant under all strictly increasing transformations g.
The same argument applies to other null and alternative hypotheses of
the sort we have been considering, provided only that all strictly increasing
transformations 9 carry null distributions into null distributions and alter-
natives into alternatives. This holds, for instance, if the earlier alternative
hypothesis is tightened to require that the X's be stochastically larger than
the Y's (Sects. 3.8 and 4.6), or relaxed to permit the X's or Y's or both to be
not necessarily identically distributed (Problem 90). It does not hold under
the shift assumption, however (Problem 90d).
Arguments were given earlier for restricting consideration to permutation
invariant procedures, that is, for excluding procedures which depend on the
order of the X's or of the Y's separately. Applying the argument for permuta-
tion invariance, along with the argument for procedures which are invariant
under transformations by any strictly increasing function, that is, rank tests,
leads to restricting consideration to permutation invariant rank tests. These
tests depend only on the ranks of the X's and y's in the combined sample,
without regard to their order within the separate samples. The null distribu-
tions of their test statistics can be generated by taking as equally likely the
(~) separations of the ranks 1, ... , N into m X ranks and n Y ranks.
These procedures can also be defined in terms of the following indicator
variables:
8.1 Most Powerful and Locally Most Powerful Rank Tests Against
Given Alternatives
Let 1'1"'" I'm' 1"1"'" I'~ be the respective ranks corresponding to the
observations X I"'" X m , YI , •.• , y" after they are pooled and arranged
from smallest to largest. Thus I'i is the rank of Xi and I'~ is the rank of ~ in
the combined sample. If we distinguish different orders of the X's and Y's
within samples, then there are N! possible arrangements of these ranks
(N = m + n). We could argue that, by sufficiency, it is not necessary to
distinguish order within samples, but omit this step because our derivations
will reach this conclusion automatically. We will derive the most powerful
tests among all rank tests, not merely among permutation-invariant rank
tests. We will see that the resulting test is permutation invariant, but proving
this first would not facilitate the derivation.
As usual, consider the null hypothesis of Identical X and Y populations.
Under this hypothesis, the N! possible arrangements of the ranks are all
equally likely. By the Neyman-Pearson Lemma (Theorem 7.1 of Chap. I),
it follows that, among rank tests at level a, the most powerful test against a
simple (completely specified) alternative K rejects if the probability under
K of the observed rank arrangement is greater than a constant k, and" ac-
8 Locally Most Powerful Rank Tests 273
(See Fig. 9.1, Chap. 3.) Specifically, consider the alternative that the X's
and Y's are drawn independently from logistic distributions with means
Jl.I and Jl.2 and common scale parameter a. We will show that any non-
randomized, rank sum test which is one-tailed in the appropriate direction
is the unique most powerful rank test at its level against every such alterna-
tive with 0 < (Jl.2 - Jl.1)fa < e, for some sufficiently small, positive e. Among
rank tests it also uniquely maximizes the derivative of the power with respect
to e = (Jl.2 - Jl.1)/a at e = o. At other levels, the derivative of the power at
e = 0 is maximized by a (randomized) rank sum test, though not always
uniquely, and maximizing the power near e = 0 may require differential
treatment of different borderline rank arrangements (Problem 95).
These results will be derived by a general method so that tests with similar
properties can be derived for anyone-parameter, one-sided alternative, and
for any alternative reducible to this form (by a strictly monotonic transforma-
tion of the data that is allowed to depend on nuisance parameters).
Although we have already used the term "locally most powerful," it has
not actually been defined here. One definition, consistent with that given
somewhat informally for the one-sample case in Sect. 9 of Chap. 3, is that a
test is locally most powerful among tests in some designated class against
some designated alternative if it has maximum power among all such tests
at all alternative distributions which are sufficiently close in some specified
sense to the null hypothesis. If we deal with an alternative which can be
indexed by a parameter e that equals 0 when the null hypothesis is true and
is positive under the alternative, and if we define "close" in the obvious
way, then this definition will require that the test be uniformly most powerful
in some interval (0, e) with e > O. Such a test will also maximize the slope of
the power function at e = O. The converse is not always true, however, and
is not automatic even when true; this difficulty is easily overlooked.
PROOFS. Consider a family of distributions Fo for the X popUlation and a
family Go for the Y population, both depending on a one-dimensional
parameter e. Suppose that e = 0 satisfies the null hypothesis of identical
distributions, that is, F0 = Go, and consider the alternative e > O.
Let [ = (r I , ...• rm. r'l , ... , r~) denote the rank arrangement and Po(r)
its probability under the alternative e. Since all rank arrangements are
equally likely under the null hypothesis, it follows (Problem 96) that a
rank test maximizes the derivative of the power at e = 0 if and only if it
is of the form
The value of k and the probability of rejection for [ on the boundary given
by k need only be chosen so that the test has level exactly fJ.. More circumspect
8 Locally Most Powerful Rank Tests 275
as 0 --+ 0, it follows from (8.1) that the most powerful rank test against 0
is again of the form (8.3) for sufficiently small 0; however, if two or more
rank arrangements [ lie on the boundary, those with larger values of the
remainder terms must be favored for the rejection region.
At certain levels there is no room for randomization or other choice at
the boundary, and the situation is simple. Specifically, a test of the form
. I'f dO
reject d Po(r) ~ k at 0 =0
among rank tests at its level, uniquely maximizes both the derivative of the
power at 0 = 0 and the power against 0 for all 0 in some interval 0 < 0 < £
(Problem 96). 0
(8.5)
where R is the region in (X 1, ... , X m' Y1 , ... , y")-space where the rank ar-
rangement is [. Assume it is legitimate to differentiate (8.5) under the integral
sign, and let
h1(x) = j)
j}(J log !o(x) I
0=0
1 j}O
= !o(x) j}
fo(x) I
0=0
(8.6)
Then
(8.7)
276 5 Two-Sample Rank Procedures for Location
where, on the right-hand side, Z(1) < ... < Z(N) are an ordered sample of
N from the distribution with density fo = go. Therefore the test (8.4), for
example, is equivalent to
reject if L Eo[h 1(Z(rt»] + L Eo[h 2 (Z(rJ'»] ;::: k
i )
(8.8)
" accept" otherwise
where the constant k may differ from formula to formula. This result may
also be written in the form
N
reject if L ejI) ;::: k
1
(8.9)
"accept" otherwise
where
I. = {I if Z{j) is an X
) 0 if Z{j) is a Y
and
j = I, ... , N, (8.10)
for arbitrary constants y and A, A positive. Since the test in (8.9) is equivalent
to that in (8.4), (8.9) also has the property that, among rank tests at its level,
it uniquely maximizes the derivative of the power at () = 0 and uniquely
maximizes the power against () for all () in some interval 0 < {} < 8. Notice
that the test is therefore based on a sum of scores in the sense of Sect. 5.
Similarly, (8.3) is equivalent to
N
reject if Le)I) > k
1
N
(8.11)
"accept" if L e)I) < k,
1
and at any level ex, a rank test maximizes the derivative of the power at
{} = 0 if and only if it is of the form (8.11), where the constant k and the
probability of rejection when L~ e)I) = k are such that the test has level
exactly ex.
Similar statements hold for {} < 0, with rejection when L~ e)I) is too
small.
In the case of normal shift, say Fo is N(O, 1) and Go is N(O, 1), (8.6) becomes
h2(Y) = 0,
and the Z(j) are an ordered sample from N(O, 1). The e) given by (8.10)
with A = 1, Y = 0 are therefore the expectations of the normal order statistics.
8 Locally Most Powerful Rank Tests 277
and h 2 (y) = O. The Z(j) are now order statistics from the logistic distribution
FO,I'SO
(8.13)
the answer is Yes. Specifically, given any arbitrary scores Cj' there is a one-
parameter family of alternatives given by densities /0, 98 such that the Cj
satisfy (8.10) for some A > 0 and some y. Given any ci' therefore, there
exists a one-parameter alternative such that, among tests at the same level,
any test of the form (8.9) uniquely maximizes the derivative of the power at
e = 0 and uniquely maximizes the power against e for all e in some interval
o < e < e, and any test of the form (8.11) maximizes the derivative of the
power at e = O.
Roughly speaking then, the class of locally most powerful rank tests is
identical with the class of sum-of-scores tests. Intuitively, it may seem un-
reasonable that the Cj should be utterly arbitrary. The intuition not reflected
in the theoretical result is that, while any particular set of scores Cj is locally
most powerful against some alternative, this may not be at all like the kind of
alternative against which good power is desired. For example, if good
power is desired against alternatives which are one-tailed in the direction of
X larger than Yand natural in other respects, then (presumably) increasing
one of the X's should not decrease the test statistic. This implies that the
Cj should be monotonically increasing in j. In general, if the class of alter-
natives is sufficiently restricted, the locally most powerful rank tests will
not yield all sets of scores. The sets of scores which arise from restricted
classes of alternatives are complicated, however, and will not be discussed
here. We note, though, that stochastic dominance alone does not imply
monotonic Cj (Problem 103). The reader is referred to Uzawa [1960] for a
complete presentation of the conditions on the Cj which result when certain
restrictions are placed on the family of alternatives.
PROOF. We will show that every set of scores c" ... , CN satisfies (8.10) for
some positive A, some y, and some one-parameter family of the following
kind. Let the Y distribution be uniform on the interval (0, 1) and let
{h(X)dX = O. (8.16)
as then h(u) = q(u) + b will be bounded, will satisfy (8.16) for some b, and
will satisfy (8.17) with A = 1, 'Y = - b. We will use a polynomial as the
bounded function q(u). Now it is true (Problem 94d) that
E( Uk) = jU + 1) ... U + k - 1) j = 1, ... , N.
) (N + 1) ... (N + k) ,
Therefore, for q(u) = L + 1)··· (N + k)u\ Equation (8.18) becomes
ak(N
c) = '[.adU + 1) .. ·U + k - 1), j = 1, ... , N, (8.19)
k
and it remains to find ak which satisfy (8.19). The right-hand side of (8.19)
can be considered a polynomial in j. There is certainly a polynomial '[. bk /
such that
j = 1, ... ,N. (8.20)
Therefore (8.19) will be satisfied if the two polynomials are identical, that is,
if ak can be chosen so that, as polynomials in j,
'[. adU + 1) .. · U + k - 1) = '[. bk / . (8.21)
k k
PROBLEMS
1. Let F and G be any two c.dJ.'s that satisfy the shift assumption F(x) = G(x + Jl).
Show that they have the same central moments for all k, that is EF{[ X - EF(X)]k} =
EG{[X - EG(X)]k} for all k; III particular, their variances are equal.
2. Let X and Y be any two random variables with c.dJ.'s F and G respectively that
satisfy the shift assumption F(x) = G(x + Jl). Show that Jl is the difference in
their locations, no matter how location is measured; that is, Jl is the difference
between means, modes, medians, p-points, etc., whenever these quantities exist
and are umque. What happens if the p-points are not unique?
3. For the Mosteller and Wallace data given in SectIOn 3.1 find the one-tailed P-value
accordmg to
(a) A two-sample sign test with fixed ~ = 17.0.
(b) A two-sample quantile test with t = 13.
(c) All two-sample sign tests with fixed ~ for 14 ::; ~ ::; 20.
(d) All two-sample quantile tests for 6 ::; t ::; 16.
4. Given a sample of m X's and II Y's with no ties, construct a path of N steps in the
plane, starting at the origin, such that the kth step is one unit to the right if the kth
smallest observation is an X and one unit upward if it is a Y.
(a) Show that there is a one-to-one correspondence between the possible paths
and the pOSSible rank arrangements of the X's and Y's.
(b) Describe the acceptance and rejection regions of a two-sample quantile
test in terms of these paths.
280 5 Two-Sample Rank Procedures for LocatIon
P= [(
I\' -
t-a-I
I )( N- 11') + (IVa'-I
n-t+a
-I)(Nm-a'
- w)]j(N)
m
if IV :::; t - a' + a + I,
and, with the same formula for P, IS
P+ [(IVt-a'
- 1)( N- IV
n-t+a'-1
)
+
(IV -
a
1)( N-
m-a-I
II' )]j(N)
m
If t - a' + a + 2 :::; IV :::; t.
(h) What can be said about the time required for a decision in parts (d) and (g)?
(i) How could P-values be defined when a curtailed sampling procedure is used?
*6. Given two mutually mdependent random samples ofm X's and II Y's from popula-
tions with continuous c.d.f.'s F and G respectively, let U be the number of Y's
that are smaller than the median of the X sample. If the X sample is regarded as the
control group and the Y sample as the treatment group, the control median test
proposed by Kimball et at. [1957] (see also Gastwirth [1968]) is to reject the
null hypothesis F = G If U IS small Generahzmg to an arbitrary quantile, let U be
Problems 281
the number of Y's which are smaller than the kth smallest of the /II X observations.
(Hint: Problem 4 may be helpful)
(a) Show that the null frequency function of U is
Below k 1/ k+1I
Above m-k II-II N - k-1/
/II II N
(c) Show that a one-tailed test based on U, which might be called a control
quantile test, always reaches the same decision as a suitably chosen, one-taIled,
two-sample quantile test, and vice versa, but this is not true for two-taIled
tests. (The lower tail is of primary interest.)
(d) What is the confidence bound corresponding to a one-tailed test based on U?
(e) In the sItuation of Problem 5, show that a decisIOn of reJect cannot be reached
early by a lower-tailed control quantile test, and a decision of "accept"
cannot be reached early by an upper-taIled control quantile test. When can
the other decisions be reached?
(f) Show that, if sampling is curtailed as soon as a decision can be reached, the
one-tailed control quantile and ordinary quantile tests coincide in all respects
(stopping time and decision).
*7. In the situation described in Problem 6, let V be the number of X's that are smaller
than the median of the Y sample. The first-median test, proposed by Gastwirth
[1968], is based on U if the median of the X sample is smaller than the median of
the Y sample, and on V otherwise. Hence it permits an earlier decision than the
control median test in some circumstances, especially in two-tailed tests. Gen-
eraltz111g to an arbItrary quantile, let X(k) be the kth smallest among the III X's and
l(l) the (I)th smallest among the II Y's and let U be the number of Y's smaller than
X(kl and V the number of X's smaller than l(n' Note that U has the dIstrIbutIon
given 111 Problem 6a (H111t: Problem 4 may also be helpful 111 answerIng the
questions below.)
(a) Show that V has the same null distributIOn as U but with different parameters.
What are the parameters '?
(b) Let the test statistic be U if X(k) < l(,) and V otherwise. Show that U is the
test statistic if and only if U :;;; I - 1 and V is the test statistic if and only if
V:;;;k-1.
282 5 Two-Sample Rank Procedures for LocatJon
(c) Let the critical value be u if X(k) < l(,) and v otherwise, where u :::; 1 - I and
v :::; k - I; express the level of this test as a sum of two hypergeometric
probabilitIes. Such a test might be called a first-quantile test. (Gastwirth
[1968] considers the case u = v, m = 2k - I, II = 21 - I.)
(d) Show that each tail of a first-quantile test always reaches the same decision
as a suitably chosen, one-tailed, two-sample quantile test, and vice versa, but
the two-tailed, two-sample quantile tests reach the same decision as only
those first-quantIle tests havmg k + /I = 1 + v. (They reach "accept" decIsions
at the same time, but the first-quantile tests reach reject decisions sooner.
See, however, parts (j) and (k) of this problem.)
(e) Fmd a convenient expression for the null conditional distributIOn of U given
that U IS the test statistic.
(f) If k = I, show that the test statistic IS min (U, V).
(g) Show that if m = 2k - I and II = 21 - I the test statistic is U with prob-
abIlIty! under the null distrIbutIOn.
(h) Show that if k = 1 and m = II then U and V are identically distributed and
the test statistic IS U with probability t under the null distrIbution.
(I) In (h), the two tails of the test are alike, so a two-tailed P-value can be defined
naturally as twice the one-tailed P-value. Discuss the problems of defining
the P-value of first-quantile tests in other situations.
(j) In the situation of Problem 5, show that a decision of reject cannot be reached
early by a first-quantile test. When can a decision of" accept" be reached?
(k) If sampling is curtailed as soon as a decision can be reached, show that the
two-tailed, two-sample quantile tests coincide in all respects with the first-
quantile test having k + II = I + v.
(I) Show that, under the null hypothesis, the probability that a first quantile
test requires IV observations for a decision is
p=
[( IV -
/-1
- w) + (wk-I
I)(N1/-1 - I)(Nm-k
- 1V)]j(N)
m
ifw:::;lI+v+I,
P+ [(w - 1)(lI-u-1
II
N- II' ) + (IV -
v
1)( N- II'
lIl-v-1
)Jj(N)
//I
if u + v + 2 :::; IV < t.
8. Prove that, m each of the three SituatIOns descrIbed 111 Sect. 3.2, the condItIOnal
distribution of A gIven t, or the distribution of A for fixed t, is the hypergeometric
distribution given in (3.1),
(a) If the N observations are drawn from the same population.
(b) If N given units are assigned to the two columns by randomization.
9. GIven two samples, suppose that ties occur at the median of the combined sample
and all those values at the median are counted as .. above."
(a) Under what circumstances will the margin t used in the two-sample median
test be unchanged?
(b) Show that ties reduce t otherwise.
(c) Show that, if the values at the median are omitted, the hypergeometric dis-
tribution still applies to the resulting 2 x 2 table under the null hypothesis.
Problems 283
10. Prove that the median test accepts (rejects) the hypothesis It = 110 under the shift
assumption If the value Jlo is 1I1terior (exterior) to the random interval (3.6)
II. (a) Verify the following results given by Mood [1950, p. 396]. Under the shirt
assumption when s' ;::: sand r ;::: r'
15. With the notation of Table 3.1 for the two-sample quantile test, show that a/m
and (t - a)/n are bounded away from 0 and 1 under suitable conditions on Ill, n, t,
and (x.
*16. (a) Argue that if X has density fand c.dJ. P, then the order statistic X(i) is approxi-
mately normal with mean p-I(i/m) and variance i(1Il - i)/m3{f[p-l(i/m)]}2
under suitable conditions on i, Ill, and P. (One such argument uses the normal
approximation to the binomial distribution of the empirical cumulative
distribution.)
(b) What kind of precise limit statements along these lines would you expect to
hold, under how broad conditions?
(c) Sketch a proof, perhaps under more restrictive conditions.
17. Show that (3.11) is a necessary and sufficient condition for two random variables
X and Y to have different medians, whatever median is chosen for each if the
median is not unique.
18. Show that the median test is consistent against alternatives with different popula-
tion medians
(a) Using the idea of consistent estimation and its relation to consistent tests
(see Sect. 34, Problems 26 and 28 of Chap. 3 and Problem I of Chap. I).
e
(b) Using the consistency of the sign test with fixed as in (3.11).
e
*19. (a) Show that the two-sample sign test with fixed has the optimum properties
stated in Sect. 3.7 when the X's come from one population and the Y's from
another.
(b) Show that these properties also hold for suitable alternatives under which
neither the X's nor the Y's need be identically distributed.
*20. Show that the hypergeometric probability of a or less in Table 3.1 is less than the
binomial probability of a or less for the binomial parameters m and p = t/N if
a ::::; (mt/ N) - 1, and greater than if a ~ mt/N [Johnson and Kotz, 1969; Uhlmann,
1966].
*21. (a) Argue or show by example that in a treatment-control experiment, if the
treatment effect is not an identical shift in every unit, the level of the confidence
procedure corresponding to the median test may be less than nominal despite
random assignment of units to groups, even when both population distribu-
tions are symmetric and the treatment effect is defined as the difference
between the two centers of symmetry. (One type of example has one group
much less dispersed than the other and uses Problem 20).
(b) In (a), show that if the populations have the same shape but different scales,
then asymptotically the confidence level is always less than nominal. Assume
the population has nonzero density at the median. (Hint: Use Problem 16.)
*22. Show that the lower-tailed median test is bIased against the alternative that the
median of the X population is larger than the median of the Y population. (An
example of power less than (X can be constructed as III Problem 21 a if the difference
III medians is very small.)
23. Use Theorem 3.1 to show that a suitable median test rejects with probability at
most (X under (3.12) and at least (X under (3.13). Show that, more generally, the
test functions of one-tailed, two-sample sign tests rejecting for A large are increasing
in the Y direction.
Problems 285
24. Prove Theorem 3.1, that a test is monotonic In the distributions of the observations
if the critical function is monotonic in the observations.
25. (a) Use (3.8) to (3.10) to give an easily evaluated expression for the approximate
power of the one-tailed median test against the alternative that the X and Y
populations are both normal but with possibly different means and variances.
(b) Evaluate this power expression for m = 6, n = 9, ex = 0.10, population means
differing by 1, and population variances in the ratios 0.25, 1, and 4.
26. A sample of size t is drawn without replacement from a finite dichotomous popula-
tion of size N which contains exactly m elements with the value 1 and n = N - m
with the value O. For the sample observations X I> ••• ,X" the number of I's in the
sample is L:'t X, = tX.
(a) Show that
27. Let ~ be the median of a combined sample of III X and n Y random vanables. Let
F",(e) and Gn(e) denote the respective sample proportions of X's and Y's whIch
e
are smaller than so that the median test statistic can be written as A = mFm(e).
(a) Show that the one-tailed median test that rejects for small values of A is
equivalent to a test rejecting for small values of F m(e) - G.(e).
e
(b) If n -> 00 while m remains fixed, show that converges to the median of the Y
population, find the limiting value of the two-sample median test statistic,
and show that the median test approaches the one-sample sign test (Moses
[1964]).
*28. What happens to the two-sample quantile test if N -> 00 with t fixed?
29. Verify the linear relationships between the Mann-Whitney and Wilcoxon statistics
given in (4.3) and (4.4).
30. (a) If 1"", II(U) denotes the number of arrangements of m X and n Y variables such
that the value of U, the number of (X, Y) pairs for which X > Y, equals u,
show that
for all u = 0, 1, ... , mn and all positive integer-valued m and n, with the
following initial and boundary conditions for all m ~ 0 and II ~ 0:
(b) What change IS reqUIred In order to generate directly the null cumulative
probabilities Fm . .{u) = P(U :$ u)?
(c) What change is required in order to generate the null probabIlity function of
the X rank sum Rx? The null cumulative function of Rx?
31. Use the recursive method developed in Problem 30 to generate the complete
null distribution of U or Rx for all m + n :$ 6. Check your results using Table F.
*32. Derive results for the two-sample case that are analogs of the one-sample results
given in Problem 13 of Chap. 3. (Fix and Hodges [1955] give these two-sample
results and a table. Their work is 1I1spired Problem 13 of Chap. 3.)
33. Let R be uniformly distnbuted on the integers 1,2, ... , N and let S be independently
uniformly distributed on (0,1).
(a) Show that S has mean t and variance 112 ,
(b) Show that R + S is uniformly distributed on (I, N + 1) and hence has the
same distribution as NS + 1.
(c) Use (a) and (b) to show that R + S has mean N/2 + I and variance N 2 /l2.
(d) Use (c) and (a) to show that R has mean (N + 1)/2 and variance (N 2 - 1)/12.
34. Use the null mean and variance of Rx given in (4.7) and (4.8) to verify the cor-
responding moments of Ry and U given in (4.9) and (4.\0).
35. Show that the null distributions of U and U' are identical.
36. Show that the possible values of U, U', R x , and Ry are as stated in the paragraph
follOWIng (4.10).
37. Show that U and U' are symmetrically distributed about mn/2, Rx is symmetric
about m(N + 1)/2, and Ry is symmetric about n(N + 1)/2.
38. (a) Show that the continuity correction for the approximate normal deviate of
the rank sum test statistic is J3/mn(N + 1).
(b) Show that for 0.1 < miN < 0.9, this continuity correction is less than
IIJO.03N 3 and hence less than 0.02 if N ~ 44, less than 0.01 If N ~ 70.
(c) Show that for 0.1 < miN < 0.9, the continuity correction in the normal
apprOXImation (3.2) for the median test is less than IN/O.3foCl and
hence less than 0.02 If N ~ 27778, less than 0.0 I if N ~ 111112
39. Show that the value of k given in (4.11) and needed for the confidence bound is
one less than the rank (among all pOSSIble RJ of the lower taIl critIcal value at
level (J. and hence k + I is the rank.
40. Represent two samples of sizes m and II by a path as explained in Problem 4. This
path separates the rectangle with corners (0, 0), (0, II), (m, 0), and (/11, n) into two
parts. Show that the upper left part has area U and the lower right part has area U',
where U and U' are the Mann-Whitney statistics.
41. Given continuous c.dJ.'s F and G, show that PI' P2' and P3 as given in Equations
(4.12)-(4.14) can be written as follows, where dF(x) can be replaced by f(x) dx
if F has density J, and similarly for dG(y).
f f
(a) PI = G(x) dF(x) = 1 - F(y) dG(y).
f 2f
(b) P2 = [1 - F(y)]2 dG(y) = [1 - F(x)]G(x) dF(x).
f
(c) P3 = G2 (x) dF(x) = I - 2J F(y)G(y) dG(y).
Problems 287
42. Calculate PI' P2, and P3 defined in Equations (4.12)-(4.14) if X and Yare drawn
from identical populations, and substitute these results in (4.15) and (4.16) to
verify the results given in (4.10).
43. Derive the expression for var(U) in Equation (4.22) by using (4.18)-(4.21).
44. Natural estimators of PI, P2, and P3, as defined in Equations (4.12)-(4.14), are the
corresponding proportions of sample comparisons (X" X) which satisfy the
respective inequalities.
(a) Show that these estimators for PI and P2 can be expressed as
51. Show that the rank sum test is consistent against stochastic dominance, and hence
by Problem 50, is consistent against shifts.
288 5 Two-Sample Rank Procedures for LocatIOn
52. Suppose the X and Y populations have the same median but, with 11Igh probability,
X is close to the median and Y is not. Show that
(a) In the notation of Sect. 4.4, PI = t, P2 = t, and P3 = !, approximately.
(b) E(U) = mn/2 and var(U) = m 2 n/4, approximately.
(c) As approximated, the mean is the same as under the null hypothesis, but the
variance IS larger for n > 2m - 1, smaller for n < 2m - I.
(d) The probability of rejection by the rank sum test must be greater than IJ. if
n > 2m - 1, less if n < 2m - 1, for some levels IJ. and some populations of
the type described.
*53. (a) What questions about the differences ~ - X, correspond to questIOns about
the Walsh averages in Problem 37 of Chap. 3? .
(b) Investigate some of the questions raised In (a).
54. Prove that the rank sum test is unbiased against the alternative that the distribu-
tIOn of every X, is stochastically smaller than the distribution of every lj.
55. Use Theorem 3.1 to show that the rank sum test rejects with probability at most IJ.
under (3.12) and at least IJ. under (3.13).
56. For a sample of X's and a sample of Y's, possibly with ties, let L be the (k + l)th
smallest difference lj - X" Show that if the Y population differs from the X
population only by a shift /-I, then peL ~ /-I) ~ 1 - IJ. ~ peL < /-I) where 1 - IJ. is
the exact probability in the continuous case. (Hint: ConSider breaking ties ran-
domly. See also Problem 107.)
57. Consider the confidence interval for shift corresponding to the rank sum test.
(a) Show that if 0 is not an endpoint, then all methods of breaking any ties lead
to the same decision, namely, to "accept" if 0 is outside and reject if 0 is
inside.
(b) Show that if 0 is an endpoint, then there are tied observations and breaking
ties one way leads to rejection, another way to acceptance.
58. Show that the value of Rx computed using the average rank procedure is equal to
U + m(m + 1)/2, where U is equal to the number of (X" lj) pairs for which
X, > lj plus one-half of the number for which X, = lj.
59. Show that when ties are handled by the average rank procedure, under Ho,
(a) The means of Rx and U are not affected by ties but the variance is reduced to
var(Rx) = var(U) = ItIn[N(N 2 - 1) - I t(t 2 - 1)]/12N(N - 1)
where t is the number of observations tied at a given rank and the sum is over
all sets of tied observations (as in Problem 52 of Chap. 3).
(b) The distributions of Rx and U may not be symmetrical.
60. Show that conditions (i)(b) and (Ii) in Section 4.7 are equivalent.
61. Show that the average rank test procedure satisfies (i)(b) and (ii) but not (i)(a)
and (iii) in Section 4.7.
*63. Show that the confidence regions for shift corresponding to the average rank
procedure are always intervals.
64. Show that the multiplicities of the average ranks and hence the exact pattern of
ties can be determined from the ranks without multiplicities. (For mstance, the
average ranks 1,3,6,9.5, 12.5, 14 can arise only from a combined sample with the
pattern of ties displayed in Table 42.)
65. Use the result in Problem I of Chap. 1 to show that U/mn is a consistent estimator
of PI = P(X, > ~).
66. Show that U/mn is a minimum variance unbiased estimator of PI if the only
restriction is that the X and Y distributions are continuous.
*67. Under the assumption that the X and Y populations are normal with arbitrary
means, express the minimum variance unbiased estimator of PI as conveniently
as possible for the case of variances
(a) Known.
(b) Common but arbitrary (unknown).
(c) Arbitrary, possibly not all the same.
*68. (a) Show that if the X and Y populations are normal with common variance,
then for large m and n, the variance of <I>[(X - ¥)/sj2] is approximately
1cfJ 2 W(I/m + I/n + e/N) where ~ = (l1x - l1 y)/aj2).
(b) Compare the result in (a) to the variance of U/mn.
69. How might one find the values of PI for which (U - I1lIlPI)2/Z2 equals
(a) The right hand side of (4.26)?
(b) The right hand side of (4.27)?
*70. Show that the bound on var(U) given by (4.27) is never greater than that given by
(4.26). (Hint: The mequahty to be proved is equivalent to [(I - 2p)3/2 - 1 +
3p](2m - II - 1) :::; 3p2(m - I) for 0 < P < 1. You may wish to prove and use
I - 3p :::; (I - 2p)3/2 :::; I - 3p + 2p2 for 0 < P < !-)
71. Mr. Greenthumb has come to consult you about the following problem. To test
the effect of a certain type of fertilizer on the growth of spinach, he divided his
spinach field into 20 plots, picked ten plots at random and fertilized them, and left
the other ten unfertilized. Upon harvesting he obtained the following yields, in
bushels.
Unfertilized plots 6.1, 10.2, 8.7, 6.4, 703, 10.9, 7.7, 8.4, 9.0, 9.8
Fertilized plots 10.1, 11.2, 1203, 9.2, 12.0, 11.9, 9.6, 10.8, 1003, 12.7
When you question him persistently, he admits to thinking that, m the absence
of fertilizer, the yields should be independently distributed, approximately 1101 m-
ally with the same mean and varIance for all plots. He thinks that the effect of the
fertilizer is to II1crcase the Yield by some amollnt; he IS slIre it does 110 harm.
The people paying for his research do not like dlstnbution assumptions,
however, and he has consented to analyze the data without such assumptions.
(a) What methods of analyzing his data would you suggest he consider? What
would you tell him about these methods? Be as precise as you can, but re-
member that Mr. Greenthumb, though highly intelligent, is nevertheless not a
statistician.
290 5 Two-Sample Rank Procedures for LocatIon
(b) Mr. Greenthumb would like to make a preliminary report right away. For this
purpose, analyze the data in some quick but reasonable way, even if it is not
the way you consider optimum.
72. Twenty mice are placed randomly in individual cages and the cages are divided
randomly into two groups, each of size 10. All the mice are infected with tubercu-
losis; then the mice in the second group are each given a certain drug (B) while
those in the first group are given a placebo (A). Since the drug (B) is known to be
nontoxic, those mice in the treatment group would not be expected to die sooner
than the control group. The number of days to death after infection are shown
below.
Control (A): 5, 6, 7, 7, 8, 8, 8, 9, 10, 12
Drug (B): 7, 8, 8, 8, 9, 9, 12, 13, 14, 17
(a) Test the hypothesis that the drug is without effect, using the rank sum test
and the average rank procedure to handle ties.
(b) Find the smallest and largest P-values when the ties are broken.
(c) Find a lower confidence bound for the effect ofthe drug, using a level ofO.90.
73. A professor decided that since it was necessary to give tests in overcrowded
classrooms, the temptation for eyes to wander should be minimized. He decided
to give two sets of tests, with the only difference being the order in which the
questions appeared. The tests were distributed in such a way that no student
could gain information if his eyes wandered. The test results are given below.
Determine whether there is a significant difference between the average grades for
these sets, and find a confidence interval for the difference.
Set A: 78, 68, 78, 90, 66, 75, 50, 42, 80, 74
Set B: 82, 81, 83, 95, 91
74. Suppose X and Y samples each have possible values I, 2, ... , r and the value i is
observed m, times in the X sample, n, times in the Y sample, I, ml = m and I, nj =
n. The observed frequencies can be arranged in an r x 2 table with ith row entries
m" n, and row total N, = m, + nj. Consider using the rank sum test with ties
handled by the average rank procedure.
(a) Express the test statistic in terms of the observed frequencies.
(b) How does this test relate to the ordinary chi-square test for an r x 2 table
when r = 2? When r > 2?
(c) If the possible values are some arbitrary numbers aI' a2' ... , a, instead of
1,2, ... , r, what effect would this have?
*75. (Early decision in rank sum tests). Suppose X and Y observations are obtained in
order of magnitude and a one-tailed rank sum test is to be used based on sample
sizes m and n and rejecting for Rx :;; t. Assume no ties. The results below are based
on Alling [1963]. The similar problem for censored samples is discussed in Halperin
and Ware [1974].
(a) Given the first N' = m' + n' observations, derive expressions for the minimum
and maximum possible values of Rx.
(b) When can a decision to reject first be reached? An "accept" decision?
(c) Show that a decision to reject can be first reached only after observing an X,
an "accept" decision after observing a Y.
(d) Show that a decision can always be reached early and that this is true of all
rank tests.
Problems 291
76. Show that the confidence region for a shift parameter corresponding to a one-
tailed or two-tailed test based on a sum of scores is an interval if the Ck form a
monotone sequence.
77. (a) Show that the null dIstribution of the sum of the X scores in a sample of III X's
and II Y's has mean III I7 cdN and variance IIII1[I7 cflN - (I~ cklN)2]1
(N - 1), in the notation of Sect. 5.
(b) Show without further algebra that the sum of the Y scores has the same variance
as the sum of the X scores (under all circumstances).
(c) Use the result in (a) to verify (4.7) and (4.8) for the rank sum test.
(d) Use the result in (a) to obtain the null mean and variance of the median test
statistic A.
78. (a) Argue that if Ck is approximately J[(k - 0.5)/N] or J[k/(N + 1)] for some
n
function J, then I7 cklN is approximately J(u) du and I7 cflN is approxi-
mately g J 2 (u) duo
(b) What kmds of condItions would be needed to make the approximations in (a)
good?
(c) What function J corresponds to the rank sum test?
(d) Compare the approximate and exact values for the rank sum test.
(e) What functIon J corresponds to the Flsher- Yates-Terry test? The van der
Waerden test?
(f) For these test statistics, what approximate values arise by applying the result
in (a) to the mean and variance in Problem 77(a)?
*79. Consider a test based on the sum of scores Ck for Ck = k + 2- k • Formulate the
two-sample counterpart of Problem 91 of Chap. 3 and answer the questions posed.
80. Use Theorem 3.1 to show that a sum of scores test rejects with probability, at most
ex under (3.12) and at least ex under (3.13) if the scores form a monotonic sequence.
8!' Do (a)-(O below using the Flsher- Yates-Terry procedure.
(a) Suppose the following are two independent random samples, the second
drawn from the same distribution as the first except for translation by an
amount (J (to the right if (J > 0). Test at a level near om the null hypothesis
(J = 2 against the alternative 0 < 2.
82. Do (a)-(f) of Problem 81 using the median test and related confidence procedure.
83. Do (a)-(f) of Problem 81 using the rank sum test and related confidence procedure.
84. Suppose that III X observations are independent and follow the uniform distribu-
tion on (0, I), and that II Y observations are mdependent and have the densIty
85. What is the relation between the densities in Problem 84 and Lehmann alterna-
tives F(x) = [G(X)]k for all x?
86. Suppose that X I' X 2 have density f(x) = 1 for 0 ~ x ~ 1; YI , Y2 have density
g(y) = I for -e ~ y ~ 0 and e ~ y ~ 1; and all are independent. Find the
probability of each possible set of combined ranks. (Note that this probability
is not increasing in the X ranks even though F(x) ~ G(x) for all x.) Find a function
Q such that F(x) = Q[G(x)].
88. Suppose that X I> ... , X." YI>"" Y. are mdependent but neither the X's nor the
Y's are necessarily identically distributed. How might one argue for using a
permutation-invariant procedure? What are some circumstances under which
permutation in variance would clearly be inappropriate?
89. Given arbitrary X" ~, X;, Y~ for i = 1, ... , III, j = 1, ... , n, show that the X,
and ~ have the same ranks in the combined sample of all X,, ~ as do the X;, Yj in
the combined sample of all X;, Y~, if and only if there exists a strictly increasing
function g such that g(X j ) = X; and g(~) = Y~ for all i,j.
90. (a) Show that all strictly increasing transformations of the observations leave the
following hypothesis invariant: XI' ... , X .. and YI , ... , Y" are mdependent
and the X population is stochastically larger than the Y population.
(b) Show the same for the hypothesis that X I' ... , X .. , YI , ... , Y" are independent
(but not necessarily identically distributed).
(c) Show the same for the hypothesis that X I " " , X m , YI , •.• , Y" are independent
and the X's are identically distributed.
(d) Show that this invariance does not apply to the shift hypothesis.
91. Show that a two-sample test IS a permutation-invariant rank test if and only if its
critical function depends only on the indicator variables I J defined in Sect. 6.
92. Show that a rank test is the most powerful rank test at level (X against a simple
alternative K if and only if it has the form (8.1) and level exactly (x.
Problems 293
*93. Suppose the X population is N(O, 1) and the Y population is N(O, 1). Use Problem
97 to show that
where the Z(J) are order statistics from N(O, I) and 0(0 3 ) indicates terms of
order 0 3 •
(b) For sufficiently small positive 0, the most powerful rank test at leveI3/(~) rejects
if the X ranks are any permutation of(n + 1, n + 2, ... , N),(n, n + 2, n + 3, ... ,
N), or (n + I - k, 11 + k, /I + 3, /I + 4, ... , N), where k = 1 or 2, and it remains
to be determined whether k = 1 or k = 2.
(c) If m = n, both choices for k give the same coefficient of 0 and hence the most
powerful tcst IS whIchever chOIce gIves the larger value of var(Z(II+ 1 -k) +
Z(II+k) + Z(II+3) + '" + Z(N)' With appropriate tables of the variances and
94. If U 1 < U 2 < ... < UN are order statistics of a sample of N from the uniform
distribution on (0, I), show that
(a) E(U,) = i/(N + I).
(b) cov(U" U) = i(N + 1 - j)/(N + 1)2(N + 2) for i < j.
(c) var(U,) = i(N + 1 - i)/(N + 1)2(N + 2).
(d) E(U~) = i(i + 1) ... (i + r - I)/(N + 1) .. · (N + r).
*95. Suppose the X population is logistic (0, I) and the Y population is logistic (0, 1).
Use Problem 97 to show that
where the Uk are order statistics from the uniform distribution on (0, 1),
A. = e- o - I, and 0(0 3 ) indicates terms of order 03 .
(b) The statement of Problem 93b holds here also.
294 5 Two-Sample Rank Procedures for LocatIOn
(c) Both chOIces for k give the same coefficients of () and 0 2 and hence determina-
tion of the most powerful choice requires either further terms or a different
approach. (Hint: Use Problem 94. Note that E(Vn+ I-k+ Vn+k),cov(V n+ I-k +
V n+ko V), and E[(Vn+ I-k + Vn+k)VJ do not depend on k for j > II + k.
(d) The derivative of the power at () = 0 is maximized by either choice of k (and
thus by a nonrandomized test).
(e) A randomized rank sum test also maximizes the derivative of the power at
0=0. What is Its critical function and how does this test relate to the fort'-
going nonrandomized test?
96. In the two-sample problem, consider a one-parameter family of alternatives
mdexed by 0, where 0 = 0 gives the null hypothesis. Let a(O) be the power of an
arbitrary rank test and a'(O) the derivative of the power at 0 = O. Show that
(a) A rank test maximizes a'(O) among rank tests at level a if and only if it has the
form (8.3) and level exactly a.
(b) A rank test of the form (8.4) uniquely maximizes both a'(O) and a(O) for all 0
in some interval 0 < 0 < e.
97. Suppose that two populations have c.d.f.'s F and G and densitiesfand g, and that
fvanishes whenever g vanIshes. Show that the rank arrangement r has probability
where Z( I) < ... < Z(N) are the order statistics of a sample of N from the distribu-
tion with density g.
*98. Show that the locally most powerful rank test against continuous Lehmann
alternatives Fo = G I +0 is based on the scores c) = E(log V) where V) has the
beta distribution with parameters j and N - j + 1.
*99. (a) Show that the locally most powerful rank test against an alternative with
monotone likelihood ratio fo(x)/go<x) which is, say, nondecreasing in x for
every 0, IS based on monotonic scores cr
(b) Show by example that the result in (a) need not hold if the property of monotone
likelihood ratios is replaced by stochastic dominance (which is weaker).
100. Show that, when the observations are obtained in order, a two-tailed quantile
test that rejects for A ~ a and A ;::: a' first reaches a decision at the second smallest
of X(a+ I), X(a') , 1(.-a'+ I), 1(.-a)'
101. Suppose that two populations have c.d.f.'s F and G and densitiesfand g, and thatf
vanishes whenever g vanishes. Show that
(a) H(II) can be defined so that F(x) = H[G(x)J for all x.
(b) H'(u) = lI(u) = f[G-I(u)]/g[G-1(u)].
(c) P(r:.) = E[n~=1 II(V(,,»]/N! where V(I) < ... < V(N) are uniform order
statistics on (0, I),
(d) For the parametric alternative F = F 0' G = F 0,
*102. Denve an expression for the scores that give the locally most powerful rank test
for alternatives under which the X and Y distributions have densIties in
(a) An exponential family Po(x) = C(O)h(x)eQ(O)T(x).
(b) A shift parameter family Po(x) = p(x - 0).
*103. (a) Show that the most powerful rank test against a simple alternative wIth
densitiesf, g and likelihood ratiof(x)/g(x) which is nondecreasing in x rejects
when the X ranks are '·1' ... , rIll if it rejects when they are r'I' ... , r;" and r; ~ r,
for all i. More generally, show that the critical function of this test as a function
of the X ranks is non decreasing in each. (Hint: Use Problem 97.)
(b) Show by example that the most powerful rank test against a simple alternative
with X stochastically larger than Y need not have this property.
104. Show that for any k, Xk is a linear combination of terms of the form x(x + I)···
(x + 1 - i) with 1 ~ k. (For instance, x 2 = -x + x(x + 1).)
1 Introduction
In Chap. 4, the principle, method, and rationale of randomization tests
applicable to the one-sample (or paired-sample) case and the null hypothesis
of symmetry about a specified point were illustrated and some properties
of these tests were discussed. In this chapter we employ the same ideas in
the case of two independent sets of observations and the null hypothesis of
identical populations.
As in Chap. 5, we may have either two random samples independently
drawn from two populations or N = m + n units, m of which are chosen at
random for some treatment. Under the null hypothesis that the two popula-
tions are identical, or that the treatment has no effect, given the N observa-
tions, all (~) distinguishable separations of them into two groups, m labeled
X and n labeled Y, are equally likely. This fact was used to generate the null
distributions of all the test statistics considered in the last chapter. The
distribution of a two-sample statistic generated by taking all possible
separations of the observations into two groups, X and Y, as equally likely
is called its randomization distribution. Hence this distribution applies under
the null hypotheses mentioned above. A two-sample randomization test then
is one whose level or P-value is determined by this randomization distribu-
tion.
The two-sample tests discussed in Chap. 5 are called rank tests, since the
test statistic employed in each case is a function of the ranks of the X's and
Y's in the combined group. However, they are also members of the class of
two-sample randomization tests and hence might be called rank-randomiza-
tion tests. On the other hand, randomization tests may use a test statistic
296
2 RandomlZllllOn-Ddfercnce Between Sample Means and EqUIvalent Criteria 297
which is not a function of the ranks alone; rather, the statistic may be any
function of the actual values of the observations. To emphasize this, these
randomization tests might be called observation-randomization tests. In
general, the randomization distribution of an observation-randomization
test statistic must be generated anew for each different set of observations.
Since the randomization distribution applies under the null hypothesis
conditionally on any given set of N observed values, these randomization
tests are conditional tests, conditional on the N values observed.
In this chapter we will first discuss the two-sample observation-randomiza-
tion test based on the difference between the two sample means (or any
equivalent criterion), and the corresponding confidence procedure. Then
we will introduce the general class of two-sample randomization tests and
study most powerful randomization tests.
2.1 Tests
(Problem 1) are the ordinary two-sample t statistic (less convenient !), the
sum Li' X, of the X's, and S* = Lr,>m Xi - Lrj:s:m lj where I'i is the rank of
Xi and I'~ is the rank of lj in the combined ordered sample.
The calculations are most easily performed using S* when Y - X is
"large," that is in the upper tail, and using a corresponding statistic defined
in Problem 1 when Y - X is in the lower tail. If the same value of S* occurs
more than once, each occurrence must be counted separately. (Theoretically
this has probability zero if continuous distributions are assumed, but in
practice it may occur.) One will ordinarily try to avoid enumerating all
(~) possible separations. In order to find a P-value, only those (~)P separa-
tions which lead to values of S* equal to or more extreme (in the appropriate
direction) than that observed must be enumerated. If a nonrandomized
test is desired at level ex, a decision to reject can be reached by enumerating
these same (~)P separations, with P < ex, and a decision to "accept" requires
identifying any (~)ex cases as extreme as that observed. For rejection, or a
P-value, every· point in the relevant tail must be included; since it is difficult
to select the cases in the correct order, considerable care is required.
The procedure for generating the entire randomization distribution is
illustrated in Table 2.1. For m = 3, n = 4, there are (j) = 35 distinguishable
separations into X's and Y·s. Each separation is listed in the table and the
value of S* calculated for each. (The L X column is included only to illus-
trate the evaluation of a different test statistic and to make the test more
intuitive; the Y - X column is included for use in Sect. 2.4.) The observed
value of S* is 1.5 and only three of the enumerated separations produce an
S* that small. Hence the one-tailed P-value by the randomization test is
3/35 = 0.0857. Since the randomization distribution is far from symmetric
here, different ways of relating the two tails would give appreciably different
two-tailed P-values. (See also Sect. 2.4 below, Problem 5, and Sect. 4.5,
Chap. 1.)
If the null hypothesis is that the Y population is the same as the X
population except for a shift by a specified amount /10, then the foregoing
randomization test may be applied to Xl' ... , X m' Y, - /10' ... , Y" - /10' The
corresponding confidence procedure will be discussed in Sect. 2.3.
In Sect. 2.1, the validity of the randomization distribution as the null dis-
tribution for the randomization test was based on the assumption that the
X and Y samples come from identical popUlations under the null hypothesis.
As noted in Sect. 1, the randomization distribution is also valid if N given
units are randomly separated into two groups, one of which is treated, and
the null hypothesis is that the treatment has no effect on any unit.
The probability ofrejection is ordinarily affected however, and sometimes
increased, if the samples are drawn from two populations which differ in
Table 2.la
X sample -0.2,0.9,2.0
Y sample 0.5,6.5, 1I.5, 14.3
X = 0.9 V = 8.2 In = 3 /I =4 S* = 1.5
Sample SeparatIOns
-0.2 0.5 0.9 2.0 6.5 11.5 14.3 IX S* v-x
X X X Y Y Y Y 1.2 0 8.16
X X Y X Y Y Y 2.3 l.l 7.53
X Y X X Y Y Y 2.7 1.5 7.30
Y X X X Y Y Y 3.4 2.2 6.89
X X Y Y X Y Y 6.8 5.6 4.91
X Y X Y X Y Y 7.2 6.0 4.68
Y X X Y X Y Y 7.9 6.7 4.27
X Y Y X X Y Y 8.3 7.1 4.03
Y X Y X X Y Y 9.0 7.8 3.62
Y Y X X X Y Y 9.4 8.2 3.39
X X Y Y Y X Y 11.8 10.6 1.99
X Y X Y Y X Y 12.2 11.0 1.76
Y X X Y Y X Y 12.9 11.7 1.35
X Y Y X Y X Y 13.3 12.1 1.12
Y X Y X Y X Y 14.0 12.8 0.71
Y Y X X Y X Y 14.4 13.2 0.48
X X Y Y Y Y X 14.6 13.4 0.36
X Y X Y Y Y X 15.0 13.8 0.12
Y X X Y Y Y X 15.7 14.5 -0.28
X Y Y X Y Y X 16.1 14.9 -0.52
Y X Y X Y Y X 16.8 15.6 -0.92
Y Y X X Y Y X 17.2 16.0 -1 16
X Y Y Y X X Y 17.8 16.6 -1.51
Y X Y Y X X Y 18.5 17.3 -1.92
Y Y X Y X X Y 18.9 17.7 -2.15
Y Y Y X X X Y 20.0 18.8 -2.79
X Y Y Y X Y X 20.6 19.4 -3.14
Y X Y Y X Y X 21.3 20.3 -3.55
Y Y X Y X Y X 21.7 20.5 -3.78
Y Y Y X X Y X 22.8 21.6 -4.42
X Y Y Y Y X X 25.6 24.4 -6.06
Y X Y Y Y X X 26.3 25.1 -6.47
Y Y X Y Y X X 26.7 25.5 -6.70
Y Y Y X Y X X 27.8 26.6 -7.34
Y Y Y Y X X X 32.3 31.1 -10.22
a These data are percent change in retail sales of Alabama drug stores from April 1971 to
April 1972. The X values are for three counties selected at random from those Alabama
counties classified as SMSA's (Standard Metropolitan Statistical Areas), and the Y values
are for randomly selected other counties. Small samples were selected so that generation of
the entire randomization distribution could be illustrated. A null hypothesis of practical
importance here is that the average percent change in retaIl sales during this period for
metropolitan areas in Alabama is not smaller than the corresponding change for less urban
areas.
299
300 6 Two-Sample Inferences Based on the Method of RandomizatIOn
Under the shift assumption, the randomization test procedure can also be
used to construct a confidence region for the amount of the shift j).. Un-
fortunately, the randomization distribution is different when different values
of j). are subtracted from the Y's. The confidence region is nevertheless an
interval, and its endpoints could be obtained by trial and error by subtracting
successive values of j)., larger or smaller than previous values as appropriate,
until the value of the test statistic equals the appropriate upper or lower
critical value of a randomization test at level a for shift O. The endpoints of
the normal theory confidence interval for the difference of means at the
same level could be used as initial trial values of j)..
However, as in the corresponding one-sample problem, there is a more
systematic and convenient method which can be used to find the confidence
limits for j). exactly. Consider all pairs of equal-sized subsamples of X's
and Y's. Specifically, for the sample of m X's, consider all possible subsamples
of size r, and for the sample of nY's, consider all possible subsamples of the
same size r. Take all possible pairs of these equal-sized subsamples of X's
and Y's for all r, 1 :s; r :s; min(m, n). The total number of different pairs is
For each pair, take the difference of the subsample means, say the Y sub-
sample mean minus the X subsample mean. Consider the (~) - 1 differences
in order of algebraic (not absolute) value. It can be shown that the kth
smallest and kth largest of these differences of equal-sized subsample means
are the lower and upper confidence bounds respectively, each at level 1 - a,
that correspond to the two-sample randomization test based on Y - X
(or any other equivalent test criterion), where a = k/(~) and hence k = (~)a
(Problem 2).
To save labor, instead of using all (~) - 1 differences one could use a
smaller number, say M, if they are selected either at random without re-
placement (Problem 3) or in a "balanced" manner determined by group
theory; then a = k/(M + 1). Both methods are discussed briefly in Sect. 2.5;
2 Randomlzatlon-- Difference Between Sample Means and EqUIvalent Criteria 301
more detail for a single sample from a symmetric population was given in
Sect. 2.5 of Chap. 4.
- - ,,\,N
L..I aj - "\''''X
L..I j -
Y-X= -X
n
(2.1)
and
a= ~aiN = (~Xj + ~)IN *
N
L (aj - a)2/N = S2/N
I
where
N m 11
Furthermore, by (2.1), (2.3) and (2.4) the mean and variance (12 of the ran-
domization distribution of ¥ - X are given by
E(¥ - X) =0 (2.5)
var(¥ - X) = (12 = NS 2/mn(N - 1). (2.6)
The corresponding moments of several other equivalent test criteria are
easily derived (Problem 4).
tion in G except the identity, take the mean of all Y's that are permuted to
become X's and subtract the mean of all X's that are permuted to become
Y's. The kth smaIlest (or kth largest) of these differences of subsample
means is the lower (or upper) confidence bound corresponding to the test
having k points in the critical region, and the one-tailed C( is k divided by the
size of the subgroup.
(c) Approximations. As in the one-sample case in Chap 4, several approxima-
tions that are based on tabled probability distributions are possible. Four
will be given here, but none of them reflects the asymmetry ofthe randomiza-
tion distribution of Y - X.
A natural approximation is the standard normal distribution for the
standardized randomization test statistic (Y ...:. X)/u where u is given by
(2.6). The foIlowing reasoning suggests, however, that a better approximation
may be obtained by calculating the 'ordinary two-sample t statistic and
treating it as Student's t distributed with N - 2 degrees of freedom.
The ordinary two-sample t statistic for equality of means assuming equal
variances can be written here as (Problem 10)
Za(N ~ ; ~ z;r /2
,
(2.8)
d = 1+ (NN -+ 11) C2
[2mn(N - 2) _
6mn _ N 2 _ N C2
J-l (2.11)
where
(¥ - xy
(2.13)
(N - 1)0'2'
(2.14)
= D/(N - 1) (2.15)
where D is given in (2.10). Equating the moments in (2.14) and (2.15) to the
corresponding moments of the beta distribution with parameters a and b
as given in (2.16) of Chap. 4 gives the relations (Problem 14b)
I-D
b = (N - 2)a. (2.16)
a = (N - I)D - I'
3 The Class of Two-Sample RandOll1lzatlOn Tests 305
3.1 Definition
The level of the randomization test based on the difference between the
means of two independent samples, or on any other equivalent test criterion,
relies only on the assumption that, given the observations, say ai' ... , aN,
all possible separations into X I " ' " X m , YI , ••. , Yn are equally likely, as
they are under the null hypothesis of identical populations, or the null
hypothesis that the treatment has no effect when the treatment group is
selected randomly from the whole set. Thus it is a conditional test, condi-
tional on a I, ... , aN' In particular, it has level (X if, under the null hypothesis,
the conditional probability of rejection given ai' ... , aN is at most (x, and its
P-value is the corresponding conditional probability of a value of the test
statistic equal to or more extreme than that observed.
More generally, as stated in Sect. 1, a two-sample randomization test is a
test which is conditional on the observations a" ... , aN, its null distribution
306 6 Two-Sample Inferences Based on the Method of Randollllzation
3.2 Properties
Denote the critical function, that is, the probability of rejection, of an ar-
bitrary test by ¢(X t> ..• , Xm; Y1, ••• , Yn). Consider first a null hypothesis
Ho under which, given the observations at> ... , aN, all N! arrangements of
them into two samples Xl' ... , X m; Y1 , ••• , y" are equally likely. Then under
H o , the conditional expected value given the aj of ¢(X 1, ••. , Xm; Yj, ... , Yn)
is simply the mean of its N I-type randomization distribution, or
(3.1)
where the sum is over the N! permutations 1tl' ••• , 1tN of the integers 1, ... , N.
Alternatively, consider a null hypothesis Hounder which, given the observa-
tions ai' ... , aN' all (~) separations into an X sample of size m and a Y
sample of size n are equally likely. Then under H 0, the conditional expected
value given the a) of ¢(X 1 " ' " Xm; Yl>"" y") is simply the mean of the
(~)-type randomization distribution, or
(3.2)
where the sum is over the (~) separations of the integers into two sets
{1tl' ••• , 1t m } and {1t m + 1,··., 1tN} with 1tl < ... < 1t m and 1tm+ 1 < ... < 1tN'
The expected value (3.1) or (3.2), whichever applies, is the conditional
probability of a Type I error for the test. Accordingly, a test ¢ has conditional
level oc, given the aj' if the quantity (3.1) or (3.2) is less than or equal to oc.
If this holds for all a I, ... , aN, then the test is a randomization test, of the N!
type orthe (~) type respectively. Any such test also has level (I. unconditionally
by the usual argument. We shall see that, conversely, a test having uncondi-
tional level oc must, under certain circumstances, have conditional level oc
given the observations a 1> ••• , aN, that is, must be an N I-type or (~)-type ran-
domization test at level oc.
The statements in Sect. 2.2 about weakening the assumptions apply here
also as long as, for those statements referring to stochastic dominance, the
critical function is suitably monotonic in the Xi and ~.
308 6 Two-Sample Inferences Based on the Method of Randomization
H 0: The observations Z l' ... ,ZN are independent with arbitrary dis-
tributions, and X l ' ... , XIII ; Y1 , ••• , Yn are a random separation of
Z1> ... , ZN into an X sample and a Ysample.
°
H'o: H holds and the XI' ... , y" have densities.
The same conclusion also holds under less broad hypotheses, including
H~: H o holds and the ZJ are normally distributed with arbitrary means J.1j
and common variance (12.
Note that Ho, for instance, does not say that XI ... ' Xm; Y" ... , Yn are in-
dependently distributed with arbitrary distributions. This would place no
restriction whatever on the relation between the X's and the Y's and hence
could not serve usefully as a null hypothesis. By a random separation we
mean that, given Z" ... , ZN, the X's are a random sample of the Z's without
replacement but in their original order, while the Y's are the remaining Z's
in their original order.
We have in mind the kind of situation in which, for example, there are N
available experimental units, of which m are to be chosen at random to
receive a treatment and the rest to serve as controls. If the null hypothesis
is that the treatment has no effect whatever on any unit, then the random
selection of the units to be treated guarantees the validity of the level of any
randomization test. Are there any other tests at a specified level C(? Suppose
that, if all experimental units were untreated (or if the treatment had no
effect whatever), one would be willing to assume no more than Ho, that is,
that the N observations on the N experimental units are independently dis-
tributed with arbitrary distributions, possibly all different. The fact stated
3 The Class of Two-Sample RandomizatIOn Test, 309
above is that the only tests having level ex under such a weak assumption are
the randomization tests at level rx. Furthermore, the null hypothesis can be
made considerably more restrictive without upsetting this conclusion. For
example, the conclusion holds for a normal model with common variance
but arbitrary unit effects, as in H'O. It therefore holds for any null hypothesis
which permits this normal model (Problem 24b). It also holds if the unit
effects are arbitrary constants (Problem 24a).
These properties are summarized in Theorem 3.1.
Theorem 3.1. If a test has level rxfor H o , H~, or H'O, then it is an (~)-type ran-
domization test. Conversely, a randomization test has level rxfor Ho and hence
for H~ and H'O.
The proof of Theorem 3.1 for Ho is requested in Problem 24c. The proofs
for H~ and H'O involve measure-theoretic considerations and will not be
considered here. See Lehmann [1959, Sect. 5.10] or Lehmann and Stein
[1949].
Suppose also that the test cp, at level rx, is unbiased against an alternative
hypothesis which includes distributions arbitrarily close to each distribution
of the null hypothesis. Then, by the usual argument (Sect. 6.3, Chap 2), cp
310 6 Two-Sample Inferences Based on the Method of Rand0l11lzatlOn
must have level exactly IX under every null distribution included in the null
hypothesis. This in turn implies that 4> has conditional level IX given the
combined sample observations but not their assignment to Xl"'" Xm;
Yl , ... , Y". (The proof, Problem 25b, is like that in Sect. 3.2 of Chap. 4.)
Since the conditional null distribution is the N I-type randomization
distnbution, it follows that if 4> is an unbiased test of H 0 or H~ against a
sufficiently broad alternative, then it is an N I-type randomization test.
Alternatives which are sufficiently broad are (Problem 25a)
H 1 : Xl' ... , X m; Yl , ... , y" are drawn independently from two popula-
tions which are the same except for a shift.
H 'l : H 1 holds and the populations have densities.
Theorem 3.2. If a test has level IX for H 0 or H~ and is unbiased against one-
sided or two-sided shifts, then it is an N I-type randomization test.
Reasons for using randomization tests were given in Sect. 3.2. In this
subsection we will see how to find that randomization test which is most
powerful against any specific alternative distribution. The particular case
of normal shift alternatives will be illustrated in the two subsections following.
Suppose, for definiteness, that we are considering all of the N I-type
randomization tests. Then under the randomization distribution, given the
observations a l' ... , aN, all N! possible arrangements into Xl"'" X m'
Yl , ... , y" are equally likely. A randomization test is valid as long as this
condition is satisfied under the null hypothesis. Consider now an alternative
4 Most Powerful RandOlnizalion Te~ts 311
with joint density or discrete frequency function J(XI,"·' Xm, YI"'" Yn).
Under this alternative, given al"'" aN, the conditional probabilities of
each of the N! possible arrangements X I " " , X m , Y1 .••• Y,.are proportional
to {(X" ... , X"" Y" ... , Y,,). By the Neyman-Pearson Lemma (Theorem
7.1 of Chap. 1), it follows (Problem 26) that among randomization tests, the
conditional power against J is maximized by a test of the form
Consider now the alternative that the X's are normal with mean ~I and
variance (J2, the y's are normal with mean ~2 and the same variance (J2, and
all are independent. It follows from (4.1) (Problem 27) that the upper-
tailed randomization test based on Y - X (or an equivalent statistic) is the
most powerful randomization test (of either the N I-type or the (~)-type)
against any such alternative with ~2 > ~I; that is, it is the uniformly most
powerful randomization test against ~2 > ~l' Similarly, the lower-tailed
randomization test based on Y - X is the uniformly most powerful ran-
domization test against ~2 < ~l' Note that the one-tailed tests here do not
depend on the values of the parameters, in contrast to the most powerful
rank tests of Sect. 8.1 of Chap. 5.
312 6 Two-Sample Inferences Based on the Method of Rand0l11lzatlOn
Suppose that the X's and Y's are normal with common variance, as above,
and consider the alternative III ¥- 1l2' There is no uniformly most powerful
randomization test against this alternative, different randomization tests
being most powerful against III < 112 and III > 1l2' It is apparently unknown
whether there is a uniformly most powerful unbiased randomization test.
We shaH prove, however, that the randomization test rejecting for large
I Y - XI has two other properties. This test may be thought of as a two-
tailed randomization test based on Y - X, but as mentioned earlier, unless
m = n, it is not ordinarily the equal-tailed randomization test based on
Y - X because the randomization distribution of Y - X is not ordinarily
symmetric unless m = n (Problem 5).
One property of the randomization test rejecting for large I Y - X I is
that it is uniformly most powerful against III -# 112 among randomization
tests which are invariant under transformations carrying X I, ... , X III'
YI , .•• , Y,. into c - X I, ..• , C - XIII' C - YI , ... , C - Y", where c is an arbitrary
constant. Notice that such a transformation carries the alternative given by
Ill> 1l2' a into that given by c - Ill' C - 1l2' a, so the invariance rationale
(Sect. 8, Chap. 3) can be applied. In particular, any invariant test has the
same power against all alternatives with the same III - 112 and the same a
(Problem 28). The statement is that no randomization test which is invariant
under all such transformations is more powerful against even one alternative
with III ¥- 112 than the randomization test rejecting when I Y - XI is too
large.
*This randomization test is also the" most stringent" randomization test
in the situation under discussion. This property is defined as foHows. Let
a.*(IlI' 1l2' a), be the power of the most powerful randomization test against
the alternative given by (Ill' 1l2' a). Then the power rx(IlI, 1l2' a) of any other
randomization test 4> is at most rx*(IlI' 1l2' a), and the difference measures
how far short of optimum the test 4> is. Accordingly, we define
as the maximum amount by which 4> falls short of optimum. The randomiza-
tion test which rejects for large 1Y - X 1 has the property that it minimizes
!i among randomization tests and hence is most stringent. Specifically, the
maximum amount by which the power of 4> is less than that of the best
randomization test against each alternative separately is minimized among
randomization tests by the randomization test rejecting for large 1y - X I.
For no e > 0 is there a randomization test which comes within !i - f, of
the optimum at every alternative, and the randomization test rejecting for
large 1 Y - X 1 comes within !i everywhere. Notice that this property does
not in itself exclude the possibility that another randomization test is much
better against most alternatives but slightly worse against some (Problem
31).*
4 Most Powerful RandolTIlZatloll Tests 313
*PROOF. The same basic device can be used to obtain both of the above
properties. (This is not surprising in light of the fact that a uniformly most
powerful invariant test is most stringent under quite general conditions
[Lehmann, 1959, p. 340].)
Consider the alternatives given by (JI.I' JI.2, 0-) and (c - JI.I' C - 1l2' 0-).
The average power of any test against these two alternatives is the same as its
power against
h(x l , · · · , Xm , YI"'" Yn) = LVI g(Xi; 111,0-) iI. g(Yi; JI.2' 0-)
where g(z; JI., 0-) is the normal density with mean JI. and variance 0- 2 (Problem
32a). We shall show that, for any (JI.I' JI.2' 0-), there is a Csuch that the random-
ization test rejecting for large 1 Y - X 1 is the most powerful randomization
test against h. From this and some further arguments, the desired conclusions
follow (Problem 32c).
By straightforward calculation, the first term on the right-hand side of
(4.2) is a multiple of
The second term on the right-hand side of (4.2) is the same as (4.3) but with
C - III in place of JI.I and c - 112 in place of 1l2' and therefore with
in place of It. If c = 21t, so that the two quantities in (4.4) and (4.5) are equal,
then the density in (4.2) becomes
PROBLEMS
I. Show that the randomization tests based on Y - X, Y, I'i' X" D y" S*, S**, the
ordInary two-sample t statistic, and r, are all equivalent; here S* = L" >", X, -
L",;;", Y, and S** = L,,>n 1] - L,.,,;;n Xi where ri is the rank of X, and r~ is the rank
of Y, in the combined ordered sample, and r is the ordmary product-moment
correlation coefficient between the N observations and the N indicator variables
defined by
I If the observatIOn with rank k IS an X
I - {
k - 0 if the observation with rank k is a Y.
*2. Show that the confidence bounds for shift corresponding to the one-tailed, two-
sample randomization tests based on Y - X at level (l( = k/(~) are the kth smallest
and largest of the differences of equal-sized subs ample means.
*3. Consider. the (~) - I differences of equal-sized subsample means in the two-sample
problem Show that under the shift assumption
(a) The (~) intervals into which these differences diVide the real Ime are equally
lIkely to contain the true shift J.l.
(b) If M of the differences are selected at random without replacement, then the kth
smallest is a lower confidence bound for J1 at level (l( = k/(M + I).
4. Find the mean and variance of the randomization distributIOn of the statistics Y,
S*, and S** defined in Problem I.
5. (a) Show that the randomizatIon distribution of Y - X is symmetric about 0 if
(.) III = II, or (II) thc combmcd sample .s symmetnc.
*(b) Can you construct an example In which the randomization distribution of
Y - X is symmetric about 0 but neither (i) nor (ii) of (a) holds? (The authors
have not done so.)
6. Consider the samples X: 0.5,0.9, 2.0 and Y: - 0.2, 6.5, 11.5, 14.3. Show that
(a) The randomization distribution of Y - X is that given in Table 2.1.
(b) The randomization test based on loY - X, has upper-tailed P-value 6/35 and
hence rejects the null hypothesis at level (l( = 6/35.
(c) The equal-tailed randomization test based on Y - X at level (l( = 6/35 and the
lower-tailed test at level 3/35 both "accept" the null hypothesis. Find the
P-values.
7. Suppose the following arc two Independent random samples, the second drawn
from the same distribution as the first except for a shift of the amount J.l.
X : 0.2, 0.6, 1.2 and Y : 1.0, 1.8, 2.3, 2.4, 4.1
(a) Use the randomization test based on the difference of sample means (or an
eqUivalent randomization test statIstic) to test the null hypothesis It = 2 against
the alternative J.l < 2, at a level near O.DI.
(b) Give the exact P-value of the test in (a).
(c) Give the confidence bound for J.l which corresponds to the test used in (a).
(d) Find the approximate P-value based on (i) the standard normal distrIbutIOn
and (ii) Student's t distribution with N - 2 = 6 degrees of freedom.
8. Show that the randomization test based on Y - X and the two-sample rank sum
test are equivalent for one-tailed (l( :::; 2/(~) but not for (l( = 3/(~) or 4/(~) (assume
//I ~ 4, II ~ 4).
Problems 315
*9. Express the confidence bounds for shift correspondmg to the randomization test
based on Y - X in terms of the order statistics of the two samples sepal ately for
C( = kl(~), k = 1,2,3,4.
II. Verify that (2.9) and (2.11) are equivalent expressions for the degrees-of-freedom
correction factor d.
12. Show that C 2 as given by (2.12) is a consistent estimator of the population kurtosis
minus 3 if the X's and Y's are independent and identically distributed with a suitable
number of filllte moments.
13. Show that the step relating F with fractIOnal degrees of freedom to a scaled t dis-
tributIOn 111 Sect. 2.5 is the same as in Sect. 2.5 of Chap. 4 with n replaced by N - I
(see Problem 18 of Chap. 4).
14. (a) Derive formulas (2.14) and (2.15) for the first two moments of the I andomiza-
tion distnbution of(Y - X)2/(N - 1)(J'2.
(b) Show that the beta distribution with the same first two moments has para-
meters given by (2.16).
(c) Show that approximating the randomizatIOn distribution of(Y - X)2/(N _1)(J'2
by thiS beta dlstnbutlOn IS equivalent to approximating the randomization
distribution of (2 by an F distribution with d and (N - 2)d degrees of freedom,
where d is given by (2.9).
15. Show that the randomization distributIOn of t has mean 0 but need not have third
moment O.
*16 Show that a group of transformations can be used to restrict the randomization
set as descnbed in (b) of Sect. 2.5.
*17. Let T(x t , ... , XII" Yt, ... , Yn) be nondecreasing in each X, and non increasing in
each Y1 and conSider randomization tests based on T. Show that
(a) The corresponding confidence regions are intervals.
(b) The level of the lower-tailed test remains valid when the observations are
mutually independent and every X, is "stochastically larger" than every Y,.
(c) The lower-tailed test IS unbiased against alternatives under winch the observa-
tions are mutually mdependent and every X, is "stochastically smaller" than
every Y,.
Note: Remember that the cntical value of a randomization test depends on the
observations and hence is not constant.
18. (a) Show that the randomization distribution of L X, IS the same as the null
dlstnbutlOn of the sum of the X scores for scores (/k (Sect. 5, Chap. 5).
(b) Relate formulas (2.3) and (2.4) for the mean and variance of the randomization
distribution of X to the corresponding results given in Problem 77a of Chap. 5
for sum of scores tests.
(c) Why, despite (a), is the randomization test not a sum of scores test?
19. Invent an "adaptive" two-sample randomization test and discuss the rationale for
It.
316 6 Two-Sample Inferences Ba~ed on Ihe Melhod of Randol11JZ<illOll
20. In each of the following situatIOns, Identify the real world counterparts of X I, ... ,
X n " YI , ..• , Y" and ai' ... , aN. Would It be possible to distinguish order within
samples? Meaningful? Desirable? If a randomization test IS to be used, should it be
(~) type? N !-type? What null hypothesis would be appropriate? Why? The situa-
tions are sketchIly described; give further details as you need or desire.
(a) A lIbrary ceIling has 50 lIght bulbs. Two types of bulbs are used 111 an expert-
ment and the lifetIme of each bulb IS recorded.
(b) In a library and a less well ventilated hallway with the same type of bulbs,
bulb lifetimes are recorded to see if they are affected by ventilation.
(c) A set of patients who are deemed appropriate and have given consent receives
the usual medication for some type of high fever. A randomly selected subset
of this set is also given a standard dose of an additional new drug. The temp-
erature change 111 a four-hour penod is recorded for all patients.
(d) Same as (c) except that the dose of the new drug is varied from 20% below to
20% above the standard level.
21. In one of the situations of Problem 20 or some other situation, describe a randomiza-
tion test which is not permutation invariant and why it might be desirable to use it.
22. (a) Show that a randomization test is N! type if it is (~) type.
(b) In what sense is an (~)-type randomization test more conditional than an N!
type?
23. Given k and II, 111 how many ways can one select integersjl,h, ... ,j. which are all
different and satisfy 1 ~ h < h < ... < 1k ~ 1/ and I ~ A+ I < A+ 2 < ...
< )" :S: II?
24. (a) Show that all level CI. tests are (~)-type randomization tests if the null hypothesis
IS that X I' ... , XIII; YI , ..• , Y" are a random separation of ai' ... , aN into an X
sample and a Y sample, where al' ... , aN are arbitrary constants.
(b) Suppose it is known that all tests having level CI. for some null hypothesis H~
are (~)-type randomization tests. Show that the same IS true for every H~*
wlllch contams H~.
(c) Show that all level CI. tests are (~)-type randomization tests for Ho: The observa-
tions Z I' •.• , ZN are independent with arbitrary distributions, and X I' .•. , Xm;
YI , ••• , Y" are a random separation of Z\, ... , ZN into an X sample and a Y
sample.
25. (a) Show that If a test has level CI. for the H 0 or H'o given after Theorem 3.1 and is
unbIased against one-sided (or two-sided) shift alternatives, then it has level
exactly CI. under every null distribution.
(b) Show that the result in (a) in turn implies that the test has conditIOnal level CI.
given the combined sample observations.
26. (a) Show that a test is the most powerful N !-type randomization test at level CI.
against a simple alternative if and only if it has level exactly CI. and is of the
form (4.1) where k is a function of the order statistics of the combined sample.
*(b) What change occurs in the statement in (a) for (~)-type randomization tests?
27. Show that a one-tailed randomization test based on Y - X (or any equivalent
statistic) is ul11formly most powerful against one-sided normal alternatives with
common variance.
Problcm~ 317
28. (a) Show that if a test is invariant under transformations carrying X into c - X
and Y into c - Y for all c, then its power against the alternative that X is
N(/I., ( 2) and Y is N(/l l , ( 2) depends only on II. - Jl2 and a.
(b) Show that if the test in (a) IS also invariant under changes of scale (carrymg X
into bX and Y mto bY) then its power depends only on (Jl. - Jl2)/a.
*(c) Why were changes of scale not considered in Sect. 4.3?
*29. Show that the randomization test based on I Y - XI has the "most strmgent"
property of Sect. 4.3
(a) If the alternative is restricted to the region 1II1 - 1121 > ba where b is some
given constant.
(b) If Ct.*(Jl., Jiz, a) is redefined as the maximum power achievable by any test.
*30. Show that uniformly most powerful invariant tests are "generally" most strmgent
under suitable conditions. What condition is most important?
31. In the SItuatIOn of Sect. 4.3, draw hypothetical graphs of Ct.*(Jl" Jlz, a) and of Ct.(JI"
Jl2' a) for the level Ct. randomization test based on I Y - X I as functions of (/12 -
II. )/a. Indicate how to find ~ from these graphs. What do the properties of Ct.(/I.,
Jiz, a) as "umformly most powerful invariant" and "most stringent" imply about the
graphs of the power of other randomization tests? (Assume the other randomization
tests are mvariant under increasing linear transformations so that their power
depends only on (JlI - Jlz)/a, but do not assume that they are invariant under
changes of sign.) What can you say about the power of a one-tailed randomization
test based on Y - X at level Ct.?
32. (a) Show that the power of a test against the density h given by (4.2) is the average
of its power agamst the two normal alternatives given by
(/1.,JI2,a) and (C-/I.,c-/12,a).
(b) Show that the most powerful randomization test against h is that rejectmg for
large I Y - XI·
*(c) From this, show that the randomization test rejecting for large I Y - X I is
uniformly most powerful invariant and most stringent against normal alterna-
tives as stated in Sect. 4.3.
CHAPTER 7
Kolmogorov-Smirnov
Two-Sample Tests
1 Introduction
We have not previously discussed the use of criteria suggested by direct
comparison of empirical (sample) cumulative distribution functions with
one another or with hypothetical c.dJ.'s (" goodness of fit "). This important
approach leads to a wide variety of procedures which stand apart from the
procedures of earlier chapters in several respects. They are expressed in a
different form. The relevant statistics are not approximately or asymptotically
normally distributed. The theory of their asymptotic behavior is fascinating
and raises different kinds of problems requiring different kinds of tools. The
mathematical interest of these and other problems has played a larger role
than statistical questions in motivating the extensive literature about them,
although there is also some excellent work on statistically important
questions.
The one-sample test procedures require that a completely specified dis-
tribution be hypothesized. In this sense they are not "nonparametric."
Although the test statistics are "distribution-free" as that term is usually
defined, they relate to null hypotheses that are entirely different from those in
Chaps. 2-4, which require only symmetry. The two-sample procedures, how-
ever, relate to the" nonparametric" null hypothesis of identical distributions
used in Chaps. 5 and 6.
The Kolmogorov-Smirnov criterion of maximum difference, defined
below, has received the most attention. Another natural criterion is the
Cramer-von Mises integrated squared difference. These and other variations
of them involving "weights" are the only specific criteria developed in this
tradition which have been broadly investigated for statistical purposes.
Pearson's chi-square goodness-of-fit criterion, which compares cell
frequencies rather than cumulative frequencies, is even more popular. It
318
2 Empirical DistributIOn FunctIOn 319
The following properties of F m(t) are easily proved (Problems 1 and 2) for
observations which are independently and identically distributed with
c.dJ. F.
(a) The random variable mFm(t) follows the binomial distribution with
parameters m and F(t).
(b) The mean and variance of F m(t) are
E[Fm(t)] = F(t), var[Fm(t)] = F(t)[l - F(t)]/m.
(c) Fm(t) is a consistent estimator of F(t) for fixed t.
(d) Fm(t) converges uniformly to F(t) in probability, that is,
P[IFm(t) - F(t)1 < e for all t] -t 1 as m - t 00, for all e > O.
(e) Fm(t) converges uniformly to F(t) with probability one (Glivenko-
Cantelli Theorem).
(f) F m(t) is asymptotically normal with mean and variance given in (b).
(g) The empirical distribution is the mean of the indicator random variables
defined by
15 () = {I if X j ~ t
•t 0 otherwise,
that is,
m
Fm(t) = Ib.(t)/m.
j= I
In particular, F I(t) = 15 I (t). The covariance between values of bj(t) for the
same i but different t is
F(tI)[1 - F(t 2 )] if tl ~ t2
{
cov[b.(t l ), bj (t 2)] = F(t 2)[1 _ F(tl)] ift2 ~ tl
= a(tl' t 2 ), say.
(h) cOV[Fm(t,), F m(t2)] = a(tl' t 2)/m.
(i) For any fixed t 1o ••• , tko the random variables Fm(td, ... , F",(tk) are
asymptotically multivariate normal with mean vector [F(t l ), ..• , F(t k)]
and covariance matrix with (i,j) element a(t" tj)/m. This can be proved
by applying the multivariate Central Limit Theorem for identically
distributed random variables to the vectors b. = [bj(t I)' ... , b;(tk )] or by
way of the multinomial distribution of the increments F m(t I), F m(t2)
- Fm(t l ) , · · · , Fm(tk) - Fm(tk-I)' 1 - Fm(tk)'
(3.2)
where (3.1) is called the two-sided statistic since the absolute value measures
differences in both directions, and (3.2) and (3.3) are called the one-sided
statistics. Appropriate critical regions are to reject F = G if Dmn is "too large"
for a two-sided test, if D;!n is "too large" for a one-sided alternative G 2 F
and if D;;'n is "too large" for a one-sided alternative G ~ F. (Assume each
of these alternatives excludes F(t) = G(t) for all t.)
Tests based on these statistics would appear to be sensitive to all types
of departures from the null hypothesis F = G, and hence not especially
sensitive to a particular type of difference between F and G. However, even
for location alternatives, against which most of the two-sample tests pre-
sented in this book are designed to perform well, the Kolmogorov-Smirnov
statistics are sometimes quite powerful. They are primarily useful, however,
when any type of difference is of interest.
Alternative expressions of the Kolmogorov-Smirnov statistics are more
easily evaluated, and they also show that the maxima are achieved. In (3.2)
for instance, note that reducing t to the next smaller lj does not change
Gn(t) and hence can only increase the maximand. Therefore, we can write
D;!n = max [Gn(lj) - FmClj)]
)
where l(j) is thejth smallest Y and M j is the number of X's less than or equal
to l(j). Similarly,
D;;'n = max [Fm(X,) - Gn(Xj)]
j
where N j is the number of y's less than or equal to X(I)o the ith smallest X.
The two-sided statistic is simply
(3.6)
These representations (and others, Problem 3), also make it evident that
the Kolmogorov-Smirnov statistics depend only on the ranks of the X's
and Y's in the combined sample. Thus the tests based on them are two-sample
rank tests. In particular, their distributions under the null hypothesis that
the X's and Y's are independent and identically distributed with an un-
specified common distribution do not depend on what that common dis-
tribution is as long as it has a continuous c.dJ. (This result is also evident
otherwise; Problem 3.) The same null distributions hold also in the situation
of say a treatment-control experiment where m units are selected at random
from N to receive treatment, if the null hypothesis is no treatment effect and
the distribution of the characteristic being measured is continuous so that ties
have probability O. (Ties are discussed in Sect. 5.)
valho [1959], and Depaix [1962]. Hodges [1957] includes a useful review of
algorithmic methods. (See also Steck [1969] for results that also make use of
the first place where the maximum occurs.) Hajek and Sidak [1967] give a
good summary of the results known about the Kolmogorov-Smirnov
statistics. Darling [1957] also gives a valuable exposition on these and
related statistics and a rather complete guide to the literature through 1956.
Barton and Mallows [1965] give an Appendix with references on subsequent
developments.
Throughout this section we assume (as does most of the literature most
of the time) that the common distribution of the independent random
variables X and Y is continuous, so that, with probability one, no two ob-
servations are equal. Then we can ignore the possibility of ties either within
or between samples. Ties will be discussed in Sect. 5.
(m, n)
(0, 0) L..--L---'---'---'---'----'---
Figure 4.1
324 7 Kolmogorov-SmIrllOv Two-Sample Tests
(u, V) reaches every lattice point (point with integer coordinates) on the
path. Thus the event Dmn < c occurs if and only if each such lattice point
(u, v) on the path satisfies
I(u/m) - (v/n) 1 < c. (4.1)
Consider this event geometrically. The expression v/n = u/m is the equa-
tion of the diagonal line which connects the origin (0, 0) and the terminal
point of the path (m, n); the vertical distance from any point (u, v) on the
path to this line is 1v - (nu/m) I. Accordingly, Dmn < c occurs if and only if
the path stays always within a vertical distance of cn from the diagonal con-
necting (0, 0) to (m, n) (or equivalently, a horizontal distance of cm).
Let A(u', v') be the number of paths from (0,0) to (u', v') which stay within
this distance, i.e., which satisfy (4.1) for u ::;; u', v ::;; v'. (This number depends
also on m, n, and c, but since they are fixed throughout we do not display
this dependence notationally.) Since every path from (0,0) to (m, n) has
equal probability under the null hypothesis, the probability we seek is
In the special case where m = n, expressions can be given in closed form for
the null c.dJ.'s of both the one-sided and two-sided Kolmogorov-Smirnov
statistics. Specifically, we will show that, for k = 1, 2, ... , n,
[n/kJ .
= 2L:(-I)'+1 ( 2n . ) / (2n) (4.8)
I~ 1 n - Ik n
326 7 Kolmogorov-SmlfllOv Two-Sample Tests
(m + n, n - m)
Figure 4.2
where [nlk] denotes the largest integer not exceeding nlk. This gives the
entire distribution in each case, since the statistics can take on values only
of the form kin when m = n. These formulas are easily evaluated, but they
apply only to the case m = n.
We now derive these results (following the methods of Drion [1952] and
Carvalho [1959], although they are much older). We again represent the
combined sample arrangement by a path starting at the origin, but the
argument will be easier to follow if we make the steps diagonal. Specifically,
we move one unit to the right and one unit up for each Y, one unit to the right
and one unit down for each X. This gives the same path as before except for
a rotation clockwise by 45° (and a change of scale). Figure 4.2 depicts the
path constructed by this rule for the same sample as Fig. 4.1, X Y Y X X Y Y.
The path now ends at (m + n, n - m), which is (2n, 0) when m = n.
Furthermore, by analysis like that at (4.1), the event Dn~ 2 kin occurs
if and only if the path reaches a height of at least k units above the horizon-
tal axis before it terminates at (2n, 0). We shall prove shortly that the number
of such paths is (n~nk)' Since all paths are equally likely under the null
hypothesis, and since the total number is (~~), it follows that the probability
P(D:" 2 kin) is (n=-",,)/(~~) as given in (4.7). The number of paths reaching
height k is given by setting I = 0 in the following lemma, which will be needed
later for both negative and positive I and k.
Lemma 4.1. Let N(k, l) be the number of paths going from (0, 0) to (2n, 21)
and reaching a height of k units. Suppose k is not between 0 and 2/. (If it is, all
paths terminating at height 21 obviously reach height k.) Then
N(k, I) = ( 2n ) = ( 2n ). (4.9)
n-k+1 n+k-l
PROOF. Paths of this sort can be put into one-to-one correspondence with
paths terminating at height 2k - 21 by reflecting that portion of each path
to the right of the point where it last reaches height k, as in Fig. 4.3. The number
4 Null DistributIOn Theory 327
(2n,2k - 21)
(0, k) ~-------+-~~~~~----I
(2n,2/)
(0, 0) ~----+---------------1
(0, j) f - - - - - - V - - - - - - - - - - - - - l
(0, -k)~----------------I
Figure 4.3
For the two-sided statistic, consider the event Dnn ~ kin. It occurs if and
only if the path reaches at least k units above or below the horizontal axis. By
symmetry, the number of paths reaching height - k is the same as the number
reaching height k, which was found above. The difficulty is that some paths
may reach both k and - k. We will count paths according to the boundary
they reach first. It is convenient to extend the notation of Lemma 4.1, letting
NU, k, I) = the number of paths going from (0,0) to (2n, 21) reaching
heights i and k, j first;
N(notj, k, l) = the number of paths going from (0,0) to (2n,21) reaching
height k without having reached heightj.
(Note that, in either case, heights j and k may subsequently be reached any
number of times.) In this notation the number of paths satisfying Dnn ~ kin
is the number reaching the upper boundary first plus the number reaching
the lower boundary first, which is
For m = n, the exact formulas (4.7) and (4.8) for the null tail probabilities
lead directly to the asymptotic null distributions of the one-sided and two-
sided Kolmogorov-Smirnov statistics. We first investigate the behavior of
(4.7) for large k and n. The right-hand side can be written as
(1 - n : k)(l - n + ~ _ 1) ... (1 - n : I)
r
=
where k now and hereafter is the largest integer not exceeding Aj2n. Thus,
for A fixed, e/n -4 2A2 and we find by (4.15) that
(4.16)
n-->oo
This is a very simple, easily calculated expression for the asymptotic prob-
abilities. We note also that n(D:")2 is asymptotically exponentially distribu-
ted and 2n(D:")2 is asymptotically chi-square distributed with 2 degrees of
freedom (Problem 28), so that exponential or chi-square tables could be
used.
The limiting distribution of the two-sided statistic is found the same way
but using (4.8) as follows.
P(J;iiD nn ~ A) = P(D nn ~ Aj2n/n)
=2[~\-1Y+I(
1
i=
2n. )/(2n)
n - Ik n
(4.17)
(4.19)
n-oo i== 1
Taking the limit as n -4 00 term by term in (4.17) can be justified by the fact
that the sum is, for each n, and in the limit, an alternating series whose
terms decrease monotonically in absolute value to 0 as i -4 00, and therefore
dominate the sum of all later terms.
Now we will show heuristically that if D~n and Dmn are suitably standard-
ized, their limiting distributions do not depend on how m and n approach 00,
and hence are the same for all m and n as for m = n, so that the expressions
on the right-hand sides of (4.16) and (4.19) apply also to unequal sample
sizes.
330 7 Kohnogorov-Smlrnov Two-Sample Tests
For two independent samples, with arbitrary but fixed t 1 , ••• , t k , the
random vectors
(4.20a)
and
(4.20b)
are independent. By the property (i) in Sect. 2, if the observations are iden-
tically distributed with c.dJ. F, these random vectors are both asymptotically
multivariate normal with the same mean vector and with covariances
a(tj, t)/m and a(tj, tj)/n respectively. It follows that the vector
jmn/(m + n)[Gn(tl) - F m(tl)], ... , jmn/(m + n)[Gn(tk) - Fm(t k)]
is asymptotically multivariate normal with mean vector 0 and the co variances
a(t" t). Hence for each t 1 , ••• , tb the quantities jmn/(m + n)[Gn(t i )
- Fm(t j)], i = 1, ... , k, have the same limiting joint distribution however
m and n approach 00. This suggests that the maximum, or the maximum
absolute value, will exhibit this same property, that is, that (4.16) and (4.19)
generalize to
(4.21)
00
5 Ties
The finite sample and asymptotic null distributions of the Kolmogorov-
Smirnov statistics discussed in Sect. 4 hold exactly only for a common dis-
tribution F which is continuous, and they do not depend on F as long as it is
6 Performance 331
6 Performance
Suppose that the X's and Y's are independent random samples from popula-
tions with c.d.f.'s F and G respectively. We have already seen that the sample
c.dJ. F m(t) for the X sample is a consistent estimator of the population c.dJ.
F(t), not merely at each t, but in the stronger sense of (d) in Sect. 2; we call this
property "strong consistency" temporarily. An equivalent statement
(Problem 2b) is that
Dm = sup IFm(t) - F(t) I -+ 0 in probability.
estimator of G - F and hence that Dmn , D~n' and D;;;n are consistent estima-
tors of the corresponding population quantities, namely, the maxima (or
suprema) of IG(t) - F(t) I, G(t) - F(t), and F(t) - G(t) respectively (Prob-
lem 31).
Consistency properties of the Kolmogorov-Smirnov tests follow in turn.
Under the null hypothesis F = G, all three population quantities are 0, and
the statistics are distribution free if the common population is continuous
and stochastically smaller otherwise. Therefore, at any fixed level, the
critical values of all three statistics approach 0 as m and n approach 00. On
the other hand, the statistics converge in probability to positive values, and
consequently each test is consistent, whenever the corresponding population
quantity is positive. Specifically, the two-sided test which rejects for large
values of Dmn is consistent against all alternatives F, G with F =1= G, that is,
with F(t) =1= G(t) for some t. (Details of the proof are left as Problem 32.)
The one-sided test which rejects for large values of D~n is consistent against
all alternatives F, G with F(t) < G(t) for some t, and similarly for D;;;n and
alternatives with F(t) > G(t) for some t. Note that these one-sided alter-
natives include stochastic dominance and the shift model of Sect. 2 of Chap. 5
as special cases.
We now derive quite simply a lower bound on the power of the one-sided
test and the behavior of this bound as m, n -+ 00 [Massey, 1950b]. The
bound and its asymptotic behavior provide useful insight and will be relevant
in Chap. 8. Let cmn . cx denote the right-tailed critical value of D~n for a test
at level a. Define L\ = sup, [G(t) - F(t)] and suppose, for convenience, that
the maximum is actually achieved so that L\ = G{to) - F{to) for some to.
We know that L\ > 0 if F(t) < G(t) for some t. Since D~n is certainly never
less than Git o) - Fm(to), the power of the test satisfies the inequality
Since mFm(t O) and nGn(t o) are independent and binomially distributed (see
(a) of Sect. 2), the right-hand side of (6.1) can be evaluated without great
difficulty for any specified F and G. Furthermore, for m and n large cmn,cx
in (6.1) can be approximated by means of (4.21) and the binomial distribu-
tion can be approximated by the normal distribution (see (f) of Sect. 2).
The result is (Problem 33)
where
and
(7.2)
(7.3)
Against the two-sided alternative F{t) "# Fo(t) for some t, a test rejecting the
null hypothesis for large values of Dn is natural, and is consistent (Problem
35). Similarly, a test based on D: is consistent against all alternatives F
with F{t) > F oCt) for some t, and one based on D;; is consistent against
F(t) < F oCt) for some t. Curiously enough, however, each of these tests is
biased (slightly) against the corresponding alternative {Massey [1952a]
(Problem 36». When F 0 is not continuous, tests using the null distribution
or critical values for the continuous case are conservative (Problem 37).
Of course, the c.dJ. F 0 must be fully specified in order to calculate the
value of any of these one-sample test statistics. In this sense, the tests are
"parametric," and accordingly are discussed only briefly here. The statistics
are however "nonparametric," or at least distribution-free, in the sense
that when F = F0, their distributions do not depend on F 0 as long as F 0 is
continuous (Problem 38). Because of this, the problem of deriving their exact
null distribution in the continuous case need be solved for only one con-
tinuous F 0, for instance, the uniform distribution on (0, 1). These null dis-
tributions are continuous. Further, the one-sided statistics, D: and D;;,
have identical distributions by symmetry. Massey [1950a] (see also Kolmo-
gorov [1933]) derived a recursive relation for the null probability P(D n ::; kin)
for integer values of k; his method applies also to Dn+. Birnbaum and Tingey
[1951] found an expression in closed form for the entire cumulative null
7 One-Sample Kolmogorov-Smlrnov StatIstics 335
on Fn(x) at each x and seek an adjustment to make them valid with overall
(simultaneous) confidence 1 - 0(. If we start with the confidence regions in
the form IF(x) - Fn(x) I : : ; c, then the two-sided Kolmogorov-Smirnov
confidence band results. If we start with IF(x) - Fn(x) I : : ; caW[F(x)] for an
arbitrary function W, such as W[F(x)] = JF(x) [1 - F(x)], then a different
band results, corresponding to the statistic sUPxIFn(x) - F(x)I/W[F(x)].
W determines the relative emphasis on short confidence intervals in different
portions of the distribution. See Anderson and Darling [1952] for further
discussion.
The Dn or Dn+ statistic also provides procedures for determining the
minimum sample size required to state with a predetermined probability
1 - 0( that if F(x) is estimated by Fn(x), the error in the estimate will nowhere
exceed a fixed value e. For instance, we can use tables of the null distribution
of Dn to find the smallest integer n such that P(Dn < e) ;:::: 1 - 0( (or P(D: < e)
;:::: 1 - O(). This is the minimum n for which the two-sided (or one-sided)
confidence bands described earlier lie within e of the empirical distribution
function.
PROBLEMS
6. In the notation of Section 4.1, show that for 0 :::; u :::; m,O :::; v :::; n, 0 :::; k :::; m + II,
no
A(m, II) = L A(u, k - u)A(m - u, II - k + u).
u=o
The range of II can be restricted to max(O, k - II) :::; II :::; minCk, Ill). Using this
expression for A(m, n) with k = (m + 11)/2 or (m + n + 1)/2 permits the recursion
to be terminated at 1/ + V = k. (See Hodgcs [1957].)
7. What change is required in the definition of A(u, v) in Section 4.1 to obtain the exact
null distribution of D,!" ?
8. Use the result of Problem 7 to find the P-value of D;:;" for the sample arrangement
X y Y X X Y Y (the same arrangement as Problem 4).
9. Show the symmetry relations (4.5) for the null distributions of the one-sided
Kolmogorov-Smirnov statistics.
10. Assume that all possible arrangements of two samples have positive probability.
Show that
(a) P(Dm" ~ c) = P(D;:;" ~ c) + P(D;;'" ~ c) - P[min(D;:;", D;;'") ~ c].
(b) D,!" and D;;," cannot both exceed 0.5.
(c) If m or n is even, samples exist with D;:;" = D;;," = 0.5.
(d) If m and II are both odd, the largest c for which mineD;:;", D;;'") ~ c is possible
and hence the largest c for which P(Dm" ~ c) < P(D;:;" ~ c) + P(D;;'" ~ c) is
c' = 0.5 - [0.5/max(m, n)].
(e) For m = 5, n = 7, 16/35 is a possible value of D;:;" but 17/35 is not. Since c' =
6/14 = 15/35 here, this illustrates that values of D;:;" between c' and 0.5 of the
form kiM, where k is an integer and M is the least common multiple of m and n,
are sometimes possible and sometimes impossible.
(f) If c is the critical value of D;:;" at exact level (1. and if c > 0.5 or m and n are both
odd and c > 0.5 - [0.5/max(m, n)], then c is the critical value of D no" at exact
level 2(1.. This is not always true if the word exact is omitted.
1l. Let PI be the larger of the two one-tailed Kolmogorov-Smirnov P-values and P2
be the two-tailed P-value. Show that 0 :::; 2P 1 - P 2 :::; 2Pt
(a) Asymptotically, that is, when the right-hand sides of (4.21) and (4.22) are used
for PI and P 2 respectively.
*(b) For III = II. (HIIlt. Show that 2P 1 - P2 :::; 2(,,~/~k)/(~:'):::; 2Pt = 2[(,,21'k)/(~:')J4,
where PI and P 2 are given by (4.7) and (48), by showing that
[1- ~J
+
[1 -2kJ : :; [1 - ~J4
III k III m
for 1 :::; k :::; Ill. See (4.14).)
12. Define" k is between c and d" as meaning c :::; k :::; d or d :::; k :::; c. In the notation
of Section 4.2, show that
(a) NU, k, l) = N( -j, -k, -I).
(b) NU, k, l) = NU, I) - N(k,j, l) if k is betweenj and 21.
(c) N(notj, k, /) = N(k,j, I) ifj is between k and 21.
(d) NU, k, /) = NU, k - I) - N(k,j, k - /) if k is betweenj and 2k - 21.
(e) N(notj, k, /) = N(k, /) - N(not k,j, k - /) if k is betweenj and 2k - 21.
(f) N(not j, k, /) = N(not j, k, k - /).
(g) N(notj, k, I) = NU, k - I) - N(not k,j, k - l) ifj is between k and 2k - 21.
(h) k is between j and 2k - 21 if and only if k is not between j and 21.
338 7 Kolmogorov-Smlrnov Two-Sample Tests
13. Derive the formula for the number of paths which satisfy D•• ;::: kin using
(a) Part (c) of Problem 12.
(b) Part(d) of Problem 12.
14. At (4.12), if the portion of the path to the left of the point where the path reaches
height k is reflected, instead of the portion to the right, what result is obtained and
how does it relate to (4.12)1
15. Let N(i,j, k, l) and N(i, notj, k, l) be the numbers of paths starting at height i, reaching
height k after, and without, respectively, having first reached heightj, and termina-
ting at height /, where the steps are diagonal as in Sect. 4.3. Give properties of N
like those given in Sect. 4.3 and Problem 12. Assume that the sum (i + / + the
number of steps) is even. (What happens if this sum is odd 1)
16. The exact null probability distribution of any of the Kolmogorov-Smirnov two-
sample statistics can always be determined by enumerating all possible arrange-
ments of the X and Y observations and computing the value of the statistic for each.
Enumerate the m
arrangements for m = n = 3. Determine the exact null distribu-
tions of Dm. and D:., both from the enumeration and from (4.7) and (4.8).
17. Show that the Mann-Whitney statistics can be expressed in terms of the empirical
distribution functions as follows:
no
U = nI G.(X,), U' = m I Fm(lj)·
.==1 J= 1
18. Define the indicator variables I k = I if the kth smallest observation in the combmed
ordered sample of m X's and II Y's is an X, and Ik = 0 otherwise, k = 1,2, ... ,
m + n = N. Show that the two-sample Kolmogorov-Smirnov statistics can be
expressed as
and ifm n,
±Ik]
=
D:" =! max [j - 2
nl:S}:S2. k=l
D•• =! max
n 1 :S J :s2.
Ij-2±Ikl.
k=1
*22. Show by elementary considerations that P(D"" ~ 1/11) = I always, and show that
(4.8) gives the same answer under the null hypothesis. (Hint· Manipulations are hke
Problem 21.)
*23. Show that the null distribution of D~" can be obtained by the following recursive
procedure. Let Vi be the smallest integer greater than or equal to (ni/m) + ne for
0:::;; i :::;; I where I = [m(l - c)] is the greatest integer in m(l - e), and N, = 1 + V"
Let Mo = 1 and
+
P(D mn ~ c) = ~
L...M, (N - NI)j(N)
. .
i=O m- 1 m
(Hint: In the representation of Sect. 4.1, the N,th step is the first to reach the bound-
ary and (i, v,) is the point reached. Altogether there are
paths to this point, of which M, reach the boundary first at the last (N,th) step and
M,(NI-J~I) I• -
at the N}h step. There are (~) paths to (m, 11), of which
M.
I
(Nm-- N)
i
I
reach the boundary first at the Njth step. For calculation, choosing m :::;; /I mini-
mizes the number of terms, and replacing by N, Ni -
1 (without changing in N)
the recursive formula for M, reduces their size somewhat and can be justified by
observing that M, is the number of paths to (i, Vi - 1) not reaching the boundary.)
See Hodges [1957], Korolyuk [1955], and Steck [1969].
+
Dmn = {j i
max - - -:
I,) 11 m
X(I+I) > l(J) } , _ {i
Dmn = max - - j-:
I.J m n
l(,+1l > X(I)
}•
340 7 Kolmogorov-Smirnov Two-Sample Tests
25. Show that the confidence bounds for a shift parameter corresponding to the two-
sample Kolmogorov-Smirnov tests with critical value c (rejecting for D':;. ~ c,
D;;;. ~ c, or Dm. ~ c) are as follows. The upper bound is
where i) is the smallest integer exceeding (mjfn) - me and j, IS the smallest integer
not less than [n(i - 1)/m] + nco The lower bound is
~ax {l(j) -
I,)
X(I): ~-
m
j - 1
n
~ c} = max {l()) -
,
X(k})}
where k, is the smallest integer not less than [mU - l)1n] + me and I, is the smallest
integer exceeding (ni/m) - nco
26. Compare the confidence bounds for a shift parameter corresponding to the two-
sample Kolmogorov-Smirnov tests (Problem 25) with critical value c ~ 1 -
[1/min(m, n)] to those corresponding to the rank sum test.
27. Show that the largest possible values of the one-sided, two-sample Kolmogorov-
Smirnov statistic D':;. with m ::;;; n and the associated tail probabilities under the
null hypothesis can be expressed as follows where k is the largest integer III n/m:
(a) P(D':;. = 1) = 1/(~) for all m ::;;; n.
(b) P(D':;. ~ 1 - l/n) = (m + 1)/(~) for m < n.
(c) P(D,;'. ~ 1 - l/m) = N/(Z,) for m ::;;; n < 2m.
(d) P(D':;. ~ 1 - i/n) = (m;:;-')f(z,) for 0 < i < n/m.
(e) P(D':;. ~ 1 - l/m) = [(m':;k) + n - k]/(~).
(f) P(D':;. ~ 1 - i/n) = [(m;:;-') + (m+~=~-I)(n - i)]/(~) for n/m < i < 2n/m.
(g) P(D':;. ~ 1 - l/m - i/n) = [(m+'::+I)+(m;:;-~11)(n - k - i)]/(Z,) forO < i < n/m.
28. Show that the asymptotic null distribution of (2mn/N)(D,;,,,)2 is exponential and
that of (4mn/N)(D':;.)2 is chi-square with 2 degrees of freedom.
*29. Show that P(D m• ~ c) and P(D':;. ~ c) are strictly smaller for discontinuous than
for continuous F for all c, 0 < c ::;;; 1, when all observations are independent with
c.d.f. F. (Hint: Consider the possibility that all observations are tied.)
30. Show that
(a) The Kolmogorov-Smirnov statistics are discontinuous functions of the
observations.
(b) The Kolmogorov-Smirnov statistics are lower-semicontinuous, where hex)
is defined as lower-semicontinuous at Xo if, for every e > 0, there is a neigh-
borhood of Xo whereJ(x) ~ J(x o) - e.
*(c) The level of any test for discontinuously distributed observations is no greater
than its level for continuously distributed observations if the acceptance region
of the test is of the form T :s; c where T is a lower-semicontinuous function of
the observations.
Problems 341
31. In the situation and terminology of Sect. 6, show that Gil - F", is a strongly con-
sistent estimator of G - F and use this to show that the three two-sample
Kolmogorov-Smirnov statistics are consistent estimators of the corresponding
population quantities.
32. Show that the one-sided and two-sided Kolmogorov-Smirnov two-sample tests
are consistent against the alternatives stated in Sect. 6.
33. Derive the asymptotic lower bound (6.2) on the power of the one-sided two-sample
Kolmogorov-Smirnov test.
34. In the situation of Sect. 6, show that
(a) The power of the one-sided Kolmogorov-Smirnov test is at most the null
probability that D;:;. ~ cm••• - 8.
(b) For large m and n, the probability in (a) is approximately
exp{ -2[max(c. - 8 Jmn/N, OW}.
(c) The two-sided Kolmogorov-Smirnov test with critical value Cm••• has power
at least
P[G.(Xo) - Fm(xo) ~ Cm•.• ] + P[G.(xo) - Fm(xo) ~ -cm•.• ].
(d) For large m and n the quantity in (c) is approximately
11>[(8 - c.JN/mn)/(J] + 11>[( -'8 - c.JN/mn)/(J]
~ 1I>[2(8Jmn/N - c.)] + 1I>[2( -8Jmn/N - c.)],
where c. is the value of A. for which (4.18) equals ()(.
*(e) Parts (c) and (d) and (6.1)-(6.4) are valid for any x o, with 8 = G(xo) - F(x o).
What choice of Xo gives the tightest lower bound in (6.3), (6.4), and part (d)?
35. Show that the Kolmogorov-Smirnov one-sample tests are consistent as stated in
Sect 7.
36. Show that the Kolmogorov-Smirnov one-sample tests are biased. (Hint: Let Z
be a function of X such that X < Z < a for X < a, X > Z > b for X > b, and
Z = X otherwise, where Fo(a) = 1 - Fo(b) = critical value of the test statistic.
An X sample rejects whenever the corresponding Z sample does, but not con-
versely.) [Massey, 1950b].
*37. Show that the Kolmogorov-Smirnov one-sample statistics are stochastically
smaller under the null hypothesis when F 0 is discontinuous than when it is con-
tinuous, and hence that the P-values and critical values that are exact for the
continuous case are conservative in the discontinuous case.
38. Show that the one-sample Kolmogorov-Smirnov statistics are distribution-free
for a sample from a population with a continuous c.dJ. Fo(x). (Hint: Let U =
Fo(X).)
39. Show that the two-sample Kolmogorov-Smirnov statistics approach the cor-
responding one-sample statistics as one sample size becomes infinite. Define the
type of convergence you use.
40. Show that the lower, upper, and two-sided confidence bands defined by the critical
values of the Kolmogorov-Smirnov one-sample statistics each have probability
at least 1 - ()( of completely covering the true c.dJ. sampled, whatever it may be.
342 7 Kolmogorov-Smlrnov Two-Sample Tests
41. Show that a one-sample Kolmogorov-Smirnov test would "accept" the c.dJ.
F 0 If and only If F 0 hes entirely within the correspondmg confidence band. Assume
the same critical value is used for F 0 discontmuous as for F0 continuous.
42. For F 0 continuous, show that
(a) The P-value of D. is twice the smaller of the two corresponding one-sided
P-values if D. ~ 0.5, and less than twice if 0 < D. < 0.5.
(b) The critical value D•. a = D:. a/ 2 if D:. a/ 2 ~ 0.5. Otherwise D•. a < D:. a/ 2 •
43. Use a symmetry argument to show that the null distribution of D;; is identical to
that of D:.
44. Show that, in the definitions (7.1)-(7.3) of the Kolmogorov-Smirnov one-sample
statistics, the supremum is always achieved for D,; but may not be for D;; or D•.
45. Show that for F 0 continuous the Kolmogorov-Smirnov one-sample statistics
defined m (7.1)-(7.3) can be written as
D: = max Win) - F O(X(I)J
15iSn
46. Show that under the null hypothesis F = F 0 for F 0 continuous, the null distribution
of the one-sample Kolmogorov-Smirnov statistics can be expressed as follows,
where U I < U 2 < ... < U. are the order statistics of a sample of size n from the
uniform distribution on (0, 1).
(a) P(D: ~ c) = P[U, ~ (i/n) - e for i = 1, ... ,11]'
(c) P(D: ~ c) = n! rf
(b) P(D" ~ c) = PW/n) - e ~ U , ~ W - 1)/nJ + e for i = 1, ... , n}.
an a,,-1
Un ••• f3 f"z
Q2 a1
dU I ••• duo where a, = (ifn) - e for i > ne,
a, = 0 otherwise, and 0 ~ e ~ 1.
l -(1/.)+< fl-(2/.)+< f(1/·)+< f<
(d) P(D. ~ c) = f ... f(u l , ... , u.)dul ... duo
1-< 1-(1/n)-< (2/.)-< (i/n)-<
where f(lI l , ••• , u.) = n! for 0 < UI < ... < u. < 1 and 0 otherwise, and
rf
1/211 ~ e ~ 1.
*49. Let F m be the empirical distribution of a sample of size m from a population with
continuous c.dJ. F. Let b be a constant, b ~ I.
(a) Verify that P[F,.,(t) ~ bF(t) for all t] = I - (lib) for III = I and //I = 2
(b) Prove the result in (a) for all m. (Hint: Take F uniform on (0, I). Let Z be the
sample maximum. Given Z, the remainder of the sample is distributed uniformly
on (0, Z). The result for m - I implies that P[Fm(t) ~ bt for 0 < t < Z IZ] =
I - [(//I - 1)/mbZ]. The remaInIng requirement for the event to occur IS
Z ~ lib.) ThiS result is due to Daniels [1945], and a special case IS given 111
Dempster [1959].
*50. Show that the null distribution of D;; can be expressed in the following form due
to Birnbaum and Tingey [1951].
P(D;; ~C)=(I-C)n+e[n~~:))(;)(I-e-~r-j(e+~)'-I.
(Hint: Referring to Problem 46(a), let the last i for which U, < (i/n) - e be n - j.
Then exactly n - j of the U's are smaller than 1 - e - U/n), and this has probability
(j)[1 - e - U/n)]n- l[e + U/n»)1. Furthermore, the remaining U's are condition-
ally uniformly distributed on the interval [I - e - U/n), I] and at most k of
them are in [I - e - U/n), 1 - C + {(k - j)/n)]. By Problem 49, 'this has con-
ditional probability 1 - W/n)/[e + U/n)]} = e/[e + U/n)]. Multiplying these two
probabilities gives the jth term of the sum, which is the probability that the (n - j)th
order statistic is the last at which the empirical c.dJ. exceeds the upper bound.
See also Chapman [1958], Dempster [1959], Dwass [1959], Pyke [1959], and
Problem 52.)
51. Venfy directly, from both Problem 46(c) and Problem 50, that under the null
hypothesis, P(D;; > 0.447) = 0.10 for n = 5.
*52. Generalize the Birnbaum-Tingey formula in Problem 50 to a formula for the null
probability that sup, [F m(t) - bF(t)] ~ e for arbitrary constants e > 0 and b >
1 - e. See the references cited in Problem 50.
*53. Derive the asymptotic null distribution of D;; from the Birnbaum-Tingey formula
in Problem 50 (Dempster [1955]).
54. Under the null hypothesis show that
(a) D;; is uniformly distributed on (0, I) for n = 1.
(b) The density of D;; for n = 2 is
I + 2x o~x~t
h(x) = { ~(I - x) t::o;x::O;l
otherwise.
55. Let F m be the empirical distribution function of a random sample of size III from the
uniform distribution on (0, I). Define
Xrn(t) = In[Frn(t) - t]
Zrn(t) = (t + I)X,.[t/(t + I)]
for all 0 ~ t ~ 1.
(a) Find E[ X rn(t)] , E[Zm(t)], var[ X n,(t)] , var[Zm(t)], cov[X rn(t), X m(u)], cov[Zrn(t),
Z,.,(u)], for all 0 ~ t ~ u ~ 1 and all m.
344 7 Kolmogorov-Smirnov Two-Sample Tests
(b) What regions for ZII/ correspond to the regions D/~ < c and Dill < c for the
Kolmogorov-Smirnov one-sample statistics?
56. (a) Under the null hypothesis that a sample comes from a normal population,
consider the Kolmogorov-Smirnov one-sample statistics with parameters
estimated by the sample mean X and standard deviation s, namely, D: =
sup, {F.(t) - <I>[(t - X)/s]} and D. = sup, 1F.(t) - <I>[(t - X)/s] I, where <I>
is the standard normal c.dJ. Show that their null distributions do not depend
on the mean and variance of the normal population.
(b) Give an analogous result for the null hypothesis of an exponential population.
(c) Does an analogous result hold for all parametric null hypotheses?
57. Measures of the distance between F m and G. other than their maximum difference
can also be used as distribution-free tests of the null hypothesis F = G that two
contmuous distributions are Identical. The one called the Cramer-von Mises
statistic is defined as
1 Introduction
In any given inference situation, many statistical procedures may be avail-
able, both parametric and non parametric. Some measure of their relative
merits is needed, especially as regards their performance or operating charac-
teristics. For instance, a comparison of the power functions of various tests
of the same (or essentially the same) hypotheses would be of interest. It is
frequently more convenient, and also more suggestive, to use a measure of
relative merit called the relative efficiency.
The relative efficiency of a procedure 0 1 with respect to a procedure O2
is defined as the ratio of sample sizes needed to achieve the same perform-
ance, namely n2/n1' where n2 is the number of observations required to
achieve the same performance using O 2 as can be achieved using 0 1 with
n1 observations. Thus, in particular, the relative efficiency of 0 1 with respect
to O 2 is less than or greater than 1 according as O2 requires fewer or more
observations than 0 1 to achieve the same performance.
The use of this definition poses certain problems. For one thing, the
comparison procedure O 2 , at least, must be defined for each possible sample
size n2' Thus it should really be regarded as a sequence of procedures, one
for each sample size. The fact that n2 is not a continuous variable poses
another slight problem, because typically there will be no integer n2 for which
the performance of O 2 , in some specified respect, exactly matches the per-
formance of 0 1 with n1 observations.
The main problem, however, is that there are many ways to specify what it
means to "achieve the same performance." Each specification will produce
its own value of n2 and hence of the relative efficiency. This multiplicity of
values can confuse comparisons.
345
346 8 Asymptotic Relative Efficiency
In testing, for example, one could ask that O2 achieve the same power as
0 1 for any specific alternative, or the same average power for some weighted
average over a set of alternatives. The relative efficiency, sometimes called
the power efficiency in this case, is thus a function of all those variables
which determine power, including n l and IX as well as the alternative dis-
tribution. Usually, however, it varies much less than the power as a function
of these variables. Consequently, the relative performance of two tests can
usually be described much more concisely in terms of relative efficiency than
in terms of power functions directly. In some cases, conveniently, the relative
efficiency is approximately constant, so that the entire comparison reduces to
a single number.
Sometimes the relative efficiency has a lower bound close to one every-
where in the range of importance. If, for example, the relative efficiency of a
non parametric test with respect to a parametric test is never much below 1,
the non parametric test is wasting at most some small fraction of the ob-
servations. One might then feel that the advantages, like simplicity and
broader applicability, of the non parametric test outweigh this small waste.
Such a conclusion should not be reached casually, however, without serious
assessment of the real value of such advantages and of increasing power.
For point estimation, relative efficiency is usually defined as the ratio of
sample sizes needed to achieve the same variances or mean squared errors
of the estimators. Other functions of the error (other" loss" functions) could
be used instead of the square for matching. Matching one function does not
ordinarily match others or the error distributions exactly. The relative
efficiency of two estimators also depends on the assumed distribution and
nl' just as for tests it depends on nl and the alternative where the power is
matched.
For confidence intervals, relative efficiency might be defined by matching
expected lengths. This would not entirely match the distributions of the
endpoints, of course. The probability of covering some specified false value
could be matched instead, and the result would then be a function of the
false value used. In any case, the relative efficiency of two confidence pro-
cedures will depend on the confidence level, as well as the true distribution
and n l •
Since relative efficiency generally depends on so many factors, its implica-
tions may be difficult to assess and interpret. This problem often disappears
conveniently when limits are taken. The asymptotic relative efficiency of a
procedure 0 1 with respect to a procedure O 2 is defined roughly as the limit
of the relative efficiency as nl --+ 00. Here, of course, both 0 1 and O2 must
be sequences of procedures, defined for arbitrarily large sample sizes n l
and n2 • It would seem that the limit required in this definition might well
fail to exist, and that when it does exist it might depend on essentially the
same variables as the relative efficiencies for finite sample sizes. We shall see,
however, that things become much simpler as the sample sizes approach
infinity, and a single number will describe a great many features of the
2 Asymptotic BchavlOr of Tests HeurIstic DIscussIOn 347
In particular, the rate at which (In --. (Jo can be adjusted so that the argument
of II> is finite. Furthermore, to a first approximation, the power function of
the test is completely determined by the quantity Jl.~«(JO)/un«(Jo). Notice that
the variance u;(lJ) is needed only under the null hypothesis (J = (Jo'
As an example, consider the power function of the ordinary sign test
against normal alternatives. Let T" be the number of negative observations.
If the true distribution of the population is normal with mean Jl. and variance
u 2 , then the probability of a negative observation is
p = 11>( - (J/u). (2.6)
At (J = (Jo = 0, the null hypothesis p = t is satisfied. T" is binomial with
parameters p and n, and hence is approximately normal with mean and
variance given by
Jl.n«(J) = np = nll>( - (J/u) (2.7)
u;«(J) = np(l - p) = nll>( -(J/u)[l - II>(-(J/u)]. (2.8)
Letting 4J denote the standard normal density, we calculate
Jl.~«(Jo) = n( -l/u)4J(O) = _ 2J1i4J(0)ju = _ J2n/n/u. (2.9)
un«(Jo) J;p,
Substituting (2.9) in (2.5) then gives
The quantity en is, in general, called the efficacy of the test statistic T" for the
family of distributions in question (at (Jo). As defined here, it depends on the
choice of the approximations Jl.n and Un' which are not unique, and on the
choice among equivalent test statistics. As we shall see, however, these choices
have only a second-order effect, and the efficacy of a test is uniquely defined
to a first order of approximation.
In taking the square, the sign of Jl.~«(JO)/un«(JO) is lost, but this is unimportant
for present purposes. The sign is always the same as the sign of Jl.~«(Jo) and
merely indicates whether large values of T" are associated with large or small
values of (J. For example, the negative sign in (2.9) corresponds to the fact
that for normal distributions the number of negative observations tends to
be large when the mean (J is small, which implies that a one-tailed test reject-
ing when there are too many negative observations is appropriate against
350 8 Asymptotic Relative Efficiency
the alternative 0 < 0, not the alternative 0 > O. Provided the appropriate
one-tailed test is used for a one-sided alternative, the sign of 1l~(00) is of no
consequence.
The efficacy en therefore contains exactly the information we need here,
and conveniently is typically of order n. In terms of en, by (2.5), the power of the
appropriate one-tailed test against the alternative 0 > 00 , can be expressed
approximately as
(2.11)
Figure 21
2 Asymptotic BehavIOr of Tests: HeuristIc DIscussion 351
(2.14)
If the sample standard deviation S is substituted for (J' when (J' is unknown, the
resulting test rejects for
(2.15)
The previous paragraph states that the test (2.15) is approximately the same
as the test (2.14) which requires the true value of 0', and this holds whatever
that value may be. The t test is of the form (2.15) except that the constant z"
is adjusted to make the test exact under normality. The amount of the
adjustment approaches zero as n - 00, however.
This illustrates the typical situation where a statistic T,. is selected for a
test but the null distribution of T,. depends on one or more nuisance param-
eters. Then we cannot base the test on T,. in the sense of comparing T,. with
a fixed critical value. We can in another sense, however, as follows. Compute
the critical value of T,. as a function of the nuisance parameters, substitute
consistent estimates of the nuisance parameters, and compare T,. with the
resulting critical value. Then the test will be approximately the same as a
test which is based on T,. in the ordinary sense but requires knowledge
of the values of the nuisance parameters. In particular, the efficacy computed
for T,. using (2.10) will relate in the usual way to the power of the test. For
instance, Equation (2.11) will apply (Problem 6a). This will still be true if
the test is adjusted further to make its level exactly IX under some null hypo-
thesis (Problem 6b). In short, the efficacy of T" will typcially apply to the
power of all tests which are based essentially on T,..
There is an alternative argument that can often be used to justify the
approximations of the previous paragraph. For example, since the t test
at level IX = 0.50 can be based on the sample mean, the efficacy computed
from the sample mean applies to the t test at level IX = 0.50. However, since
the efficacy does not depend on the value of IX, the efficacy of the sample
mean must apply to the t test at all levels. This argument is further explained
and used in connection with point estimation in Sect. 3.2.
2 Asymptotic Behavior of Tests' HeurIstic DiscussIOn 353
Now consider two tests at the same level, which are based on statistics Tl,n
and T2.n with efficacies et.n and e2.n respectively. The power of each test
is given approximately by (2.11). For a given sample size, the two power
functions are then approximately the same shape but they differ by a scale
J
factor e2.n/et.n' That is, the power of the first test at eo + ais approximately
the same as the power of the second test at eo + aJe2.nlet.n. Figure 2.1
illustrates this situation with graphs of normal power functions for one-
sided tests at level (J. = 0.05, that is, <I>(ajk - 1.64) as a function of a, for
several values of k, where k represents e2. nlet. n'
We have seen that typically e t • n is of order n. More specifically, as n ~ 00,
the ratio et.n/n typically approaches t some positive constant et.,
(2.16)
We call e 1. the limiting efficacy per observation or, more briefly, the as'ymptotic
efficacy of the first test statistic. If the limiting efficacies per observation
exist for both test statistics, then the ratio el. nle2.n approaches E 1 : 2, where
lim e 1 • nln
lim e2.nln
(2.17)
I All limits in this chapter are to be taken as the relevant sample sizes approach infinity, but this
Since (2.16) implies that el. nl == nl el . and e2.n2 == n2 e2.' substituting these
in (2.18) shows that the power functions of the two tests will be equal when
(2.19)
Thus E I: 2 can be interpreted as the limiting ratio of sample sizes for which
the tests have the same power. This is the usual definition of asymptotic
relative efficiency. The foregoing discussion indicates that both this and the
scale-factor interpretation of E I: 2 mentioned earlier will hold in typical
situations.
As an example, consider the asymptotic efficiency of the ordinary sign
test relative to the classical normal-theory test for the same situation. The
asymptotic relative efficiency depends on the family of distributions under
discussion. Let us consider a normal family as one relevant possibility. Let
Tl, n be the test statistic for the sign test, that is, the number of negative
observations. From the results given in (2.9), the efficacy of TI • n for the normal
distribution is
(2.20)
The normal-theory test is based essentially on the sample mean. The test
statistic is the sample mean if (T is known and the t statistic if (T is unknown, but,
as explained in the last subsection, we may proceed as if the sample mean were
the test statistic in both cases. Accordingly, we let T2 ,n be the sample mean
and obtain its efficacy from (2.12) as
_ / 2
e2.n - n (T • (2.21)
The asymptotic efficiency of the sign test relative to the normal-theory test
(for (T either known or unknown), against normal alternatives, is then the
ratio
. el,n 2
E I : 2 = II m - = - = 0.64. (2.22)
e2,n 1t
Suppose we are considering two estimators of the same quantity, say Ji. For
example, the sample median and the sample mean may be regarded as
estimators of the same quantity if the population median and mean are
assumed to be equal. Let the two estimators be T1,n and T 2 ,n' and suppose
that the true distribution belongs to some specified family of distributions
indexed by a one-dimensional parameter (J. Since (J determines the true
distribution, it also determines the quantity being estimated, which we may
write accordingly as Ji{(J).
Suppose that T1,n and T2 ,n are approximately normal, as estimators
usually are, with mean Ji{(J) and variances ui.n{(J) and uL{(J) respectively.
Now if the definitions of Sect. 2 are applied to T1 , nand T2 , n regarded as test
statistics for a particular null value of (J (which they could be at least within
our one-parameter family of distributions), then their efficacies and their
asymptotic relative efficiency are given by
Note that since Ji'{(J) is the same for both tests, E 1 : z{(J) can be obtained with-
out actually computing Ji'{(J).
The two interpretations of E 1: 2{ lJ) in estimation are similar to the two in
testing. First, the errors ofthe two estimators T1 , nand Tz,nhave approximately
the same distribution except for a scale factor 1/JE 1 . 2 {(J), since both
356 8 AsymptotIc Relative EfficIency
estimators are approximately normal with mean J1(e) and the ratio of the
variance of the first estimator to that of the second is l/E l : 2(e).
Second, if the two estimators are based on samples of sizes nl and n2
in the ratio n2/nl = E 1: 2(lJ), then the distributions of the estimators and
hence of the errors will be approximately the same. To see this, recall again
that both are approximately normal with mean J1(lJ). They will, therefore,
have approximately the same distribution if their variances are approximately
equal, and hence if their efficacies, given by (3.1), are approximately equal.
But this is exactly the condition of Equation (2.19) for the tests based on
T1 ,n, and T2 ,n2 to have approximately the same power function, except that
the dependence on lJo was suppressed there while the dependence on lJ is not
suppressed here. Accordingly, by the same argument as in Sect. 2.3, if the
ratio n2/nl is, in the limit, equal to E 1 dlJ), then the two estimators of lJ, T1, n,
and T2 ,n2' will, in the limit, have the same distribution. Note that in general,
this ratio depends on the true value of e, although this dependence will dis-
appear in our examples.
In typical situations, then, El dlJ) will have the two interpretations above
and can be computed for estimators exactly as for test statistics when the
estimators are estimating the same quantity.
As an example, suppose lJ is the proportion negative in some population,
'ft.n is the proportion of negative observations in the sample, and T2 • n =
(t(X/S), where <I> is the standard normal c.dJ. and X and S are the sample
mean and standard deviation. The second estimator is appropriate for
normal populations with unknown standard deviation. If the population
is normal with mean lJ and standard deviation (1, then the asymptotic
efficiency of the first estimator relative to the second can be computed
(Problem 7) as
E (lJ) = (1 + !W 2)Cf>2(W) (3.3)
1:2 p(l _ p)
where w = lJ/(1, P = <I>(w), and 4> is the standard normal density function.
E 1 : 2 (lJ) is plotted as a function of p in Fig. 3.1. When 25% of the population
is negative, for example, El :2(lJ) = 0.66. This implies that the variance of
the sample proportion negative is approximately 1/0.66 = 1.51 times the
variance of <I>(X/S), and that their errors have approximately the same dis-
tribution except for a scale factor 1/J0,66 = 1.23. It also implies that the
sample proportion negative has approximately the same distribution as
<I>(X/S) based on 66 %as many observations. If lJ = 0, then p = 0.5 and the
asymptotic relative efficiency is the same as for the sign test relative to the
t test, namely 2/n = 0.64. As can be seen from Fig. 3.1., the normal-theory
est!mator, <I>(X/S), is always better under normality and may be much
better. Unfortunately, however, if the normality assumption fails, then as
discussed in Sect. 3.2 of Chap. 2, <I>(X/S) will not ordinarily be a natural or
good estimator of the proportion negative at all.
3 AsymptotIc BehavIor of POlllt Estimators: HeuristIc DIscussIOn 357
0.6
0.5
0.4
OJ
0.2
0.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 p
Figure 3.1
The previous subsection makes it appear that estimators have the same
asymptotic relative efficiency as tests. However, more needs to be said con-
cerning which estimators have the same relative efficiency as which tests.
When we start with tests and look for corresponding estimators, the
following difficulty arises. A test can be defined equivalently in terms of
many different test statistics. However, if the asymptotic relative efficiency
of two tests is to apply to their test statistics when regarded as estimators,
then both test statistics must be estimators of the same population quantity,
which must not depend on n. Consider, for instance, the sign test and the
normal-theory test in a one-sample problem. The sign test was defined
earlier in terms of the number of negative observations, but this estimates
a quantity depending on n. An equivalent test statistic which eliminates this
dependence is the proportion of negative sample observations, which
estimates the proportion negative in the population. However, the normal-
theory test statistic, either the sample mean or the t statistic, is not an
estimator of this parameter, and so we still do not have a pair of estimators
which are related in the manner of the previous section and correspond to
the two tests, the sign test and the normal-theory test.
If, on the other hand, we start with estimators, the difficulty is that they
may not appear to be test statistics for the right kind of test. The sample
358 8 Asymptolic RelatIve EfficIency
median, for instance, is not the test statistic for the sign test. Furthermore,
since its distribution depends on the shape of the population, it is a possible
test statistic only under very restrictive assumptions. How, then, can its
efficacy in estimation be related to the efficacy of ~ny interesting test?
One answer is to consider one-tailed tests at the level (X = 0.50. For
example, the upper-tailed sign test at level (X = 0.50 rejects if the number of
negative observations exceeds n/2. This is equivalent to rejecting if the
sample median is negative. Thus the sample median can serve as the test
statistic for a one-tailed sign test at level (X = 0.50, although not at other
levels. As we have seen, however, the efficacy of a test does not depend on the
level. Therefore, the efficacy of the sample median must be asymptotically
the same as the efficacy of the sign test (Problem 8). While we have been
talking about the sign test for the null hypothesis that the population median
is 0, the argument extends immediately to any hypothesized value of the
median. A sign test of the null hypothesis that the median is f.J.o is ordinarily
based on the number of observations less than f.J.o, but the sample median
can again serve as the test statistic for a one-tailed test at level (X = 0.50.
Thus the efficacy of the sample median as an estimator of the population
median in any family of distributions will be asymptotically the same at
each value of the population median as the efficacy of the sign test for that
value of the popUlation median. Of course, this can be verified directly
(Problem 9). However, the foregoing argument relates the sample median
to the sign test explicitly and shows that their efficacies must be asymptotically
equal and need not both be computed individually.
A similar argument relates the sample mean to the t test (Problem 10).
(Section 2.2 presented another argument, namely that the t test at any level
is asymptotically equivalent for present purposes to a test based on the sample
mean.)
To generalize the argument relating a family of tests for a parameter f.J.
to an estimator of f.J. with asymptotically the same efficacy, suppose we per-
form a one-tailed test at level (X = 0.50 for each value of f.J.. Then for any given
set of observations there is usually one value of f.J. such that all values on one
side of it would be accepted, and all values on the other side rejected. This·
point of division, the value of f.J. which would be "just accepted" at the level
ex = 0.50, may be considered an estimator of f.J. corresponding to the family
of tests in question. It is just the confidence bound for f.J. at level 1 - ex =
0.50 which corresponds to the family of tests at level ex = 0.50. Its efficacy
might be very difficult to obtain directly. However, since this estimator could
be used as a test statistic for anyone of the tests at level ex = 0.50, its efficacy
at any value of f.J. must be asymptotically the same as that of the test for that
value of f.J..
In summary, the foregoing argument shows that, given a family of tests
for f.J., the corresponding 50 % confidence bound is a naturally related
estimator of f.J., with asymptotically the same efficacy.
3 Asymptotic BehavIOr of Pomt Estnnators: Heuflstlc DIscussIon 359
the amount of the shift by which the populations differ can be performed by
comparing this estimate with fl. The estimator corresponding to the two-
sample t test in this way is the difference of the sample means. Similarly, the
difference of the sample medians corresponds to the two-sample median test,
and the median of the set of all differences lj - Xi corresponds to the two-
sample rank-sum test (Problem 12).
Summary
Now let us consider two arbitrary estimators T1,n and Tz,n for (presumably)
different quantities. There is no natural way to make a direct comparison
between Tl, n as an estimator of one quantity and Tz,n as an estimator of
another. If Tl,n has smaller variance than Tz,n, this may only be because the
quantity estimated by T1,n is the easier one to estimate. In fact, it could
happen at the same time that some function of Tz,n estimates the same
quantity as T1,n and has smaller variance than Tl,n' Then Tz,n would cer-
tainly be more useful than T1,n even though it has larger variance. Two
estimators can be compared in a straightforward way only when both are
estimating the same quantity.
We could leave the problem here, since there is no really compelling
reason to compare two estimators of different quantities. However, an
instructive comparison turns out to be possible between any two statistics
T1,n and T2 ,n, which need not even be estimates at all. We shall find that
there is a very natural way to use functions of them to estimate the same
quantity asymptotically, and that the asymptotic relative efficiency of the
3 Asymptotic BehavIOr of Pomt Estimators. Heuristic DIscussIOn 361
e n - () = TI,nIl;- , n(IlI,n«(})
I, (})
+ remainder. (3.6)
Since TI,n is approximately normal with mean IlI,n«(}) and variance aL«(}), it
follows that el,n is approximately normal with mean () and variance
(3.7)
e
Similarly 2,n = 1l2.~(T2,n) is approximately normal with mean () and vari-
ance tle 2,nC(}).
Therefore, in large samples, there is a natural way to use TI,n and T2,n for
purposes of estimating (), and the resulting estimators are approximately
normal with the same mean () and variances tlel,n«(}) and tle 2j(}). The
asymptotic efficiency of the estimator of () based on TI , n relative to that based
on T2 ,n is then
in some way, then the tests and estimators for () based on Tl,n and T2 ,n will
generally depend on the distribution assumption, because Jl.l,n and Jl.2,n
do (Problem 14),
Thus we have obtained an approximation to the c.dJ. of T" when the true
distribution is given by ().
The efficacy en«(}) need not be computed from the confidence bound T",
as it can be computed from any test statistic which yields the corresponding
test of the nun value Jl.«()). T" is one such statistic, but is often not the easiest
one to use in computing the efficacy, even in the case ix = 0.50 discussed in
Sect. 3.2. Indeed, the natural test statistic for the nun value Jl. usuany depends
on Jl.. Consider, for example, the confidence bounds related to the Wilcoxon
signed-rank test as in Sect. 4 of Chap. 3. It would be difficult, if not impossible,
4 Asymptotic BehavIOr of Confidence Bounds 363
to compute the mean and variance of this confidence bound directly. How-
ever, the corresponding test for any null value of Ji can be carried out by
subtracting Ji from every observation and then computing the signed-rank
sum. Note that this test statistic is a function of Ji. Its efficacy is easy to
compute for any given II, and is also the efficacy of the confidence bound.
Now suppose we have two upper confidence bounds Tl,n and T2.n for
the same quantity Ji at the same confidence level 1 - ex. The c.d.f. of each is
given approximately by (4.2). In addition, as n --+ 00, e1. n«()/e2. n«() --+ E 1 d(),
the asymptotic relative efficiency of the corresponding tests. It follows
(Problem 15b) that for a given sample size, the c.dJ. of Tl, n at Ji«() + c5 is
approximately the c.dJ. of T2.n at Ji«() + c5JE1:2«(); that is, T1.n - Ji«()
and T2... - Ji«() have approximately the same c.dJ. except for a scale factor
I/JE1 :2«()' (Compare Fig. 2.1.) In particular, the expectation of T1... - Ji«()
is approximately 1/JE 1:i() times the expectation of T2.n - Ji«(). The
quantity 1i. n - Ji«() with its algebraic sign is the amount of overestimation.
One might instead be interested in its positive part,
[7;.n - Ji«()] + = max{1i.n - Ji«(),O},
since it is desirable that an upper confidence bound for Ji be small as long as
it exceeds Ji but not when it is smaller than Ji. Just as for the signed overestima-
tion, the expectation of [Tl,n - Ji«()] + is approximately I/JE 1. 2«() times
the expectation of [T2 n - Ji«()]+. For other measures of "error" or "loss,"
JE 1 :i() is the scale factor in the random variable but not always in the
expected loss (Problem 16).
This provides one interpretation of the asymptotic relative efficiency of
tests in connection with the corresponding confidence procedures. We obtain
another interpretation by considering the two confidence bounds T1 • nt and
T2•n2 , based on samples of sizes n1 and n2 respectively. By the same argu-
ment as in earlier sections (Problem 17), if the ratio n2/n1 is, in the limit,
equal to E 1:2 «(), then the confidence bounds Tl,nt and T2 • n2 will, in the limit,
have the same disttibution and hence the same expected loss for any loss
function.
These conclusions of course apply to lower as well as upper confidence
bounds. As regards two confidence intervals, the previous discussion pro-
vides a comparison between the distributions of the two upper bounds, and
similarly for the two lower bounds. A full comparison would also require
consideration of the joint distribution of the upper and lower bounds, which
we have not discussed. What we already know implies, however, that the ex-
pected length of the first interval is approximately I/JE1 d() times that of
the second (Problem 18). Furthermore, the length of a typical confidence
interval is asymptotically constant, with a standard deviation of smaller
order than its mean. Hence, in comparing two confidence intervals, the
scale factor JE1: 2«() applies asymptotically to both their lengths and the
deviation of their endpoints from Ji«(), and therefore to all other aspects of
364 8 AsymptotIc Relative EfficIency
their relationships to 1l(0). Thus E 1"2(0) has the same kind of interpretation
for confidence intervals as for confidence bounds. A different kind of argu-
ment would be required, however, to establish this for confidence intervals
whose length is not asymptotically constant.
In typical situations, then, the quantity E 1 : 2 (0) will have the interpreta-
tions given above and will therefore be called the asymptotic efficiency of
the confidence procedure Tt •n relative to T2 • n' It is the same as the asymptotic
relative efficiency of the corresponding tests of the null value 1l(0), which is
usually easier to compute directly.
As an example, consider a confidence bound for the population median
based on an order statistic as in Sect. 4 of Chap. 2, and the normal-theory
confidence bound X + ZIlS/Jn for the population median, where X and S are
the sample mean and standard deviation and Zil is an appropriate constant.
Both are confidence bounds for the same quantity if the population mean
and median are assumed equal. (The confidence level of the normal-theory
procedure depends on the population shape, but will be correct asymptotic-
ally.) Let Il be the common value of the population mean and median. The
confidence bound based on the order statistic corresponds to the sign test
for each Il, which is based on the number of observations less than Il. The
normal-theory confidence bound corresponds to the t test for each Il. Under
the normal assumption (the population is normal with mean 0 = Il and
standard deviation 0), the asymptotic efficiency of the first test with respect
to the second was found in Sect. 2.3 to be 2/n = 0.64 at Il = 0 = 0, and the
same value clearly applies at other values of Il. For the two confidence
bounds, this asymptotic efficiency means that under normality the order
statistic bound has approximately the same probability of falling below
Il + b as the normal-theory bound has of falling below Il + bJO.64 = Il
+ 0.80b. In other words, the two confidence bounds differ from Il by amounts
having the same distribution except for a scale factor 1/J0.64 = 1.25. In par-
ticular, the expected amount by which the order statistic bound exceeds Il
is approximately 1.25 times the corresponding expectation for the normal-
theory bound, and the same holds for the expected lengths of confidence
intervals. Furthermore, the order statistic bound has approximately the
same distribution as the normal-theory bound based on 64 % as many
observations.
5 Example
We have seen that the asymptotic relative efficiency of two tests applies
also to the corresponding estimators and confidence bounds, with at least
two interpretations in each case. Accordingly, any numerical efficiency value
has many meanings. We now illustrate this whole range of ideas for two
sets of related procedures in the one-sample problem.
5 Example 365
Consider the sign test for the null hypothesis that the population median
has a specified value. The natural test statistic is the number of observations
falling below the specified value. We thus have a whole family of tests (and
test statistics), one for each value which might be specified. The estimator
of the population median corresponding to this family of tests is the sample
median, as explained in Sect. 3.2. The corresponding confidence bounds are
order statistics, as explained in Sect. 4 of Chap. 2. For convenience, all these
procedures will be referred to in this section as "median procedures;" they
all permit inferences about the population median.
Consider also the family of t tests for null hypotheses specifying the
population mean. The corresponding estimator of the population mean is
the sample mean, as noted in Sect. 3.2, and the corresponding confidence
bounds are the usual ones based on the t distribution. All these procedures
will be referred to here as normal-theory procedures, because they are
exact under the assumption of normality. They give asymptotically valid
inferences about the population mean regardless of the population shape
provided that the variance is finite. (If this were not true, comparisons with
other procedures under assumptions other than normality would be com-
plicated by the discrepancy between their true level and their nominal level.
Such a discrepancy would invalidate all our earlier analysis, and would also
bring a new consideration into the problem-the trade-off between level
and power. See also the discussion of power comparisons using conservative
tests in Sect. 4.3 of Chap. 1.)
Our object here is to compare the median procedures with the normal-
theory procedures under the assumption that the population median and
mean are equal. Before proceeding, however, let us emphasize that these
procedures lead in general to inferences about different quantities. A median
procedure provides an inference about the population median, while a
normal-theory procedure provides an inference about the population mean.
Accordingly, the first question to ask is whether it is really the median or the
mean of the population which is of interest. Careful thought may reveal that
one or the other (or something else entirely) is really the parameter of in-
terest. If so, and if we we are also unwilling to assume that they differ neg-
ligibly, then our choice of procedure will be clear and an efficiency compari-
son irrelevant. Such considerations are at least as important as efficiency.
They receive less attention here simply because they require less explanation.
On the other hand, if we believe that the population median and mean
differ negligibly compared to the uncertainty resulting from sampling
variability, then the relative efficiency of the median and normal-theory
procedures will be of interest. It will also be of interest in situations where
we think the population median and mean may well differ appreciably and
it is immaterial which one the inference concerns-but then perhaps a more
meaningful and useful way of making or scaling the measurements could
be found. In either case, the choice among procedures may be facilitated by
learning something about their relative efficiency. We shall, however,
366 8 Asymptotic Relative Efficiency
This result does not depend on (), as could be anticipated from the nature
of the procedures and the way () enters the density (5.1) (Problem 21d). Ac-
cording to Sects. 2-4, the implications of this result under the model (5.1) are
as follows.
Consider the median procedure and the normal-theory procedure for
testing the hypothesis () = (}o, that is, the sign test and the t test for this null
hypothesis. We assume always that the tests have the same level IX and are
either both one-tailed in the same direction or both two-tailed with the
same division of the significance level between the two tails. If the tests are
based on samples of the same (large) size, then the power of the sign test at a
point (j units away from (Jo is approximately the same as the power of the t
test at a point (j.)2 = 1.41(j units away in the same direction; that is, the
sign test gives approximately the same power at any point as the t test
gives at a point farther from (}o by the factor .)2. In terms of different sample
sizes (stilI large), we may say that the sign test requires approximately one-
half as many observations as the t test to give a specified power at a specified
alternative near the null hypothesis. Both these statements apply to Laplace
alternative (5.1), of course.
Approximations to the power itself can be given in terms of e1,nC(}) and
e2,n«(J); by Equation (2.5) the respective powers against (J of the sign test
and the t test are approximately <I>[Jn«() - (}o)/). - Z/l] and <I>[Jnj2«(J - (}o)/
). - Z/l] for one-tailed tests appropriate against the alternative () > (}o.
Similar expressions could be written for one-tailed tests appropriate against
(J < (}o and for two-tailed tests. One might expect these approximations to
be better than usual here, because when the J.li,n«(J) are linear in () and the
ut.n«(J) do not depend on (J, as here, (2.4) is exact and (2.5) and (2.7) agree
exactly with (2.3) for both the mean and the median. This is misleading,
however, as the mean and median can serve as test statistics only at the level
IX = 0.50. At other levels, other test statistics would be required, and, for
them, (2.5) and (2.7) would not agree exactly with (2.3).
To estimate J.l, the median and normal-theory procedures use the sample
median and mean respectively. In large samples from the Laplace distribu-
tion (5.1), both are approximately normal with mean J.l = (J, and the variance
of the median is approximately one-half of the variance of the mean from
a sample of the same size. The estimation error of the median has approx-
imately the same distribution as 1/.)2 times the estimation error of the mean.
If the sample size for the median is one-half of the sample size for the mean,
their distributions will be approximately the same. The situation is par-
ticularly simple in that the factor one-half does not depend on (). (This
simplification occurs frequently, but not always.)
For large samples from the Laplace distribution (5.1), an upper confidence
bound for J.l computed by the median procedure has approximately the
same probability of falling below J.l + (j as the normal theory bound has of
falling below J.l + (j.)2. The amount by which the former exceeds J.l has
368 8 Asymptollc Relallve Efficiency
approximately the same distribution as 1/.j2 times the amount by which the
latter exceeds Ji. Similar statements hold for a lower confidence bound.
The expectation of the difference between a confidence bound and Ji, or of
the positive part of this difference, or of the length of a confidence interval,
is approximately 1/.j2 times as great for the median procedure as for the
normal theory procedure. The median procedure using a sample size one-
half as large as the normal theory procedure gives confidence bounds having
approximately the same distribution. Again, the implications of these
statements are particularly simple because the factor one-half does not
depend on (J.
All these statements are implied by the single statement that the asymp-
totic efficiency of the median relative to the mean is 2 (for all (J). Note that
these results apply specifically when the true distribution is Laplace, (5.1),
and its center of symmetry (J is the parameter of interest. For a true distribu-
tion with a different shape, the asymptotic relative efficiency will generally be
different, as we shall illustrate next.
Suppose now that the population is normal with mean (J and variance (J2,
where (J2 is arbitrary but fixed, like A above. In this case the asymptotic
efficiency of the median procedures relative to the normal-theory procedures
is (Problems 7c and 21).
E 1 : 2 «(J) = 2/n = 0.64. (5.5)
In large samples, then, the sign test of a null hypothesis (J = (Jo has ap-
proximately the same power at any point as the t test at a point JD.64 = 0.80
times as far from (J o. The estimation error of the median has approximately the
same distribution as 1/JD.64 = 1.25 times the estimation error of the mean.
A confidence interval for (J computed by the median procedure differs from
(J by an amount having approximately the same distribution as 1/JD.64 times
the corresponding difference for the normal theory procedure. All these
statements apply to samples of the same size. The median procedures for
testing, estimation, and setting confidence limits behave approximately like
the normal-theory procedures based on a sample 0.64 times as large. Again
the factor is independent of (J.
lf the population is normal, the median procedures are asymptotically
0.64 times as efficient as the normal theory procedures. On the other hand,
if the population is Laplace, we saw earlier that the median procedures are
asymptotically twice as efficient as the normal-theory procedures. It is
convenient to consider both the normal and Laplace densities as special
cases of the general density (which might be called the" double exponential
power" or "power Laplace" density) given by
o 2 3 4 5 6 7 8 9 to k
Figure 5.1
370 8 Asymptottc Relattve Efficiency
Those readers who are satisfied with the informal definitions of asymptotic
relative efficiency given in Sects. 2-4 may skip the precise formulation pre-
sented in Sect. 6 and go immediately to Sect. 7.
6.1 Introduction
6.2 Tests
Suppose first that we are comparing two test procedures and that 00 gives a
distribution belonging to the null hypothesis for each test. If both tests are
one-tailed, assume that they are appropriate against the same one-sided
alternative and that the exact levels of the tests under 00 approach the same
positive constant as n --+ 00. If both tests are two-tailed, make the same
assumption about the level in each tail separately. Then the asymptotic
efficiency E = E t dOo) of the first test relative to the second can be expected
to have the following properties, anyone of which could be taken as the defi-
nition of asymptotic relative efficiency.
A(i). For two tests based on the sample size n, the difference between
the power of the first test at 0 0 + (j and the power of the second test at
0 0 + (jfi approaches zero uniformly in (j as n --+ 00.
If Pt,n and P 2 ,n are the power functions of the two tests, the condition is
Pt,n(OO + ,,) - P 2 ,n(OO + "ft) --+ 0 uniformly in" as n --+ 00. (6.1)
The statement of (6.1) in B terminology, directly from the definition of uniform
convergence, is that for every B > 0, there is an N not depending on (j such
that, for all n > N and all ",
(6.2)
Uniform convergence is equivalent by definition to convergence of the
maximum absolute difference. Thus (6.1) is equivalent (Problem 24) to
mrlPt,n«10 + (j) - P 2 ,n(00 + c5,jE)I--+ 0 as n --+ 00. (6.3)
We saw in Sect. 2.1 how the power function of an arbitrary test can typically
be approximated by a normal distribution as in Equation (2.11). Using this
approximation, along with the fact that the efficacy is typically of order
In, we observe that the powers of the two tests at ()o + on and ()o + onft
appearing in (6.1)-(6.4) typically behave as follows, for an arbitrary E.
If on -.. 0 so fast that JnOll -.. 0, then neither test is effective asymptotically
and both powers approach the significance level ex. If, at the other extreme,
In 011 -.. ± 00, then both powers approach one (or zero for one-tailed tests
in the wrong direction). If 0" -.. 0 in such a way that In d d
011 -.. for nonzero
and finite, then both powers approach limits other than 1, ex, or 0 in general.
These limits will be equal if E = E 1:2«()0), but otherwise will not, with one
minor exception. (For two-tailed tests with unequal tails in the limit, for
each E =1= E 1: 2(00 ) there is one particular value of d such that the limits are
equal and less than ex. See Problem 26.)
Notice that, without the condition "uniformly in 0," (6.1) would lose all
force, since both terms in (6.1) approach one (or zero for one-tailed tests in
the wrong direction) for every fixed 0 =1= 0 regardless of the value of E. But
with the uniformity. property A(i) implies that the power functions will be
the same in the limit except for a scale factor l/ft even when the alternatives
are rescaled so that the limits are not degenerate, specifically, when the powers
are considered as functions of In(O - (0 ),
From the foregoing it also follows that the only value of E with property
A(i) is E = E 1: 2(0 0 ), In most of the properties to follow, uniformity is
essential to the meaning and can be restated in the style of (6.3) and (6.4);
similar comments about degenerate limits and rescaling apply; and similar
converses can be stated. This will not be mentioned each time, but left to the
reader to fill in (Problems 27-29).
The asymptotic relative efficiency property A(i) concerns the power of
the two tests at different alternatives for the same sample size. The next two
properties relate to the same alternative but different sample sizes.
A(ii). For two tests with sample sizes nl and n2 respectively, the difference
between the powers at the same point () approaches zero uniformly in 0
when nl and n2 approach infinity simultaneously in such a way that
n2/nl -.. E.
In the same notation as before, the condition here is
Pl,II,«(}) - P 2,II2(O) -.. 0 uniformly in () if n2/nl -.. E. (6.?)
The final property of the asymptotic relative efficiency E of two tests
which we state is
A(iii). For two tests, if n 1 is the minimum n for which the first test has
power at least 1 - f3 against the alternative 0 and n2 is defined similarly
for the second test, then n2/n 1 -.. E as 0 -.. ()o.
6 DefinItIOns of Asymptotic RelatIve Efficiency 373
6.3 Estimators
Consider next two estimators TI, nand T2 ,n of the same quantity /l(O). Since
there is no distinguished value of 0 in this context, we return to the notation
E 1 :iO) for the asymptotic relative efficiency, to emphasize its dependence
on O. Continuing our delineation of the properties which we expect E,: 2(0)
to have, we state four such properties, B(i)-B(iv), anyone of which could
serve to define asymptotic relative efficiency for estimators.
B(i). When the true distribution is given by 0, the difference between the
probability of an error of J or less using T1 ,II and the probability of an
error of JJE, :2(0) or less using T2 ,n approaches zero uniformly in J as
the common sample size n ~ 00.
In symbols,
Po[T"n - /leO) ~ J] - PO[T2 ,n - /l(O) ~ JJE 1 :iO)] -+ 0 (6.6)
uniformly in (j as n -+ 00.
This says that the errors of the two estimators have, in an appropriate
sense, asymptotically the same distribution except for a scale factor
I/J E, :2(0) (Problem 30). The errors themselves have distributions which
concentrate at zero as n -+ 00, so that both terms in (6.6) approach zero for
J < 0 and both approach one for (j > O. By uniformity, however, Property
B(i) implies that even when the errors are scaled up in such a way that their
distributions are not degenerate in the limit, the distributions will be the
same in the limit except for the scale factor 1/JE 1 : 2(0). Specifically, when the
errors are scaled up by the factor ~, their distributions typically converge to
normal distributions with mean 0 and variances in the ratio 1/E 1 : 2 (0);
equivalently, the distributions of In J
times the error of T" nand n/E, . 2(0)
times the error of T2 ,n approach the same normal distribution, with mean
zero and positive variance.
Scaling, the importance of uniformity, alternative statements of it, and
the uniqueness of E are mostly much the same as in Sect. 6.2 and will not be
discussed further here, but they should be borne in mind.
B(ii). When the true distribution is given by 0, the ratio of the variance of
Tl,n to the variance of T2 . n approaches l/E,: 2(0) as n ~ 00. The same is
true of the tatio of their mean squared errors. The ratio of the standard
374 8 Asymptotic Relative Efficiency
deviation of TI,n to that of T2,n approaches l/JE1: 2(0). The same holds for
the ratio of their mean absolute errors.
Property B(ii) does not follow automatically from B(i) because the variance
of a limiting distribution, though ordinarily the limit of the variances for
finite n, need not be. For instance, In
[Tl,n - 11(0)] can have infinite variance
for every n even though its limiting distribution has finite variance (Problem
31). Common methods of deriving limiting distributions do not apply to the
limit of the variances. Similar remarks hold for mean squared error, standard
deviation, and mean absolute error. Accordingly, statements and proofs about
asymptotic relative efficiency, in this book and elsewhere, usually apply
directly to B(i), but for B(ii), some statements need qualification, especially
very general ones, and most proofs need additional justification.
B(iii). For two estimators with sample sizes ni and n2 respectively, if the
true distribution corresponds to e, and nl and n2 approach infinity simul-
taneously in such a way that n2/nl -+ E 1 : 2(0), then the difference between
the c.dJ.'s of th~ two estimators converges uniformly to zero.
In symbols,
Po(T1,nl :s; t) - PO(T2 ,n2 S t) -+ 0 (6.7)
uniformly in t ifn2/nl -+ EI:ie).
B(iv). For two estimators with sample sizes ni and n2 respectively, if the
true distribution corresponds to 0 and nl and n2 approach infinity simul-
taneously in such a way that n2/nl -+ E I : ie), then the ratio ofthe variances
of the estimators approaches one, as does the ratio of their mean squared
errors, the ratio of their standard deviations, the ratio of their mean
absolute errors, and indeed the ratio of the expectation of any function of
their errors.
Comments similar to those following property B(ii) apply to B(iii) and
B(iv).
The final phrase ofB(iv) implies that no matter what function of the error
(and even of the true value as well) is chosen to represent the loss of mis-
estimation, the ratio of the expected losses still approaches one if n i and n2
approach infinity in such a way that n2/nl -+ E 1: 2(0). The conditions needed
to prove that this property holds depend of course on the loss function, and
may be strong.
We consider next two confidence bounds Tl,n and T2 ,n for the same quantity
11(0) and state four properties, which we may expect the asymptotic relative
efficiency E I: 2(e) to have and could use to define asymptotic relative efficiency
for confidence bounds.
6 Definitions of Asymptotic Relative Efficiency 375
C(i). When the true distribution is given by 0, the difference between the
J
c.d.f. of Tl, n at f.1( 0) + (j and the c.dJ. of T2 • n at f.1( 0) + (j E I : i e) ap-
proaches zero uniformly in (j as the common sample size 11 ~ 00.
In symbols,
uniformly in (j as 11 ~ 00.
This implies that, in a relative sense, the two confidence bounds have
J
asymptotically the same distribution except for a scale factor 1/ E 1:2 ( e). The
discussions of scaling, uniformity, and uniqueness of E in Sects. 6.2 and 6.3
apply here with little change and will not be repeated.
C(ii). When the true distribution is given bye, the ratio of the expectation
of TI ,II - f.1(O) to the expectation of T2 ,1I - f.1(e) approaches 1/)E\:2(O) as
11 ~ 00. The ratio ofthe expected overestimations, that is, ofthe expectations
of [Tl,n - f.1(O)] + and [T2 ,n - f.1(e)]+, and the ratio of expected under-
estimations [f.1(e) - TI,n]+ and [f.1(e) - T2 • n]+, also approach I/JE I de)
as 11~ 00.
As with property B(ii), C(ii) does not strictly follow from C(i), and state-
ments about asymptotic relative efficiency often need additional justification
and may need qualification to apply to property C(ii).
The definition of [7;./1 - f.1(e)]+ and the reason for considering it when
7;,/1 is an upper confidence bound were given in Section 4. The reason for
considering [f.1(e) - 7;,/1]+ when 7;,/1 is a lower confidence bound is
analogous.
C(iii). For two confidence bounds with sample sizes III and /12 respectively,
if the true distribution is given by 0 and if 111 and 112 approach infinity
simultaneously in such a way that 112/11 I ~ E I : 2«(J), then the difference
between the c.dJ.'s of the two confidence bounds converges uniformly to
zero.
In symbols,
PoETI,/I, S t] - P8[T2,/l2S t] ~ 0 (6.9)
C(iv). For two confidence bounds for f.1 with sample sizes III and 112
respectively, if the true distribution is given bye, and if 11 1 and 112 approach
infinity simultaneously in such a way that 112/111 ~ EI de), then the ratio
of their expected overestimations approaches one, as does the ratio of the
expected values of any function of their differences from f.1.
Comments similar to those following B(ii), B(iv), and C(ii) apply to C(iii)
and C(iv).
376 8 Asymptohc Relahve EfficIency
Ll n
Po [ I --' - I l >8] -+0 as n -+ 00. (6.10)
L 2 ,n JE 1 : 2 (B)
Thus, in addition to property C(i) r"r the endpoints, property D(i) requires
that the ratio of the lengths of the _ .ervals has arbitrarily high probability
J
of being arbitrarily near 1/ E 1 : iB) if n is large enough.
D(iii). For two confidence interval procedures with sample sizes nl and n2
respectively, the upper and lower endpoints each have property C(iii), and
the ratio of the length of the first confidence interval to the length of the
second converges in probability to one if nl and n2 approach infinity
simultaneously in such a way that n2/n1 -+ E 1 : iB) when the true distribu-
tion is given by B.
It is natural to define properties D(ii) and D(iv) in a similar way, re-
quiring the endpoints to have properties C(ii) and C(iv) but requiring con-
vergence in the ordinary sense of the ratio of the expectations of the lengths
(or any functions of the lengths) in place of convergence in probability of
the ratio of lengths. Statements about asymptotic relative efficiency often
need additional justification and may need qualification to apply to such
properties, because they involve expectations.
6.6 Summary
result, it is not really misleading to think of the quantity E 1: 2(0) as being the
same throughout. However, the definition ofthe asymptotic relative efficiency
of tests, for instance, is quite independent of the definition for estimators
and confidence procedures. Hence it is perfectly legitimate to talk about the
asymptotic relative efficiency of tests without referring to any estimators or
confidence procedures or to their relation to these tests. A rigorous definition
of what it means for tests, estimators, and confidence procedures to be
"related" has not been given here.
In the statistical literature, asymptotic relative efficiency is usually
defined for tests by property A(ii), or a slightly weaker version of it (Problem
32). The other properties for tests are included here because they enrich the
meaning of asymptotic relative efficiency and do generally hold, even though
they may not be mentioned or their validity proved. Occasionally, reference
is made to "Mood's definition" [Mood, 1954] of asymptotic relative effi-
ciency. This applies to one-tailed tests and to two-tailed tests with equal
tails. It ordinarily agrees with the usual definition in the case of two-tailed
tests with equal tails, but it gives the square root of the usual value in the
case of one-tailed tests (Problem 33). Definitions of the asymptotic relative
efficiency of point estimators and confidence procedures do not appear
frequently in the literature. The properties of these procedures which are
most similar to A(ii) for tests are B(iii), C(iii), and D(iii).
7 Pitman's Formula
The asymptotic relative efficiency of two procedures was defined in Sect. 1
as the limit of the relative efficiency, that is, the limit of the ratio of the sample
sizes needed to achieve the same performance. The heuristic discussion in
Sects. 2-4 indicated that a single number describes the relative efficiency
asymptotically for a wide range of measures of performance. This discussion
also led to a formula, that given in (2.17) and usually called Pitman's formula,
for computing asymptotic relative efficiency as the limiting ratio of two
efficacies. We used this formula for calculation in the examples of Sect. 5.
In light of the discussion, we could reasonably define asymptotic relative
efficiency as a quantity with the properties indicated in Sects. 2-4 and
laid out in Sect. 6, or some specified subset of them. Pitman's formula is
not a part of the definition of asymptotic relative efficiency, and numbers com-
puted from it have these properties only under suitable conditions (which,
for many of the properties, are very weak). However, the formula provides
a convenient and widely applicable method for computation of asymptotic
relative efficiencies, and we will use it further below. Nevertheless, we will not
give sufficient conditions for the validity of the formula, nor prove formally
that numbers computed from it have the properties given in Sect. 6.
For convenience and easier reference, Pitman's formula will now be
repeated. Consider a family of distributions indexed by a one-dimensional
378 8 Asymptotic Relative Efficiency
We start by listing the procedures we consider and formulas for their asymp-
totic efficacies for any shift family h. Thereafter we give numerical values for
the shift families just mentioned.
(1) The normal-theory procedures include the t test for the null hypothesis
that the population mean has a specified value (or the test based on the
sample mean ifthe variance is assumed known), the sample mean as estimator
of the population mean, and the confidence bounds for the population mean
corresponding to the foregoing tests.
The efficacy is most conveniently computed from the estimator, the
sample mean, and is el,n = n/(12 where (12 is the population variance, the
variance of the density h (Problem 35a). Thus the asymptotic efficacy is
(8.2)
(2) The median procedures (in the sense of Sect. 5) are the sign test for the
null hypothesis that the population median has a specified value, the sample
median as estimator of the population median, and the corresponding
confidence bounds for the population median which have appropriate order
statistics as endpoints.
The efficacy is conveniently computed from the test statistic, the number
of observations smaller than the median value specified by the null hypo-
thesis, and is 4nh2(O) if h is a density with median 0 (Problem 35b). The
asymptotic efficacy is then
(8.3)
(3) The Wilcoxon procedures are the Wilcoxon signed-rank test for the
null hypothesis that the population is symmetric about a specified value, the
corresponding estimator of the population center of symmetry, namely the
median of the set of Walsh averages (see Sect. 3,2), and the corresponding
confidence bounds (Sect. 4, Chap. 3).
8 AsymptotIc RelatIve EfficIencIes of One-Sample Procedures for ShIft Fanuhes 381
(8.5)
where, in the expectation, I X 1(1) < ... < I X I(n) are the absolute values of a
sample of n observations from the density h arranged in increasing order of
size. (This notation is imprecise, since the meaning and distribution of
IX Iw depend on n as well as j.) The asymptotic efficacy is lim e4, n/n, and
will exist only under some restriction on the Cnj.
A common situation is that
Cnj
j -
= bnc ( -n-
t) + remainder, (8.7)
where bn is a constant for each n, c(u) is a function of u defined for 0 < u < 1,
and the remainders are small enough not to contribute asymptotically. For
example, the sign test statistic has this form with Cllj = I, c(u) = 1, and bn = 1.
The Wilcoxon signed-rank test statistic has this form with Cnj = j, c(u) = II,
and bn = n. The factor bn has no effect on the test and is included in (8.7)
for the convenience of using the test statistics as previously defined. For
example, without bn the Wilcoxon signed-rank test would have Cnj = jln,
an equivalent but inconvenient form. The reason for evaluating c(u) at (j -1)/n
rather than j/n in (8.7) is to allow the possibility that c(u) ---+ 00 as u ---+ 1, in
which case c(jfn) could not be used for j = n.
382 8 AsymptotIc Relative EfficIency
rIf
section that the asymptotic efficacy when (8.7) holds is
(8.9)
When (8.7) holds with suitable conditions on c, the asymptotic efficacy for
the uniform density is (Problem 37)
(5) The normal scores procedures are of the foregoing type and satisfy
(8.7) with c(u) = <1>-1[(1 + u)/2] where <I> is the standard normal c.dJ.
For example, we might take Cnj as the quantile of order U - t)/n or j/(n + 1) of
the absolute value of a standard normal random variable, that is, the
(n + j - t)/2n or (n + j + l)/2(n + 1) quantile of the standard normal dis-
tribution, or take cn) = E[ IZ I(j)] where IZ 1(1) < ... < IZ ICn) are the ordered
absolute values of a sample of n from the standard normal distribution (Sect.
9, Chap. 3).
When h is symmetric about zero and sufficiently regular, the asymptotic
r {f:oo r
efficacy of any normal scores procedure can be expressed as (Problem 38)
where H is the c.d.f. of h, <I> - 1 is the inverse of the standard normal distribu-
tion, and <1>-1' is the derivative of this inverse function.
(6) The procedures based on the randomization distribution are the random-
ization test of the null hypothesis of symmetry around a specified value and
the corresponding estimator and confidence procedures, as described in
Sects. 2.1 and 2.2 of Chap. 4. With the sample mean as the test statistic, the
efficacy and limiting efficacy per observation are the same as for the normal-
theory procedures. However, further argument is needed to justify this
8 Asymptotic Relative Efficiencies of One-Sample Procedures for Shift Famlhc~ 383
Families
Procedure
Squared Normal Asymptotically
Density h(x) Mean Median Wilcoxon ranks scores efficient
agrees with (5.8). The graph in Fig. 5.1 includes these values and several
others which can be verified similarly.
This concludes our explanation of Table 8.1. The numbers in the table
can all be obtained from Equations (8.2), (8.3), (8.5) and (8.8)-(8.11). These
calculations, except those which require numerical integration, are left to
the reader as Problems 45-52.
*Derivations
We will first derive formula (8.6) for the efficacy of the procedures based on
sums of signed constants Cnj' and then formula (8.8) for the asymptotic
efficacy. We assume that h is symmetric about zero and sufficiently regular
to justify the steps to follow.
The test statistic for the null hypothesis 0 = 0 is
n
1',. = L SjCn)
j= 1
(8.13)
, din d
= j~1 Cnj dO EIJ(S)
flnCO) = dO Eo(1',.) 0=0
I 0=0'
(8.16)
Let I X 1(1)' ••• , I X I(n) be the absolute values of the observations ordered
from smallest to largest. Given IXI(I)"'" IXI(n), the conditional probability
that Sj = 1 is (Problem 53)
I J(I X I(j); 0) (8 )
Po(Sj = I I X 1(1)' ... , IX I(n) = J(IX I(j); 0) + J( _I X I(}); 0)' .17
Let J*(ZI"'" Zn; 0) be the joint density of IXI(I)"'" IXI(n) at ZI"'" ZII'
Then
Eo(S) = Po(Sj = 1) - PIJ(Sj = -1) = 2Po(Sj = 1) - 1
= 2Eo[P o(Sj = lII X I(1)'"'' IXI(n)] - 1
388 8 Asymptotic Relative Efficiency
~ Eo(8)j
dO I -
0;0 - -
Ih'(Z)
h(z) f* (zl>""zn,O)dzl···dzn
.
= 2nbn I I
1/2 c(2u
h,[H - I(U)]
- 1) h[H I(U)] du + remamder.
.
(8.21)
Dividing the square of this by n times the quantity (8.20) and letting n -+ w
gives the first formula of (8.8). The second formula follows immediately. *
stated, is that h be such that those procedures under discussion are valid, at
least asymptotically.
We consider first the asymptotic efficiency of the normal-theory (mean)
procedures relative to the median procedures. Notice that its value in Table
8.1 ranges from zero for the Cauchy distribution to 3 for the uniform dis-
tribution. (Its value is zero for any density h which is positive at its median
and has infinite variance.) As it happens, the largest value possible for any
shift family given by a density h which has its maximum at its median is 3,
although an asymptotic efficiency of infinity is possible for a symmetric,
multimodal density h (Problem 54). Thus, forj(x; (J) = h(x - (J), the asymp-
totic efficiency of the normal-theory procedures with respect to the median
procedures can be zero, even if h is required to be symmetric and unimodal;
it can be infinite if h is unrestricted or merely assumed symmetric; and it can
be as large as 3, but no larger, if h has its maximum at its median or, therefore,
if h is symmetric and unimodal (because this implies that h has its maximum
at its median).
Next we consider the asymptotic efficiency of the Wilcoxon procedures
relative to the normal-theory procedures. It can be infinite even if h is sym-
metric and unimodal (the Cauchy family again provides an example), and
it can be as small as but no smaller than 108/125 = 0.864 if h is symmetric,
whether or not it is also required to be unimodal. The value 0.864 is achieved
by the symmetric beta density with,. = 2. This density is a quadratic function
with a negative squared term, except that where the quadratic function is
negative the density is zero. A proof that 0.864 is minimum appears at the
end of this subsection under (1).
The asymptotic efficiency of the Wilcoxon procedures relative to the
median procedures can be arbitrarily small, even if h is symmetric and uni-
modal; it can be infinite if h is unrestricted or merely symmetric; and it can
be as large as 3 but not larger if h has its maximum at its median or is sym-
metric and unimodal (Problem 55). The uniform distribution gives the value
3, while the double exponential power family in Equation (5.6) gives values
which approach zero as k ---4 0 (Problem 55).
The normal scores procedures are at least as efficient asymptotically as the
normal-theory procedures for all symmetric h and may be infinitely more
so. The asymptotic efficiency of the normal scores procedures relative to the
normal-theory procedures is infinite for h either Cauchy or uniform; it is
one if h is normal; and it is never less than one if h is symmetric. These results
are proved later under (2). Clearly, a requirement of unimodality would not
improve the bounds.
The asymptotic efficiency of the normal scores procedures relative to the
median procedures can be infinite and it can be arbitrarily small even if h
is symmetric and unimodal. The uniform distribution provides an example
of the former, and Problem 56 of the latter.
The asymptotic efficiency of the normal scores procedures relative to the
Wilcoxon procedures can be infinite, even if h is symmetric and unimodal,
390 8 Asymptotic Relative Efficiency
Median I"
3 00 I 1 I"
3 CIJ 0 00
Wilcoxon 1Q.!!
0 3" 1 I 0 J!
125 00
•
n
Normal scores 1 00 0 00 "6 00 I I
and it must be more than but can come arbitrarily close to n/6 = 0.524
if h is symmetric, whether or not h is also required to be unimodal. The
uniform distribution again provides an example of the former, and the
proof of the latter is requested in Problem 57.
Table 8.2 summarizes most of the foregoing bounds. Each cell relates to
the procedures designating its row and column and contains the greatest
lower bound and the least upper bound on the asymptotic efficiency of the
row procedure relative to the column procedure for shifts of a symmetric,
unimodal density. If unimodality is not assumed, the bounds are the same
except in the cases footnoted. Notice that the table is symmetric, in the sense
that, for instance, the greatest lower bound for the median relative to the mean
is 1, while the least upper bound for the mean relative to the median is the
reciprocal of 1 or 3.
*Proofs
(1) What is the minimum asymptotic efficiency of the Wilcoxon pro-
cedures relative to the normal theory procedures for shifts of a density
h which is symmetric about zero? By (8.4) and (8.6), the quantity to be
r·
minimized is
(8.26)
The next-to-Iast term of Equation (8.30) vanishes if (J2 < 00 (Problem 58a);
then applying the Schwarz inequality (or the fact that a correlation is at most
one) to the last term gives
The integral in Equation (8.31) is equal to one (Problem 58b), and we may
assume without loss of generality that (J2 = 1 since we can rescale H if neces-
sary. Then Q ::2: 1 and
(8.32)
392 8 AsymptotIc Relatlvc Efficlcncy
equality holds if and only if H = <1>. Therefore E5: 1 ;:::: 1 with equality if and
only if h is normal. 0
Proof (2) was adapted from Chernoff and Savage [1958] and is similar
to that of Gastwirth and Wolff [1968] (Problem 59). The reader may wish
to consult these papers for further insight.
In this subsection we will give explicitly the locally most powerful signed-
rank test for a symmetric shift alternative, show that its asymptotic efficacy
equals the Fisher information and hence that the test and related estimators
and confidence procedures are asymptotically efficient for the specified
alternative, and discuss the possibility of an "adaptive" procedure which
would be asymptotically efficient for all symmetric shifts simultaneously.
The signed-rank test which is locally most powerful for a shift family
of densities f(x; (J) :, h(x - (J) with h symmetric about zero is, by (9.8). of
Chap. 3, based on the sum of signed constants
(8.33)
where the IX I(J) are as usual the ordered absolute values of a sample from the
popUlation with distributIOn h (Problem 60). (The meaning of h is different
in Chap. 3.) The efficacy of this procedure for the family f(x; 0) = h(x - 0)
is, by (8.6), equal to
(8.34)
It can be shown (see below) that the asymptotic efficacy of this procedure is
(8.35)
where I is the Fisher information given in (8.12) for the family f(x; 0) =
h(x - (J). As discussed earlier (see Sect. 8.2), no procedure satisfying suitable
regularity conditions has asymptotic efficacy greater than the Fisher in-
formation. Therefore the locally most powerful signed-rank procedure for
the family f(x; (J) = h(x - (J) is asymptotically efficient for this family,
and it has asymptotic efficiency at least one relative to every other procedure
satisfying the regularity conditions. This applies, of course, to the corres-
ponding estimators and confidence procedures as well as tests. Exact regu-
larity conditions which are also simple are hard to find, but the foregoing
statements certainly apply to most standard non parametric and parametric
procedures.
We now have an asymptotically efficient procedure for any given h. Is
there a procedure which is asymptotically efficient for all h simultaneously?
8 Asymptotic Relative Efficiencies of One-Sample Procedures for Shift Fanuhes 393
Cnj is now equal to the expectation multiplying it in (8.21), this argument gives,
for the efficacy (8.34),
n 2 fl [h'(H- I(U)]]2 .
j~1 Cn) = 2n J1/2 h[H leU)] du + remamder
= 2n L'x'[~~n\(X) dx + remainder
hl(X)] 2 •
= nEh [ heX) + remamder. (8.36)
i>2 = i Eh[hl(II
j = 1 n) ) =1 h(
X I{J)] 2
X lu)
- varh[h l( IX I(j)]
h( IX I(j)
The last term here represents the amount by which the efficacy falls short
of the information in n observations.* 0
9.1 Assumptions
Xj - e= [(Uj - e) - Vj]
(9.1)
in standard notation. Alternatives with 0"; = O"~ are usual but not necessary.
Dropping the bivariate normal assumption, now suppose that there is an
"effect" associated with each pair, which may be regarded as either fixed or
396 8 Asymptotic Relative EfficIency
»j = Ilj + 0 + Wi (9.3)
where Il) is the effect associated with the jth pair, 0 is the treatment effect,
and V~ and Wi are" disturbances" or "errors." If the Vi and Wi are mutually
independent and normally distributed with mean 0 and common variance
(12, this is a normal, fixed-effects or mixed model for two-way analysis of
variance with no interaction; the design consists of two treatments (one is
the control) in blocks of size two. If, more generally, the n pairs of errors
(VII' W II ), (V2, W 2), ... , (V~, W~) are independently, identically and sym-
metrically2 distributed, then the random variables
X) = »j - lj = 0 + Wi - Vi (9.4)
are independently, identically, symmetrically distributed about 0 (Problem
63). The one-sample tests based on symmetry apply to the null hypothesis
o= 0, and the alternative 0 =1= 0 is a shift alternative provided, as we assume,
the distribution of (Vi, Wi) is the same for all O.
Specifically, if Vi and Wi have ajoint density, then Xj will have a density
J(x; 0) = h(x - 0) where h is the density of Wi - Vj. Any density h which is
symmetric around zero, like those in Table 8.1, arises from some symmetric
joint density for Vi and Wj (Problem 63d).
If we assume, however, that the two "errors" V~ and Wj are themselves
independently, identically distributed, with density q, say, then h is given
by
2 The pair of random variables (V~, W~) IS said to be distributed symmetrically ("permuta-
tlOnally symmetrically" is more specific but ralely used) if(V~. W~) has the same joint distribution
as (W~. V~). ThIS contrasts with the definitIOn of symmetry around IJ for a single random variable.
9 Asymptotic Relative EfficIency of Procedures for Matched Pairs 397
Median 72
00 I 1 .1. 00 O· 00
125 3
Wilcoxon .!QJiu
125 00 0 3· 1 I O· .
2-
that row relative to the procedure for that column for shifts of a density of the
form (9.5). Except as indicated by a footnote, the bounds are greatest lower
and least upper bounds, whether or not q is symmetric and/or unimodal.
which relates in exactly the same way as the one-sample efficacy to the power
of a test or the distribution of an estimator or confidence limit. A specified
performance (power, for instance) can now, however, be achieved by various
combinations of m and n. Both the total sample size N = m + n and the
allocation, say A = miN, are important. It turns out, however, as one might
expect, that the influence of the allocation is the same for all procedures of
10 Asymptotic Relative Efficiency of Two-Sample Procedures for Shift Families 399
the types we have considered. The role of the limiting efficacy per observation
can be played here by
. N
e. = I1m-e", II' (10.2)
mil .
Since the limit (10.2) is the same however m and n approach infinity,
that is, whatever the allocation miN, we need not carry out additional com-
putations for the specific procedures discussed in Chaps. 5 and 6 and the
distributions h considered in Sect. 8, because each of the two-sample tests
has a corresponding one-sample limit for which we have already computed
the efficacy. A two-sample test for identical but unspecified populations, based
on samples of sizes m and n, reduces as m ~ 00 with n fixed to a one-sample
test of a hypothesis about the distribution of the Y sample, which is of
finite size n. In particular, the two-sample t test for equality of means ap-
proaches the standard t test for the hypothesis that the Y mean equals a
given value, namely the X population mean as determined from the infinite
X sample; the two-sample median test for identical populations reduces to
the sign test of the hypothesis that the probability is ! that a Y observation
exceeds a given value, namely the X population median (see Problem 27,
Chap. 5); and Problem 69 shows that the asymptotic properties of the two-
sample Wilcoxon rank-sum statistic for identical populations are the same
as the asymptotic properties of the one-sample Wilcoxon rank-sum statistic
for the null hypothesis of symmetry about a given point. Similar reductions
hold for the two-sample normal scores test and the randomization test based
on the difference of the sample means. Accordingly, the results for efficacy
given in Table 8.1 and for bounds on asymptotic efficiency given in Table 8.2
apply to the corresponding two-sample tests here.
Unfortunately, there seems to be no easy way to see that the limit (10.2)
does not depend on how m and n approach infinity except by calculation
in special cases or appeal to powerful theorems. Problem 70a requests such
calculation for the two-sample normal-theory, median, and Wilcoxon pro-
cedures. For a procedure based on a sum of scores Cmnk (Sect. 5, Chap. 5),
the efficacy for the shift alternative (10.4) is (Problem 71a)
(10.5)
If the scores are of the form
then under suitable regularity conditions the limit (10.2) exists and is (Prob-
rI{f
lem 7tb)
{f [f T}
rI{f
e. = c(u)h,[H- 1(u)]lh[H- 1(u)] du C2 (U) du - c(u) du
If h is symmetric and c(1 - u) = -c(u) for all u, this reduces to (8.8) with
c(u) and c[H(x)] in place of c(2u - 1) and c[2H(x) - 1]. Thus the one-
sample and two-sample asymptotic efficacies are the same for shifts of
symmetric densities h if the one-sample function c(1)(u) and the two-sample
function c(2)(u) satisfy
C(2)(U) = -c(2)(1 - u) = c(1pu - 1) for 1- < u < 1. (10.8)
These conditions on c(2) and C(I) are actually quite natural. For instance,
the one- and two-sample normal scores procedures have
c(1,(u) = <1>-1[(1 + u)/2] and c(2)(u) = <I>-I(U),
which satisfy (10.8). The one- and two-sample median and Wilcoxon pro-
cedures also satisfy (10.8), and a two-sample sum-of-scores procedure cor-
responding similarly to the one-sample squared rank procedure of Sect. 8
has scores Cmnk = ±(k - N /2)2, one sign applying for k > N /2 and the
other for k < N /2 (Problem 72).
Thus it can be shown that the efficiencies of appropriately corresponding
one- and two-sample procedures are the same for symmetric densities h, and
that Tables 8.1 and 8.2 apply to two-sample procedures as well as to one-
sample procedures. For asymmetric densities, where the one-sample pro-
cedures are generally not valid, the two-sample procedures are valid and
could also be compared using the formulas above. For the normal theory
procedures we still have e. = 1/([2 and for the median procedures e. =
4h2(~o 5) where ~O.5 is the median of h, but for the Wilcoxon procedure we
r.
must use
shape under the null hypothesis. This is also true under alternative hypo-
theses, as we shall see below. Accordingly, the asymptotic power function of
a Kolmogorov-Smirnov test has a different shape from that of the other
tests, and its dependence on the level is also different. The sample size
required to achieve a given level and a given power at a given alternative
therefore depends on the level. power. and alternative in a fundamentally
different way. Consequently, the asymptotic efficiency of the Kolmogorov-
Smirnov procedures relative to other procedures depends on the Type I and
Type II errors, (X and p, and therefore will have a much more restrictive
meaning in this section than in the rest of this chapter.
J J
Since the asymptotic distributions of mn/N Dmn and mn/N D:'nare inde-
pendent of the way m and n approach infinity under both the null hypothesis
and nearby alternatives, it is true here as in the previous section that the asymp-
totic efficiency of the Kolmogorov-Smirnov procedures relative to other
procedures with the same allocation miN does not depend on what that
allocation is. Therefore, two-sample efficiencies are again the same as one-
sample efficiencies. We shall treat the one-sample case here since the notation
is simpler, but the results are applicable to the two-sample case once
In is replaced by Jmn/N. One sided and two-sided tests are not simply
related however. We shall discuss one-sided tests first.
{e-2(C.-~IiiW if Jne
r.:.
< Ca.
(1Ll)
I if V ne ~ Ca ,
where Ca is the upper (X point of the asymptotic null distribution of JnD:
given in (6.3) of Chap. 7 as
achieve level lI. and power 1 - 13 at the same alternative e therefore ap-
proaches (Problem 79c)
(11.4)
as e -+ 0 and the sample sizes approach infinity. The dependence of the
asymptotic efficiency on lI. and 13 is immediately evident. Some numerical
values are given in Table ILl, which is explained further below. The efficiency
approaches 1 as lI. -+ 0 for fixed 13, it approaches 00 as 13 -+ 0 for fixed lI., and
it approaches (Problem 79d)
as 13 -+ 1 - lI., that is, when the alternative approaches the null hypothesis
faster than l/Jn -+ O.
Equation (11.4) gives the asymptotic efficiency of the one-sided, one-
or two-sample Kolmogorov-Smirnov test relative to the one- or two-sample
median test, for uniform shift alternatives. Its asymptotic efficiency relative to
any other test can be obtained by dividing (11.4) by the asymptotic efficiency
of the other test relative to the median test for the same alternatives, for
instance, if the other test appears in Table 8.1, by an entry in the line labeled
uniform.
For other alternatives, the dependence on lI. and 13 will generally differ
from that in (11.4). While no simple, general method of computation is
known, it is possible to obtain bounds which provide some insight.
First, for any symmetric, unimodal shift alternative, (1Ll) is an asymp-
totic upper bound on the power of the Kolmogorov-Smirnov test (Problem
7ge), where e is now the difference between the true and hypothesized c.dJ.
at the median. (In the case of a uniform shift, this agrees with the previous
definition of e, and the uniform is the "most favorable" symmetric uni-
modal shift.) Therefore (11.4) is an upper bound on the asymptotic efficiency
of the Kolmogorov-Smirnov test relative to the median test. If (11.4) is
divided by the asymptotic efficiency of any other test relative to the median
test for any particular symmetric, unimodal shift family, we obtain an upper
bound on the asymptotic efficiency of the Kolmogorov-Smirnov test
relative to the other test for this family.
Second, it is easy to obtain a lower bound (unfortunately very weak) on
the power by observing that the Kolmogorov-Smirnov test will certainly
reject Ho if Fn(ll) - F(Il) exceeds the Kolmogorov-Smirnov critical value
at the median Il, or equivalently, if the median test rejects even when the
Kolmogorov-Smirnov critical value is used in place of the (smaller) median
test critical value for Fn(ll) - F(Il). Approximating the relevant binomial
probability by a normal probability in the usual way, a lower bound on the
approximate power is obtained (Problem 80a) as
(11.6)
404 8 Asymptotic Relative Efficiency
The power of the median test with its own critical value is again approximated
by (11.3), and it follows (Problem 80b) that
(11.7)
Lower bounds relative to tests other than the median test are obtained by
division, as the upper bounds in the previous paragraph were.
Unfortunately, for alternatives very near the null hypothesis, (11.6)
does not even say that the power exceeds (x, and (11.7) is correspondingly
poor. Furthermore, the right-hand side of (11.7) is always less than 1, while
we could hope to be able to prove that the Kolmogorov-Smirnov test is
asymptotically more efficient than the median test for at least some alterna-
tives other than the uniform shift. It is worth noting, however, that the
right-hand side of (11.7) approaches 1 as (X -+ 0 [Capon, 1965] or f3 -+ 0, so
that the Kolmogorov-Smirnov test is asymptotically almost as efficient
as the median test if (X or f3 is small enough. Unfortunately the approach is
very slow, as is evident from Table 11.2.
A lower bound which improves on the previous one for any alternative
G ~ F can be obtained as follows.
Pa{sup[Fn(t) - F(t)] ~ c} = 1 - PG{sup[Fn(t) - G(t) + G(t) - F(t)] ~ c}
2 I - Pa{sup[FII(t) - G(t)] ~ c and FII(p.) - G(p.) ~ c - a},
( 11.8)
where f} = G(p.) - F(p.) as before. This last probability, conditional on Fn(p.),
can be evaluated asymptotically by arguments like those leading to the
asymptotic null distribution, and the expectation of the result over the dis-
tribution of Fip.) can be found (see below). The resulting lower bound on
the asymptotic power is (Problem 82a; Quade, [1965, Theorem 4.2(d) with
T = i])
0.05 1.41 1.45 1.56 1.68 1.83 2.17 2.97 3.88 4.86 w
0.71 0.74 0.77 0.79 0.80 0.82 0.84 0.86 0.87
0 0.11 0.34 0.45 0.53 0.62 0.69 0.73 0.76
0.025 1.35 1.42 1.52 1.62 1.76 2.06 2.77 3.57 4.42 w
0.73 0.77 0.79 0.81 0.82 0.84 0.86 0.87 0.88
0 0.23 0.43 0.52 0.59 0.66 0.72 0.76 0.78
0.005 1.28 1.36 1.45 1.54 1.65 1.90 2.48 3.12 3.78 w
0.76 0.81 0.83 0.84 0.85 0.87 0.88 0.89 0.90 1
0 0.43 0.56 0.63 0.67 0.72 0.77 0.80 0.81 1
0.001 1.22 1.33 1.40 1.48 1.58 1.80 2.30 2.83 3.38 w
0.78 0.84 0.86 0.87 0.88 0.89 0.90 0.90 0.91
0 0.55 0.65 0.69 0.73 0.77 0.80 0.82 0.84
0.0001 1.20 1.29 1.36 1.43 1.52 1.70 2.12 2.57 3.03 w
0.81 0.87 0.88 0.89 0.89 0.90 0.91 0.92 0.92 I
0 0.66 0.72 0.75 0.78 0.81 0.83 0.85 0.86
Note: For each IX and P pair, the entry in the first row is the value for umform shifts, (11.4),
which IS also an upper bound for alternatives G most distant from F at the median; the entry
in the second row is the value for Laplace shifts, (11.19), which is also a lower bound for shifts
of a density h satisfying hex) ~2h(lI)min{H(x). 1 - H(x)} where illS the median; and the entry
in the thud row IS the lower bound (11.10) for stochastically one-sided alternatIves All bounds
are valid for symmetric unimodal shifts. Column * applies as f3 -+ I - IX
406 8 AsymptotIc RelatIve Efficiency
PG[Fn(x) - G(x) ::::;; c - J j for J.lj-I < x < J1;iU" U2,···' ur]
i = 1, 2, ... , r + 1,
(11.14)
where Vi = G(J1j) - G(J1'-I) and Uo = Ur+1 = o. The asymptotic value of
the right-hand side of (11.13) is
and that (11.12b) still holds. Then (11.13) changes correspondingly, but
the only effect in (11.14) and (11.15) is to replace the first c5 i by Yi. (This
follows because, given Ui-I and Ui and a linear boundary under the limiting
distribution of F'r> only the distances from Ui-I and u. to the boundary
matter.) The nature of the integrand in (11.15) as a function of UI> ••• , Ur is
thus unchanged.
For r = 1, 15 1 = 61 = Y2 = 0, YI = 15 2 = 0, and J-tl = J-t (the median),
the condition in (11.16) becomes
20G(X) for x < J-t
G(x) - F(x) ~ { 20[1 _ G(x)] for x > J-t. (11.17)
Some values for the lower bound in this case are given in Table 11.1. When
a -+ 0 or f3 -+ 0, the lower bound (11.19) again approaches 1, but as f3 -+ 1 - a,
it approaches (Problem 84)
rather than O. Thus, this lower bound is effective in the neighborhood of the
null hypothesis, while (11.7) and (11.1 0) were not.
For a symmetric shift family with density h(x - J-t), as J-t -+ 0, the con-
dition (11.17) under which the foregoing bound applies becomes
or, in terms of the upper tail, that the hazard function h(x)/[1 - H(x)] be
monotonically increasing for x > 0 (increasing failure rate). A sufficient
condition for this, in turn, is that
h'(x)/h(x) is monotonically decreasing. (11.24)
Problem 85 gives further conditions equivalent to these, which help clarify
their relationship.
II Asymptotic Efficiency of Kolmogorov-Smlrnov Procedures 409
Condition (11.24) is satisfied, and hence so are (11.23) and (11.22), and the
bound in (11.19) applies, for the normal, logistic, Laplace, and uniform
distributions, and more generally for the double exponential power dis-
tribution (5.6) with k ~ 1 and the symmetric beta distribution with r ~ 1.
Conditions (11.22) and (11.23) also hold, although (11.24) does not, for the
symmetric beta with r < 1 and the Cauchy distributions. For the double
exponential power distribution with k < 1 (high-tailed distributions),
however, not even condition (11.22) holds, and the bound in (11.19) has not
been proved. Derivation of these results is requested in Problem 86.
For the Laplace distribution, equality holds in (11.19), which therefore
gives the actual minimum, not just a lower bound, under anyone of the con-
ditions (11.17), (11.22), (11.23), or (11.24) (Problem 87).
For shifts of a density h, as (3 --+ 1 - a, that is, in the asymptotic neigh-
borhood of the null hypothesis, the quantity
f
r,
2
{[2aC a/¢(Za)] {2¢[T(u)c a] - l}h,[H- 1 (u)]/h[H- 1(u)] dU}
where
T(u) = (2u - 1)/[u(1 - u)r /2 , (11.26)
plays the role of the asymptotic efficacy of the one-sided Kolmogorov-
Smirnov test. In other words, (11.25) can be divided by the limiting efficacy
per observation of another test to obtain the asymptotic efficiency of the
Kolmogorov-Smirnov test relative to the other test when {3 is very close to
1 - a. Equation (11.25) may be compared to (8.8) for tests based on sums of
signed constants or scores. Unfortunately, it is typically difficult to evaluate,
as well as being applicable only as {3 --+ 1 - a. For proof, see Hajek and
Sidak [1967]. *
We turn now to two-sided tests. One might expect the results for a one-
sided test at level a/2 to apply to the two-sided test at level a, and some of
them do approximately in some situations, as we shall see. They do not
apply directly or exactly, however, and may not apply even approximately,
because they do not take account of either rejection in the" wrong" direction
or rejection in both directions at once.
The possibility of rejection in both directions at once implies that the
rejection region for a two-sided Kolmogorov-Smirnov test is the union of two
one-sided regions which are neither mutually exclusive nor each at level
410 8 Asymptottc RelatIve EfficIency
F(x) $; G(x) $; F(x) + () for all x and G(J.l) = F(J.l) + () at the median J.l.
(11.28)
We have not done so, however, because the calculation is rather complicated
and the distributions coming close to equality are multimodal in a patho-
logical way. When additional conditions on G are imposed, it is not known
what form the most favorable distributions or maximum power or efficiency
have, and they are probably of a form leading to even more difficult
calculation.
412 8 Asymptotic RelatIve Efficiency
PROBLEMS
2. Let Iln(O) and a;(O) be the exact mean and variance of a statistic 1'" for some one-
parameter family of densitIesf(xn; 0). Derive the Cramer-Rao mequality [1l~(0)]2 ::;
a;(O)/,,(O), where 1,,(0) = - E[(a 2 /a0 2 )log f(x,,; 0)] IS the FIsher information, by
showing that 1l~(0) = cov(1'", Un) and In(O) = var(U n) where Un = (a/NJ)log f(xn;O).
Conclude that the efficacy satisfies en(O) ::; In(O). Note that, if 1'" is an estimator of 0,
then Iln(O) = 1 + b~(O) where bn(O) is the bias of 1'". See also Problem 43.
*3. Relate the asymptotic efficiency of maximum likelihood estimators to the develop-
ment in this chapter.
*5. Trace the development of Sect. 2 and give the leading error terms in (2.11) and the
approximations leading to it when 1'" is the sample mean, the population is normal
with variance a 2 (0) depending on the mean, and the null hypothesis is gIven by
00, a 2 (00)·
*6. (a) Derive the order of magnitude of the error introduced in (2.11) and the approxi-
mations leading to it when a consistent estimate is substituted for a nuisance
parameter as described in Sect. 2.2.
(b) Is the order of magnitude of this error changed if the test is adjusted to make its
level exact under some null hypothesis?
7. (a) Derive the asymptotic efficiency (3.3) of the sample proportion negative
relative to the normal-theory estimator of p, the proportion negative in the
population, for a normal population.
(b) Show that this efficiency depends only on p.
(c) Show that this efficiency is the same at 1 - P as at p.
(d) Show that this efficiency is 2/n when p = 0.5.
(e) Evaluate this efficiency at p = 0,01,0.05,0.10,0.25, and 0.50.
(f) Which value of p gives the largest efficiency? (See also Problem 38(c) ofehap. 2.)
8. Show that the efficacies of equivalent test statistics, as defined by (2.10), must be
asymptotically the same, but need not be identical in finite samples, even if Iln(O) is
defined as the exact finite-sample mean.
9. (a) For a sample of n independently, identically distributed observations with
density f(x; 0), find the limiting efficacy per observation of the sample median
by direct calculation, using the fact that the median is asymptotically normal
with mean equal to the population median jl(O) and variance 1/4nf2[J1(O); 0].
*(b) Find the efficacy of the number of negative observations and venfy that It
agrees with the limiting efficacy in (a).
(Hmt: Show that F[Il(O); 0] = 0.5 and find 11'(0) by differentiating.)
10. Use tests at level (X = 0.50 to argue that the sample mean and the t statistIc have
asymptotically the same efficacy in samples from a distribution with finite variance.
11. Show that application of the method of Sect. 3.2 to the Wilcoxon signed-rank test
for an assumed center of symmetry 11 gives the median of the Walsh averages as an
estimator of 11 with the same asymptotic efficacy.
414 8 Asymptotic Relative Efficiency
12. In the two-sample shift problem, apply the method of Sect. 3.2 to obtain estimators
that have the same asymptotic efficacy as
(a) The two-sample t test.
(b) The two-sample median test.
(c) The two-sample rank-sum test.
13. Derive Equation (3.6) by expanding the right-hand side of (3.4) in a Taylor's series
about III. n(O), or the left-hand side of (3.5) about O.
* 14 (a) Consider a family of normal distributIOns with mean II and standard deviation a
that are indexed by the 90th percentile value 0 and a. Given that a = 1, how
would tests and estimators for 0 be based on the sample mean, the sample
median, or the 90th percentile of the sample?
(b) What would be the asymptotic efficaCies and relative efficiencies of the tests
and estimators in (a)?
(c) Consider the following three possible defimtions of 0 for arbitrary distributions:
(i) 0 = /1 + 1.645a
(ii) 0 = ~o 5 + 1.645a
(iii) 0 = ~o 9
where ~p is the quantile of order p in the distribution. Show that each of these
definitions agrees with the definition given for normal distributions with a = 1.
(d) How might tests and estimators based on the sample mean, the sample'median,
or the 90th percentile of the sample be developed for each of the extended
definitions of 0 in (c)?
15. (a) Derive the approximate c.dJ. (4.2) of a confidence bound T,. from the approxi-
mate power (4.1) of the corresponding test.
(b) Show using (a) that the c.d.f.'s of two confidence bounds Tl,n and T2 • n are
approximately the same except for a scale factor I/)E, 2(0).
16. Let L(t - /1,0) be the "loss" lI1curred when the confidence bound t is given for the
parameter /1(0). Show that, in the situations of Sect. 4,
(a) The distributIOn of L(T'.n - /1(0), 0) is approximately the same as that of
L{[T2 • n - 11(0)]IJE1:iO), OJ.
(b) If L(z, 0) is homogeneous of degree k in z, that is, L(az, 0) = akL(z, 0), then
Eo{L[T1.n - /1(0), OJ} is approximately [E1:2(0)r k,2 Eo{L['f:,.n - 11(0), OJ}.
(c) Apply (b) to L(z, 0) = z and L(z, 0) = max(z,O).
*(d) Show that if oL(z, O)loz exists and does not vanish at z = 0, then the same
conclusion as for k = 1 in (b) holds.
(e) If T'.n and T2 • n are estimators rather than confidence bounds, what changes are
necessary in (a)-(d)?
(f) Apply (b) for estimators to the loss functIOns
Izl, Z2, and c max(z, 0) + d max( -z, 0).
17. Show that the interpretation of asymptotic relative efficiency in terms of sample
sizes applies also to confidence bounds.
18. If T;. II and T;: n are respectively lower and upper confidence bounds for /1(0), i = 1,2,
and if the asymptotic relative efficiencies of T'I,II with respect to T 2.", and of T';."
with respect to T~.n' are both EI 2(0), show that
Eo(T'1.n - Ti.n) = Eo(T~.n - T 2.n)/JE 1.2 (0)
asymptotically.
Problems 415
19. For a sample from a population with media II and a density positive and contmuous
at /-I, show that the sample median is asymptotically normal with mean /-I and vari-
ance 1/(4I1d 2 ), where d is the density at /-I.
20. For the Laplace density (5.l),
(a) Show that the mean is (} and the variance is 2,1.2.
(b) Show that the efficacies of the sample median and the sample mean are as
given in (5.2) and (5.3) respectively.
21. For a sample of II from a density of the formf[(x - O)/A],
(a) Show that the efficacy of the normal-theory procedures satisfies en(O; A) =
llel(O; 1)/A 2 and in particular is independent of 0 and inversely proportional
to A2 •
(b) Show that the results in (a) hold also for the median procedures.
(c) Show that the asymptotic relative efficiency of the median procedures relative
to the normal-theory procedures is independent of both 0 and Aand depends only
on f.
(d) Generalize the results in (a)-(c).
*22. Show that the double exponential power density (5.6) approaches the uniform
density on (0 - A, (} + A) as k ..... 00.
*23. (a) Find the efficacies of the sample median and mean and their asymptotic relative
efficiency for the double exponential power density (5.6).
t
(b) Show that the asymptotic relative efficiency in (a) approaches as k ..... 00.
(c) Show that the result in (b) agrees with that of a direct calculation for the uniform
distribution.
24. Show that (6.1)-(6.4) are equivalent, that is, that (6.3) and (6.4) are equivalent to
uniform convergence.
25. How might the various properties of asymptotic relative efficiency be restated for
the case where it is 0 or oo?
26. In typical testing situations as described in Section 2, show that
(a) The powers P1.,,(00 + 0,,) and P 2 ,,,(Oo + onfi) appearing in (6.4) approach
limits other than 1, Ct., or 0 if fio" ..... d of. 0, 00, and E > O.
(b) If E of. E 1 :2 (00), the limits in (a) are different except in the situation mentioned
in Sect. 6.2.
27. Restate the uniform convergence conditions (6.5)-(6.9) in the style of (6.3) and (6.4).
28. Rescale the variables in the uniform convergence conditions (6.5)-(6.9) in such a
way that the terms have nondegenerate limits and uniformity is no longer essential
to the meaning. What are the limits? State the corresponding properties of asymp-
totic relative efficiency in terms of these limits.
29. State and justify converses to the properties of asymptotic relative efficiency,
according to which each property determines the asymptotic relative efficiency
uniquely.
30. Let T1 ." and T2 ." be estimators of the same quantity II = 11(0). If fi(T1." - It) has
a nondegenerate limiting distribution and property B(i) holds, then fi(T2 ,n - /-I)
has the same non degenerate limiting distribution except for the scale factor
I/JE~'2(O). Show this.
416 8 Asymptotic Relative Efficiency
*31. (a) Give an example of an estimator Tl,n and a population distribution such that
TI, n has infinite variance for every 11 but its limiting distribution has finite
variance.
(b) Can the variance of the limiting distributIOn exceed the limit (or lim inf or
lim sup) of the variance of ~ TI • n ?
32. (a) Show that Property A(ii) implies that, for all D, the difference between the two
powers PI.".(0o + J~IO) - P 2 .",({)o + j,~O) approaches zero if 112/111 -> E.
ThiS property is sometimes taken as the definitIOn of asymptotic relative
efficiency.
*(b) Show the converse of (a) under the additional condition that the power functions
Pl,n(O) and P 2 • n(0) are both monotone functions of 0 for all 11.
(c) Give statements analogous to (a) and (b) for estimators.
(d) Give statements analogous to (a) and (b) for confidence bounds.
33. (a) For one-tailed tests, choose dn such that the powers P l.n(OO + t) and P 2, n(OO +
dll t) have the same derivative with respect to tat t = O. Argue that dn->J E I z{(0)
as 11 -> 00.
(b) For unbiased two-tailed tests, choose dn such that the powers in (a) have the
same second derivative with respect to t at t = O. Argue that dn -> EI'z{Oo) as
11 -> 00.
38. (a) Derive the asymptotic efficacy of the normal scores procedures given m (8.11)
for a symmetric shift.
(b) Show thatthe result in (a) can also be written as e5 = UM<I>-l '(1)/ H- l'(l}}dl]2.
39. Determme the effect of a change of scale in the model on the efficacies of the pro-
cedures in Sect. 8.2.
*40. (a) Show that the double exponential power density with the value of a specified
in Table 8.1 approaches t for Ix I < 1 and 0 for Ix I > 1 as k -+ 00.
(b) What happens to the limit in (a) at x = I? Is it relevant?
(c) What happens to the limit in (a) if a is kept fixed as k -+ w?
*41. (a) Show that the symmetric beta density with the value of a specified in Table 8.1
approaches a normal density for all x as ,. -+ w.
(b) What happens to the limit in (a) if a is kept fixed as ,. -+ w?
42. Show that the logistic density appearing in Table 8.1 can be written in the alternative
forms
IJ(x) = a/2(l + cosh ax) = a/4 cosh2(ax/2)
where cosh z = (e + e-
Z Z )/2.
43. For the shift family f(x; 6) = h(x - 6), show that the Fisher information is
/(0) = E[h'(X)/IJ(X)]2 = - E[cfJ"(X)]
where X has density hand cfJ(x)= log h(x). (See also Problem 2.)
44. Let Un be the sample mean and let T" = Un if I Un I > n - 1/4 and T" = 0 otherwise.
For a population with mean ~ and finite variance (J2,
(a) Show that T" is asymptotically N(~, (J2/n) if Jl =f. O.
(b) What is the asymptotic distribution of T" if ~ = O?
(c) What is the asymptotic efficacy of T" ? Is it continuous at ~ = O?
"'45. For the double exponential power density h(x) of Table 8.1, show that
(a) (J2 = r(3/k)/a 2 r(l/k) and the asymptotic efficacy of the mean procedures is
1/(J2.
(b) h(O) = ak/2r(l/k) and the asymptotic efficacy of the median procedures is
[ak/r(l/k)Y
J
(c) IJ 2 (x)dx = ak/2 1+ (I/k)r(l/k) and the asymptotic efficacy of the Wilcoxon
procedures is 3[ak/21/kr{1/k)]2.
(d) / = a2 k 2 r(2 - I/k)/r(l/k).
(e) The entries in Table 8.1 corresponding to (a)-(d) are correct.
(f) The asymptotic efficiencies of the mean and Wilcoxon procedures relative to
the median procedures both approach 3 as k -+ 00.
46. For the normal scores procedures,
(a) Show that the asymptotic efficacy is 2/n for the Laplace distribution and 1 for
the standard normal distribution.
(b) Verify the entries in Table 8.1 corresponding to (a). (Note that the variance of
the normal density there is nI2.)
*47. For the one-sample squared rank procedures,
(a) Show that the asymptotic efficacy is i for the Laplace distribution and
80(arctan J!YIn 3 (with arctan III radians) for the standard normal distributIOn.
418 8 AsymptotIc RelatIve EfficIency
(b) Verify the entries in Table 8.1 corresponding to (a). (Note that the variance of
the normal density there is n/2.)
48. For the uniform distribution on [ -1, 1], verify the asymptotic efficacies given in
Table 8.1.
*49. For the symmetric beta density hex) of Table 8.1, show that
(a) (12 = 1/(2r + l)a 2 and the asymptotic efficacy of the mean procedures is
(2r + l)a 2 •
(b) h(O) = a/2 2r - 1 B(r, r) and the asymptotic efficacy of the median procedures is
[a/4r - t B(r, r)Y
(c) f h2(X)dx = aB(2,. - 1, 2r - 1)/2B 2(,., r) and the asymptotic efficacy of the
Wilcoxon procedures is 3a 2B2(2r - 1,2,. - 1)/B4 (,., r).
(d) I = a2 (,. - 1)(2r - 1)/(r - 2) for r > 2. What happens if r :5; 2?
(e) The entries in Table 8.1 corresponding to (a)-(d) are correct.
r
50. For the one-sample squared rank procedures for shifts, show that
r
(a) The asymptotic efficacy IS
*(b) The result in (a) for the symmetric beta density of Table 8.1 with a = 1 is
5(33/32)2 if r = 2; 320(2/n)6(n/3 + 1/5 - 7/W if r = 1.5; 5(283.15/7.2 9 )2 if
r = 3; 5(187.175/3.2 13 )2 if r = 4.
*(c) Verify the entries in Table 8.1 corresponding to (b).
*51. For the cumulative logistIc distribution H(x) = 1/(1 + e- X ), show that (Hint:
The relation hex) = H(x)[1 - H(x)] and the substitution y = H(x) may sometimes
be helpful.)
(a) (12 = n 2/3 and the asymptotic efficacy of the mean procedures is 3/n 2.
(b) h(O) = i and the asymptotic efficacy of the median procedures is t.
(c) f h2(X )dx = i and the asymptotic efficacy of the Wilcoxon procedures is t-
(d) The asymptotic efficacy of procedures based on sums of signed constants
satisfying (8.7) is e4 = [fA uc(u)du]2m c2(u)du.
(e) For c(u) = u2, e4. = -h.
(f) The efficacy of the normal scores procedures is l/n.
(g) 1= t.
(h) The entries in Table 8.1 corresponding to (a)-(g) are correct.
52. For the Cauchy density hex) = l/n(1 + x 2), show that
(a) (12 = 00 and the asymptotic efficacy of the mean procedures is O.
(b) h(O) = l/n and the asymptotic efficacy of the median procedures is 4/n2.
(c) f h2(X)dx = 1/2n and the asymptotic efficacy of thc Wilcoxon procedures IS
3/n 2 •
*(d) The asymptotic efficacy of procedures based on sums of signed constants
satisfying (8.7) is
53. Derive Equation (8.17) for the conditional probability, given the absolute values of a
sample, that the jth smallest belongs to a positive observation.
54. Show that the asymptotic efficiency of the normal-theory procedures relative to
the median procedures for shifts h(x - 8) can be infinite even if h is symmetric,
but cannot exceed 3 if h has its maximum at its median.
55. Justify the bounds given in Sect. 8.3 for the asymptotic etTIciency of the Wilcoxon
procedures relative to the median procedures.
56. Show that the asymptotic efficiency of the normal scores procedures relative to the
median procedures can be arbitrarily small for shifts of a symmetric density with a
spike of height l/e and width e2 at the median.
*57. (a) Show that the asymptotic efficiency of the Wilcoxon procedures relative to the
normal scores procedures for shifts of a symmetric density h cannot exceed 6/rt.
(Hmt: Use the second expression in (8.11).)
(b) Show that the value in (a) is approached for a suitable symmetric unimodal
density with a spike of height 1/e2 and width e3 at the median.
*58. (a) If H is a c.dJ. with finite variance, show that xl/>{<I>-l[H(x)]} ..... 0 as x ..... ± 00,
where <I> and I/> are the standard normal cumulative and density functions.
(Hint: Tchebycheff's inequality can be used [Gastwirth and Wolff, 1968].)
(b) If H is continuous, show that {J <I>-1[H(x)]h(x)dx}2 = 1.
59. (a) Show that E(l/Z) ;::: l/E(Z) if the random variable Z is positive with prob-
ability one. (One proof uses Jensen's inequality; another uses (8.27) or a similar
inequality. For others, see Gurland [1967].)
*(b) Use the result in (a) instead of (8.27) to show that the asymptotic efficiency of
the normal scores procedures relative to the normal-theory procedures for
symmetric shifts is at least one [Gastwirth and Wolff, 1968].)
*60. Derive the locally most powerful signed-rank test given by (8.33) for a symmetric
shift family.
*61. (a) Show that the test based on the sum of signed constants (8.33) satisfies (8.7) with
(b) Verify that substitution of the result in (a) in (8.8) gives e4. = I.
*62. Derive the expression (8.37) relating the efficacy of the signed-rank test given by
(8.33) to the Fisher information.
63. Let the jomt density of (V, W) belong to a shift family g(v, w; 0) = g(v, w - 8; 0)
and let X = W - V.
(a) Express the density of X in terms of g and show that it can be written as a shift
family f(x; 8) = h(x - 8).
(b) Show that every univariate shift family of densities can arise as in (a).
(c) If (V, W) is permutationally symmetrically distributed when 8 = 0, show that
X is symmetrically distributed about 8.
(d) Show that every symmetric density h can arise in the manner of (c).
(e) What role does the restriction to densities play in (a)-(d)?
420 8 Asymptotic Relative Efficiency
69. Let X" ... , Xm have c.dJ. F and Y" . .. , y" have c.dJ. G; let Z,' Z 2' ... be indepen-
dently, identically distributed according to H which is symmetric about 0; and
define
the number of lj < y if y < 0,
nG*(y) = {
" the number of lj ~ y if y ;;:: o.
Show that
(a) The two-sample Wilcoxon test statistic is equivalent to L Fm( lj), which
approaches L, F(lj) as m --> 00.
(b) The one-sample Wilcoxon test statistic for the Y sample is equivalent to
LJ [G:( lj) - G:( - lj)].
(c) For shifts of H, each of the statistics in (a) and (b) has the same asymptotic
behavior as a linear function of L H(Zj + 0).
(d) A two-sample sum of scores test statistic satisfying (10.6) is asymptotically
equivalent to L c[F(lj)] as first m --> 00 and then n --> 00.
(e) A one-sample sum of scores test statistic satisfying (8.7) is asymptotically
equivalent to LJ c[G:(lj) - G:( -lj)] as n --> 00.
(f) In the situation of (c), the sum in (d) is distributed as L c[H(Zj + 0)] while
the sum in (e) IS distributed as L c[2H(Zj + 0) - 1].
(g) With c = C(I) or c(2) as appropriate, the sums in (d)-(f) are the same under
condition (10.8).
*70. For each of the two-sample normal-theory, median, and Wilcoxon procedures
for shift alternatives (10.4),
(a) Derive directly the asymptotic efficacy.
(b) Verify that the efficacy does not depend on the ratio of the sample sizes and
agrees with the corresponding one-sample efficacy.
71. For a two-sample procedure based on a sum of scores,
(a) Derive the efficacy (10.5).
(b) Derive the asymptotic efficacy (10.7).
72. (a) Show that the one- and two-sample median and Wilcoxon procedures have
corresponding scores functions in the sense of (10.8).
(b) Derive the two-sample procedure that corresponds similarly to the one-
sample squared ranks procedure.
73. Problems 73-78 all concern the asymptotic robustness of two-sample location
tests against inequality of scale. See also Pratt [1964]. Consider a test at level a
for the null hypothesis that the densitiesfand g, with c.dJ.'s F and G respectively,
are equal. Suppose, for definiteness, that a is one-tailed, and let K = 11>-I(a) be the
corresponding standard normal deviate. Assume that miN --> A(which now matters).
For the two-sample t test, show that
(a) The probability of rejection approaches 0 or 1 iff and 9 have different means
(and finite variances).
(b) The probability of rejection approaches I1>(Kd) if f and 9 have equal means,
where d = [A + (l - A)02]1/2/(l - A + A02)1/2 and 0 = var(Y)/var(X).
(c) For fixed A, the range of possible values of d is an interval with endpoints
(A -I - 1)'/2 and (A -I - 1)-1 /2.
*74. For the two-sample median test, show that results like those in Problem 73 hold
with medIans replacing means, d = (1 - A+ AO)/(1 - A + A02)1/2, and 0 = fig at
422 8 Asymptotic Relative Efficiency
the common median, and the endpoints of the range of dare [min(A, I - A)]1/2
and 1. (Hint: Use the corresponding confidence bound given in (3.6) of Chap. 5.)
*75. For the two-sample Wilcoxon test, show that results like those of Problem 73 hold
but PI = 1/2 replaces equal means, d = [12AP2 + 12(1 - A)P3 - 3r1 / 2 , and the
t
range of d has endpoints (3 - 31)-1/2 and (I - 1)1/2/(3A - 412)1/2 if 1::;; and the
same with 1 replaced by (1 - 1) otherwise, where PI' P2' and P3 are defined by
(4.12)-(4.14) of Chap. 5. Forthe range of d, use var(U)~ mn[3m - (m _1)2 /(n -1)]/12
If PI = t where U is the Mann-Whitney statistic defined in (4.5) of Chap. 5 [Birnbaum
and Klose, 1957]. (Hint: Equation (4.26) of Chap. 5 gives an upper bound for
var(U) which is achieved for constant Y.)
*76. In Problem 75, if J and g have the same center of symmetry and variances in the
ratio (J2 = var( y)/var(X), show that
(a) P2 = ! + (I/2n) arcsin [(J2/(02 + I)] for J and g normal, where the arCSIn IS in
radIans.
(b) P2 = ! + (J2/2«(J + 1)(2(J + I) for f and g the Laplace density (5.6).
(c) P2 = ! + (J2/12 if (J ::;; 1 and P2 = t-
1/6(J if (J ~ 1 for Jand g uniform.
(d) P3 is given by the same formulas but with (J replaced by 1/(J.
*77. For the two-sample normal scores test, show that results like those of Problem
73(a) and (b) hold but')' = 0 replaces equal means and
= [2A.J(F, G,1) + 2(1 - 1)J(G, F, 1 - 1)]-1/2
d
J(F, G, A) = ff
x<y
G(x)[1 - G(y)]<I>-I,[H(x)]<I>-I,[H(y)]J(x)J(y)dxdy.
*78. What are the implications of Problems 73-77 for confidence procedures?
79. (a) Derive the approximation (11.1) for the power of the one-sided, one-sample
Kolmogorov-Smirnov test against shifts of the uniform distribution.
(b) Derive the approximate power (11.3) of the median test against the alternative
in (a).
(c) Derive the asymptotic relative efficiency (11.4) of the one-sided Kolmogorov-
Smirnov test relative to the median test for uniform shift alternatives.
*(d) Find the limits of the asymptotic relative efficiency as (1. --+ 0 for fixed {J, and as
{J --+ 0 for fixed (1., and as {J --+ I - (1..
*(e) Show that (a) and (c) provide upper bounds for any symmetric, unimodal
shift alternative.
*(f) For what other alternatives do (a) and (c) provide upper bounds?
80. For the one-sided, one-sample Kolmogorov-Smirnov test, derive
(a) The lower bound (11.6) for its power.
*(b) The lower bound (11.7) for its asymptotic efficiency relative to the median test.
(Hint: Use (11.2) and (11.3.)
81. (a) LetJbe a density on (- 00, B) where B is finite and let ~o be its quantile of order
O. Let g be a density on (~o, (0) such that g(y) = J(y) for eo::;; y ::;; B. Show that
the power of a Kolmogorov-Smirnov test of the null hypothesis J against the
alternative g depends only on (J, and not onf, and in particular, it is the same for
any J as for a uniform distribution on (0, I).
*(b) What happens in (a) if no such finite limit B exists?
Problems 423
*87. Show that the Laplace distribution achieves equality in the lower bounds (11.18)
and (l1.19).
88. Let nl and n2 be the sample sizes required to achieve a given p by a one-sided
normal-theory test at level ct./2 and by a two-sided normal-theory test at level ct.
for the one-sample shift problem.
424 8 AsymptotIc RelatIve Efficiency
(a) Give expressions that determine "l and "2 for alternatives near the null hypo-
thesIs.
(b) Show that 1lt/1l2 :::;; 1.02 asymptotically for (X :::;; 0.1 and p :::;; 0.74; also for
(X :::;; 0.05 and P : :; 0.85; also for other combinations of (X and p.
*89. For the Kolmogorov-Smirnov test with 'one-sided critical value c at level (X/2,
show that
(a) PG[inf(Fn - F) :::;; -c and sup(Fn - F) :::;; c] :::;; (X/2 if G(x) ~ F(x) for all x.
(b) Asymptotically the probability in (a) comes arbitrarily close to
(X/2 - PG[inf(Fn - G) :::;; -c and sup(Fn - G) ~ c]
and the inequality (11.27) comes arbitrarily close to equality for shifts of a
distributIOn which is umform on U7~ I (I, i + t) for sufficiently large m.
*90. Show that, in the notation of Problem 89, for G continuous, as II --+ 00
+ e-2((2i-1)c-.0]2 _
<Xl
--+ L e - 2((2i-l)c-(.-1)8I' 2e- 2•2(2c-O)2.
J= 1
Tables
425
426 Tables
0.00 001 002 003 004 0.05 0.06 007 0.08 0.09
00 0 50000 49601 49202 48803 48405 48006 47608 47210 46812 46414
0.1 46017 45620 45224 44828 44433 44038 43644 43251 42858 42465
0.2 42074 41683 41294 40905 40517 40129 39743 39358 38974 38591
03 38209 37828 37448 37070 36693 36317 35942 35569 35197 34827
0.4 34458 34090 33724 33360 32997 32636 32276 31918 31561 31207
05 30854 30503 30153 29806 29460 29116 28774 28434 28096 27760
0.6 27425 27093 26763 26435 26109 25785 25463 25143 24825 24510
07 24196 23885 23576 23270 22965 22663 22363 22065 21770 21476
08 21186 20897 20611 20327 20045 19766 19489 19215 18943 18673
09 18406 18141 17879 17619 17361 17106 16853 16602 16354 16109
1.0 15866 15625 15386 15151 14917 14686 14457 14231 14007 13786
II 13567 13350 13136 12924 12714 12507 12302 12100 11900 11702
12 11507 11314 11123 10935 10749 10565 10383 10204 10027 09853
1.3 00 96800 95098 93418 91759 90123 88508 86915 85343 83793 82264
14 80757 79270 77804 76359 74934 73529 72145 70781 69437 68112
15 66807 65522 64255 63008 61780 60571 59380 58208 57053 55917
16 54799 53699 52616 51551 50503 49471 48457 47460 46479 45514
17 44565 43633 42716 41815 40930 40059 39204 38364 37538 36727
18 35930 35148 34380 33625 32884 32157 31443 30742 30054 29379
19 28717 28067 27429 26803 26190 25588 24998 24419 23852 23295
20 22750 22216 21692 21178 20675 20182 19699 19226 18763 18309
2I 17864 17429 17003 16586 16177 15778 15386 15003 14629 14262
22 13903 13553 13209 12874 12545 12224 11911 11604 11304 11011
23 10724 10444 10170 09903 09642 09387 09137 08894 08656 08424
24 00' 81975 79763 77603 75494 73436 71428 69469 67557 65691 63872
2.5 62097 60366 58677 57031 55426 53861 52336 50849 49400 47988
26 46612 45271 43965 42692 41453 40246 39070 37926 36811 35726
27 34670 33642 32641 31667 30720 29798 28901 28028 27179 26354
28 25551 24771 24012 23274 22557 21860 21182 20524 19884 19262
2.9 18658 18071 17502 16948 16411 15889 15382 14890 14412 13949
3.0 13499 13062 12639 12228 11829 11442 11067 10703 10350 10008
3.1 00 3 96760 93544 90426 87403 84474 81635 78885 76219 73638 71136
3.2 68714 66367 64095 61895 59765 57703 55706 53774 51904 50094
3.3 48342 46648 45009 43423 41889 40406 38971 37584 36243 34946
3.4 33693 32481 31311 30179 29086 28029 27009 26023 25071 24151
35 23263 22405 21577 20778 20006 19262 18543 17849 17180 16534
36 15911 15310 14730 14171 13632 13112 12611 12128 11662 11213
37 10780 10363 09961 09574 09201 08842 08496 08162 07841 07532
3.8 0.0' 72348 69483 66726 64072 61517 59059 56694 54418 52228 50122
39 48096 46148 44274 42473 40741 39076 37475 35936 34458 33037
40 31671 30359 29099 27888 26726 25609 24536 23507 22518 21569
41 20658 19783 18944 18138 17365 16624 15912 15230 14575 13948
42 13346 12769 12215 11685 11176 10689 10221 09774 09345 08934
43 00' 85399 81627 78015 74555 71241 68069 65031 62123 59340 56675
44 54125 51685 49350 47117 44979 42935 40980 39110 37322 35612
Tables 427
Table A (continued)
z 000 001 002 0.03 004 005 006 007 008 009
4.5 33977 32414 30920 29492 28127 26823 25577 24386 23249 22162
4.6 21125 20133 19187 18283 17420 16597 15810 15060 14344 13660
4.7 13008 12386 11792 11226 10686 10171 09680 09211 08765 08339
48 00 6 79333 75465 71779 68267 64920 61731 58693 55799 53043 50418
49 47918 45538 43272 41115 39061 37107 35247 33476 31792 30190
For larger values of z, P can be approximated by (21t)-1/2 Me- "/2 = 0.398942 Me-"/2 where
The error IS less than 0 0005P for z ~ I 2. See Pelzer and Pratt [1968].
Source Taken from Table III of Fisher and Yates Stallsllcal Tablesfor BIOlogical, Agricultural and Medical Research, pubhshed
by Longman Group Ltd., London (6th edition, 1974, page 45) (previously pubhshed by Ohver & Boyd, Ltd, EdlDburgh), by
permISSion of the authors and publishers
428 Tables
r
use P(S = 0) = (I - p)n For other values of s, consider the complementary tall For other values of p andlor II,
Table A can be used wllh the approximate standard normal deviate
Z=~~- d -~- (
{1211 s'
s'ln-+ t'ln-t')
Is' - IIpl 611 + I lip IIq
where s' = s + 1/2, t' = II - s - 1/2, d = s + 2/3 - (II + 1/6)p + 002 [ql(s + I) - pl(1I - s) + (q - 05)/(11 + I)]
The errorls less than I % of the smaller tall probability If~, II - s - I ~ 2 and 0 19 $ s'qlt'p $ 53 It IS less than 02%
If S, II - S - I ~ 4 and 040 $ s'qlt'p $ 25 See Pelzer and Pratt [1968]. It IS less than 000082 If s, II - S - I ~ I,
than 000012 If S, II - S - I ~ 4 See Lmg [1978].
p 005 010 020 030 040 050 060 070 080 090 095
II n-s
4 I 09860 09477 0.8192 06517 04752 03125 01792 00837 00272 00037 00005 3
5 I 0.9774 09185 07373 05282 03370 01875 00870 00308 00067 00005 00000 4
2 09988 09914 09421 08369 06826 05000 0.3174 0.1631 00579 00086 00012 3
6 I 09672 08857 06554 04202 0.2333 01094 00410 0.0109 0.0016 0.0001 00000 5
2 09978 09842 09011 07443 05443 03438 01792 0,0705 0.0170 00013 00001 4
7 I 09556 08503 0.5767 0.3294 0.1586 00625 00188 00038 00004 0.0000 0.0000 6
2 09962 0.9743 08520 06471 0.4199 0.2266 00963 0.0288 00047 00002 00000 5
3 09998 0.9973 09667 08740 07102 05000 0.2898 01260 0.0333 0.0027 0.0002 4
8 I 09428 08131 0.5033 02553 0.1064 0.0352 00085 00013 00001 0.0000 0.0000 7
2 09942 09619 0.7969 0.5518 0.3154 01445 00498 0.0113 0.0012 0.0000 00000 6
3 09996 09950 09437 08059 0.5941 03633 01737 0.0580 00104 00004 00000 5
9 I 09288 07748 0.4362 01960 00705 0.0195 00038 00004 00000 0.0000 0.0000 8
2 09916 0.9470 0.7382 04628 02318 00898 00250 0.0043 0.0003 00000 00000 7
3 0.9994 09917 0.9144 07297 0.4826 02539 0.0994 00253 00031 00001 00000 6
4 10000 09991 09804 09012 07334 05000 0.2666 00988 0.0196 00009 00000 5
10 09139 07361 0.3758 0.1493 0.0464 0.0107 00017 00001 00000 0.0000 0.0000 9
2 09885 09298 0.6778 0.3828 01673 0.0547 0.0123 00016 0.0001 0.0000 0.0000 8
3 09990 09872 08791 0.6496 0.3823 0.1719 0.0548 00106 00009 0.0000 0.0000 7
4 09999 09984 0.9672 0.8497 0.6331 0.3770 0.1662 00473 00064 00001 00000 6
II I 08981 06974 03221 01130 0.0302 0.0059 00007 00000 00000 00000 00000 10
2 09848 09104 0.6174 0.3127 01189 00327 0.0059 00006 00000 0.0000 00000 9
3 09984 09815 08389 05696 02963 01133 00293 00043 00002 00000 0.0000 8
4 09999 09972 0.9496 0.7897 05328 02744 00994 00216 00020 00000 00000 7
5 10000 09997 09883 09218 07535 05000 02465 00782 00117 00003 0.0000 6
12 I 08816 0.6590 02749 0.0850 0.0196 00032 00003 00000 00000 00000 00000 II
2 09804 0.8891 05583 0.2528 00834 0.0193 00028 00002 00000 00000 00000 10
3 09978 09744 0.7946 04925 02253 00730 0.0153 00017 0.0001 00000 00000 9
4 0.9998 0.9957 0.9274 07237 04382 0.1938 00573 0.0095 0.0006 00000 00000 8
5 10000 09995 09806 0.8822 06652 0.3872 01582 0.0386 00039 00001 00000 7
13 I 08646 06213 02336 00637 00126 00017 00001 00000 0.0000 0.0000 0.0000 12
2 09755 08661 0.5017 02025 00579 00112 0.0013 00001 0.0000 00000 00000 II
3 09969 09658 07473 04206 01686 00461 00078 0.0007 00000 00000 00000 10
4 09997 09935 09009 06543 03530 01334 0.0321 00040 00002 0.0000 00000 9
5 10000 0.9991 09700 08346 05744 02905 00977 00182 00012 00000 00000 8
6 10000 09999 09930 09376 07712 05000 02288 0.0624 0.0070 00001 00000 7
I-p 0.95 0.90 0.80 0.70 0.60 050 0.40 030 020 010 0.05
Tables 429
Table B (continued)
p 005 0.10 020 030 040 050 060 070 080 090 095
/I " -s
14 08470 05848 01979 00475 00081 00009 00001 00000 00000 00000 00000 13
2 09699 08416 04481 01608 00398 00065 00006 00000 00000 00000 00000 12
3 0.9958 09559 06982 03552 01243 00287 00039 0.0002 0.0000 0.0000 00000 II
4 09996 0.9908 08702 05842' 02793 0.0898 00175 00017 00000 00000 00000 10
5 10000 09985 09561 0.7805 04859 02120 00583 00083 00004 00000 00000 9
6 10000 09998 09884 09067 06925 03953 01501 0.0315 00024 00000 00000 8
15 I 08290 05490 01671 00353 00052 00005 0.0000 00000 00000 00000 00000 14
2 09638 08159 03980 01268 00271 00037 00003 00000 00000 00000 00000 13
3 09945 09444 06482 02969 00905 00176 00019 00001 00000 00000 00000 12
4 09994 09873 08358 05155 02173 0.0592 0.0093 0.0007 00000 00000 00000 II
5 09999 09978 09389 07216 04032 01509 00338 00037 00001 00000 00000 10
6 10000 09997 09819 08689 06098 03036 0.0950 0.0152 0.0008 0.0000 0.0000 9
7 10000 10000 09958 09500 07869 0.5000 02131 00500 00042 00000 0.0000 8
16 I 08108 05147 01407 00261 00033 00003 00000 0.0000 0.0000 00000 00000 15
2 09571 07892 03518 00994 0.0183 0.0021 0.0001 0.0000 00000 0.0000 0.0000 14
3 09930 09316 0.5981 02459 0.0651 00106 00009 0.0000 00000 0.0000 00000 13
4 09991 09830 07982 04499 01666 00384 00049 00003 00000 0.0000 0.0000 12
5 09999 09967 09183 06598 0.3288 01051 00191 0.0016 0.0000 0.0000 00000 II
6 10000 09995 09733 08247 0.5272 02272 00583 00071 00002 0.0000 00000 10
7 10000 09999 09930 09256 07161 04018 01423 0.0257 00015 00000 00000 9
17 I 07922 0.4818 01182 00193 00021 0.0001 0.0000 0.0000 00000 00000 00000 16
2 09497 0.7618 03096 00774 00123 0.0012 0.0001 0.0000 00000 00000 00000 15
3 09912 09174 05489 0.2019 0.0464 00064 0.0005 00000 0.0000 0.0000 0.0000 14
4 09988 09779 0.7582 03887 0.1260 0.0245 0.0025 00001 0.0000 0.0000 00000 13
5 09999 09953 08943 05968 02639 00717 0.0106 0.0007 0.0000 0.0000 00000 12
6 10000 09992 09623 07752 04478 01662 00348 0.0032 00001 0.0000 0.0000 11
7 10000 09999 09891 08954 0.6405 03145 00919 00127 00005 00000 00000 10
8 10000 10000 09974 09597 08011 05000 0.1989 0.0403 00026 0.0000 00000 9
18 I 07735 04503 00991 0.0142 00013 0.0001 0.0000 00000 00000 00000 00000 17
2 09419 07338 02713 00600 00082 00007 00000 00000 00000 00000 0.0000 16
3 09891 09018 05010 01646 00328 00038 00002 0.0000 0.0000 0.0000 00000 15
4 09985 09718 07164 03327 00942 00154 00013 00000 00000 00000 0.0000 14
5 09998 09936 08671 05344 02088 0.0481 00058 00003 0.0000 0.0000 0.0000 13
6 10000 09988 09487 07217 03743 0.1189 0.0203 00014 00000 00000 00000 12
7 10000 09998 09837 08593 05634 02403 00576 00061 00002 0.0000 00000 11
8 10000 10000 09957 09404 07368 04073 01347 00210 0.0009 0.0000 0.0000 10
19 I 07547 0.4203 00829 00104 o0008 ,0 0000 00000 00000 00000 0,0000 00000 18
2 0.9335 0.7054 0.2369 0.0462 00055 00004 00000 00000 00000 00000 00000 17
3 09869 08850 04551 01332 00230 00022 00001 00000 00000 00000 00000 16
4 0.9980 0.9648 0.6733 02822 0.0696 0.0096 00006 0.0000 0.0000 0.0000 00000 15
5 09998 0.9914 0.8369 04739 01629 00318 0,0031 0.0001 00000 00000 00000 14
6 10000 09983 09324 06655 0,3081 0.0835 0.0116 00006 00000 00000 00000 13
7 1.0000 09997 09767 08180 04878 01796 00352 00028 00000 0.0000 00000 12
8 1.0000 1,0000 09933 09161 06675 03238 0,0885 00105 0.0003 00000 00000 II
9 10000 10000 09984 09674 08139 0.5000 01861 00326 00016 00000 00000 10
1-p 0.95 0.90 0.80 0.70 0.60 050 040 030 0.20 010 005
430 Tables
Table B (continued)
p 005 010 020 030 040 050 060 070 080 0.90 0.95
n n-s
20 1 07358 0.3917 00692 00076 00005 00000 00000 00000 00000 00000 00000 19
2 0.9245 06769 0.2061 00355 00036 00002 00000 00000 00000 00000 00000 18
3 09841 0.8670 04114 0.1071 0.0160 00013 0.0000 0.0000 00000 0.0000 0.0000 17
4 09974 09568 06296 02375 00510 00059 00003 00000 00000 00000 00000 16
5 09997 09887 08042 04164 01256 00207 00016 00000 0.0000 0.0000 0.0000 15
6 1.0000 09976 09133 0.6080 02500 0.0577 00065 0.0003 0.0000 0.0000 0.0000 14
7 10000 09996 0.9679 0.7723 0.4159 0.1316 00210 0.0013 00000 0.0000 00000 13
8 10000 09999 09900 08867 05956 02517 00565 00051 00001 00000 00000 12
9 10000 1.0000 0.9974 0.9520 07553 0.4119 01275 00171 00006 00000 00000 11
I-p 095 0.90 080 070 060 050 040 030 020 010 0.05
000 0005 0.010 0.025 0051 0.105 3.89 4.74 557 6.64 743
005 0005 0.010 0025 0.051 0.105 3.62 4.32 497 578 634
0.10 0005 0010 0025 0051 0105 3.37 394 445 504 544
015 0005 0010 0025 0.051 0.105 3.14 360 3.99 4.42 470
0.20 0005 0010 0025 0051 0104 292 329 358 389 407
2 000 0103 0149 0242 0.355 0.532 532 630 7.22 841 927
2 0.05 0105 0.150 0245 0358 0535 511 597 677 776 847
2 010 0.106 0152 0247 0361 0538 4.90 565 634 7.17 7.74
2 015 0.107 0154 0249 0364 0.542 469 535 594 662 708
2 020 0109 0155 0252 0.368 0545 4.50 507 5.56 6.12 648
2 025 0110 0157 0.255 0371 0549 431 480 5.21 5.65 594
2 2\7 0111 0159 0.257 0374 0552 417 4.61 497 535 558
2 2\6 0.112 0.161 0.260 0377 0556 4.00 437 466 4.96 514
2 040 0114 0163 0.264 0.382 0.561 377 405 4.27 447 4.59
3 0.00 0338 0436 0619 0818 I 102 668 775 877 10.05 10 98
3 010 0348 0.448 0.634 0834 1.119 628 7.16 796 893 961
3 020 0358 0461 0650 0853 I 138 5.89 660 721 793 8.41
3 030 0370 0475 0667 0873 1.158 5.52 607 6.52 703 735
3 040 0383 0491 0687 0895 I 181 515 556 589 6.22 642
3 0.50 0398 0508 0.709 0.919 1205 479 508 529 5.49 560
4 0.00 0.672 0823 1090 1.366 1.745 799 9.15 10 24 1160 1259
4 010 0693 0.847 1.117 1.395 1773 760 858 947 10 54 II 30
4 020 0715 0872 1147 1427 1804 721 8.02 8.73 957 1013
4 030 0740 0901 I 179 1462 1.838 6.83 7.48 804 866 906
4 040 0768 0932 1216 1500 1876 646 696 738 782 809
4 050 0799 0968 1256 1543 1917 608 646 674 7.03 720
6 0.00 1.537 1.785 2.202 2613 3152 10.53 11.84 13.06 1457 15.66
6 0.10 I 583 1835 2.255 2.667 3202 10.14 11.27 1230 13.55 1444
6 0.20 1.634 1890 2.314 2726 3.257 9.74 10.71 11.57 1259 13.28
6 0.30 1.691 1.951 2.379 2.791 3.317 935 10.16 1086 11 66 12.19
6 040 1754 2.019 2450 2.863 3.384 8.95 961 10 16 1077 11.16
6 0.50 1826 2095 2531 2944 3.458 8.54 906 947 9.90 10.17
7 000 2.037 2.330 2.814 3285 3895 11.77 Ill5 1442 1600 17 13
7 0.10 2.098 2394 2881 3.352 3.956 11 37 1257 13.67 1499 15.92
7 0.20 2164 2.464 2.954 3.424 4.022 10 97 1201 12.93 14.02 1477
7 030 2238 2542 3035 3503 4094 10.56 11.44 1220 13.08 1367
7 0.40 2.320 2628 3124 3591 4.174 10 16 10 88 11.49 1217 1261
7 0.50 2414 2726 3.225 3690 4.264 9.74 10.31 10.77 11.27 11 59
432 Tables
Table C (continued)
Lower Tall a Upper Tall C(
s SIll 0.005 0.010 0.025 0.050 0.100 0100 0.050 0025 0.010 0.005
8 000 2571 2906 3454 3.981 4656 1299 1443 1576 1740 1858
8 010 2646 2.984 3534 4059 4727 12.59 1386 1501 1640 17.37
8 020 2.727 3.069 3621 4144 4804 1218 1328 1426 1542 1622
8 030 2818 3163 3717 4238 4888 1177 1271 13 52 1446 1510
8 040 2920 3268 3824 4341 4981 II 35 1213 1279 13 53 1402
8 050 3035 3387 3944 4458 5085 10 91 1154 1206 1261 1297
9 000 3132 3507 4115 4.695 5432 14.21 1571 1708 1878 2000
9 010 3.221 3599 4208 4785 5513 13 79 1512 16.32 17.77 18.80
9 020 3318 3699 4.309 4.883 5600 1338 1454 1557 1679 1763
9 030 3426 3810 4.420 4.990 5696 1296 1395 1482 1582 1650
9 040 3547 3933 4544 5 108 5801 1253 1336 1407 1487 1540
9 050 3684 4073 4683 5242 5919 1208 12.76 13.32 13.93 14.32
10 000 3717 4130 4.795 5425 6221 1541 1696 1839 2014 2140
10 010 3820 4235 4900 5526 6311 1499 1637 1762 1913 20.20
10 020 3932 4350 5015 5636 6409 1456 1578 1686 1814 1902
10 030 4057 4477 5.141 5756 6515 1413 1518 1610 1716 17.88
10 040 4197 4619 5281 5890 6632 1369 1458 1533 16.19 16.76
10 050 4355 4779 5439 6039 6.763 1324 1396 14.56 15.22 1565
II 000 4321 4771 5.491 6169 7021 1660 1821 1968 2149 2278
11 010 4.438 489Q 5608 6281 7120 1617 1761 1891 20.47 2157
II 020 4566 5019 5736 6402 7227 15.74 1701 18.14 19.47 20.39
II 030 4707 5162 5877 6535 7343 1530 1640 1736 18.48 19.23
11 040 4866 5.322 6033 6683 7472 1485 1579 1658 1749 1809
II 050 5045 5502 6209 6848 7616 1438 1515 1579 1650 16.96
12 000 4943 5428 6201 6924 7829 1778 19.44 2096 2282 24.14
12 010 5073 5560 6330 7.047 7937 1735 1884 2018 2180 2293
12 020 5216 5.703 6470 7179 8053 1691 1823 19.40 2079 2175
12 030 5374 5862 6625 7325 8180 1646 1761 1861 1978 20.57
12 040 5551 6039 6797 7486 8320 1600 1698 1782 1877 1941
12 050 5.751 6239 6.990 7666 8.476 15.52 16.33 17.01 1776 1825
13 0.00 5580 6.099 6.922 7.690 8646 18.96 2067 2223 24.14 25.50
13 010 5724 6.243 7.063 7.822 8762 1852 2006 21.44 23 II 24.28
13 020 5882 6401 7.216 7966 8887 1807 1944 20.65 22.09 2308
13 030 6056 6575 7.384 8124 9024 17.62 18.81 19.85 21.07 21.89
13 040 6250 6769 7.571 8298 9174 1715 18.17 19.04 2005 20.71
13 050 6471 6.988 7.781 8.493 9.342 16.66 17.51 18.22 19.01 19.53
14 000 6231 6.782 7.654 8.464 9.470 20.13 2189 2349 25.45 2684
14 010 6388 6.939 7.806 8.606 9.594 19.68 2127 22.69 2441 25.61
14 020 6560 7.111 7.972 8.761 9728 19.23 20.64 21.89 23.38 24.40
14 030 6750 7300 8.153 8.930 9874 1877 2000 21.08 2234 2320
14 040 6962 7.510 8355 9117 10.035 18.29 1935 2026 2131 22.00
14 0.50 7.203 7.748 8.581 9.327 10.214 17.79 18.67 19.42 20.25 20.80
Other values can be approxImated as follows For lower hmlts, let a = /I - s, b = S + I, Z = posItive normal
quantIle (Table A) For upper hmlts,let a = /1 - S + I, b = s, Z = negatIve normal quantIle Calculate A = 9a - I,
B = 9b - I, C = 3z, D = B2 - bC 2,E = AB + qaD + bA2)1i2 Theconfidencehmltls It[l + (b/a)2(E/D)3]
For C( ~ 0.005, the error is less than 1%if s, 11 - S ~ 9 and less than 0.5 %If s, 11 - S ~ 12. See Pratt [/968].
Tables 433
I\n 5 6 7 8 9 10 I\n 8 9 10
I\n II 12 13 14 15 16 17 18 19 20
0 00005 00002 0.0001 0.0001 0.0000 0.0000 00000 00000 00000 00000
I 00010 00005 00002 00001 00001 0.0000 0.0000 0.0000 00000 0.0000
2 00015 00007 00004 00002 00001 00000 0.0000 0.0000 0.0000 00000
3 0.0024 00012 00006 00003 00002 00001 00000 00000 00000 00000
4 00034 0.0017 00009 00004 00002 00001 00001 00000 00000 00000
5 00049 0.0024 00012 00006 0.0003 00002 0.0001 0.0000 0.0000 00000
6 0.0068 0.0034 00017 00009 00004 00002 00001 00001 00000 00000
7 00093 00046 00023 00012 00006 00003 00001 00001 00000 00000
8 00122 00061 00031 00015 0.0008 0.0004 0.0002 00001 00000 00000
9 0.0161 0.0081 00040 00020 0.0010 00005 00003 00001 00001 00000
10 00210 00105 00052 00026 00013 00007 00003 00002 00001 00000
11 00269 00134 00067 00034 00017 00008 00004 00002 00001 00001
12 00337 00171 0.0085 0.0043 0.0021 0.0011 0.0005 0.0003 00001 00001
13 0.0415 00212 00107 00054 00027 00013 00007 00003 00002 00001
14 00508 00261 0.0133 00067 00034 00017 00008 00004 00002 00001
15 00615 00320 0.0164 0.0083 00042 0.0021 0.0010 0.0005 00003 00001
16 0.0737 00386 00199 00101 00051 00026 00013 0.0006 00003 0.0002
17 00874 00461 00239 00123 00062 0.0031 00016 00008 0.0004 0.0002
18 0.1030 0.0549 00287 00148 00075 00038 00019 00010 00005 00002
19 01201 00647 00341 00176 00090 00046 00023 00012 0.0006 00003
20 01392 00757 0.0402 00209 00108 00055 0.0028 0.0014 0.0007 00004
21 0.1602 00881 00471 00247 00128 00065 0.0033 00017 00008 00004
22 01826 0.1018 00549 00290 0.0151 0.0078 00040 0.0020 00010 0.0005
23 02065 01167 0.0636 0.0338 0.0177 00091 00047 0.0024 00012 00006
24 0.2324 01331 0.0732 0.0392 00206 00107 00055 00028 0.0014 00007
25 0.2598 01506 00839 00453 00240 00125 00064 0.0033 00017 00008
26 0.2886 0.1697 0.0955 0.0520 00277 0.0145 00075 0.0038 0.0020 0.0010
27 0.3188 01902 01082 0.0594 0.0319 00168 0.0087 0.0045 0.0023 0.0012
28 03501 02119 01219 0.0676 00365 00193 00101 0.0052 0.0027 00014
29 0.3823 0.2349 0.1367 00765 0.0416 0.0222 O.oII6 0.0060 0.0031 0.0016
30 0.4155 02593 0.1527 0.0863 0.0473 00253 00133 0.0069 00036 00018
31 0.4492 02847 0.1698 0.0969 0.0535 0.0288 0.0153 00080 0.0041 00021
32 04829 0.3110 01879 01083 00603 00327 0.0174 0.0091 00047 00024
33 05171 0.3386 0.2072 0.1206 00677 00370 00198 00104 00054 00028
34 0.3667 02274 0.1338 00757 00416 00224 0.0118 00062 00032
35 0.3955 02487 0.1479 00844 0.0467 0.0253 0.0134 0.0070 0.0036
36 04250 02709 0.1629 00938 00523 00284 00152 00080 00042
37 0.4548 0.2939 0.1788 0.1039 0.0583 00319 0.0171 00090 00047
434 Tables
Table D (continued)
1\11 11 12 13 14 15 16 17 18 19 20
1\11 15 16 17 18 19 20 1\11 18 19 20
n 21 22 23 24 25 26 27 28 29 30
J1 115.5 1265 138.0 150.0 162.5 175.5 189.0 203.0 2175 2325
tJ 28.77 30.80 32.88 35.00 37.17 39.37 41.62 43.91 4625 47.83
1/ 31 32 33 34 35 36 37 38 39 40
/1 2480 264 0 280.5 2975 315.0 3330 351.5 370.5 390 0 410.0
tJ 5303 53.48 55.97 5849 61.05 6365 66.29 6895 7166 7440
Source Adapted from Table II ofH L Harter and D BOwen, Eds (1972). Selecled Tables //I MathematIcal StaIlSI/CS, Vol I.
Markham PubhshlRg Co • Chicago. With permission of The Inslltute of Mathematical StatIStiCS
Tables 435
11-111 0 I 2 3 3 4 5 5 6 7
mil /- m 0 0 I 2 2 2 3 3 4
2 0 01667 03000 01000 02000 0.2857 01429 02143 0.2778 01667 02222 02727 01818
I 08333 09000 07000 08000 0.8571 07143 07857 08333 07222 0.7778 08182 07273
3 0 0.0500 01143 00286 00714 01190 00476 00833 01212 0.0606 00909 01224 00699
1 0.5000 0.6286 03714 05000 05952 0.4048 05000 05758 04242 05000 05629 04371
4 0 00143 00397 00079 00238 0.0455 00152 0.0303 0.0490 0.0210 00350 00513 00256
1 02429 03571 01667 02619 0.3485 0.1970 0.2727 03427 02168 02797 03385 02308
2 07571 08333 06429 07381 08030 06515 07273 07832 06573 07203 0.7692 06615
5 0 00040 0.0130 00022 00076 00163 00047 00105 00186 00070 00128 00204 00091
1 01032 0.1753 00671 01212 01795 00862 01329 01818 01002 01410 01833 01109
2 05000 06082 03918 05000 05874 04126 05000 05734 04266 05000 05633 04367
6 0 00011 00041 00006 00023 00056 0.0014 0.0035 00068 00023 00045 00077 00031
1 00400 0.0775 00251 00513 00839 00350 00594 00882 00430 00656 00913 00495
2 02836 0.3835 02086 02960 03776 02308 03042 03733 02466 03100 03700 02585
3 07165 0.7914 06166 07040 07692 06224 06958 07534 06267 06900 07415 06300
7 0 00003 00012 00002 00007 00019 00004 00011 0.0024 00007 00015 00028 00010
I 00146 00317 00089 00203 00364 00134 00249 00399 00174 00286 00426 00209
2 01431 02145 01002 01573 02178 o 1170 01674 02199 01299 01749 02214 01401
3 05000 05952 04048 05000 05806 04194 05000 05700 04300 05000 05619 04381
8 0 00001 00004 00000 00002 00006 00001 00004 00008 00002 0.0005 00010 00003
1 00051 00122 00030 00076 0.0149 00049 00099 00170 00067 00119 00188 00084
2 00660 01090 00445 0.0767 01149 00549 00849 01192 00635 00913 0.1224 0.0706
3 03096 0.3992 02380 03186 03950 02549 0.3250 03916 02678 03297 03889 02779
4 06904 07620 06008 06814 07451 06050 06750 07322 06084 06703 07221 06111
9 0 00000 00001 00000 00001 00002 00000 00001 00003 00001 00002 00004 00001
I 00017 00045 00010 00027 00058 00017 00038 00069 00025 00047 00079 00033
2 0.0283 00513 00185 00349 00563 00242 00402 00602 00291 00447 00633 00335
3 01735 02422 01276 01849 02449 01421 01935 02468 01535 02002 02481 01628
4 05000 05859 04141 0.5000 05750 04250 05000 05666 04334 05000 05600 04400
10 0 00000 0.0000 0.0000 0.0000 0.0001 00000 00000 00001 00000 00001 00001 00000
I 00005 0.0016 00003 00010 00022 00006 00014 0.0027 00009 0.0018 0.0032 0.0012
2 0.0115 0.0226 0.0073 00150 00260 00101 0.0180 0.0287 0.0127 0.0207 0.0310 00151
3 00894 01349 00635 0.0992 01402 00736 01069 0.1442 0.0820 0.1131 01473 00891
4 0.3281 04100 02599 0.3350 0.4067 02735 0.3401 04041 02841 03441 04018 02928
5 0.6719 07401 05900 0.6650 07265 0.5933 0.6599 07159 05959 06559 07072 05982
II I 0.0002 0.0005 00001 00003 00008 0.0002 00005 0.0010 0.0003 0.0007 00013 00004
2 0.0045 0.0095 0.0028 00061 00114 00040 0.0077 0.0130 0.0053 0.0092 00144 00065
3 00431 0.0699 00296 0.0498 0.0749 0.0358 0.0554 0.0789 00412 00601 00821 0.0460
4 0.1974 0.2632 01504 0.2068 0.2655 01628 02142 0.2671 01730 02200 02683 01814
5 05000 05789 04211 05000 0.5704 0.4296 05000 05635 04365 05000 05579 04421
12 1 00001 00002 00000 0.0001 0.0003 0.0001 00002 00004 00001 0.0002 00005 0.0002
2 00017 00038 0.0010 0.0024 00048 00016 00032 0.0056 00021 0.0039 00064 00027
3 00196 0.0341 00131 0.0236 0.0377 0.0165 0.0271 00407 00197 00302 00433 00226
4 01102 01566 00812 01189 0.1612 00906 01259 01649 0.0987 0.1318 01678 01056
5 0.3421 04179 0.2772 0.3475 04153 0.2883 0.3518 0.4131 0.2973 03552 04112 03047
6 06579 0.7228 05821 06525 07117 0.8388 0.6482 07027 05869 06448 0.6953 05888
13 1 00000 00001 00000 00000 00001 00000 00001 00001 0.0000 0.0001 0.0002 00001
2 00006 0.0015 0.0004 0.0009 0.0019 0.0006 0.0013 0.0024 00008 00016 00028 00011
3 0.0085 00157 00056 00107 00180 00073 00127 00200 00090 00145 00218 00106
4 00576 00871 00412 00642 00919 0.0476 00697 00957 00531 00744 00990 00581
436 Tables
Table E (continued)
11-/11 o I 2 3 3 4 5 5 6 7 7
/II a 1-/11 o o I I 2 2 2 3 3 3 4
5 02169 0.2798 0.1697 02247 0.2817 01804 0.2311 02831 01814 02363 02842 0.1970
6 05000 05734 04266 05000 0.5664 0.4336 0.5000 05607 0.4383 05000 05559 04441
14 I 00000 00000 0.0000 00000 0.0000 0.0000 0.0000 00000 00000 00000 0.0001 00000
2 00002 00006 00001 00003 0.0008 0.0002 0.0005 0.0010 0.0003 00006 00012 00004
3 0.0035 00070 0.0023 0.0046 00082 0.0031 0.0057 00094 00039 00067 00105 00048
4 00285 0.0457 00199 0.0328 0.0495 00237 0.0366 0.0526 00272 00399 00554 00304
5 01284 0.1749 00974 01362 0.1790 0.1061 0.1426 01823 0.1137 0.1480 0.1851 0.1202
6 03532 0.4241 02912 0.3576 04219 03005 0.3612 04201 03082 03641 04185 0.3147
7 06468 0.7088 0.5758 0.6424 06995 0.5781 0.6388 06918 0.5799 06359 0.6853 05815
15 2 00001 00002 00009 00001 00003 00001 0.0002 00004 0.0001 00002 0.0005 00002
3 00014 00030 0.0092 00019 00036 00013 00024 00043 0.0017 0.0030 0.0049 0.0021
4 00134 00228 0.0528 0.0160 00253 00113 0.0183 0.0276 0.0133 00205 0.0296 0.0153
5 00716 01028 01862 00778 0.1072 0.0591 0.0832 01109 0.0646 00878 0.1141 0.0696
6 0.2330 0.2933 0.4311 02397 0.2949 0.1956 0.2453 0.2962 0.2036 0.2499 0.2972 0.2105
7 0.5000 05689 07067 0.5000 0.5631 04369 0.5000 05582 0.4418 05000 0.5541 0.4459
For larger sample Sizes, use
and
peA < a)
-
= [I + _I_m_ + ... + 1(1 + 1)"'(1 + a - I)
II - I + I (11- I + 1)"'(11 - I + a) a
(n)]O/(N)
I m
Alternatively, approximate cumulative probabllllles can be found from Table A With the approximate standard normal
deViate
Z= la'd' _ b'c'l 2L ( I -
a"d"-b"C"[ _I)
6N1)/( I + 61111)( I + 6,;1)( I + 6i1)( I + 6(N 1)]"2
where a', b', c', d' are the cell entnes a, b, c, d 10 the 2 x 2 table corrected by 1/2 respecllvely, and
in terms of the naturalloganthms In. If common logarithms are used, replace 2L by 4 60517 L. ThiS normal approxlma-
lion has at least the accuracy specified below if mm(a, b - I, C - I, d) IS at least the value shown See Lmg and Pratt
(1980] for the mmimum guaranteed accuracy for other values ofmm(a, b - I, c - I,d) and other tall probabilities.
mm(a, b - I, c - I,d) 3 4 6 8 12 24 50
Any tall probability 0.0'50 0.0'33 0.0'17 0.0'10 004 50 0.04 14 0.0'35
Tall probability :s;0.05 00'25 0.0'16 0.04 94 0.04 53 0.04 28 0.0'87 00'27
Tail probability :s;0.01 0.0'13 0.0 4 84 004 45 00 4 29 0.04 15 00'51 0.0'16
Source. Adapted from G J. Lieberman and D. BOwen (1961), Tables of ,lie Hypergeomelrlc Probablluy DISlrlbllllOlI, Stanford
Umvers/ty Press, Stanford, Cahforma, WIth permission.
Tables 437
R, P R, P R, P R, P
Table F (continued)
R, p R, P R, P R, P
Table F (continued)
R, p R, P R, P R, P
Table F (continued)
R, p R, P R, P R, P
Table F (continued)
R, p R, P R, P R, P
Table F (continued)
R, p R, P R, P
For m or n larger than 10, the probabilities are found from Table A as follows:
Rx + 0.5 - m(N + 1)/2 Rx - 0.5 - m(N + 1)/2
ZL = ZR =
Jmn(N + 1)/12 Jmn(N + 1)/12
Desired Approximated by
Left tail probability for Rx Right tail probability for - Z f.
RIght tail probability for R, Right tail probability for ZR
Source: Adapted from Table B of C H. Kraft and C. Van Eeden (1969), A Non-parametric
IntroductIOn to Statistics, Macmillan PublIshing Co, New York, with permIssion.
Tables 443
Table G (continued)
III n mnDmn P III n mnDmn P m n mnDml1 P
9 45 54 54 63 63
10 50 60 70 70 80
II 66 66 77 88 88
12 72 72 84 96 96
13 78 91 91 104 117
14 84 98 112 112 126
15 90 105 120 135 135
16 112 112 128 144 160
17 119 136 136 153 170
18 126 144 162 180 180
19 133 152 171 190 190
20 140 160 180 200 220
Source Adapted from Table I of H L. Harler and D B. Owen, Eds (t970), Selected Tables //I Mathemallcal
StallStlCS, Vol I, Markham Publlshmg Company, ChIcago, WIth permIssIon of the InstItute of Mathemattcal
Stattsttcs.
Bibliography
Alling, David W. : Early decision in the Wilcoxon two-sample test. J. Am. Stat. Assoc. 58,
713-720 (l963).
Anderson, T. W., Darling, D. A.: Asymptotic theory of certain" goodness of fit" test
criteria based on stochastic processes. Annals of Math. Stat. 23, 193-212 (1952).
Bahadur, R. R.: Stochastic comparison of tests. Annals ofMath. Stat. 31, 276-295 (1960).
Bahadur, R. R.: Some Limit Theorems in Statistics. (CBMS Monograph No.4),
Philadelphia: SIAM, 1971.
Barton, D. E., Mallows, C. L.: Some aspects of the random sequence. Annals of Math.
Stat. 36, 236-260 (1965).
Bauer, D F.: Constructing confidence sets using rank statistics. J Am. Stat. Assoc. 67,
687-690 (1972).
Birnbaum, A.: On the foundations of statistical inference. J. Am. Stat. Assoc. 57,
269-306 (1962).
Birnbaum, Z. W.: Numerical tabulation of the distribution of Kolmogorov's statistic
for finite sample size. J. Am. Stat. Assoc. 47, 425-441 (1952).
Birnbaum. Z. W.: On the use of the Mann-Whitney statistic. Proc. of the Third Berkeley
Symp. Math. Stat. and Probability, Vol. I, Berkeley: Untv. Calif. 1956, pp. 13-17.
Birnbaum, Z. W., Hall, R. A.: Small sample dIstributions for multI-sample statistics of
the Smirnov type. Annals of Math. Stat. 31, 710-720 (I 960}.
BIrnbaum, Z. W., Klose, O. M.: Bounds for the variance of the Mann-Whitney statIstIc.
Annals of Math. Stat. 28, 933-945 (1957).
Birnbaum, Z. W., Tingey, F. H.: One-sided confidence contours for probability distri-
butions. Annals of Math. Stat. 22, 592-596 (1951).
Blackman, J.: An extension of the Kolmogorov distnbution. Annals of Math. Stat. 27,
513-520 (1956). Correction, ibid. 29, 318-324 (1958).
Blyth. C. R. : Note on relative efficiency of tests. Annals ofMath. Stat. 29, 898-903 (l958).
Box, G. E. P., Andersen, S. L.: Permutation theory III the denvatlon of robust cntena
and the study of departures from assumption. J. Royal Stat. Soc. E, 17, 1-34 (1955).
Bradley, James V.: Distribution-Free Statistical Tests. Englewood Cliffs, New Jersey:
Prentice-Hall, 1968.
Brascamp, H. J., Lieb, E. H.: On extensions of the Brunn-Minkowski and Prekopa-
Leindler Theorems, including inequalities for log concave functions, and with an
application to the diffusion equation. J. Func. Anal. 22, 366-389 (1976).
445
446 Blbhography
Gastwirth, J. L.: The first-median test: A two-sIded version of the control median test.
J. Am. Stat. Assoc. 63, 692-706 (1968).
Gastwirth, J. L., Wolff, S.: An elementary method of obtallling lower bounds on the
asymptotic power of rank tests. Annals of Math. Stat. 39, 2128-2130 (1968).
Gibbons, J. D.: A proposed two-sample rank test: The Psi test and its properties. J.
Royal Stat. Soc. B, 26, 305-312 (1964a).
GIbbons, J. D.: Effect of nonnormality on the power function of the sIgn test. J. Am.
Stat. Assoc. 59, 142-148 (1964b).
Gibbons, J. D.: On the power of two-sample rank tests on the equality of two distribu-
tion functions. J. Royal Stat. Soc. B, 26, 293-304 (I 964c).
Gibbons, J. D.: Nonparametric Statistical Inference. New York: McGraw-Hill, 1971.
Gibbons, J. D., Pratt, J. W.: P-values: Interpretation and methodology. The Am.
Statistician, 29, 20-25 (1975).
Gnedenko, B. V.: Tests of homogeneity of probability distributions in two independent
samples. Math. Nachrichten, 12,26-66 (1954).
Gnedenko, B. V., Korolyuk, V. S.: On the maximum discrepancy between two empirical
distributions (in Russian). Doklady Akad. Nauk SSSR, 80,525-528 (1951).
Good, 1. J.: Siglllficance tests in parallel and in series. J. Am. Stat. Assoc. 53, 799-813
(1958).
Good, 1. J. : A subjecttve evaluation of Bode's Law and an "objective" test for approxi-
mate numerical rationality. J. Am. Stat. Assoc. 64, 23-49 (1969).
Goodman, L. A.: Kolmogorov-Smirnov tests for psychologIcal research. Psych. Bull.
51, 160-168 (1954).
Gflzzle, J. E., Starmer, C. F., Koch, G. C.: AnalysIs of categorical data by linear models.
Biometrics, 25, 489-504 (1969).
Groeneboom, P., Oosterhoff, J.: Bahadur efficIency and probabilitIes of large deviations.
Statistica Neerlandlca, 31, 1-24 (1977).
Gurland, J: An inequality satisfied by the expectation of the reciprocal of a random
variable. The Am. Statistician, 21, (2), 24-25 (1967).
Guttman, 1. : Statistical Tolerance Regions: Classical and Bayesian. New York: Hafner
Press, 1970.
Halperin, M., Ware, J.: Early decision in a censored Wilcoxon two-sample test for
accumulating survival data. J. Am. Stat. Assoc. 69, 414-422 (1974).
Hajek, J.: Nonparametric Statistics. San Francisco: Holden-Day, 1969.
Hajek, J., Sidak, Z.: Theory of Rank Tests. New York: AcademIC, 1967.
Harter, H. L. : Expected values of normal order statistics. Biometrika, 48, 151-165 (1961).
Harter, H. L., Owen, D. B. (eds.): Selected Tables in Mathematical StatistICS, Vol. I,
Chicago: Markham Publ., 1970.
Hartigan, J. A.: USlllg subsample values as typical values. J. Am. Stat. Assoc. 64, 1303-
1317 (1969).
Harvard University Computation Laboratory: Tables of the Cumulative Binomial
Probability DistributIOn. Cambridge: Harvard Univ., 1955.
Hodges, J. L., Jr.: The siglllficance probability of the Smirnov two-sample test. Arkiv.
Mat., 3, 469-486 (1957).
Hodges, J. L., J r., Lehmann, E. L.: The efficiency of some nonparametflc competitors
of the t-test. Annals of Math. Stat. 27, 324--335 (1956).
Hoeffding, W.: A class of statistics with asymptotically normal distributIOn. Annals
of Math. Stat. 19,293-325 (1948).
Hoeffdlllg, W.: On the distribution of the number of successes in independent trials.
Annals of Math. Stat. 27, 713-721 (1956).
Hoeffding, W.: Review of S. S. Wilks, Mathemallcal StatIstIcs. Annals of Math. Stat.
33, 1467-1473 (1962).
Hogg, R. V.: Adaptive robust procedures: A partial review and some suggestIOns for
future applications and theory. J. Am. Stat. Assoc. 69, 909-927 (1974).
Bibliography 449
Hollander, M.: Rank tests for randomized blocks. Annals oj Math. Stat. 38, 867-877
(1967).
Huber, P. J.: Robust estimation in location. Annals oj Math. Stat. 35, 73-101 (1964).
Iman, R. E.: Use of a t-statistic as an approximation to the exact distribution of the
Wilcoxon signed ranks test statistic. Comm. in Stat. 3, 795-806 (1974).
Jacobson, J. E.: The Wilcoxon two-sample statistic: Tables and bibliography. J. Am.
Stat. Assoc. 58, 1086-1103 (1963).
Jeffreys, H.: Theory of Probability, 3rd ed. Oxford: Oxford Univ., 1961.
Johnson, N. L., Kotz, S.: Distributions in Statistics: Discrete Distributions. New York:
John Wiley, 1969.
Kac, M.: On deviations between theoretical and empirical distributions. Proc. Nat.
Academy of Sci. 35, 252-257 (1949).
Kadane, J. B.: For what use are tests of hypotheses and tests of significance. Introduc-
tion. Comm. in Stat. A,S, 735-736 (1976).
Karlin, S.: Decision theory for Polya type distributions. Case of two actions, I. Proc.
Third Berkeley Symp. on Math. Stat. and Probability, Vol. 1, Berkeley: Univ. Calif.,
1956, pp. 115-129.
Karlin, S.: P6lya-type distributions, II. Annals of Math. Stat. 28, 281-308 (l957a).
Karlin, S.: P6lya-type distributions, III: AdmIssibility for multi-action problems.
Annals oj Math. Stat. 28, 839-860 (1957b).
Karlin, S., Rubin, H.: Distributions possessing a monotone likelihood ratio. J. Am.
Stat. Assoc. 51, 637-643 (1956).
Kempthorne, 0.: Of what use are tests of significance and tests of hypotheses. Comm. in
Stat. A,S, 763-777 (1976).
Kimball, A. W.: Burnett, W. T., Jr., Doherty, D. G.: Chemical protection against
ionizing radiation. I. Sampling methods for screening compounds in radiation pro-
tectIon studies with mice. Radiation Research, 7, 1-12 (1957).
Klotz, J.: Small sample power and efficiency for the one sample Wilcoxon and normal
scores tests. Annals of Math. Stat. 34, 624-632 (1963).
Klotz, J.: Asymptotic efficiency of the two sample Kolmogorov-Smirnov test. J. Am.
Stat. Assoc. 62, 932-938 (1967).
Kolmogorov, A. N.: Sulla determinazione empirica di una legge di distribuzione.
Giorn. Inst. Ita!' Attuari, 4, 83-91 (1933).
Korolyuk, V. S.: Asymptotic expansions for the criterion of fit of A. N. Kolmogorov
and N. V. Smlrnov. Doklady Akad. Nauk SSSR, 93, 443-446 (1954). (Izvestiya Akad.
Nauk SSSR Ser. Mat. 19, 103-124 (1955).)
Korolyuk, V. S.: On the deviation of empirical distributions for the case of two inde-
pendent samples. Izvestiya Akad. Nauk SSSR Ser. Mat., 19, 81-96 (1955).
Kraft, C. H., Van Eeden, c.: A Nonparametric Introduction to StatIstics. New York:
Macmillan, 1968.
Kruskal, W. H.: Histoncal notes on the Wilcoxon unpaired two-sample test. J. Am.
Stat. Assoc. 52,356-360 (1957).
Kruskal, W. H.: "Tests of Significance" in International Encyclopedia oj the Social
Sciences, 14,238-249 (1968). New York: The Free Press.
Lancaster, H. 0.: Statistical control of counting experiments. Biometrika, 39, 419-422
(1952).
Lancaster, H. 0.: Significance tests in discrete distributions. J. Am. Stat. Assoc. 56,
223-234 (1961).
Lehman, S. Y.: Exact and approximate distribution for the Wilcoxon statistic with ties.
J. Am. Stat. Assoc. 56, 293-298 (1961).
Lehmann, E. L.: The power of rank tests. Annals of Math. Stat. 24, 23-43 (1953).
Lehmann, E. L.: Testm.q Statistical Hypotheses. New York: John Wiley, 1959.
Lehmann, E. L., Stein, C. : On the theory of some non-parametric hypotheses. Annals oj
Math. Stat. 20, 28-45 (1949).
450 BIbliography
Neyman, J. : Tests of statistical hypotheses and their use in studies of natural phenomena.
Comm. in Stat. A, 5, 737-751 (1976).
Neyman, J., Pearson, E. S.: On the problem of the most efficient tests of statistical
hypotheses. Phil. Trans. of the Royal Stat. Soc. A, 231, 289-337 (1933).
Noether, G. E.: Elements of Nonparametric Statistics. New York: John Wiley, 1967.
Noether, G. E.: Some simple distribution-free confidence intervals for the center of a
symmetric distribution. J. Am. Stat. Assoc. 68, 716-719 (1973).
Ord, J. K.: Approximations to distribution functions which are hypergeometric series.
Biometrika, 55, 243-248 (1968).
Owen, D. B.: Handbook of Statisllcal Tables. Reading, Mass. : Addison-Wesley, 1962.
Paulson, E.: An approximate normalization of the analysis of variance distribution.
Annals of Math. Stat. 13,233-235 (1942).
Pearson, E. S.: Some thoughts on statistical inference. Annals ofMath. Stat. 33, 394-403
(1962).
Pearson, E. S., Hartley, H. O. (eds.): Biometrika Tablesfor StatistIcians, Vol. I. Cam-
bridge, England: Univ. Press, 1966.
Peizer, D. B., Pratt, 1. W.: A normal approximation for binomial, F, beta and other
common, related distributions. J. Am. Stat. Assoc. 63,1416-1456 (1968).
Pitman, E. J. G.: Significance tests which may be applied to samples from any popu-
lations. J. Royal Stat. Soc. B, 4, 119-130 (1937a).
Pitman, E. J. G.: Significance tests which may be applied to samples from any popula-
tions, II. The correlation coefficient test. J. Royal Stat. Soc. B, 4, 225-232 (1937b).
Pitman, E. J. G.: Significance tests which may be applied to samples from any popula-
tions, III. The analysis of variance test. Biometrika, 29,322-335 (1938).
Pratt, 1. W.: Remarks on zeros and ties in the Wilcoxon signed ranks procedures. J.
Am. Stat. Assoc. 54,655-667 (1959).
Pratt, J. W.: On interchanging limits and integrals. Annals of Math. Stat. 31, 74-77
(1960).
Pratt, J. W.: Length of confidence intervals. J. Am. Stat. Assoc. 56, 549-567 (1961).
Pratt, J. W.: Robustness of some procedures for the two-sample location problem.
J. Am. Stat. Assoc. 59, 665-680 (1964).
Pratt, J. W.: Bayesian interpretation of standard inference situations, J. Royal Stat. Soc.
B,27, 169-203 (1965).
Pratt, 1. W.: A normal approximation for binomial, F, beta, and other common,
related tail probabilities, II. J. Am. Stat. Assoc. 63,1457-1483 (1968).
Pratt, J. W.: Comment on "Post-data two sample tests of location ", J. Am. Stat. Assoc.
68, 104-105 (1973).
Pratt, 1. W.: A discussIOn of the question: For what use are tests of hypotheses and tests
of significance. Comm. in Stat. A, 5, 779-787 (1976).
Pratt, 1. W.: Concavity of the log likelihood. J. Am. Stat. Assoc. 76,103-106 (1981).
Pratt,1. W., Raiffa, H., Schlaifer, R.: The foundations of decision under uncertainty:
An elementary exposition. J. Am. Stat. Assoc. 59, 353-375 (1964).
Putter, J. : The treatment of ties in some nonparametric tests. Annals of Math. Stat. 26,
368-386 (1955).
Pyke, R. : The supremum and infimum of the Poisson process. Annals of Math. Stat. 30,
568-576 (1959).
Quade, D.: On the asymptotic power of the one-sample Kolmogorov-Smirnov tests.
Annals of Math. Stat. 36,1000-1018 (1965).
Rahe, A. J.: Tables of critical values for the Pratt matched pair signed rank statistic.
J. Am. Stat. Assoc. 69,368-373 (1974).
Raiffa, H., Sch1aifer, R.: Applied Statistical Decision Theory. Boston: Div. Res.,
Harvard Business School, 1961.
Roberts, H. V.: For what use are tests of hypotheses and tests of significance. Comm.
in Stat. A, 5, 753-761 (1976).
452 BIbliography
Rosenbaum, S.: Tables for a nonparametric test of location. Annals of Math. Stat. 25,
146-150 (1954).
Rustagi. J. S.: Bounds for the variance of Mann-Whitney statistics. Annals of Math.
Stat. 13, 119-126 (1962).
Sandiford, P. J.: A new binomial approximation for use in sampling from finite popu-
lations. J. Am. Stat. Assoc. 55, 718-722 (1960).
Savage, I. R.: Bibliography of Nonparametric Statistics. Cambridge: Harvard Univ.
Press, 1962.
Savage, L. J. The Foundations of Statistics. New York: John Wiley, 1954.
Scheffe, H.: A useful convergence theorem for probability distributions. Annals of
Math. Stat. 18,434-438 (1947).
Scheffe, H., Tukey, J. W.: Nonparametric estimation, I. Validation of order statistics.
Annals of Math. Stat. 16, 187-192 (1945).
Siegel, S.: Non-parametric Statistics for the Behavioral Sciences. New York: McGraw-
Hill, 1956.
Singer, B.: Distribution-Free Methods for Nonparametric Methods: A Classified and
Selected Bibliography. Leicester: British Psych. Soc. 1979.
Smirnov, N. V.: Estimate of deviation between empirical distribution functions in two
independent samples (in Russian). Bull. Moscow Univ., 2, 3-16 (1939).
Smirnov. N. V. : Approximation of distribution laws of random variables from empirical
data (in Russian). Uspehi Mat. Nauk, 10, 179-206 (1944).
Steck, G. P.: The Smirnov two sample tests as rank tests. Annals of Math. Stat. 40, 1449-
1466 (1969).
Stein, C.: Efficient nonparametric testing and estimation. Proc. Third Berkeley Symp.
on Math. Stat. and Probability, Vol. I. Berkeley: Univ. Calif., 1956, pp. 187-195.
Sterne, T. E.: Some remarks on confidence or fiducial limits. Biometrika, 41, 275-278
(1954).
Stuart, A. : The comparison offrequencies in matched samples. British J. Stat. Psych. 10,
29-32 (1957).
Tate, M. W., Clelland, R. C.: Nonparametric and Shortcut Statistics, Danville, Ill.:
The Interstate Publishers & Printers, 1957.
Teichroew, D.: Tables of expected values of order statistics and products of order
statistics for samples of size twenty and less from the normal distribution. Annals of
Math. Stat. 27, 410-426 (1956).
Terry, M. E.: Some rank order tests which are most powerful against specific parametric
alternatives. Annals of Math. Stat. 23, 346-366 (1952).
Tsao, C. K. : An application of Massey's distribution of the maximum deviation between
two sample cumulative step functions. Annals of Math. Stat. 25, 587-592 (1954).
Tukey, J. W. : Nonparametric estimation, II. Statistically equivalent blocks and tolerance
regions-the continuous case. Annals of Math. Stat. 18,529-539 (1947).
Uhlmann, W.: Vergleich der hypergeometrischen mit der Binomial-Verteilung.
Metrlka, 10, 145-158 (1966).
Uzawa, H.: Locally most powerful rank tests for two-sample problems. Annals of
Math. Stat. 31, 685-702 (1960).
van der Vaart, H. R.: Some extensions of the idea of bias. Annals of Math. Stat. 32,
436-447 (1961).
van der Waerden, B. L.: Order tests for the two-sample problem and their power, I, II,
III. Proc. Koninklijke Nederlandse Akademie van Wetenschappen (A), 55 (lnda-
gationes Mathematlcae, 14),453-458 (1952); lndagationes Mathematicae 15, 303-310,
311-316 (1953); correctIOn, lndagationes Mathematicae 15, 80 (1953).
van der Waerden, B. L. : Testing a distribution function. Proc. Koninklijke Nederlandse
Akademle van Wetenschappen (A), 56 (lndagationes Mathematicae 15), 201-207
(1953).
Bibliography 453
van der Waerden, B. L.: The computation of the X-distribution. Proc. Third Berkeley
Symp. Math. Stat. and Probability, Vol. I. Berkeley: Univ. Calif., 1956, pp. 207-208.
van der Waerden, B. L., Nievergelt, E.: Tafeln Zum Vergleich Zweier Stichproben
mittels X-test und Zeichentest. Berlin-Gottingen-Heidelberg: Springer-Verlag, 1956.
van Eeden, c.: The relation between Pitman's asymptotic relative efficiency of two tests
and the correlation coefficient between their test statistics. Annals of Math. Stat.
34, 1442-1451 (1963).
von Mises, R.: Wahrscheinlichkeitsrechnung und ihre Anwendung in der Statistik und
theoretischen Physik. Leipzig-Wien: F. Deuticke, 1931.
Walsh, J. E.: ApplIcations of some significance tests for the median which are valid under
very general conditions. J. Am. Stat. Assoc. 44, 342-355 (1949a).
Walsh, J. E.: Some significance tests for the median which are valid under very general
conditions. Annals of Math. Stat. 20, 64-81 (l949b).
Walsh, J. E.: Nonparametric tests for median by interpolation from sign tests. Annals
of the Inst. of Stat. Math.H, 183-188 (1959-60).
Walsh, J. E.: Handbook of Nonparametric Statistics, I. Investigation of Randomness,
Moments, Percentiles and Distributions. New York: Van Nostrand, 1962a.
Walsh, J. E.: Some two-sided distribution-free tolerance intervals of a general nature.
J. Am. Stat. Assoc. 57, 775-784 (1962b).
Walsh, 1. E.: Handbook of Nonparametric Statistics, II: Results for Two and Several
Sample Problems, Symmetry and Extremes. New York: Van Nostrand, 1965.
Walsh, 1. E.: Handbook of Nonparametric Statistics, III: Analysis of Variance. New
York: Van Nostrand, 1968.
Wilcoxon, F. : Individual comparisons by ranking methods. Biometrics, 1, 80-83 (1945).
Wilks, S. S.: Mathematical Statistics. New York: John Wiley, 1962.
Wilson, E. B., Hilferty, M. M.: The distribution of chi-square. Proc. Nat. Academy of
Sci. 17, 684-688 (1931).
Wise, M. E.: A quickly convergent expansion for cumulative hypergeometric proba-
bilities, direct and inverse. Biometrika, 41, 317-329 (1954).
Young, W. H.: On semiintegrals and oscillating successions offunctions. Proc. London
Math. Soc. (2),9,286-324 (1911).
Zahl, S.: Bounds for the Central Limit Theorem error. Ph.D. Dissertation, Boston,
Mass.: Harvard Univ. 1962.
Index
455
456 Index
Jeffreys, H. 26
Factorization CrIterion (see also Jennrich, R. I. 322
sufficiency) 10 Johnson, N. L. 241, 284
458 Index
W l!coxon signed rank test (COlli.) Rank-randomization test. one sample (see
one-sample 151 -153 also signed rank test) 204
Power efficiency 346 Rank-randomIZatIOn test. two sample (see
Power Laplace distributIOn (see double also rank test) 232. 296
exponential distribution) Rank sum test 77
Pratt. J. W. 17.20.26.28.32.44.50. Rank tests. two-sample 231-279.322
51.71.162.241.247,257,401,421. Reduced sample procedure (see ties)
423. 427. 428. 432 Regression model 117
Principle of invariance 178. 180-181. Rejection rule 14-15
218.223-224,308.312-313 Relative efficiency 345. 346
Principle of minimum likelihood 30. 60 Roberts. H V 17
Probability density function 4 Rosenbaum. S. 237.243.283
Probability integral transformation 95 Rosenbaum's test 243
Pyke. R. 343 Rubin. H. 52. 65
Rustagi. J. S. 265
Quade. D. 402,404, 411
Quantile 83 - 85
Sandiford. P. J. 240
Quantile test, two-sample (see also median
Savage. 1. R. 255. 392
test) 236-248, 267
Savage. L. J. 17.20
Sheffe. H. 71. 130
Shift assumption 23 I. 232- 234, 249,
Rahe, A. 1. 165 297
Random effects model and asymptotic Shift families 379
relative efficiency 395 - 396 Sidak, Z. 323, 409
Random method of breaking ties (see Simple random sample 105
lies) Sign test for quantities.
Random variable 3 one-sample 85 -97, 146
Randomization distribution, confidence intervals 87, 92-96
one-sample 203. 204 optimum properties 88-92
approximations to 212-216 random effects models 92-96
expected value and variance 210 zero differences 97 - 104, 107. 114
(11 !2") type distribution 219 asymptotic relative efficiency 354,
2/1 type distribution 219 356,357.359.365.380
I-statistic 207 - 208 Sign test. two-sample 234-248, 348,
Randomization distribution, 349. 354
two-sample 296, 298-299 asymptotic relative efficiency 380
approximations to 303 - 305 (see also Fisher's exact test, median test)
expected value and variance 301 - 302 Sign test. two-sample with fixed ~ 236,
N! type distribution 306 237.241
(m) type distribution 306 Signed-rank of observations 147 -148
student's I-statistic 303 - 305 Signed-rank sum 148
Randomization test, one-sample (see Signed-rank tests. one-sample 173 - 177 ,
observation-randomization test. 180-181.203.216
one-sample and rank-randomization asymptotic relative
test. one-sample) 203-204.382. efficiency 386 - 392
383 confidence intervals 174-175
Randomization test. two-sample. (see Walsh averages 173-174
observation-randomization test. two Signed-rank zero procedure (see ties)
sample and rank-randomizatIOn test. Significance level (see also level) 19-20
two sample) 296 interpretation 20
Randomized confidence regions 51 - 52 power 21
Randomized P-value 40-41 Significance tests 14- 34
Randomized test procedures 10. 34-41 Size (see level or sigmficance level)
Index 461
Uhlmann, W. 284
T -test (see student 's l-te~t) Unbiasedness of statistic 7
Tables 426-444 confidence level 61-62
Teichroew, D. 267,293 tests 59, 10 I - 104
Terry, M E 267 United States Senate 250
Test statistic 15 Uniformly consistent estimator 154
and asymptotic relative Uniformly most powerful tests 52-53,
efficiency 357 - 358, 371-373 181, 222-224,310-313
Test for equality of Umformly most powerful unbiased
proportions 106-116, 236 test 59
(see also sign test, one-sample with zero Unit normal density (see standard normal
differences) density)
Test for 2 x 2 tables (see x 2 test for Uzawa, H 278
equality of two proportions)
Ties
Valid test 20
in Kolmogorov-Smirnov
van der Vaart, H R 125
tests 330- 331
non-zero 167-168 van der Waerden, B. L. 184.267.330,
sign test, two-sample 241 - 242 333
van der Waerden statistic (see also normal
sign test with zero dIfferences,
scores test) 267 - 268, 333
one-sample 97 - 104
van Eeden, C. 184, 442
Wilcoxon rank sum test,
two-sample 258 - 263 Variance 7
von Mises, R. 344
Wilcoxon signed rank test,
one-sample 160 - 171
average rank procedure 162, Wallace, D. L. 235.237
163-165,167-168,170 Walsh averages 150, 380
462 Index